The pattern you get is quite common when mixing languages. The physical storage of arrays is generally just a “vector” of length w*h. The problem comes when one language stores arrays by column and the other by rows. A matrix transpose is often used to switch storage orders. Unfortunately, transpose is often an “expensive” operation. NumPy internals says transpose is done in the array metadata, but this just defers the expense to the point at which the data are used: because you are accessing widely separated locations in memory, you may get cache misses and other overhead. The numpy approach works well if you are working on small pieces of the array, but doesn’t really save anything if you are doing operations on the full array. If you are doing heavy calculations on large images in numpy you may find things go faster if you work in the original (Java) order and apply transpose after all the computations have been done.
As a further complication, some languages display images with the first row at the bottom, while others start from the top.