Snappy read pixels problem

I have this code that works, but takes approximately one hour to complete, is there any way to speed it up?
I am trying to create a table/dataframe with the band values for each pixel, as well as the lat/long to import into a postgres database


import pandas as pd
from snappy import ProductIO, GeoPos, PixelPos
import re
import numpy as np
import time

# Measure execution time
start_time = time.time()

product_path = r"E:\PhD\gis_stuff\Lai_Processed\Subset_S2A_MSIL2A_20180420T112121_N0207_R037_T30UVD_20180420T132427_resampled_BandMath.dim"
product = ProductIO.readProduct(product_path)

# Define the list of bands to retrieve
bands_to_retrieve = ['lai', 'ndvi', 'fapar', 'fcover', 'gndvi', 'gci', 'EVI', 'EVI2', 'lai2']

# Create an empty DataFrame to store the data
data = {'Date': [], 'X': [], 'Y': [], 'Latitude': [], 'Longitude': []}
for band in bands_to_retrieve:
    data[band] = []

df = pd.DataFrame(data)

# Extract date from the filename
match = re.search(r'L2A_(.*?)T', product_path)
if match:
    date = match.group(1)
else:
    date = None

# Get the width and height of the product
width = product.getSceneRasterWidth()
height = product.getSceneRasterHeight()

# Get geocoding
gc = product.getSceneGeoCoding()

# Iterate over each pixel
for y in range(height):
    for x in range(width):
        print(x,y)
        # Get geo-coordinates for each pixel
        geoPos = gc.getGeoPos(PixelPos(x, y), None)
        lat = geoPos.getLat()
        lon = geoPos.getLon()

        # Retrieve values for each band
        band_values = {}
        for band_name in bands_to_retrieve:
            band = product.getBand(band_name)
            band_array = np.zeros(1, dtype=np.float32)
            band.readPixels(x, y, 1, 1, band_array)
            band_values[band_name] = band_array[0]

        # Append data to the DataFrame
        df = df.append({'Date': date, 'X': x, 'Y': y, 'Latitude': lat, 'Longitude': lon, **band_values}, ignore_index=True)

# Save the DataFrame to a CSV file
csv_filename = f"{date}.csv"
df.to_csv(csv_filename, index=False)  
# Calculate and print the execution time
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds") 
...

Alright, I’m not sure I’m getting the whole thing right, but I will try to help. What I wanna suggest is considering parallel processing. Python’s multiprocessing or concurrent.futures modules could be very helpful in this case. By distributing the data reading and processing across multiple cores, you might significantly reduce the execution time.

Also, I think that, instead of appending to the DataFrame in each loop iteration, you could try collecting all data in a list first and then converting it to a DataFrame at once. Appending to DataFrames can be quite slow, especially with large datasets.

Hope this helps somehow!

It seems that the problem was with appending to the dataframe. I re-did the code so that it writes a line to a .csv file for each iteration of the #iterate over pixels loop.
Runs much faster now :slight_smile:

1 Like

Awesome to hear that! That’s a neat solution.

If you’re up for a bit more tinkering, how about giving parallel processing a whirl? Playing around with Python’s multiprocessing or concurrent.futures might be a fun next step. It’s like divvying up the work among a bunch of CPU cores. I think it could speed things up even more.

If you know the final size, it is often faster to create a dataframe with “fill” values and then update values in a loop. In remote sensing, there are often large regions of missing data, so you can skip the write operation for pixels with no data.

Saving data to netcdf-cf format gives you access to many tools that extract and format data. Converting to ASCII can affect summary statistics computed from the final database, so there are advantages beyond efficiency from minimizing conversions.