Duplicate observations in the same archive?

Hi,

In most cases each S2 archive contains only one granule for each MGRS grid name (though the spatial extents may overlap a lot, or sometimes completely). However, for about 200 archives there are duplicate grid names. For example, S2A_OPER_PRD_MSIL1C_PDMC_20160119T013846_R116_V20160117T002243_20160117T002334.zip contains:

S2A_OPER_MSI_L1C_TL_SGS__20160117T051155_A002973_T55KES_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T051155_A002973_T55KET_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T051155_A002973_T55KFS_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T051155_A002973_T55KFT_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T051155_A002973_T55KGS_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T051155_A002973_T55KHS_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T070701_A002973_T55KDS_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T070701_A002973_T55KES_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T070701_A002973_T55KFS_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T070701_A002973_T55KGS_N02.01/
S2A_OPER_MSI_L1C_TL_SGS__20160117T070701_A002973_T55KHS_N02.01/

Note that KES, KFS, KGS and KHS are encountered twice, with different processing times.

The granule XML files show different sensing time as well. For the KHS granules, they are:
For 20160117T051155_A002973_T55KHS:
<SENSING_TIME metadataLevel=“Standard”>2016-01-17T00:22:43.886Z</SENSING_TIME>

For 20160117T070701_A002973_T55KHS:
<SENSING_TIME metadataLevel=“Standard”>2016-01-17T00:23:34.395Z</SENSING_TIME>

These times 22:43 and 23:34 happen to correspond exactly to the start and end time shown in the archive’s filename.

They are mostly spatially distinct, but in the area of overlap their values are either the same or differ by just a few units, so it seems that they represent two different passes over the same data using different tiling schemes.

The two different coverings don’t necessarily fall into the same grid tiles. Granules with the same product times in the id have the same SENSING_TIME in the metadata.

I’m a little unsure how to interpret them - are they two valid observations, or should one group with either earlier or later processing time be discarded?

First:

Second:

Diff (large black and white areas are taken by only one image, the speckled gray area in the middle is overlap):

For the record, usually there are exactly two different product dates when there is a granule collision. There are two exceptions, though:

  1. Three product dates for granules in archive
    S2A_OPER_PRD_MSIL1C_PDMC_20160308T165226_R096_V20160305T143722_20160305T144316

The granule XML files also have three different sensing times corresponding to the three product dates.

granules:
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MKC_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MKD_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MLC_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MLD_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MLE_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MMC_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MMD_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MME_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MNC_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MND_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MNE_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MPD_N02.01
S2A_OPER_MSI_L1C_TL_MTI__20160305T224740_A003668_T20MPE_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T214502_A003668_T20MLE_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T214502_A003668_T20MME_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T214502_A003668_T20MNE_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T214502_A003668_T20MPE_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T231645_A003668_T20MKC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T231645_A003668_T20MLC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T231645_A003668_T20MMC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160305T231645_A003668_T20MNC_N02.01

  1. Two product dates, but no granule tile overlap for
    S2A_OPER_PRD_MSIL1C_PDMC_20160119T095356_R137_V20160118T112024_20160118T113008

granules (I confirmed that each of them has at least some unique pixels):
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T29SPD_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T29SQD_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T29TPE_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T29TQE_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30STJ_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30SUH_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30SUJ_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30SVH_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30SVJ_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30TTK_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30TUK_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T152255_A002994_T30TVK_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T29SPC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T29SPD_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T29SQC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T29SQD_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T30STH_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T30STJ_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T30SUH_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T30SUJ_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T30SVH_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160118T165356_A002994_T30SVJ_N02.01

I looked in more detail at all granules from S2A_OPER_PRD_MSIL1C_PDMC_20160119T013846_R116_V20160117T002243_20160117T002334.zip , and the areas covered by two sets are not the same, even though the data in the overlap areas are identical.

Granules for 20160117T051155:

Granules for 20160117T070701:

Hello Simon,
Where you find duplicate L1C Tiles, it is because your product has 2 (or more) Datatakes within it. In your first instance (Orbit 002973 over Queensland) both Datatakes were processed at Svalbard (SGS).

In your second example (Orbit 003668 South of the Equator in Brazil) the L1C Tiles come from Matera, Italy (MTI) and Svalbard (SGS).

In the second instance, I’ve done a rough approximation of a single L1C Tile (20MNE) on a Catalogue view containing the Orbit 003668

You’ll see how the L1C tile sits across the line identifying separate Datatakes.

Cheers

Jan

S2 MPC Operations Manager

1 Like

@Jan: thank you, this clarifies things. Followup questions:

  1. The catalog XML contains only one datatake id field per product. For my second example, S2A_OPER_PRD_MSIL1C_PDMC_20160308T165226_R096_V20160305T143722_20160305T144316, it contains GS2A_20160305T143722_003668_N02.01 - just the last of the three datatake ids you listed. IS that expected?

  2. Could you confirm that future reprocessed versions of the same datatake would have the same datatake id, except with the suffix changing from 01 to 02?

@simonf

Hi Simon,

Yes it is expected. what yoy have there is the Datatake Identifier (GS2A_20160305T143722_003668_N02.01)

If you look in your AUX_DATA > DATASTRIP folder you will see there are three Datastrips:

├── S2A_OPER_PRD_MSIL1C_PDMC_20160308T165226_R096_V20160305T143722_20160305T144316.SAFE
│ ├── AUX_DATA
│ ├── DATASTRIP
│ │ ├── S2A_OPER_MSI_L1C_DS_MTI__20160305T224740_S20160305T144239_N02.01
│ │ ├── S2A_OPER_MSI_L1C_DS_SGS__20160305T214502_S20160305T143722_N02.01
│ │ └── S2A_OPER_MSI_L1C_DS_SGS__20160305T231645_S20160305T144316_N02.01
│ │

within this Datatake. The confusion may have arisen in my earlier response. I’m sorry if this is the case.
You can find out more on the difference between Datastrips and Datatakes in the Definitions part of the Sentinel Online website: https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/definitions

Regarding your query about the numbering of the reprocessed versions, I believe that ESA are not incrementing the suffix; reprocessed products will still be 02.01.

I hope this helps

Cheers

Jan

@Jan: thank you, this is getting clearer for me. One more followup:

Which metadata fields should I look at to determine whether a new granule should replace an older granule, and how do I identify the older granule to replace? If there were no partial overlap, I could just say that for each pair (sensing_time, tile_id) I should keep the granule with the most recent creation date. But because of the overlap I should choose the most recent group with creation dates close together, and discard the granules with much older creation dates. This is harder to implement correctly.

Alternatively, I can look at the product’s GENERATION_TIME and discard all granules with the same sensing times and tile ids as those contained in the previous products with the older GENERATION_TIME. In this case, though, the new product might not contain all the pixels as the previous product if, eg, only one granule was reprocessed.

Do you have any recommendations?

Somewhat relevant: I found a few cases when the same observation generates 2 or 3 products that appear to be the same (at least their granule names, as well as overviews on scihub are the same). Looks like I can deduplicate them by picking the product with the earliest date, but this does make it harder to look for future truly reprocessed products with the same observation dates and different product dates.

Example:

S2A_OPER_PRD_MSIL1C_PDMC_20160309T000754_R111_V20160306T154612_20160306T154612
S2A_OPER_PRD_MSIL1C_PDMC_20160309T000935_R111_V20160306T154612_20160306T154612
S2A_OPER_PRD_MSIL1C_PDMC_20160309T001049_R111_V20160306T154612_20160306T154612

The granules in all of them are:
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PPR_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PQR_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PQS_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PQT_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PRR_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PRS_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T17PRT_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PTA_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PTB_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PTC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PUA_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PUB_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PUC_N02.01
S2A_OPER_MSI_L1C_TL_SGS__20160306T224527_A003683_T18PVC_N02.01

Simon

Hi Simon
The unique ID in Scihub for your problem is the filename.
It can append that same tile is embedded in 2 ZIP in Scihub.
This is due to the way that Scihub is taking the data from the PDGS.
For reprocessing the baseline should be changed (02.01 should be changed).
The tile is attach to a datatake that is represent an acquisition and a version of the generation.

Thank you, @talazac - I’ll be comparing granule ids.

Do you know by any chance how this works for S1? Looks like the dates in the S1 GRD asset id uniquely identify an observation, and the ingestiondate field in the XML catalog helps to disambiguate occasional duplicate entries. Is there something else?

Hi Simon,

we stumbled over a similar issue when going through our automatically downloaded data. It seems that in some occasions there has been performed a reprocessing of basically the metadata only. Hence the processing ID of the granule has not changed, since they are identical to the previous processing.
The only difference we could find in our case was that the estimate for cloud coverage was slightly different in the metadata.

An example for this would be:

S2A_OPER_PRD_MSIL1C_PDMC_20160308T072905_R022_V20150813T101657_20150813T101657.SAFE
S2A_OPER_PRD_MSIL1C_PDMC_20160309T042055_R022_V20150813T101657_20150813T101657.SAFE

We have yet to decide how to handle this though. For sure the data is the same apart from the metadata about cloud coverage assessment.

How is it in your case? Is this the same issue in the case described by you?

Best,

Alex

1 Like

In the cases I checked, yes, the data appears to be different. We are only going to keep in Earth Engine the most recent product.

So you are basically checking processing time and which granules are contained? And in the case you have the same granules you select the one with the most recent processing time?

I make a check at the product (zip file) level - if the sensing (observation) time and the granule names are the same on several products, I only ingest granules from the product with the most recent product (generation) time.

It’s theoretically possible for granule names only partially overlap, but I have not seen that.

It also happens that duplicates (same scene, but different production time stamps) appear for a while and then disappear. This suggests there is some post-ingestion brush up step, but it would be more useful if this was done pre-ingestion. Saves wasting precious bandwidth.

For instance, you will no longer find:

S2A_OPER_PRD_MSIL1C_PDMC_20160309T042055_R022_V20150813T101657_20150813T101657.SAFE

in the example above. It looks like something recent (from early March). Maybe it is because more S2 processors have come on-line and there is some overlap in processing tasks??

Guido

Hi Guido,

it seems you are right. Funnily enough this example while writing my response yesterday was still online, but I just checked this moment and the duplicates have disappeared.

Cheers,

Alex

@edit:

Ok, I made a mistake in my query, the duplicates are still in. I just had marked the wrong geographic extent and hence it didn’t show up.

I am noting the same behaviour for Sentinel-1 after going over my automatically downloaded data.
It is messing a bit with my scripting, any news on this ?
S1A_IW_SLC__1SDV_20160130T173415_20160130T173442_009726_00E342_77A4.zip
S1A_IW_SLC__1SDV_20160130T173415_20160130T173442_009726_00E342_AB3F.zip

Could you check the metadata to see what are the differences between those two products? The sensing-times are identical so it’s the same data-take.

Hi @andretheronsa

Although the products have the same filename, there appears to be some separation between when the AMALFI Reports are generated:

S1A_IW_SLC__1SDV_20160130T173415_20160130T173442_009726_00E342_77A4.SAFE Date: 2016-01-30T18:51:17
S1A_IW_SLC__1SDV_20160130T173415_20160130T173442_009726_00E342_AB3F.SAFE Date: 2016-06-27T07:24:41

…so it may be (I don’t have the detailed understanding of the processing/reprocessing requirements for the S1 mission) , but it looks as if the later AMALFI Report indicates a reprocessing occurred for this product, with both products being retained in the Catalogue.

The two products have an Product Unique Identifier (“77A4” and “AB3F”) as highlighted in the Product Naming Convention:

https://sentinel.esa.int/web/sentinel/user-guides/sentinel-1-sar/naming-conventions

…and Figure 3-7 (Sentinel-1 Product Naming Convention) in the Sentinel-1 Product Specification document at:

https://sentinel.esa.int/documents/247904/1877131/Sentinel-1-Product-Specification

Cheers

Jan

S2 MPC Operations Manager

@Jan - do you know why both version are kept in the catalog?