Validation of random forest classification

pia · December 18, 2020, 11:12am

Hey there!

What do I do, if I don’t have any reference data to compare my classification with? I’m using SNAP for a random forest classification of Landsat images. I just have 2 classes: urban and non-urban
I made training samples for these two classes and after my classification I made new training samples for a new class called urban-validation and another class called non-urban-validation. As I don’t have any other data than my self-made classification of the satellite image, I guess my training samples for the two validation classes would have an accuracy of 100 %.
That’s why I don’t know if it’s even necessary to do a confusion matrix?
But if I would still want to do a confusion matrix, which classes should I use?

Does someone have an idea what I should do? I would be so greatful!!

Have a good day!

ABraun · December 18, 2020, 12:06pm

what you get from SNAP is the training accuracy - that means how well the classifier was able to put the training samples into the correct class based on the image data. It does by no means tell you about the quality of the result.

Where is your area located? There are often many independent reference datasets available which could be used. Maybe we can think together of alternatives.

For the difference between training and prediction accuracy, I have compiled a tutorial which maybe helps you: Landcover classification with Sentinel-1 GRD

pia · December 18, 2020, 3:22pm

Hi, thanks for your quick respond!

Yeah, thats what I thought too.

It’s Tanzania, smaller cities like Tunduma or Mbeya

ABraun · December 18, 2020, 4:10pm

the following sources are worth checking for their suitability to serve as reference data.

pia · December 19, 2020, 9:22am

Hey again

I checked some of these pages out and figured some of would work, but am I right that resolution of the refence data has to be the same as my satellite data? It’s 30 meters.

What about these points? What do I have to consider using reference data?

-Definition of urban and non-urban must be the same?
-Year of observations must be the same?
-the Coordinate Reference System should be the same?
-Satellite should be the same? (Landsat/Sentinel)

ABraun · December 19, 2020, 6:29pm

I’d say the most important factor of reference data for validation is independence from the used data. That means the urban areas you use for validation should not have been derived from the same data as you are using. Only then you can say that your method provides results which are as good (or not) as results of other sources. So, to answer your points

idealy, the quality of the reference data is higher than yours, so if the spatial resolution of the validation data is higher - even better!
the definition of what is considered urban and what not should be similar, yes
the temporal differnce should be as small as possible so that differences between result and validation are not caused by temporal dynamics
coordinate refernces system does not matter, you can reproject the validation data to match yours
satellite should not be the same for the above mentioned reasons.