Number of training samples at Random forest classifier

more trees do not necessarily mean better results, at least at this scale. To be absolutely sure, you can try 200 or 500 trees which fully average the best findings and drop out input rasters of low significance.

OK. Thanks Sir. I will do it although it takes time in SNAP. One more question, if we save a classifier for 10 trees, is it possible apply the same classifier on 500 trees (I want to avoid instability random forest by this way)?
If yes, probably, I should change the number of trees and selected images and then chose classifier. Am I right?

the principle of a random forest is to use a large amount of images to explain a trained distribution. If you select 500 trees, the classifier will randomly choose from your input images so you can always use the same input data. If you however have only few input data 50 or 500 trees doesn’t make any difference.

Thanks. My input is one image that I used some polygons for training (although based on SNAP; I used 5000 training as default). I classified with 500 trees but the result is worse again.
with 10 trees: 44%
with 50 trees: 42%
with 500 trees: 38%

that doesn’t make much sense. Random Forest classifiers are suitable for cases with multitude of explanatory variables (>12 input rasters) which are then randomly permutated during the training process. By doing this, the classifier statistically favours those rasters which have a higher value for the correct prediction of your training samples (also randomly selected).
Entering the same raster 500 times into the classifier doesn’t increase any of the quality. In that case, a KNN Nearest Neighbor classifier would possibly be more suitable.

I did this work with Maximum Likelihood and result was 47 %. I do not know why? would you please put a reference to below sentence?
Random Forest classifiers are suitable for cases with multitude of explanatory variables (>12 input rasters) which are then randomly permutated during the training process. By doing this, the classifier statistically favours those rasters which have a higher value for the correct prediction of your training samples (also randomly selected).

Try to understand how the classifiers work. Not every classifier fits to any data. Many of them are not rocket science and there are descriptions for any kind of reader. Some are more technical, others use more examples.

Nicely illustrated: http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/

Original source: https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Very detailed but most comprehensive work: https://arxiv.org/pdf/1407.7502.pdf

1 Like

OK. I got it now. Thanks.
As you know there is instability in RF. That’s why I used a classifier for all images but I do not know my method is right in SNAP or not?
I chose ‘train and apply classifier’ option on my first image and did classification (the classifier is saved automaticly ) and then I used the same classifier for other images (I mean for other images, instead of choosing ‘train and apply classifier’, I chose ‘load and apply classifier’). Although, I think first of all, we should select ‘number of trees’ and other options and then go on ‘load and apply classifier’.
Am I doing in right way?

Technically, this is the right way. But it is worth to mention that you can only apply your classifier to other images if you calibrated them carefully. Otherwise, the threshold values do not necessarily apply to these images.

hi , i am doing flood mapping using sentinel 1 GRD images .here i am using two ways (after pre-processing)

  1. taking a threshold value by visualizing histogram
  2. by Random forest classification(two classes water and non-water)

i am always getting high number of pixel of water in RFC, which much more than first method.
RFC methods is not even closer to thresholding for water pixel. why RFC is showing high values
or i can change the parameters in RFC to bring closer to threshold method .

what are your input layers? Random Forest needs a larger number of rasters to be trained on.

i am using a single layer which “sigma0_vv_db”.

i am making polygons for water and non-water .giving them as train on vector.

that is clearly not enough for a RF classifier. It is based on the idea that the input rasters and pixels are randomly shuffled and selected and only subsets are used for training the dataset. Repeating this, you get a very robust classifier.

It is well explained here: http://wgrass.media.osaka-cu.ac.jp/gisideas10/papers/04aa1f4a8beb619e7fe711c29b7b.pdf

You can include GLCM textures to increase your feature space.

like i have three images of sentinel 1 grd one image of pre-flood and two images of post-flood. after pre-processing please tell me the steps for RFC.

change detection is a different approach actually.

Have you seen these tutorials:

However, your approach with RF stays the same. Calculate image textures and classify your image(s).

yes i have attented that course but i was trying to mapped it with RFC. so i will try your methods
thank you

I have 4 training sets (and each set has multiple polygons). Also I will use 128 bands or more. So, what should be the optimum “number training samples” and “number of trees” for my RF classification? (Number of pixels of the smallest training set are 291 pixels and number of pixels of the biggest training set are 1850 pixels, and total number of pixels are 3489 pixels)

Thanks.

1 Like

Dear @ABraun,

So is there any estimated number or relation between number of classes (need to classify), number of training samples, number of trees and number of input bands (variables)?

Thank you,

training/validation are often split 2/3 to 1/3, to test the accuracy of the classifier (not the classified map), but the number of trees and the number of input bands are not linearly linked to the other parameters.