Number of training samples at Random forest classifier

Hi
I have a question about Random forest classification; as you know there are two options that we should arrange for Random forest algorithm in SNAP.

  1. Number of trees
    2.Number of training samples
    that they are 10 and 5000 in default but I can not understand what is the meaning by ‘Number of training samples ’ and why it is 5000?
    Wpuld you please explain for me?
    Actually I found this paper ‘https://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XL-7-W3/777/2015/ ’that it is written by @ABraun but as you can see options are different and I can not understand similarity between ‘Number of training samples ’ in SNAP and in this paper.
  1. The number of trees defines how often the training data (under different conditions, some data left out) is trained against the pixel values. For testing, 10 is enough but to achieve robust results, you can increase it up to 100 or 500. This however only makes sense if you have more than 8 input rasters, otherwise the training data is always the same, even if you repeat it 1000 times.
  2. The number of samples is the number of pixels within your training polygons which are randomly selected for training. This means, each trained class should have clearly more than 5000 pixels, otherwise you always use the same available pixels.
4 Likes

@ABraun
I choose some polygons that they are included some pixels but I do not know how many pixels they are in overall? So how SNAP choose 5000 pixels?
Should I count the number of pixels that I choose in polygons as training data and then put instead of 5000?

The overall pixels that I used for training classes are almost 6200 (I mean whole polygons have 6200 pixels with together) but I leave the number of samples by default (I mean 5000).
Did I do something wrong?

nothing wrong. 5000 ist just a default value which ensures that the random forest operator has enough permutation options for selecting random samples.

@ABraun I increased number of trees but surprisingly, the result is now worse;
with 10 trees: 44%
with 50 trees: 42%

Why it happend?

more trees do not necessarily mean better results, at least at this scale. To be absolutely sure, you can try 200 or 500 trees which fully average the best findings and drop out input rasters of low significance.

OK. Thanks Sir. I will do it although it takes time in SNAP. One more question, if we save a classifier for 10 trees, is it possible apply the same classifier on 500 trees (I want to avoid instability random forest by this way)?
If yes, probably, I should change the number of trees and selected images and then chose classifier. Am I right?

the principle of a random forest is to use a large amount of images to explain a trained distribution. If you select 500 trees, the classifier will randomly choose from your input images so you can always use the same input data. If you however have only few input data 50 or 500 trees doesn’t make any difference.

Thanks. My input is one image that I used some polygons for training (although based on SNAP; I used 5000 training as default). I classified with 500 trees but the result is worse again.
with 10 trees: 44%
with 50 trees: 42%
with 500 trees: 38%

that doesn’t make much sense. Random Forest classifiers are suitable for cases with multitude of explanatory variables (>12 input rasters) which are then randomly permutated during the training process. By doing this, the classifier statistically favours those rasters which have a higher value for the correct prediction of your training samples (also randomly selected).
Entering the same raster 500 times into the classifier doesn’t increase any of the quality. In that case, a KNN Nearest Neighbor classifier would possibly be more suitable.

I did this work with Maximum Likelihood and result was 47 %. I do not know why? would you please put a reference to below sentence?
Random Forest classifiers are suitable for cases with multitude of explanatory variables (>12 input rasters) which are then randomly permutated during the training process. By doing this, the classifier statistically favours those rasters which have a higher value for the correct prediction of your training samples (also randomly selected).

Try to understand how the classifiers work. Not every classifier fits to any data. Many of them are not rocket science and there are descriptions for any kind of reader. Some are more technical, others use more examples.

Nicely illustrated: http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/

Original source: https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Very detailed but most comprehensive work: https://arxiv.org/pdf/1407.7502.pdf

1 Like

OK. I got it now. Thanks.
As you know there is instability in RF. That’s why I used a classifier for all images but I do not know my method is right in SNAP or not?
I chose ‘train and apply classifier’ option on my first image and did classification (the classifier is saved automaticly ) and then I used the same classifier for other images (I mean for other images, instead of choosing ‘train and apply classifier’, I chose ‘load and apply classifier’). Although, I think first of all, we should select ‘number of trees’ and other options and then go on ‘load and apply classifier’.
Am I doing in right way?

Technically, this is the right way. But it is worth to mention that you can only apply your classifier to other images if you calibrated them carefully. Otherwise, the threshold values do not necessarily apply to these images.

hi , i am doing flood mapping using sentinel 1 GRD images .here i am using two ways (after pre-processing)

  1. taking a threshold value by visualizing histogram
  2. by Random forest classification(two classes water and non-water)

i am always getting high number of pixel of water in RFC, which much more than first method.
RFC methods is not even closer to thresholding for water pixel. why RFC is showing high values
or i can change the parameters in RFC to bring closer to threshold method .

what are your input layers? Random Forest needs a larger number of rasters to be trained on.

i am using a single layer which “sigma0_vv_db”.

i am making polygons for water and non-water .giving them as train on vector.

that is clearly not enough for a RF classifier. It is based on the idea that the input rasters and pixels are randomly shuffled and selected and only subsets are used for training the dataset. Repeating this, you get a very robust classifier.

It is well explained here: http://wgrass.media.osaka-cu.ac.jp/gisideas10/papers/04aa1f4a8beb619e7fe711c29b7b.pdf

You can include GLCM textures to increase your feature space.