Hi, I am wondering if there is any difference in the implementation of the Random Forest Classifier between SNAP and sklearn (scikit-learn)?
Parameters used in SNAP RF classifier:
Number of training samples: 5000
Number of trees: 500
Training vectors: land, water
Feature bands: Sigma0_VV_db
I have coded my own RF classifier using sklearn with the same parameters above. However, for the same input image:
Run time: SNAP took around 5 mins VS sklearn takes more than an hour
Accuracy: Result from SNAP is very accurate (water bodies are delineated very clearly) VS totally inaccurate result from sklearn (looks like a TV static image)
Also, how many test samples and how are they selected to evaluate the classifier?
I think the main difference between both is that the sklearn RF offers more options to define the permutation of the input features and variables, for example the fraction of each sampling set for the generation of each tree, the pureness of nodes, the minimum numbers within a node ect. All these are unknowingly predefined in SNAP which could be one reason for its faster performance (no or less randomization, subsetting, bootstrapping of samples).
Besides that, I think it is rather useless to generate 500 trees based on only one input raster, because you will extract samples from the same source over and over again. Your random forest becomes a forest, basically and the only thing that changes is the subset of sampling points. You would achieve way more out of a random forest if you offer a couple of rasters at least to be randomly permutated.