In R, we have both randomForest and cforest package for random forest classification.
There are some differences
between c-forest and randomForest:
Importance
measuremet: permutation importance
in cforest vs Gini in randomForest. (Gini prefers continuous and mult-category)
Cforest
can also deal with highly correlated variables.
So it seems that cforest is better than randomForest. Cforest
also provide the a parameter to deal with the imbalance data, but the document
is not clear. And there is no example how to use it.
>
table(train_data$ASB)
0 1
1780 182
>
weights=rep(1/sum(train_data$ASB=="1"), nrow(train_data))
>
weights[train_data$ASB=="0"] = 1/sum(train_data$ASB=="0")
>
sum(weights)
[1] 2
>
table(weights)
weights
0.000561797752808989 0.00549450549450549
1780 182
>
cf_tree <- cforest(ASB ~ ., data = train_data, weights = weights,
+ control
= cforest_unbiased())
>table(predict(cf_tree), train_data$ASB)
pre_vector
0
1
0 1780
0
1
89 93
Now it's
better than confusion matrix than the model without weights as the following
pre_vector_without_weight
0
1
0 1780
0
1 182 0