Wednesday, 24 September 2014

Using cforest to train the unbalanced data

In R, we have both randomForest and cforest package for random forest classification.
There are some differences between c-forest and randomForest:
Importance measuremet: permutation importance in cforest vs Gini in randomForest. (Gini prefers continuous and mult-category)
Cforest can also deal with highly correlated variables.

So it seems that cforest is better than randomForest. Cforest also provide the a parameter to deal with the imbalance data, but the document is not clear. And there is no example how to use it.



> table(train_data$ASB)
   0    1
1780  182
> weights=rep(1/sum(train_data$ASB=="1"), nrow(train_data))
> weights[train_data$ASB=="0"] = 1/sum(train_data$ASB=="0")
> sum(weights)
[1] 2
> table(weights)
weights
0.000561797752808989  0.00549450549450549
                1780                  182
> cf_tree <- cforest(ASB ~ ., data = train_data, weights = weights,
+ control = cforest_unbiased())
>table(predict(cf_tree), train_data$ASB)
   pre_vector
       0    1
  0 1780    0
  1   89   93

Now it's better than confusion matrix than the model without weights as the following
   pre_vector_without_weight
       0    1
  0 1780    0
  1   182   0


No comments:

Post a Comment