Recording & Thinking & Communicating: Using cforest to train the unbalanced data

In R, we have both randomForest and cforest package for random forest classification.

There are some differences between c-forest and randomForest:

Importance measuremet: permutation importance in cforest vs Gini in randomForest. (Gini prefers continuous and mult-category)

Cforest can also deal with highly correlated variables.

So it seems that cforest is better than randomForest. Cforest also provide the a parameter to deal with the imbalance data, but the document is not clear. And there is no example how to use it.

> table(train_data$ASB)

0 1

1780 182

> weights=rep(1/sum(train_data$ASB=="1"), nrow(train_data))

> weights[train_data$ASB=="0"] = 1/sum(train_data$ASB=="0")

> sum(weights)

[1] 2

> table(weights)

weights

0.000561797752808989 0.00549450549450549

1780 182

> cf_tree <- cforest(ASB ~ ., data = train_data, weights = weights,

+ control = cforest_unbiased())

>table(predict(cf_tree), train_data$ASB)

pre_vector

0 1

0 1780 0

1 89 93

Now it's better than confusion matrix than the model without weights as the following

pre_vector_without_weight

0 1

0 1780 0

1 182 0

Recording & Thinking & Communicating

Wednesday, 24 September 2014

Using cforest to train the unbalanced data

No comments:

Post a Comment