Wednesday, 24 September 2014

Using cforest to train the unbalanced data

In R, we have both randomForest and cforest package for random forest classification.
There are some differences between c-forest and randomForest:
Importance measuremet: permutation importance in cforest vs Gini in randomForest. (Gini prefers continuous and mult-category)
Cforest can also deal with highly correlated variables.

So it seems that cforest is better than randomForest. Cforest also provide the a parameter to deal with the imbalance data, but the document is not clear. And there is no example how to use it.



> table(train_data$ASB)
   0    1
1780  182
> weights=rep(1/sum(train_data$ASB=="1"), nrow(train_data))
> weights[train_data$ASB=="0"] = 1/sum(train_data$ASB=="0")
> sum(weights)
[1] 2
> table(weights)
weights
0.000561797752808989  0.00549450549450549
                1780                  182
> cf_tree <- cforest(ASB ~ ., data = train_data, weights = weights,
+ control = cforest_unbiased())
>table(predict(cf_tree), train_data$ASB)
   pre_vector
       0    1
  0 1780    0
  1   89   93

Now it's better than confusion matrix than the model without weights as the following
   pre_vector_without_weight
       0    1
  0 1780    0
  1   182   0


Friday, 19 September 2014

Magit install problem on Emacs by el-get


Emacs is a good IDE for python. 

Recently, Jhamrick updated her great Emacs settings again (https://github.com/jhamrick configuration). When I try to follow her new updated setting, I found magit can't be installed by el-get. And it give me some error like this:

#The make information
makeinfo magit.texi -o magit.info
magit.texi:6: warning: unrecognized encoding name `utf-8'.
install-info --dir=dir magit.info
info: dir: No such file or directory
make: *** [dir] Error 1

#The message:
magit failed to install: (error el-get: make el-get could not build magit [make EMACS=/homed/home/shi/bin/emacs/bin/emacs-24.3 all]) [3 times]
el-get-installation-failed: el-get: make el-get could not build magit [make EMACS=/homed/home/shi/bin/emacs/bin/emacs-24.3 all]

I guess it's the problem of the "install-info" command on our server, because it's out of date. It means el-get can't get through the make process.

When I check the magit git site, they say "elpa" can also install the magit. This time I don't get error in the installing. To use the elpa package, put following in the .emacs file:

;;Melpa
(require 'package)
(add-to-list 'package-archives
             '("melpa" . "http://melpa.milkbox.net/packages/") t)
(when (< emacs-major-version 24)
  (add-to-list 'package-archives '("gnu" . "http://elpa.gnu.org/packages/")))
(package-initialize)

Then we can load the magit successfully:
(require 'magit)


 ================================ 
One principle: install them manually on your machine if the Jhamrick's install script failed.

Friday, 12 September 2014

Pandas colum slicing formats "['name']" and "[['name']]" are different


Using slightly different slicing format "[]' and [[]] will give different objects. This will cause problems when you use the result to compare with other data.

 Code:
print allele_data.head()  #data
print allele_data["start"].__class__ #format 1
print allele_data[["start"]].__class__ # format 2


Result:
In [157]:      chr      start  ref_score ref alt  ref_index ref_strand  alt_score
0   chr1  186214179   0.822386   C   T         10          -   0.768521  
1  chr20   49942978   0.959431   A   G          1          -   0.953408  
2   chr1  144989929   0.649916   A   G         11          -   0.666702  
3   chr4    8548970   0.803862   G   A         15          -   0.773032  
4   chr8  135550588   0.892755   C   T          7          +   0.843062 

In [159]: <class 'pandas.core.series.Series'>
In [161]: <class 'pandas.core.frame.DataFrame'>