Wednesday 24 September 2014

Using cforest to train the unbalanced data

In R, we have both randomForest and cforest package for random forest classification.
There are some differences between c-forest and randomForest:
Importance measuremet: permutation importance in cforest vs Gini in randomForest. (Gini prefers continuous and mult-category)
Cforest can also deal with highly correlated variables.

So it seems that cforest is better than randomForest. Cforest also provide the a parameter to deal with the imbalance data, but the document is not clear. And there is no example how to use it.



> table(train_data$ASB)
   0    1
1780  182
> weights=rep(1/sum(train_data$ASB=="1"), nrow(train_data))
> weights[train_data$ASB=="0"] = 1/sum(train_data$ASB=="0")
> sum(weights)
[1] 2
> table(weights)
weights
0.000561797752808989  0.00549450549450549
                1780                  182
> cf_tree <- cforest(ASB ~ ., data = train_data, weights = weights,
+ control = cforest_unbiased())
>table(predict(cf_tree), train_data$ASB)
   pre_vector
       0    1
  0 1780    0
  1   89   93

Now it's better than confusion matrix than the model without weights as the following
   pre_vector_without_weight
       0    1
  0 1780    0
  1   182   0


Friday 19 September 2014

Magit install problem on Emacs by el-get


Emacs is a good IDE for python. 

Recently, Jhamrick updated her great Emacs settings again (https://github.com/jhamrick configuration). When I try to follow her new updated setting, I found magit can't be installed by el-get. And it give me some error like this:

#The make information
makeinfo magit.texi -o magit.info
magit.texi:6: warning: unrecognized encoding name `utf-8'.
install-info --dir=dir magit.info
info: dir: No such file or directory
make: *** [dir] Error 1

#The message:
magit failed to install: (error el-get: make el-get could not build magit [make EMACS=/homed/home/shi/bin/emacs/bin/emacs-24.3 all]) [3 times]
el-get-installation-failed: el-get: make el-get could not build magit [make EMACS=/homed/home/shi/bin/emacs/bin/emacs-24.3 all]

I guess it's the problem of the "install-info" command on our server, because it's out of date. It means el-get can't get through the make process.

When I check the magit git site, they say "elpa" can also install the magit. This time I don't get error in the installing. To use the elpa package, put following in the .emacs file:

;;Melpa
(require 'package)
(add-to-list 'package-archives
             '("melpa" . "http://melpa.milkbox.net/packages/") t)
(when (< emacs-major-version 24)
  (add-to-list 'package-archives '("gnu" . "http://elpa.gnu.org/packages/")))
(package-initialize)

Then we can load the magit successfully:
(require 'magit)


 ================================ 
One principle: install them manually on your machine if the Jhamrick's install script failed.

Friday 12 September 2014

Pandas colum slicing formats "['name']" and "[['name']]" are different


Using slightly different slicing format "[]' and [[]] will give different objects. This will cause problems when you use the result to compare with other data.

 Code:
print allele_data.head()  #data
print allele_data["start"].__class__ #format 1
print allele_data[["start"]].__class__ # format 2


Result:
In [157]:      chr      start  ref_score ref alt  ref_index ref_strand  alt_score
0   chr1  186214179   0.822386   C   T         10          -   0.768521  
1  chr20   49942978   0.959431   A   G          1          -   0.953408  
2   chr1  144989929   0.649916   A   G         11          -   0.666702  
3   chr4    8548970   0.803862   G   A         15          -   0.773032  
4   chr8  135550588   0.892755   C   T          7          +   0.843062 

In [159]: <class 'pandas.core.series.Series'>
In [161]: <class 'pandas.core.frame.DataFrame'>



Wednesday 5 March 2014

"qsub: Unknown option" Problem

Today, a script worked fine under qsub before doesn't work after I made some minor changes.

When I submit it to qsub, the qsub gives me something like:

qsub: Unknown option.


Since the script works fine on my machine, but it can't work in qsub. After some googling, I finally get some clues in from the post: http://www.biac.duke.edu/forums/topic.asp?TOPIC_ID=1284.

Basically, I add one line in my script,

#$local_bin/novoindex -t 2 hg19.nix hg19.fa


As you noticed, I have already made it as comment. This is fine in shell, but qsub will treat it as input option. If I fixed it like,

##$local_bin/novoindex -t 2 hg19.nix hg19.fa

The script works fine again.

Wednesday 6 November 2013

Bedtools intersect big files


For large files(especially for the one bigger than 2G), bedtools intersect must use the option -sorted. Otherwise, the program is running the whole afternoon and give no results. If the "sorted" option is used, it will be done within minutes.

Sunday 8 September 2013

Productive Programmer: a book about using tools


This covers a lot topics related to productivity in different levels:
*the computer interaction: using searching instead of navigation, launching pad, editor choosing
*programming tools:  code analyzer,
*programming methodology: Test driven design, Meta-programming, composed method, polyglot programming,
*programming philosophy: automation, Don't repeat yourself
*Programmer self-management: Focus, YAGIN, Don't sheave yak, accidental complexity and essential complexity.

The first two topics is purely about tools, the third is about how to using the programming language, and the last two is actually about programmer management. Anyway, if you treat the programming language and yourself as one kind of programming tools, the author's answer to productivity would match all the chapters of this book well.

Before I read this book, I have summarized some similar tips in the computer interaction aspect by myself. But now, honestly speaking, I'm not a productive programmer at all. Why? The answer lies in the last two parts, the programmer's self management.

Writing a script by using an unsuitable language will only waste me several hours, but doing something unnecessary things would cost me several days. We can run fast, but this should be based on the premise of the right direction.  When I add this book to my read list, I also add another, Rapid Development, which states a general strategy for rapid development:
*Avoid classic mistakes
*Apply development fundamentals
*Manage risks to avoid catastrophic setbacks
*Apply schedule-oriented practices

From this view, the Productive Programmer mainly solves the problem in the development fundamentals. As a research based programmer, management is much more important than the programming tricks.

Thursday 22 August 2013

The problems in the design of data processing pipelines


The design problem is the most big question for me, because this will determine whether your work will be useful in next a few days. It's much important then whether you use Emacs or some automatic unit test harness. After struggling these days, I find some pitfalls in my design process.
 
Design when programming
For a huge pipeline, I tend to design the next step when programming this step. This frequently leads to doing a lot of unnecessary work. On the other hand, we can't design all the steps before start. But this should not be the excuse of the avoid design early. [Solution], design as much as possible.

Optimize too early
Last week, I took two days to learn a utility in GATK that can accelerate the speed of retrieving fasta sequence of certain interval from whole genome.  I had done this in R, but it's very slow. Now, when the pipeline is almost done, I find that I don't use accelerate utility at all, and the speed of R is acceptable. So the two days I spent on this utility is kind of wasting time.So the solution of this design trap: Do the necessary things, not the beautiful things.
 
Paralyzed in front of multiple choices
When facing a lot of choices, my brain is paralyzed. And then I tried to escape from such situation, and do some distracting things. In such situation, write down sub steps of each choice. Then you'll get a feeling of which one you like, but not which one is better.