Wednesday 6 November 2013

Bedtools intersect big files


For large files(especially for the one bigger than 2G), bedtools intersect must use the option -sorted. Otherwise, the program is running the whole afternoon and give no results. If the "sorted" option is used, it will be done within minutes.

Sunday 8 September 2013

Productive Programmer: a book about using tools


This covers a lot topics related to productivity in different levels:
*the computer interaction: using searching instead of navigation, launching pad, editor choosing
*programming tools:  code analyzer,
*programming methodology: Test driven design, Meta-programming, composed method, polyglot programming,
*programming philosophy: automation, Don't repeat yourself
*Programmer self-management: Focus, YAGIN, Don't sheave yak, accidental complexity and essential complexity.

The first two topics is purely about tools, the third is about how to using the programming language, and the last two is actually about programmer management. Anyway, if you treat the programming language and yourself as one kind of programming tools, the author's answer to productivity would match all the chapters of this book well.

Before I read this book, I have summarized some similar tips in the computer interaction aspect by myself. But now, honestly speaking, I'm not a productive programmer at all. Why? The answer lies in the last two parts, the programmer's self management.

Writing a script by using an unsuitable language will only waste me several hours, but doing something unnecessary things would cost me several days. We can run fast, but this should be based on the premise of the right direction.  When I add this book to my read list, I also add another, Rapid Development, which states a general strategy for rapid development:
*Avoid classic mistakes
*Apply development fundamentals
*Manage risks to avoid catastrophic setbacks
*Apply schedule-oriented practices

From this view, the Productive Programmer mainly solves the problem in the development fundamentals. As a research based programmer, management is much more important than the programming tricks.

Thursday 22 August 2013

The problems in the design of data processing pipelines


The design problem is the most big question for me, because this will determine whether your work will be useful in next a few days. It's much important then whether you use Emacs or some automatic unit test harness. After struggling these days, I find some pitfalls in my design process.
 
Design when programming
For a huge pipeline, I tend to design the next step when programming this step. This frequently leads to doing a lot of unnecessary work. On the other hand, we can't design all the steps before start. But this should not be the excuse of the avoid design early. [Solution], design as much as possible.

Optimize too early
Last week, I took two days to learn a utility in GATK that can accelerate the speed of retrieving fasta sequence of certain interval from whole genome.  I had done this in R, but it's very slow. Now, when the pipeline is almost done, I find that I don't use accelerate utility at all, and the speed of R is acceptable. So the two days I spent on this utility is kind of wasting time.So the solution of this design trap: Do the necessary things, not the beautiful things.
 
Paralyzed in front of multiple choices
When facing a lot of choices, my brain is paralyzed. And then I tried to escape from such situation, and do some distracting things. In such situation, write down sub steps of each choice. Then you'll get a feeling of which one you like, but not which one is better.

Friday 9 August 2013

vcf-compare problems in practice

Recently, I'm using vcftool to compare two vcf files. The command to finish this job is vcf-compare. But the developer of vcf tools doesn't illustrate the how to use it in detail, which causes a lot of problem in pratice.

The two vcf files I'm comparing are supposed to overlap a lot with each other. But the vcf compare told me that there was no overlap between them. This is the output of vcf-compare:

Error: There is no overlap between any of the samples, yet haplotype comparison was requested.

By chance, there should be some overlapping. So I supposed there were some problems in my pipeline and data. After comparing a lot of situations, I found two problems of the data:
1. The format header should be same. Since the two vcf files come from different bam files, the last field of format header is the name of their original bam file. This will lead to the difference in format and it should be same.

2. The name of chromosome. In my case, one file using 1 to representing chr1, while the other using chr1. So I need to add chr in before the numbers.

After fixing two problems above, vcf-compare works well.


Friday 26 July 2013

Emacs Mark Region and Scrolling Problem


The problem looks like this in emacs when I set the mark:
  1. C+SPC set the mark at the beginning of the paragraph
  1. Move to the end to paragraph.
  1. If I need to scroll the page in step 2, the cursor will move to the current view of the emacs
  1. Then I use left click to select the region between the start mark and current cursor. But I only get part of the region that in the current view.

After making me depressed a long time, I realized that the reason is due to the failing of set mark in step 1. C+SPC is not captured by Emacs, but by my Input Method.

Tuesday 9 July 2013

Synchronize the folder to the server in the Rstudio



  1. Install the WinSCP
  1. Set WinSCP installation path to the WIN7 Environment Variable in $PATH
  1. Write the code:
    1. Winscp_sync.txt:
                  open stored_session_name_in_WinSCP
                  synchronize both E:\Projects\R\data\server /homed/home/shi/anthony/tfbs_pwm/rsnp
                  exit
  1. Winscp_sync.bat:
                WinSCP.exe /console /script=winscp_sync.txt
  1. In Rstudio
               Call: system('./winscp_sync.bat')#Path needed