For large
files(especially for the one bigger than 2G), bedtools intersect must use the
option -sorted. Otherwise, the program is running the whole afternoon and give no results. If the "sorted" option is used, it will be done within minutes.
Wednesday, 6 November 2013
Sunday, 8 September 2013
Productive Programmer: a book about using tools
This covers a
lot topics related to productivity in different levels:
*the computer
interaction: using searching instead of navigation, launching pad, editor
choosing
*programming
tools: code analyzer,
*programming
methodology: Test driven design, Meta-programming, composed method, polyglot
programming,
*programming
philosophy: automation, Don't repeat yourself
*Programmer
self-management: Focus, YAGIN, Don't sheave yak, accidental complexity and
essential complexity.
The first two topics
is purely about tools, the third is about how to using the programming
language, and the last two is actually about programmer management. Anyway, if
you treat the programming language and yourself as one kind of programming
tools, the author's answer to productivity would match all the chapters of this
book well.
Before
I read this book, I have summarized some similar tips in the computer
interaction aspect by myself. But now, honestly speaking, I'm not a productive
programmer at all. Why? The answer lies in the last two parts, the programmer's self management.
Writing
a script by using an unsuitable language will only waste me several hours, but
doing something unnecessary things would cost me several days. We can run fast, but this should be based on
the premise of the right direction. When
I add this book to my read list, I also add another, Rapid Development, which states a general strategy for rapid development:
*Avoid
classic mistakes
*Apply
development fundamentals
*Manage
risks to avoid catastrophic setbacks
*Apply
schedule-oriented practices
From this
view, the Productive Programmer mainly solves the problem in the
development fundamentals. As a research based programmer, management is much
more important than the programming tricks.
Thursday, 22 August 2013
The problems in the design of data processing pipelines
The design problem is the most big question for me, because this will determine
whether your work will be useful in next a few days. It's much important then whether you use Emacs or some automatic unit test harness. After struggling these days, I find some pitfalls in my design process.
Design when programming
For a
huge pipeline, I tend to design the next step when programming this step. This
frequently leads to doing a lot of unnecessary work. On the other hand, we
can't design all the steps before start. But this should not be the excuse of
the avoid design early. [Solution], design as much as possible.
Optimize
too early
Last week, I took two days to learn a utility in GATK that can accelerate the speed of retrieving fasta sequence of certain interval from whole genome. I had done this in R, but it's very slow. Now, when the pipeline is almost done, I find that I don't use accelerate utility at all, and the speed of R is acceptable. So the two days I spent on this utility is kind of wasting time.So the solution of this design trap: Do the necessary things, not the beautiful things.
Last week, I took two days to learn a utility in GATK that can accelerate the speed of retrieving fasta sequence of certain interval from whole genome. I had done this in R, but it's very slow. Now, when the pipeline is almost done, I find that I don't use accelerate utility at all, and the speed of R is acceptable. So the two days I spent on this utility is kind of wasting time.So the solution of this design trap: Do the necessary things, not the beautiful things.
Paralyzed
in front of multiple choices
When
facing a lot of choices, my brain is paralyzed. And then I tried to escape from
such situation, and do some distracting things. In such situation, write down
sub steps of each choice. Then you'll get a feeling of which one you like, but
not which one is better.
Friday, 9 August 2013
vcf-compare problems in practice
Recently, I'm using vcftool to compare two vcf files. The command to finish this job is vcf-compare. But the developer of vcf tools doesn't illustrate the how to use it in detail, which causes a lot of problem in pratice.
The two vcf files I'm comparing are supposed to overlap a lot with each other. But the vcf compare told me that there was no overlap between them. This is the output of vcf-compare:
Error: There is no overlap between any of the samples, yet haplotype comparison was requested.
By chance, there should be some overlapping. So I supposed there were some problems in my pipeline and data. After comparing a lot of situations, I found two problems of the data:
1. The format header should be same. Since the two vcf files come from different bam files, the last field of format header is the name of their original bam file. This will lead to the difference in format and it should be same.
2. The name of chromosome. In my case, one file using 1 to representing chr1, while the other using chr1. So I need to add chr in before the numbers.
After fixing two problems above, vcf-compare works well.
The two vcf files I'm comparing are supposed to overlap a lot with each other. But the vcf compare told me that there was no overlap between them. This is the output of vcf-compare:
Error: There is no overlap between any of the samples, yet haplotype comparison was requested.
By chance, there should be some overlapping. So I supposed there were some problems in my pipeline and data. After comparing a lot of situations, I found two problems of the data:
1. The format header should be same. Since the two vcf files come from different bam files, the last field of format header is the name of their original bam file. This will lead to the difference in format and it should be same.
2. The name of chromosome. In my case, one file using 1 to representing chr1, while the other using chr1. So I need to add chr in before the numbers.
After fixing two problems above, vcf-compare works well.
Friday, 26 July 2013
Emacs Mark Region and Scrolling Problem
The problem looks
like this in emacs when I set the mark:
- C+SPC set the mark at the beginning of the paragraph
- Move to the end to paragraph.
- If I need to scroll the page in step 2, the cursor will move to the current view of the emacs
- Then I use left click to select the region between the start mark and current cursor. But I only get part of the region that in the current view.
After
making me depressed a long time, I realized that the reason is due to the
failing of set mark in step 1. C+SPC is not captured by Emacs, but by my
Input Method.
Tuesday, 9 July 2013
Synchronize the folder to the server in the Rstudio
- Install the WinSCP
- Set WinSCP installation path to the WIN7 Environment Variable in $PATH
- Write the code:
- Winscp_sync.txt:
open stored_session_name_in_WinSCP
synchronize
both E:\Projects\R\data\server /homed/home/shi/anthony/tfbs_pwm/rsnp
exit
- Winscp_sync.bat:
WinSCP.exe
/console /script=winscp_sync.txt
- In Rstudio
Call:
system('./winscp_sync.bat')#Path needed
Subscribe to:
Posts (Atom)