Thursday 22 August 2013

The problems in the design of data processing pipelines


The design problem is the most big question for me, because this will determine whether your work will be useful in next a few days. It's much important then whether you use Emacs or some automatic unit test harness. After struggling these days, I find some pitfalls in my design process.
 
Design when programming
For a huge pipeline, I tend to design the next step when programming this step. This frequently leads to doing a lot of unnecessary work. On the other hand, we can't design all the steps before start. But this should not be the excuse of the avoid design early. [Solution], design as much as possible.

Optimize too early
Last week, I took two days to learn a utility in GATK that can accelerate the speed of retrieving fasta sequence of certain interval from whole genome.  I had done this in R, but it's very slow. Now, when the pipeline is almost done, I find that I don't use accelerate utility at all, and the speed of R is acceptable. So the two days I spent on this utility is kind of wasting time.So the solution of this design trap: Do the necessary things, not the beautiful things.
 
Paralyzed in front of multiple choices
When facing a lot of choices, my brain is paralyzed. And then I tried to escape from such situation, and do some distracting things. In such situation, write down sub steps of each choice. Then you'll get a feeling of which one you like, but not which one is better.

Friday 9 August 2013

vcf-compare problems in practice

Recently, I'm using vcftool to compare two vcf files. The command to finish this job is vcf-compare. But the developer of vcf tools doesn't illustrate the how to use it in detail, which causes a lot of problem in pratice.

The two vcf files I'm comparing are supposed to overlap a lot with each other. But the vcf compare told me that there was no overlap between them. This is the output of vcf-compare:

Error: There is no overlap between any of the samples, yet haplotype comparison was requested.

By chance, there should be some overlapping. So I supposed there were some problems in my pipeline and data. After comparing a lot of situations, I found two problems of the data:
1. The format header should be same. Since the two vcf files come from different bam files, the last field of format header is the name of their original bam file. This will lead to the difference in format and it should be same.

2. The name of chromosome. In my case, one file using 1 to representing chr1, while the other using chr1. So I need to add chr in before the numbers.

After fixing two problems above, vcf-compare works well.