Thursday, 22 August 2013

The problems in the design of data processing pipelines


The design problem is the most big question for me, because this will determine whether your work will be useful in next a few days. It's much important then whether you use Emacs or some automatic unit test harness. After struggling these days, I find some pitfalls in my design process.
 
Design when programming
For a huge pipeline, I tend to design the next step when programming this step. This frequently leads to doing a lot of unnecessary work. On the other hand, we can't design all the steps before start. But this should not be the excuse of the avoid design early. [Solution], design as much as possible.

Optimize too early
Last week, I took two days to learn a utility in GATK that can accelerate the speed of retrieving fasta sequence of certain interval from whole genome.  I had done this in R, but it's very slow. Now, when the pipeline is almost done, I find that I don't use accelerate utility at all, and the speed of R is acceptable. So the two days I spent on this utility is kind of wasting time.So the solution of this design trap: Do the necessary things, not the beautiful things.
 
Paralyzed in front of multiple choices
When facing a lot of choices, my brain is paralyzed. And then I tried to escape from such situation, and do some distracting things. In such situation, write down sub steps of each choice. Then you'll get a feeling of which one you like, but not which one is better.

No comments:

Post a Comment