I’ve been using Nextflow a fair amount recently, here are some collected thoughts:

  1. It is a good system for distributing and organising computation based on processes working in traditional O/S-level abstractions: files, stdin/out, process based parallelism.
  2. The cache-ing system is (as expected, see my own similar system here ) useful.
  3. The integration with cloud auto-scaling clusters of nodes makes it much easier to efficiently scale out into the cloud. Can reduce the time-to-completion by a large factor using this
  4. Access to file-based intermediate results together with cache/resume is good for exploratory data analysis
  5. The error messages due to programming errors are not great, this will be confusing for beginners
  6. Few potential users will know Groovy (and some knowledge is very useful if not required)
  7. The dichotomy between Groovy functions and dataflow processes and similarly between Groovy variables and dataflow variables will take some for beginners to get used to
  8. Some edges in the DSL unfortunate like not being able to use the same process twice in a workflow (without reimporting with different name)

Overall a system that you can get work done in straight away. As always develop with a scaled-down problem, then scale-out!

Single node execution model

Single node with object store

Multi-node with SLURM