Scientific Workflows at Scale using GNU Parallel

Length: Half day

Prerequisite Knowledge:

  • Basic usage of Linux terminal and familiarity with simple commands such as ls, cat, date etc.
  • Familiarity with basic concepts of files, directories, CPU, memory, etc.

Presenters:

This tutorial offers theoretical foundations and hands-on experience on GNU Parallel, a shell tool that executes terminal commands in parallel on one or more computers. A versatile tool, Parallel offers numerous options and features that makes it easy, concise and efficient to work on multicore and multinode architectures. Ability to run tasks in parallel over files / raw data in a wide variety of modes and load distributions makes it a very powerful and suitable tool for a wide variety of workflows. GNU Parallel fits nicely with the HPC middleware and filesystems making it a low friction tool to perform not only the computations but data movements efficiently. Most HPC centers offer their resources in a shared manner via resource managers and schedulers. This makes it important for any tool to be able to work well with such schedulers. Parallel integrates nicely with HPC job schedulers such as SLURM, LSF, PBS / Torque and others. Workflows often require dependency chaining between multiple computing stages. With simple shell techniques we will see how we can leverage Parallel to achieve fully asynchronous, parallel multi-stage scientific workflows. Parallel is invaluable from the scientific user’s point of view, in that the simplicity of parallel empowers users to rapidly extract the parallel profile of a complex workflow and experiment and hone it for large scale runs using Parallel or some other specialized workflow tools.