1. Operator Tour¶
In this chapter, we take a curated tour of the Nextflow operators. Commonly used and well understood operators are not covered here - only those that we've seen could use more attention or those where the usage could be more elaborate. These set of operators have been chosen to illustrate tangential concepts and Nextflow features.
1.1 map
¶
1.1.1 Basics¶
Map is certainly the most commonly used of the operators covered here. It's a way to supply a closure through which each element in the channel is passed. The return value of the closure is emitted as an element in a new output channel. A canonical example is a closure that multiplies two numbers:
The code above is available in a starter main.nf
file available at advanced/operators/main.nf
. It is recommended to open and edit this file to follow along with the examples given in the rest of this chapter. The workflow can be executed with:
By default, the element being passed to the closure is given the default name it
. If you would prefer a more informative variable name, it can be named by using the ->
notation:
Groovy is an optionally typed language, and it is possible to specify the type of the argument passed to the closure.
1.1.2 Named Closures¶
If you find yourself re-using the same closure multiple times in your pipeline, the closure can be named and referenced:
If you have these re-usable closures defined, you can compose them together.
N E X T F L O W ~ version 23.04.1
Launching `./main.nf` [focused_borg] DSL2 - revision: f3c3e751fe
3
6
11
18
27
The above is the same as writing:
For those inclined towards functional programming, you'll be happy to know that closures can be curried:
1.2 view
¶
In addition to the argument-less usage of view
as shown above, this operator can also take a closure to customize the stdout message. We can create a closure to print the value of the elements in a channel as well as their type, for example:
Most closures will remain anonymous
In many cases, it is simply cleaner to keep the closure anonymous, defined inline. Giving closures a name is only recommended when you find yourself defining the same or similar closures repeatedly in a given workflow.
1.3 splitCsv
¶
A common Nextflow pattern is for a simple samplesheet to be passed as primary input into a workflow. We'll see some more complicated ways to manage these inputs later on in the workshop, but the splitCsv
(docs) is an excellent tool to have in a pinch. This operator will parse a csv/tsv and return a channel where each item is a row in the csv/tsv:
Exercise
From the directory advanced/operators
, use the splitCsv
and map
operators to read the file data/samplesheet.csv
and return a channel that would be suitable input to the process below. Feel free to consult the splitCsv documentation for tips.
Solution
Specifying the header
argument in the splitCsv
operator, we have convenient named access to csv elements. The closure returns a list of two elements where the second element a list of paths.
Convert Strings to Paths
The fastq paths are simple strings in the context of a csv row. In order to pass them as paths to a Nextflow process, they need to be converted into objects that adjere to the Path
interface. This is accomplished by wrapping them in file
.
In the sample above, we've lost an important piece of metadata - the tumor/normal classification, choosing only the sample id as the first element in the output list.
In the next chapter, we'll discuss the "meta map" pattern in more detail, but we can preview that here.
The construction of this map is very repetitive, and in the next chapter, we'll discuss some Groovy methods available on the Map
class that can make this pattern more concise and less error-prone.
1.4 multiMap
¶
The multiMap
(documentation) operator is a way of taking a single input channel and emitting into multiple channels for each input element.
Let's assume we've been given a samplesheet that has tumor/normal pairs bundled together on the same row. View the example samplesheet with:
Using the splitCsv
operator would give us one entry that would contain all four fastq files. Let's consider that we wanted to split these fastqs into separate channels for tumor and normal. In other words, for every row in the samplesheet, we would like to emit an entry into two new channels. To do this, we can use the multiMap
operator:
multiMapCriteria
The closure supplied to multiMap
needs to return multiple channels, so using named closures as described in the map
section above will not work. Fortunately, Nextflow provides the convenience multiMapCriteria
method to allow you to define named multiMap
closures should you need them. See the multiMap
documentation for more info.
1.5 branch
¶
The branch
operator (documentation) is a way of taking a single input channel and emitting a new element into one (and only one) of a selection of output channels.
In the example above, the multiMap
operator was necessary because we were supplied with a samplesheet that combined two pairs of fastq per row and we wanted to turn each row into new elements in multiple channels. If we were to use the neater samplesheet that had tumor/normal pairs on separate rows, we could use the branch
operator to achieve the same result as we are routing each input element into a single output channel.
An element is only emitted to the first channel were the test condition is met. If an element does not meet any of the tests, it is not emitted to any of the output channels. You can 'catch' any such samples by specifying true
as a condition. If we knew that all samples would be either tumor or normal and no third 'type', we could write
We may want to emit a slightly different element than the one passed as input. The branch
operator can (optionally) return a new element to an channel. For example, to add an extra key in the meta map of the tumor samples, we add a new line under the condition and return our new element. In this example, we modify the first element of the List
to be a new list that is the result of merging the existing meta map with a new map containing a single key:
Exercise
How would you modify the element returned in the tumor
channel to have the key:value pair type:'abnormal'
instead of type:'tumor'
?
Solution
There are many ways to accomplish this, but the map merging pattern introduced above can also be used to safely and concisely rename values in a map.
Merging maps is safe
Using the +
operator to merge two or more Maps returns a new Map. There are rare edge cases where modification of map rather than returning a new map can affect other channels. We discuss this further in the next chapter, but just be aware that this +
operator is safer and often more convenient than modifying the meta
object directly.
See the Groovy Map documentation for details.
1.5.1 Multi-channel Objects¶
Some Nextflow operators return objects that contain multiple channels. The multiMap
and branch
operators are excellent examples. In most instances, the output is assigned to a variable and then addressed by name:
or by using the set
operator (documentation):
A more interesting situation occurs when given a process that takes multiple channels as input:
You can either provide the channels individually:
or you can provide the multichannel as a single input:
For an even cleaner solution, you can skip the now-redundant set
operator:
If you have processes that output multiple channels and input multiple channels and the cardinality matches, they can be chained together in the same manner.
1.6 groupTuple
¶
A common operation is to group elements from a single channel where those elements share a common key. Take this example samplesheet as an example:
We see that there are multiple rows where the first element in the item emitted by the channel is the Map [id:sampleA, type:normal]
and items in the channel where the first element is the Map [id:sampleA, type:tumor]
.
The groupTuple
operator allows us to combine elements that share a common key:
1.7 transpose
¶
The transpose operator is often misunderstood. It can be thought of as the inverse of the groupTuple
operator. Give the following workflow, the groupTuple
and transpose
operators cancel each other out. Removing lines 8 and 9 returns the same result.
Given a workflow that returns one element per sample, where we have grouped the samplesheet lines on a meta containing only id and type:
N E X T F L O W ~ version 23.04.1
Launching `./main.nf` [spontaneous_rutherford] DSL2 - revision: 7dc1cc0039
[[id:sampleA, type:normal], [1, 2], [[data/reads/sampleA_rep1_normal_R1.fastq.gz, data/reads/sampleA_rep1_normal_R2.fastq.gz], [data/reads/sampleA_rep2_normal_R1.fastq.gz, data/reads/sampleA_rep2_normal_R2.fastq.gz]]]
[[id:sampleA, type:tumor], [1, 2], [[data/reads/sampleA_rep1_tumor_R1.fastq.gz, data/reads/sampleA_rep1_tumor_R2.fastq.gz], [data/reads/sampleA_rep2_tumor_R1.fastq.gz, data/reads/sampleA_rep2_tumor_R2.fastq.gz]]]
[[id:sampleB, type:normal], [1], [[data/reads/sampleB_rep1_normal_R1.fastq.gz, data/reads/sampleB_rep1_normal_R2.fastq.gz]]]
[[id:sampleB, type:tumor], [1], [[data/reads/sampleB_rep1_tumor_R1.fastq.gz, data/reads/sampleB_rep1_tumor_R2.fastq.gz]]]
[[id:sampleC, type:normal], [1], [[data/reads/sampleC_rep1_normal_R1.fastq.gz, data/reads/sampleC_rep1_normal_R2.fastq.gz]]]
[[id:sampleC, type:tumor], [1], [[data/reads/sampleC_rep1_tumor_R1.fastq.gz, data/reads/sampleC_rep1_tumor_R2.fastq.gz]]]
If we add in a transpose
, each repeat number is matched back to the appropriate list of reads:
N E X T F L O W ~ version 23.04.1
Launching `./main.nf` [elegant_rutherford] DSL2 - revision: 2c5476b133
[[id:sampleA, type:normal], 1, [data/reads/sampleA_rep1_normal_R1.fastq.gz, data/reads/sampleA_rep1_normal_R2.fastq.gz]]
[[id:sampleA, type:normal], 2, [data/reads/sampleA_rep2_normal_R1.fastq.gz, data/reads/sampleA_rep2_normal_R2.fastq.gz]]
[[id:sampleA, type:tumor], 1, [data/reads/sampleA_rep1_tumor_R1.fastq.gz, data/reads/sampleA_rep1_tumor_R2.fastq.gz]]
[[id:sampleA, type:tumor], 2, [data/reads/sampleA_rep2_tumor_R1.fastq.gz, data/reads/sampleA_rep2_tumor_R2.fastq.gz]]
[[id:sampleB, type:normal], 1, [data/reads/sampleB_rep1_normal_R1.fastq.gz, data/reads/sampleB_rep1_normal_R2.fastq.gz]]
[[id:sampleB, type:tumor], 1, [data/reads/sampleB_rep1_tumor_R1.fastq.gz, data/reads/sampleB_rep1_tumor_R2.fastq.gz]]
[[id:sampleC, type:normal], 1, [data/reads/sampleC_rep1_normal_R1.fastq.gz, data/reads/sampleC_rep1_normal_R2.fastq.gz]]
[[id:sampleC, type:tumor], 1, [data/reads/sampleC_rep1_tumor_R1.fastq.gz, data/reads/sampleC_rep1_tumor_R2.fastq.gz]]