flo.yaml specification

Individual analysis tasks are defined as YAML objects in a file named flo.yaml (or whatever you prefer) with something like this:

---
creates: "path/to/some/output/file.txt"
depends: "path/to/some/script.py"
command: "python {{depends}} > {{creates}}"

Every task YAML object must have a creates key and can optionally contain depends and command keys. The order of these keys does not matter; the above order is chosen for explanatory purposes only.

creates

The creates key defines the resource that is created. By default, it is interpreted as a path to a file (relative paths are interpreted as relative to the flo.yaml file). You can also specify a protocol, such as mysql:database/table (yet-to-be-implemented), for non-file based resources.

depends

The depends key defines the resource(s) on which this task depends. It is common for depends to specify many things, including data analysis scripts or other tasks from within the flo.yaml. Multiple dependencies can be defined in a YAML list like this:

depends:
  - "path/to/some/script.py"
  - "another/task/creates/target.txt"

These dependencies are what flo uses to determine if a task is out of sync and needs to be re-executed. Importantly, flo obeys the dependencies when it constructs the task graph but always runs in a deterministic order.

command

The command key is mandatory and it defines the command(s) that should be executed to produce the resource specified by the creates key. Like the depends key, multiple steps can be defined in a YAML list like this:

command:
  - "mkdir -p $(dirname {{creates}})"
  - "python {{depends}} > {{creates}}"

templating variables

Importantly, the command is rendered as a jinja template to avoid duplication of information that is already defined in that task. Its quite common to use {{depends}} and {{creates}} in the command specification, but you can also use other variables like this:

---
creates: "path/to/some/output/file.txt"
sigma: "2.137"
depends: "path/to/some/script.py"
command: "python {{depends}} {{sigma} > {{creates}}"

In the aforementioned example, sigma is only available when rendering the jinja template for that task. If you’d like to use sigma in several other tasks, you can alternatively put it in a global namespace in a flo.yaml like this (similar example here):

---
sigma: "2.137"
tasks:
  -
    creates: "path/to/some/output/file.txt"
    depends: "path/to/some/script.py"
    command: "python {{depends}} {{sigma} > {{creates}}"
  -
    creates: "path/to/another/output/file.txt"
    depends:
      - "path/to/another/script.py"
      - "path/to/some/output/file.txt"
    command: "python {{depends[0]}} {{sigma}} < {{depends[1]}} > {{creates}}"

Another common use case for global variables is when you have several tasks that all depend on the same file. You can also use jinja templating in the creates and depends attributes of your flo.yaml like this:

---
input: "data/sp500.html"
tasks:
  -
    creates: "{{input}}"
    command:
      - "mkdir -p $(dirname {{creates}})"
      - "wget http://en.wikipedia.org/wiki/List_of_S%26P_500_companies -O {{creates}}"
  -
    creates: "data/names.dat"
    depends:
      - "src/extract_names.py"
      - "{{input}}"
    command: "python {{depends|join(' ')}} > {{creates}}"
  -
    creates: "data/symbols.dat"
    depends:
      - "src/extract_symbols.py"
      - "{{input}}"
    command: "python {{depends|join(' ')}} > {{creates}}"

There are several examples for more inspiration on how you could use the flo.yaml specification. If you have suggestions for other ideas, please add them!

deterministic execution order

flo is guaranteed to run in the exact same order every single time and its important that users understand how it works. When flo is executed, it makes sure to obey the dependencies specified in the YAML configuration. In the event of ties flo is executed in the same order as the tasks appear in the YAML configuration. Technically, this is very similar to a breadth first search originating from the set of tasks that have no dependencies except that we order things based on the maximum distance that each task is from any given source node and we break ties based on the order in the YAML configuration file.

The deterministic order example contains a few different YAML configuration files to demonstrate how this works in practice, the highlights of which are summarized here.

task graph for sibling tasks that all depend on the same parent

For sibling tasks, sibling tasks are executed in the order in which they appear in the YAML configuration file, but always after the their dependencies have been satisfied. In this example, the task graph looks like this and the tasks are guaranteed to run in alphabetical order.

task graph for parallel task threads

For parallel threads, task threads are executed based on their distance from the source tasks and secondarily based on their ordering in the YAML configuration file. In this example, the task graph looks something like this and the tasks are guaranteed to run in alphabetical order.

task graph for merging task threads

For merging task graphs, tasks are executed based on their maximal distance from any source task. In this example, the task graph looks something like this and the tasks are guaranteed to run in alphabetical order.