flo.yaml specification

Individual analysis tasks are defined as YAML objects in a file named flo.yaml (or whatever you prefer) with something like this:

---
creates: "path/to/some/output/file.txt"
depends: "path/to/some/script.py"
command: "python {{depends}} > {{creates}}"

Every YAML object that defines a task must have creates and command keys and can optionally contain a depends key. The order of these keys does not matter; the above order is chosen for explanatory purposes only.

creates

The creates key uniquely identifies the resource that is created. By default, it is interpreted as a path to a file (relative paths are interpreted as relative to the flo.yaml file) or a directory. Importantly, every task is intended to create a single file or directory. If you have a task that creates multiple files, you can either (i) split that into separate tasks or (ii) have all of those files embedded in a directory and use the directory name as the creates value like this:

---
creates: "path/to/output/directory"
depends: "path/to/some/script.py"
command:
  - "mkdir -p {{creates}}"
  - "python {{depends}} {{creates}}"

In this case, the directory path/to/output/directory is passed as the first argument to path/to/some/script.py, which can then add as many files as necessary to that directory. When this task is complete, flo checks the hash of all files in path/to/output/directory and all of its child directories to determine if it is in sync or not.

depends

The depends key defines the resource(s) on which this task depends. It is common for depends to specify many things, including data analysis scripts or other tasks from within the flo.yaml. Multiple dependencies can be defined in a YAML list like this:

depends:
  - "path/to/some/script.py"
  - "another/task/creates/target.txt"

These dependencies are what flo uses to determine if a task is out of sync and needs to be re-executed. Importantly, flo obeys the dependencies when it constructs the task graph but always runs in a deterministic order. If a specified depends does not exist immediately prior to flo running the task, flo throws an informative error.

command

The command key is mandatory and it defines the command(s) that should be executed to produce the resource specified by the creates key. Like the depends key, multiple steps can be defined in a YAML list like this:

command:
  - "mkdir -p $(dirname {{creates}})"
  - "python {{depends}} > {{creates}}"

templating variables

Importantly, the command is rendered as a jinja template to avoid duplication of information that is already defined in that task. Its quite common to use {{depends}} and {{creates}} in the command specification, but you can also use other variables like this:

---
creates: "path/to/some/output/file.txt"
sigma: "2.137"
depends: "path/to/some/script.py"
command: "python {{depends}} {{sigma} > {{creates}}"

In the aforementioned example, sigma is only available when rendering the jinja template for that task. If you’d like to use sigma in several other tasks, you can alternatively put it in a global namespace in a flo.yaml like this (similar example here):

---
sigma: "2.137"
tasks:
  -
    creates: "path/to/some/output/file.txt"
    depends: "path/to/some/script.py"
    command: "python {{depends}} {{sigma} > {{creates}}"
  -
    creates: "path/to/another/output/file.txt"
    depends:
      - "path/to/another/script.py"
      - "path/to/some/output/file.txt"
    command: "python {{depends[0]}} {{sigma}} < {{depends[1]}} > {{creates}}"

Another common use case for global variables is when you have several tasks that all depend on the same file. You can also use jinja templating in the creates and depends attributes of your flo.yaml like this:

---
input: "data/sp500.html"
tasks:
  -
    creates: "{{input}}"
    command:
      - "mkdir -p $(dirname {{creates}})"
      - "wget http://en.wikipedia.org/wiki/List_of_S%26P_500_companies -O {{creates}}"
  -
    creates: "data/names.dat"
    depends:
      - "src/extract_names.py"
      - "{{input}}"
    command: "python {{depends|join(' ')}} > {{creates}}"
  -
    creates: "data/symbols.dat"
    depends:
      - "src/extract_symbols.py"
      - "{{input}}"
    command: "python {{depends|join(' ')}} > {{creates}}"

There are several examples for more inspiration on how you could use the flo.yaml specification. If you have suggestions for other ideas, please add them!

deterministic execution order

flo is guaranteed to run in the exact same order every single time and its important that users understand how it works. When flo is executed, it makes sure to obey the dependencies specified in the YAML configuration. In the event of ties flo is executed in the same order as the tasks appear in the YAML configuration. Technically, this is very similar to a breadth first search originating from the set of tasks that have no dependencies except that we order things based on the maximum distance that each task is from any given source node and we break ties based on the order in the YAML configuration file.

The deterministic order example contains a few different YAML configuration files to demonstrate how this works in practice, the highlights of which are summarized here.

task graph for sibling tasks that all depend on the same parent

For sibling tasks, sibling tasks are executed in the order in which they appear in the YAML configuration file, but always after the their dependencies have been satisfied. In this example, the task graph looks like this and the tasks are guaranteed to run in alphabetical order.

task graph for parallel task threads

For parallel threads, task threads are executed based on their distance from the source tasks and secondarily based on their ordering in the YAML configuration file. In this example, the task graph looks something like this and the tasks are guaranteed to run in alphabetical order.

task graph for merging task threads

For merging task graphs, tasks are executed based on their maximal distance from any source task. In this example, the task graph looks something like this and the tasks are guaranteed to run in alphabetical order.