pydatatask.pipeline module#

A pipeline is just an unordered collection of tasks. Relationships between the tasks are implicit, defined by which repositories they share.

class pydatatask.pipeline.Pipeline(tasks: Iterable[Task], session: Session, resources: Iterable[ResourceManager], priority: Callable[[str, str], int] | None = None)[source]#

The pipeline class.

Parameters:

tasks – The tasks which make up this pipeline.
session – The session to open while this pipeline is active.
resources – Any resource managers in use. You need to provide these so the pipeline can reset the rate-limiting at each update.
priority – Optional: A function which takes a task and job name and returns an integer priority. No jobs will be scheduled unless all higher-priority jobs (larger numbers) have already been scheduled.

settings(synchronous=False, metadata=True)[source]#

This method can be called to set properties of the current run.

Parameters:

synchronous – Whether jobs will be started and completed in-process, waiting for their completion before a launch phase succeeds.
metadata – Whether jobs will store their completion metadata.

async open()[source]#: Opens the pipeline and its associated resource session. This will be automatically when entering an async with pipeline: block.

async close()[source]#: Closes the pipeline and its associated resource session. This will be automatically called when exiting an async with pipeline: block.

async update() → bool[source]#

Perform one round of pipeline maintenance, running the update phase and then the launch phase. The pipeline must be opened for this function to run.

async update_only_update() → bool[source]#

Perform one round of the update phase of pipeline maintenance. The pipeline must be opened for this function to run.

async update_only_launch() → bool[source]#

Perform one round of the launch phase of pipeline maintenance. The pipeline must be opened for this function to run.

async gather_ready_jobs(task: Task) → Set[str][source]#: Collect all jobs that are ready to be launched for a given task.

graph() → DiGraph[source]#: Generate the dependency graph for a pipeline. This is a directed graph containing both repositories and tasks as nodes.

Iterate the list of repositories that are dependent on the given node of the dependency graph.

Parameters:

node – The starting point of the graph traversal.
recursive – Whether to return only direct dependencies (False) or transitive dependencies (True) of node.