Skip to main content

Partitions vs config

In this example, we'll explore two different approaches to parameterize Dagster pipelines. When you need to process data for different segments (like customers, regions, or dates), you can choose between Dagster's partitions or run configuration. Each approach has distinct trade-offs in terms of tracking, observability, and workflow.

Problem: Processing data for multiple customers

Imagine you need to process data for multiple customers, where each customer's data should be processed independently. You want to be able to run the pipeline for specific customers and potentially reprocess historical data when needed.

The key question is: Should you use partitions to create a segment for each customer, or should you use config to pass the customer ID as a parameter?

Solution 1: Using partitions

Partitions divide your data into discrete segments. Each customer becomes a partition, giving you full visibility into which customers have been processed and the ability to backfill specific customers.

src/project_mini/defs/partitions_vs_config/with_partitions.py
import dagster as dg

customer_partitions = dg.StaticPartitionsDefinition(["customer_a", "customer_b", "customer_c"])


@dg.asset(partitions_def=customer_partitions)
def customer_orders(context: dg.AssetExecutionContext):
"""Fetch and process orders for a specific customer partition."""
customer_id = context.partition_key
context.log.info(f"Fetching orders for customer: {customer_id}")

orders = [
{"order_id": f"{customer_id}-001", "amount": 150.00},
{"order_id": f"{customer_id}-002", "amount": 275.50},
{"order_id": f"{customer_id}-003", "amount": 89.99},
]

context.log.info(f"Processed {len(orders)} orders for {customer_id}")
return orders


@dg.asset(partitions_def=customer_partitions, deps=[customer_orders])
def customer_summary(context: dg.AssetExecutionContext):
"""Generate a summary report for a specific customer partition."""
customer_id = context.partition_key
context.log.info(f"Generating summary for customer: {customer_id}")

summary = {
"customer_id": customer_id,
"total_orders": 3,
"total_revenue": 515.49,
}

context.log.info(f"Summary for {customer_id}: {summary}")
return summary


@dg.schedule(
cron_schedule="0 1 * * *",
job=dg.define_asset_job(
"all_customers_job",
selection=[customer_orders, customer_summary],
partitions_def=customer_partitions,
),
)
def daily_customer_schedule():
"""Trigger processing for all customer partitions."""
for partition_key in customer_partitions.get_partition_keys():
yield dg.RunRequest(partition_key=partition_key)
Partitions approach
Materialization trackingPer-customer history visible in UI
BackfillingBuilt-in support for reprocessing specific customers
SchedulingNative support for processing all partitions
UI experiencePartition status bar shows processing state
Setup complexityRequires defining partition set upfront

Solution 2: Using config

Run configuration passes the customer ID as a parameter at runtime. This approach is simpler to set up but doesn't track which customers have been processed.

src/project_mini/defs/partitions_vs_config/with_config.py
import dagster as dg


class CustomerConfig(dg.Config):
customer_id: str


@dg.asset
def customer_orders(context: dg.AssetExecutionContext, config: CustomerConfig):
"""Fetch and process orders for a specific customer."""
customer_id = config.customer_id
context.log.info(f"Fetching orders for customer: {customer_id}")

orders = [
{"order_id": f"{customer_id}-001", "amount": 150.00},
{"order_id": f"{customer_id}-002", "amount": 275.50},
{"order_id": f"{customer_id}-003", "amount": 89.99},
]

context.log.info(f"Processed {len(orders)} orders for {customer_id}")
return orders


@dg.asset(deps=[customer_orders])
def customer_summary(context: dg.AssetExecutionContext, config: CustomerConfig):
"""Generate a summary report for a specific customer."""
customer_id = config.customer_id
context.log.info(f"Generating summary for customer: {customer_id}")

summary = {
"customer_id": customer_id,
"total_orders": 3,
"total_revenue": 515.49,
}

context.log.info(f"Summary for {customer_id}: {summary}")
return summary
Config approach
Materialization trackingSingle asset history (not per-customer)
BackfillingManual re-runs required
SchedulingRequires custom logic to iterate customers
UI experienceSpecify customer in Launchpad before each run
Setup complexitySimple config class, no partition management

When to use each approach

The choice between partitions and config depends on your specific requirements:

Use partitions when:

  • Your data naturally segments into discrete categories
  • You need to track materialization status per segment
  • Backfilling specific segments is a common operation
  • You want to schedule processing for all segments automatically
  • You need visibility into which segments are up-to-date vs stale

Use config when:

  • Processing is infrequent or ad-hoc
  • Parameters are dynamic or come from an unbounded set
  • You don't need per-parameter tracking
  • A single materialization history is sufficient
  • You want simple parameterization without partition overhead

Hybrid approach

You can also combine both approaches: use partitions for the primary segmentation (e.g., by customer) and config for additional runtime parameters (e.g., processing options).

src/project_mini/defs/partitions_vs_config/with_partitions_and_config.py
import dagster as dg

customer_partitions = dg.StaticPartitionsDefinition(["customer_a", "customer_b", "customer_c"])


class ProcessingConfig(dg.Config):
include_archived: bool = False
limit: int = 1000


@dg.asset(partitions_def=customer_partitions)
def customer_data(context: dg.AssetExecutionContext, config: ProcessingConfig):
customer_id = context.partition_key
context.log.info(f"Processing {customer_id} with include_archived={config.include_archived}")

This gives you the benefits of partition tracking while maintaining flexibility for runtime parameters.