Edilitics | Data to Decisions

Hybrid Pipelines

Combine no-code operations and Python Polars scripts in a single pipeline. Each step passes its output to the next regardless of type.

A hybrid pipeline lets you mix point-and-click operations with Python scripts in a single workflow. Each step feeds into the next automatically. Use the no-code operations for standard tasks like filtering, joining, and cleaning. Add a script step when the logic is too custom for a point-and-click operation.

Everything runs in the same pipeline - same preview, same run history, same schedule.


When to Use a Hybrid Pipeline

Add a script step to your pipeline when you need to:

  • Read values out of a JSON or free-text column
  • Calculate a score that combines several columns with custom weights
  • Find the first row where a running total crosses a threshold
  • Group by more than one column at the same time

Everything else - stacking tables, filtering rows, joining datasets, cleaning nulls, renaming columns - is covered by the point-and-click operations.


Sample Datasets

This tutorial uses three sample files. Download all three before you begin.

edilitics_sample_orders.csv

H1 orders - 500 rows, 14 columns including metadata JSON · 500 rows

Download

edilitics_sample_orders_h2.csv

H2 orders - 250 rows, 16 columns (includes return_flag and delivery_partner) · 250 rows

Download

edilitics_sample_products.csv

Product catalog - 4 rows, join on product_id to add pricing and support tier · 4 rows

Download

What You Start With

Three separate raw files with several data quality problems to fix before analysis:

FileRowsIssues
edilitics_sample_orders.csv500H1 orders. Nulls in discount_pct, product_tier, and notes. Contains 10 orders referencing a product that does not exist in the catalog.
edilitics_sample_orders_h2.csv250H2 orders. Same core columns as H1, plus two extra columns (return_flag, delivery_partner) that H1 does not have.
edilitics_sample_products.csv4Product catalog. No issues.

What You End With

A single clean summary table with one row per sales rep per region - ready for reporting or dashboard use.

ColumnDescription
sales_repSales representative name
regionRegion
total_revenueTotal revenue across all valid orders
order_countNumber of orders
avg_deal_scoreAverage deal quality score
sla_breach_countNumber of orders that took more than 30 days to deliver
avg_delivery_daysAverage days from order to delivery
milestone_ordersNumber of orders where the rep first crossed 100k in cumulative revenue

No nulls. No raw JSON. No unmatched products. No schema mismatches. Around 40 rows.


Worked Example: Full-Year Sales Performance Pipeline

Goal: combine two order exports and a product catalog into a per-rep, per-region performance summary - cleaning data quality issues, enriching with product details, scoring deal quality, and flagging SLA breaches.

10 steps. 4 code steps. 6 no-code steps.

Start your pipeline with edilitics_sample_orders.csv as the source table.

Concat (no-code)

Stack the H2 order export onto the H1 source data to build a full-year dataset.

Operation: Concat

Concat type: Vertical

Database: select the integration where you uploaded the sample files

Table: edilitics_sample_orders_h2

Manage duplicates: Keep All

When you select the H2 table, Edilitics checks whether the two tables share the same columns. Because H2 has two extra columns (return_flag and delivery_partner) that H1 does not, the status will show Concatenation not allowed and open a conflict modal listing those columns under Extra Columns. Click Drop Columns to continue. This removes the extra columns from the H2 rows so both tables match on the same 14-column schema before stacking.

Result: 750 rows, 14 columns

Filter (no-code)

Remove orders that reference a product not in the catalog. The H1 data contains 10 orders with product_id set to PROD-999, which does not exist in the products table. Keeping these rows would result in empty product columns after the join in Step 4, making them useless for analysis. Remove them now.

Operation: Filter

Filter Column: product_id

Condition: Not Equal to

Target Value(s): PROD-999

Result: 740 rows - 10 orders removed

Null Values Handling (no-code)

Drop rows where product_tier is null. Orders without a product tier cannot be grouped or segmented accurately downstream.

Operation: Null Values Handling

Column: product_tier

Impute/Drop: Drop Null values

Result: 593 rows - 147 incomplete records removed

Joins (no-code)

Enrich each order with product details from the catalog. After Step 2, every remaining order has a valid product_id, so the join will fully populate the new columns with no unmatched rows.

Operation: Joins

Join type: Left Join

Left Key: product_id (orders)

Right Key: product_id (products)

Added columns: product_name, base_price_usd, support_level, max_users, storage_gb

Result: 593 rows, 5 new columns added

Code Editor: Extract lead source and campaign from metadata

The metadata column stores a small JSON object on every row, for example {"source": "web", "campaign": "q1_inbound"}. The point-and-click operations cannot read values out of a JSON string. This script pulls source and campaign into their own plain-text columns so they can be used later in grouping and reporting.

Operation: Code Editor

df = df.with_columns([
    pl.col("metadata").map_elements(
        lambda x: re.search(r'"source":\s*"([^"]+)"', x).group(1)
        if x and re.search(r'"source":\s*"([^"]+)"', x) else None,
        return_dtype=pl.String
    ).alias("lead_source"),
    pl.col("metadata").map_elements(
        lambda x: re.search(r'"campaign":\s*"([^"]+)"', x).group(1)
        if x and re.search(r'"campaign":\s*"([^"]+)"', x) else None,
        return_dtype=pl.String
    ).alias("campaign"),
])

Added columns: lead_source, campaign

Why code: extracting values from a JSON string is not available as a point-and-click operation

Date Time Delta (no-code)

Calculate how many days passed between the order date and the delivery date for each order.

Operation: Date Time Delta

Start column: order_date

End column: delivery_date

Unit: Days

Result column name: days_to_delivery

Result: new column added with the delivery duration for every order

Conditional Column (no-code)

Flag whether each order met or missed the 30-day SLA. Any order that took more than 30 days to deliver is marked as Breached.

Operation: Conditional Column

New column name: sla_status

Rule: IF days_to_delivery > 30 THEN Breached

Else value: On Time

Result: 268 orders marked Breached, 325 marked On Time

Code Editor: Deal score

Give each order a score based on three things: how strong its revenue was relative to other orders in the same region, how lean the discount was, and how fast it was delivered. The final score blends all three using fixed weights.

To build this score, the script first calculates the average and spread of revenue within each region, uses those to position each order, then combines the result with the discount and delivery figures. The working values are dropped at the end - only deal_score remains.

Operation: Code Editor

df = df.with_columns([
    pl.col("revenue").mean().over("region").alias("region_rev_mean"),
    pl.col("revenue").std().over("region").alias("region_rev_std"),
])

df = df.with_columns(
    ((pl.col("revenue") - pl.col("region_rev_mean")) / pl.col("region_rev_std"))
    .alias("revenue_zscore")
)

df = df.with_columns(
    (
        pl.col("revenue_zscore") * 0.5
        + (1 - pl.col("discount_pct").fill_null(0.0)) * 0.3
        + (1 / (pl.col("days_to_delivery") + 1)) * 0.2
    ).alias("deal_score")
)

df = df.drop(["region_rev_mean", "region_rev_std", "revenue_zscore"])

Added column: deal_score

Removed: 3 working columns used only for the calculation

Why code: this score needs several intermediate values that are calculated and used in sequence - not possible with a single point-and-click operation

Drop / Rename Columns (no-code)

Remove columns that are no longer needed and rename one for clarity before the final aggregation.

Operation: Drop / Rename Columns

In the column list, click the delete icon next to each of the following columns to remove them:

  • metadata
  • notes
  • email
  • max_users
  • storage_gb

Then rename base_price_usd to list_price by editing the name field inline.

Click Save & Preview when done.

Result: a leaner dataset with no raw JSON, no personal data, and clear column names

Code Editor: Milestone flag and final aggregation

This step produces the final summary table. It does two things that the point-and-click operations cannot handle:

  1. Milestone flag - sorts each rep's orders by date, builds a running revenue total, and marks the specific order where each rep first crossed 100,000. This requires stepping through rows in order, which is not available as a point-and-click operation.
  2. Group by two columns - summarises the data by both sales_rep and region at the same time. The Group By operation supports one group column only.

Operation: Code Editor

df = df.sort(["sales_rep", "order_date"])

df = df.with_columns(
    pl.col("revenue").cum_sum().over("sales_rep").alias("cumulative_rev_by_rep")
)

df = df.with_columns(
    (
        (pl.col("cumulative_rev_by_rep") >= 100000)
        & (pl.col("cumulative_rev_by_rep").shift(1).over("sales_rep") < 100000)
    ).cast(pl.Int32).alias("milestone_row")
)

df = df.with_columns(
    (pl.col("sla_status") == pl.lit("Breached")).cast(pl.Int32).alias("sla_breach")
)

df = df.group_by(["sales_rep", "region"]).agg([
    pl.col("revenue").sum().alias("total_revenue"),
    pl.col("deal_score").mean().alias("avg_deal_score"),
    pl.col("order_id").count().alias("order_count"),
    pl.col("sla_breach").sum().alias("sla_breach_count"),
    pl.col("days_to_delivery").mean().alias("avg_delivery_days"),
    pl.col("milestone_row").sum().alias("milestone_orders"),
]).sort(["sales_rep", "region"])

Result: approximately 40 rows, 8 columns - one row per sales rep per region, ready for reporting


Pipeline Summary

StepOperationRowsColumnsWhat changed
StartSource load50014-
1Concat75014H2 rows added, extra H2 columns dropped
2Filter7401410 unmatched product orders removed
3Null Values Handling59314147 rows with no product tier removed
4Joins593195 product columns added
5Code: metadata extraction59321lead_source and campaign added
6Date Time Delta59322days_to_delivery added
7Conditional Column59323sla_status added
8Code: deal score59321deal_score added, 3 working columns removed
9Drop / Rename Columns593165 columns removed, 1 renamed
10Code: milestone + group by~408Aggregated to final summary table

Key Principles

  • Point-and-click first. Use the no-code operations for everything they can handle. Add a script step only when a point-and-click operation cannot do the job.
  • Test before saving. Every script step has a Test button. Click it to check the script against the preview data before saving.
  • Script steps are not special. They sit in the pipeline like any other step. You can reorder, remove, or duplicate them freely.
  • Clean up in the same step. If a script creates working columns that are only needed mid-calculation, remove them before the step ends so they do not carry forward.

Frequently Asked Questions


Next Steps

Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.

Last updated on

On this page