Edilitics | Data to Decisions

Group By

Group rows by a column and apply aggregations like sum, count, mean, median, min, max, variance, and standard deviation to output grouped summary tables.

Group By collapses your dataset into one row per unique value in a chosen column and computes aggregations across all other columns. The output is a summary table - not the original rows with new columns added. Every operation downstream will work on the grouped result.

You can apply multiple aggregations to the same column (for example, both sum and mean on revenue) by adding extra aggregation rows in one operation.

Group By reshapes the entire dataset. All original rows are replaced by the grouped summary. If you need to keep original rows, add a Conditional Column or Column Aggregation instead.


When to Use Group By

  • Revenue by region. Group by region, sum revenue to get total sales per territory.
  • Order count by product tier. Group by product_tier, count order_id to see volume per segment.
  • Average deal size by sales rep. Group by sales_rep, take the mean of revenue.
  • Distinct customer count per region. Group by region, count distinct customer_name.
  • Spread analysis. Group by product_tier, compute std dev of revenue to understand deal size variability per tier.

Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for Group By examples:

Prop

Type


Aggregations by Column Type

The aggregation options shown depend on the type of the column being aggregated.

Numerical columns:

AggregationLabelWhat it computes
SumSumTotal of all values in the group
CountCountNumber of non-null rows in the group
Count DistinctCount DistNumber of unique non-null values
MinimumMinLowest value in the group
MaximumMaxHighest value in the group
MeanMeanArithmetic average
MedianMedianMiddle value when sorted
ModeModeMost frequently occurring value
Variance (sample)Var (Sample)Sample variance (ddof=1)
Variance (population)Var (Pop)Population variance (ddof=0)
Std Dev (sample)Std Dev (Sample)Sample standard deviation (ddof=1)
Std Dev (population)Std Dev (Pop)Population standard deviation (ddof=0)

String and datetime columns:

AggregationLabelWhat it computes
CountCountNumber of non-null rows in the group
Count DistinctCount DistNumber of unique non-null values
ModeModeMost frequently occurring value

Use Sample variance/std dev when your dataset is a sample drawn from a larger population (the typical case for order data). Use Population when your dataset is the complete population.


How to Apply Group By

Open the Group By operation

In your Transform pipeline, click Add Operation and select Group By from the operation list.

If your dataset has columns with Object, List, or dictionary types, a warning appears. Flatten those columns first to avoid data loss.

Select the Group By column

Choose one column from the dropdown to group by. All column types are available - string, datetime, and numerical. Object and list columns are excluded.

In the sample dataset, try region to group orders by territory, or product_tier to group by subscription segment.

Once selected, three column headers appear: Aggregate Column, Aggregation, and Result Column Name.

Review and adjust aggregations

All remaining columns appear automatically with default aggregations:

  • Numerical columns default to Sum
  • String and datetime columns default to Count

The Result Column Name is auto-generated as sum_revenue, count_order_id etc. Edit any name to something more meaningful.

Change the aggregation for any column using its Aggregation dropdown. Only aggregations valid for that column's type appear.

Remove a column from the output using its remove icon.

Add extra aggregations (optional)

Click Add Column to add a blank aggregation row. Use this to apply a second aggregation to any column - for example, mean_revenue alongside sum_revenue.

Each added row needs: an Aggregate Column, an Aggregation, and a unique Result Column Name.

The same aggregation cannot be applied to the same column twice - already-used aggregations are hidden from the dropdown for that column.

Click Save & Preview

Click Save & Preview. Edilitics applies the group by and aggregations. The success toast confirms: "Group By applied. Preview now reflects grouped and aggregated output."

The preview now shows the grouped summary table - one row per unique value of the group-by column.

Verify in the preview

Check row count - it should equal the number of distinct values in the group-by column. Check aggregated values for sense. Nulls in aggregated columns indicate the group had no non-null values for that column.


Before and After

Group By column: region. Aggregations: sum of revenue, count of order_id, mean of revenue.

Input (500 rows):

order_idregionrevenueproduct_tier
ORD-2024-0001North America12450.00Enterprise
ORD-2024-0047Europe8320.50Growth
ORD-2024-0099North America83841.00Enterprise
ORD-2024-0312Latin America3750.75Starter
............

Output (5 rows - one per region):

regionsum_revenuecount_order_idmean_revenue
North America621430.50986341.13
Europe584210.75946215.01
Asia Pacific512890.25885828.30
Latin America398740.00765246.58
Middle East & Africa312650.00644885.16

500 rows collapsed to 5. All original columns except region replaced by the aggregated columns.


Code Equivalent

-- Compatible with PostgreSQL, BigQuery, Snowflake, Redshift, DuckDB
SELECT
  region,
  SUM(revenue)          AS sum_revenue,
  COUNT(order_id)       AS count_order_id,
  AVG(revenue)          AS mean_revenue
FROM orders
GROUP BY region;
import polars as pl

df = df.group_by("region").agg([
    pl.col("revenue").sum().alias("sum_revenue"),
    pl.col("order_id").count().alias("count_order_id"),
    pl.col("revenue").mean().alias("mean_revenue"),
])

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


Operation Reference

Prop

Type


Frequently Asked Questions


Next Steps

Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.

Last updated on

On this page