Group By
Group rows by a column and apply aggregations like sum, count, mean, median, min, max, variance, and standard deviation to output grouped summary tables.
Group By collapses your dataset into one row per unique value in a chosen column and computes aggregations across all other columns. The output is a summary table - not the original rows with new columns added. Every operation downstream will work on the grouped result.
You can apply multiple aggregations to the same column (for example, both sum and mean on revenue) by adding extra aggregation rows in one operation.
Group By reshapes the entire dataset. All original rows are replaced by the grouped summary. If you need to keep original rows, add a Conditional Column or Column Aggregation instead.
When to Use Group By
- Revenue by region. Group by
region, sumrevenueto get total sales per territory. - Order count by product tier. Group by
product_tier, countorder_idto see volume per segment. - Average deal size by sales rep. Group by
sales_rep, take the mean ofrevenue. - Distinct customer count per region. Group by
region, count distinctcustomer_name. - Spread analysis. Group by
product_tier, compute std dev ofrevenueto understand deal size variability per tier.
Sample Dataset
The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.
edilitics_sample_orders.csv
Sample orders dataset for hands-on practice · 500 rows
Relevant columns for Group By examples:
Prop
Type
Aggregations by Column Type
The aggregation options shown depend on the type of the column being aggregated.
Numerical columns:
| Aggregation | Label | What it computes |
|---|---|---|
| Sum | Sum | Total of all values in the group |
| Count | Count | Number of non-null rows in the group |
| Count Distinct | Count Dist | Number of unique non-null values |
| Minimum | Min | Lowest value in the group |
| Maximum | Max | Highest value in the group |
| Mean | Mean | Arithmetic average |
| Median | Median | Middle value when sorted |
| Mode | Mode | Most frequently occurring value |
| Variance (sample) | Var (Sample) | Sample variance (ddof=1) |
| Variance (population) | Var (Pop) | Population variance (ddof=0) |
| Std Dev (sample) | Std Dev (Sample) | Sample standard deviation (ddof=1) |
| Std Dev (population) | Std Dev (Pop) | Population standard deviation (ddof=0) |
String and datetime columns:
| Aggregation | Label | What it computes |
|---|---|---|
| Count | Count | Number of non-null rows in the group |
| Count Distinct | Count Dist | Number of unique non-null values |
| Mode | Mode | Most frequently occurring value |
Use Sample variance/std dev when your dataset is a sample drawn from a larger population (the typical case for order data). Use Population when your dataset is the complete population.
How to Apply Group By
Open the Group By operation
In your Transform pipeline, click Add Operation and select Group By from the operation list.
If your dataset has columns with Object, List, or dictionary types, a warning appears. Flatten those columns first to avoid data loss.
Select the Group By column
Choose one column from the dropdown to group by. All column types are available - string, datetime, and numerical. Object and list columns are excluded.
In the sample dataset, try region to group orders by territory, or product_tier to group by subscription segment.
Once selected, three column headers appear: Aggregate Column, Aggregation, and Result Column Name.
Review and adjust aggregations
All remaining columns appear automatically with default aggregations:
- Numerical columns default to Sum
- String and datetime columns default to Count
The Result Column Name is auto-generated as sum_revenue, count_order_id etc. Edit any name to something more meaningful.
Change the aggregation for any column using its Aggregation dropdown. Only aggregations valid for that column's type appear.
Remove a column from the output using its remove icon.
Add extra aggregations (optional)
Click Add Column to add a blank aggregation row. Use this to apply a second aggregation to any column - for example, mean_revenue alongside sum_revenue.
Each added row needs: an Aggregate Column, an Aggregation, and a unique Result Column Name.
The same aggregation cannot be applied to the same column twice - already-used aggregations are hidden from the dropdown for that column.
Click Save & Preview
Click Save & Preview. Edilitics applies the group by and aggregations. The success toast confirms: "Group By applied. Preview now reflects grouped and aggregated output."
The preview now shows the grouped summary table - one row per unique value of the group-by column.
Verify in the preview
Check row count - it should equal the number of distinct values in the group-by column. Check aggregated values for sense. Nulls in aggregated columns indicate the group had no non-null values for that column.
Before and After
Group By column: region. Aggregations: sum of revenue, count of order_id, mean of revenue.
Input (500 rows):
| order_id | region | revenue | product_tier |
|---|---|---|---|
| ORD-2024-0001 | North America | 12450.00 | Enterprise |
| ORD-2024-0047 | Europe | 8320.50 | Growth |
| ORD-2024-0099 | North America | 83841.00 | Enterprise |
| ORD-2024-0312 | Latin America | 3750.75 | Starter |
| ... | ... | ... | ... |
Output (5 rows - one per region):
| region | sum_revenue | count_order_id | mean_revenue |
|---|---|---|---|
| North America | 621430.50 | 98 | 6341.13 |
| Europe | 584210.75 | 94 | 6215.01 |
| Asia Pacific | 512890.25 | 88 | 5828.30 |
| Latin America | 398740.00 | 76 | 5246.58 |
| Middle East & Africa | 312650.00 | 64 | 4885.16 |
500 rows collapsed to 5. All original columns except region replaced by the aggregated columns.
Code Equivalent
-- Compatible with PostgreSQL, BigQuery, Snowflake, Redshift, DuckDB
SELECT
region,
SUM(revenue) AS sum_revenue,
COUNT(order_id) AS count_order_id,
AVG(revenue) AS mean_revenue
FROM orders
GROUP BY region;import polars as pl
df = df.group_by("region").agg([
pl.col("revenue").sum().alias("sum_revenue"),
pl.col("order_id").count().alias("count_order_id"),
pl.col("revenue").mean().alias("mean_revenue"),
])After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
Operation Reference
Prop
Type
Frequently Asked Questions
Next Steps
Filter
Filter the grouped summary to keep only rows matching a condition.
Sort / Order By
Sort the grouped output by an aggregated column such as sum_revenue.
Window Functions
Add group-level aggregations as new columns without collapsing the original rows.
Pivot / Unpivot
Reshape the grouped output by turning row values into column headers.
Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.
Last updated on
Manage Timezones
Convert datetime columns to a target IANA timezone. All datetime columns are pre-loaded. Apply one timezone to all selected columns without writing code.
Pivot / Unpivot
Pivot reshapes long to wide: category values become headers. Unpivot reshapes wide to long: columns stack into key-value rows. Configure aggregations visually.