Drop Duplicate Rows
Remove duplicate rows based on a column's values. Choose to keep the first occurrence, the last, or drop all rows where a duplicate exists. No code needed.
Drop Duplicate Rows removes rows where a specified column contains repeated values. For each deduplication rule, you choose which occurrence to keep: the first, the last, or none at all. You can add multiple rules in one operation, each targeting a different column.
All column types are supported as deduplication keys.
When to Use Drop Duplicate Rows
- Enforcing unique keys. Customer IDs, order numbers, or transaction codes should appear only once. Drop duplicates on those columns before any aggregation or join.
- Cleaning merged or appended datasets. When two datasets are combined, the same record can appear twice. Deduplicate on a unique identifier to collapse them to one.
- Keeping the latest record per entity. Sort by a timestamp column first, then deduplicate with Keep Last to retain the most recent entry per customer or product.
- Keeping the earliest record per entity. Deduplicate with Keep First to retain the original registration or first order per customer.
- Removing all ambiguous rows. Use Keep None when you want to discard any row where a duplicate exists, not just remove the extras.
Sample Dataset
The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.
edilitics_sample_orders.csv
Sample orders dataset for hands-on practice · 500 rows
Relevant columns for Drop Duplicates examples:
Prop
Type
The Three Handling Rules
Each deduplication rule requires you to choose what to do when duplicates are found in the key column.
Keeps the first occurrence of each value and removes all subsequent duplicates.
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-01-10 | 500 |
| C001 | 2024-03-22 | 800 |
| C002 | 2024-02-14 | 300 |
After deduplicating customer_id with Keep First:
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-01-10 | 500 |
| C002 | 2024-02-14 | 300 |
The first C001 row (January) is kept. The March row is removed.
Best for: retaining the original or earliest record per entity.
Keeps the last occurrence of each value and removes all earlier duplicates.
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-01-10 | 500 |
| C001 | 2024-03-22 | 800 |
| C002 | 2024-02-14 | 300 |
After deduplicating customer_id with Keep Last:
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-03-22 | 800 |
| C002 | 2024-02-14 | 300 |
The last C001 row (March) is kept. The January row is removed.
Best for: retaining the most recent record per entity. Sort by a date column ascending before running this operation to ensure "last" means "most recent."
Removes all rows where the key column value appears more than once. Both the original and the duplicate are dropped.
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-01-10 | 500 |
| C001 | 2024-03-22 | 800 |
| C002 | 2024-02-14 | 300 |
After deduplicating customer_id with Keep None:
| customer_id | order_date | revenue |
|---|---|---|
| C002 | 2024-02-14 | 300 |
Both C001 rows are removed because customer_id C001 appeared more than once. C002 is kept because it appeared exactly once.
Best for: strict quality pipelines where any duplicated key is considered invalid and should be excluded entirely.
How to Apply Drop Duplicate Rows
Open the Drop Duplicate Rows operation
In your Transform pipeline, click Add Operation and select Drop Duplicate Rows from the operation list.
Select a deduplication key column
Choose the column whose values will be checked for duplicates. This is the Deduplication Key. All column types are available.
A column selected in one row is removed from the dropdown in all other rows so the same key cannot be used twice.
Choose a handling rule
Select Keep First, Keep Last, or Keep None from the Duplicate Handling Rule dropdown. This dropdown is disabled until a key column is selected.
- Keep First: keeps the earliest occurrence, removes the rest.
- Keep Last: keeps the most recent occurrence, removes the rest. Sort by a date column before this step if row order matters.
- Keep None: removes all rows where the key value appears more than once.
Add more rules (optional)
Click Add Column to add another deduplication rule targeting a different column. Each rule is applied independently and sequentially. The Add Column button is disabled until the current rule has both a column and a handling rule selected.
Remove a rule (optional)
Click the remove icon on any row to delete it. The remove icon only appears when more than one rule row exists.
Save the operation
Click Save & Preview. Edilitics applies each rule in order. A success toast confirms: "Duplicate rows removed. Preview updated with the de-duplicated result."
Verify in the preview
Check the row count and spot-check values in the key column to confirm duplicates were handled as expected.
Before and After
Dataset with duplicate customer_id values:
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-01-10 | 500 |
| C001 | 2024-03-22 | 800 |
| C002 | 2024-02-14 | 300 |
| C003 | 2024-01-05 | 150 |
| C003 | 2024-04-01 | 420 |
After deduplicating customer_id with Keep Last (dataset pre-sorted ascending by order_date):
| customer_id | order_date | revenue |
|---|---|---|
| C001 | 2024-03-22 | 800 |
| C002 | 2024-02-14 | 300 |
| C003 | 2024-04-01 | 420 |
5 rows reduced to 3. Most recent order per customer retained.
Code Equivalent
-- Keep Last: use ROW_NUMBER to rank by date, keep rank = 1 after sorting desc
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM orders
)
SELECT * EXCLUDE rn
FROM ranked
WHERE rn = 1;import polars as pl
# Keep Last occurrence per customer_id
df = df.unique(subset=["customer_id"], keep="last")
# Keep First
df = df.unique(subset=["customer_id"], keep="first")
# Keep None (drop all rows where customer_id appears more than once)
df = df.unique(subset=["customer_id"], keep="none")After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
Operation Reference
Prop
Type
Frequently Asked Questions
Next Steps
Sort / Order By
Sort by a date column before deduplicating to control which row is considered first or last.
Manage Nulls
Handle null values in key columns before running deduplication.
Filter
Filter rows after deduplication to narrow down to specific segments.
Group By
Aggregate the deduplicated dataset by category or time period.
Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.
Last updated on
Sampling
Sample rows from your dataset using simple random sampling, systematic sampling, or stratified sampling by category group. No code needed.
Sort / Order By
Sort your dataset by one or more columns in ascending or descending order. Add multiple sort keys and control precedence across numeric and text columns.