What does Keep None do?

Keep None removes every row where the key column value appears more than once, including both the original and all duplicates.

How does Keep Last decide which row is last?

Keep Last retains the row that appears last in the current dataset order. Sort by a date column ascending before running this operation to ensure last means most recent.

Can I deduplicate on a combination of columns?

Not directly in the UI. Each rule targets one column. For composite key deduplication, use the Code Editor operation with df.unique(subset=['col1', 'col2'], keep='first').

Which column types can be used as deduplication keys?

All column types are supported including string, numeric, and datetime.

Drop Duplicate Rows | Edilitics Docs

Remove duplicate rows based on a column's values. Choose to keep the first occurrence, the last, or drop all rows where a duplicate exists. No code needed.

Drop Duplicate Rows removes rows where a specified column contains repeated values. For each deduplication rule, you choose which occurrence to keep: the first, the last, or none at all. You can add multiple rules in one operation, each targeting a different column.

All column types are supported as deduplication keys.

When to Use Drop Duplicate Rows

Enforcing unique keys. Customer IDs, order numbers, or transaction codes should appear only once. Drop duplicates on those columns before any aggregation or join.
Cleaning merged or appended datasets. When two datasets are combined, the same record can appear twice. Deduplicate on a unique identifier to collapse them to one.
Keeping the latest record per entity. Sort by a timestamp column first, then deduplicate with Keep Last to retain the most recent entry per customer or product.
Keeping the earliest record per entity. Deduplicate with Keep First to retain the original registration or first order per customer.
Removing all ambiguous rows. Use Keep None when you want to discard any row where a duplicate exists, not just remove the extras.

Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for Drop Duplicates examples:

Prop

Type

The Three Handling Rules

Each deduplication rule requires you to choose what to do when duplicates are found in the key column.

Keeps the first occurrence of each value and removes all subsequent duplicates.

customer_id	order_date	revenue
C001	2024-01-10	500
C001	2024-03-22	800
C002	2024-02-14	300

After deduplicating customer_id with Keep First:

customer_id	order_date	revenue
C001	2024-01-10	500
C002	2024-02-14	300

The first C001 row (January) is kept. The March row is removed.

Best for: retaining the original or earliest record per entity.

Keeps the last occurrence of each value and removes all earlier duplicates.

customer_id	order_date	revenue
C001	2024-01-10	500
C001	2024-03-22	800
C002	2024-02-14	300

After deduplicating customer_id with Keep Last:

customer_id	order_date	revenue
C001	2024-03-22	800
C002	2024-02-14	300

The last C001 row (March) is kept. The January row is removed.

Best for: retaining the most recent record per entity. Sort by a date column ascending before running this operation to ensure "last" means "most recent."

Removes all rows where the key column value appears more than once. Both the original and the duplicate are dropped.

customer_id	order_date	revenue
C001	2024-01-10	500
C001	2024-03-22	800
C002	2024-02-14	300

After deduplicating customer_id with Keep None:

customer_id	order_date	revenue
C002	2024-02-14	300

Both C001 rows are removed because customer_id C001 appeared more than once. C002 is kept because it appeared exactly once.

Best for: strict quality pipelines where any duplicated key is considered invalid and should be excluded entirely.

How to Apply Drop Duplicate Rows

Open the Drop Duplicate Rows operation

In your Transform pipeline, click Add Operation and select Drop Duplicate Rows from the operation list.

Select a deduplication key column

Choose the column whose values will be checked for duplicates. This is the Deduplication Key. All column types are available.

A column selected in one row is removed from the dropdown in all other rows so the same key cannot be used twice.

Choose a handling rule

Select Keep First, Keep Last, or Keep None from the Duplicate Handling Rule dropdown. This dropdown is disabled until a key column is selected.

Keep First: keeps the earliest occurrence, removes the rest.
Keep Last: keeps the most recent occurrence, removes the rest. Sort by a date column before this step if row order matters.
Keep None: removes all rows where the key value appears more than once.

Add more rules (optional)

Click Add Column to add another deduplication rule targeting a different column. Each rule is applied independently and sequentially. The Add Column button is disabled until the current rule has both a column and a handling rule selected.

Remove a rule (optional)

Click the remove icon on any row to delete it. The remove icon only appears when more than one rule row exists.

Save the operation

Click Save & Preview. Edilitics applies each rule in order. A success toast confirms: "Duplicate rows removed. Preview updated with the de-duplicated result."

Verify in the preview

Check the row count and spot-check values in the key column to confirm duplicates were handled as expected.

Before and After

Dataset with duplicate customer_id values:

customer_id	order_date	revenue
C001	2024-01-10	500
C001	2024-03-22	800
C002	2024-02-14	300
C003	2024-01-05	150
C003	2024-04-01	420

After deduplicating customer_id with Keep Last (dataset pre-sorted ascending by order_date):

customer_id	order_date	revenue
C001	2024-03-22	800
C002	2024-02-14	300
C003	2024-04-01	420

5 rows reduced to 3. Most recent order per customer retained.

Code Equivalent

-- Keep Last: use ROW_NUMBER to rank by date, keep rank = 1 after sorting desc
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
  FROM orders
)
SELECT * EXCLUDE rn
FROM ranked
WHERE rn = 1;

import polars as pl

# Keep Last occurrence per customer_id
df = df.unique(subset=["customer_id"], keep="last")

# Keep First
df = df.unique(subset=["customer_id"], keep="first")

# Keep None (drop all rows where customer_id appears more than once)
df = df.unique(subset=["customer_id"], keep="none")

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.

Drop Duplicate Rows

When to Use Drop Duplicate Rows

Sample Dataset

The Three Handling Rules

How to Apply Drop Duplicate Rows

Before and After

Code Equivalent

Operation Reference

Frequently Asked Questions

Next Steps

Sort / Order By

Manage Nulls

Filter

Group By

On this page

Drop Duplicate Rows

What does 'Keep None' actually do?

How does 'Keep Last' decide which row is last?

Can I deduplicate on a combination of columns (composite key)?

Which column types can be used as deduplication keys?

Does the row count change after this operation?

What if the same column is added as a key in two rule rows?

Can I undo Drop Duplicate Rows?

Sort / Order By

Manage Nulls

Filter

Group By

On this page