Edilitics | Data to Decisions

Drop Duplicate Rows

Remove duplicate rows based on a column's values. Choose to keep the first occurrence, the last, or drop all rows where a duplicate exists. No code needed.

Drop Duplicate Rows removes rows where a specified column contains repeated values. For each deduplication rule, you choose which occurrence to keep: the first, the last, or none at all. You can add multiple rules in one operation, each targeting a different column.

All column types are supported as deduplication keys.


When to Use Drop Duplicate Rows

  • Enforcing unique keys. Customer IDs, order numbers, or transaction codes should appear only once. Drop duplicates on those columns before any aggregation or join.
  • Cleaning merged or appended datasets. When two datasets are combined, the same record can appear twice. Deduplicate on a unique identifier to collapse them to one.
  • Keeping the latest record per entity. Sort by a timestamp column first, then deduplicate with Keep Last to retain the most recent entry per customer or product.
  • Keeping the earliest record per entity. Deduplicate with Keep First to retain the original registration or first order per customer.
  • Removing all ambiguous rows. Use Keep None when you want to discard any row where a duplicate exists, not just remove the extras.

Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for Drop Duplicates examples:

Prop

Type


The Three Handling Rules

Each deduplication rule requires you to choose what to do when duplicates are found in the key column.

Keeps the first occurrence of each value and removes all subsequent duplicates.

customer_idorder_daterevenue
C0012024-01-10500
C0012024-03-22800
C0022024-02-14300

After deduplicating customer_id with Keep First:

customer_idorder_daterevenue
C0012024-01-10500
C0022024-02-14300

The first C001 row (January) is kept. The March row is removed.

Best for: retaining the original or earliest record per entity.

Keeps the last occurrence of each value and removes all earlier duplicates.

customer_idorder_daterevenue
C0012024-01-10500
C0012024-03-22800
C0022024-02-14300

After deduplicating customer_id with Keep Last:

customer_idorder_daterevenue
C0012024-03-22800
C0022024-02-14300

The last C001 row (March) is kept. The January row is removed.

Best for: retaining the most recent record per entity. Sort by a date column ascending before running this operation to ensure "last" means "most recent."

Removes all rows where the key column value appears more than once. Both the original and the duplicate are dropped.

customer_idorder_daterevenue
C0012024-01-10500
C0012024-03-22800
C0022024-02-14300

After deduplicating customer_id with Keep None:

customer_idorder_daterevenue
C0022024-02-14300

Both C001 rows are removed because customer_id C001 appeared more than once. C002 is kept because it appeared exactly once.

Best for: strict quality pipelines where any duplicated key is considered invalid and should be excluded entirely.


How to Apply Drop Duplicate Rows

Open the Drop Duplicate Rows operation

In your Transform pipeline, click Add Operation and select Drop Duplicate Rows from the operation list.

Select a deduplication key column

Choose the column whose values will be checked for duplicates. This is the Deduplication Key. All column types are available.

A column selected in one row is removed from the dropdown in all other rows so the same key cannot be used twice.

Choose a handling rule

Select Keep First, Keep Last, or Keep None from the Duplicate Handling Rule dropdown. This dropdown is disabled until a key column is selected.

  • Keep First: keeps the earliest occurrence, removes the rest.
  • Keep Last: keeps the most recent occurrence, removes the rest. Sort by a date column before this step if row order matters.
  • Keep None: removes all rows where the key value appears more than once.

Add more rules (optional)

Click Add Column to add another deduplication rule targeting a different column. Each rule is applied independently and sequentially. The Add Column button is disabled until the current rule has both a column and a handling rule selected.

Remove a rule (optional)

Click the remove icon on any row to delete it. The remove icon only appears when more than one rule row exists.

Save the operation

Click Save & Preview. Edilitics applies each rule in order. A success toast confirms: "Duplicate rows removed. Preview updated with the de-duplicated result."

Verify in the preview

Check the row count and spot-check values in the key column to confirm duplicates were handled as expected.


Before and After

Dataset with duplicate customer_id values:

customer_idorder_daterevenue
C0012024-01-10500
C0012024-03-22800
C0022024-02-14300
C0032024-01-05150
C0032024-04-01420

After deduplicating customer_id with Keep Last (dataset pre-sorted ascending by order_date):

customer_idorder_daterevenue
C0012024-03-22800
C0022024-02-14300
C0032024-04-01420

5 rows reduced to 3. Most recent order per customer retained.


Code Equivalent

-- Keep Last: use ROW_NUMBER to rank by date, keep rank = 1 after sorting desc
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
  FROM orders
)
SELECT * EXCLUDE rn
FROM ranked
WHERE rn = 1;
import polars as pl

# Keep Last occurrence per customer_id
df = df.unique(subset=["customer_id"], keep="last")

# Keep First
df = df.unique(subset=["customer_id"], keep="first")

# Keep None (drop all rows where customer_id appears more than once)
df = df.unique(subset=["customer_id"], keep="none")

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


Operation Reference

Prop

Type


Frequently Asked Questions


Next Steps

Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.

Last updated on

On this page