Does Sampling keep the original rows?

No. Sampling replaces the dataset with the selected subset. Rows not selected are removed from the pipeline.

How is Selection Frequency calculated in Systematic Sampling?

Selection Frequency is auto-calculated as Total Rows divided by Sample Size. If you have 500 rows and Sample Size 50, Selection Frequency is 10 - every 10th row is selected.

What is the difference between Proportionate and Disproportionate in Stratified Sampling?

Proportionate preserves original group ratios. Disproportionate takes a fixed equal count from every group regardless of size, giving smaller groups equal representation.

Can I reproduce the same random sample?

Yes. Set a Random Seed integer in Simple Random Sampling. The same seed produces the same rows on every run.

Why does Simple Random Sampling only accept integers for Sample Size?

Sample percentage must be a positive integer. Decimals are rejected. Enter 10 for 10%, not 10.5.

Sampling | Edilitics Docs

Sample rows from your dataset using simple random sampling, systematic sampling, or stratified sampling by category group. No code needed.

Sampling reduces your dataset to a subset of rows using one of three strategies: simple random sampling by percentage, systematic sampling by fixed interval, or stratified sampling by category group. The original dataset is replaced by the sample. Every operation downstream works on the sampled result.

Sampling replaces the dataset with the sampled rows. Original rows not selected are removed from the pipeline. If you need the full dataset later, add sampling as a late step in your pipeline or work on a copy.

When to Use Sampling

Speed up development. Work on 10% of a large dataset while building and testing transformations.
Create train/test splits. Sample a percentage of rows for model training before exporting.
Audit a structured subset. Use systematic sampling to review every Nth transaction without bias.
Ensure category representation. Use stratified sampling to guarantee rows from each region or product tier.
Reduce export volume. Sample before exporting to keep output files to a manageable size.

Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for Sampling examples:

Prop

Type

Sampling Methods

Method	How it selects rows	Best for
Simple Random Sampling	Random percentage of total rows, each row has equal probability	Quick testing, unbiased subsets
Systematic Sampling	Every Nth row starting from an offset position	Structured audits, time-series intervals
Stratified Sampling	Fixed count or proportionate share from each category subgroup	Ensuring category representation

How to Apply Sampling

Open the Sampling operation

In your Transform pipeline, click Add Operation and select Sampling from the operation list.

Choose a Sampling Type

In the Choose Sampling Type dropdown, select one of:

Simple Random Sampling
Systematic Sampling
Stratified Sampling

The configuration fields update based on the selected type.

Configure the selected method

Follow the configuration for the method you chose. See the sections below for each method.

Click Save & Preview

Click Save & Preview. Edilitics applies the sampling and replaces the dataset with the selected rows. The success toast confirms: "Sampling applied. Preview updated with sampled data."

Verify in the preview

Check the row count in the preview. Confirm the sampled rows reflect the expected distribution or range.

Simple Random Sampling

Selects a random percentage of rows from the full dataset. Every row has equal probability of being selected.

Fields:

Field	Required	Description
Sample Size (%)	Yes	Percentage of rows to include. Positive integer only (e.g. 10, 50, 100). Decimals not accepted.
Random Seed	No	Integer seed for reproducibility. Same seed produces the same sample on repeated runs.
Allow Duplicates	No	Checkbox. When checked, the same row can appear more than once. Auto-checked when Sample Size exceeds 100%.

If you enter a Sample Size greater than 100, Allow Duplicates is automatically enabled. Without repetition, you cannot sample more rows than the dataset contains.

Before and After (Sample Size 60%, 500-row dataset):

Input: 500 rows

Output: approximately 300 rows (random selection, order not preserved)

Systematic Sampling

Selects every Nth row starting from an offset position. N is computed automatically as total rows divided by the requested sample size.

Fields:

Field	Required	Description
Total Rows (Approx)	Read-only	Auto-fetched total row count for the current dataset.
Selection Frequency	Read-only	Auto-calculated as Total Rows divided by Sample Size. This is N - every Nth row is selected.
Sample Size	Yes	Number of rows to return. Must be a positive integer less than total rows.
Offset	Yes	Starting position within the dataset. Must be less than the Selection Frequency. Enabled only after Sample Size is entered.

Before and After (Sample Size 50, 500-row dataset, Offset 3):

Selection Frequency = 500 / 50 = 10. Rows selected: 3, 13, 23, 33 ... (every 10th row starting at position 3).

Input: 500 rows

Output: 50 rows

Stratified Sampling

Divides the dataset by unique values in a categorical column and samples from each subgroup. Two distribution modes: Proportionate and Disproportionate.

Fields:

Field	Required	Description
Group By Column	Yes	Categorical column to define subgroups. Only string columns appear.
Distribution Type	Yes	Proportionate or Disproportionate.
Total Sample Size	Yes (Proportionate)	Total rows to return across all groups. Distributed in proportion to each group's share of the full dataset.
Number of Samples from Each Subgroup	Yes (Disproportionate)	Fixed number of rows to take from each group independently of group size.

Proportionate maintains original group ratios. If region has 40% North America and 20% Europe, the output sample preserves that 2:1 ratio.

Disproportionate takes an equal fixed count from every group regardless of size. Useful when smaller groups need representation equal to larger ones.

If no categorical columns exist, an info toast appears: "No columns with categorical fields found!" Cast columns to string or select a dataset that includes categorical columns.

Before and After (Stratified, Disproportionate, Group By Column region, 10 from each subgroup):

Input: 500 rows across 5 regions

Output: 50 rows (10 per region)

Code Equivalent

-- Simple random sample (PostgreSQL)
SELECT *
FROM orders
ORDER BY RANDOM()
LIMIT 300;

-- Systematic sample: every 10th row (DuckDB)
SELECT *
FROM (
  SELECT *, ROW_NUMBER() OVER () AS rn FROM orders
)
WHERE (rn - 3) % 10 = 0;

-- Stratified sample: 10 rows per region (PostgreSQL)
SELECT *
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY RANDOM()) AS rn
  FROM orders
) t
WHERE rn <= 10;

import polars as pl

# Simple random sample: 60% of rows
df_simple = df.sample(fraction=0.6, seed=42, with_replacement=False)

# Systematic sample: every 10th row starting at offset 3
df_systematic = df[list(range(3, df.shape[0], 10))]

# Stratified sample: proportionate, 300 total rows
ratios = dict(
    df["region"]
    .value_counts()
    .with_columns(pl.col("count") / df.shape[0])
    .iter_rows()
)
df_stratified = pl.concat([
    df.filter(pl.col("region") == region).sample(n=int(ratio * 300))
    for region, ratio in ratios.items()
])

# Stratified sample: disproportionate, 10 rows per region
df_disproportionate = pl.concat([
    df.filter(pl.col("region") == region).sample(n=10)
    for region in df["region"].unique()
])

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.

Sampling

When to Use Sampling

Sample Dataset

Sampling Methods

How to Apply Sampling

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Code Equivalent

Operation Reference

Frequently Asked Questions

Next Steps

Filter

Group By

Cast Data Types

Manage Nulls

On this page

Sampling

Does Sampling keep the original rows?

Why does Simple Random Sampling only accept integers for Sample Size?

What happens when Sample Size exceeds 100% in Simple Random Sampling?

How is Selection Frequency calculated in Systematic Sampling?

What is the Offset in Systematic Sampling?

What column types appear in the Stratified Group By Column dropdown?

What is the difference between Proportionate and Disproportionate in Stratified Sampling?

Can I use Random Seed to get the same sample every time?

Filter

Group By

Cast Data Types

Manage Nulls

On this page