Edilitics | Data to Decisions

Sampling

Sample rows from your dataset using simple random sampling, systematic sampling, or stratified sampling by category group. No code needed.

Sampling reduces your dataset to a subset of rows using one of three strategies: simple random sampling by percentage, systematic sampling by fixed interval, or stratified sampling by category group. The original dataset is replaced by the sample. Every operation downstream works on the sampled result.

Sampling replaces the dataset with the sampled rows. Original rows not selected are removed from the pipeline. If you need the full dataset later, add sampling as a late step in your pipeline or work on a copy.


When to Use Sampling

  • Speed up development. Work on 10% of a large dataset while building and testing transformations.
  • Create train/test splits. Sample a percentage of rows for model training before exporting.
  • Audit a structured subset. Use systematic sampling to review every Nth transaction without bias.
  • Ensure category representation. Use stratified sampling to guarantee rows from each region or product tier.
  • Reduce export volume. Sample before exporting to keep output files to a manageable size.

Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for Sampling examples:

Prop

Type


Sampling Methods

MethodHow it selects rowsBest for
Simple Random SamplingRandom percentage of total rows, each row has equal probabilityQuick testing, unbiased subsets
Systematic SamplingEvery Nth row starting from an offset positionStructured audits, time-series intervals
Stratified SamplingFixed count or proportionate share from each category subgroupEnsuring category representation

How to Apply Sampling

Open the Sampling operation

In your Transform pipeline, click Add Operation and select Sampling from the operation list.

Choose a Sampling Type

In the Choose Sampling Type dropdown, select one of:

  • Simple Random Sampling
  • Systematic Sampling
  • Stratified Sampling

The configuration fields update based on the selected type.

Configure the selected method

Follow the configuration for the method you chose. See the sections below for each method.

Click Save & Preview

Click Save & Preview. Edilitics applies the sampling and replaces the dataset with the selected rows. The success toast confirms: "Sampling applied. Preview updated with sampled data."

Verify in the preview

Check the row count in the preview. Confirm the sampled rows reflect the expected distribution or range.


Simple Random Sampling

Selects a random percentage of rows from the full dataset. Every row has equal probability of being selected.

Fields:

FieldRequiredDescription
Sample Size (%)YesPercentage of rows to include. Positive integer only (e.g. 10, 50, 100). Decimals not accepted.
Random SeedNoInteger seed for reproducibility. Same seed produces the same sample on repeated runs.
Allow DuplicatesNoCheckbox. When checked, the same row can appear more than once. Auto-checked when Sample Size exceeds 100%.

If you enter a Sample Size greater than 100, Allow Duplicates is automatically enabled. Without repetition, you cannot sample more rows than the dataset contains.

Before and After (Sample Size 60%, 500-row dataset):

Input: 500 rows

Output: approximately 300 rows (random selection, order not preserved)


Systematic Sampling

Selects every Nth row starting from an offset position. N is computed automatically as total rows divided by the requested sample size.

Fields:

FieldRequiredDescription
Total Rows (Approx)Read-onlyAuto-fetched total row count for the current dataset.
Selection FrequencyRead-onlyAuto-calculated as Total Rows divided by Sample Size. This is N - every Nth row is selected.
Sample SizeYesNumber of rows to return. Must be a positive integer less than total rows.
OffsetYesStarting position within the dataset. Must be less than the Selection Frequency. Enabled only after Sample Size is entered.

Before and After (Sample Size 50, 500-row dataset, Offset 3):

Selection Frequency = 500 / 50 = 10. Rows selected: 3, 13, 23, 33 ... (every 10th row starting at position 3).

Input: 500 rows

Output: 50 rows


Stratified Sampling

Divides the dataset by unique values in a categorical column and samples from each subgroup. Two distribution modes: Proportionate and Disproportionate.

Fields:

FieldRequiredDescription
Group By ColumnYesCategorical column to define subgroups. Only string columns appear.
Distribution TypeYesProportionate or Disproportionate.
Total Sample SizeYes (Proportionate)Total rows to return across all groups. Distributed in proportion to each group's share of the full dataset.
Number of Samples from Each SubgroupYes (Disproportionate)Fixed number of rows to take from each group independently of group size.

Proportionate maintains original group ratios. If region has 40% North America and 20% Europe, the output sample preserves that 2:1 ratio.

Disproportionate takes an equal fixed count from every group regardless of size. Useful when smaller groups need representation equal to larger ones.

If no categorical columns exist, an info toast appears: "No columns with categorical fields found!" Cast columns to string or select a dataset that includes categorical columns.

Before and After (Stratified, Disproportionate, Group By Column region, 10 from each subgroup):

Input: 500 rows across 5 regions

Output: 50 rows (10 per region)


Code Equivalent

-- Simple random sample (PostgreSQL)
SELECT *
FROM orders
ORDER BY RANDOM()
LIMIT 300;

-- Systematic sample: every 10th row (DuckDB)
SELECT *
FROM (
  SELECT *, ROW_NUMBER() OVER () AS rn FROM orders
)
WHERE (rn - 3) % 10 = 0;

-- Stratified sample: 10 rows per region (PostgreSQL)
SELECT *
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY RANDOM()) AS rn
  FROM orders
) t
WHERE rn <= 10;
import polars as pl

# Simple random sample: 60% of rows
df_simple = df.sample(fraction=0.6, seed=42, with_replacement=False)

# Systematic sample: every 10th row starting at offset 3
df_systematic = df[list(range(3, df.shape[0], 10))]

# Stratified sample: proportionate, 300 total rows
ratios = dict(
    df["region"]
    .value_counts()
    .with_columns(pl.col("count") / df.shape[0])
    .iter_rows()
)
df_stratified = pl.concat([
    df.filter(pl.col("region") == region).sample(n=int(ratio * 300))
    for region, ratio in ratios.items()
])

# Stratified sample: disproportionate, 10 rows per region
df_disproportionate = pl.concat([
    df.filter(pl.col("region") == region).sample(n=10)
    for region in df["region"].unique()
])

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


Operation Reference

Prop

Type


Frequently Asked Questions


Next Steps

Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.

Last updated on

On this page