Sampling
Sample rows from your dataset using simple random sampling, systematic sampling, or stratified sampling by category group. No code needed.
Sampling reduces your dataset to a subset of rows using one of three strategies: simple random sampling by percentage, systematic sampling by fixed interval, or stratified sampling by category group. The original dataset is replaced by the sample. Every operation downstream works on the sampled result.
Sampling replaces the dataset with the sampled rows. Original rows not selected are removed from the pipeline. If you need the full dataset later, add sampling as a late step in your pipeline or work on a copy.
When to Use Sampling
- Speed up development. Work on 10% of a large dataset while building and testing transformations.
- Create train/test splits. Sample a percentage of rows for model training before exporting.
- Audit a structured subset. Use systematic sampling to review every Nth transaction without bias.
- Ensure category representation. Use stratified sampling to guarantee rows from each region or product tier.
- Reduce export volume. Sample before exporting to keep output files to a manageable size.
Sample Dataset
The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.
edilitics_sample_orders.csv
Sample orders dataset for hands-on practice · 500 rows
Relevant columns for Sampling examples:
Prop
Type
Sampling Methods
| Method | How it selects rows | Best for |
|---|---|---|
| Simple Random Sampling | Random percentage of total rows, each row has equal probability | Quick testing, unbiased subsets |
| Systematic Sampling | Every Nth row starting from an offset position | Structured audits, time-series intervals |
| Stratified Sampling | Fixed count or proportionate share from each category subgroup | Ensuring category representation |
How to Apply Sampling
Open the Sampling operation
In your Transform pipeline, click Add Operation and select Sampling from the operation list.
Choose a Sampling Type
In the Choose Sampling Type dropdown, select one of:
- Simple Random Sampling
- Systematic Sampling
- Stratified Sampling
The configuration fields update based on the selected type.
Configure the selected method
Follow the configuration for the method you chose. See the sections below for each method.
Click Save & Preview
Click Save & Preview. Edilitics applies the sampling and replaces the dataset with the selected rows. The success toast confirms: "Sampling applied. Preview updated with sampled data."
Verify in the preview
Check the row count in the preview. Confirm the sampled rows reflect the expected distribution or range.
Simple Random Sampling
Selects a random percentage of rows from the full dataset. Every row has equal probability of being selected.
Fields:
| Field | Required | Description |
|---|---|---|
| Sample Size (%) | Yes | Percentage of rows to include. Positive integer only (e.g. 10, 50, 100). Decimals not accepted. |
| Random Seed | No | Integer seed for reproducibility. Same seed produces the same sample on repeated runs. |
| Allow Duplicates | No | Checkbox. When checked, the same row can appear more than once. Auto-checked when Sample Size exceeds 100%. |
If you enter a Sample Size greater than 100, Allow Duplicates is automatically enabled. Without repetition, you cannot sample more rows than the dataset contains.
Before and After (Sample Size 60%, 500-row dataset):
Input: 500 rows
Output: approximately 300 rows (random selection, order not preserved)
Systematic Sampling
Selects every Nth row starting from an offset position. N is computed automatically as total rows divided by the requested sample size.
Fields:
| Field | Required | Description |
|---|---|---|
| Total Rows (Approx) | Read-only | Auto-fetched total row count for the current dataset. |
| Selection Frequency | Read-only | Auto-calculated as Total Rows divided by Sample Size. This is N - every Nth row is selected. |
| Sample Size | Yes | Number of rows to return. Must be a positive integer less than total rows. |
| Offset | Yes | Starting position within the dataset. Must be less than the Selection Frequency. Enabled only after Sample Size is entered. |
Before and After (Sample Size 50, 500-row dataset, Offset 3):
Selection Frequency = 500 / 50 = 10. Rows selected: 3, 13, 23, 33 ... (every 10th row starting at position 3).
Input: 500 rows
Output: 50 rows
Stratified Sampling
Divides the dataset by unique values in a categorical column and samples from each subgroup. Two distribution modes: Proportionate and Disproportionate.
Fields:
| Field | Required | Description |
|---|---|---|
| Group By Column | Yes | Categorical column to define subgroups. Only string columns appear. |
| Distribution Type | Yes | Proportionate or Disproportionate. |
| Total Sample Size | Yes (Proportionate) | Total rows to return across all groups. Distributed in proportion to each group's share of the full dataset. |
| Number of Samples from Each Subgroup | Yes (Disproportionate) | Fixed number of rows to take from each group independently of group size. |
Proportionate maintains original group ratios. If region has 40% North America and 20% Europe, the output sample preserves that 2:1 ratio.
Disproportionate takes an equal fixed count from every group regardless of size. Useful when smaller groups need representation equal to larger ones.
If no categorical columns exist, an info toast appears: "No columns with categorical fields found!" Cast columns to string or select a dataset that includes categorical columns.
Before and After (Stratified, Disproportionate, Group By Column region, 10 from each subgroup):
Input: 500 rows across 5 regions
Output: 50 rows (10 per region)
Code Equivalent
-- Simple random sample (PostgreSQL)
SELECT *
FROM orders
ORDER BY RANDOM()
LIMIT 300;
-- Systematic sample: every 10th row (DuckDB)
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER () AS rn FROM orders
)
WHERE (rn - 3) % 10 = 0;
-- Stratified sample: 10 rows per region (PostgreSQL)
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY RANDOM()) AS rn
FROM orders
) t
WHERE rn <= 10;import polars as pl
# Simple random sample: 60% of rows
df_simple = df.sample(fraction=0.6, seed=42, with_replacement=False)
# Systematic sample: every 10th row starting at offset 3
df_systematic = df[list(range(3, df.shape[0], 10))]
# Stratified sample: proportionate, 300 total rows
ratios = dict(
df["region"]
.value_counts()
.with_columns(pl.col("count") / df.shape[0])
.iter_rows()
)
df_stratified = pl.concat([
df.filter(pl.col("region") == region).sample(n=int(ratio * 300))
for region, ratio in ratios.items()
])
# Stratified sample: disproportionate, 10 rows per region
df_disproportionate = pl.concat([
df.filter(pl.col("region") == region).sample(n=10)
for region in df["region"].unique()
])After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
Operation Reference
Prop
Type
Frequently Asked Questions
Next Steps
Filter
Filter the sampled dataset to narrow down to rows matching specific conditions.
Group By
Aggregate the sampled rows by category to summarise the subset.
Cast Data Types
Cast columns in the sample before exporting or joining with other datasets.
Manage Nulls
Handle null values in the sample before further analysis.
Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.
Last updated on
Filter
Keep rows that match conditions and remove the rest. Filter numeric columns by range, strings by value, and datetimes by time. Combine with AND logic.
Drop Duplicates
Remove duplicate rows based on a column's values. Choose to keep the first occurrence, the last, or drop all rows where a duplicate exists. No code needed.