String Extract
Extract substrings from a text column using regex capture groups. Each capture group becomes a new string column. Preset patterns for email, URL, phone, date, and number. No code needed.
String Extract pulls specific substrings from a text column using a regex pattern with capture groups. Each capture group in your pattern maps to one new output column. The source column is unchanged. Output columns are always string type.
Use String Extract when you need to extract one or more values embedded in a longer string - an ID buried in a log entry, a domain from an email address, or a year from a formatted date string.
When to Use String Extract
- Email decomposition. Extract username and domain from
contact@edilitics.cominto two separate columns. - Order ID parsing. Pull the year segment from
ORD-2024-0001using a regex group. - URL domain extraction. Extract the hostname from a full URL for traffic source analysis.
- Phone number normalisation. Pull a phone number embedded in a freeform notes field.
- Log parsing. Extract error codes, user IDs, or timestamps from unstructured log strings.
String Extract targets one source column per operation and uses re.search -
it finds the pattern anywhere in the string, not just at the start. If the
pattern does not match a row, that row's output columns receive null.
Sample Dataset
The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.
edilitics_sample_orders.csv
Sample orders dataset for hands-on practice · 500 rows
Relevant columns for String Extract examples:
Prop
Type
Preset Patterns
String Extract includes five built-in presets. Selecting a preset auto-populates the Regex Pattern field and sets the correct number of output column slots.
| Preset | Pattern | Typical use |
|---|---|---|
| Email address | ([a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}) | Extract a full email address embedded in a string |
| URL | (https?://[^\s]+) | Extract a URL from freeform text |
| Phone number | ((\+?\d[\d\-\s]{7,}\d)) | Extract a phone number in international or local format |
| Date (YYYY-MM-DD) | (\d{4}-\d{2}-\d{2}) | Extract an ISO date from a longer string |
| Number | ([\d]+(?:\.\d+)?) | Extract the first number (integer or decimal) found |
| Custom | - | Type your own pattern with capture groups |
Selecting a preset locks the pattern field with the preset regex. Switch to Custom to type your own.
How to Apply String Extract
Open the String Extract operation
In your Transform pipeline, click Add Operation and select String Extract from the operation list.
Select the source column
Choose a column from the Source Column dropdown. Only string (text) columns appear - boolean, date, and datetime columns are excluded. In the sample dataset, try email, order_id, or notes.
Choose a preset or enter a custom pattern
From the Preset Pattern dropdown, select one of the five built-in patterns, or select Custom and type your own regex in the Regex Pattern field.
The pattern field shows a live capture group count: "2 capture groups detected." This count determines how many output column slots you can add.
Rules for the pattern:
- Must contain at least one capture group
(...)- non-capturing groups(?:...)do not count - Must be a valid regular expression
- Invalid patterns block submission
Name the output columns
In the Output Columns section, one slot appears per capture group detected. Name each slot - the name becomes the new column in your dataset.
Each name must:
- Contain only letters, numbers, and underscores
- Not start with
__or a number - Not already exist in the dataset
You can map fewer groups than the pattern contains - remove a slot with its remove icon. You cannot add more slots than the pattern has capture groups.
Click Save & Preview
Click Save & Preview. Edilitics applies the extraction across all rows and adds the new columns to the dataset. The success toast confirms: "Extracted N column(s) from column_name."
Rows where the pattern does not match receive null in all output columns.
Verify in the preview
Check the new columns. Nulls indicate rows where the pattern did not match. Adjust your pattern if too many rows are returning null.
Before and After
Pattern: ([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}) applied to email.
Output columns: email_user (Group 1), email_domain (Group 2).
Input:
| order_id | revenue | |
|---|---|---|
| ORD-2024-0001 | contact@meridian.com | 12450.00 |
| ORD-2024-0047 | ops@globalretail.io | 8320.50 |
| ORD-2024-0312 | procurement@techwave.co | 3750.75 |
Output:
| order_id | revenue | email_user | email_domain | |
|---|---|---|---|---|
| ORD-2024-0001 | contact@meridian.com | 12450.00 | contact | meridian.com |
| ORD-2024-0047 | ops@globalretail.io | 8320.50 | ops | globalretail.io |
| ORD-2024-0312 | procurement@techwave.co | 3750.75 | procurement | techwave.co |
Source column email unchanged. Two new string columns added.
Code Equivalent
-- PostgreSQL: extract username and domain from email
SELECT
*,
(REGEXP_MATCH(email, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'))[1] AS email_user,
(REGEXP_MATCH(email, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'))[2] AS email_domain
FROM orders;
-- BigQuery: use REGEXP_EXTRACT for single group
SELECT
*,
REGEXP_EXTRACT(email, r'([^@]+)@') AS email_user,
REGEXP_EXTRACT(email, r'@(.+)') AS email_domain
FROM orders;import polars as pl
import re
# Extract capture groups using map_elements (re.search finds match anywhere in string)
pattern = r"([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})"
df = df.with_columns([
pl.col("email")
.map_elements(
lambda x: (m := re.search(pattern, str(x))) and m.group(1),
return_dtype=pl.String
)
.alias("email_user"),
pl.col("email")
.map_elements(
lambda x: (m := re.search(pattern, str(x))) and m.group(2),
return_dtype=pl.String
)
.alias("email_domain"),
])After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.
Operation Reference
Prop
Type
Frequently Asked Questions
Next Steps
Cast Data Types
Cast extracted string columns to numeric or date types if needed.
Split Columns
Split a column at a delimiter into multiple parts - an alternative when the structure is fixed.
Filter
Filter rows based on the extracted values in the new columns.
Find & Replace
Clean up values in the extracted column before using it downstream.
Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.
Last updated on
Text Case
Standardize text casing in any string column without code. Apply sentence, upper, lower, title, or swap case to one or more columns in a single step.
Conditional Column
Add a new column whose value is set by IF/THEN rules evaluated top-to-bottom. First matching rule wins. Else value for unmatched rows. No code needed.