What happens if the pattern does not match a row?

All output columns for that row receive null. The pattern uses re.search, which finds the match anywhere in the string.

Can I extract from numeric or date columns?

No. Only string columns are available as source columns. Boolean, date, and datetime columns are excluded.

What is the difference between String Extract and Split Columns?

Split Columns divides a column at a fixed delimiter. String Extract uses regex capture groups to pull values from anywhere in a string. Use Split Columns for delimiter-based structure, String Extract for embedded patterns.

Are extracted columns always string type?

Yes. All output columns are string type. Use Cast Data Type after extraction to convert to numeric or date types.

Can I extract from multiple columns in one operation?

No. String Extract targets one source column per operation. Add another String Extract operation to extract from a second column.

String Extract | Edilitics Docs

Extract substrings from a text column using regex capture groups. Each capture group becomes a new string column. Preset patterns for email, URL, phone, date, and number. No code needed.

String Extract pulls specific substrings from a text column using a regex pattern with capture groups. Each capture group in your pattern maps to one new output column. The source column is unchanged. Output columns are always string type.

Use String Extract when you need to extract one or more values embedded in a longer string - an ID buried in a log entry, a domain from an email address, or a year from a formatted date string.

When to Use String Extract

Email decomposition. Extract username and domain from contact@edilitics.com into two separate columns.
Order ID parsing. Pull the year segment from ORD-2024-0001 using a regex group.
URL domain extraction. Extract the hostname from a full URL for traffic source analysis.
Phone number normalisation. Pull a phone number embedded in a freeform notes field.
Log parsing. Extract error codes, user IDs, or timestamps from unstructured log strings.

String Extract targets one source column per operation and uses re.search - it finds the pattern anywhere in the string, not just at the start. If the pattern does not match a row, that row's output columns receive null.

Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for String Extract examples:

Prop

Type

Preset Patterns

String Extract includes five built-in presets. Selecting a preset auto-populates the Regex Pattern field and sets the correct number of output column slots.

Preset	Pattern	Typical use
Email address	`([a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})`	Extract a full email address embedded in a string
URL	`(https?://[^\s]+)`	Extract a URL from freeform text
Phone number	`((\+?\d[\d\-\s]{7,}\d))`	Extract a phone number in international or local format
Date (YYYY-MM-DD)	`(\d{4}-\d{2}-\d{2})`	Extract an ISO date from a longer string
Number	`([\d]+(?:\.\d+)?)`	Extract the first number (integer or decimal) found
Custom	-	Type your own pattern with capture groups

Selecting a preset locks the pattern field with the preset regex. Switch to Custom to type your own.

How to Apply String Extract

Open the String Extract operation

In your Transform pipeline, click Add Operation and select String Extract from the operation list.

Select the source column

Choose a column from the Source Column dropdown. Only string (text) columns appear - boolean, date, and datetime columns are excluded. In the sample dataset, try email, order_id, or notes.

Choose a preset or enter a custom pattern

From the Preset Pattern dropdown, select one of the five built-in patterns, or select Custom and type your own regex in the Regex Pattern field.

The pattern field shows a live capture group count: "2 capture groups detected." This count determines how many output column slots you can add.

Rules for the pattern:

Must contain at least one capture group (...) - non-capturing groups (?:...) do not count
Must be a valid regular expression
Invalid patterns block submission

Name the output columns

In the Output Columns section, one slot appears per capture group detected. Name each slot - the name becomes the new column in your dataset.

Each name must:

Contain only letters, numbers, and underscores
Not start with __ or a number
Not already exist in the dataset

You can map fewer groups than the pattern contains - remove a slot with its remove icon. You cannot add more slots than the pattern has capture groups.

Click Save & Preview

Click Save & Preview. Edilitics applies the extraction across all rows and adds the new columns to the dataset. The success toast confirms: "Extracted N column(s) from column_name."

Rows where the pattern does not match receive null in all output columns.

Verify in the preview

Check the new columns. Nulls indicate rows where the pattern did not match. Adjust your pattern if too many rows are returning null.

Before and After

Pattern: ([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}) applied to email.

Output columns: email_user (Group 1), email_domain (Group 2).

Input:

order_id	email	revenue
ORD-2024-0001	contact@meridian.com	12450.00
ORD-2024-0047	ops@globalretail.io	8320.50
ORD-2024-0312	procurement@techwave.co	3750.75

Output:

order_id	email	revenue	email_user	email_domain
ORD-2024-0001	contact@meridian.com	12450.00	contact	meridian.com
ORD-2024-0047	ops@globalretail.io	8320.50	ops	globalretail.io
ORD-2024-0312	procurement@techwave.co	3750.75	procurement	techwave.co

Source column email unchanged. Two new string columns added.

Code Equivalent

-- PostgreSQL: extract username and domain from email
SELECT
  *,
  (REGEXP_MATCH(email, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'))[1] AS email_user,
  (REGEXP_MATCH(email, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'))[2] AS email_domain
FROM orders;

-- BigQuery: use REGEXP_EXTRACT for single group
SELECT
  *,
  REGEXP_EXTRACT(email, r'([^@]+)@') AS email_user,
  REGEXP_EXTRACT(email, r'@(.+)') AS email_domain
FROM orders;

import polars as pl
import re

# Extract capture groups using map_elements (re.search finds match anywhere in string)
pattern = r"([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})"

df = df.with_columns([
    pl.col("email")
      .map_elements(
          lambda x: (m := re.search(pattern, str(x))) and m.group(1),
          return_dtype=pl.String
      )
      .alias("email_user"),
    pl.col("email")
      .map_elements(
          lambda x: (m := re.search(pattern, str(x))) and m.group(2),
          return_dtype=pl.String
      )
      .alias("email_domain"),
])

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.

String Extract

When to Use String Extract

Sample Dataset

Preset Patterns

How to Apply String Extract

Before and After

Code Equivalent

Operation Reference

Frequently Asked Questions

Next Steps

Cast Data Types

Split Columns

Filter

Find & Replace

On this page