Edilitics | Data to Decisions

String Extract

Extract substrings from a text column using regex capture groups. Each capture group becomes a new string column. Preset patterns for email, URL, phone, date, and number. No code needed.

String Extract pulls specific substrings from a text column using a regex pattern with capture groups. Each capture group in your pattern maps to one new output column. The source column is unchanged. Output columns are always string type.

Use String Extract when you need to extract one or more values embedded in a longer string - an ID buried in a log entry, a domain from an email address, or a year from a formatted date string.


When to Use String Extract

  • Email decomposition. Extract username and domain from contact@edilitics.com into two separate columns.
  • Order ID parsing. Pull the year segment from ORD-2024-0001 using a regex group.
  • URL domain extraction. Extract the hostname from a full URL for traffic source analysis.
  • Phone number normalisation. Pull a phone number embedded in a freeform notes field.
  • Log parsing. Extract error codes, user IDs, or timestamps from unstructured log strings.

String Extract targets one source column per operation and uses re.search - it finds the pattern anywhere in the string, not just at the start. If the pattern does not match a row, that row's output columns receive null.


Sample Dataset

The examples in this doc use the Edilitics sample orders dataset. Download it to follow along in your own workspace.

edilitics_sample_orders.csv

Sample orders dataset for hands-on practice · 500 rows

Download

Relevant columns for String Extract examples:

Prop

Type


Preset Patterns

String Extract includes five built-in presets. Selecting a preset auto-populates the Regex Pattern field and sets the correct number of output column slots.

PresetPatternTypical use
Email address([a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})Extract a full email address embedded in a string
URL(https?://[^\s]+)Extract a URL from freeform text
Phone number((\+?\d[\d\-\s]{7,}\d))Extract a phone number in international or local format
Date (YYYY-MM-DD)(\d{4}-\d{2}-\d{2})Extract an ISO date from a longer string
Number([\d]+(?:\.\d+)?)Extract the first number (integer or decimal) found
Custom-Type your own pattern with capture groups

Selecting a preset locks the pattern field with the preset regex. Switch to Custom to type your own.


How to Apply String Extract

Open the String Extract operation

In your Transform pipeline, click Add Operation and select String Extract from the operation list.

Select the source column

Choose a column from the Source Column dropdown. Only string (text) columns appear - boolean, date, and datetime columns are excluded. In the sample dataset, try email, order_id, or notes.

Choose a preset or enter a custom pattern

From the Preset Pattern dropdown, select one of the five built-in patterns, or select Custom and type your own regex in the Regex Pattern field.

The pattern field shows a live capture group count: "2 capture groups detected." This count determines how many output column slots you can add.

Rules for the pattern:

  • Must contain at least one capture group (...) - non-capturing groups (?:...) do not count
  • Must be a valid regular expression
  • Invalid patterns block submission

Name the output columns

In the Output Columns section, one slot appears per capture group detected. Name each slot - the name becomes the new column in your dataset.

Each name must:

  • Contain only letters, numbers, and underscores
  • Not start with __ or a number
  • Not already exist in the dataset

You can map fewer groups than the pattern contains - remove a slot with its remove icon. You cannot add more slots than the pattern has capture groups.

Click Save & Preview

Click Save & Preview. Edilitics applies the extraction across all rows and adds the new columns to the dataset. The success toast confirms: "Extracted N column(s) from column_name."

Rows where the pattern does not match receive null in all output columns.

Verify in the preview

Check the new columns. Nulls indicate rows where the pattern did not match. Adjust your pattern if too many rows are returning null.


Before and After

Pattern: ([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}) applied to email.

Output columns: email_user (Group 1), email_domain (Group 2).

Input:

order_idemailrevenue
ORD-2024-0001contact@meridian.com12450.00
ORD-2024-0047ops@globalretail.io8320.50
ORD-2024-0312procurement@techwave.co3750.75

Output:

order_idemailrevenueemail_useremail_domain
ORD-2024-0001contact@meridian.com12450.00contactmeridian.com
ORD-2024-0047ops@globalretail.io8320.50opsglobalretail.io
ORD-2024-0312procurement@techwave.co3750.75procurementtechwave.co

Source column email unchanged. Two new string columns added.


Code Equivalent

-- PostgreSQL: extract username and domain from email
SELECT
  *,
  (REGEXP_MATCH(email, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'))[1] AS email_user,
  (REGEXP_MATCH(email, '([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'))[2] AS email_domain
FROM orders;

-- BigQuery: use REGEXP_EXTRACT for single group
SELECT
  *,
  REGEXP_EXTRACT(email, r'([^@]+)@') AS email_user,
  REGEXP_EXTRACT(email, r'@(.+)') AS email_domain
FROM orders;
import polars as pl
import re

# Extract capture groups using map_elements (re.search finds match anywhere in string)
pattern = r"([a-zA-Z0-9._%+\-]+)@([a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})"

df = df.with_columns([
    pl.col("email")
      .map_elements(
          lambda x: (m := re.search(pattern, str(x))) and m.group(1),
          return_dtype=pl.String
      )
      .alias("email_user"),
    pl.col("email")
      .map_elements(
          lambda x: (m := re.search(pattern, str(x))) and m.group(2),
          return_dtype=pl.String
      )
      .alias("email_domain"),
])

After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


After Save & Preview, the pipeline shows a DQ delta badge on this step - green if the table score improved, red if it dropped. See Data Quality Scoring for how scores are calculated.


Operation Reference

Prop

Type


Frequently Asked Questions


Next Steps

Need help? Email support@edilitics.com with your workspace, job ID, and context. We reply within one business day.

Last updated on

On this page