Group By

The Group By functionality in Edilitics empowers users to aggregate data efficiently, delivering powerful insights from their datasets. This feature supports a diverse range of aggregation methods tailored to different data types, ensuring a comprehensive and flexible data analysis process.


Step-by-Step Guide to Utilizing Group By

Select Column to Group By

Choose the column you wish to group by from the dropdown list provided.

Note: Columns with data types such as Object, List, Dict, or Struct will be automatically excluded from this list. Edilitics recommends users flatten these data types before applying the Group By operation to avoid data loss. Users will be informed of the existence of such columns.

Default Aggregation Applied

Upon selecting the group by column, Edilitics automatically applies default aggregation to all other columns in the table:

  • Categorical and Datetime Columns: Aggregated by Count.

  • Numerical Columns: Aggregated by Sum.

Adjust Aggregations

Modify these aggregations from the dropdown options provided. Edilitics ensures only relevant aggregation options are displayed for the corresponding data types.

Categorical and Datetime Columns

Categorical and datetime columns contain discrete values, such as categories (e.g., "Red," "Blue," "Green") or timestamps (e.g., "2023-12-01 14:30:00"). The operations applied to these columns focus on counting occurrences and identifying patterns.

Count (Default)

  • Definition: Computes the total number of occurrences of values in a column.
  • How it Works: Each row with a non-null value contributes to the count.
  • When to Use:
    • To determine how often a category appears in a dataset.
    • To analyze the frequency distribution of categorical variables.
    • To count the number of valid timestamps in a datetime column.

Count Distinct

  • Definition: Calculates the number of unique values in a column.
  • How it Works: Identifies distinct entries in the column and returns their count.
  • When to Use:
    • To check how many different categories exist (e.g., different product types).
    • To determine the number of unique timestamps in a datetime column.

Mode

  • Definition: Returns the most frequently occurring value in a column.
  • How it Works: Finds the value with the highest frequency.
  • When to Use:
    • To identify the most common category (e.g., most popular product sold).
    • To determine the most frequent timestamp in a dataset (e.g., peak transaction time).

Numerical Columns

Numerical columns contain continuous values that can be aggregated using mathematical operations such as sum, mean, and standard deviation.

Sum (Default)

  • Definition: Computes the total sum of all numerical values in a column.
  • How it Works: Adds up all non-null values in the column.
  • When to Use:
    • To find total sales, revenue, or expenses.
    • To aggregate numerical data over a time period (e.g., total website visits in a month).

Count

  • Definition: Counts the number of numerical values in a column.
  • How it Works: Ignores null values and counts only non-null entries.
  • When to Use:
    • To check the number of recorded transactions.
    • To determine the number of valid numerical entries in a dataset.

Count Distinct

  • Definition: Counts the number of unique numerical values in a column.
  • How it Works: Identifies distinct numbers and returns their count.
  • When to Use:
    • To find how many unique prices are present in a product catalog.
    • To determine the number of distinct transaction amounts.

Min (Minimum)

  • Definition: Returns the smallest value in the column.
  • How it Works: Scans all values and finds the lowest one.
  • When to Use:
    • To identify the lowest transaction amount.
    • To determine the earliest recorded temperature in a dataset.

Max (Maximum)

  • Definition: Returns the largest value in the column.
  • How it Works: Scans all values and finds the highest one.
  • When to Use:
    • To determine the highest sales figure recorded.
    • To find the peak website traffic on a given day.

Mean (Average)

  • Definition: Computes the arithmetic mean (sum of values divided by count).
  • How it Works: Adds all values and divides by the number of non-null entries.
  • When to Use:
    • To find the average customer purchase amount.
    • To determine the mean temperature over a time period.

Median

  • Definition: Returns the middle value when the data is sorted.
  • How it Works: Sorts the values and picks the middle one (or the average of two middle values if the count is even).
  • When to Use:
    • To analyze salary distributions (median salary is often more informative than mean).
    • To find the central tendency of house prices, which may be skewed by extreme values.

Mode

  • Definition: Returns the most frequently occurring numerical value.
  • How it Works: Identifies the value appearing most frequently in the dataset.
  • When to Use:
    • To identify the most common transaction amount.
    • To determine the most frequent age in a customer demographic dataset.

Variance

  • Definition: Measures the spread of data points from the mean.
  • How it Works: Calculates the average of squared differences from the mean.
  • When to Use:
    • To analyze the volatility of stock prices.
    • To measure consistency in test scores across students.

Standard Deviation

  • Definition: Measures how much values deviate from the mean.
  • How it Works: Square root of variance.
  • When to Use:
    • To assess risk in financial datasets (e.g., fluctuations in stock prices).
    • To determine consistency in production quality in manufacturing.

Adjust New Column Names

Modify new column names as needed. Edilitics auto-assigns new column names based on the aggregation applied (e.g., sum_column_name or mode_column_name).

Note: Column names should only include alphabetic characters, numbers, and underscores. They cannot start with an underscore (_) or numeric values." This validation prevents errors while loading data to the destination database.

Drop Columns - (Optional)

Users can opt to drop any columns they do not want to include in the final destination table.

Add Fields - (Optional)

Users can add existing columns with different aggregations. This feature allows multiple perspectives on the same data, enabling more comprehensive analysis.

  • Description:

    • Adding fields with different aggregations allows simultaneous views of sum, average, and maximum values, providing deeper insights. This is especially useful for complex data analysis tasks requiring various aspects of data distribution.
  • When to Use:

    • Utilize this feature for multiple aggregated views to better understand trends and patterns. It is beneficial for:

    • Financial reporting to see total, average, and peak values.

    • Sales analysis to understand sales sum and transaction frequency. Performance metrics to combine mean, median, and standard deviation for a complete data variability picture.

Submit

Submit the operation to apply the Group By function and create the aggregated table.


Practical Applications

Retail

  • Objective: Analyze total sales by product category.

  • Scenario:

    • Group By: ProductCategory

    • Default Aggregations Applied:

      • SalesAmount: Sum

      • TransactionDate: Count

    • Adjusted Aggregations:

      • SalesAmount: Sum (default), Count, Mean

      • TransactionDate: Count (default), Count Distinct

    • New Column Names: sum_SalesAmount, count_TransactionDate

    • Use Case: Understand total and average sales by product category for inventory and sales strategy planning.

Healthcare

  • Objective: Aggregate patient visits by department.

  • Scenario:

    • Group By: Department

    • Default Aggregations Applied:

      • VisitID: Count

      • VisitDate: Count

    • Adjusted Aggregations:

      • VisitID: Count (default), Count Distinct

      • VisitDate: Count (default), Count Distinct

    • New Column Names: count_VisitID, count_VisitDate

    • Use Case: Track the number of visits per department to allocate resources efficiently.

Finance

  • Objective: Summarize transaction amounts by account type.

  • Scenario:

    • Group By: AccountType

    • Default Aggregations Applied:

      • TransactionAmount: Sum

      • TransactionDate: Count

    • Adjusted Aggregations:

      • TransactionAmount: Sum (default), Max, Mean

      • TransactionDate: Count (default), Count Distinct

    • New Column Names: sum_TransactionAmount, max_TransactionAmount

    • Use Case: Monitor transaction volumes and average transaction values by account type for financial analysis.

Manufacturing

  • Objective: Calculate total production by factory.

  • Scenario:

    • Group By: Factory

    • Default Aggregations Applied:

      • ProductionUnits: Sum

      • ProductionDate: Count

    • Adjusted Aggregations:

      • ProductionUnits: Sum (default), Min, Max

      • ProductionDate: Count (default), Count Distinct

    • New Column Names: sum_ProductionUnits, min_ProductionUnits

    • Use Case: Assess production efficiency and output by factory to optimize operations.

Education

  • Objective: Aggregate student scores by class.

  • Scenario:

    • Group By: Class

    • Default Aggregations Applied:

      • Score: Sum

      • TestDate: Count

    • Adjusted Aggregations:

      • Score: Sum (default), Mean, Standard Deviation

      • TestDate: Count (default), Count Distinct

    • New Column Names: sum_Score, mean_Score, stddev_Score

    • Use Case: Evaluate class performance and identify areas needing improvement based on student scores.


The Group By functionality in Edilitics provides a robust, user-friendly solution for aggregating and analyzing data. With support for various aggregation methods tailored to different data types, users can efficiently derive meaningful insights from their datasets. This feature enhances data accessibility and usability, making it an indispensable tool for all users.

Need Assistance? Edilitics Support is Here for You!

Our dedicated support team is ready to assist you. If you have any questions or need help using Edilitics, please don't hesitate to contact us at support@edilitics.com. We're committed to ensuring your success!

Don't just manage data, unlock its potential.

Choose Edilitics and gain a powerful advantage in today's data-driven world.