30  Grouping and Summarizing Data

When analyzing data, we often want to focus on subsets or categories—groups—for additional insight into relationships in the data and how they vary across groups. Grouping and summarizing data lets us uncover patterns and insights by organizing rows into meaningful groups and reducing those groups into summarized values.

Imagine you run an online store and sell products worldwide. While looking at total sales is useful, you might also want to understand sales by country, by product, or by month. Grouping data allows you to dissect and view it from various angles, revealing insights that might be invisible otherwise. Before you can analyze sales by country, product, or month, you first have to group the sales data by country, product, or month.

In this chapter, you’ll learn:

  1. How to create groups using group_by().
  2. How to explore and manage grouped data.
  3. How to calculate summaries for each group using summarize().
  4. Advanced techniques for summarizing across multiple columns, handling missing data, and chaining transformations.


30.1 Demonstration Data: Employee and Team Evaluation at MarketTrack

MarketTrack is a fictional project-based consulting startup that relies heavily on its teams’ and employees’ performance to deliver value to clients. The company evaluates its employees across various projects and teams, tracking their contributions and satisfaction levels to understand workload patterns and identify high performers.

The MarketTrack Employee Evaluation dataset employee_data contains performance ratings for employees across various teams and projects at MarketTrack, a company specializing in project-based consulting. Each row represents a unique evaluation for an employee on a specific project, capturing details about their performance and workload. This dataset provides insights into how employees contribute to different projects and teams over time.

Dataset Structure

  • EmployeeID: Unique identifier for each employee (e.g., “E1”, “E2”, etc.).
  • Team: Team to which the employee is assigned, such as “Team A”, “Team B”, or “Team C”.
  • Project: Name of the project the employee is working on, e.g., “Project X”, “Project Y”, or “Project Z”.
  • Hours_Worked: Number of hours the employee has contributed to the project.
  • Satisfaction_Score: Performance rating of the employee on a scale from 1 to 5, reflecting their satisfaction and contributions to the project.
# A tibble: 100 × 5
   EmployeeID Team   Project   Hours_Worked Satisfaction_Score
   <chr>      <chr>  <chr>            <int>              <dbl>
 1 E13        Team C Project X           24                1.1
 2 E16        Team A Project Y           32                2.6
 3 E12        Team A Project Z           13                3.8
 4 E20        Team A Project Y           38                4.8
 5 E10        Team A Project Z           14                3.1
 6 E16        Team A Project Y           38                3.8
 7 E3         Team A Project X           38                2.3
 8 E5         Team C Project X           19                1.4
 9 E4         Team C Project Y           31                1.9
10 E11        Team B Project X           39                2.1
# ℹ 90 more rows


30.2 Creating Groups with group_by()

What Are Groups in Data?

A group is a collection of rows that share a common value in one or more columns. Groups help organize data for targeted analysis, making it easier to: - Compare metrics across categories.

  • Identify trends or outliers within specific groups.
  • Make informed decisions based on summarized data.

For example, in the MarketTrack employee_data dataset, grouping by the Team column creates three groups: Team A, Team B, and Team C. Each group contains rows where Team has the same value.



Creating Groups

To create groups in your data, use the group_by() function from the dplyr package. Grouping organizes rows into groups based on the values in one or more columns.

Syntax

The basic syntax for the group_by() function is

# Basic syntax for group_by()
grouped_data <- data |> 
  group_by(column)

Grouping by Team

Here’s how to group the MarketTrack employee_data by Team:

# Grouping by Team
grouped_data <- employee_data |> 
  group_by(Team)

# Preview grouped data
grouped_data
# A tibble: 100 × 5
# Groups:   Team [3]
   EmployeeID Team   Project   Hours_Worked Satisfaction_Score
   <chr>      <chr>  <chr>            <int>              <dbl>
 1 E13        Team C Project X           24                1.1
 2 E16        Team A Project Y           32                2.6
 3 E12        Team A Project Z           13                3.8
 4 E20        Team A Project Y           38                4.8
 5 E10        Team A Project Z           14                3.1
 6 E16        Team A Project Y           38                3.8
 7 E3         Team A Project X           38                2.3
 8 E5         Team C Project X           19                1.4
 9 E4         Team C Project Y           31                1.9
10 E11        Team B Project X           39                2.1
# ℹ 90 more rows

The output shows that the data is now grouped by Team.

Groups Are Logical, Not Visual

When you use group_by(), the data is grouped logically, but the rows remain in their original order. The “Groups: Team [3]” label confirms that there are 3 groups, but the rows for each group are not yet clustered together in the dataset.

Imagine an activity where groups are formed by having participants call out numbers in sequence: “1”, “2”, “3”, and so on. Once everyone has called out their number, the groups are formed, but the participants are still standing where they started. Similarly, in group_by(), the data is assigned to groups but remains in its original arrangement.

If you want the rows for each group to appear together in the dataset, you can use arrange() to cluster them visually:

# Group and arrange data by Team
employee_data |> 
  group_by(Team) |> 
  arrange(Team)
# A tibble: 100 × 5
# Groups:   Team [3]
   EmployeeID Team   Project   Hours_Worked Satisfaction_Score
   <chr>      <chr>  <chr>            <int>              <dbl>
 1 E16        Team A Project Y           32                2.6
 2 E12        Team A Project Z           13                3.8
 3 E20        Team A Project Y           38                4.8
 4 E10        Team A Project Z           14                3.1
 5 E16        Team A Project Y           38                3.8
 6 E3         Team A Project X           38                2.3
 7 E18        Team A Project Z           11                4.5
 8 E2         Team A Project Z           29                3.4
 9 E7         Team A Project X           24                3.9
10 E15        Team A Project Y           36                3.3
# ℹ 90 more rows


Exploring Grouped Data

Grouped data frames behave differently from regular data frames:

  1. Operations like summarize() or mutate() are applied within each group.
  2. The grouping columns remain visible, but rows are treated as part of their group.

You can inspect grouped data using:

  • group_vars(): Lists the columns used for grouping.
  • n_groups(): Counts the number of groups.

Inspecting Groups

# Inspecting groups
group_vars(grouped_data)  # Which columns define groups
[1] "Team"
n_groups(grouped_data)    # How many groups exist
[1] 3


30.3 Summarizing Groups with summarize()

With your data grouped by key categories, the summarize() function allows you to calculate metrics—such as totals, averages, or counts—that are specific to each group. These metrics are often summary statistics or derived insights that provide high-level information about your data in the context of the group.

The primary purpose of summarize() is to condense all rows within a group into a single summary row, resulting in an output that usually has far fewer rows than the original dataset. This makes it a powerful tool for aggregating data and extracting actionable insights.

For example:

  • Grouping sales data by product categories and calculating the total sales for each category.
  • Grouping employee data by team and calculating the average satisfaction score for each team.
  • Grouping by the willingness to pay of customers and calculating the total count.

This reduction in data granularity shifts the focus from individual rows to high-level patterns and trends within groups which aids in focus and decision-making.

Syntax

# Basic syntax for summarize()
summarize(data, new_column = summary_function(column))

Common Summarizing Functions

Many common functions are summarizing functions that could be used here:

Centrality

Centrality functions are useful for calculating the central tendency of variables:

  • mean(): Calculates the average
  • median(): Calculates the median
Mean and Median Hours Worked by Team
# Calculate average and median hours worked per team
employee_data |> 
  group_by(Team) |> 
  summarize(
    Avg_Hours = mean(Hours_Worked, na.rm = TRUE),
    Med_Hours = median(Hours_Worked)
  )
# A tibble: 3 × 3
  Team   Avg_Hours Med_Hours
  <chr>      <dbl>     <dbl>
1 Team A      25.1      26.5
2 Team B      23.6      22.5
3 Team C      27.1      27.5

Spread

Spread functions are useful for calculating the variance of variables:

  • sd(): Calculates the standard deviation
  • var(): Calculates the variance
  • IQR(): Calculates the interquartile range
  • min(): Finds the smallest value
  • max(): Finds the largest value
Standard Deviation and Interquartile Range of Hours Worked by Team
# Calculate standard deviation and interquartile range of hours worked per team
employee_data |> 
  group_by(Team) |> 
  summarize(
    sd_Hours = sd(Hours_Worked, na.rm = TRUE),
    iqr_Hours = IQR(Hours_Worked)
  )
# A tibble: 3 × 3
  Team   sd_Hours iqr_Hours
  <chr>     <dbl>     <dbl>
1 Team A    11.0      21.8 
2 Team B    10.2      18.5 
3 Team C     6.98      9.75

Counts

Functions for counting or summing rows or unique values. Count functions are useful for understanding the size of each group:

  • sum(): Adds up all values.
  • n(): Counts the number of rows in the group.
  • n_distinct(): Counts the number of unique values in a column.
Counting Employees and Total Hours Worked per Team
# Count the number of employees per team and total hours worked per team
employee_data |> 
  group_by(Team) |> 
  summarize(
    Num_Employees = n_distinct(EmployeeID),
    Total_Hours = sum(Hours_Worked)
  )
# A tibble: 3 × 3
  Team   Num_Employees Total_Hours
  <chr>          <int>       <int>
1 Team A            15         653
2 Team B            17         851
3 Team C            16        1030

Cumulative Summaries

Cumulative functions calculate running totals or statistics within a group:

  • cumsum(): Cumulative sum of values.
  • cummin(): Cumulative minimum value.
  • cummax(): Cumulative maximum value.
  • cummean(): Cumulative mean (from the dplyr package).
Cumulative Hours Worked by Team
# Cumulative hours worked by team
employee_data |> 
  arrange(Team, Hours_Worked) |>  # Arrange for cumulative calculation
  group_by(Team) |> 
  mutate(Cumulative_Hours = cumsum(Hours_Worked))
# A tibble: 100 × 6
# Groups:   Team [3]
   EmployeeID Team   Project   Hours_Worked Satisfaction_Score Cumulative_Hours
   <chr>      <chr>  <chr>            <int>              <dbl>            <int>
 1 E19        Team A Project X           10                2.8               10
 2 E2         Team A Project Z           10                3                 20
 3 E18        Team A Project Z           11                4.5               31
 4 E18        Team A Project Z           12                4                 43
 5 E5         Team A Project X           12                4.9               55
 6 E12        Team A Project Z           13                3.8               68
 7 E11        Team A Project Y           13                1.4               81
 8 E10        Team A Project Z           14                3.1               95
 9 E5         Team A Project Y           14                2.2              109
10 E20        Team A Project Y           17                3.9              126
# ℹ 90 more rows

Logical Summaries

Logical functions evaluate whether certain conditions are met:

  • any(): Checks if at least one value is TRUE.
  • all(): Checks if all values are TRUE
Checking Satisfaction Thresholds by Team
# Check if any or all satisfaction scores exceed 4 in each team
employee_data |> 
  group_by(Team) |> 
  summarize(
    Any_High_Satisfaction = any(Satisfaction_Score > 4),
    All_High_Satisfaction = all(Satisfaction_Score > 4)
  )
# A tibble: 3 × 3
  Team   Any_High_Satisfaction All_High_Satisfaction
  <chr>  <lgl>                 <lgl>                
1 Team A TRUE                  FALSE                
2 Team B TRUE                  FALSE                
3 Team C TRUE                  FALSE                

Custom Summaries

Sometimes, you may need a metric that isn’t provided by the default functions. In such cases, you can create custom summary functions using function() or anonymous functions like \(x):

Weighted Average of Satisfaction Scores
# Calculate weighted average satisfaction score (hypothetical weights)
employee_data |> 
  group_by(Team) |> 
  summarize(
    Weighted_Avg_Satisfaction = sum(Satisfaction_Score * Hours_Worked) / sum(Hours_Worked)
  )
# A tibble: 3 × 2
  Team   Weighted_Avg_Satisfaction
  <chr>                      <dbl>
1 Team A                      3.05
2 Team B                      3.25
3 Team C                      3.19


Exercise: Using group_by() and summarize()

Try it yourself:


Let’s revisit the Muscle Cola customer dataset, introduced in Section 18.5.2, to practice grouping and summarizing transformations. Muscle Cola is a protein-infused cola drink designed for fitness enthusiasts, and this dataset contains survey responses from potential customers.

In the original dataset, each row represents a customer’s response to a specified price for a serving of Muscle Cola. Respondents were asked how many servings they would purchase per month at each of a series of prices ($0.50 to $2.50). This approach aligns with the economics of demand, as it captures the relationship between price and quantity demanded for non-durable goods like Muscle Cola, which are consumed and need to be repurchased regularly. However, asking customers to evaluate multiple prices in a single survey can be overwhelming, potentially leading to less thoughtful or accurate responses.

For this exercise, we’ll use a simplified variation of the data that employs an alternative survey design. Instead of asking customers how many servings they would buy at multiple prices, respondents were asked just two questions:

  • What is the most you would be willing to pay for one serving of Muscle Cola?
  • At that price, how many servings would you purchase in a typical month?

This simplified approach has trade-offs. While it may underestimate the total quantity demanded at lower prices—since respondents are not explicitly asked about their behavior at reduced prices—the brevity of the survey can lead to more thoughtful and reliable answers. By focusing on willingness to pay and expected quantity at that price, this data provides an opportunity to explore demand estimation through grouping, arranging, and cumulative summarization techniques.

  • Customer Preferences:

    • Gym_preference: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.
    • Gym_member: Specifies whether the respondent is a gym member (“Yes”/“No”).
    • Gym_freq: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.
    • Protein_importance: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
  • Customer Demand:

    • WTP: The maximum amount the respondent would be willing to pay for a single serving
    • Quantity: The number of servings the respondent would consumer per month if the price matched the willingness to pay of that respondent
  • Demographics:

    • Gender: Gender of the respondent.
    • Age: Numeric value indicating the respondent’s age.
    • Weight: The respondent’s weight in pounds, which may be relevant for understanding protein needs.

In economics, the concept of customer demand dictates that customers who are willing to pay a higher price would also be willing to pay lower prices. Therefore, the quantity demanded by customers with higher willingness to pay must be added to the quantity demanded at lower price points to correctly calculate the total demand at each price level.

To transform this variation on the customer dataset muscle_cola_data into proper demand data:

  1. Group by Willingness to Pay: Use group_by() to group the data by each unique willingness-to-pay value reported by customers. Each group represents the potential price point and its associated customers.

  2. Calculate Total Quantity: Use summarize() to calculate the total quantity that would be consumed by customers at each willingness-to-pay value.

  3. Arrange the Data: Start by arranging the data in descending order of willingness to pay (highest to lowest). This ensures that the cumulative calculations reflect the logic of aggregating from higher to lower price points.

  4. Calculate Quantity Demanded: For each price (willingness-to-pay value), calculate the cumulative sum of total quantities for all price points. This creates a proper demand curve by aggregating quantities for higher-paying customers into the total demand at lower price levels.

Specify the dataset as muscle_cola_data, then

  1. group the data by willingness to pay values with group_by()
  2. using summarize(), create a new variable as the sum of all quantity at each willingness to pay value
  3. arrange the data from highest to lowest willingness to pay with arrange()
  4. using mutate(), calculate quantity demanded as the cumulative sum of total (summed) quantities at each willingness to pay values

Fully-worked solution:

muscle_cola_demand_data <- muscle_cola_data |> 
    group_by(WTP) |> 
    summarize(Total_Q = sum(Quantity)) |> 
    arrange(desc(WTP)) |> 
    mutate(Quantity_Demanded = cumsum(Total_Q))
muscle_cola_demand_data    


Tips for Using Summarizing Functions

  1. Always verify if missing values (NA) need to be handled with na.rm = TRUE.
  2. Choose the summarizing function that best matches your business question.
  3. Combine multiple summaries in one call for efficiency and clarity.

By mastering summarizing functions, you can derive valuable insights from your grouped data. In the next section, we’ll explore advanced techniques like summarizing across multiple columns and chaining transformations.



30.4 Summarizing Multiple Columns with across() [advanced and optional]

In most cases, you’ll use summarize() to calculate summary statistics for a single variable, such as finding the average satisfaction score. However, some datasets may include multiple related variables that require the same summarizing operation. For example, you might have several satisfaction metrics (Satisfaction_1, Satisfaction_2) that you want to average.

The across() function makes it easy to apply the same operation to multiple columns at once. This approach is especially efficient for handling larger datasets with multiple related variables.

Syntax

The basic syntax for the across() function is

across(
  columns_to_transform,
  \(x) transformation_function(x, additional_arguments)
)

where

  • columns_to_transform: Specifies the columns to be transformed (e.g., starts_with(“Satisfaction”)).
  • \(x): Defines an anonymous function to be applied to each column.
  • transformation_function(x): Specifies the operation to perform (e.g., mean(x)).
  • additional_arguments: Pass arguments directly to the transformation function (e.g., na.rm = TRUE).

Averaging Multiple Satisfaction Metrics

Suppose our employee data includes three different types of employee satisfaction scores:

  1. Work-Life Balance Satisfaction: Satisfaction with the ability to balance work and personal life (Satisfaction_Work_Life)
  2. Team Dynamics Satisfaction: Satisfaction with collaboration and support within their team (Satisfaction_Team_Dynamics)
  3. Project Engagement Satisfaction: Satisfaction with their interest and engagement in the projects they work on (Satisfaction_Project_Engagement)
Rows: 100
Columns: 7
$ EmployeeID                      <chr> "E15", "E19", "E14", "E3", "E10", "E18…
$ Team                            <chr> "Team B", "Team A", "Team C", "Team C"…
$ Project                         <chr> "Project Y", "Project Z", "Project X",…
$ Hours_Worked                    <int> 30, 26, 40, 21, 39, 12, 39, 34, 27, 24…
$ Satisfaction_Work_Life          <dbl> 3.8, 2.8, 4.3, 1.2, 4.0, 2.4, 4.3, 3.1…
$ Satisfaction_Team_Dynamics      <dbl> 4.4, 2.8, 3.9, 1.4, 1.9, 4.8, 4.0, 4.3…
$ Satisfaction_Project_Engagement <dbl> 4.5, 3.6, 4.7, 1.7, 2.3, 4.2, 5.0, 2.3…

Using the new employee_satisfaction_data dataset, you can use across() to calculate the mean of each satisfaction metric in a single call:

# Calculate the mean for multiple satisfaction metrics
employee_satisfaction_data |> 
  summarize(across(starts_with("Satisfaction"), \(x) mean(x, na.rm = TRUE)))
# A tibble: 1 × 3
  Satisfaction_Work_Life Satisfaction_Team_Dynamics Satisfaction_Project_Engag…¹
                   <dbl>                      <dbl>                        <dbl>
1                   3.14                       3.01                         2.87
# ℹ abbreviated name: ¹​Satisfaction_Project_Engagement

This produces one summary column for each satisfaction metric.

When to Use across()

While across() is highly efficient, it’s most useful when:

  • You need to apply the same operation to multiple related variables (e.g., satisfaction metrics, product ratings).
  • Your dataset includes multiple columns with similar types of data.

If your dataset only includes one variable of interest (e.g., a single Satisfaction column), across() is unnecessary. Instead, use a simple summarize() call:

# Simple summarization for one variable
employee_data |> 
  summarize(Avg_Satisfaction = mean(Satisfaction_Score, na.rm = TRUE))
# A tibble: 1 × 1
  Avg_Satisfaction
             <dbl>
1             3.11

Key Takeaways

  • Use Simple Syntax First: For most datasets with one column of interest, a straightforward summarize() call is sufficient.
  • Optional for Multiple Variables: When working with datasets containing many related columns, across() can streamline your code.
  • AI Code Awareness: Tools like AI assistants often generate code using across() because it’s efficient and modern. While this may seem advanced, understanding across() helps you interpret and adapt such code.


30.5 Managing Grouping Structures

Ungrouping with ungroup()

After summarizing, you may want to remove grouping to avoid unexpected behavior in subsequent transformations. Use ungroup():

# Ungrouping data
ungrouped_data <- grouped_data |> 
  ungroup()

Best Practices for Working with Groups

  1. Group Effectively:
    • Choose grouping columns thoughtfully to avoid excessive complexity.
  2. Verify Group Structures:
    • Check your groups with group_vars() and n_groups() before summarizing.
  3. Ungroup When Necessary:
    • Always ungroup() after summarizing if further operations don’t require grouping.

Common Pitfalls and Debugging Tips

  1. Forgetting to Group:
    • summarize() without group_by() produces a single summary for the entire dataset.
  2. Misunderstanding Output:
    • Be clear on how n() and n_distinct() differ when counting rows or unique values.
  3. Handling Missing Data:
    • Always include na.rm = TRUE when working with columns that might contain NA values.

This chapter provides a comprehensive guide to grouping and summarizing data, empowering you to uncover patterns and make data-driven decisions.