# A tibble: 100 × 5
EmployeeID Team Project Hours_Worked Satisfaction_Score
<chr> <chr> <chr> <int> <dbl>
1 E13 Team C Project X 24 1.1
2 E16 Team A Project Y 32 2.6
3 E12 Team A Project Z 13 3.8
4 E20 Team A Project Y 38 4.8
5 E10 Team A Project Z 14 3.1
6 E16 Team A Project Y 38 3.8
7 E3 Team A Project X 38 2.3
8 E5 Team C Project X 19 1.4
9 E4 Team C Project Y 31 1.9
10 E11 Team B Project X 39 2.1
# ℹ 90 more rows
28 Grouping and Summarizing Data
When analyzing data, we often want to focus on subsets or categories—groups—for additional insight into relationships in the data and how they vary across groups. Grouping and summarizing data lets us uncover patterns and insights by organizing rows into meaningful groups and reducing those groups into summarized values.
Imagine you run an online store and sell products worldwide. While looking at total sales is useful, you might also want to understand sales by country, by product, or by month. Grouping data allows you to dissect and view it from various angles, revealing insights that might be invisible otherwise. Before you can analyze sales by country, product, or month, you first have to group the sales data by country, product, or month.
In this chapter, you’ll learn:
- How to create groups using
group_by()
. - How to explore and manage grouped data.
- How to calculate summaries for each group using
summarize()
. - Advanced techniques for summarizing across multiple columns, handling missing data, and chaining transformations.
28.1 Demonstration Data: Employee and Team Evaluation at MarketTrack
MarketTrack is a fictional project-based consulting startup that relies heavily on its teams’ and employees’ performance to deliver value to clients. The company evaluates its employees across various projects and teams, tracking their contributions and satisfaction levels to understand workload patterns and identify high performers.
The MarketTrack Employee Evaluation dataset employee_data
contains performance ratings for employees across various teams and projects at MarketTrack, a company specializing in project-based consulting. Each row represents a unique evaluation for an employee on a specific project, capturing details about their performance and workload. This dataset provides insights into how employees contribute to different projects and teams over time.
Dataset Structure
- EmployeeID: Unique identifier for each employee (e.g., “E1”, “E2”, etc.).
- Team: Team to which the employee is assigned, such as “Team A”, “Team B”, or “Team C”.
- Project: Name of the project the employee is working on, e.g., “Project X”, “Project Y”, or “Project Z”.
- Hours_Worked: Number of hours the employee has contributed to the project.
- Satisfaction_Score: Performance rating of the employee on a scale from 1 to 5, reflecting their satisfaction and contributions to the project.
28.2 Creating Groups with group_by()
What Are Groups in Data?
A group is a collection of rows that share a common value in one or more columns. Groups help organize data for targeted analysis, making it easier to: - Compare metrics across categories.
- Identify trends or outliers within specific groups.
- Make informed decisions based on summarized data.
For example, in the MarketTrack employee_data
dataset, grouping by the Team
column creates three groups: Team A, Team B, and Team C. Each group contains rows where Team
has the same value.
Creating Groups
To create groups in your data, use the group_by()
function from the dplyr
package. Grouping organizes rows into groups based on the values in one or more columns.
Syntax
The basic syntax for the group_by()
function is
# Basic syntax for group_by()
<- data |>
grouped_data group_by(column)
Grouping by Team
Here’s how to group the MarketTrack employee_data
by Team
:
# Grouping by Team
<- employee_data |>
grouped_data group_by(Team)
# Preview grouped data
grouped_data
# A tibble: 100 × 5
# Groups: Team [3]
EmployeeID Team Project Hours_Worked Satisfaction_Score
<chr> <chr> <chr> <int> <dbl>
1 E13 Team C Project X 24 1.1
2 E16 Team A Project Y 32 2.6
3 E12 Team A Project Z 13 3.8
4 E20 Team A Project Y 38 4.8
5 E10 Team A Project Z 14 3.1
6 E16 Team A Project Y 38 3.8
7 E3 Team A Project X 38 2.3
8 E5 Team C Project X 19 1.4
9 E4 Team C Project Y 31 1.9
10 E11 Team B Project X 39 2.1
# ℹ 90 more rows
The output shows that the data is now grouped by Team
.
Exploring Grouped Data
Grouped data frames behave differently from regular data frames:
- Operations like
summarize()
ormutate()
are applied within each group. - The grouping columns remain visible, but rows are treated as part of their group.
You can inspect grouped data using:
group_vars()
: Lists the columns used for grouping.n_groups()
: Counts the number of groups.
Inspecting Groups
# Inspecting groups
group_vars(grouped_data) # Which columns define groups
[1] "Team"
n_groups(grouped_data) # How many groups exist
[1] 3
28.3 Summarizing Groups with summarize()
With your data grouped by key categories, the summarize()
function allows you to calculate metrics—such as totals, averages, or counts—that are specific to each group. These metrics are often summary statistics or derived insights that provide high-level information about your data in the context of the group.
The primary purpose of summarize()
is to condense all rows within a group into a single summary row, resulting in an output that usually has far fewer rows than the original dataset. This makes it a powerful tool for aggregating data and extracting actionable insights.
For example:
- Grouping sales data by product categories and calculating the total sales for each category.
- Grouping employee data by team and calculating the average satisfaction score for each team.
- Grouping by the willingness to pay of customers and calculating the total count.
This reduction in data granularity shifts the focus from individual rows to high-level patterns and trends within groups which aids in focus and decision-making.
Syntax
# Basic syntax for summarize()
summarize(data, new_column = summary_function(column))
Common Summarizing Functions
Many common functions are summarizing functions that could be used here:
Centrality
Centrality functions are useful for calculating the central tendency of variables:
mean()
: Calculates the averagemedian()
: Calculates the median
Mean and Median Hours Worked by Team
# Calculate average and median hours worked per team
|>
employee_data group_by(Team) |>
summarize(
Avg_Hours = mean(Hours_Worked, na.rm = TRUE),
Med_Hours = median(Hours_Worked)
)
# A tibble: 3 × 3
Team Avg_Hours Med_Hours
<chr> <dbl> <dbl>
1 Team A 25.1 26.5
2 Team B 23.6 22.5
3 Team C 27.1 27.5
Spread
Spread functions are useful for calculating the variance of variables:
sd()
: Calculates the standard deviationvar()
: Calculates the varianceIQR()
: Calculates the interquartile rangemin()
: Finds the smallest valuemax()
: Finds the largest value
Standard Deviation and Interquartile Range of Hours Worked by Team
# Calculate standard deviation and interquartile range of hours worked per team
|>
employee_data group_by(Team) |>
summarize(
sd_Hours = sd(Hours_Worked, na.rm = TRUE),
iqr_Hours = IQR(Hours_Worked)
)
# A tibble: 3 × 3
Team sd_Hours iqr_Hours
<chr> <dbl> <dbl>
1 Team A 11.0 21.8
2 Team B 10.2 18.5
3 Team C 6.98 9.75
Counts
Functions for counting or summing rows or unique values. Count functions are useful for understanding the size of each group:
sum()
: Adds up all values.n()
: Counts the number of rows in the group.n_distinct()
: Counts the number of unique values in a column.
Counting Employees and Total Hours Worked per Team
# Count the number of employees per team and total hours worked per team
|>
employee_data group_by(Team) |>
summarize(
Num_Employees = n_distinct(EmployeeID),
Total_Hours = sum(Hours_Worked)
)
# A tibble: 3 × 3
Team Num_Employees Total_Hours
<chr> <int> <int>
1 Team A 15 653
2 Team B 17 851
3 Team C 16 1030
Cumulative Summaries
Cumulative functions calculate running totals or statistics within a group:
cumsum()
: Cumulative sum of values.cummin()
: Cumulative minimum value.cummax()
: Cumulative maximum value.cummean()
: Cumulative mean (from thedplyr
package).
Cumulative Hours Worked by Team
# Cumulative hours worked by team
|>
employee_data arrange(Team, Hours_Worked) |> # Arrange for cumulative calculation
group_by(Team) |>
mutate(Cumulative_Hours = cumsum(Hours_Worked))
# A tibble: 100 × 6
# Groups: Team [3]
EmployeeID Team Project Hours_Worked Satisfaction_Score Cumulative_Hours
<chr> <chr> <chr> <int> <dbl> <int>
1 E19 Team A Project X 10 2.8 10
2 E2 Team A Project Z 10 3 20
3 E18 Team A Project Z 11 4.5 31
4 E18 Team A Project Z 12 4 43
5 E5 Team A Project X 12 4.9 55
6 E12 Team A Project Z 13 3.8 68
7 E11 Team A Project Y 13 1.4 81
8 E10 Team A Project Z 14 3.1 95
9 E5 Team A Project Y 14 2.2 109
10 E20 Team A Project Y 17 3.9 126
# ℹ 90 more rows
Logical Summaries
Logical functions evaluate whether certain conditions are met:
any()
: Checks if at least one value isTRUE
.all()
: Checks if all values areTRUE
Checking Satisfaction Thresholds by Team
# Check if any or all satisfaction scores exceed 4 in each team
|>
employee_data group_by(Team) |>
summarize(
Any_High_Satisfaction = any(Satisfaction_Score > 4),
All_High_Satisfaction = all(Satisfaction_Score > 4)
)
# A tibble: 3 × 3
Team Any_High_Satisfaction All_High_Satisfaction
<chr> <lgl> <lgl>
1 Team A TRUE FALSE
2 Team B TRUE FALSE
3 Team C TRUE FALSE
Custom Summaries
Sometimes, you may need a metric that isn’t provided by the default functions. In such cases, you can create custom summary functions using function()
or anonymous functions like \(x)
:
Weighted Average of Satisfaction Scores
# Calculate weighted average satisfaction score (hypothetical weights)
|>
employee_data group_by(Team) |>
summarize(
Weighted_Avg_Satisfaction = sum(Satisfaction_Score * Hours_Worked) / sum(Hours_Worked)
)
# A tibble: 3 × 2
Team Weighted_Avg_Satisfaction
<chr> <dbl>
1 Team A 3.05
2 Team B 3.25
3 Team C 3.19
Exercise: Using group_by()
and summarize()
Try it yourself:
Let’s revisit the Muscle Cola customer dataset, introduced in Section 18.4.2, to practice grouping and summarizing transformations. Muscle Cola is a protein-infused cola drink designed for fitness enthusiasts, and this dataset contains survey responses from potential customers.
In the original dataset, each row represents a customer’s response to a specified price for a serving of Muscle Cola. Respondents were asked how many servings they would purchase per month at each of a series of prices ($0.50 to $2.50). This approach aligns with the economics of demand, as it captures the relationship between price and quantity demanded for non-durable goods like Muscle Cola, which are consumed and need to be repurchased regularly. However, asking customers to evaluate multiple prices in a single survey can be overwhelming, potentially leading to less thoughtful or accurate responses.
For this exercise, we’ll use a simplified variation of the data that employs an alternative survey design. Instead of asking customers how many servings they would buy at multiple prices, respondents were asked just two questions:
- What is the most you would be willing to pay for one serving of Muscle Cola?
- At that price, how many servings would you purchase in a typical month?
This simplified approach has trade-offs. While it may underestimate the total quantity demanded at lower prices—since respondents are not explicitly asked about their behavior at reduced prices—the brevity of the survey can lead to more thoughtful and reliable answers. By focusing on willingness to pay and expected quantity at that price, this data provides an opportunity to explore demand estimation through grouping, arranging, and cumulative summarization techniques.
Customer Preferences:
Gym_preference
: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.Gym_member
: Specifies whether the respondent is a gym member (“Yes”/“No”).Gym_freq
: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.Protein_importance
: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
Customer Demand:
WTP
: The maximum amount the respondent would be willing to pay for a single servingQuantity
: The number of servings the respondent would consumer per month if the price matched the willingness to pay of that respondent
Demographics:
Gender
: Gender of the respondent.Age
: Numeric value indicating the respondent’s age.Weight
: The respondent’s weight in pounds, which may be relevant for understanding protein needs.
In economics, the concept of customer demand dictates that customers who are willing to pay a higher price would also be willing to pay lower prices. Therefore, the quantity demanded by customers with higher willingness to pay must be added to the quantity demanded at lower price points to correctly calculate the total demand at each price level.
To transform this variation on the customer dataset muscle_cola_data
into proper demand data:
Group by Willingness to Pay: Use
group_by()
to group the data by each unique willingness-to-pay value reported by customers. Each group represents the potential price point and its associated customers.Calculate Total Quantity: Use
summarize()
to calculate the total quantity that would be consumed by customers at each willingness-to-pay value.Arrange the Data: Start by arranging the data in descending order of willingness to pay (highest to lowest). This ensures that the cumulative calculations reflect the logic of aggregating from higher to lower price points.
Calculate Quantity Demanded: For each price (willingness-to-pay value), calculate the cumulative sum of total quantities for all price points. This creates a proper demand curve by aggregating quantities for higher-paying customers into the total demand at lower price levels.
Specify the dataset as muscle_cola_data
, then
- group the data by willingness to pay values with
group_by()
- using
summarize()
, create a new variable as the sum of all quantity at each willingness to pay value - arrange the data from highest to lowest willingness to pay with
arrange()
- using
mutate()
, calculate quantity demanded as the cumulative sum of total (summed) quantities at each willingness to pay values
Fully-worked solution:
<- muscle_cola_data |>
muscle_cola_demand_data group_by(WTP) |>
summarize(Total_Q = sum(Quantity)) |>
arrange(desc(WTP)) |>
mutate(Quantity_Demanded = cumsum(Total_Q))
muscle_cola_demand_data
Tips for Using Summarizing Functions
- Always verify if missing values (
NA
) need to be handled withna.rm = TRUE
. - Choose the summarizing function that best matches your business question.
- Combine multiple summaries in one call for efficiency and clarity.
By mastering summarizing functions, you can derive valuable insights from your grouped data. In the next section, we’ll explore advanced techniques like summarizing across multiple columns and chaining transformations.
28.4 Summarizing Multiple Columns with across()
[advanced and optional]
In most cases, you’ll use summarize()
to calculate summary statistics for a single variable, such as finding the average satisfaction score. However, some datasets may include multiple related variables that require the same summarizing operation. For example, you might have several satisfaction metrics (Satisfaction_1
, Satisfaction_2
) that you want to average.
The across()
function makes it easy to apply the same operation to multiple columns at once. This approach is especially efficient for handling larger datasets with multiple related variables.
Syntax
The basic syntax for the across()
function is
across(
columns_to_transform,transformation_function(x, additional_arguments)
\(x) )
where
- columns_to_transform: Specifies the columns to be transformed (e.g., starts_with(“Satisfaction”)).
- \(x): Defines an anonymous function to be applied to each column.
- transformation_function(x): Specifies the operation to perform (e.g.,
mean(x)
). - additional_arguments: Pass arguments directly to the transformation function (e.g.,
na.rm = TRUE
).
Averaging Multiple Satisfaction Metrics
Suppose our employee data includes three different types of employee satisfaction scores:
- Work-Life Balance Satisfaction: Satisfaction with the ability to balance work and personal life (
Satisfaction_Work_Life
) - Team Dynamics Satisfaction: Satisfaction with collaboration and support within their team (
Satisfaction_Team_Dynamics
) - Project Engagement Satisfaction: Satisfaction with their interest and engagement in the projects they work on (
Satisfaction_Project_Engagement
)
Rows: 100
Columns: 7
$ EmployeeID <chr> "E15", "E19", "E14", "E3", "E10", "E18…
$ Team <chr> "Team B", "Team A", "Team C", "Team C"…
$ Project <chr> "Project Y", "Project Z", "Project X",…
$ Hours_Worked <int> 30, 26, 40, 21, 39, 12, 39, 34, 27, 24…
$ Satisfaction_Work_Life <dbl> 3.8, 2.8, 4.3, 1.2, 4.0, 2.4, 4.3, 3.1…
$ Satisfaction_Team_Dynamics <dbl> 4.4, 2.8, 3.9, 1.4, 1.9, 4.8, 4.0, 4.3…
$ Satisfaction_Project_Engagement <dbl> 4.5, 3.6, 4.7, 1.7, 2.3, 4.2, 5.0, 2.3…
Using the new employee_satisfaction_data
dataset, you can use across()
to calculate the mean of each satisfaction metric in a single call:
# Calculate the mean for multiple satisfaction metrics
|>
employee_satisfaction_data summarize(across(starts_with("Satisfaction"), \(x) mean(x, na.rm = TRUE)))
# A tibble: 1 × 3
Satisfaction_Work_Life Satisfaction_Team_Dynamics Satisfaction_Project_Engag…¹
<dbl> <dbl> <dbl>
1 3.14 3.01 2.87
# ℹ abbreviated name: ¹Satisfaction_Project_Engagement
This produces one summary column for each satisfaction metric.
When to Use across()
While across()
is highly efficient, it’s most useful when:
- You need to apply the same operation to multiple related variables (e.g., satisfaction metrics, product ratings).
- Your dataset includes multiple columns with similar types of data.
If your dataset only includes one variable of interest (e.g., a single Satisfaction
column), across()
is unnecessary. Instead, use a simple summarize()
call:
# Simple summarization for one variable
|>
employee_data summarize(Avg_Satisfaction = mean(Satisfaction_Score, na.rm = TRUE))
# A tibble: 1 × 1
Avg_Satisfaction
<dbl>
1 3.11
Key Takeaways
- Use Simple Syntax First: For most datasets with one column of interest, a straightforward
summarize()
call is sufficient. - Optional for Multiple Variables: When working with datasets containing many related columns,
across()
can streamline your code. - AI Code Awareness: Tools like AI assistants often generate code using
across()
because it’s efficient and modern. While this may seem advanced, understandingacross()
helps you interpret and adapt such code.
28.5 Managing Grouping Structures
Ungrouping with ungroup()
After summarizing, you may want to remove grouping to avoid unexpected behavior in subsequent transformations. Use ungroup()
:
# Ungrouping data
<- grouped_data |>
ungrouped_data ungroup()
Best Practices for Working with Groups
- Group Effectively:
- Choose grouping columns thoughtfully to avoid excessive complexity.
- Verify Group Structures:
- Check your groups with
group_vars()
andn_groups()
before summarizing.
- Check your groups with
- Ungroup When Necessary:
- Always
ungroup()
after summarizing if further operations don’t require grouping.
- Always
Common Pitfalls and Debugging Tips
- Forgetting to Group:
summarize()
withoutgroup_by()
produces a single summary for the entire dataset.
- Misunderstanding Output:
- Be clear on how
n()
andn_distinct()
differ when counting rows or unique values.
- Be clear on how
- Handling Missing Data:
- Always include
na.rm = TRUE
when working with columns that might containNA
values.
- Always include
This chapter provides a comprehensive guide to grouping and summarizing data, empowering you to uncover patterns and make data-driven decisions.