# A tibble: 2 × 4
Product Jan Feb Mar
<chr> <dbl> <dbl> <dbl>
1 Product A 120 130 150
2 Product B 90 110 95
25 Tidy Data
Tidy data is essential for data analysis within the tidyverse
, providing consistency and compatibility with data manipulation, visualization, and modeling tools. However, in real-world scenarios, data is often not organized this way. Understanding what tidy data is, why data is untidy and how to convert from untidy to tidy data can help you better structure your data for analysis.
25.1 Tidy Data Principles
The tidy data structure adheres to three main principles:
- Each Variable Forms a Column: Each variable should be in its own column.
- Each Observation Forms a Row: Each row represents one unique observation.
- Each Type of Observational Unit Forms a Table: Different types of observational units should be in separate tables.
25.2 Why Tidy Data Matters in the Tidyverse
Tidy data is essential for data analysis within the tidyverse, providing consistency and compatibility with data manipulation, visualization, and modeling tools. Here’s why tidy data matters:
- Simplified Manipulation: Consistent structure makes operations like filtering, grouping, and summarizing easier and more intuitive.
- Visualization Compatibility: Visualization tools, like
ggplot2
, work best with tidy data because they rely on data structure consistency for plotting variables. - Better Collaboration: Tidy data is easier to understand, share, and collaborate on, as it follows a standard format that others can quickly interpret.
25.3 Why Is Data Often Untidy?
Data is often collected and organized to serve purposes beyond analytics, and these other uses can lead to untidy data structures that are more challenging to work with. Some common reasons for untidy data include:
Incremental Data Collection Over Time: In spreadsheets, data is frequently added over time, with new columns created for each time period. While this approach visually separates each period, it results in a wide format that is harder to analyze.
Survey Structures with Parent and Child Questions: Survey data is commonly wide because of parent and child question structures. When a survey question has several sub-questions or asks respondents to rate items in different contexts, each context or sub-question becomes a separate column.
Optimizing for Human Readability: Data in spreadsheets is often organized for readability rather than analysis. Merged cells, summary rows, and descriptive headers make data easier to interpret at a glance but result in an untidy structure.
Legacy Systems and Inconsistent Data Standards: In organizations, data often comes from various departments or systems with differing data structures. Integrating data from these sources requires tidying to make it usable for analysis.
Custom Reporting Requirements: Many reports are designed for presentation, not analysis. This often results in wide layouts or additional visual elements that need to be restructured to achieve a tidy format.
25.4 Types of Untidy Data
Many datasets come in forms that do not follow tidy data principles. Here are common types of untidy data:
- Wide Data: Variables spread across multiple columns, often seen in spreadsheet exports.
- Long Data: Observational units are stacked in rows without clearly separated variables.
Wide Format (Months as Columns)
Example: Consider this wide-format dataset representing monthly sales:
This data is not tidy because each month (e.g., Jan
, Feb
, Mar
) appears as a separate column. In tidy data, each variable forms its own column, but here, the column names Jan
, Feb
, and Mar
are actually values of a variable we might call Month.
Additionally, the values in these columns represent sales revenue, so they would ideally belong to a column labeled Sales_Revenue.
To make this dataset tidy:
- Each row should represent a single observation (product sales in a specific month).
- Each variable should form its own column (e.g.,
Product
,Month
, andSales_Revenue
).
In the next sections, we’ll discuss how to convert this into a tidy format.
# A tibble: 2 × 4
Region Jan Feb Mar
<chr> <dbl> <dbl> <dbl>
1 North 100 110 115
2 South 80 90 100
This wide-format dataset shows monthly sales by region, with each month in a separate column.
Example 2: Long Format (Observations Stacked in Rows)
# A tibble: 4 × 3
City Measurement Value
<chr> <chr> <dbl>
1 NYC January Temp 32
2 NYC February Temp 35
3 LA January Temp 68
4 LA February Temp 70
In this dataset, temperature measurements are stacked rather than separated into variables, making it hard to analyze by month.
Exercise: Recognizing Untidy Data
Try it yourself:
The dataset price_data
contains data for prices of three products over three months.
- Is the data in
price_data
tidy? - What are the variables in
price_data
? - Are the variables different from the columns in
price_data
?
Hint 1
Consider the principles of tidy data: Each variable should form a column, each observation should form a row, and each type of observational unit should form a separate table. What variables are contained in the data table and do they match the variable names?
Fully worked solution:
This dataset is not in a tidy format:
- The variables of the dataset are products, monthly prices and the month. The column labels refer primarily to the month while the values are for prices.
- Each row represents an observation about a product which may be tidy except that the data represent prices per month. In that case, a single observation would have one product, one price and one month.
- This dataset is not tidy because every column is not a variable and every row is not an observation.
How might the current structure affect an analysis that compares sales trends across months?
The untidy data in price_data
would be prohibitively difficult to analyze in its untidy form. For example, univariate EDA could apply to the prices but the histogram needs to point to a single variable in the aesthetic (x = price) and our prices our spread out over three columns. We need the prices in a single column to analyze with a histogram.
25.5 Tidying Data with pivot_longer()
The pivot_longer()
function is essential for converting wide data into a tidy format by “melting” columns into rows.
Syntax of pivot_longer()
<- old_data %>%
new_data pivot_longer(
cols = ..., # Columns to pivot
names_to = ..., # Name for the new column containing the old column names
values_to = ... # Name for the new column containing values
)
Converting Wide to Tidy with pivot_longer()
ShopEase Customer Satisfaction Survey Dataset (Wide Format with NA Values)
This dataset represents customer satisfaction ratings for ShopEase, a company collecting feedback on several of its products. Each row corresponds to a unique customer who has provided satisfaction ratings for different products (Product A, Product B, Product C, and Product D). However, not all customers have experience with each product, so there are some missing values (NA
) where a customer did not provide a rating for a particular product.
Dataset Structure
- CustomerID: Unique identifier for each customer (e.g., “C1”, “C2”, etc.).
- Satisfaction_ProductA: Satisfaction rating for Product A on a scale from 1 to 5, with
NA
indicating no rating. - Satisfaction_ProductB: Satisfaction rating for Product B on a scale from 1 to 5, with
NA
indicating no rating. - Satisfaction_ProductC: Satisfaction rating for Product C on a scale from 1 to 5, with
NA
indicating no rating. - Satisfaction_ProductD: Satisfaction rating for Product D on a scale from 1 to 5, with
NA
indicating no rating.
This dataset is structured in a wide format, where each product’s satisfaction rating is stored in a separate column. While this layout is useful initially, it can become cumbersome for deeper analysis.
Rows: 50
Columns: 5
$ CustomerID <chr> "C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", …
$ Satisfaction_ProductA <int> 3, NA, 3, 2, 2, NA, 3, 5, 4, NA, NA, 1, 2, 3, 5,…
$ Satisfaction_ProductB <int> 5, 2, 1, 1, 3, 1, NA, 5, 1, 2, 4, 4, NA, NA, 3, …
$ Satisfaction_ProductC <int> 2, 4, 2, 2, NA, 4, 4, NA, 1, NA, NA, NA, 3, NA, …
$ Satisfaction_ProductD <int> 4, 3, 5, 4, 4, 4, NA, 1, 2, 3, 4, 3, 1, NA, 5, 5…
Tidying ShopEase Data with pivot_longer()
Let’s use tidying techniques to transform the dataset into a more manageable, “tidy” format where each row represents a unique combination of CustomerID
and Product
.
Using pivot_longer()
, we transform the data so that all product satisfaction ratings are consolidated into two columns: one for the product name (Product
) and another for the rating (Satisfaction
).
# Tidying the data with pivot_longer()
<- customer_satisfaction |>
tidy_customer_satisfaction pivot_longer(
cols = starts_with("Satisfaction_Product"), # choosing which columns to pivot
names_to = "Product", # old column names become values of new variable Product
names_prefix = "Satisfaction_Product", # pivot can ignore the leading text of column names
values_to = "Satisfaction" # values of the old dataset are pivoted into a new variable "Satisfaction
)
# View the tidied data
tidy_customer_satisfaction
# A tibble: 200 × 3
CustomerID Product Satisfaction
<chr> <chr> <int>
1 C1 A 3
2 C1 B 5
3 C1 C 2
4 C1 D 4
5 C2 A NA
6 C2 B 2
7 C2 C 4
8 C2 D 3
9 C3 A 3
10 C3 B 1
# ℹ 190 more rows
The new tidy dataset
tidy_customer_satisfaction
has three columns, one for each variableCustomer_ID
,Product
, andSatisfaction
. Every column is a variable. Every row is an observation of a customer rating her satisfaction with each product.After practicing with
pivot_longer()
, you can also try usingpivot_wider()
to return the dataset to its original wide format, reinforcing your understanding of these tidying functions.
Example: Tidying monthly sales data.
<- wide_sales_data |>
tidy_sales_data pivot_longer(cols = -Region, names_to = "Month", values_to = "Sales")
tidy_sales_data
# A tibble: 6 × 3
Region Month Sales
<chr> <chr> <dbl>
1 North Jan 100
2 North Feb 110
3 North Mar 115
4 South Jan 80
5 South Feb 90
6 South Mar 100
Exercise: Tidy a Wide Dataset
Try it yourself:
Customer data from the protein-infused Muscle Cola was introduced in Chapter 18 in the exercise generating a scatter plot in Section 18.4.2. The original untidy data has been imported here as muscle_cola_wide
.
The variables are
Customer Preferences:
Gym_preference
: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.Gym_member
: Specifies whether the respondent is a gym member (“Yes”/“No”).Gym_freq
: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.Protein_importance
: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
Price Sensitivity:
P0.50
,P1.00
,P1.50
,P2.00
,P2.50
: Quantity the respondent is willing to purchase at different price points (in dollars), where the column name indicates the price (e.g., P0.50 represents $0.50, P1.00 represents $1.00).
Demographics:
Gender
: Gender of the respondent.Age
: Numeric value indicating the respondent’s age.Weight
: The respondent’s weight in pounds, which may be relevant for understanding protein needs.
For muscle_cola_wide
, use pivot_longer()
to tidy the data.
Hint 1
Focus on the column names and variable values to identify the variables in the dataset
Use
pivot_longer()
to tidy the dataset
Hint 2
There are many variables in this dataset. The data is wide in the price/quantity variables where P0.50 - P2.50 are clearly wide. The variables should be Price with values from the columns labels and Quantity with values from the values in the columns
Call the
pivot_longer()
function specifying data asmuscle_cola_wide
andP0.50
,P1.00
,P1.50
,P1.00
, andP2.50
are the columns to be pivoted to be longer. The column names are translated into values for a variable calledPrice
. The column values move to a column namedQuantity.
The names prefix for the columns to be pivoted and to have their names converted into price data isP.
Fully worked solution:
# Call the `pivot_longer()` function specifying data as `muscle_cola_wide`
<- muscle_cola_wide |>
muscle_cola_long pivot_longer(
cols = c(P0.50, P1.00, P1.50, P2.00, P2.50), # Explicitly list price columns
names_to = "Price",
values_to = "Quantity",
names_prefix = "P"
)
muscle_cola_long
25.6 Tidying Data with pivot_wider()
The pivot_wider()
function is used to convert long data into a wide format by spreading rows into columns.
Syntax of pivot_wider()
The basic syntax of pivot_wider()
is
<- old_data %>%
new_data pivot_wider(
names_from = ..., # Column containing new column names
values_from = ... # Column containing values to be spread
)
Demo: Converting Long to Wide with pivot_wider()
In customer data, it is much less common to have long data that needs to be made wider to be tidy. Here we demonstrate pivot_wider()
with census data from the us_rent_income
dataset. This dataset provides information on median income and median rent across different U.S. states. Each row represents an observation for either rent or income in a particular state, allowing comparisons of income levels relative to rental costs.
Dataset Structure
GEOID
: A unique identifier for each state.NAME
: The name of the state (e.g., “Alabama”, “Alaska”).variable
: Indicates the type of data, either “income” or “rent”, for that particular row. estimate: The estimated median value for the specified variable (either median income or median rent).moe
: The margin of error for the estimate, providing a measure of uncertainty around the estimate.
This dataset allows for analysis of rent and income levels across states, supporting insights into cost-of-living differences and affordability.
head(us_rent_income)
# A tibble: 6 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama income 24476 136
2 01 Alabama rent 747 3
3 02 Alaska income 32940 508
4 02 Alaska rent 1200 13
5 04 Arizona income 27517 148
6 04 Arizona rent 972 4
Notice that the variable column signifies whether that row is for the median income
or median rent
for every GEOID
and state
. Having two variables in one column means that we need to use pivot_wider()
to separate and analyze income and rent. We also need to widen the margin of error data (moe
) which has values for income
and rent
for every GEOID
and state
.
# Pivot temperature data from long to wide format
|>
us_rent_income pivot_wider(
names_from = variable,
values_from = c(estimate, moe)
)
# A tibble: 52 × 6
GEOID NAME estimate_income estimate_rent moe_income moe_rent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 747 136 3
2 02 Alaska 32940 1200 508 13
3 04 Arizona 27517 972 148 4
4 05 Arkansas 23789 709 165 5
5 06 California 29454 1358 109 3
6 08 Colorado 32401 1125 109 5
7 09 Connecticut 35326 1123 195 5
8 10 Delaware 31560 1076 247 10
9 11 District of Columbia 43198 1424 681 17
10 12 Florida 25952 1077 70 3
# ℹ 42 more rows
Here both estimate and moe are values columns, so we supplied them both to values_from
.
25.7 Chapter Summary: Tidy Data
In this chapter, we explored the foundational concepts of tidy data and learned how to reshape and consolidate data using pivot_longer()
to convert wide-format data into a tidy, long format. Through the Muscle Cola dataset, we examined customer responses related to their willingness to purchase at various price points and their workout preferences.
A few practical tips will aid you in recognizing and tidying data:
- Check Data Structure: Look at the column and row layout before transformations.
- Identify Columns to Pivot: Avoid unnecessary columns in
pivot_longer()
orpivot_wider()
. - Spot Repeated Observations: Repeated or hierarchical observations need special attention.
By practicing these tidying techniques, you will be able to:
- Recognize when data is untidy and identify key indicators of a wide format.
- Use
pivot_longer()
to reorganize data columns effectively, transforming multiple columns of related values into a single column for easier analysis. - Interpret consolidated data, preparing it for further exploration and analysis in subsequent chapters.
These tidying and consolidation techniques ensure data consistency, making it easier to conduct meaningful analysis and draw insights to inform business decisions.