29  Tidy Data

Tidy data is essential for data analysis within the tidyverse, providing consistency and compatibility with data manipulation, visualization, and modeling tools. However, in real-world scenarios, data is often not organized this way. Understanding what tidy data is, why data is untidy and how to convert from untidy to tidy data can help you better structure your data for analysis.



29.1 Tidy Data Principles

The tidy data structure adheres to three main principles:

  1. Each Variable Forms a Column: Each variable should be in its own column.
  2. Each Observation Forms a Row: Each row represents one unique observation.
  3. Each Type of Observational Unit Forms a Table: Different types of observational units should be in separate tables.


29.2 Why Tidy Data Matters in the Tidyverse

Tidy data is essential for data analysis within the tidyverse, providing consistency and compatibility with data manipulation, visualization, and modeling tools. Here’s why tidy data matters:

  • Simplified Manipulation: Consistent structure makes operations like filtering, grouping, and summarizing easier and more intuitive.
  • Visualization Compatibility: Visualization tools, like ggplot2, work best with tidy data because they rely on data structure consistency for plotting variables.
  • Better Collaboration: Tidy data is easier to understand, share, and collaborate on, as it follows a standard format that others can quickly interpret.


29.3 Why Is Data Often Untidy?

Data is often collected and organized to serve purposes beyond analytics, and these other uses can lead to untidy data structures that are more challenging to work with. Some common reasons for untidy data include:

  • Incremental Data Collection Over Time: In spreadsheets, data is frequently added over time, with new columns created for each time period. While this approach visually separates each period, it results in a wide format that is harder to analyze.

  • Survey Structures with Parent and Child Questions: Survey data is commonly wide because of parent and child question structures. When a survey question has several sub-questions or asks respondents to rate items in different contexts, each context or sub-question becomes a separate column.

  • Optimizing for Human Readability: Data in spreadsheets is often organized for readability rather than analysis. Merged cells, summary rows, and descriptive headers make data easier to interpret at a glance but result in an untidy structure.

  • Legacy Systems and Inconsistent Data Standards: In organizations, data often comes from various departments or systems with differing data structures. Integrating data from these sources requires tidying to make it usable for analysis.

  • Custom Reporting Requirements: Many reports are designed for presentation, not analysis. This often results in wide layouts or additional visual elements that need to be restructured to achieve a tidy format.



29.4 Types of Untidy Data

Many datasets come in forms that do not follow tidy data principles. Here are common types of untidy data:

  • Wide Data: Variables spread across multiple columns, often seen in spreadsheet exports.
  • Long Data: Observational units are stacked in rows without clearly separated variables.

Wide Format (Months as Columns)

Example: Consider this wide-format dataset representing monthly sales:

# A tibble: 2 × 4
  Product     Jan   Feb   Mar
  <chr>     <dbl> <dbl> <dbl>
1 Product A   120   130   150
2 Product B    90   110    95

This data is not tidy because each month (e.g., Jan, Feb, Mar) appears as a separate column. In tidy data, each variable forms its own column, but here, the column names Jan, Feb, and Mar are actually values of a variable we might call Month. Additionally, the values in these columns represent sales revenue, so they would ideally belong to a column labeled Sales_Revenue.

To make this dataset tidy:

  • Each row should represent a single observation (product sales in a specific month).
  • Each variable should form its own column (e.g., Product, Month, and Sales_Revenue).

In the next sections, we’ll discuss how to convert this into a tidy format.


# A tibble: 2 × 4
  Region   Jan   Feb   Mar
  <chr>  <dbl> <dbl> <dbl>
1 North    100   110   115
2 South     80    90   100

This wide-format dataset shows monthly sales by region, with each month in a separate column.

Example 2: Long Format (Observations Stacked in Rows)

# A tibble: 4 × 3
  City  Measurement   Value
  <chr> <chr>         <dbl>
1 NYC   January Temp     32
2 NYC   February Temp    35
3 LA    January Temp     68
4 LA    February Temp    70

In this dataset, temperature measurements are stacked rather than separated into variables, making it hard to analyze by month.

Exercise: Recognizing Untidy Data

Try it yourself:


The dataset price_data contains data for prices of three products over three months.

  1. Is the data in price_data tidy?
  2. What are the variables in price_data?
  3. Are the variables different from the columns in price_data?

Hint 1

Consider the principles of tidy data: Each variable should form a column, each observation should form a row, and each type of observational unit should form a separate table. What variables are contained in the data table and do they match the variable names?

Fully worked solution:

This dataset is not in a tidy format:

  • The variables of the dataset are products, monthly prices and the month. The column labels refer primarily to the month while the values are for prices.
  • Each row represents an observation about a product which may be tidy except that the data represent prices per month. In that case, a single observation would have one product, one price and one month.
  • This dataset is not tidy because every column is not a variable and every row is not an observation.

How might the current structure affect an analysis that compares sales trends across months?

The untidy data in price_data would be prohibitively difficult to analyze in its untidy form. For example, univariate EDA could apply to the prices but the histogram needs to point to a single variable in the aesthetic (x = price) and our prices our spread out over three columns. We need the prices in a single column to analyze with a histogram.



29.5 Tidying Data with pivot_longer()

The pivot_longer() function is essential for converting wide data into a tidy format by “melting” columns into rows.

Syntax of pivot_longer()

new_data <- old_data %>%
  pivot_longer(
    cols = ...,       # Columns to pivot
    names_to = ...,   # Name for the new column containing the old column names
    values_to = ...   # Name for the new column containing values
  )

Converting Wide to Tidy with pivot_longer()

ShopEase Customer Satisfaction Survey Dataset (Wide Format with NA Values)

This dataset represents customer satisfaction ratings for ShopEase, a company collecting feedback on several of its products. Each row corresponds to a unique customer who has provided satisfaction ratings for different products (Product A, Product B, Product C, and Product D). However, not all customers have experience with each product, so there are some missing values (NA) where a customer did not provide a rating for a particular product.

Dataset Structure
  • CustomerID: Unique identifier for each customer (e.g., “C1”, “C2”, etc.).
  • Satisfaction_ProductA: Satisfaction rating for Product A on a scale from 1 to 5, with NA indicating no rating.
  • Satisfaction_ProductB: Satisfaction rating for Product B on a scale from 1 to 5, with NA indicating no rating.
  • Satisfaction_ProductC: Satisfaction rating for Product C on a scale from 1 to 5, with NA indicating no rating.
  • Satisfaction_ProductD: Satisfaction rating for Product D on a scale from 1 to 5, with NA indicating no rating.

This dataset is structured in a wide format, where each product’s satisfaction rating is stored in a separate column. While this layout is useful initially, it can become cumbersome for deeper analysis.

Rows: 50
Columns: 5
$ CustomerID            <chr> "C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", …
$ Satisfaction_ProductA <int> 3, NA, 3, 2, 2, NA, 3, 5, 4, NA, NA, 1, 2, 3, 5,…
$ Satisfaction_ProductB <int> 5, 2, 1, 1, 3, 1, NA, 5, 1, 2, 4, 4, NA, NA, 3, …
$ Satisfaction_ProductC <int> 2, 4, 2, 2, NA, 4, 4, NA, 1, NA, NA, NA, 3, NA, …
$ Satisfaction_ProductD <int> 4, 3, 5, 4, 4, 4, NA, 1, 2, 3, 4, 3, 1, NA, 5, 5…

Tidying ShopEase Data with pivot_longer()

Let’s use tidying techniques to transform the dataset into a more manageable, “tidy” format where each row represents a unique combination of CustomerID and Product.

Using pivot_longer(), we transform the data so that all product satisfaction ratings are consolidated into two columns: one for the product name (Product) and another for the rating (Satisfaction).

# Tidying the data with pivot_longer()
tidy_customer_satisfaction <- customer_satisfaction |> 
  pivot_longer(
    cols = starts_with("Satisfaction_Product"), # choosing which columns to pivot
    names_to = "Product",                       # old column names become values of new variable Product
    names_prefix = "Satisfaction_Product",      # pivot can ignore the leading text of column names
    values_to = "Satisfaction"                  # values of the old dataset are pivoted into a new variable "Satisfaction
  )

# View the tidied data
tidy_customer_satisfaction
# A tibble: 200 × 3
   CustomerID Product Satisfaction
   <chr>      <chr>          <int>
 1 C1         A                  3
 2 C1         B                  5
 3 C1         C                  2
 4 C1         D                  4
 5 C2         A                 NA
 6 C2         B                  2
 7 C2         C                  4
 8 C2         D                  3
 9 C3         A                  3
10 C3         B                  1
# ℹ 190 more rows

The new tidy dataset tidy_customer_satisfaction has three columns, one for each variable Customer_ID, Product, and Satisfaction. Every column is a variable. Every row is an observation of a customer rating her satisfaction with each product.

After practicing with pivot_longer(), you can also try using pivot_wider() to return the dataset to its original wide format, reinforcing your understanding of these tidying functions.

Example: Tidying monthly sales data.

tidy_sales_data <- wide_sales_data |> 
  pivot_longer(cols = -Region, names_to = "Month", values_to = "Sales")
tidy_sales_data
# A tibble: 6 × 3
  Region Month Sales
  <chr>  <chr> <dbl>
1 North  Jan     100
2 North  Feb     110
3 North  Mar     115
4 South  Jan      80
5 South  Feb      90
6 South  Mar     100

Exercise: Tidy a Wide Dataset

Try it yourself:


Customer data from the protein-infused Muscle Cola was introduced in Chapter 18 in the exercise generating a scatter plot in Section 18.5.2. The original untidy data has been imported here as muscle_cola_wide.

The variables are

  • Customer Preferences:

    • Gym_preference: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.
    • Gym_member: Specifies whether the respondent is a gym member (“Yes”/“No”).
    • Gym_freq: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.
    • Protein_importance: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
  • Price Sensitivity:

    • P0.50, P1.00, P1.50, P2.00, P2.50: Quantity the respondent is willing to purchase at different price points (in dollars), where the column name indicates the price (e.g., P0.50 represents $0.50, P1.00 represents $1.00).
  • Demographics:

    • Gender: Gender of the respondent.
    • Age: Numeric value indicating the respondent’s age.
    • Weight: The respondent’s weight in pounds, which may be relevant for understanding protein needs.

For muscle_cola_wide, use pivot_longer() to tidy the data.

Hint 1

  1. Focus on the column names and variable values to identify the variables in the dataset

  2. Use pivot_longer() to tidy the dataset

Hint 2

  1. There are many variables in this dataset. The data is wide in the price/quantity variables where P0.50 - P2.50 are clearly wide. The variables should be Price with values from the columns labels and Quantity with values from the values in the columns

  2. Call the pivot_longer() function specifying data as muscle_cola_wide and P0.50, P1.00, P1.50, P1.00, and P2.50 are the columns to be pivoted to be longer. The column names are translated into values for a variable called Price. The column values move to a column named Quantity. The names prefix for the columns to be pivoted and to have their names converted into price data is P.

Fully worked solution:

# Call the `pivot_longer()` function specifying data as `muscle_cola_wide`
muscle_cola_long <- muscle_cola_wide |> 
    pivot_longer(
          cols = c(P0.50, P1.00, P1.50, P2.00, P2.50), # Explicitly list price columns
          names_to = "Price",
          values_to = "Quantity",
          names_prefix = "P"
     )

muscle_cola_long


29.6 Tidying Data with pivot_wider()

The pivot_wider() function is used to convert long data into a wide format by spreading rows into columns.

Syntax of pivot_wider()

The basic syntax of pivot_wider() is

new_data <- old_data %>%
  pivot_wider(
    names_from = ...,  # Column containing new column names
    values_from = ...  # Column containing values to be spread
  )

Demo: Converting Long to Wide with pivot_wider()

In customer data, it is much less common to have long data that needs to be made wider to be tidy. Here we demonstrate pivot_wider() with census data from the us_rent_income dataset. This dataset provides information on median income and median rent across different U.S. states. Each row represents an observation for either rent or income in a particular state, allowing comparisons of income levels relative to rental costs.

Dataset Structure

  • GEOID: A unique identifier for each state.
  • NAME: The name of the state (e.g., “Alabama”, “Alaska”).
  • variable: Indicates the type of data, either “income” or “rent”, for that particular row. estimate: The estimated median value for the specified variable (either median income or median rent).
  • moe: The margin of error for the estimate, providing a measure of uncertainty around the estimate.

This dataset allows for analysis of rent and income levels across states, supporting insights into cost-of-living differences and affordability.

head(us_rent_income)
# A tibble: 6 × 5
  GEOID NAME    variable estimate   moe
  <chr> <chr>   <chr>       <dbl> <dbl>
1 01    Alabama income      24476   136
2 01    Alabama rent          747     3
3 02    Alaska  income      32940   508
4 02    Alaska  rent         1200    13
5 04    Arizona income      27517   148
6 04    Arizona rent          972     4

Notice that the variable column signifies whether that row is for the median income or median rent for every GEOID and state. Having two variables in one column means that we need to use pivot_wider() to separate and analyze income and rent. We also need to widen the margin of error data (moe) which has values for income and rent for every GEOID and state.

# Pivot temperature data from long to wide format
us_rent_income |> 
  pivot_wider(
    names_from = variable, 
    values_from = c(estimate, moe)
  )
# A tibble: 52 × 6
   GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
   <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
 1 01    Alabama                        24476           747        136        3
 2 02    Alaska                         32940          1200        508       13
 3 04    Arizona                        27517           972        148        4
 4 05    Arkansas                       23789           709        165        5
 5 06    California                     29454          1358        109        3
 6 08    Colorado                       32401          1125        109        5
 7 09    Connecticut                    35326          1123        195        5
 8 10    Delaware                       31560          1076        247       10
 9 11    District of Columbia           43198          1424        681       17
10 12    Florida                        25952          1077         70        3
# ℹ 42 more rows

Here both estimate and moe are values columns, so we supplied them both to values_from.



29.7 Chapter Summary: Tidy Data

In this chapter, we explored the foundational concepts of tidy data and learned how to reshape and consolidate data using pivot_longer() to convert wide-format data into a tidy, long format. Through the Muscle Cola dataset, we examined customer responses related to their willingness to purchase at various price points and their workout preferences.

A few practical tips will aid you in recognizing and tidying data:

  • Check Data Structure: Look at the column and row layout before transformations.
  • Identify Columns to Pivot: Avoid unnecessary columns in pivot_longer() or pivot_wider().
  • Spot Repeated Observations: Repeated or hierarchical observations need special attention.

By practicing these tidying techniques, you will be able to:

  • Recognize when data is untidy and identify key indicators of a wide format.
  • Use pivot_longer() to reorganize data columns effectively, transforming multiple columns of related values into a single column for easier analysis.
  • Interpret consolidated data, preparing it for further exploration and analysis in subsequent chapters.

These tidying and consolidation techniques ensure data consistency, making it easier to conduct meaningful analysis and draw insights to inform business decisions.