31  Transforming Columns

In this chapter, we focus on transforming columns in datasets using key tidyverse functions. Whether you are creating new variables, selecting specific columns, renaming, or reordering, these transformations help you prepare your data for analysis. We’ll explore:

  1. Adding and modifying columns with mutate().
  2. Selecting and dropping columns with select().
  3. Renaming columns with rename() and rename_with().
  4. Reordering columns with relocate().
  5. Applying advanced mutations programmatically with across() [optional].


31.1 Demonstration Data: Inventory at GearHub

We will use the GearHub Inventory dataset to demonstrate these column transformation techniques. GearHub is a fictional retail company specializing in a variety of products, including electronics, clothing, and home goods. Their sales_inventory dataset, also used in Section 30.1, contains data on:

  • ProductID: A unique identifier for each product (e.g., “P1”, “P2”).
  • ProductName: Descriptive names for products (e.g., “Wireless Headphones”, “Winter Jacket”).
  • Category: Product categories such as “Electronics”, “Clothing”, and “Home Goods”.
  • Price: The price of each product in dollars.
  • Quantity: The number of units currently in stock.
  • StoreLocation: The location of the store where the product is available (e.g., “New York”).
# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows


31.2 Creating and Modifying Columns with mutate()

The mutate() function allows you to add new columns or modify existing ones. This is often used to calculate derived variables or clean up data.

Syntax

# Basic syntax for mutate()
new_data <- data |> 
  mutate(new_column = expression)

Calculate Total Inventory Value

In the GearHub Inventory dataset, we can calculate the total inventory value for each product by multiplying the Price by Quantity:

sales_inventory |> 
  mutate(TotalValue = Price * Quantity)
# A tibble: 100 × 7
   ProductID ProductName Category    Price Quantity StoreLocation TotalValue
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>              <dbl>
 1 P1        Product 1   Electronics  12.2       40 Los Angeles         490.
 2 P2        Product 2   Electronics  70.7        6 San Francisco       424.
 3 P3        Product 3   Home Goods   32.8       41 San Francisco      1346.
 4 P4        Product 4   Clothing     95.1       21 New York           1997.
 5 P5        Product 5   Electronics  25.9       46 New York           1193.
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles         193.
 7 P7        Product 7   Electronics  46.1       17 San Francisco       784.
 8 P8        Product 8   Clothing     27.4       12 Los Angeles         328.
 9 P9        Product 9   Home Goods   51.0       35 New York           1784.
10 P10       Product 10  Clothing     46.0        9 San Francisco       414.
# ℹ 90 more rows

You can create multiple columns in one call to mutate():

sales_inventory |> 
  mutate(
    TotalValue = Price * Quantity,
    DiscountedPrice = Price * 0.9
  )
# A tibble: 100 × 8
   ProductID ProductName Category    Price Quantity StoreLocation TotalValue
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>              <dbl>
 1 P1        Product 1   Electronics  12.2       40 Los Angeles         490.
 2 P2        Product 2   Electronics  70.7        6 San Francisco       424.
 3 P3        Product 3   Home Goods   32.8       41 San Francisco      1346.
 4 P4        Product 4   Clothing     95.1       21 New York           1997.
 5 P5        Product 5   Electronics  25.9       46 New York           1193.
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles         193.
 7 P7        Product 7   Electronics  46.1       17 San Francisco       784.
 8 P8        Product 8   Clothing     27.4       12 Los Angeles         328.
 9 P9        Product 9   Home Goods   51.0       35 New York           1784.
10 P10       Product 10  Clothing     46.0        9 San Francisco       414.
# ℹ 90 more rows
# ℹ 1 more variable: DiscountedPrice <dbl>

Conditional Logic with case_when()

To assign categories based on price, we can use case_when() inside mutate():

sales_inventory |> 
  mutate(
    PriceCategory = case_when(
      Price < 20 ~ "Low",
      Price < 50 ~ "Medium",
      TRUE ~ "High"
    )
  )
# A tibble: 100 × 7
   ProductID ProductName Category    Price Quantity StoreLocation PriceCategory
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>         <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles   Low          
 2 P2        Product 2   Electronics  70.7        6 San Francisco High         
 3 P3        Product 3   Home Goods   32.8       41 San Francisco Medium       
 4 P4        Product 4   Clothing     95.1       21 New York      High         
 5 P5        Product 5   Electronics  25.9       46 New York      Medium       
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles   High         
 7 P7        Product 7   Electronics  46.1       17 San Francisco Medium       
 8 P8        Product 8   Clothing     27.4       12 Los Angeles   Medium       
 9 P9        Product 9   Home Goods   51.0       35 New York      High         
10 P10       Product 10  Clothing     46.0        9 San Francisco Medium       
# ℹ 90 more rows

Exercise: Using mutate()

Try it yourself:


Let’s use the Muscle Cola customer dataset to practice row transformations. This dataset, introduced in Section 18.5.2, contains survey data from potential customers of Muscle Cola, a protein-infused cola drink designed for fitness enthusiasts.

This dataset has been tidied so each row represents the response of a customer to a specified price for a serving of Muscle Cola. The variables are

  • Customer Preferences:

    • Gym_preference: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.
    • Gym_member: Specifies whether the respondent is a gym member (“Yes”/“No”).
    • Gym_freq: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.
    • Protein_importance: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
  • Customer Demand:

    • Price: Specifies a series of prices at which respondents stated the quantity they would consumer
    • Quantity: Quantities that respondents would consume at different prices per month.
  • Demographics:

    • Gender: Gender of the respondent.
    • Age: Numeric value indicating the respondent’s age.
    • Weight: The respondent’s weight in pounds, which may be relevant for understanding protein needs.

For more details about how this dataset was collected and tidied, see Section 18.5.2.


For the muscle_cola_data dataset, add new columns to the dataset in a single mutate() command and name the resulting dataset new_muscle_cola_data:

  1. Create a new column named Individual_Revenue that calculates the potential revenue for each respondent at each price by multiplying the Price and Quantity columns.
  2. Create a new column named AgeGroup that assigns respondents to “Young” (Age < 30), “Middle-Aged” (30 ≤ Age < 60), or “Senior” (Age ≥ 60).

Specify the dataset as muscle_cola_data; then call mutate(); then

  1. create Individual_Revenue by multiplying the Price and Quantity column; then
  2. create AgeGroup using case_when() to assign respondents to “Young” (if Age < 30), “Middle-Aged” (if 30 ≤ Age < 60), or “Senior” (if Age ≥ 60).

Be sure that you have saved the results as new_muscle_cola_data.

Fully worked solution:

new_muscle_cola_data <- muscle_cola_data |> 
    mutate(Individual_Revenue = Price * Quantity,
           Age_Group = case_when(Age < 30 ~ "Young",
                                 Age >= 30 | Age <= 60 ~ "Middle-aged",
                                 Age > 60 ~ "Senior"
                                 )
           )

Notice that the newly created variables are not printed to the screen. You can see that they were created because they are in the list of variables that are not shown.

  • One way to see the new variables is to glimpse() the new data.
  • Another way is to call the new variables to the screen new_muscle_cola_data$Individual_Revenue and new_muscle_cola_data$Age_Group.
  • Another way to see the new variables is to select them for viewing using the select() function as we do in Section 31.3.
  • Still another way is to relocate the new variables to a visible position, using relocate() function as we do in Section 31.5.


31.3 Selecting and Dropping Columns with select()

The select() function allows you to focus on specific columns, reorder them, or exclude unnecessary ones.

Syntax

# Basic syntax for select()
new_data <- data |> 
  select(columns_to_keep)

Select Specific Columns

Suppose we only want to keep ProductID, Category, and Price:

sales_inventory |> 
  select(ProductID, Category, Price)
# A tibble: 100 × 3
   ProductID Category    Price
   <chr>     <chr>       <dbl>
 1 P1        Electronics  12.2
 2 P2        Electronics  70.7
 3 P3        Home Goods   32.8
 4 P4        Clothing     95.1
 5 P5        Electronics  25.9
 6 P6        Home Goods   96.3
 7 P7        Electronics  46.1
 8 P8        Clothing     27.4
 9 P9        Home Goods   51.0
10 P10       Clothing     46.0
# ℹ 90 more rows

Using Helpers with select()

You can also use helper functions like starts_with() or contains() to simplify column selection:

sales_inventory |> 
  select(starts_with("P"))
# A tibble: 100 × 3
   ProductID ProductName Price
   <chr>     <chr>       <dbl>
 1 P1        Product 1    12.2
 2 P2        Product 2    70.7
 3 P3        Product 3    32.8
 4 P4        Product 4    95.1
 5 P5        Product 5    25.9
 6 P6        Product 6    96.3
 7 P7        Product 7    46.1
 8 P8        Product 8    27.4
 9 P9        Product 9    51.0
10 P10       Product 10   46.0
# ℹ 90 more rows

Exclude Columns

To exclude certain columns, use the - sign:

sales_inventory |> 
  select(-StoreLocation)
# A tibble: 100 × 5
   ProductID ProductName Category    Price Quantity
   <chr>     <chr>       <chr>       <dbl>    <int>
 1 P1        Product 1   Electronics  12.2       40
 2 P2        Product 2   Electronics  70.7        6
 3 P3        Product 3   Home Goods   32.8       41
 4 P4        Product 4   Clothing     95.1       21
 5 P5        Product 5   Electronics  25.9       46
 6 P6        Product 6   Home Goods   96.3        2
 7 P7        Product 7   Electronics  46.1       17
 8 P8        Product 8   Clothing     27.4       12
 9 P9        Product 9   Home Goods   51.0       35
10 P10       Product 10  Clothing     46.0        9
# ℹ 90 more rows

Exercise: Using select()

Try it yourself:


Using the new_muscle_cola_data dataset that came from the exercise in Section 31.2.4:

  1. select the new columns so their values are visible.
  2. drop all of the columns that are not the new columns.
  3. select the Age and Age_Group columns using the contains() helper function.

Specify the dataset as new_muscle_cola_data, then

  1. call select() for the new columns so only their values are visible.
  2. call select() to drop all of the columns that are not the new columns.
  3. call select() the Age and Age_Group columns using the contains() helper function.

Fully worked solution:

new_muscle_cola_data |> select(Individual_Revenue, Age_Group)
new_muscle_cola_data |> select(-(Gym_preference:Weight))
new_muscle_cola_data |> select(contains("Age"))


31.4 Renaming Columns with rename() and rename_with()

Renaming columns helps make datasets easier to understand and work with. Renaming is especially useful when working with data from Qualtrics where the entire survey question is given as the column name.

Syntax: rename()

# Rename columns directly
new_data <- data |> 
  rename(new_name = old_name)

Rename Columns

Rename StoreLocation to Location:

sales_inventory |> 
  rename(Location = StoreLocation)
# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity Location     
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows

Rename by Column Position

Sometimes column names are so long and unwieldy as to make it difficult to inspect and analyze the data. This is particularly useful when column names are too long to type. Consider the variable names of this dataset:

# A tibble: 3 × 2
  `Long and unwieldy column name 1` `Another unnecessarily verbose name`
                              <dbl>                                <dbl>
1                                 1                                    4
2                                 2                                    5
3                                 3                                    6

Rather than type the column name as an argument in the rename() function, you can reference and rename columns using their position:

tb_renamed <- tb |> 
  rename( short_name1 = 1, 
          short_name2 = 2 ) 
tb_renamed
# A tibble: 3 × 2
  short_name1 short_name2
        <dbl>       <dbl>
1           1           4
2           2           5
3           3           6

Syntax: rename_with()

rename_with() applies a function to modify column names programmatically.

rename_with() using the tolower argument

One built-in tool to rename data is tolower which will convert all upper-case letters to lower-case.

# Rename columns using a function
new_data <- data |> 
  rename_with(tolower)

rename_with() using the toupper argument

Another built-in tool to rename data is toupper which will convert all lower-case letters to upper-case.

# Rename columns using a function
new_data <- data |> 
  rename_with(toupper)

Standardize Column Names

Convert all column names to lowercase:

sales_inventory |> 
  rename_with(tolower)
# A tibble: 100 × 6
   productid productname category    price quantity storelocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows

Exercise: Using rename()

When working with customer data, one of the most common and urgent tasks for entrepreneurs is renaming columns, especially when importing survey data from Qualtrics. Qualtrics often inserts the full survey question text as the column name, which can be lengthy and unwieldy—sometimes spanning entire sentences or paragraphs. These overly verbose column names make it challenging to inspect and analyze the data effectively.

When renaming:

  • Use variable names that are concise yet meaningful enough to interpret.
  • Follow tidyverse naming conventions, such as snake_case, where words are separated by underscores, e.g. new_muscle_cola_data.

Try it yourself:


For this exercise, the muscle_cola_data dataset has been edited to restore its original survey-question column names as originally imported from Qualtrics. The column names are so long that even viewing the glimpse() output becomes cumbersome. Your task is to rename these columns to shorter, more manageable names that are still descriptive and informative.

  1. Rename the muscle_cola_data columns by their position.
  2. [optional] Remember that the original muscle_cola_data is not tidy. That is clear here where columns containing quantity demanded at different prices are named for their prices. As a more advanced exercise, first rename the columns and then tidy the data as we did in the pivot_longer() exercise in Section 29.5.3.
  • Specify the dataset as muscle_cola_data, then call rename() and rename the variables by their position. - [optional] tidy muscle_cola_data now that the variables names are manageable.

Renamed solution:

renamed_muscle_cola_data <- muscle_cola_data |> 
    rename(Gym_preference = 1,
           Gym_member = 2,
           Gym_freq = 3,
           Protein_importance = 4,
           WTP = 5,
           price_050 = 6,
           price_100 = 7,
           price_150 = 8,
           price_200 = 9,
           price_250 = 10,
           Gender = 11,
           Age = 12,
           Weight = 13
    )
renamed_muscle_cola_data

Renamed and tidied solution:

renamed_muscle_cola_data <- muscle_cola_data |> 
    rename(Gym_preference = 1,
           Gym_member = 2,
           Gym_freq = 3,
           Protein_importance = 4,
           WTP = 5,
           price_050 = 6,
           price_100 = 7,
           price_150 = 8,
           price_200 = 9,
           price_250 = 10,
           Gender = 11,
           Age = 12,
           Weight = 13
    )
renamed_muscle_cola_data

tidy_muscle_cola_data <- renamed_muscle_cola_data |>
    pivot_longer(
        cols = c(price_050, 
                 price_100, 
                 price_150, 
                 price_200, 
                 price_250), # Explicitly list price/quantity columns
         names_to = "Price",
         values_to = "Quantity",
         names_prefix = "price_"
          ) |>
    mutate(Price = as.numeric(Price)/100) 
tidy_muscle_cola_data

31.5 Reordering Columns with relocate()

Use relocate() to move columns to specific positions.

Syntax

# Move a column to a new position
new_data <- data |> 
  relocate(column_name, .before = other_column)

Move Price to the Front

sales_inventory |> 
  relocate(Price, .before = ProductID)
# A tibble: 100 × 6
   Price ProductID ProductName Category    Quantity StoreLocation
   <dbl> <chr>     <chr>       <chr>          <int> <chr>        
 1  12.2 P1        Product 1   Electronics       40 Los Angeles  
 2  70.7 P2        Product 2   Electronics        6 San Francisco
 3  32.8 P3        Product 3   Home Goods        41 San Francisco
 4  95.1 P4        Product 4   Clothing          21 New York     
 5  25.9 P5        Product 5   Electronics       46 New York     
 6  96.3 P6        Product 6   Home Goods         2 Los Angeles  
 7  46.1 P7        Product 7   Electronics       17 San Francisco
 8  27.4 P8        Product 8   Clothing          12 Los Angeles  
 9  51.0 P9        Product 9   Home Goods        35 New York     
10  46.0 P10       Product 10  Clothing           9 San Francisco
# ℹ 90 more rows

You can also move columns to the end or after specific columns:

sales_inventory |> 
  relocate(StoreLocation, .after = ProductName)
# A tibble: 100 × 6
   ProductID ProductName StoreLocation Category    Price Quantity
   <chr>     <chr>       <chr>         <chr>       <dbl>    <int>
 1 P1        Product 1   Los Angeles   Electronics  12.2       40
 2 P2        Product 2   San Francisco Electronics  70.7        6
 3 P3        Product 3   San Francisco Home Goods   32.8       41
 4 P4        Product 4   New York      Clothing     95.1       21
 5 P5        Product 5   New York      Electronics  25.9       46
 6 P6        Product 6   Los Angeles   Home Goods   96.3        2
 7 P7        Product 7   San Francisco Electronics  46.1       17
 8 P8        Product 8   Los Angeles   Clothing     27.4       12
 9 P9        Product 9   New York      Home Goods   51.0       35
10 P10       Product 10  San Francisco Clothing     46.0        9
# ℹ 90 more rows

Exercise: Using relocate()

Try it yourself:


Using the new_muscle_cola_data created in Section 31.2.4, relocate the new columns to the beginning of the dataset so their values are visible.

Specify the dataset as new_muscle_cola_data and call relocate() to move the new columns to a position where their values are visible.

Fully worked solution:

new_muscle_cola_data |>  relocate(c(Individual_Revenue, Age_Group), .before = Gym_preference)

31.6 Programmatic Transformations with across()

The across() function lets you apply transformations to multiple columns programmatically. It allows you to apply the same transformation or set of transformations to multiple columns in your dataset. It’s a cleaner and more powerful alternative to using functions like mutate_at() or mutate_all().

Here’s a breakdown of what across() does and how it works:

across() allows you to:

  • Select multiple columns within a mutate() (or summarize()) call.
  • Apply one or more functions to those selected columns.
  • Optionally rename the resulting columns based on the transformations.

Syntax

data |> mutate(
  across(
    cols,              # Which columns to target
    .fns = function,   # What function(s) to apply
    .names = format    # Optional: How to name the new columns
    )
  )
  • cols: Columns to apply the function(s) to (e.g., starts_with(“price”)).
  • .fns: The transformation(s) to apply, like mean, round, or a custom function.
  • .names: Optional template for renaming resulting columns (default keeps the original names).

Creating Transformed Columns with mutate() and across()

Imagine you want to convert the Price column in sales_inventory dataset from dollars to cents (by multiplying by 100) and double the Quantity column to model a promotion where customers can purchase double the inventory.

sales_inventory
# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows
sales_inventory |> 
  mutate( across(c(Price, Quantity), 
                 ~ .x * c(100, 2), 
                 .names = "{.col}_transformed"
                 ) 
          ) 
# A tibble: 100 × 8
   ProductID ProductName Category Price Quantity StoreLocation Price_transformed
   <chr>     <chr>       <chr>    <dbl>    <int> <chr>                     <dbl>
 1 P1        Product 1   Electro…  12.2       40 Los Angeles              1224  
 2 P2        Product 2   Electro…  70.7        6 San Francisco             141. 
 3 P3        Product 3   Home Go…  32.8       41 San Francisco            3284  
 4 P4        Product 4   Clothing  95.1       21 New York                  190. 
 5 P5        Product 5   Electro…  25.9       46 New York                 2593  
 6 P6        Product 6   Home Go…  96.3        2 Los Angeles               193. 
 7 P7        Product 7   Electro…  46.1       17 San Francisco            4609  
 8 P8        Product 8   Clothing  27.4       12 Los Angeles                54.7
 9 P9        Product 9   Home Go…  51.0       35 New York                 5096  
10 P10       Product 10  Clothing  46.0        9 San Francisco              92.0
# ℹ 90 more rows
# ℹ 1 more variable: Quantity_transformed <dbl>

Explanation:

  1. across(): applies the transformation to each column in the list c(Price, Quantity).
  2. ~ .x * c(100, 2):
    • Price gets multiplied by 100 (to convert to cents).
    • Quantity gets multiplied by 2 (to simulate doubling).
  3. .names = "{.col}_transformed": Keeps the original column name and appends _transformed.

Result: You get two new columns in the dataset:

  • Price_transformed: Prices in cents.
  • Quantity_transformed: Quantities doubled.

Applying Multiple Functions

Let’s calculate the mean and standard deviation for the Price and Quantity columns from sales_inventory and create new columns for each.

sales_inventory
# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows
sales_inventory |> mutate( across(c(Price, Quantity), list(mean = mean, sd = sd), 
                                  .names = "{.col}_{.fn}"
                                  ) 
                           ) 
# A tibble: 100 × 10
   ProductID ProductName Category    Price Quantity StoreLocation Price_mean
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>              <dbl>
 1 P1        Product 1   Electronics  12.2       40 Los Angeles         52.4
 2 P2        Product 2   Electronics  70.7        6 San Francisco       52.4
 3 P3        Product 3   Home Goods   32.8       41 San Francisco       52.4
 4 P4        Product 4   Clothing     95.1       21 New York            52.4
 5 P5        Product 5   Electronics  25.9       46 New York            52.4
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles         52.4
 7 P7        Product 7   Electronics  46.1       17 San Francisco       52.4
 8 P8        Product 8   Clothing     27.4       12 Los Angeles         52.4
 9 P9        Product 9   Home Goods   51.0       35 New York            52.4
10 P10       Product 10  Clothing     46.0        9 San Francisco       52.4
# ℹ 90 more rows
# ℹ 3 more variables: Price_sd <dbl>, Quantity_mean <dbl>, Quantity_sd <dbl>

Explanation:

  • across(c(Price, Quantity), list(mean = mean, sd = sd)):

    • Applies the functions mean and sd (standard deviation) to the columns Price and Quantity.
  • .names = "{.col}_{.fn}":

    • Creates new column names by appending the function name (_mean or _sd) to the original column name (Price or Quantity).

Result:

This code generates four new columns:

  • Price_mean: The mean of the Price column.
  • Price_sd: The standard deviation of the Price column.
  • Quantity_mean: The mean of the Quantity column.
  • Quantity_sd: The standard deviation of the Quantity column.

Apply Transformation to Numeric Columns

Let’s double the values of all numeric columns in the sales_inventory dataset.

sales_inventory
# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows
sales_inventory |> mutate( across(where(is.numeric), ~ .x * 2) ) 
# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <dbl> <chr>        
 1 P1        Product 1   Electronics  24.5       80 Los Angeles  
 2 P2        Product 2   Electronics 141.        12 San Francisco
 3 P3        Product 3   Home Goods   65.7       82 San Francisco
 4 P4        Product 4   Clothing    190.        42 New York     
 5 P5        Product 5   Electronics  51.9       92 New York     
 6 P6        Product 6   Home Goods  193.         4 Los Angeles  
 7 P7        Product 7   Electronics  92.2       34 San Francisco
 8 P8        Product 8   Clothing     54.7       24 Los Angeles  
 9 P9        Product 9   Home Goods  102.        70 New York     
10 P10       Product 10  Clothing     92.0       18 San Francisco
# ℹ 90 more rows

Explanation:

  • where(is.numeric):

    • Targets only the numeric columns in the dataset (Price and Quantity in this case).
  • ~ .x * 2:

    • Multiplies each value in the targeted columns by 2.

Result: This transformation updates the dataset, doubling the values in the Price and Quantity columns while leaving other columns unchanged.

Why Is across() Useful?

  • Readable and Concise: It simplifies repetitive transformations.
  • Flexible: You can apply multiple functions or transformations in one line.
  • Customizable Naming: Easily create new columns with meaningful names.
  • Powerful Column Selection: Use selectors like starts_with(), ends_with(), or where().

Exercise: Using across()

Try it yourself:


Add a 10% price tax to all price columns in the untidy version of muscle_cola_data.

Use the across() function to add a 10% price tax to all price_* columns.

Fully worked solution:

muscle_cola_data <- muscle_cola_data |>
  mutate(
    across(starts_with("price"), ~ .x * 1.10, 
           .names = "taxed_{.col}")
  )
muscle_cola_data |> select(starts_with("taxed"))


31.7 Conclusion

Transforming columns is an essential step in preparing data for analysis. With the tools in this chapter, you can:

  • Create new variables with mutate().
  • Focus on relevant columns with select().
  • Rename columns for clarity using rename() and rename_with().
  • Reorder columns using relocate().
  • Apply advanced mutations programmatically with across().

These skills will streamline your workflow and improve your ability to handle real-world datasets effectively.