# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
31 Transforming Columns
In this chapter, we focus on transforming columns in datasets using key tidyverse functions. Whether you are creating new variables, selecting specific columns, renaming, or reordering, these transformations help you prepare your data for analysis. We’ll explore:
- Adding and modifying columns with
mutate()
. - Selecting and dropping columns with
select()
. - Renaming columns with
rename()
andrename_with()
. - Reordering columns with
relocate()
. - Applying advanced mutations programmatically with
across()
[optional].
31.1 Demonstration Data: Inventory at GearHub
We will use the GearHub Inventory dataset to demonstrate these column transformation techniques. GearHub is a fictional retail company specializing in a variety of products, including electronics, clothing, and home goods. Their sales_inventory
dataset, also used in Section 30.1, contains data on:
ProductID
: A unique identifier for each product (e.g., “P1”, “P2”).ProductName
: Descriptive names for products (e.g., “Wireless Headphones”, “Winter Jacket”).Category
: Product categories such as “Electronics”, “Clothing”, and “Home Goods”.Price
: The price of each product in dollars.Quantity
: The number of units currently in stock.StoreLocation
: The location of the store where the product is available (e.g., “New York”).
31.2 Creating and Modifying Columns with mutate()
The mutate()
function allows you to add new columns or modify existing ones. This is often used to calculate derived variables or clean up data.
Syntax
# Basic syntax for mutate()
<- data |>
new_data mutate(new_column = expression)
Calculate Total Inventory Value
In the GearHub Inventory dataset, we can calculate the total inventory value for each product by multiplying the Price
by Quantity
:
|>
sales_inventory mutate(TotalValue = Price * Quantity)
# A tibble: 100 × 7
ProductID ProductName Category Price Quantity StoreLocation TotalValue
<chr> <chr> <chr> <dbl> <int> <chr> <dbl>
1 P1 Product 1 Electronics 12.2 40 Los Angeles 490.
2 P2 Product 2 Electronics 70.7 6 San Francisco 424.
3 P3 Product 3 Home Goods 32.8 41 San Francisco 1346.
4 P4 Product 4 Clothing 95.1 21 New York 1997.
5 P5 Product 5 Electronics 25.9 46 New York 1193.
6 P6 Product 6 Home Goods 96.3 2 Los Angeles 193.
7 P7 Product 7 Electronics 46.1 17 San Francisco 784.
8 P8 Product 8 Clothing 27.4 12 Los Angeles 328.
9 P9 Product 9 Home Goods 51.0 35 New York 1784.
10 P10 Product 10 Clothing 46.0 9 San Francisco 414.
# ℹ 90 more rows
You can create multiple columns in one call to mutate()
:
|>
sales_inventory mutate(
TotalValue = Price * Quantity,
DiscountedPrice = Price * 0.9
)
# A tibble: 100 × 8
ProductID ProductName Category Price Quantity StoreLocation TotalValue
<chr> <chr> <chr> <dbl> <int> <chr> <dbl>
1 P1 Product 1 Electronics 12.2 40 Los Angeles 490.
2 P2 Product 2 Electronics 70.7 6 San Francisco 424.
3 P3 Product 3 Home Goods 32.8 41 San Francisco 1346.
4 P4 Product 4 Clothing 95.1 21 New York 1997.
5 P5 Product 5 Electronics 25.9 46 New York 1193.
6 P6 Product 6 Home Goods 96.3 2 Los Angeles 193.
7 P7 Product 7 Electronics 46.1 17 San Francisco 784.
8 P8 Product 8 Clothing 27.4 12 Los Angeles 328.
9 P9 Product 9 Home Goods 51.0 35 New York 1784.
10 P10 Product 10 Clothing 46.0 9 San Francisco 414.
# ℹ 90 more rows
# ℹ 1 more variable: DiscountedPrice <dbl>
Conditional Logic with case_when()
To assign categories based on price, we can use case_when()
inside mutate()
:
|>
sales_inventory mutate(
PriceCategory = case_when(
< 20 ~ "Low",
Price < 50 ~ "Medium",
Price TRUE ~ "High"
) )
# A tibble: 100 × 7
ProductID ProductName Category Price Quantity StoreLocation PriceCategory
<chr> <chr> <chr> <dbl> <int> <chr> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles Low
2 P2 Product 2 Electronics 70.7 6 San Francisco High
3 P3 Product 3 Home Goods 32.8 41 San Francisco Medium
4 P4 Product 4 Clothing 95.1 21 New York High
5 P5 Product 5 Electronics 25.9 46 New York Medium
6 P6 Product 6 Home Goods 96.3 2 Los Angeles High
7 P7 Product 7 Electronics 46.1 17 San Francisco Medium
8 P8 Product 8 Clothing 27.4 12 Los Angeles Medium
9 P9 Product 9 Home Goods 51.0 35 New York High
10 P10 Product 10 Clothing 46.0 9 San Francisco Medium
# ℹ 90 more rows
Exercise: Using mutate()
Try it yourself:
Let’s use the Muscle Cola customer dataset to practice row transformations. This dataset, introduced in Section 18.5.2, contains survey data from potential customers of Muscle Cola, a protein-infused cola drink designed for fitness enthusiasts.
This dataset has been tidied so each row represents the response of a customer to a specified price for a serving of Muscle Cola. The variables are
Customer Preferences:
Gym_preference
: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.Gym_member
: Specifies whether the respondent is a gym member (“Yes”/“No”).Gym_freq
: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.Protein_importance
: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
Customer Demand:
Price
: Specifies a series of prices at which respondents stated the quantity they would consumerQuantity
: Quantities that respondents would consume at different prices per month.
Demographics:
Gender
: Gender of the respondent.Age
: Numeric value indicating the respondent’s age.Weight
: The respondent’s weight in pounds, which may be relevant for understanding protein needs.
For more details about how this dataset was collected and tidied, see Section 18.5.2.
For the muscle_cola_data
dataset, add new columns to the dataset in a single mutate()
command and name the resulting dataset new_muscle_cola_data
:
- Create a new column named
Individual_Revenue
that calculates the potential revenue for each respondent at each price by multiplying thePrice
andQuantity
columns. - Create a new column named
AgeGroup
that assigns respondents to “Young” (Age < 30), “Middle-Aged” (30 ≤ Age < 60), or “Senior” (Age ≥ 60).
Specify the dataset as muscle_cola_data
; then call mutate()
; then
- create
Individual_Revenue
by multiplying thePrice
andQuantity
column; then - create
AgeGroup
usingcase_when()
to assign respondents to “Young” (if Age < 30), “Middle-Aged” (if 30 ≤ Age < 60), or “Senior” (if Age ≥ 60).
Be sure that you have saved the results as new_muscle_cola_data
.
Fully worked solution:
<- muscle_cola_data |>
new_muscle_cola_data mutate(Individual_Revenue = Price * Quantity,
Age_Group = case_when(Age < 30 ~ "Young",
>= 30 | Age <= 60 ~ "Middle-aged",
Age > 60 ~ "Senior"
Age
) )
Notice that the newly created variables are not printed to the screen. You can see that they were created because they are in the list of variables that are not shown.
- One way to see the new variables is to
glimpse()
the new data.
- Another way is to call the new variables to the screen
new_muscle_cola_data$Individual_Revenue
andnew_muscle_cola_data$Age_Group
.- Another way to see the new variables is to select them for viewing using the
select()
function as we do in Section 31.3.
- Still another way is to relocate the new variables to a visible position, using
relocate()
function as we do in Section 31.5.
31.3 Selecting and Dropping Columns with select()
The select()
function allows you to focus on specific columns, reorder them, or exclude unnecessary ones.
Syntax
# Basic syntax for select()
<- data |>
new_data select(columns_to_keep)
Select Specific Columns
Suppose we only want to keep ProductID
, Category
, and Price
:
|>
sales_inventory select(ProductID, Category, Price)
# A tibble: 100 × 3
ProductID Category Price
<chr> <chr> <dbl>
1 P1 Electronics 12.2
2 P2 Electronics 70.7
3 P3 Home Goods 32.8
4 P4 Clothing 95.1
5 P5 Electronics 25.9
6 P6 Home Goods 96.3
7 P7 Electronics 46.1
8 P8 Clothing 27.4
9 P9 Home Goods 51.0
10 P10 Clothing 46.0
# ℹ 90 more rows
Using Helpers with select()
You can also use helper functions like starts_with()
or contains()
to simplify column selection:
|>
sales_inventory select(starts_with("P"))
# A tibble: 100 × 3
ProductID ProductName Price
<chr> <chr> <dbl>
1 P1 Product 1 12.2
2 P2 Product 2 70.7
3 P3 Product 3 32.8
4 P4 Product 4 95.1
5 P5 Product 5 25.9
6 P6 Product 6 96.3
7 P7 Product 7 46.1
8 P8 Product 8 27.4
9 P9 Product 9 51.0
10 P10 Product 10 46.0
# ℹ 90 more rows
Exclude Columns
To exclude certain columns, use the -
sign:
|>
sales_inventory select(-StoreLocation)
# A tibble: 100 × 5
ProductID ProductName Category Price Quantity
<chr> <chr> <chr> <dbl> <int>
1 P1 Product 1 Electronics 12.2 40
2 P2 Product 2 Electronics 70.7 6
3 P3 Product 3 Home Goods 32.8 41
4 P4 Product 4 Clothing 95.1 21
5 P5 Product 5 Electronics 25.9 46
6 P6 Product 6 Home Goods 96.3 2
7 P7 Product 7 Electronics 46.1 17
8 P8 Product 8 Clothing 27.4 12
9 P9 Product 9 Home Goods 51.0 35
10 P10 Product 10 Clothing 46.0 9
# ℹ 90 more rows
Exercise: Using select()
Try it yourself:
Using the new_muscle_cola_data
dataset that came from the exercise in Section 31.2.4:
- select the new columns so their values are visible.
- drop all of the columns that are not the new columns.
- select the
Age
andAge_Group
columns using thecontains()
helper function.
Specify the dataset as new_muscle_cola_data
, then
- call
select()
for the new columns so only their values are visible. - call
select()
to drop all of the columns that are not the new columns. - call
select()
theAge
andAge_Group
columns using thecontains()
helper function.
Fully worked solution:
|> select(Individual_Revenue, Age_Group)
new_muscle_cola_data |> select(-(Gym_preference:Weight))
new_muscle_cola_data |> select(contains("Age")) new_muscle_cola_data
31.4 Renaming Columns with rename()
and rename_with()
Renaming columns helps make datasets easier to understand and work with. Renaming is especially useful when working with data from Qualtrics where the entire survey question is given as the column name.
Syntax: rename()
# Rename columns directly
<- data |>
new_data rename(new_name = old_name)
Rename Columns
Rename StoreLocation
to Location
:
|>
sales_inventory rename(Location = StoreLocation)
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity Location
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
Rename by Column Position
Sometimes column names are so long and unwieldy as to make it difficult to inspect and analyze the data. This is particularly useful when column names are too long to type. Consider the variable names of this dataset:
# A tibble: 3 × 2
`Long and unwieldy column name 1` `Another unnecessarily verbose name`
<dbl> <dbl>
1 1 4
2 2 5
3 3 6
Rather than type the column name as an argument in the rename()
function, you can reference and rename columns using their position:
<- tb |>
tb_renamed rename( short_name1 = 1,
short_name2 = 2 )
tb_renamed
# A tibble: 3 × 2
short_name1 short_name2
<dbl> <dbl>
1 1 4
2 2 5
3 3 6
Syntax: rename_with()
rename_with()
applies a function to modify column names programmatically.
rename_with()
using the tolower
argument
One built-in tool to rename data is tolower
which will convert all upper-case letters to lower-case.
# Rename columns using a function
<- data |>
new_data rename_with(tolower)
rename_with()
using the toupper
argument
Another built-in tool to rename data is toupper
which will convert all lower-case letters to upper-case.
# Rename columns using a function
<- data |>
new_data rename_with(toupper)
Standardize Column Names
Convert all column names to lowercase:
|>
sales_inventory rename_with(tolower)
# A tibble: 100 × 6
productid productname category price quantity storelocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
Exercise: Using rename()
When working with customer data, one of the most common and urgent tasks for entrepreneurs is renaming columns, especially when importing survey data from Qualtrics. Qualtrics often inserts the full survey question text as the column name, which can be lengthy and unwieldy—sometimes spanning entire sentences or paragraphs. These overly verbose column names make it challenging to inspect and analyze the data effectively.
When renaming:
- Use variable names that are concise yet meaningful enough to interpret.
- Follow tidyverse naming conventions, such as snake_case, where words are separated by underscores, e.g.
new_muscle_cola_data
.
Try it yourself:
For this exercise, the muscle_cola_data
dataset has been edited to restore its original survey-question column names as originally imported from Qualtrics. The column names are so long that even viewing the glimpse()
output becomes cumbersome. Your task is to rename these columns to shorter, more manageable names that are still descriptive and informative.
- Rename the
muscle_cola_data
columns by their position. - [optional] Remember that the original
muscle_cola_data
is not tidy. That is clear here where columns containing quantity demanded at different prices are named for their prices. As a more advanced exercise, first rename the columns and then tidy the data as we did in thepivot_longer()
exercise in Section 29.5.3.
- Specify the dataset as
muscle_cola_data
, then callrename()
and rename the variables by their position. - [optional] tidymuscle_cola_data
now that the variables names are manageable.
Renamed solution:
<- muscle_cola_data |>
renamed_muscle_cola_data rename(Gym_preference = 1,
Gym_member = 2,
Gym_freq = 3,
Protein_importance = 4,
WTP = 5,
price_050 = 6,
price_100 = 7,
price_150 = 8,
price_200 = 9,
price_250 = 10,
Gender = 11,
Age = 12,
Weight = 13
) renamed_muscle_cola_data
Renamed and tidied solution:
<- muscle_cola_data |>
renamed_muscle_cola_data rename(Gym_preference = 1,
Gym_member = 2,
Gym_freq = 3,
Protein_importance = 4,
WTP = 5,
price_050 = 6,
price_100 = 7,
price_150 = 8,
price_200 = 9,
price_250 = 10,
Gender = 11,
Age = 12,
Weight = 13
)
renamed_muscle_cola_data
<- renamed_muscle_cola_data |>
tidy_muscle_cola_data pivot_longer(
cols = c(price_050,
price_100,
price_150,
price_200, # Explicitly list price/quantity columns
price_250), names_to = "Price",
values_to = "Quantity",
names_prefix = "price_"
|>
) mutate(Price = as.numeric(Price)/100)
tidy_muscle_cola_data
31.5 Reordering Columns with relocate()
Use relocate()
to move columns to specific positions.
Syntax
# Move a column to a new position
<- data |>
new_data relocate(column_name, .before = other_column)
Move Price
to the Front
|>
sales_inventory relocate(Price, .before = ProductID)
# A tibble: 100 × 6
Price ProductID ProductName Category Quantity StoreLocation
<dbl> <chr> <chr> <chr> <int> <chr>
1 12.2 P1 Product 1 Electronics 40 Los Angeles
2 70.7 P2 Product 2 Electronics 6 San Francisco
3 32.8 P3 Product 3 Home Goods 41 San Francisco
4 95.1 P4 Product 4 Clothing 21 New York
5 25.9 P5 Product 5 Electronics 46 New York
6 96.3 P6 Product 6 Home Goods 2 Los Angeles
7 46.1 P7 Product 7 Electronics 17 San Francisco
8 27.4 P8 Product 8 Clothing 12 Los Angeles
9 51.0 P9 Product 9 Home Goods 35 New York
10 46.0 P10 Product 10 Clothing 9 San Francisco
# ℹ 90 more rows
You can also move columns to the end or after specific columns:
|>
sales_inventory relocate(StoreLocation, .after = ProductName)
# A tibble: 100 × 6
ProductID ProductName StoreLocation Category Price Quantity
<chr> <chr> <chr> <chr> <dbl> <int>
1 P1 Product 1 Los Angeles Electronics 12.2 40
2 P2 Product 2 San Francisco Electronics 70.7 6
3 P3 Product 3 San Francisco Home Goods 32.8 41
4 P4 Product 4 New York Clothing 95.1 21
5 P5 Product 5 New York Electronics 25.9 46
6 P6 Product 6 Los Angeles Home Goods 96.3 2
7 P7 Product 7 San Francisco Electronics 46.1 17
8 P8 Product 8 Los Angeles Clothing 27.4 12
9 P9 Product 9 New York Home Goods 51.0 35
10 P10 Product 10 San Francisco Clothing 46.0 9
# ℹ 90 more rows
Exercise: Using relocate()
Try it yourself:
Using the new_muscle_cola_data
created in Section 31.2.4, relocate the new columns to the beginning of the dataset so their values are visible.
Specify the dataset as new_muscle_cola_data
and call relocate()
to move the new columns to a position where their values are visible.
Fully worked solution:
|> relocate(c(Individual_Revenue, Age_Group), .before = Gym_preference) new_muscle_cola_data
31.6 Programmatic Transformations with across()
The across()
function lets you apply transformations to multiple columns programmatically. It allows you to apply the same transformation or set of transformations to multiple columns in your dataset. It’s a cleaner and more powerful alternative to using functions like mutate_at()
or mutate_all()
.
Here’s a breakdown of what across()
does and how it works:
across()
allows you to:
- Select multiple columns within a
mutate()
(orsummarize()
) call. - Apply one or more functions to those selected columns.
- Optionally rename the resulting columns based on the transformations.
Syntax
|> mutate(
data across(
# Which columns to target
cols, .fns = function, # What function(s) to apply
.names = format # Optional: How to name the new columns
) )
cols
: Columns to apply the function(s) to (e.g., starts_with(“price”))..fns
: The transformation(s) to apply, like mean, round, or a custom function..names
: Optional template for renaming resulting columns (default keeps the original names).
Creating Transformed Columns with mutate() and across()
Imagine you want to convert the Price
column in sales_inventory
dataset from dollars to cents (by multiplying by 100) and double the Quantity
column to model a promotion where customers can purchase double the inventory.
sales_inventory
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
|>
sales_inventory mutate( across(c(Price, Quantity),
~ .x * c(100, 2),
.names = "{.col}_transformed"
) )
# A tibble: 100 × 8
ProductID ProductName Category Price Quantity StoreLocation Price_transformed
<chr> <chr> <chr> <dbl> <int> <chr> <dbl>
1 P1 Product 1 Electro… 12.2 40 Los Angeles 1224
2 P2 Product 2 Electro… 70.7 6 San Francisco 141.
3 P3 Product 3 Home Go… 32.8 41 San Francisco 3284
4 P4 Product 4 Clothing 95.1 21 New York 190.
5 P5 Product 5 Electro… 25.9 46 New York 2593
6 P6 Product 6 Home Go… 96.3 2 Los Angeles 193.
7 P7 Product 7 Electro… 46.1 17 San Francisco 4609
8 P8 Product 8 Clothing 27.4 12 Los Angeles 54.7
9 P9 Product 9 Home Go… 51.0 35 New York 5096
10 P10 Product 10 Clothing 46.0 9 San Francisco 92.0
# ℹ 90 more rows
# ℹ 1 more variable: Quantity_transformed <dbl>
Explanation:
across()
: applies the transformation to each column in the listc(Price, Quantity)
.~ .x * c(100, 2)
:Price
gets multiplied by 100 (to convert to cents).Quantity
gets multiplied by 2 (to simulate doubling).
.names = "{.col}_transformed"
: Keeps the original column name and appends_transformed
.
Result: You get two new columns in the dataset:
Price_transformed
: Prices in cents.Quantity_transformed
: Quantities doubled.
Applying Multiple Functions
Let’s calculate the mean and standard deviation for the Price
and Quantity
columns from sales_inventory
and create new columns for each.
sales_inventory
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
|> mutate( across(c(Price, Quantity), list(mean = mean, sd = sd),
sales_inventory .names = "{.col}_{.fn}"
) )
# A tibble: 100 × 10
ProductID ProductName Category Price Quantity StoreLocation Price_mean
<chr> <chr> <chr> <dbl> <int> <chr> <dbl>
1 P1 Product 1 Electronics 12.2 40 Los Angeles 52.4
2 P2 Product 2 Electronics 70.7 6 San Francisco 52.4
3 P3 Product 3 Home Goods 32.8 41 San Francisco 52.4
4 P4 Product 4 Clothing 95.1 21 New York 52.4
5 P5 Product 5 Electronics 25.9 46 New York 52.4
6 P6 Product 6 Home Goods 96.3 2 Los Angeles 52.4
7 P7 Product 7 Electronics 46.1 17 San Francisco 52.4
8 P8 Product 8 Clothing 27.4 12 Los Angeles 52.4
9 P9 Product 9 Home Goods 51.0 35 New York 52.4
10 P10 Product 10 Clothing 46.0 9 San Francisco 52.4
# ℹ 90 more rows
# ℹ 3 more variables: Price_sd <dbl>, Quantity_mean <dbl>, Quantity_sd <dbl>
Explanation:
across(c(Price, Quantity), list(mean = mean, sd = sd))
:- Applies the functions
mean
andsd
(standard deviation) to the columnsPrice
andQuantity.
- Applies the functions
.names = "{.col}_{.fn}"
:- Creates new column names by appending the function name (
_mean
or_sd
) to the original column name (Price
orQuantity
).
- Creates new column names by appending the function name (
Result:
This code generates four new columns:
Price_mean
: The mean of thePrice
column.Price_sd
: The standard deviation of thePrice
column.Quantity_mean
: The mean of theQuantity
column.Quantity_sd
: The standard deviation of theQuantity
column.
Apply Transformation to Numeric Columns
Let’s double the values of all numeric columns in the sales_inventory
dataset.
sales_inventory
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
|> mutate( across(where(is.numeric), ~ .x * 2) ) sales_inventory
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <dbl> <chr>
1 P1 Product 1 Electronics 24.5 80 Los Angeles
2 P2 Product 2 Electronics 141. 12 San Francisco
3 P3 Product 3 Home Goods 65.7 82 San Francisco
4 P4 Product 4 Clothing 190. 42 New York
5 P5 Product 5 Electronics 51.9 92 New York
6 P6 Product 6 Home Goods 193. 4 Los Angeles
7 P7 Product 7 Electronics 92.2 34 San Francisco
8 P8 Product 8 Clothing 54.7 24 Los Angeles
9 P9 Product 9 Home Goods 102. 70 New York
10 P10 Product 10 Clothing 92.0 18 San Francisco
# ℹ 90 more rows
Explanation:
where(is.numeric)
:- Targets only the numeric columns in the dataset (
Price
andQuantity
in this case).
- Targets only the numeric columns in the dataset (
~ .x * 2
:- Multiplies each value in the targeted columns by 2.
Result: This transformation updates the dataset, doubling the values in the Price
and Quantity
columns while leaving other columns unchanged.
Why Is across()
Useful?
- Readable and Concise: It simplifies repetitive transformations.
- Flexible: You can apply multiple functions or transformations in one line.
- Customizable Naming: Easily create new columns with meaningful names.
- Powerful Column Selection: Use selectors like
starts_with()
,ends_with()
, orwhere()
.
Exercise: Using across()
Try it yourself:
Add a 10% price tax to all price columns in the untidy version of muscle_cola_data
.
Use the across()
function to add a 10% price tax to all price_* columns.
Fully worked solution:
<- muscle_cola_data |>
muscle_cola_data mutate(
across(starts_with("price"), ~ .x * 1.10,
.names = "taxed_{.col}")
)|> select(starts_with("taxed")) muscle_cola_data
31.7 Conclusion
Transforming columns is an essential step in preparing data for analysis. With the tools in this chapter, you can:
- Create new variables with
mutate()
. - Focus on relevant columns with
select()
. - Rename columns for clarity using
rename()
andrename_with()
. - Reorder columns using
relocate()
. - Apply advanced mutations programmatically with
across()
.
These skills will streamline your workflow and improve your ability to handle real-world datasets effectively.