# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
30 Transforming Rows
Continuing our study of how to transform existing data into the data that we need for analysis, we learn to transform rows. Remember that in tidy data, every row is an observation of the data source. Once your data is tidy, transforming rows allows you to isolate, reorder, or clean observations in preparation for further analysis. In this chapter, we’ll focus on four tidyverse functions that are designed to transform rows: slice()
, filter()
, arrange()
, and distinct()
.
30.1 Demonstration Data: Inventory at GearHub
GearHub is a fictional retail company specializing in a variety of products, including electronics, clothing, and home goods. The company operates stores in multiple locations, such as New York, San Francisco, and Los Angeles. GearHub’s inventory dataset provides a detailed view of the products available across these locations, tracking key information about each item. Specifically, the sales_inventory
dataset contains data on:
ProductID
: A unique identifier for each product (e.g., “P1”, “P2”).ProductName
: Descriptive names for products (e.g., “Wireless Headphones”, “Winter Jacket”).Category
: Product categories such as “Electronics”, “Clothing”, and “Home Goods”.Price
: The price of each product in dollars.Quantity
: The number of units currently in stock.StoreLocation
: The location of the store where the product is available (e.g., “New York”).
30.2 Selecting Rows by Position with slice()
The slice()
function selects rows based on their position in the dataset.
Syntax
The basic syntax for the slice()
function is
slice(data, row_number)
Here:
- data: The tibble containing the dataset to be transformed.
- row_number: Code specifying which rows to include.
Saving the Transformed Data
In most cases, you’ll need the transformed data for further analysis. To save the result, you assign it to a new dataset (or overwrite the original dataset if appropriate):
<- orig_data |>
new_data slice(row_number)
This flexibility makes slice()
a versatile tool for working with rows beyond just selecting specific positions.
Selecting the First 5 Products
To select the first five rows of sales_inventory
with slice()
:
|> slice(1:5) sales_inventory
# A tibble: 5 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
Alternatively, you could use the head()
function to achieve the same result:
head(sales_inventory, 5)
# A tibble: 5 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
The difference with slice()
is that it allows you to select rows from any position in the dataset, not just the head or tail.
Dropping the First 4 Products
Rows can also be excluded by specifying negative indices:
# Drop the first four rows
|>
sales_inventory slice(-(1:4))
# A tibble: 96 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P5 Product 5 Electronics 25.9 46 New York
2 P6 Product 6 Home Goods 96.3 2 Los Angeles
3 P7 Product 7 Electronics 46.1 17 San Francisco
4 P8 Product 8 Clothing 27.4 12 Los Angeles
5 P9 Product 9 Home Goods 51.0 35 New York
6 P10 Product 10 Clothing 46.0 9 San Francisco
7 P11 Product 11 Electronics 9.05 9 Los Angeles
8 P12 Product 12 Home Goods 59.9 28 New York
9 P13 Product 13 Clothing 92.9 37 New York
10 P14 Product 14 Home Goods 99.1 39 San Francisco
# ℹ 86 more rows
Remember that we used the
slice()
function previously when importing Qualtrics data from .csv files Section 12.3.4 and Section 13.4. Qualtrics datasets often include rows of metadata unrelated to survey responses. Whileread_csv()
can skip the first row of metadata,slice()
allowed us to remove additional metadata rows while keeping the column names intact. This prior example highlights howslice()
is useful for selecting or excluding specific rows during data import.
Variants of slice()
slice_head()
Select the first n
rows.
|>
sales_inventory slice_head(n = 5)
# A tibble: 5 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
slice_tail()
Select the last n
rows.
|>
sales_inventory slice_tail(n = 5)
# A tibble: 5 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P96 Product 96 Home Goods 29.4 45 San Francisco
2 P97 Product 97 Clothing 84.5 10 New York
3 P98 Product 98 Home Goods 22.8 35 Los Angeles
4 P99 Product 99 Clothing 81.2 34 New York
5 P100 Product 100 Home Goods 98.9 42 San Francisco
slice_sample()
Select random rows.
|>
sales_inventory slice_sample(n = 5)
# A tibble: 5 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P70 Product 70 Electronics 11.1 33 New York
2 P12 Product 12 Home Goods 59.9 28 New York
3 P45 Product 45 Electronics 43.6 45 San Francisco
4 P38 Product 38 Home Goods 58.9 31 Los Angeles
5 P6 Product 6 Home Goods 96.3 2 Los Angeles
slice_min()
/ slice_max()
Select rows with minimum or maximum values for a column.
|>
sales_inventory slice_max(Price, n = 3)
# A tibble: 3 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P14 Product 14 Home Goods 99.1 39 San Francisco
2 P100 Product 100 Home Goods 98.9 42 San Francisco
3 P22 Product 22 Home Goods 98.7 43 New York
Exercise: Using slice()
Try it yourself:
Let’s use the Muscle Cola customer dataset to practice row transformations. This dataset, introduced in a grammar of graphics exercise in Section 18.5.2, contains survey data from potential customers of Muscle Cola, a protein-infused cola drink designed for fitness enthusiasts.
This dataset has been tidied so each row represents the response of a customer to a specified price for a serving of Muscle Cola. The variables are
Customer Preferences:
Gym_preference
: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.Gym_member
: Specifies whether the respondent is a gym member (“Yes”/“No”).Gym_freq
: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.Protein_importance
: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
Customer Demand:
Price
: Specifies a series of prices at which respondents stated the quantity they would consumerQuantity
: Quantities that respondents would consume at different prices per month.
Demographics:
Gender
: Gender of the respondent.Age
: Numeric value indicating the respondent’s age.Weight
: The respondent’s weight in pounds, which may be relevant for understanding protein needs.
For the muscle_cola_data
dataset,
- Use
slice_head()
to select the first 10 rows. - Use
slice_sample()
to randomly select 5 rows. - Use
slice_max()
to find the 3 rows with the highestQuantity
.
- Call
slice_head()
, specify themuscle_cola_data
dataset and select the first 10 rows or callslice()
, specify themuscle_cola_data
dataset and specify rows 1 through 10. - Call
slice_sample()
, specify themuscle_cola_data
dataset and randomly select 5 rows. - Call
slice_max()
, specify themuscle_cola_data
dataset and select the 3 rows with the highestQuantity
.
Fully worked solution:
# 1. Use `slice_head()` to select the first 10 rows.
|> slice_head(n = 10)
muscle_cola_data # 1. Use `slice()` to select the first 10 rows.
|> slice(1:10)
muscle_cola_data
# 2. Use `slice_sample()` to randomly select 5 rows.
|> slice_sample(n = 5)
muscle_cola_data
# 3. Use `slice_max()` to find the 3 rows with the highest `Quantity`.
|> slice_max(Quantity, n = 3) muscle_cola_data
Notice that the
slice_max()
command selected more than 3 rows. This is because the maximum quantity was 60 and there were more than 3 cases of that quantity.
30.3 Selecting Rows by Condition with filter()
It’s useful for narrowing down data to focus on specific criteria. The filter()
function subsets rows based on conditions.
Syntax
filter(data, condition1, condition2, ...)
Demo: Filtering Products by Category and Price
Suppose you want to filter the GearHub dataset for Electronics
products priced over $50.
|>
sales_inventory filter(Category == "Electronics", Price > 50)
# A tibble: 14 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P2 Product 2 Electronics 70.7 6 San Francisco
2 P24 Product 24 Electronics 70.6 49 New York
3 P26 Product 26 Electronics 60.4 11 San Francisco
4 P33 Product 33 Electronics 55.3 38 New York
5 P35 Product 35 Electronics 77.3 5 Los Angeles
6 P46 Product 46 Electronics 93.2 36 Los Angeles
7 P54 Product 54 Electronics 69.0 29 New York
8 P64 Product 64 Electronics 87.9 50 New York
9 P69 Product 69 Electronics 81.7 40 Los Angeles
10 P74 Product 74 Electronics 62.0 8 San Francisco
11 P77 Product 77 Electronics 87.5 44 San Francisco
12 P80 Product 80 Electronics 78.6 50 New York
13 P93 Product 93 Electronics 56.1 38 Los Angeles
14 P95 Product 95 Electronics 80.3 22 San Francisco
Common Conditions Used for Filtering
Equalities
Select products with Quantity
of inventory of 50 units
|>
sales_inventory filter(Quantity == 50)
# A tibble: 2 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P64 Product 64 Electronics 87.9 50 New York
2 P80 Product 80 Electronics 78.6 50 New York
Inequalities
Select products with a price less than 10.
|>
sales_inventory filter(Price < 10)
# A tibble: 2 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P11 Product 11 Electronics 9.05 9 Los Angeles
2 P75 Product 75 Clothing 9.39 36 San Francisco
Character Strings
For character and factor variables, specify the condition for filtering. Filter for products in the San Francisco
store location`.
|>
sales_inventory filter(StoreLocation == "San Francisco")
# A tibble: 36 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P2 Product 2 Electronics 70.7 6 San Francisco
2 P3 Product 3 Home Goods 32.8 41 San Francisco
3 P7 Product 7 Electronics 46.1 17 San Francisco
4 P10 Product 10 Clothing 46.0 9 San Francisco
5 P14 Product 14 Home Goods 99.1 39 San Francisco
6 P16 Product 16 Home Goods 89.1 31 San Francisco
7 P17 Product 17 Clothing 62.6 42 San Francisco
8 P23 Product 23 Electronics 19.3 42 San Francisco
9 P26 Product 26 Electronics 60.4 11 San Francisco
10 P27 Product 27 Electronics 18.3 5 San Francisco
# ℹ 26 more rows
Multiple (Boolean) Conditions
Filter on multiple conditions using logical operators (boolean conditions1) such as & (AND), | (OR), and ! (NOT). Filter for products in the Home Goods
category in the New York
store location.
|>
sales_inventory filter(Category == "Home Goods" & StoreLocation == "New York")
# A tibble: 7 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P9 Product 9 Home Goods 51.0 35 New York
2 P12 Product 12 Home Goods 59.9 28 New York
3 P22 Product 22 Home Goods 98.7 43 New York
4 P50 Product 50 Home Goods 14.7 18 New York
5 P68 Product 68 Home Goods 69.6 44 New York
6 P73 Product 73 Home Goods 55.1 5 New York
7 P78 Product 78 Home Goods 19.7 25 New York
Distinction Between
=
and==
: In R,=
is used for assignment, while==
is used for equality comparison. When filtering rows based on values, always use==
. For example:
- Correct: Filter rows where order Quantity is equal to 103
high_inventory <- sales_inventory |> filter(Quantity == 50)
- Incorrect: Avoid using = for comparison. (This code will generate an error)
high_inventory <- sales_inventory |> filter(Price = 55)
In this section, we’ve explored the filter()
function, emphasizing its role as a verb that selects rows based on conditions to help you extract meaningful information from your datasets.
Exercise: Using filter()
Try it yourself:
Filter the muscle_cola_data
dataset to find
- Respondents who have a gym membership.
- Respondents who go to the gym at least 5 days per week.
- Respondents who place moderately above average and far above average importance on protein in their diet.
- Respondents who are younger than the average age and have a gym membership.
Specify the muscle_cola_data
dataset and then:
- Call
filter()
and select rows of respondents with a gym membership. - Call
filter()
and select rows of respondents who go to the gym at least 5 days per week. - Call
filter()
and select rows of respondents who place importance on protein in their diet moderately above average and far above average. - Call
filter()
and select rows of respondents under the average age and have a gym membership
Fully worked solution:
# 1. Filter for respondents with a gym membership
|> filter(Gym_member == "Yes")
muscle_cola_data
# 2. Filter for respondents who go to the gym at least 5 days per week.
|> filter(Gym_freq >= 5)
muscle_cola_data
# 3. Filter for respondents who place the importance of protein in their diet at a level of moderately above average or far above average.
|> filter(Protein_importance == "Moderately above average" |
muscle_cola_data == "Far above average")
Protein_importance |> filter(Protein_importance %in% c("Moderately above average", "Far above average"))
muscle_cola_data
# 4. Call `filter()` and select rows of respondents younger than the average age and have a gym membership
|> filter(Age > mean(Age, na.rm=T) & Gym_member == "Yes") muscle_cola_data
30.4 Reordering Rows with arrange()
Sorting rows is a common data transformation task, and arrange()
allows you to specify the order in which rows should appear based on one or more columns. You can arrange data in ascending (default) or descending order for each column.
Syntax
arrange(data, column1, desc(column2))
Demo: Sorting Products by Single and Multiple Variables
Sort by a single variable in ascending order
Sort the GearHub dataset by Price
in ascending order:
|>
sales_inventory arrange(Price)
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P11 Product 11 Electronics 9.05 9 Los Angeles
2 P75 Product 75 Clothing 9.39 36 San Francisco
3 P86 Product 86 Home Goods 10.4 46 San Francisco
4 P62 Product 62 Electronics 10.8 17 San Francisco
5 P70 Product 70 Electronics 11.1 33 New York
6 P39 Product 39 Clothing 11.5 9 Los Angeles
7 P1 Product 1 Electronics 12.2 40 Los Angeles
8 P32 Product 32 Home Goods 12.7 48 Los Angeles
9 P79 Product 79 Clothing 13.4 10 New York
10 P50 Product 50 Home Goods 14.7 18 New York
# ℹ 90 more rows
Sort by a single variable in descending order
Sort the GearHub dataset by Quantity
in descending order:
|>
sales_inventory arrange(desc(Quantity))
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P64 Product 64 Electronics 87.9 50 New York
2 P80 Product 80 Electronics 78.6 50 New York
3 P24 Product 24 Electronics 70.6 49 New York
4 P32 Product 32 Home Goods 12.7 48 Los Angeles
5 P41 Product 41 Home Goods 83.0 47 San Francisco
6 P55 Product 55 Home Goods 27.0 47 San Francisco
7 P72 Product 72 Clothing 22.1 47 Los Angeles
8 P5 Product 5 Electronics 25.9 46 New York
9 P31 Product 31 Clothing 97.9 46 Los Angeles
10 P86 Product 86 Home Goods 10.4 46 San Francisco
# ℹ 90 more rows
Sort by multiple variables for hierarchical ordering
Sort by Category
alphabetically and then by Price
(descending):
|>
sales_inventory arrange(Category, desc(Price))
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P31 Product 31 Clothing 97.9 46 Los Angeles
2 P4 Product 4 Clothing 95.1 21 New York
3 P18 Product 18 Clothing 94.0 9 Los Angeles
4 P13 Product 13 Clothing 92.9 37 New York
5 P53 Product 53 Clothing 92.5 17 San Francisco
6 P63 Product 63 Clothing 87.6 6 New York
7 P48 Product 48 Clothing 86.8 16 San Francisco
8 P97 Product 97 Clothing 84.5 10 New York
9 P37 Product 37 Clothing 81.2 9 Los Angeles
10 P99 Product 99 Clothing 81.2 34 New York
# ℹ 90 more rows
Exercise: Using arrange()
Try it yourself:
- Arrange
muscle_cola_data
responses byQuantity
(ascending). - Arrange
muscle_cola_data
responses byPrice
(descending). - Arrange
muscle_cola_data
responses byGym_freq
(ascending) andPrice
(descending).
Specify the muscle_cola_data
dataset and then:
- Call
arrange()
byQuantity
in ascending order. - Call
arrange()
byPrice
in descending order. - Call
arrange()
byGym_freq
in ascending order and thenPrice
in descending order.
Fully worked solution:
# 1. Arrange `muscle_cola_data` responses by `Quantity` (ascending).
|> arrange(Quantity)
muscle_cola_data
# 2. Arrange `muscle_cola_data` responses by `Price` (descending).
|> arrange(desc(Quantity))
muscle_cola_data
# 3. Arrange `muscle_cola_data` responses by `Gym_freq` (ascending) and `Price` (descending).
|> arrange(Gym_freq, desc(Quantity)) muscle_cola_data
30.5 Removing Duplicate Rows with distinct()
The distinct()
function removes duplicate rows or identifies unique values in specific columns. distinct()
is useful when you want to retain only the unique rows in your dataset.
Syntax
distinct(data, column1, column2, .keep_all = TRUE)
Demo: Removing Duplicate Products
Identify unique categories
|>
sales_inventory distinct(Category)
# A tibble: 3 × 1
Category
<chr>
1 Electronics
2 Home Goods
3 Clothing
Identify distinct combinations of Category
and StoreLocation
:
|>
sales_inventory distinct(Category, StoreLocation)
# A tibble: 9 × 2
Category StoreLocation
<chr> <chr>
1 Electronics Los Angeles
2 Electronics San Francisco
3 Home Goods San Francisco
4 Clothing New York
5 Electronics New York
6 Home Goods Los Angeles
7 Clothing Los Angeles
8 Home Goods New York
9 Clothing San Francisco
Retain all columns while ensuring uniqueness based on Category
By default, the distinct()
function retains only the column(s) specified for identifying unique values. However, you can retain all other columns by using the .keep_all = TRUE
argument. This approach keeps the first instance of each distinct value while preserving the rest of the dataset’s information.
|>
sales_inventory distinct(Category, .keep_all = TRUE)
# A tibble: 3 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P3 Product 3 Home Goods 32.8 41 San Francisco
3 P4 Product 4 Clothing 95.1 21 New York
Remove all duplicate rows from a dataset
To remove duplicate rows entirely from a dataset using distinct()
, you simply call the function without specifying any columns. By default, distinct()
will consider all columns when determining whether a row is a duplicate. When no columns are specified, the function considers the entire row as a unique entity and only unique rows (based on all columns) are retained.
#Remove duplicate rows from the entire dataset
|>
sales_inventory distinct()
# A tibble: 100 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P1 Product 1 Electronics 12.2 40 Los Angeles
2 P2 Product 2 Electronics 70.7 6 San Francisco
3 P3 Product 3 Home Goods 32.8 41 San Francisco
4 P4 Product 4 Clothing 95.1 21 New York
5 P5 Product 5 Electronics 25.9 46 New York
6 P6 Product 6 Home Goods 96.3 2 Los Angeles
7 P7 Product 7 Electronics 46.1 17 San Francisco
8 P8 Product 8 Clothing 27.4 12 Los Angeles
9 P9 Product 9 Home Goods 51.0 35 New York
10 P10 Product 10 Clothing 46.0 9 San Francisco
# ℹ 90 more rows
In this dataset, removing duplicate rows has no effect because there are none. Every row has a unique ProductID
and ProductName.
Removing duplicate rows is especially useful when working with datasets that may have accidental duplicate entries. If you want to keep duplicates based on specific columns, you would specify those columns explicitly. But for the entire dataset, no arguments are needed!
Exercise: Using distinct()
Try it yourself:
From the muscle_cola_data
dataset
- Find distinct combinations of
Gym_freq
andProtein_importance
. - Find distinct combinations of
Price
andQuantity
, retaining all of the variables. - Remove duplicate rows from the dataset entirely.
Specify the muscle_cola_data
dataset and then:
- Call
distinct()
withGym_freq
andProtein_importance
as arguments. - Call
distinct()
withPrice
,Quantity
, and.keep_all
as arguments. - Call
distinct()
with no arguments.
Fully worked solution:
# 1. Call `distinct()` with `Gym_freq` and `Protein_importance` as arguments.
|> distinct(Gym_freq, Protein_importance)
muscle_cola_data
# 2. Call `distinct()` with `Price`, `Quantity`, and `.keep_all` as arguments.
|> distinct(Price, Quantity, .keep_all = TRUE)
muscle_cola_data
# 3. Call `distinct()` with no arguments.
|> distinct() muscle_cola_data
30.6 Combining Row Transformations
You can combine row-transforming functions for more complex tasks.
Demo: Combining filter()
, arrange()
, and distinct()
Filter for Electronics
products priced over $50, remove duplicates by Category
, and sort by Quantity
(descending):
|>
sales_inventory filter(Category == "Electronics", Price > 50) |>
distinct(Category, .keep_all = TRUE) |>
arrange(desc(Quantity))
# A tibble: 1 × 6
ProductID ProductName Category Price Quantity StoreLocation
<chr> <chr> <chr> <dbl> <int> <chr>
1 P2 Product 2 Electronics 70.7 6 San Francisco
Exercise: Using Combined Row Transformation`
Try it yourself:
From the muscle_cola_data
dataset
- Filter for respondents aged 30 and below, find distinct
Gym_preference
, and arrange them alphabetically. - Filter the dataset to include only customers who visit the gym at least three times per week. From this subset, select the top 5 responses with the highest quantity, ensuring the results are sorted in descending order. Include the corresponding price for each response.
Specify the muscle_cola_data
dataset and then:
- Call
filter()
for respondents aged 30 and below, then calldistinct()
forGym_preference
, andarrange()
them alphabetically. - Call
filter()
for respondents who go to the gym at least 3 times per week, callslice_max()
to select the top 5 responses with the highest quantity, ensuring the results are sorted in descending order.
Fully worked solution:
# 1. Filter for respondents aged 30 and below, find distinct `Gym_preference`, and arrange them alphabetically.
|> filter(Age <= 30) |>
muscle_cola_data distinct(Gym_preference, .keep_all = TRUE) |>
arrange()
# 2. Filter for gym-goers with high frequency and select top responses by quantity
|> filter(Gym_freq >= 3) |> # Customers who go to the gym 3+ times per week
muscle_cola_data slice_max(Quantity, n = 5) |> # Top 5 by Quantity
arrange(desc(Quantity)) # Ensure descending order
30.7 Chapter Summary
In this chapter, you learned how to:
- Use
filter()
to select rows based on conditions. - Use
slice()
to select rows by position or value. - Use
arrange()
to reorder rows. - Use
distinct()
to remove duplicates or find unique combinations.
By mastering these functions, you can effectively manipulate and prepare data for analysis, making your datasets cleaner, more focused, and easier to interpret.
Boolean conditions, named after mathematician and logician George Boole, are expressions or statements in programming that evaluate to either true or false, representing logical values or states.↩︎