30 Transforming Rows

Continuing our study of how to transform existing data into the data that we need for analysis, we learn to transform rows. Remember that in tidy data, every row is an observation of the data source. Once your data is tidy, transforming rows allows you to isolate, reorder, or clean observations in preparation for further analysis. In this chapter, we’ll focus on four tidyverse functions that are designed to transform rows: slice(), filter(), arrange(), and distinct().

30.1 Demonstration Data: Inventory at GearHub

GearHub is a fictional retail company specializing in a variety of products, including electronics, clothing, and home goods. The company operates stores in multiple locations, such as New York, San Francisco, and Los Angeles. GearHub’s inventory dataset provides a detailed view of the products available across these locations, tracking key information about each item. Specifically, the sales_inventory dataset contains data on:

ProductID: A unique identifier for each product (e.g., “P1”, “P2”).
ProductName: Descriptive names for products (e.g., “Wireless Headphones”, “Winter Jacket”).
Category: Product categories such as “Electronics”, “Clothing”, and “Home Goods”.
Price: The price of each product in dollars.
Quantity: The number of units currently in stock.
StoreLocation: The location of the store where the product is available (e.g., “New York”).

# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows

30.2 Selecting Rows by Position with `slice()`

The slice() function selects rows based on their position in the dataset.

Syntax

The basic syntax for the slice() function is

slice(data, row_number)

Here:

data: The tibble containing the dataset to be transformed.
row_number: Code specifying which rows to include.

Saving the Transformed Data

In most cases, you’ll need the transformed data for further analysis. To save the result, you assign it to a new dataset (or overwrite the original dataset if appropriate):

new_data <- orig_data |> 
  slice(row_number)

This flexibility makes slice() a versatile tool for working with rows beyond just selecting specific positions.

Selecting the First 5 Products

To select the first five rows of sales_inventory with slice():

sales_inventory |> slice(1:5)

# A tibble: 5 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P1        Product 1   Electronics  12.2       40 Los Angeles  
2 P2        Product 2   Electronics  70.7        6 San Francisco
3 P3        Product 3   Home Goods   32.8       41 San Francisco
4 P4        Product 4   Clothing     95.1       21 New York     
5 P5        Product 5   Electronics  25.9       46 New York

Alternatively, you could use the head() function to achieve the same result:

head(sales_inventory, 5)

# A tibble: 5 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P1        Product 1   Electronics  12.2       40 Los Angeles  
2 P2        Product 2   Electronics  70.7        6 San Francisco
3 P3        Product 3   Home Goods   32.8       41 San Francisco
4 P4        Product 4   Clothing     95.1       21 New York     
5 P5        Product 5   Electronics  25.9       46 New York

The difference with slice() is that it allows you to select rows from any position in the dataset, not just the head or tail.

Dropping the First 4 Products

Rows can also be excluded by specifying negative indices:

# Drop the first four rows
sales_inventory |> 
  slice(-(1:4))

# A tibble: 96 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P5        Product 5   Electronics 25.9        46 New York     
 2 P6        Product 6   Home Goods  96.3         2 Los Angeles  
 3 P7        Product 7   Electronics 46.1        17 San Francisco
 4 P8        Product 8   Clothing    27.4        12 Los Angeles  
 5 P9        Product 9   Home Goods  51.0        35 New York     
 6 P10       Product 10  Clothing    46.0         9 San Francisco
 7 P11       Product 11  Electronics  9.05        9 Los Angeles  
 8 P12       Product 12  Home Goods  59.9        28 New York     
 9 P13       Product 13  Clothing    92.9        37 New York     
10 P14       Product 14  Home Goods  99.1        39 San Francisco
# ℹ 86 more rows

Remember that we used the slice() function previously when importing Qualtrics data from .csv files Section 12.3.4 and Section 13.4. Qualtrics datasets often include rows of metadata unrelated to survey responses. While read_csv() can skip the first row of metadata, slice() allowed us to remove additional metadata rows while keeping the column names intact. This prior example highlights how slice() is useful for selecting or excluding specific rows during data import.

Variants of `slice()`

`slice_head()`

Select the first n rows.

sales_inventory |>
  slice_head(n = 5)

# A tibble: 5 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P1        Product 1   Electronics  12.2       40 Los Angeles  
2 P2        Product 2   Electronics  70.7        6 San Francisco
3 P3        Product 3   Home Goods   32.8       41 San Francisco
4 P4        Product 4   Clothing     95.1       21 New York     
5 P5        Product 5   Electronics  25.9       46 New York

`slice_tail()`

Select the last n rows.

sales_inventory |>
  slice_tail(n = 5)

# A tibble: 5 × 6
  ProductID ProductName Category   Price Quantity StoreLocation
  <chr>     <chr>       <chr>      <dbl>    <int> <chr>        
1 P96       Product 96  Home Goods  29.4       45 San Francisco
2 P97       Product 97  Clothing    84.5       10 New York     
3 P98       Product 98  Home Goods  22.8       35 Los Angeles  
4 P99       Product 99  Clothing    81.2       34 New York     
5 P100      Product 100 Home Goods  98.9       42 San Francisco

`slice_sample()`

Select random rows.

  sales_inventory |>
    slice_sample(n = 5)

# A tibble: 5 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P70       Product 70  Electronics  11.1       33 New York     
2 P12       Product 12  Home Goods   59.9       28 New York     
3 P45       Product 45  Electronics  43.6       45 San Francisco
4 P38       Product 38  Home Goods   58.9       31 Los Angeles  
5 P6        Product 6   Home Goods   96.3        2 Los Angeles

`slice_min()` / `slice_max()`

Select rows with minimum or maximum values for a column.

  sales_inventory |>
    slice_max(Price, n = 3)

# A tibble: 3 × 6
  ProductID ProductName Category   Price Quantity StoreLocation
  <chr>     <chr>       <chr>      <dbl>    <int> <chr>        
1 P14       Product 14  Home Goods  99.1       39 San Francisco
2 P100      Product 100 Home Goods  98.9       42 San Francisco
3 P22       Product 22  Home Goods  98.7       43 New York

Exercise: Using `slice()`

Try it yourself:

Let’s use the Muscle Cola customer dataset to practice row transformations. This dataset, introduced in a grammar of graphics exercise in Section 18.5.2, contains survey data from potential customers of Muscle Cola, a protein-infused cola drink designed for fitness enthusiasts.

This dataset has been tidied so each row represents the response of a customer to a specified price for a serving of Muscle Cola. The variables are

Customer Preferences:
- Gym_preference: Indicates whether the respondent prefers a gym workout, another form of exercise, or both.
- Gym_member: Specifies whether the respondent is a gym member (“Yes”/“No”).
- Gym_freq: Numeric value representing the average number of times per week the respondent goes to the gym, ranging from occasional to daily attendance.
- Protein_importance: Likert-scale response indicating how important protein intake is to the respondent, ranging from “Below Average” to “Above Average”.
Customer Demand:
- Price: Specifies a series of prices at which respondents stated the quantity they would consumer
- Quantity: Quantities that respondents would consume at different prices per month.
Demographics:
- Gender: Gender of the respondent.
- Age: Numeric value indicating the respondent’s age.
- Weight: The respondent’s weight in pounds, which may be relevant for understanding protein needs.

For the muscle_cola_data dataset,

Use slice_head() to select the first 10 rows.
Use slice_sample() to randomly select 5 rows.
Use slice_max() to find the 3 rows with the highest Quantity.

Fully worked solution:

# 1. Use `slice_head()` to select the first 10 rows.
muscle_cola_data |> slice_head(n = 10)
# 1. Use `slice()` to select the first 10 rows.
muscle_cola_data |> slice(1:10)

# 2. Use `slice_sample()` to randomly select 5 rows.
muscle_cola_data |> slice_sample(n = 5)

# 3. Use `slice_max()` to find the 3 rows with the highest `Quantity`.
muscle_cola_data |> slice_max(Quantity, n = 3)

Notice that the slice_max() command selected more than 3 rows. This is because the maximum quantity was 60 and there were more than 3 cases of that quantity.

30.3 Selecting Rows by Condition with `filter()`

It’s useful for narrowing down data to focus on specific criteria. The filter() function subsets rows based on conditions.

Syntax

filter(data, condition1, condition2, ...)

Demo: Filtering Products by Category and Price

Suppose you want to filter the GearHub dataset for Electronics products priced over $50.

sales_inventory |>
  filter(Category == "Electronics", Price > 50)

# A tibble: 14 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P2        Product 2   Electronics  70.7        6 San Francisco
 2 P24       Product 24  Electronics  70.6       49 New York     
 3 P26       Product 26  Electronics  60.4       11 San Francisco
 4 P33       Product 33  Electronics  55.3       38 New York     
 5 P35       Product 35  Electronics  77.3        5 Los Angeles  
 6 P46       Product 46  Electronics  93.2       36 Los Angeles  
 7 P54       Product 54  Electronics  69.0       29 New York     
 8 P64       Product 64  Electronics  87.9       50 New York     
 9 P69       Product 69  Electronics  81.7       40 Los Angeles  
10 P74       Product 74  Electronics  62.0        8 San Francisco
11 P77       Product 77  Electronics  87.5       44 San Francisco
12 P80       Product 80  Electronics  78.6       50 New York     
13 P93       Product 93  Electronics  56.1       38 Los Angeles  
14 P95       Product 95  Electronics  80.3       22 San Francisco

Common Conditions Used for Filtering

Equalities

Select products with Quantity of inventory of 50 units

sales_inventory |>
  filter(Quantity == 50)

# A tibble: 2 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P64       Product 64  Electronics  87.9       50 New York     
2 P80       Product 80  Electronics  78.6       50 New York

Inequalities

Select products with a price less than 10.

sales_inventory |>
  filter(Price < 10)

# A tibble: 2 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P11       Product 11  Electronics  9.05        9 Los Angeles  
2 P75       Product 75  Clothing     9.39       36 San Francisco

Character Strings

For character and factor variables, specify the condition for filtering. Filter for products in the San Francisco store location`.

sales_inventory |>
  filter(StoreLocation == "San Francisco")

# A tibble: 36 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P2        Product 2   Electronics  70.7        6 San Francisco
 2 P3        Product 3   Home Goods   32.8       41 San Francisco
 3 P7        Product 7   Electronics  46.1       17 San Francisco
 4 P10       Product 10  Clothing     46.0        9 San Francisco
 5 P14       Product 14  Home Goods   99.1       39 San Francisco
 6 P16       Product 16  Home Goods   89.1       31 San Francisco
 7 P17       Product 17  Clothing     62.6       42 San Francisco
 8 P23       Product 23  Electronics  19.3       42 San Francisco
 9 P26       Product 26  Electronics  60.4       11 San Francisco
10 P27       Product 27  Electronics  18.3        5 San Francisco
# ℹ 26 more rows

Multiple (Boolean) Conditions

Filter on multiple conditions using logical operators (boolean conditions¹) such as & (AND), | (OR), and ! (NOT). Filter for products in the Home Goods category in the New York store location.

sales_inventory |>
  filter(Category == "Home Goods" & StoreLocation == "New York")

# A tibble: 7 × 6
  ProductID ProductName Category   Price Quantity StoreLocation
  <chr>     <chr>       <chr>      <dbl>    <int> <chr>        
1 P9        Product 9   Home Goods  51.0       35 New York     
2 P12       Product 12  Home Goods  59.9       28 New York     
3 P22       Product 22  Home Goods  98.7       43 New York     
4 P50       Product 50  Home Goods  14.7       18 New York     
5 P68       Product 68  Home Goods  69.6       44 New York     
6 P73       Product 73  Home Goods  55.1        5 New York     
7 P78       Product 78  Home Goods  19.7       25 New York

Distinction Between = and ==: In R, = is used for assignment, while == is used for equality comparison. When filtering rows based on values, always use ==. For example:

Correct: Filter rows where order Quantity is equal to 103 high_inventory <- sales_inventory |> filter(Quantity == 50)

Incorrect: Avoid using = for comparison. (This code will generate an error) high_inventory <- sales_inventory |> filter(Price = 55)

In this section, we’ve explored the filter() function, emphasizing its role as a verb that selects rows based on conditions to help you extract meaningful information from your datasets.

Exercise: Using `filter()`

Try it yourself:

Filter the muscle_cola_data dataset to find

Respondents who have a gym membership.
Respondents who go to the gym at least 5 days per week.
Respondents who place moderately above average and far above average importance on protein in their diet.
Respondents who are younger than the average age and have a gym membership.

Fully worked solution:

# 1. Filter for respondents with a gym membership
muscle_cola_data |> filter(Gym_member == "Yes")

# 2. Filter for respondents who go to the gym at least 5 days per week.
muscle_cola_data |> filter(Gym_freq >= 5)

# 3. Filter for respondents who place the importance of protein in their diet at a level of moderately above average or far above average.
muscle_cola_data |> filter(Protein_importance == "Moderately above average" |
                           Protein_importance == "Far above average") 
muscle_cola_data |> filter(Protein_importance %in% c("Moderately above average", "Far above average")) 
                            
# 4. Call `filter()` and select rows of respondents younger than the average age and have a gym membership
muscle_cola_data |> filter(Age > mean(Age, na.rm=T) & Gym_member == "Yes")

30.4 Reordering Rows with `arrange()`

Sorting rows is a common data transformation task, and arrange() allows you to specify the order in which rows should appear based on one or more columns. You can arrange data in ascending (default) or descending order for each column.

Syntax

arrange(data, column1, desc(column2))

Demo: Sorting Products by Single and Multiple Variables

Sort by a single variable in ascending order

Sort the GearHub dataset by Price in ascending order:

sales_inventory |>
  arrange(Price)

# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P11       Product 11  Electronics  9.05        9 Los Angeles  
 2 P75       Product 75  Clothing     9.39       36 San Francisco
 3 P86       Product 86  Home Goods  10.4        46 San Francisco
 4 P62       Product 62  Electronics 10.8        17 San Francisco
 5 P70       Product 70  Electronics 11.1        33 New York     
 6 P39       Product 39  Clothing    11.5         9 Los Angeles  
 7 P1        Product 1   Electronics 12.2        40 Los Angeles  
 8 P32       Product 32  Home Goods  12.7        48 Los Angeles  
 9 P79       Product 79  Clothing    13.4        10 New York     
10 P50       Product 50  Home Goods  14.7        18 New York     
# ℹ 90 more rows

Sort by a single variable in descending order

Sort the GearHub dataset by Quantity in descending order:

sales_inventory |>
  arrange(desc(Quantity))

# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P64       Product 64  Electronics  87.9       50 New York     
 2 P80       Product 80  Electronics  78.6       50 New York     
 3 P24       Product 24  Electronics  70.6       49 New York     
 4 P32       Product 32  Home Goods   12.7       48 Los Angeles  
 5 P41       Product 41  Home Goods   83.0       47 San Francisco
 6 P55       Product 55  Home Goods   27.0       47 San Francisco
 7 P72       Product 72  Clothing     22.1       47 Los Angeles  
 8 P5        Product 5   Electronics  25.9       46 New York     
 9 P31       Product 31  Clothing     97.9       46 Los Angeles  
10 P86       Product 86  Home Goods   10.4       46 San Francisco
# ℹ 90 more rows

Sort by multiple variables for hierarchical ordering

Sort by Category alphabetically and then by Price (descending):

sales_inventory |>
  arrange(Category, desc(Price))

# A tibble: 100 × 6
   ProductID ProductName Category Price Quantity StoreLocation
   <chr>     <chr>       <chr>    <dbl>    <int> <chr>        
 1 P31       Product 31  Clothing  97.9       46 Los Angeles  
 2 P4        Product 4   Clothing  95.1       21 New York     
 3 P18       Product 18  Clothing  94.0        9 Los Angeles  
 4 P13       Product 13  Clothing  92.9       37 New York     
 5 P53       Product 53  Clothing  92.5       17 San Francisco
 6 P63       Product 63  Clothing  87.6        6 New York     
 7 P48       Product 48  Clothing  86.8       16 San Francisco
 8 P97       Product 97  Clothing  84.5       10 New York     
 9 P37       Product 37  Clothing  81.2        9 Los Angeles  
10 P99       Product 99  Clothing  81.2       34 New York     
# ℹ 90 more rows

Exercise: Using `arrange()`

Try it yourself:

Arrange muscle_cola_data responses by Quantity (ascending).
Arrange muscle_cola_data responses by Price (descending).
Arrange muscle_cola_data responses by Gym_freq (ascending) and Price (descending).

Fully worked solution:

# 1. Arrange `muscle_cola_data` responses by `Quantity` (ascending).
muscle_cola_data |> arrange(Quantity)

# 2. Arrange `muscle_cola_data` responses by `Price` (descending).
muscle_cola_data |> arrange(desc(Quantity))

# 3. Arrange `muscle_cola_data` responses by `Gym_freq` (ascending) and `Price` (descending).
muscle_cola_data |> arrange(Gym_freq, desc(Quantity))

30.5 Removing Duplicate Rows with `distinct()`

The distinct() function removes duplicate rows or identifies unique values in specific columns. distinct() is useful when you want to retain only the unique rows in your dataset.

Syntax

distinct(data, column1, column2, .keep_all = TRUE)

Demo: Removing Duplicate Products

Identify unique categories

sales_inventory |>
  distinct(Category)

# A tibble: 3 × 1
  Category   
  <chr>      
1 Electronics
2 Home Goods 
3 Clothing

Identify distinct combinations of `Category` and `StoreLocation`:

sales_inventory |>
  distinct(Category, StoreLocation)

# A tibble: 9 × 2
  Category    StoreLocation
  <chr>       <chr>        
1 Electronics Los Angeles  
2 Electronics San Francisco
3 Home Goods  San Francisco
4 Clothing    New York     
5 Electronics New York     
6 Home Goods  Los Angeles  
7 Clothing    Los Angeles  
8 Home Goods  New York     
9 Clothing    San Francisco

Retain all columns while ensuring uniqueness based on `Category`

By default, the distinct() function retains only the column(s) specified for identifying unique values. However, you can retain all other columns by using the .keep_all = TRUE argument. This approach keeps the first instance of each distinct value while preserving the rest of the dataset’s information.

sales_inventory |>
  distinct(Category, .keep_all = TRUE)

# A tibble: 3 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P1        Product 1   Electronics  12.2       40 Los Angeles  
2 P3        Product 3   Home Goods   32.8       41 San Francisco
3 P4        Product 4   Clothing     95.1       21 New York

Remove all duplicate rows from a dataset

To remove duplicate rows entirely from a dataset using distinct(), you simply call the function without specifying any columns. By default, distinct() will consider all columns when determining whether a row is a duplicate. When no columns are specified, the function considers the entire row as a unique entity and only unique rows (based on all columns) are retained.

#Remove duplicate rows from the entire dataset
sales_inventory |> 
  distinct()

# A tibble: 100 × 6
   ProductID ProductName Category    Price Quantity StoreLocation
   <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
 1 P1        Product 1   Electronics  12.2       40 Los Angeles  
 2 P2        Product 2   Electronics  70.7        6 San Francisco
 3 P3        Product 3   Home Goods   32.8       41 San Francisco
 4 P4        Product 4   Clothing     95.1       21 New York     
 5 P5        Product 5   Electronics  25.9       46 New York     
 6 P6        Product 6   Home Goods   96.3        2 Los Angeles  
 7 P7        Product 7   Electronics  46.1       17 San Francisco
 8 P8        Product 8   Clothing     27.4       12 Los Angeles  
 9 P9        Product 9   Home Goods   51.0       35 New York     
10 P10       Product 10  Clothing     46.0        9 San Francisco
# ℹ 90 more rows

In this dataset, removing duplicate rows has no effect because there are none. Every row has a unique ProductID and ProductName. Removing duplicate rows is especially useful when working with datasets that may have accidental duplicate entries. If you want to keep duplicates based on specific columns, you would specify those columns explicitly. But for the entire dataset, no arguments are needed!

Exercise: Using `distinct()`

Try it yourself:

From the muscle_cola_data dataset

Find distinct combinations of Gym_freq and Protein_importance.
Find distinct combinations of Price and Quantity, retaining all of the variables.
Remove duplicate rows from the dataset entirely.

Fully worked solution:

# 1. Call `distinct()` with `Gym_freq` and `Protein_importance` as arguments.
muscle_cola_data |> distinct(Gym_freq, Protein_importance)

# 2. Call `distinct()` with `Price`, `Quantity`, and `.keep_all` as arguments.
muscle_cola_data |> distinct(Price, Quantity, .keep_all = TRUE)

# 3. Call `distinct()` with no arguments.
muscle_cola_data |> distinct()

30.6 Combining Row Transformations

You can combine row-transforming functions for more complex tasks.

Demo: Combining `filter()`, `arrange()`, and `distinct()`

Filter for Electronics products priced over $50, remove duplicates by Category, and sort by Quantity (descending):

sales_inventory |>
  filter(Category == "Electronics", Price > 50) |>
  distinct(Category, .keep_all = TRUE) |>
  arrange(desc(Quantity))

# A tibble: 1 × 6
  ProductID ProductName Category    Price Quantity StoreLocation
  <chr>     <chr>       <chr>       <dbl>    <int> <chr>        
1 P2        Product 2   Electronics  70.7        6 San Francisco

Exercise: Using Combined Row Transformation`

Try it yourself:

From the muscle_cola_data dataset

Filter for respondents aged 30 and below, find distinct Gym_preference, and arrange them alphabetically.
Filter the dataset to include only customers who visit the gym at least three times per week. From this subset, select the top 5 responses with the highest quantity, ensuring the results are sorted in descending order. Include the corresponding price for each response.

Fully worked solution:


# 1. Filter for respondents aged 30 and below, find distinct `Gym_preference`, and arrange them alphabetically.
muscle_cola_data |> filter(Age <= 30) |>
                    distinct(Gym_preference, .keep_all = TRUE) |>
                    arrange()

# 2. Filter for gym-goers with high frequency and select top responses by quantity
muscle_cola_data |> filter(Gym_freq >= 3) |> # Customers who go to the gym 3+ times per week
                    slice_max(Quantity, n = 5) |> # Top 5 by Quantity 
                    arrange(desc(Quantity)) # Ensure descending order

30.7 Chapter Summary

In this chapter, you learned how to:

Use filter() to select rows based on conditions.
Use slice() to select rows by position or value.
Use arrange() to reorder rows.
Use distinct() to remove duplicates or find unique combinations.

By mastering these functions, you can effectively manipulate and prepare data for analysis, making your datasets cleaner, more focused, and easier to interpret.

Boolean conditions, named after mathematician and logician George Boole, are expressions or statements in programming that evaluate to either true or false, representing logical values or states.↩︎

30.1 Demonstration Data: Inventory at GearHub

30.2 Selecting Rows by Position with slice()

Syntax

Saving the Transformed Data

Selecting the First 5 Products

Dropping the First 4 Products

Variants of slice()

slice_head()

slice_tail()

slice_sample()

slice_min() / slice_max()

Exercise: Using slice()

Try it yourself:

Fully worked solution:

30.3 Selecting Rows by Condition with filter()

Syntax

Demo: Filtering Products by Category and Price

Common Conditions Used for Filtering

Equalities

Inequalities

Character Strings

Multiple (Boolean) Conditions

Exercise: Using filter()

Try it yourself:

Fully worked solution:

30.4 Reordering Rows with arrange()

Syntax

Demo: Sorting Products by Single and Multiple Variables

Sort by a single variable in ascending order

Sort by a single variable in descending order

Sort by multiple variables for hierarchical ordering

Exercise: Using arrange()

Try it yourself:

Fully worked solution:

30.5 Removing Duplicate Rows with distinct()

Syntax

Demo: Removing Duplicate Products

Identify unique categories

Identify distinct combinations of Category and StoreLocation:

Retain all columns while ensuring uniqueness based on Category

Remove all duplicate rows from a dataset

Exercise: Using distinct()

Try it yourself:

Fully worked solution:

30.6 Combining Row Transformations

Demo: Combining filter(), arrange(), and distinct()

Exercise: Using Combined Row Transformation`

Try it yourself:

Fully worked solution:

30.7 Chapter Summary

30.2 Selecting Rows by Position with `slice()`

Variants of `slice()`

`slice_head()`

`slice_tail()`

`slice_sample()`

`slice_min()` / `slice_max()`

Exercise: Using `slice()`

30.3 Selecting Rows by Condition with `filter()`

Exercise: Using `filter()`

30.4 Reordering Rows with `arrange()`

Exercise: Using `arrange()`

30.5 Removing Duplicate Rows with `distinct()`

Identify distinct combinations of `Category` and `StoreLocation`:

Retain all columns while ensuring uniqueness based on `Category`

Exercise: Using `distinct()`

Demo: Combining `filter()`, `arrange()`, and `distinct()`