21  Variables

21.1 Exploring the Foundations of Data Analysis

Variables are the foundation of data analysis. They represent the aspects of the world we want to study, measure, or observe, and serve as the building blocks for all forms of statistical and visual analysis.

In this chapter, you will:

  1. Understand the definition and purpose of variables.
  2. Learn about the different types of variables and their uses.
  3. Explore how to identify variables in a dataset using the glimpse() function in R.
  4. Practice converting and manipulating variables for analysis.

21.2 What Is a Variable?

A variable is any characteristic, number, or quantity that can be measured or observed. As the name suggests, it is a collection of data that varies across observations.

Each observation represents a single instance or entity in the dataset (e.g., a customer, a transaction, or a product). For every observation, the value of the variable reflects a specific measurement or state.

For example, in a dataset of customer data:

  • An observation might represent a single customer.
  • Variables could include characteristics such as:
    • Age: The customer’s age.
    • Income: The customer’s annual income.
    • City of Residence: Where the customer lives.
    • Willingness to Pay: The customer’s stated willingness to pay for a product.
  • The specific recorded values for these variables would vary across customers (observations).

In short, a variable is a way to capture and organize data that varies across observations, helping to measure the state or characteristics of what is being studied.


To explore the concept of variables, consider the liberty_ship_data dataset, first introduced in Section 18.2. This dataset captures information about the production of supply ships during World War II. Key variables include:

  • Yard: The shipyard responsible for building the ship.
  • Way: Ordered number of the sloping platform that launches the ship into the water after construction.
  • Direct_Hours: Total direct labor hours spent constructing a ship.
  • Total_Production_Days: Total number of days required to produce a ship.
  • Total_Cost: The total cost of producing a ship in dollars.
  • Delivery_Date: The date that production and outfitting of a ship is finished and it enters service as a transport ship.

Each variable provides a unique lens through which to understand data and answer questions. For instance:

  • How much time did it take to construct a ship?
  • Are certain shipyards more efficient than others?
# A tibble: 1,571 × 7
    Unit Yard    Way Direct_Hours Total_Production_Days Total_Cost Delivery_Date
   <dbl> <chr> <dbl>        <dbl>                 <dbl>      <dbl> <chr>        
 1     1 Beth…     1       870870                   244   2615849  12/30/41     
 2     2 Beth…     2       831745                   249   2545125  1/19/42      
 3     3 Beth…     3       788406                   222   2466811  1/29/42      
 4     4 Beth…     4       758934                   233   2414978  2/9/42       
 5     5 Beth…     5       735197                   220   2390643  2/20/42      
 6     6 Beth…     6       710342                   227   2345051  2/27/42      
 7     8 Beth…     8       668785                   217   2254490  3/30/42      
 8     9 Beth…     9       675662                   196   2139564. 3/18/42      
 9    10 Beth…    10       652911                   211   2221499. 4/11/42      
10    11 Beth…    11       603625                   229   2217642. 5/9/42       
# ℹ 1,561 more rows

21.3 Types of Variables

Variables are often classified based on the type of data they hold. Here are the key types of variables:

Numeric Variables

These variables represent quantities and can be used for calculations.

For example:

  • Direct_Hours, Total_Production_Days, and Total_Cost from the liberty_ship_data dataset.
  • Revenue or profit of startups.
  • Willingness to pay for the product of a startup

Categorical Variables

These represent groups or categories.

For example:

  • Yard and Way variables from the liberty_ship_data dataset.
  • Product categories (Tech, Retail, Healthcare).
  • Gender (Male, Female).

Binary Variables

A special type of categorical variable with only two values.

For example:

  • Whether a survey respondent is part of the target customer population (Yes/No).
  • Whether a startup is profitable (Yes/No).

Ordinal Variables

Categorical variables with an inherent order.

For example

  • Satisfaction level (Low, Medium, High).
  • Education level (High school degree, Some college, College degree).

Time-Series Variables

Observations indexed over time.

  • Delivery date in the liberty_ship_data datset
  • Year of founding (2010, 2015, 2020).
  • Sales date (02-Jan-2020, 23-Jan-2020, 24-Jan-2020).

21.4 Converting Variable Types

When working with datasets, variables are often imported with incorrect or suboptimal types for analysis. To ensure accurate and efficient analysis, it’s important to identify these cases and convert variables to more appropriate types. The most common scenarios involve variables being imported as character strings <chr> that should be converted into more structured formats:


Inspect Data Types with glimpse() or str()

Inspecting the structure of a dataset is a crucial step in data analysis. It helps you understand the variables in the dataset, their data types, and whether they are suitable for the intended analysis.

Using glimpse()

The glimpse() function from the dplyr package provides a compact, column-wise overview of a dataset. It displays:

  • The variable names.
  • preview of the data values.
  • The data types of each variable.

This function is especially useful for large datasets, as it shows all variables in a concise format, making it easier to spot issues such as incorrect data types.

For example, here is the glimpse() of the Liberty ship dataset liberty_ship_data:

Code
# Inspect variable types
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit                  <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard                  <chr> "Bethlehem", "Bethlehem", "Bethlehem", "Bethlehe…
$ Way                   <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours          <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost            <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date         <chr> "12/30/41", "1/19/42", "1/29/42", "2/9/42", "2/2…

Using str()

The str() function is a base R tool for examining the structure of an object, including datasets. It displays:

  • The object type (e.g., tibble, data frame, vector).
  • Variable names.
  • Data types.
  • A preview of the values for each variable.

str() is slightly more verbose than glimpse() and works consistently across different environments. This makes it a great fallback if glimpse() doesn’t render as expected in certain output formats.

Here is str() for the liberty_ship_data.

Code
# Inspect variable types
str(liberty_ship_data)
tibble [1,571 × 7] (S3: tbl_df/tbl/data.frame)
 $ Unit                 : num [1:1571] 1 2 3 4 5 6 8 9 10 11 ...
 $ Yard                 : chr [1:1571] "Bethlehem" "Bethlehem" "Bethlehem" "Bethlehem" ...
 $ Way                  : num [1:1571] 1 2 3 4 5 6 8 9 10 11 ...
 $ Direct_Hours         : num [1:1571] 870870 831745 788406 758934 735197 ...
 $ Total_Production_Days: num [1:1571] 244 249 222 233 220 227 217 196 211 229 ...
 $ Total_Cost           : num [1:1571] 2615849 2545125 2466811 2414978 2390643 ...
 $ Delivery_Date        : chr [1:1571] "12/30/41" "1/19/42" "1/29/42" "2/9/42" ...

As you can see str() denotes numeric variables as num where glimpse() denotes them <dbl> but they mean the same thing - a numeric variable..


Character to Category with as.factor()

Text-based variables that represent groups or categories (e.g., product types, regions) can be converted into categorical data types know as factors. Factors are more efficient and allow for better handling in statistical analysis and visualization.

In the liberty_ship_data dataset, we can see from the output of glimpse() that the Yard variable was imported as a character data type <chr>. This variable contains the names of the shipyards that built the Liberty ships. Since there are only eight shipyards, the variable should be treated as categorical rather than as freeform text. Categorical variables have a fixed set of categories (in this case, the names of shipyards) and are best represented in R using the factor type.

When we find a categorical variable that is misspecified as a character vectore, we convert it to a categorical variable (factor) using the as.factor() function.

Code
# Convert a character variable to categorical / factor
liberty_ship_data <- liberty_ship_data |>
  mutate(Yard = factor(Yard))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit                  <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard                  <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way                   <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours          <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost            <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date         <chr> "12/30/41", "1/19/42", "1/29/42", "2/9/42", "2/2…

The glimpse() function shows that Yard is now a factor data type <fct>.


Character to Date with mdy()

Dates stored as strings cannot be used for calculations (e.g., determining the time difference between two dates). Converting them into Date objects allows for precise handling of time-based analysis.

In the liberty_ship_data dataset, the Delivery_Date variable was imported as a character data type <chr> with dates having the common format “12/30/41” for 30 December 1941. We know this variable represents dates in the format "MM/DD/YY". Dates are a special kind of data type in R that allow us to perform time-based calculations and analyses.

For example, knowing the Delivery_Date for each ship allows us to:

  • Analyze how ship deliveries changed over time.
  • Compare production rates across shipyards or periods.
  • Calculate the time it took to deliver ships after they were ordered.

To convert the Delivery_Date variable into a proper date type, we use the mdy() function from the lubridate package, which correctly interprets the "MM/DD/YY" format.

Code
# Convert a character variable to a date
liberty_ship_data <- liberty_ship_data |>
  mutate(Delivery_Date = mdy(Delivery_Date))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit                  <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard                  <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way                   <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours          <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost            <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date         <date> 2041-12-30, 2042-01-19, 2042-01-29, 2042-02-09,…

After the transformation, glimpse() shows that Delivery_Date is now a <date> variable. This conversion ensures that Delivery_Date is stored as a date, unlocking its potential for further analysis.

If the date variable had a different format like Day/Month/Year or Year-Month-Day, you would use the dmy() or ymd() functions instead.1


Numeric to Factor with as.factor()

The Way variable is the sloped platform where a ship was constructed and launched. It was imported as a numeric data type <dbl> in the dataset, however, this variable is better understood as a name or label rather than a true numeric value. For instance, it wouldn’t make sense to say that “Way 4 is twice as much as Way 2.” This makes Way another categorical variable. We can convert the data type of the Way variable to a factor using the as.factor() function:

Code
# Convert a numeric variable to categorical
liberty_ship_data <- liberty_ship_data |>
  mutate(Way = as.factor(Way))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit                  <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard                  <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way                   <fct> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours          <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost            <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date         <date> 2041-12-30, 2042-01-19, 2042-01-29, 2042-02-09,…

After the conversion, glimpse() shows that Way is now recognized as a factor variable. This ensures that it will be treated appropriately in subsequent analyses, such as grouping, summarizing, or visualizing data.


Key Takeaways about Data Types

  1. Inspect Variable Types: Always inspect variable types when starting an analysis to ensure they align with their intended use.
  2. Understand the Context: Evaluate whether a variable’s meaning aligns with its type (e.g., numeric values that represent ordered labels).
  3. Convert as Needed: Use as.factor(), as.numeric(), mdy(), or similar functions to reclassify variables when necessary for accurate analysis.

21.5 Exercise: Converting Data Types

Try it yourself:


You are analyzing data collected from 50 startups participating in the LaunchBright Accelerator Program, a competitive initiative designed to support high-potential startups across diverse industries. The program tracks key metrics to understand the factors contributing to startup success, such as funding, team composition, and market performance.

The dataset is named startup_data and includes a mix of numeric and categorical variables, offering insights into startup characteristics and performance. However, some variables have been imported in formats unsuitable for analysis and need to be transformed.

  1. Use str() to identify the current data types
  2. Transform variables from inappropriate data types to more appropriate ones

Hint 1

  1. Run the str() command on the startup_data dataset.
  2. Identify the variables that have the wrong data types.
  3. Convert the incorrectly specified variables to their correct type.

Hint 2

str(startup_data) shows the following incorrect data types:

  1. Market_Segment: is a character type but should be a category (factor)
  2. Customer_Satisfaction is a character type but should be an ordinal factor
  3. Accelerator_Cohort is a character type but should be a date
  4. Is_Profitable is a character but should be a factor

Do the appropriate data type conversions with as.factor(), factor() for ordinal variables, and year().

Fully worked solution:

startup_data <- startup_data |> 
1    mutate(Market_Segment = as.factor(Market_Segment),
2           Customer_Satisfaction = factor(Customer_Satisfaction, levels = c("Low", "Medium", "High"), ordered = TRUE),
3           Accelerator_Cohort = as.numeric(Accelerator_Cohort),
4           Is_Profitable = as.factor(Is_Profitable)
           )
str(startup_data)           
1
Convert Market_Segment to a factor using as.factor()
2
Convert Customer_Satisfaction to an ordinal factor using factor() with arguments listing the levels of the factors from low to high, and declaring that the factor is ordered
3
Convert Accelerator_Cohort to a numeric type using as.numeric()
4
Convert Is_Profitable to a factor with as.factor()

You could convert Accelerator_Cohort to a date variable but there is really no benefit to having Accelerator_Cohort as a date variable compared to a numeric variable. Since the conversion to numeric is easier, we go with that.

  • Which variables need data types converted?
  • What data type is needed?
  • Which functions convert to the needed data types?

21.6 Summary

In this chapter, we explored the definition and types of variables, their role in datasets, and how to manipulate them for analysis. Mastering variables is the first step in making sense of data and drawing meaningful conclusions.

Understanding variables helps entrepreneurs: - Segment customers by demographics (categorical variables). - Track growth over time (time-series variables). - Identify patterns in performance metrics (numeric variables).


  1. There are many more date formats that lubridate can manage through diverse functions. Explore them at the lubridate reference page.↩︎