# A tibble: 1,571 × 7
Unit Yard Way Direct_Hours Total_Production_Days Total_Cost Delivery_Date
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 Beth… 1 870870 244 2615849 12/30/41
2 2 Beth… 2 831745 249 2545125 1/19/42
3 3 Beth… 3 788406 222 2466811 1/29/42
4 4 Beth… 4 758934 233 2414978 2/9/42
5 5 Beth… 5 735197 220 2390643 2/20/42
6 6 Beth… 6 710342 227 2345051 2/27/42
7 8 Beth… 8 668785 217 2254490 3/30/42
8 9 Beth… 9 675662 196 2139564. 3/18/42
9 10 Beth… 10 652911 211 2221499. 4/11/42
10 11 Beth… 11 603625 229 2217642. 5/9/42
# ℹ 1,561 more rows
21 Variables
21.1 Exploring the Foundations of Data Analysis
Variables are the foundation of data analysis. They represent the aspects of the world we want to study, measure, or observe, and serve as the building blocks for all forms of statistical and visual analysis.
In this chapter, you will:
- Understand the definition and purpose of variables.
- Learn about the different types of variables and their uses.
- Explore how to identify variables in a dataset using the
glimpse()
function in R. - Practice converting and manipulating variables for analysis.
21.2 What Is a Variable?
A variable is any characteristic, number, or quantity that can be measured or observed. As the name suggests, it is a collection of data that varies across observations.
Each observation represents a single instance or entity in the dataset (e.g., a customer, a transaction, or a product). For every observation, the value of the variable reflects a specific measurement or state.
For example, in a dataset of customer data:
- An observation might represent a single customer.
- Variables could include characteristics such as:
- Age: The customer’s age.
- Income: The customer’s annual income.
- City of Residence: Where the customer lives.
- Willingness to Pay: The customer’s stated willingness to pay for a product.
- The specific recorded values for these variables would vary across customers (observations).
In short, a variable is a way to capture and organize data that varies across observations, helping to measure the state or characteristics of what is being studied.
To explore the concept of variables, consider the liberty_ship_data
dataset, first introduced in Section 18.2. This dataset captures information about the production of supply ships during World War II. Key variables include:
Yard
: The shipyard responsible for building the ship.Way
: Ordered number of the sloping platform that launches the ship into the water after construction.Direct_Hours
: Total direct labor hours spent constructing a ship.Total_Production_Days
: Total number of days required to produce a ship.Total_Cost
: The total cost of producing a ship in dollars.Delivery_Date
: The date that production and outfitting of a ship is finished and it enters service as a transport ship.
Each variable provides a unique lens through which to understand data and answer questions. For instance:
- How much time did it take to construct a ship?
- Are certain shipyards more efficient than others?
21.3 Types of Variables
Variables are often classified based on the type of data they hold. Here are the key types of variables:
Numeric Variables
These variables represent quantities and can be used for calculations.
For example:
Direct_Hours
,Total_Production_Days
, andTotal_Cost
from theliberty_ship_data
dataset.- Revenue or profit of startups.
- Willingness to pay for the product of a startup
Categorical Variables
These represent groups or categories.
For example:
Yard
andWay
variables from theliberty_ship_data
dataset.- Product categories (
Tech
,Retail
,Healthcare
). - Gender (
Male
,Female
).
Binary Variables
A special type of categorical variable with only two values.
For example:
- Whether a survey respondent is part of the target customer population (
Yes
/No
). - Whether a startup is profitable (
Yes
/No
).
Ordinal Variables
Categorical variables with an inherent order.
For example
- Satisfaction level (
Low
,Medium
,High
). - Education level (
High school degree
,Some college
,College degree
).
Time-Series Variables
Observations indexed over time.
- Delivery date in the
liberty_ship_data
datset - Year of founding (
2010
,2015
,2020
). - Sales date (
02-Jan-2020
,23-Jan-2020
,24-Jan-2020
).
21.4 Converting Variable Types
When working with datasets, variables are often imported with incorrect or suboptimal types for analysis. To ensure accurate and efficient analysis, it’s important to identify these cases and convert variables to more appropriate types. The most common scenarios involve variables being imported as character strings <chr>
that should be converted into more structured formats:
Inspect Data Types with glimpse()
or str()
Inspecting the structure of a dataset is a crucial step in data analysis. It helps you understand the variables in the dataset, their data types, and whether they are suitable for the intended analysis.
Using glimpse()
The glimpse()
function from the dplyr
package provides a compact, column-wise overview of a dataset. It displays:
- The variable names.
- preview of the data values.
- The data types of each variable.
This function is especially useful for large datasets, as it shows all variables in a concise format, making it easier to spot issues such as incorrect data types.
For example, here is the glimpse()
of the Liberty ship dataset liberty_ship_data
:
Code
# Inspect variable types
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard <chr> "Bethlehem", "Bethlehem", "Bethlehem", "Bethlehe…
$ Way <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date <chr> "12/30/41", "1/19/42", "1/29/42", "2/9/42", "2/2…
Using str()
The str()
function is a base R tool for examining the structure of an object, including datasets. It displays:
- The object type (e.g., tibble, data frame, vector).
- Variable names.
- Data types.
- A preview of the values for each variable.
str()
is slightly more verbose than glimpse()
and works consistently across different environments. This makes it a great fallback if glimpse()
doesn’t render as expected in certain output formats.
Here is str()
for the liberty_ship_data
.
Code
# Inspect variable types
str(liberty_ship_data)
tibble [1,571 × 7] (S3: tbl_df/tbl/data.frame)
$ Unit : num [1:1571] 1 2 3 4 5 6 8 9 10 11 ...
$ Yard : chr [1:1571] "Bethlehem" "Bethlehem" "Bethlehem" "Bethlehem" ...
$ Way : num [1:1571] 1 2 3 4 5 6 8 9 10 11 ...
$ Direct_Hours : num [1:1571] 870870 831745 788406 758934 735197 ...
$ Total_Production_Days: num [1:1571] 244 249 222 233 220 227 217 196 211 229 ...
$ Total_Cost : num [1:1571] 2615849 2545125 2466811 2414978 2390643 ...
$ Delivery_Date : chr [1:1571] "12/30/41" "1/19/42" "1/29/42" "2/9/42" ...
As you can see str()
denotes numeric variables as num
where glimpse()
denotes them <dbl>
but they mean the same thing - a numeric variable..
Character to Category with as.factor()
Text-based variables that represent groups or categories (e.g., product types, regions) can be converted into categorical data types know as factors. Factors are more efficient and allow for better handling in statistical analysis and visualization.
In the liberty_ship_data
dataset, we can see from the output of glimpse()
that the Yard
variable was imported as a character data type <chr>
. This variable contains the names of the shipyards that built the Liberty ships. Since there are only eight shipyards, the variable should be treated as categorical rather than as freeform text. Categorical variables have a fixed set of categories (in this case, the names of shipyards) and are best represented in R using the factor
type.
When we find a categorical variable that is misspecified as a character vectore, we convert it to a categorical variable (factor
) using the as.factor()
function.
Code
# Convert a character variable to categorical / factor
<- liberty_ship_data |>
liberty_ship_data mutate(Yard = factor(Yard))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date <chr> "12/30/41", "1/19/42", "1/29/42", "2/9/42", "2/2…
The glimpse()
function shows that Yard
is now a factor data type <fct>
.
Character to Date with mdy()
Dates stored as strings cannot be used for calculations (e.g., determining the time difference between two dates). Converting them into Date
objects allows for precise handling of time-based analysis.
In the liberty_ship_data
dataset, the Delivery_Date
variable was imported as a character data type <chr>
with dates having the common format “12/30/41” for 30 December 1941. We know this variable represents dates in the format "MM/DD/YY"
. Dates are a special kind of data type in R that allow us to perform time-based calculations and analyses.
For example, knowing the Delivery_Date
for each ship allows us to:
- Analyze how ship deliveries changed over time.
- Compare production rates across shipyards or periods.
- Calculate the time it took to deliver ships after they were ordered.
To convert the Delivery_Date
variable into a proper date type, we use the mdy()
function from the lubridate
package, which correctly interprets the "MM/DD/YY"
format.
Code
# Convert a character variable to a date
<- liberty_ship_data |>
liberty_ship_data mutate(Delivery_Date = mdy(Delivery_Date))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date <date> 2041-12-30, 2042-01-19, 2042-01-29, 2042-02-09,…
After the transformation, glimpse()
shows that Delivery_Date
is now a <date>
variable. This conversion ensures that Delivery_Date
is stored as a date, unlocking its potential for further analysis.
If the date variable had a different format like Day/Month/Year
or Year-Month-Day
, you would use the dmy()
or ymd()
functions instead.1
Numeric to Factor with as.factor()
The Way
variable is the sloped platform where a ship was constructed and launched. It was imported as a numeric data type <dbl>
in the dataset, however, this variable is better understood as a name or label rather than a true numeric value. For instance, it wouldn’t make sense to say that “Way 4 is twice as much as Way 2.” This makes Way
another categorical variable. We can convert the data type of the Way
variable to a factor using the as.factor()
function:
Code
# Convert a numeric variable to categorical
<- liberty_ship_data |>
liberty_ship_data mutate(Way = as.factor(Way))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 7
$ Unit <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way <fct> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ Delivery_Date <date> 2041-12-30, 2042-01-19, 2042-01-29, 2042-02-09,…
After the conversion, glimpse()
shows that Way
is now recognized as a factor variable. This ensures that it will be treated appropriately in subsequent analyses, such as grouping, summarizing, or visualizing data.
Key Takeaways about Data Types
- Inspect Variable Types: Always inspect variable types when starting an analysis to ensure they align with their intended use.
- Understand the Context: Evaluate whether a variable’s meaning aligns with its type (e.g., numeric values that represent ordered labels).
- Convert as Needed: Use
as.factor()
,as.numeric()
,mdy()
, or similar functions to reclassify variables when necessary for accurate analysis.
21.5 Exercise: Converting Data Types
Try it yourself:
You are analyzing data collected from 50 startups participating in the LaunchBright Accelerator Program, a competitive initiative designed to support high-potential startups across diverse industries. The program tracks key metrics to understand the factors contributing to startup success, such as funding, team composition, and market performance.
The dataset is named startup_data
and includes a mix of numeric and categorical variables, offering insights into startup characteristics and performance. However, some variables have been imported in formats unsuitable for analysis and need to be transformed.
- Use
str()
to identify the current data types - Transform variables from inappropriate data types to more appropriate ones
Hint 1
- Run the
str()
command on thestartup_data
dataset. - Identify the variables that have the wrong data types.
- Convert the incorrectly specified variables to their correct type.
Hint 2
str(startup_data)
shows the following incorrect data types:
Market_Segment
: is a character type but should be a category (factor)Customer_Satisfaction
is a character type but should be an ordinal factorAccelerator_Cohort
is a character type but should be a dateIs_Profitable
is a character but should be a factor
Do the appropriate data type conversions with as.factor()
, factor()
for ordinal variables, and year()
.
Fully worked solution:
<- startup_data |>
startup_data 1mutate(Market_Segment = as.factor(Market_Segment),
2Customer_Satisfaction = factor(Customer_Satisfaction, levels = c("Low", "Medium", "High"), ordered = TRUE),
3Accelerator_Cohort = as.numeric(Accelerator_Cohort),
4Is_Profitable = as.factor(Is_Profitable)
)str(startup_data)
- 1
-
Convert
Market_Segment
to a factor usingas.factor()
- 2
-
Convert
Customer_Satisfaction
to an ordinal factor usingfactor()
with arguments listing the levels of the factors from low to high, and declaring that the factor is ordered - 3
-
Convert
Accelerator_Cohort
to a numeric type usingas.numeric()
- 4
-
Convert
Is_Profitable
to a factor withas.factor()
You could convert Accelerator_Cohort
to a date variable but there is really no benefit to having Accelerator_Cohort
as a date variable compared to a numeric variable. Since the conversion to numeric is easier, we go with that.
- Which variables need data types converted?
- What data type is needed?
- Which functions convert to the needed data types?
21.6 Summary
In this chapter, we explored the definition and types of variables, their role in datasets, and how to manipulate them for analysis. Mastering variables is the first step in making sense of data and drawing meaningful conclusions.
Understanding variables helps entrepreneurs: - Segment customers by demographics (categorical variables). - Track growth over time (time-series variables). - Identify patterns in performance metrics (numeric variables).
There are many more date formats that
lubridate
can manage through diverse functions. Explore them at the lubridate reference page.↩︎