# A tibble: 1,571 × 6
Unit Yard Way Direct_Hours Total_Production_Days Total_Cost
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Bethlehem 1 870870 244 2615849
2 2 Bethlehem 2 831745 249 2545125
3 3 Bethlehem 3 788406 222 2466811
4 4 Bethlehem 4 758934 233 2414978
5 5 Bethlehem 5 735197 220 2390643
6 6 Bethlehem 6 710342 227 2345051
7 8 Bethlehem 8 668785 217 2254490
8 9 Bethlehem 9 675662 196 2139564.
9 10 Bethlehem 10 652911 211 2221499.
10 11 Bethlehem 11 603625 229 2217642.
# ℹ 1,561 more rows
21 Variables and Distributions
21.1 The Building Blocks of Basic Statistics
Variables and distributions form the foundation of data analysis, enabling us to describe, calculate, visualize, and interpret data patterns. In this chapter, we’ll explore:
- The role of variables in organizing and structuring data.
- Key measures, including central tendency, spread, and skewness, to summarize data meaningfully.
- How to calculate and interpret descriptive statistics for variables.
- The concept of distributions and how they reveal the shape of the data.
Understanding these principles will prepare you to explore data systematically and make informed decisions based on patterns and relationships within your data.
21.2 Demonstration Data
We’ll use the Liberty ships dataset throughout this chapter to illustrate key concepts. First introduced in Section 18.2, this dataset captures information about the production of supply ships during World War II. Key variables include:
Yard
: The shipyard responsible for building the ship (categorical).Way
: Ordered number of the sloping platform that launches the ship into the water after construction (categorical).Direct_Hours
: Total direct labor hours spent constructing a ship (numeric).Total_Production_Days
: Total number of days required to produce a ship (numeric).Total_Cost
: The total cost of producing a ship in dollars (numeric).
Each variable provides a unique lens through which to understand data and answer questions. For instance:
- How much time did it take to construct a ship?
- Are certain shipyards more efficient than others?
These variables provide a practical context for understanding how variables and distributions are analyzed in a business setting.
21.3 Variables
Variables are the building blocks of data, capturing characteristics or quantities that vary between observations. A variable represents a characteristic or quantity that can take on different values. Variables are the building blocks of data analysis, capturing the information we need to describe, compare, and analyze patterns.
Types of Variables
- Categorical Variables:
- Represent groups or categories (e.g.,
Yard
in the Liberty ships dataset). - Example: Which shipyard produced each ship.
- Represent groups or categories (e.g.,
- Numeric Variables:
- Represent measurable quantities (e.g.,
Direct_Hours
,Total_Cost
). - Subtypes:
- Continuous: Can take any value within a range (e.g.,
Direct_Hours
). - Discrete: Represent countable values (e.g., number of ships built).
- Continuous: Can take any value within a range (e.g.,
- Represent measurable quantities (e.g.,
- Ordinal Variables:
- Represent categories with an inherent order (e.g.,
Way
in the Liberty ships dataset). - Example: The number of the ramp used to launch the ship after construction.
- Represent categories with an inherent order (e.g.,
Working with Variable Types in R
Variables often need to be identified and converted to appropriate types for analysis. The glimpse()
function is an excellent tool for inspecting the structure of a dataset, including variable types:
# Inspect variable types
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 6
$ Unit <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard <chr> "Bethlehem", "Bethlehem", "Bethlehem", "Bethlehe…
$ Way <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
From the output, we can see that the Yard
variable was imported as a character <chr>
data type. This variable contains the names of the shipyards that built the Liberty ships. Since there are only eight shipyards, the variable should be treated as categorical rather than as freeform text. Categorical variables have a fixed set of categories (in this case, the names of shipyards) and are best represented in R using the factor
type.
The Way
variable is a numeric type <dbl>
in the dataset, representing the sloped platform where a ship was constructed and launched. However, this variable is better understood as an ordered label rather than a true numeric value. For instance, it wouldn’t make sense to say that “Way 4 is twice as much as Way 2.” This makes Way
another categorical variable.
To correctly categorize these variables, we can use the mutate()
function along with as.factor()
to convert them to factors:
# Convert a numeric variable to categorical
<- liberty_ship_data |>
liberty_ship_data mutate(Yard = as.factor(Yard),
Way = as.factor(Way))
glimpse(liberty_ship_data)
Rows: 1,571
Columns: 6
$ Unit <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Yard <fct> Bethlehem, Bethlehem, Bethlehem, Bethlehem, Beth…
$ Way <fct> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Direct_Hours <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
$ Total_Cost <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
After the conversion, glimpse()
shows that both Yard
and Way
are now recognized as factor variables. This ensures that they will be treated appropriately in subsequent analyses, such as grouping, summarizing, or visualizing data.
Key Takeaways about Variable Types
- Inspect Variable Types: Always inspect variable types when starting an analysis to ensure they align with their intended use.
- Understand the Context: Evaluate whether a variable’s meaning aligns with its type (e.g., numeric values that represent ordered labels).
- Convert as Needed: Use
as.factor()
,as.numeric()
, or similar functions to reclassify variables when necessary for accurate analysis.
21.4 Descriptive Statistics
Variables vary—that’s their defining characteristic. This variability creates the need to understand the patterns and trends within their values. Descriptive statistics provide a systematic way to summarize and explore these patterns, offering insights that can help make sense of the data and improve predictability in decision-making.
Measures of Central Tendency
Central tendency describes the point around which values cluster in a distribution. It helps identify a “typical” or “representative” value for a dataset. The key measures of central tendency include:
- Mean: The arithmetic average of all values.
- Median: The middle value when data is sorted.
- Mode: The most frequently occurring value.
Key Measures Explained
- Mean:
- Calculated by summing all values and dividing by the number of observations.
- Often referred to as the expected value, as it predicts the central value of a distribution.
- Sensitive to extreme values (outliers), which can skew its representativeness.
- Median:
- The middle value when all observations are sorted in ascending order.
- More robust to outliers than the mean, making it ideal for skewed distributions.
- Mode:
- The most frequently occurring value in the dataset.
- Particularly useful for categorical variables or distributions with clear peaks.
Calculating Central Tendency in R
Here’s how to calculate the measures of central tendency for the Total_Cost
variable in the Liberty ship dataset:
# Calculate mean, median, and mode
<- liberty_ship_data %>%
summary_stats summarize(
mean_cost = mean(Total_Cost, na.rm = TRUE),
median_cost = median(Total_Cost, na.rm = TRUE),
mode_cost = names(sort(table(Total_Cost), decreasing = TRUE))[1]
)
summary_stats
# A tibble: 1 × 3
mean_cost median_cost mode_cost
<dbl> <dbl> <chr>
1 1951953. 1919645. 1829220.302
Key Insights
- The mean provides the average cost, giving a quick estimate of overall production costs.
- The median offers a better measure of central tendency for datasets with skewed distributions, such as production costs with a few very high or low values.
- The mode, while less commonly used for numeric data, highlights the most frequent value and is particularly valuable for categorical data, such as identifying the most common shipyard.
- Use mean and median to identify typical values.
Measures of Spread
Measures of spread help us understand the variability in a dataset—how much individual values differ from the central tendency. By quantifying spread, we can make better predictions and more informed decisions.
The key measures of spread include:
- Range: The difference between the largest and smallest values, giving a quick sense of the data’s span.
- Interquartile Range (IQR): The range of the middle 50% of data, robust to outliers.
- Standard Deviation (SD): The average distance of each observation from the mean, indicating how spread out the data is.
- Variance: The square of the standard deviation, emphasizing larger deviations from the mean.
Key Measures Explained
- Range:
- The simplest measure of spread, calculated as the difference between the maximum and minimum values.
- Sensitive to outliers and does not account for how values are distributed within the range.
- Interquartile Range (IQR):
- Measures the spread of the middle 50% of the data.
- Defined as the difference between the third quartile (75th percentile) and the first quartile (25th percentile).
- More robust than the range, as it is less affected by extreme values.
- Standard Deviation (SD):
- Measures the average distance of each value from the mean.
- A high SD indicates greater variability, while a low SD suggests values are tightly clustered around the mean.
- Commonly used in statistical analyses.
- Variance:
- Represents the average squared deviation from the mean.
- While less intuitive than the SD, it is a key metric in many advanced statistical models.
Calculating Spread in R
We can calculate these measures for the Total_Cost
variable in the Liberty ship dataset as follows:
# Calculate range, IQR, and standard deviation
<- liberty_ship_data %>%
spread_stats summarize(
range_cost = diff(range(Total_Cost, na.rm = TRUE)),
iqr_cost = IQR(Total_Cost, na.rm = TRUE),
sd_cost = sd(Total_Cost, na.rm = TRUE),
var_cost = var(Total_Cost, na.rm = TRUE)
)
spread_stats
# A tibble: 1 × 4
range_cost iqr_cost sd_cost var_cost
<dbl> <dbl> <dbl> <dbl>
1 2835747. 260812. 262535. 68924856289.
Key Insights
- Range: Provides a quick sense of the span of costs but can be misleading if there are extreme outliers.
- IQR: Offers a more stable view of variability by focusing on the central portion of the data, excluding extremes.
- SD and Variance: Quantify overall variability in the dataset, with SD being more interpretable in units of the original variable.
These measures allow you to understand not only the average production cost of Liberty ships but also how consistent those costs are across observations. High variability might indicate differing efficiencies or practices across shipyards, while low variability suggests more standardized processes.
21.5 Distributions
What is a Distribution?
Imagine arranging all the values of a variable along a line, from the smallest to the largest. Some values will cluster closely together, while others will spread out. This clustering and spreading form the shape of the distribution.
For example, imagine plotting the heights of a group of people. The resulting distribution might have a central peak where most heights cluster (around the average) and taper off on both sides for shorter and taller individuals whose heights are less common.
If the clustering is dense around the average, we say those values are more likely to occur. Conversely, values that are less dense are less likely. This idea of density is central to understanding distributions.
This plot shows individual data points of a random variable \(\mathsf{X}\) that follow a pattern defined by a standard normal distribution. The values of \(\mathsf{X}\) are spread along the x-axis, with the density curve illustrating the theoretical distribution shape that describes where values are most likely to occur. The height of the curve reflects the relative likelihood of values in different regions, with peaks indicating where data points cluster. This pattern is typical of many real-world variables.
Common Shapes of Distributions
There are hundreds of known probability distributions, each with a different shape. Some common shapes include:
- Normal Distribution:
- The bell-shaped curve is symmetrical and centered around the mean.
- Real-world example: Heights of individuals or measurement errors in experiments.
- Entrepreneurial Insight: Normal distributions often serve as a benchmark for modeling and comparison.
- Right-Skewed Distribution:
- Most values cluster on the left, with a long tail extending to the right.
- Real-world example: Income levels or customer spending.
- Entrepreneurial Insight: Right-skewed distributions highlight data where a small number of observations have disproportionate impact (e.g., big spenders driving sales).
- Bimodal Distribution
- Two peaks indicate distinct groups within the data.
- Real-world example: Customer preferences or seasonal sales.
- Entrepreneurial Insight: Bimodal distributions often reveal multiple populations or market segments requiring tailored strategies.
This visualization shows how distributions vary in shape and density. Differences in density imply different actionable insights that the savvy entrerpreneur can leverage to increase success.
Moments of a Distribution
Moments are a way to describe the shape and behavior of a distribution by summarizing key characteristics. Think of moments as the “landmarks” of a dataset: they tell us where the data is centered, how spread out it is, and whether it has unusual patterns like leaning to one side or having heavy tails. Just as a photograph has features like brightness, contrast, and sharpness, moments capture the features of a distribution’s shape.
- First Moment (Central Tendency): Where the data is centered because the values are densely clustered.
- Second Moment (Spread): How far values tend to deviate from the center.
- Third Moment (Skewness): Whether the data leans left or right.
- Fourth Moment (Kurtosis): How much the data is packed into the tails or the peak.
The first moment, central tendency, and the second moment, spread, capture key features of a distribution’s shape. These concepts, already introduced in the descriptive statistics of variables, form the foundation of understanding a distribution. While they provide critical insights, the third and fourth moments—skewness and kurtosis—also reveal important implications for decision-making, especially for entrepreneurs.
Centrality: The First Moment
Central tendency represents the predicted value of a variable, often serving as the best guess for its typical value. Among measures of central tendency, the mean is commonly used because it reflects the expected value of a distribution.
For example, consider planning your daily commute. If the average commute time is 20 minutes, you might choose to leave 20 minutes before your target arrival time. However, relying solely on the mean might overlook important details about variability.
In entrepreneurship, the mean helps estimate average customer spending, which is critical for revenue predictions. However, other measures like the median may better reflect customer spending power in cases of extreme income disparities. Additionally, identifying the mode of product purchases can reveal the most popular items, guiding inventory and marketing strategies.
Spread: The Second Moment
Variability helps us understand how much individual values differ from the central value. While the mean provides a central point, measures of spread—like variance and standard deviation—help refine decisions based on how widely values deviate from that average.
Returning to the commute example, suppose the standard deviation of commute times is high. On important days like exams or job interviews, relying on the average alone could be risky. A high variability in commute times signals unpredictability, suggesting a need for earlier departure to ensure punctuality.
In entrepreneurship, high variability in customer spending could indicate distinct customer segments. For example, premium customers may require a different marketing strategy than budget-conscious ones. A wide range of spending also signals opportunities to tailor offerings, such as creating both high-end and budget-friendly product lines.
Below are two normal distributions with the same mean but different levels of variance (spread). Notice how a high-variance distribution is more spread out, indicating a wider range of possible values around the central mean.
In these plots, you can see that a low-variance distribution clusters more tightly around the mean, whereas a high-variance distribution is more spread out.
Skewness: The Third Moment
A skewed distribution leans to one side, significantly influencing the interpretation of central tendency. Below are visualizations of positively and negatively skewed distributions:
- Right-skewed distributions have values clustered toward the lower end, with a long tail extending to the right.
- Left-skewed distributions have values clustered at the higher end, with a long tail extending to the left.
Entrepreneurial Insights
- Right-Skewed Distributions: A right-skewed distribution might represent income, where most individuals earn below the average, but a few high earners pull the mean higher and stretch the tail. These high earners, if they are also significant spenders, can guide dual marketing strategies:
- Low-end targeting: Capture the majority of customers who spend modestly.
- High-end targeting: Develop premium offerings for the lucrative but smaller high-spending segment.
- Left-Skewed Distributions: A left-skewed distribution in product ratings could indicate dissatisfaction from a small but vocal subset of users. This could signal a need for targeted interventions:
- Address key grievances: Identify and resolve issues raised by this subset to improve product perception.
- Mitigate influence risks: If these users significantly influence mainstream customers (e.g., through reviews), proactive trust remediation and tailored compensation may help.
- Comparative Analysis of Skewness: In entrepreneurship, comparing the skewness of revenue distributions across regions can reveal disparities in market behavior. For instance, one region might exhibit right skewness (indicating a small base of high-value customers), while another shows a more symmetrical distribution (indicating consistent spending patterns). These insights can inform regional pricing models, promotional campaigns, or product availability strategies.
By understanding and leveraging skewness, entrepreneurs can adapt strategies to cater to both the mainstream and the extremes of their customer base. This ensures that outliers are not ignored but used strategically for growth.
These visuals highlight how skewness and variance shape a distribution. Skewness reveals asymmetry in value clustering, while variance shows the degree of spread around the central value. Together, these properties inform the choice of central tendency measures and guide decisions on interpreting data patterns.
Kurtosis
Kurtosis measures the “tailedness” of a distribution, providing insight into the frequency and magnitude of extreme values. A distribution with high kurtosis has heavier tails and a sharper peak, indicating more frequent extreme deviations from the mean, while low kurtosis reflects lighter tails and a flatter peak. For example, normal distributions have a kurtosis value of 3, which serves as a baseline for comparison.
Entrepreneurial Insights
- Customer Behavior:
- High kurtosis in spending habits might highlight extreme variability among customers, such as a small group of high spenders that significantly impact revenue. Entrepreneurs can target these outliers with premium services or loyalty programs.
- Low kurtosis suggests more consistent spending patterns, making it easier to predict overall revenue and standardize marketing efforts.
- Product Feedback and Satisfaction:
- High kurtosis in customer satisfaction scores may indicate polarization, where customers either love or hate a product. This signals the need to investigate underlying causes and either address dissatisfaction or double down on what high-scoring customers value. For example, polarized reviews might guide adjustments in product design, pricing, or marketing strategies to address extreme opinions.
- Revenue Stream Analysis: Entrepreneurs can evaluate kurtosis in revenue distributions to assess volatility.
- High kurtosis could signal reliance on irregular, extreme revenue events (e.g., seasonal spikes or a few high-value customers), necessitating risk mitigation strategies such as diversifying revenue streams.
- Low kurtosis might indicate more stable revenue patterns, reducing the need for aggressive risk management.
- Demographic Insights: By evaluating kurtosis across demographic groups, businesses can detect differences in variability.
- High kurtosis in spending habits among young customers might suggest diverse financial capabilities, requiring segmentation into premium and budget-conscious categories.
- Low kurtosis within a group may indicate homogeneity, simplifying targeted marketing and product offerings.
- Risk and Investment Decisions:
- High kurtosis in metrics like customer lifetime value or project ROI could highlight potential risks, such as over-reliance on outliers for profitability. Entrepreneurs can use this insight to avoid overestimating average performance due to extreme values.
- Low kurtosis indicates a more predictable range of outcomes, which might encourage higher confidence in scaling efforts.
21.6 Conclusion
Understanding variables and distributions is foundational for any data analysis. Variables capture the core elements of your data, while distributions reveal how their values are clustered and spread. Skewness and kurtosis describe the shape of a variable’s distribution, providing insight into asymmetry (skewness) and the presence of heavy tails or peakedness (kurtosis). These concepts are foundational to understanding how data deviates from a normal distribution as explored in Chapter 23 and exploratory data analysis as explored in Chapter 25.
By learning to identify and interpret measures of central tendency, spread, skewness and kurtosis, you’ve gained essential tools for summarizing data. In the next chapter, we’ll explore how to use confidence intervals and hypothesis testing to draw inferences and make data-driven decisions.