# A tibble: 1,571 × 7
Unit Yard Way Direct_Hours Total_Production_Days Total_Cost Delivery_Date
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 Beth… 1 870870 244 2615849 12/30/41
2 2 Beth… 2 831745 249 2545125 1/19/42
3 3 Beth… 3 788406 222 2466811 1/29/42
4 4 Beth… 4 758934 233 2414978 2/9/42
5 5 Beth… 5 735197 220 2390643 2/20/42
6 6 Beth… 6 710342 227 2345051 2/27/42
7 8 Beth… 8 668785 217 2254490 3/30/42
8 9 Beth… 9 675662 196 2139564. 3/18/42
9 10 Beth… 10 652911 211 2221499. 4/11/42
10 11 Beth… 11 603625 229 2217642. 5/9/42
# ℹ 1,561 more rows
22 Descriptive Statistics
22.1 Introduction
The variability of variables creates the need to understand the patterns and trends of the data. Descriptive statistics are essential tools for summarizing the characteristics of a dataset. They provide insights into the central tendency, variability, and shape of data, bridging the gap between raw data and meaningful interpretation. This chapter focuses on:
- Measures of central tendency and spread at the variable level.
- An introduction to distribution-level statistics like skewness and kurtosis, setting the stage for deeper exploration in the distributions chapter.
Through these concepts, you’ll learn to describe and interpret data systematically, supporting decision-making in entrepreneurial and analytical contexts.
22.2 Demonstration Data
We’ll use the liberty_ships_data
dataset to explore descriptive statistics. This dataset provides data about the production of transport ships, Liberty ships, during World War II. The dataset was first introduced in Section 18.2 and again in Section 21.2. Key variables include:
Yard
: The shipyard responsible for building the ship.Way
: Ordered number of the sloping platform that launches the ship into the water after construction.Direct_Hours
: Total direct labor hours spent constructing a ship.Total_Production_Days
: Total number of days required to produce a ship.Total_Cost
: The total cost of producing a ship in dollars.Deliver_Date
: The date the ship was launched into service as a transport vehicle.
22.3 Measures of Central Tendency
Measures of central tendency identify the “center” of a dataset, summarizing where most values cluster. Common measures include:
- Mean: The arithmetic average of all values.
- Median: The middle value when data is sorted.
- Mode: The most frequently occurring value.
Key Measures Explained
- Mean:
- Calculated by summing all values and dividing by the number of observations.
- Often referred to as the expected value, as it predicts the central value of a distribution.
- Sensitive to extreme values (outliers), which can skew its representativeness.
- Median:
- The middle value when all observations are sorted in ascending order.
- More robust to outliers than the mean, making it ideal for skewed distributions.
- Mode:
- The most frequently occurring value in the dataset.
- Particularly useful for categorical variables or distributions with clear peaks.
Calculating Central Tendency in R
Here’s how to calculate the measures of central tendency for the Total_Cost
variable in the Liberty ship dataset:
Code
# Calculate mean, median, and mode
<- liberty_ship_data %>%
centrality_stats summarize(
mean_cost = mean(Total_Cost, na.rm = TRUE),
median_cost = median(Total_Cost, na.rm = TRUE),
mode_cost = names(sort(table(Total_Cost), decreasing = TRUE))[1]
)
centrality_stats
# A tibble: 1 × 3
mean_cost median_cost mode_cost
<dbl> <dbl> <chr>
1 1951953. 1919645. 1829220.302
Decision-Making with Central Tendency
Central tendency represents the predicted value of a variable, often serving as the best guess for its typical value. Among measures of central tendency, the mean is commonly used because it reflects the expected value of a distribution.
For example, consider planning your daily commute. If the average commute time is 20 minutes, you might choose to leave 20 minutes before your target arrival time. However, relying solely on the mean might overlook important details about variability.
In entrepreneurship, the mean helps estimate average customer spending, which is critical for revenue predictions. However, other measures like the median may better reflect customer spending power in cases of extreme income disparities. Additionally, identifying the mode of product purchases can reveal the most popular items, guiding inventory and marketing strategies.
Key Insights
- The mean provides the average cost, giving a quick estimate of overall production costs.
- The median offers a better measure of central tendency for datasets with skewed distributions, such as production costs with a few very high or low values.
- The mode, while less commonly used for numeric data, highlights the most frequent value and is particularly valuable for categorical data, such as identifying the most common shipyard.
- Use mean and median to identify typical values.
22.4 Measures of Spread
Measures of spread help us understand the variability in a dataset—how much individual values differ from the central cluster point By quantifying spread, we can make better predictions and more informed decisions.
The key measures of spread include:
- Range: The difference between the largest and smallest values, giving a quick sense of the data’s span.
- Interquartile Range (IQR): The range of the middle 50% of data, robust to outliers.
- Standard Deviation (SD): The average distance of each observation from the mean, indicating how spread out the data is.
- Variance: The square of the standard deviation, emphasizing larger deviations from the mean.
Key Measures Explained
- Range:
- The simplest measure of spread, calculated as the difference between the maximum and minimum values.
- Sensitive to outliers and does not account for how values are distributed within the range.
- Interquartile Range (IQR):
- Measures the spread of the middle 50% of the data.
- Defined as the difference between the third quartile (75th percentile) and the first quartile (25th percentile).
- More robust than the range, as it is less affected by extreme values.
- Standard Deviation (SD):
- Measures the average distance of each value from the mean.
- A high SD indicates greater variability, while a low SD suggests values are tightly clustered around the mean.
- Commonly used in statistical analyses.
- Variance:
- Represents the average squared deviation from the mean.
- While less intuitive than the SD, it is a key metric in many advanced statistical models.
Calculating Spread in R
We can calculate these measures for the Total_Cost
variable in the Liberty ship dataset as follows:
Code
# Calculate range, IQR, and standard deviation
<- liberty_ship_data %>%
spread_stats summarize(
range_cost = diff(range(Total_Cost, na.rm = TRUE)),
iqr_cost = IQR(Total_Cost, na.rm = TRUE),
sd_cost = sd(Total_Cost, na.rm = TRUE),
var_cost = var(Total_Cost, na.rm = TRUE)
)
spread_stats
# A tibble: 1 × 4
range_cost iqr_cost sd_cost var_cost
<dbl> <dbl> <dbl> <dbl>
1 2835747. 260812. 262535. 68924856289.
Decision-Making with Variability
Variability helps us understand how much individual values differ from the central value. While the mean provides a central point, measures of spread—like variance and standard deviation—help refine decisions based on how widely values deviate from that average.
Returning to the commute example, suppose the standard deviation of commute times is high. On important days like exams or job interviews, relying on the average alone could be risky. A high variability in commute times signals unpredictability, suggesting a need for earlier departure to ensure punctuality.
In entrepreneurship, high variability in customer spending could indicate distinct customer segments. For example, premium customers may require a different marketing strategy than budget-conscious ones. A wide range of spending also signals opportunities to tailor offerings, such as creating both high-end and budget-friendly product lines.
Key Insights
- Range: Provides a quick sense of the span of costs but can be misleading if there are extreme outliers.
- IQR: Offers a more stable view of variability by focusing on the central portion of the data, excluding extremes.
- SD and Variance: Quantify overall variability in the dataset, with SD being more interpretable in units of the original variable.
These measures allow you to understand not only the average production cost of Liberty ships but also how consistent those costs are across observations. High variability might indicate differing efficiencies or practices across shipyards, while low variability suggests more standardized processes.
22.5 Exercise: Calculating Descriptive Statistics
Try it yourself:
We return to the data collected from 50 startups participating in the LaunchBright Accelerator Program as introduced in @#sec-exercise-convert-types. The dataset startup_data
offers insights into startup characteristics and performance.
- Calculate mean and median for
Funding_Amount
in thestartup_data.
- Compute range, IQR, and standard deviation for
Funding_Amount.
- Convert
Customer_Satisfaction
from character to an ordered factor (Low < Medium < High
) and - Calculate the mode of
Customer_Satisfaction
as a factor.
Hint 1
- Calculate mean and median for
Funding_Amount
using themean()
andmedian()
functions - Compute range, IQR, and standard deviation for
Funding_Amount
using therange()
,IQR()
, andsd()
functions - Convert
Customer_Satisfaction
from character to an ordered factor (Low < Medium < High) using thefactor()
function - Calculate the mode of
Customer_Satisfaction
using thetable()
function
Hint 2
- Calculate mean using
mean()
and median usingmedian()
forFunding_Amount
in thestartup_data
- Compute range using
range()
, IQR usingIQR()
, and standard deviation usingsd()
forFunding_Amount
- Convert
Customer_Satisfaction
from character to an ordered factor (Low < Medium < High) usingfactor( , levels = c(), ordered = TRUE )
. - Calculate the mode of
Customer_Satisfaction
usingsummarize( names(sort(table( ), decreasing = TRUE))[1])
Fully worked solution:
1<- startup_data |>
central_tendency summarize( mean_funding = mean(Funding_Amount, na.rm = TRUE),
median_funding = median(Funding_Amount, na.rm = TRUE),
mode_funding = names(sort(table(Funding_Amount), decreasing = TRUE))[1] )
central_tendency
2<- startup_data |>
spread_measures summarize(range_funding = diff(range(Funding_Amount, na.rm = TRUE)),
iqr_funding = IQR(Funding_Amount, na.rm = TRUE),
sd_funding = sd(Funding_Amount, na.rm = TRUE) )
spread_measures
3<- startup_data |>
startup_data mutate(Customer_Satisfaction = factor(Customer_Satisfaction,
levels = c("Low", "Medium", "High"),
ordered = TRUE ))
4<- startup_data |>
mode_satisfaction summarize(mode_satisfaction = names(sort(table(Customer_Satisfaction), decreasing = TRUE))[1])
mode_satisfaction
- 1
-
Calculate the mean, median, and mode for
Funding_Amount.
- 2
-
Compute range, IQR, and standard deviation for
Funding_Amount.
- 3
-
Convert
Customer_Satisfaction
to an ordinal factor usingfactor()
with arguments listing the levels of the factors from low to high, and declaring that the factor is ordered - 4
-
Calculate the mode of
Customer_Satisfaction
22.6 Conclusion
Descriptive statistics are the most commonly used first step in understanding data. They provide insights into the central values, variability, and shape of datasets, setting the stage for deeper analysis. Practice these concepts to build confidence in summarizing and interpreting data effectively.