22 Descriptive Statistics

22.1 Introduction

The variability of variables creates the need to understand the patterns and trends of the data. Descriptive statistics are essential tools for summarizing the characteristics of a dataset. They provide insights into the central tendency, variability, and shape of data, bridging the gap between raw data and meaningful interpretation. This chapter focuses on:

Measures of central tendency and spread at the variable level.
An introduction to distribution-level statistics like skewness and kurtosis, setting the stage for deeper exploration in the distributions chapter.

Through these concepts, you’ll learn to describe and interpret data systematically, supporting decision-making in entrepreneurial and analytical contexts.

22.2 Demonstration Data

We’ll use the liberty_ships_data dataset to explore descriptive statistics. This dataset provides data about the production of transport ships, Liberty ships, during World War II. The dataset was first introduced in Section 18.2 and again in Section 21.2. Key variables include:

Yard: The shipyard responsible for building the ship.
Way: Ordered number of the sloping platform that launches the ship into the water after construction.
Direct_Hours: Total direct labor hours spent constructing a ship.
Total_Production_Days: Total number of days required to produce a ship.
Total_Cost: The total cost of producing a ship in dollars.
Deliver_Date: The date the ship was launched into service as a transport vehicle.

# A tibble: 1,571 × 7
    Unit Yard    Way Direct_Hours Total_Production_Days Total_Cost Delivery_Date
   <dbl> <chr> <dbl>        <dbl>                 <dbl>      <dbl> <chr>        
 1     1 Beth…     1       870870                   244   2615849  12/30/41     
 2     2 Beth…     2       831745                   249   2545125  1/19/42      
 3     3 Beth…     3       788406                   222   2466811  1/29/42      
 4     4 Beth…     4       758934                   233   2414978  2/9/42       
 5     5 Beth…     5       735197                   220   2390643  2/20/42      
 6     6 Beth…     6       710342                   227   2345051  2/27/42      
 7     8 Beth…     8       668785                   217   2254490  3/30/42      
 8     9 Beth…     9       675662                   196   2139564. 3/18/42      
 9    10 Beth…    10       652911                   211   2221499. 4/11/42      
10    11 Beth…    11       603625                   229   2217642. 5/9/42       
# ℹ 1,561 more rows

22.3 Measures of Central Tendency

Measures of central tendency identify the “center” of a dataset, summarizing where most values cluster. Common measures include:

Mean: The arithmetic average of all values.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value.

Key Measures Explained

Mean:
- Calculated by summing all values and dividing by the number of observations.
- Often referred to as the expected value, as it predicts the central value of a distribution.
- Sensitive to extreme values (outliers), which can skew its representativeness.
Median:
- The middle value when all observations are sorted in ascending order.
- More robust to outliers than the mean, making it ideal for skewed distributions.
Mode:
- The most frequently occurring value in the dataset.
- Particularly useful for categorical variables or distributions with clear peaks.

Calculating Central Tendency in R

Here’s how to calculate the measures of central tendency for the Total_Cost variable in the Liberty ship dataset:

Code

# Calculate mean, median, and mode
centrality_stats <- liberty_ship_data %>%
  summarize(
    mean_cost = mean(Total_Cost, na.rm = TRUE),
    median_cost = median(Total_Cost, na.rm = TRUE),
    mode_cost = names(sort(table(Total_Cost), decreasing = TRUE))[1]
  )

centrality_stats

# A tibble: 1 × 3
  mean_cost median_cost mode_cost  
      <dbl>       <dbl> <chr>      
1  1951953.    1919645. 1829220.302

Decision-Making with Central Tendency

Central tendency represents the predicted value of a variable, often serving as the best guess for its typical value. Among measures of central tendency, the mean is commonly used because it reflects the expected value of a distribution.

For example, consider planning your daily commute. If the average commute time is 20 minutes, you might choose to leave 20 minutes before your target arrival time. However, relying solely on the mean might overlook important details about variability.

In entrepreneurship, the mean helps estimate average customer spending, which is critical for revenue predictions. However, other measures like the median may better reflect customer spending power in cases of extreme income disparities. Additionally, identifying the mode of product purchases can reveal the most popular items, guiding inventory and marketing strategies.

Key Insights

The mean provides the average cost, giving a quick estimate of overall production costs.
The median offers a better measure of central tendency for datasets with skewed distributions, such as production costs with a few very high or low values.
The mode, while less commonly used for numeric data, highlights the most frequent value and is particularly valuable for categorical data, such as identifying the most common shipyard.
Use mean and median to identify typical values.

22.4 Measures of Spread

Measures of spread help us understand the variability in a dataset—how much individual values differ from the central cluster point By quantifying spread, we can make better predictions and more informed decisions.

The key measures of spread include:

Range: The difference between the largest and smallest values, giving a quick sense of the data’s span.
Interquartile Range (IQR): The range of the middle 50% of data, robust to outliers.
Standard Deviation (SD): The average distance of each observation from the mean, indicating how spread out the data is.
Variance: The square of the standard deviation, emphasizing larger deviations from the mean.

Key Measures Explained

Range:
- The simplest measure of spread, calculated as the difference between the maximum and minimum values.
- Sensitive to outliers and does not account for how values are distributed within the range.
Interquartile Range (IQR):
- Measures the spread of the middle 50% of the data.
- Defined as the difference between the third quartile (75th percentile) and the first quartile (25th percentile).
- More robust than the range, as it is less affected by extreme values.
Standard Deviation (SD):
- Measures the average distance of each value from the mean.
- A high SD indicates greater variability, while a low SD suggests values are tightly clustered around the mean.
- Commonly used in statistical analyses.
Variance:
- Represents the average squared deviation from the mean.
- While less intuitive than the SD, it is a key metric in many advanced statistical models.

Calculating Spread in R

We can calculate these measures for the Total_Cost variable in the Liberty ship dataset as follows:

Code

# Calculate range, IQR, and standard deviation
spread_stats <- liberty_ship_data %>%
  summarize(
    range_cost = diff(range(Total_Cost, na.rm = TRUE)),
    iqr_cost = IQR(Total_Cost, na.rm = TRUE),
    sd_cost = sd(Total_Cost, na.rm = TRUE),
    var_cost = var(Total_Cost, na.rm = TRUE)
  )

spread_stats

# A tibble: 1 × 4
  range_cost iqr_cost sd_cost     var_cost
       <dbl>    <dbl>   <dbl>        <dbl>
1   2835747.  260812. 262535. 68924856289.

Decision-Making with Variability

Variability helps us understand how much individual values differ from the central value. While the mean provides a central point, measures of spread—like variance and standard deviation—help refine decisions based on how widely values deviate from that average.

Returning to the commute example, suppose the standard deviation of commute times is high. On important days like exams or job interviews, relying on the average alone could be risky. A high variability in commute times signals unpredictability, suggesting a need for earlier departure to ensure punctuality.

In entrepreneurship, high variability in customer spending could indicate distinct customer segments. For example, premium customers may require a different marketing strategy than budget-conscious ones. A wide range of spending also signals opportunities to tailor offerings, such as creating both high-end and budget-friendly product lines.

Key Insights

Range: Provides a quick sense of the span of costs but can be misleading if there are extreme outliers.
IQR: Offers a more stable view of variability by focusing on the central portion of the data, excluding extremes.
SD and Variance: Quantify overall variability in the dataset, with SD being more interpretable in units of the original variable.

These measures allow you to understand not only the average production cost of Liberty ships but also how consistent those costs are across observations. High variability might indicate differing efficiencies or practices across shipyards, while low variability suggests more standardized processes.

22.5 Exercise: Calculating Descriptive Statistics

Try it yourself:

We return to the data collected from 50 startups participating in the LaunchBright Accelerator Program as introduced in @#sec-exercise-convert-types. The dataset startup_data offers insights into startup characteristics and performance.

Calculate mean and median for Funding_Amount in the startup_data.
Compute range, IQR, and standard deviation for Funding_Amount.
Convert Customer_Satisfaction from character to an ordered factor (Low < Medium < High) and
Calculate the mode of Customer_Satisfaction as a factor.

Fully worked solution:

1central_tendency <- startup_data |>
    summarize( mean_funding = mean(Funding_Amount, na.rm = TRUE), 
               median_funding = median(Funding_Amount, na.rm = TRUE), 
               mode_funding = names(sort(table(Funding_Amount), decreasing = TRUE))[1] )
central_tendency

2spread_measures <- startup_data |>
    summarize(range_funding = diff(range(Funding_Amount, na.rm = TRUE)), 
              iqr_funding = IQR(Funding_Amount, na.rm = TRUE), 
              sd_funding = sd(Funding_Amount, na.rm = TRUE) )
spread_measures 

3startup_data <- startup_data |>
    mutate(Customer_Satisfaction = factor(Customer_Satisfaction, 
                                          levels = c("Low", "Medium", "High"), 
                                          ordered = TRUE ))

4mode_satisfaction <- startup_data |>
    summarize(mode_satisfaction = names(sort(table(Customer_Satisfaction), decreasing = TRUE))[1])
mode_satisfaction

1: Calculate the mean, median, and mode for Funding_Amount.
2: Compute range, IQR, and standard deviation for Funding_Amount.
3: Convert Customer_Satisfaction to an ordinal factor using factor() with arguments listing the levels of the factors from low to high, and declaring that the factor is ordered
4: Calculate the mode of Customer_Satisfaction

22.6 Conclusion

Descriptive statistics are the most commonly used first step in understanding data. They provide insights into the central values, variability, and shape of datasets, setting the stage for deeper analysis. Practice these concepts to build confidence in summarizing and interpreting data effectively.

22 Descriptive Statistics

22.1 Introduction

22.2 Demonstration Data

22.3 Measures of Central Tendency

Key Measures Explained

Calculating Central Tendency in R

Decision-Making with Central Tendency

Key Insights

22.4 Measures of Spread

Key Measures Explained

Calculating Spread in R

Decision-Making with Variability

Key Insights

22.5 Exercise: Calculating Descriptive Statistics

Try it yourself:

Hint 1

Hint 2

Fully worked solution:

22.6 Conclusion