23  Univariate EDA

Exploratory Data Analysis of Individual Variables

Univariate Exploratory Data Analysis (EDA) focuses on examining one variable at a time. By understanding each variable individually, we gain valuable insights that lay the groundwork for analyzing relationships and building predictive models. In the context of entrepreneurship, a thorough exploration of each variable provides clarity on customer demographics, financial projections, or product feedback, which is essential for making informed business decisions.



23.1 Why Univariate Analysis Matters

In analytics, understanding one variable at a time helps us:

  • Validate Data Quality: Spot issues like outliers or missing values that could distort your analysis. For example, identifying outliers in customer spending data helps spot potential errors or unique customer segments worth investigating.

  • Identify Patterns: Observe distributions (e.g., age distribution) to understand customer demographics or product preferences. For instance, noticing a peak in customer age distribution can reveal a dominant demographic, informing targeted marketing efforts.

  • Guide Future Analysis: Set a foundation for examining relationships between variables and building models that predict customer behavior or business performance. For example, understanding the spending range in a single customer segment can guide pricing strategies and promotional campaigns.

By mastering univariate techniques, you’ll be able to confidently analyze single variables in your data, whether you’re studying customer demographics, sales figures, or product feedback. These foundational skills will prepare you to make data-driven decisions as an entrepreneur.

Key Learning Objectives

  • Analyze single-variable data to assess quality, detect patterns, and inform decisions in business analytics.
  • Employ univariate techniques to describe and interpret data distributions in real-world business contexts.
  • Use R to apply fundamental univariate methods, including descriptive statistics and visualizations.
Entrepreneurial Insight

Observing single-variable patterns, like customer age or purchase frequency, helps you design targeted marketing strategies and understand your customer base better. Spotting unusual values early on can also guide strategic decisions, like prioritizing inventory or tailoring promotions to specific customer groups.

In the next sections, we’ll dive deeper into specific univariate techniques, using data from a hypothetical company UrbanFind, to illustrate these foundational skills.



23.2 Demonstration Data: UrbanFind

Consider UrbanFind, a startup that specializes in curating personalized recommendations for city dwellers in several areas of their lives:

  • Tech Gadgets: Recommendations for the latest gadgets and devices that enhance convenience and connectivity in a fast-paced city life, such as smart home devices, wearable tech, and productivity tools.

  • Fashion: Curated fashion items and accessories that align with urban styles and seasonal trends, helping city dwellers look their best in a competitive, image-conscious environment.

  • Outdoor Activities: Gear and suggestions for outdoor activities that are accessible even in or near urban settings—like urban hiking, weekend getaways, and fitness equipment for both outdoor and indoor use.

  • Health and Wellness Products: Products focused on personal well-being, including fitness equipment, nutritional supplements, and relaxation tools to counterbalance the stresses of urban life.

These recommendations aim to provide city residents with tailored options that fit their lifestyle and preferences, whether they’re looking to upgrade their tech, update their wardrobe, stay active, or improve their wellness. By analyzing customer data, UrbanFind can better understand which areas resonate most with their audience and refine their product offerings and marketing strategies accordingly.

By examining single variables—like customer age, income level, or product rating—UrbanFind can answer foundational questions:

  • Who is the customer?
  • What price range can they afford?
  • How satisfied are they with existing products?

These insights, while simple, guide strategic decisions and set the stage for deeper analysis.

Variables in UrbanFind’s Data

UrbanFind conducted a survey to gather insights into customer demographics, spending habits, and interests. The dataset we’re working with contains responses from 100 survey participants who are representative of UrbanFind’s potential customer base. Each row is an observation, representing the responses of one unique respondent, with the following variables captured:

  • Age: The age of the customer in years. Age is an important demographic factor for UrbanFind, as different age groups may have distinct preferences for technology, fashion, or outdoor activities.

  • Spending: The amount (in dollars) each customer reported spending on lifestyle-related products in the past month. This includes items like tech gadgets, health products, and outdoor gear. UrbanFind aims to understand the range of spending to help design product bundles and set price points.

  • Product Interest: The product category the customer is most interested in, chosen from four options: Tech, Fashion, Outdoors, and Health. This helps UrbanFind determine which product lines to prioritize for marketing and inventory.

  • Region: The geographic region where each customer lives, categorized into North, South, East, and West. This variable provides insights into potential regional differences in product preferences and spending behaviors.

Each of these variables gives us a unique lens through which to view the customer base. By examining them individually, we gain insights that will inform how UrbanFind can tailor its offerings to meet customer needs.

Entrepreneurial Insight

Understanding variables at this level helps entrepreneurs like UrbanFind make informed decisions. By focusing on characteristics such as age, spending, and preferences, businesses can design targeted marketing strategies, set appropriate price points, and determine which products resonate most with their customers. This level of insight is foundational for any data-driven business strategy.

Viewing the UrbanFind Dataset

Here’s a preview of the customer_data dataset. Notice how the values of each variable vary across observations. In other words, age, spending, product_interest, and region are all variables that provide different types of information.

# A tibble: 100 × 4
     age spending product_interest region
   <dbl>    <dbl> <chr>            <chr> 
 1    33      495 Fashion          East  
 2    18      458 Fashion          North 
 3    32      491 Health           South 
 4    30      420 Fashion          South 
 5    85      664 Fashion          East  
 6    35      533 Fashion          East  
 7    31      526 Health           South 
 8    14      350 Fashion          South 
 9    24      471 Health           East  
10    NA      424 Fashion          South 
# ℹ 90 more rows


23.3 Visualizing Distributions

The greatest value of a picture is when it forces us to notice what we never expected to see.
— John Tukey

In exploratory data analysis, understanding the distribution of a variable is essential, but we rarely know what this distribution looks like until we visualize it. By using visualization tools in R, such as those available in ggplot2, we can gain simple yet powerful insights into data patterns, including clustering, spread, and unusual values. Visualization helps reveal whether data tends to follow a normal distribution, has skewness, or shows other unique characteristics.

Histograms for Continuous Variables

A histogram shows the frequency of values within specified ranges (or “bins”) along the x-axis, making it ideal for visualizing the shape of the data. Histograms allow us to observe clustering patterns, skewness, and whether the distribution has one peak or multiple peaks.

For example, if we were examining a variable representing the price of items in a store, a histogram could reveal if prices tend to cluster around certain points or if there’s a broader spread. In the following demonstration, you’ll see how histograms effectively depict the distribution of values within a dataset.

In this demonstration, the histogram shows the frequency distribution of our sample variable, giving insight into how values are spread and where they tend to cluster.

Box Plots for Summarizing Descriptive Statistics

A box plot provides a summary of data based on quartiles, helping us visualize the spread, center, and outliers within a dataset. The box represents the interquartile range (IQR), covering the middle 50% of the data. The whiskers extend to the minimum and maximum values within 1.5 times the IQR, while any points beyond the whiskers are displayed as individual outliers.

This annotated box plot shows the key components, helping us understand the distribution’s spread and identify any potential outliers.

Bar Plots for Categorical Variables

While histograms are ideal for visualizing the distribution of continuous variables, bar plots are the go-to tool for visualizing categorical variables. A categorical variable represents distinct categories, with each observation falling into one category. Examples include gender, product type, or region.

In a bar plot, each bar represents a category, and the height of the bar shows the count (or percentage) of observations in that category. This visualization provides insights into the relative frequency of each category, allowing us to compare categories at a glance.


   Clothing Electronics       Games   Groceries      Health     Outdoor 
         25          45          38          89          44          23 

In this bar plot, each bar’s height shows the frequency of sales in each region. The plot highlights which purchase categories receive the most spending, helping us identify potential areas for business expansion or targeted marketing.

Demonstration: Exploring Unknown Distributions

When we encounter new data, we often don’t know the underlying distribution of each variable. Visualizations like histograms and box plots make it possible to quickly uncover patterns, shapes, and unique characteristics of the data that may not be obvious from just inspecting raw values.

In this demonstration, we’ll use a dataset of purchase_data that contains three different variables with distinct distributions that are initially unknown. The variables are product_price, customer_age, and purchase_frequency. We’ll first inspect the dataset to see the raw data and then visualize each variable individually to reveal its shape.

# A tibble: 6 × 3
  product_price customer_age purchase_frequency
          <dbl>        <dbl>              <dbl>
1          63.7         20.4               43.3
2          44.4         20.8               44.7
3          53.6         27.1               35.3
4          56.3         26.2               38.8
5          54.0         33.8               33.4
6          48.9         21.3               44.2

The head() of the dataset gives a preview of raw values, but doesn’t reveal much about the distributions. Since the distributions of product_price, customer_age, and purchase_frequency are unknown, lets begin our exploration of the data by visualizing the variables with histograms.

Product Price

To plot the histogram of product_price, we following the grammar of graphics by

  1. calling the ggplot function,
  2. specifying the dataset purchase_data, and
  3. specifying the aesthetic mapping x = product_price.

To create a basic plot in ggplot2, you only need to specify the data and the mapping. For example, when creating histograms, bar charts, and box plots, we typically map only the x aesthetic to the variable of interest. While this demonstration includes custom colors for aesthetics, the simplest ggplot only requires you to declare the data and mapping.

# Plotting the normal distribution for product price
ggplot(purchase_data, aes(x = product_price)) +
  geom_histogram(binwidth = 2, 
                 fill = "royalblue", 
                 color = "black") +
  labs(title = "Distribution of Product Prices", 
       x = "Price", 
       y = "Frequency") +
  theme_minimal()

Insights about this (Normal) distribution:

The histogram of purchase_price shows a symmetric bell curve around the mean price. In real-world datasets, a normal distribution is often used as a baseline for comparisons because of its predictable properties. Here, the normal shape indicates that most products are priced around the mean, with a few outliers on either side.

Customer Age

Next, let’s visualize customer_age with the same grammar of graphics code as we did in Section 23.3.4.1.

# Plotting the unknown distribution for customer age
ggplot(purchase_data, aes(x = customer_age)) +
  geom_histogram(binwidth = 5, 
                 fill = "#7B3294", 
                 color = "black") +
  labs(title = "Distribution of Customer Ages", 
       x = "Customer Age", 
       y = "Frequency") +
  theme_minimal()

For customer_age, we see a right-skewed distribution where most values are lower, with a few exceptionally high outliers (such as older customers).

Insights about this (right-skewed) distribution:

The histogram reveals a right-skewed distribution, with most ages clustered on the lower end and a long tail extending to the right. This shape suggests that typical customers are younger, with a few much older customers. For such distributions, the median age is often a better measure of central tendency than the mean, as it’s less influenced by extreme values.

Tip: When exploring a new distribution, it’s often helpful to try different binwidths in your histogram. Adjusting the binwidth can uncover finer details or highlight broader trends in the data, providing a clearer picture of its underlying shape. For example, with a binwidth of 10, the histogram of customer_age appears to peak at age = 0, losing one of the key attributes of a skewed distribution.

# Plotting the unknown distribution for customer age
ggplot(purchase_data, aes(x = customer_age)) +
  geom_histogram(binwidth = 10, 
                 fill = "#7B3294", 
                 color = "black") +
  labs(title = "Distribution of Customer Ages", 
       x = "Customer Age", 
       y = "Frequency") +
  theme_minimal()

Purchase Frequency

Finally, we’ll plot purchase_frequency as we did for the previous two variables.

# Plotting the bimodal distribution for purchase frequency
ggplot(purchase_data, aes(x = purchase_frequency)) +
  geom_histogram(binwidth = 2, 
                 fill = "mediumseagreen", 
                 color = "black") +
  labs(title = "Distribution of Purchase Frequency", 
       x = "Purchase Frequency", 
       y = "Frequency (Count)") +
  theme_minimal()

Insights about this (bimodal) distribution:

The histogram of purchase_frequency shows two peaks, suggesting two distinct groups within the data. This could represent two customer segments, one that purchases more frequently and another that purchases less often. Identifying bimodal distributions can guide targeted marketing strategies by focusing on the needs of each segment.

Exercise: Visualizing Distributions

Try it yourself:


In a dataset named distribution_data, data for a right-skewed variable named skewed_variable is found together with a bimodal variable named bimodal_variable. Visualize the distributions of these variables using a histograms. Compare the shapes to what you would expect from each distribution type.

Hint 1

Build your histogram plot using the grammar of graphics by declaring the data, specifying the aesthetic mapping (a histogram maps a variable to the x-axis only), and calling the geometry (a histogram uses the geom_histogram() function). Note that you can specify the binwidth (the span of the x-axis covered by one bar of the histogram), fill (the color of the bars of the histogram), and color (the color of the borders of the bars).

Hint 2

  1. For the skewed distribution:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Map the aesthetic with x = skewed_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
  2. For the bimodal distribution:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Map the aesthetic with x = bimodal_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
  geom_histogram(binwidth = 1) 

Fully worked solution:

  1. For the distribution of skewed_variable:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Specify the aesthetic mapping as x = skewed_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
  2. For the distribution of bimodal_variable:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Specify the aesthetic mapping as x = bimodal_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
1ggplot(distribution_data,
2       aes(x = skewed_variable)) +
3  geom_histogram(binwidth = 2,
4                 fill = "royalblue",
5                 color = "black")

ggplot(distribution_data,
       aes(x = bimodal_variable)) +
  geom_histogram(binwidth = 0.5,
                 fill = "royalblue",
                 color = "black")
1
Call the ggplot() function and specify purchase_data
2
Specify that aesthetic mapping with skewed_variable or bimodal_variable plotted on the x-axis
3
Call the geom_histogram() function to get a histogram of the distributions [optional] and specify the binwidth,
4
fill,
5
color, or other aesthetics of geom_histogram()

Data visualization is an essential tool in exploratory data analysis. Each type of plot provides unique insights, depending on whether the data is continuous or categorical. The following table outlines commonly used visualization types, along with their purposes and suitability for continuous or categorical data.

Visualization Type Description Continuous Categorical Use Case
Histogram Shows frequency distribution of values in bins Distribution Shape, Outlier Detection
Box Plot Displays median, quartiles, and potential outliers Spread, Outliers
Density Plot Smooth curve showing data density over continuous range Distribution Shape
Bar Chart Shows count or proportion of each category Frequency of Categorical Values
QQ Plot Plots data against a normal distribution Normality Check


23.4 Descriptive Statistics

Numerical quantities focus on expected values, graphical summaries on unexpected values.
— John Tukey

Descriptive statistics provide a concise summary of your data and offer insights into its distribution, central tendencies, and variability. We’ll start with R’s summary() function to get a high-level overview, then calculate more specific measures with individual functions. In this section, we’ll calculate these key statistics to get a comprehensive understanding of the customer_data dataset from UrbanFind from Section 23.2.

Summary Statistics with summary()

R’s summary() function provides a quick overview of key statistics, including measures of central tendency and spread (e.g., mean, median, and range). This overview often reveals initial insights, including potential outliers and data skewness.

# Summary statistics for customer_data
summary(customer_data)
      age           spending      product_interest      region         
 Min.   :14.00   Min.   : 139.0   Length:100         Length:100        
 1st Qu.:29.00   1st Qu.: 431.2   Class :character   Class :character  
 Median :37.00   Median : 529.0   Mode  :character   Mode  :character  
 Mean   :36.81   Mean   : 543.8                                        
 3rd Qu.:43.00   3rd Qu.: 627.8                                        
 Max.   :90.00   Max.   :1600.0                                        
 NA's   :3       NA's   :2                                             

The summary() function outputs the mean and median for continuous variables like age and spending and also shows the minimum and maximum values (range). However, it doesn’t cover standard deviation or interquartile range (IQR) directly, which we’ll calculate separately.

From the summary(), we see that the mean age is 36.81 and the median age is 37 which are close in value. We also see that the mean spending of $543.83 and median spending of $529 are close in value.

Interpretation: When the mean and median are similar, as seen here for both customer age and spending, it suggests a balanced, symmetrical distribution with minimal skew. To determine how closely values cluster around the average, we also need to examine the range and standard deviation.

Measures of Central Tendency

Central tendency indicates the “typical” or “average” values within the data, helping us understand what’s most representative. Let’s calculate these measures with R functions to see how they differ and when each is most useful.

Mean

The mean is the average value, which is useful for understanding overall levels. However, it can be influenced by outliers.

# Calculate mean for age and spending
mean(customer_data$age, na.rm = TRUE)
[1] 36.81443
mean(customer_data$spending, na.rm = TRUE)
[1] 543.8265

Median

The median represents the middle value, which can be more informative than the mean for skewed distributions, as it’s less affected by outliers.

# Calculate median for age and spending
median(customer_data$age, na.rm = TRUE)
[1] 37
median(customer_data$spending, na.rm = TRUE)
[1] 529

Mode

The mode is the most frequently occurring value, especially useful for categorical data. For example, let’s calculate the mode for product_interest in customer_data.

# Calculate mode for product_interest
table(customer_data$product_interest) 

 Fashion   Health Outdoors     Tech 
      25       24       27       22 
table(customer_data$product_interest) |> which.max()
Outdoors 
       3 

Measures of Spread

While measures of central tendency (like mean and median) give us an idea of typical values, measures of spread reveal how much the data varies around those central values. Understanding the spread is essential for interpreting data patterns, as it tells us whether values are tightly clustered around the mean or widely dispersed. For instance, high variability suggests diverse customer profiles, while low variability indicates uniformity.

Here, we’ll explore three key measures of spread: range, standard deviation, and variance. Each provides a unique perspective on data variability.

Range

The range is the difference between the maximum and minimum values. It’s simple but can be affected by outliers.

# Calculate range for age and spending
range(customer_data$age, na.rm = TRUE)
[1] 14 90
range(customer_data$spending, na.rm = TRUE)
[1]  139 1600

In customer_data, the range of age is from 14 to 90, while the range of spending spans from $139 to $1600. These ranges show the full spectrum of prices and quantities but don’t tell us how common values are within this span.

Interquartile Range (IQR)

The IQR measures the spread of the middle 50% of values and is particularly useful for understanding spread in skewed data. It’s calculated as the difference between the 75th percentile (75% of values fall below this point) and 25th percentile (25% of values fall below this point):

\[ \mathsf{IQR = Q3 - Q1}.\]

# Calculate IQR for age and spending
IQR(customer_data$age, na.rm = TRUE)
[1] 14
IQR(customer_data$spending, na.rm = TRUE)
[1] 196.5

The IQR of age is 14, meaning the middle 50% of prices fall within this range. For spending, the IQR is $196.5.

The IQR is especially useful in box plots, where it represents the range of the central box.

Standard Deviation (SD)

The standard deviation measures the average distance of each value from the mean. A low SD indicates that values are clustered near the mean, while a high SD suggests more variability. Standard deviation is useful for interpreting consistency in data.

# Calculate standard deviation for age and spending
sd(customer_data$age, na.rm = TRUE)
[1] 13.2534
sd(customer_data$spending, na.rm = TRUE)
[1] 207.0577

The standard deviation for customer age is 13.25, and for spending, it’s $207.06. These values show how much age and purchasing typically vary from their respective means.

Interpretation: Measures of spread are crucial for understanding data variability. For example, a low standard deviation in customer ages could imply a relatively homogenous customer segment that can be reached through the same channels while a higher standard deviation in spending might indicate diverse customer income levels or buying patterns. Understanding variability helps UrbanFind plan for advertising, inventory, and fluctuations in sales.

Exercise: Calculating Descriptive Statistics

Try it yourself:


The sales_data dataset represents a set of sales records for a fictional company named MetroMart, which operates across multiple regions. This dataset was created to demonstrate descriptive statistics and data exploration techniques, essential for understanding product performance and sales trends.

We’ll work with both continuous variables (e.g., price, quantity_sold) and categorical variables (e.g., region, category, and promotion).

Dataset Overview:

  • Product ID: A unique identifier for each product (from 1 to 300).
  • Price: The selling price of each product, normally distributed with a mean price of $20 and some variation to represent typical pricing diversity.
  • Quantity Sold: The number of units sold, following a Poisson distribution to reflect typical purchase quantities.
  • Region: The region where each product was sold, categorized into North, South, East, and West regions.
  • Category: The product category, which can be one of four types: Electronics, Grocery, Clothing, or Home Goods. This variable helps us understand which product types tend to perform better in terms of sales volume and pricing across different regions.
  • Promotion: Indicates whether the product was on promotion at the time of sale, with possible values of “Yes” or “No.” This variable allows us to analyze how promotional offers affect sales volume and may reveal seasonal or regional preferences for discounted products.

This provides a rich foundation for exploring key statistical concepts, such as central tendency and variability, while allowing us to analyze categorical differences. With these variables, we can gain insights into average prices, sales volume, and how factors like product type, regional differences, and promotions influence sales. These metrics can reveal patterns, outliers, and trends that are crucial for strategic decision-making in areas such as pricing, inventory management, and targeted marketing.

For MetroMart’s sales_data:

  1. Calculate the measures of central tendency (mean and median) for the price and quantity_sold variables to understand the typical values for these sales metrics.

  2. Calculate the mode of the categorical variables, such as region, to see where most purchases happen, and category to identify the most popular product type.

  3. Calculate the spread of the price and quantity_sold variables to understand how they differ from their means, and analyze how promotion might influence these values.

Hint 1

  1. Calculate descriptive statistics for mean and median using the summary() function.
  2. Calculate the mode of the categorical variables using table()
  3. Calculate descriptive statistics for spread (range, IQR, and standard deviation) using the appropriate R functions.

Hint 2

  1. For descriptive statistics about central tendencies:

    1. Call the summary() function
    2. Specify the data as sales_data
  2. For descriptive statistics about spread:

    1. Call the range() function for sales_data and the price and quantity_sold variables – be sure to remove NA values from the calculation
    2. Call the IQR() function for sales_data and the price and quantity_sold variables – be sure to remove NA values from the calculation
    3. Call the sd() function for sales_data and the price and quantity_sold variables – be sure to remove NA values from the calculation
  summary()
  range()
  IQR()
  sd()  

Fully worked solution:

1summary(sales_data)
2table(sales_data$region)
table(sales_data$category)
table(sales_data$promotion)
3range(sales_data$price, na.rm = TRUE)
range(sales_data$quantity_sold, na.rm = TRUE)
4IQR(sales_data$price, na.rm = TRUE)
IQR(sales_data$quantity_sold, na.rm = TRUE)
5sd(sales_data$price, na.rm = TRUE)
sd(sales_data$quantity_sold, na.rm = TRUE)
1
Call the summary() function for sales_data
2
Call the table() function for categorical variables region, category, and promotion.
3
Call the range() function for sales_data and the price and quantity_sold variables removing NA values from the calculation
4
Call the IQR() function for sales_data and the price and quantity_sold variables removing NA values from the calculation
5
Call the sd() function for sales_data and the price and quantity_sold variables removing NA values from the calculation
  • Compare mean and median to see if the data may be skewed.
  • What does the spread of the data suggest about the diversity of prices? quantity sold?

In exploring any dataset, descriptive statistics provide a foundation for understanding key characteristics of the data. They allow us to summarize central tendencies, variations, and identify any unusual patterns that may be present. The table below outlines essential descriptive statistics techniques, each providing unique insights into different aspects of a dataset.

Measure Description Continuous Categorical
Central Tendency Measures the “typical” value ✅ (Mean, Median) ✅ (Mode)
Variation Measures data spread ✅ (Range, SD, IQR)
Skewness & Kurtosis Assesses data symmetry and tail heaviness ✅ (Skewness, Kurtosis)
Outlier Detection Identifies unusual values using thresholds ✅ (Z-score, IQR)
Normality Tests Tests if data follows a normal distribution ✅ (Shapiro-Wilk)


23.5 Testing for Outliers

Whenever I see an outlier, I never know whether to throw it away or patent it.
– Bert Gunter, R-Help, 9/14/2015

Outliers are data points that deviate significantly from other observations. They can indicate data entry errors, unusual values, or meaningful deviations within the data, and they are crucial to identify in EDA, as they can influence statistical analyses and skew insights. There are several methods to detect and assess outliers, with Z-scores being one of the most common when dealing with approximately normal data.

Testing for Outliers with Z-Scores

When data is approximately normal, we can use Z-scores to assess how far specific values deviate from the mean in terms of standard deviations.

  • Z-score: A Z-score tells us how many standard deviations an observed value (\(\mathsf{X}\)) is from the mean (\(\mathsf{\mu}\)):

    \[ \mathsf{Z = \dfrac{X - \mu}{\sigma}} \]

    where \(\mathsf{X}\) is the observed value, \(\mathsf{\mu}\) is the mean, and \(\mathsf{\sigma}\) is the standard deviation.

    Generally:

    • Values with Z-scores beyond ±2 may be considered unusual.
    • Values beyond ±3 are often classified as outliers.

The following plot illustrates the concept of Z-scores on a normal distribution, with dashed lines at ±2 standard deviations and dotted lines at ±3 standard deviations to mark the bounds of “usual” and “outlier” values.

In this plot, the dashed red lines at ±2 standard deviations and the dotted red lines at ±3 standard deviations illustrate where values are likely to be considered unusual or outliers based on their Z-scores.

Demonstration of the Z-score Test

To test for outliers with Z-scores in our sales_data dataset, we’ll calculate the Z-scores for the price variable and check for any values that are more than 3 standard deviations from the mean. Here’s how it’s done in R:

# Calculate Z-scores for sales_data
sales_data$z_score_price <- (sales_data$price - mean(sales_data$price, na.rm = TRUE)) / sd(sales_data$price, na.rm = TRUE)

# Filtering potential outliers based on Z-scores
z_score_outliers <- sales_data %>% filter(abs(z_score_price) > 3)
z_score_outliers
[1] product_id    price         quantity_sold region        z_score_price
<0 rows> (or 0-length row.names)

In this example, we check the price variable for outliers. If z_score_outliers returns zero rows, we can conclude that there are no outliers in the price variable of sales_data.

Note: Z-scores assume a normal distribution. For non-normal distributions, consider other outlier detection methods, such as the IQR-based method, which we will cover next.

Testing for Outliers with Interquartile Range (IQR)

The IQR method is useful for data that is not normally distributed. This method defines outliers as values that fall below \(\mathsf{Q1 - 1.5 \cdot IQR}\) or above \(\mathsf{Q3 + 1.5 * IQR}\).

Demonstration of the IQR Test

# Calculate IQR and define bounds for potential outliers
q1 <- quantile(sales_data$price, 0.25, na.rm = T)
q3 <- quantile(sales_data$price, 0.75, na.rm = T)
iqr <- q3 - q1

# Define lower and upper bounds for outliers
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

# Filter potential outliers based on IQR bounds
iqr_outliers <- sales_data %>% filter(price < lower_bound | price > upper_bound)
iqr_outliers
  product_id price quantity_sold region z_score_price
1        156  5.88            22  North     -2.990809

Using the IQR calculation, we can conclude that there is one outlier in the price variable from sales_data, namely the observation of product_id = 156. Notice that the Z-score for this observation is -2.9908093 meaning that the Z-score was also close to identifying this observation as an outlier.

The IQR method is especially effective for detecting outliers in skewed data, as it isn’t influenced by extreme values on either end.

Testing for Outliers with Box Plot Visualization

A box plot visually identifies outliers as individual points beyond the whiskers. In a standard box plot, whiskers extend up to 1.5 times the IQR. Points beyond this range are considered potential outliers.

Demonstration of the Box Plot Test

# Box plot to visually identify outliers in price
ggplot(sales_data, aes(x = "", y = price)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16) +
  labs(title = "Box Plot of Product Prices with Outliers Highlighted", y = "Price") +
  theme_minimal()

This box plot clearly identifies a single value, the minimum value, of price as an outlier. Checking the data we can verify that the lowest price in sales_data is the observation for product_id = 156.

# slice and display the observation (row) with minimum price
sales_data |> slice_min(price)
  product_id price quantity_sold region z_score_price
1        156  5.88            22  North     -2.990809

Choosing an Outlier Detection Method

The best method depends on the data distribution and the context of analysis:

  • Z-scores are suitable for normally distributed data.
  • IQR is robust and works well with skewed data.
  • Box plots provide a quick, visual approach for outlier detection.

Takeaway: Detecting outliers helps identify unusual or potentially erroneous data points. By carefully assessing outliers, we can refine data quality and uncover valuable insights for analysis and decision-making.

Each method offers different insights, making it beneficial to combine them for a comprehensive approach to outlier detection.

Exercise: Testing for Outliers with Statistics and Visalization

Try it yourself:


Perform the Z-score, IQR, and box plot tests to determine whether the age and/or spending variables in UrbanFind’s customer_data have outliers. Which test is more appropriate?

Hint 1

  1. Calculate the Z-score and the IQR test score for both variables.
  2. Visualize the box plot for both variables.

Hint 2

  1. For Z-scores:

    1. Calculate (age - mean(age)) / sd(age)
    2. Calculate (spending - mean(spending)) / sd(spending)
    3. Filter the data to identify cases where the Z-score is more than 3 standard deviations from the variable value
(variable - mean(variable)) / sd(variable)
data |> filter (abs(z_score) > 3)
  1. For the IQR test

    1. Calculate the IQR test values Q1 - 1.5 * IQR and Q3 + 1.5 * IQR for age
    2. Calculate the IQR test values Q1 - 1.5 * IQR and Q3 + 1.5 * IQR for spending
    3. Filter the data to identify cases where the value of an observation is less than the lower bound or greater than the upper bound
# Calculate quantiles and IQR 
q1 <- quantile(variable, 0.25)
q3 <- quantile(variable, 0.75)
iqr <- q3 - q1

# Define lower and upper bounds for outliers
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

# Filter potential outliers based on IQR bounds
data |> filter(variable < lower_bound | variable > upper_bound)
  1. For the box plot test

    1. Plot the box plot for age and for spending
    2. Visually check for outliers
    3. Filter or slice to identify the observations with outliers
ggplot(data, aes(x = "", y = variable)) +
  geom_boxplot() 
data |> slice_min(order_by = variable, n = number_of_outliers)
data |> slice_max(order_by = variable, n = number_of_outliers)
filter(condition)  

Fully worked solution:

1customer_data$z_score_age <- (customer_data$age - mean(customer_data$age, na.rm = T)) / sd(customer_data$age, na.rm = T)
customer_data |> filter(abs(z_score_age) > 3)
## 
customer_data$z_score_spending <- (customer_data$spending - mean(customer_data$spending, na.rm = T)) / sd(customer_data$spending, na.rm = T)
customer_data |> filter(abs(z_score_spending) > 3)

2q1 <- quantile(customer_data$age, 0.25, na.rm = T)
q3 <- quantile(customer_data$age, 0.75, na.rm = T)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
customer_data %>% filter(age < lower_bound | age > upper_bound)
#
q1 <- quantile(customer_data$spending, 0.25, na.rm = T)
q3 <- quantile(customer_data$spending, 0.75, na.rm = T)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
customer_data %>% filter(spending < lower_bound | spending > upper_bound)


ggplot(customer_data, aes(x = "", y = age)) +
3  geom_boxplot(outlier.color = "red", outlier.shape = 16)
customer_data |> slice_max(order_by = age, n = 2)  
#
ggplot(customer_data, aes(x = "", y = spending)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16)
customer_data |> slice_max(order_by = spending, n = 2)    
customer_data |> slice_max(order_by = spending, n = 1)    
1
Calculate the Z-score and filter for observations greater than 3 standard deviations.
2
Calculate the IQR upper- and lower-bounds and filter for observations above or below them.
3
Plot the box plot, visually check for outliers, filter/slice the outliers
  • What observations may be outliers?
  • Which test seems more appropriate to test for outliers in these variables?


Interpretation in Entrepreneurship

Understanding distribution shape allows us to make better data-driven decisions:

  • Detecting Skew: Skewness tells us if a variable’s distribution leans towards higher or lower values. For instance, if product prices are right-skewed, most products are affordable, but a few are premium-priced.
  • Identifying Outliers: Box plots make it easy to spot outliers. Outliers could indicate data errors or valuable insights, like a product that sells exceptionally well or poorly.
  • Assessing Variability: Histograms and box plots help in quickly visualizing data spread, giving insight into whether values cluster tightly around the mean or are widely dispersed.

By combining histograms, box plots, and an understanding of skewness and kurtosis, we gain a clearer view of data characteristics, which aids in further analysis and informs decision-making.



23.6 Testing for Normality

Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question)
– Greg Snow, R-Help, 21 Feb 2014

The normal distribution is one of the most commonly encountered probability distributions in statistics. Often called a “bell curve,” the normal distribution is symmetric, with most values clustering around the mean and tapering off as they move away. This distribution is particularly useful as a benchmark for identifying patterns and detecting outliers.

Properties of a Normal Distribution

A normal distribution has several key properties:

  1. Symmetry: It’s perfectly symmetric around the mean, meaning values are equally likely to be higher or lower than the mean.
  2. Mean, Median, and Mode Alignment: In a normal distribution, the mean, median, and mode all lie at the center.
  3. Standard Deviations and Proportion of Data: The data follows a predictable pattern:
    • Approximately 68% of values fall within 1 standard deviation of the mean.
    • Approximately 95% of values fall within 2 standard deviations.
    • Nearly all values (99.7%) fall within 3 standard deviations.

Let’s visualize a normal distribution to see these properties in action.

This plot shows a classic bell-shaped curve, highlighting the central clustering and symmetry of a normal distribution.

Real-World Contexts for Normal Distributions

Normal distributions are frequently observed in various real-world contexts:

  • Human Characteristics: Many biological traits, like height and blood pressure, are approximately normally distributed.
  • Measurement Errors: In scientific and engineering measurements, errors often follow a normal distribution due to random variations.
  • Business and Economics: Some financial metrics, like daily stock returns (under certain conditions), are assumed to be normally distributed for modeling and forecasting.

These contexts rely on the normal distribution to create reliable benchmarks, allowing us to predict outcomes and assess what’s typical or unusual within a dataset.

Why the Normal Distribution Matters

The normal distribution helps us make sense of what’s typical in our data and identify values that stand out. By using Z-scores, we can quickly assess whether a value is within an expected range or if it’s unusually high or low. This can be invaluable for detecting anomalies, setting benchmarks, and making decisions across various fields.

Takeaway: When a dataset is normally distributed, it’s easier to interpret central tendency and variability. We can confidently use Z-scores to detect outliers and make informed decisions about unusual data points. Understanding the normal distribution’s properties gives us a powerful tool for data analysis and decision-making.

Testing for Normality

Once we understand the concept of a normal distribution, it’s often important to test whether our data approximates this distribution. Many statistical methods assume normality, and knowing whether data meets this assumption can guide our approach to analysis.

Visual Methods

  1. Histogram: Plotting the data as a histogram and comparing it visually to a bell curve is a straightforward way to check for normality. If the histogram approximates a symmetric, bell-shaped curve, the data may be normally distributed.

  2. Q-Q Plot (Quantile-Quantile Plot): A Q-Q plot compares the quantiles of our data to the quantiles of a theoretical normal distribution. If the data is normally distributed, points in a Q-Q plot will approximately follow a straight line.

   # Plotting a Q-Q plot
ggplot(normal_data, aes(sample = value)) +
  stat_qq(color = "blue") +
  stat_qq_line(color = "red", linetype = "dashed") +
  labs(title = "Q-Q Plot for Normality Check", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

In this plot, if the points closely follow the dashed line, the data can be considered approximately normal. Deviations from this line indicate departures from normality.

Statistical Tests

  1. Shapiro-Wilk Test: The Shapiro-Wilk test is a common statistical test for normality. It produces a p-value that helps us decide whether to reject the null hypothesis (that the data is normally distributed). A small p-value (typically < 0.05) indicates that the data is unlikely to be normally distributed.
# Shapiro-Wilk test for normality
shapiro_test <- shapiro.test(normal_data$value)
shapiro_test

    Shapiro-Wilk normality test

data:  normal_data$value
W = 0.99882, p-value = 0.767

In the output, the test’s p-value tells us whether the data significantly deviates from a normal distribution. A p-value greater than 0.05 suggests that the data may approximate a normal distribution. Since the p-value of 0.7669591 is much greater than 0.05, we have greater confidence that the variable has a normal distribution.

Note: Visual methods provide a general sense of normality, while statistical tests offer more rigor. However, keep in mind that large sample sizes can make tests overly sensitive, rejecting normality for minor deviations.

Exercise: Testing for Normality with Statistics and Visalization

Try it yourself:


Perform the Q-Q plot and Shapiro-Wilk tests to determine whether the age and/or spending variables in UrbanFind’s customer_data are normally distributed. Which test is more appropriate?

Hint 1

  1. Plot the Q-Q plot for both variables.
  2. Calculate the Shapiro-Wilk test statistic

Hint 2

  1. For the Q-Q plot: ggplot(data, aes(sample = variable)) + stat_qq() + stat_qq_line()
ggplot(data, aes(sample = variable)) + stat_qq() + stat_qq_line()
  1. For the Shapiro-Wilk test statistic: shapiro.test(data$variable)
shapiro.test(data$variable)

Fully worked solution:

1ggplot(customer_data, aes(sample = age)) +  stat_qq() + stat_qq_line()
ggplot(customer_data, aes(sample = spending)) +  stat_qq() + stat_qq_line()
#
2shapiro.test(customer_data$age)
shapiro.test(customer_data$spending)
1
Plot the Q-Q plot and check for deviations from normal
2
Calculate the Shapiro-Wilks test and compare the p-value to 0.05
  • Which variables may not be normally distributed?
  • Which test seems more appropriate to test for normality in these variables?

When to Test for Normality

Testing for normality is particularly relevant when:

  • We plan to use parametric tests that assume normality, such as t-tests or ANOVA.
  • We want to identify and interpret outliers accurately.
  • Our goal is to understand the overall shape and spread of data, especially in comparison to a benchmark.

Summary: Testing for normality ensures we’re using the right statistical methods and assumptions. Combining visual inspection with statistical tests gives a well-rounded perspective on the normality of our data.

By assessing normality before further analysis, we set up our workflow with greater confidence in the reliability of our conclusions.

23.7 Wrapping Up: The Role of Univariate Analysis in Business Analytics

In business analytics, understanding each variable individually provides the foundation for more complex analysis. Univariate analysis not only allows us to gain insights into data distributions and identify potential issues like outliers, but it also sets the stage for exploring relationships between variables.

Key Takeaways

  • Descriptive Statistics: Measures of central tendency and spread reveal typical values and the variability in data, helping us summarize large datasets quickly. Knowing the mean, median, and standard deviation of variables like product prices or customer age provides a snapshot that can inform pricing strategies, target markets, and resource allocation.

  • Distribution Shapes and Visualization: Histograms and box plots give us a visual sense of how data is distributed. Detecting skewness or unusual shapes helps us understand customer segments or product lines, allowing for more tailored marketing and inventory decisions.

  • Normality and Outliers: By identifying whether data follows a normal distribution, we can decide when to apply statistical techniques that rely on this assumption. Detecting outliers allows us to spot anomalies, which could indicate data errors or highlight exceptional cases worth further investigation—like products that underperform or customers with unusual purchasing patterns.

Why This Matters for Business Decisions

Univariate analysis provides the initial, crucial insights that business leaders need. Each of these concepts—mean, median, standard deviation, skewness, and outliers—contributes to more informed decision-making. Whether setting realistic sales targets, evaluating customer demographics, or pricing products, univariate insights offer the clarity needed to make effective choices. EDA matters for practical reasons:

  • Data Quality Assurance: Checking distribution properties like central tendency, spread, and skewness helps ensure data reliability. For example, detecting abnormality, such as extreme skew or high kurtosis, can reveal data quality issues or outliers, prompting further investigation before analysis.

  • Appropriate Modeling Choices: Many statistical models assume data normality. For instance, linear regression models tend to perform best when data approximates a normal distribution. Knowing whether data is normal or not allows you to choose or transform the data appropriately (e.g., using log transformations on skewed data).

  • Effective Comparison and Interpretation: Descriptive statistics (mean, median, mode, variance) allow for quick data comparisons across different groups. In business analytics, this could mean understanding customer spending habits, employee performance across departments, or differences in product popularity.

  • Setting Benchmarks and Thresholds: In cases like quality control or risk analysis, knowing the natural variability in a data set lets you set realistic benchmarks. Identifying and analyzing the “normal” range helps to spot significant deviations, which might indicate issues or opportunities.

  • Anomaly Detection and Risk Identification: Abnormal distributions can indicate underlying risks or rare events that might skew typical results, such as a heavy-tailed distribution indicating potential risk in financial data. Identifying these traits early can guide more robust analyses that account for potential volatility or risk.

  • Customer Insights and Market Segmentation: In marketing analytics, for example, distribution patterns in customer demographics or purchase behaviors can reveal distinct market segments. Insights into the “shape” of this data help tailor products and campaigns to match customer profiles more closely.

Looking Ahead: As we move into bivariate analysis, where we examine relationships between two variables, these foundational skills will help us understand how different factors interact. For example, understanding individual distributions prepares us to explore questions like: “How does price relate to sales volume?” or “What is the relationship between customer age and spending behavior?”