# A tibble: 100 × 4
age spending product_interest region
<dbl> <dbl> <chr> <chr>
1 33 495 Fashion East
2 18 458 Fashion North
3 32 491 Health South
4 30 420 Fashion South
5 85 664 Fashion East
6 35 533 Fashion East
7 31 526 Health South
8 14 350 Fashion South
9 24 471 Health East
10 NA 424 Fashion South
# ℹ 90 more rows
23 Univariate EDA
Exploratory Data Analysis of Individual Variables
Univariate Exploratory Data Analysis (EDA) focuses on examining one variable at a time. By understanding each variable individually, we gain valuable insights that lay the groundwork for analyzing relationships and building predictive models. In the context of entrepreneurship, a thorough exploration of each variable provides clarity on customer demographics, financial projections, or product feedback, which is essential for making informed business decisions.
23.1 Why Univariate Analysis Matters
In analytics, understanding one variable at a time helps us:
Validate Data Quality: Spot issues like outliers or missing values that could distort your analysis. For example, identifying outliers in customer spending data helps spot potential errors or unique customer segments worth investigating.
Identify Patterns: Observe distributions (e.g., age distribution) to understand customer demographics or product preferences. For instance, noticing a peak in customer age distribution can reveal a dominant demographic, informing targeted marketing efforts.
Guide Future Analysis: Set a foundation for examining relationships between variables and building models that predict customer behavior or business performance. For example, understanding the spending range in a single customer segment can guide pricing strategies and promotional campaigns.
By mastering univariate techniques, you’ll be able to confidently analyze single variables in your data, whether you’re studying customer demographics, sales figures, or product feedback. These foundational skills will prepare you to make data-driven decisions as an entrepreneur.
Key Learning Objectives
- Analyze single-variable data to assess quality, detect patterns, and inform decisions in business analytics.
- Employ univariate techniques to describe and interpret data distributions in real-world business contexts.
- Use R to apply fundamental univariate methods, including descriptive statistics and visualizations.
In the next sections, we’ll dive deeper into specific univariate techniques, using data from a hypothetical company UrbanFind, to illustrate these foundational skills.
23.2 Demonstration Data: UrbanFind
Consider UrbanFind, a startup that specializes in curating personalized recommendations for city dwellers in several areas of their lives:
Tech Gadgets: Recommendations for the latest gadgets and devices that enhance convenience and connectivity in a fast-paced city life, such as smart home devices, wearable tech, and productivity tools.
Fashion: Curated fashion items and accessories that align with urban styles and seasonal trends, helping city dwellers look their best in a competitive, image-conscious environment.
Outdoor Activities: Gear and suggestions for outdoor activities that are accessible even in or near urban settings—like urban hiking, weekend getaways, and fitness equipment for both outdoor and indoor use.
Health and Wellness Products: Products focused on personal well-being, including fitness equipment, nutritional supplements, and relaxation tools to counterbalance the stresses of urban life.
These recommendations aim to provide city residents with tailored options that fit their lifestyle and preferences, whether they’re looking to upgrade their tech, update their wardrobe, stay active, or improve their wellness. By analyzing customer data, UrbanFind can better understand which areas resonate most with their audience and refine their product offerings and marketing strategies accordingly.
By examining single variables—like customer age, income level, or product rating—UrbanFind can answer foundational questions:
- Who is the customer?
- What price range can they afford?
- How satisfied are they with existing products?
These insights, while simple, guide strategic decisions and set the stage for deeper analysis.
Variables in UrbanFind’s Data
UrbanFind conducted a survey to gather insights into customer demographics, spending habits, and interests. The dataset we’re working with contains responses from 100 survey participants who are representative of UrbanFind’s potential customer base. Each row is an observation, representing the responses of one unique respondent, with the following variables captured:
Age: The age of the customer in years. Age is an important demographic factor for UrbanFind, as different age groups may have distinct preferences for technology, fashion, or outdoor activities.
Spending: The amount (in dollars) each customer reported spending on lifestyle-related products in the past month. This includes items like tech gadgets, health products, and outdoor gear. UrbanFind aims to understand the range of spending to help design product bundles and set price points.
Product Interest: The product category the customer is most interested in, chosen from four options: Tech, Fashion, Outdoors, and Health. This helps UrbanFind determine which product lines to prioritize for marketing and inventory.
Region: The geographic region where each customer lives, categorized into North, South, East, and West. This variable provides insights into potential regional differences in product preferences and spending behaviors.
Each of these variables gives us a unique lens through which to view the customer base. By examining them individually, we gain insights that will inform how UrbanFind can tailor its offerings to meet customer needs.
Viewing the UrbanFind Dataset
Here’s a preview of the customer_data
dataset. Notice how the values of each variable vary across observations. In other words, age
, spending
, product_interest
, and region
are all variables that provide different types of information.
23.3 Visualizing Distributions
The greatest value of a picture is when it forces us to notice what we never expected to see.— John Tukey
In exploratory data analysis, understanding the distribution of a variable is essential, but we rarely know what this distribution looks like until we visualize it. By using visualization tools in R, such as those available in ggplot2
, we can gain simple yet powerful insights into data patterns, including clustering, spread, and unusual values. Visualization helps reveal whether data tends to follow a normal distribution, has skewness, or shows other unique characteristics.
Histograms for Continuous Variables
A histogram shows the frequency of values within specified ranges (or “bins”) along the x-axis, making it ideal for visualizing the shape of the data. Histograms allow us to observe clustering patterns, skewness, and whether the distribution has one peak or multiple peaks.
For example, if we were examining a variable representing the price of items in a store, a histogram could reveal if prices tend to cluster around certain points or if there’s a broader spread. In the following demonstration, you’ll see how histograms effectively depict the distribution of values within a dataset.
In this demonstration, the histogram shows the frequency distribution of our sample variable, giving insight into how values are spread and where they tend to cluster.
Box Plots for Summarizing Descriptive Statistics
A box plot provides a summary of data based on quartiles, helping us visualize the spread, center, and outliers within a dataset. The box represents the interquartile range (IQR), covering the middle 50% of the data. The whiskers extend to the minimum and maximum values within 1.5 times the IQR, while any points beyond the whiskers are displayed as individual outliers.
This annotated box plot shows the key components, helping us understand the distribution’s spread and identify any potential outliers.
Bar Plots for Categorical Variables
While histograms are ideal for visualizing the distribution of continuous variables, bar plots are the go-to tool for visualizing categorical variables. A categorical variable represents distinct categories, with each observation falling into one category. Examples include gender, product type, or region.
In a bar plot, each bar represents a category, and the height of the bar shows the count (or percentage) of observations in that category. This visualization provides insights into the relative frequency of each category, allowing us to compare categories at a glance.
Clothing Electronics Games Groceries Health Outdoor
25 45 38 89 44 23
In this bar plot, each bar’s height shows the frequency of sales in each region. The plot highlights which purchase categories receive the most spending, helping us identify potential areas for business expansion or targeted marketing.
Demonstration: Exploring Unknown Distributions
When we encounter new data, we often don’t know the underlying distribution of each variable. Visualizations like histograms and box plots make it possible to quickly uncover patterns, shapes, and unique characteristics of the data that may not be obvious from just inspecting raw values.
In this demonstration, we’ll use a dataset of purchase_data
that contains three different variables with distinct distributions that are initially unknown. The variables are product_price
, customer_age
, and purchase_frequency
. We’ll first inspect the dataset to see the raw data and then visualize each variable individually to reveal its shape.
# A tibble: 6 × 3
product_price customer_age purchase_frequency
<dbl> <dbl> <dbl>
1 63.7 20.4 43.3
2 44.4 20.8 44.7
3 53.6 27.1 35.3
4 56.3 26.2 38.8
5 54.0 33.8 33.4
6 48.9 21.3 44.2
The head()
of the dataset gives a preview of raw values, but doesn’t reveal much about the distributions. Since the distributions of product_price
, customer_age
, and purchase_frequency
are unknown, lets begin our exploration of the data by visualizing the variables with histograms.
Product Price
To plot the histogram of product_price
, we following the grammar of graphics by
- calling the
ggplot
function, - specifying the dataset
purchase_data
, and - specifying the aesthetic mapping
x = product_price
.
To create a basic plot in
ggplot2
, you only need to specify the data and the mapping. For example, when creating histograms, bar charts, and box plots, we typically map only thex
aesthetic to the variable of interest. While this demonstration includes custom colors for aesthetics, the simplestggplot
only requires you to declare the data and mapping.
# Plotting the normal distribution for product price
ggplot(purchase_data, aes(x = product_price)) +
geom_histogram(binwidth = 2,
fill = "royalblue",
color = "black") +
labs(title = "Distribution of Product Prices",
x = "Price",
y = "Frequency") +
theme_minimal()
Customer Age
Next, let’s visualize customer_age
with the same grammar of graphics code as we did in Section 23.3.4.1.
# Plotting the unknown distribution for customer age
ggplot(purchase_data, aes(x = customer_age)) +
geom_histogram(binwidth = 5,
fill = "#7B3294",
color = "black") +
labs(title = "Distribution of Customer Ages",
x = "Customer Age",
y = "Frequency") +
theme_minimal()
For customer_age
, we see a right-skewed distribution where most values are lower, with a few exceptionally high outliers (such as older customers).
Tip: When exploring a new distribution, it’s often helpful to try different binwidths in your histogram. Adjusting the binwidth can uncover finer details or highlight broader trends in the data, providing a clearer picture of its underlying shape. For example, with a binwidth of 10, the histogram of
customer_age
appears to peak at age = 0, losing one of the key attributes of a skewed distribution.
# Plotting the unknown distribution for customer age
ggplot(purchase_data, aes(x = customer_age)) +
geom_histogram(binwidth = 10,
fill = "#7B3294",
color = "black") +
labs(title = "Distribution of Customer Ages",
x = "Customer Age",
y = "Frequency") +
theme_minimal()
Purchase Frequency
Finally, we’ll plot purchase_frequency
as we did for the previous two variables.
# Plotting the bimodal distribution for purchase frequency
ggplot(purchase_data, aes(x = purchase_frequency)) +
geom_histogram(binwidth = 2,
fill = "mediumseagreen",
color = "black") +
labs(title = "Distribution of Purchase Frequency",
x = "Purchase Frequency",
y = "Frequency (Count)") +
theme_minimal()
Exercise: Visualizing Distributions
Try it yourself:
In a dataset named distribution_data
, data for a right-skewed variable named skewed_variable
is found together with a bimodal variable named bimodal_variable
. Visualize the distributions of these variables using a histograms. Compare the shapes to what you would expect from each distribution type.
Hint 1
Build your histogram plot using the grammar of graphics by declaring the data, specifying the aesthetic mapping (a histogram maps a variable to the x-axis only), and calling the geometry (a histogram uses the geom_histogram()
function). Note that you can specify the binwidth
(the span of the x-axis covered by one bar of the histogram), fill
(the color of the bars of the histogram), and color
(the color of the borders of the bars).
Hint 2
For the skewed distribution:
- Call the
ggplot()
function - Specify the data as
distribution_data
- Map the aesthetic with
x = skewed_variable
- Specify the geometry as
geom_histogram
- [optional] Specify
binwidth
,fill
, andcolor
as you like
- Call the
For the bimodal distribution:
- Call the
ggplot()
function - Specify the data as
distribution_data
- Map the aesthetic with
x = bimodal_variable
- Specify the geometry as
geom_histogram
- [optional] Specify
binwidth
,fill
, andcolor
as you like
- Call the
geom_histogram(binwidth = 1)
Fully worked solution:
For the distribution of
skewed_variable
:- Call the
ggplot()
function - Specify the data as
distribution_data
- Specify the aesthetic mapping as
x = skewed_variable
- Specify the geometry as
geom_histogram
- [optional] Specify
binwidth
,fill
, andcolor
as you like
- Call the
For the distribution of
bimodal_variable
:- Call the
ggplot()
function - Specify the data as
distribution_data
- Specify the aesthetic mapping as
x = bimodal_variable
- Specify the geometry as
geom_histogram
- [optional] Specify
binwidth
,fill
, andcolor
as you like
- Call the
1ggplot(distribution_data,
2aes(x = skewed_variable)) +
3geom_histogram(binwidth = 2,
4fill = "royalblue",
5color = "black")
ggplot(distribution_data,
aes(x = bimodal_variable)) +
geom_histogram(binwidth = 0.5,
fill = "royalblue",
color = "black")
- 1
-
Call the
ggplot()
function and specifypurchase_data
- 2
-
Specify that aesthetic mapping with
skewed_variable
orbimodal_variable
plotted on the x-axis - 3
-
Call the
geom_histogram()
function to get a histogram of the distributions [optional] and specify thebinwidth
, - 4
-
fill
, - 5
-
color
, or other aesthetics ofgeom_histogram()
Data visualization is an essential tool in exploratory data analysis. Each type of plot provides unique insights, depending on whether the data is continuous or categorical. The following table outlines commonly used visualization types, along with their purposes and suitability for continuous or categorical data.
Visualization Type | Description | Continuous | Categorical | Use Case |
---|---|---|---|---|
Histogram | Shows frequency distribution of values in bins | ✅ | ❌ | Distribution Shape, Outlier Detection |
Box Plot | Displays median, quartiles, and potential outliers | ✅ | ❌ | Spread, Outliers |
Density Plot | Smooth curve showing data density over continuous range | ✅ | ❌ | Distribution Shape |
Bar Chart | Shows count or proportion of each category | ❌ | ✅ | Frequency of Categorical Values |
QQ Plot | Plots data against a normal distribution | ✅ | ❌ | Normality Check |
23.4 Descriptive Statistics
Numerical quantities focus on expected values, graphical summaries on unexpected values.— John Tukey
Descriptive statistics provide a concise summary of your data and offer insights into its distribution, central tendencies, and variability. We’ll start with R’s summary()
function to get a high-level overview, then calculate more specific measures with individual functions. In this section, we’ll calculate these key statistics to get a comprehensive understanding of the customer_data
dataset from UrbanFind from Section 23.2.
Summary Statistics with summary()
R’s summary()
function provides a quick overview of key statistics, including measures of central tendency and spread (e.g., mean, median, and range). This overview often reveals initial insights, including potential outliers and data skewness.
# Summary statistics for customer_data
summary(customer_data)
age spending product_interest region
Min. :14.00 Min. : 139.0 Length:100 Length:100
1st Qu.:29.00 1st Qu.: 431.2 Class :character Class :character
Median :37.00 Median : 529.0 Mode :character Mode :character
Mean :36.81 Mean : 543.8
3rd Qu.:43.00 3rd Qu.: 627.8
Max. :90.00 Max. :1600.0
NA's :3 NA's :2
The summary()
function outputs the mean and median for continuous variables like age and spending and also shows the minimum and maximum values (range). However, it doesn’t cover standard deviation or interquartile range (IQR) directly, which we’ll calculate separately.
From the summary()
, we see that the mean age is 36.81 and the median age is 37 which are close in value. We also see that the mean spending of $543.83 and median spending of $529 are close in value.
Interpretation: When the mean and median are similar, as seen here for both customer
age
andspending
, it suggests a balanced, symmetrical distribution with minimal skew. To determine how closely values cluster around the average, we also need to examine the range and standard deviation.
Measures of Central Tendency
Central tendency indicates the “typical” or “average” values within the data, helping us understand what’s most representative. Let’s calculate these measures with R functions to see how they differ and when each is most useful.
Mean
The mean
is the average value, which is useful for understanding overall levels. However, it can be influenced by outliers.
# Calculate mean for age and spending
mean(customer_data$age, na.rm = TRUE)
[1] 36.81443
mean(customer_data$spending, na.rm = TRUE)
[1] 543.8265
Median
The median
represents the middle value, which can be more informative than the mean for skewed distributions, as it’s less affected by outliers.
# Calculate median for age and spending
median(customer_data$age, na.rm = TRUE)
[1] 37
median(customer_data$spending, na.rm = TRUE)
[1] 529
Mode
The mode
is the most frequently occurring value, especially useful for categorical data. For example, let’s calculate the mode for product_interest
in customer_data.
# Calculate mode for product_interest
table(customer_data$product_interest)
Fashion Health Outdoors Tech
25 24 27 22
table(customer_data$product_interest) |> which.max()
Outdoors
3
Measures of Spread
While measures of central tendency (like mean and median) give us an idea of typical values, measures of spread reveal how much the data varies around those central values. Understanding the spread is essential for interpreting data patterns, as it tells us whether values are tightly clustered around the mean or widely dispersed. For instance, high variability suggests diverse customer profiles, while low variability indicates uniformity.
Here, we’ll explore three key measures of spread: range, standard deviation, and variance. Each provides a unique perspective on data variability.
Range
The range is the difference between the maximum and minimum values. It’s simple but can be affected by outliers.
# Calculate range for age and spending
range(customer_data$age, na.rm = TRUE)
[1] 14 90
range(customer_data$spending, na.rm = TRUE)
[1] 139 1600
In customer_data
, the range of age
is from 14 to 90, while the range of spending
spans from $139 to $1600. These ranges show the full spectrum of prices and quantities but don’t tell us how common values are within this span.
Interquartile Range (IQR)
The IQR measures the spread of the middle 50% of values and is particularly useful for understanding spread in skewed data. It’s calculated as the difference between the 75th percentile (75% of values fall below this point) and 25th percentile (25% of values fall below this point):
\[ \mathsf{IQR = Q3 - Q1}.\]
# Calculate IQR for age and spending
IQR(customer_data$age, na.rm = TRUE)
[1] 14
IQR(customer_data$spending, na.rm = TRUE)
[1] 196.5
The IQR of age
is 14, meaning the middle 50% of prices fall within this range. For spending
, the IQR is $196.5.
The IQR is especially useful in box plots, where it represents the range of the central box.
Standard Deviation (SD)
The standard deviation measures the average distance of each value from the mean. A low SD indicates that values are clustered near the mean, while a high SD suggests more variability. Standard deviation is useful for interpreting consistency in data.
# Calculate standard deviation for age and spending
sd(customer_data$age, na.rm = TRUE)
[1] 13.2534
sd(customer_data$spending, na.rm = TRUE)
[1] 207.0577
The standard deviation for customer age
is 13.25, and for spending
, it’s $207.06. These values show how much age and purchasing typically vary from their respective means.
Interpretation: Measures of spread are crucial for understanding data variability. For example, a low standard deviation in customer ages could imply a relatively homogenous customer segment that can be reached through the same channels while a higher standard deviation in spending might indicate diverse customer income levels or buying patterns. Understanding variability helps UrbanFind plan for advertising, inventory, and fluctuations in sales.
Exercise: Calculating Descriptive Statistics
Try it yourself:
The sales_data
dataset represents a set of sales records for a fictional company named MetroMart, which operates across multiple regions. This dataset was created to demonstrate descriptive statistics and data exploration techniques, essential for understanding product performance and sales trends.
We’ll work with both continuous variables (e.g., price
, quantity_sold
) and categorical variables (e.g., region
, category
, and promotion
).
Dataset Overview:
- Product ID: A unique identifier for each product (from 1 to 300).
- Price: The selling price of each product, normally distributed with a mean price of $20 and some variation to represent typical pricing diversity.
- Quantity Sold: The number of units sold, following a Poisson distribution to reflect typical purchase quantities.
- Region: The region where each product was sold, categorized into North, South, East, and West regions.
- Category: The product category, which can be one of four types: Electronics, Grocery, Clothing, or Home Goods. This variable helps us understand which product types tend to perform better in terms of sales volume and pricing across different regions.
- Promotion: Indicates whether the product was on promotion at the time of sale, with possible values of “Yes” or “No.” This variable allows us to analyze how promotional offers affect sales volume and may reveal seasonal or regional preferences for discounted products.
This provides a rich foundation for exploring key statistical concepts, such as central tendency and variability, while allowing us to analyze categorical differences. With these variables, we can gain insights into average prices, sales volume, and how factors like product type, regional differences, and promotions influence sales. These metrics can reveal patterns, outliers, and trends that are crucial for strategic decision-making in areas such as pricing, inventory management, and targeted marketing.
For MetroMart’s sales_data
:
Calculate the measures of central tendency (mean and median) for the
price
andquantity_sold
variables to understand the typical values for these sales metrics.Calculate the mode of the categorical variables, such as
region
, to see where most purchases happen, andcategory
to identify the most popular product type.Calculate the spread of the
price
andquantity_sold
variables to understand how they differ from their means, and analyze howpromotion
might influence these values.
Hint 1
- Calculate descriptive statistics for mean and median using the
summary()
function.
- Calculate the mode of the categorical variables using
table()
- Calculate descriptive statistics for spread (range, IQR, and standard deviation) using the appropriate R functions.
Hint 2
For descriptive statistics about central tendencies:
- Call the
summary()
function - Specify the data as
sales_data
- Call the
For descriptive statistics about spread:
- Call the
range()
function forsales_data
and theprice
andquantity_sold
variables – be sure to removeNA
values from the calculation - Call the
IQR()
function forsales_data
and theprice
andquantity_sold
variables – be sure to removeNA
values from the calculation - Call the
sd()
function forsales_data
and theprice
andquantity_sold
variables – be sure to removeNA
values from the calculation
- Call the
summary()
range()
IQR()
sd()
Fully worked solution:
1summary(sales_data)
2table(sales_data$region)
table(sales_data$category)
table(sales_data$promotion)
3range(sales_data$price, na.rm = TRUE)
range(sales_data$quantity_sold, na.rm = TRUE)
4IQR(sales_data$price, na.rm = TRUE)
IQR(sales_data$quantity_sold, na.rm = TRUE)
5sd(sales_data$price, na.rm = TRUE)
sd(sales_data$quantity_sold, na.rm = TRUE)
- 1
-
Call the
summary()
function forsales_data
- 2
-
Call the
table()
function for categorical variablesregion
,category
, andpromotion
. - 3
-
Call the
range()
function forsales_data
and theprice
andquantity_sold
variables removingNA
values from the calculation - 4
-
Call the
IQR()
function forsales_data
and theprice
andquantity_sold
variables removingNA
values from the calculation - 5
-
Call the
sd()
function forsales_data
and theprice
andquantity_sold
variables removingNA
values from the calculation
- Compare mean and median to see if the data may be skewed.
- What does the spread of the data suggest about the diversity of prices? quantity sold?
In exploring any dataset, descriptive statistics provide a foundation for understanding key characteristics of the data. They allow us to summarize central tendencies, variations, and identify any unusual patterns that may be present. The table below outlines essential descriptive statistics techniques, each providing unique insights into different aspects of a dataset.
Measure | Description | Continuous | Categorical |
---|---|---|---|
Central Tendency | Measures the “typical” value | ✅ (Mean, Median) | ✅ (Mode) |
Variation | Measures data spread | ✅ (Range, SD, IQR) | ❌ |
Skewness & Kurtosis | Assesses data symmetry and tail heaviness | ✅ (Skewness, Kurtosis) | ❌ |
Outlier Detection | Identifies unusual values using thresholds | ✅ (Z-score, IQR) | ❌ |
Normality Tests | Tests if data follows a normal distribution | ✅ (Shapiro-Wilk) | ❌ |
23.5 Testing for Outliers
Whenever I see an outlier, I never know whether to throw it away or patent it.– Bert Gunter, R-Help, 9/14/2015
Outliers are data points that deviate significantly from other observations. They can indicate data entry errors, unusual values, or meaningful deviations within the data, and they are crucial to identify in EDA, as they can influence statistical analyses and skew insights. There are several methods to detect and assess outliers, with Z-scores being one of the most common when dealing with approximately normal data.
Testing for Outliers with Z-Scores
When data is approximately normal, we can use Z-scores to assess how far specific values deviate from the mean in terms of standard deviations.
Z-score: A Z-score tells us how many standard deviations an observed value (\(\mathsf{X}\)) is from the mean (\(\mathsf{\mu}\)):
\[ \mathsf{Z = \dfrac{X - \mu}{\sigma}} \]
where \(\mathsf{X}\) is the observed value, \(\mathsf{\mu}\) is the mean, and \(\mathsf{\sigma}\) is the standard deviation.
Generally:
- Values with Z-scores beyond ±2 may be considered unusual.
- Values beyond ±3 are often classified as outliers.
The following plot illustrates the concept of Z-scores on a normal distribution, with dashed lines at ±2 standard deviations and dotted lines at ±3 standard deviations to mark the bounds of “usual” and “outlier” values.
In this plot, the dashed red lines at ±2 standard deviations and the dotted red lines at ±3 standard deviations illustrate where values are likely to be considered unusual or outliers based on their Z-scores.
Demonstration of the Z-score Test
To test for outliers with Z-scores in our sales_data
dataset, we’ll calculate the Z-scores for the price
variable and check for any values that are more than 3 standard deviations from the mean. Here’s how it’s done in R:
# Calculate Z-scores for sales_data
$z_score_price <- (sales_data$price - mean(sales_data$price, na.rm = TRUE)) / sd(sales_data$price, na.rm = TRUE)
sales_data
# Filtering potential outliers based on Z-scores
<- sales_data %>% filter(abs(z_score_price) > 3)
z_score_outliers z_score_outliers
[1] product_id price quantity_sold region z_score_price
<0 rows> (or 0-length row.names)
In this example, we check the price
variable for outliers. If z_score_outliers
returns zero rows, we can conclude that there are no outliers in the price variable of sales_data.
Note: Z-scores assume a normal distribution. For non-normal distributions, consider other outlier detection methods, such as the IQR-based method, which we will cover next.
Testing for Outliers with Interquartile Range (IQR)
The IQR method is useful for data that is not normally distributed. This method defines outliers as values that fall below \(\mathsf{Q1 - 1.5 \cdot IQR}\) or above \(\mathsf{Q3 + 1.5 * IQR}\).
Demonstration of the IQR Test
# Calculate IQR and define bounds for potential outliers
<- quantile(sales_data$price, 0.25, na.rm = T)
q1 <- quantile(sales_data$price, 0.75, na.rm = T)
q3 <- q3 - q1
iqr
# Define lower and upper bounds for outliers
<- q1 - 1.5 * iqr
lower_bound <- q3 + 1.5 * iqr
upper_bound
# Filter potential outliers based on IQR bounds
<- sales_data %>% filter(price < lower_bound | price > upper_bound)
iqr_outliers iqr_outliers
product_id price quantity_sold region z_score_price
1 156 5.88 22 North -2.990809
Using the IQR calculation, we can conclude that there is one outlier in the price
variable from sales_data
, namely the observation of product_id
= 156. Notice that the Z-score for this observation is -2.9908093 meaning that the Z-score was also close to identifying this observation as an outlier.
The IQR method is especially effective for detecting outliers in skewed data, as it isn’t influenced by extreme values on either end.
Testing for Outliers with Box Plot Visualization
A box plot visually identifies outliers as individual points beyond the whiskers. In a standard box plot, whiskers extend up to 1.5 times the IQR. Points beyond this range are considered potential outliers.
Demonstration of the Box Plot Test
# Box plot to visually identify outliers in price
ggplot(sales_data, aes(x = "", y = price)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16) +
labs(title = "Box Plot of Product Prices with Outliers Highlighted", y = "Price") +
theme_minimal()
This box plot clearly identifies a single value, the minimum value, of price
as an outlier. Checking the data we can verify that the lowest price
in sales_data
is the observation for product_id
= 156.
# slice and display the observation (row) with minimum price
|> slice_min(price) sales_data
product_id price quantity_sold region z_score_price
1 156 5.88 22 North -2.990809
Choosing an Outlier Detection Method
The best method depends on the data distribution and the context of analysis:
- Z-scores are suitable for normally distributed data.
- IQR is robust and works well with skewed data.
- Box plots provide a quick, visual approach for outlier detection.
Takeaway: Detecting outliers helps identify unusual or potentially erroneous data points. By carefully assessing outliers, we can refine data quality and uncover valuable insights for analysis and decision-making.
Each method offers different insights, making it beneficial to combine them for a comprehensive approach to outlier detection.
Exercise: Testing for Outliers with Statistics and Visalization
Try it yourself:
Perform the Z-score, IQR, and box plot tests to determine whether the age
and/or spending
variables in UrbanFind’s customer_data
have outliers. Which test is more appropriate?
Hint 1
- Calculate the Z-score and the IQR test score for both variables.
- Visualize the box plot for both variables.
Hint 2
For Z-scores:
- Calculate (age - mean(age)) / sd(age)
- Calculate (spending - mean(spending)) / sd(spending)
- Filter the data to identify cases where the Z-score is more than 3 standard deviations from the variable value
- mean(variable)) / sd(variable)
(variable |> filter (abs(z_score) > 3) data
For the IQR test
- Calculate the IQR test values Q1 - 1.5 * IQR and Q3 + 1.5 * IQR for age
- Calculate the IQR test values Q1 - 1.5 * IQR and Q3 + 1.5 * IQR for spending
- Filter the data to identify cases where the value of an observation is less than the lower bound or greater than the upper bound
# Calculate quantiles and IQR
<- quantile(variable, 0.25)
q1 <- quantile(variable, 0.75)
q3 <- q3 - q1
iqr
# Define lower and upper bounds for outliers
<- q1 - 1.5 * iqr
lower_bound <- q3 + 1.5 * iqr
upper_bound
# Filter potential outliers based on IQR bounds
|> filter(variable < lower_bound | variable > upper_bound) data
For the box plot test
- Plot the box plot for age and for spending
- Visually check for outliers
- Filter or slice to identify the observations with outliers
ggplot(data, aes(x = "", y = variable)) +
geom_boxplot()
|> slice_min(order_by = variable, n = number_of_outliers)
data |> slice_max(order_by = variable, n = number_of_outliers)
data filter(condition)
Fully worked solution:
1$z_score_age <- (customer_data$age - mean(customer_data$age, na.rm = T)) / sd(customer_data$age, na.rm = T)
customer_data|> filter(abs(z_score_age) > 3)
customer_data ##
$z_score_spending <- (customer_data$spending - mean(customer_data$spending, na.rm = T)) / sd(customer_data$spending, na.rm = T)
customer_data|> filter(abs(z_score_spending) > 3)
customer_data
2<- quantile(customer_data$age, 0.25, na.rm = T)
q1 <- quantile(customer_data$age, 0.75, na.rm = T)
q3 <- q3 - q1
iqr <- q1 - 1.5 * iqr
lower_bound <- q3 + 1.5 * iqr
upper_bound %>% filter(age < lower_bound | age > upper_bound)
customer_data #
<- quantile(customer_data$spending, 0.25, na.rm = T)
q1 <- quantile(customer_data$spending, 0.75, na.rm = T)
q3 <- q3 - q1
iqr <- q1 - 1.5 * iqr
lower_bound <- q3 + 1.5 * iqr
upper_bound %>% filter(spending < lower_bound | spending > upper_bound)
customer_data
ggplot(customer_data, aes(x = "", y = age)) +
3geom_boxplot(outlier.color = "red", outlier.shape = 16)
|> slice_max(order_by = age, n = 2)
customer_data #
ggplot(customer_data, aes(x = "", y = spending)) +
geom_boxplot(outlier.color = "red", outlier.shape = 16)
|> slice_max(order_by = spending, n = 2)
customer_data |> slice_max(order_by = spending, n = 1) customer_data
- 1
- Calculate the Z-score and filter for observations greater than 3 standard deviations.
- 2
- Calculate the IQR upper- and lower-bounds and filter for observations above or below them.
- 3
- Plot the box plot, visually check for outliers, filter/slice the outliers
- What observations may be outliers?
- Which test seems more appropriate to test for outliers in these variables?
Interpretation in Entrepreneurship
Understanding distribution shape allows us to make better data-driven decisions:
- Detecting Skew: Skewness tells us if a variable’s distribution leans towards higher or lower values. For instance, if product prices are right-skewed, most products are affordable, but a few are premium-priced.
- Identifying Outliers: Box plots make it easy to spot outliers. Outliers could indicate data errors or valuable insights, like a product that sells exceptionally well or poorly.
- Assessing Variability: Histograms and box plots help in quickly visualizing data spread, giving insight into whether values cluster tightly around the mean or are widely dispersed.
By combining histograms, box plots, and an understanding of skewness and kurtosis, we gain a clearer view of data characteristics, which aids in further analysis and informs decision-making.
23.6 Testing for Normality
Why are you testing your data for normality? For large sample sizes the normality tests often give a meaningful answer to a meaningless question (for small samples they give a meaningless answer to a meaningful question)– Greg Snow, R-Help, 21 Feb 2014
The normal distribution is one of the most commonly encountered probability distributions in statistics. Often called a “bell curve,” the normal distribution is symmetric, with most values clustering around the mean and tapering off as they move away. This distribution is particularly useful as a benchmark for identifying patterns and detecting outliers.
Properties of a Normal Distribution
A normal distribution has several key properties:
- Symmetry: It’s perfectly symmetric around the mean, meaning values are equally likely to be higher or lower than the mean.
- Mean, Median, and Mode Alignment: In a normal distribution, the mean, median, and mode all lie at the center.
- Standard Deviations and Proportion of Data: The data follows a predictable pattern:
- Approximately 68% of values fall within 1 standard deviation of the mean.
- Approximately 95% of values fall within 2 standard deviations.
- Nearly all values (99.7%) fall within 3 standard deviations.
Let’s visualize a normal distribution to see these properties in action.
This plot shows a classic bell-shaped curve, highlighting the central clustering and symmetry of a normal distribution.
Real-World Contexts for Normal Distributions
Normal distributions are frequently observed in various real-world contexts:
- Human Characteristics: Many biological traits, like height and blood pressure, are approximately normally distributed.
- Measurement Errors: In scientific and engineering measurements, errors often follow a normal distribution due to random variations.
- Business and Economics: Some financial metrics, like daily stock returns (under certain conditions), are assumed to be normally distributed for modeling and forecasting.
These contexts rely on the normal distribution to create reliable benchmarks, allowing us to predict outcomes and assess what’s typical or unusual within a dataset.
Why the Normal Distribution Matters
The normal distribution helps us make sense of what’s typical in our data and identify values that stand out. By using Z-scores, we can quickly assess whether a value is within an expected range or if it’s unusually high or low. This can be invaluable for detecting anomalies, setting benchmarks, and making decisions across various fields.
Takeaway: When a dataset is normally distributed, it’s easier to interpret central tendency and variability. We can confidently use Z-scores to detect outliers and make informed decisions about unusual data points. Understanding the normal distribution’s properties gives us a powerful tool for data analysis and decision-making.
Testing for Normality
Once we understand the concept of a normal distribution, it’s often important to test whether our data approximates this distribution. Many statistical methods assume normality, and knowing whether data meets this assumption can guide our approach to analysis.
Visual Methods
Histogram: Plotting the data as a histogram and comparing it visually to a bell curve is a straightforward way to check for normality. If the histogram approximates a symmetric, bell-shaped curve, the data may be normally distributed.
Q-Q Plot (Quantile-Quantile Plot): A Q-Q plot compares the quantiles of our data to the quantiles of a theoretical normal distribution. If the data is normally distributed, points in a Q-Q plot will approximately follow a straight line.
# Plotting a Q-Q plot
ggplot(normal_data, aes(sample = value)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red", linetype = "dashed") +
labs(title = "Q-Q Plot for Normality Check", x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()
In this plot, if the points closely follow the dashed line, the data can be considered approximately normal. Deviations from this line indicate departures from normality.
Statistical Tests
- Shapiro-Wilk Test: The Shapiro-Wilk test is a common statistical test for normality. It produces a p-value that helps us decide whether to reject the null hypothesis (that the data is normally distributed). A small p-value (typically < 0.05) indicates that the data is unlikely to be normally distributed.
# Shapiro-Wilk test for normality
<- shapiro.test(normal_data$value)
shapiro_test shapiro_test
Shapiro-Wilk normality test
data: normal_data$value
W = 0.99882, p-value = 0.767
In the output, the test’s p-value tells us whether the data significantly deviates from a normal distribution. A p-value greater than 0.05 suggests that the data may approximate a normal distribution. Since the p-value of 0.7669591 is much greater than 0.05, we have greater confidence that the variable has a normal distribution.
Note: Visual methods provide a general sense of normality, while statistical tests offer more rigor. However, keep in mind that large sample sizes can make tests overly sensitive, rejecting normality for minor deviations.
Exercise: Testing for Normality with Statistics and Visalization
Try it yourself:
Perform the Q-Q plot and Shapiro-Wilk tests to determine whether the age
and/or spending
variables in UrbanFind’s customer_data
are normally distributed. Which test is more appropriate?
Hint 1
- Plot the Q-Q plot for both variables.
- Calculate the Shapiro-Wilk test statistic
Hint 2
- For the Q-Q plot: ggplot(data, aes(sample = variable)) + stat_qq() + stat_qq_line()
ggplot(data, aes(sample = variable)) + stat_qq() + stat_qq_line()
- For the Shapiro-Wilk test statistic: shapiro.test(data$variable)
shapiro.test(data$variable)
Fully worked solution:
1ggplot(customer_data, aes(sample = age)) + stat_qq() + stat_qq_line()
ggplot(customer_data, aes(sample = spending)) + stat_qq() + stat_qq_line()
#
2shapiro.test(customer_data$age)
shapiro.test(customer_data$spending)
- 1
- Plot the Q-Q plot and check for deviations from normal
- 2
- Calculate the Shapiro-Wilks test and compare the p-value to 0.05
- Which variables may not be normally distributed?
- Which test seems more appropriate to test for normality in these variables?
When to Test for Normality
Testing for normality is particularly relevant when:
- We plan to use parametric tests that assume normality, such as t-tests or ANOVA.
- We want to identify and interpret outliers accurately.
- Our goal is to understand the overall shape and spread of data, especially in comparison to a benchmark.
Summary: Testing for normality ensures we’re using the right statistical methods and assumptions. Combining visual inspection with statistical tests gives a well-rounded perspective on the normality of our data.
By assessing normality before further analysis, we set up our workflow with greater confidence in the reliability of our conclusions.
23.7 Wrapping Up: The Role of Univariate Analysis in Business Analytics
In business analytics, understanding each variable individually provides the foundation for more complex analysis. Univariate analysis not only allows us to gain insights into data distributions and identify potential issues like outliers, but it also sets the stage for exploring relationships between variables.
Key Takeaways
Descriptive Statistics: Measures of central tendency and spread reveal typical values and the variability in data, helping us summarize large datasets quickly. Knowing the mean, median, and standard deviation of variables like product prices or customer age provides a snapshot that can inform pricing strategies, target markets, and resource allocation.
Distribution Shapes and Visualization: Histograms and box plots give us a visual sense of how data is distributed. Detecting skewness or unusual shapes helps us understand customer segments or product lines, allowing for more tailored marketing and inventory decisions.
Normality and Outliers: By identifying whether data follows a normal distribution, we can decide when to apply statistical techniques that rely on this assumption. Detecting outliers allows us to spot anomalies, which could indicate data errors or highlight exceptional cases worth further investigation—like products that underperform or customers with unusual purchasing patterns.
Why This Matters for Business Decisions
Univariate analysis provides the initial, crucial insights that business leaders need. Each of these concepts—mean, median, standard deviation, skewness, and outliers—contributes to more informed decision-making. Whether setting realistic sales targets, evaluating customer demographics, or pricing products, univariate insights offer the clarity needed to make effective choices. EDA matters for practical reasons:
Data Quality Assurance: Checking distribution properties like central tendency, spread, and skewness helps ensure data reliability. For example, detecting abnormality, such as extreme skew or high kurtosis, can reveal data quality issues or outliers, prompting further investigation before analysis.
Appropriate Modeling Choices: Many statistical models assume data normality. For instance, linear regression models tend to perform best when data approximates a normal distribution. Knowing whether data is normal or not allows you to choose or transform the data appropriately (e.g., using log transformations on skewed data).
Effective Comparison and Interpretation: Descriptive statistics (mean, median, mode, variance) allow for quick data comparisons across different groups. In business analytics, this could mean understanding customer spending habits, employee performance across departments, or differences in product popularity.
Setting Benchmarks and Thresholds: In cases like quality control or risk analysis, knowing the natural variability in a data set lets you set realistic benchmarks. Identifying and analyzing the “normal” range helps to spot significant deviations, which might indicate issues or opportunities.
Anomaly Detection and Risk Identification: Abnormal distributions can indicate underlying risks or rare events that might skew typical results, such as a heavy-tailed distribution indicating potential risk in financial data. Identifying these traits early can guide more robust analyses that account for potential volatility or risk.
Customer Insights and Market Segmentation: In marketing analytics, for example, distribution patterns in customer demographics or purchase behaviors can reveal distinct market segments. Insights into the “shape” of this data help tailor products and campaigns to match customer profiles more closely.
Looking Ahead: As we move into bivariate analysis, where we examine relationships between two variables, these foundational skills will help us understand how different factors interact. For example, understanding individual distributions prepares us to explore questions like: “How does price relate to sales volume?” or “What is the relationship between customer age and spending behavior?”