23 Testing for Normality
23.1 Introduction
Normality is a cornerstone assumption in many statistical methods. Knowing whether your data follows a normal distribution helps guide the choice of analytical techniques, from descriptive summaries to advanced modeling. Testing for normality ensures that statistical assumptions are met or highlights when adjustments, such as transformations or non-parametric methods, are necessary.
In this chapter, we’ll explore:
- The properties of a normal distribution.
- Why normality matters for statistical analysis.
- How to assess normality visually and statistically.
- Practical guidance on interpreting results and choosing appropriate methods.
23.2 What Is a Normal Distribution?
The normal distribution, or “bell curve,” is a probability distribution commonly used in statistics. Its symmetrical shape and predictable properties make it an ideal benchmark for many analyses.
Key Properties
- Symmetry: The distribution is perfectly symmetric around the mean.
- Mean, Median, and Mode Alignment: All three measures of central tendency coincide at the center.
- Standard Deviations: As we saw in Figure 22.1, approximately:
- 68% of data falls within 1 standard deviation of the mean.
- 95% within 2 standard deviations.
- 99.7% within 3 standard deviations.
Let’s visualize a normal distribution to see these properties in action.
This plot shows a classic bell-shaped curve, highlighting the central clustering and symmetry of a normal distribution.
23.3 Real-World Contexts for Normal Distributions
Normal distributions are frequently observed in various real-world contexts:
- Human Characteristics: Many biological traits, like height and blood pressure, are approximately normally distributed.
- Measurement Errors: In scientific and engineering measurements, errors often follow a normal distribution due to random variations.
- Business and Economics: Some financial metrics, like daily stock returns (under certain conditions), are assumed to be normally distributed for modeling and forecasting.
These contexts rely on the normal distribution to create reliable benchmarks, allowing us to predict outcomes and assess what’s typical or unusual within a dataset.
23.4 Why the Normal Distribution Matters
The normal distribution helps us make sense of what’s typical in our data and identify values that stand out. By using Z-scores, we can quickly assess whether a value is within an expected range or if it’s unusually high or low. This can be invaluable for detecting anomalies, setting benchmarks, and making decisions across various fields.
Takeaway: When a dataset is normally distributed, it’s easier to interpret central tendency and variability. We can confidently use Z-scores to detect outliers and make informed decisions about unusual data points. Understanding the normal distribution’s properties gives us a powerful tool for data analysis and decision-making.
23.5 Testing for Normality with Visualization
Once we understand the concept of a normal distribution, it’s often important to test whether our data approximates this distribution. Many statistical methods assume normality, and knowing whether data meets this assumption can guide our approach to analysis.
Histogram
Plotting the data as a histogram and comparing it visually to a bell curve is a straightforward way to check for normality. If the histogram approximates a symmetric, bell-shaped curve, the data may be normally distributed.
Q-Q Plot (Quantile-Quantile Plot)
A Q-Q plot compares the quantiles of our data to the quantiles of a theoretical normal distribution. If the data is normally distributed, points in a Q-Q plot will approximately follow a straight line.
# Plotting a Q-Q plot
ggplot(normal_data, aes(sample = value)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red", linetype = "dashed") +
labs(title = "Q-Q Plot for Normality Check", x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()
In this plot, if the points closely follow the dashed line, the data can be considered approximately normal. Deviations from this line indicate departures from normality.
In this particular plot, the points follow the dashed line closely except at the upper quantile levels where deviation begins. The deviation appears to be small and it is likely that value
is normally distributed. To be sure, we can apply a statistical test to quantify our confidence in this conclusion.
23.6 Testing for Normality with Statistics
Shapiro-Wilk Test
The Shapiro-Wilk test is a common statistical test for normality. It produces a p-value that helps us decide whether to reject the null hypothesis (that the data is normally distributed). A small p-value (typically < 0.05) indicates that the data is unlikely to be normally distributed.
# Shapiro-Wilk test for normality
<- shapiro.test(normal_data$value)
shapiro_test shapiro_test
Shapiro-Wilk normality test
data: normal_data$value
W = 0.99882, p-value = 0.767
In the output, the test’s p-value tells us whether the data significantly deviates from a normal distribution. A p-value greater than 0.05 suggests that the data may approximate a normal distribution. Since the p-value of 0.7669591 is much greater than 0.05, we have greater confidence that the variable has a normal distribution.
Note: Visual methods provide a general sense of normality, while statistical tests offer more rigor. However, keep in mind that large sample sizes can make tests overly sensitive, rejecting normality for minor deviations.
23.7 Demonstration: Testing Whether Liberty Ship Variables Are Normally Distributed
Demonstration Data
Let’s use the Liberty ships dataset from Section 18.2 to test the normality of the key variables in the dataset. Specifically, we’ll examine whether the Direct_Hours
, Total_Production_Days
, and Total_Cost
variables follow a normal distribution.
# A tibble: 1,571 × 6
Unit Yard Way Direct_Hours Total_Production_Days Total_Cost
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Bethlehem 1 870870 244 2615849
2 2 Bethlehem 2 831745 249 2545125
3 3 Bethlehem 3 788406 222 2466811
4 4 Bethlehem 4 758934 233 2414978
5 5 Bethlehem 5 735197 220 2390643
6 6 Bethlehem 6 710342 227 2345051
7 8 Bethlehem 8 668785 217 2254490
8 9 Bethlehem 9 675662 196 2139564.
9 10 Bethlehem 10 652911 211 2221499.
10 11 Bethlehem 11 603625 229 2217642.
# ℹ 1,561 more rows
Normality Testing
Histogram
# Plotting the normal distribution
#| fig-cap: "Comparing a histogram of direct hours to produce a Liberty ship to a normal distribution"
library(ggplot2)
<- ggplot(liberty_ship_data, aes(x = Direct_Hours)) +
direct_hours_plot geom_histogram(aes(y = after_stat(density)), binwidth = 25000, fill = "seagreen4", color = "black") +
stat_function(fun = dnorm, args = list(mean = mean(liberty_ship_data$Direct_Hours),
sd = sd(liberty_ship_data$Direct_Hours)),
color = "darkblue", linewidth = 1.5) +
labs(title = "Histogram of Liberty Ship Production", x = "Direct Hours", y = "Density") +
theme_minimal()
direct_hours_plot
<- ggplot(liberty_ship_data, aes(x = Total_Production_Days)) +
production_days_plot geom_histogram(aes(y = after_stat(density)), binwidth = 10, fill = "tomato", color = "black") +
stat_function(fun = dnorm, args = list(mean = mean(liberty_ship_data$Total_Production_Days,na.rm=T),
sd = sd(liberty_ship_data$Total_Production_Days,na.rm=T)),
color = "darkblue", linewidth = 1.5) +
labs(title = "Histogram of Liberty Ship Production", x = "Total Production Days", y = "Density") +
theme_minimal()
production_days_plot
<- ggplot(liberty_ship_data, aes(x = Total_Cost)) +
total_cost_plot geom_histogram(aes(y = after_stat(density)), binwidth = 100000, fill = "orangered3", color = "black") +
stat_function(fun = dnorm, args = list(mean = mean(liberty_ship_data$Total_Cost,na.rm=T),
sd = sd(liberty_ship_data$Total_Cost,na.rm=T)),
color = "darkblue", linewidth = 1.5) +
labs(title = "Histogram of Liberty Ship Production", x = "Total Cost", y = "Density") +
theme_minimal()
total_cost_plot
#mean(liberty_ship_data$Total_Production_Days,na.rm=T)
#ggplot(liberty_ship_dataxlim=c(0,300), )
Q-Q Plot (Quantile-Quantile Plot)
A Q-Q plot compares the quantiles of our data to the quantiles of a theoretical normal distribution. If the data is normally distributed, points in a Q-Q plot will approximately follow a straight line.
# Plotting a Q-Q plot
ggplot(normal_data, aes(sample = value)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red", linetype = "dashed") +
labs(title = "Q-Q Plot for Normality Check", x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()
23.8 Exercise: Testing for Normality with Statistics and Visalization
Try it yourself:
Perform the Q-Q plot and Shapiro-Wilk tests to determine whether the age
and/or spending
variables in UrbanFind’s customer_data
are normally distributed. Which test is more appropriate?
Hint 1
- Plot the Q-Q plot for both variables.
- Calculate the Shapiro-Wilk test statistic
Hint 2
- For the Q-Q plot: ggplot(data, aes(sample = variable)) + stat_qq() + stat_qq_line()
ggplot(data, aes(sample = variable)) + stat_qq() + stat_qq_line()
- For the Shapiro-Wilk test statistic: shapiro.test(data$variable)
shapiro.test(data$variable)
Fully worked solution:
1ggplot(customer_data, aes(sample = age)) + stat_qq() + stat_qq_line()
ggplot(customer_data, aes(sample = spending)) + stat_qq() + stat_qq_line()
#
2shapiro.test(customer_data$age)
shapiro.test(customer_data$spending)
- 1
- Plot the Q-Q plot and check for deviations from normal
- 2
- Calculate the Shapiro-Wilks test and compare the p-value to 0.05
- Which variables may not be normally distributed?
- Which test seems more appropriate to test for normality in these variables?
23.9 When to Test for Normality
Testing for normality is particularly relevant when:
- We plan to use parametric tests that assume normality, such as t-tests or ANOVA.
- We want to identify and interpret outliers accurately.
- Our goal is to understand the overall shape and spread of data, especially in comparison to a benchmark.
Summary: Testing for normality ensures we’re using the right statistical methods and assumptions. Combining visual inspection with statistical tests gives a well-rounded perspective on the normality of our data.
By assessing normality before further analysis, we set up our workflow with greater confidence in the reliability of our conclusions.
23.10 What to Do If Data Deviates from Normality
- Transform the Data:
- Apply log, square root, or Box-Cox transformations.
- Goal: Bring the data closer to normality.
- Use Non-Parametric Methods: Tests like the Mann-Whitney U or Kruskal-Wallis do not assume normality.
- Proceed with Caution: For large datasets, parametric methods may still perform adequately under mild deviations.
23.11 Summary
Testing for normality ensures statistical methods are applied appropriately. While not always necessary, it helps detect when alternative approaches are warranted. By using visual tools like histograms and Q-Q plots alongside statistical tests like Shapiro-Wilk, you can confidently assess whether your data aligns with the assumptions of normality.
Looking ahead, these techniques will inform how we approach both univariate and bivariate analysis, ensuring robust and reliable insights.