24  Univariate EDA: Visualization

The greatest value of a picture is when it forces us to notice what we never expected to see.
— John Tukey


24.1 Introduction

Univariate Exploratory Data Analysis (EDA) examines individual variables to uncover their properties, validate data quality, and gain insights. By focusing on one variable at a time, we can identify patterns, detect anomalies, and ensure the data is ready for deeper analysis. For entrepreneurs, univariate EDA provides clarity on customer demographics, product feedback, and business performance metrics.

EDA generally follows a two-step process:

  1. Visualization: Start by examining the variable’s distribution through plots like histograms or box plots.
  2. Statistical Measurement: Follow up with descriptive statistics like mean, median, and variance to quantify its characteristics.

As discussed in Exploring Data, this approach applies to all forms of EDA. In this chapter, we’ll focus on visual exploration of a single variable to understand the distribution, detect data quality issues, and interpret its properties effectively. In the following Chapter 25, we’ll focus on exploring the distribution of variables using descriptive statistics.

24.2 Visualizing Distributions

In exploratory data analysis, understanding the distribution of a variable is essential, but we rarely know what this distribution looks like until we visualize it. By using visualization tools in R, such as those available in ggplot2, we can gain simple yet powerful insights into data patterns, including clustering, spread, and unusual values. Visualization helps reveal whether data tends to follow a normal distribution, has skewness, or shows other unique characteristics.


24.3 Demonstration Data: UrbanFind

Consider UrbanFind, a startup that specializes in curating personalized recommendations for city dwellers in several areas of their lives:

  • Tech Gadgets: Recommendations for the latest gadgets and devices that enhance convenience and connectivity in a fast-paced city life, such as smart home devices, wearable tech, and productivity tools.

  • Fashion: Curated fashion items and accessories that align with urban styles and seasonal trends, helping city dwellers look their best in a competitive, image-conscious environment.

  • Outdoor Activities: Gear and suggestions for outdoor activities that are accessible even in or near urban settings—like urban hiking, weekend getaways, and fitness equipment for both outdoor and indoor use.

  • Health and Wellness Products: Products focused on personal well-being, including fitness equipment, nutritional supplements, and relaxation tools to counterbalance the stresses of urban life.

These recommendations aim to provide city residents with tailored options that fit their lifestyle and preferences, whether they’re looking to upgrade their tech, update their wardrobe, stay active, or improve their wellness. By analyzing customer data, UrbanFind can better understand which areas resonate most with their audience and refine their product offerings and marketing strategies accordingly.

By examining single variables—like customer age, income level, or product rating—UrbanFind can answer foundational questions:

  • Who is the customer?
  • What price range can they afford?
  • How satisfied are they with existing products?

These insights, while simple, guide strategic decisions and set the stage for deeper analysis.

Variables in UrbanFind’s Data

UrbanFind conducted a survey to gather insights into customer demographics, spending habits, and interests. The dataset we’re working with contains responses from 100 survey participants who are representative of UrbanFind’s potential customer base. Each row is an observation, representing the responses of one unique respondent, with the following variables captured:

  • Age: The age of the customer in years. Age is an important demographic factor for UrbanFind, as different age groups may have distinct preferences for technology, fashion, or outdoor activities.

  • Spending: The amount (in dollars) each customer reported spending on lifestyle-related products in the past month. This includes items like tech gadgets, health products, and outdoor gear. UrbanFind aims to understand the range of spending to help design product bundles and set price points.

  • Product Interest: The product category the customer is most interested in, chosen from four options: Tech, Fashion, Outdoors, and Health. This helps UrbanFind determine which product lines to prioritize for marketing and inventory.

  • Region: The geographic region where each customer lives, categorized into North, South, East, and West. This variable provides insights into potential regional differences in product preferences and spending behaviors.

Each of these variables gives us a unique lens through which to view the customer base. By examining them individually, we gain insights that will inform how UrbanFind can create targeted marketing strategies, set appropriate price points, determine which products resonate most with their customers, and tailor its offerings to meet customer needs,

Preview of the UrbanFind customer_data Dataset

Here’s a preview of the customer_data dataset. Notice how the values of each variable vary across observations. In other words, age, spending, product_interest, and region are all variables that provide different types of information.

# A tibble: 100 × 4
     age spending product_interest region
   <dbl>    <dbl> <chr>            <chr> 
 1    33      495 Fashion          East  
 2    18      458 Fashion          North 
 3    32      491 Health           South 
 4    30      420 Fashion          South 
 5    85      664 Fashion          East  
 6    35      533 Fashion          East  
 7    31      526 Health           South 
 8    14      350 Fashion          South 
 9    24      471 Health           East  
10    NA      424 Fashion          South 
# ℹ 90 more rows

24.4 Histograms for Continuous Variables

A histogram shows the frequency of values within specified ranges (or “bins”) along the x-axis, making it ideal for visualizing the shape of the data. Histograms allow us to observe clustering patterns, skewness, and whether the distribution has one peak or multiple peaks.

For example, if we were examining a variable representing the price of items in a store, a histogram could reveal if prices tend to cluster around certain points or if there’s a broader spread. In the following demonstration, you’ll see how histograms effectively depict the distribution of values within a dataset.

Here the histogram shows the frequency distribution of our sample variable, giving insight into how values are spread and where they tend to cluster.


24.5 Box Plots for Continuous Variables

A box plot provides a summary of data based on quartiles, helping us visualize the statistical measures of spread, center, and outliers within a numeric variable. The box represents the interquartile range (IQR), covering the middle 50% of the data. The whiskers extend to the minimum and maximum values within 1.5 times the IQR, while any points beyond the whiskers are displayed as individual outliers.

This annotated box plot shows the key components, helping us understand the distribution’s spread and identify any potential outliers.


24.6 Bar Plots for Categorical Variables

While histograms are ideal for visualizing the distribution of continuous variables, bar plots are the go-to tool for visualizing categorical variables. A categorical variable represents distinct categories, with each observation falling into one category. Examples include gender, product type, or region.

In a bar plot, each bar represents a category, and the height of the bar shows the count (or percentage) of observations in that category. This visualization provides insights into the relative frequency of each category, allowing us to compare categories at a glance.


   Clothing Electronics       Games   Groceries      Health     Outdoor 
         25          45          38          89          44          23 

In this bar plot, each bar’s height shows the frequency of sales in each region. The plot highlights which purchase categories receive the most spending, helping us identify potential areas for business expansion or targeted marketing.

24.7 Demonstration: Exploring Unknown Distributions

When we encounter new data, we often don’t know the underlying distribution of each variable. Visualizations like histograms and box plots make it possible to quickly uncover patterns, shapes, and unique characteristics of the data that may not be obvious from just inspecting raw values.

In this demonstration, we’ll use a dataset of purchase_data that contains three different variables with distinct distributions that are initially unknown. The variables are product_price, customer_age, and purchase_frequency. We’ll first inspect the dataset to see the raw data and then visualize each variable individually to reveal its shape.

# A tibble: 6 × 3
  product_price customer_age purchase_frequency
          <dbl>        <dbl>              <dbl>
1          63.7         20.4               43.3
2          44.4         20.8               44.7
3          53.6         27.1               35.3
4          56.3         26.2               38.8
5          54.0         33.8               33.4
6          48.9         21.3               44.2

The head() of the dataset gives a preview of raw values, but doesn’t reveal much about the distributions. Since the distributions of product_price, customer_age, and purchase_frequency are unknown, lets begin our exploration of the data by visualizing the variables with histograms.

Product Price

To plot the histogram of product_price, we following the grammar of graphics by

  1. calling the ggplot function,
  2. specifying the dataset purchase_data, and
  3. specifying the aesthetic mapping x = product_price.

To create a basic plot in ggplot2, you only need to specify the data and the mapping. For example, when creating histograms, bar charts, and box plots, we typically map only the x aesthetic to the variable of interest. While this demonstration includes custom colors for aesthetics, the simplest ggplot only requires you to declare the data and mapping.

# Plotting the normal distribution for product price
ggplot(purchase_data, aes(x = product_price)) +
  geom_histogram(binwidth = 2, 
                 fill = "royalblue", 
                 color = "black") +
  labs(title = "Distribution of Product Prices", 
       x = "Price", 
       y = "Frequency") +
  theme_minimal()

The histogram of purchase_price shows a symmetric bell curve around the mean price. In real-world datasets, a normal distribution is often used as a baseline for comparisons because of its predictable properties. Here, the normal shape indicates that most products are priced around the mean, with a few outliers on either side.

Customer Age

Next, let’s visualize customer_age with the same grammar of graphics code.

# Plotting the unknown distribution for customer age
ggplot(purchase_data, aes(x = customer_age)) +
  geom_histogram(binwidth = 5, 
                 fill = "#7B3294", 
                 color = "black") +
  labs(title = "Distribution of Customer Ages", 
       x = "Customer Age", 
       y = "Frequency") +
  theme_minimal()

For customer_age, we see a right-skewed distribution where most values are lower, with a few exceptionally high outliers (such as older customers).

The histogram reveals a right-skewed distribution, with most ages clustered on the lower end and a long tail extending to the right. This shape suggests that typical customers are younger, with a few much older customers. For such distributions, the median age is often a better measure of central tendency than the mean, as it’s less influenced by extreme values.

Tip: When exploring a new distribution, it’s often helpful to try different binwidths in your histogram. Adjusting the binwidth can uncover finer details or highlight broader trends in the data, providing a clearer picture of its underlying shape. For example, with a binwidth of 10, the histogram of customer_age appears to peak at age = 0, losing one of the key attributes of a skewed distribution.

# Plotting the unknown distribution for customer age
ggplot(purchase_data, aes(x = customer_age)) +
  geom_histogram(binwidth = 10, 
                 fill = "#7B3294", 
                 color = "black") +
  labs(title = "Distribution of Customer Ages", 
       x = "Customer Age", 
       y = "Frequency") +
  theme_minimal()

Purchase Frequency

Finally, we’ll plot purchase_frequency as we did for the previous two variables.

# Plotting the bimodal distribution for purchase frequency
ggplot(purchase_data, aes(x = purchase_frequency)) +
  geom_histogram(binwidth = 2, 
                 fill = "mediumseagreen", 
                 color = "black") +
  labs(title = "Distribution of Purchase Frequency", 
       x = "Purchase Frequency", 
       y = "Frequency (Count)") +
  theme_minimal()

The histogram of purchase_frequency shows two peaks, suggesting two distinct groups within the data. This could represent two customer segments, one that purchases more frequently and another that purchases less often. Identifying bimodal distributions can guide targeted marketing strategies by focusing on the needs of each segment.

24.8 Exercise: Visualizing Distributions

Try it yourself:


In a dataset named distribution_data, data for a right-skewed variable named skewed_variable is found together with a bimodal variable named bimodal_variable. Visualize the distributions of these variables using a histograms. Compare the shapes to what you would expect from each distribution type.

Hint 1

Build your histogram plot using the grammar of graphics by declaring the data, specifying the aesthetic mapping (a histogram maps a variable to the x-axis only), and calling the geometry (a histogram uses the geom_histogram() function). Note that you can specify the binwidth (the span of the x-axis covered by one bar of the histogram), fill (the color of the bars of the histogram), and color (the color of the borders of the bars).

Hint 2

  1. For the skewed distribution:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Map the aesthetic with x = skewed_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
  2. For the bimodal distribution:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Map the aesthetic with x = bimodal_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
  geom_histogram(binwidth = 1) 

Fully worked solution:

  1. For the distribution of skewed_variable:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Specify the aesthetic mapping as x = skewed_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
  2. For the distribution of bimodal_variable:

    1. Call the ggplot() function
    2. Specify the data as distribution_data
    3. Specify the aesthetic mapping as x = bimodal_variable
    4. Specify the geometry as geom_histogram
    5. [optional] Specify binwidth, fill, and color as you like
1ggplot(distribution_data,
2       aes(x = skewed_variable)) +
3  geom_histogram(binwidth = 2,
4                 fill = "royalblue",
5                 color = "black")

ggplot(distribution_data,
       aes(x = bimodal_variable)) +
  geom_histogram(binwidth = 0.5,
                 fill = "royalblue",
                 color = "black")
1
Call the ggplot() function and specify purchase_data
2
Specify that aesthetic mapping with skewed_variable or bimodal_variable plotted on the x-axis
3
Call the geom_histogram() function to get a histogram of the distributions [optional] and specify the binwidth,
4
fill,
5
color, or other aesthetics of geom_histogram()


24.9 Conclusion

Data visualization is an essential tool in exploratory data analysis. Each type of plot provides unique insights, depending on whether the data is continuous or categorical. The following table outlines commonly used visualization types, along with their purposes and suitability for continuous or categorical data.

Visualization Type Description Continuous Categorical Use Case
Histogram Shows frequency distribution of values in bins Distribution Shape, Outlier Detection
Box Plot Displays median, quartiles, and potential outliers Spread, Outliers
Density Plot Smooth curve showing data density over continuous range Distribution Shape
Bar Chart Shows count or proportion of each category Frequency of Categorical Values
QQ Plot Plots data against a normal distribution Normality Check

Effective data visualization is the cornerstone of univariate exploratory data analysis, transforming raw data into actionable insights. By using tools like histograms, box plots, and bar charts, we gain an intuitive understanding of a dataset’s distribution, variability, and potential anomalies. These visual tools allow us to identify patterns, such as skewness or clustering, and spot outliers that might require further investigation.

Understanding distribution shape allows us to make better data-driven decisions:

  • Detecting Skew: Skewness tells us if a variable’s distribution leans towards higher or lower values. For instance, if product prices are right-skewed, most products are affordable, but a few are premium-priced.
  • Identifying Outliers: Box plots make it easy to spot outliers. Outliers could indicate data errors or valuable insights, like a product that sells exceptionally well or poorly.
  • Assessing Variability: Histograms and box plots help in quickly visualizing data spread, giving insight into whether values cluster tightly around the mean or are widely dispersed.

For entrepreneurs, these visualizations can answer critical questions like:

  • Are customer purchases concentrated around specific price points?
  • Do certain product categories dominate sales volume?
  • Is variability in spending consistent across different regions?

Visualization not only simplifies complex datasets but also facilitates communication of insights to stakeholders, enabling data-driven decision-making. As you progress in EDA, remember that every visualization should serve a purpose: to clarify the data and support your decisions.

Looking ahead, statistical measures will deepen your understanding of the patterns observed visually, providing numerical evidence to complement these initial insights.