13  Visualizing Data

Data visualization is an essential part of any data analysis. It allows us to see patterns, relationships, and trends that are difficult to grasp from raw data alone. A good visualization clarifies, communicates, and uncovers insights, while a bad visualization can mislead or obscure important information.

In this chapter, we’ll cover the basic principles of effective data visualization and introduce you to the powerful ggplot2 package in R, which follows the grammar of graphics framework to build layered and flexible visualizations.

13.1 Basic Principles of Good Data Visualization

Before diving into coding, let’s discuss some basic principles to keep in mind when creating visualizations:

Clarity

Your visualization should make the data easier to understand, not harder. Always strive for simplicity and clarity by avoiding unnecessary elements (e.g., 3D effects, excessive gridlines). Every component should contribute to conveying the data’s message.

Honesty

Ensure that your visual representation of data is honest and accurate. This includes using appropriate scales, avoiding distortion (e.g., truncated y-axes that exaggerate differences), and choosing the right chart type for your data.

Audience Awareness

Always consider the audience when designing a visualization. What message are you trying to communicate, and who will be interpreting it? Tailor your visualization to the needs and experience level of your audience.

Minimize Cognitive Load

The visualization should be intuitive and easy to read. Minimize the amount of mental effort your audience needs to interpret the graphic. Use familiar chart types and avoid clutter.

Consistency

Maintain consistent use of colors, scales, and labels. Consistency across visualizations helps the audience compare different aspects of the data and improves the overall readability of your report or presentation.

Data-Ink Ratio

This principle, popularized by Edward Tufte, suggests that the proportion of “data ink” (the ink used to represent the actual data) to non-data ink (decorations, gridlines, etc.) should be maximized. In other words, reduce chart junk and highlight the data.



13.2 Grammar of Graphics

Now that we’ve covered the key principles, let’s move on to visualizing data in R using the ggplot2 package. ggplot2 follows the Grammar of Graphics framework, which breaks down a plot into layers, allowing for flexible and modular plotting.

At the core of ggplot2 is the idea that any plot can be described by the following elements:

  1. Data: The dataset being visualized.
  2. Aesthetic mappings (aes): The visual properties of the data (e.g., position, color, size) that correspond to the variables in the dataset.
  3. Geometric objects (geoms): The shapes that represent the data, such as points, lines, and bars.
  4. Labels and Titles: Plot titles and axes labels to make the plot more informative.
  5. Scales: How data values are mapped to visual properties like axes, colors, or sizes.
  6. Facets: Dividing data into subplots.
  7. Themes: Control the visual appearance of the plot.

Basic Plot with ggplot2

Let’s walk through a simple example of how ggplot2 works by exploring data from the production of Liberty ships during World War II.

Demonstration Data: Production of Liberty Ships

During World War II, the U.S. embraced the challenge of rapidly building supply ships to support the Allied war effort. To address this, Liberty ships were developed to be a quick and cost-effective solution. Starting with an inexperienced workforce, the U.S. Maritime Commission eventually produced over 2,700 Liberty ships in just five years. Sometimes referred to as the “miracle of the Liberty ship,” the time to build one ship was reduced from 244 days to just 42 days by the end of the war, playing a crucial role in the war effort by maintaining supply lines across the Atlantic.

There are 18 variables in the Liberty ship dataset so we will focus on just a few:

  • TC: the total cost of producing a ship
  • Yard: location of the shipyard that built the ship
  • Direct_Hours: the total direct labor hours spent on constructing each ship
  • Total_Production_Days: the total number of days required to produce and deliver a ship

We’ll use these variables to create scatter plots and explore their relationships.

Import the data

Before we create the plot, we first need to import the dataset. The Liberty ship data is stored in a CSV file named liberty_ship.csv, located in the data subdirectory.

# Import the cupcakes.csv dataset
liberty_ship_data <- read_csv("data/liberty_ship.csv")
liberty_ship_data
# A tibble: 1,571 × 18
    Unit   Way Keel_Date Delivery_Date Yard_Cost Procurement_Cost
   <dbl> <dbl> <chr>     <chr>             <dbl>            <dbl>
 1     1     1 4/30/41   12/30/41        1613203           888165
 2     2     2 5/15/41   1/19/42         1545574           888165
 3     3     3 6/21/41   1/29/42         1470687           888165
 4     4     4 6/21/41   2/9/42          1421123           888165
 5     5     5 7/15/41   2/20/42         1397852           888165
 6     6     6 7/15/41   2/27/42         1354256           888165
 7     8     8 8/25/41   3/30/42         1267658           888165
 8     9     9 9/3/41    3/18/42         1151856           888165
 9    10    10 9/12/41   4/11/42         1236111           888165
10    11    11 9/22/41   5/9/42          1236111           888165
# ℹ 1,561 more rows
# ℹ 12 more variables: Facilities_Cost <dbl>, Admin_Cost <dbl>, TC <dbl>,
#   VC <dbl>, FC <dbl>, Yard <chr>, Direct_Hours <dbl>, Indirect_Hours <dbl>,
#   Total_ManHours <dbl>, Way_Days <dbl>, Outfitting_Days <dbl>,
#   Total_Production_Days <dbl>

Inspect the data

By inspecting the data, we can confirm that it loaded correctly and get a quick overview before visualizing it. The glimpse() and summary() functions give us a snapshot of the Liberty ship dataset:

glimpse(liberty_ship_data)
Rows: 1,571
Columns: 18
$ Unit                  <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, …
$ Way                   <dbl> 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 1, 12, 13, 2, 14…
$ Keel_Date             <chr> "4/30/41", "5/15/41", "6/21/41", "6/21/41", "7/1…
$ Delivery_Date         <chr> "12/30/41", "1/19/42", "1/29/42", "2/9/42", "2/2…
$ Yard_Cost             <dbl> 1613203, 1545574, 1470687, 1421123, 1397852, 135…
$ Procurement_Cost      <dbl> 888165, 888165, 888165, 888165, 888165, 888165, …
$ Facilities_Cost       <dbl> 88815, 86414, 83755, 81995, 81169, 79621, 76546,…
$ Admin_Cost            <dbl> 25666.00, 24972.00, 24204.00, 23695.00, 23457.00…
$ TC                    <dbl> 2615849, 2545125, 2466811, 2414978, 2390643, 234…
$ VC                    <dbl> 2501368, 2433739, 2358852, 2309288, 2286017, 224…
$ FC                    <dbl> 114481.00, 111386.00, 107959.00, 105690.00, 1046…
$ Yard                  <chr> "Bethlehem", "Bethlehem", "Bethlehem", "Bethlehe…
$ Direct_Hours          <dbl> 870870, 831745, 788406, 758934, 735197, 710342, …
$ Indirect_Hours        <dbl> 327818, 313916, 308175, 293346, 283848, 271138, …
$ Total_ManHours        <dbl> 1198688, 1145661, 1096581, 1052280, 1019045, 981…
$ Way_Days              <dbl> 150, 163, 147, 168, 144, 168, 174, 143, 163, 187…
$ Outfitting_Days       <dbl> 94, 86, 75, 65, 76, 59, 43, 53, 48, 42, 48, 44, …
$ Total_Production_Days <dbl> 244, 249, 222, 233, 220, 227, 217, 196, 211, 229…
#summary(liberty_ship_data)

These functions allow us to quickly verify that the dataset includes all expected variables and to get a general sense of the data’s distribution and structure.

Now that we’ve inspected the data, let’s visualize the relationship between Total_Production_Days ( the total number of days required to build and deliver a ship.) and Direct_Hours (the total direct labor hours required to build each ship).

Remember, we will follow the grammar of graphics principles:

  1. Data
  2. Aesthetic mappings (aes)
  3. Geometric objects (geoms)
  4. Titles and Labels
  5. Scales
  6. Facets
  7. Themes

Scatter Plot of Liberty Ship Data Using the Grammar of Graphics

ggplot(liberty_ship_data,                    # 1. specify the data
       aes(x = Direct_Hours,                 # 2. aesthetic mapping of Direct_Hours to the x axis
           y = Total_Production_Days)) +     # 2. aesthetic mapping of Total_Production_Days to the y axis
  geom_point(aes(color = Yard)) +            # 3. specify the geometry as points (scatter plot)
  labs(title = "Liberty Ship Production",    # 4. specify the title label
       x = "Direct Labor Hours",             # 4. specify the x-axis label    
       y = "Total Production Days") +        # 4. specify the y-axis label
  scale_x_continuous(labels = comma) +       # 5. modify the axis scales
  theme_hc()                                 # 7. apply a theme from ggthemes

In the following sections, we will learn to create this scatter plot, built from layers of the elements of the grammar of graphics.



13.3 Call the ggplot() Function

# Call the ggplot function with no arguments
ggplot()

Since we haven’t provided any data or aesthetic mappings, R returns an empty rectangle. This serves as a placeholder for the plot, showing that we’ll build it layer by layer.

13.4 Specify the Data

The next step is to specify the dataset that will be plotted.

ggplot(liberty_ship_data)

At this point, we’ve specified that we want to explore the Liberty ship dataset, but we haven’t yet told ggplot() which variables we want to plot, so nothing is shown. The plot remains an empty canvas until we declare which variables to use in the plot. In the next step, we will map our variables to create the plot.



13.5 Map the Aesthetics

Aesthetic mappings in ggplot() define how variables in your dataset are visually represented in the plot. The simplest form of aesthetic mapping is to declare which variable will be placed on the x-axis and which will be placed on the y-axis. This is done using the aes() function (short for “aesthetics”).

In this example, we will map Direct_Hours to the x-axis and Total_Production_Days to the y-axis:

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days))

Now, the plot is starting to take shape. Direct_Hours is mapped to the x-axis, with a range automatically set from 0 to around 1,250,000, reflecting the range of direct labor hours in the dataset. Similarly, Total_Production_Days is mapped to the y-axis, with a range from 0 to around 350, based on the values in that variable.

At this point, the axes are scaled, but we still don’t see any data points because we haven’t added any geometric objects to represent the data visually.



13.6 Specify the Geometric Objects

Even though we’ve mapped the variables to the axes, the data isn’t yet visible because we haven’t specified how to represent it. This is where we declare the geometric object or geom. The geom defines the shape that will be used to display the data.

In this case, we will use geom_point() to add a layer of points to the plot, where each point represents a single observation.

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point()

Now that we’ve added a geometric object, the aesthetic mappings become visible. The points are positioned based on Direct_Hours and Total_Production_Days.

Try it yourself: Basic Scatter Plot


The customer data about the protein-infused Muscle Cola is already imported here as muscle_cola_data. Generate a scatter plot of the willingness to pay of the respondents WTP in the x-axis and the number of servings they would consume per month Quantity in the y-axis.

Hint 1

Build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables, then declaring the geometry to plot the data points.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Then call the geom_point() function to plot the data as points (scatter plot).

  aes(x = WTP, y = Quantity)

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity). Then call the geom_point() function to plot the data as points.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity)) +
3  geom_point()
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis
3
Call the geom_point() function to get a scatter plot of points

Add Additional Aesthetic Mappings to the Geometry

In addition to mapping the x and y variables, we can map other aesthetic properties such as color, shape, size, and linewidth to variables in our dataset, and they will apply to all geoms unless overridden within a specific geom.

  • Color: We can use color to differentiate between categories in the data.
  • Shape: Shape can be used to differentiate categories in scatter plots.
  • Size: Size can represent a numeric variable, making points larger or smaller based on values.
  • Linewidth: If you are drawing lines, you can use linewidth to control the thickness of the lines.

Color

Here’s how to map the variable Yard to color to show how different shipyards affect the relationship between production time and labor hours:

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days, color = Yard)) +
  geom_point()

In this case, color differentiates the shipyards, and each point will be colored according to which shipyard built that ship. This helps to visually distinguish groups within the data.

Size

We can also add other aesthetic mappings, such as size, to show the relationship between additional variables:

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days, color = Yard, size = Total_ManHours)) +
  geom_point()

Here, the size of each point corresponds to the total number of man-hours required for each ship, providing additional insight into how labor hours relate to production time.

Try it yourself: Add Color Aesthetics to the Geometry


Now map the variable Gym_member to color to show how gym membership affects the relationship between willingness to pay WTP and consumption Quantity.

Hint 1

Re-build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables as well as the aesthetic mapping of gym membership to color. Then declare the geometry to plot the data points.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Add an argument inside the aesthetic mapping that maps color to Gym_member. Then call the geom_point() function to plot the data as points (scatter plot).

  aes(x = WTP, y = Quantity, color = Gym_member)

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity, color = Gym_member). Then call the geom_point() function to plot the data as points.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity, color = Gym_member)) +
3  geom_point()
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis, adding the mapping of color to Gym_member
3
Call the geom_point() function to get a scatter plot of points

Aesthetics in Geom Layers

In some cases, you may want to apply aesthetic mappings only to specific geoms. Aesthetic mappings can be defined within individual layers (geoms), allowing you to control the appearance of different geoms independently. For example, we can apply color to the points while adding a line connecting the points, showing the sequence of production time and labor hours:

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +         # Color by shipyard in the point layer
  geom_line()                             # Add a line connecting the points

Summary of Aesthetic Mappings

To summarize, aesthetic mappings allow you to represent variables in your data through different visual properties, such as:

  • x and y positions: Mapped using aes(x = , y = ).
  • color: Used to visually separate categories.
  • shape: Differentiates categories in scatter plots.
  • size: Represents numeric variables by adjusting the size of points.
  • linewidth: Controls line thickness for line-based geoms like geom_line() or geom_smooth().


Try it yourself: Mapping Plot Aesthetics and Geometry Aesthetics


  1. Plot the relationship between willingness to pay WTP and monthly consumption Quantity.
  2. Map the variable Gym_member to color to show how gym membership affects the relationship between willingness to pay WTP and consumption Quantity.
  3. Map the length of the respondent’s workout Workout_length to shape to show how workout length affects the relationship.

Hint 1

Re-build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables as well as the aesthetic mapping of gym membership to color. Inside a new aesthetic mapping in the point geometry, map the shape of the point to the length of the workout. Then declare the geometry to plot the data points.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Add an argument inside the aesthetic mapping that maps color to Gym_member. Then call the geom_point() function to plot the data as points (scatter plot). Add an aesthetic mapping of shape to Workout_length inside the geom_point().

  geom_point(aes(shape = Gym_member))

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity, color = Gym_member). Then call the geom_point() function to plot the data as points.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity, color = Gym_member)) +
3  geom_point(aes(shape = Workout_length))
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis, adding the mapping of color to Gym_member
3
Call the geom_point() function to get a scatter plot of points and add the aesthetic that maps shape to the Workout_length


13.7 Labels and Titles

To make the plot more informative, we can add titles and labels for the axes. This helps clarify what the variables represent and provides context for the viewer.

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days")

In this plot:

  • title: Describes what the plot is about.
  • x and y labels: Make the axes clear so viewers know what each variable represents.

Try it yourself: Add Labels


  1. Plot the relationship between willingness to pay WTP and monthly consumption Quantity.

  2. Map the variable Gym_member to color to show how gym membership affects the relationship between willingness to pay WTP and consumption Quantity.

  3. Map the length of the respondent’s workout Workout_length to shape to show how workout length affects the relationship.

  4. Add title and axes labels

    • Make the title “Willingness to Pay and Consume Muscle Cola”
    • Make the x-axis “Willingness to Pay”
    • Make the y-axis “Quantity Consumed”

Hint 1

Re-build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables as well as the aesthetic mapping of gym membership to color. Inside a new aesthetic mapping in the point geometry, map the shape of the point to the length of the workout. Then declare the geometry to plot the data points. Then add the title and labels as instructed.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Add an argument inside the aesthetic mapping that maps color to Gym_member. Then call the geom_point() function to plot the data as points (scatter plot). Add an aesthetic mapping of shape to Workout_length inside the geom_point(). Then call the labs() function using arguments title = "Willingness to Pay and Consume Muscle Cola", x = "Willingness to Pay", y = "Quantity Consumed".

  labs(title = "Willingness to Pay and Consume Muscle Cola", 
       x = "Willingness to Pay", 
       y = "Quantity Consumed"
       )

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity, color = Gym_member). Then call the geom_point() function to plot the data as points. Then call the labs() function to add the title and axes labels.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity, color = Gym_member)) +
3  geom_point(aes(shape = Workout_length)) +
4  labs(title = "Willingness to Pay and Consume Muscle Cola",
5       x = "Willingness to Pay",
6       y = "Quantity Consumed")
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis, adding the mapping of color to Gym_member
3
Call the geom_point() function to get a scatter plot of points and add the aesthetic that maps shape to the Workout_length
4
Call the labs() function and specify the title
5
Specify the x-axis label
6
Specify the y-axis label


13.8 Scales

Scales define how your data values are translated into visual properties such as axis ranges, color gradients, or point sizes. Scales are an important tool when you need to refine the default behavior of ggplot2.

Adjusting Axis Scales

You can use scale_x_continuous() and scale_y_continuous() to control the appearance of the x- and y-axes. For instance, you may want to limit the range of the axes or change the axis labels.

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_x_continuous(limits = c(0, 1500000), breaks = seq(0, 1500000, 250000), labels = comma) +
  scale_y_continuous(limits = c(0, 400), breaks = seq(0, 400, 50))

In this example:

  • limits: Define the minimum and maximum range for the axis.
  • breaks: Specify the intervals for the axis ticks to make the plot more readable.

Adjusting Color Scales

You can control the color scale with scale_color_manual() or scale_color_gradient() when you want specific colors for categories or continuous data.

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_color_manual(values = c("blue", "red", "green", "purple", "orange", "yellow"))

In this example, we manually assign colors to different shipyards using scale_color_manual().

Alternatively, for continuous data, you can use scale_color_gradient() to create a color gradient based on the data values:

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days, color = Total_ManHours)) +
  geom_point() +                    # Points colored by total man-hours
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_color_gradient(low = "blue", high = "red")

In this case, scale_color_gradient() maps the values of Total_ManHours to a gradient ranging from blue (low values) to red (high values).

Adding Currency to Scales

When dealing with financial data, you may need to format axis labels to display values as currency. In ggplot2, this can be done using the scales package, which provides a variety of formatting functions. The most common for financial data is dollar().

Let’s say you want to display Total_Cost as a financial value (for instance, costs in dollars). You can use scale_y_continuous() (or scale_x_continuous() for the x-axis) and apply the dollar() function to format the axis values.

library(scales)  # Load the scales package for currency formatting
ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = TC)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Cost") +
  scale_x_continuous(labels = comma) +  
  scale_y_continuous(labels = dollar)  # Format y-axis as currency

In this example, scales::dollar() automatically formats the x-axis labels as currency, adding dollar signs and commas where appropriate.

Formatting Axis with Custom Currency or Units

You can also customize the currency format (e.g., changing the currency symbol or specifying decimals).

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = TC)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Cost") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = dollar_format(prefix = "$", suffix = "K", scale = 0.001))

In this example:

  • dollar_format(prefix = "$", suffix = "K", scale = 0.001): Adds a dollar sign prefix, a “K” suffix to represent thousands, and scales the values down by a factor of 1,000. For example, 1,000,000 becomes “$1,000K.”

Example: Formatting for Other Currencies

You can also format the labels for other currencies by changing the prefix. Here’s how you might display values in euros or pounds:

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = TC)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Cost (in €)") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = dollar_format(prefix = "€", suffix = ""))

Here, the x-axis values will be displayed in euros (e.g., “€1,000”).

Recap of Scales:

  • scale_x_continuous() / scale_y_continuous(): Adjust axis ranges, breaks, and labels.
  • scale_color_manual(): Manually assign specific colors to categories.
  • scale_color_gradient(): Create color gradients for continuous data.
  • scales::dollar(): Quickly formats axis labels as dollars.
  • dollar_format(): Allows you to customize the currency symbol, add suffixes like “K” or “M”, and scale the values for better readability.
  • Other Currencies: You can use dollar_format() for any currency by modifying the prefix argument (e.g., “€” or “£”).

Using scales allows you to fine-tune how your data is presented visually, giving you more control over the plot’s aesthetics. For example, currency formatting is a handy tool when dealing with financial data, making your plots clearer and more professional in entrepreneurial contexts.

Try it yourself: Add Currency to the x-axis Scale


  1. Plot the relationship between willingness to pay WTP and monthly consumption Quantity.

  2. Map the variable Gym_member to color to show how gym membership affects the relationship between willingness to pay WTP and consumption Quantity.

  3. Map the length of the respondent’s workout Workout_length to shape to show how workout length affects the relationship.

  4. Add title and axes labels

    • Make the title “Willingness to Pay and Consume Muscle Cola”
    • Make the x-axis “Willingness to Pay”
    • Make the y-axis “Quantity Consumed”
  5. Set the x-axis scale to be currency in dollars ($). Note that the scales library is already loaded so the scale functions are available to you.

Hint 1

Re-build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables as well as the aesthetic mapping of gym membership to color. Inside a new aesthetic mapping in the point geometry, map the shape of the point to the length of the workout. Then declare the geometry to plot the data points. Then add the title and labels. Then format the x-axis to have currency in dollars.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Add an argument inside the aesthetic mapping that maps color to Gym_member. Then call the geom_point() function to plot the data as points (scatter plot). Add an aesthetic mapping of shape to Workout_length inside the geom_point(). Then call the labs() function using arguments title = "Willingness to Pay and Consume Muscle Cola", x = "Willingness to Pay", y = "Quantity Consumed". Then call the scale_x_continuous() function and specify the argument that the labels are dollar currency.

  labs(title = "Willingness to Pay and Consume Muscle Cola", 
       x = "Willingness to Pay", 
       y = "Quantity Consumed"
       )

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity, color = Gym_member). Then call the geom_point() function to plot the data as points. Then call the labs() function to add the title and axes labels. Then call the scale_x_continuous() function to specify the x-axis scale labels to be dollar currency.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity, color = Gym_member)) +
3  geom_point(aes(shape = Workout_length)) +
4  labs(title = "Willingness to Pay and Consume Muscle Cola",
5       x = "Willingness to Pay",
6       y = "Quantity Consumed") +
7  scale_x_continuous(labels = dollar)
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis, adding the mapping of color to Gym_member
3
Call the geom_point() function to get a scatter plot of points and add the aesthetic that maps shape to the Workout_length
4
Call the labs() function and specify the title
5
Specify the x-axis label
6
Specify the y-axis label
7
Call the scale_x_continuous() function and specify the labels to be dollar.


13.9 Facets

Faceting allows you to split your data into multiple panels or subplots based on the values of one or more categorical variables. This is useful when you want to compare the same relationship across different subsets of the data, such as different categories or groups.

Using facet_wrap()

The function facet_wrap() is used to create facets (small multiples) for a single variable. In the following example, we create a separate plot for each Yard to compare the relationship between Direct_Hours and Total_Production_Days across shipyards.

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_x_continuous(labels = comma) +
  facet_wrap(~ Yard)

Here, facet_wrap(~ Yard) creates a separate panel for each shipyard, showing how the relationship between Direct_Hours and Total_Production_Days varies across shipyards.

Using facet_grid()

If you want to facet your data on two variables (one on the x-axis and one on the y-axis), you can use facet_grid(). This function creates a grid of plots where rows and columns represent different variables.

ggplot(liberty_ship_data,
       aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_x_continuous(labels = comma) +
  facet_grid(Way ~ Yard)

Here, facet_grid(Way ~ Yard) creates a grid of plots where each row corresponds to a different production platform (way) and each column corresponds to a production yard.

Why Use Facets?

  • Easy comparisons: Facets make it easy to compare the same relationship across different categories without overcrowding a single plot.
  • Clarity: By splitting the data into smaller, separate plots, facets allow you to focus on specific subgroups of the data.

Faceting is a powerful tool in ggplot2 for visualizing complex relationships in your data by dividing it into more digestible pieces. It can be particularly helpful when comparing categorical variables or exploring subsets of the data.

Try it yourself: Create Facets to Isolate Relationships


  1. Plot the relationship between willingness to pay WTP and monthly consumption Quantity.

  2. Map the variable Gym_member to color to show how gym membership affects the relationship between willingness to pay WTP and consumption Quantity.

  3. Map the length of the respondent’s workout Workout_length to shape to show how workout length affects the relationship.

  4. Add title and axes labels

    • Make the title “Willingness to Pay and Consume Muscle Cola”
    • Make the x-axis “Willingness to Pay”
    • Make the y-axis “Quantity Consumed”
  5. Set the x-axis scale to be currency in dollars ($)

  6. Create facets to separate the effects of the relationship for the preferred drink Drink_preferred of the respondents.

Hint 1

Re-build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables as well as the aesthetic mapping of gym membership to color. Inside a new aesthetic mapping in the point geometry, map the shape of the point to the length of the workout. Then declare the geometry to plot the data points. Then add the title and labels. Then format the x-axis to have currency in dollars. Then wrap the plot facets by respondent drink preference.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Add an argument inside the aesthetic mapping that maps color to Gym_member. Then call the geom_point() function to plot the data as points (scatter plot). 4. Add an aesthetic mapping of shape to Workout_length inside the geom_point(). Then call the labs() function using arguments title = "Willingness to Pay and Consume Muscle Cola", x = "Willingness to Pay", y = "Quantity Consumed". Then call the scale_x_continuous() function and specify the argument that the labels are dollar currency. Then call the facet_wrap() function and specify it to create facets for each Drink_preferred

 facet_wrap(~ Drink_preferred)

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity, color = Gym_member). Then call the geom_point() function to plot the data as points. Then call the labs() function to add the title and axes labels. Then call the scale_x_continuous() function to specify the x-axis scale labels to be dollar currency. Then call the facet_wrap() function to specify the facets to be created and wrapped on Drink_preferrence.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity, color = Gym_member)) +
3  geom_point(aes(shape = Workout_length)) +
4  labs(title = "Willingness to Pay and Consume Muscle Cola",
5       x = "Willingness to Pay",
6       y = "Quantity Consumed") +
7  scale_x_continuous(labels = dollar) +
8  facet_wrap(~ Drink_preferred)
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis, adding the mapping of color to Gym_member
3
Call the geom_point() function to get a scatter plot of points and add the aesthetic that maps shape to the Workout_length
4
Call the labs() function and specify the title
5
Specify the x-axis label
6
Specify the y-axis label
7
Call the scale_x_continuous() function and specify the labels to be dollar.
8
Call the facet_wrap() function and specify the facets to be created for Drink_preferred.


13.10 Themes

Themes in ggplot2 control the non-data-related visual aspects of the plot, such as background color, gridlines, text size, and font. While the default theme in ggplot2 works well in many cases, customizing the theme can make your plot more suitable for reports, presentations, or other specific formats.

Default Theme: theme_gray()

The default theme in ggplot2 is theme_gray(), which uses a gray background and white gridlines. This is a good starting point, but you might want to change the theme to fit your needs.

ggplot(liberty_ship_data, aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  # Color by shipyard
  labs(title = "Liberty Ship Production Time vs. Labor Hours",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_x_continuous(labels = comma) +
  theme_gray()

Customizing the Theme

You can use several built-in themes to quickly change the appearance of your plot. For example:

  • theme_gray(): Gray background color and white grid lines. Put the data forward to make comparisons easy. This is the default theme in ggplot.
  • theme_bw(): White background and gray grid lines. May work better for presentations displayed with a projector.
  • theme_light(): A theme with light grey lines and axes, to direct more attention towards the data.
  • theme_dark(): Same as theme_light but with a dark background. Useful to make thin colored lines pop out.
  • theme_minimal(): A clean, simple theme that removes unnecessary gridlines and backgrounds.
  • theme_classic(): A traditional theme with only axis lines and no gridlines.
  • theme_void(): A completely empty theme, useful for non-standard plots or creative visualizations.

Customizing Specific Theme Elements

In addition to using predefined themes, you can customize individual elements of the theme, such as the background, text size, or axis lines.

ggplot(liberty_ship_data, aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +
  labs(title = "Liberty Ship Production: Custom Theme") +
  scale_x_continuous(labels = comma) +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 12),
    panel.grid.major = element_line(color = "gray", size = 0.5),
    panel.background = element_rect(fill = "white"),
    legend.position = "top"
  )

In this example, we use theme_minimal() as a base and customize:

  • plot.title: Adjusting the font size and boldness of the title.
  • axis.title: Changing the font size of axis labels.
  • panel.grid.major: Customizing the major gridlines.
  • panel.background: Setting the background color.
  • legend.position: Moving the legend to the top of the plot.

Using External Themes (ggthemes)

In addition to built-in themes, you can use external packages like ggthemes to apply specialized themes, such as themes that mimic professional charts (e.g., from The Economist, Wall Street Journal).

# install.packages("ggthemes") # Install and load ggthemes (if needed)
library(ggthemes)
ggplot(liberty_ship_data, aes(x = Direct_Hours, y = Total_Production_Days)) +
  geom_point(aes(color = Yard)) +  
  labs(title = "Liberty Ship Production: Economist Theme",
       x = "Direct Labor Hours",
       y = "Total Production Days") +
  scale_x_continuous(labels = comma) +
  theme_economist()  # Apply a theme from ggthemes

You can use several other themes from the ggthemes package to adopt their appearance. For example:

  • theme_excel(): theme mimicking default plot theme in Excel bridging the gap between ggplot2 and Excel
  • theme_tufte(): a minimalist theme adhering to the principles espoused by Edward Tufte who is mentioned above
  • theme_economist(): theme based on the plots in the Economist magazine
  • theme_wsj(): theme based on plots in the Wall Street Journal with clean, professional visuals that mirror financial or business reporting
  • theme_fivethirtyeight(): modern chart theme often seen in political or sports analytics found in the popular data-driven news site FiveThirtyEight
  • theme_hc(): theme based on Highcharts JS

Why Use Themes?

  • Consistency: Applying a consistent theme across multiple plots can give your visualizations a professional look.
  • Readability: Themes help you control elements that make your plots easier to read, especially when presenting in different formats (e.g., reports, presentations).
  • Customization: You can tailor your plot’s appearance to your audience or purpose by fine-tuning individual elements.

Recap of Themes:

  • Built-in themes: ggplot2 offers several built-in themes like theme_gray(), theme_minimal(), theme_classic(), and more.
  • External themes: Packages like ggthemes provide additional styles to make your plots look more professional or specialized.
  • Custom themes: You can customize individual elements of a theme, such as text size, gridlines, and backgrounds, to match your specific needs.

Try it yourself: Add a Theme


  1. Plot the relationship between willingness to pay WTP and monthly consumption Quantity.

  2. Map the variable Gym_member to color to show how gym membership affects the relationship between willingness to pay WTP and consumption Quantity.

  3. Map the length of the respondent’s workout Workout_length to shape to show how workout length affects the relationship.

  4. Add title and axes labels

    • Make the title “Willingness to Pay and Consume Muscle Cola”
    • Make the x-axis “Willingness to Pay”
    • Make the y-axis “Quantity Consumed”
  5. Set the x-axis scale to be currency in dollars ($)

  6. Drop the facet wrap to reduce complexity.

  7. Create the plot in at least two themes other than the default theme_gray(). Note that ggthemes is already loaded so you can choose from any themes in that library as well.

Hint 1

Re-build your plot from layers beginning with specifying the dataset, then specifying aesthetic mapping of the x- and y-variables as well as the aesthetic mapping of gym membership to color. Inside a new aesthetic mapping in the point geometry, map the shape of the point to the length of the workout. Then declare the geometry to plot the data points. Then add the title and labels. Then format the x-axis to have currency in dollars. Add a theme and evaluate your preferences for the appearance.

Hint 2

Inside the ggplot function, add the argument that the dataset is muscle_cola_data and the aesthetic mapping aes() the specifies that willingness to pay WTP is the x variable and Quantity is the y variable. Add an argument inside the aesthetic mapping that maps color to Gym_member. Then call the geom_point() function to plot the data as points (scatter plot). 4. Add an aesthetic mapping of shape to Workout_length inside the geom_point(). Then call the labs() function using arguments title = "Willingness to Pay and Consume Muscle Cola", x = "Willingness to Pay", y = "Quantity Consumed". Then call the scale_x_continuous() function and specify the argument that the labels are dollar currency. Call a theme_name() function to evaluate the appearance.

 theme_name() # replace "name" with the name of your chosen theme

Fully worked solution:

As arguments to the ggplot() function, declare the data as muscle_cola_data and the aesthetic mapping as aes(x = WTP, y = Quantity, color = Gym_member). Then call the geom_point() function to plot the data as points. Then call the labs() function to add the title and axes labels. Then call the scale_x_continuous() function to specify the x-axis scale labels to be dollar currency. Then call a theme function. In this solution, we call theme_fivethirtyeight.

1ggplot(muscle_cola_data,
2       aes(x = WTP, y = Quantity, color = Gym_member)) +
3  geom_point(aes(shape = Workout_length)) +
4  labs(title = "Willingness to Pay and Consume Muscle Cola",
5       x = "Willingness to Pay",
6       y = "Quantity Consumed") +
7  scale_x_continuous(labels = dollar) +
8  theme_fivethirtyeight()
1
Call the ggplot() function and specify muscle_cola_data as the data
2
Specify that aesthetic mapping with WTP plotted on the x-axis and Quantity on the y-axis, adding the mapping of color to Gym_member
3
Call the geom_point() function to get a scatter plot of points and add the aesthetic that maps shape to the Workout_length
4
Call the labs() function and specify the title
5
Specify the x-axis label
6
Specify the y-axis label
7
Call the scale_x_continuous() function and specify the labels to be dollar.
8
Call the facet_wrap() function and specify the facets to be created for Drink_preferred.



13.11 Conclusion

In this chapter, we explored the foundational principles of the Grammar of Graphics that underpin ggplot2. By breaking down the construction of a plot into distinct components—data, aesthetics, geometric objects, labels, scales, facets, and themes—you’ve learned how to build visualizations step by step. This approach allows you to customize your plots fully, making them both informative and visually appealing.

Whether you’re transforming data, adjusting scales, or applying professional themes, ggplot2 gives you the flexibility to create compelling data visualizations tailored to your audience. As you move forward, continue experimenting with different combinations of geoms, aesthetics, and themes to refine your plots for specific needs. The ability to clearly and effectively communicate your data insights through visualization is a critical skill in entrepreneurship and beyond.