Exercise 2 Applied Data Science

Healy - Chapter 4

library(ggplot2)
library(gapminder)

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(color="gray70", aes(group = country)) +
    geom_smooth(linewidth = 1.1, method = "loess", se = FALSE) +
    scale_y_log10(labels=scales::dollar) +
    facet_wrap(~ continent, ncol = 5) +
    labs(x = "Year",
         y = "GDP per capita",
         title = "GDP per capita on Five Continents")

## `geom_smooth()` using formula = 'y ~ x'

1. Revisit the gapminder plots at the beginning of the chapter and experiment with different ways to facet the data. Try plotting population and per capita GDP while faceting on year, or even on country. In the latter case you will get a lot of panels, and plotting them straight to the screen may take a long time. Instead, assign the plot to an object and save it as a PDF file to your figures/ folder. Experiment with the height and width of the figure.

# Regular scale for GDP per capita
p_gdp_no_log <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap)) +
  geom_line(color = "gray70", aes(group = country)) +
  geom_smooth(linewidth = 1.1, method = "loess", se = FALSE) +
  scale_y_continuous(labels = scales::dollar) +  # Regular scale with dollar formatting
  facet_wrap(~ year, ncol = 4) +  # Facet by year, 4 columns
  labs(x = "Year", 
       y = "GDP per capita", 
       title = "GDP per capita Faceted by Year")

print(p_gdp_no_log)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

2.Investigate the difference between a formula written as facet_grid(sex ~ species) versus one written as facet_grid(~ sex + species).

# Load necessary libraries
library(ggplot2)
library(palmerpenguins)

# Clean the dataset by removing rows with missing values
penguins_clean <- na.omit(penguins)

# Plot using facet_grid(sex ~ species)
p1 <- ggplot(penguins_clean, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species)) +
  facet_grid(sex ~ species) +  # Facet by sex (rows) and species (columns)
  labs(title = "Facet Grid: sex ~ species",
       x = "Flipper Length (mm)", 
       y = "Bill Length (mm)") +
  theme_minimal()

# Print the plot
print(p1)

# Plot using facet_grid(~ sex + species)
p2 <- ggplot(penguins_clean, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species)) +
  facet_grid(~ sex + species) +  # Facet by combined sex and species in 1D
  labs(title = "Facet Grid: ~ sex + species",
       x = "Flipper Length (mm)", 
       y = "Bill Length (mm)") +
  theme_minimal()

# Print the plot
print(p2)

facet_grid(sex ~ species): This creates a 2D grid with one variable controlling the rows and another variable controlling the columns. It is good for organizing data into a cross-classified grid. facet_grid(~ sex + species): This creates a 1D layout where all unique combinations of sex and species are placed in a single row (or column). It is more compact but may be less organized visually if there are many combinations.

3. Experiment to see what happens when you use facet_wrap() with more complex forumulas like facet_wrap(~ sex + race) instead of facet_grid. Like facet_grid(), the facet_wrap() function can facet on two or more variables at once. But it will do it by laying the results out in a wrapped one-dimensional table instead of a fully cross-classified grid.

# Plot using facet_grid(sex ~ species)
p_grid <- ggplot(penguins_clean, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species)) +
  facet_grid(sex ~ species) +  # Facet by sex (rows) and species (columns)
  labs(title = "Facet Grid: sex ~ species",
       x = "Flipper Length (mm)", 
       y = "Bill Length (mm)") +
  theme_minimal()

# Print the plot
print(p_grid)

# Plot using facet_wrap(~ sex + species) with 2 columns
p_wrap_ncol <- ggplot(penguins_clean, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(aes(color = species)) +
  facet_wrap(~ sex + species, ncol = 2) +  # Facet with 2 columns
  labs(title = "Facet Wrap: ~ sex + species (2 Columns)",
       x = "Flipper Length (mm)", 
       y = "Bill Length (mm)") +
  theme_minimal()

# Print the plot
print(p_wrap_ncol)

facet_wrap(~ sex + species) arranges facets in a wrapped 1D layout, which is more flexible and compact but doesn’t provide a clear grid structure. facet_grid(sex ~ species) produces a fully cross-classified 2D grid, which is more structured and easier to interpret when comparing rows and columns. facet_wrap() allows you to control the number of rows or columns using nrow or ncol, giving you more flexibility in how the facets are arranged.

4. Frequency polygons are closely related to histograms. Instead of displaying the count of observations using bars, they display it with a series of connected lines instead. You can try the various geom_histogram() calls in this chapter using geom_freqpoly() instead.

# Frequency polygons of flipper length by species
p_freqpoly_species <- ggplot(penguins_clean, aes(x = flipper_length_mm, color = species)) +
  geom_freqpoly(binwidth = 5, size = 1.2) +
  labs(title = "Frequency Polygon of Flipper Length by Species",
       x = "Flipper Length (mm)",
       y = "Count") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Print the frequency polygon for species comparison
print(p_freqpoly_species)

# Histogram of flipper length by species
p_hist_species <- ggplot(penguins_clean, aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(binwidth = 5, position = "identity", alpha = 0.4, color = "black") +
  labs(title = "Histogram of Flipper Length by Species",
       x = "Flipper Length (mm)",
       y = "Count") +
  theme_minimal()

# Print the histogram for species comparison
print(p_hist_species)

5. A histogram bins observations for one variable and shows a bars with the count in each bin. We can do this for two variables at once, too. The geom_bin2d() function takes two mappings, x and y. It divides your plot into a grid and colors the bins by the count of observations in them. Try using it on the gapminder data to plot life expectancy versus per capita GDP. Like a histogram, you can vary the number or width of the bins for both x or y. Instead of saying bins = 30 or binwidth = 1, provide a number for both x and y with, for example, bins = c(20, 50). If you specify bindwith instead, you will need to pick values that are on the same scale as the variable you are mapping.

# Basic 2D histogram using geom_bin2d()
p_bin2d <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_bin2d() +
  scale_x_log10() +  # Log scale for GDP per capita
  scale_fill_continuous(type = "viridis") +  # Use viridis color scale for bin counts
  labs(title = "2D Histogram of Life Expectancy vs GDP per Capita",
       x = "GDP per Capita (log scale)",
       y = "Life Expectancy",
       fill = "Count") +
  theme_minimal()

# Print the plot
print(p_bin2d)

6. Density estimates can also be drawn in two dimensions. The geom_density_2d() function draws contour lines estimating the joint distribution of two variables. Try it with the midwest data, for example, plotting percent below the poverty line (percbelowpoverty) against percent college-educated (percollege). Try it with and without a geom_point() layer.

# Load necessary libraries
library(ggplot2)
# Load the midwest dataset
data("midwest")

# Filled density plot
p_density2d_filled <- ggplot(midwest, aes(x = percbelowpoverty, y = percollege)) +
  geom_point(alpha = 0.5, color = "blue") +  # Add points with transparency
  geom_density_2d_filled(alpha = 0.4) +  # Add filled density contours with transparency
  labs(title = "Filled 2D Density Estimate: Poverty vs College Education",
       x = "Percent Below Poverty Line",
       y = "Percent College Educated") +
  theme_minimal()

# Print the plot
print(p_density2d_filled)

Healy - Chapter 5

1. The subset() function is very useful when used in conjunction with a series of layered geoms. Go back to your code for the Presidential Elections plot (Figure 5.18) and redo it so that it shows all the data points but only labels elections since 1992. You might need to look again at the elections_historic data to see what variables are available to you. You can also experiment with subsetting by political party, or changing the colors of the points to reflect the winning party.

# Load necessary libraries
library(ggplot2)
library(ggrepel)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(socviz)  # Ensure the socviz package is loaded

# Set up plot labels and titles
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

# Base plot with all data points
p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) +
  geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +  # Horizontal line at 50%
  geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +  # Vertical line at 50%
  geom_point() +  # Plot points for all elections
  scale_x_continuous(labels = scales::percent) +  # Format x-axis labels as percentages
  scale_y_continuous(labels = scales::percent) +  # Format y-axis labels as percentages
  labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption)

# Add labels only for elections since 1992
p + geom_text_repel(data = subset(elections_historic, year >= 1992))

2. Use geom_point() and reorder() to make a Cleveland dot plot of all Presidential elections, ordered by share of the popular vote.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(socviz)  # Ensure the socviz package is loaded

# Cleveland dot plot ordered by winner's share of the popular vote
ggplot(elections_historic, aes(x = reorder(winner, popular_pct), y = popular_pct)) +
  geom_point(size = 1, color = "black") +  # Plot points
  coord_flip() +  # Flip the coordinates for better readability
  scale_y_continuous(labels = scales::percent) +  # Format y-axis as percentages
  labs(x = "President", y = "Winner's Share of Popular Vote",
       title = "Presidential Elections Ordered by Popular Vote Share",
       subtitle = "1824-2016",
       caption = "Source: socviz package") +
  theme_minimal()  # Apply a minimal theme for a clean look

3. Try using annotate() to add a rectangle that lightly colors the entire upper left quadrant of Figure 5.18.

# Load necessary libraries
library(ggplot2)
library(ggrepel)
library(dplyr)
library(socviz)  # Ensure the socviz package is loaded

# Set up plot labels and titles
p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

# Base plot with all data points
p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label)) +
  geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +  # Horizontal line at 50%
  geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +  # Vertical line at 50%

  # Add a light blue rectangle in the upper-left quadrant
  annotate("rect", xmin = 0, xmax = 0.5, ymin = 0.5, ymax = 1, 
           alpha = 0.2, fill = "lightblue") +  # Lightly shade the upper-left quadrant

  geom_point() +  # Plot points for all elections
  geom_text_repel() +  # Add labels to the points
  scale_x_continuous(labels = scales::percent) +  # Format x-axis labels as percentages
  scale_y_continuous(labels = scales::percent) +  # Format y-axis labels as percentages
  labs(x = x_label, y = y_label, title = p_title, subtitle = p_subtitle, caption = p_caption) +
  theme_minimal()  # Apply a minimal theme for a clean look

# Print the plot
p

## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

4. The main action verbs in the dplyr library are group_by(), filter(), select(), summarize(), and mutate(). Practice with them by revisiting the gapminder data to see if you can reproduce a pair of graphs from Chapter One, shown here again in Figure 5.28. You will need to filter some rows, group the data by continent, and calculate the mean life expectancy by continent before beginning the plotting process.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gapminder)  # Ensure the gapminder dataset is loaded

# Step 1: Group data by continent and calculate the average life expectancy
continent_lifeExp <- gapminder |> 
  group_by(continent) |> #Group by continent
  summarize(mean_lifeExp = mean(lifeExp, na.rm = TRUE))  # Calculate mean life expectancy

# Step 2: Create the bar plot
ggplot(continent_lifeExp, aes(x = mean_lifeExp, y = reorder(continent, mean_lifeExp))) +
  geom_bar(stat = "identity") +  # Create a bar plot with identity stat
  labs(x = "Life Expectancy in years, 2007",
       y = "Continent") +
  theme_minimal()  # Apply a minimal theme for a clean look

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gapminder)  # Ensure the gapminder dataset is loaded

# Step 1: Filter the data for the year 2007 and calculate the average life expectancy per continent
continent_lifeExp_2007 <- gapminder |> 
  filter(year == 2007) |> 
  group_by(continent)  |> 
  summarize(mean_lifeExp = mean(lifeExp))

# Step 2: Create the scatterplot, with continents ordered by life expectancy
ggplot(continent_lifeExp_2007, aes(x = mean_lifeExp, y = reorder(continent, mean_lifeExp))) +
  geom_point(size = 1) +  # Plot a single point for each continent
  labs(x = "Average Life Expectancy (Years)",
       y = "Continent") +
  theme_minimal()

5. Get comfortable with grouping, mutating, and summarizing data in pipelines. This will become a routine task as you work with your data. There are many ways that tables can be aggregated and transformed. Remember group_by() groups your data from left to right, with the rightmost or innermost group being the level calculations will be done at; mutate() adds a column at the current level of grouping; and summarize() aggregates to the next level up. Try creating some grouped objects from the GSS data, calculating frequencies as you learned in this Chapter, and then check to see if the totals are what you expect.

# Step 1: Group by gender and political party affiliation, calculate frequencies and percentages
gender_party_freq <- gss_sm |> 
  group_by(sex, partyid) |>   # Group by gender and political party
  summarize(N = n())  |>   # Count the number of occurrences in each group
  group_by(sex) |>   # Group by gender again for percentage calculation
  mutate(pct = round(N / sum(N) * 100, 1))  # Calculate percentage within each gender group

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

# View the resulting grouped data
print(gender_party_freq)

## # A tibble: 18 × 4
## # Groups:   sex [2]
##    sex    partyid                N   pct
##    <fct>  <fct>              <int> <dbl>
##  1 Male   Strong Democrat      169  13.2
##  2 Male   Not Str Democrat     193  15.1
##  3 Male   Ind,near Dem         188  14.7
##  4 Male   Independent          211  16.5
##  5 Male   Ind,near Rep         156  12.2
##  6 Male   Not Str Republican   171  13.4
##  7 Male   Strong Republican    130  10.2
##  8 Male   Other Party           45   3.5
##  9 Male   <NA>                  13   1  
## 10 Female Strong Democrat      294  18.5
## 11 Female Not Str Democrat     303  19  
## 12 Female Ind,near Dem         217  13.6
## 13 Female Independent          262  16.5
## 14 Female Ind,near Rep         136   8.5
## 15 Female Not Str Republican   193  12.1
## 16 Female Strong Republican    140   8.8
## 17 Female Other Party           27   1.7
## 18 Female <NA>                  19   1.2

# Step 2: Check if the totals match our expectations
# Check the total number of observations for each gender
gender_totals <- gender_party_freq  |> 
  group_by(sex)  |> 
  summarize(total = sum(N))  # Summing the counts to check if the totals match

# View the total counts for each gender
print(gender_totals)

## # A tibble: 2 × 2
##   sex    total
##   <fct>  <int>
## 1 Male    1276
## 2 Female  1591

# Compare with the overall total number of observations in the dataset
overall_total <- nrow(gss_sm)
print(paste("Overall total number of observations:", overall_total))

## [1] "Overall total number of observations: 2867"

6. This code is similar to what you saw earlier, but a little more compact. (We calculate the pct values directly.) Check the results are as you expect by grouping by race and summing the percentages. Try doing the same exercise grouping by sex or region.

# Load necessary libraries
library(dplyr)

# Group by race and degree, calculate N and percentage within each race
race_degree_freq <- gss_sm %>%
  group_by(race, degree) %>%
  summarize(N = n()) %>%
  group_by(race) %>%
  mutate(pct = round(N / sum(N) * 100, 0))  # Calculate percentage

## `summarise()` has grouped output by 'race'. You can override using the
## `.groups` argument.

# View the resulting data
print(race_degree_freq)

## # A tibble: 18 × 4
## # Groups:   race [3]
##    race  degree             N   pct
##    <fct> <fct>          <int> <dbl>
##  1 White Lt High School   197     9
##  2 White High School     1057    50
##  3 White Junior College   166     8
##  4 White Bachelor         426    20
##  5 White Graduate         250    12
##  6 White <NA>               4     0
##  7 Black Lt High School    60    12
##  8 Black High School      292    60
##  9 Black Junior College    33     7
## 10 Black Bachelor          71    14
## 11 Black Graduate          31     6
## 12 Black <NA>               3     1
## 13 Other Lt High School    71    26
## 14 Other High School      112    40
## 15 Other Junior College    17     6
## 16 Other Bachelor          39    14
## 17 Other Graduate          37    13
## 18 Other <NA>               1     0

# Step 2: Check if percentages sum to 100% within each race
race_degree_check <- race_degree_freq %>%
  group_by(race) %>%
  summarize(total_pct = sum(pct))  # Sum the percentages for each race

# View the result
print(race_degree_check)

## # A tibble: 3 × 2
##   race  total_pct
##   <fct>     <dbl>
## 1 White        99
## 2 Black       100
## 3 Other        99

# Group by sex and degree, calculate N and percentage within each sex
sex_degree_freq <- gss_sm %>%
  group_by(sex, degree) %>%
  summarize(N = n()) %>%
  group_by(sex) %>%
  mutate(pct = round(N / sum(N) * 100, 0))  # Calculate percentage

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

# View the resulting data
print(sex_degree_freq)

## # A tibble: 12 × 4
## # Groups:   sex [2]
##    sex    degree             N   pct
##    <fct>  <fct>          <int> <dbl>
##  1 Male   Lt High School   147    12
##  2 Male   High School      662    52
##  3 Male   Junior College    89     7
##  4 Male   Bachelor         243    19
##  5 Male   Graduate         132    10
##  6 Male   <NA>               3     0
##  7 Female Lt High School   181    11
##  8 Female High School      799    50
##  9 Female Junior College   127     8
## 10 Female Bachelor         293    18
## 11 Female Graduate         186    12
## 12 Female <NA>               5     0

# Step 2: Check if percentages sum to 100% within each sex
sex_degree_check <- sex_degree_freq %>%
  group_by(sex) %>%
  summarize(total_pct = sum(pct))  # Sum the percentages for each sex

# View the result
print(sex_degree_check)

## # A tibble: 2 × 2
##   sex    total_pct
##   <fct>      <dbl>
## 1 Male         100
## 2 Female        99

7.Try summary calculations with functions other than sum. Can you calculate the mean and median number of children by degree? (Hint: the childs variable in gss_sm has children as a numeric value.)

# Load necessary libraries
library(dplyr)

# Step 1: Group by degree and calculate mean and median number of children
degree_children_summary <- gss_sm %>%
  group_by(degree) %>%  # Group by degree
  summarize(
    mean_children = mean(childs, na.rm = TRUE),   # Calculate mean number of children
    median_children = median(childs, na.rm = TRUE) # Calculate median number of children
  )

# Step 2: View the resulting summary
print(degree_children_summary)

## # A tibble: 6 × 3
##   degree         mean_children median_children
##   <fct>                  <dbl>           <dbl>
## 1 Lt High School          2.81               3
## 2 High School             1.86               2
## 3 Junior College          1.77               2
## 4 Bachelor                1.45               1
## 5 Graduate                1.52               2
## 6 <NA>                    3.6                4

8. dplyr has a large number of helper functions that let you summarize data in many different ways. The vignette on window functions included with the dplyr documentation is a good place to begin learning about these. You should also look at Chapter 3 of Wickham & Grolemund (2016) for more information on transforming data with dplyr.

# Load necessary libraries
library(dplyr)

# Example: Create a lagged column to compare the number of children with the previous row
gss_sm %>%
  arrange(year) %>%  # Ensure the data is ordered by year for time-based operations
  mutate(lagged_childs = lag(childs)) %>%  # Create a new column with the lagged value of childs
  select(year, childs, lagged_childs) %>%  # Select relevant columns for display
  head(10)  # Show the first 10 rows

## # A tibble: 10 × 3
##     year childs lagged_childs
##    <dbl>  <dbl>         <dbl>
##  1  2016      3            NA
##  2  2016      0             3
##  3  2016      2             0
##  4  2016      4             2
##  5  2016      2             4
##  6  2016      2             2
##  7  2016      2             2
##  8  2016      3             2
##  9  2016      3             3
## 10  2016      4             3

9. Experiment with the gapminder data to practice some of the new geoms we have learned. Try examining population or life expectancy over time using a series of boxplots. (Hint: you may need to use the group aesthetic in the aes() call.) Can you facet this boxplot by continent? Is anything different if you create a tibble from gapminder that explicitly groups the data by year and continent first, and then create your plots with that?

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gapminder)

# Boxplot of life expectancy over time
ggplot(gapminder, aes(x = factor(year), y = lifeExp, group = year)) +
  geom_boxplot() +
  labs(title = "Life Expectancy Over Time",
       x = "Year",
       y = "Life Expectancy") +
  theme_minimal()

# Facet the boxplot by continent
ggplot(gapminder, aes(x = factor(year), y = lifeExp, group = year)) +
  geom_boxplot() +
  labs(title = "Life Expectancy Over Time by Continent",
       x = "Year",
       y = "Life Expectancy") +
  facet_wrap(~ continent) +  # Facet by continent
  theme_minimal()

# Group the data by year and continent
gapminder_grouped <- gapminder %>%
  group_by(year, continent) %>%
  as_tibble()

# Create the same boxplot after grouping
ggplot(gapminder_grouped, aes(x = factor(year), y = lifeExp, group = year)) +
  geom_boxplot() +
  labs(title = "Life Expectancy Over Time by Continent (Grouped Data)",
       x = "Year",
       y = "Life Expectancy") +
  facet_wrap(~ continent) +  # Facet by continent
  theme_minimal()

Without grouping: The boxplot facets by continent and shows the distribution of life expectancy over time. Each year within a continent has its own boxplot. With grouping: The plot looks the same, but grouping the data beforehand can provide flexibility if you want to perform additional calculations (e.g., calculating means, medians, or adding summary statistics) before plotting. Grouping in dplyr doesn’t change the appearance of a basic plot, but it helps when you’re working with more complex operations.

10. Read the help page for geom_boxplot() and take a look at the notch and varwidth options. Try them out to see how they change the look of the plot.

# Boxplot with notches and variable width
ggplot(gapminder, aes(x = factor(year), y = lifeExp, group = year)) +
  geom_boxplot(notch = TRUE, varwidth = TRUE) +  # Combine notches and variable width
  labs(title = "Boxplot of Life Expectancy Over Time (With Notches and Variable Width)",
       x = "Year",
       y = "Life Expectancy") +
  theme_minimal()

11. As an alternative to geom_boxplot() try geom_violin() for a similar plot, but with a mirrored density distribution instead of a box and whiskers.

# Violin plot with boxplot overlay
ggplot(gapminder, aes(x = factor(year), y = lifeExp)) +
  geom_violin(trim = FALSE, fill = "lightblue") +  # Violin plot showing full density
  geom_boxplot(width = 0.1, outlier.shape = NA) +  # Add boxplot inside the violin plot
  labs(title = "Violin Plot of Life Expectancy Over Time with Boxplot Overlay",
       x = "Year",
       y = "Life Expectancy") +
  theme_minimal()

12. geom_pointrange() is one of a family of related geoms that produce different kinds of error bars and ranges, depending on your specific needs. They include geom_linerange(), geom_crossbar(), and geom_errorbar(). Try them out using gapminder or organdata to see how they differ.

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gapminder)

# Calculate the mean and standard deviation of life expectancy by year and continent
gapminder_summary <- gapminder %>%
  group_by(year, continent) %>%
  summarize(
    mean_lifeExp = mean(lifeExp, na.rm = TRUE),
    sd_lifeExp = sd(lifeExp, na.rm = TRUE)
  ) %>%
  mutate(
    lower = mean_lifeExp - sd_lifeExp,  # Lower bound (mean - 1 sd)
    upper = mean_lifeExp + sd_lifeExp   # Upper bound (mean + 1 sd)
  )

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

# Plot using geom_pointrange()
ggplot(gapminder_summary, aes(x = factor(year), y = mean_lifeExp, ymin = lower, ymax = upper)) +
  geom_pointrange() +
  labs(title = "Life Expectancy Over Time (Using geom_pointrange)",
       x = "Year",
       y = "Mean Life Expectancy") +
  facet_wrap(~ continent) +
  theme_minimal()

# Plot using geom_linerange()
ggplot(gapminder_summary, aes(x = factor(year), y = mean_lifeExp, ymin = lower, ymax = upper)) +
  geom_linerange() +
  geom_point() +  # Add points to represent the mean value
  labs(title = "Life Expectancy Over Time (Using geom_linerange and geom_point)",
       x = "Year",
       y = "Mean Life Expectancy") +
  facet_wrap(~ continent) +
  theme_minimal()

# Plot using geom_crossbar()
ggplot(gapminder_summary, aes(x = factor(year), y = mean_lifeExp, ymin = lower, ymax = upper)) +
  geom_crossbar(aes(ymin = lower, ymax = upper, y = mean_lifeExp), width = 0.5) +
  labs(title = "Life Expectancy Over Time (Using geom_crossbar)",
       x = "Year",
       y = "Mean Life Expectancy") +
  facet_wrap(~ continent) +
  theme_minimal()

# Plot using geom_errorbar()
ggplot(gapminder_summary, aes(x = factor(year), y = mean_lifeExp, ymin = lower, ymax = upper)) +
  geom_errorbar(width = 0.2) +  # Error bars with horizontal lines at the ends
  geom_point() +  # Add points to represent the mean value
  labs(title = "Life Expectancy Over Time (Using geom_errorbar)",
       x = "Year",
       y = "Mean Life Expectancy") +
  facet_wrap(~ continent) +
  theme_minimal()

geom_pointrange():Shows both the point (mean) and the range (lower to upper bounds) in a single geom. Useful when you want a compact representation of the range and the central value. geom_linerange(): Only shows the range (lower to upper bounds), without the point. geom_crossbar(): Displays the range as a horizontal line at the mean, with vertical lines extending to the lower and upper bounds. Provides a visual cue about the mean, which is emphasized by the bar. geom_errorbar(): The most traditional way to represent error bars, with vertical lines extending from the lower to upper bounds, and small horizontal lines at the ends. Often combined with geom_point() to show the mean.

Healy - Chapter 6

out <- lm(formula = lifeExp ~ log(gdpPercap) + pop + continent, data = gapminder)
plot(out, which = c(1,2), ask=FALSE)

Exercise 2 Applied Data Science

Tamrin Cheng

2024-10-13

Healy - Chapter 4

Healy - Chapter 5

Healy - Chapter 6