This lab will use the packages dslabs for the gapminder dataset, and tidyverse. Notice that both are loaded above. Remember to add your name to the file.

1. Boxplots and outliers

Boxplots provide a simple way to identify outliers. Here we will see why outliers may be easier to identify in boxplots than scatterplots.

  1. Using the gapminder dataset, make a scatterplot of year (x-axis) and infant mortality (y axis) using data for year 2015. Are there any outliers (extreme observations)?
gapminder %>% 
  filter(year == 2015) %>% 
  ggplot(aes(x = year, y = infant_mortality)) + 
  geom_point()
## Warning: Removed 7 rows containing missing values (`geom_point()`).

  1. Now repeat the exercise using a boxplot instead of a scatterplot. Are there outliers? no outliers
gapminder %>% 
  filter(year == 2015) %>% 
  ggplot(aes(x = "", y = infant_mortality)) + 
  geom_boxplot()
## Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).

  1. Now use the same boxplot but using data for ten years. Pick any decade that you like. Has there been any change over time?
gapminder %>% 
  filter(year >= 2005 & year <= 2015) %>% 
  ggplot(aes(x = "", y = infant_mortality)) + 
  geom_boxplot()
## Warning: Removed 77 rows containing non-finite values (`stat_boxplot()`).

2. Faceted Barplot

In line with the exercise in class, create a bar plot that presents the number of flights per carrier and airport of origin. Instead of using both variables in the bar plot as in class, add a facet_wrap to have separate plots for each airport.

flights %>% 
  ggplot(aes(x = carrier)) + 
  geom_bar() + 
  facet_wrap(~ origin)

3. Mean vs. median

  1. Compute the mean of population in 1960 and assign it to object mean_pop. Read the help for mean to remove NAs (unless you already know how to do it). Remove NAs.
mean_pop <- gapminder %>% 
  filter(year == 1960) %>% 
  pull(population) %>% 
  mean(na.rm = TRUE)
  1. Compute the median of population in 1960 and assign it to object median_pop. Is it greater or smaller than the average? What does this mean?
median_pop <- gapminder %>% 
  filter(year == 1960) %>% 
  pull(population) %>% 
  median(na.rm = TRUE)
median_pop
## [1] 3075752
  1. Create a density plot using geom_density of population in 1960. A density plot is a way of representing the distribution of a numeric variable. Add a vertical line containing the value of mean_pop and another one containing the value of median_pop. Use geom_vline to do so ans use as.numeric around mean_pop and median_pop. What do you observe?
gapminder %>% 
  filter(year == 1960) %>% 
  ggplot(aes(x = population)) + 
  geom_density() + 
  geom_vline(xintercept = as.numeric(mean_pop), color = "red") +
  geom_vline(xintercept = as.numeric(median_pop), color = "blue")