Submit this file (after adding your name after “author”) using Canvas. Make sure to label your plots!

  1. (10 points) Load the library dslabs and define the dataset gapminder as gapminder <- as_tibble(gapminder). There is another dataset called gapminder too, that is in the package gapminder. They are different, so make sure you only load dslabs. If you want to avoid confusion, you can write dslabs::gapminder to make the package explicit. The dslabs package has yearly observations, the gapminder dataset reports data every 5 years.
library(dslabs)

gapminder <- as_tibble(gapminder)

To look at the dataset, you can do head(gapminder). If you just write dslabs::gapminder, R will print the whole dataset. Please try to avoid this.

  1. (10 points) Create a new variable called gdp_per_cap corresponding to gdp divided by population.
gapminder <- gapminder %>%
  mutate(gdp_per_cap = gdp / population)
  1. (10 points) Compute the median life expectancy per continent per year. Make sure to remove missing values. Assign the newly created dataset to a new name.
gapminder_m <- gapminder %>% group_by(continent, year) %>% summarize(gapminder_m = median(life_expectancy, na.rm = TRUE))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
  1. (10 points) Plot the average life expectancy over time, using a different color for each continent. Don’t forget to label your axes. (You may want to drop observations for 2016, here or at an earlier point, if you want to avoid a warning saying that there are NA values. This comes from missing values for 2016.)
library(ggplot2)
gapminder_no2016 <- gapminder %>%
  filter(year != 2016)
avg_lifeexp <- gapminder_no2016 %>%
  group_by(continent, year) %>%
  mutate(avg_life_exp = mean(life_expectancy, na.rm = TRUE))
  ggplot(avg_lifeexp, aes(x = year, y = avg_life_exp, color = continent)) +
    geom_line() +
  labs(
    title = "Average Life Expectancy over Time in each Continent",
    x = "Year",
    y = "Average Life Expectancy"
  ) 

  avg_lifeexp
## # A tibble: 10,360 × 11
## # Groups:   continent, year [280]
##    country   year infant_mortality life_expectancy fertility population      gdp
##    <fct>    <int>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
##  1 Albania   1960            115.             62.9      6.19    1636054 NA      
##  2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
##  3 Angola    1960            208              36.0      7.32    5270844 NA      
##  4 Antigua…  1960             NA              63.0      4.43      54681 NA      
##  5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
##  6 Armenia   1960             NA              66.9      4.55    1867396 NA      
##  7 Aruba     1960             NA              65.7      4.82      54208 NA      
##  8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
##  9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
## 10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
## # ℹ 10,350 more rows
## # ℹ 4 more variables: continent <fct>, region <fct>, gdp_per_cap <dbl>,
## #   avg_life_exp <dbl>
  1. (10 points) Create a histogram of fertility rates in 2010. Within the appropiate geom_* set:

Call this graph g.

g <- gapminder %>%
  filter(year == 2010) %>%
  ggplot(aes(x = fertility)) +
   geom_histogram(
     binwidth = 0.5,
     color = "white",
     fill = "#d90502") +
labs(
  title = "Histogram of Fertility Rates in 2010",
  x = "Fertility Rate",
  y = "Count") +
theme_minimal() 

g

  1. (10 points) Using the previous graph g, facet it by continent such that each continent’s plot is a new row. (Hint: check for help for facet_grid)
g <- g + facet_grid(rows = vars(continent))
  1. (10 points) Create a scatter plot of fertility rate (y-axis) with respect to gdp per capita (x-axis) in 2010. Within the appropiate geom_*, set

What do you see?

scatter_2010 <- gapminder %>%
  filter(year == 2010) %>%
  ggplot(aes(x = gdp_per_cap, y = fertility)) +
  geom_point(
    size = 3,
    alpha = 0.5,
    color = "#009E73"
  ) +

labs(
  title = "Scatter Plot of Fertility Rate by GDP per Capita in 2010",
  x = "GDP per Capita",
  y = "Fertility Rate" +
    theme_minimal()
)
scatter_2010
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).

  1. (30 points) Boxplots provide a simple way to identify outliers. Here we will see why outliers may be easier to identify in boxplots than scatterplots.
  1. (10 points) Using the gapminder dataset, make a scatterplot of year (x-axis) and infant mortality (y axis) using data for year 2015. Are there any outliers (extreme observations)?
scatter_2015 <- gapminder %>%
  filter(year == 2015) %>%
  ggplot(aes(x = year, y = infant_mortality)) +
  geom_point() +
labs(
  title = "Scatter Plot of Infant Mortality Rate in 2015",
  x = "Year",
  y = "Infant Mortality" +
  theme_minimal()
)
scatter_2015
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

##There are a few outliers that reach a higher range
  1. (10 points) Now repeat the exercise using a boxplot instead of a scatterplot. Are there outliers?
box_2015 <- gapminder %>%
  filter(year == 2015) %>%
  ggplot(aes(x = year, y = infant_mortality)) +
geom_boxplot() +
  labs(
    title = "Box Plot of Infant Mortality Rate in 2015",
    x = "Year",
    y = "Infant Mortality" +
      theme_minimal()
  )
box_2015
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

##Yes there are outliers toward the top in the higher range and is more easily seen than a scatter plot.
  1. (10 points) Now use the same boxplot but using data for ten years. Pick any decade that you like. Has there been any change over time?
gapminder_2000s <- gapminder %>%
  filter(year >= 2000 & year <= 2009) %>%
ggplot(aes(x = year, y = infant_mortality)) + 
  geom_boxplot() +
  labs(
    title = "Infant Mortality by Year from 2000-2009",
    x = "Year",
    y = "Infant Mortality" +
      theme_minimal()
  )
gapminder_2000s
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 70 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

##OVertime, there are more outliers and they have reached higher levels of infant mortality toward 150  than 100 in just the year 2010.