Submit this homework on Canvas as an RMarkdown file (.Rmd) and the output file (.html). Type your code in the grey areas and the text of your answers in the white areas. Click the green play button in each code chunk to run it, and knit to produce the html document. Make sure your code runs well. For this, I recommend to restart Rstudio to make sure the relevant packages are loaded. Remember that you can work in groups, but each student must submit individual answers.
In this problem set you will analyze the dataset gapminder from the
package dslabs used in class as an example. You will
investigate the relationship between income, infant mortality,
fertility, and life expectancy.
gapminder a tibble? Make
it one using the as_tibble() function if not.library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(gapminder)
##
## Attaching package: 'gapminder'
## The following object is masked from 'package:dslabs':
##
## gapminder
We will first get the data ready.
gapminder <- gapminder %>% mutate(gdp_percap = gdpPercap / pop)
save(gapminder, file = "gapminder.RData")
gapminder_subset <- gapminder %>% filter(country %in% c("Costa Rica","United States","Germany"))
summary(gapminder_subset)
## country continent year lifeExp
## Costa Rica :12 Africa : 0 Min. :1952 Min. :57.21
## Germany :12 Americas:24 1st Qu.:1966 1st Qu.:70.03
## United States:12 Asia : 0 Median :1980 Median :73.42
## Afghanistan : 0 Europe :12 Mean :1980 Mean :72.37
## Albania : 0 Oceania : 0 3rd Qu.:1993 3rd Qu.:76.27
## Algeria : 0 Max. :2007 Max. :79.41
## (Other) : 0
## pop gdpPercap gdp_percap
## Min. : 926317 Min. : 2627 Min. :8.633e-05
## 1st Qu.: 3431884 1st Qu.: 6548 1st Qu.:1.242e-04
## Median : 78248020 Median :15510 Median :2.718e-04
## Mean :102719428 Mean :17422 Mean :9.208e-04
## 3rd Qu.:189581500 3rd Qu.:25384 3rd Qu.:2.012e-03
## Max. :301139947 Max. :42952 Max. :2.836e-03
##
ggplot(data = gapminder_subset, aes(x = year, y = gdp_percap, color = country)) +
geom_line()
gapminder %>%
filter(country %in% c("Costa Rica","United States","Germany")) %>%
ggplot(aes(x = year, y = lifeExp, color = country)) +
geom_line()
c. Repeat the same for infant mortality. Do you learn something from
this?
filtered_data <- gapminder %>% filter(year == "5,0")
ggplot(data = filtered_data, aes(x = year, y = lifeExp, color = country)) +
geom_line() +
ggtitle("infant imortality") +
xlab("Year") +
ylab("Life Expectancy")
gdp_per_cap in the x-axis and life expectancy in the
y-axis, including all countries. Does it look like there is a
relationship between income (gdp per capita) and life expectancy? ye,
the higher the gdp is the higher the life expectancyggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp)) +
geom_point()
e. Now repeat the plot above but restricted to Costa Rica, Germany, and
the United States. Use a different color for each country. Does it look
like there is a relationship between income (gdp per capita) and life
expectancy? Yes there is defenitely a relationship between income and
life exp.
filtered_data <- gapminder %>% filter(country %in% c("Costa Rica", "Germany", "United States"))
ggplot(data = filtered_data, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point() +
xlab("GDP pc") +
ylab("Life exp")
For this question we will use the dataset vehicles in
the package fueleconomy.
library(fueleconomy)
vehicles <- vehicles %>%
mutate( type = case_when(class %in% c("Subcompact Cars", "Compact Cars", "Midsize Cars",
"Small Station Wagons" , "Midsize Station Wagons", "Large Cars" ) ~ "Compact",
class %in% c("Small Sport Utility Vehicle 2WD", "Sport Utility Vehicle - 4WD" , "Small Sport Utility Vehicle 4WD" ,"Sport Utility Vehicle - 2WD" ,
"Standard Sport Utility Vehicle 4WD", "Standard Sport Utility Vehicle 2WD",
"Minivan - 4WD" , "Minivan - 2WD" , "Standard Pickup Trucks", "Standard Pickup Trucks 2WD" ,"Standard Pickup Trucks/2wd", "Small Pickup Trucks 2WD" , "Standard Pickup Trucks 4WD" , "Small Pickup Trucks 4WD", "Small Pickup Trucks", "Minivan - 4WD", "Minivan - 2WD" ) ~ "SUV - Pickup"))
vehicles <- vehicles %>%
mutate(fuele = (cty + hwy) / 2)
vehicles_summary <- aggregate(fuele ~ type + year, data = vehicles, mean)
ggplot(vehicles_summary, aes(x = year, y = fuele, color = type)) +
geom_line() +
labs(x = "Year", y = "Average fuel efficiency", title = "Evolution of average fuel efficiency by type of vehicle over time") +
scale_color_manual(values = c("blue", "orange"))
Create a new variable fuele as the average between
fuel efficiency in the citycty and the highway fuel
efficiency hwy.
Keep only vehicles for which the new variable type
is not missing.
Compute the average fuel efficiency by type of car
(type) and year. You may want to create a new dataset for
this.
Use a graph to visualize the evolution of the average fuel efficiency by type of vehicle over time. What type of graph is most appropiate? Explain. What can you learn about fuel efficiency over time, by type, from the graph? Explain. Make sure you label your axis. Change the colors to some of your choice.