Problem Set 1

Submit this homework on Canvas as an RMarkdown file (.Rmd) and the output file (.html). Type your code in the grey areas and the text of your answers in the white areas. Click the green play button in each code chunk to run it, and knit to produce the html document. Make sure your code runs well. For this, I recommend to restart Rstudio to make sure the relevant packages are loaded. Remember that you can work in groups, but each student must submit individual answers.

A. Analysis with gapminder

1. Packages

In this problem set you will analyze the dataset gapminder from the package dslabs used in class as an example. You will investigate the relationship between income, infant mortality, fertility, and life expectancy.

Start by loading the packages dslabs and tidyverse (install them if you haven’t yet). Is the dataset gapminder a tibble? Make it one using the as_tibble() function if not.

library(dslabs)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(gapminder)

## 
## Attaching package: 'gapminder'

## The following object is masked from 'package:dslabs':
## 
##     gapminder

2. Life Expectancy Over Time

We will first get the data ready.

Create a new variable called gdp_percap equal to gdp divided by population. Remember again to save the dataset or you will loose it.

gapminder <- gapminder %>% mutate(gdp_percap = gdpPercap / pop)
save(gapminder, file = "gapminder.RData")

Restrict the sample to Costa Rica, the United States, and Germany. Make sure you save this new dataset so that you only look at these five countries for the rest of the analysis.

gapminder_subset <- gapminder %>% filter(country %in% c("Costa Rica","United States","Germany"))

Are there NAs in your data? In which variable(s)? Keep this answer in mind below. (Hint: Use summary()) No there are not any NA’s in the data

summary(gapminder_subset)

##           country      continent       year         lifeExp     
##  Costa Rica   :12   Africa  : 0   Min.   :1952   Min.   :57.21  
##  Germany      :12   Americas:24   1st Qu.:1966   1st Qu.:70.03  
##  United States:12   Asia    : 0   Median :1980   Median :73.42  
##  Afghanistan  : 0   Europe  :12   Mean   :1980   Mean   :72.37  
##  Albania      : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:76.27  
##  Algeria      : 0                 Max.   :2007   Max.   :79.41  
##  (Other)      : 0                                               
##       pop              gdpPercap       gdp_percap       
##  Min.   :   926317   Min.   : 2627   Min.   :8.633e-05  
##  1st Qu.:  3431884   1st Qu.: 6548   1st Qu.:1.242e-04  
##  Median : 78248020   Median :15510   Median :2.718e-04  
##  Mean   :102719428   Mean   :17422   Mean   :9.208e-04  
##  3rd Qu.:189581500   3rd Qu.:25384   3rd Qu.:2.012e-03  
##  Max.   :301139947   Max.   :42952   Max.   :2.836e-03  
##

3. Relationships between variables

Plot the evolution of gdp per capita over time, using a different line and color for each country. Does this surprise you? No it doesnt suprise me

ggplot(data = gapminder_subset, aes(x = year, y = gdp_percap, color = country)) + 
  geom_line()

Make a line plot that shows the evolution of life expectancy over time, using a different line and color for each country. What do you see? Anything interesting? Is this how you expected it, given the patterns in 3a? I think it is interesting how well costa rica has improved over the past few years.

gapminder %>%
  filter(country %in% c("Costa Rica","United States","Germany")) %>%
  ggplot(aes(x = year, y = lifeExp, color = country)) + 
  geom_line()

c. Repeat the same for infant mortality. Do you learn something from this?

filtered_data <- gapminder %>% filter(year == "5,0")

ggplot(data = filtered_data, aes(x = year, y = lifeExp, color = country)) +
  geom_line() +
  ggtitle("infant imortality") +
  xlab("Year") +
  ylab("Life Expectancy")

Let’s go back to the whole sample now. Make a scatterplot with gdp_per_cap in the x-axis and life expectancy in the y-axis, including all countries. Does it look like there is a relationship between income (gdp per capita) and life expectancy? ye, the higher the gdp is the higher the life expectancy

ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp)) + 
  geom_point()

e. Now repeat the plot above but restricted to Costa Rica, Germany, and the United States. Use a different color for each country. Does it look like there is a relationship between income (gdp per capita) and life expectancy? Yes there is defenitely a relationship between income and life exp.

filtered_data <- gapminder %>% filter(country %in% c("Costa Rica", "Germany", "United States"))
ggplot(data = filtered_data, aes(x = gdpPercap, y = lifeExp, color = country)) +
  geom_point() +
  xlab("GDP pc") +
  ylab("Life exp")

Add 3 more countries to the plot, ideally of different income levels. Does it look like there is a relationship between income (gdp per capita) and life expectancy? What do you conclude? Absolutely a relationship between the 2 variables.

B. Data wrangling and visualization

For this question we will use the dataset vehicles in the package fueleconomy.

After loading the library, run the lines below to create a new variable for two car categories. These categories roughly represent two types of cars subject to different standards regarding fuel efficiency (less stringent for light duty vehicles (SUVs, pickup trucks).

library(fueleconomy)
vehicles <- vehicles %>% 
mutate( type = case_when(class %in% c("Subcompact Cars", "Compact Cars", "Midsize Cars",
"Small Station Wagons" , "Midsize Station Wagons", "Large Cars"  ) ~ "Compact",
class %in% c("Small Sport Utility Vehicle 2WD", "Sport Utility Vehicle - 4WD" , "Small Sport Utility Vehicle 4WD"  ,"Sport Utility Vehicle - 2WD" ,
"Standard Sport Utility Vehicle 4WD", "Standard Sport Utility Vehicle 2WD",
"Minivan - 4WD"  , "Minivan - 2WD" , "Standard Pickup Trucks",  "Standard Pickup Trucks 2WD" ,"Standard Pickup Trucks/2wd", "Small Pickup Trucks 2WD" , "Standard Pickup Trucks 4WD" , "Small Pickup Trucks 4WD", "Small Pickup Trucks", "Minivan - 4WD", "Minivan - 2WD"  ) ~ "SUV - Pickup"))

vehicles <- vehicles %>% 
mutate(fuele = (cty + hwy) / 2)

vehicles_summary <- aggregate(fuele ~ type + year, data = vehicles, mean)

ggplot(vehicles_summary, aes(x = year, y = fuele, color = type)) + 
  geom_line() + 
  labs(x = "Year", y = "Average fuel efficiency", title = "Evolution of average fuel efficiency by type of vehicle over time") +
  scale_color_manual(values = c("blue", "orange"))