Academic Honesty Statement (fill your name in the blank)

I, Yuhe, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

# load required packages here
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
library(openintro)
library(nycflights13)
library(lubridate)

1. Data Tidying and Relational Data

The following questions shall be answered by working with the world_bank_pop and who data sets from the openinto library.

  1. The data set world_bank_pop is not clean. Clean the data set such that the after data tidying you have six columns: country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW. Give your code and show the first 10 rows of the data set after being tidied. Then explain the meaning of each column.
# Enter code here.
cleaned_data <- world_bank_pop %>%
  pivot_longer(c("2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017"), names_to = "year", values_to = "population") %>%
  pivot_wider(names_from = indicator, values_from = population)

head(cleaned_data, 10)

Answer:
“country”means country code; “year” means years; “SP.URB.TOTAL” menas the total urban population; “SP.URB.GROW” means the urban population growth; “SP.POP.TOTL” means total population; “SP.POP.GROW” means the population growth.

  1. Replace the country column of the tided data set in step a) with full names of the country (for example, replace USA with United States of America) by checking the data frame who, which contains the full name of each country corresponding to the three-digit country code. Give your code and show the updated data set in a manner to illustrate that the task is correctly fulfilled.
# Enter code here.
who_1 <- who %>%
  select(country, iso3) %>%
  group_by(country, iso3) %>%
  summarise(n = n()) %>%
  select(-n)

world_bank_pop_1 <- cleaned_data %>%
  left_join(who_1, c("country" = "iso3")) %>%
  mutate(country = country.y) %>%
  select(-country.y,) %>%
  filter(!is.na(country))

glimpse(world_bank_pop_1)
## Rows: 3,780
## Columns: 6
## $ country     <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aru…
## $ year        <chr> "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2…
## $ SP.URB.TOTL <dbl> 41625, 42025, 42194, 42277, 42317, 42399, 42555, 42729, 42…
## $ SP.URB.GROW <dbl> 1.66422212, 0.95637310, 0.40133515, 0.19651721, 0.09456936…
## $ SP.POP.TOTL <dbl> 89101, 90691, 91781, 92701, 93540, 94483, 95606, 96787, 97…
## $ SP.POP.GROW <dbl> 2.5392344, 1.7687566, 1.1947181, 0.9973955, 0.9009892, 1.0…

Answer:
As above.

  1. With the data set obtained in step b), answer which countries had undergone significant urbanization between 2000 and 2017. You need to show the code and the results (either graphs or tables) to support your answer.
# Enter code here.
world_bank_pop_2 <- world_bank_pop_1 %>%
  filter(year == "2000" | year == "2017") %>%
  group_by(country) %>%
  summarise(
    city_pop_year_2000 = SP.URB.TOTL[year == "2000"],
    city_pop_year_2017 = SP.URB.TOTL[year == "2017"],
    pop_change = city_pop_year_2017 - city_pop_year_2000
  ) %>%
  arrange(desc(pop_change))

print(world_bank_pop_2)
## # A tibble: 210 × 4
##    country                    city_pop_year_2000 city_pop_year_2017 pop_change
##    <chr>                                   <dbl>              <dbl>      <dbl>
##  1 China                               452999147          809246214  356247067
##  2 India                               293168849          455009748  161840899
##  3 Indonesia                            89914698          144572428   54657730
##  4 Nigeria                              42801631           95817238   53015607
##  5 United States of America            223069137          266788716   43719579
##  6 Brazil                              142795391          179958546   37163155
##  7 Pakistan                             50914288           78853074   27938786
##  8 Bangladesh                           30476706           58016080   27539374
##  9 Mexico                               73132993           98108030   24975037
## 10 Iran (Islamic Republic of)           41975934           62866706   20890772
## # ℹ 200 more rows

Answer:
China had undergone significant urbanization between 2000 and 2017. India is at the second place.

2. Factors and Relational Data

For the following tasks, use data set planes and flights from the nycflights13 package.

  1. For the planes data set, only keep planes from manufacturers that have more than 10 samples in the data set. Then convert manufacturer column into a factor. Then combine AIRBUS and AIRBUS INDUSTRIE as a single category AIRBUS; combine MCDONNELL DOUGLAS, MCDONNELL DOUGLAS AIRCRAFT CO and MCDONNELL DOUGLAS CORPORATION into a single category MCDONNELL. Save your data frame as a new one. Show your code and the first 10 rows of the updated data frame.
# Enter code here.
planes_1 <- planes %>%
  group_by(manufacturer) %>%
  mutate(n = n()) %>%
  filter(n > 10) %>%
  select(-n) %>%
  mutate(manufacturer = as.factor(manufacturer)) %>%
  mutate(manufacturer = fct_collapse(manufacturer,
    "AIRBUS"    = c("AIRBUS", "AIRBUS INDUSTRIE"),
    "MCDONNELL" = c("MCDONNELL DOUGLAS", "MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")
  ))

head(planes_1, 10)

Answer:
As above

  1. Join the flights data set with the planes data set, study how plane models correlate with the flight distance with proper data visualizations or summary tables. You are required to summarize your findings concisely in your own words.
# Enter code here.
planes_2 <- planes_1 %>%
  left_join(flights, by = "tailnum") %>%
  group_by(model, manufacturer) %>%
  summarise(avg_distance = mean(distance)) %>%
  arrange(desc(avg_distance))

planes_2

Answer:
The A330-243 model from the AIRBUS manufacturer have the highest average flight distance. BOEING has most flight models that flight a high average distance in the top 10 average flight distance list.

3. Datetime and Data Transformation

For the following tasks, use the data set weather, flights or planes from the nycflights13 package.

  1. Create a plot of the temperature change across the whole year of 2013 at the JFK airport. (Hint: You need to first create a datetime variable for each hour.)
# Enter code here.
jfk_weather <- weather %>%
  filter(origin == "JFK") %>%
  mutate(datetime = make_datetime(year, month, day, hour))

ggplot(jfk_weather, aes(x = datetime, y = temp)) +
  geom_line() +
  labs(
    title = "Temperature Chnage at JFK airport of 2013",
    x = "Month",
    y = "Temperature"
    )

Answer:

  1. Find out which day of the year has the largest temperature difference (defined as the difference between the highest and the lowest temperature) across the day (0am - 11pm).
# Enter code here.

weather_1 <- weather %>%
  filter(origin == "JFK") %>%
  group_by(year, month, day) %>%
  summarize(
    max = max(temp, na.rm = TRUE),
    min = min(temp, na.rm = TRUE),
    temp_diff = max - min) %>%
  arrange(desc(temp_diff))

head(weather_1, 1)

Answer: The highest temperature difference day is 2013-05-08.

  1. Find a way to select all overnight flights (also called “Red Eye Flights” that depart at late night and arrive in the early morning) from the flights data set. Here overnight flights are defined as flights that departed between 10pm and 1am, and having an air time of over 4 hours . Create a categorical variable overnight_flag with YES or NO as the possible values. Show your code and the updated data frame.
# Enter code here.
flights_updated <- flights %>%
  mutate(
    overnight = if_else( (dep_time >= 2200 | dep_time <= 100) & air_time > 240, "YES", "NO", missing = "NO") )

flights_updated %>%
  filter(overnight == "YES") %>%
  select(month, day, tailnum, dest, dep_time, air_time, overnight) %>%
  head(10)

Answer:

  1. Someone says that most overnight flights use relatively small planes. Verify whether this is true with the data frame obtained in c) and the planes data set.
# Enter code here.
overnight_planes <- flights_updated %>%
  inner_join(planes, by = "tailnum")

comparison <- overnight_planes %>%
  group_by(overnight) %>%
  summarize(
    mean = mean(seats),
    median = median(seats),
    min = min(seats),
    total_flights = n()
    )

print(comparison)
## # A tibble: 2 × 5
##   overnight  mean median   min total_flights
##   <chr>     <dbl>  <int> <int>         <int>
## 1 NO         137.    149     2        283531
## 2 YES        200.    200     2           639

Answer:
The table shows that the overnight planes have a higher average on seat number, which mean, the overnight flights are actually bigger. However, the flights number is pretty small compare to day flights.

4. General Analysis and Statistical Tests

Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.

  1. From the gss_cat data set, find factors that are significantly correlated with the reported income.
# Enter code here.

Answer:

  1. From the smoking data set of the openintro package, find find factors that are significantly correlated with the smoking status and the number of cigarettes smoked per day.
# Enter code here.

Answer: