Academic Honesty Statement (fill your name in the blank)

I, Duc Vinh Hoang, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

library(openintro)
library(tidyverse)
library(nycflights13)
library(lubridate)
data(smoking)

1. Data Tidying and Relational Data

The following questions shall be answered by working with the world_bank_pop and who data sets from the openinto library.

  1. The data set world_bank_pop is not clean. Clean the data set such that the after data tidying you have six columns: country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW. Give your code and show the first 10 rows of the data set after being tidied. Then explain the meaning of each column.
wb_tidy <- world_bank_pop %>%
  pivot_longer(
    cols = `2000`:`2017`, 
    names_to = "year", 
    values_to = "value"
  ) %>%
  pivot_wider(
    names_from = indicator, 
    values_from = value
  )

head(wb_tidy, 10)

Answer:
“country”: 3 digits ISO country code.
“year”: Year that the data is recorded.
“SP.URB.TOTL”: Total urban population count.
“SP.URB.GROW”: The annual growth rate of the urban population.
“SP.POP.TOTL”: Total national population count.
“SP.POP.GROW”: The annual growth rate of the total population.

  1. Replace the country column of the tided data set in step a) with full names of the country (for example, replace USA with United States of America) by checking the data frame who, which contains the full name of each country corresponding to the three-digit country code. Give your code and show the updated data set in a manner to illustrate that the task is correctly fulfilled.
country_mapping <- who %>%
  distinct(iso3, country) %>%
  rename(full_name = country)

wb_named <- wb_tidy %>%
  left_join(country_mapping, by = c("country" = "iso3")) %>%
  mutate(country = coalesce(full_name, country)) %>%
  select(-full_name)

wb_named %>%
  filter(country == "Viet Nam") %>%
  head(5)

Answer:

  1. With the data set obtained in step b), answer which countries had undergone significant urbanization between 2000 and 2017. You need to show the code and the results (either graphs or tables) to support your answer.
urbanization <- wb_named %>%
  filter(year %in% c("2000", "2017")) %>%
  drop_na(SP.URB.TOTL, SP.POP.TOTL) %>%
  mutate(urban_pct = (SP.URB.TOTL / SP.POP.TOTL) * 100) %>%
  select(country, year, urban_pct) %>%
  pivot_wider(names_from = year, values_from = urban_pct) %>%
  mutate(pct_change = `2017` - `2000`) %>%

  filter(nchar(country) > 3) %>%
  arrange(desc(pct_change)) %>%
  head(5)

print(urbanization)
## # A tibble: 5 × 4
##   country               `2000` `2017` pct_change
##   <chr>                  <dbl>  <dbl>      <dbl>
## 1 Equatorial Guinea       49.1   71.6       22.6
## 2 China                   35.9   58.0       22.1
## 3 Costa Rica              59.1   78.6       19.5
## 4 Haiti                   35.6   54.3       18.7
## 5 Sao Tome and Principe   53.4   72.0       18.5
ggplot(urbanization, aes(x = reorder(country, pct_change), y = pct_change)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Top 5 Countries That Had Significant Urbanization (2000-2017)",
    x = "Country",
    y = "Increase in Urban Population Proportion (%)"
  ) 

Answer:
It seems to be Equatorial Guinea had the most significant urbanization changes from 2000 to 2017, closely followed by China, Costa Rica, Haiti and Sao Tome and Principe.

2. Factors and Relational Data

For the following tasks, use data set planes and flights from the nycflights13 package.

  1. For the planes data set, only keep planes from manufacturers that have more than 10 samples in the data set. Then convert manufacturer column into a factor. Then combine AIRBUS and AIRBUS INDUSTRIE as a single category AIRBUS; combine MCDONNELL DOUGLAS, MCDONNELL DOUGLAS AIRCRAFT CO and MCDONNELL DOUGLAS CORPORATION into a single category MCDONNELL. Save your data frame as a new one. Show your code and the first 10 rows of the updated data frame.
planes_dataset <- planes %>%
  add_count(manufacturer) %>%
  filter(n > 10) %>%
  select(-n) %>% 
  
  mutate(
    manufacturer = as.factor(manufacturer),
    manufacturer = fct_collapse(manufacturer,
      "AIRBUS" = c("AIRBUS", "AIRBUS INDUSTRIE"),
      "MCDONNELL" = c("MCDONNELL DOUGLAS", "MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")
    )
  )

head(planes_dataset, 10)

Answer:

  1. Join the flights data set with the planes data set, study how plane models correlate with the flight distance with proper data visualizations or summary tables. You are required to summarize your findings concisely in your own words.
flights_planes <- flights %>%
  inner_join(planes_dataset, by = "tailnum", suffix = c("_flight", "_plane"))

model_distance_summary <- flights_planes %>%
  group_by(model) %>%
  summarize(
    avg_distance = mean(distance, na.rm = TRUE),
    median_distance = median(distance, na.rm = TRUE),
    num_flights = n()
  ) %>%
  filter(num_flights > 50) %>%
  arrange(desc(avg_distance))

head(model_distance_summary, 10)
top_models <- flights_planes %>%
  count(model, sort = TRUE) %>%
  slice_head(n = 10) %>%
  pull(model)

flights_planes %>%
  filter(model %in% top_models) %>%
  ggplot(aes(x = reorder(model, distance, FUN = median), y = distance)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Flight Distance Distribution by Plane Model",
    x = "Plane Model",
    y = "Flight Distance (Miles)"
  )

Answer:
Wide-body models handle the longest hauls: Both Boeing 737-824 and 757-222 easily exceed above 2000 miles.
Narrow-body serve mid-range routes: Airbus A319, A320, Boeing 737-7H4 are the three aircrafts that serve mid-range routes at around 700mi to 1300mi.
Regional jets: The rest of the list are private jets and their purpose are to serve short-range and not quite popular for commercial, their routes are best below 1000mi

3. Datetime and Data Transformation

For the following tasks, use the data set weather, flights or planes from the nycflights13 package.

  1. Create a plot of the temperature change across the whole year of 2013 at the JFK airport. (Hint: You need to first create a datetime variable for each hour.)
jfk_weather <- weather %>%
  filter(origin == "JFK") %>%
  mutate(datetime = make_datetime(year, month, day, hour))

ggplot(jfk_weather, aes(x = datetime, y = temp)) +
  geom_line(color = "coral", alpha = 0.8) +
  scale_x_datetime(date_labels = "%B", date_breaks = "1 month") +
  labs(
    title = "Hourly Temperature Change at JFK Airport (2013)",
    x = "Month",
    y = "Temperature (F)"
  ) 

Answer:

  1. Find out which day of the year has the largest temperature difference (defined as the difference between the highest and the lowest temperature) across the day (0am - 11pm).
largest_temp_diff <- weather %>%
  group_by(origin, year, month, day) %>%
  summarize(
    max_temp = max(temp, na.rm = TRUE),
    min_temp = min(temp, na.rm = TRUE),
    temp_diff = max_temp - min_temp,
    .groups = "drop"
  ) %>%
  arrange(desc(temp_diff)) %>%
  head(1)

print(largest_temp_diff)
## # A tibble: 1 × 7
##   origin  year month   day max_temp min_temp temp_diff
##   <chr>  <int> <int> <int>    <dbl>    <dbl>     <dbl>
## 1 JFK     2013     5     8     62.6     13.1      49.5

Answer:

  1. Find a way to select all overnight flights (also called “Red Eye Flights” that depart at late night and arrive in the early morning) from the flights data set. Here overnight flights are defined as flights that departed between 10pm and 1am, and having an air time of over 4 hours . Create a categorical variable overnight_flag with YES or NO as the possible values. Show your code and the updated data frame.
flights_latenight <- flights %>%
  mutate(
    overnight_flag = if_else(
      condition = (dep_time >= 2200 | dep_time <= 100) & air_time > 240,
      true = "YES",
      false = "NO",
    )
  )

flights_latenight %>%
  filter(overnight_flag == "YES") %>%
  select(year:day, dep_time, air_time, dest, overnight_flag) %>%
  head(10)

Answer:

  1. Someone says that most overnight flights use relatively small planes. Verify whether this is true with the data frame obtained in c) and the planes data set.
flights_plane_type <- flights_latenight %>%
  inner_join(planes, by = "tailnum", suffix = c("_flight", "_plane"))

plane_size_summary <- flights_plane_type %>%
  group_by(overnight_flag) %>%
  summarize(
    avg_seats = mean(seats, na.rm = TRUE),
    median_seats = median(seats, na.rm = TRUE),
    total_flights = n()
  )

print(plane_size_summary)
## # A tibble: 3 × 4
##   overnight_flag avg_seats median_seats total_flights
##   <chr>              <dbl>        <dbl>         <int>
## 1 NO                 137.           149        279298
## 2 YES                200.           200           639
## 3 <NA>                84.8           55          4233

Answer:
The statement is wrong.
With larger body planes.

4. General Analysis and Statistical Tests

Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.

  1. From the gss_cat data set, find factors that are significantly correlated with the reported income.
chisq_marital <- chisq.test(table(gss_cat$marital, gss_cat$rincome))
print(chisq_marital)
## 
##  Pearson's Chi-squared test
## 
## data:  table(gss_cat$marital, gss_cat$rincome)
## X-squared = 2470.6, df = 75, p-value < 2.2e-16
chisq_race <- chisq.test(table(fct_drop(gss_cat$race), gss_cat$rincome))
print(chisq_race)
## 
##  Pearson's Chi-squared test
## 
## data:  table(fct_drop(gss_cat$race), gss_cat$rincome)
## X-squared = 197.91, df = 30, p-value < 2.2e-16
ggplot(gss_cat, aes(x = rincome, fill = marital)) +
  geom_bar(position = "fill", linewidth = 0.2) +
  coord_flip() +
  labs(
    title = "Marital Status by Reported Income (Including Non-Responses)",
    x = "Reported Income",
    y = "Proportion",
    fill = "Marital Status"
  ) 

Answer:
Marital Status: With a p-value of < 2.2e-16, marital status is significantly correlated with reported income, race is also significantly correlated with it with p-value of < 2.2e-16.

  1. From the smoking data set of the openintro package, find find factors that are significantly correlated with the smoking status and the number of cigarettes smoked per day.
chisq_smoke_marital <- chisq.test(table(smoking$marital_status, smoking$smoke))
print(chisq_smoke_marital)
## 
##  Pearson's Chi-squared test
## 
## data:  table(smoking$marital_status, smoking$smoke)
## X-squared = 74.98, df = 4, p-value = 2.012e-15
ggplot(smoking, aes(x = marital_status, fill = smoke)) +
  geom_bar(position = "fill") +
  labs(
    title = "Smoking Status by Marriage",
    x = "Marital Status",
    y = "Proportion",
    fill = "Smoke"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

smokers_only <- smoking %>%
  filter(smoke == "Yes") %>%
  mutate(cigs_per_day = (amt_weekdays * 5 + amt_weekends * 2) / 7) %>%
  drop_na(cigs_per_day, gender)

t_test_gender <- t.test(cigs_per_day ~ gender, data = smokers_only)
print(t_test_gender)
## 
##  Welch Two Sample t-test
## 
## data:  cigs_per_day by gender
## t = -4.0167, df = 317.7, p-value = 7.373e-05
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -5.470389 -1.873280
## sample estimates:
## mean in group Female   mean in group Male 
##             12.87973             16.55157
ggplot(smokers_only, aes(x = gender, y = cigs_per_day, fill = gender)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Average Daily Cigarettes Smoked by Gender",
    x = "Gender",
    y = "Cigarettes per Day"
  ) 

Answer:
Looking at the “Smoking Status by Marriage”, people who are “Divorced” or “Separated” have a higher proportion of smokers compared to people who are “Married” or “Widowed”.
Male smokers tend to smoke more than female in one day with more outliers. However, the Q1 and Q3 from both Male and Female have no difference at all.