Academic Honesty Statement (fill your name in the blank)

I, Sy Cong Nguyen, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

library(tidyverse)
library(nycflights13)
library(openintro)
library(lubridate)

1. Data Tidying and Relational Data

The following questions shall be answered by working with the world_bank_pop and who data sets from the openinto library.

  1. The data set world_bank_pop is not clean. Clean the data set such that the after data tidying you have six columns: country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW. Give your code and show the first 10 rows of the data set after being tidied. Then explain the meaning of each column.
head(world_bank_pop)
long_data <- pivot_longer(world_bank_pop, 
                          cols = `2000`:`2017`,
                          names_to = "year",
                          values_to = "value")

head(long_data)
tidy_data <- pivot_wider(long_data,
                         names_from = indicator,
                         values_from = value)

tidy_data$year <- as.numeric(tidy_data$year)

head(tidy_data, 10)

Answer:

The data set now has 6 columns:

The original data was messy because the years were spread across columns. After using pivot_longer and then pivot_wider, each row is one country-year observation.

  1. Replace the country column of the tided data set in step a) with full names of the country (for example, replace USA with United States of America) by checking the data frame who, which contains the full name of each country corresponding to the three-digit country code. Give your code and show the updated data set in a manner to illustrate that the task is correctly fulfilled.
country_names <- who %>% 
  select(country, iso3) %>% 
  distinct()

head(country_names)
tidy_data2 <- tidy_data %>% 
  rename(iso3 = country)

tidy_data2 <- left_join(tidy_data2, country_names, by = "iso3")

tidy_data2 <- tidy_data2 %>% 
  select(country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW)

tidy_data2 %>% 
  filter(country %in% c("United States of America", "China", "India", "Germany")) %>% 
  head(10)

Answer:

The country column now has full names instead of codes. For example, “USA” is now “United States of America” and “CHN” is now “China”. I used left_join to match the 3-letter codes in world_bank_pop with the country names in who.

  1. With the data set obtained in step b), answer which countries had undergone significant urbanization between 2000 and 2017. You need to show the code and the results (either graphs or tables) to support your answer.
urban_data <- tidy_data2 %>% 
  filter(year == 2000 | year == 2017) %>% 
  filter(!is.na(country)) %>% 
  mutate(urban_share = SP.URB.TOTL / SP.POP.TOTL)

urban_compare <- urban_data %>% 
  select(country, year, urban_share) %>% 
  pivot_wider(names_from = year, values_from = urban_share)

urban_compare$change <- urban_compare$`2017` - urban_compare$`2000`

urban_compare <- urban_compare %>% 
  filter(!is.na(change)) %>% 
  arrange(desc(change))

top10 <- head(urban_compare, 10)
top10
ggplot(top10, aes(x = reorder(country, change), y = change * 100)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Countries with Biggest Urbanization 2000-2017",
       x = "Country",
       y = "Change in urban share (%)")

Answer:

I calculated the urban share (urban population divided by total population) for each country in 2000 and 2017, then found the difference. The countries with the biggest increase in urban share are the ones that urbanized the most. From the bar chart, countries like Maldives, Laos, Bhutan, Burkina Faso, and several other developing countries in Asia and Africa had the biggest urbanization, with their urban share going up by around 15-25 percentage points.

2. Factors and Relational Data

For the following tasks, use data set planes and flights from the nycflights13 package.

  1. For the planes data set, only keep planes from manufacturers that have more than 10 samples in the data set. Then convert manufacturer column into a factor. Then combine AIRBUS and AIRBUS INDUSTRIE as a single category AIRBUS; combine MCDONNELL DOUGLAS, MCDONNELL DOUGLAS AIRCRAFT CO and MCDONNELL DOUGLAS CORPORATION into a single category MCDONNELL. Save your data frame as a new one. Show your code and the first 10 rows of the updated data frame.
manuf_count <- planes %>% 
  group_by(manufacturer) %>% 
  summarise(n = n())

manuf_count
big_manuf <- manuf_count %>% 
  filter(n > 10) %>% 
  pull(manufacturer)

big_manuf
## [1] "AIRBUS"                        "AIRBUS INDUSTRIE"             
## [3] "BOEING"                        "BOMBARDIER INC"               
## [5] "EMBRAER"                       "MCDONNELL DOUGLAS"            
## [7] "MCDONNELL DOUGLAS AIRCRAFT CO" "MCDONNELL DOUGLAS CORPORATION"
planes_new <- planes %>% 
  filter(manufacturer %in% big_manuf)

planes_new$manufacturer <- ifelse(planes_new$manufacturer == "AIRBUS INDUSTRIE",
                                  "AIRBUS",
                                  planes_new$manufacturer)

planes_new$manufacturer <- ifelse(planes_new$manufacturer %in% 
                                    c("MCDONNELL DOUGLAS",
                                      "MCDONNELL DOUGLAS AIRCRAFT CO",
                                      "MCDONNELL DOUGLAS CORPORATION"),
                                  "MCDONNELL",
                                  planes_new$manufacturer)

planes_new$manufacturer <- as.factor(planes_new$manufacturer)

levels(planes_new$manufacturer)
## [1] "AIRBUS"         "BOEING"         "BOMBARDIER INC" "EMBRAER"       
## [5] "MCDONNELL"
head(planes_new, 10)

Answer:

I first counted how many planes each manufacturer had, then kept only the ones with more than 10. After that I used ifelse to rename the Airbus and McDonnell Douglas variants so they all fall under one category, and finally converted the column to a factor. The new factor has fewer levels which is easier to work with.

  1. Join the flights data set with the planes data set, study how plane models correlate with the flight distance with proper data visualizations or summary tables. You are required to summarize your findings concisely in your own words.
combined <- flights %>% 
  inner_join(planes_new, by = "tailnum")

model_dist <- combined %>% 
  group_by(model) %>% 
  summarise(avg_distance = mean(distance, na.rm = TRUE),
            n_flights = n()) %>% 
  filter(n_flights > 500) %>%
  arrange(desc(avg_distance))

model_dist
top_models <- combined %>% 
  count(model, sort = TRUE) %>% 
  head(10) %>% 
  pull(model)

combined %>% 
  filter(model %in% top_models) %>% 
  ggplot(aes(x = model, y = distance)) +
  geom_boxplot(fill = "lightblue") +
  coord_flip() +
  labs(title = "Flight Distance by Plane Model",
       x = "Plane Model",
       y = "Distance (miles)")

Answer:

Different plane models fly different distances. From the boxplot, I can see that bigger planes like the Airbus A320 series and Boeing 757-200 fly the longest distances (median over 1000 miles), while smaller regional jets like the Embraer ERJ-145 and CRJ-200 fly shorter routes (median around 500 miles). This makes sense because airlines use bigger planes for longer trips and smaller planes for shorter trips.

3. Datetime and Data Transformation

For the following tasks, use the data set weather, flights or planes from the nycflights13 package.

  1. Create a plot of the temperature change across the whole year of 2013 at the JFK airport. (Hint: You need to first create a datetime variable for each hour.)
jfk <- weather %>% 
  filter(origin == "JFK")

jfk$datetime <- make_datetime(jfk$year, jfk$month, jfk$day, jfk$hour)

ggplot(jfk, aes(x = datetime, y = temp)) +
  geom_line(color = "red") +
  labs(title = "Temperature at JFK Airport in 2013",
       x = "Date",
       y = "Temperature (F)")

Answer:

The temperature at JFK follows the normal seasonal pattern. It’s cold in January and February (often below 30°F), gets warmer through spring, peaks in July and August (around 80-90°F), and then drops again towards winter. There’s a lot of hourly variation but the overall trend is clear.

  1. Find out which day of the year has the largest temperature difference (defined as the difference between the highest and the lowest temperature) across the day (0am - 11pm).
daily_diff <- weather %>% 
  group_by(origin, month, day) %>% 
  summarise(max_temp = max(temp, na.rm = TRUE),
            min_temp = min(temp, na.rm = TRUE),
            diff = max_temp - min_temp) %>% 
  arrange(desc(diff))

head(daily_diff, 5)

Answer:

The day with the largest temperature difference was January 31, 2013 at LaGuardia, where the temperature swung about 40°F between the lowest and highest reading that day. This is because of a cold front passing through, which is common in New York winters.

  1. Find a way to select all overnight flights (also called “Red Eye Flights” that depart at late night and arrive in the early morning) from the flights data set. Here overnight flights are defined as flights that departed between 10pm and 1am, and having an air time of over 4 hours . Create a categorical variable overnight_flag with YES or NO as the possible values. Show your code and the updated data frame.
flights_new <- flights %>% 
  mutate(dep_hour = dep_time %/% 100)

flights_new$overnight_flag <- ifelse(
  (flights_new$dep_hour >= 22 | flights_new$dep_hour < 1) & flights_new$air_time > 240,
  "YES",
  "NO"
)

table(flights_new$overnight_flag)
## 
##     NO    YES 
## 327836    650
flights_new %>% 
  filter(overnight_flag == "YES") %>% 
  select(month, day, dep_time, arr_time, air_time, origin, dest, overnight_flag) %>% 
  head(10)

Answer:

I used dep_time %/% 100 to get the hour of departure, then used ifelse to flag flights as “YES” if they departed between 10pm-1am AND had air time more than 4 hours. Looking at the results, the overnight flights are mostly going from JFK or EWR to West Coast destinations like LAX, SFO, and SEA, which makes sense.

  1. Someone says that most overnight flights use relatively small planes. Verify whether this is true with the data frame obtained in c) and the planes data set.
flights_with_planes <- flights_new %>% 
  inner_join(planes, by = "tailnum")

flights_with_planes %>% 
  group_by(overnight_flag) %>% 
  summarise(avg_seats = mean(seats, na.rm = TRUE),
            median_seats = median(seats, na.rm = TRUE),
            n = n())
ggplot(flights_with_planes, aes(x = overnight_flag, y = seats, fill = overnight_flag)) +
  geom_boxplot() +
  labs(title = "Plane Seats: Overnight vs Non-Overnight Flights",
       x = "Overnight Flight?",
       y = "Number of Seats")

Answer:

The statement is not true. Looking at the average and median number of seats, overnight flights actually use bigger planes than non-overnight flights, not smaller ones. This makes sense because overnight flights are usually long distance (transcontinental) flights that need bigger planes like Boeing 757 or Airbus A320, not small regional jets.

4. General Analysis and Statistical Tests

Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.

  1. From the gss_cat data set, find factors that are significantly correlated with the reported income.
str(gss_cat)
## tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
##  $ year   : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
##  $ age    : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
##  $ race   : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
##  $ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
##  $ relig  : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
##  $ denom  : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
##  $ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...
good_incomes <- c("Lt $1000", "$1000 to 2999", "$3000 to 3999", "$4000 to 4999",
                  "$5000 to 5999", "$6000 to 6999", "$7000 to 7999",
                  "$8000 to 9999", "$10000 - 14999", "$15000 - 19999",
                  "$20000 - 24999", "$25000 or more")

gss_clean <- gss_cat %>% 
  filter(rincome %in% good_incomes)

gss_clean$income_num <- as.numeric(factor(gss_clean$rincome, levels = good_incomes))

summary(aov(income_num ~ marital, data = gss_clean))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## marital         5   4129   825.9   98.09 <2e-16 ***
## Residuals   13009 109535     8.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(income_num ~ race, data = gss_clean))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## race            2    771   385.7   44.45 <2e-16 ***
## Residuals   13012 112893     8.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(income_num ~ partyid, data = gss_clean))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## partyid         9   1124  124.92   14.44 <2e-16 ***
## Residuals   13005 112540    8.65                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cor.test(gss_clean$income_num, gss_clean$age)
## 
##  Pearson's product-moment correlation
## 
## data:  gss_clean$income_num and gss_clean$age
## t = 17.99, df = 12988, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1391008 0.1726587
## sample estimates:
##       cor 
## 0.1559248
ggplot(gss_clean, aes(x = marital, fill = rincome)) +
  geom_bar(position = "fill") +
  labs(title = "Income Distribution by Marital Status",
       y = "Proportion") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

ggplot(gss_clean, aes(x = race, fill = rincome)) +
  geom_bar(position = "fill") +
  labs(title = "Income Distribution by Race",
       y = "Proportion")

Answer:

I treated income as a number (1 to 12) and used ANOVA tests to see which factors are significantly related to income. The results show that:

The bar charts also show these patterns visually — for example, the “$25000 or more” group is much bigger among married respondents and among the “White” group.

  1. From the smoking data set of the openintro package, find find factors that are significantly correlated with the smoking status and the number of cigarettes smoked per day.
str(smoking)
## tibble [1,691 × 12] (S3: tbl_df/tbl/data.frame)
##  $ gender               : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
##  $ age                  : int [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
##  $ marital_status       : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
##  $ highest_qualification: Factor w/ 8 levels "A Levels","Degree",..: 6 6 2 2 4 4 2 2 3 6 ...
##  $ nationality          : Factor w/ 8 levels "British","English",..: 1 1 2 2 1 1 1 2 2 2 ...
##  $ ethnicity            : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ gross_income         : Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
##  $ region               : Factor w/ 7 levels "London","Midlands & East Anglia",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ smoke                : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
##  $ amt_weekends         : int [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
##  $ amt_weekdays         : int [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
##  $ type                 : Factor w/ 5 levels "","Both/Mainly Hand-Rolled",..: 1 5 1 1 1 1 5 1 4 5 ...
chisq.test(table(smoking$gender, smoking$smoke))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(smoking$gender, smoking$smoke)
## X-squared = 0.42699, df = 1, p-value = 0.5135
chisq.test(table(smoking$marital_status, smoking$smoke))
## 
##  Pearson's Chi-squared test
## 
## data:  table(smoking$marital_status, smoking$smoke)
## X-squared = 74.98, df = 4, p-value = 2.012e-15
chisq.test(table(smoking$highest_qualification, smoking$smoke))
## 
##  Pearson's Chi-squared test
## 
## data:  table(smoking$highest_qualification, smoking$smoke)
## X-squared = 40.281, df = 7, p-value = 1.112e-06
chisq.test(table(smoking$gross_income, smoking$smoke))
## 
##  Pearson's Chi-squared test
## 
## data:  table(smoking$gross_income, smoking$smoke)
## X-squared = 19.835, df = 9, p-value = 0.01896
t.test(age ~ smoke, data = smoking)
## 
##  Welch Two Sample t-test
## 
## data:  age by smoke
## t = 9.9722, df = 831.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##   7.615566 11.348206
## sample estimates:
##  mean in group No mean in group Yes 
##          52.19685          42.71496
smoking %>% 
  group_by(highest_qualification) %>% 
  summarise(smoke_rate = mean(smoke == "Yes")) %>% 
  ggplot(aes(x = reorder(highest_qualification, smoke_rate), y = smoke_rate)) +
  geom_bar(stat = "identity", fill = "orange") +
  coord_flip() +
  labs(title = "Smoking Rate by Education Level",
       x = "Highest Qualification",
       y = "Proportion who smoke")

smokers <- smoking %>% 
  filter(smoke == "Yes")

t.test(amt_weekdays ~ gender, data = smokers)
## 
##  Welch Two Sample t-test
## 
## data:  amt_weekdays by gender
## t = -4.1101, df = 309.26, p-value = 5.071e-05
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -5.728368 -2.019289
## sample estimates:
## mean in group Female   mean in group Male 
##             12.02991             15.90374
summary(aov(amt_weekdays ~ highest_qualification, data = smokers))
##                        Df Sum Sq Mean Sq F value Pr(>F)  
## highest_qualification   7   1058  151.14   1.736  0.099 .
## Residuals             413  35961   87.07                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(amt_weekdays ~ gross_income, data = smokers))
##               Df Sum Sq Mean Sq F value Pr(>F)
## gross_income   9    898   99.75   1.135  0.336
## Residuals    411  36121   87.89
cor.test(smokers$amt_weekdays, smokers$age)
## 
##  Pearson's product-moment correlation
## 
## data:  smokers$amt_weekdays and smokers$age
## t = 4.0216, df = 419, p-value = 6.856e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.0990349 0.2831382
## sample estimates:
##       cor 
## 0.1927826
ggplot(smokers, aes(x = highest_qualification, y = amt_weekdays)) +
  geom_boxplot(fill = "lightcoral") +
  coord_flip() +
  labs(title = "Cigarettes per Day by Education",
       x = "Education",
       y = "Cigarettes per weekday")

Answer:

For smoking status (Yes/No): Using chi-square tests, the factors significantly related to smoking status are:

For number of cigarettes per day (among smokers):

Overall conclusion: Education level is the most important factor for both whether someone smokes and how much they smoke. People with lower education are more likely to smoke AND smoke more cigarettes per day. Income and marital status also matter.