I, Sy Cong Nguyen, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(nycflights13)
library(openintro)
library(lubridate)
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.head(world_bank_pop)
long_data <- pivot_longer(world_bank_pop,
cols = `2000`:`2017`,
names_to = "year",
values_to = "value")
head(long_data)
tidy_data <- pivot_wider(long_data,
names_from = indicator,
values_from = value)
tidy_data$year <- as.numeric(tidy_data$year)
head(tidy_data, 10)
Answer:
The data set now has 6 columns:
country: the 3-letter country code (like USA, CHN)year: the year (from 2000 to 2017)SP.URB.TOTL: total urban population in that
country/yearSP.URB.GROW: urban population growth rate (%)SP.POP.TOTL: total populationSP.POP.GROW: total population growth rate (%)The original data was messy because the years were spread across
columns. After using pivot_longer and then
pivot_wider, each row is one country-year observation.
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.country_names <- who %>%
select(country, iso3) %>%
distinct()
head(country_names)
tidy_data2 <- tidy_data %>%
rename(iso3 = country)
tidy_data2 <- left_join(tidy_data2, country_names, by = "iso3")
tidy_data2 <- tidy_data2 %>%
select(country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW)
tidy_data2 %>%
filter(country %in% c("United States of America", "China", "India", "Germany")) %>%
head(10)
Answer:
The country column now has full names instead of codes.
For example, “USA” is now “United States of America” and “CHN” is now
“China”. I used left_join to match the 3-letter codes in
world_bank_pop with the country names in who.
urban_data <- tidy_data2 %>%
filter(year == 2000 | year == 2017) %>%
filter(!is.na(country)) %>%
mutate(urban_share = SP.URB.TOTL / SP.POP.TOTL)
urban_compare <- urban_data %>%
select(country, year, urban_share) %>%
pivot_wider(names_from = year, values_from = urban_share)
urban_compare$change <- urban_compare$`2017` - urban_compare$`2000`
urban_compare <- urban_compare %>%
filter(!is.na(change)) %>%
arrange(desc(change))
top10 <- head(urban_compare, 10)
top10
ggplot(top10, aes(x = reorder(country, change), y = change * 100)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Countries with Biggest Urbanization 2000-2017",
x = "Country",
y = "Change in urban share (%)")
Answer:
I calculated the urban share (urban population divided by total
population) for each country in 2000 and 2017, then found the
difference. The countries with the biggest increase in urban share are
the ones that urbanized the most. From the bar chart, countries like
Maldives, Laos, Bhutan, Burkina Faso, and several other developing
countries in Asia and Africa had the biggest urbanization, with their
urban share going up by around 15-25 percentage points.
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.manuf_count <- planes %>%
group_by(manufacturer) %>%
summarise(n = n())
manuf_count
big_manuf <- manuf_count %>%
filter(n > 10) %>%
pull(manufacturer)
big_manuf
## [1] "AIRBUS" "AIRBUS INDUSTRIE"
## [3] "BOEING" "BOMBARDIER INC"
## [5] "EMBRAER" "MCDONNELL DOUGLAS"
## [7] "MCDONNELL DOUGLAS AIRCRAFT CO" "MCDONNELL DOUGLAS CORPORATION"
planes_new <- planes %>%
filter(manufacturer %in% big_manuf)
planes_new$manufacturer <- ifelse(planes_new$manufacturer == "AIRBUS INDUSTRIE",
"AIRBUS",
planes_new$manufacturer)
planes_new$manufacturer <- ifelse(planes_new$manufacturer %in%
c("MCDONNELL DOUGLAS",
"MCDONNELL DOUGLAS AIRCRAFT CO",
"MCDONNELL DOUGLAS CORPORATION"),
"MCDONNELL",
planes_new$manufacturer)
planes_new$manufacturer <- as.factor(planes_new$manufacturer)
levels(planes_new$manufacturer)
## [1] "AIRBUS" "BOEING" "BOMBARDIER INC" "EMBRAER"
## [5] "MCDONNELL"
head(planes_new, 10)
Answer:
I first counted how many planes each manufacturer had, then kept only
the ones with more than 10. After that I used ifelse to
rename the Airbus and McDonnell Douglas variants so they all fall under
one category, and finally converted the column to a factor. The new
factor has fewer levels which is easier to work with.
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.combined <- flights %>%
inner_join(planes_new, by = "tailnum")
model_dist <- combined %>%
group_by(model) %>%
summarise(avg_distance = mean(distance, na.rm = TRUE),
n_flights = n()) %>%
filter(n_flights > 500) %>%
arrange(desc(avg_distance))
model_dist
top_models <- combined %>%
count(model, sort = TRUE) %>%
head(10) %>%
pull(model)
combined %>%
filter(model %in% top_models) %>%
ggplot(aes(x = model, y = distance)) +
geom_boxplot(fill = "lightblue") +
coord_flip() +
labs(title = "Flight Distance by Plane Model",
x = "Plane Model",
y = "Distance (miles)")
Answer:
Different plane models fly different distances. From the boxplot, I
can see that bigger planes like the Airbus A320 series and Boeing
757-200 fly the longest distances (median over 1000 miles), while
smaller regional jets like the Embraer ERJ-145 and CRJ-200 fly shorter
routes (median around 500 miles). This makes sense because airlines use
bigger planes for longer trips and smaller planes for shorter trips.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)jfk <- weather %>%
filter(origin == "JFK")
jfk$datetime <- make_datetime(jfk$year, jfk$month, jfk$day, jfk$hour)
ggplot(jfk, aes(x = datetime, y = temp)) +
geom_line(color = "red") +
labs(title = "Temperature at JFK Airport in 2013",
x = "Date",
y = "Temperature (F)")
Answer:
The temperature at JFK follows the normal seasonal pattern. It’s cold
in January and February (often below 30°F), gets warmer through spring,
peaks in July and August (around 80-90°F), and then drops again towards
winter. There’s a lot of hourly variation but the overall trend is
clear.
daily_diff <- weather %>%
group_by(origin, month, day) %>%
summarise(max_temp = max(temp, na.rm = TRUE),
min_temp = min(temp, na.rm = TRUE),
diff = max_temp - min_temp) %>%
arrange(desc(diff))
head(daily_diff, 5)
Answer:
The day with the largest temperature difference was January
31, 2013 at LaGuardia, where the temperature swung about 40°F
between the lowest and highest reading that day. This is because of a
cold front passing through, which is common in New York winters.
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.flights_new <- flights %>%
mutate(dep_hour = dep_time %/% 100)
flights_new$overnight_flag <- ifelse(
(flights_new$dep_hour >= 22 | flights_new$dep_hour < 1) & flights_new$air_time > 240,
"YES",
"NO"
)
table(flights_new$overnight_flag)
##
## NO YES
## 327836 650
flights_new %>%
filter(overnight_flag == "YES") %>%
select(month, day, dep_time, arr_time, air_time, origin, dest, overnight_flag) %>%
head(10)
Answer:
I used dep_time %/% 100 to get the hour of departure,
then used ifelse to flag flights as “YES” if they departed
between 10pm-1am AND had air time more than 4 hours. Looking at the
results, the overnight flights are mostly going from JFK or EWR to West
Coast destinations like LAX, SFO, and SEA, which makes sense.
planes data set.flights_with_planes <- flights_new %>%
inner_join(planes, by = "tailnum")
flights_with_planes %>%
group_by(overnight_flag) %>%
summarise(avg_seats = mean(seats, na.rm = TRUE),
median_seats = median(seats, na.rm = TRUE),
n = n())
ggplot(flights_with_planes, aes(x = overnight_flag, y = seats, fill = overnight_flag)) +
geom_boxplot() +
labs(title = "Plane Seats: Overnight vs Non-Overnight Flights",
x = "Overnight Flight?",
y = "Number of Seats")
Answer:
The statement is not true. Looking at the average
and median number of seats, overnight flights actually use
bigger planes than non-overnight flights, not smaller
ones. This makes sense because overnight flights are usually long
distance (transcontinental) flights that need bigger planes like Boeing
757 or Airbus A320, not small regional jets.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.str(gss_cat)
## tibble [21,483 × 9] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:21483] 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ marital: Factor w/ 6 levels "No answer","Never married",..: 2 4 5 2 4 6 2 4 6 6 ...
## $ age : int [1:21483] 26 48 67 39 25 25 36 44 44 47 ...
## $ race : Factor w/ 4 levels "Other","Black",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ rincome: Factor w/ 16 levels "No answer","Don't know",..: 8 8 16 16 16 5 4 9 4 4 ...
## $ partyid: Factor w/ 10 levels "No answer","Don't know",..: 6 5 7 6 9 10 5 8 9 4 ...
## $ relig : Factor w/ 16 levels "No answer","Don't know",..: 15 15 15 6 12 15 5 15 15 15 ...
## $ denom : Factor w/ 30 levels "No answer","Don't know",..: 25 23 3 30 30 25 30 15 4 25 ...
## $ tvhours: int [1:21483] 12 NA 2 4 1 NA 3 NA 0 3 ...
good_incomes <- c("Lt $1000", "$1000 to 2999", "$3000 to 3999", "$4000 to 4999",
"$5000 to 5999", "$6000 to 6999", "$7000 to 7999",
"$8000 to 9999", "$10000 - 14999", "$15000 - 19999",
"$20000 - 24999", "$25000 or more")
gss_clean <- gss_cat %>%
filter(rincome %in% good_incomes)
gss_clean$income_num <- as.numeric(factor(gss_clean$rincome, levels = good_incomes))
summary(aov(income_num ~ marital, data = gss_clean))
## Df Sum Sq Mean Sq F value Pr(>F)
## marital 5 4129 825.9 98.09 <2e-16 ***
## Residuals 13009 109535 8.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(income_num ~ race, data = gss_clean))
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 771 385.7 44.45 <2e-16 ***
## Residuals 13012 112893 8.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(income_num ~ partyid, data = gss_clean))
## Df Sum Sq Mean Sq F value Pr(>F)
## partyid 9 1124 124.92 14.44 <2e-16 ***
## Residuals 13005 112540 8.65
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cor.test(gss_clean$income_num, gss_clean$age)
##
## Pearson's product-moment correlation
##
## data: gss_clean$income_num and gss_clean$age
## t = 17.99, df = 12988, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1391008 0.1726587
## sample estimates:
## cor
## 0.1559248
ggplot(gss_clean, aes(x = marital, fill = rincome)) +
geom_bar(position = "fill") +
labs(title = "Income Distribution by Marital Status",
y = "Proportion") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
ggplot(gss_clean, aes(x = race, fill = rincome)) +
geom_bar(position = "fill") +
labs(title = "Income Distribution by Race",
y = "Proportion")
Answer:
I treated income as a number (1 to 12) and used ANOVA tests to see which factors are significantly related to income. The results show that:
The bar charts also show these patterns visually — for example, the
“$25000 or more” group is much bigger among married respondents and
among the “White” group.
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.str(smoking)
## tibble [1,691 × 12] (S3: tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 1 1 2 2 2 1 ...
## $ age : int [1:1691] 38 42 40 40 39 37 53 44 40 41 ...
## $ marital_status : Factor w/ 5 levels "Divorced","Married",..: 1 4 2 2 2 2 2 4 4 2 ...
## $ highest_qualification: Factor w/ 8 levels "A Levels","Degree",..: 6 6 2 2 4 4 2 2 3 6 ...
## $ nationality : Factor w/ 8 levels "British","English",..: 1 1 2 2 1 1 1 2 2 2 ...
## $ ethnicity : Factor w/ 7 levels "Asian","Black",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ gross_income : Factor w/ 10 levels "10,400 to 15,600",..: 3 9 5 1 3 2 7 1 3 6 ...
## $ region : Factor w/ 7 levels "London","Midlands & East Anglia",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 1 2 2 ...
## $ amt_weekends : int [1:1691] NA 12 NA NA NA NA 6 NA 8 15 ...
## $ amt_weekdays : int [1:1691] NA 12 NA NA NA NA 6 NA 8 12 ...
## $ type : Factor w/ 5 levels "","Both/Mainly Hand-Rolled",..: 1 5 1 1 1 1 5 1 4 5 ...
chisq.test(table(smoking$gender, smoking$smoke))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(smoking$gender, smoking$smoke)
## X-squared = 0.42699, df = 1, p-value = 0.5135
chisq.test(table(smoking$marital_status, smoking$smoke))
##
## Pearson's Chi-squared test
##
## data: table(smoking$marital_status, smoking$smoke)
## X-squared = 74.98, df = 4, p-value = 2.012e-15
chisq.test(table(smoking$highest_qualification, smoking$smoke))
##
## Pearson's Chi-squared test
##
## data: table(smoking$highest_qualification, smoking$smoke)
## X-squared = 40.281, df = 7, p-value = 1.112e-06
chisq.test(table(smoking$gross_income, smoking$smoke))
##
## Pearson's Chi-squared test
##
## data: table(smoking$gross_income, smoking$smoke)
## X-squared = 19.835, df = 9, p-value = 0.01896
t.test(age ~ smoke, data = smoking)
##
## Welch Two Sample t-test
##
## data: age by smoke
## t = 9.9722, df = 831.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 7.615566 11.348206
## sample estimates:
## mean in group No mean in group Yes
## 52.19685 42.71496
smoking %>%
group_by(highest_qualification) %>%
summarise(smoke_rate = mean(smoke == "Yes")) %>%
ggplot(aes(x = reorder(highest_qualification, smoke_rate), y = smoke_rate)) +
geom_bar(stat = "identity", fill = "orange") +
coord_flip() +
labs(title = "Smoking Rate by Education Level",
x = "Highest Qualification",
y = "Proportion who smoke")
smokers <- smoking %>%
filter(smoke == "Yes")
t.test(amt_weekdays ~ gender, data = smokers)
##
## Welch Two Sample t-test
##
## data: amt_weekdays by gender
## t = -4.1101, df = 309.26, p-value = 5.071e-05
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -5.728368 -2.019289
## sample estimates:
## mean in group Female mean in group Male
## 12.02991 15.90374
summary(aov(amt_weekdays ~ highest_qualification, data = smokers))
## Df Sum Sq Mean Sq F value Pr(>F)
## highest_qualification 7 1058 151.14 1.736 0.099 .
## Residuals 413 35961 87.07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(amt_weekdays ~ gross_income, data = smokers))
## Df Sum Sq Mean Sq F value Pr(>F)
## gross_income 9 898 99.75 1.135 0.336
## Residuals 411 36121 87.89
cor.test(smokers$amt_weekdays, smokers$age)
##
## Pearson's product-moment correlation
##
## data: smokers$amt_weekdays and smokers$age
## t = 4.0216, df = 419, p-value = 6.856e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0990349 0.2831382
## sample estimates:
## cor
## 0.1927826
ggplot(smokers, aes(x = highest_qualification, y = amt_weekdays)) +
geom_boxplot(fill = "lightcoral") +
coord_flip() +
labs(title = "Cigarettes per Day by Education",
x = "Education",
y = "Cigarettes per weekday")
Answer:
For smoking status (Yes/No): Using chi-square tests, the factors significantly related to smoking status are:
For number of cigarettes per day (among smokers):
Overall conclusion: Education level is the most
important factor for both whether someone smokes and how much they
smoke. People with lower education are more likely to smoke AND smoke
more cigarettes per day. Income and marital status also matter.