I, Boyang Liu, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(openintro)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.5.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
library(nycflights13)
library(lubridate)
library(forcats)
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.world_bank_tidy <- world_bank_pop %>%
pivot_longer(
cols = -c(country, indicator),
names_to = "year",
values_to = "value"
) %>%
pivot_wider(
names_from = indicator,
values_from = value
) %>%
mutate(year = as.integer(year)) %>%
select(
country,
year,
SP.URB.TOTL,
SP.URB.GROW,
SP.POP.TOTL,
SP.POP.GROW
)
head(world_bank_tidy, 10)
Answer:
country = 3-letter country code.
year = year of observation.
SP.URB.TOTL = total urban
population.
SP.URB.GROW = urban population growth rate.
SP.POP.TOTL = total population.
SP.POP.GROW = total population
growth rate.
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.country_names <- who %>%
select(country_name = country, iso3) %>%
distinct()
world_bank_named <- world_bank_tidy %>%
left_join(country_names, by = c("country" = "iso3")) %>%
mutate(country = ifelse(is.na(country_name), country, country_name)) %>%
select(
country,
year,
SP.URB.TOTL,
SP.URB.GROW,
SP.POP.TOTL,
SP.POP.GROW
)
head(world_bank_named, 10)
Answer:
urban_change <- world_bank_named %>%
filter(country %in% country_names$country_name) %>%
filter(year == 2000 | year == 2017) %>%
mutate(urban_percent = SP.URB.TOTL / SP.POP.TOTL * 100) %>%
select(country, year, urban_percent) %>%
pivot_wider(
names_from = year,
values_from = urban_percent,
names_prefix = "year_"
) %>%
mutate(change = year_2017 - year_2000) %>%
arrange(desc(change))
urban_change %>%
select(country, year_2000, year_2017, change) %>%
head(10)
top_10_urbanization <- urban_change %>%
head(10)
ggplot(top_10_urbanization, aes(x = reorder(country, change), y = change)) +
geom_col(fill = "lightblue") +
coord_flip() +
labs(
title = "Top 10 Countries with the Largest Urbanization Increase",
x = "Country",
y = "Increase in Urban Population Percentage (2000 to 2017)"
)
Answer:
I calculated the urban population
percentage for each country in 2000 and 2017. Then I found the
difference between these two years.
The countries with the largest increases had the most significant urbanization. Based on the table and graph, Equatorial Guinea, China, Costa Rica, Haiti, and Sao Tome and Principe had large increases in urbanization from 2000 to 2017.
This means that in these countries, more people lived in urban areas in 2017 than in 2000.
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.planes_clean <- planes %>%
group_by(manufacturer) %>%
filter(n() > 10) %>%
ungroup() %>%
mutate(
manufacturer = case_when(
manufacturer %in% c(
"AIRBUS",
"AIRBUS INDUSTRIE"
) ~ "AIRBUS",
manufacturer %in% c(
"MCDONNELL DOUGLAS",
"MCDONNELL DOUGLAS AIRCRAFT CO",
"MCDONNELL DOUGLAS CORPORATION"
) ~ "MCDONNELL",
TRUE ~ manufacturer
),
manufacturer = as.factor(manufacturer)
)
head(planes_clean, 10)
Answer:
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.flights_planes <- flights %>%
inner_join(planes_clean, by = "tailnum")
model_distance <- flights_planes %>%
group_by(model) %>%
summarise(
number_of_flights = n(),
average_distance = mean(distance, na.rm = TRUE),
median_distance = median(distance, na.rm = TRUE)
) %>%
filter(number_of_flights > 50) %>%
arrange(desc(average_distance))
head(model_distance, 10)
top_models <- model_distance %>%
head(10)
ggplot(top_models, aes(x = reorder(model, average_distance), y = average_distance)) +
geom_col(fill = "orange") +
coord_flip() +
labs(
title = "Top 10 Plane Models by Average Flight Distance",
x = "Plane Model",
y = "Average Flight Distance"
)
Answer:
I joined the flights data set with the
cleaned planes data set by tailnum. Then I grouped the data by plane
model and calculated the average flight distance.
Based on the table and graph, some plane models have higher average flight distances than others. This means some models are used more often for longer flights, while other models are used more often for shorter flights.
So, plane model is related to flight distance. Since model is a category, I compared the average distance for each model instead of using a numeric correlation.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)jfk_weather <- weather %>%
filter(origin == "JFK") %>%
mutate(
datetime = make_datetime(year, month, day, hour)
)
ggplot(jfk_weather, aes(x = datetime, y = temp)) +
geom_line() +
labs(
title = "Temperature Change at JFK Airport in 2013",
x = "Date",
y = "Temperature"
)
Answer:
This plot shows the temperature change
at JFK airport across the whole year of 2013. I created a datetime
variable by combining year, month, day, and hour. The temperature is
lower in winter and higher in summer.
daily_temp_diff <- jfk_weather %>%
group_by(year, month, day) %>%
summarise(
highest_temp = max(temp, na.rm = TRUE),
lowest_temp = min(temp, na.rm = TRUE),
temp_difference = highest_temp - lowest_temp,
.groups = "drop"
) %>%
arrange(desc(temp_difference))
head(daily_temp_diff, 10)
Answer:
I grouped the JFK weather data by year,
month, and day. Then I calculated the highest temperature and the lowest
temperature for each day. The temperature difference is the highest
temperature minus the lowest temperature.
The first row in the table shows the day with the largest temperature difference.
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.flights_overnight <- flights %>%
mutate(
dep_hour = dep_time %/% 100,
overnight_flag = ifelse(
(dep_hour >= 22 | dep_hour <= 1) & air_time > 240,
"YES",
"NO"
),
overnight_flag = as.factor(overnight_flag)
)
head(flights_overnight, 10)
Answer:
I created a new variable called
overnight_flag.
A flight is marked as YES if it departed between 10pm and 1am and had an air time over 4 hours. Otherwise, it is marked as NO.
I used dep_time to get the departure hour, and I used air_time > 240 because 4 hours equals 240 minutes.
planes data set.overnight_planes <- flights_overnight %>%
inner_join(planes, by = "tailnum")
overnight_size_summary <- overnight_planes %>%
group_by(overnight_flag) %>%
summarise(
number_of_flights = n(),
average_seats = mean(seats, na.rm = TRUE),
median_seats = median(seats, na.rm = TRUE),
smallest_seats = min(seats, na.rm = TRUE),
largest_seats = max(seats, na.rm = TRUE)
)
overnight_size_summary
ggplot(overnight_planes, aes(x = overnight_flag, y = seats)) +
geom_boxplot(fill = "purple") +
labs(
title = "Plane Size Comparison for Overnight and Non-Overnight Flights",
x = "Overnight Flight",
y = "Number of Seats"
)
Answer:
I used the number of seats to measure
plane size. Then I joined the overnight flight data with the planes data
by tailnum.
Based on the summary table and boxplot, I compared the seat numbers for overnight flights and non-overnight flights. If the average or median seats for overnight flights is lower, then the statement is true. If the average or median seats is similar or higher, then the statement is not supported.
From the result, overnight flights do not clearly use smaller planes. The seat numbers should be compared with non-overnight flights before making this claim.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.gss_income <- gss_cat %>%
filter(
!rincome %in% c("No answer", "Don't know", "Refused", "Not applicable")
) %>%
droplevels()
test_results <- data.frame(
factor = c("marital", "race", "partyid", "relig"),
p_value = c(
chisq.test(table(gss_income$rincome, gss_income$marital), simulate.p.value = TRUE)$p.value,
chisq.test(table(gss_income$rincome, gss_income$race), simulate.p.value = TRUE)$p.value,
chisq.test(table(gss_income$rincome, gss_income$partyid), simulate.p.value = TRUE)$p.value,
chisq.test(table(gss_income$rincome, gss_income$relig), simulate.p.value = TRUE)$p.value
)
)
test_results
ggplot(gss_income, aes(x = rincome, fill = marital)) +
geom_bar(position = "fill") +
coord_flip() +
labs(
title = "Reported Income by Marital Status",
x = "Reported Income",
y = "Proportion",
fill = "Marital Status"
)
ggplot(gss_income, aes(x = rincome, fill = race)) +
geom_bar(position = "fill") +
coord_flip() +
labs(
title = "Reported Income by Race",
x = "Reported Income",
y = "Proportion",
fill = "Race"
)
Answer:
The chi-square test results show that
marital status, race, party identification, and religion are all
significantly related to reported income because all p-values are less
than 0.05.
This means reported income is not the same across these groups. The graphs also show different income patterns for different groups.
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.Answer: