I, Sang Dao, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
library(openintro)
library(nycflights13)
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.wb_tidy <- world_bank_pop %>%
pivot_longer(cols = -c(country, indicator),
names_to = "year",
values_to = "value") %>%
pivot_wider(names_from = indicator,
values_from = value)
head(wb_tidy, 10)
Answer:
country : The three-digit country code uniquely identifying the observation.
year: The specific year the data was recorded.
SP.URB.TOTL: Total urban population.
SP.URB.GROW: Annual urban population growth rate.
SP.POP.TOTL: Total population.
SP.POP.GROW: Annual total population growth rate.
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.who_codes <- who %>%
select(full_name = country, iso3) %>%
distinct(iso3, .keep_all = TRUE)
wb_joined <- wb_tidy %>%
left_join(who_codes, by = c("country" = "iso3")) %>%
select(country = full_name, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW)
head(wb_joined, 10)
Answer:
urban_change_data <- wb_joined %>%
filter(year %in% c("2000", "2017")) %>%
mutate(urban_prop = SP.URB.TOTL / SP.POP.TOTL) %>%
select(country, year, urban_prop) %>%
pivot_wider(names_from = year,
values_from = urban_prop,
names_prefix = "yr_",
values_fn = mean) %>%
mutate(urbanization_change = yr_2017 - yr_2000) %>%
filter(!is.na(urbanization_change)) %>%
arrange(desc(urbanization_change))
head(urban_change_data)
Answer: So there are top 6 countries has the trongest undergone significant urbanization between 2000 and 2017.
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.planes_cleaned <- planes %>%
group_by(manufacturer) %>%
filter(n() > 10) %>%
ungroup() %>%
mutate(manufacturer = factor(manufacturer)) %>%
mutate(manufacturer = fct_collapse(manufacturer,
AIRBUS = c("AIRBUS", "AIRBUS INDUSTRIE"),
MCDONNELL = c("MCDONNELL DOUGLAS", "MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")
))
head(planes_cleaned, 10)
Answer:
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.flights_planes <- flights %>%
inner_join(planes_cleaned, by = "tailnum")
model_distance_summary <- flights_planes %>%
group_by(model) %>%
summarise(
flight_count = n(),
avg_distance = mean(distance, na.rm = TRUE),
median_distance = median(distance, na.rm = TRUE)
) %>%
arrange(desc(avg_distance))
head(model_distance_summary, 10)
top_15_models <- model_distance_summary %>%
arrange(desc(flight_count)) %>%
slice(1:15) %>%
pull(model)
flights_planes %>%
filter(model %in% top_15_models) %>%
ggplot(aes(x = fct_reorder(model, distance, .fun = median), y = distance)) +
geom_boxplot(fill = "steelblue", outlier.alpha = 0.3) +
coord_flip()
Answer: From the charts it is easy to see that different planes have different jobs. Small planes mostly just do short trips. They usually fly less than 700 miles and stick to the same short routes. But the big Airbus and Boeing planes fly much further. They often go 1,200 or 2,000 miles and do all kinds of long trips across the country.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)jfk_weather <- weather %>%
filter(origin == "JFK") %>%
mutate(datetime = make_datetime(year, month, day, hour))
ggplot(jfk_weather, aes(x = datetime, y = temp)) +
geom_line(color = "steelblue", alpha = 0.7)
Answer:
largest_temp_diff_day <- weather %>%
group_by(year, month, day) %>%
summarise(
max_temp = max(temp, na.rm = TRUE),
min_temp = min(temp, na.rm = TRUE),
temp_diff = max_temp - min_temp,
.groups = "drop"
) %>%
arrange(desc(temp_diff))
head(largest_temp_diff_day, 1)
Answer:
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.flights_updated <- flights %>%
mutate(
overnight_flag = ifelse((dep_time >= 2200 | dep_time <= 100) & air_time > 240, "YES", "NO")
)
overnight_flights <- flights_updated %>%
filter(overnight_flag == "YES")
overnight_flights %>%
select(tailnum, dep_time, air_time, overnight_flag) %>%
head(10)
Answer:
planes data set.flights_with_planes <- flights_updated %>%
inner_join(planes, by = "tailnum")
size_verification <- flights_with_planes %>%
filter(!is.na(overnight_flag)) %>%
group_by(overnight_flag) %>%
summarise(
flight_count = n(),
avg_seats = mean(seats, na.rm = TRUE),
median_seats = median(seats, na.rm = TRUE)
)
print(size_verification)
## # A tibble: 2 × 4
## overnight_flag flight_count avg_seats median_seats
## <chr> <int> <dbl> <dbl>
## 1 NO 279298 137. 149
## 2 YES 639 200. 200
Answer:
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.gss_cat %>%
filter(!is.na(rincome) & !is.na(tvhours)) %>%
ggplot(aes(x = fct_reorder(rincome, tvhours, .fun = median, .na_rm = TRUE), y = tvhours)) +
geom_boxplot(fill = "steelblue") +
coord_flip()
oneway.test(tvhours ~ rincome, data = gss_cat)
##
## One-way analysis of means (not assuming equal variances)
##
## data: tvhours and rincome
## F = 53.509, num df = 15, denom df = 1004, p-value < 2.2e-16
Answer: From the chart we can see difference. Who make less money usually watch way more TV than people with high pay. Basically, the more money someone makes, the less time they spend watching TV.
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.smokers_data <- smoking %>%
filter(smoke == "Yes" & !is.na(amt_weekdays) & !is.na(age))
ggplot(smokers_data, aes(x = age, y = amt_weekdays)) +
geom_point(position = "jitter", alpha = 0.5, color = "darkred") +
geom_smooth(method = "lm")
cor(smokers_data$age, smokers_data$amt_weekdays)
## [1] 0.1927826
Answer: The test result show that is was close to zero so the link is pretty weak. If we look at the dot chart the line goes up just a little bit. => Older people might smoke a few more cigarettes during the week, but the dots are spread out all over.