I, keyan______, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
library(tidyverse)
library(openintro)
library(nycflights13)
library(forcats)
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.data(world_bank_pop)
tidy_data <- world_bank_pop %>%
pivot_longer(
cols = -c(country, indicator),
names_to = "year",
values_to = "value"
) %>%
pivot_wider(
names_from = indicator,
values_from = value
)
head(tidy_data, 10)
Answer:
country: three-letter country code|
year: the year|SP.URB.TOTL: total urban population|SP.URB.GROW: urban
population growth rate|SP.POP.TOTL: total population|SP.POP.GROW:
population growth rate
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.data(world_bank_pop)
data(who)
tidy_data <- world_bank_pop %>%
pivot_longer(
cols = -c(country, indicator),
names_to = "year",
values_to = "value"
) %>%
pivot_wider(
names_from = indicator,
values_from = value
)
updated_data <- tidy_data %>%
rename(iso3 = country) %>%
left_join(who %>% select(iso3, country), by = "iso3") %>%
select(country, everything(), -iso3)
## Warning in left_join(., who %>% select(iso3, country), by = "iso3"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 341 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(updated_data, 10)
Answer:
country column changed to full country
name|join works by matching same country code|who dataset is like a
dictionary
result <- updated_data %>%
filter(year %in% c("2000", "2017")) %>%
group_by(country, year) %>%
summarise(SP.URB.TOTL = mean(SP.URB.TOTL, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(names_from = year, values_from = SP.URB.TOTL) %>%
mutate(change = `2017` - `2000`) %>%
arrange(desc(change))
head(result, 10)
Answer:
These countries had significant
urbanization because their urban population increased the most between
2000 and 2017
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.data(planes)
planes_updated <- planes %>%
group_by(manufacturer) %>%
filter(n() > 10) %>%
ungroup() %>%
mutate(manufacturer = as.factor(manufacturer)) %>%
mutate(manufacturer = fct_collapse(manufacturer,
"AIRBUS" = c("AIRBUS", "AIRBUS INDUSTRIE"),
"MCDONNELL" = c("MCDONNELL DOUGLAS", "MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")
))
head(planes_updated, 10)
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.data(flights)
data(planes)
joined <- flights %>%
left_join(planes, by = "tailnum")
result <- joined %>%
filter(!is.na(model), !is.na(distance)) %>%
group_by(model) %>%
summarise(avg_distance = mean(distance, na.rm = TRUE), n = n()) %>%
arrange(desc(avg_distance))
head(result, 10)
Answer:
Long-range or larger aircraft models
generally fly longer distances, while smaller or short-range models fly
shorter routes overall.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)weather_jfk <- weather %>%
filter(origin == "JFK") %>%
mutate(datetime = make_datetime(year, month, day, hour))
ggplot(weather_jfk, aes(x = datetime, y = temp)) +
geom_line(color = "steelblue", alpha = 0.3, linewidth = 0.3) +
geom_smooth(method = "loess", span = 0.2, color = "red", se = FALSE, linewidth = 1) +
labs(
title = "Temperature Change at JFK in 2013",
x = "Time",
y = "Temperature (F)"
) +
theme_minimal()
Answer:
Temperature changes over time show
seasonal patterns across the year.From an overall perspective,
temperature increases.From the beginning of the year, reaches a peak in
summer, and then decreases toward the end of the year.
data(weather)
result <- weather %>%
filter(origin == "JFK") %>%
mutate(date = as.Date(make_datetime(year, month, day, hour))) %>%
group_by(date) %>%
summarise(temp_diff = max(temp, na.rm = TRUE) - min(temp, na.rm = TRUE)) %>%
arrange(desc(temp_diff))
head(result, 1)
Answer:
the day with the largest temperature
difference is the one with the highest daily variation between max and
min temperature
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.data(flights)
flights_new <- flights %>%
mutate(dep_hour = dep_time %/% 100) %>%
mutate(
overnight_flag = case_when(
(dep_hour >= 22 | dep_hour <= 1) & air_time > 240 ~ "YES",
TRUE ~ "NO"
)
)
head(flights_new, 10)
Answer:
overnight flights are those departing
late night with long air time
planes data set.data(flights)
data(planes)
flights_new <- flights %>%
mutate(dep_hour = dep_time %/% 100) %>%
mutate(
overnight_flag = case_when(
(dep_hour >= 22 | dep_hour <= 1) & air_time > 240 ~ "YES",
TRUE ~ "NO"
)
)
joined <- flights_new %>%
left_join(planes, by = "tailnum")
result <- joined %>%
filter(!is.na(seats)) %>%
group_by(overnight_flag) %>%
summarise(avg_seats = mean(seats, na.rm = TRUE), n = n())
result
Answer:
overnight flights tend to use larger
planes on average, so the statement is not true.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.gss_clean <- gss_cat %>%
filter(!rincome %in% c("No answer", "Don't know", "Refused", "Not applicable")) %>%
mutate(rincome = fct_drop(rincome))
ggplot(gss_clean, aes(x = rincome, y = age)) +
geom_boxplot(fill = "skyblue", color = "darkgreen", outlier.alpha = 0.3) +
coord_flip() +
labs(title = "Age Distribution by Income",
x = "Income",
y = "Age") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.y = element_text(size = 9))
## Warning: Removed 25 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(gss_clean, aes(x = marital, fill = rincome)) +
geom_bar(position = "fill", width = 0.7) +
coord_flip() +
scale_fill_brewer(palette = "Blues", direction = 1) +
labs(title = "Income Distribution by Marital Status",
x = "Marital Status",
y = "Proportion",
fill = "Income") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors
Answer:
in summary, age and marital status are
related to income. older and married people are more likely to have
higher income. younger and single people are more likely to have lower
income.
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.ggplot(smoking, aes(x = gender, fill = smoke)) +
geom_bar(position = "fill", width = 0.7) +
scale_fill_manual(values = c("Yes" = "orange", "No" = "gray")) +
labs(title = "Smoking Status by Gender",
x = "Gender",
y = "Proportion",
fill = "Smoking") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
smoking_active <- smoking %>% filter(smoke == "Yes")
ggplot(smoking_active, aes(x = age, y = amt_weekends)) +
geom_point(alpha = 0.4, color = "blue") +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Age vs Cigarettes Smoked (Weekends)",
x = "Age",
y = "Cigarettes (Weekends)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Answer:
gender is related to smoking status, and
males and females show different proportions of smokers
age is
related to smoking intensity, and some age groups smoke more than others
overall, gender and age both affect smoking behavior, including
whether people smoke and how much they smoke