I, Jerry Chan, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.world_bank_pop_tidy <- world_bank_pop %>%
pivot_longer(
cols = `2000`:`2017`,
names_to = "year",
values_to = "value"
)
world_bank_pop_tidy <- world_bank_pop_tidy %>%
pivot_wider(
names_from = indicator,
values_from = value
) %>%
mutate(year = as.numeric(year)) %>%
select(country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW)
head(world_bank_pop_tidy, 10)
Answer:
country: country code
year:
observed year
SP.URB.TOTL: total people in urban area
SP.URB.GROW: urban growth rate
SP.POP.TOTL: total population of
country
SP.POP.GROW: total population growth rate
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.world_bank_pop_named <- world_bank_pop_tidy %>%
rename(iso3 = country) %>%
left_join(
who %>% select(iso3, country),
by = "iso3"
)
world_bank_pop_named <- world_bank_pop_named %>%
select(
country,
year,
SP.URB.TOTL,
SP.URB.GROW,
SP.POP.TOTL,
SP.POP.GROW
)
head(world_bank_pop_named, 10)
Answer:
the tidied data set merged with who
data set with ‘left-join’ with iso3 country code, replacing the three
letter country code, then selected same columns from step a).
urban_change <- world_bank_pop_named %>%
filter(year %in% c(2000, 2017)) %>%
group_by(country, year) %>%
summarise(urban_pop = mean(SP.URB.TOTL, na.rm = TRUE), .groups = "drop") %>%
pivot_wider(
names_from = year,
values_from = urban_pop
) %>%
mutate(
urban_change = 2017 - 2000,
percent_change = (urban_change) / 2000 * 100
) %>%
arrange(desc(percent_change))
head(urban_change, 10)
Answer:
the data contained multiple
observations per country year, so values were summarised using mean
before pivoting. This allowed all the years to be shrunk down to one
number for calculation.
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.planes_clean <- planes %>%
group_by(manufacturer) %>%
filter(n() > 10) %>%
mutate(
manufacturer = case_when(
manufacturer %in% c("AIRBUS", "AIRBUS INDUSTRIE") ~ "AIRBUS",
manufacturer %in% c("MCDONNELL DOUGLAS",
"MCDONNELL DOUGLAS AIRCRAFT CO",
"MCDONNELL DOUGLAS CORPORATION") ~ "MCDONNELL",
TRUE ~ manufacturer
)
)
head(planes_clean, 10)
Answer:
the manufacturers are filter by more
than 10 observations, the used case_when as an if else to combine airbus
variants to airbus and mcdonnell variants to mcdonnell, the rest were
left as is. Then printed first 10 on the list.
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.flights_planes <- flights %>%
left_join(planes, by = "tailnum")
flights_planes_clean <- flights_planes %>%
filter(!is.na(model), !is.na(distance))
model_distance_summary <- flights_planes_clean %>%
group_by(model) %>%
summarise(
avg_distance = mean(distance, na.rm = TRUE),
n_flights = n()
) %>%
arrange(desc(avg_distance))
head(model_distance_summary, 10)
Answer:
the flights dataset was joined by the
planes data set for the aircraft models. The table shows that larger
models’ average distance is longer than smaller models, this suggests
that aircraft types are important to determine when looking at flight
distance.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)jfk_weather <- weather %>%
filter(origin == "JFK") %>%
mutate(datetime = make_datetime(year, month, day, hour))
ggplot(jfk_weather, aes(x = datetime, y = temp)) +
geom_line(alpha = 0.5) +
labs(
title = "JFK airport temperature change across year 2013",
x = "Date",
y = "Temperature (F)"
)
Answer:
datetime variable was created using
year, month, day, hour, allowing JFK airport temperature be measured and
plotted across the year 2013.
jfk_weather <- weather %>%
filter(origin == "JFK") %>%
mutate(
datetime = make_datetime(year, month, day, hour),
date = as_date(datetime)
)
daily_range <- jfk_weather %>%
group_by(date) %>%
summarise(
max_temp = max(temp, na.rm = TRUE),
min_temp = min(temp, na.rm = TRUE),
temp_diff = max_temp - min_temp
) %>%
arrange(desc(temp_diff))
head(daily_range, 1)
Answer:
for each day, the maximum and minimum
temperature are measured and calculated to get the temperature
difference, then arrange in descending order based on the difference to
find the highest one.
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.overnight_flights <- flights %>%
mutate(overnight_flag = case_when((dep_time >= 2200 | dep_time <= 100) & air_time > 240 ~ "YES", TRUE ~ "NO"))
head(overnight_flights, 10)
Answer:
a new variable called overnight_flag
was created to record any flights departed between 11pm and 1am, with
over 4 hours in the air.
planes data set.small_overnight_planes <- overnight_flights %>%
left_join(planes %>% select(tailnum, seats), by = "tailnum")
small_overnight_planes %>%
group_by(overnight_flag) %>%
summarise(
avg_seats = mean(seats, na.rm = TRUE),
median_seats = median(seats, na.rm = TRUE),
n = n()
)
ggplot(small_overnight_planes, aes(x = overnight_flag, y = seats)) +
geom_boxplot() +
labs(
title = "Plane Size Comparison: Overnight vs Non-Overnight Flights",
x = "Overnight Flight",
y = "Number of Seats"
)
Answer:
aircraft size was determined using the
number of seats, after measuring the size between overnight flights and
non-overnight flights, it is found that overnight flights on average
does use more small planes.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.gss_clean <- gss_cat %>%
filter(!is.na(rincome)) %>%
mutate(rincome = fct_relevel(rincome))
ggplot(gss_clean, aes(x = age, fill = rincome)) +
geom_histogram() +
labs(
title = "Reported Income By Age",
x = "Age",
y = "Reported Income"
)
Answer:
the correlation between reported income
and age is middle age people reported to have the highest income.
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.smoking_clean <- smoking %>%
mutate(cigs_per_day = (amt_weekends + amt_weekdays) / 2)
ggplot(smoking_clean, aes(x = highest_qualification, fill = smoke)) +
coord_flip() +
geom_bar() +
labs(
title = "Smoking Status by Education",
x = "Education Level",
y = "Proportion"
)
ggplot(smoking_clean, aes(x = highest_qualification, y = cigs_per_day)) +
coord_flip() +
geom_boxplot() +
labs(
title = "Cigearettes per day by Education",
x = "Education Level",
y = "Proportion"
)
Answer:
smoking status has been associated with
education levels, with those lower qualification levels are smoking more
compared to those with higher qualifications.