I, Yuhe, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
library(openintro)
library(nycflights13)
library(lubridate)
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.# Enter code here.
cleaned_data <- world_bank_pop %>%
pivot_longer(c("2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017"), names_to = "year", values_to = "population") %>%
pivot_wider(names_from = indicator, values_from = population)
head(cleaned_data, 10)
Answer:
“country”means country code; “year”
means years; “SP.URB.TOTAL” menas the total urban population;
“SP.URB.GROW” means the urban population growth; “SP.POP.TOTL” means
total population; “SP.POP.GROW” means the population growth.
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.# Enter code here.
who_1 <- who %>%
select(country, iso3) %>%
group_by(country, iso3) %>%
summarise(n = n()) %>%
select(-n)
world_bank_pop_1 <- cleaned_data %>%
left_join(who_1, c("country" = "iso3")) %>%
mutate(country = country.y) %>%
select(-country.y,) %>%
filter(!is.na(country))
glimpse(world_bank_pop_1)
## Rows: 3,780
## Columns: 6
## $ country <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aru…
## $ year <chr> "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2…
## $ SP.URB.TOTL <dbl> 41625, 42025, 42194, 42277, 42317, 42399, 42555, 42729, 42…
## $ SP.URB.GROW <dbl> 1.66422212, 0.95637310, 0.40133515, 0.19651721, 0.09456936…
## $ SP.POP.TOTL <dbl> 89101, 90691, 91781, 92701, 93540, 94483, 95606, 96787, 97…
## $ SP.POP.GROW <dbl> 2.5392344, 1.7687566, 1.1947181, 0.9973955, 0.9009892, 1.0…
Answer:
As above.
# Enter code here.
world_bank_pop_2 <- world_bank_pop_1 %>%
filter(year == "2000" | year == "2017") %>%
group_by(country) %>%
summarise(
city_pop_year_2000 = SP.URB.TOTL[year == "2000"],
city_pop_year_2017 = SP.URB.TOTL[year == "2017"],
pop_change = city_pop_year_2017 - city_pop_year_2000
) %>%
arrange(desc(pop_change))
print(world_bank_pop_2)
## # A tibble: 210 × 4
## country city_pop_year_2000 city_pop_year_2017 pop_change
## <chr> <dbl> <dbl> <dbl>
## 1 China 452999147 809246214 356247067
## 2 India 293168849 455009748 161840899
## 3 Indonesia 89914698 144572428 54657730
## 4 Nigeria 42801631 95817238 53015607
## 5 United States of America 223069137 266788716 43719579
## 6 Brazil 142795391 179958546 37163155
## 7 Pakistan 50914288 78853074 27938786
## 8 Bangladesh 30476706 58016080 27539374
## 9 Mexico 73132993 98108030 24975037
## 10 Iran (Islamic Republic of) 41975934 62866706 20890772
## # ℹ 200 more rows
Answer:
China had undergone significant
urbanization between 2000 and 2017. India is at the second place.
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.# Enter code here.
planes_1 <- planes %>%
group_by(manufacturer) %>%
mutate(n = n()) %>%
filter(n > 10) %>%
select(-n) %>%
mutate(manufacturer = as.factor(manufacturer)) %>%
mutate(manufacturer = fct_collapse(manufacturer,
"AIRBUS" = c("AIRBUS", "AIRBUS INDUSTRIE"),
"MCDONNELL" = c("MCDONNELL DOUGLAS", "MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")
))
head(planes_1, 10)
Answer:
As above
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.# Enter code here.
planes_2 <- planes_1 %>%
left_join(flights, by = "tailnum") %>%
group_by(model, manufacturer) %>%
summarise(avg_distance = mean(distance)) %>%
arrange(desc(avg_distance))
planes_2
Answer:
The A330-243 model from the AIRBUS
manufacturer have the highest average flight distance. BOEING has most
flight models that flight a high average distance in the top 10 average
flight distance list.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)# Enter code here.
jfk_weather <- weather %>%
filter(origin == "JFK") %>%
mutate(datetime = make_datetime(year, month, day, hour))
ggplot(jfk_weather, aes(x = datetime, y = temp)) +
geom_line() +
labs(
title = "Temperature Chnage at JFK airport of 2013",
x = "Month",
y = "Temperature"
)
Answer:
# Enter code here.
weather_1 <- weather %>%
filter(origin == "JFK") %>%
group_by(year, month, day) %>%
summarize(
max = max(temp, na.rm = TRUE),
min = min(temp, na.rm = TRUE),
temp_diff = max - min) %>%
arrange(desc(temp_diff))
head(weather_1, 1)
Answer: The highest temperature difference day is
2013-05-08.
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.# Enter code here.
flights_updated <- flights %>%
mutate(
overnight = if_else( (dep_time >= 2200 | dep_time <= 100) & air_time > 240, "YES", "NO", missing = "NO") )
flights_updated %>%
filter(overnight == "YES") %>%
select(month, day, tailnum, dest, dep_time, air_time, overnight) %>%
head(10)
Answer:
planes data set.# Enter code here.
overnight_planes <- flights_updated %>%
inner_join(planes, by = "tailnum")
comparison <- overnight_planes %>%
group_by(overnight) %>%
summarize(
mean = mean(seats),
median = median(seats),
min = min(seats),
total_flights = n()
)
print(comparison)
## # A tibble: 2 × 5
## overnight mean median min total_flights
## <chr> <dbl> <int> <int> <int>
## 1 NO 137. 149 2 283531
## 2 YES 200. 200 2 639
Answer:
The table shows that the overnight
planes have a higher average on seat number, which mean, the overnight
flights are actually bigger. However, the flights number is pretty small
compare to day flights.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.# Enter code here.
Answer:
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.# Enter code here.
Answer: