I, tianlu sui______, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.
# load required packages here
library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
library(lubridate)
library(nycflights13)
library(forcats)
The following questions shall be answered by working with the
world_bank_pop and who data sets from the
openinto library.
world_bank_pop is not clean. Clean the
data set such that the after data tidying you have six columns:
country, year, SP.URB.TOTL,
SP.URB.GROW, SP.POP.TOTL,
SP.POP.GROW. Give your code and show the first 10 rows of
the data set after being tidied. Then explain the meaning of each
column.world_bank_pop
## # A tibble: 1,064 x 20
## country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW SP.URB.TOTL 4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
## 2 ABW SP.URB.GROW 1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
## 3 ABW SP.POP.TOTL 8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
## 4 ABW SP.POP.GROW 2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
## 5 AFE SP.URB.TOTL 1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
## 6 AFE SP.URB.GROW 3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
## 7 AFE SP.POP.TOTL 4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
## 8 AFE SP.POP.GROW 2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
## 9 AFG SP.URB.TOTL 4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
## 10 AFG SP.URB.GROW 1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
## # i 1,054 more rows
## # i 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>
?world_bank_pop
## 打开httpd帮助服务器… 好了
# Enter code here.
tidy_pop <- world_bank_pop %>%
pivot_longer(
cols = `2000`:`2017`,
names_to = "year",
values_to = "value"
) %>%
mutate(year = as.integer(year)) %>%
pivot_wider(
names_from = indicator,
values_from = value
) %>%
select(country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW)
head(tidy_pop, 10)
## # A tibble: 10 x 6
## country year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ABW 2000 41625 1.66 89101 2.54
## 2 ABW 2001 42025 0.956 90691 1.77
## 3 ABW 2002 42194 0.401 91781 1.19
## 4 ABW 2003 42277 0.197 92701 0.997
## 5 ABW 2004 42317 0.0946 93540 0.901
## 6 ABW 2005 42399 0.194 94483 1.00
## 7 ABW 2006 42555 0.367 95606 1.18
## 8 ABW 2007 42729 0.408 96787 1.23
## 9 ABW 2008 42906 0.413 97996 1.24
## 10 ABW 2009 43079 0.402 99212 1.23
Answer:
SP.POP.GROW = population growth SP.POP.TOTL = total population SP.URB.GROW = urban population growth SP.URB.TOTL = total urban population country:Abbreviations of country names year:Time
country column of the tided data set in
step a) with full names of the country (for example, replace
USA with United States of America) by checking
the data frame who, which contains the full name of each
country corresponding to the three-digit country code. Give your code
and show the updated data set in a manner to illustrate that the task is
correctly fulfilled.who <- tidyr::who
# Enter code here.
who1<-who%>%
select(country,iso3)%>%
rename(country_name=iso3)
tidy_pop2 <- tidy_pop %>%
rename(country_name=country)%>%
left_join(
who1,
by = "country_name"
)%>%
select(-country_name)
## Warning in left_join(., who1, by = "country_name"): Detected an unexpected many-to-many relationship between `x` and `y`.
## i Row 1 of `x` matches multiple rows in `y`.
## i Row 341 of `y` matches multiple rows in `x`.
## i If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
head(tidy_pop2, 10)
## # A tibble: 10 x 6
## year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW country
## <int> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2000 41625 1.66 89101 2.54 Aruba
## 2 2000 41625 1.66 89101 2.54 Aruba
## 3 2000 41625 1.66 89101 2.54 Aruba
## 4 2000 41625 1.66 89101 2.54 Aruba
## 5 2000 41625 1.66 89101 2.54 Aruba
## 6 2000 41625 1.66 89101 2.54 Aruba
## 7 2000 41625 1.66 89101 2.54 Aruba
## 8 2000 41625 1.66 89101 2.54 Aruba
## 9 2000 41625 1.66 89101 2.54 Aruba
## 10 2000 41625 1.66 89101 2.54 Aruba
Answer:
# Enter code here.
top10_urban_pop_change <- tidy_pop2 %>%
filter(year %in% c(2000, 2017)) %>%
select(country, year, SP.URB.TOTL) %>%
pivot_wider(
names_from = year,
values_from = SP.URB.TOTL,
values_fn = mean
) %>%
rename(pop_2000 = `2000`,
pop_2017 = `2017`) %>%
mutate(change = pop_2017 - pop_2000) %>%
arrange(desc(change)) %>%
head(10)
top10_urban_pop_change
## # A tibble: 10 x 4
## country pop_2000 pop_2017 change
## <chr> <dbl> <dbl> <dbl>
## 1 China 452999147 809246214 356247067
## 2 India 293168849 455009748 161840899
## 3 Indonesia 89914698 144572428 54657730
## 4 Nigeria 42801631 95817238 53015607
## 5 United States of America 223069137 266788716 43719579
## 6 Brazil 142795391 179958546 37163155
## 7 Pakistan 50914288 78853074 27938786
## 8 Bangladesh 30476706 58016080 27539374
## 9 Mexico 73132993 98108030 24975037
## 10 Iran (Islamic Republic of) 41975934 62866706 20890772
Answer:
china india indonesia nigeria usa
For the following tasks, use data set planes and
flights from the nycflights13 package.
planes data set, only keep planes from
manufacturers that have more than 10 samples in the data set. Then
convert manufacturer column into a factor. Then combine
AIRBUS and AIRBUS INDUSTRIE as a single
category AIRBUS; combine MCDONNELL DOUGLAS,
MCDONNELL DOUGLAS AIRCRAFT CO and
MCDONNELL DOUGLAS CORPORATION into a single category
MCDONNELL. Save your data frame as a new one. Show your
code and the first 10 rows of the updated data frame.# Enter code here.
planes2 <- planes %>%
filter(!is.na(manufacturer)) %>%
group_by(manufacturer) %>%
filter(n() > 10)
planes2%>%
mutate(manufacturer = case_when(
manufacturer %in% c("AIRBUS", "AIRBUS INDUSTRIE") ~ "AIRBUS",
manufacturer %in% c("MCDONNELL DOUGLAS",
"MCDONNELL DOUGLAS AIRCRAFT CO",
"MCDONNELL DOUGLAS CORPORATION") ~ "MCDONNELL",
TRUE ~ manufacturer
)) %>%
mutate(manufacturer = factor(manufacturer))
## # A tibble: 3,270 x 9
## # Groups: manufacturer [5]
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <fct> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 7 N107US 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 8 N108UW 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 9 N109UW 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## 10 N110UW 1999 Fixed wing multi~ AIRBUS A320~ 2 182 NA Turbo~
## # i 3,260 more rows
head(planes2, 10)
## # A tibble: 10 x 9
## # Groups: manufacturer [2]
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 7 N107US 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 8 N108UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 9 N109UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 10 N110UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
Answer:
flights data set with the planes
data set, study how plane models correlate with the flight distance with
proper data visualizations or summary tables. You are required to
summarize your findings concisely in your own words.flight_planes <- flights %>%
left_join(planes2, by = "tailnum")
flight_planes
## # A tibble: 336,776 x 27
## year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # i 336,766 more rows
## # i 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
## # manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
## # engine <chr>
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # i 336,766 more rows
## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Enter code here.
planes2 <- planes %>%
filter(!is.na(model))
flights2 <- flights %>%
filter(!is.na(distance), !is.na(tailnum))
flight_planes <- flights2 %>%
left_join(planes2, by = "tailnum")
model_summary <- flight_planes %>%
filter(!is.na(model)) %>%
group_by(model) %>%
summarise(
avg_distance = mean(distance, na.rm = TRUE),
n = n()
) %>%
arrange(desc(avg_distance))
model_summary
## # A tibble: 127 x 3
## model avg_distance n
## <chr> <dbl> <int>
## 1 A330-243 4983 342
## 2 767-424ER 3850. 532
## 3 A319-115 2525. 94
## 4 777-222 2520 4
## 5 757-212 2475 2
## 6 A330-323 2422 1
## 7 737-890 2402 346
## 8 737-8FH 2402 16
## 9 737-990 2402 18
## 10 737-990ER 2402 334
## # i 117 more rows
Answer:
The A330-243 has a significantly longer average range than other models; the 767-424ER is also quite good, whilst the CL-600-2B19 performs the worst.The flight range of most models is between 2,400 and 500 metres.
For the following tasks, use the data set weather,
flights or planes from the
nycflights13 package.
JFK airport. (Hint: You need to first create a
datetime variable for each hour.)# Enter code here.
jfk_weather <- weather %>%
filter(origin == "JFK") %>%
mutate(datetime = make_datetime(year, month, day, hour))
ggplot(jfk_weather, aes(x = datetime, y = temp)) +
geom_line() +
labs(title = "Temperature change at JFK",
x = "Time",
y = "Temperature")
Answer:
# Enter code here.
daily_range <- weather %>%
group_by(year, month, day) %>%
summarise(
max_temp = max(temp, na.rm = TRUE),
min_temp = min(temp, na.rm = TRUE),
temp_diff = max_temp - min_temp
)
daily_range %>%
arrange(desc(temp_diff))%>%
head(5)
## # A tibble: 5 x 6
## # Groups: year, month [3]
## year month day max_temp min_temp temp_diff
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 2013 5 8 66.9 13.1 53.8
## 2 2013 4 9 84.0 46.9 37.1
## 3 2013 1 31 62.6 30.0 32.6
## 4 2013 5 29 87.1 57.0 30.1
## 5 2013 5 30 93.0 63.0 30.1
Answer:
May 8
flights data set. Here overnight
flights are defined as flights that departed between 10pm and 1am, and
having an air time of over 4 hours . Create a categorical variable
overnight_flag with YES or NO as
the possible values. Show your code and the updated data frame.# Enter code here.
flights_redeyes <- flights %>%
mutate(
dep_hour = dep_time %/% 100,
late_night = (dep_hour >= 22 | dep_hour <= 1),
long_flight = air_time > 240,
overnight_flag = ifelse(late_night & long_flight, "YES", "NO")
)
flights_redeyes
## # A tibble: 336,776 x 23
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # i 336,766 more rows
## # i 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hour <dbl>,
## # late_night <lgl>, long_flight <lgl>, overnight_flag <chr>
Answer:
planes data set.# Enter code here.
redeyesflight_plane <- flights_redeyes %>%
left_join(planes, by = "tailnum")
redeyesflight_plane
## # A tibble: 336,776 x 31
## year.x month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # i 336,766 more rows
## # i 23 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hour <dbl>,
## # late_night <lgl>, long_flight <lgl>, overnight_flag <chr>, year.y <int>,
## # type <chr>, manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## # speed <int>, engine <chr>
redeyesflight_plane %>%
group_by(overnight_flag) %>%
summarise(
avg_seats = mean(seats, na.rm = TRUE),
n = n()
)
## # A tibble: 3 x 3
## overnight_flag avg_seats n
## <chr> <dbl> <int>
## 1 NO 137. 327817
## 2 YES 200. 667
## 3 <NA> 84.8 8292
Answer:
That is incorrect; it is quite clear
that overnight flights have a higher average number of seats than
non-overnight flights.
Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.
gss_cat data set, find factors that are
significantly correlated with the reported income.gss_cat
## # A tibble: 21,483 x 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never married 26 White $8000 to 9999 Ind,near ~ Prot~ Sout~ 12
## 2 2000 Divorced 48 White $8000 to 9999 Not str r~ Prot~ Bapt~ NA
## 3 2000 Widowed 67 White Not applicable Independe~ Prot~ No d~ 2
## 4 2000 Never married 39 White Not applicable Ind,near ~ Orth~ Not ~ 4
## 5 2000 Divorced 25 White Not applicable Not str d~ None Not ~ 1
## 6 2000 Married 25 White $20000 - 24999 Strong de~ Prot~ Sout~ NA
## 7 2000 Never married 36 White $25000 or more Not str r~ Chri~ Not ~ 3
## 8 2000 Divorced 44 White $7000 to 7999 Ind,near ~ Prot~ Luth~ NA
## 9 2000 Married 44 White $25000 or more Not str d~ Prot~ Other 0
## 10 2000 Married 47 White $25000 or more Strong re~ Prot~ Sout~ 3
## # i 21,473 more rows
?gss_cat
# Enter code here.
table(gss_cat$rincome, gss_cat$marital) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 2470.6, df = 75, p-value < 2.2e-16
table(gss_cat$rincome, gss_cat$tvhours) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 1263.1, df = 345, p-value < 2.2e-16
table(gss_cat$rincome, gss_cat$partyid) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 806.11, df = 135, p-value < 2.2e-16
table(gss_cat$rincome, gss_cat$age) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 8157, df = 1065, p-value < 2.2e-16
ggplot(gss_cat, aes(x = rincome, fill = marital)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Answer:
Consequently, the reported income is associated with marital status, age, political affiliation and the amount of time spent watching television
Generally speaking, the income for never married people is slightly lower.
smoking data set of the openintro
package, find find factors that are significantly correlated with the
smoking status and the number of cigarettes smoked per day.smoking
## # A tibble: 1,691 x 12
## gender age marital_status highest_qualification nationality ethnicity
## * <fct> <int> <fct> <fct> <fct> <fct>
## 1 Male 38 Divorced No Qualification British White
## 2 Female 42 Single No Qualification British White
## 3 Male 40 Married Degree English White
## 4 Female 40 Married Degree English White
## 5 Female 39 Married GCSE/O Level British White
## 6 Female 37 Married GCSE/O Level British White
## 7 Male 53 Married Degree British White
## 8 Male 44 Single Degree English White
## 9 Male 40 Single GCSE/CSE English White
## 10 Female 41 Married No Qualification English White
## # i 1,681 more rows
## # i 6 more variables: gross_income <fct>, region <fct>, smoke <fct>,
## # amt_weekends <int>, amt_weekdays <int>, type <fct>
# Enter code here.
table(smoking$smoke, smoking$gender) %>%
chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 0.42699, df = 1, p-value = 0.5135
table(smoking$smoke, smoking$age) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 159.26, df = 78, p-value = 1.631e-07
table(smoking$smoke, smoking$nationality) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 14.929, df = 7, p-value = 0.03692
table(smoking$smoke, smoking$gross_income) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 19.835, df = 9, p-value = 0.01896
table(smoking$smoke, smoking$region) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 12.671, df = 6, p-value = 0.04857
ggplot(smoking, aes(x = smoke, y = age)) +
geom_boxplot()
Answer:
Smoking is linked to age; smokers are predominantly aged between 30 and 50, whilst non-smokers are predominantly aged between 40 and 70.