Academic Honesty Statement (fill your name in the blank)

I, tianlu sui______, hereby state that I have not gained information in any way not allowed by the exam rules during this exam, and that all work is my own.

Load packages

# load required packages here
library(tidyverse)
library(openintro)
library(dplyr)
library(ggplot2)
library(lubridate)
library(nycflights13)
library(forcats)

1. Data Tidying and Relational Data

The following questions shall be answered by working with the world_bank_pop and who data sets from the openinto library.

  1. The data set world_bank_pop is not clean. Clean the data set such that the after data tidying you have six columns: country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW. Give your code and show the first 10 rows of the data set after being tidied. Then explain the meaning of each column.
world_bank_pop
## # A tibble: 1,064 x 20
##    country indicator      `2000`  `2001`  `2002`  `2003`  `2004`  `2005`  `2006`
##    <chr>   <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 ABW     SP.URB.TOTL    4.16e4 4.20e+4 4.22e+4 4.23e+4 4.23e+4 4.24e+4 4.26e+4
##  2 ABW     SP.URB.GROW    1.66e0 9.56e-1 4.01e-1 1.97e-1 9.46e-2 1.94e-1 3.67e-1
##  3 ABW     SP.POP.TOTL    8.91e4 9.07e+4 9.18e+4 9.27e+4 9.35e+4 9.45e+4 9.56e+4
##  4 ABW     SP.POP.GROW    2.54e0 1.77e+0 1.19e+0 9.97e-1 9.01e-1 1.00e+0 1.18e+0
##  5 AFE     SP.URB.TOTL    1.16e8 1.20e+8 1.24e+8 1.29e+8 1.34e+8 1.39e+8 1.44e+8
##  6 AFE     SP.URB.GROW    3.60e0 3.66e+0 3.72e+0 3.71e+0 3.74e+0 3.81e+0 3.81e+0
##  7 AFE     SP.POP.TOTL    4.02e8 4.12e+8 4.23e+8 4.34e+8 4.45e+8 4.57e+8 4.70e+8
##  8 AFE     SP.POP.GROW    2.58e0 2.59e+0 2.61e+0 2.62e+0 2.64e+0 2.67e+0 2.70e+0
##  9 AFG     SP.URB.TOTL    4.31e6 4.36e+6 4.67e+6 5.06e+6 5.30e+6 5.54e+6 5.83e+6
## 10 AFG     SP.URB.GROW    1.86e0 1.15e+0 6.86e+0 7.95e+0 4.59e+0 4.47e+0 5.03e+0
## # i 1,054 more rows
## # i 11 more variables: `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>
?world_bank_pop
## 打开httpd帮助服务器… 好了
# Enter code here.


tidy_pop <- world_bank_pop %>%
  

  pivot_longer(
    cols = `2000`:`2017`,  
    names_to = "year",
    values_to = "value"
  ) %>%
  
  mutate(year = as.integer(year)) %>%
  pivot_wider(
    names_from = indicator,
    values_from = value
  ) %>%
  


  select(country, year, SP.URB.TOTL, SP.URB.GROW, SP.POP.TOTL, SP.POP.GROW)


head(tidy_pop, 10)
## # A tibble: 10 x 6
##    country  year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW
##    <chr>   <int>       <dbl>       <dbl>       <dbl>       <dbl>
##  1 ABW      2000       41625      1.66         89101       2.54 
##  2 ABW      2001       42025      0.956        90691       1.77 
##  3 ABW      2002       42194      0.401        91781       1.19 
##  4 ABW      2003       42277      0.197        92701       0.997
##  5 ABW      2004       42317      0.0946       93540       0.901
##  6 ABW      2005       42399      0.194        94483       1.00 
##  7 ABW      2006       42555      0.367        95606       1.18 
##  8 ABW      2007       42729      0.408        96787       1.23 
##  9 ABW      2008       42906      0.413        97996       1.24 
## 10 ABW      2009       43079      0.402        99212       1.23

Answer:

SP.POP.GROW = population growth SP.POP.TOTL = total population SP.URB.GROW = urban population growth SP.URB.TOTL = total urban population country:Abbreviations of country names year:Time

  1. Replace the country column of the tided data set in step a) with full names of the country (for example, replace USA with United States of America) by checking the data frame who, which contains the full name of each country corresponding to the three-digit country code. Give your code and show the updated data set in a manner to illustrate that the task is correctly fulfilled.
who <- tidyr::who
# Enter code here.

who1<-who%>%
  select(country,iso3)%>%
  rename(country_name=iso3)

tidy_pop2 <- tidy_pop %>%
  rename(country_name=country)%>%
  left_join(
    who1,
    by = "country_name"
  )%>%
  select(-country_name)
## Warning in left_join(., who1, by = "country_name"): Detected an unexpected many-to-many relationship between `x` and `y`.
## i Row 1 of `x` matches multiple rows in `y`.
## i Row 341 of `y` matches multiple rows in `x`.
## i If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(tidy_pop2, 10)
## # A tibble: 10 x 6
##     year SP.URB.TOTL SP.URB.GROW SP.POP.TOTL SP.POP.GROW country
##    <int>       <dbl>       <dbl>       <dbl>       <dbl> <chr>  
##  1  2000       41625        1.66       89101        2.54 Aruba  
##  2  2000       41625        1.66       89101        2.54 Aruba  
##  3  2000       41625        1.66       89101        2.54 Aruba  
##  4  2000       41625        1.66       89101        2.54 Aruba  
##  5  2000       41625        1.66       89101        2.54 Aruba  
##  6  2000       41625        1.66       89101        2.54 Aruba  
##  7  2000       41625        1.66       89101        2.54 Aruba  
##  8  2000       41625        1.66       89101        2.54 Aruba  
##  9  2000       41625        1.66       89101        2.54 Aruba  
## 10  2000       41625        1.66       89101        2.54 Aruba

Answer:

  1. With the data set obtained in step b), answer which countries had undergone significant urbanization between 2000 and 2017. You need to show the code and the results (either graphs or tables) to support your answer.
# Enter code here.
top10_urban_pop_change <- tidy_pop2 %>%
  filter(year %in% c(2000, 2017)) %>%
  select(country, year, SP.URB.TOTL) %>%
  
  pivot_wider(
    names_from = year,
    values_from = SP.URB.TOTL,
    values_fn = mean  
  ) %>%
  
  rename(pop_2000 = `2000`,
         pop_2017 = `2017`) %>%
  
  mutate(change = pop_2017 - pop_2000) %>%
  arrange(desc(change)) %>%
  head(10)

top10_urban_pop_change
## # A tibble: 10 x 4
##    country                     pop_2000  pop_2017    change
##    <chr>                          <dbl>     <dbl>     <dbl>
##  1 China                      452999147 809246214 356247067
##  2 India                      293168849 455009748 161840899
##  3 Indonesia                   89914698 144572428  54657730
##  4 Nigeria                     42801631  95817238  53015607
##  5 United States of America   223069137 266788716  43719579
##  6 Brazil                     142795391 179958546  37163155
##  7 Pakistan                    50914288  78853074  27938786
##  8 Bangladesh                  30476706  58016080  27539374
##  9 Mexico                      73132993  98108030  24975037
## 10 Iran (Islamic Republic of)  41975934  62866706  20890772

Answer:
china india indonesia nigeria usa

2. Factors and Relational Data

For the following tasks, use data set planes and flights from the nycflights13 package.

  1. For the planes data set, only keep planes from manufacturers that have more than 10 samples in the data set. Then convert manufacturer column into a factor. Then combine AIRBUS and AIRBUS INDUSTRIE as a single category AIRBUS; combine MCDONNELL DOUGLAS, MCDONNELL DOUGLAS AIRCRAFT CO and MCDONNELL DOUGLAS CORPORATION into a single category MCDONNELL. Save your data frame as a new one. Show your code and the first 10 rows of the updated data frame.
# Enter code here.
planes2 <- planes %>%
  filter(!is.na(manufacturer)) %>%
  
  group_by(manufacturer) %>%
  filter(n() > 10)

planes2%>%
  mutate(manufacturer = case_when(
    manufacturer %in% c("AIRBUS", "AIRBUS INDUSTRIE") ~ "AIRBUS",
    manufacturer %in% c("MCDONNELL DOUGLAS",
                        "MCDONNELL DOUGLAS AIRCRAFT CO",
                        "MCDONNELL DOUGLAS CORPORATION") ~ "MCDONNELL",
    TRUE ~ manufacturer
  )) %>%
  
  mutate(manufacturer = factor(manufacturer))
## # A tibble: 3,270 x 9
## # Groups:   manufacturer [5]
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <fct>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi~ EMBRAER      EMB-~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wing multi~ EMBRAER      EMB-~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wing multi~ AIRBUS       A320~       2   182    NA Turbo~
## # i 3,260 more rows
head(planes2, 10)
## # A tibble: 10 x 9
## # Groups:   manufacturer [2]
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi~ EMBRAER      EMB-~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wing multi~ EMBRAER      EMB-~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~

Answer:

  1. Join the flights data set with the planes data set, study how plane models correlate with the flight distance with proper data visualizations or summary tables. You are required to summarize your findings concisely in your own words.
flight_planes <- flights %>%
  left_join(planes2, by = "tailnum")
flight_planes
## # A tibble: 336,776 x 27
##    year.x month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##     <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1   2013     1     1      517            515         2      830            819
##  2   2013     1     1      533            529         4      850            830
##  3   2013     1     1      542            540         2      923            850
##  4   2013     1     1      544            545        -1     1004           1022
##  5   2013     1     1      554            600        -6      812            837
##  6   2013     1     1      554            558        -4      740            728
##  7   2013     1     1      555            600        -5      913            854
##  8   2013     1     1      557            600        -3      709            723
##  9   2013     1     1      557            600        -3      838            846
## 10   2013     1     1      558            600        -2      753            745
## # i 336,766 more rows
## # i 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
## #   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
## #   engine <chr>
flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # i 336,766 more rows
## # i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
# Enter code here.

planes2 <- planes %>%
  filter(!is.na(model))

flights2 <- flights %>%
  filter(!is.na(distance), !is.na(tailnum))

flight_planes <- flights2 %>%
  left_join(planes2, by = "tailnum")

model_summary <- flight_planes %>%
  filter(!is.na(model)) %>%
  group_by(model) %>%
  summarise(
    avg_distance = mean(distance, na.rm = TRUE),
    n = n()
  ) %>%
  arrange(desc(avg_distance))


model_summary
## # A tibble: 127 x 3
##    model     avg_distance     n
##    <chr>            <dbl> <int>
##  1 A330-243         4983    342
##  2 767-424ER        3850.   532
##  3 A319-115         2525.    94
##  4 777-222          2520      4
##  5 757-212          2475      2
##  6 A330-323         2422      1
##  7 737-890          2402    346
##  8 737-8FH          2402     16
##  9 737-990          2402     18
## 10 737-990ER        2402    334
## # i 117 more rows

Answer:

The A330-243 has a significantly longer average range than other models; the 767-424ER is also quite good, whilst the CL-600-2B19 performs the worst.The flight range of most models is between 2,400 and 500 metres.

3. Datetime and Data Transformation

For the following tasks, use the data set weather, flights or planes from the nycflights13 package.

  1. Create a plot of the temperature change across the whole year of 2013 at the JFK airport. (Hint: You need to first create a datetime variable for each hour.)
# Enter code here.

jfk_weather <- weather %>%
  filter(origin == "JFK") %>%
  mutate(datetime = make_datetime(year, month, day, hour))

ggplot(jfk_weather, aes(x = datetime, y = temp)) +
  geom_line() +
  labs(title = "Temperature change at JFK",
       x = "Time",
       y = "Temperature")

Answer:

  1. Find out which day of the year has the largest temperature difference (defined as the difference between the highest and the lowest temperature) across the day (0am - 11pm).
# Enter code here.

daily_range <- weather %>%
  group_by(year, month, day) %>%
  summarise(
    max_temp = max(temp, na.rm = TRUE),
    min_temp = min(temp, na.rm = TRUE),
    temp_diff = max_temp - min_temp
  )

daily_range %>%
  arrange(desc(temp_diff))%>%
  head(5)
## # A tibble: 5 x 6
## # Groups:   year, month [3]
##    year month   day max_temp min_temp temp_diff
##   <int> <int> <int>    <dbl>    <dbl>     <dbl>
## 1  2013     5     8     66.9     13.1      53.8
## 2  2013     4     9     84.0     46.9      37.1
## 3  2013     1    31     62.6     30.0      32.6
## 4  2013     5    29     87.1     57.0      30.1
## 5  2013     5    30     93.0     63.0      30.1

Answer:

May 8

  1. Find a way to select all overnight flights (also called “Red Eye Flights” that depart at late night and arrive in the early morning) from the flights data set. Here overnight flights are defined as flights that departed between 10pm and 1am, and having an air time of over 4 hours . Create a categorical variable overnight_flag with YES or NO as the possible values. Show your code and the updated data frame.
# Enter code here.

flights_redeyes <- flights %>%
  mutate(
    dep_hour = dep_time %/% 100,
    late_night = (dep_hour >= 22 | dep_hour <= 1),
    long_flight = air_time > 240,
    overnight_flag = ifelse(late_night & long_flight, "YES", "NO")
  )
flights_redeyes
## # A tibble: 336,776 x 23
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # i 336,766 more rows
## # i 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hour <dbl>,
## #   late_night <lgl>, long_flight <lgl>, overnight_flag <chr>

Answer:

  1. Someone says that most overnight flights use relatively small planes. Verify whether this is true with the data frame obtained in c) and the planes data set.
# Enter code here.
redeyesflight_plane <- flights_redeyes %>%
  left_join(planes, by = "tailnum")
redeyesflight_plane
## # A tibble: 336,776 x 31
##    year.x month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##     <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1   2013     1     1      517            515         2      830            819
##  2   2013     1     1      533            529         4      850            830
##  3   2013     1     1      542            540         2      923            850
##  4   2013     1     1      544            545        -1     1004           1022
##  5   2013     1     1      554            600        -6      812            837
##  6   2013     1     1      554            558        -4      740            728
##  7   2013     1     1      555            600        -5      913            854
##  8   2013     1     1      557            600        -3      709            723
##  9   2013     1     1      557            600        -3      838            846
## 10   2013     1     1      558            600        -2      753            745
## # i 336,766 more rows
## # i 23 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_hour <dbl>,
## #   late_night <lgl>, long_flight <lgl>, overnight_flag <chr>, year.y <int>,
## #   type <chr>, manufacturer <chr>, model <chr>, engines <int>, seats <int>,
## #   speed <int>, engine <chr>
redeyesflight_plane %>%
  group_by(overnight_flag) %>%
  summarise(
    avg_seats = mean(seats, na.rm = TRUE),
    n = n()
  )
## # A tibble: 3 x 3
##   overnight_flag avg_seats      n
##   <chr>              <dbl>  <int>
## 1 NO                 137.  327817
## 2 YES                200.     667
## 3 <NA>                84.8   8292

Answer:
That is incorrect; it is quite clear that overnight flights have a higher average number of seats than non-overnight flights.

4. General Analysis and Statistical Tests

Answer the following questions with data visualization or summary. You are required to summarize your findings concisely in your own words and support your conclusion with proper graphs or tables.

  1. From the gss_cat data set, find factors that are significantly correlated with the reported income.
gss_cat
## # A tibble: 21,483 x 9
##     year marital         age race  rincome        partyid    relig denom tvhours
##    <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
##  1  2000 Never married    26 White $8000 to 9999  Ind,near ~ Prot~ Sout~      12
##  2  2000 Divorced         48 White $8000 to 9999  Not str r~ Prot~ Bapt~      NA
##  3  2000 Widowed          67 White Not applicable Independe~ Prot~ No d~       2
##  4  2000 Never married    39 White Not applicable Ind,near ~ Orth~ Not ~       4
##  5  2000 Divorced         25 White Not applicable Not str d~ None  Not ~       1
##  6  2000 Married          25 White $20000 - 24999 Strong de~ Prot~ Sout~      NA
##  7  2000 Never married    36 White $25000 or more Not str r~ Chri~ Not ~       3
##  8  2000 Divorced         44 White $7000 to 7999  Ind,near ~ Prot~ Luth~      NA
##  9  2000 Married          44 White $25000 or more Not str d~ Prot~ Other       0
## 10  2000 Married          47 White $25000 or more Strong re~ Prot~ Sout~       3
## # i 21,473 more rows
?gss_cat
# Enter code here.
table(gss_cat$rincome, gss_cat$marital) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 2470.6, df = 75, p-value < 2.2e-16
table(gss_cat$rincome, gss_cat$tvhours) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 1263.1, df = 345, p-value < 2.2e-16
table(gss_cat$rincome, gss_cat$partyid) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 806.11, df = 135, p-value < 2.2e-16
table(gss_cat$rincome, gss_cat$age) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 8157, df = 1065, p-value < 2.2e-16
ggplot(gss_cat, aes(x = rincome, fill = marital)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Answer:

Consequently, the reported income is associated with marital status, age, political affiliation and the amount of time spent watching television

Generally speaking, the income for never married people is slightly lower.

  1. From the smoking data set of the openintro package, find find factors that are significantly correlated with the smoking status and the number of cigarettes smoked per day.
smoking
## # A tibble: 1,691 x 12
##    gender   age marital_status highest_qualification nationality ethnicity
##  * <fct>  <int> <fct>          <fct>                 <fct>       <fct>    
##  1 Male      38 Divorced       No Qualification      British     White    
##  2 Female    42 Single         No Qualification      British     White    
##  3 Male      40 Married        Degree                English     White    
##  4 Female    40 Married        Degree                English     White    
##  5 Female    39 Married        GCSE/O Level          British     White    
##  6 Female    37 Married        GCSE/O Level          British     White    
##  7 Male      53 Married        Degree                British     White    
##  8 Male      44 Single         Degree                English     White    
##  9 Male      40 Single         GCSE/CSE              English     White    
## 10 Female    41 Married        No Qualification      English     White    
## # i 1,681 more rows
## # i 6 more variables: gross_income <fct>, region <fct>, smoke <fct>,
## #   amt_weekends <int>, amt_weekdays <int>, type <fct>
# Enter code here.
table(smoking$smoke, smoking$gender) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 0.42699, df = 1, p-value = 0.5135
table(smoking$smoke, smoking$age) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 159.26, df = 78, p-value = 1.631e-07
table(smoking$smoke, smoking$nationality) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 14.929, df = 7, p-value = 0.03692
table(smoking$smoke, smoking$gross_income) %>%
  chisq.test()
## Warning in chisq.test(.): Chi-squared近似算法有可能不准
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 19.835, df = 9, p-value = 0.01896
table(smoking$smoke, smoking$region) %>%
  chisq.test()
## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 12.671, df = 6, p-value = 0.04857
ggplot(smoking, aes(x = smoke, y = age)) +
  geom_boxplot()

Answer:


Smoking is linked to age; smokers are predominantly aged between 30 and 50, whilst non-smokers are predominantly aged between 40 and 70.