My Workbook

Author

Kia Wright-Williams

Week 1- Meet the penguins

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

The penguins data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.

The plot below shows the relationship between flipper and bill lengths of these penguins.

Week 2- Computations

This dataset contains a subset of the fuel economy data from the EPA. Specifically, we use the mpg dataset from the ggplot2 package.

Code
library(ggplot2)

The plots in Figure 1 show the relationship between city and highway mileage for 38 popular models of cars. In Figure 1 (a) the points are colored by the number of cylinders while in Figure 1(b) the points are colored by engine displacement.

Code
#| #| label: fig-mpg
#| fig-cap: "City and highway mileage for 38 popular models of cars."
#| fig-subcap:
#|   - "Color by number of cylinders"
#|   - "Color by engine displacement, in liters"
#| layout-ncol: 2
#| column: page

ggplot(mpg, aes(x = hwy, y = cty, color = cyl)) +
  geom_point(alpha = 0.5, size = 2) +
  scale_color_viridis_c() +
  theme_minimal()

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(alpha = 0.5, size = 2) +
  scale_color_viridis_c(option = "E") +
  theme_minimal()

#| echo: false

mean_cty <- round(mean(mpg$cty), 2)
mean_hwy <- round(mean(mpg$hwy), 2)
Scatterplot of city vs. highway mileage for cars, where points are colored by the number of cylinders. The plot displays a positive, linear, and strong relationship between city and highway mileage, and mileage increases as the number of cylinders decreases.
Figure 1: City and highway mileage for 38 popular models of cars.
Scatterplot of city vs. highway mileage for cars, where points are colored by the number of cylinders. The plot displays a positive, linear, and strong relationship between city and highway mileage, and mileage increases as the number of cylinders decreases.
Figure 2: City and highway mileage for 38 popular models of cars.

There are 234 observations in our data.

The average city mileage of the cars in our data is 16.86 and the average highway mileage is 23.44.

Week 3- Exercise Diamond Dataset

Code
1 + 1
[1] 2
Code
#| View(diamonds) # to view the dataset
#| str(diamonds) # to show the structure of the dataset
#| names(diamonds) # to view the variable names

diamonds %>% 
  mutate(JustOne = 1,
         Values = "something",
         Simple = TRUE) # used to create variables. can be used to create them based on existing variables
# A tibble: 53,940 × 13
   carat cut    color clarity depth table price     x     y     z JustOne Values
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl> <chr> 
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43       1 somet…
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31       1 somet…
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31       1 somet…
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63       1 somet…
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75       1 somet…
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48       1 somet…
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47       1 somet…
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53       1 somet…
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49       1 somet…
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39       1 somet…
# ℹ 53,930 more rows
# ℹ 1 more variable: Simple <lgl>
Code
diamonds %>% 
 mutate(price200 = price - 200, # £200 off original price
         price20perc = price * 0.20, # 20% of original price
         price20percoff = price * 0.80, # 20% off original price
         pricepercarat = price / carat, # ratio of price to carat
         squaredepth = depth ^ 2) # original depth, squared
# A tibble: 53,940 × 15
   carat cut       color clarity depth table price     x     y     z price200
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43      126
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31      126
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31      127
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63      134
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75      135
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48      136
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47      136
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53      137
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49      137
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39      138
# ℹ 53,930 more rows
# ℹ 4 more variables: price20perc <dbl>, price20percoff <dbl>,
#   pricepercarat <dbl>, squaredepth <dbl>
Code
diamonds.new <- # used to save changes to dataset as a new object
  diamonds %>% 
  mutate(price200 = price - 200, # £200 off original price
         price20perc = price * 0.20, # 20% of original price
         price20percoff = price * 0.80, # 20% off original price
         pricepercarat = price / carat, # ratio of price to carat
         squaredepth = depth ^ 2) # original depth, squared

## Nesting Fuctions
diamonds %>% 
  mutate(m = mean(price), # calculates the mean price
         sd = sd(price), # calculates the standard deviation
         med = median(price)) # calculates the median price
# A tibble: 53,940 × 13
   carat cut       color clarity depth table price     x     y     z     m    sd
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 3933. 3989.
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 3933. 3989.
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 3933. 3989.
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 3933. 3989.
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 3933. 3989.
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 3933. 3989.
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 3933. 3989.
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 3933. 3989.
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 3933. 3989.
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 3933. 3989.
# ℹ 53,930 more rows
# ℹ 1 more variable: med <dbl>

Midwest data set

To calculate average population density

Code
library(tidyverse)
library(ggplot2) # This is where midwest is located
midwest <- ggplot2::midwest # Load the midwest dataset 
avg_density <- mean(midwest$popdensity)
midwest$avg.pop.den <- avg_density
head(midwest)
# A tibble: 6 × 29
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>

To calculate entire average area for entire data set

Code
avg.area <- mean(midwest$area)
head(midwest)
# A tibble: 6 × 29
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 20 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>

To calculate total number of adults in dataset

Code
total_adults <- sum(midwest$popadults)
midwest$totadult <- total_adults
head(midwest)
# A tibble: 6 × 30
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 21 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>

To calculate the difference between total population and white people.

Code
midwest$tot.minus.white <- midwest$poptotal - midwest$popwhite ## alternatively without using poptotal and popwhite 
midwest$tot.minus.white <- midwest$popblack + midwest$popamerindian + midwest$popasian + midwest$popother
head(midwest)
# A tibble: 6 × 31
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 22 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>

To calculate the ratio between populations

Code
midwest$child.to.adult <- midwest$percchildbelowpovert / midwest$percadultpoverty 
head(midwest) 
# A tibble: 6 × 32
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 23 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>, child.to.adult <dbl>
Code
## Why are the values in child.to.adult all different? = Because they represent the ratio of the percentage of children below the poverty line to the percentage of adults in poverty for each observation (row) in the dataset. 
midwest$ratio.adult <- midwest$popadults / midwest$poptotal
head(midwest)
# A tibble: 6 × 33
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 24 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …

To calculate a percentage of a total population that are adults per county

Code
 midwest$perc.adult <- (midwest$popadults / midwest$poptotal) * 100
 head(midwest)
# A tibble: 6 × 34
    PID county   state  area poptotal popdensity popwhite popblack popamerindian
  <int> <chr>    <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
1   561 ADAMS    IL    0.052    66090      1271.    63917     1702            98
2   562 ALEXAND… IL    0.014    10626       759      7054     3496            19
3   563 BOND     IL    0.022    14991       681.    14477      429            35
4   564 BOONE    IL    0.017    30806      1812.    29344      127            46
5   565 BROWN    IL    0.018     5836       324.     5264      547            14
6   566 BUREAU   IL    0.05     35688       714.    35157       50            65
# ℹ 25 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, avg.pop.den <dbl>,
#   totadult <int>, tot.minus.white <int>, child.to.adult <dbl>, …

Presidential data set

To load dataset

Code
library(datasets)
data("presidential")
head(presidential) # to look at the data itself
# A tibble: 6 × 4
  name       start      end        party     
  <chr>      <date>     <date>     <chr>     
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy    1961-01-20 1963-11-22 Democratic
3 Johnson    1963-11-22 1969-01-20 Democratic
4 Nixon      1969-01-20 1974-08-09 Republican
5 Ford       1974-08-09 1977-01-20 Republican
6 Carter     1977-01-20 1981-01-20 Democratic
Code
str(presidential) # to understand the structure and types of the dataset
tibble [12 × 4] (S3: tbl_df/tbl/data.frame)
 $ name : chr [1:12] "Eisenhower" "Kennedy" "Johnson" "Nixon" ...
 $ start: Date[1:12], format: "1953-01-20" "1961-01-20" ...
 $ end  : Date[1:12], format: "1961-01-20" "1963-11-22" ...
 $ party: chr [1:12] "Republican" "Democratic" "Democratic" "Republican" ...

To create a duration column by calculating the difference in days

Code
presidential$start <- as.Date(presidential$start)
presidential$end <- as.Date(presidential$end)
presidential$duration <- as.numeric(presidential$end - presidential$start) 
head(presidential)
# A tibble: 6 × 5
  name       start      end        party      duration
  <chr>      <date>     <date>     <chr>         <dbl>
1 Eisenhower 1953-01-20 1961-01-20 Republican     2922
2 Kennedy    1961-01-20 1963-11-22 Democratic     1036
3 Johnson    1963-11-22 1969-01-20 Democratic     1886
4 Nixon      1969-01-20 1974-08-09 Republican     2027
5 Ford       1974-08-09 1977-01-20 Republican      895
6 Carter     1977-01-20 1981-01-20 Democratic     1461

Exercises (economics dataset)

To load data

Code
library(datasets)
data("economics") 
head(economics)
# A tibble: 6 × 6
  date         pce    pop psavert uempmed unemploy
  <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
1 1967-07-01  507. 198712    12.6     4.5     2944
2 1967-08-01  510. 198911    12.6     4.7     2945
3 1967-09-01  516. 199113    11.9     4.6     2958
4 1967-10-01  512. 199311    12.9     4.9     3143
5 1967-11-01  517. 199498    12.8     4.7     3066
6 1967-12-01  525. 199657    11.8     4.8     3018

To calculate the % of the population that is unemployed

Code
economics$perc.unemploy <- (economics$unemploy / economics$pop) * 100 ##multiplying by 100 turns the ratio into a %.
head(economics)
# A tibble: 6 × 7
  date         pce    pop psavert uempmed unemploy perc.unemploy
  <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>         <dbl>
1 1967-07-01  507. 198712    12.6     4.5     2944          1.48
2 1967-08-01  510. 198911    12.6     4.7     2945          1.48
3 1967-09-01  516. 199113    11.9     4.6     2958          1.49
4 1967-10-01  512. 199311    12.9     4.9     3143          1.58
5 1967-11-01  517. 199498    12.8     4.7     3066          1.54
6 1967-12-01  525. 199657    11.8     4.8     3018          1.51

Exercises (TX housing dataset)

To load the dataset

Code
library(ggplot2)
data("txhousing") 
head(txhousing)
# A tibble: 6 × 9
  city     year month sales   volume median listings inventory  date
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.

To create success rate

Code
txhousing$successrate <- (txhousing$sales / txhousing$listings) * 100 
head(txhousing)
# A tibble: 6 × 10
  city     year month sales   volume median listings inventory  date successrate
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000         10.3
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.        13.1
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.        16.6
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.        12.5
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.        17.8
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.        20  

To calculate the % of houses that do not sell of the total listing available.

Code
txhousing$failrate <- ((txhousing$listings - txhousing$sales) / txhousing$listings) * 100 
head(txhousing)
# A tibble: 6 × 11
  city     year month sales   volume median listings inventory  date successrate
  <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
1 Abilene  2000     1    72  5380000  71400      701       6.3 2000         10.3
2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.        13.1
3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.        16.6
4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.        12.5
5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.        17.8
6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.        20  
# ℹ 1 more variable: failrate <dbl>

Chapter 6.6.1: Exercises

Problem A

Code
midwest %>% # dataset and pipe tool used to take output of one function & use as input for next 
group_by(state) %>% # groups the data by the state column 
summarize(poptotalmean = mean(poptotal), # calculates summary stats for each group & calculates the mean of the pop total 
poptotalmed = median(poptotal), # calculates median of population total 
popmax = max(poptotal), # finds maximum value of the population total 
popmin = min(poptotal), # finds minimum value of the population total
popdistinct = n_distinct(poptotal), # counts the number of distinct values in population total column 
popfirst = first(poptotal), # retrieves the first value of the pop total column for each state 
popany = any(poptotal < 5000),# checks if there are any values in the pop total coumn that are less than 5000 for each state. 
popany2 = any(poptotal > 2000000)) %>% # checks for any values in the pop total column that are greater than 2mil 
ungroup() # removes grouping structure from date frame. Useful if plan to perform operations that should not be done in groups
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

Code
midwest %>% 
group_by(state) %>% 
summarize(num5k = sum(poptotal < 5000), # calculates summary statistic for each group. Inside function, several calculations are defined. 
num2mil = sum(poptotal > 2000000), numrows = n()) %>% ungroup() #5K= calculates number of counties in each state where the population is less than 5000 #num2mil= where pop is greater than 2mil #numrows= counts number of counties in each state group
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

Code
midwest %>% 
  group_by(county) %>% # groups data by the county column 
  summarize(x = n_distinct(state)) %>% # calculate number of distinct states associated with each county 
  arrange(desc(x)) %>% # sorts summarised data in decending order based on values in column x 
  ungroup() 
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows
Code
# part II 
# How does n() differ from n_distinct()? = n counts the total number of observations (rows) in current group, whereas n distinct counts number of unique (distinct) values in specified column within current group 

# When would they be the same? different? = Same when every row in group corresponds to a distinct value in the column being counted. Different when theres multiple records for the same value in the specified column
midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% # calculates total number of rows or observations for each county and stores in x 
  ungroup() 
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows
Code
# part III # hint: 
# - How many distinctly different counties are there for each county? = if you group by county, you'll get 1 row per county, showing total count of records associated with that county 

# - Can there be more than 1 (county) county in each county? = no but can be multiple records for that county based on other factors such as different years 

# - What if we replace 'county' with 'state'? = the output will show the total number of records for each state. 

midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(county)) %>% # calculates number of distinct counties associated with each county which will always return 1. 
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows

Problem D

Code
diamonds %>% 
  group_by(clarity) %>% # groups dataset by clarity column 
  summarize(a = n_distinct(color), # creates summary statistics for each group. In function, calculates number of distinct colours for each clarity group 
            b = n_distinct(price),# In function, calculates number of distinct colours for each clarity group 
            c = n()) %>% # counts total number of diamonds within each clarity group 
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

Code
# part I
diamonds %>% 
  group_by(color, cut) %>% # groups dataset by color and cut 
  summarize(m = mean(price), # calculates mean price of diamonds for each combination of color and cut 
            s = sd(price), # calculates standard deviation of price for each combination of color and cut 
            .groups = 'drop') %>% 
  ungroup() 
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows
Code
# part II
diamonds %>% 
  group_by(cut, color) %>% # operations are identical 
  summarize(m = mean(price), 
            s = sd(price), 
            .groups = 'drop') %>% 
  ungroup() 
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows
Code
# part III
# hint: 
# - How good is the sale if the price of diamonds equaled msale? 
# - If msale represents a price that is 20% off mean price of diamonds in each group, you could evaluate how the sale compares to the original mean price 
# - e.x. The diamonds are x% off original price in msale. 
diamonds %>% 
  group_by(cut, color, clarity) %>% # groups dataset by the columns 
  summarize(m = mean(price), 
            s = sd(price), 
            msale = m * 0.80, # calculates sale price that is 20% off mean price 
            .groups = 'drop')
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F

Code
diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth),# calculates mean depth of diamonds for each cut category 
            pizza = mean(price),# calculates mean price of diamonds for each cut category 
            popcorn = median(y),# calculates median of a variable called 'y' (y must be a column in diamonds dataset. If not defined, will produce an error.) 
            pineapple = potato - pizza,# calculates difference between mean depth and mean price 
            papaya = pineapple ^ 2,# squares value of pineapple 
            peach = n()) %>% # counts total number of diamonds in each cut category 
  ungroup()
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

Code
# part I 
diamonds %>% 
  group_by(color) %>% # groups dataset by colour variable 
  summarize(m = mean(price)) %>% # calculates mean price of diamonds for each colour group 
  mutate(x1 = str_c("Diamond color ", color), x2 = 5) %>% # adds new columns to summarize new dataset. X1 creates a new column (x1) that links the string diamond colour with the actual colour value. X2 creates a new column (x2) and assigns the value 5 to every row. 
  ungroup() 
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5
Code
# part II # What does the first ungroup() do? Is it useful here? Why/why not? = used to remove grouping structure after summarisation. its useful if you plan to perform additional operations on resulting dataframes that should not be affected by previous grouping 

# Why isn't there a closing ungroup() after the mutate()? = since the first ungroup was to ensure that the subsequent operations do not treat the data as grouped, there is no need for another ungroup after mutate. 

diamonds %>% 
  group_by(color) %>% 
  summarize(m = mean(price)) %>% 
  ungroup() %>% 
  # ungroup before mutate: (1) summarise mean price by colour, (2) ungrouping summarised data, (3) adding new columns (x1 & x2) 
  mutate(x1 = str_c("Diamond color ", color), x2 = 5)
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Problem H

Code
# part I 
diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # creates new column (x1) that contains half the value of the price for each diamond 
  summarize(m = mean(x1)) %>% # calculates mean of new column for each colour group stored in a new column named 'm' 
  ungroup() 
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.
Code
# part II # What's the difference between part I and II? = (1) position of ungroup, (2) effect on results 

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>% # removed grouping structure before summarising 
  summarize(m = mean(x1))
# A tibble: 1 × 1
      m
  <dbl>
1 1966.

Why is grouping data necessary? - crucial for performing calculations and analyses within subsets of data

Why is ungrouping data necessary? - necessary to ensure that further operations do not unintentionally respect previous groupings

When should you ungroup data? - after summary operations or before applying long group specific trabsformations

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? - if theres no grouping in code, ungroup is not required

Chapter 6.7: Extra Practice

Code
diamonds_sorted_low_to_high <- diamonds %>% # sort diamonds by price (low to high)
  arrange(price) 
diamonds_sorted_high_to_low <- diamonds %>% # sort diamonds by price (high to low)
  arrange(desc(price)) 
diamonds_sorted_low_price_cut <- diamonds %>% # sort diamonds by price (low to high, then by cut)
  arrange(price, cut) 
diamonds_sorted_high_price_cut <- diamonds %>% # sort diamonds by price (high to low, then by cut)
  arrange(desc(price), cut)
diamonds_sorted_price_clarity <- diamonds %>% # sort diamonds by price and then by clarity (high to low)
  arrange(price, desc(clarity))
diamonds_with_sale_price <- diamonds %>% # create a new column for sale price
  mutate(salePrice = price - 250) 
diamonds_selected <- diamonds %>% # select specific columns, removing x, y, and z
  select(-x, -y, -z)
diamonds_count_by_cut <- diamonds %>% # count diamonds by cut
  group_by(cut) %>%
  summarise(count = n()) 
diamonds_with_total_num <- diamonds %>% #add total number of diamonds as a new column
  mutate(totalNum = n())

Exercise for research method:

Good Question- “How does the price of diamonds change by the quality of the cut?”

“What impact does clarity have on the price of diamonds?”

Bad Question- “What is the average price of a diamond?

“Which diamond is the cheapest?”

Data Exploration: Cricket Dataset


To load and run data:

Code
options(repos = c(CRAN = "https://cloud.r-project.org/"))
library(tidyverse)
install.packages("modeldata")

The downloaded binary packages are in
    /var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library(modeldata)

Attaching package: 'modeldata'
The following object is masked _by_ '.GlobalEnv':

    penguins
The following object is masked from 'package:palmerpenguins':

    penguins
Code
?ggplot

?crickets
View(crickets)

The basics:

Code
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

To add colour to visualise differences between species:

Code
ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2")

Modyifying basic properties of the plot:

Code
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point(color = "red",
             size = 2,
             alpha = .3,
             shape = "square") +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")

Code
# Learn more about the options for the geom_abline()
# with ?geom_point

Adding another layer:

Code
ggplot(crickets, aes(x = temp, 
                     y = rate)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)")
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(crickets, aes(x = temp, 
                     y = rate,
                     color = species)) + 
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  labs(x = "Temperature",
       y = "Chirp rate",
       color = "Species",
       title = "Cricket chirps",
       caption = "Source: McDonald (2009)") +
  scale_color_brewer(palette = "Dark2") 
`geom_smooth()` using formula = 'y ~ x'

Other plots:

Code
ggplot(crickets, aes(x = rate)) + 
  geom_histogram(bins = 15) # one quantitative variable

Code
ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

Code
ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black",
           fill = "lightblue")

Code
ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) +
  scale_fill_brewer(palette = "Dark2")

Code
ggplot(crickets, aes(x = species, 
                     y = rate,
                     color = species)) + 
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

Code
?theme_minimal()

Faceting (technique used to create multiple subplots within a single plot based on a categorical variable).

Code
# not great:
ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2") ## to assign a colour (package containing this colour)

Code
ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) + ## creates grid of plots. Useful when have moderate number of levels.
  scale_fill_brewer(palette = "Dark2")

Code
?facet_wrap

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

Research Methods: ‘what is a good research hypothesis?’

A good research hypothesis is a specific, concise and testable statement that establishes an assumption which is either supported or falsified through data collection and statistical analysis. Serving as a foundation for conducting research, it provides empirical evidence to guide further research and to understand the subject being investigated.

Week 5: Formative

Code
data("iris")

Graph 1:

Code
boxplot(Sepal.Length ~ Species, data = iris,
        main = "Boxplot of Sepal Length by Species",
        xlab = "Species",
        ylab = "Sepal Length (cm)",
        col= c("lightblue", "lightgreen", "lightpink"))

  • Boxplots are used for visualising distribution of data and can be associated with several statistical tests: t-tests, ANOVA (post hoc), Kruskal-Wallis test, Mann-Whitney U test and Chi-Square tests.

Graph 2:

Code
library(ggplot2)
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_density(alpha = 0.5) + # plot with transparency
  labs(title = "Density plot of petal length",
       x = "Petal length (cm)",
       y = "Density") +
  theme_minimal()

  • Frequency graphs are useful for showing the shape or distribution of data and are associated with the following tests: Shapiro-Wilk test, Kolmogorov-Smirnov test, Anderson-Darling test, Chi-square test, t-tests, ANOVA (post hocs), Kruskal-Wallis test and Mann-Whitney U test.

Graph 3:

Code
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
  geom_point(size = 3, alpha = 0.7) + ## adjust size and transparency
  geom_smooth(method = "lm", se = FALSE, color = "black") + ## line of best fit
  labs(title = "Scatterplot of Petal Length vs. Petal Width",
       x = "Petal Length (cm)",
       y = "Petal Width (cm)") +
  theme_minimal() +
  scale_colour_brewer(palette = "Set1")
`geom_smooth()` using formula = 'y ~ x'

  • Scatterplots show possible relationships between 2 variables, validates assumptions and identifies outliers. Associated tests: Correlation coefficient (pearsons r), Spearmans rank correlation test, linear regression, ANOVA, Chi-square test, Residual analysis and Levene’s test (HOV).

Graph 4:

Code
library(dplyr)
library(ggplot2)

# Create the size variable
iris <- iris %>%
  mutate(size = ifelse(Sepal.Length < median(Sepal.Length), "Small", "Big"))

head(iris)  # should show the new 'size' column
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  size
1          5.1         3.5          1.4         0.2  setosa Small
2          4.9         3.0          1.4         0.2  setosa Small
3          4.7         3.2          1.3         0.2  setosa Small
4          4.6         3.1          1.5         0.2  setosa Small
5          5.0         3.6          1.4         0.2  setosa Small
6          5.4         3.9          1.7         0.4  setosa Small
Code
iris_summary <- iris %>%
  group_by(Species, size) %>%
  summarise(count = n(), .groups = "drop")

head(iris_summary)  #  should include Species, size, and count
# A tibble: 6 × 3
  Species    size  count
  <fct>      <chr> <int>
1 setosa     Big       1
2 setosa     Small    49
3 versicolor Big      29
4 versicolor Small    21
5 virginica  Big      47
6 virginica  Small     3
Code
ggplot(iris_summary, aes(x = Species, y = count, fill = size)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Species", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Small" = "lightblue", "Big" = "salmon"))  

  • Bar graphs are used to display and represent the frequency count of values for a categorical variable and are associated with the following tests: Chi-squared test, t-tests, ANOVA (post hoc), Kruskal-Wallis test, Proportion test and Fishers Exact test.

Week 6: Formative

Code
# Import the data
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
head(housetasks)
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Code
install.packages("gplots")

The downloaded binary packages are in
    /var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library(gplots)

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
Code
# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Code
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

Code
## The argument shade is used to color the graph. 
# The argument las = 2 produces vertical labels
# Blue color indicates that the observed value is higher than the expected value if the data were random
# Red color specifies that the observed value is lower than the expected value if the data were random
Code
#There is another package named vcd, which can be used to make a mosaic plot (function mosaic()) or an association plot (function assoc()).

install.packages("vcd")

The downloaded binary packages are in
    /var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library("vcd")
Loading required package: grid
Code
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)

Code
# Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.
# This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with df=(r−1)(c−1) degrees of freedom and p = 0.05.
# If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.
chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
Code
# Observed counts
chisq$observed
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153
Code
# Expected counts
round(chisq$expected,2)
            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70

To know the cells that are contributing most to the chi2 score, have to calculate the chi2 statistic for each cell. This formula returns the Pearson residuals (r) for each cell.

Pearson residuals can be extracted from the output of the function chisq.test()

Code
round(chisq$residuals, 3)
             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556
Code
# To visualise
install.packages("corrplot")

The downloaded binary packages are in
    /var/folders/_3/gl1b3gb52rsg0qg560909w4r0000gn/T//RtmpVTPNvs/downloaded_packages
Code
library(corrplot)
corrplot 0.95 loaded
Code
corrplot(chisq$residuals, is.cor = FALSE)

  1. Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
  • In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal.

  • There is a strong positive association between the column Husband and the row Repair

  1. Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the column Wife are negatively associated (~ “not associated”) with the row Repairs. There is a repulsion between the column Husband and, the rows Laundry and Main_meal
Code
# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
            Wife Alternating Husband Jointly
Laundry    7.738       0.272   1.777   2.246
Main_meal  4.976       0.012   1.243   1.903
Dinner     2.197       0.073   0.600   0.560
Breakfeast 1.222       0.615   0.408   1.443
Tidying    0.149       0.133   1.270   0.661
Dishes     0.063       0.178   0.891   0.625
Shopping   0.085       0.090   0.581   0.586
Official   0.688       3.771   0.010   0.311
Driving    1.538       2.403   3.374   1.789
Finances   0.886       0.037   0.028   1.700
Insurance  1.705       0.941   0.868   1.683
Repairs    2.919       0.947  21.921   2.275
Holidays   2.831       1.098   1.233  12.445
Code
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

Code
# printing the p-value
chisq$p.value
[1] 0
Code
# printing the mean
chisq$estimate
NULL