Online Workbook

Author

Erica Bass, N1343960

Session 1 - Setting up Workbook

library(tidyverse) # this loads up the previously installed package
library(palmerpenguins) # this loads up the package containing penguin data
tibble(penguins) # this produces a data table in the console
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • Next I will produce a bar chart of penguin species
ggplot(penguins, aes(x = species, fill = species)) +
  geom_bar()
Figure 1: A bar chart of penguin species
  • This information can also be found by summarising the data
penguins %>%
  group_by(species) %>% # this groups the penguin data by species
  summarise(n = n()) %>% # this summarises the species data by number
  ungroup() # this ungroups the data so that it is not permanently altered
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Session 3 - Data Wrangling

Session 6.6.1

R for Graduate Students

Question 1

Problem A

midwest %>% # tells R to use this data set
  group_by(state) %>% # groups data by state variable
  summarise(poptotalmean = mean(poptotal), # displays mean populations
            poptotalmed = median(poptotal), # displays median of populations
            popmax = max(poptotal), # displays maximum population within state
            popmin = min(poptotal), # displays minimum population within state
            popdistinct = n_distinct(poptotal), # displays number of distinct values
            popfirst = first(poptotal), #displays the first value for poptotal in each state
            popany = any(poptotal < 5000), #states if there are any populations within each state that are less than 5000 people.
            popany2 = any(poptotal > 2000000)) %>% # states if there are any populations within each state that are greater than 2000000 people.
  ungroup() # final ungrouping of data
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem B

midwest %>% # tells R to use this dataset
  group_by(state) %>% # groups data by state variable
  summarise(num5k = sum(poptotal < 5000), # shows total number of counties within state that have population less than 5000 people.
            num2mil = sum(poptotal > 2000000), # shows total number of counties within state that have population of more than 2mil people.
            numrows = n()) %>% # shows total number of states per county
  ungroup() # final ungrouping of data
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem C

Part 1

midwest %>% # tells R to use this data set
  group_by(county) %>% # groups data by county variable
  summarise(x = n_distinct(state)) %>% # shows how many states contain counties of each name, under column 'x'
  arrange(desc(x)) %>% # arranges column 'x' in descending order
  ungroup() # final ungrouping of data
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

Part 2

  • n () differs from n_distinct () as n gives total number of values within a variable, whereas n_distinct shows the total number of variables that are different from eachother.

  • n () and n_distinct () might be the same if every value in the variable is different eg. row number. They may be different if there are only a finite number of answers within a variable ie. customer satisfaction ratings on a scale of 1-10.

midwest %>% # tells R to use this data set
  group_by(county) %>% # groups data by county variable
  summarise(x = n ()) %>% # summarises data by total number of each county
  ungroup() # final ungrouping of data
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Part 3

midwest %>% # tells R to use this data set
  group_by(county) %>% # groups data by county variable
  summarise(n_distinct(county)) %>% # summarises data by number of distinct counties per county
  ungroup() # final ungrouping of data
# A tibble: 320 × 2
   county    `n_distinct(county)`
   <chr>                    <int>
 1 ADAMS                        1
 2 ALCONA                       1
 3 ALEXANDER                    1
 4 ALGER                        1
 5 ALLEGAN                      1
 6 ALLEN                        1
 7 ALPENA                       1
 8 ANTRIM                       1
 9 ARENAC                       1
10 ASHLAND                      1
# ℹ 310 more rows
  • There is only one distinct county for each county, and there cannot be more than one, so this information is useless.
midwest %>% # tells R to use this data set
  group_by(state) %>% # groups data by state variable
  summarise(n_distinct(state)) %>% # summaries data by number of disctinct states per state
  ungroup() # final ungrouping of data
# A tibble: 5 × 2
  state `n_distinct(state)`
  <chr>               <int>
1 IL                      1
2 IN                      1
3 MI                      1
4 OH                      1
5 WI                      1
  • Running the same code but using ‘state’ instead of ‘county’ is also useless as there cannot be more than one of each.

Problem D

diamonds %>% # tells R to use this data set
  group_by(clarity) %>% # groups data by clarity variable
  summarize(a = n_distinct(color), # a = number of unique colours per clarity group
            b = n_distinct(price), # b = number of unique prices per clarity group
            c = n()) %>% # c = total number of diamonds per clarity group
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem E

Part 1

diamonds %>% # tells R to use this data set
  group_by(color, cut) %>% # groups data by color and cut variables
  summarise(m = mean(price), # column m = mean price of each color/cut
            s = sd (price)) %>% # column s = standard deviation of each mean price in column m
  ungroup() # final ungrouping of data
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Part 2

diamonds %>% # tells R to use this data set
  group_by(cut, color) %>% # groups data by cut and color variables
  summarise(m = mean(price), # column m = mean price of each cut/ color
            s = sd (price)) %>% # column s = standard deviation of each mean price in column m
  ungroup() # final ungrouping of data
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows
  • Observation: it doesn’t matter which variable comes first or last in the group_by () function, as R will group the data in the same way.

Part 3

diamonds %>% # tells R to use this data set
  group_by(cut, color, clarity) %>% # groups data by cut, color and clarity
  summarize(m = mean(price), # column m = mean price of each grouped variable
            s = sd(price), # column s = standard deviations of column m
            msale = m * 0.80) %>% # column m sale = 20% off each mean price
  ungroup() # final ungrouping of data
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem F

diamonds %>% # tells R to use this data set
  group_by(cut) %>% # groups data by cut variable
  summarise(potato = mean(depth), # column 'potato' = mean depth for each cut
            pizza = mean(price), # column 'pizza' = mean price for each cut
            popcorn = median(y), # column 'popcorn' = median of variable 'y' for each cut
            pineapple = potato - pizza, # column 'pineapple' = difference between mean depth and mean price for each cut
            papaya = pineapple ^ 2, # column 'papaya' = pineapple squared
            peach = n()) %>% # column 'peach' = total number of values per cut
  ungroup() # final ungrouping of data
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem G

Part 1

diamonds %>% # tells R to use this data set
  group_by(color) %>% # groups data by color variable
  summarise(m = mean(price)) %>% # column 'm' = mean price for each color
  mutate(x1 = str_c("Diamond Color", color), # adds new variable 'x1'which strings together the words 'diamond color' and the already categorised colors in the data
         x2 = 5) %>% # adds new variable 'x2' where every value is 5
  ungroup() # final ungrouping of data
# A tibble: 7 × 4
  color     m x1                x2
  <ord> <dbl> <chr>          <dbl>
1 D     3170. Diamond ColorD     5
2 E     3077. Diamond ColorE     5
3 F     3725. Diamond ColorF     5
4 G     3999. Diamond ColorG     5
5 H     4487. Diamond ColorH     5
6 I     5092. Diamond ColorI     5
7 J     5324. Diamond ColorJ     5

Part 2

diamonds %>% # tells R to use this data set
  group_by(color) %>% # groups data by color variable
  summarize(m = mean(price)) %>%  # column 'm' = mean price for each color
  ungroup() %>% # ungroups data
  mutate(x1 = str_c("Diamond color ", color), # adds new variable 'x1'which strings together the words 'diamond color' and the already categorised colors in the data
         x2 = 5) # adds new variable 'x2' where every value is 5
# A tibble: 7 × 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5
  • In the example above, the first ungroup () makes sure that the data is ungrouped after the mean prices are calculated. It is not particulary useful here, as this specific mutate command also tells R to sort by color.

  • There isn’t a closing ungroup () after the mutate () as the data has already been ungrouped according to the previous command.

Problem H

Part 1

diamonds %>% # tells R to use this data set
  group_by(color) %>% # groups data by color variable
  mutate(x1 = price * 0.5) %>% # column 'x1' = price for each color, halved
  summarize(m = mean(x1)) %>% # column 'm' = means of each color from column 'x1'
  ungroup() # final ungrouping of data
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

Part 2

diamonds %>% # tells R to use this data set
  group_by(color) %>% # groups data by color variable
  mutate(x1 = price * 0.5) %>% # column 'x1' = price for each color, halved
  ungroup() %>% # ungrouping of data
  summarize(m = mean(x1)) # column 'm' = mean of column 'x1'
# A tibble: 1 × 1
      m
  <dbl>
1 1966.
  • This tibble differs from the one above as the data is ungrouped before it is summarised. Therefore in this example, a mean is taken from all of the values in column ‘x1’ rather than a mean of each color group.

Question 2

Grouping data is necessary so that trends can be seen more easily, as it allows for comparison within variables.

Question 3

Ungrouping data is necessary so that the original data set does not become permanently altered.

Question 4

You should always ungroup data, after everytime you finish your analysis on the grouped data.

Question 5

If the code does not contain group_by () then you do not need to ungroup () at the end.

Session 6.7 “Extra Practice”

Question 1

View all of the variable names in diamonds:

diamonds %>% 
  str() 
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Question 2

Arrange the diamonds by lowest to highest price:

diamonds %>%
  arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Arrange the diamonds by highest to lowest price:

diamonds %>%
  arrange(desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Arrange the diamonds by lowest price and cut:

diamonds %>%
  arrange(price) %>%
  arrange(cut)
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows

Arrange the diamonds by highest price and cut:

diamonds %>%
  arrange(desc(price)) %>%
  arrange(desc(cut))
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.51 Ideal G     IF       61.7  55   18806  7.37  7.41  4.56
 2  2.07 Ideal G     SI2      62.5  55   18804  8.2   8.13  5.11
 3  2.15 Ideal G     SI2      62.6  54   18791  8.29  8.35  5.21
 4  2.05 Ideal G     SI1      61.9  57   18787  8.1   8.16  5.03
 5  1.6  Ideal F     VS1      62    56   18780  7.47  7.52  4.65
 6  2.06 Ideal I     VS2      62.2  55   18779  8.15  8.19  5.08
 7  1.71 Ideal G     VVS2     62.1  55   18768  7.66  7.63  4.75
 8  2.08 Ideal H     SI1      58.7  60   18760  8.36  8.4   4.92
 9  2.03 Ideal G     SI1      60    55.8 18757  8.17  8.3   4.95
10  2.61 Ideal I     SI2      62.1  56   18756  8.85  8.73  5.46
# ℹ 53,930 more rows

Question 3

Arrange the diamonds by lowest to highest price and worst to best clarity:

diamonds %>%
  arrange(price) %>%
  arrange(clarity)
# A tibble: 53,940 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 2  0.32 Good    D     I1       64      54   361  4.33  4.36  2.78
 3  0.31 Premium F     I1       62.9    59   394  4.33  4.29  2.71
 4  0.34 Ideal   D     I1       62.5    57   413  4.47  4.49  2.8 
 5  0.34 Ideal   D     I1       61.4    55   413  4.5   4.52  2.77
 6  0.32 Premium E     I1       60.9    58   444  4.42  4.38  2.68
 7  0.43 Premium H     I1       62      59   452  4.78  4.83  2.98
 8  0.41 Good    G     I1       63.8    56   467  4.7   4.74  3.01
 9  0.32 Good    D     I1       64      54   468  4.36  4.33  2.78
10  0.32 Ideal   E     I1       60.7    57   490  4.45  4.41  2.69
# ℹ 53,930 more rows

Question 4

Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond:

diamonds %>%
  mutate(salePrice = price - 250)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z salePrice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

Question 5

Remove the x, y, and z variables from the diamonds dataset:

diamonds %>%
  select(-x,-y,-z)
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

Question 6

Determine the number of diamonds there are for each cut value:

diamonds %>%
  group_by(cut) %>%
  summarise(n()) %>%
  ungroup()
# A tibble: 5 × 2
  cut       `n()`
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

Question 7

Create a new column named totalNum that calculates the total number of diamonds:

diamonds %>%
  mutate(row = row_number()) %>%
  summarise(n_distinct(row))
# A tibble: 1 × 1
  `n_distinct(row)`
              <int>
1             53940

Session 3 - Research Questions

  • A bad research question about the diamonds data set: Do more expensive diamonds have better clarity?

  • A good research question about the diamonds data set: To what extent does the clarity and cut of a diamond affect its final price, and are there any other factors that have greater influence?

Session 4 - Data Exploration

Pre-session exercises

Chapter 2: R Graphics Cookbook

1. Creating a Scatter Plot

Note

Using ggplot2 from the tidyverse package, the ggplot () function tells R to create a plot object, and then geom_point () tells R to add a layer of points.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

2. Creating a Line Graph

Note

Follow the same formula as above, using ggplot2.

ggplot(pressure, aes(x = temperature, y = pressure)) +
  geom_line() +
  geom_point() # optional if you would like the points shown on the graph too

3. Creating a Bar Graph

Note

There are various ways to plot a bar graph using ggplot2, depending on your variables. Geom_col is used to plot values, whereas geom_bar is used to plot counts (n). See below.

  1. For continuous x and y variables;
ggplot(BOD, aes(x = Time, y = demand)) +
  geom_col()

  1. In order to treat the x variable as discrete, convert it to a factor;
ggplot(BOD, aes(x = factor(Time), y = demand)) +
  geom_col()

  1. To plot counts, use the geom_bar function and only state the x variable;
ggplot(mtcars, aes(x = factor(cyl))) +
  geom_bar()

4. Creating a Histogram

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note

When making a histogram, R will tell you what default bin width (30) has been used. You can alter the bin width as below.

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 4)

5. Creating a Box Plot

ggplot(ToothGrowth, aes(x = supp, y = len)) +
  geom_boxplot()

Note

You can also plot multiple variables at once using the interaction function.

ggplot(ToothGrowth, aes(x = interaction(supp, dose), y = len)) +
  geom_boxplot()

6. Plotting a Function Curve

Note

data.frame is used to set the range. fun = ‘name of function’ eg. dnorm would be the density of the normal distribution.

ggplot(data.frame(x = c(0,20)), aes(x = x)) +
  stat_function(fun = dnorm, geom = "line")

Post-session exercises

Youtube video: 'Data Visualisation with R in 36 Minutes' Equitable Equations

library(tidyverse) # as is done in the console at the start of each session
library(modeldata) # previously installed via the console, contains the cricekts data used in these exercises 
  • Producing a simple scatter plot of temperature vs rate, and colouring by species
ggplot(crickets, aes(x = temp, y = rate, colour = species)) +
  geom_point() +
  labs (x = "Temperature", # changes x axis label
        y = "Chirp rate", # changes y axis label
        color = "Species", # changes title of color legend (adding capital S)
        title = "Cricket chirps", # adds title
        caption = "Source: McDonald (2009)") # adds caption

Tip

You can change the colours used in your graph by following a simple function such as scale_color_brewer (palette = “Dark2”). Here scale refers to changing the scale of a certain aesthetic, color refers to the aesthetic you want to change, and brewer is the package containing the Dark2 palette.

  • Making small modifications to the output
ggplot(crickets, aes(x = temp, y = rate)) +
  geom_point(color = "red", # changes colour of all points (colour goes in the aes brackets above if wanting to specify by a variable)
             size = 2, # changes size of points
             alpha = 0.8, # changes transparency of points, value between 0-1
             shape = "square") # changes shape of points

Note

Remember: To call the help window telling you all the info about a function use the ‘?’ eg. running ?geom_point will show you all the changes that can be made within this function. You could also call the help for ?labs.

  • Adding a regression line
ggplot(crickets, aes(x = temp, y = rate, colour = species)) +
  geom_point() +
  geom_smooth(method = "lm", # geom_smooth adds a best fit curve. Method "lm" turns it into a linear model
              se = FALSE) # removes the standard error (greyed out area around the regression line)

  • Different ways of plotting a single quantitative or categorical variable
ggplot(crickets, aes(x = rate)) +
  geom_histogram(bins = 15) 

ggplot(crickets, aes(x = rate)) +
  geom_freqpoly(bins = 15)

ggplot(crickets, aes(x = species,)) +
  geom_bar(color = "black",# determines bar outline colour
           fill = "lightblue", # determines bar fill colour
           show.legend = FALSE)

Note

As a general rule:

Bar Chart = single categorical variable

Histogram = single quantitative variable

Scatter Plot = two quantitative variables

Grouped Bar Chart = two categorical variables

Box Plot = one quantitative and one categorical variable

  • Producing a box plot

The 5 values represented on a box plot are: Minimum value, First quartile, Median, Third quartile, Maximum value

ggplot(crickets, aes(x = species, y = rate,
                     colour = species)) +
  geom_boxplot(show.legend = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal()

Useful to know

In R Studio, go to ‘help > cheat sheets > data visualtisation with ggplot2’ for useful information about presenting data.

  • Faceting is simultaneously displaying multiple subplots, which each represent different subsets of data based on one or more categorical variables.
ggplot(crickets, aes(x = rate, # species variable on x axis
                     fill = species)) + # tells R to colour differentiate by species
  geom_histogram(bins = 15, 
                 show.legend = FALSE) + # removes legend from graph
  facet_wrap(~species, # tells R to facet species data
             ncol = 1) # 1 column means the graphs are stacked on top of eacother. nrow = 1 would mean the graphs were next to eachother.

Session 4 - Research Methods

A good research hypothesis should be formed, based on a gap in knowledge in a particular subject area, and will initially be based on an assumption idealised from prior research. It should be clear and concise, and contain only information that is directly relevant to the research question. Suitable information and/or data should be able to collected in response to the hypothesis, and any data collected must be appropriate for statistical analysis. Meaningful conclusions need to be able to be drawn from the research, and therefore a good hypothesis must be able to be either proven or disproven.

Session 5 - How to Choose the Correct Analyses

Reproduce the graphs on NOW using the data set ‘iris’ and identify which statistical test you would use.

  1. I would use ANOVA to analyse this data. This is because the predictor variable (species) is categorical and the outcome variable (sepal length) is quantitative. There are more than two groups being tested (there are 3 species), and only one outcome variable per group (sepal length).
ggplot(iris, aes(x = Species, 
                 y = Sepal.Length, 
                 colour = Species)) +
  geom_boxplot()

  1. I would use simple regression to analyse this data. This is because the predictor variable (petal length) is quantitative the outcome variable (density) is also quantitative. There is also only one predictor variable (petal length).
ggplot(iris, aes(x = Petal.Length, 
                 fill = Species)) +
  geom_density(alpha = 0.3)

  1. I would also use a simple regression to analyse this data. This is becuase the predictor variable (petal length) is quantitative, and so is the outcome variable (petal width). There is also only one predictor variable (petal length).
ggplot(iris, aes(x = Petal.Length, 
                 y = Petal.Width)) +
  geom_point(aes(colour = Species,
                 shape = Species)) +
  geom_smooth(method = "lm",
              se = TRUE)

  1. I would use ANOVA to analyse this data. This is because the predictor variable (species) is categorical, and the outcome variable (count) is quantitative. More than two groups are being compared (there are 3 species), and there is only one outcome variable (count).
iris.new <- iris %>%
  mutate(size = ifelse(Sepal.Length < median(Sepal.Length),"small", "big"))

ggplot(iris.new, aes(x = Species,
                     colour = size,
                     fill = size)) +
  geom_bar(position = 'dodge')

Session 6 - Frequency Tests

Extra Reading

Book: Practical Statistics for Field Biology

Chi-squared test = encompasses various tests: homogeneity, randomness, association, independence and goodness of fit.

  • The test involves comparing observed frequencies with expected frequencies.
  • Expected frequencies may be calculated from sample data or from a mathematical model.
  • Versions of the chi-squared test that compare frequencies with that of a mathematical model are called goodness of fit tests.
  • All versions of chi-squared test assume that samples are random and observations are independent.
  • Yates’ correction should be applied when there is only one degree of freedom.

Dispersion models can also be compared with frequency distributions, to visualise expected and observed frequencies.

  • Binomial model = regular dispersion
  • Poisson model = random dispersion
  • Negative binomial model = contagious dispersion (aggregated or clumped).

Online Article (Scribbr): Chi-Square (Χ²) Tests \| Types, Formula & Examples

Chi-squared test is for categorical data. It is a non-parametric test.

There are 2 types:

  • Goodness of fit = test whether frequency distribution is different to your expectations. Used when you have one categorical variable. Usually the expected frequency is that each category will have equal proportions.

  • Test of independence = test whether two variables are related to eachother. Used when you have two categorical variables. If the two variables are independent, the probability of belonging to a certain group of one variable, isn’t affected by the other variable.

A larger Χ² value means a bigger difference between expected and observed frequencies.

Contigency tables can be used to show frequency distributions in each combination of groups, when you have two categorical variables.

Eg.

         Hand
Country   Left-handed Right-handed
  America          50          250
  Canada           30          200

Pre-session work

Youtube video: Frequency and Contingency Tables with R

library(vcd) # installed via console. Contains 'arthirits' data used in this example
tibble(Arthritis)
# A tibble: 84 × 5
      ID Treatment Sex     Age Improved
   <int> <fct>     <fct> <int> <ord>   
 1    57 Treated   Male     27 Some    
 2    46 Treated   Male     29 None    
 3    77 Treated   Male     30 None    
 4    17 Treated   Male     32 Marked  
 5    36 Treated   Male     46 Marked  
 6    23 Treated   Male     58 Marked  
 7    75 Treated   Male     59 None    
 8    39 Treated   Male     59 Marked  
 9    33 Treated   Male     63 None    
10    55 Treated   Male     63 None    
# ℹ 74 more rows

1. Producing a one-way table:

one.way.table <- table(Arthritis$Improved) # produces a table in the global environment, of frequencies within the variable 'improved'
one.way.table # shows simple counts for table created above

  None   Some Marked 
    42     14     28 
prop.table(one.way.table) # shows counts as proportions of total count

     None      Some    Marked 
0.5000000 0.1666667 0.3333333 
prop.table(one.way.table)*100 # shows counts as percentages

    None     Some   Marked 
50.00000 16.66667 33.33333 

2. Producing a two-way table:

two.way.table <- xtabs(~ Treatment + Improved, data = Arthritis) # creates a contingency table from the arthritis data set, using treatment and improved variables
two.way.table # shows simple counts for table created above
         Improved
Treatment None Some Marked
  Placebo   29    7      7
  Treated   13    7     21
prop.table(two.way.table) # produces table of proportions, where all cells in the table add up to 1.
         Improved
Treatment       None       Some     Marked
  Placebo 0.34523810 0.08333333 0.08333333
  Treated 0.15476190 0.08333333 0.25000000
Note

You can also use these functions to specify proportions by row or column, rather than a proportion of the whole table.

The addmargins functions allows you to add total sums to your table. This could be sums of total counts, or sums of proportions or percentages.

3. Creating a two-way table using cross table

You need to install the package “gmodels” for this.

library(gmodels)

CrossTable(Arthritis$Treatment, Arthritis$Improved)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |      None |      Some |    Marked | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |        29 |         7 |         7 |        43 | 
                    |     2.616 |     0.004 |     3.752 |           | 
                    |     0.674 |     0.163 |     0.163 |     0.512 | 
                    |     0.690 |     0.500 |     0.250 |           | 
                    |     0.345 |     0.083 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        13 |         7 |        21 |        41 | 
                    |     2.744 |     0.004 |     3.935 |           | 
                    |     0.317 |     0.171 |     0.512 |     0.488 | 
                    |     0.310 |     0.500 |     0.750 |           | 
                    |     0.155 |     0.083 |     0.250 |           | 
--------------------|-----------|-----------|-----------|-----------|
       Column Total |        42 |        14 |        28 |        84 | 
                    |     0.500 |     0.167 |     0.333 |           | 
--------------------|-----------|-----------|-----------|-----------|

 

4. Three-way contingency tables

Note

The function ftable() can be used to create a condensed contingency table where there are more than 2 variables, that is easier to understand.

For example:

three.way.table <- xtabs(~ Treatment+Sex+Improved, data = Arthritis)

ftable(three.way.table)
                 Improved None Some Marked
Treatment Sex                             
Placebo   Female            19    7      6
          Male              10    0      1
Treated   Female             6    5     16
          Male               7    2      5

Youtube video: Chi squared goodness of fit testing in R- Equitable Equations

Example: 200 roles of a die constitute the following distribution. 1=28, 2=30, 3=22, 4=31, 5=38, 6=51. Is the dice unfair?

H0 = the dice is fair.

First, enter the counts as a vector.

counts <- c(28,30,22,31,38,51)

The expected frequencies don’t need to be entered in this example, as it is assumed that all numbers have an equal chance of being rolled.

Then perform chi-squared test:

chisq.test(counts)

    Chi-squared test for given probabilities

data:  counts
X-squared = 15.22, df = 5, p-value = 0.009463

Very low p-value means the null hypothesis can be rejected, and the dice is not fair.

Degrees of freedom = number of different possible outcomes - 1.

Note

If the expected frequencies are not assumed equal, then you need to input them as another vector eg. props (proportions).

Then you would run the test using chisq.test (counts, p = props).

Youtube video: Chi-squared tesing for independence in R- Equitable Equations

First create a matrix of your data. This example shows the types of volunteers against number of hours worked per week.

volunteers <- matrix(c(111,96,48,96,133,61,91,150,53), byrow = T, nrow = 3)
rownames(volunteers) <- c("Community College", "Four Year", "Non-student")
colnames(volunteers) <- c("1-3hrs", "4-6hrs", "7-9hrs")

volunteers
                  1-3hrs 4-6hrs 7-9hrs
Community College    111     96     48
Four Year             96    133     61
Non-student           91    150     53

H0 = the variables are independent of eachother.

chisq.test(volunteers)

    Pearson's Chi-squared test

data:  volunteers
X-squared = 12.991, df = 4, p-value = 0.01132

Degrees of freedom = (number of rows - 1) * (number of columns - 1).

Post-session work

Online Article: Analytics with R- Contingency Tables

Additional package to tidyverse needed:

library(knitr) # prints html-friendly tables
  • First pick out the variables in the data that you want to analyse:
mpg %>%
  group_by(class, cyl) %>%
  summarise(n=n()) %>%
  kable() # produces a table from the data, using knitr package
class cyl n
2seater 8 5
compact 4 32
compact 5 2
compact 6 13
midsize 4 16
midsize 6 23
midsize 8 2
minivan 4 1
minivan 6 10
pickup 4 3
pickup 6 10
pickup 8 20
subcompact 4 21
subcompact 5 2
subcompact 6 7
subcompact 8 5
suv 4 8
suv 6 16
suv 8 38
  • To turn this data into a contingency table, we need to be able to display the data by frequency (counts). Here, we can display ‘class’ as rows and ‘cyl’ as columns.
mpg %>%
  group_by(class, cyl) %>%
  summarise(n = n()) %>%
  spread(cyl, n) %>% # this spread function tells R to create columns for each cyl value, using counts (n) as the crosstab response value
  kable()
class 4 5 6 8
2seater NA NA NA 5
compact 32 2 13 NA
midsize 16 NA 23 2
minivan 1 NA 10 NA
pickup 3 NA 10 20
subcompact 21 2 7 5
suv 8 NA 16 38
  • You can also use the dyplr package to summarise by values other than frequency. Here, the data is summarised by first creating a new column called mean_cty which calculates the average number of city miles per class and cyl value, then using the spread function to display this as a contingency table.
mpg %>%
  group_by(class, cyl) %>%
  summarise(mean_cty = mean(cty)) %>%
  spread(cyl, mean_cty) %>%
  kable()
class 4 5 6 8
2seater NA NA NA 15.40000
compact 21.37500 21 16.92308 NA
midsize 20.50000 NA 17.78261 16.00000
minivan 18.00000 NA 15.60000 NA
pickup 16.00000 NA 14.50000 11.80000
subcompact 22.85714 20 17.00000 14.80000
suv 18.00000 NA 14.50000 12.13158
  • The table() function can also be used to configure a table quickly. To produce a contingency table of frequencies for vehicle class by number of cylinders:
table(mpg$class,mpg$cyl)
            
              4  5  6  8
  2seater     0  0  0  5
  compact    32  2 13  0
  midsize    16  0 23  2
  minivan     1  0 10  0
  pickup      3  0 10 20
  subcompact 21  2  7  5
  suv         8  0 16 38
  • If we assign the table as an object, we can then do more with it.
mpg_table <- table(mpg$class,mpg$cyl)
  • To create a sum of row frequencies in this table, use the following code with the command ‘1’:
margin.table(mpg_table,1)

   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
  • To create a sum of column frequencies in this table, use the following code with command ‘2’:
margin.table(mpg_table,2)

 4  5  6  8 
81  4 79 70 
  • The following codes allow you to produce similar outputs but presenting proportions instead of frequencies.

  • As a proportion of the whole table:

prop.table(mpg_table)
            
                       4           5           6           8
  2seater    0.000000000 0.000000000 0.000000000 0.021367521
  compact    0.136752137 0.008547009 0.055555556 0.000000000
  midsize    0.068376068 0.000000000 0.098290598 0.008547009
  minivan    0.004273504 0.000000000 0.042735043 0.000000000
  pickup     0.012820513 0.000000000 0.042735043 0.085470085
  subcompact 0.089743590 0.008547009 0.029914530 0.021367521
  suv        0.034188034 0.000000000 0.068376068 0.162393162
  • For proportions by row:
prop.table(mpg_table,1)
            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 1.00000000
  compact    0.68085106 0.04255319 0.27659574 0.00000000
  midsize    0.39024390 0.00000000 0.56097561 0.04878049
  minivan    0.09090909 0.00000000 0.90909091 0.00000000
  pickup     0.09090909 0.00000000 0.30303030 0.60606061
  subcompact 0.60000000 0.05714286 0.20000000 0.14285714
  suv        0.12903226 0.00000000 0.25806452 0.61290323
  • For proportions by column:
prop.table(mpg_table,2)
            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 0.07142857
  compact    0.39506173 0.50000000 0.16455696 0.00000000
  midsize    0.19753086 0.00000000 0.29113924 0.02857143
  minivan    0.01234568 0.00000000 0.12658228 0.00000000
  pickup     0.03703704 0.00000000 0.12658228 0.28571429
  subcompact 0.25925926 0.50000000 0.08860759 0.07142857
  suv        0.09876543 0.00000000 0.20253165 0.54285714
Tip

The following link gives more examples of information that can be drawn from chi-squared testing, as well as producing graphics like balloon and mosaic plots, using additional packages in R. http://www.sthda.com/english/wiki/comparing-proportions-in-r

Session 7 - Mean Tests

Pre-session work

T-Tests

Youtube: t-test and interpreting p values using R programming

Additional libraries installed:

library(patchwork) # additional graphics options
library(gapminder) # contains data used in this video

T-tests always refer to questions about the average. We are asking is it different? This can mean is the average different to the assumed average, or is it different to the average of another subset of data?

To test hypotheses, we assume the H0 is correct.

  • Example (single sample t-test) :
gapminder %>%
  filter(continent == "Africa") %>% # shows data for Africa only
  select(lifeExp) %>% # selects data for life expectancy variable only
  t.test(mu = 50) # we enter the assumed mean as it is a single sample t-test

    One Sample t-test

data:  .
t = -3.0976, df = 623, p-value = 0.002038
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 48.14599 49.58467
sample estimates:
mean of x 
 48.86533 

P value is smaller than 0.05 so we can reject the H0.

  • Example (two-sided t-test for difference of means):
gapminder %>%
  filter(continent %in% c("Africa", "Europe")) %>% # filters by Africa and Europe
  t.test(lifeExp ~ continent, data = .) # compares life expectancy by continent

    Welch Two Sample t-test

data:  lifeExp by continent
t = -49.551, df = 981.2, p-value < 2.2e-16
alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
95 percent confidence interval:
 -23.95076 -22.12595
sample estimates:
mean in group Africa mean in group Europe 
            48.86533             71.90369 
# data = . tells R to use the data that has been filtered in the t-test.

P value is smaller than 0.05 so we can reject the H0. Therefore there is a statistically significant difference.

  • Example (one-sided t-test for difference of means):

This test is one-sided as it has directionality. Whereas in the two-sided test you are looking for a difference of means either side of the mean, here you are looking for a difference of means that is either larger OR smaller.

gapminder %>%
  filter(country %in% c("Ireland", "Switzerland")) %>%
  t.test(lifeExp ~ country, data = .,
         alternative = "less", # here you state the directionality of H1. Here the alternative hypothesis is that the average lifeexp of Ireland is less than Switzerland.
         conf.level = 0.99) # the default is 0.95. Enter here if you want to change the significance value.

    Welch Two Sample t-test

data:  lifeExp by country
t = -1.6337, df = 21.77, p-value = 0.05835
alternative hypothesis: true difference in means between group Ireland and group Switzerland is less than 0
99 percent confidence interval:
     -Inf 1.367223
sample estimates:
    mean in group Ireland mean in group Switzerland 
                 73.01725                  75.56508 

The difference is not statistically significant (p > 0.01). Therefore accept H0, there is no difference between average life expectancy.

  • Example (paired t-test):

Eg. comparing life expectancy of one country over different time periods.

paired.test.data <- gapminder %>%
  filter(year %in% c(1957,2007) & continent == "Africa") %>%
  mutate(year2 = as.factor(year)) %>% # creates a new column 'year2' where the years are a factor rather than numeric, so it can run in the t-test
  select(country, year2, lifeExp) %>% # only selects the info we now need
  spread(year2, lifeExp) # forms a table where the years are different columns, so that the data is formally 'paired'

print(paired.test.data)
# A tibble: 52 × 3
   country                  `1957` `2007`
   <fct>                     <dbl>  <dbl>
 1 Algeria                    45.7   72.3
 2 Angola                     32.0   42.7
 3 Benin                      40.4   56.7
 4 Botswana                   49.6   50.7
 5 Burkina Faso               34.9   52.3
 6 Burundi                    40.5   49.6
 7 Cameroon                   40.4   50.4
 8 Central African Republic   37.5   44.7
 9 Chad                       39.9   50.7
10 Comoros                    42.5   65.2
# ℹ 42 more rows

Now the t-test can be run:

t.test(paired.test.data$`1957`,paired.test.data$`2007`,
       paired = TRUE)

    Paired t-test

data:  paired.test.data$`1957` and paired.test.data$`2007`
t = -11.381, df = 51, p-value = 1.308e-15
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -15.92814 -11.15125
sample estimates:
mean difference 
      -13.53969 
Note

Use the tail function to see the end of a tibble, as you only get shown the first few rows normally.

Assumptions for the t-test:

  1. You have a large, representative sample
  2. The data is normally distributed (can be visualised with a density plot)
  3. The two samples have similar variance

For example, if you wanted to check the variance of life expectancy in Ireland and Switzerland before doing a t-test, run the following:

var(gapminder$lifeExp [gapminder$country=="Ireland"])
[1] 13.09338
var(gapminder$lifeExp [gapminder$country=="Switzerland"])
[1] 16.09271

ANOVA

Youtube: ANOVA using R programming

  • ANOVA is used to compare means for more than 2 data sets.
  • H0 is still that we assume there are no differences between the means.

An example using the gapminder data:

  1. Create a data set to work with
ANOVA_data <- gapminder %>%
  filter(year == 2007 & continent %in% c("Americas","Europe","Asia")) %>%
  select(continent, lifeExp)
  1. Look at the distribution of means
ANOVA_data %>%
  group_by(continent) %>%
  summarise(mean_life = mean(lifeExp))
# A tibble: 3 × 2
  continent mean_life
  <fct>         <dbl>
1 Americas       73.6
2 Asia           70.7
3 Europe         77.6
  1. Run ANOVA test and assign it as an object
ANOVA_results <- ANOVA_data %>%
  aov(lifeExp ~ continent, data = .)

summary(ANOVA_results)
            Df Sum Sq Mean Sq F value   Pr(>F)    
continent    2  755.6   377.8   11.63 3.42e-05 ***
Residuals   85 2760.3    32.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. To go further, we can use Tukey method to determine whether the significance is being driven by a particular continent
ANOVA_results %>%
  TukeyHSD()
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent, data = .)

$continent
                     diff        lwr        upr     p adj
Asia-Americas   -2.879635 -6.4839802  0.7247099 0.1432634
Europe-Americas  4.040480  0.3592746  7.7216854 0.0279460
Europe-Asia      6.920115  3.4909215 10.3493088 0.0000189

This output shows P (significance values) for the differences in means between each combination of continents.

Post-session work

Book: Practical Stastitics for Field Biology

  • Non-parametric tests: convert observations to ranks and compare sample distributions (medians).

  • Parametric tests: use actual observations and compare means.

  • Matched observations: eg. masses of birds on day 0 and then 10 days later.

  • Unmatched observations: eg. masses of 10 birds at two different sites.

Mann-Whitney U-Test

  • Also known as unpaired two-samples Wilcoxon test
  • A non-parametric test for comparing medians of two unmatched samples.
  • Can be used with very small sample sizes, and sample sizes can be unequal.
  • Minimum 4 observations needed per sample.
  • Can also be used on interval and ordinal data, as observations are converted to ranks.
  • The U value has to be smaller than the critical value to reject H0.
  • Data does not need to be normally distributed, but the test assumes the two distributions are similar eg. can’t be one -vely skewed and the other +vely skewed.

Kruskal-Wallis test

  • A non-parametric test for comparing medians of three or more samples.
  • Can also be used on interval and ordinal data.
  • Samples don’t have to be equal size.
  • If only 3 samples, at least 5 observations per sample are needed.
  • The K value produced is compared with the distribution of X^2 (pre-defined results in appendix 3 of book).

Wilcoxon test for matched pairs

  • A non-parametric test for comparing medians of two matched samples.
  • Pairs of values have to be subtracted from eachother, so it is suitable for interval data (ie. counts) but not ordinal.
  • 6 or more pairs of observations need to have a difference that is not 0.
  • If the number of matched pairs exceeds 40, a parametric test might be more suitable.
  • The two data sets do not have to be normally distributed, but their distributions should be similar.
  • The T value has to be smaller than or equal to the tabulated value (appendix 7 in book), to reject H0.

The F-test (two-tailed)

  • Should be routinely applied before testing for differences between means (using z and t tests).
  • It determines if the difference in variance between samples is so small that it can be ignored. If not, the validity of the means test is reduced.
  • It helps ensure that any differences found are due to actual differences in means, and not differences in variance.

The z-test

  • A parametric test for comparing means of two large samples.
  • Used when more than 30 observations per sample.
  • For unmatched data.
  • Doesn’t necessarily assume data is normally distributed.
  • If data is badly skewed, try increasing number of observations.
  • If not possible, either transform the data or use the Mann-Whitney U-test.

The t-test

  • A parametric test for comparing means of two small samples.
  • Used when under 30 observations per sample.
  • Does assume data is normally distributed.
  • The standard t-test is for unmatched data, but a matched pairs t-test is also available.

ANOVA

  • Used to compare means of more than two samples.
  • Two-way ANOVA compares means from two variables at once.
  • ANOVA assumes all observations are obtained randomly and are normally distributed. The Shapiro-Wilks test can be used to check for normality. For a visual check, a Q-Q plot of the residuals can be produced. Or a basic histogram can help you to see normality.
  • ANOVA assumes that the variances of the samples are similar. Levene’s test can be used to test variance.
  • Tukey’s test can be used to identify which means are different from which, when significant differences are identified in the ANOVA output.
  • The Kruskal-Wallis test is the non-parametric alternative to one-way ANOVA.

Youtube video: ANOVA in R: a complete example- Equitable Equations

  1. Visualise your data:
ggplot(penguins, aes(x = species, y = flipper_length_mm)) +
  geom_boxplot()

We want to test whether differences in flipper length between species are statistically significant.

  1. Assess whether the assumptions of the ANOVA test are met:
  • Are the observations independent? We are looking at individual penguins, so assume yes.

  • Are the observations approximately normally distributed within groups? Yes, the box plot shows symmetrical flipper lengths.

  • Are the variances equal? Yes, the box plot shows the boxes are roughly the same size, so the interquartile ranges are similar for each species.

ggplot(penguins, aes(x = flipper_length_mm)) + 
  geom_histogram() + 
  facet_wrap(~ species, ncol = 1)

This helps us to visualise that the data is approximately normally distributed. Fortunately, ANOVA is quite resilient to slighlty non-normal data.

We can also assess variance by running the following code:

penguins %>%
  group_by(species) %>%
  summarise(var(flipper_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
  species   `var(flipper_length_mm, na.rm = TRUE)`
  <fct>                                      <dbl>
1 Adelie                                      42.8
2 Chinstrap                                   50.9
3 Gentoo                                      42.1

The variances are very close so it is ok to proceed.

  1. Run the ANOVA test:
ANOVA_penguins <- aov(flipper_length_mm ~ species, data = penguins)
summary(ANOVA_penguins)
             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  52473   26237   594.8 <2e-16 ***
Residuals   339  14953      44                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness
  1. As the difference is shown to be significant, we can run post-hoc test (Tukey) to show exactly which differences are significant:
TukeyHSD(ANOVA_penguins)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = flipper_length_mm ~ species, data = penguins)

$species
                      diff       lwr       upr p adj
Chinstrap-Adelie  5.869887  3.586583  8.153191     0
Gentoo-Adelie    27.233349 25.334376 29.132323     0
Gentoo-Chinstrap 21.363462 19.000841 23.726084     0

This shows that the means are statistically significant between all species.

Session 8 - Correlations

Pre-session work

Book: Making Sense of Data with R - Yi Shang

Scatterpolot - shows two quantitative variables with two axes. Independent variable on the x axis, dependent variable on the y.

Side-by-side boxplot - show relationship between quantitative and categorical variables.

Correlation= relationship between two variables. To have a correlation we need two measurements from each case in the dataset.

Note

The direction of a correlation is not relevant to its strength. You can still have a strong negative correlation and weak positive.

Pearson’s r = correlation coefficient, value between -1 and 1 that determines the strength and direction of a correlation.

Use the code: cor.test(variable1, variable2)

Eg.

correlation.data <- penguins %>%
  select(flipper_length_mm, bill_length_mm)

cor.test(correlation.data$flipper_length_mm, correlation.data$bill_length_mm)

    Pearson's product-moment correlation

data:  correlation.data$flipper_length_mm and correlation.data$bill_length_mm
t = 16.034, df = 340, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5912769 0.7126403
sample estimates:
      cor 
0.6561813 

(Image above applies to - values too).

It is beneficial to visualise the scatter plot as well as calculating the correlation coefficient, as r:

  • Is not the slope ie. it tells you how confidently you will be able to predict one variable based on the other, but it can’t tell you how steep the slope is
  • Only represents linear relationships
  • Could be influenced by outliers
Tip

Remember, correlation does not always equal causation.

Youtube video: Pearson's correlation - clearly explained!

  • A small p-value for correlation, means you can have more confidence about the strength of the relationship. It doesn’t necessarily mean the relationship is stronger.
  • More data also = more confidence about correlation. It may not necessarily increase the correlation value.
  • Note that an R value (correlation) of 0.5 is not actually twice as strong as 0.25. You should use R^2 to make inferences like this.

Session notes

  • Pearson’s correlation is reported as lower case r.
  • For non-normal data: Spearman’s rank correlation should be used.
  • You can do a balloon plot in R that represents the correlations between all of your different variables. This is a good visual aid to assess the strength of different relationships. See lecture slides.

Session 8 - Introductions

You can use the funnel theory to structure your introduction and discussion.

Introduction: broad -> narrow

  1. Define research theory
  2. Identify knowledge gap/ find niche
  3. Explain how the study aims to occupy this niche

Discussion: narrow -> broad

  1. Major findings
  2. Your findings in the context of other studies
  3. Implications of your findings, and their generalisability

You can use the MEAL method to structure your paragraphs:

  1. Main idea: state the main idea of the paragraph in your opening sentence. Everything in the paragraph should support this idea.
  2. Evidence: specific information that supports your main idea
  3. Analysis: your interpretation of the evidence (important!)
  4. Link: ties everything together, linking onto the next paragraph

Session 9 - Linear Models

Pre-session work

Youtube video: Linear regression using R programming- Greg Martin

  • If you have the y-intercept and the slope, then you can use any x value to predict a value of y.
  • Small p-value = we can reject H0, and accept the hypothesis that the slope we are seeing is statistically significant. -R^2 = the proportion of changes in y values that can be explained by changes in x.

To produce a linear model:

cars %>%
  lm(dist ~ speed, data = .) %>% # input dependent variable first (dist)
  summary()

Call:
lm(formula = dist ~ speed, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
  • Residuals are the distance from each observation to the the line of best fit. Therefore smaller residuals = better model.
  • Coefficients show the y-intercept and the slope. In column ‘estimate’. Also shows the p-value for the slope.
  • R squared is the proportion of y values that can be explained by changes in x. The adjusted R value is only needed for multiple regression, not a simple linear model.
  • F statistic is more relevant if more variables are added.

By assigning the output of the linear model as an object, there are other things we can do with it in R:

linear_model <- lm (dist ~ speed, data = cars)
summary(linear_model)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

For example, we can ask R to display all of the residuals and then produce a histogram of these values, to show how they are distributed around zero.

linear_model$residuals
         1          2          3          4          5          6          7 
  3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584  -3.744993 
         8          9         10         11         12         13         14 
  4.255007  12.255007  -8.677401   2.322599 -15.609810  -9.609810  -5.609810 
        15         16         17         18         19         20         21 
 -1.609810  -7.542219   0.457781   0.457781  12.457781 -11.474628  -1.474628 
        22         23         24         25         26         27         28 
 22.525372  42.525372 -21.407036 -15.407036  12.592964 -13.339445  -5.339445 
        29         30         31         32         33         34         35 
-17.271854  -9.271854   0.728146 -11.204263   2.795737  22.795737  30.795737 
        36         37         38         39         40         41         42 
-21.136672 -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
        43         44         45         46         47         48         49 
  2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285  43.201285 
        50 
  4.268876 
hist(linear_model$residuals)

Once you have your linear model, it can be used to predict values of y, using values of x.

First you need to create an object with the x values you are interested in:

new_speeds <- data.frame(speed = c(10,15,20))

Then use the predict function to obtain your y-values:

predict(linear_model, new_speeds) %>%
  round(1) # rounds the data to 1 d.p.
   1    2    3 
21.7 41.4 61.1 

Youtube video: Learn statistical regression in 40 mins

  • The line of best fit through the points is the straight line that reduces the sum of the squared errors (residuals) as much as possible. The errors are squared as this reduces the effect of negative errors cancelling out positive ones.

  • The way that R squared is calculated is:

R squared = sum of squares due to regression/total sum of squares

It is the proportion of variation in the y variable, being explained by the x variable.

  • The way that degrees of freedom is calculated is:

df = number of observations - number of x variables - 1

Note

Using this formula for df, adding more and more x variables will increase your r-squared value, even if these extra variables are not relevant to your explanation of y. Therefore, when multiple x variables are involved, you need to start looking at the adjusted r-squared value.

Post-session work

Use the mtcars data set to produce a linear model of the effect of weight (wt) on efficiency (mpg).

lm_cars <- mtcars %>%
  lm(mpg~wt, data = .)

summary(lm_cars)

Call:
lm(formula = mpg ~ wt, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Here we can see that there is a negative relationship between weight and efficiency, and that this is statistically significant (P<0.001). As the weight of the cars increases, the fuel efficiency decreases. The R squared value indicates that approximately 75% of changes in fuel efficiency, are as a result of changes in weight of the cars.

mtcars %>%
  ggplot(aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Session 10 - Logistic Regression

Pre-session work

Youtube video: StatQuest - Logistic Regression

  • Logistic regression is similar to linear regression, but it allows us to predict whether something is true or false (a binary output), rather than predicting values on a continuous scale. It can be used for continuous or discrete x variables.

  • It fits an ‘s’ shaped logistic function to the data rather than a line. It helps us to predict the probability of an output being true (1) or false (0), using a known value on the x axis.

Youtube video: Equitable Equations - Logistic Regression in R

Differences between linear and logistic regression:

  • Residuals are not the same for logistic models as they are for linear models - they can’t be plotted or have their standard error calculated.
  • There is no normal distribution in a logistic model - at every x-value the distribution of the response variable is binomial. Therefore using confidence interals in logistic models is more tricky.
admissions <- read.csv("C://Users//User//OneDrive//Documents//MSc//RM&DA//admissions.csv")

ggplot(admissions, aes(x = gpa, 
                       y = admitted)) +
  geom_jitter(height = 0.05,
              alpha = 0.1)

logistic_model <- glm(admitted ~ gpa,
                      data = admissions,
                      family = "binomial") # glm is a more broad function (generalised linear model), so binomial needs stating here so that a logistic model is produced.

summary(logistic_model)

Call:
glm(formula = admitted ~ gpa, family = "binomial", data = admissions)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -12.0352     0.4917  -24.48   <2e-16 ***
gpa           4.0802     0.1649   24.74   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2906.6  on 2099  degrees of freedom
Residual deviance: 1527.4  on 2098  degrees of freedom
AIC: 1531.4

Number of Fisher Scoring iterations: 5

A regression line can be added to the graph above, as follows:

ggplot(admissions, aes(x = gpa,
                       y = admitted)) +
  geom_jitter(height = 0.05,
              alpha = 0.1) +
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              se = FALSE)
`geom_smooth()` using formula = 'y ~ x'

Session 11 - Polynomial and Poisson Models

Pre-session work

Youtube video: Polynomial regression in R- MarinStatsLectures

  • Polynomial regression is a type of linear regression for curved lines.
  • Partial F-tests can be used to decide which regression model is a better fit for the curve or line.

Youtube video: Poisson regression in R- Equitable Equations

library(ISLR2) # contains bikeshare data

Attaching package: 'ISLR2'
The following object is masked from 'package:vcd':

    Hitters
glimpse (Bikeshare)
Rows: 8,645
Columns: 15
$ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mnth       <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan,…
$ day        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hr         <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weekday    <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ weathersit <fct> clear, clear, clear, clear, clear, cloudy/misty, clear, cle…
$ temp       <dbl> 0.24, 0.22, 0.22, 0.24, 0.24, 0.24, 0.22, 0.20, 0.24, 0.32,…
$ atemp      <dbl> 0.2879, 0.2727, 0.2727, 0.2879, 0.2879, 0.2576, 0.2727, 0.2…
$ hum        <dbl> 0.81, 0.80, 0.80, 0.75, 0.75, 0.75, 0.80, 0.86, 0.75, 0.76,…
$ windspeed  <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0896, 0.0000, 0.0…
$ casual     <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41, 1…
$ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70, 5…
$ bikers     <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, 110…
ggplot(Bikeshare, aes(x = temp,
                      y = bikers)) +
  geom_point(alpha = 0.1) +
  facet_wrap(~weathersit) +
  theme_minimal()

Linear regression would not be suitable to model the above examples.

  • Poisson regression is used to model counts
poisson.model <- glm(bikers ~ temp + workingday + weathersit, data = Bikeshare,
                     family = "poisson")
summary(poisson.model)

Call:
glm(formula = bikers ~ temp + workingday + weathersit, family = "poisson", 
    data = Bikeshare)

Coefficients:
                           Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                3.885010   0.003231 1202.490  < 2e-16 ***
temp                       2.129054   0.004789  444.587  < 2e-16 ***
workingday                -0.008888   0.001943   -4.574 4.78e-06 ***
weathersitcloudy/misty    -0.042219   0.002132  -19.805  < 2e-16 ***
weathersitlight rain/snow -0.432090   0.004023 -107.412  < 2e-16 ***
weathersitheavy rain/snow -0.760995   0.166681   -4.566 4.98e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  816754  on 8639  degrees of freedom
AIC: 869804

Number of Fisher Scoring iterations: 5
ggplot(Bikeshare, aes(x = temp,
                      y = bikers,
                      colour = weathersit)) +
  geom_point(alpha = 0.05) +
  geom_smooth(method = "glm",
              se = FALSE,
              method.args = list(family = "poisson")) +
  facet_wrap(~workingday) +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2")
`geom_smooth()` using formula = 'y ~ x'

Session 12 - Multivariate Analysis

Pre-session work

Youtube video: Statquest: PCA main ideas in only 5 minutes

  • A PCA plot uses correlations between various variables and plots them on a 2D graph.
  • Points that are highly correlated with eachother will cluster together.
  • The axis of the graph are ranked in order of importance ie. PC1 is more important than PC2. Therefore if clusters on the PC1 axis are the same distance apart as clusters on the PC2 axis, there will be a greater difference between clusters on the PC1 axis as it is more important.

A great worked example of how to perform (and interpret) PCA in R can be found here:

https://www.datacamp.com/tutorial/pca-analysis-r?utm_source=google&utm_medium=paid_search&utm_campaignid=19589720821&utm_adgroupid=157156375351&utm_device=c&utm_keyword=&utm_matchtype=&utm_network=s&utm_adpostion=&utm_creative=724847713618&utm_targetid=dsa-2218886984060&utm_loc_interest_ms=&utm_loc_physical_ms=9046223&utm_content=&utm_campaign=230119_1-sea~dsa~tofu_2-b2c_3-row-p1_4-prc_5-na_6-na_7-le_8-pdsh-go_9-nb-e_10-na_11-na-dec24&gad_source=2&gclid=EAIaIQobChMIr7X297ykigMVxZJoCR1RPQMNEAAYASAAEgKmifD_BwE