ESC&R RMDA Workbook 2024

Author

Nick Park (N1343061)

FM Lecture Week 3 on 8/10/24 Post session Exercises taken from:

Chapter 6.6.1 of “R for Graduate Students” by Y.Wendy Huynh 2019 (Huynh 2019)

Start up stuff

For example:

  • setting the working directory

  • installing packages (no need in this case as Tidyverse (Wickham et al. 2019) already installed)

  • loading packages

  • importing data (in this case the Diamonds & the Midwest data in ggplot)

getwd() 
library(tidyverse) 
library(ggplot2) 
data("diamonds")
data("midwest")

Types of Variable

Variables may be either:

  • Categorical (otherwise know as factors) which can be further subdivided into:

    • Nominal (no order eg. happy, neutral or sad) or binary (True or False)
    • Ordinal (ordered eg. 1-10 from sad to happy)
  • Numerical which are quantitative and either:

    • discrete (eg. number of students in a class or numbers of pages in a book) or

    • continuous (eg. height, temperature)

Checking & Exploring the data

Must look at the data before we start analysing it to get a better understanding of it and to start some basic manipulations and assessments:

For example we could use:

  • View() but often big data files so might be better to look at small sections using glimpse(), kable() (from dplyr and knitr packages respectively) and the $ operator to look at individual variables (columns)

  • using the select() function as below to pick a sample of the data

  • also really useful to look at the structure using str() again as shown below

diamonds %>% 
  select(1:8) #picking the first 8 columns
# A tibble: 53,940 × 8
   carat cut       color clarity depth table price     x
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95
 2  0.21 Premium   E     SI1      59.8    61   326  3.89
 3  0.23 Good      E     VS1      56.9    65   327  4.05
 4  0.29 Premium   I     VS2      62.4    58   334  4.2 
 5  0.31 Good      J     SI2      63.3    58   335  4.34
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95
 8  0.26 Very Good H     SI1      61.9    55   337  4.07
 9  0.22 Fair      E     VS2      65.1    61   337  3.87
10  0.23 Very Good H     VS1      59.4    61   338  4   
# ℹ 53,930 more rows
str(diamonds) #showing a summary of the dataframe
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

These outputs show that there are 10 total variables (columns) with 3 being ordered factors (ie. categorical-cut, color and clarity), 1 integer or discrete numerical (price) and 6 numeric or continuous numerical (carat, depth, table, x, y & z) . There are 53,940 round cut diamonds in the dataset and each row contains information about each one. An ordered factor arranges the categorical values in a low to high ranked order. For example there are 5 categories of diamond cuts with Fair being the lowest grade to Ideal being the highest.

Smith (Statistics in r for Biodiversity Conservation 2020) suggest that: “before attempting to analyse data it is vital to perform a data exploration as this will save time by identifying any potential problems in the data and will help in deciding what type of analysis to conduct”.

Note

The authors recommend the 6 step data exploration proposed by Zuur (Zuur, Ieno, and Meesters 2010) which is intended to identify:

  1. Outliers in response and independent variables
  2. Normality and homegenity of the response variable
  3. An excess of 0’s in the response variable
  4. Multicollinearity among independent variables
  5. Relationships among response and independent variables
  6. Independence of the response variable

As an example, Boxplots can be used to visualise/identify outliers as below where the diamonds data is plotted by cut against price as in Figure 1 below. There are many other plotting techniques and statistical methods which can also be used for this and the other points in the Zuur method (see chapter 1 for more information) and below for lots of practice at these methods!!

diamonds %>%
  group_by(cut) %>%
  ggplot(aes(x=cut, y=price, color=cut, fill=cut))+
  geom_boxplot()
Figure 1: Diamonds data set boxplot showing Cut versus Price

The image below Figure 2 taken from Andrew Gard YouTube’s (Gard 2023) series is a nice summary for which graph/plot to use depending on categorical or numerical variables:

Figure 2: Which plot to choose?

Exercises 6.6.1 Using the diamonds dataset

Figure 3: Nice picture of a big diamond!
Useful Tip!

Good idea to execute one line at a time to see how each line changes the output but remember to not include the pipe or commas at the end of a line or it won’t work!

Another useful tip!

Basic data management might involve techniques/fucntions such as mutate, nesting, summarise, group by/ungroup, filter, select, arrange, names.

1. 1st problem on the Diamonds dataset

This problem shows how to group and ungroup variables in a dataframe using the group_by function; create and add in an extra column using the mutate function; choose which ones to display using the select function; and list values in a variable in ascending or descending order using the arrange function.

diamonds %>% #utilises the diamonds dataset
  group_by(color, clarity) %>% #groups by color and clarity variables
  mutate(price200 = mean(price)) %>% #creates new variable (average price by groups)
  ungroup() %>% #data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>% #new variable, original price +$10
  select(cut, color, clarity, price, price200, random10) %>% # retain only these columns
  arrange(color) %>% #visualise data ordered by color
  group_by(cut) %>% #group data by cut
  mutate(dis = n_distinct(price), #numbers each row consecutively for each cut
        rowID = row_number()) %>% #numbers each row consecutively for each cut
    ungroup() #then final ungroup
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
Warning!

Always Ungroup after grouping!

2. Problem 2a. on the Midwest dataset

This problem shows how to collapse all rows into a one row summary using the summarise() function. This can also be used along with the group() function as per below.

midwest %>% 
  group_by(state) %>%
  summarise(poptotalmean = mean(poptotal),
            poptotalmed = median(poptotal),
            popmax = max(poptotal),
            popmin = min(poptotal),
            popdistinct = n_distinct(poptotal),
            popfirst = first(poptotal),
            popany = any(poptotal < 5000),
            popany2 = any(poptotal > 2000000)) %>% 
  ungroup()
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>

Problem 2b. on the Midwest dataset

This problem shows how to put parameters against the summarise() function to show how many rows of values exist in the dataframe within those set parameters as per below.

## Problem B
midwest %>% 
  group_by(state) %>% 
  summarise(num5k = sum(poptotal < 5000),
            num2mil = sum(poptotal > 2000000),
            numrows = n()) %>% 
  ungroup()
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

Problem 2c.1 on the Midwest dataset

This problem shows how to show the number of unique entries there are in the dataset with values ??I’m not sure!!

midwest %>% 
  group_by(county) %>% 
  summarise(x = n_distinct(state)) %>% 
  arrange(desc(x)) %>% 
  ungroup()
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows

Problem 2c.11 on the Midwest dataset

This problem shows how n() differs from n_distinct().

midwest %>% 
  group_by(county) %>% 
  summarise(x = n()) %>% 
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows

Problem 3. Still on the Midwest dataset

This one shows that there can’t be more than one distinct county for each county! If replace county by state then get a more sensible response!

midwest %>%
  group_by(county) %>%
  summarise(x=n_distinct(county)) %>%
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows
midwest %>%
  group_by(state) %>%
  summarise(x=n_distinct(county)) %>%
  ungroup()
# A tibble: 5 × 2
  state     x
  <chr> <int>
1 IL      102
2 IN       92
3 MI       83
4 OH       88
5 WI       72

Problem 4. Looking at the Diamonds dataset again

This one shows how to look at a specific variable using the group() function, and then identify within that group, how many unique values there were for 2 other variables (using the n_distinct(variable) function, aswell as the total number of values using the n() function

diamonds %>%
  group_by(clarity) %>% 
  summarise(a=n_distinct(color),
            b=n_distinct(price),
            c=n()) %>%
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

Problem 5.1 Still on the Diamonds dataset

This one shows how you can use the summarise function to show the mean and sd values for values that are from 2 groups of variables

diamonds %>% 
  group_by(color, cut) %>% 
  summarise(m = mean(price),
            s = sd(price)) %>% 
  ungroup()
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows

Problem 5.2 and this one shows what happens if you put the 2 variables in a different order!

If you reverse cut and color order in first line then the values are the same but presented in different order.

diamonds %>% 
  group_by(cut, color) %>% 
  summarise(m = mean(price),
            s = sd(price)) %>% 
  ungroup()
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows

Problem 5.3 Still in diamonds

This problem shows a practical application in that can see what effect a 20% discount has when compared to the mean sale price

diamonds %>% 
  group_by(cut, color, clarity) %>% 
  summarize(m = mean(price),
            s = sd(price),
            msale = m * 0.80) %>% 
  ungroup()
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows

Problem 6. Still in Diamonds

This shows how you can call the output variables whatever you want!

diamonds %>% 
  group_by(cut) %>% 
  summarise(potato = mean(depth),
            pizza = mean(price),
            popcorn = median(y),
            pineapple = potato - pizza,
            papaya = pineapple ^ 2,
            peach = n()) %>% 
  ungroup()
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

Problem 7.1 Still in Diamonds….

What is the difference between 7.1 and 7.2?

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  summarise(m = mean(x1)) %>% 
  ungroup()
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

Problem 7.2…..

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>%  
  summarize(m = mean(x1))
# A tibble: 1 × 1
      m
  <dbl>
1 1966.

The top one shows what the mean of the prices of all of those diamonds of each color (ie this is different for each colour) after each price has been times by 0.5. The bottom one shows the mean of all the diamonds’ prices times by 0.5 regardless of colour.

Further questions:

  • why is grouping the data necessary? This allows specific variables to be brought together for future operations. For example, grouping a dataset by age and sex might be useful if we were looking to see what effect age and sex had on a particular test result. It is then possible to conduct further manipulation and analysis of the grouped data.

  • Why is ungrouping data necessary? If you don’t ungroup once you’ve finished your calculations then future data management will likely produce errors as any further fucntions will be based on the already manipulated data

  • When should you ungroup data? At the end of your calculation

  • If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()? No

Further practice from the same book section 6.7

Q1. View all of the variable names in diamonds. An easy one to get started! Could also use head(), tail(), slice_head or slice_tail functions as below:

view(diamonds) #this brings up the table in a separate tab at the top
head(diamonds) #brings up the first 6 rows
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
diamonds %>%
slice_head(n=5) # to bring up the first 5 rows
# A tibble: 5 × 10
  carat cut     color clarity depth table price     x     y     z
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75
diamonds %>%
  slice_tail(n=5) #shows the last 5 rows
# A tibble: 5 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.72 Ideal     D     SI1      60.8    57  2757  5.75  5.76  3.5 
2  0.72 Good      D     SI1      63.1    55  2757  5.69  5.75  3.61
3  0.7  Very Good D     SI1      62.8    60  2757  5.66  5.68  3.56
4  0.86 Premium   H     SI2      61      58  2757  6.15  6.12  3.74
5  0.75 Ideal     D     SI2      62.2    55  2757  5.83  5.87  3.64

Q2. Arrange the diamonds by:

  • Lowest to highest price (hint: arrange())

  • Highest to lowest price (hint: arrange()desc())

  • Lowest price and cut

  • highest price and cut

diamonds %>%
  arrange(price) #lowest to highest
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
diamonds %>% 
  arrange(desc(price)) #highest to lowest
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
diamonds %>%
  arrange(cut, price) # by lowest price and cut
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.25 Fair  E     VS1      55.2    64   361  4.21  4.23  2.33
 3  0.23 Fair  G     VVS2     61.4    66   369  3.87  3.91  2.39
 4  0.27 Fair  E     VS1      66.4    58   371  3.99  4.02  2.66
 5  0.3  Fair  J     VS2      64.8    58   416  4.24  4.16  2.72
 6  0.3  Fair  F     SI1      63.1    58   496  4.3   4.22  2.69
 7  0.34 Fair  J     SI1      64.5    57   497  4.38  4.36  2.82
 8  0.37 Fair  F     SI1      65.3    56   527  4.53  4.47  2.94
 9  0.3  Fair  D     SI2      64.6    54   536  4.29  4.25  2.76
10  0.25 Fair  D     VS1      61.2    55   563  4.09  4.11  2.51
# ℹ 53,930 more rows
diamonds %>%
  arrange(cut, desc(price)) #by highest price and cut
# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.01 Fair  G     SI1      70.6    64 18574  7.43  6.64  4.69
 2  2.02 Fair  H     VS2      64.5    57 18565  8     7.95  5.14
 3  4.5  Fair  J     I1       65.8    58 18531 10.2  10.2   6.72
 4  2    Fair  G     VS2      67.6    58 18515  7.65  7.61  5.16
 5  2.51 Fair  H     SI2      64.7    57 18308  8.44  8.5   5.48
 6  3.01 Fair  I     SI2      65.8    56 18242  8.99  8.94  5.9 
 7  3.01 Fair  I     SI2      65.8    56 18242  8.99  8.94  5.9 
 8  2.32 Fair  H     SI1      62      62 18026  8.47  8.31  5.2 
 9  5.01 Fair  J     I1       65.5    59 18018 10.7  10.5   6.98
10  1.93 Fair  F     VS1      58.9    62 17995  8.17  7.97  4.75
# ℹ 53,930 more rows

Q3. Arrange the diamonds by lowest to highest price and worst to best clarity

diamonds %>%
  arrange(price, clarity)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Q4. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate())

diamonds %>%
  mutate(salePrice = price - 250)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z salePrice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows

Q5. Remove the x, y, and z variables from the diamonds dataset (hint: select())

diamonds %>%
  select(1:7)
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows

Q6. Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).

diamonds %>%
  group_by(cut) %>%
  summarise(number=n()) %>%
  ungroup()
# A tibble: 5 × 2
  cut       number
  <ord>      <int>
1 Fair        1610
2 Good        4906
3 Very Good  12082
4 Premium    13791
5 Ideal      21551

Q7. Create a new column named totalNum that calculates the total number of diamonds.

diamonds %>%
  mutate(total_diamonds =n()) # see 11 column on far right
# A tibble: 53,940 × 11
   carat cut    color clarity depth table price     x     y     z total_diamonds
   <dbl> <ord>  <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>          <int>
 1  0.23 Ideal  E     SI2      61.5    55   326  3.95  3.98  2.43          53940
 2  0.21 Premi… E     SI1      59.8    61   326  3.89  3.84  2.31          53940
 3  0.23 Good   E     VS1      56.9    65   327  4.05  4.07  2.31          53940
 4  0.29 Premi… I     VS2      62.4    58   334  4.2   4.23  2.63          53940
 5  0.31 Good   J     SI2      63.3    58   335  4.34  4.35  2.75          53940
 6  0.24 Very … J     VVS2     62.8    57   336  3.94  3.96  2.48          53940
 7  0.24 Very … I     VVS1     62.3    57   336  3.95  3.98  2.47          53940
 8  0.26 Very … H     SI1      61.9    55   337  4.07  4.11  2.53          53940
 9  0.22 Fair   E     VS2      65.1    61   337  3.87  3.78  2.49          53940
10  0.23 Very … H     VS1      59.4    61   338  4     4.05  2.39          53940
# ℹ 53,930 more rows

Research Methods subsection

Exercise: Generate a good question and a bad question about the diamonds data set you’ve just explored. Try to use the principles discussed to simulate both types of questions

What makes a good question?

Good questions should be clear and interesting; relevant and meaningful; feasible and manageable in terms of available resources; have measurable variables; not answerable with yes/no; often contain what/how rather than is/are/do/does; be novel and push the boundaries of knowledge forwards; tend to be specific rather than vague; aim to discover/explain/explore; be ethical; be systematically constructed

Examples of good questions might include:

  • why does a diamond have differing clarities/depths of colours/cuts?

  • what makes a diamond attractive to buyers?

  • how does colour affect the price?

  • which is the most important determinant of price?

Examples of bad questions might include:

  • is there a difference in price between diamonds of different cuts/clarity/colours etc?

  • are poor clarity diamonds cheaper?

  • does my wife want another diamond? This is a terrible question!!

Week 4 FM Data Exploration & Scientific Hypotheses-Formative exercise on Data Analyses

Based on Andrew Gard’s YouTube Video tutorial (Gard 2023)on “Visualisation with R in 36minutes”

This tutorial utilises the Tidyverse package and the ModelData package as this contains the ‘Cricket’ data we will be using

library(tidyverse)
library(modeldata)
library(ggplot2)
data("crickets")

Exploring the data shows that there are 3 columns of variables and 31 observations or rows. There are 2 continuous numerical variables (temperature and chirp rate ) and one categorical which is binary in nature (species).

View(crickets)
str(crickets)
tibble [31 × 3] (S3: tbl_df/tbl/data.frame)
 $ species: Factor w/ 2 levels "O. exclamationis",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ temp   : num [1:31] 20.8 20.8 24 24 24 24 26.2 26.2 26.2 26.2 ...
 $ rate   : num [1:31] 67.9 65.1 77.3 78.7 79.4 80.4 85.8 86.6 87.5 89.1 ...
glimpse(crickets)
Rows: 31
Columns: 3
$ species <fct> O. exclamationis, O. exclamationis, O. exclamationis, O. excla…
$ temp    <dbl> 20.8, 20.8, 24.0, 24.0, 24.0, 24.0, 26.2, 26.2, 26.2, 26.2, 28…
$ rate    <dbl> 67.9, 65.1, 77.3, 78.7, 79.4, 80.4, 85.8, 86.6, 87.5, 89.1, 98…

Exercise 1. Start with a simple scatter or point plot

As shown below these type of plots are useful for showing 2 numerical variables as in this case:

ggplot(crickets, aes(x = temp,
                     y = rate)) + 
  geom_point()

Which is OK but could improve the appearance by changing labels, adding captions and titles for example:

# to improve the appearance of this:
ggplot(crickets, aes(x = temp,
                     y = rate))+
  geom_point() + 
  labs(x = "Temperature", # these will change labels, add caption and titles etc
       y = "Chirp rate",
       title = "Cricket Chirps", 
       caption = "Source: Mcdonald (2009)")

Looking at this plot and the data it is obvious that it would be much more meaningful if the species were added in to:

ggplot(crickets, aes(x = temp,
                     y = rate, 
                     colour = species))+ #this will add in the species in different colours
  geom_point() + 
  labs(x = "Temperature", # these will change labels,add caption and title etc
       y = "Chirp rate",
       colour = "Species", # adding in species and relabelling the species legend with a capital S
       title = "Cricket Chirps", 
       caption = "Source: Mcdonald (2009)")

…..and might also be able to improve on the visual appearance by tweaking the colour scheme:

ggplot(crickets, aes(x = temp,
                     y = rate, 
                     colour = species))+
  geom_point() + 
  labs(x = "Temperature", # these will change labels,add caption and title etc
       y = "Chirp rate",
       colour = "Species", # adding in species and relabelling the species legend with a capital S
       title = "Cricket Chirps", 
       caption = "Source: Mcdonald (2009)") +
  scale_color_brewer(palette = "Dark2") #this will tweak the colour scheme

……and there are lots of other options for modifying the appearance and layout of the basic plot:

ggplot(crickets, aes(x = temp,
                     y = rate))+
  geom_point(colour = "red",#note where this is to change the whole plot's colour 
             size = 2, #obviously changes the size
             alpha = 0.3, #changes the transparency
             shape = "square") + #changing the shape
  labs(x = "Temperature", # these will change labels,add caption and title etc
       y = "Chirp rate",
       title = "Cricket Chirps", 
       caption = "Source: Mcdonald (2009)")

Can also further modify this simple plot to give more useful information, for example adding in another layer to show a regression line with or without error bars:

ggplot(crickets, aes(x = temp,
                     y = rate))+
  geom_point() +
  geom_smooth(method = "lm", 
              se = TRUE)+ # adding in regression line or curve of best fit and removing the error bars
  labs(x = "Temperature", # these will change labels,add caption and title etc
       y = "Chirp rate",
       title = "Cricket Chirps", 
       caption = "Source: Mcdonald (2009)")

…but again we know there are lots of effects from the differing species so ought to demonstrate this with the 2 species shown as likely to be far more meaningful:

ggplot(crickets, aes(x = temp,
                     y = rate,
                     colour = species))+ #adding the species back in
  geom_point() +
  geom_smooth(method = "lm", 
              se = FALSE)+ # adding in regression line or curve of best fit and removing the error bars
  labs(x = "Temperature", # these will change labels,add caption and title etc
       y = "Chirp rate",
       colour =  "Species",
       title = "Cricket Chirps", 
       caption = "Source: Mcdonald (2009)")

Exercise 2. Looking at other types of plots including (remember to refer back to Figure 2 to understand which plot to use):

  • for one quantitative variable

This type of data can often be usefully shown using a histogram:

ggplot(crickets, aes(x = rate)) +
  geom_histogram(bins = 15) #note the bins give a size of the categories and can be changed to alter the appearance and usefulness by trial and error

…..or a frequency polygon:

ggplot(crickets, aes(x = rate)) + 
  geom_freqpoly(bins = 15)

  • and for one single categorical variable such as the species using a bar chart:
ggplot(crickets, aes(x = species)) + 
  geom_bar(color = "black", #just adding some colours in
           fill = "green")

…..and adding some more colour in:

ggplot(crickets, aes(x = species, 
                     fill = species)) + 
  geom_bar(show.legend = FALSE) + # removes the legend as not needed
  scale_fill_brewer(palette = "Dark2") #useful for adding in colours

  • and for one categorical and one numerical variable using a box plot:
ggplot(crickets, aes(x = species, y = rate)) +
  geom_boxplot()

….and adding in some colour:

ggplot(crickets, aes(x = species,
                     y = rate,
                     colour = species)) + #adding in colour
  geom_boxplot(show.legend = FALSE) + # removes the legend as not needed
  scale_colour_brewer(palette = "Dark2") +
  theme_minimal() # removes the grey background

Exercise 3. Faceting

Faceting is a powerful technique used to split data into subsets and display multiple plots for each subset, all in one cohesive graphic layout. This allows you to compare patterns or trends across different categories or groups in a dataset. For example the first plot shows the overlaid technique for comparing the 2 species’ chirp rates but it is messy and not very useful. It would be much better to show them side by side on the same axis as in the second plot:

ggplot(crickets, aes(x = rate, 
                     fill = species)) +
  geom_histogram(bins = 15) +
  scale_color_brewer((palette = "Dark2")) #but this is messy and not really useful, better if side by side on same axis so can compare them more easily

ggplot(crickets, aes(x = rate, 
                     fill = species)) +
    geom_histogram(bins = 15, show.legend = FALSE)+
    scale_color_brewer((palette = "Dark2"))+
    facet_wrap(~species)

Research Methods subsection Week 4 FM

What makes a good research hypothesis?

After identifying a potentially interesting gap in the current knowledge and understanding in a particular area of interest (generating a research question), a comprehensive literature review is required to confirm that the gap is real and worthy of further effort. The next crucial step in the scientific method is to create a concise statement or hypothesis outlining the result which is expected or predicted to be available to be proved (the so called alternative hypothesis) or disproved (the so-called null hypothesis) as a result of the investigation which follows. Such hypotheses should be simple, clear and relevant to that field of investigation and may take many formats with examples including if/then or direct statements. Getting this hypothesis right should then allow the investigation and collection of carefully observed empirical data utilising reproducible and specific methodology, followed by thorough and rigorous analysis, including statistical methods, of the data to allow valid conclusions to be drawn. Effective and timely communication of the results in peer reviewed publications should allow the iterative and continual evolution of knowledge in that specific field to continue potentially leading to the generation of further hypotheses to be developed in the same field.

Flow Chart of the Scientific Method
Figure 4: Flow Chart of the Scientific Method

References include:

(Efbrazil 2020), (contributors 2023), (Elsevier 2023),

(Barroga and Matanguihan 2022)

Week 5 Choosing the Right Statistical Tests

First step is to understand the type of variable you are looking at, remembering that R will help if needed by looking at the tibble using summarise(), glimpse() etc

Figure 5

Then work out what your “x” or predictor variable(s) are, and what are your “y” or response variable(s) are, so that you know you are plotting against what from a categorical or quantitative point of view. This will allow you to assess which of the four families of statistical test is the most appropriate as below:

Figure 6: The 4 Families of Statistical Tests

Putting this into words:

  • Categorical vs. Categorical use Frequency Tests

  • Categorical vs. Numerical use Mean Tests

  • Numerical vs. Numerical use Correlations and Models

  • Numerical vs. Categorical use Logistic Models

Or a flow chart format:

Choosing a statistical test
Figure 7: Choosing a statistical test

and some examples of the actual tests themselves:

Decision Tree for Choosing which Statistical Test
Figure 8: Decision Tree for Choosing the Most Appropriate Statistical Test

Week 5 formative

Utilising the “iris’ data file which is pre-loaded in R. So firstly, load the data, then describe the data and explore it in more detail including plots:

#|label: Load the iris data
#| results: hide
data("iris")
glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

So can see from these that the dataset consists of 5 columns of which the first 4 are continuous numerical variables and the last one is a nominal (no order) categorical variable. There are 150 rows of individual observations so a reasonable size dataset. First thought is that it would be good to know if each species represented equally-easy to visualise this by a histogram as single categorical.

ggplot(iris, aes(x = Species, fill = Species)) + 
  geom_bar(color = "black", show.legend = FALSE) + # The outline of the bars remains black and the legend is removed
  labs(y = "Count",
       title = "Number in Each Species",
       caption = "Source:Melo 2024")

So there are equal numbers of each species. It would also be interesting to see if there is a relationship between any of the numerical variables and the species so plots of the numerical variables on the y axis versus species on the x axis. For example species vs sepal length as a box plot:

ggplot(iris, aes(x = Species,
                     y = Sepal.Length,
                     colour = Species)) + #adding in colour
  geom_boxplot(show.legend = FALSE) + # removes the legend as not needed
  scale_colour_brewer(palette = "Dark2")

Looking back to our Four Families of statistical tests diagram show in Figure 6 this shows that the correct statistical analysis to use would be from the Mean Tests family including T tests for example. So let’s try it:

summary(lm(Sepal.Length~Species, data=iris))

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

The next graph required in the formative exercise shows on paper as density versus petal length by species. I wonder if frequency would be a better word than density? Actually with a bit of digging found out that in a density plot, the density on the y-axis represents a numerical variable. So this is another way of displaying numerical versus numerical (like a scatter plot). The density is a continuous numerical value that estimates the probability density function of a continuous variable. It shows how the data is distributed along the x-axis (the variable of interest), and the area under the curve sums to 1. The y-axis typically does not represent counts or frequencies directly, but rather a scaled density that shows the relative concentration of the data points. Key Points:

  • X-axis: The continuous variable for which the distribution is being plotted (e.g., height, weight, petal length, etc.).

  • Y-axis (Density): A continuous numerical variable representing the probability density or the relative likelihood of the data points.

So, the y-axis is also numerical, not categorical.

ggplot(iris, aes(x = Petal.Length, fill = Species)) + 
  geom_density(trim = FALSE, color = "black") 

Looking back to our Four Families of statistical tests diagram show in Figure 6 this suggests that the correct statistical analysis to use would be from the Correlations and models family including Pearson’s correlation or linear regression??

The 3rd graph in the exercise showed a scatter plot with the x axis showing the predictor variable Petal Length which is a numerical continuous variable, and the y axis showing the response variable Petal Width which is also a continuous numerical variable. So we are looking for a correlation relationship between these 2. Plotting the graph:

ggplot(iris, aes(x = Petal.Length,
                     y = Petal.Width))+
  geom_point() +
  geom_smooth(method="lm",
              se = TRUE, colour = "black")

But this isn’t that useful as we know already that there are differences between species so should separate out the data by species:

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +  # No color aesthetic #here
  geom_point(aes(colour = Species)) +  # Color only for points
  geom_smooth(method = "lm", se = TRUE)  # Regression line with SE bars

Personally, there is such a difference between species there might be more use in having each species with its own regression line and SE bars:

ggplot(iris, aes(x = Petal.Length,
                     y = Petal.Width,
                     colour = Species))+ #adding the species back in
  geom_point()+
  geom_smooth(method = "lm", 
              se = TRUE) # adding in regression line or curve of best fit #and removing the error bars

Would also be good to be able to show the different points in different shapes as in the original slide:

ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) +  # No color aesthetic #here
  geom_point(aes(colour = Species, shape =  Species)) +  # Color only for points
  geom_smooth(method = "lm", se = TRUE)  # Regression line with SE bars

The last curve is a histogram which show categorical vs categorical data, so we need to create another variable here, which we can call “Size”. This is based on how many values for each species fall into new categories called big or small which are based on whether or not the value is above or below the median value for the column Sepal Length. This is done by the following:

iris_sepal_length <- iris %>%
mutate(Size=ifelse(Sepal.Length < median(Sepal.Length),
  "small", "big"))

This produces the iris_sepal_length object in the environment and allows us to use this new dataset to produced the bar chart:

ggplot(iris_sepal_length, aes(x = Species, fill = Size)) +
  geom_bar(position = "dodge") +
  labs(y = "Frequency", x = "Species") +
  ggtitle("Frequency of Size Categories by Species")

In this chunk aes(x = Species, fill = Size) maps the x-axis to Species and uses fill = Size to color the bars by the Size category (“small” or “big”). The geom_bar(position = "dodge") creates a bar plot where bars for different size categories are side by side for each species.

Week 6 Frequency Tests

This is the first family of tests out of the 4 (others are Mean, Logistic, Correlation and Models) to be looked at, and 99% of analysis will fall into these 4. Used for categorical variables on x and y axis.

Frequency tests are non-parametric tests meaning that they do not assume a specific distribution for the data, making them ideal for analyzing data that do not meet the assumptions of parametric tests. They are particularly useful when:

  1. Data is Not Normally Distributed: Non-parametric tests do not require normality, so they work well with skewed data or data with outliers.

  2. Small Sample Sizes: They are often better suited for small sample sizes where the distribution cannot be reliably determined.

  3. Ordinal or Ranked Data: Non-parametric tests are designed to work with ordinal data (ranked data) and do not rely on precise numerical differences between values.

  4. Unequal Variances: Unlike parametric tests, non-parametric tests do not require homogeneity of variances.

Pre-Session videos & Exercises:

Frequency & Contingency tables with R taken from Statisticis using R programming Youtube channel

A frequency table lists a set of values and how often each one appears-helps show which values are common/rare. A contingency table (2 way frequency table) is a tabular mechanism with the least 2 rows and 2 columns used to present categorical data in terms of frequency counts.

options(repos = c(CRAN = "https://cran.r-project.org")) #setting the CRAn mirror
install.packages("vcd")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(vcd)
write.csv(Arthritis, "Arthritis.csv", row.names = FALSE) #putting the arthritis data onto my computer
Arthritis <- read.csv("/Users/nickpark/Desktop/MSC/RMDA/R/Arthritis.csv") #telling quarto where to find it
head(Arthritis)
  ID Treatment  Sex Age Improved
1 57   Treated Male  27     Some
2 46   Treated Male  29     None
3 77   Treated Male  30     None
4 17   Treated Male  32   Marked
5 36   Treated Male  46   Marked
6 23   Treated Male  58   Marked
str(Arthritis)
'data.frame':   84 obs. of  5 variables:
 $ ID       : int  57 46 77 17 36 23 75 39 33 55 ...
 $ Treatment: chr  "Treated" "Treated" "Treated" "Treated" ...
 $ Sex      : chr  "Male" "Male" "Male" "Male" ...
 $ Age      : int  27 29 30 32 46 58 59 59 63 63 ...
 $ Improved : chr  "Some" "None" "None" "Marked" ...
summary(Arthritis)
       ID         Treatment             Sex                 Age       
 Min.   : 1.00   Length:84          Length:84          Min.   :23.00  
 1st Qu.:21.75   Class :character   Class :character   1st Qu.:46.00  
 Median :42.50   Mode  :character   Mode  :character   Median :57.00  
 Mean   :42.50                                         Mean   :53.36  
 3rd Qu.:63.25                                         3rd Qu.:63.00  
 Max.   :84.00                                         Max.   :74.00  
   Improved        
 Length:84         
 Class :character  
 Mode  :character  
                   
                   
                   

So creating a one way table from this Arthritis data which has 5 columns and 84 rows, with 2 categorical factors (Treatment and Sex), one ordinal categorical factor (Improved) and 2 numerical variables. Could do some plots eg bar charts) and summary stats but just diving straight into the exercises for now:

Tip!

For the coding chunks below run one line at a time to see what effect the next one has on the previous output

#one way table
mytable <- table(Arthritis$Improved) #create the object showing the Improved column data from the Arthritis dataset
mytable #shows counts in the Improved column

Marked   None   Some 
    28     42     14 
prop.table(mytable) #same thing but shown as proportions of the total

   Marked      None      Some 
0.3333333 0.5000000 0.1666667 
prop.table(mytable)*100 #and as percentages

  Marked     None     Some 
33.33333 50.00000 16.66667 

and a 2-way table using the xtabs() function:

#two way table
mytable <- xtabs(~Treatment+Improved, data=Arthritis) #selecting the columns want to be included
mytable #counts and shows the Contingecy table
         Improved
Treatment Marked None Some
  Placebo      7   29    7
  Treated     21   13    7

and using the margin.table function() to add the totals to allow calculation of proportions using the prop.table() function:

margin.table(mytable, 1) # the 1 argument tells R that we want to look at the chosen row and give the total counts for that row (Treatment)
Treatment
Placebo Treated 
     43      41 
prop.table(mytable, 1) # and by the chosen row as proportions (note rows sums to 1)
         Improved
Treatment    Marked      None      Some
  Placebo 0.1627907 0.6744186 0.1627907
  Treated 0.5121951 0.3170732 0.1707317

and the same thing by column:

margin.table(mytable, 2) # 2 the 1 argument tells R that we want to look at the chosen column this time and give the total counts for that column (Treatment)
Improved
Marked   None   Some 
    28     42     14 
prop.table(mytable, 2) #and in proportions (columns add to 1)
         Improved
Treatment    Marked      None      Some
  Placebo 0.2500000 0.6904762 0.5000000
  Treated 0.7500000 0.3095238 0.5000000

Or looking at whole table or all cells together:

prop.table((mytable)) # cell proportions (all cells add to 1)
         Improved
Treatment     Marked       None       Some
  Placebo 0.08333333 0.34523810 0.08333333
  Treated 0.25000000 0.15476190 0.08333333
addmargins(mytable) #cell counts with row and column sums
         Improved
Treatment Marked None Some Sum
  Placebo      7   29    7  43
  Treated     21   13    7  41
  Sum         28   42   14  84
addmargins(prop.table(mytable)) #cell proportions with row and column sums
         Improved
Treatment     Marked       None       Some        Sum
  Placebo 0.08333333 0.34523810 0.08333333 0.51190476
  Treated 0.25000000 0.15476190 0.08333333 0.48809524
  Sum     0.33333333 0.50000000 0.16666667 1.00000000

and summing by chosen row or column:

addmargins(prop.table(mytable, 1),2) #row proportions with row sums
         Improved
Treatment    Marked      None      Some       Sum
  Placebo 0.1627907 0.6744186 0.1627907 1.0000000
  Treated 0.5121951 0.3170732 0.1707317 1.0000000
addmargins(prop.table(mytable, 2), 1)#column proportions with column sums
         Improved
Treatment    Marked      None      Some
  Placebo 0.2500000 0.6904762 0.5000000
  Treated 0.7500000 0.3095238 0.5000000
  Sum     1.0000000 1.0000000 1.0000000

So can see that there are lots of options for presenting the data and interpretation is difficult! Lies, damned lies and statistics!

Can also use the CrossTable() function to create a 2 way frequency or contingency table but need to remember to load the gmodels package first. In this example the table shows the same columns as in the above example with the output showing all the results possible including a chi squared contribution value:

library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  84 

 
                    | Arthritis$Improved 
Arthritis$Treatment |    Marked |      None |      Some | Row Total | 
--------------------|-----------|-----------|-----------|-----------|
            Placebo |         7 |        29 |         7 |        43 | 
                    |     3.752 |     2.616 |     0.004 |           | 
                    |     0.163 |     0.674 |     0.163 |     0.512 | 
                    |     0.250 |     0.690 |     0.500 |           | 
                    |     0.083 |     0.345 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
            Treated |        21 |        13 |         7 |        41 | 
                    |     3.935 |     2.744 |     0.004 |           | 
                    |     0.512 |     0.317 |     0.171 |     0.488 | 
                    |     0.750 |     0.310 |     0.500 |           | 
                    |     0.250 |     0.155 |     0.083 |           | 
--------------------|-----------|-----------|-----------|-----------|
       Column Total |        28 |        42 |        14 |        84 | 
                    |     0.333 |     0.500 |     0.167 |           | 
--------------------|-----------|-----------|-----------|-----------|

 

and a 3 way contingency table using the xtabs() function:

mytable <- xtabs(~Treatment+Sex+Improved, data=Arthritis)
mytable
, , Improved = Marked

         Sex
Treatment Female Male
  Placebo      6    1
  Treated     16    5

, , Improved = None

         Sex
Treatment Female Male
  Placebo     19   10
  Treated      6    7

, , Improved = Some

         Sex
Treatment Female Male
  Placebo      7    0
  Treated      5    2
margin.table(mytable, 1) #totals for treatment
Treatment
Placebo Treated 
     43      41 
margin.table(mytable, 2) #totals for Sex
Sex
Female   Male 
    59     25 
margin.table(mytable, 3) #totals for Improved
Improved
Marked   None   Some 
    28     42     14 
margin.table(mytable, c(1,3)) #totals for treatment by improved
         Improved
Treatment Marked None Some
  Placebo      7   29    7
  Treated     21   13    7

and the ftable() function can also be useful:

#Treatment by Sex for each level of Improved
ftable(mytable)
                 Improved Marked None Some
Treatment Sex                             
Placebo   Female               6   19    7
          Male                 1   10    0
Treated   Female              16    6    5
          Male                 5    7    2
ftable(prop.table(mytable, c(1,2)),3) #props sum to 1 
                 Improved     Marked       None       Some
Treatment Sex                                             
Placebo   Female          0.18750000 0.59375000 0.21875000
          Male            0.09090909 0.90909091 0.00000000
Treated   Female          0.59259259 0.22222222 0.18518519
          Male            0.35714286 0.50000000 0.14285714
ftable(addmargins(prop.table(mytable, c(1,2)),3))
                 Improved     Marked       None       Some        Sum
Treatment Sex                                                        
Placebo   Female          0.18750000 0.59375000 0.21875000 1.00000000
          Male            0.09090909 0.90909091 0.00000000 1.00000000
Treated   Female          0.59259259 0.22222222 0.18518519 1.00000000
          Male            0.35714286 0.50000000 0.14285714 1.00000000
ftable(addmargins(prop.table(mytable, c(1,2)),3))*100
                 Improved     Marked       None       Some        Sum
Treatment Sex                                                        
Placebo   Female           18.750000  59.375000  21.875000 100.000000
          Male              9.090909  90.909091   0.000000 100.000000
Treated   Female           59.259259  22.222222  18.518519 100.000000
          Male             35.714286  50.000000  14.285714 100.000000

The Chi Squared Goodness of Fit Test for Frequency Data taken from Iain’s Math & Stat Screencasts Youtube channel.

Useful for testing frequency data especially when there are more than 2 categories. This is a Hypothesis test so need to set null and alternative. No exercises on this one.

The Chi-Squared Goodness-of-Fit test is a non-parametric statistical test used to determine if observed categorical data fit a particular expected distribution. It helps answer whether the observed frequencies in a single categorical variable differ significantly from expected frequencies under a hypothesized distribution.

When to Use the Chi-Squared Goodness-of-Fit Test

The test is applied in situations where:

  1. You have categorical data with different levels (categories) — for example, colors of M&Ms in a sample or survey responses.

  2. You want to compare observed frequencies of categorie  to expected frequencies based on a theoretical or known distribution, like a uniform distribution or a specific expected ratio.

Key Concepts and Hypotheses

  • Null Hypothesis (H₀): The observed frequencies fit the expected distribution. This means any differences between observed and expected frequencies are due to random variation.

  • Alternative Hypothesis (H₁): The observed frequencies do not fit the expected distribution, suggesting a significant difference.

Steps for Performing the Test

  1. Define expected frequencies: Based on the hypothesized distribution.

  2. Calculate observed frequencies: From your sample data.

  3. Compute the Chi-Squared statistic: Using the formula:

    χ2=∑(Oi−Ei)2Eiχ2=∑Ei​(Oi​−Ei​)2​

    where OiOi​ is the observed frequency for category ii, and EiEi​ is the expected frequency.

  4. Determine the degrees of freedom: This is (k−1)(k−1), where kk is the number of categories.

  5. Interpret the p-value: Compare the calculated χ2χ2 statistic to the critical value in the Chi-Squared distribution table or obtain a p-value to determine significance.

Chi Squared testing for Independence in R

The Chi-Squared Test of Independence is a statistical test used to determine if there is a significant association between two categorical variables. This test assesses whether the distribution of one variable differs across the levels of the other variable, indicating independence or a relationship between them.

When to Use the Chi-Squared Test of Independence

This test is appropriate when:

  1. You have two categorical variables that you want to examine for association.

  2. You want to determine if the variables are independent, meaning the distribution of one variable is unaffected by the levels of the other variable.

Hypotheses for the Test

  • Null Hypothesis (H₀): The two variables are independent (no association between them).

  • Alternative Hypothesis (H₁): The two variables are not independent (there is an association between them).

Steps for Performing the Test

  1. Create a contingency table: This table cross-tabulates the frequencies for each combination of the two categorical variables.

  2. Calculate expected frequencies: For each cell in the table, the expected frequency is calculated assuming the two variables are independent. The formula for each expected count in a cell is:

    Eij=(row total)×(column total)grand totalEij​=grand total(row total)×(column total)​

  3. Compute the Chi-Squared statistic:

    χ2=∑(Oij−Eij)2Eijχ2=∑Eij​(Oij​−Eij​)2​

    where OijOij​ is the observed frequency for cell (i,j)(i,j), and EijEij​ is the expected frequency for the same cell.

  4. Determine degrees of freedom: This is calculated as (r−1)(c−1)(r−1)(c−1), where rr is the number of rows, and cc is the number of columns.

  5. Interpret the p-value: A small p-value (typically < 0.05) suggests a significant association between the two variables, meaning they are not independent.

Useful to understand if the categorical variables in question are associated/related/independent of each other. Remember to run each line in sequence.

volunteers <- matrix(c(111, 96, 48, 96, 133, 61, 91, 150, 53), byrow = T, nrow = 3) # Creating the data to be looked at
rownames(volunteers) <- c("Community College", "Four Year", "Nonstudent")
colnames(volunteers) <- c("1-3Hours", "4-6Hours", "7-9Hours") #naming rows and columns
volunteers #looking at the output
                  1-3Hours 4-6Hours 7-9Hours
Community College      111       96       48
Four Year               96      133       61
Nonstudent              91      150       53
model <- chisq.test(volunteers) #running the chisquare test
model #showing the output

    Pearson's Chi-squared test

data:  volunteers
X-squared = 12.991, df = 4, p-value = 0.01132
model$expected # visualising the expected cell counts
                   1-3Hours 4-6Hours 7-9Hours
Community College  90.57211 115.1907 49.23719
Four Year         103.00358 131.0012 55.99523
Nonstudent        104.42431 132.8081 56.76758
model$residuals #and the Pearson's residuals
                    1-3Hours   4-6Hours   7-9Hours
Community College  2.1464772 -1.7880604 -0.1763148
Four Year         -0.6900708  0.1746359  0.6688187
Nonstudent        -1.3136852  1.4918030 -0.5000487

Another way of running this test is:

vol_tab <- as.table(volunteers) #changing the existing matrix to a table
summary(vol_tab)#using the summary function to run the test
Number of cases in table: 839 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 12.991, df = 4, p-value = 0.01132

Week 6 Frequency Testing Lecture Session Exercises

install.packages("janitor")
install.packages("gmodels")
library(tidyverse)
library(tibble)
library(knitr)
library(janitor)
library(gmodels)

So creating a dataset to be used called ladybirds. Clearly interested to see if there is any association between the habitat and the colour frequency (as with the moths in the industrial revolution):

ladybirds <- tribble(
              ~Habitat, ~Site, ~Colour, ~Number,
              "Rural", "R1", "Black", 10,
              "Rural", "R2", "Black", 3,
              "Rural", "R3", "Black", 4,
              "Rural", "R4", "Black", 7,
              "Rural", "R5", "Black", 6,
              "Rural", "R1", "Red", 15,
              "Rural", "R2", "Red", 18,
              "Rural", "R3", "Red", 9,
              "Rural", "R4", "Red", 12,
              "Rural", "R5", "Red", 16,
              "Industrial", "U1", "Black", 32,
              "Industrial", "U2", "Black", 25,
              "Industrial", "U3", "Black", 25,
              "Industrial", "U4", "Black", 17,
              "Industrial", "U5", "Black", 16,
              "Industrial", "U1", "Red", 17,
              "Industrial", "U2", "Red", 23,
              "Industrial", "U3", "Red", 21,
              "Industrial", "U4", "Red", 9,
              "Industrial", "U5",  "Red", 15)
view(ladybirds)
str(ladybirds)
tibble [20 × 4] (S3: tbl_df/tbl/data.frame)
 $ Habitat: chr [1:20] "Rural" "Rural" "Rural" "Rural" ...
 $ Site   : chr [1:20] "R1" "R2" "R3" "R4" ...
 $ Colour : chr [1:20] "Black" "Black" "Black" "Black" ...
 $ Number : num [1:20] 10 3 4 7 6 15 18 9 12 16 ...
head(ladybirds)
# A tibble: 6 × 4
  Habitat Site  Colour Number
  <chr>   <chr> <chr>   <dbl>
1 Rural   R1    Black      10
2 Rural   R2    Black       3
3 Rural   R3    Black       4
4 Rural   R4    Black       7
5 Rural   R5    Black       6
6 Rural   R1    Red        15

So the dataset contains 4 columns and 20 rows with 3 categorical factors and one numerical variable. We might be interested in asking:

  • how does habitat influence morphotype occurrence of ladybirds?

  • is there an association (dependency) between habitat and ladybird morphotype?

First step as usual is to manipulate the table to show the variables we are most interested in:

#Table 1
ladybirds%>% #take the ladybirds dataset just created
  group_by(Habitat, Colour) %>% #create groups
  summarise(count = sum(Number)) |> #this last sympbol is similar to the %>% operation 
  kable() #creates the table showing how many of each colour at each habitat
Habitat Colour count
Industrial Black 115
Industrial Red 85
Rural Black 30
Rural Red 70

and adding in proportions:

ladybirds%>% 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) |> 
  mutate(prop = count/sum(count)) |> #creates a column called prop which is the calculation of the count over the total ie the proportions. Note total sums to 1
  kable() 
Habitat Colour count prop
Industrial Black 115 0.575
Industrial Red 85 0.425
Rural Black 30 0.300
Rural Red 70 0.700

and taking key value pairs, convert from long format with multiple rows for each category to wide format with separate columns for each category. The spread() function is used to create separate columns for each unique value in the colour column:

ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0)|> #The spread() function takes each unique Colour as a separate column, making the data frame wide. This reshaped data frame has Habitat as rows and Colour categories as columns, with count values in each cell. If any combination of Habitat and Colour is missing, fill = 0 ensures it’s filled with a 0.
  adorn_totals(c("row", "col")) |> #adds totals for both rows and columns
  kable() #displays the table in a clean format
Habitat Black Red Total
Industrial 115 85 200
Rural 30 70 100
Total 145 155 300

and looking at this with proportions:

library(dplyr)
library(tibble)
library(tidyr)
library(knitr)
ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> #This line moves the Habitat column to row names, so that each row represents a habitat, with columns for each Colour
  proportions() %>% #The proportions() function calculates the relative proportions of each count within the table. By default, this will compute proportions for each cell as a fraction of the total sum of all counts across habitats and colors, giving the proportion each cell contributes to the total count of all ladybirds.
  kable()
Black Red
Industrial 0.3833333 0.2833333
Rural 0.1000000 0.2333333

and converting the data frame to a matrix saved as t (useful when want to perform operations like proportions across rows and columns as matrices often allow simpler indexing and manipulation than data frames):

ladybirds |> 
  group_by(Habitat, Colour) %>% 
  summarise(count = sum(Number)) %>% 
  spread(Colour, count, fill = 0) |> 
  column_to_rownames("Habitat") |> 
  as.matrix()->t
  proportions(t,1) |> #This calculates proportions across each row (i.e., within each Habitat) by specifying margin = 1. For each habitat, the sum of all colors will equal 1, so each cell value represents the proportion of that color within the habitat.
  kable()
Black Red
Industrial 0.575 0.425
Rural 0.300 0.700

and the same matrix with proportions calculated across each column instead of row:

ladybirds |> 
    group_by(Habitat, Colour) %>% 
    summarise(count = sum(Number)) %>% 
    spread(Colour, count, fill = 0) |> 
    column_to_rownames("Habitat") |> 
    as.matrix()->t
  proportions(t,2) |> 
    kable()
Black Red
Industrial 0.7931034 0.5483871
Rural 0.2068966 0.4516129

So again it is really obvious that there are loads of options for presenting frequency data and it can therefore be very difficult to interpret!!

Looking at Cross tables as a shortcut to the above: The function CrossTable(t) in R is provided by the gmodels package. It creates a cross-tabulation, or contingency table, which summarizes the frequency or proportion of combinations of categorical variables in the data. CrossTable() generates a detailed table with counts and various statistics, often used for analyzing relationships between two categorical variables. For example, where t is a matrix of counts of ladybird colors across habitats, CrossTable(t) will display:

  • Frequency Counts for each combination of Habitat and Colour in the matrix t

  • Row Proportions – the proportion of each color count relative to the row totals (e.g., proportion #of colors within each habitat)

  • Column Proportions – the proportion of each count relative to the column totals (e.g., how much each habitat contributes to a specific color)

  • Total Proportions – the proportion of each cell relative to the total count for the entire table

CrossTable(t)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  300 

 
             |  
             |     Black |       Red | Row Total | 
-------------|-----------|-----------|-----------|
  Industrial |       115 |        85 |       200 | 
             |     3.477 |     3.253 |           | 
             |     0.575 |     0.425 |     0.667 | 
             |     0.793 |     0.548 |           | 
             |     0.383 |     0.283 |           | 
-------------|-----------|-----------|-----------|
       Rural |        30 |        70 |       100 | 
             |     6.954 |     6.505 |           | 
             |     0.300 |     0.700 |     0.333 | 
             |     0.207 |     0.452 |           | 
             |     0.100 |     0.233 |           | 
-------------|-----------|-----------|-----------|
Column Total |       145 |       155 |       300 | 
             |     0.483 |     0.517 |           | 
-------------|-----------|-----------|-----------|

 

As can be seen the Cross table produces the following output details:

  • Cell Counts for each combination

  • Row and Column Proportions for each combination, helping identify any notable patterns or associations

  • Chi-square Statistics (if requested), to test independence between the categories.

    The CrossTable() output can be very informative in understanding patterns and relationships in categorical data

Types of Frequency Test

1.Test for independence

Looking at the naruto dataset which is a 2 by 2 matrix to see if there is any association or dependency between the categorical variables by performing a chi squared test of independence (see pre session section above). The h0 is that there is no association between the variables and the alternate is that there is a statistically significant association between them. So the first step is to create the frequency table as below. It is then easy to use R to run the test:

naruto<-matrix(c(35,205,8,48), nrow=2, byrow=TRUE)
  chisq.test(naruto)$expected #chisq.test(naruto) conducts a Chi-square test of independence on the naruto matrix. Adding $expected extracts the expected frequencies from the Chi-square test result. These expected frequencies are the counts that would occur if there were no association between the row and column categories.  
          [,1]      [,2]
[1,] 34.864865 205.13514
[2,]  8.135135  47.86486
  proportions(naruto[1,]) #computes the proportion of each value within this row relative to the total of the row. It divides each element by the sum of the elements in the row.Using proportions in this way provides a view of the relative distribution of counts within a specific row. This is useful for understanding how each category within that row contributes to the total count of the row, often used in contingency tables or categorical data analysis.
[1] 0.1458333 0.8541667
  proportions(naruto[2,]) #likewise for column 
[1] 0.1428571 0.8571429
  chisq.test(naruto) 

    Pearson's Chi-squared test with Yates' continuity correction

data:  naruto
X-squared = 5.9938e-30, df = 1, p-value = 1

In this case the p value is well above 0.05 suggesting that we have to accept the h0 and believe that there is no association between the variables.

Using the Placebo dataset:

The placebo effect in statistical terms refers to a measurable, positive response observed in participants receiving a placebo — an inactive treatment or substance (like a sugar pill) — due to their belief or expectation that they are receiving a real treatment. This effect can complicate the interpretation of clinical trials or experiments, as the response seen is not due to the active treatment but rather to psychological or other non-specific effects. Frequency testing using tests for independence or goodness of fit

  placebo<-matrix(c(43,31,27,12,24,28), nrow=2, byrow=TRUE)
  rownames(placebo)<-c("cure-yes", "cure-no")
  colnames(placebo)<-c("treatment","placebo","no treatment")
  exp.placebo<-chisq.test(placebo)$expected #extracts the matrix of expected frequencies under the null hypothesis of independence.
  exp.placebo #This displays the expected frequencies calculated from the chi-square test, showing the counts that would be expected in each cell if the cure and treatment types were unrelated or independent.
         treatment  placebo no treatment
cure-yes  33.66667 33.66667     33.66667
cure-no   21.33333 21.33333     21.33333
  chisq.test(placebo) # shows an effect of treatement or placebo as p<0.05. It assesses if there is a significant association between cure rates and treatment types by comparing observed counts in placebo with expected counts (exp.placebo). If the chi-square test finds that the observed counts significantly differ from the expected counts, it suggests a relationship between treatment type and cure status.

    Pearson's Chi-squared test

data:  placebo
X-squared = 10.619, df = 2, p-value = 0.004945

In this example, chisq.test(placebo) tests whether there is an association between treatment type and cure status. If the p-value is below a significance threshold (e.g., 0.05), you can conclude that the type of treatment is significantly associated with the cure status.

2. Testing for Homogeneity

In statistics, a Chi-Square Test for Homogeneity is used to compare the distribution of categorical variables across multiple groups to determine if they are homogeneous (similar in distribution) across those groups. It’s a type of chi-squared test that examines whether different populations have the same proportions for a particular categorical variable.

When to Use the Test for Homogeneity:

The Chi-Square Test for Homogeneity is appropriate when:

  1. You have two or more independent groups (e.g., different samples or populations).

  2. You want to test if the distribution of a categorical variable is consistent across these groups.

-creating a dataset showing frequency of various age categories across 2 countries to look at this test and plotting using a bar chart:

  pop<-tribble(
    ~age, ~country, ~pop,
    "0-10","UK", 50,
    "0-10","BRA", 120,
    "11-20","UK", 70,
    "11-20","BRA", 140,
    "21-30","UK", 110,
    "21-30","BRA", 150,
    "31-40","UK", 150,
    "31-40","BRA", 180,
    "40-50","UK", 180,
    "40-50","BRA", 180,
    "50-60","UK", 150,
    "50-60","BRA", 90,
    "60-70","UK", 130,
    "60-70","BRA", 50)
  pop |> 
    ggplot(aes(x=age, y= pop, color=country, fill=country))+
    geom_col(position="dodge")+
    theme(axis.text=element_text(size = 18),
          axis.title = element_text(size=18))

and then grouping the data to allow for looking to see if the 2 distributions have the same pattern. Clearly the h0 is that there is no significant difference between the groups, with the alternate hypothesis suggesting that there is.

pop |> 
  group_by(country, age) %>% #summarise(count = sum(Number)) %>% 
  spread(country, pop, fill = 0) |> 
  column_to_rownames("age") |> 
  as.matrix() -> t2
  chisq.test(t2)

    Pearson's Chi-squared test

data:  t2
X-squared = 108.97, df = 6, p-value < 2.2e-16

Interpreting these Results

  • p-value: If the p-value is below a certain significance level (commonly 0.05), it indicates that there is a statistically significant difference in the distribution of preferences across the groups, meaning the groups are not homogeneous with respect to the categorical variable.

3. Goodness of Fit

Useful for testing frequency data especially when there are more than 2 categories. This is a Hypothesis test so need to set null and alternative.

The Chi-Squared Goodness-of-Fit test is a non-parametric statistical test used to determine if observed categorical data fit a particular expected distribution. It helps answer whether the observed frequencies in a single categorical variable differ significantly from expected frequencies under a hypothesized distribution.

When to Use the Chi-Squared Goodness-of-Fit Test:

The test is applied in situations where:

  1. You have categorical data with different levels (categories) — for example, colors of M&Ms in a sample or survey responses.

  2. You want to compare observed frequencies of categories to expected frequencies based on a theoretical or known distribution, like a uniform distribution or a specific expected ratio.

Key Concepts and Hypotheses

  • Null Hypothesis (H₀): The observed frequencies fit the expected distribution. This means any differences between observed and expected frequencies are due to random variation.

  • Alternative Hypothesis (H₁): The observed frequencies do not fit the expected distribution, suggesting a significant difference.

  pop %>% 
    spread(country, pop, fill=0) %>% 
    mutate(world=(BRA+UK)/2) %>% 
    pivot_longer(cols = BRA:world, names_to = "place")->pop2
  pop2  #show the output
# A tibble: 21 × 3
   age   place value
   <chr> <chr> <dbl>
 1 0-10  BRA     120
 2 0-10  UK       50
 3 0-10  world    85
 4 11-20 BRA     140
 5 11-20 UK       70
 6 11-20 world   105
 7 21-30 BRA     150
 8 21-30 UK      110
 9 21-30 world   130
10 31-40 BRA     180
# ℹ 11 more rows
  pop2 %>%
    ggplot(aes(x=age,y=value, colour=place, fill=place))+
    geom_col(position = "dodge")

  pop2 %>% 
    spread(place, value, fill = 0) %>% 
    column_to_rownames("age")-> pop3
  chisq.test(pop3$BR, pop3$world)

    Pearson's Chi-squared test

data:  pop3$BR and pop3$world
X-squared = 35, df = 30, p-value = 0.2426

Post Lecture Session Week6

Exercises-taken from Statistical tools for high-throughput data analysis website

http://www.sthda.com/english/wiki/comparing-proportions-in-r

  • One proportion Z test-used to compare observed proportion to a theoretical one when there are only 2 categories.

    This test is often used in cases where you want to compare a sample proportion to a population proportion for categorical data.

    When to Use a One-Proportion Z-Test:

    • You have a single sample with categorical data (e.g., success/failure, yes/no).

    • You want to test if the proportion of “successes” in this sample matches a known population proportion.

    • The sample size is large enough that the sampling distribution of the proportion is approximately normal (commonly, both np≥10np≥10 and n(1−p)≥10n(1−p)≥10).

    • H0 usually stated as the sample proportion is equal to the population proportion, with alternative h being the sample proportion is different from (greater or less than) the population proportion

      For example: one population of mice, n=160, would expect 50:50 m:f (pe), observed shows 95males:65females. So observed proportion of males is 95/160 = p0, females is 1-p0 #setting the null hypothesis-would expect p0=pe (a two tailed test). Setting the null hypothesis would expect p0=pe (2 tailed). Using R function binom.test(x, n, p=0.5, alternative = “two.sided”) where x = number of successes, n= total, p = probability.

  • Use the Binomial test when sample is small.

    Or use the prop.test() function prop.test(x, n, p =NULL, alternative + two.sided”, correct =TRUE)

    binom.test(95, 160, p = 0.5, alternative = "two.sided")
    
        Exact binomial test
    
    data:  95 and 160
    number of successes = 95, number of trials = 160, p-value = 0.02157
    alternative hypothesis: true probability of success is not equal to 0.5
    95 percent confidence interval:
     0.5133727 0.6705878
    sample estimates:
    probability of success 
                   0.59375 
    prop.test(95, 160, p = NULL, alternative = "two.sided", correct = TRUE)
    
        1-sample proportions test with continuity correction
    
    data:  95 out of 160, null probability 0.5
    X-squared = 5.2562, df = 1, p-value = 0.02187
    alternative hypothesis: true p is not equal to 0.5
    95 percent confidence interval:
     0.5131775 0.6697480
    sample estimates:
          p 
    0.59375 

Interpretation:

If the p-valu  is low (e.g., < 0.05), you may conclude that the observed sample proportion is significantly different from the hypothesized proportion. If it’s high, there isn’t enough evidence to suggest a difference.

2.Two Proportions Z tests-used to compare 2 observed proportions. This used to compare the proportions of two independent groups to determine if they are significantly different from each other. It’s commonly applied to categorical data when you have two groups (e.g., treatment vs. control) and want to test if the proportions of a specific outcome (e.g., success rate) differ between them

When to Use a Two-Proportion Z-Test:

  • You have categorical data and two independent groups.

  • You want to test if the proportions of a certain outcome are different between the two groups.

  • The sample sizes are large enough to approximate a normal distribution (commonly, both groups should meet the criteria n×p≥10n×p≥10 and n×(1−p)≥10n×(1−p)≥10).

    For example: 2 groups A with lung cancer n = 500, no of smokers is 490, 98% B healthy, n = 500, no of smokers is 400, 80% overall prop of smokers is 890/500, 89%, non smoker 11% . We want to know if the proportion of smokers are the same in each group so setting hypothesis pa=pb and using the prop.test() function gives:

res <- prop.test(x = c(490, 400), n =c(500, 500))
res

    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80 

Interpretation

A low p-value (e.g., < 0.05) indicates that the proportions are significantly different between the two groups. A high p-value suggests there isn’t enough evidence to conclude a difference.

3.Chi square Goodness of Fit test to evaluate the association between 2 categorical variables. Used to compare observed distribution to expected with 2 or more categories in discrete data. In other words it compares multiple observed proportions to expected probabilities. For example:

collected wild tulips 81 red, 50y, 27w

if equally distributed would expect 1/3

but if in the region of collection ratio is 3:2:1 (3+2+1=6)

then expected proportion would be 1/2 red, 1/3 y, 1/6 w

so is there a diff between observed and expected?

null h0 is no significant difference

So in R use: chisq.test(x, p ) where x is numeric vector and p is probabilities

tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res #so p v low which says can reject null h0

    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07

So very low p value so can reject the h0.

  1. Chi square test of independence. Used to evaluate the association between 2 categorical variables. First load the data which is a contingency table showing 13 tasks and their distribution in the couple

    file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
    housetasks <- read.delim(file_path, row.names = 1) #installing the data
    install.packages("gplots")
    
    The downloaded binary packages are in
        /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
    library(gplots) #and displaying these graphically
    dt <- as.table(as.matrix(housetasks)) #first by converting the data to a table
    balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
                label = FALSE, show.margins = FALSE) # then Graphing using balloon plot function in gplots

then running the Chi squared test to examine whether the rows and columns of the contingency table are statistically significantly associated, with the null hypothesis set that they are independent:

chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16

can extract observed and expected counts:

chisq$observed
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153
round(chisq$expected,2)
            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70

can also find which are the most contributing cells to the total chi-sq score, by calculating the Chi sq stat for each cell (Pearson’s residuals or standardised residuals):

round(chisq$residuals, 3)
             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556

and can visualise these using corrplot() function:

install.packages("corrplot")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(corrplot)
corrplot(chisq$residuals, is.cor = FALSE) #for a given cell the size of the circle is proportional to the amount of the cell contribution

#positive residuals in blue specify positive association between corresponding row and column variables and vv

and can look at the contribution in percentages:

contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3) #and visualising this:
            Wife Alternating Husband Jointly
Laundry    7.738       0.272   1.777   2.246
Main_meal  4.976       0.012   1.243   1.903
Dinner     2.197       0.073   0.600   0.560
Breakfeast 1.222       0.615   0.408   1.443
Tidying    0.149       0.133   1.270   0.661
Dishes     0.063       0.178   0.891   0.625
Shopping   0.085       0.090   0.581   0.586
Official   0.688       3.771   0.010   0.311
Driving    1.538       2.403   3.374   1.789
Finances   0.886       0.037   0.028   1.700
Insurance  1.705       0.941   0.868   1.683
Repairs    2.919       0.947  21.921   2.275
Holidays   2.831       1.098   1.233  12.445
corrplot(contrib, is.cor = FALSE)#the relative contribution of each cell to total chi sq score give some indication of the nature of the dependency between rows and columns of the contingency table

Contingency tables-taken from Analytics with R-Data Analysis section

There are many options for producing contingecy tables and summary tables in R. For example:

  • summary tables and stats using dplyr and tidyr (enables you to continue code seamlessly into the next task (filtering, plotting etc)

  • producing frequency and proportion tables using table()

  • producing frequency, proportion and Chi squared values using CrossTable()

  • the group_by(), summarise() and spread() commands are also useful for producing summary or agrregate values of our data

library(ggplot2) #loading required packages
library(dplyr)
library(tidyr)
library(knitr)
data(mpg)
str(mpg)
tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr [1:234] "f" "f" "f" "f" ...
 $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

So looking at the total number of cars with each class and cyl combination:

mpg%>%
    group_by(class, cyl)%>%
    summarise(n=n())%>% #creates the column n which contains the count of each row
    kable()
class cyl n
2seater 8 5
compact 4 32
compact 5 2
compact 6 13
midsize 4 16
midsize 6 23
midsize 8 2
minivan 4 1
minivan 6 10
pickup 4 3
pickup 6 10
pickup 8 20
subcompact 4 21
subcompact 5 2
subcompact 6 7
subcompact 8 5
suv 4 8
suv 6 16
suv 8 38

and to turn this summary data into a cross table (contingency table) we need variable A (class) to be listed by row and variable B (cyl) to be listed by column. Can do this using the spread() function to create columns for each cyl value with n as the cross tab response value:

mpg%>%
  group_by(class, cyl) %>%
  summarise(n=n()) %>%
  spread(cyl, n)%>%
  kable()
class 4 5 6 8
2seater NA NA NA 5
compact 32 2 13 NA
midsize 16 NA 23 2
minivan 1 NA 10 NA
pickup 3 NA 10 20
subcompact 21 2 7 5
suv 8 NA 16 38

If we wanted to look at summary stats other than frequency we can determine what kind of summary stat we want to see easily by adjusting our summarise() input. Here instead of frequency we can get the average number of city miles by class and cylinder:

mpg%>%
  group_by(class, cyl) %>%
  summarise(mean_cty=mean(cty))%>% #For each class and cyl group, calculates the mean of cty (city miles per gallon) and stores it in a new column called mean_cty
  spread(cyl, mean_cty) %>% #Reshapes the data from long to wide format, spreading the cyl values across columns. Each cyl becomes its own column, with mean_cty values filling the cells
  kable()
class 4 5 6 8
2seater NA NA NA 15.40000
compact 21.37500 21 16.92308 NA
midsize 20.50000 NA 17.78261 16.00000
minivan 18.00000 NA 15.60000 NA
pickup 16.00000 NA 14.50000 11.80000
subcompact 22.85714 20 17.00000 14.80000
suv 18.00000 NA 14.50000 12.13158

or could look at or max number of city miles by class and cyl:

mpg%>%
  group_by(class, cyl)%>%
  summarise(max_cty=max(cty))%>%
  spread(cyl, max_cty)%>%
  kable()
class 4 5 6 8
2seater NA NA NA 16
compact 33 21 18 NA
midsize 23 NA 19 16
minivan 18 NA 17 NA
pickup 17 NA 16 14
subcompact 35 20 18 15
suv 20 NA 17 14

Looking at the same data by Proportions rather than counts-we can find proportions by creating a new calculated variable dividing row frequency by table frequency:

mpg%>%
  group_by(class)%>%
  summarise(n=n())%>%
  mutate(prop=n/sum(n))%>% #the new proportion variable
  kable()
class n prop
2seater 5 0.0213675
compact 47 0.2008547
midsize 41 0.1752137
minivan 11 0.0470085
pickup 33 0.1410256
subcompact 35 0.1495726
suv 62 0.2649573

and we can then create a contingency table of proportion values by applying spread() as before:

mpg%>%
  group_by(class, cyl)%>%
  summarise(n=n())%>%
  mutate(prop=n/sum(n))%>%
  subset(select=c("class", "cyl", "prop"))%>% #this drops the frequency value
  spread(class, prop)%>%
  kable()
cyl 2seater compact midsize minivan pickup subcompact suv
4 NA 0.6808511 0.3902439 0.0909091 0.0909091 0.6000000 0.1290323
5 NA 0.0425532 NA NA NA 0.0571429 NA
6 NA 0.2765957 0.5609756 0.9090909 0.3030303 0.2000000 0.2580645
8 1 NA 0.0487805 NA 0.6060606 0.1428571 0.6129032

The table() function is a quick way to pull together row/column frequency and proportions for categorical variables. For example, using basic table() command we can get contingency table for class by number of cyl:

table(mpg$class, mpg$cyl)
            
              4  5  6  8
  2seater     0  0  0  5
  compact    32  2 13  0
  midsize    16  0 23  2
  minivan     1  0 10  0
  pickup      3  0 10 20
  subcompact 21  2  7  5
  suv         8  0 16 38

A table of frequency can also be called using the ftable() command:

mpg_table<- table(mpg$class, mpg$cyl)
  ftable(mpg_table)
             4  5  6  8
                       
2seater      0  0  0  5
compact     32  2 13  0
midsize     16  0 23  2
minivan      1  0 10  0
pickup       3  0 10 20
subcompact  21  2  7  5
suv          8  0 16 38

and when specifically wanting to look at row/column frequency we can use the margin.table() function with the appropriate argument applied:

  margin.table(mpg_table, 1)#for row freq we use the margin.table() command with the 1 argument

   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
  margin.table(mpg_table, 2)#and for column freq we use the 2 argument

 4  5  6  8 
81  4 79 70 

and of course we can do the same things for proportions:

  prop.table(mpg_table)#for prop of the entire table we use prop.table() command:
            
                       4           5           6           8
  2seater    0.000000000 0.000000000 0.000000000 0.021367521
  compact    0.136752137 0.008547009 0.055555556 0.000000000
  midsize    0.068376068 0.000000000 0.098290598 0.008547009
  minivan    0.004273504 0.000000000 0.042735043 0.000000000
  pickup     0.012820513 0.000000000 0.042735043 0.085470085
  subcompact 0.089743590 0.008547009 0.029914530 0.021367521
  suv        0.034188034 0.000000000 0.068376068 0.162393162
  prop.table(mpg_table, 1)#for row proportions we use prop.table() with the 1 argument
            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 1.00000000
  compact    0.68085106 0.04255319 0.27659574 0.00000000
  midsize    0.39024390 0.00000000 0.56097561 0.04878049
  minivan    0.09090909 0.00000000 0.90909091 0.00000000
  pickup     0.09090909 0.00000000 0.30303030 0.60606061
  subcompact 0.60000000 0.05714286 0.20000000 0.14285714
  suv        0.12903226 0.00000000 0.25806452 0.61290323
  prop.table(mpg_table, 2)#and for column proportions we use the 2 argument
            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 0.07142857
  compact    0.39506173 0.50000000 0.16455696 0.00000000
  midsize    0.19753086 0.00000000 0.29113924 0.02857143
  minivan    0.01234568 0.00000000 0.12658228 0.00000000
  pickup     0.03703704 0.00000000 0.12658228 0.28571429
  subcompact 0.25925926 0.50000000 0.08860759 0.07142857
  suv        0.09876543 0.00000000 0.20253165 0.54285714

Lastly, the CrossTable command from the gmodels package produces a frequency with table and row proportions with one single command which can be very useful:

install.packages("gmodels")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(gmodels)
CrossTable(mpg$class, mpg$cyl) #run cross table command with the 2 variables as input

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  234 

 
             | mpg$cyl 
   mpg$class |         4 |         5 |         6 |         8 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|
     2seater |         0 |         0 |         0 |         5 |         5 | 
             |     1.731 |     0.085 |     1.688 |     8.210 |           | 
             |     0.000 |     0.000 |     0.000 |     1.000 |     0.021 | 
             |     0.000 |     0.000 |     0.000 |     0.071 |           | 
             |     0.000 |     0.000 |     0.000 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     compact |        32 |         2 |        13 |         0 |        47 | 
             |    15.210 |     1.782 |     0.518 |    14.060 |           | 
             |     0.681 |     0.043 |     0.277 |     0.000 |     0.201 | 
             |     0.395 |     0.500 |     0.165 |     0.000 |           | 
             |     0.137 |     0.009 |     0.056 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     midsize |        16 |         0 |        23 |         2 |        41 | 
             |     0.230 |     0.701 |     6.059 |     8.591 |           | 
             |     0.390 |     0.000 |     0.561 |     0.049 |     0.175 | 
             |     0.198 |     0.000 |     0.291 |     0.029 |           | 
             |     0.068 |     0.000 |     0.098 |     0.009 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     minivan |         1 |         0 |        10 |         0 |        11 | 
             |     2.070 |     0.188 |    10.641 |     3.291 |           | 
             |     0.091 |     0.000 |     0.909 |     0.000 |     0.047 | 
             |     0.012 |     0.000 |     0.127 |     0.000 |           | 
             |     0.004 |     0.000 |     0.043 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
      pickup |         3 |         0 |        10 |        20 |        33 | 
             |     6.211 |     0.564 |     0.117 |    10.391 |           | 
             |     0.091 |     0.000 |     0.303 |     0.606 |     0.141 | 
             |     0.037 |     0.000 |     0.127 |     0.286 |           | 
             |     0.013 |     0.000 |     0.043 |     0.085 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
  subcompact |        21 |         2 |         7 |         5 |        35 | 
             |     6.515 |     3.284 |     1.963 |     2.858 |           | 
             |     0.600 |     0.057 |     0.200 |     0.143 |     0.150 | 
             |     0.259 |     0.500 |     0.089 |     0.071 |           | 
             |     0.090 |     0.009 |     0.030 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
         suv |         8 |         0 |        16 |        38 |        62 | 
             |     8.444 |     1.060 |     1.162 |    20.403 |           | 
             |     0.129 |     0.000 |     0.258 |     0.613 |     0.265 | 
             |     0.099 |     0.000 |     0.203 |     0.543 |           | 
             |     0.034 |     0.000 |     0.068 |     0.162 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total |        81 |         4 |        79 |        70 |       234 | 
             |     0.346 |     0.017 |     0.338 |     0.299 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|

 

Week 7 Mean Tests

Pre-session videos and exercises

t-test and interpreting p values using R Programming by Greg Martin Youtube channel

t tests (single, two sided, one sided, paired) & interpreting p values using R

library(tidyverse)
install.packages("gapminder")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(gapminder)
install.packages("patchwork")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(patchwork)
library(dplyr)

View and plot the data of interest in density plots (or box plots):

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Examples of the plots in the exercise:

Plot 1: Life expectancy in the continent of Africa

# Filter data for Africa
africa_data <- gapminder %>%
  filter(continent == "Africa")
# Calculate mean life expectancy
mean_lifeExp <- mean(africa_data$lifeExp)
# Plot density with mean line
ggplot(africa_data, aes(x = lifeExp)) +
  geom_density(fill = "blue", alpha = 0.5) +
  geom_vline(xintercept = mean_lifeExp, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Density Plot of Life Expectancy in Africa",
       x = "Life Expectancy",
       y = "Density") +
  theme_minimal()

Plot 2: Life expectancy in Africa and Europe

# Filter data for Africa and Europe
africa_europe_data <- gapminder %>%
  filter(continent %in% c("Africa", "Europe"))

# Calculate mean life expectancy for each continent
mean_lifeExp <- africa_europe_data %>%
  group_by(continent) %>%
  summarize(mean_lifeExp = mean(lifeExp))

# Plot density for each continent with mean lines
ggplot(africa_europe_data, aes(x = lifeExp, fill = continent, color = continent)) +
  geom_density(alpha = 0.4) +
  geom_vline(data = mean_lifeExp, aes(xintercept = mean_lifeExp, color = continent),
             linetype = "dashed", size = 1) +
  labs(title = "Density Plot of Life Expectancy in Africa and Europe",
       x = "Life Expectancy",
       y = "Density") +
  theme_minimal()

Mean tests, like the t-test (both one-sample, two-sample, and paired) and ANOVA (analysis of variance), are parametric tests. This means they rely on certain assumptions about the population data, primarily that:

  1. The data are normally distributed.

  2. The samples have similar variances (homogeneity of variance).

  3. Observations are independent of each other.

These tests are designed to estimate and compare means under these assumptions, making .hem robust for many types of data but less ideal when data significantly deviate from normality or have unequal variances.

If the data do not meet these assumptions, non-parametric alternatives to compare central tendencies exist, such as the Mann-Whitney U test or Wilcoxon signed-rank test for comparing medians rather than means.

In a t test we are basically looking at the mean or average in to see if that average is different? Different to the one we think we expect or in one subset statistically significantly different to the average in another subset of the same population.

  1. Single sample t test-set the hypothesis: h0 could be that the mean expected lifespan is 50yrs. The alternative is then that the mean expected lifespan is not 50 yrs. So assuming that ho is correct, what are the chances that we’d get a sample that is as different or more. Must pick the threshold before run the data.
gapminder%>% #using the preloaded data in gapminder
    filter(continent == "Africa")%>% #selecting Africa only data
    select(lifeExp)%>% #looking at Lifeexpectancy variable only  
    t.test(mu = 50) #run the t test with pop mean of 50 (assumed mean is 50) 

    One Sample t-test

data:  .
t = -3.0976, df = 623, p-value = 0.002038
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 48.14599 49.58467
sample estimates:
mean of x 
 48.86533 

Results: p value-probability that we’d get a sample that is as different to 50 or more as what we’ve got here, is a less than 5% chance so can reject h0, and therefore by definition accept alternative-95% confident that true mean lies between 48 and 49, does not include 50 so sample data is statistically significantly different to 50.

Can also do same thing by creating an object as below:

my_ttest <- gapminder %>% #doing same thing but creating an object
  filter(continent == "Africa")%>%
  select(lifeExp)%>%
  t.test(mu = 50)
attributes(my_ttest)#then allows functions on that object
$names
 [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
 [6] "null.value"  "stderr"      "alternative" "method"      "data.name"  

$class
[1] "htest"
my_ttest$p.value #such as p value specifically for example 
[1] 0.002038444
  1. Two sided t test-is the mean for Africa statistically significantly different to Europe’s? The 2 sides refer to above or below

    gapminder%>%
      filter(continent %in% c("Africa", "Europe")) %>% 
      t.test(lifeExp ~ continent, data = ., 
             alternativve = "two.sided") #default anyway but means we are doing 2 sided test-is it different in any direction (up or down)
    
        Welch Two Sample t-test
    
    data:  lifeExp by continent
    t = -49.551, df = 981.2, p-value < 2.2e-16
    alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0
    95 percent confidence interval:
     -23.95076 -22.12595
    sample estimates:
    mean in group Africa mean in group Europe 
                48.86533             71.90369 

These results show tiny p value (far less than 0.05) so can therefore reject h0 and accept that the difference is statistically significant, confidence interval doesn’t include 0.

  1. One sided t test-looking at the plot the life expectancy in Ireland it is clearly less than Switzerland but is it statistically significant? In this case the question isis it stat sig that is lower than swi, so ho is that they are the same, h1 is that life expectancy in Ire is less by that magnitude or more than it is in Switzerland.

    gapminder %>%
      filter(country %in% c("Ireland", "Switzerland"))%>%
      t.test(lifeExp ~ country, data = ., 
             alternative = "less", 
             conf.level = 0.95)
    
        Welch Two Sample t-test
    
    data:  lifeExp by country
    t = -1.6337, df = 21.77, p-value = 0.05835
    alternative hypothesis: true difference in means between group Ireland and group Switzerland is less than 0
    95 percent confidence interval:
          -Inf 0.1313697
    sample estimates:
        mean in group Ireland mean in group Switzerland 
                     73.01725                  75.56508 

Results say not statistically significant as p value 0.058 so can’t reject h0, might well be the case that LE in the 2 is the same, 95% conf interval and does include 0.

  1. Paired t test-example here is looking at one population at 2 different time points-life expectancy in Africa in 1957 versus 2007, exact counterpart for each data point so pairs of data so needs a paired t test:

    library(tidyr)
    gapminder %>% 
      filter(continent=="Africa") %>% 
      mutate(year2=as.factor(year)) %>% 
      filter(year %in% c("1957", "2007")) %>% 
      select(country, year2, lifeExp) %>% 
      spread(year2, lifeExp)->africa
    africa
    # A tibble: 52 × 3
       country                  `1957` `2007`
       <fct>                     <dbl>  <dbl>
     1 Algeria                    45.7   72.3
     2 Angola                     32.0   42.7
     3 Benin                      40.4   56.7
     4 Botswana                   49.6   50.7
     5 Burkina Faso               34.9   52.3
     6 Burundi                    40.5   49.6
     7 Cameroon                   40.4   50.4
     8 Central African Republic   37.5   44.7
     9 Chad                       39.9   50.7
    10 Comoros                    42.5   65.2
    # ℹ 42 more rows
    t.test(africa$`1957`, africa$`2007`, paired = TRUE, alternative = "less")
    
        Paired t-test
    
    data:  africa$`1957` and africa$`2007`
    t = -11.381, df = 51, p-value = 6.54e-16
    alternative hypothesis: true mean difference is less than 0
    95 percent confidence interval:
          -Inf -11.54659
    sample estimates:
    mean difference 
          -13.53969 
Warning

Assumptions of t test:

  • large, representative sample

  • values are normally distributed (check shape using density plots for example)

  • similar variances (can check this in R)

ANOVA’s

ANOVA using R programming by Greg Martin Youtube channel

ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or more groups to see if at least one group mean is significantly different from the others. ANOVA determines this by analyzing the variance within each group and comparing it to the variance between the groups. Here’s a breakdown of key concepts and types of ANOVA:

How ANOVA Works

  1. Hypotheses:

    • Null Hypothesis (H₀): All group means are equal.

    • Alternative Hypothesis (H₁): At least one group mean is different.

  2. F-statistic: ANOVA calculates an F-statistic to determine if observed group mean differences are greater than what would be expected by chance. A higher F-statistic typically indicates a more significant difference between groups.

  3. Assumptions:

    • The data are normally distributed within each group.

    • Homogeneity of variances (each group has a similar variance).

    • Independence of observations.

Types of ANOVA

  1. One-Way ANOVA:

    • Compares means across one factor (e.g., test scores across different teaching methods).

    • Useful when testing one categorical independent variable with more than two levels.

  2. Two-Way ANOVA:

    • Compares means across two factors (e.g., the effect of both teaching methods and student gender on test scores).

    • Can also test for interactions between the two factors.

  3. Repeated Measures ANOVA:

    • Used when the same subjects are measured multiple times under different conditions (e.g., testing the same group’s scores before and after training).

    • Accounts for correlations within subjects.

  4. MANOVA (Multivariate Analysis of Variance):

    • Used when there are multiple dependent variables.

    • Allows testing of how independent variables influence several outcomes at once.

Make sure know what question is being asked and how to interpret the results. t test compares 2 means, if it was just 2 then t test , if more than 2 need to use anova. Underlying principles same-assume no difference in means

Useful plots are boxplots and density plots. Densities suggest that means are not equal. So we are asking if the difference we are seeing is significant. The way we answer this is by assuming the opposite-h0 in all 3 continents is the same, no difference. If can show that this unlikely to be true then can accept the alternative hypothesis that the difference we are seeing is real. If there were no difference then how likely (what is the prob) that we would get sample that shows the diff we are seeing-if very unlikely then can reject and accept the alternative hypothesis.

decide on significance level (p hacking is bad science) before start

gapdata <- gapminder %>% 
  filter(year == 2007 &
           continent %in% c("Americas", "Europe", "Asia"))%>%
      select(continent, lifeExp)

Arranging this data to show what we want:

gapdata %>%
  group_by(continent)%>%
  summarise(Mean_life = mean(lifeExp))%>%
  arrange(Mean_life)
# A tibble: 3 × 2
  continent Mean_life
  <fct>         <dbl>
1 Asia           70.7
2 Americas       73.6
3 Europe         77.6

Is the difference real? If the p value is less than 0.05 then will reject h0.

So creating the Anova model:

gapdata %>%
  aov(lifeExp ~ continent, data = .,) %>%
  summary()
            Df Sum Sq Mean Sq F value   Pr(>F)    
continent    2  755.6   377.8   11.63 3.42e-05 ***
Residuals   85 2760.3    32.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov_model <- gapdata %>% 
  aov(lifeExp ~ continent, data = .)

So p tiny so can reject p0, so means are stat diff.

However don’t know which one is diff to the other 2 (can say not equal) but don’t which ones are diff. so teasing that out a bit further:

gapdata %>%
    aov(lifeExp ~ continent, data = .) %>%
  TukeyHSD() #takes each combination and consider each 2
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lifeExp ~ continent, data = .)

$continent
                     diff        lwr        upr     p adj
Asia-Americas   -2.879635 -6.4839802  0.7247099 0.1432634
Europe-Americas  4.040480  0.3592746  7.7216854 0.0279460
Europe-Asia      6.920115  3.4909215 10.3493088 0.0000189

Asia and America not statistically significantly difference but Europe and America are, and Euro-Asia also is as p <0.05

Week 7 Lecture session

Means testing is the only one of the 4 families that has parametric and non-p tests.

Used for data where x axis is categorical and y response variable is numerical (can be discrete or continuous)

One sample T test

Does the mean of the sample differ from the known or expected mean? Need to check assumptions before applying a parametric mean test (continuous, normally distributed, random sample and enough data)

Eg using penguin data-all of this work has been done on an r script, not sure if worth transcribing over as largely repeat of pre and post lecture sessions work

Post Lecture session taken from Statistical Tools for high throughput data analysis website

1. Comparing one-sample mean to a standard known mean

One-sample T-test (parametric)

A one-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ). Generally, the theoretical mean comes from:

  • a previous experiment. For example, compare whether the mean weight of mice differs from 200 mg, a value determined in a previous study.

  • or from an experiment where you have control and treatment conditions. If you express your data as “percent of control”, you can test whether the average value of treatment condition differs significantly from 100.

Note that, one-sample t-test can be used only, when the data are normally distributed which can be checked using Shapiro-Wilk test.

Research questions and statistical hypotheses-typical research questions are:

  • whether the mean (m) of the sample is equal to/less then/greater than the theoretical mean (mu)

We can then define the corresponding null and alternate hypothesis (and if they are one or two tailed).

Formula of one-sample t-test-can be calculated as follows:

t=m−μs/n‾√ where,

  • m is the sample mean

  • n is the sample size

  • s is the sample standard deviation with n−1 degrees of freedom

  • μ is the theoretical value

We can compute the p-value corresponding to the absolute value of the t-test statistics (|t|) for the degrees of freedom (df): df=n−1.

How to interpret the results?

If the p-value is inferior or equal to the significance level 0.05, we can reject the null hypothesis and accept the alternative hypothesis. In other words, we conclude that the sample mean is significantly different from the theoretical mean.

Visualize your data and compute one-sample t-test in R

Install ggpubr R package for data visualization

install.packages("ggpubr")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages

R function to compute one-sample t-test

To perform one-sample t-test, the R function t.test() can be used as follow:

t.test(x, mu = 0, alternative = "two.sided")
  • x: a numeric vector containing your data values

  • mu: the theoretical mean. Default is 0 but you can change it.

  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

    Example:

    Here, we’ll use an example data set containing the weight of 10 mice.

    We want to know, if the average weight of the mice differs from 25g?

    set.seed(1234)
    my_data <- data.frame(
      name = paste0(rep("M_", 10), 1:10),
      weight = round(rnorm(10, 20, 2), 1)
    )
    # Print the first 10 rows of the data to check it
    head(my_data, 10)
       name weight
    1   M_1   17.6
    2   M_2   20.6
    3   M_3   22.2
    4   M_4   15.3
    5   M_5   20.9
    6   M_6   21.0
    7   M_7   18.9
    8   M_8   18.9
    9   M_9   18.9
    10 M_10   18.2
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 
#Visualize your data using box plots
library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,
          ggtheme = theme_minimal())

Preliminary test to check one-sample t-test assumptions:

  1. Is this a large sample? - No, because n < 30.

  2. Since the sample size is not large enough (less than 30, central limit theorem), we need to check whether the data follow a normal distribution.

How to check the normality? Using the Shapiro-Wilk normality test and to look at the normality plot with h0 set as the data are normally distributed

shapiro.test(my_data$weight) # => p-value = 0.6993 so from the output, the p-value isdgreater than the significance level 0.05 implying that the distribution of the data are not significantly different from normal distribtion. In other words, we can assume the normality.

    Shapiro-Wilk normality test

data:  my_data$weight
W = 0.9526, p-value = 0.6993

Could also do visual inspection of the normality using q-q plots.

Then Compute one-sample t-test:

We want to know, if the average weight of the mice differs from 25g (two-tailed test)?

# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res 

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25 

In the result above :

  • t is the t-test statistic value (t = -9.078),

  • df is the degrees of freedom (df= 9),

  • p-value is the significance level of the t-test (p-value = 7.95310^{-6}).

  • conf.int is the confidence interval of the mean at 95% (conf.int = [17.8172, 20.6828]);

  • sample estimates is the mean value of the sample (mean = 19.25).

    The p-value of the test is 7.95310^{-6}, which is less than the significance level alpha = 0.05. We can conclude that the mean weight of the mice is significantly different from 25g with ap-value = 7.95310^{-6}

    Note that:

    • if you want to test whether the mean weight of mice is less/more than 25g (one-tailed test), type this:
    t.test(my_data$weight, mu = 25,
                  alternative = "less/more")

Access to the values returned by t.test() function-the result is a list containing:

  • statistic: the value of the t test statistics

  • parameter: the degrees of freedom for the t test statistics

  • p.value: the p-value for the test

  • conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.

  • estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).

The format of the R code to use for getting these values is as follow:

# printing the p-value
res$p.value
[1] 7.953383e-06
# printing the mean
res$estimate
mean of x 
    19.25 
# printing the confidence interval
res$conf.int
[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95

2. What’s a one-sample Wilcoxon signed rank test?

The one-sample Wilcoxon signed rank test is a non-parametric alternative to one-sample t-test when the data cannot be assumed to be normally distributed. It’s used to determine whether the median of the sample is equal to a known standard value (i.e. theoretical value). Note that, the data should be distributed symmetrically around the median. In other words, there should be roughly the same number of values above and below the median. 

Research questions and statistical hypotheses-typical research questions are:

  • whether the median (m) of the sample is equal to/greater than/less than the theoretical value (m0)?

In statistics, we can define the corresponding null hypothesis (H0) as follow:

  • H0:m=m0/H0:m≤m0/H0:m≥m0 and the corresponding alternative hypotheses (Ha) are as follow:
  1. Ha:m≠m0 (different)

  2. Ha:m>m0 (greater)

  3. Ha:m<m0 (less)

    R function to compute one-sample Wilcoxon test

    To perform one-sample Wilcoxon-test, the R function

    wilcox.test() can be used as follow:

    {wilcox.test(x, mu = 0, alternative = "two.sided")}

    where:

    • x: a numeric vector containing your data values

    • mu: the theoretical mean/median value. Default is 0 but you can change it.

    • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

    Eg: Here, we’ll use an example data set containing the weight of 10 mice. We want to know, if the median weight of the mice differs from 25g?

    set.seed(1234)
    my_data <- data.frame(
      name = paste0(rep("M_", 10), 1:10),
      weight = round(rnorm(10, 20, 2), 1)
    )

Check your data:

# Print the first 10 rows of the data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
# Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 

Visualize your data using box plots:

library(ggpubr)
ggboxplot(my_data$weight, 
          ylab = "Weight (g)", xlab = FALSE,)

          ggtheme = theme_minimal()

Compute one-sample Wilcoxon test-we want to know if the average weight of the mice differs from 25g (two-tailed test)?

# One-sample wilcoxon test
res <- wilcox.test(my_data$weight, mu = 25)
# Printing the results
res 

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
# print only the p-value
res$p.value
[1] 0.005793045

The p-value of the test is 0.005793, which is less than the significance level alpha = 0.05. We can reject the null hypothesis and conclude that the average weight of the mice is significantly different from 25g with a p-value = 0.005793.

Note that:

  • if you want to test whether the median weight of mice is less than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
              alternative = "less")
  • Or, if you want to test whether the median weight of mice is greater than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
              alternative = "greater")
              

3. Comparing the means of two independent groups

3.1 Unpaired two samples t-test (parametric)

The unpaired two-samples t-test is used to compare the mean of two independent groups. For example, suppose that we have measured the weight of 100 individuals: 50 women (group A) and 50 men (group B). We want to know if the mean weight of women (mA) is significantly different from that of men (mB). In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.

Note that, unpaired two-samples t-test can be used only under certain conditions:

  • when the two groups of samples (A and B), being compared, are normally distributed. This can be checked using Shapiro-Wilk test.

  • and when the variances of the two groups are equal. This can be checked using F-test.

Typical research questions are: whether the mean of group A is equal to /less than /greater than the mean of group B, allowing h0 to be set as ma=mb.

R function to compute unpaired two-samples t-test

To perform two-samples t-test comparing the means of two independent samples (x & y), the R function t.test() can be used as follow:

t.test(x, y, alternative = “two.sided”, var.equal = FALSE) where:

  • x,y: numeric vectors

  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

  • var.equal: a logical variable indicating whether to treat the two variances as being equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch test is used.

Example: Here, we’ll use an example data set, which contains the weight of 18 individuals (9 women and 9 men). We want to know, if the average women’s weight differs from the average men’s weight?

# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )

check the data:

print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4

Compute summary statistics by groups:

library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group count  mean    sd
  <chr> <int> <dbl> <dbl>
1 Man       9  69.0  9.38
2 Woman     9  52.1 15.6 

Visualise the data using box plots:

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
        ylab = "Weight", xlab = "Groups")

Preliminary test to check independent t-test assumptions:

Assumption 1: Are the two samples independents?

Yes, since the samples from men and women are not related.

Assumtion 2: Are the data from each of the 2 groups follow a normal distribution?

Use Shapiro-Wilk normality test with the null hypothesis set as the data are normally distributed (alternative hypothesis therefore the data are not normally distributed)

We’ll use the functions with() and shapiro.test() to compute Shapiro-Wilk test for each group of samples:

# Shapiro-Wilk normality test for Men's weights
with(my_data, shapiro.test(weight[group == "Man"]))# p = 0.1

    Shapiro-Wilk normality test

data:  weight[group == "Man"]
W = 0.86425, p-value = 0.1066
# Shapiro-Wilk normality test for Women's weights
with(my_data, shapiro.test(weight[group == "Woman"])) # p = 0.6

    Shapiro-Wilk normality test

data:  weight[group == "Woman"]
W = 0.94266, p-value = 0.6101

From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality. Note that, if the data are not normally distributed, it’s recommended to use the non parametric two-samples Wilcoxon rank test.

Assumption 3. Do the two populations have the same variances? We’ll use F-test to test for homogeneity in variances. This can be performed with the function var.test() as follow:

res.ftest <- var.test(weight ~ group, data = my_data)
res.ftest

    F test to compare two variances

data:  weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.08150656 1.60191315
sample estimates:
ratio of variances 
         0.3613398 

The p-value of F-test is p = 0.1713596. It’s greater than the significance level alpha = 0.05. In conclusion, there is no significant difference between the variances of the two sets of data. Therefore, we can use the classic t-test which assume equality of the two variances.

Compute unpaired two-samples t-test

Question : Is there any significant difference between women and men weights?

1) Compute independent t-test - Method 1: The data are saved in two different numeric vectors.

# Compute t-test
res <- t.test(women_weight, men_weight, var.equal = TRUE)
res

    Two Sample t-test

data:  women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29.748019  -4.029759
sample estimates:
mean of x mean of y 
 52.10000  68.98889 

2) Compute independent t-test - Method 2: The data are saved in a data frame.

# Compute t-test
res <- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res

    Two Sample t-test

data:  weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
  4.029759 29.748019
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000 

Both methods give the same result. In the result above :

  • t is the t-test statistic value (t = 2.784),

  • df is the degrees of freedom (df= 16),

  • p-value is the significance level of the t-test (p-value = 0.01327).

  • conf.int is the confidence interval of the mean at 95% (conf.int = [4.0298, 29.748]);

  • sample estimates is he mean value of the sample (mean = 68.9888889, 52.1).

Note that:

  • if you want to test whether the average men’s weight is less than the average women’s weight, type this:
t.test(weight ~ group, data = my_data,
        var.equal = TRUE, alternative = "less")
  • Or, if you want to test whether the average men’s weight is greater than the average women’s weight, type this
t.test(weight ~ group, data = my_data,
        var.equal = TRUE, alternative = "greater")

Interpretation of the result

The p-value of the test is 0.01327, which is less than the significance level alpha = 0.05. We can conclude that men’s average weight is significantly different from women’s average weight with a p-value 0.01327.

Access to the values returned by t.test() function:

The result of t.test() function is a list containing the following components:

  • statistic: the value of the t test statistics

  • parameter: the degrees of freedom for the test statistics

  • p.value: the p-value for the test

  • conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.

  • estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).

The format of the R code to use for getting these values is as follow:

# printing the p-value
res$p.value
[1] 0.0132656
# printing the mean
res$estimate
  mean in group Man mean in group Woman 
           68.98889            52.10000 
# printing the confidence interval
res$conf.int
[1]  4.029759 29.748019
attr(,"conf.level")
[1] 0.95
3.2 Unpaired two-samples Wilcoxon test (non-parametric)

The unpaired two-samples Wilcoxon test (also known as Wilcoxon rank sum test or Mann-Whitney test) is a non-parametric alternative to the unpaired two-samples t-test, which can be used to compare two independent groups of samples. It’s used when your data are not normally distributed.

Visualize your data and compute Wilcoxon test in R:

To perform two-samples Wilcoxon test comparing the means of two independent samples (x & y), the R function wilcox.test() can be used as follows:

wilcox.test(x, y, alternative = “two.sided”) where x,y are

numeric vectors and alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

Example:Here, we’ll use an example data set, which contains the weight of 18 individuals (9 women and 9 men). We want to know, if the median women’s weight differs from the median men’s weight?

# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
my_data <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )

Checking the data and looking at summary stats:

print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group count median   IQR
  <chr> <int>  <dbl> <dbl>
1 Man       9   67.3  10.9
2 Woman     9   48.8  15  

Visualising the data using box plots:

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          ylab = "Weight", xlab = "Groups")

Compute unpaired two-samples Wilcoxon test

Question : Is there any significant difference between women and men weights?

1) Compute two-samples Wilcoxon test - Method 1: The data are saved in two different numeric vectors.

res <- wilcox.test(women_weight, men_weight)
res

    Wilcoxon rank sum test with continuity correction

data:  women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0

2) Compute two-samples Wilcoxon test - Method 2: The data are saved in a data frame.

res <- wilcox.test(weight ~ group, data = my_data,
                   exact = FALSE)
res

    Wilcoxon rank sum test with continuity correction

data:  weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0

As you can see, the two methods give the same results. The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude that men’s median weight is significantly different from women’s median weight with a p-value = 0.02712.

Note that:

  • if you want to test whether the median men’s weight is less than the median women’s weight, type this:
wilcox.test(weight ~ group, data = my_data, 
        exact = FALSE, alternative = "less")
  • Or, if you want to test whether the median men’s weight is greater than the median women’s weight, type this
wilcox.test(weight ~ group, data = my_data,
        exact = FALSE, alternative = "greater")

4 Comparing the means of paired samples

4.1 Paired samples t-test (parametric)

The paired samples t-test is used to compare the means between two related groups of samples. In this case, you have two values (i.e., pair of values) for the same samples. This article describes how to compute paired samples t-test using R software. As an example of data, 20 mice received a treatment X during 3 months. We want to know whether the treatment X has an impact on the weight of the mice.

To answer this question, the weight of the 20 mice has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice. In such situations paired t-test can be used to compare the mean weights before and after treatment.

Paired t-test analysis is performed as follow:

  1. Calculate the difference (d) between each pair of value

  2. Compute the mean (m) and the standard deviation (s) of d

  3. Compare the average difference to 0. If there is any significant difference between the two pairs of samples, then the mean of d (m) is expected to be far from 0.

Paired t-test can be used only when the difference d is normally distributed. This can be checked using Shapiro-Wilk test.

Research questions and statistical hypotheses-typical research questions are whether mean difference is equal to/less than/greater than 0. Allows us to define h0 as m=0 etc.

Visualize your data and compute paired t-test in RsR function to compute paired t-test

To perform paired samples t-test comparing the means of two paired samples (x & y), the R function t.test() can be used as follows:

t.test(x, y, paired = TRUE, alternative = "two.sided") 
where
  • x,y: numeric vectors

  • paired: a logical value specifying that we want to compute a paired t-test

  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

    # Data in two numeric vectors
    # ++++++++++++++++++++++++++
    # Weight of the mice before treatment
    before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
    # Weight of the mice after treatment
    after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
    # Create a data frame
    my_data <- data.frame( 
                    group = rep(c("before", "after"), each = 10),
                    weight = c(before,  after)
                    )

We want to know if there is a significant difference?

Checking and summarising the data:

# Print all data
print(my_data)
    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group  count  mean    sd
  <chr>  <int> <dbl> <dbl>
1 after     10  394.  29.4
2 before    10  199.  18.5

and visualising the data:

# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")

Box plots show you the increase, but lose the paired information. You can use the function plot.paired() [in pairedData package] to plot paired data (“before - after” plot).

install.packages("PairedData")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(PairedData)
# Subset weight data before treatment
before <- subset(my_data,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(my_data,  group == "after", weight,
                 drop = TRUE)
# Plot paired data
library(PairedData)
pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()

Preliminary tests to check paired t-test assumption:s

Assumption 1: Are the two samples paired? 

Yes, since the data have been collected from measuring twice the weight of the same mice.

Assumption 2: Is this a large sample?

No, because n < 30. Since the sample size is not large enough (less than 30), we need to check whether the differences of the pairs follow a normal distribution using the Shapiro-Wilk normality test with the Null hypothesis set as the data are normally distributed.

# compute the difference
d <- with(my_data, 
        weight[group == "before"] - weight[group == "after"])
# Shapiro-Wilk normality test for the differences
shapiro.test(d) # => p-value = 0.6141

    Shapiro-Wilk normality test

data:  d
W = 0.94536, p-value = 0.6141

From the output, the p-value is greater than the significance level 0.05 implying that the distribution of the differences (d) are not significantly different from normal distribution. In other words, we can assume the normality. Note that, if the data are not normally distributed, it’s recommended to use the non parametric paired two-samples Wilcoxon test.

Compute paired samples t-test

Question : Is there any significant changes in the weights of mice after treatment?

1) Compute paired t-test - Method 1: The data are saved in two different numeric vectors.

# Compute t-test
res <- t.test(before, after, paired = TRUE)
res

    Paired t-test

data:  before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean difference 
        -194.49 

2) Compute paired t-test - Method 2: The data are saved in a data frame but could not get this method to work (as kept getting error message that could not use paired in formula method) so have removed as a chunk but left included for completeness:

res <- t.test(weight ~ group, data = my_data, paired = TRUE)

res

Both methods give the same result. In the result above :

  • t is the t-test statistic value (t = 20.88),

  • df is the degrees of freedom (df= 9),

  • p-value is the significance level of the t-test (p-value = 6.210^{-9}).

  • conf.int is the confidence interval (conf.int) of the mean differences at 95% is also shown (conf.int= [173.42, 215.56])

  • sample estimates is the mean differences between pairs (mean = 194.49).

Note that:

  • if you want to test whether the average weight before treatment is less than the average weight after treatment, type this:
t.test(weight ~ group, data = my_data, paired = TRUE,
        alternative = "less")
  • Or, if you want to test whether the average weight before treatment is greater than the average weight after treatment, type this
t.test(weight ~ group, data = my_data, paired = TRUE,
       alternative = "greater")

Interpretation of the result

The p-value of the test is 6.210^{-9}, which is less than the significance level alpha = 0.05. We can then reject null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment with a p-value = 6.210^{-9}.

Access to the values returned by t.test() function

The result of t.test(  function is a list containing the following components:

  • statistic: the value of the t test statistics

  • parameter: the degrees of freedom for the t test statistics

  • p.value: the p-value for the test

  • conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.

  • estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).

The format of the R code to use for getting these values is as follow:

# printing the p-value
res$p.value
[1] 6.200298e-09
# printing the mean
res$estimate
mean difference 
        -194.49 
# printing the confidence interval
res$conf.int
[1] -215.5581 -173.4219
attr(,"conf.level")
[1] 0.95
4.2 Paired samples Wilcoxon test (non-parametric)

The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. Differences between paired samples should be distributed symmetrically around the median.

Visualise your data and compute paired samples using the R functionwilcox.test() as follows:

wilcox.test(x, y, paired = TRUE, alternative = "two.sided")
  • where:
    x,y: numeric vectors

  • paired: a logical value specifying that we want to compute a paired Wilcoxon test

  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

    Here, we’ll use an example data set, which contains the weight of 10 mice before and after a treatment.

    # Data in two numeric vectors
    # ++++++++++++++++++++++++++
    # Weight of the mice before treatment
    before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
    # Weight of the mice after treatment
    after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
    # Create a data frame
    my_data <- data.frame( 
                    group = rep(c("before", "after"), each = 10),
                    weight = c(before,  after)
                    )

    We want to know, if there is any significant difference in the median weights before and after treatment?

    Checking the data and computing summary statistics (median and inter-quartile range (IQR)) by groups:

    # Print all data
    print(my_data)
        group weight
    1  before  200.1
    2  before  190.9
    3  before  192.7
    4  before  213.0
    5  before  241.4
    6  before  196.9
    7  before  172.2
    8  before  185.5
    9  before  205.2
    10 before  193.7
    11  after  392.9
    12  after  393.2
    13  after  345.1
    14  after  393.0
    15  after  434.0
    16  after  427.9
    17  after  422.0
    18  after  383.9
    19  after  392.3
    20  after  352.2
    group_by(my_data, group) %>%
      summarise(
        count = n(),
        median = median(weight, na.rm = TRUE),
        IQR = IQR(weight, na.rm = TRUE)
      )
    # A tibble: 2 × 4
      group  count median   IQR
      <chr>  <int>  <dbl> <dbl>
    1 after     10   393.  28.8
    2 before    10   195.  12.6

    Visualise the data using boxplots:

    # Plot weight by group and color by group
    library("ggpubr")
    ggboxplot(my_data, x = "group", y = "weight", 
              color = "group", palette = c("#00AFBB", "#E7B800"),
              order = c("before", "after"),
              ylab = "Weight", xlab = "Groups")

    Box plots show you the increase, but lose the paired information. You can use the function plot.paired() [in pairedDatapackage] to plot paired data (“before - after” plot).

    # Subset weight data before treatment
    before <- subset(my_data,  group == "before", weight,
                     drop = TRUE)
    # subset weight data after treatment
    after <- subset(my_data,  group == "after", weight,
                     drop = TRUE)
    # Plot paired data
    library(PairedData)
    pd <- paired(before, after)
    plot(pd, type = "profile") + theme_bw()

Compute paired-sample Wilcoxon test

Question : Is there any significant changes in the weights of mice before after treatment?

1) Compute paired Wilcoxon test - Method 1: The data are saved in two different numeric vectors.

res <- wilcox.test(before, after, paired = TRUE)
res

    Wilcoxon signed rank exact test

data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

2) Compute paired Wilcoxon-test - Method 2: The data are saved in a data frame. As previously this method did not work and keeps giving the same error message that cannot use paired in formula method so have removed it from chunk although left in for completeness sake:

Compute t-test

res <- wilcox.test(weight ~ group, data = my_data, paired = TRUE) res

As you can see, the two methods give the same results. The p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 0.001953.

Note that:

  • if you want to test whether the median weight before treatment is less than the median weight after treatment, type this:
wilcox.test(weight ~ group, data = my_data, paired = TRUE,
        alternative = "less")
  • Or, if you want to test whether the median weight before treatment is greater than the median weight after treatment, type this
wilcox.test(weight ~ group, data = my_data, paired = TRUE,
       alternative = "greater")

5 Comparing the means of more than two groups

5.1 One-way ANOVA test

An extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).

ANOVA test hypotheses:

  • Null hypothesis: the means of the different groups are the same

  • Alternative hypothesis: At least one sample mean is not equal to the others.

Note that, if you have only two groups, you can use t-test. In this case the F-test and the t-test are equivalent.

Assumptions-the ANOVA test can be applied only when:

  • The observations are obtained independently and randomly from the population defined by the factor levels

  • The data of each factor level are normally distributed.

  • These normal populations have a common variance. (Levene’s test can be used to check this.)

    Visualise and compute-for example we’ll use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions:

    my_data <- PlantGrowth

Check the data-to have an idea of what the data looks like, we can use the the function sample_n()[in dplyr package]. The sample_n() function randomly picks a few of the observations in the data frame to print out:

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
   weight group
1    6.15  trt2
2    3.83  trt1
3    5.29  trt2
4    5.12  trt2
5    4.50  ctrl
6    4.17  trt1
7    5.87  trt1
8    5.33  ctrl
9    5.26  trt2
10   4.61  ctrl

In R terminology, the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically.

# Show the levels
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"

Compute summary statistics by groups - count, mean, sd:

group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 3 × 4
  group count  mean    sd
  <fct> <int> <dbl> <dbl>
1 ctrl     10  5.03 0.583
2 trt1     10  4.66 0.794
3 trt2     10  5.53 0.443

Visualise the data:

# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

Compute one-way ANOVA test using the function aov()-we want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions:

# Compute the analysis of variance
res.aov <- aov(weight ~ group, data = my_data)
# Summary of the analysis
summary(res.aov)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes the columns F value and Pr(>F) corresponding to the p-value of the test.

Interpret the result of one-way ANOVA tests:

As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*” in the model summary.

Multiple pairwise-comparison between the means of groups:

In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant using the Tukey multiple pairwise-comparisons. As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups.

The function TukeyHD() takes the fitted ANOVA as an argument:

TukeyHSD(res.aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = my_data)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064
  • diff: difference between means of the two groups

  • lwrupr: the lower and the upper end point of the confidence interval at 95% (default)

  • p adj: p-value after adjustment for the multiple comparisons.

It can be seen from the output, that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.012.

Check ANOVA assumptions: test validity?

The ANOVA test assumes that, the data are normally distributed and the variance across groups are homogeneous. We can check that with some diagnostic plots. Check the homogeneity of variance assumption with the residuals versus fits plot. In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups) so we can assume the homogeneity of variances.

# 1. Homogeneity of variances
plot(res.aov, 1)

Points 17, 15, 4 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions. It’s also possible to use Levene’s test to check the homogeneity of variances.

library(car)
leveneTest(weight ~ group, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2  1.1192 0.3412
      27               

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

Relaxing the homogeneity of variance assumption:

The classical one-way ANOVA test requires an assumption of equal variances for all groups. In our example, the homogeneity of variance assumption turned out to be fine as the Levene test is not significant. But how do we save our ANOVA test, in a situation where the homogeneity of variance assumption is violated?

An alternative procedure the Welch one-way test, which does not require this assumption can be used with the function oneway.test().

  • ANOVA test with no assumption of equal variances

    oneway.test(weight ~ group, data = my_data)
    
        One-way analysis of means (not assuming equal variances)
    
    data:  weight and group
    F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739

    Check the normality assumption using the normality plot of residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted. The normal probability plot of residuals is used to check the assumption that the residuals are normally distributed. It should approximately follow a straight line.

    # 2. Normality
    plot(res.aov, 2)

As all the points fall approximately along this reference line, we can assume normality.

The conclusion above, is supported by the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = 0.6) which finds no indication that normality is violated.

# Extract the residuals
aov_residuals <- residuals(object = res.aov )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.96607, p-value = 0.4379

Note that, a non-parametric alternative to one-way ANOVA is Kruskal-Wallis rank sum test, which can be used when ANNOVA assumptions are not met.

kruskal.test(weight ~ group, data = my_data)

    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
5.2 Two-Way ANOVA test

This is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. The grouping variables are also known as factors. The different categories (groups) of a factor are called levels. The number of levels can vary between factors. The level combinations of factors are called cells.

  • When the sample sizes within cells are equal, we have the so-called balanced design. In this case the standard two-way ANOVA test can be applied.

  • When the sample sizes within each level of the independent variables are not the same (case of unbalanced designs), the ANOVA test should be handled differently.

Two-way ANOVA test hypotheses:

  1. There is no difference in the means of factor A

  2. There is no difference in means of factor B

  3. There is no interaction between factors A and B

The alternative hypothesis for cases 1 and 2 is: the means are not equal.

The alternative hypothesis for case 3 is: there is an interaction between A and B.

Assumptions of two-way ANOVA test:

Two-way ANOVA, like all ANOVA tests, assumes that the observations within each cell are normally distributed and have equal variances.

Compute two-way ANOVA test in R: balanced designs

Balanced designs correspond to the situation where we have equal sample sizes within levels of our independent grouping levels. Here, we’ll use the built-in R data set named ToothGrowth. It contains data from a study evaluating the effect of vitamin C on tooth growth in Guinea pigs. The experiment has been performed on 60 pigs, where each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). Tooth length was measured and a sample of the data is shown below.

# Store the data in the variable my_data
my_data <- ToothGrowth

check the data:

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
    len supp dose
1  21.5   VC  2.0
2  17.3   VC  1.0
3  27.3   OJ  2.0
4  18.5   VC  2.0
5   8.2   OJ  0.5
6  26.4   OJ  1.0
7  25.8   OJ  1.0
8   5.2   VC  0.5
9   6.4   VC  0.5
10  9.4   OJ  0.5
# Check the structure
str(my_data)
'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

From the output above, R considers “dose” as a numeric variable. We’ll convert it as a factor variable (i.e., grouping variable) as follows:

# Convert dose as a factor and recode the levels
# as "D0.5", "D1", "D2"
my_data$dose <- factor(my_data$dose, 
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))
head(my_data)
   len supp dose
1  4.2   VC D0.5
2 11.5   VC D0.5
3  7.3   VC D0.5
4  5.8   VC D0.5
5  6.4   VC D0.5
6 10.0   VC D0.5

Question: We want to know if tooth length depends on supp and dose. So firstly we need to generate a frequency table:

table(my_data$supp, my_data$dose)
    
     D0.5 D1 D2
  OJ   10 10 10
  VC   10 10 10

We have 2x3 design cells with the factors being supp and dose and 10 subjects in each cell, that is we have a balanced design.

Visualising the data using Box plots and line plots to see group differences:

  • Box plot to plot the data grouped by the combinations of the levels of the two factors.

  • Two-way interaction plot, which plots the mean (or other summary) of the response for two-way combinations of factors, thereby illustrating possible interactions

# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
library("ggpubr")
ggboxplot(my_data, x = "dose", y = "len", color = "supp",
          palette = c("#00AFBB", "#E7B800"))

# Line plots with multiple groups
# +++++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))

Compute two-way ANOVA test:

We want to know if tooth length depends on supp and dose. The R function aov() can be used to answer this question. The function summary.aov() is then used to summarize the analysis of variance model.

res.aov2 <- aov(len ~ supp + dose, data = my_data)
summary(res.aov2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4   14.02 0.000429 ***
dose         2 2426.4  1213.2   82.81  < 2e-16 ***
Residuals   56  820.4    14.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output includes the columns F value and Pr(>F) corresponding to the p-value of the test.

From the ANOVA table we can conclude that both supp and dose are statistically significant. dos  is the most significant factor variable. These results would lead us to believe that changing delivery methods (supp) or the dose of vitamin C, will impact significantly the mean tooth length.

Note the above fitted model is called additive model. It makes an assumption that the two factor variables are independent. If you think that these two variables might interact to create an synergistic effect, replace the plus symbol (+) by an asterisk (*), as follows:

# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that the two main effects (supp and dose) are statistically significant, as well as their interaction. Note that in the situation where the interaction is not significant you should use the additive model.

Interpret the results

From the ANOVA results, you can conclude the following, based on the p-values and a significance level of 0.05:

  • the p-value of supp is 0.000429 (significant), which indicates that the levels of supp are associated with significant different tooth length.

  • the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.

  • the p-value for the interaction between supp*dose is 0.02 (significant), which indicates that the relationships between dose and tooth length depends on the supp method.

Computing some summary statistics:

group_by(my_data, supp, dose) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )
# A tibble: 6 × 5
# Groups:   supp [2]
  supp  dose  count  mean    sd
  <fct> <fct> <int> <dbl> <dbl>
1 OJ    D0.5     10 13.2   4.46
2 OJ    D1       10 22.7   3.91
3 OJ    D2       10 26.1   2.66
4 VC    D0.5     10  7.98  2.75
5 VC    D1       10 16.8   2.52
6 VC    D2       10 26.1   4.80

Multiple pairwise-comparison between the means of groups

In ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.

As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups. The function TukeyHD() takes the fitted ANOVA as an argument. We don’t need to perform the test for the “supp” variable because it has only two levels, which have been already proven to be significantly different by ANOVA test. Therefore, the Tukey HSD test will be done only for the factor variable “dose”.

TukeyHSD(res.aov3, which = "dose")
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)

$dose
          diff       lwr       upr   p adj
D1-D0.5  9.130  6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1    6.365  3.597488  9.132512 2.7e-06

It can be seen from the output, that all pairwise comparisons are significant with an adjusted p-value < 0.05.Check ANOVA assumptions: test validity?

ANOVA assumes that the data are normally distributed and the variance across groups are homogeneous. We can check that with some diagnostic plots.

The residuals versus fits plot is used to check the homogeneity of variances. In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups) so we can assume the homogeneity of variances.

# 1. Homogeneity of variances
plot(res.aov3, 1)

Points 32 and 23 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.

Can also use Levene’s test to check the homogeneity of variance:

library(car)
leveneTest(len ~ supp*dose, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  5  1.7086 0.1484
      54               

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

Check the normality assumption using the normality plot of the residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted. The normal probability plot of residuals is used to verify the assumption that the residuals are normally distributed. The normal probability plot of the residuals should approximately follow a straight line.

# 2. Normality
plot(res.aov3, 2)

As all the points fall approximately along this reference line, we can assume normality. The conclusion above, is supported by the Shapiro-Wilk test on the ANOVA residuals (W = 0.98, p = 0.5) which finds no indication that normality is violated.

# Extract the residuals
aov_residuals <- residuals(object = res.aov3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )

    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.98499, p-value = 0.6694

Compute two-way ANOVA test in R for unbalanced designs where there are unequal numbers of subjects in each group. There are three fundamentally different ways to run an ANOVA in an unbalanced design. They are known as Type-I, Type-II and Type-III sums of squares. To keep things simple, note thtt The recommended method are the Type-III sums of squares. The three methods give the same result when the design is balanced. However, when the design is unbalanced, they don’t give the same results.

The functionAnova() [in car package] can be used to compute two-way ANOVA test for unbalanced designs.

library(car)
my_anova <- aov(len ~ supp * dose, data = my_data)
Anova(my_anova, type = "III")
Anova Table (Type III tests)

Response: len
             Sum Sq Df F value    Pr(>F)    
(Intercept) 1750.33  1 132.730 3.603e-16 ***
supp         137.81  1  10.450  0.002092 ** 
dose         885.26  2  33.565 3.363e-10 ***
supp:dose    108.32  2   4.107  0.021860 *  
Residuals    712.11 54                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6 MANOVA test: Multivariate analysis of variance

In the situation where there multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA). For example, we may conduct an experiment where we give two treatments (A and B) to two groups of mice, and we are interested in the weight and height of mice. In that case, the weight and height of mice are two dependent variables, and our hypothesis is that both together are affected by the difference in treatment. A multivariate analysis of variance could be used to test this hypothesis.

Assumptions of MANOVA

MANOVA can be used in certain conditions:

  • The dependent variables should be normally distributed within groups. The R function mshapiro.test( )[in the mvnormtest package] can be used to perform the Shapiro-Wilk test for multivariate normality. This is useful in the case of MANOVA, which assumes multivariate normality.

  • Homogeneity of variances across the range of predictors.

  • Linearity between all pairs of dependent variables, all pairs of covariates, and all dependent variable-covariate pairs in each cell

Interpretation of MANOVA

If the global multivariate test is significant, we conclude that the corresponding effect (treatment) is significant. In that case, the next question is to determine if the treatment affects only the weight, only the height or both. In other words, we want to identify the specific dependent variables that contributed to the significant global effect. To answer this question, we can use one-way ANOVA (or univariate ANOVA) to examine separately each dependent variable.

Compute MANOVA in R-using the iris data set:

# Store the data in the variable my_data
my_data <- iris

check the data:

# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           5.2         3.5          1.5         0.2     setosa
2           5.7         2.6          3.5         1.0 versicolor
3           6.3         3.3          6.0         2.5  virginica
4           6.5         3.2          5.1         2.0  virginica
5           6.3         3.4          5.6         2.4  virginica
6           6.4         2.8          5.6         2.2  virginica
7           6.8         3.2          5.9         2.3  virginica
8           7.9         3.8          6.4         2.0  virginica
9           6.2         2.9          4.3         1.3 versicolor
10          7.1         3.0          5.9         2.1  virginica

Question: We want to know if there is any significant difference, in sepal and petal length, between the different species.

Compute the MANOVA test:

sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man)
           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look to see which differ
summary.aov(res.man)
 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output above, it can be seen that the two variables are highly significantly different among Species.

7. Kruskal-Wallis test

Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.

V/isualize your data and compute Kruskal-Wallis test in R

Eg. Using the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions.

my_data <- PlantGrowth

checking the data:

# print the head of the file
head(my_data)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl

and showing the group levels:

# Show the group levels
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"

and computing summary stats by group:

group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
# A tibble: 3 × 6
  group count  mean    sd median   IQR
  <fct> <int> <dbl> <dbl>  <dbl> <dbl>
1 ctrl     10  5.03 0.583   5.15 0.743
2 trt1     10  4.66 0.794   4.55 0.662
3 trt2     10  5.53 0.443   5.44 0.467

and visualising the data box plots:

# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

and mean plots:

# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

computing the Kruskal-Wallis test:

We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.The test can be performed using the function kruskal.test() as follow:

kruskal.test(weight ~ group, data = my_data)

    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Interpretation: as the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.

Week 8 Correlations

Pre-lecture session taken from the following Youtube video:

Pearson’s Correlation, Clearly Explained by Josh Starmer

and Chapter 5 “Correlation” from Making Sense of Data with R by Yi Shang.

The below is taken from a mixture of the lecture session and the above pre session info.

Plotting the relationship between 2 quantitative variables is best done using scatterplots (see Exercise 1 Week 4).

In statistics the relationship between 2 variables are often called correlation. To have a correlation we need 2 measurements from each case in the dataset. Correlation is very useful in the sense that it tells us if we can use one variable to predict another. What is used to make the prediction is called the predictor (independent) variable and what is being predicted is called the outcome (Dependent) variable. By convention on a scatter plot the independent variable goes on x axis, dependent on the y.

We have a positive correlation when the 2 variables tend to change in the same direction, and a negative correlation when the 2 variables tend to change in different directions. The strength or magnitude of a correlation is the likelihood that x and y change in accordance with each other. The direction of a correlation is not relevant to its strength (a negative correlation can be strong while a positive one can be weak and vv). The correlation coefficient (Pearson’s r) is an index to express the direction and strength of of the relationship (from 1 to -1 for any pair of variables with 1 and -1 being perfect strongest possible correlations). Remember that r values are not slopes-can have exactly the same slope with 3 different r values and likewise 3 different slopes with same r value.

Here’s a rough rule of thumb to interpret the strength of Pearson’s r:

The higher the r number that tighter are the plots around the line.

The df number or degrees of freedom is equal to the number of pairs we have in the date plus 2. The 95% confidence interval is a margin of error around the correlation coefficient and marks the extent of our uncertainty about this relationship (a wider margin means more uncertain and usually reflects small size of data).

The correlation coefficient tells us whether is is possible to predict one variable based on the other but it does not tell us how to make that prediction. r shows up only as the tightness of the dots clustering around the invisible straight line-it has nothing to do with how steep that slope is.

Note r only works on linear relationships. Also can be unduly influenced by outliers (skewed) so must look at the plot too!

The p value in correlation tests tells us the probability that randomly drawn dots will result in a similarly strong relationship or stronger. The smaller the p value the more confidence we have in the predictions we make with the line.

So as long as the correlation value is not 0 we can use the line to make inferences but our guesses become more refined the closer the correlation gets to 1 or -1 and our confidence in our inferences depends on the amount of data we have collected and the p value.

Remember correlation is not causation!

Some worked examples:

#loading required package
library(tidyverse)
library(Hmisc)
install.packages("dplyr")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(dplyr)

Using the Mid West car dataset, showing the 3 types of correlation test:

  1. Pearson’s correlation-assumption: normal distribution

    cor.test(mtcars$hp, mtcars$mpg) #looking at the hp and mpg columns
    
        Pearson's product-moment correlation
    
    data:  mtcars$hp and mtcars$mpg
    t = -6.7424, df = 30, p-value = 1.788e-07
    alternative hypothesis: true correlation is not equal to 0
    95 percent confidence interval:
     -0.8852686 -0.5860994
    sample estimates:
           cor 
    -0.7761684 

Note that this gives a tiny p value with -0.77 correlation coefficient (r value) so strong negative correlation/relationship with a high level of confidence that the result isn’t just down to random chance-the probability of random data creating a similarly strong or stronger relationship is crazy small so can have good confidence that if we have a value on the x axis could use this to predict a value on the y axis.

  1. Spearmans correlation

    -assumption: non-normal distribution

    cor.test(mtcars$hp, mtcars$mpg, method = "spearman")
    
        Spearman's rank correlation rho
    
    data:  mtcars$hp and mtcars$mpg
    S = 10337, p-value = 5.086e-12
    alternative hypothesis: true rho is not equal to 0
    sample estimates:
           rho 
    -0.8946646 

This one is reported as rho = -0.89 so a strong negative correlation with a tiny p value so highly likely that not due to random chance.

  1. Multiple correlations

    Using the code below it is easy to obtain all the r values for each combination of variables:

mtcars %>%
    dplyr::select(mpg:qsec) %>%
    cor() %>%
    round(., 2) ->res
  res
       mpg   cyl  disp    hp  drat    wt  qsec
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00

Visualisation

As ever visualising the data can be really helpful, for example using the corrplot package:

if (!require("corrplot")) install.packages("corrplot")
corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

and checking p values on the above data (although note that not give same result as above when used the Multiple correlation approach):

if (!require("Hmisc")) install.packages("Hmisc")
res2<-rcorr(res, type = c("pearson", "spearman"))
res2
       mpg   cyl  disp    hp  drat    wt  qsec
mpg   1.00 -0.99 -1.00 -0.97  0.95 -0.99  0.78
cyl  -0.99  1.00  0.99  0.98 -0.93  0.97 -0.84
disp -1.00  0.99  1.00  0.97 -0.96  0.99 -0.79
hp   -0.97  0.98  0.97  1.00 -0.86  0.92 -0.91
drat  0.95 -0.93 -0.96 -0.86  1.00 -0.97  0.60
wt   -0.99  0.97  0.99  0.92 -0.97  1.00 -0.69
qsec  0.78 -0.84 -0.79 -0.91  0.60 -0.69  1.00

n= 7 


P
     mpg    cyl    disp   hp     drat   wt     qsec  
mpg         0.0000 0.0000 0.0004 0.0011 0.0000 0.0377
cyl  0.0000        0.0000 0.0000 0.0020 0.0004 0.0178
disp 0.0000 0.0000        0.0004 0.0007 0.0000 0.0350
hp   0.0004 0.0000 0.0004        0.0126 0.0033 0.0043
drat 0.0011 0.0020 0.0007 0.0126        0.0002 0.1519
wt   0.0000 0.0004 0.0000 0.0033 0.0002        0.0878
qsec 0.0377 0.0178 0.0350 0.0043 0.1519 0.0878       

Using the iris data to show more ways of evaluating the data:

data <- iris[, 1:4]
groups <- iris[,5]
library(psych)
corPlot(data, cex = 1.2) #to give a numerical tabular result

par(mfrow = c(2,3))
corrplot(cor(data), method = "circle",
  title = "method = 'circle'",
  tl.pos = "n", mar = c(2,1,3,1))

install.packages("corrplot")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(corrplot)
corrplot(cor(data), method = "square",
  title = "method = 'square'",
  tl.pos = "n", mar = c(2,1,3,1))

corrplot(cor(data), method = "ellipse",
  title = "method = 'ellipse'",
  tl.pos = "n", mar = c(2,1,3,1))

corrplot(cor(data), method = "pie",
  title = "method = 'pie'",
  tl.pos = "n", mar = c(2,1,3,1))

corrplot(cor(data), method =  "number",
         title = "method = 'number'",
        tl.pos = "n", mar = c(2,1,3,1))

and using ggplot and GGally on penguins data for example which gives a myriad of different plots to look at:

if (!require("GGally")) install.packages("GGally")
install.packages("palmerpenguins")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(palmerpenguins)
library(GGally)
# Code that generates plots or runs intensive computations
ggpairs(penguins[,c(1,3:7)], ggplot2::aes(colour = species))

Week 9 Linear Models

Here’s a link to a Greg Martin Youtube video https://youtu.be/-mGXnm0fHtI

and some of the coding used in this video are shown below. Using the tidyverse package and the cars dataset:

library(tidyverse)
head(cars, 10)
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

It’s easy to create a simple linear model:

cars %>%
  lm(dist ~ speed, data = .) %>% #with dist being our dependent y axis variable and speed being our independent x axis variable
  summary()

Call:
lm(formula = dist ~ speed, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

In the above:

  • residuals-represents the distance between the line of best fit and the observations so a good model has low residuals (all dots close to line). Ideally want to see symmetrical distribution around 0

  • next get coefficients, which gives the slope (for every unit on x get 3.9 on y in this case) and also p value for our slope

  • lastly r squared which gives 65% of the change in y can be explained by the change in x

As ever there are other ways of doing the same thing, for example:

mod <- lm(dist ~ speed, data = cars)
mod

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17.579        3.932  
summary(mod) #and this method allows us to do more with the output to gain better understanding as to the suitability and strength of the model

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

and this method allows us to do more with the output:

attributes(mod)
$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        

$class
[1] "lm"
mod$residuals #such as looking at individual columns for example
         1          2          3          4          5          6          7 
  3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584  -3.744993 
         8          9         10         11         12         13         14 
  4.255007  12.255007  -8.677401   2.322599 -15.609810  -9.609810  -5.609810 
        15         16         17         18         19         20         21 
 -1.609810  -7.542219   0.457781   0.457781  12.457781 -11.474628  -1.474628 
        22         23         24         25         26         27         28 
 22.525372  42.525372 -21.407036 -15.407036  12.592964 -13.339445  -5.339445 
        29         30         31         32         33         34         35 
-17.271854  -9.271854   0.728146 -11.204263   2.795737  22.795737  30.795737 
        36         37         38         39         40         41         42 
-21.136672 -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
        43         44         45         46         47         48         49 
  2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285  43.201285 
        50 
  4.268876 
hist(mod$residuals) #and graphing the output to check the distribution

shapiro.test(mod$residuals) #and can run shapiro test to look at the distribution of the residuals to check its normality

    Shapiro-Wilk normality test

data:  mod$residuals
W = 0.94509, p-value = 0.02152

and can then use this model to do some predictions as follows:

new_speeds <- data.frame(speed = c(10, 15, 20))
predict(mod, new_speeds) %>% 
  round(1) 
   1    2    3 
21.7 41.4 61.1 

and as before can achieve same thing with:

cars %>%
  lm(dist ~ speed, data = .) %>% 
  predict(data.frame(speed = c(10, 15, 20))) %>%
   round()
 1  2  3 
22 41 61 

Post Lecture exercise:

Answer the following question using a linear model “What is the effect of the weight of the vehicle on its efficiency (mile per gallon)?”

#loading packages and exploring the datasets
library(tidyverse)
library(ggplot2)
data("mtcars")
str(mtcars)   
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Obvious that do have numerical versus numerical continuous quantitative data so can use linear regression models to help the analysis. Starting with plotting the data using scatter plot:

ggplot(mtcars, aes(x = wt, y = mpg))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) #adding line of best fit and removing the error bars-use se = TRUE if want them in

which shows that there does look to be a relationship between the 2 variables showing a negative slope as would expect(ie response variable mpg on y axis goes down as predictor variable wt on x axis goes up). Using linear regression on this as below:

 mtcars %>%
    lm(mpg ~ wt, data = .) %>% #with mpg being our dependent y axis variable and wt being our independent x axis variable
    summary()

Call:
lm(formula = mpg ~ wt, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Residuals-are the differences between the observed values and the values predicted by the model. Represents the distance between the line of best fit and the observations so a good model has low residuals (all dots close to line). The summary statistics of residuals are provided:

  • Min: Smallest residual (-4.5432).

  • 1Q: First quartile residual (-2.3647).

  • Median: Middle residual (-0.1252).

  • 3Q: Third quartile residual (1.4096).

  • Max: Largest residual (6.8727).

The closer these values are to zero, the better the model fits the data.

Coefficients

This section provides estimates for the regression coefficients: | Term | Estimate | Std. Error | t value | Pr(>|t|) |

  • Intercept (37.2851): The predicted mpg when wt = 0. While not realistic for this dataset (since weight is never 0), it serves as the baseline for the regression line.

  • Slope (-5.3445): For every 1-unit increase in wt, mpg decreases by 5.3445 units on average.

  • Std. Error: Indicates the variability of the estimates. Smaller values suggest more precise estimates.

  • t value: The test statistic for each coefficient, calculated as Estimate/Std. Error.

  • Pr(>|t|): The p-value for the hypothesis test H0:β=0, H0​:β=0. For both coefficients,p<0.001p<0.001, indicating strong evidence that the coefficients are significantly different from zero.

Residual Standard Error

The standard deviation of the residuals:

  • 3.046: On average, the observed mpg deviates by about 3.046 units from the predicted values.

R-squared and Adjusted R-squared

  • Multiple R-squared (0.7528): About 75.28% of the variability in mpg is explained by wt.

  • Adjusted R-squared (0.7446): Adjusted for the number of predictors and sample size, slightly lower.

Higher R-squared values indicate a better fit.

F-statistic

  • 91.38: Tests whether the model as a whole explains a significant proportion of variance in the dependent variable.

  • p-value (1.294e-10): Extremely small, indicating the model is highly significant.

Interpretation

The regression model suggests that car weight (wt) is a significant predictor of miles per gallon (mpg). Heavier cars tend to have lower fuel efficiency, with a strong linear relationship accounting for about 75% of the variation in mpg.

Also wanted to run an ANOVA on the above linear regression model to test whether the model as a whole explains a significant amount of variation in the dependent variable compared to a simpler model (eg a model with no predictors):

# Fit the linear model
model <- lm(mpg ~ wt, data = mtcars)

# Perform ANOVA
anova_result <- anova(model)

# View the result
print(anova_result)
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
wt         1 847.73  847.73  91.375 1.294e-10 ***
Residuals 30 278.32    9.28                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

with the following explanations:

  1. Df (Degrees of Freedom):

    • For the predictor wt: 1 degree of freedom (since it’s one variable).

    • Residuals: n−k−1, where n is the number of observations and k is the number of predictors (30 in this case).

  2. Sum Sq (Sum of Squares):

    • Total variance explained by wt: 847.725.

    • Residual variance (unexplained variance): 278.321.

  3. Mean Sq (Mean Squares):

    • Sum of Squares divided by Degrees of Freedom.
  4. F value:

    • Ratio of variance explained by the predictor to residual variance (Mean Sq for wt/Mean Sq for Residuals).
  5. Pr(>F):

    • The p-value for the F-test. A small value (e.g., <0.05) indicates that the predictor significantly explains variation in the dependent variable.

Interpretation:

From the ANOVA output:

  • The p-value for the predictor wt (e.g.,<2.2e−16) suggests that car weight (wt) is a highly significant predictor of miles per gallon (mpg).

  • The large F value (91.375) indicates that the variability explained by wt is much larger than the residual (unexplained) variability, confirming a strong relationship.

This aligns with the earlier regression analysis, reinforcing that wt is a significant predictor of mpg.

Multiple Linear Models

Can look at multiple linear models by simply adding in variables as in the below example using the penguins dataset:

m2<-lm(flipper_length_mm~bill_length_mm
       + bill_depth_mm, data = penguins)
summary(m2)

Call:
lm(formula = flipper_length_mm ~ bill_length_mm + bill_depth_mm, 
    data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.731  -5.855   0.041   6.112  23.059 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    194.31175    6.33800   30.66   <2e-16 ***
bill_length_mm   1.41476    0.08802   16.07   <2e-16 ***
bill_depth_mm   -3.23801    0.24335  -13.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.626 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6259,    Adjusted R-squared:  0.6237 
F-statistic: 283.6 on 2 and 339 DF,  p-value: < 2.2e-16

and as before can then use the shapiro test to look the normality of the residuals:

shapiro.test(m2$residuals)

    Shapiro-Wilk normality test

data:  m2$residuals
W = 0.99317, p-value = 0.1227

and can then use ancova to compare regression slopes:

ancov1<-lm(bill_length_mm~body_mass_g
           + species  
           + body_mass_g*species, data=penguins)
summary(ancov1)

Call:
lm(formula = bill_length_mm ~ body_mass_g + species + body_mass_g * 
    species, data = penguins)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4208 -1.6461  0.0919  1.4718  9.3138 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  26.9941391  1.5926015  16.950  < 2e-16 ***
body_mass_g                   0.0031879  0.0004271   7.464 7.27e-13 ***
speciesChinstrap              5.1800537  3.2746719   1.582    0.115    
speciesGentoo                -0.2545907  2.7138655  -0.094    0.925    
body_mass_g:speciesChinstrap  0.0012748  0.0008740   1.459    0.146    
body_mass_g:speciesGentoo     0.0009030  0.0006066   1.489    0.138    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.399 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.8098,    Adjusted R-squared:  0.807 
F-statistic: 286.1 on 5 and 336 DF,  p-value: < 2.2e-16

and then anova of the output:

anova(ancov1)
Analysis of Variance Table

Response: bill_length_mm
                     Df Sum Sq Mean Sq  F value Pr(>F)    
body_mass_g           1 3599.7  3599.7 625.5924 <2e-16 ***
species               2 4612.5  2306.3 400.8045 <2e-16 ***
body_mass_g:species   2   18.6     9.3   1.6159 0.2003    
Residuals           336 1933.4     5.8                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Research Methods exercise:

From a paper of your preference, read the methods carefully and create a graphical abstract of the experimetnal design with a legend to help with explanation:

Week 10 Logistic Regression

Pre session

Couple of Youtube videos to get started:

https://youtu.be/yIYKR4sgzI8?list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe In summary this one by StatQuest by Josh Starmer started with a quick review of Linear Regression where we had 2 numerical variables plotted (eg weight and size) and a line of best fit applied which allowed us to calculate r2 and determine if weight and size are correlated (large values imply large effect), and a p value to see if the r2 was statistically significant; and use the line to predict size given weight. Can also use mutliple regression whereby trying to predict size using weight and blood volume for example (or trying to model size using weight and blood volume). Multiple regression did the same thing that normal regression did: calculate r2, p value and predict size given weight and blood volume. Also possible to compare models for example the simple (single) model to the complicated one (multiple) tells us if we need to measure weight and blood volume to accurately predict size or if we can get away with just weight.

So that was linear regression, now logistic regression: similar except predicts whether something is true or false instead of predicting something continuous like size. Also instead of fitting a line to the data logistic regression fits an “S” shaped logistic function from 0-1 which tells you the probability that a mouse is obese based on its weight as an example. Just like with linear regression we can make simple models (eg obesity predicted by weight) or more complicated models (eg obesity predicted by weight + genotype + age + astrological sign). In other words just like linear regression logistic regression can work with continuous data (like weight and age) and discrete data like (genotype and astrological size). Can also test to see if each variable is useful for predicting obesity. However unlike liner regression we can’t compare the complicated model to the simple model. Instead we just test to see if a variable’s effect on the prediction is signifciantly different from 0. If not it means the variable is not helping the prediction (use Waldt’s test). Logistic regression does not have the same concept of residuals as linear regression (where residuals are used to fit the line and to calculate r2 and compare simple to complicated models). Instead it uses the maximum likelihood concept.

2nd video is the next in this series and looks at worked examples as below:

https://youtu.be/C4N3_XJJ-jU?list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe

Using the heart disease data which needs some initial wrangling:

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header = FALSE)
head(data)
  V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
1 63  1  1 145 233  1  2 150  0 2.3   3 0.0 6.0   0
2 67  1  4 160 286  0  2 108  1 1.5   2 3.0 3.0   2
3 67  1  4 120 229  0  2 129  1 2.6   2 2.0 7.0   1
4 37  1  3 130 250  0  0 187  0 3.5   3 0.0 3.0   0
5 41  0  2 130 204  0  2 172  0 1.4   1 0.0 3.0   0
6 56  1  2 120 236  0  0 178  0 0.8   1 0.0 3.0   0

None of the columns are named so need to name them:

colnames(data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "hd")
head(data)
  age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal hd
1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0  0
2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0  2
3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0  1
4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0  0
5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0  0
6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0  0

which looks better but str() still shows an issue:

str(data)
'data.frame':   303 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
 $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : chr  "0.0" "3.0" "2.0" "0.0" ...
 $ thal    : chr  "6.0" "3.0" "7.0" "3.0" ...
 $ hd      : int  0 2 1 0 0 0 3 0 2 1 ...

some of the columns are a problem, for example sex is a number rather than a factor where 0 represents “female” and 1 represents “male”. CP (chest pain) is also supposed to be a factor (where 1-3 represents different types of pain and 4 is no chest pain). Similarly ca and thal are correctly called factors but one of the levels is “?” when we need it to be NA. So we’ve got some cleaning up to do!

data[data == "?"] <- NA # to change the question marks to NA's
data$sex <- ifelse(test=data$sex == 0, yes = "F", no = "M")
data$sex <- as.factor(data$sex) #and convert the column into a factor
data$cp <- as.factor(data$cp) # and converting other columns into factors as that's what they are supposed to be
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$ca <- as.integer(data$ca) #and tell r that ca column is now a column  of integers and then convert it to a factor
data$ca <- as.factor(data$ca)#as above
data$thal <- as.factor(data$thal)#as above
data$hd <- ifelse(test=data$hd == 0, yes = "Healthy", no = "Unhealthy") #and lastly
data$hd <- as.factor(data$hd) #making hd a factor which is easier on the eye

and check that this has all worked:

str(data)
'data.frame':   303 obs. of  14 variables:
 $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
 $ cp      : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
 $ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
 $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
 $ thal    : Factor w/ 3 levels "3.0","6.0","7.0": 2 1 3 1 1 1 1 1 3 3 ...
 $ hd      : Factor w/ 2 levels "Healthy","Unhealthy": 1 2 2 1 1 1 2 1 2 2 ...

now see how many samples have NA values to decide if to get rid of them or if we should impute values for them:

nrow(data[is.na(data$ca) | is.na(data$thal),]) #shows 6 rows of data have NA's in them
[1] 6

we can view the samples with NA’s in them by selecting those rows from the data:

data[is.na(data$ca) | is.na(data$thal),]
    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope   ca thal
88   53   F  3      128  216   0       2     115     0     0.0     1    0 <NA>
167  52   M  3      138  223   0       0     169     0     0.0     1 <NA>  3.0
193  43   M  4      132  247   1       2     143     1     0.1     2 <NA>  7.0
267  52   M  4      128  204   1       0     156     1     1.0     2    0 <NA>
288  58   M  2      125  220   0       0     144     0     0.4     2 <NA>  7.0
303  38   M  3      138  175   0       0     173     0     0.0     1 <NA>  3.0
           hd
88    Healthy
167   Healthy
193 Unhealthy
267 Unhealthy
288   Healthy
303   Healthy

For this exercise we will just remove them:

data <- data[!(is.na(data$ca) | is.na(data$thal)),]
nrow(data) #shows that these have been removed
[1] 297

Now we need to make sure that healthy and diseased samples come from each gender as if only male samples have heard disease then we should remove all females from the model. Can do this using the xtabs function to select the columns we want to build a table from (in this case HD and sex)

xtabs(~hd + sex, data=data)
           sex
hd            F   M
  Healthy    71  89
  Unhealthy  25 112

ie healthy and unhealthy in both male and female. Now let’s verify that all all 4 levels of chest pain were reported by a few patients:

xtabs(~hd +cp, data=data)
           cp
hd            1   2   3   4
  Healthy    16  40  65  39
  Unhealthy   7   9  18 103

so that’s fine, lets do the same thing for all other boolean and categorical variables that we are using to predict heart disease:

xtabs(~hd + fbs, data = data)
           fbs
hd            0   1
  Healthy   137  23
  Unhealthy 117  20
xtabs(~hd + restecg, data = data)
           restecg
hd           0  1  2
  Healthy   92  1 67
  Unhealthy 55  3 79

Potential issue here in that only 4 patients represent level 1 so might be a problem as not enough.

xtabs(~hd + slope, data = data)
           slope
hd            1   2   3
  Healthy   103  48   9
  Unhealthy  36  89  12
xtabs(~hd + ca, data = data)
           ca
hd            0   1   2   3
  Healthy   129  21   7   3
  Unhealthy  45  44  31  17
xtabs(~hd + thal, data = data)
           thal
hd          3.0 6.0 7.0
  Healthy   127   6  27
  Unhealthy  37  12  88

So all these look ok apart from possibly ecg. Now let’s run the Logistic Regression:

Starting with a simple model where we will try to predict heart disease using only the gender of each patient:

logistic <- glm(hd ~ sex, data=data, family = "binomial") #the call to glm the function that performs generalised linear models. Binomial makes the glm function do the logistic regression
summary(logistic)

Call:
glm(formula = hd ~ sex, family = "binomial", data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0438     0.2326  -4.488 7.18e-06 ***
sexM          1.2737     0.2725   4.674 2.95e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 409.95  on 296  degrees of freedom
Residual deviance: 386.12  on 295  degrees of freedom
AIC: 390.12

Number of Fisher Scoring iterations: 4

Output shows:

  1. Summary of the deviance residuals which look good as close to being centred on 0 and are roughly symmetrical

  2. Coefficients which correspond to the equation: heart disease = -1.0438 + 1.2737 x the patient is male (this latter variable is equal to 0 when the patient is female and 1 when the patient is male) so if we are predicting heart disease for a female patient we get the following equation: hd = -1.0438+1.2737x0 which reduce to hd = -1.0438 so the log(odds) that a female has hd = -1.0438. For a male patient the equation is hd = -1.0438 + 1.2737x1 giving hd = -1.0438+1.2737. In other words the second term is the log(odds ration) of the odds that a male will have hd over the odds that a female will have hd. The pvalues are well below 0.05 and thus the log(odds) and the log(odds ratios) are both stat sig. Remember that a small p value alone is not interesting, we also want large effect sizes and that’s what the log(odds) and log(odds ratio) tells us. For more on the odds ratio see this video https://youtu.be/8nm0G-1uJzA.

  3. Next we see the default dispersion parameter used for this logistic regression. When we do normal linear regression we estimate both the mean and the variance from the data. In contrast with logistic regression we estimate the mean of the data and the variance is derived from the mean. It is therefore possible that the variance is underestimated. If so you can adjust the dispersion parameter in the summary() command.

  4. Then we have the Null Deviance and the Residual Deviance. These can be used to compare models, compute r2 and an overal p value.

  5. Then we have the AIC (Akaike Information Creiterion) which is just the Residual Deviance adjusted for the number of parameters in the model. The AIC can be used to compare one model to another.

  6. Lastly we have the number of Fisher Scoring iterations which just tells us how quickly the glm() function converged on the maximum liklihood estimates for the coefficients.

Now that we’ve done a simple logistic regression using just one variable (sex) to predict hd we can create a fancy model that uses all of the variables to predict hd.

logistic <- glm(hd~., data=data, family = "binomial")
summary(logistic)

Call:
glm(formula = hd ~ ., family = "binomial", data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.253978   2.960399  -2.113 0.034640 *  
age         -0.023508   0.025122  -0.936 0.349402    
sexM         1.670152   0.552486   3.023 0.002503 ** 
cp2          1.448396   0.809136   1.790 0.073446 .  
cp3          0.393353   0.700338   0.562 0.574347    
cp4          2.373287   0.709094   3.347 0.000817 ***
trestbps     0.027720   0.011748   2.359 0.018300 *  
chol         0.004445   0.004091   1.087 0.277253    
fbs1        -0.574079   0.592539  -0.969 0.332622    
restecg1     1.000887   2.638393   0.379 0.704424    
restecg2     0.486408   0.396327   1.227 0.219713    
thalach     -0.019695   0.011717  -1.681 0.092781 .  
exang1       0.653306   0.447445   1.460 0.144267    
oldpeak      0.390679   0.239173   1.633 0.102373    
slope2       1.302289   0.486197   2.679 0.007395 ** 
slope3       0.606760   0.939324   0.646 0.518309    
ca1          2.237444   0.514770   4.346 1.38e-05 ***
ca2          3.271852   0.785123   4.167 3.08e-05 ***
ca3          2.188715   0.928644   2.357 0.018428 *  
thal6.0     -0.168439   0.810310  -0.208 0.835331    
thal7.0      1.433319   0.440567   3.253 0.001141 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 409.95  on 296  degrees of freedom
Residual deviance: 183.10  on 276  degrees of freedom
AIC: 225.1

Number of Fisher Scoring iterations: 6

In this we see that age isn’t a useful predictor because it has a large p value. However the median age in our dataset was 56 so most of the folks were pretty old and that explains why it wasn’t very useful.

Gender is still a good predictor though.

If we scroll down to the bottom of the output we see that the Residual Deviance and the AIC are both much smaller for this fancy model than they were for the simple model when we only used gender to predict hd. If we want to calculate McFadden’s Pseudo r2 we can pull the log-liklihood of the null model out of the logistic variable by getting the value for the null deviance and dividing by -2 and we can pull the log-liklihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by -2, then we just do the maths and we end up with a Pseudo r2 = 0.55. This can be interpreted as the overall effect size.

ll.null <- logistic$null.deviance/-2
ll.proposed <- logistic$deviance/-2
(ll.null-ll.proposed)/ll.null
[1] 0.5533531

and we can use those same log-liklihoods to calculate a p value for that r2 using a chi square distribution:

1-pchisq(2*(ll.proposed-ll.null), df = (length(logistic$coefficients)-1))
[1] 0

in this case the p value is tiny so the r2 value isn’t due to dumb luck!

Lastly we can draw a graph that shows the predicted probabilities that each patient has heart disease along with their actual heart disease status. Most of the patients with hd are predicted to have a high probability of having hd and most of the patients without hd are predicted to have a low probability of having hd. Thus our logistic regression has done a pretty good job.

predicted.data <- data.frame(probability.of.hd=logistic$fitted.values, hd=data$hd) #create new data frame that contains the probabilities of having hd along with the actual hd status
predicted.data <- predicted.data[order(predicted.data$probability.of.hd, decreasing=FALSE),]#then we sort the data frame from low probabilities to high
predicted.data$rank <- 1:nrow(predicted.data)#then we add a new column to the data frame that has the rank of each sample from low probability to =high.
library(ggplot2)#then we load ggplot to we can draw a fancy graph
library(cowplot)
ggplot(data=predicted.data, aes(x=rank, y=probability.of.hd))+
  geom_point(aes(color=hd), alpha=1, shape=4, stroke=2) +
    xlab("Index") +
    ylab("Predicted probability of getting heart disease")

Post session Exercises

Taken from the STHDA website on Logistic Regression Essentials in R https://www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/

Logistic Regression is used to predict the class or category of individuals based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable which can only have only have two possible values 0 or 1, yes or no, true or false etc. LR belongs to a family named Generalised Linear Model developed for extending the linear regression model to other situations. Logistic regression does not return direclty the class of observations-it allows us to estimate the probability of class membership, ranging between 0 and 1. You need to decide the threshold probability at which the category flips from one to the other. By default this is set = 0.5 but in reality it should be settled based on the analysis purpose.

Logistic function

The standard logistic regression function, for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 + exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)], where:

  • y = b0 + b1*x,

  • exp() is the exponential and

  • p is the probability of event to occur (1) given x. Mathematically, this is written as p(event=1|x) and abbreviated asp(x), sopx = 1/[1 + exp(-(b0 + b1*x))]`

By a bit of manipulation, it can be demonstrated that p/(1-p) = exp(b0 + b1*x). By taking the logarithm of both sides, the formula becomes a linear combination of predictors: log[p/(1-p)] = b0 + b1*x.

When you have multiple predictor variables, the logistic function looks like: log[p/(1-p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn

b0 and b1 are the regression beta coefficients. A positive b1 indicates that increasing x will be associated with increasing p. Conversely, a negative b1 indicates that increasing x will be associated with decreasing p.

The quantity log[p/(1-p)] is called the logarithm of the odd, also known as log-odd or log it.

The odds reflect the likelihood that the event will occur. It can be seen as the ratio of “successes” to “non-successes”. Technically, odds are the probability of an event divided by the probability that the event will not take place(P. Bruce and Bruce 2017). For example, if the probability of being diabetes-positive is 0.5, the probability of “won’t be” is 1-0.5 = 0.5, and the odds are 1.0.

Note that, the probability can be calculated from the odds as p = Odds/(1 + Odds).

Need a couple of R packages for this one:

library(tidyverse)
install.packages("caret")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(caret)
library(ggplot2)
theme_set(theme_bw())

Preparing the data

Performing the following steps might improve the accuracy of your model

  • Remove potential outliers

  • Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.

  • Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution.

Here, we’ll use the PimaIndiansDiabetes [in mlbench package] for predicting the probability of being diabetes positive based on multiple clinical variables.

We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

install.packages("mlbench")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(mlbench)
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
    pregnant glucose pressure triceps insulin mass pedigree age diabetes
421        1     119       88      41     170 45.3    0.507  26      neg
188        1     128       98      41      58 32.0    1.321  33      pos
639        7      97       76      32      91 40.9    0.871  32      pos
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Computing logistic regression

The R function glm(), for generalized linear model, can be used to compute logistic regression. You need to specify the option family = binomial, which tells R that we want to fit logistic regression.

# Fit the model
model <- glm( diabetes ~., data = train.data, family = binomial)
# Summarize the model
summary(model)

Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.053e+01  1.440e+00  -7.317 2.54e-13 ***
pregnant     1.005e-01  6.127e-02   1.640  0.10092    
glucose      3.710e-02  6.486e-03   5.719 1.07e-08 ***
pressure    -3.876e-04  1.383e-02  -0.028  0.97764    
triceps      1.418e-02  1.998e-02   0.710  0.47800    
insulin      5.940e-04  1.508e-03   0.394  0.69371    
mass         7.997e-02  3.180e-02   2.515  0.01190 *  
pedigree     1.329e+00  4.823e-01   2.756  0.00585 ** 
age          2.718e-02  2.020e-02   1.346  0.17840    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 267.18  on 305  degrees of freedom
AIC: 285.18

Number of Fisher Scoring iterations: 5
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103

Simple logistic regression is used to predict the probability of class membership based on one single predictor variable. The following R code builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration:

model <- glm( diabetes ~ glucose, data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose      0.04327234 0.005341133  8.101716 5.418949e-16

The output above shows the estimate of the regression beta coefficients and their significance levels. The intercept (b0) is -6.15 and the coefficient of glucose variable is 0.043.

The logistic equation can be written as p = exp(-6.15 + 0.043*glucose)/ [1 + exp(-6.15 + 0.043*glucose)]. Using this formula, for each new glucose plasma concentration value, you can predict the probability of the individuals in being diabetes positive. Predictions can be easily made using the function predict(). Use the option type = “response” to directly obtain the probabilities

newdata <- data.frame(glucose = c(20,  180))
probabilities <- model %>% predict(newdata, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes
    1     2 
"neg" "pos" 

The logistic function gives an s-shaped probability curve illustrated as follow:

train.data %>%
  mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
  ggplot(aes(glucose, prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Logistic Regression Model", 
    x = "Plasma Glucose Concentration",
    y = "Probability of being diabete-pos"
    )

Multiple logistic regression is used to predict the probability of class membership based on multiple predictor variables, as follow:

model <- glm( diabetes ~ glucose + mass + pregnant, 
                data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose      0.03886154 0.005404219  7.190962 6.433636e-13
mass         0.09458458 0.023529905  4.019760 5.825738e-05
pregnant     0.14466661 0.045125729  3.205857 1.346611e-03

and in the below using all the predictor variables:

model <- glm( diabetes ~., data = train.data, family = binomial)
summary(model)$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01

From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:

  • Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variable

  • Std.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.

  • z value: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)

  • Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is.

Note that, the functions coef() and summary()can be used to extract only the coefficients, as follow:

coef(model)
  (Intercept)      pregnant       glucose      pressure       triceps 
-1.053400e+01  1.005031e-01  3.709621e-02 -3.875933e-04  1.417771e-02 
      insulin          mass      pedigree           age 
 5.939876e-04  7.997447e-02  1.329149e+00  2.718224e-02 
summary(model )$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01

Interpretation

It can be seen that only 5 out of the 8 predictors are significantly associated to the outcome. These include: pregnant, glucose, pressure, mass and pedigree.

The coefficient estimate of the variable glucose is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of being diabetes-positive. However the coefficient for the variable pressure is b = -0.007, which is negative. This means that an increase in blood pressure will be associated with a decreased probability of being diabetes-positive.

An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0).

For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.

If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).

For example, the regression coefficient for glucose is 0.042. This indicate that one unit increase in the glucose concentration will increase the odds of being diabetes-positive by exp(0.042) 1.04 times.

From the logistic regression results, it can be noticed that some variables - triceps, insulin and age - are not statistically significant. Keeping them in the model may contribute to overfitting. Therefore, they should be eliminated. This can be done automatically using statistical techniques, including stepwise regression an penalized regression methods. Briefly, they consist of selecting an optimal model with a reduced set of variables, without compromising the model accuracy. Here, as we have a small number of predictors (n = 9), we can select manually the most significant:

model <- glm( diabetes ~ pregnant + glucose + pressure + mass + pedigree, 
                data = train.data, family = binomial)

Making predictions

We can predictions using the test data in order to evaluate the performance of our logistic regression model as follows:

  1. Predict the class membership probabilities of observations based on predictor variables

  2. Assign the observations to the class with highest probability score (i.e above 0.5)

The R function predict() can be used to predict the probability of being diabetes-positive, given the predictor values.

Predict the probabilities of being diabetes-positive:

probabilities <- model %>% predict(test.data, type = "response")
head(probabilities)
       19        21        32        55        64        71 
0.1352603 0.5127526 0.6795376 0.7517408 0.2734867 0.1648174 

Which classes do these probabilities refer to? In our example, the output is the probability that the diabetes test will be positive. We know that these values correspond to the probability of the test to be positive, rather than negative, because the contrasts() function indicates that R has created a dummy variable with a 1 for “pos” and “0” for neg. The probabilities always refer to the class dummy-coded as “1”.

Check the dummy coding:

contrasts(test.data$diabetes)
    pos
neg   0
pos   1

Predict the class of individuals:

The following R code categorizes individuals into two groups based on their predicted probabilities (p) of being diabetes-positive. Individuals, with p above 0.5 (random guessing), are considered as diabetes-positive.

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)
   19    21    32    55    64    71 
"neg" "pos" "pos" "pos" "neg" "neg" 

Assessing model accuracy is measured as the proportion of observations that have been correctly classified. Inversely, the classification error is defined as the proportion of observations that have been misclassified.

Proportion of correctly classified observations:

mean(predicted.classes == test.data$diabetes)
[1] 0.7564103

The classification prediction accuracy is about 76%, which is good. The misclassification error rate is 24%.

Week 11 Polynomial & Poisson Models

Pre-session

https://youtu.be/ZYN0YD7UfK4 on Fitting & Assessing Polynomial regression models by Mike Marin. Polynomial Regression is a special case of Linear Regression where the relationship between x and y is modelled using a polynomial, rather than a line. It can be used when the relationship is non-linear, although this is still considered to be a special case of Multiple Linear Regression. COULD NOT DOWNLOAD LCD DATA

https://youtu.be/FfBnX5dfxXw on Poisson Regression in R by . Using the bikeshare data set-see R notes for week11.

Session

See notes and R scripts for lecture session notes and exercises on Non Linear Models

Post Session

Polynomial Models-create a fictional dataframe or tibble using a quadratic function. Graph it using ggplot to visualise the data and then fit a quadratic model:

-creating a fictional dataset of blood volume versus height following a quadratic model introduce a a squared term for height, which reflects a non-linear relationship:

# Set seed for reproducibility
set.seed(42)
# Create a fictional dataset
n <- 100  # Number of observations
# Generate random heights in centimeters (between 150 and 200 cm)
height <- runif(n, min = 150, max = 200)
# Simulate blood volume using a quadratic relationship
blood_volume <- 0.02 * height^2 - 5 * height + 400 + rnorm(n, mean = 0, sd = 2)
# Combine into a data frame
blood_data_quadratic <- data.frame(
  Height_cm = round(height, 1),
  Blood_Volume_L = round(blood_volume, 2))
# Display the first few rows
head(blood_data_quadratic)
  Height_cm Blood_Volume_L
1     195.7         188.23
2     196.9         189.19
3     164.3         121.55
4     191.5         177.29
5     182.1         152.86
6     176.0         139.98

Visualise the data:

library(ggplot2)

ggplot(blood_data_quadratic, aes(x = Height_cm, y = Blood_Volume_L)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, color = "blue") +
  labs(
    title = "Blood Volume vs Height (Quadratic Model)",
    x = "Height (cm)",
    y = "Blood Volume (L)"
  ) +
  theme_minimal()

and fitting a linear model to this data:

lm(formula = Blood_Volume_L ~ Height_cm, data = blood_data_quadratic)

Call:
lm(formula = Blood_Volume_L ~ Height_cm, data = blood_data_quadratic)

Coefficients:
(Intercept)    Height_cm  
   -211.925        2.022  

and then a quadratic model:

lm(formula = Blood_Volume_L ~ Height_cm + I(Height_cm^2), data = blood_data_quadratic)

Call:
lm(formula = Blood_Volume_L ~ Height_cm + I(Height_cm^2), data = blood_data_quadratic)

Coefficients:
   (Intercept)       Height_cm  I(Height_cm^2)  
     396.66827        -4.98051         0.01999  

and then comparing using ANOVA:

linear_model <- lm(Blood_Volume_L ~ Height_cm, data = blood_data_quadratic)
quadratic_model <- lm(Blood_Volume_L ~ Height_cm + I(Height_cm^2), data = blood_data_quadratic)
anova(linear_model, quadratic_model)
Analysis of Variance Table

Model 1: Blood_Volume_L ~ Height_cm
Model 2: Blood_Volume_L ~ Height_cm + I(Height_cm^2)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     98 1937.76                                  
2     97  338.05  1    1599.7 459.02 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So tiny p value suggesting that quadratic model much better than linear model for this data.

Poisson Models:

Load the data ‘Recreation Demand’ from the AER package:

install.packages("AER")

The downloaded binary packages are in
    /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(AER)
data("RecreationDemand")
head(RecreationDemand)
  trips quality ski income userfee  costC   costS   costH
1     0       0 yes      4      no  67.59  68.620  76.800
2     0       0  no      9      no  68.86  70.936  84.780
3     0       0 yes      5      no  58.12  59.465  72.110
4     0       0  no      2      no  15.79  13.750  23.680
5     0       0 yes      3      no  24.02  34.033  34.547
6     0       0 yes      5      no 129.46 137.377 137.850
str(RecreationDemand)
'data.frame':   659 obs. of  8 variables:
 $ trips  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ quality: num  0 0 0 0 0 0 0 0 0 2 ...
 $ ski    : Factor w/ 2 levels "no","yes": 2 1 2 1 2 2 1 2 1 1 ...
 $ income : num  4 9 5 2 3 5 1 5 2 3 ...
 $ userfee: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ costC  : num  67.6 68.9 58.1 15.8 24 ...
 $ costS  : num  68.6 70.9 59.5 13.8 34 ...
 $ costH  : num  76.8 84.8 72.1 23.7 34.5 ...

Have a look at the data description using the help in r studio:

?RecreationDemand

What is the best model to predict the number of boat trips using a poisson model? Compare different models that you create.

So this is saying that boat trips will be our response variable and we want to see which of the other variables influence this most. Obvious that some are likely to be more important than others-for example, quality of the facility and whether a fee were payable seem most likely to influence the number of trips so let’s test it. Use a Poisson regression model as we are looking at count data that follow a Poisson distribution where the mean and variance of the response variable are approximately equal.

Boat_trip_Model1 <- glm(trips ~ quality, 
    data = RecreationDemand, 
    family = "poisson") #have to let r know which glm want
summary(Boat_trip_Model1)

Call:
glm(formula = trips ~ quality, family = "poisson", data = RecreationDemand)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.51378    0.05751  -8.933   <2e-16 ***
quality      0.54696    0.01518  36.022   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4849.7  on 658  degrees of freedom
Residual deviance: 3294.4  on 657  degrees of freedom
AIC: 4051.5

Number of Fisher Scoring iterations: 7

and plotting this:

library(ggplot2)
ggplot(RecreationDemand, aes(y = trips,
                      x = quality,)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "glm", 
              se = FALSE,
              method.args = list(family = "poisson")) +
  scale_colour_brewer(palette =  "Dark2")

and then add in more predictors:

Boat_trip_Model2 <- glm(trips ~ quality + userfee, 
    data = RecreationDemand, 
    family = "poisson") #have to let r know which glm want
summary(Boat_trip_Model2)

Call:
glm(formula = trips ~ quality + userfee, family = "poisson", 
    data = RecreationDemand)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.54705    0.05786  -9.455   <2e-16 ***
quality      0.52697    0.01539  34.244   <2e-16 ***
userfeeyes   1.35552    0.07835  17.301   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4849.7  on 658  degrees of freedom
Residual deviance: 3077.5  on 656  degrees of freedom
AIC: 3836.5

Number of Fisher Scoring iterations: 7

and plot this:

ggplot(RecreationDemand, aes(y = trips,
                      x = quality,
                      colour = userfee)) +
  geom_point(alpha = 0.1) +
  geom_smooth(method = "glm", 
              se = FALSE,
              method.args = list(family = "poisson")) +
  scale_colour_brewer(palette =  "Dark2")

Compare models using AIC:

AIC(Boat_trip_Model1, Boat_trip_Model2)
                 df      AIC
Boat_trip_Model1  2 4051.477
Boat_trip_Model2  3 3836.527

Which shows that Model2 is the best model as has the lowest AIC.

and testing for overdispersion:

dispersiontest(Boat_trip_Model2)

    Overdispersion test

data:  Boat_trip_Model2
z = 1.8959, p-value = 0.02899
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion 
  14.18829 

So is over-dispersion as >1 so should use another type of distribution such as quasi-poisson.

References

Barroga, Edanz, and Grace J. Matanguihan. 2022. “A Practical Guide to Writing Quantitative and Qualitative Research Questions and Hypotheses in Scholarly Articles.” Journal of Korean Medical Science 37 (16): e121. https://doi.org/10.3346/jkms.2022.37.e121.
contributors, Wikipedia. 2023. “The Scientific Method [SVG Image].” https://en.wikipedia.org/wiki/Scientific_method#/media/File:The_Scientific_Method.svg.
Efbrazil. 2020. “Image from Wikimedia Commons.” https://commons.wikimedia.org/w/index.php?curid=102392470.
Elsevier. 2023. “What Is and How to Write a Good Hypothesis in Research.” https://scientific-publishing.webshop.elsevier.com/manuscript-preparation/what-how-write-good-hypothesis-research/.
Gard, Andrew. 2023. “Quarto Tutorial Series.” YouTube. https://www.youtube.com/playlist?list=PLwfiFx4cFJbmZcUm8hu6v_zWyMfLDeIuq.
Huynh, Y. Wendy. 2019. R for Graduate Students. Self-published.
Statistics in r for Biodiversity Conservation. 2020. Amazon.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Zuur, Alain F., Elena N. Ieno, and Erik H. W. G. Meesters. 2010. A Beginner’s Guide to r. New York, NY: Springer. https://doi.org/10.1007/978-0-387-93837-0.