getwd()
library(tidyverse)
library(ggplot2)
data("diamonds")
data("midwest")ESC&R RMDA Workbook 2024
FM Lecture Week 3 on 8/10/24 Post session Exercises taken from:
Chapter 6.6.1 of “R for Graduate Students” by Y.Wendy Huynh 2019 (Huynh 2019)
Start up stuff
For example:
setting the working directory
installing packages (no need in this case as Tidyverse (Wickham et al. 2019) already installed)
loading packages
importing data (in this case the Diamonds & the Midwest data in ggplot)
Types of Variable
Variables may be either:
Categorical (otherwise know as factors) which can be further subdivided into:
- Nominal (no order eg. happy, neutral or sad) or binary (True or False)
- Ordinal (ordered eg. 1-10 from sad to happy)
Numerical which are quantitative and either:
discrete (eg. number of students in a class or numbers of pages in a book) or
continuous (eg. height, temperature)
Checking & Exploring the data
Must look at the data before we start analysing it to get a better understanding of it and to start some basic manipulations and assessments:
For example we could use:
View() but often big data files so might be better to look at small sections using glimpse(), kable() (from dplyr and knitr packages respectively) and the $ operator to look at individual variables (columns)
using the select() function as below to pick a sample of the data
also really useful to look at the structure using str() again as shown below
diamonds %>%
select(1:8) #picking the first 8 columns# A tibble: 53,940 × 8
carat cut color clarity depth table price x
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95
2 0.21 Premium E SI1 59.8 61 326 3.89
3 0.23 Good E VS1 56.9 65 327 4.05
4 0.29 Premium I VS2 62.4 58 334 4.2
5 0.31 Good J SI2 63.3 58 335 4.34
6 0.24 Very Good J VVS2 62.8 57 336 3.94
7 0.24 Very Good I VVS1 62.3 57 336 3.95
8 0.26 Very Good H SI1 61.9 55 337 4.07
9 0.22 Fair E VS2 65.1 61 337 3.87
10 0.23 Very Good H VS1 59.4 61 338 4
# ℹ 53,930 more rows
str(diamonds) #showing a summary of the dataframetibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
These outputs show that there are 10 total variables (columns) with 3 being ordered factors (ie. categorical-cut, color and clarity), 1 integer or discrete numerical (price) and 6 numeric or continuous numerical (carat, depth, table, x, y & z) . There are 53,940 round cut diamonds in the dataset and each row contains information about each one. An ordered factor arranges the categorical values in a low to high ranked order. For example there are 5 categories of diamond cuts with Fair being the lowest grade to Ideal being the highest.
Smith (Statistics in r for Biodiversity Conservation 2020) suggest that: “before attempting to analyse data it is vital to perform a data exploration as this will save time by identifying any potential problems in the data and will help in deciding what type of analysis to conduct”.
The authors recommend the 6 step data exploration proposed by Zuur (Zuur, Ieno, and Meesters 2010) which is intended to identify:
- Outliers in response and independent variables
- Normality and homegenity of the response variable
- An excess of 0’s in the response variable
- Multicollinearity among independent variables
- Relationships among response and independent variables
- Independence of the response variable
As an example, Boxplots can be used to visualise/identify outliers as below where the diamonds data is plotted by cut against price as in Figure 1 below. There are many other plotting techniques and statistical methods which can also be used for this and the other points in the Zuur method (see chapter 1 for more information) and below for lots of practice at these methods!!
diamonds %>%
group_by(cut) %>%
ggplot(aes(x=cut, y=price, color=cut, fill=cut))+
geom_boxplot()The image below Figure 2 taken from Andrew Gard YouTube’s (Gard 2023) series is a nice summary for which graph/plot to use depending on categorical or numerical variables:
Exercises 6.6.1 Using the diamonds dataset
Good idea to execute one line at a time to see how each line changes the output but remember to not include the pipe or commas at the end of a line or it won’t work!
Basic data management might involve techniques/fucntions such as mutate, nesting, summarise, group by/ungroup, filter, select, arrange, names.
1. 1st problem on the Diamonds dataset
This problem shows how to group and ungroup variables in a dataframe using the group_by function; create and add in an extra column using the mutate function; choose which ones to display using the select function; and list values in a variable in ascending or descending order using the arrange function.
diamonds %>% #utilises the diamonds dataset
group_by(color, clarity) %>% #groups by color and clarity variables
mutate(price200 = mean(price)) %>% #creates new variable (average price by groups)
ungroup() %>% #data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% #new variable, original price +$10
select(cut, color, clarity, price, price200, random10) %>% # retain only these columns
arrange(color) %>% #visualise data ordered by color
group_by(cut) %>% #group data by cut
mutate(dis = n_distinct(price), #numbers each row consecutively for each cut
rowID = row_number()) %>% #numbers each row consecutively for each cut
ungroup() #then final ungroup# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
Always Ungroup after grouping!
2. Problem 2a. on the Midwest dataset
This problem shows how to collapse all rows into a one row summary using the summarise() function. This can also be used along with the group() function as per below.
midwest %>%
group_by(state) %>%
summarise(poptotalmean = mean(poptotal),
poptotalmed = median(poptotal),
popmax = max(poptotal),
popmin = min(poptotal),
popdistinct = n_distinct(poptotal),
popfirst = first(poptotal),
popany = any(poptotal < 5000),
popany2 = any(poptotal > 2000000)) %>%
ungroup()# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
Problem 2b. on the Midwest dataset
This problem shows how to put parameters against the summarise() function to show how many rows of values exist in the dataframe within those set parameters as per below.
## Problem B
midwest %>%
group_by(state) %>%
summarise(num5k = sum(poptotal < 5000),
num2mil = sum(poptotal > 2000000),
numrows = n()) %>%
ungroup()# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem 2c.1 on the Midwest dataset
This problem shows how to show the number of unique entries there are in the dataset with values ??I’m not sure!!
midwest %>%
group_by(county) %>%
summarise(x = n_distinct(state)) %>%
arrange(desc(x)) %>%
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
Problem 2c.11 on the Midwest dataset
This problem shows how n() differs from n_distinct().
midwest %>%
group_by(county) %>%
summarise(x = n()) %>%
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
Problem 3. Still on the Midwest dataset
This one shows that there can’t be more than one distinct county for each county! If replace county by state then get a more sensible response!
midwest %>%
group_by(county) %>%
summarise(x=n_distinct(county)) %>%
ungroup()# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
midwest %>%
group_by(state) %>%
summarise(x=n_distinct(county)) %>%
ungroup()# A tibble: 5 × 2
state x
<chr> <int>
1 IL 102
2 IN 92
3 MI 83
4 OH 88
5 WI 72
Problem 4. Looking at the Diamonds dataset again
This one shows how to look at a specific variable using the group() function, and then identify within that group, how many unique values there were for 2 other variables (using the n_distinct(variable) function, aswell as the total number of values using the n() function
diamonds %>%
group_by(clarity) %>%
summarise(a=n_distinct(color),
b=n_distinct(price),
c=n()) %>%
ungroup()# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
Problem 5.1 Still on the Diamonds dataset
This one shows how you can use the summarise function to show the mean and sd values for values that are from 2 groups of variables
diamonds %>%
group_by(color, cut) %>%
summarise(m = mean(price),
s = sd(price)) %>%
ungroup()# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
Problem 5.2 and this one shows what happens if you put the 2 variables in a different order!
If you reverse cut and color order in first line then the values are the same but presented in different order.
diamonds %>%
group_by(cut, color) %>%
summarise(m = mean(price),
s = sd(price)) %>%
ungroup()# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
Problem 5.3 Still in diamonds
This problem shows a practical application in that can see what effect a 20% discount has when compared to the mean sale price
diamonds %>%
group_by(cut, color, clarity) %>%
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80) %>%
ungroup()# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
Problem 6. Still in Diamonds
This shows how you can call the output variables whatever you want!
diamonds %>%
group_by(cut) %>%
summarise(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple ^ 2,
peach = n()) %>%
ungroup()# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
Problem 7.1 Still in Diamonds….
What is the difference between 7.1 and 7.2?
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
summarise(m = mean(x1)) %>%
ungroup()# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
Problem 7.2…..
diamonds %>%
group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>%
summarize(m = mean(x1))# A tibble: 1 × 1
m
<dbl>
1 1966.
The top one shows what the mean of the prices of all of those diamonds of each color (ie this is different for each colour) after each price has been times by 0.5. The bottom one shows the mean of all the diamonds’ prices times by 0.5 regardless of colour.
Further questions:
why is grouping the data necessary? This allows specific variables to be brought together for future operations. For example, grouping a dataset by age and sex might be useful if we were looking to see what effect age and sex had on a particular test result. It is then possible to conduct further manipulation and analysis of the grouped data.
Why is ungrouping data necessary? If you don’t ungroup once you’ve finished your calculations then future data management will likely produce errors as any further fucntions will be based on the already manipulated data
When should you ungroup data? At the end of your calculation
If the code does not contain
group_by(), do you still needungroup()at the end? For example, doesdata() %>% mutate(newVar = 1 + 2)requireungroup()? No
Further practice from the same book section 6.7
Q1. View all of the variable names in diamonds. An easy one to get started! Could also use head(), tail(), slice_head or slice_tail functions as below:
view(diamonds) #this brings up the table in a separate tab at the tophead(diamonds) #brings up the first 6 rows# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
diamonds %>%
slice_head(n=5) # to bring up the first 5 rows# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
diamonds %>%
slice_tail(n=5) #shows the last 5 rows# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.5
2 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
3 0.7 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
4 0.86 Premium H SI2 61 58 2757 6.15 6.12 3.74
5 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
Q2. Arrange the diamonds by:
Lowest to highest
price(hint:arrange())Highest to lowest
price(hint:arrange(),desc())Lowest
priceandcuthighest
priceandcut
diamonds %>%
arrange(price) #lowest to highest# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
diamonds %>%
arrange(desc(price)) #highest to lowest# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
diamonds %>%
arrange(cut, price) # by lowest price and cut# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.25 Fair E VS1 55.2 64 361 4.21 4.23 2.33
3 0.23 Fair G VVS2 61.4 66 369 3.87 3.91 2.39
4 0.27 Fair E VS1 66.4 58 371 3.99 4.02 2.66
5 0.3 Fair J VS2 64.8 58 416 4.24 4.16 2.72
6 0.3 Fair F SI1 63.1 58 496 4.3 4.22 2.69
7 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
8 0.37 Fair F SI1 65.3 56 527 4.53 4.47 2.94
9 0.3 Fair D SI2 64.6 54 536 4.29 4.25 2.76
10 0.25 Fair D VS1 61.2 55 563 4.09 4.11 2.51
# ℹ 53,930 more rows
diamonds %>%
arrange(cut, desc(price)) #by highest price and cut# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.01 Fair G SI1 70.6 64 18574 7.43 6.64 4.69
2 2.02 Fair H VS2 64.5 57 18565 8 7.95 5.14
3 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
4 2 Fair G VS2 67.6 58 18515 7.65 7.61 5.16
5 2.51 Fair H SI2 64.7 57 18308 8.44 8.5 5.48
6 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.9
7 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.9
8 2.32 Fair H SI1 62 62 18026 8.47 8.31 5.2
9 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
10 1.93 Fair F VS1 58.9 62 17995 8.17 7.97 4.75
# ℹ 53,930 more rows
Q3. Arrange the diamonds by lowest to highest price and worst to best clarity
diamonds %>%
arrange(price, clarity)# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Q4. Create a new variable named salePrice to reflect a discount of $250 off of the original cost of each diamond (hint: mutate())
diamonds %>%
mutate(salePrice = price - 250)# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z salePrice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
Q5. Remove the x, y, and z variables from the diamonds dataset (hint: select())
diamonds %>%
select(1:7)# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
Q6. Determine the number of diamonds there are for each cut value (hint: group_by(), summarize()).
diamonds %>%
group_by(cut) %>%
summarise(number=n()) %>%
ungroup()# A tibble: 5 × 2
cut number
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
Q7. Create a new column named totalNum that calculates the total number of diamonds.
diamonds %>%
mutate(total_diamonds =n()) # see 11 column on far right# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z total_diamonds
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premi… E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premi… I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very … J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very … I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very … H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very … H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Research Methods subsection
Exercise: Generate a good question and a bad question about the diamonds data set you’ve just explored. Try to use the principles discussed to simulate both types of questions
Good questions should be clear and interesting; relevant and meaningful; feasible and manageable in terms of available resources; have measurable variables; not answerable with yes/no; often contain what/how rather than is/are/do/does; be novel and push the boundaries of knowledge forwards; tend to be specific rather than vague; aim to discover/explain/explore; be ethical; be systematically constructed

Examples of good questions might include:
why does a diamond have differing clarities/depths of colours/cuts?
what makes a diamond attractive to buyers?
how does colour affect the price?
which is the most important determinant of price?
Examples of bad questions might include:
is there a difference in price between diamonds of different cuts/clarity/colours etc?
are poor clarity diamonds cheaper?
does my wife want another diamond? This is a terrible question!!
Week 4 FM Data Exploration & Scientific Hypotheses-Formative exercise on Data Analyses
Based on Andrew Gard’s YouTube Video tutorial (Gard 2023)on “Visualisation with R in 36minutes”
This tutorial utilises the Tidyverse package and the ModelData package as this contains the ‘Cricket’ data we will be using
library(tidyverse)
library(modeldata)
library(ggplot2)
data("crickets")Exploring the data shows that there are 3 columns of variables and 31 observations or rows. There are 2 continuous numerical variables (temperature and chirp rate ) and one categorical which is binary in nature (species).
View(crickets)
str(crickets)tibble [31 × 3] (S3: tbl_df/tbl/data.frame)
$ species: Factor w/ 2 levels "O. exclamationis",..: 1 1 1 1 1 1 1 1 1 1 ...
$ temp : num [1:31] 20.8 20.8 24 24 24 24 26.2 26.2 26.2 26.2 ...
$ rate : num [1:31] 67.9 65.1 77.3 78.7 79.4 80.4 85.8 86.6 87.5 89.1 ...
glimpse(crickets)Rows: 31
Columns: 3
$ species <fct> O. exclamationis, O. exclamationis, O. exclamationis, O. excla…
$ temp <dbl> 20.8, 20.8, 24.0, 24.0, 24.0, 24.0, 26.2, 26.2, 26.2, 26.2, 28…
$ rate <dbl> 67.9, 65.1, 77.3, 78.7, 79.4, 80.4, 85.8, 86.6, 87.5, 89.1, 98…
Exercise 1. Start with a simple scatter or point plot
As shown below these type of plots are useful for showing 2 numerical variables as in this case:
ggplot(crickets, aes(x = temp,
y = rate)) +
geom_point()Which is OK but could improve the appearance by changing labels, adding captions and titles for example:
# to improve the appearance of this:
ggplot(crickets, aes(x = temp,
y = rate))+
geom_point() +
labs(x = "Temperature", # these will change labels, add caption and titles etc
y = "Chirp rate",
title = "Cricket Chirps",
caption = "Source: Mcdonald (2009)")Looking at this plot and the data it is obvious that it would be much more meaningful if the species were added in to:
ggplot(crickets, aes(x = temp,
y = rate,
colour = species))+ #this will add in the species in different colours
geom_point() +
labs(x = "Temperature", # these will change labels,add caption and title etc
y = "Chirp rate",
colour = "Species", # adding in species and relabelling the species legend with a capital S
title = "Cricket Chirps",
caption = "Source: Mcdonald (2009)")…..and might also be able to improve on the visual appearance by tweaking the colour scheme:
ggplot(crickets, aes(x = temp,
y = rate,
colour = species))+
geom_point() +
labs(x = "Temperature", # these will change labels,add caption and title etc
y = "Chirp rate",
colour = "Species", # adding in species and relabelling the species legend with a capital S
title = "Cricket Chirps",
caption = "Source: Mcdonald (2009)") +
scale_color_brewer(palette = "Dark2") #this will tweak the colour scheme……and there are lots of other options for modifying the appearance and layout of the basic plot:
ggplot(crickets, aes(x = temp,
y = rate))+
geom_point(colour = "red",#note where this is to change the whole plot's colour
size = 2, #obviously changes the size
alpha = 0.3, #changes the transparency
shape = "square") + #changing the shape
labs(x = "Temperature", # these will change labels,add caption and title etc
y = "Chirp rate",
title = "Cricket Chirps",
caption = "Source: Mcdonald (2009)")Can also further modify this simple plot to give more useful information, for example adding in another layer to show a regression line with or without error bars:
ggplot(crickets, aes(x = temp,
y = rate))+
geom_point() +
geom_smooth(method = "lm",
se = TRUE)+ # adding in regression line or curve of best fit and removing the error bars
labs(x = "Temperature", # these will change labels,add caption and title etc
y = "Chirp rate",
title = "Cricket Chirps",
caption = "Source: Mcdonald (2009)")…but again we know there are lots of effects from the differing species so ought to demonstrate this with the 2 species shown as likely to be far more meaningful:
ggplot(crickets, aes(x = temp,
y = rate,
colour = species))+ #adding the species back in
geom_point() +
geom_smooth(method = "lm",
se = FALSE)+ # adding in regression line or curve of best fit and removing the error bars
labs(x = "Temperature", # these will change labels,add caption and title etc
y = "Chirp rate",
colour = "Species",
title = "Cricket Chirps",
caption = "Source: Mcdonald (2009)")Exercise 2. Looking at other types of plots including (remember to refer back to Figure 2 to understand which plot to use):
- for one quantitative variable
This type of data can often be usefully shown using a histogram:
ggplot(crickets, aes(x = rate)) +
geom_histogram(bins = 15) #note the bins give a size of the categories and can be changed to alter the appearance and usefulness by trial and error…..or a frequency polygon:
ggplot(crickets, aes(x = rate)) +
geom_freqpoly(bins = 15)- and for one single categorical variable such as the species using a bar chart:
ggplot(crickets, aes(x = species)) +
geom_bar(color = "black", #just adding some colours in
fill = "green")…..and adding some more colour in:
ggplot(crickets, aes(x = species,
fill = species)) +
geom_bar(show.legend = FALSE) + # removes the legend as not needed
scale_fill_brewer(palette = "Dark2") #useful for adding in colours- and for one categorical and one numerical variable using a box plot:
ggplot(crickets, aes(x = species, y = rate)) +
geom_boxplot()….and adding in some colour:
ggplot(crickets, aes(x = species,
y = rate,
colour = species)) + #adding in colour
geom_boxplot(show.legend = FALSE) + # removes the legend as not needed
scale_colour_brewer(palette = "Dark2") +
theme_minimal() # removes the grey backgroundExercise 3. Faceting
Faceting is a powerful technique used to split data into subsets and display multiple plots for each subset, all in one cohesive graphic layout. This allows you to compare patterns or trends across different categories or groups in a dataset. For example the first plot shows the overlaid technique for comparing the 2 species’ chirp rates but it is messy and not very useful. It would be much better to show them side by side on the same axis as in the second plot:
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_color_brewer((palette = "Dark2")) #but this is messy and not really useful, better if side by side on same axis so can compare them more easilyggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15, show.legend = FALSE)+
scale_color_brewer((palette = "Dark2"))+
facet_wrap(~species)Research Methods subsection Week 4 FM
What makes a good research hypothesis?
After identifying a potentially interesting gap in the current knowledge and understanding in a particular area of interest (generating a research question), a comprehensive literature review is required to confirm that the gap is real and worthy of further effort. The next crucial step in the scientific method is to create a concise statement or hypothesis outlining the result which is expected or predicted to be available to be proved (the so called alternative hypothesis) or disproved (the so-called null hypothesis) as a result of the investigation which follows. Such hypotheses should be simple, clear and relevant to that field of investigation and may take many formats with examples including if/then or direct statements. Getting this hypothesis right should then allow the investigation and collection of carefully observed empirical data utilising reproducible and specific methodology, followed by thorough and rigorous analysis, including statistical methods, of the data to allow valid conclusions to be drawn. Effective and timely communication of the results in peer reviewed publications should allow the iterative and continual evolution of knowledge in that specific field to continue potentially leading to the generation of further hypotheses to be developed in the same field.
References include:
(Efbrazil 2020), (contributors 2023), (Elsevier 2023),
Week 5 Choosing the Right Statistical Tests
First step is to understand the type of variable you are looking at, remembering that R will help if needed by looking at the tibble using summarise(), glimpse() etc
Then work out what your “x” or predictor variable(s) are, and what are your “y” or response variable(s) are, so that you know you are plotting against what from a categorical or quantitative point of view. This will allow you to assess which of the four families of statistical test is the most appropriate as below:
Putting this into words:
Categorical vs. Categorical use Frequency Tests
Categorical vs. Numerical use Mean Tests
Numerical vs. Numerical use Correlations and Models
Numerical vs. Categorical use Logistic Models
Or a flow chart format:
and some examples of the actual tests themselves:
Week 5 formative
Utilising the “iris’ data file which is pre-loaded in R. So firstly, load the data, then describe the data and explore it in more detail including plots:
#|label: Load the iris data
#| results: hide
data("iris")glimpse(iris)Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
str(iris)'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
So can see from these that the dataset consists of 5 columns of which the first 4 are continuous numerical variables and the last one is a nominal (no order) categorical variable. There are 150 rows of individual observations so a reasonable size dataset. First thought is that it would be good to know if each species represented equally-easy to visualise this by a histogram as single categorical.
ggplot(iris, aes(x = Species, fill = Species)) +
geom_bar(color = "black", show.legend = FALSE) + # The outline of the bars remains black and the legend is removed
labs(y = "Count",
title = "Number in Each Species",
caption = "Source:Melo 2024")So there are equal numbers of each species. It would also be interesting to see if there is a relationship between any of the numerical variables and the species so plots of the numerical variables on the y axis versus species on the x axis. For example species vs sepal length as a box plot:
ggplot(iris, aes(x = Species,
y = Sepal.Length,
colour = Species)) + #adding in colour
geom_boxplot(show.legend = FALSE) + # removes the legend as not needed
scale_colour_brewer(palette = "Dark2")Looking back to our Four Families of statistical tests diagram show in Figure 6 this shows that the correct statistical analysis to use would be from the Mean Tests family including T tests for example. So let’s try it:
summary(lm(Sepal.Length~Species, data=iris))
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
The next graph required in the formative exercise shows on paper as density versus petal length by species. I wonder if frequency would be a better word than density? Actually with a bit of digging found out that in a density plot, the density on the y-axis represents a numerical variable. So this is another way of displaying numerical versus numerical (like a scatter plot). The density is a continuous numerical value that estimates the probability density function of a continuous variable. It shows how the data is distributed along the x-axis (the variable of interest), and the area under the curve sums to 1. The y-axis typically does not represent counts or frequencies directly, but rather a scaled density that shows the relative concentration of the data points. Key Points:
X-axis: The continuous variable for which the distribution is being plotted (e.g., height, weight, petal length, etc.).
Y-axis (Density): A continuous numerical variable representing the probability density or the relative likelihood of the data points.
So, the y-axis is also numerical, not categorical.
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
geom_density(trim = FALSE, color = "black") Looking back to our Four Families of statistical tests diagram show in Figure 6 this suggests that the correct statistical analysis to use would be from the Correlations and models family including Pearson’s correlation or linear regression??
The 3rd graph in the exercise showed a scatter plot with the x axis showing the predictor variable Petal Length which is a numerical continuous variable, and the y axis showing the response variable Petal Width which is also a continuous numerical variable. So we are looking for a correlation relationship between these 2. Plotting the graph:
ggplot(iris, aes(x = Petal.Length,
y = Petal.Width))+
geom_point() +
geom_smooth(method="lm",
se = TRUE, colour = "black")But this isn’t that useful as we know already that there are differences between species so should separate out the data by species:
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + # No color aesthetic #here
geom_point(aes(colour = Species)) + # Color only for points
geom_smooth(method = "lm", se = TRUE) # Regression line with SE barsPersonally, there is such a difference between species there might be more use in having each species with its own regression line and SE bars:
ggplot(iris, aes(x = Petal.Length,
y = Petal.Width,
colour = Species))+ #adding the species back in
geom_point()+
geom_smooth(method = "lm",
se = TRUE) # adding in regression line or curve of best fit #and removing the error barsWould also be good to be able to show the different points in different shapes as in the original slide:
ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + # No color aesthetic #here
geom_point(aes(colour = Species, shape = Species)) + # Color only for points
geom_smooth(method = "lm", se = TRUE) # Regression line with SE barsThe last curve is a histogram which show categorical vs categorical data, so we need to create another variable here, which we can call “Size”. This is based on how many values for each species fall into new categories called big or small which are based on whether or not the value is above or below the median value for the column Sepal Length. This is done by the following:
iris_sepal_length <- iris %>%
mutate(Size=ifelse(Sepal.Length < median(Sepal.Length),
"small", "big"))This produces the iris_sepal_length object in the environment and allows us to use this new dataset to produced the bar chart:
ggplot(iris_sepal_length, aes(x = Species, fill = Size)) +
geom_bar(position = "dodge") +
labs(y = "Frequency", x = "Species") +
ggtitle("Frequency of Size Categories by Species")In this chunk aes(x = Species, fill = Size) maps the x-axis to Species and uses fill = Size to color the bars by the Size category (“small” or “big”). The geom_bar(position = "dodge") creates a bar plot where bars for different size categories are side by side for each species.
Week 6 Frequency Tests
This is the first family of tests out of the 4 (others are Mean, Logistic, Correlation and Models) to be looked at, and 99% of analysis will fall into these 4. Used for categorical variables on x and y axis.
Frequency tests are non-parametric tests meaning that they do not assume a specific distribution for the data, making them ideal for analyzing data that do not meet the assumptions of parametric tests. They are particularly useful when:
Data is Not Normally Distributed: Non-parametric tests do not require normality, so they work well with skewed data or data with outliers.
Small Sample Sizes: They are often better suited for small sample sizes where the distribution cannot be reliably determined.
Ordinal or Ranked Data: Non-parametric tests are designed to work with ordinal data (ranked data) and do not rely on precise numerical differences between values.
Unequal Variances: Unlike parametric tests, non-parametric tests do not require homogeneity of variances.
Pre-Session videos & Exercises:
Frequency & Contingency tables with R taken from Statisticis using R programming Youtube channel
A frequency table lists a set of values and how often each one appears-helps show which values are common/rare. A contingency table (2 way frequency table) is a tabular mechanism with the least 2 rows and 2 columns used to present categorical data in terms of frequency counts.
options(repos = c(CRAN = "https://cran.r-project.org")) #setting the CRAn mirror
install.packages("vcd")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(vcd)write.csv(Arthritis, "Arthritis.csv", row.names = FALSE) #putting the arthritis data onto my computerArthritis <- read.csv("/Users/nickpark/Desktop/MSC/RMDA/R/Arthritis.csv") #telling quarto where to find ithead(Arthritis) ID Treatment Sex Age Improved
1 57 Treated Male 27 Some
2 46 Treated Male 29 None
3 77 Treated Male 30 None
4 17 Treated Male 32 Marked
5 36 Treated Male 46 Marked
6 23 Treated Male 58 Marked
str(Arthritis)'data.frame': 84 obs. of 5 variables:
$ ID : int 57 46 77 17 36 23 75 39 33 55 ...
$ Treatment: chr "Treated" "Treated" "Treated" "Treated" ...
$ Sex : chr "Male" "Male" "Male" "Male" ...
$ Age : int 27 29 30 32 46 58 59 59 63 63 ...
$ Improved : chr "Some" "None" "None" "Marked" ...
summary(Arthritis) ID Treatment Sex Age
Min. : 1.00 Length:84 Length:84 Min. :23.00
1st Qu.:21.75 Class :character Class :character 1st Qu.:46.00
Median :42.50 Mode :character Mode :character Median :57.00
Mean :42.50 Mean :53.36
3rd Qu.:63.25 3rd Qu.:63.00
Max. :84.00 Max. :74.00
Improved
Length:84
Class :character
Mode :character
So creating a one way table from this Arthritis data which has 5 columns and 84 rows, with 2 categorical factors (Treatment and Sex), one ordinal categorical factor (Improved) and 2 numerical variables. Could do some plots eg bar charts) and summary stats but just diving straight into the exercises for now:
For the coding chunks below run one line at a time to see what effect the next one has on the previous output
#one way table
mytable <- table(Arthritis$Improved) #create the object showing the Improved column data from the Arthritis dataset
mytable #shows counts in the Improved column
Marked None Some
28 42 14
prop.table(mytable) #same thing but shown as proportions of the total
Marked None Some
0.3333333 0.5000000 0.1666667
prop.table(mytable)*100 #and as percentages
Marked None Some
33.33333 50.00000 16.66667
and a 2-way table using the xtabs() function:
#two way table
mytable <- xtabs(~Treatment+Improved, data=Arthritis) #selecting the columns want to be included
mytable #counts and shows the Contingecy table Improved
Treatment Marked None Some
Placebo 7 29 7
Treated 21 13 7
and using the margin.table function() to add the totals to allow calculation of proportions using the prop.table() function:
margin.table(mytable, 1) # the 1 argument tells R that we want to look at the chosen row and give the total counts for that row (Treatment)Treatment
Placebo Treated
43 41
prop.table(mytable, 1) # and by the chosen row as proportions (note rows sums to 1) Improved
Treatment Marked None Some
Placebo 0.1627907 0.6744186 0.1627907
Treated 0.5121951 0.3170732 0.1707317
and the same thing by column:
margin.table(mytable, 2) # 2 the 1 argument tells R that we want to look at the chosen column this time and give the total counts for that column (Treatment)Improved
Marked None Some
28 42 14
prop.table(mytable, 2) #and in proportions (columns add to 1) Improved
Treatment Marked None Some
Placebo 0.2500000 0.6904762 0.5000000
Treated 0.7500000 0.3095238 0.5000000
Or looking at whole table or all cells together:
prop.table((mytable)) # cell proportions (all cells add to 1) Improved
Treatment Marked None Some
Placebo 0.08333333 0.34523810 0.08333333
Treated 0.25000000 0.15476190 0.08333333
addmargins(mytable) #cell counts with row and column sums Improved
Treatment Marked None Some Sum
Placebo 7 29 7 43
Treated 21 13 7 41
Sum 28 42 14 84
addmargins(prop.table(mytable)) #cell proportions with row and column sums Improved
Treatment Marked None Some Sum
Placebo 0.08333333 0.34523810 0.08333333 0.51190476
Treated 0.25000000 0.15476190 0.08333333 0.48809524
Sum 0.33333333 0.50000000 0.16666667 1.00000000
and summing by chosen row or column:
addmargins(prop.table(mytable, 1),2) #row proportions with row sums Improved
Treatment Marked None Some Sum
Placebo 0.1627907 0.6744186 0.1627907 1.0000000
Treated 0.5121951 0.3170732 0.1707317 1.0000000
addmargins(prop.table(mytable, 2), 1)#column proportions with column sums Improved
Treatment Marked None Some
Placebo 0.2500000 0.6904762 0.5000000
Treated 0.7500000 0.3095238 0.5000000
Sum 1.0000000 1.0000000 1.0000000
So can see that there are lots of options for presenting the data and interpretation is difficult! Lies, damned lies and statistics!
Can also use the CrossTable() function to create a 2 way frequency or contingency table but need to remember to load the gmodels package first. In this example the table shows the same columns as in the above example with the output showing all the results possible including a chi squared contribution value:
library(gmodels)
CrossTable(Arthritis$Treatment, Arthritis$Improved)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 84
| Arthritis$Improved
Arthritis$Treatment | Marked | None | Some | Row Total |
--------------------|-----------|-----------|-----------|-----------|
Placebo | 7 | 29 | 7 | 43 |
| 3.752 | 2.616 | 0.004 | |
| 0.163 | 0.674 | 0.163 | 0.512 |
| 0.250 | 0.690 | 0.500 | |
| 0.083 | 0.345 | 0.083 | |
--------------------|-----------|-----------|-----------|-----------|
Treated | 21 | 13 | 7 | 41 |
| 3.935 | 2.744 | 0.004 | |
| 0.512 | 0.317 | 0.171 | 0.488 |
| 0.750 | 0.310 | 0.500 | |
| 0.250 | 0.155 | 0.083 | |
--------------------|-----------|-----------|-----------|-----------|
Column Total | 28 | 42 | 14 | 84 |
| 0.333 | 0.500 | 0.167 | |
--------------------|-----------|-----------|-----------|-----------|
and a 3 way contingency table using the xtabs() function:
mytable <- xtabs(~Treatment+Sex+Improved, data=Arthritis)
mytable, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
margin.table(mytable, 1) #totals for treatmentTreatment
Placebo Treated
43 41
margin.table(mytable, 2) #totals for SexSex
Female Male
59 25
margin.table(mytable, 3) #totals for ImprovedImproved
Marked None Some
28 42 14
margin.table(mytable, c(1,3)) #totals for treatment by improved Improved
Treatment Marked None Some
Placebo 7 29 7
Treated 21 13 7
and the ftable() function can also be useful:
#Treatment by Sex for each level of Improved
ftable(mytable) Improved Marked None Some
Treatment Sex
Placebo Female 6 19 7
Male 1 10 0
Treated Female 16 6 5
Male 5 7 2
ftable(prop.table(mytable, c(1,2)),3) #props sum to 1 Improved Marked None Some
Treatment Sex
Placebo Female 0.18750000 0.59375000 0.21875000
Male 0.09090909 0.90909091 0.00000000
Treated Female 0.59259259 0.22222222 0.18518519
Male 0.35714286 0.50000000 0.14285714
ftable(addmargins(prop.table(mytable, c(1,2)),3)) Improved Marked None Some Sum
Treatment Sex
Placebo Female 0.18750000 0.59375000 0.21875000 1.00000000
Male 0.09090909 0.90909091 0.00000000 1.00000000
Treated Female 0.59259259 0.22222222 0.18518519 1.00000000
Male 0.35714286 0.50000000 0.14285714 1.00000000
ftable(addmargins(prop.table(mytable, c(1,2)),3))*100 Improved Marked None Some Sum
Treatment Sex
Placebo Female 18.750000 59.375000 21.875000 100.000000
Male 9.090909 90.909091 0.000000 100.000000
Treated Female 59.259259 22.222222 18.518519 100.000000
Male 35.714286 50.000000 14.285714 100.000000
The Chi Squared Goodness of Fit Test for Frequency Data taken from Iain’s Math & Stat Screencasts Youtube channel.
Useful for testing frequency data especially when there are more than 2 categories. This is a Hypothesis test so need to set null and alternative. No exercises on this one.
The Chi-Squared Goodness-of-Fit test is a non-parametric statistical test used to determine if observed categorical data fit a particular expected distribution. It helps answer whether the observed frequencies in a single categorical variable differ significantly from expected frequencies under a hypothesized distribution.
When to Use the Chi-Squared Goodness-of-Fit Test
The test is applied in situations where:
You have categorical data with different levels (categories) — for example, colors of M&Ms in a sample or survey responses.
You want to compare observed frequencies of categorie to expected frequencies based on a theoretical or known distribution, like a uniform distribution or a specific expected ratio.
Key Concepts and Hypotheses
Null Hypothesis (H₀): The observed frequencies fit the expected distribution. This means any differences between observed and expected frequencies are due to random variation.
Alternative Hypothesis (H₁): The observed frequencies do not fit the expected distribution, suggesting a significant difference.
Steps for Performing the Test
Define expected frequencies: Based on the hypothesized distribution.
Calculate observed frequencies: From your sample data.
Compute the Chi-Squared statistic: Using the formula:
χ2=∑(Oi−Ei)2Eiχ2=∑Ei(Oi−Ei)2
where OiOi is the observed frequency for category ii, and EiEi is the expected frequency.
Determine the degrees of freedom: This is (k−1)(k−1), where kk is the number of categories.
Interpret the p-value: Compare the calculated χ2χ2 statistic to the critical value in the Chi-Squared distribution table or obtain a p-value to determine significance.
Chi Squared testing for Independence in R
The Chi-Squared Test of Independence is a statistical test used to determine if there is a significant association between two categorical variables. This test assesses whether the distribution of one variable differs across the levels of the other variable, indicating independence or a relationship between them.
When to Use the Chi-Squared Test of Independence
This test is appropriate when:
You have two categorical variables that you want to examine for association.
You want to determine if the variables are independent, meaning the distribution of one variable is unaffected by the levels of the other variable.
Hypotheses for the Test
Null Hypothesis (H₀): The two variables are independent (no association between them).
Alternative Hypothesis (H₁): The two variables are not independent (there is an association between them).
Steps for Performing the Test
Create a contingency table: This table cross-tabulates the frequencies for each combination of the two categorical variables.
Calculate expected frequencies: For each cell in the table, the expected frequency is calculated assuming the two variables are independent. The formula for each expected count in a cell is:
Eij=(row total)×(column total)grand totalEij=grand total(row total)×(column total)
Compute the Chi-Squared statistic:
χ2=∑(Oij−Eij)2Eijχ2=∑Eij(Oij−Eij)2
where OijOij is the observed frequency for cell (i,j)(i,j), and EijEij is the expected frequency for the same cell.
Determine degrees of freedom: This is calculated as (r−1)(c−1)(r−1)(c−1), where rr is the number of rows, and cc is the number of columns.
Interpret the p-value: A small p-value (typically < 0.05) suggests a significant association between the two variables, meaning they are not independent.
Useful to understand if the categorical variables in question are associated/related/independent of each other. Remember to run each line in sequence.
volunteers <- matrix(c(111, 96, 48, 96, 133, 61, 91, 150, 53), byrow = T, nrow = 3) # Creating the data to be looked at
rownames(volunteers) <- c("Community College", "Four Year", "Nonstudent")
colnames(volunteers) <- c("1-3Hours", "4-6Hours", "7-9Hours") #naming rows and columns
volunteers #looking at the output 1-3Hours 4-6Hours 7-9Hours
Community College 111 96 48
Four Year 96 133 61
Nonstudent 91 150 53
model <- chisq.test(volunteers) #running the chisquare test
model #showing the output
Pearson's Chi-squared test
data: volunteers
X-squared = 12.991, df = 4, p-value = 0.01132
model$expected # visualising the expected cell counts 1-3Hours 4-6Hours 7-9Hours
Community College 90.57211 115.1907 49.23719
Four Year 103.00358 131.0012 55.99523
Nonstudent 104.42431 132.8081 56.76758
model$residuals #and the Pearson's residuals 1-3Hours 4-6Hours 7-9Hours
Community College 2.1464772 -1.7880604 -0.1763148
Four Year -0.6900708 0.1746359 0.6688187
Nonstudent -1.3136852 1.4918030 -0.5000487
Another way of running this test is:
vol_tab <- as.table(volunteers) #changing the existing matrix to a table
summary(vol_tab)#using the summary function to run the testNumber of cases in table: 839
Number of factors: 2
Test for independence of all factors:
Chisq = 12.991, df = 4, p-value = 0.01132
Week 6 Frequency Testing Lecture Session Exercises
install.packages("janitor")
install.packages("gmodels")
library(tidyverse)
library(tibble)
library(knitr)
library(janitor)
library(gmodels)So creating a dataset to be used called ladybirds. Clearly interested to see if there is any association between the habitat and the colour frequency (as with the moths in the industrial revolution):
ladybirds <- tribble(
~Habitat, ~Site, ~Colour, ~Number,
"Rural", "R1", "Black", 10,
"Rural", "R2", "Black", 3,
"Rural", "R3", "Black", 4,
"Rural", "R4", "Black", 7,
"Rural", "R5", "Black", 6,
"Rural", "R1", "Red", 15,
"Rural", "R2", "Red", 18,
"Rural", "R3", "Red", 9,
"Rural", "R4", "Red", 12,
"Rural", "R5", "Red", 16,
"Industrial", "U1", "Black", 32,
"Industrial", "U2", "Black", 25,
"Industrial", "U3", "Black", 25,
"Industrial", "U4", "Black", 17,
"Industrial", "U5", "Black", 16,
"Industrial", "U1", "Red", 17,
"Industrial", "U2", "Red", 23,
"Industrial", "U3", "Red", 21,
"Industrial", "U4", "Red", 9,
"Industrial", "U5", "Red", 15)view(ladybirds)
str(ladybirds)tibble [20 × 4] (S3: tbl_df/tbl/data.frame)
$ Habitat: chr [1:20] "Rural" "Rural" "Rural" "Rural" ...
$ Site : chr [1:20] "R1" "R2" "R3" "R4" ...
$ Colour : chr [1:20] "Black" "Black" "Black" "Black" ...
$ Number : num [1:20] 10 3 4 7 6 15 18 9 12 16 ...
head(ladybirds)# A tibble: 6 × 4
Habitat Site Colour Number
<chr> <chr> <chr> <dbl>
1 Rural R1 Black 10
2 Rural R2 Black 3
3 Rural R3 Black 4
4 Rural R4 Black 7
5 Rural R5 Black 6
6 Rural R1 Red 15
So the dataset contains 4 columns and 20 rows with 3 categorical factors and one numerical variable. We might be interested in asking:
how does habitat influence morphotype occurrence of ladybirds?
is there an association (dependency) between habitat and ladybird morphotype?
First step as usual is to manipulate the table to show the variables we are most interested in:
#Table 1
ladybirds%>% #take the ladybirds dataset just created
group_by(Habitat, Colour) %>% #create groups
summarise(count = sum(Number)) |> #this last sympbol is similar to the %>% operation
kable() #creates the table showing how many of each colour at each habitat| Habitat | Colour | count |
|---|---|---|
| Industrial | Black | 115 |
| Industrial | Red | 85 |
| Rural | Black | 30 |
| Rural | Red | 70 |
and adding in proportions:
ladybirds%>%
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) |>
mutate(prop = count/sum(count)) |> #creates a column called prop which is the calculation of the count over the total ie the proportions. Note total sums to 1
kable() | Habitat | Colour | count | prop |
|---|---|---|---|
| Industrial | Black | 115 | 0.575 |
| Industrial | Red | 85 | 0.425 |
| Rural | Black | 30 | 0.300 |
| Rural | Red | 70 | 0.700 |
and taking key value pairs, convert from long format with multiple rows for each category to wide format with separate columns for each category. The spread() function is used to create separate columns for each unique value in the colour column:
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0)|> #The spread() function takes each unique Colour as a separate column, making the data frame wide. This reshaped data frame has Habitat as rows and Colour categories as columns, with count values in each cell. If any combination of Habitat and Colour is missing, fill = 0 ensures it’s filled with a 0.
adorn_totals(c("row", "col")) |> #adds totals for both rows and columns
kable() #displays the table in a clean format| Habitat | Black | Red | Total |
|---|---|---|---|
| Industrial | 115 | 85 | 200 |
| Rural | 30 | 70 | 100 |
| Total | 145 | 155 | 300 |
and looking at this with proportions:
library(dplyr)
library(tibble)
library(tidyr)
library(knitr)
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |> #This line moves the Habitat column to row names, so that each row represents a habitat, with columns for each Colour
proportions() %>% #The proportions() function calculates the relative proportions of each count within the table. By default, this will compute proportions for each cell as a fraction of the total sum of all counts across habitats and colors, giving the proportion each cell contributes to the total count of all ladybirds.
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.3833333 | 0.2833333 |
| Rural | 0.1000000 | 0.2333333 |
and converting the data frame to a matrix saved as t (useful when want to perform operations like proportions across rows and columns as matrices often allow simpler indexing and manipulation than data frames):
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
as.matrix()->t
proportions(t,1) |> #This calculates proportions across each row (i.e., within each Habitat) by specifying margin = 1. For each habitat, the sum of all colors will equal 1, so each cell value represents the proportion of that color within the habitat.
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.575 | 0.425 |
| Rural | 0.300 | 0.700 |
and the same matrix with proportions calculated across each column instead of row:
ladybirds |>
group_by(Habitat, Colour) %>%
summarise(count = sum(Number)) %>%
spread(Colour, count, fill = 0) |>
column_to_rownames("Habitat") |>
as.matrix()->t
proportions(t,2) |>
kable()| Black | Red | |
|---|---|---|
| Industrial | 0.7931034 | 0.5483871 |
| Rural | 0.2068966 | 0.4516129 |
So again it is really obvious that there are loads of options for presenting frequency data and it can therefore be very difficult to interpret!!
Looking at Cross tables as a shortcut to the above: The function CrossTable(t) in R is provided by the gmodels package. It creates a cross-tabulation, or contingency table, which summarizes the frequency or proportion of combinations of categorical variables in the data. CrossTable() generates a detailed table with counts and various statistics, often used for analyzing relationships between two categorical variables. For example, where t is a matrix of counts of ladybird colors across habitats, CrossTable(t) will display:
Frequency Counts for each combination of Habitat and Colour in the matrix t
Row Proportions – the proportion of each color count relative to the row totals (e.g., proportion #of colors within each habitat)
Column Proportions – the proportion of each count relative to the column totals (e.g., how much each habitat contributes to a specific color)
Total Proportions – the proportion of each cell relative to the total count for the entire table
CrossTable(t)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 300
|
| Black | Red | Row Total |
-------------|-----------|-----------|-----------|
Industrial | 115 | 85 | 200 |
| 3.477 | 3.253 | |
| 0.575 | 0.425 | 0.667 |
| 0.793 | 0.548 | |
| 0.383 | 0.283 | |
-------------|-----------|-----------|-----------|
Rural | 30 | 70 | 100 |
| 6.954 | 6.505 | |
| 0.300 | 0.700 | 0.333 |
| 0.207 | 0.452 | |
| 0.100 | 0.233 | |
-------------|-----------|-----------|-----------|
Column Total | 145 | 155 | 300 |
| 0.483 | 0.517 | |
-------------|-----------|-----------|-----------|
As can be seen the Cross table produces the following output details:
Cell Counts for each combination
Row and Column Proportions for each combination, helping identify any notable patterns or associations
Chi-square Statistics (if requested), to test independence between the categories.
The CrossTable() output can be very informative in understanding patterns and relationships in categorical data
Types of Frequency Test
1.Test for independence
Looking at the naruto dataset which is a 2 by 2 matrix to see if there is any association or dependency between the categorical variables by performing a chi squared test of independence (see pre session section above). The h0 is that there is no association between the variables and the alternate is that there is a statistically significant association between them. So the first step is to create the frequency table as below. It is then easy to use R to run the test:
naruto<-matrix(c(35,205,8,48), nrow=2, byrow=TRUE)
chisq.test(naruto)$expected #chisq.test(naruto) conducts a Chi-square test of independence on the naruto matrix. Adding $expected extracts the expected frequencies from the Chi-square test result. These expected frequencies are the counts that would occur if there were no association between the row and column categories. [,1] [,2]
[1,] 34.864865 205.13514
[2,] 8.135135 47.86486
proportions(naruto[1,]) #computes the proportion of each value within this row relative to the total of the row. It divides each element by the sum of the elements in the row.Using proportions in this way provides a view of the relative distribution of counts within a specific row. This is useful for understanding how each category within that row contributes to the total count of the row, often used in contingency tables or categorical data analysis.[1] 0.1458333 0.8541667
proportions(naruto[2,]) #likewise for column [1] 0.1428571 0.8571429
chisq.test(naruto)
Pearson's Chi-squared test with Yates' continuity correction
data: naruto
X-squared = 5.9938e-30, df = 1, p-value = 1
In this case the p value is well above 0.05 suggesting that we have to accept the h0 and believe that there is no association between the variables.
Using the Placebo dataset:
The placebo effect in statistical terms refers to a measurable, positive response observed in participants receiving a placebo — an inactive treatment or substance (like a sugar pill) — due to their belief or expectation that they are receiving a real treatment. This effect can complicate the interpretation of clinical trials or experiments, as the response seen is not due to the active treatment but rather to psychological or other non-specific effects. Frequency testing using tests for independence or goodness of fit
placebo<-matrix(c(43,31,27,12,24,28), nrow=2, byrow=TRUE)
rownames(placebo)<-c("cure-yes", "cure-no")
colnames(placebo)<-c("treatment","placebo","no treatment")
exp.placebo<-chisq.test(placebo)$expected #extracts the matrix of expected frequencies under the null hypothesis of independence.
exp.placebo #This displays the expected frequencies calculated from the chi-square test, showing the counts that would be expected in each cell if the cure and treatment types were unrelated or independent. treatment placebo no treatment
cure-yes 33.66667 33.66667 33.66667
cure-no 21.33333 21.33333 21.33333
chisq.test(placebo) # shows an effect of treatement or placebo as p<0.05. It assesses if there is a significant association between cure rates and treatment types by comparing observed counts in placebo with expected counts (exp.placebo). If the chi-square test finds that the observed counts significantly differ from the expected counts, it suggests a relationship between treatment type and cure status.
Pearson's Chi-squared test
data: placebo
X-squared = 10.619, df = 2, p-value = 0.004945
In this example, chisq.test(placebo) tests whether there is an association between treatment type and cure status. If the p-value is below a significance threshold (e.g., 0.05), you can conclude that the type of treatment is significantly associated with the cure status.
2. Testing for Homogeneity
In statistics, a Chi-Square Test for Homogeneity is used to compare the distribution of categorical variables across multiple groups to determine if they are homogeneous (similar in distribution) across those groups. It’s a type of chi-squared test that examines whether different populations have the same proportions for a particular categorical variable.
When to Use the Test for Homogeneity:
The Chi-Square Test for Homogeneity is appropriate when:
You have two or more independent groups (e.g., different samples or populations).
You want to test if the distribution of a categorical variable is consistent across these groups.
-creating a dataset showing frequency of various age categories across 2 countries to look at this test and plotting using a bar chart:
pop<-tribble(
~age, ~country, ~pop,
"0-10","UK", 50,
"0-10","BRA", 120,
"11-20","UK", 70,
"11-20","BRA", 140,
"21-30","UK", 110,
"21-30","BRA", 150,
"31-40","UK", 150,
"31-40","BRA", 180,
"40-50","UK", 180,
"40-50","BRA", 180,
"50-60","UK", 150,
"50-60","BRA", 90,
"60-70","UK", 130,
"60-70","BRA", 50)
pop |>
ggplot(aes(x=age, y= pop, color=country, fill=country))+
geom_col(position="dodge")+
theme(axis.text=element_text(size = 18),
axis.title = element_text(size=18))and then grouping the data to allow for looking to see if the 2 distributions have the same pattern. Clearly the h0 is that there is no significant difference between the groups, with the alternate hypothesis suggesting that there is.
pop |>
group_by(country, age) %>% #summarise(count = sum(Number)) %>%
spread(country, pop, fill = 0) |>
column_to_rownames("age") |>
as.matrix() -> t2
chisq.test(t2)
Pearson's Chi-squared test
data: t2
X-squared = 108.97, df = 6, p-value < 2.2e-16
Interpreting these Results
- p-value: If the p-value is below a certain significance level (commonly 0.05), it indicates that there is a statistically significant difference in the distribution of preferences across the groups, meaning the groups are not homogeneous with respect to the categorical variable.
3. Goodness of Fit
Useful for testing frequency data especially when there are more than 2 categories. This is a Hypothesis test so need to set null and alternative.
The Chi-Squared Goodness-of-Fit test is a non-parametric statistical test used to determine if observed categorical data fit a particular expected distribution. It helps answer whether the observed frequencies in a single categorical variable differ significantly from expected frequencies under a hypothesized distribution.
When to Use the Chi-Squared Goodness-of-Fit Test:
The test is applied in situations where:
You have categorical data with different levels (categories) — for example, colors of M&Ms in a sample or survey responses.
You want to compare observed frequencies of categories to expected frequencies based on a theoretical or known distribution, like a uniform distribution or a specific expected ratio.
Key Concepts and Hypotheses
Null Hypothesis (H₀): The observed frequencies fit the expected distribution. This means any differences between observed and expected frequencies are due to random variation.
Alternative Hypothesis (H₁): The observed frequencies do not fit the expected distribution, suggesting a significant difference.
pop %>%
spread(country, pop, fill=0) %>%
mutate(world=(BRA+UK)/2) %>%
pivot_longer(cols = BRA:world, names_to = "place")->pop2
pop2 #show the output# A tibble: 21 × 3
age place value
<chr> <chr> <dbl>
1 0-10 BRA 120
2 0-10 UK 50
3 0-10 world 85
4 11-20 BRA 140
5 11-20 UK 70
6 11-20 world 105
7 21-30 BRA 150
8 21-30 UK 110
9 21-30 world 130
10 31-40 BRA 180
# ℹ 11 more rows
pop2 %>%
ggplot(aes(x=age,y=value, colour=place, fill=place))+
geom_col(position = "dodge") pop2 %>%
spread(place, value, fill = 0) %>%
column_to_rownames("age")-> pop3
chisq.test(pop3$BR, pop3$world)
Pearson's Chi-squared test
data: pop3$BR and pop3$world
X-squared = 35, df = 30, p-value = 0.2426
Post Lecture Session Week6
Exercises-taken from Statistical tools for high-throughput data analysis website
http://www.sthda.com/english/wiki/comparing-proportions-in-r
One proportion Z test-used to compare observed proportion to a theoretical one when there are only 2 categories.
This test is often used in cases where you want to compare a sample proportion to a population proportion for categorical data.
When to Use a One-Proportion Z-Test:
You have a single sample with categorical data (e.g., success/failure, yes/no).
You want to test if the proportion of “successes” in this sample matches a known population proportion.
The sample size is large enough that the sampling distribution of the proportion is approximately normal (commonly, both np≥10np≥10 and n(1−p)≥10n(1−p)≥10).
H0 usually stated as the sample proportion is equal to the population proportion, with alternative h being the sample proportion is different from (greater or less than) the population proportion
For example: one population of mice, n=160, would expect 50:50 m:f (pe), observed shows 95males:65females. So observed proportion of males is 95/160 = p0, females is 1-p0 #setting the null hypothesis-would expect p0=pe (a two tailed test). Setting the null hypothesis would expect p0=pe (2 tailed). Using R function binom.test(x, n, p=0.5, alternative = “two.sided”) where x = number of successes, n= total, p = probability.
Use the Binomial test when sample is small.
Or use the prop.test() function prop.test(x, n, p =NULL, alternative + two.sided”, correct =TRUE)
binom.test(95, 160, p = 0.5, alternative = "two.sided")Exact binomial test data: 95 and 160 number of successes = 95, number of trials = 160, p-value = 0.02157 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5133727 0.6705878 sample estimates: probability of success 0.59375prop.test(95, 160, p = NULL, alternative = "two.sided", correct = TRUE)1-sample proportions test with continuity correction data: 95 out of 160, null probability 0.5 X-squared = 5.2562, df = 1, p-value = 0.02187 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.5131775 0.6697480 sample estimates: p 0.59375
Interpretation:
If the p-valu is low (e.g., < 0.05), you may conclude that the observed sample proportion is significantly different from the hypothesized proportion. If it’s high, there isn’t enough evidence to suggest a difference.
2.Two Proportions Z tests-used to compare 2 observed proportions. This used to compare the proportions of two independent groups to determine if they are significantly different from each other. It’s commonly applied to categorical data when you have two groups (e.g., treatment vs. control) and want to test if the proportions of a specific outcome (e.g., success rate) differ between them
When to Use a Two-Proportion Z-Test:
You have categorical data and two independent groups.
You want to test if the proportions of a certain outcome are different between the two groups.
The sample sizes are large enough to approximate a normal distribution (commonly, both groups should meet the criteria n×p≥10n×p≥10 and n×(1−p)≥10n×(1−p)≥10).
For example: 2 groups A with lung cancer n = 500, no of smokers is 490, 98% B healthy, n = 500, no of smokers is 400, 80% overall prop of smokers is 890/500, 89%, non smoker 11% . We want to know if the proportion of smokers are the same in each group so setting hypothesis pa=pb and using the prop.test() function gives:
res <- prop.test(x = c(490, 400), n =c(500, 500))
res
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
Interpretation
A low p-value (e.g., < 0.05) indicates that the proportions are significantly different between the two groups. A high p-value suggests there isn’t enough evidence to conclude a difference.
3.Chi square Goodness of Fit test to evaluate the association between 2 categorical variables. Used to compare observed distribution to expected with 2 or more categories in discrete data. In other words it compares multiple observed proportions to expected probabilities. For example:
collected wild tulips 81 red, 50y, 27w
if equally distributed would expect 1/3
but if in the region of collection ratio is 3:2:1 (3+2+1=6)
then expected proportion would be 1/2 red, 1/3 y, 1/6 w
so is there a diff between observed and expected?
null h0 is no significant difference
So in R use: chisq.test(x, p ) where x is numeric vector and p is probabilities
tulip <- c(81, 50, 27)
res <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res #so p v low which says can reject null h0
Chi-squared test for given probabilities
data: tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
So very low p value so can reject the h0.
Chi square test of independence. Used to evaluate the association between 2 categorical variables. First load the data which is a contingency table showing 13 tasks and their distribution in the couple
file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt" housetasks <- read.delim(file_path, row.names = 1) #installing the data install.packages("gplots")The downloaded binary packages are in /var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packageslibrary(gplots) #and displaying these graphically dt <- as.table(as.matrix(housetasks)) #first by converting the data to a table balloonplot(t(dt), main ="housetasks", xlab ="", ylab="", label = FALSE, show.margins = FALSE) # then Graphing using balloon plot function in gplots
then running the Chi squared test to examine whether the rows and columns of the contingency table are statistically significantly associated, with the null hypothesis set that they are independent:
chisq <- chisq.test(housetasks)
chisq
Pearson's Chi-squared test
data: housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
can extract observed and expected counts:
chisq$observed Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
round(chisq$expected,2) Wife Alternating Husband Jointly
Laundry 60.55 25.63 38.45 51.37
Main_meal 52.64 22.28 33.42 44.65
Dinner 37.16 15.73 23.59 31.52
Breakfeast 48.17 20.39 30.58 40.86
Tidying 41.97 17.77 26.65 35.61
Dishes 38.88 16.46 24.69 32.98
Shopping 41.28 17.48 26.22 35.02
Official 33.03 13.98 20.97 28.02
Driving 47.82 20.24 30.37 40.57
Finances 38.88 16.46 24.69 32.98
Insurance 47.82 20.24 30.37 40.57
Repairs 56.77 24.03 36.05 48.16
Holidays 55.05 23.30 34.95 46.70
can also find which are the most contributing cells to the total chi-sq score, by calculating the Chi sq stat for each cell (Pearson’s residuals or standardised residuals):
round(chisq$residuals, 3) Wife Alternating Husband Jointly
Laundry 12.266 -2.298 -5.878 -6.609
Main_meal 9.836 -0.484 -4.917 -6.084
Dinner 6.537 -1.192 -3.416 -3.299
Breakfeast 4.875 3.457 -2.818 -5.297
Tidying 1.702 -1.606 -4.969 3.585
Dishes -1.103 1.859 -4.163 3.486
Shopping -1.289 1.321 -3.362 3.376
Official -3.659 8.563 0.443 -2.459
Driving -5.469 6.836 8.100 -5.898
Finances -4.150 -0.852 -0.742 5.750
Insurance -5.758 -4.277 4.107 5.720
Repairs -7.534 -4.290 20.646 -6.651
Holidays -7.419 -4.620 -4.897 15.556
and can visualise these using corrplot() function:
install.packages("corrplot")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(corrplot)
corrplot(chisq$residuals, is.cor = FALSE) #for a given cell the size of the circle is proportional to the amount of the cell contribution#positive residuals in blue specify positive association between corresponding row and column variables and vvand can look at the contribution in percentages:
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3) #and visualising this: Wife Alternating Husband Jointly
Laundry 7.738 0.272 1.777 2.246
Main_meal 4.976 0.012 1.243 1.903
Dinner 2.197 0.073 0.600 0.560
Breakfeast 1.222 0.615 0.408 1.443
Tidying 0.149 0.133 1.270 0.661
Dishes 0.063 0.178 0.891 0.625
Shopping 0.085 0.090 0.581 0.586
Official 0.688 3.771 0.010 0.311
Driving 1.538 2.403 3.374 1.789
Finances 0.886 0.037 0.028 1.700
Insurance 1.705 0.941 0.868 1.683
Repairs 2.919 0.947 21.921 2.275
Holidays 2.831 1.098 1.233 12.445
corrplot(contrib, is.cor = FALSE)#the relative contribution of each cell to total chi sq score give some indication of the nature of the dependency between rows and columns of the contingency tableContingency tables-taken from Analytics with R-Data Analysis section
There are many options for producing contingecy tables and summary tables in R. For example:
summary tables and stats using dplyr and tidyr (enables you to continue code seamlessly into the next task (filtering, plotting etc)
producing frequency and proportion tables using table()
producing frequency, proportion and Chi squared values using CrossTable()
the group_by(), summarise() and spread() commands are also useful for producing summary or agrregate values of our data
library(ggplot2) #loading required packages
library(dplyr)
library(tidyr)
library(knitr)data(mpg)
str(mpg)tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
$ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
$ model : chr [1:234] "a4" "a4" "a4" "a4" ...
$ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
$ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
$ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
$ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
$ drv : chr [1:234] "f" "f" "f" "f" ...
$ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
$ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
$ fl : chr [1:234] "p" "p" "p" "p" ...
$ class : chr [1:234] "compact" "compact" "compact" "compact" ...
glimpse(mpg)Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
So looking at the total number of cars with each class and cyl combination:
mpg%>%
group_by(class, cyl)%>%
summarise(n=n())%>% #creates the column n which contains the count of each row
kable()| class | cyl | n |
|---|---|---|
| 2seater | 8 | 5 |
| compact | 4 | 32 |
| compact | 5 | 2 |
| compact | 6 | 13 |
| midsize | 4 | 16 |
| midsize | 6 | 23 |
| midsize | 8 | 2 |
| minivan | 4 | 1 |
| minivan | 6 | 10 |
| pickup | 4 | 3 |
| pickup | 6 | 10 |
| pickup | 8 | 20 |
| subcompact | 4 | 21 |
| subcompact | 5 | 2 |
| subcompact | 6 | 7 |
| subcompact | 8 | 5 |
| suv | 4 | 8 |
| suv | 6 | 16 |
| suv | 8 | 38 |
and to turn this summary data into a cross table (contingency table) we need variable A (class) to be listed by row and variable B (cyl) to be listed by column. Can do this using the spread() function to create columns for each cyl value with n as the cross tab response value:
mpg%>%
group_by(class, cyl) %>%
summarise(n=n()) %>%
spread(cyl, n)%>%
kable()| class | 4 | 5 | 6 | 8 |
|---|---|---|---|---|
| 2seater | NA | NA | NA | 5 |
| compact | 32 | 2 | 13 | NA |
| midsize | 16 | NA | 23 | 2 |
| minivan | 1 | NA | 10 | NA |
| pickup | 3 | NA | 10 | 20 |
| subcompact | 21 | 2 | 7 | 5 |
| suv | 8 | NA | 16 | 38 |
If we wanted to look at summary stats other than frequency we can determine what kind of summary stat we want to see easily by adjusting our summarise() input. Here instead of frequency we can get the average number of city miles by class and cylinder:
mpg%>%
group_by(class, cyl) %>%
summarise(mean_cty=mean(cty))%>% #For each class and cyl group, calculates the mean of cty (city miles per gallon) and stores it in a new column called mean_cty
spread(cyl, mean_cty) %>% #Reshapes the data from long to wide format, spreading the cyl values across columns. Each cyl becomes its own column, with mean_cty values filling the cells
kable()| class | 4 | 5 | 6 | 8 |
|---|---|---|---|---|
| 2seater | NA | NA | NA | 15.40000 |
| compact | 21.37500 | 21 | 16.92308 | NA |
| midsize | 20.50000 | NA | 17.78261 | 16.00000 |
| minivan | 18.00000 | NA | 15.60000 | NA |
| pickup | 16.00000 | NA | 14.50000 | 11.80000 |
| subcompact | 22.85714 | 20 | 17.00000 | 14.80000 |
| suv | 18.00000 | NA | 14.50000 | 12.13158 |
or could look at or max number of city miles by class and cyl:
mpg%>%
group_by(class, cyl)%>%
summarise(max_cty=max(cty))%>%
spread(cyl, max_cty)%>%
kable()| class | 4 | 5 | 6 | 8 |
|---|---|---|---|---|
| 2seater | NA | NA | NA | 16 |
| compact | 33 | 21 | 18 | NA |
| midsize | 23 | NA | 19 | 16 |
| minivan | 18 | NA | 17 | NA |
| pickup | 17 | NA | 16 | 14 |
| subcompact | 35 | 20 | 18 | 15 |
| suv | 20 | NA | 17 | 14 |
Looking at the same data by Proportions rather than counts-we can find proportions by creating a new calculated variable dividing row frequency by table frequency:
mpg%>%
group_by(class)%>%
summarise(n=n())%>%
mutate(prop=n/sum(n))%>% #the new proportion variable
kable()| class | n | prop |
|---|---|---|
| 2seater | 5 | 0.0213675 |
| compact | 47 | 0.2008547 |
| midsize | 41 | 0.1752137 |
| minivan | 11 | 0.0470085 |
| pickup | 33 | 0.1410256 |
| subcompact | 35 | 0.1495726 |
| suv | 62 | 0.2649573 |
and we can then create a contingency table of proportion values by applying spread() as before:
mpg%>%
group_by(class, cyl)%>%
summarise(n=n())%>%
mutate(prop=n/sum(n))%>%
subset(select=c("class", "cyl", "prop"))%>% #this drops the frequency value
spread(class, prop)%>%
kable()| cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
|---|---|---|---|---|---|---|---|
| 4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
| 5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
| 6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
| 8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
The table() function is a quick way to pull together row/column frequency and proportions for categorical variables. For example, using basic table() command we can get contingency table for class by number of cyl:
table(mpg$class, mpg$cyl)
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
A table of frequency can also be called using the ftable() command:
mpg_table<- table(mpg$class, mpg$cyl)
ftable(mpg_table) 4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
and when specifically wanting to look at row/column frequency we can use the margin.table() function with the appropriate argument applied:
margin.table(mpg_table, 1)#for row freq we use the margin.table() command with the 1 argument
2seater compact midsize minivan pickup subcompact suv
5 47 41 11 33 35 62
margin.table(mpg_table, 2)#and for column freq we use the 2 argument
4 5 6 8
81 4 79 70
and of course we can do the same things for proportions:
prop.table(mpg_table)#for prop of the entire table we use prop.table() command:
4 5 6 8
2seater 0.000000000 0.000000000 0.000000000 0.021367521
compact 0.136752137 0.008547009 0.055555556 0.000000000
midsize 0.068376068 0.000000000 0.098290598 0.008547009
minivan 0.004273504 0.000000000 0.042735043 0.000000000
pickup 0.012820513 0.000000000 0.042735043 0.085470085
subcompact 0.089743590 0.008547009 0.029914530 0.021367521
suv 0.034188034 0.000000000 0.068376068 0.162393162
prop.table(mpg_table, 1)#for row proportions we use prop.table() with the 1 argument
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 1.00000000
compact 0.68085106 0.04255319 0.27659574 0.00000000
midsize 0.39024390 0.00000000 0.56097561 0.04878049
minivan 0.09090909 0.00000000 0.90909091 0.00000000
pickup 0.09090909 0.00000000 0.30303030 0.60606061
subcompact 0.60000000 0.05714286 0.20000000 0.14285714
suv 0.12903226 0.00000000 0.25806452 0.61290323
prop.table(mpg_table, 2)#and for column proportions we use the 2 argument
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 0.07142857
compact 0.39506173 0.50000000 0.16455696 0.00000000
midsize 0.19753086 0.00000000 0.29113924 0.02857143
minivan 0.01234568 0.00000000 0.12658228 0.00000000
pickup 0.03703704 0.00000000 0.12658228 0.28571429
subcompact 0.25925926 0.50000000 0.08860759 0.07142857
suv 0.09876543 0.00000000 0.20253165 0.54285714
Lastly, the CrossTable command from the gmodels package produces a frequency with table and row proportions with one single command which can be very useful:
install.packages("gmodels")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(gmodels)
CrossTable(mpg$class, mpg$cyl) #run cross table command with the 2 variables as input
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 234
| mpg$cyl
mpg$class | 4 | 5 | 6 | 8 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|
2seater | 0 | 0 | 0 | 5 | 5 |
| 1.731 | 0.085 | 1.688 | 8.210 | |
| 0.000 | 0.000 | 0.000 | 1.000 | 0.021 |
| 0.000 | 0.000 | 0.000 | 0.071 | |
| 0.000 | 0.000 | 0.000 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
compact | 32 | 2 | 13 | 0 | 47 |
| 15.210 | 1.782 | 0.518 | 14.060 | |
| 0.681 | 0.043 | 0.277 | 0.000 | 0.201 |
| 0.395 | 0.500 | 0.165 | 0.000 | |
| 0.137 | 0.009 | 0.056 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
midsize | 16 | 0 | 23 | 2 | 41 |
| 0.230 | 0.701 | 6.059 | 8.591 | |
| 0.390 | 0.000 | 0.561 | 0.049 | 0.175 |
| 0.198 | 0.000 | 0.291 | 0.029 | |
| 0.068 | 0.000 | 0.098 | 0.009 | |
-------------|-----------|-----------|-----------|-----------|-----------|
minivan | 1 | 0 | 10 | 0 | 11 |
| 2.070 | 0.188 | 10.641 | 3.291 | |
| 0.091 | 0.000 | 0.909 | 0.000 | 0.047 |
| 0.012 | 0.000 | 0.127 | 0.000 | |
| 0.004 | 0.000 | 0.043 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
pickup | 3 | 0 | 10 | 20 | 33 |
| 6.211 | 0.564 | 0.117 | 10.391 | |
| 0.091 | 0.000 | 0.303 | 0.606 | 0.141 |
| 0.037 | 0.000 | 0.127 | 0.286 | |
| 0.013 | 0.000 | 0.043 | 0.085 | |
-------------|-----------|-----------|-----------|-----------|-----------|
subcompact | 21 | 2 | 7 | 5 | 35 |
| 6.515 | 3.284 | 1.963 | 2.858 | |
| 0.600 | 0.057 | 0.200 | 0.143 | 0.150 |
| 0.259 | 0.500 | 0.089 | 0.071 | |
| 0.090 | 0.009 | 0.030 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
suv | 8 | 0 | 16 | 38 | 62 |
| 8.444 | 1.060 | 1.162 | 20.403 | |
| 0.129 | 0.000 | 0.258 | 0.613 | 0.265 |
| 0.099 | 0.000 | 0.203 | 0.543 | |
| 0.034 | 0.000 | 0.068 | 0.162 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 81 | 4 | 79 | 70 | 234 |
| 0.346 | 0.017 | 0.338 | 0.299 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Week 7 Mean Tests
Pre-session videos and exercises
t-test and interpreting p values using R Programming by Greg Martin Youtube channel
t tests (single, two sided, one sided, paired) & interpreting p values using R
library(tidyverse)
install.packages("gapminder")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(gapminder)
install.packages("patchwork")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(patchwork)
library(dplyr)View and plot the data of interest in density plots (or box plots):
str(gapminder)tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
head(gapminder)# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Examples of the plots in the exercise:
Plot 1: Life expectancy in the continent of Africa
# Filter data for Africa
africa_data <- gapminder %>%
filter(continent == "Africa")
# Calculate mean life expectancy
mean_lifeExp <- mean(africa_data$lifeExp)
# Plot density with mean line
ggplot(africa_data, aes(x = lifeExp)) +
geom_density(fill = "blue", alpha = 0.5) +
geom_vline(xintercept = mean_lifeExp, color = "red", linetype = "dashed", size = 1) +
labs(title = "Density Plot of Life Expectancy in Africa",
x = "Life Expectancy",
y = "Density") +
theme_minimal()Plot 2: Life expectancy in Africa and Europe
# Filter data for Africa and Europe
africa_europe_data <- gapminder %>%
filter(continent %in% c("Africa", "Europe"))
# Calculate mean life expectancy for each continent
mean_lifeExp <- africa_europe_data %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
# Plot density for each continent with mean lines
ggplot(africa_europe_data, aes(x = lifeExp, fill = continent, color = continent)) +
geom_density(alpha = 0.4) +
geom_vline(data = mean_lifeExp, aes(xintercept = mean_lifeExp, color = continent),
linetype = "dashed", size = 1) +
labs(title = "Density Plot of Life Expectancy in Africa and Europe",
x = "Life Expectancy",
y = "Density") +
theme_minimal()Mean tests, like the t-test (both one-sample, two-sample, and paired) and ANOVA (analysis of variance), are parametric tests. This means they rely on certain assumptions about the population data, primarily that:
The data are normally distributed.
The samples have similar variances (homogeneity of variance).
Observations are independent of each other.
These tests are designed to estimate and compare means under these assumptions, making .hem robust for many types of data but less ideal when data significantly deviate from normality or have unequal variances.
If the data do not meet these assumptions, non-parametric alternatives to compare central tendencies exist, such as the Mann-Whitney U test or Wilcoxon signed-rank test for comparing medians rather than means.
In a t test we are basically looking at the mean or average in to see if that average is different? Different to the one we think we expect or in one subset statistically significantly different to the average in another subset of the same population.
- Single sample t test-set the hypothesis: h0 could be that the mean expected lifespan is 50yrs. The alternative is then that the mean expected lifespan is not 50 yrs. So assuming that ho is correct, what are the chances that we’d get a sample that is as different or more. Must pick the threshold before run the data.
gapminder%>% #using the preloaded data in gapminder
filter(continent == "Africa")%>% #selecting Africa only data
select(lifeExp)%>% #looking at Lifeexpectancy variable only
t.test(mu = 50) #run the t test with pop mean of 50 (assumed mean is 50)
One Sample t-test
data: .
t = -3.0976, df = 623, p-value = 0.002038
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
48.14599 49.58467
sample estimates:
mean of x
48.86533
Results: p value-probability that we’d get a sample that is as different to 50 or more as what we’ve got here, is a less than 5% chance so can reject h0, and therefore by definition accept alternative-95% confident that true mean lies between 48 and 49, does not include 50 so sample data is statistically significantly different to 50.
Can also do same thing by creating an object as below:
my_ttest <- gapminder %>% #doing same thing but creating an object
filter(continent == "Africa")%>%
select(lifeExp)%>%
t.test(mu = 50)
attributes(my_ttest)#then allows functions on that object$names
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "stderr" "alternative" "method" "data.name"
$class
[1] "htest"
my_ttest$p.value #such as p value specifically for example [1] 0.002038444
Two sided t test-is the mean for Africa statistically significantly different to Europe’s? The 2 sides refer to above or below
gapminder%>% filter(continent %in% c("Africa", "Europe")) %>% t.test(lifeExp ~ continent, data = ., alternativve = "two.sided") #default anyway but means we are doing 2 sided test-is it different in any direction (up or down)Welch Two Sample t-test data: lifeExp by continent t = -49.551, df = 981.2, p-value < 2.2e-16 alternative hypothesis: true difference in means between group Africa and group Europe is not equal to 0 95 percent confidence interval: -23.95076 -22.12595 sample estimates: mean in group Africa mean in group Europe 48.86533 71.90369
These results show tiny p value (far less than 0.05) so can therefore reject h0 and accept that the difference is statistically significant, confidence interval doesn’t include 0.
One sided t test-looking at the plot the life expectancy in Ireland it is clearly less than Switzerland but is it statistically significant? In this case the question isis it stat sig that is lower than swi, so ho is that they are the same, h1 is that life expectancy in Ire is less by that magnitude or more than it is in Switzerland.
gapminder %>% filter(country %in% c("Ireland", "Switzerland"))%>% t.test(lifeExp ~ country, data = ., alternative = "less", conf.level = 0.95)Welch Two Sample t-test data: lifeExp by country t = -1.6337, df = 21.77, p-value = 0.05835 alternative hypothesis: true difference in means between group Ireland and group Switzerland is less than 0 95 percent confidence interval: -Inf 0.1313697 sample estimates: mean in group Ireland mean in group Switzerland 73.01725 75.56508
Results say not statistically significant as p value 0.058 so can’t reject h0, might well be the case that LE in the 2 is the same, 95% conf interval and does include 0.
Paired t test-example here is looking at one population at 2 different time points-life expectancy in Africa in 1957 versus 2007, exact counterpart for each data point so pairs of data so needs a paired t test:
library(tidyr) gapminder %>% filter(continent=="Africa") %>% mutate(year2=as.factor(year)) %>% filter(year %in% c("1957", "2007")) %>% select(country, year2, lifeExp) %>% spread(year2, lifeExp)->africa africa# A tibble: 52 × 3 country `1957` `2007` <fct> <dbl> <dbl> 1 Algeria 45.7 72.3 2 Angola 32.0 42.7 3 Benin 40.4 56.7 4 Botswana 49.6 50.7 5 Burkina Faso 34.9 52.3 6 Burundi 40.5 49.6 7 Cameroon 40.4 50.4 8 Central African Republic 37.5 44.7 9 Chad 39.9 50.7 10 Comoros 42.5 65.2 # ℹ 42 more rowst.test(africa$`1957`, africa$`2007`, paired = TRUE, alternative = "less")Paired t-test data: africa$`1957` and africa$`2007` t = -11.381, df = 51, p-value = 6.54e-16 alternative hypothesis: true mean difference is less than 0 95 percent confidence interval: -Inf -11.54659 sample estimates: mean difference -13.53969
Assumptions of t test:
large, representative sample
values are normally distributed (check shape using density plots for example)
similar variances (can check this in R)
ANOVA’s
ANOVA using R programming by Greg Martin Youtube channel
ANOVA, or Analysis of Variance, is a statistical method used to compare the means of three or more groups to see if at least one group mean is significantly different from the others. ANOVA determines this by analyzing the variance within each group and comparing it to the variance between the groups. Here’s a breakdown of key concepts and types of ANOVA:
How ANOVA Works
Hypotheses:
Null Hypothesis (H₀): All group means are equal.
Alternative Hypothesis (H₁): At least one group mean is different.
F-statistic: ANOVA calculates an F-statistic to determine if observed group mean differences are greater than what would be expected by chance. A higher F-statistic typically indicates a more significant difference between groups.
Assumptions:
The data are normally distributed within each group.
Homogeneity of variances (each group has a similar variance).
Independence of observations.
Types of ANOVA
One-Way ANOVA:
Compares means across one factor (e.g., test scores across different teaching methods).
Useful when testing one categorical independent variable with more than two levels.
Two-Way ANOVA:
Compares means across two factors (e.g., the effect of both teaching methods and student gender on test scores).
Can also test for interactions between the two factors.
Repeated Measures ANOVA:
Used when the same subjects are measured multiple times under different conditions (e.g., testing the same group’s scores before and after training).
Accounts for correlations within subjects.
MANOVA (Multivariate Analysis of Variance):
Used when there are multiple dependent variables.
Allows testing of how independent variables influence several outcomes at once.
Make sure know what question is being asked and how to interpret the results. t test compares 2 means, if it was just 2 then t test , if more than 2 need to use anova. Underlying principles same-assume no difference in means
Useful plots are boxplots and density plots. Densities suggest that means are not equal. So we are asking if the difference we are seeing is significant. The way we answer this is by assuming the opposite-h0 in all 3 continents is the same, no difference. If can show that this unlikely to be true then can accept the alternative hypothesis that the difference we are seeing is real. If there were no difference then how likely (what is the prob) that we would get sample that shows the diff we are seeing-if very unlikely then can reject and accept the alternative hypothesis.
decide on significance level (p hacking is bad science) before start
gapdata <- gapminder %>%
filter(year == 2007 &
continent %in% c("Americas", "Europe", "Asia"))%>%
select(continent, lifeExp)Arranging this data to show what we want:
gapdata %>%
group_by(continent)%>%
summarise(Mean_life = mean(lifeExp))%>%
arrange(Mean_life)# A tibble: 3 × 2
continent Mean_life
<fct> <dbl>
1 Asia 70.7
2 Americas 73.6
3 Europe 77.6
Is the difference real? If the p value is less than 0.05 then will reject h0.
So creating the Anova model:
gapdata %>%
aov(lifeExp ~ continent, data = .,) %>%
summary() Df Sum Sq Mean Sq F value Pr(>F)
continent 2 755.6 377.8 11.63 3.42e-05 ***
Residuals 85 2760.3 32.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov_model <- gapdata %>%
aov(lifeExp ~ continent, data = .)So p tiny so can reject p0, so means are stat diff.
However don’t know which one is diff to the other 2 (can say not equal) but don’t which ones are diff. so teasing that out a bit further:
gapdata %>%
aov(lifeExp ~ continent, data = .) %>%
TukeyHSD() #takes each combination and consider each 2 Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lifeExp ~ continent, data = .)
$continent
diff lwr upr p adj
Asia-Americas -2.879635 -6.4839802 0.7247099 0.1432634
Europe-Americas 4.040480 0.3592746 7.7216854 0.0279460
Europe-Asia 6.920115 3.4909215 10.3493088 0.0000189
Asia and America not statistically significantly difference but Europe and America are, and Euro-Asia also is as p <0.05
Week 7 Lecture session
Means testing is the only one of the 4 families that has parametric and non-p tests.
Used for data where x axis is categorical and y response variable is numerical (can be discrete or continuous)
One sample T test
Does the mean of the sample differ from the known or expected mean? Need to check assumptions before applying a parametric mean test (continuous, normally distributed, random sample and enough data)
Eg using penguin data-all of this work has been done on an r script, not sure if worth transcribing over as largely repeat of pre and post lecture sessions work
Post Lecture session taken from Statistical Tools for high throughput data analysis website
1. Comparing one-sample mean to a standard known mean
One-sample T-test (parametric)
A one-sample t-test is used to compare the mean of one sample to a known standard (or theoretical/hypothetical) mean (μ). Generally, the theoretical mean comes from:
a previous experiment. For example, compare whether the mean weight of mice differs from 200 mg, a value determined in a previous study.
or from an experiment where you have control and treatment conditions. If you express your data as “percent of control”, you can test whether the average value of treatment condition differs significantly from 100.
Note that, one-sample t-test can be used only, when the data are normally distributed which can be checked using Shapiro-Wilk test.
Research questions and statistical hypotheses-typical research questions are:
- whether the mean (m) of the sample is equal to/less then/greater than the theoretical mean (mu)
We can then define the corresponding null and alternate hypothesis (and if they are one or two tailed).
Formula of one-sample t-test-can be calculated as follows:
t=m−μs/n‾√ where,
m is the sample mean
n is the sample size
s is the sample standard deviation with n−1 degrees of freedom
μ is the theoretical value
We can compute the p-value corresponding to the absolute value of the t-test statistics (|t|) for the degrees of freedom (df): df=n−1.
How to interpret the results?
If the p-value is inferior or equal to the significance level 0.05, we can reject the null hypothesis and accept the alternative hypothesis. In other words, we conclude that the sample mean is significantly different from the theoretical mean.
Visualize your data and compute one-sample t-test in R
Install ggpubr R package for data visualization
install.packages("ggpubr")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
R function to compute one-sample t-test
To perform one-sample t-test, the R function t.test() can be used as follow:
t.test(x, mu = 0, alternative = "two.sided")
x: a numeric vector containing your data values
mu: the theoretical mean. Default is 0 but you can change it.
alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
Example:
Here, we’ll use an example data set containing the weight of 10 mice.
We want to know, if the average weight of the mice differs from 25g?
set.seed(1234) my_data <- data.frame( name = paste0(rep("M_", 10), 1:10), weight = round(rnorm(10, 20, 2), 1) )# Print the first 10 rows of the data to check it head(my_data, 10)name weight 1 M_1 17.6 2 M_2 20.6 3 M_3 22.2 4 M_4 15.3 5 M_5 20.9 6 M_6 21.0 7 M_7 18.9 8 M_8 18.9 9 M_9 18.9 10 M_10 18.2
# Statistical summaries of weight
summary(my_data$weight) Min. 1st Qu. Median Mean 3rd Qu. Max.
15.30 18.38 18.90 19.25 20.82 22.20
#Visualize your data using box plots
library(ggpubr)
ggboxplot(my_data$weight,
ylab = "Weight (g)", xlab = FALSE,
ggtheme = theme_minimal())Preliminary test to check one-sample t-test assumptions:
Is this a large sample? - No, because n < 30.
Since the sample size is not large enough (less than 30, central limit theorem), we need to check whether the data follow a normal distribution.
How to check the normality? Using the Shapiro-Wilk normality test and to look at the normality plot with h0 set as the data are normally distributed
shapiro.test(my_data$weight) # => p-value = 0.6993 so from the output, the p-value isdgreater than the significance level 0.05 implying that the distribution of the data are not significantly different from normal distribtion. In other words, we can assume the normality.
Shapiro-Wilk normality test
data: my_data$weight
W = 0.9526, p-value = 0.6993
Could also do visual inspection of the normality using q-q plots.
Then Compute one-sample t-test:
We want to know, if the average weight of the mice differs from 25g (two-tailed test)?
# One-sample t-test
res <- t.test(my_data$weight, mu = 25)
# Printing the results
res
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
17.8172 20.6828
sample estimates:
mean of x
19.25
In the result above :
t is the t-test statistic value (t = -9.078),
df is the degrees of freedom (df= 9),
p-value is the significance level of the t-test (p-value = 7.95310^{-6}).
conf.int is the confidence interval of the mean at 95% (conf.int = [17.8172, 20.6828]);
sample estimates is the mean value of the sample (mean = 19.25).
The p-value of the test is 7.95310^{-6}, which is less than the significance level alpha = 0.05. We can conclude that the mean weight of the mice is significantly different from 25g with ap-value = 7.95310^{-6}
Note that:
- if you want to test whether the mean weight of mice is less/more than 25g (one-tailed test), type this:
t.test(my_data$weight, mu = 25, alternative = "less/more")
Access to the values returned by t.test() function-the result is a list containing:
statistic: the value of the t test statistics
parameter: the degrees of freedom for the t test statistics
p.value: the p-value for the test
conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).
The format of the R code to use for getting these values is as follow:
# printing the p-value
res$p.value[1] 7.953383e-06
# printing the mean
res$estimatemean of x
19.25
# printing the confidence interval
res$conf.int[1] 17.8172 20.6828
attr(,"conf.level")
[1] 0.95
2. What’s a one-sample Wilcoxon signed rank test?
The one-sample Wilcoxon signed rank test is a non-parametric alternative to one-sample t-test when the data cannot be assumed to be normally distributed. It’s used to determine whether the median of the sample is equal to a known standard value (i.e. theoretical value). Note that, the data should be distributed symmetrically around the median. In other words, there should be roughly the same number of values above and below the median.
Research questions and statistical hypotheses-typical research questions are:
- whether the median (m) of the sample is equal to/greater than/less than the theoretical value (m0)?
In statistics, we can define the corresponding null hypothesis (H0) as follow:
- H0:m=m0/H0:m≤m0/H0:m≥m0 and the corresponding alternative hypotheses (Ha) are as follow:
Ha:m≠m0 (different)
Ha:m>m0 (greater)
Ha:m<m0 (less)
R function to compute one-sample Wilcoxon test
To perform one-sample Wilcoxon-test, the R function
wilcox.test() can be used as follow:
{wilcox.test(x, mu = 0, alternative = "two.sided")}where:
x: a numeric vector containing your data values
mu: the theoretical mean/median value. Default is 0 but you can change it.
alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
Eg: Here, we’ll use an example data set containing the weight of 10 mice. We want to know, if the median weight of the mice differs from 25g?
set.seed(1234) my_data <- data.frame( name = paste0(rep("M_", 10), 1:10), weight = round(rnorm(10, 20, 2), 1) )
Check your data:
# Print the first 10 rows of the data
head(my_data, 10) name weight
1 M_1 17.6
2 M_2 20.6
3 M_3 22.2
4 M_4 15.3
5 M_5 20.9
6 M_6 21.0
7 M_7 18.9
8 M_8 18.9
9 M_9 18.9
10 M_10 18.2
# Statistical summaries of weight
summary(my_data$weight) Min. 1st Qu. Median Mean 3rd Qu. Max.
15.30 18.38 18.90 19.25 20.82 22.20
Visualize your data using box plots:
library(ggpubr)
ggboxplot(my_data$weight,
ylab = "Weight (g)", xlab = FALSE,) ggtheme = theme_minimal()Compute one-sample Wilcoxon test-we want to know if the average weight of the mice differs from 25g (two-tailed test)?
# One-sample wilcoxon test
res <- wilcox.test(my_data$weight, mu = 25)
# Printing the results
res
Wilcoxon signed rank test with continuity correction
data: my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
# print only the p-value
res$p.value[1] 0.005793045
The p-value of the test is 0.005793, which is less than the significance level alpha = 0.05. We can reject the null hypothesis and conclude that the average weight of the mice is significantly different from 25g with a p-value = 0.005793.
Note that:
- if you want to test whether the median weight of mice is less than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
alternative = "less")
- Or, if you want to test whether the median weight of mice is greater than 25g (one-tailed test), type this:
wilcox.test(my_data$weight, mu = 25,
alternative = "greater")
3. Comparing the means of two independent groups
3.1 Unpaired two samples t-test (parametric)
The unpaired two-samples t-test is used to compare the mean of two independent groups. For example, suppose that we have measured the weight of 100 individuals: 50 women (group A) and 50 men (group B). We want to know if the mean weight of women (mA) is significantly different from that of men (mB). In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.
Note that, unpaired two-samples t-test can be used only under certain conditions:
when the two groups of samples (A and B), being compared, are normally distributed. This can be checked using Shapiro-Wilk test.
and when the variances of the two groups are equal. This can be checked using F-test.
Typical research questions are: whether the mean of group A is equal to /less than /greater than the mean of group B, allowing h0 to be set as ma=mb.
R function to compute unpaired two-samples t-test
To perform two-samples t-test comparing the means of two independent samples (x & y), the R function t.test() can be used as follow:
t.test(x, y, alternative = “two.sided”, var.equal = FALSE) where:
x,y: numeric vectors
alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
var.equal: a logical variable indicating whether to treat the two variances as being equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch test is used.
Example: Here, we’ll use an example data set, which contains the weight of 18 individuals (9 women and 9 men). We want to know, if the average women’s weight differs from the average men’s weight?
# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(
group = rep(c("Woman", "Man"), each = 9),
weight = c(women_weight, men_weight)
)check the data:
print(my_data) group weight
1 Woman 38.9
2 Woman 61.2
3 Woman 73.3
4 Woman 21.8
5 Woman 63.4
6 Woman 64.6
7 Woman 48.4
8 Woman 48.8
9 Woman 48.5
10 Man 67.8
11 Man 60.0
12 Man 63.4
13 Man 76.0
14 Man 89.4
15 Man 73.3
16 Man 67.3
17 Man 61.3
18 Man 62.4
Compute summary statistics by groups:
library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)# A tibble: 2 × 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 Man 9 69.0 9.38
2 Woman 9 52.1 15.6
Visualise the data using box plots:
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "Weight", xlab = "Groups")Preliminary test to check independent t-test assumptions:
Assumption 1: Are the two samples independents?
Yes, since the samples from men and women are not related.
Assumtion 2: Are the data from each of the 2 groups follow a normal distribution?
Use Shapiro-Wilk normality test with the null hypothesis set as the data are normally distributed (alternative hypothesis therefore the data are not normally distributed)
We’ll use the functions with() and shapiro.test() to compute Shapiro-Wilk test for each group of samples:
# Shapiro-Wilk normality test for Men's weights
with(my_data, shapiro.test(weight[group == "Man"]))# p = 0.1
Shapiro-Wilk normality test
data: weight[group == "Man"]
W = 0.86425, p-value = 0.1066
# Shapiro-Wilk normality test for Women's weights
with(my_data, shapiro.test(weight[group == "Woman"])) # p = 0.6
Shapiro-Wilk normality test
data: weight[group == "Woman"]
W = 0.94266, p-value = 0.6101
From the output, the two p-values are greater than the significance level 0.05 implying that the distribution of the data are not significantly different from the normal distribution. In other words, we can assume the normality. Note that, if the data are not normally distributed, it’s recommended to use the non parametric two-samples Wilcoxon rank test.
Assumption 3. Do the two populations have the same variances? We’ll use F-test to test for homogeneity in variances. This can be performed with the function var.test() as follow:
res.ftest <- var.test(weight ~ group, data = my_data)
res.ftest
F test to compare two variances
data: weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.08150656 1.60191315
sample estimates:
ratio of variances
0.3613398
The p-value of F-test is p = 0.1713596. It’s greater than the significance level alpha = 0.05. In conclusion, there is no significant difference between the variances of the two sets of data. Therefore, we can use the classic t-test which assume equality of the two variances.
Compute unpaired two-samples t-test
Question : Is there any significant difference between women and men weights?
1) Compute independent t-test - Method 1: The data are saved in two different numeric vectors.
# Compute t-test
res <- t.test(women_weight, men_weight, var.equal = TRUE)
res
Two Sample t-test
data: women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-29.748019 -4.029759
sample estimates:
mean of x mean of y
52.10000 68.98889
2) Compute independent t-test - Method 2: The data are saved in a data frame.
# Compute t-test
res <- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res
Two Sample t-test
data: weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
4.029759 29.748019
sample estimates:
mean in group Man mean in group Woman
68.98889 52.10000
Both methods give the same result. In the result above :
t is the t-test statistic value (t = 2.784),
df is the degrees of freedom (df= 16),
p-value is the significance level of the t-test (p-value = 0.01327).
conf.int is the confidence interval of the mean at 95% (conf.int = [4.0298, 29.748]);
sample estimates is he mean value of the sample (mean = 68.9888889, 52.1).
Note that:
- if you want to test whether the average men’s weight is less than the average women’s weight, type this:
t.test(weight ~ group, data = my_data,
var.equal = TRUE, alternative = "less")
- Or, if you want to test whether the average men’s weight is greater than the average women’s weight, type this
t.test(weight ~ group, data = my_data,
var.equal = TRUE, alternative = "greater")
Interpretation of the result
The p-value of the test is 0.01327, which is less than the significance level alpha = 0.05. We can conclude that men’s average weight is significantly different from women’s average weight with a p-value 0.01327.
Access to the values returned by t.test() function:
The result of t.test() function is a list containing the following components:
statistic: the value of the t test statistics
parameter: the degrees of freedom for the test statistics
p.value: the p-value for the test
conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).
The format of the R code to use for getting these values is as follow:
# printing the p-value
res$p.value[1] 0.0132656
# printing the mean
res$estimate mean in group Man mean in group Woman
68.98889 52.10000
# printing the confidence interval
res$conf.int[1] 4.029759 29.748019
attr(,"conf.level")
[1] 0.95
3.2 Unpaired two-samples Wilcoxon test (non-parametric)
The unpaired two-samples Wilcoxon test (also known as Wilcoxon rank sum test or Mann-Whitney test) is a non-parametric alternative to the unpaired two-samples t-test, which can be used to compare two independent groups of samples. It’s used when your data are not normally distributed.
Visualize your data and compute Wilcoxon test in R:
To perform two-samples Wilcoxon test comparing the means of two independent samples (x & y), the R function wilcox.test() can be used as follows:
wilcox.test(x, y, alternative = “two.sided”) where x,y are
numeric vectors and alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
Example:Here, we’ll use an example data set, which contains the weight of 18 individuals (9 women and 9 men). We want to know, if the median women’s weight differs from the median men’s weight?
# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(
group = rep(c("Woman", "Man"), each = 9),
weight = c(women_weight, men_weight)
)Checking the data and looking at summary stats:
print(my_data) group weight
1 Woman 38.9
2 Woman 61.2
3 Woman 73.3
4 Woman 21.8
5 Woman 63.4
6 Woman 64.6
7 Woman 48.4
8 Woman 48.8
9 Woman 48.5
10 Man 67.8
11 Man 60.0
12 Man 63.4
13 Man 76.0
14 Man 89.4
15 Man 73.3
16 Man 67.3
17 Man 61.3
18 Man 62.4
library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),
median = median(weight, na.rm = TRUE),
IQR = IQR(weight, na.rm = TRUE)
)# A tibble: 2 × 4
group count median IQR
<chr> <int> <dbl> <dbl>
1 Man 9 67.3 10.9
2 Woman 9 48.8 15
Visualising the data using box plots:
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "Weight", xlab = "Groups")Compute unpaired two-samples Wilcoxon test
Question : Is there any significant difference between women and men weights?
1) Compute two-samples Wilcoxon test - Method 1: The data are saved in two different numeric vectors.
res <- wilcox.test(women_weight, men_weight)
res
Wilcoxon rank sum test with continuity correction
data: women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
2) Compute two-samples Wilcoxon test - Method 2: The data are saved in a data frame.
res <- wilcox.test(weight ~ group, data = my_data,
exact = FALSE)
res
Wilcoxon rank sum test with continuity correction
data: weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
As you can see, the two methods give the same results. The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude that men’s median weight is significantly different from women’s median weight with a p-value = 0.02712.
Note that:
- if you want to test whether the median men’s weight is less than the median women’s weight, type this:
wilcox.test(weight ~ group, data = my_data,
exact = FALSE, alternative = "less")
- Or, if you want to test whether the median men’s weight is greater than the median women’s weight, type this
wilcox.test(weight ~ group, data = my_data,
exact = FALSE, alternative = "greater")
4 Comparing the means of paired samples
4.1 Paired samples t-test (parametric)
The paired samples t-test is used to compare the means between two related groups of samples. In this case, you have two values (i.e., pair of values) for the same samples. This article describes how to compute paired samples t-test using R software. As an example of data, 20 mice received a treatment X during 3 months. We want to know whether the treatment X has an impact on the weight of the mice.
To answer this question, the weight of the 20 mice has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice. In such situations paired t-test can be used to compare the mean weights before and after treatment.
Paired t-test analysis is performed as follow:
Calculate the difference (d) between each pair of value
Compute the mean (m) and the standard deviation (s) of d
Compare the average difference to 0. If there is any significant difference between the two pairs of samples, then the mean of d (m) is expected to be far from 0.
Paired t-test can be used only when the difference d is normally distributed. This can be checked using Shapiro-Wilk test.
Research questions and statistical hypotheses-typical research questions are whether mean difference is equal to/less than/greater than 0. Allows us to define h0 as m=0 etc.
Visualize your data and compute paired t-test in RsR function to compute paired t-test
To perform paired samples t-test comparing the means of two paired samples (x & y), the R function t.test() can be used as follows:
t.test(x, y, paired = TRUE, alternative = "two.sided")
where
x,y: numeric vectors
paired: a logical value specifying that we want to compute a paired t-test
alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
# Data in two numeric vectors # ++++++++++++++++++++++++++ # Weight of the mice before treatment before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7) # Weight of the mice after treatment after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2) # Create a data frame my_data <- data.frame( group = rep(c("before", "after"), each = 10), weight = c(before, after) )
We want to know if there is a significant difference?
Checking and summarising the data:
# Print all data
print(my_data) group weight
1 before 200.1
2 before 190.9
3 before 192.7
4 before 213.0
5 before 241.4
6 before 196.9
7 before 172.2
8 before 185.5
9 before 205.2
10 before 193.7
11 after 392.9
12 after 393.2
13 after 345.1
14 after 393.0
15 after 434.0
16 after 427.9
17 after 422.0
18 after 383.9
19 after 392.3
20 after 352.2
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)# A tibble: 2 × 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 after 10 394. 29.4
2 before 10 199. 18.5
and visualising the data:
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800"),
order = c("before", "after"),
ylab = "Weight", xlab = "Groups")Box plots show you the increase, but lose the paired information. You can use the function plot.paired() [in pairedData package] to plot paired data (“before - after” plot).
install.packages("PairedData")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(PairedData)# Subset weight data before treatment
before <- subset(my_data, group == "before", weight,
drop = TRUE)
# subset weight data after treatment
after <- subset(my_data, group == "after", weight,
drop = TRUE)
# Plot paired data
library(PairedData)
pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()Preliminary tests to check paired t-test assumption:s
Assumption 1: Are the two samples paired?
Yes, since the data have been collected from measuring twice the weight of the same mice.
Assumption 2: Is this a large sample?
No, because n < 30. Since the sample size is not large enough (less than 30), we need to check whether the differences of the pairs follow a normal distribution using the Shapiro-Wilk normality test with the Null hypothesis set as the data are normally distributed.
# compute the difference
d <- with(my_data,
weight[group == "before"] - weight[group == "after"])
# Shapiro-Wilk normality test for the differences
shapiro.test(d) # => p-value = 0.6141
Shapiro-Wilk normality test
data: d
W = 0.94536, p-value = 0.6141
From the output, the p-value is greater than the significance level 0.05 implying that the distribution of the differences (d) are not significantly different from normal distribution. In other words, we can assume the normality. Note that, if the data are not normally distributed, it’s recommended to use the non parametric paired two-samples Wilcoxon test.
Compute paired samples t-test
Question : Is there any significant changes in the weights of mice after treatment?
1) Compute paired t-test - Method 1: The data are saved in two different numeric vectors.
# Compute t-test
res <- t.test(before, after, paired = TRUE)
res
Paired t-test
data: before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-215.5581 -173.4219
sample estimates:
mean difference
-194.49
2) Compute paired t-test - Method 2: The data are saved in a data frame but could not get this method to work (as kept getting error message that could not use paired in formula method) so have removed as a chunk but left included for completeness:
res <- t.test(weight ~ group, data = my_data, paired = TRUE)
res
Both methods give the same result. In the result above :
t is the t-test statistic value (t = 20.88),
df is the degrees of freedom (df= 9),
p-value is the significance level of the t-test (p-value = 6.210^{-9}).
conf.int is the confidence interval (conf.int) of the mean differences at 95% is also shown (conf.int= [173.42, 215.56])
sample estimates is the mean differences between pairs (mean = 194.49).
Note that:
- if you want to test whether the average weight before treatment is less than the average weight after treatment, type this:
t.test(weight ~ group, data = my_data, paired = TRUE,
alternative = "less")
- Or, if you want to test whether the average weight before treatment is greater than the average weight after treatment, type this
t.test(weight ~ group, data = my_data, paired = TRUE,
alternative = "greater")
Interpretation of the result
The p-value of the test is 6.210^{-9}, which is less than the significance level alpha = 0.05. We can then reject null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment with a p-value = 6.210^{-9}.
Access to the values returned by t.test() function
The result of t.test( function is a list containing the following components:
statistic: the value of the t test statistics
parameter: the degrees of freedom for the t test statistics
p.value: the p-value for the test
conf.int: a confidence interval for the mean appropriate to the specified alternative hypothesis.
estimate: the means of the two groups being compared (in the case of independent t test) or difference in means (in the case of paired t test).
The format of the R code to use for getting these values is as follow:
# printing the p-value
res$p.value[1] 6.200298e-09
# printing the mean
res$estimatemean difference
-194.49
# printing the confidence interval
res$conf.int[1] -215.5581 -173.4219
attr(,"conf.level")
[1] 0.95
4.2 Paired samples Wilcoxon test (non-parametric)
The paired samples Wilcoxon test (also known as Wilcoxon signed-rank test) is a non-parametric alternative to paired t-test used to compare paired data. It’s used when your data are not normally distributed. Differences between paired samples should be distributed symmetrically around the median.
Visualise your data and compute paired samples using the R functionwilcox.test() as follows:
wilcox.test(x, y, paired = TRUE, alternative = "two.sided")
where:
x,y: numeric vectorspaired: a logical value specifying that we want to compute a paired Wilcoxon test
alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.
Here, we’ll use an example data set, which contains the weight of 10 mice before and after a treatment.
# Data in two numeric vectors # ++++++++++++++++++++++++++ # Weight of the mice before treatment before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7) # Weight of the mice after treatment after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2) # Create a data frame my_data <- data.frame( group = rep(c("before", "after"), each = 10), weight = c(before, after) )We want to know, if there is any significant difference in the median weights before and after treatment?
Checking the data and computing summary statistics (median and inter-quartile range (IQR)) by groups:
# Print all data print(my_data)group weight 1 before 200.1 2 before 190.9 3 before 192.7 4 before 213.0 5 before 241.4 6 before 196.9 7 before 172.2 8 before 185.5 9 before 205.2 10 before 193.7 11 after 392.9 12 after 393.2 13 after 345.1 14 after 393.0 15 after 434.0 16 after 427.9 17 after 422.0 18 after 383.9 19 after 392.3 20 after 352.2group_by(my_data, group) %>% summarise( count = n(), median = median(weight, na.rm = TRUE), IQR = IQR(weight, na.rm = TRUE) )# A tibble: 2 × 4 group count median IQR <chr> <int> <dbl> <dbl> 1 after 10 393. 28.8 2 before 10 195. 12.6Visualise the data using boxplots:
# Plot weight by group and color by group library("ggpubr") ggboxplot(my_data, x = "group", y = "weight", color = "group", palette = c("#00AFBB", "#E7B800"), order = c("before", "after"), ylab = "Weight", xlab = "Groups")Box plots show you the increase, but lose the paired information. You can use the function plot.paired() [in pairedDatapackage] to plot paired data (“before - after” plot).
# Subset weight data before treatment before <- subset(my_data, group == "before", weight, drop = TRUE) # subset weight data after treatment after <- subset(my_data, group == "after", weight, drop = TRUE) # Plot paired data library(PairedData) pd <- paired(before, after) plot(pd, type = "profile") + theme_bw()
Compute paired-sample Wilcoxon test
Question : Is there any significant changes in the weights of mice before after treatment?
1) Compute paired Wilcoxon test - Method 1: The data are saved in two different numeric vectors.
res <- wilcox.test(before, after, paired = TRUE)
res
Wilcoxon signed rank exact test
data: before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
2) Compute paired Wilcoxon-test - Method 2: The data are saved in a data frame. As previously this method did not work and keeps giving the same error message that cannot use paired in formula method so have removed it from chunk although left in for completeness sake:
Compute t-test
res <- wilcox.test(weight ~ group, data = my_data, paired = TRUE) res
As you can see, the two methods give the same results. The p-value of the test is 0.001953, which is less than the significance level alpha = 0.05. We can conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment with a p-value = 0.001953.
Note that:
- if you want to test whether the median weight before treatment is less than the median weight after treatment, type this:
wilcox.test(weight ~ group, data = my_data, paired = TRUE,
alternative = "less")
- Or, if you want to test whether the median weight before treatment is greater than the median weight after treatment, type this
wilcox.test(weight ~ group, data = my_data, paired = TRUE,
alternative = "greater")
5 Comparing the means of more than two groups
5.1 One-way ANOVA test
An extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. The one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of independent two-samples t-test for comparing means in a situation where there are more than two groups. In one-way ANOVA, the data is organized into several groups base on one single grouping variable (also called factor variable).
ANOVA test hypotheses:
Null hypothesis: the means of the different groups are the same
Alternative hypothesis: At least one sample mean is not equal to the others.
Note that, if you have only two groups, you can use t-test. In this case the F-test and the t-test are equivalent.
Assumptions-the ANOVA test can be applied only when:
The observations are obtained independently and randomly from the population defined by the factor levels
The data of each factor level are normally distributed.
These normal populations have a common variance. (Levene’s test can be used to check this.)
Visualise and compute-for example we’ll use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions:
my_data <- PlantGrowth
Check the data-to have an idea of what the data looks like, we can use the the function sample_n()[in dplyr package]. The sample_n() function randomly picks a few of the observations in the data frame to print out:
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10) weight group
1 6.15 trt2
2 3.83 trt1
3 5.29 trt2
4 5.12 trt2
5 4.50 ctrl
6 4.17 trt1
7 5.87 trt1
8 5.33 ctrl
9 5.26 trt2
10 4.61 ctrl
In R terminology, the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically.
# Show the levels
levels(my_data$group)[1] "ctrl" "trt1" "trt2"
Compute summary statistics by groups - count, mean, sd:
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)# A tibble: 3 × 4
group count mean sd
<fct> <int> <dbl> <dbl>
1 ctrl 10 5.03 0.583
2 trt1 10 4.66 0.794
3 trt2 10 5.53 0.443
Visualise the data:
# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight",
add = c("mean_se", "jitter"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")Compute one-way ANOVA test using the function aov()-we want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions:
# Compute the analysis of variance
res.aov <- aov(weight ~ group, data = my_data)
# Summary of the analysis
summary(res.aov) Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output includes the columns F value and Pr(>F) corresponding to the p-value of the test.
Interpret the result of one-way ANOVA tests:
As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with “*” in the model summary.
Multiple pairwise-comparison between the means of groups:
In one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant using the Tukey multiple pairwise-comparisons. As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups.
The function TukeyHD() takes the fitted ANOVA as an argument:
TukeyHSD(res.aov) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = my_data)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
diff: difference between means of the two groups
lwr, upr: the lower and the upper end point of the confidence interval at 95% (default)
p adj: p-value after adjustment for the multiple comparisons.
It can be seen from the output, that only the difference between trt2 and trt1 is significant with an adjusted p-value of 0.012.
Check ANOVA assumptions: test validity?
The ANOVA test assumes that, the data are normally distributed and the variance across groups are homogeneous. We can check that with some diagnostic plots. Check the homogeneity of variance assumption with the residuals versus fits plot. In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups) so we can assume the homogeneity of variances.
# 1. Homogeneity of variances
plot(res.aov, 1)Points 17, 15, 4 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions. It’s also possible to use Levene’s test to check the homogeneity of variances.
library(car)
leveneTest(weight ~ group, data = my_data)Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 1.1192 0.3412
27
From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.
Relaxing the homogeneity of variance assumption:
The classical one-way ANOVA test requires an assumption of equal variances for all groups. In our example, the homogeneity of variance assumption turned out to be fine as the Levene test is not significant. But how do we save our ANOVA test, in a situation where the homogeneity of variance assumption is violated?
An alternative procedure the Welch one-way test, which does not require this assumption can be used with the function oneway.test().
ANOVA test with no assumption of equal variances
oneway.test(weight ~ group, data = my_data)One-way analysis of means (not assuming equal variances) data: weight and group F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739Check the normality assumption using the normality plot of residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted. The normal probability plot of residuals is used to check the assumption that the residuals are normally distributed. It should approximately follow a straight line.
# 2. Normality plot(res.aov, 2)
As all the points fall approximately along this reference line, we can assume normality.
The conclusion above, is supported by the Shapiro-Wilk test on the ANOVA residuals (W = 0.96, p = 0.6) which finds no indication that normality is violated.
# Extract the residuals
aov_residuals <- residuals(object = res.aov )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )
Shapiro-Wilk normality test
data: aov_residuals
W = 0.96607, p-value = 0.4379
Note that, a non-parametric alternative to one-way ANOVA is Kruskal-Wallis rank sum test, which can be used when ANNOVA assumptions are not met.
kruskal.test(weight ~ group, data = my_data)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
5.2 Two-Way ANOVA test
This is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable. The grouping variables are also known as factors. The different categories (groups) of a factor are called levels. The number of levels can vary between factors. The level combinations of factors are called cells.
When the sample sizes within cells are equal, we have the so-called balanced design. In this case the standard two-way ANOVA test can be applied.
When the sample sizes within each level of the independent variables are not the same (case of unbalanced designs), the ANOVA test should be handled differently.
Two-way ANOVA test hypotheses:
There is no difference in the means of factor A
There is no difference in means of factor B
There is no interaction between factors A and B
The alternative hypothesis for cases 1 and 2 is: the means are not equal.
The alternative hypothesis for case 3 is: there is an interaction between A and B.
Assumptions of two-way ANOVA test:
Two-way ANOVA, like all ANOVA tests, assumes that the observations within each cell are normally distributed and have equal variances.
Compute two-way ANOVA test in R: balanced designs
Balanced designs correspond to the situation where we have equal sample sizes within levels of our independent grouping levels. Here, we’ll use the built-in R data set named ToothGrowth. It contains data from a study evaluating the effect of vitamin C on tooth growth in Guinea pigs. The experiment has been performed on 60 pigs, where each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). Tooth length was measured and a sample of the data is shown below.
# Store the data in the variable my_data
my_data <- ToothGrowthcheck the data:
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10) len supp dose
1 21.5 VC 2.0
2 17.3 VC 1.0
3 27.3 OJ 2.0
4 18.5 VC 2.0
5 8.2 OJ 0.5
6 26.4 OJ 1.0
7 25.8 OJ 1.0
8 5.2 VC 0.5
9 6.4 VC 0.5
10 9.4 OJ 0.5
# Check the structure
str(my_data)'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
From the output above, R considers “dose” as a numeric variable. We’ll convert it as a factor variable (i.e., grouping variable) as follows:
# Convert dose as a factor and recode the levels
# as "D0.5", "D1", "D2"
my_data$dose <- factor(my_data$dose,
levels = c(0.5, 1, 2),
labels = c("D0.5", "D1", "D2"))
head(my_data) len supp dose
1 4.2 VC D0.5
2 11.5 VC D0.5
3 7.3 VC D0.5
4 5.8 VC D0.5
5 6.4 VC D0.5
6 10.0 VC D0.5
Question: We want to know if tooth length depends on supp and dose. So firstly we need to generate a frequency table:
table(my_data$supp, my_data$dose)
D0.5 D1 D2
OJ 10 10 10
VC 10 10 10
We have 2x3 design cells with the factors being supp and dose and 10 subjects in each cell, that is we have a balanced design.
Visualising the data using Box plots and line plots to see group differences:
Box plot to plot the data grouped by the combinations of the levels of the two factors.
Two-way interaction plot, which plots the mean (or other summary) of the response for two-way combinations of factors, thereby illustrating possible interactions
# Box plot with multiple groups
# +++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
library("ggpubr")
ggboxplot(my_data, x = "dose", y = "len", color = "supp",
palette = c("#00AFBB", "#E7B800"))# Line plots with multiple groups
# +++++++++++++++++++++++
# Plot tooth length ("len") by groups ("dose")
# Color box plot by a second group: "supp"
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))Compute two-way ANOVA test:
We want to know if tooth length depends on supp and dose. The R function aov() can be used to answer this question. The function summary.aov() is then used to summarize the analysis of variance model.
res.aov2 <- aov(len ~ supp + dose, data = my_data)
summary(res.aov2) Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 14.02 0.000429 ***
dose 2 2426.4 1213.2 82.81 < 2e-16 ***
Residuals 56 820.4 14.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output includes the columns F value and Pr(>F) corresponding to the p-value of the test.
From the ANOVA table we can conclude that both supp and dose are statistically significant. dos is the most significant factor variable. These results would lead us to believe that changing delivery methods (supp) or the dose of vitamin C, will impact significantly the mean tooth length.
Note the above fitted model is called additive model. It makes an assumption that the two factor variables are independent. If you think that these two variables might interact to create an synergistic effect, replace the plus symbol (+) by an asterisk (*), as follows:
# Two-way ANOVA with interaction effect
# These two calls are equivalent
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3) Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 15.572 0.000231 ***
dose 2 2426.4 1213.2 92.000 < 2e-16 ***
supp:dose 2 108.3 54.2 4.107 0.021860 *
Residuals 54 712.1 13.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It can be seen that the two main effects (supp and dose) are statistically significant, as well as their interaction. Note that in the situation where the interaction is not significant you should use the additive model.
Interpret the results
From the ANOVA results, you can conclude the following, based on the p-values and a significance level of 0.05:
the p-value of supp is 0.000429 (significant), which indicates that the levels of supp are associated with significant different tooth length.
the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.
the p-value for the interaction between supp*dose is 0.02 (significant), which indicates that the relationships between dose and tooth length depends on the supp method.
Computing some summary statistics:
group_by(my_data, supp, dose) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
sd = sd(len, na.rm = TRUE)
)# A tibble: 6 × 5
# Groups: supp [2]
supp dose count mean sd
<fct> <fct> <int> <dbl> <dbl>
1 OJ D0.5 10 13.2 4.46
2 OJ D1 10 22.7 3.91
3 OJ D2 10 26.1 2.66
4 VC D0.5 10 7.98 2.75
5 VC D1 10 16.8 2.52
6 VC D2 10 26.1 4.80
Multiple pairwise-comparison between the means of groups
In ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.
As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for performing multiple pairwise-comparison between the means of groups. The function TukeyHD() takes the fitted ANOVA as an argument. We don’t need to perform the test for the “supp” variable because it has only two levels, which have been already proven to be significantly different by ANOVA test. Therefore, the Tukey HSD test will be done only for the factor variable “dose”.
TukeyHSD(res.aov3, which = "dose") Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)
$dose
diff lwr upr p adj
D1-D0.5 9.130 6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1 6.365 3.597488 9.132512 2.7e-06
It can be seen from the output, that all pairwise comparisons are significant with an adjusted p-value < 0.05.Check ANOVA assumptions: test validity?
ANOVA assumes that the data are normally distributed and the variance across groups are homogeneous. We can check that with some diagnostic plots.
The residuals versus fits plot is used to check the homogeneity of variances. In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups) so we can assume the homogeneity of variances.
# 1. Homogeneity of variances
plot(res.aov3, 1)Points 32 and 23 are detected as outliers, which can severely affect normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.
Can also use Levene’s test to check the homogeneity of variance:
library(car)
leveneTest(len ~ supp*dose, data = my_data)Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 1.7086 0.1484
54
From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.
Check the normality assumption using the normality plot of the residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted. The normal probability plot of residuals is used to verify the assumption that the residuals are normally distributed. The normal probability plot of the residuals should approximately follow a straight line.
# 2. Normality
plot(res.aov3, 2)As all the points fall approximately along this reference line, we can assume normality. The conclusion above, is supported by the Shapiro-Wilk test on the ANOVA residuals (W = 0.98, p = 0.5) which finds no indication that normality is violated.
# Extract the residuals
aov_residuals <- residuals(object = res.aov3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )
Shapiro-Wilk normality test
data: aov_residuals
W = 0.98499, p-value = 0.6694
Compute two-way ANOVA test in R for unbalanced designs where there are unequal numbers of subjects in each group. There are three fundamentally different ways to run an ANOVA in an unbalanced design. They are known as Type-I, Type-II and Type-III sums of squares. To keep things simple, note thtt The recommended method are the Type-III sums of squares. The three methods give the same result when the design is balanced. However, when the design is unbalanced, they don’t give the same results.
The functionAnova() [in car package] can be used to compute two-way ANOVA test for unbalanced designs.
library(car)
my_anova <- aov(len ~ supp * dose, data = my_data)
Anova(my_anova, type = "III")Anova Table (Type III tests)
Response: len
Sum Sq Df F value Pr(>F)
(Intercept) 1750.33 1 132.730 3.603e-16 ***
supp 137.81 1 10.450 0.002092 **
dose 885.26 2 33.565 3.363e-10 ***
supp:dose 108.32 2 4.107 0.021860 *
Residuals 712.11 54
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
6 MANOVA test: Multivariate analysis of variance
In the situation where there multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA). For example, we may conduct an experiment where we give two treatments (A and B) to two groups of mice, and we are interested in the weight and height of mice. In that case, the weight and height of mice are two dependent variables, and our hypothesis is that both together are affected by the difference in treatment. A multivariate analysis of variance could be used to test this hypothesis.
Assumptions of MANOVA
MANOVA can be used in certain conditions:
The dependent variables should be normally distributed within groups. The R function mshapiro.test( )[in the mvnormtest package] can be used to perform the Shapiro-Wilk test for multivariate normality. This is useful in the case of MANOVA, which assumes multivariate normality.
Homogeneity of variances across the range of predictors.
Linearity between all pairs of dependent variables, all pairs of covariates, and all dependent variable-covariate pairs in each cell
Interpretation of MANOVA
If the global multivariate test is significant, we conclude that the corresponding effect (treatment) is significant. In that case, the next question is to determine if the treatment affects only the weight, only the height or both. In other words, we want to identify the specific dependent variables that contributed to the significant global effect. To answer this question, we can use one-way ANOVA (or univariate ANOVA) to examine separately each dependent variable.
Compute MANOVA in R-using the iris data set:
# Store the data in the variable my_data
my_data <- irischeck the data:
# Show a random sample
set.seed(1234)
dplyr::sample_n(my_data, 10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.2 3.5 1.5 0.2 setosa
2 5.7 2.6 3.5 1.0 versicolor
3 6.3 3.3 6.0 2.5 virginica
4 6.5 3.2 5.1 2.0 virginica
5 6.3 3.4 5.6 2.4 virginica
6 6.4 2.8 5.6 2.2 virginica
7 6.8 3.2 5.9 2.3 virginica
8 7.9 3.8 6.4 2.0 virginica
9 6.2 2.9 4.3 1.3 versicolor
10 7.1 3.0 5.9 2.1 virginica
Question: We want to know if there is any significant difference, in sepal and petal length, between the different species.
Compute the MANOVA test:
sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man) Df Pillai approx F num Df den Df Pr(>F)
Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look to see which differ
summary.aov(res.man) Response Sepal.Length :
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response Petal.Length :
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
Residuals 147 27.22 0.185
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output above, it can be seen that the two variables are highly significantly different among Species.
7. Kruskal-Wallis test
Kruskal-Wallis test by rank is a non-parametric alternative to one-way ANOVA test, which extends the two-samples Wilcoxon test in the situation where there are more than two groups. It’s recommended when the assumptions of one-way ANOVA test are not met.
V/isualize your data and compute Kruskal-Wallis test in R
Eg. Using the built-in R data set named PlantGrowth. It contains the weight of plants obtained under a control and two different treatment conditions.
my_data <- PlantGrowthchecking the data:
# print the head of the file
head(my_data) weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
and showing the group levels:
# Show the group levels
levels(my_data$group)[1] "ctrl" "trt1" "trt2"
and computing summary stats by group:
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE),
median = median(weight, na.rm = TRUE),
IQR = IQR(weight, na.rm = TRUE)
)# A tibble: 3 × 6
group count mean sd median IQR
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 ctrl 10 5.03 0.583 5.15 0.743
2 trt1 10 4.66 0.794 4.55 0.662
3 trt2 10 5.53 0.443 5.44 0.467
and visualising the data box plots:
# Box plots
# ++++++++++++++++++++
# Plot weight by group and color by group
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")and mean plots:
# Mean plots
# ++++++++++++++++++++
# Plot weight by group
# Add error bars: mean_se
# (other values include: mean_sd, mean_ci, median_iqr, ....)
library("ggpubr")
ggline(my_data, x = "group", y = "weight",
add = c("mean_se", "jitter"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")computing the Kruskal-Wallis test:
We want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions.The test can be performed using the function kruskal.test() as follow:
kruskal.test(weight ~ group, data = my_data)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
Interpretation: as the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the treatment groups.
Week 8 Correlations
Pre-lecture session taken from the following Youtube video:
Pearson’s Correlation, Clearly Explained by Josh Starmer
and Chapter 5 “Correlation” from Making Sense of Data with R by Yi Shang.
The below is taken from a mixture of the lecture session and the above pre session info.
Plotting the relationship between 2 quantitative variables is best done using scatterplots (see Exercise 1 Week 4).
In statistics the relationship between 2 variables are often called correlation. To have a correlation we need 2 measurements from each case in the dataset. Correlation is very useful in the sense that it tells us if we can use one variable to predict another. What is used to make the prediction is called the predictor (independent) variable and what is being predicted is called the outcome (Dependent) variable. By convention on a scatter plot the independent variable goes on x axis, dependent on the y.
We have a positive correlation when the 2 variables tend to change in the same direction, and a negative correlation when the 2 variables tend to change in different directions. The strength or magnitude of a correlation is the likelihood that x and y change in accordance with each other. The direction of a correlation is not relevant to its strength (a negative correlation can be strong while a positive one can be weak and vv). The correlation coefficient (Pearson’s r) is an index to express the direction and strength of of the relationship (from 1 to -1 for any pair of variables with 1 and -1 being perfect strongest possible correlations). Remember that r values are not slopes-can have exactly the same slope with 3 different r values and likewise 3 different slopes with same r value.
Here’s a rough rule of thumb to interpret the strength of Pearson’s r:
The higher the r number that tighter are the plots around the line.
The df number or degrees of freedom is equal to the number of pairs we have in the date plus 2. The 95% confidence interval is a margin of error around the correlation coefficient and marks the extent of our uncertainty about this relationship (a wider margin means more uncertain and usually reflects small size of data).
The correlation coefficient tells us whether is is possible to predict one variable based on the other but it does not tell us how to make that prediction. r shows up only as the tightness of the dots clustering around the invisible straight line-it has nothing to do with how steep that slope is.
Note r only works on linear relationships. Also can be unduly influenced by outliers (skewed) so must look at the plot too!
The p value in correlation tests tells us the probability that randomly drawn dots will result in a similarly strong relationship or stronger. The smaller the p value the more confidence we have in the predictions we make with the line.
So as long as the correlation value is not 0 we can use the line to make inferences but our guesses become more refined the closer the correlation gets to 1 or -1 and our confidence in our inferences depends on the amount of data we have collected and the p value.
Remember correlation is not causation!
Some worked examples:
#loading required package
library(tidyverse)
library(Hmisc)
install.packages("dplyr")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(dplyr)Using the Mid West car dataset, showing the 3 types of correlation test:
Pearson’s correlation-assumption: normal distribution
cor.test(mtcars$hp, mtcars$mpg) #looking at the hp and mpg columnsPearson's product-moment correlation data: mtcars$hp and mtcars$mpg t = -6.7424, df = 30, p-value = 1.788e-07 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8852686 -0.5860994 sample estimates: cor -0.7761684
Note that this gives a tiny p value with -0.77 correlation coefficient (r value) so strong negative correlation/relationship with a high level of confidence that the result isn’t just down to random chance-the probability of random data creating a similarly strong or stronger relationship is crazy small so can have good confidence that if we have a value on the x axis could use this to predict a value on the y axis.
Spearmans correlation
-assumption: non-normal distribution
cor.test(mtcars$hp, mtcars$mpg, method = "spearman")Spearman's rank correlation rho data: mtcars$hp and mtcars$mpg S = 10337, p-value = 5.086e-12 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.8946646
This one is reported as rho = -0.89 so a strong negative correlation with a tiny p value so highly likely that not due to random chance.
Multiple correlations
Using the code below it is easy to obtain all the r values for each combination of variables:
mtcars %>%
dplyr::select(mpg:qsec) %>%
cor() %>%
round(., 2) ->res
res mpg cyl disp hp drat wt qsec
mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42
cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59
disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43
hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71
drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09
wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17
qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00
Visualisation
As ever visualising the data can be really helpful, for example using the corrplot package:
if (!require("corrplot")) install.packages("corrplot")
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)and checking p values on the above data (although note that not give same result as above when used the Multiple correlation approach):
if (!require("Hmisc")) install.packages("Hmisc")
res2<-rcorr(res, type = c("pearson", "spearman"))
res2 mpg cyl disp hp drat wt qsec
mpg 1.00 -0.99 -1.00 -0.97 0.95 -0.99 0.78
cyl -0.99 1.00 0.99 0.98 -0.93 0.97 -0.84
disp -1.00 0.99 1.00 0.97 -0.96 0.99 -0.79
hp -0.97 0.98 0.97 1.00 -0.86 0.92 -0.91
drat 0.95 -0.93 -0.96 -0.86 1.00 -0.97 0.60
wt -0.99 0.97 0.99 0.92 -0.97 1.00 -0.69
qsec 0.78 -0.84 -0.79 -0.91 0.60 -0.69 1.00
n= 7
P
mpg cyl disp hp drat wt qsec
mpg 0.0000 0.0000 0.0004 0.0011 0.0000 0.0377
cyl 0.0000 0.0000 0.0000 0.0020 0.0004 0.0178
disp 0.0000 0.0000 0.0004 0.0007 0.0000 0.0350
hp 0.0004 0.0000 0.0004 0.0126 0.0033 0.0043
drat 0.0011 0.0020 0.0007 0.0126 0.0002 0.1519
wt 0.0000 0.0004 0.0000 0.0033 0.0002 0.0878
qsec 0.0377 0.0178 0.0350 0.0043 0.1519 0.0878
Using the iris data to show more ways of evaluating the data:
data <- iris[, 1:4]
groups <- iris[,5]library(psych)
corPlot(data, cex = 1.2) #to give a numerical tabular resultpar(mfrow = c(2,3))
corrplot(cor(data), method = "circle",
title = "method = 'circle'",
tl.pos = "n", mar = c(2,1,3,1))install.packages("corrplot")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(corrplot)
corrplot(cor(data), method = "square",
title = "method = 'square'",
tl.pos = "n", mar = c(2,1,3,1))corrplot(cor(data), method = "ellipse",
title = "method = 'ellipse'",
tl.pos = "n", mar = c(2,1,3,1))corrplot(cor(data), method = "pie",
title = "method = 'pie'",
tl.pos = "n", mar = c(2,1,3,1))corrplot(cor(data), method = "number",
title = "method = 'number'",
tl.pos = "n", mar = c(2,1,3,1))and using ggplot and GGally on penguins data for example which gives a myriad of different plots to look at:
if (!require("GGally")) install.packages("GGally")
install.packages("palmerpenguins")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(palmerpenguins)
library(GGally)
# Code that generates plots or runs intensive computations
ggpairs(penguins[,c(1,3:7)], ggplot2::aes(colour = species))Week 9 Linear Models
Here’s a link to a Greg Martin Youtube video https://youtu.be/-mGXnm0fHtI
and some of the coding used in this video are shown below. Using the tidyverse package and the cars dataset:
library(tidyverse)
head(cars, 10) speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
It’s easy to create a simple linear model:
cars %>%
lm(dist ~ speed, data = .) %>% #with dist being our dependent y axis variable and speed being our independent x axis variable
summary()
Call:
lm(formula = dist ~ speed, data = .)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
In the above:
residuals-represents the distance between the line of best fit and the observations so a good model has low residuals (all dots close to line). Ideally want to see symmetrical distribution around 0
next get coefficients, which gives the slope (for every unit on x get 3.9 on y in this case) and also p value for our slope
lastly r squared which gives 65% of the change in y can be explained by the change in x
As ever there are other ways of doing the same thing, for example:
mod <- lm(dist ~ speed, data = cars)
mod
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
summary(mod) #and this method allows us to do more with the output to gain better understanding as to the suitability and strength of the model
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
and this method allows us to do more with the output:
attributes(mod)$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
$class
[1] "lm"
mod$residuals #such as looking at individual columns for example 1 2 3 4 5 6 7
3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
8 9 10 11 12 13 14
4.255007 12.255007 -8.677401 2.322599 -15.609810 -9.609810 -5.609810
15 16 17 18 19 20 21
-1.609810 -7.542219 0.457781 0.457781 12.457781 -11.474628 -1.474628
22 23 24 25 26 27 28
22.525372 42.525372 -21.407036 -15.407036 12.592964 -13.339445 -5.339445
29 30 31 32 33 34 35
-17.271854 -9.271854 0.728146 -11.204263 2.795737 22.795737 30.795737
36 37 38 39 40 41 42
-21.136672 -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
43 44 45 46 47 48 49
2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285 43.201285
50
4.268876
hist(mod$residuals) #and graphing the output to check the distributionshapiro.test(mod$residuals) #and can run shapiro test to look at the distribution of the residuals to check its normality
Shapiro-Wilk normality test
data: mod$residuals
W = 0.94509, p-value = 0.02152
and can then use this model to do some predictions as follows:
new_speeds <- data.frame(speed = c(10, 15, 20))
predict(mod, new_speeds) %>%
round(1) 1 2 3
21.7 41.4 61.1
and as before can achieve same thing with:
cars %>%
lm(dist ~ speed, data = .) %>%
predict(data.frame(speed = c(10, 15, 20))) %>%
round() 1 2 3
22 41 61
Post Lecture exercise:
Answer the following question using a linear model “What is the effect of the weight of the vehicle on its efficiency (mile per gallon)?”
#loading packages and exploring the datasets
library(tidyverse)
library(ggplot2)
data("mtcars")
str(mtcars) 'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Obvious that do have numerical versus numerical continuous quantitative data so can use linear regression models to help the analysis. Starting with plotting the data using scatter plot:
ggplot(mtcars, aes(x = wt, y = mpg))+
geom_point() +
geom_smooth(method = "lm", se = FALSE) #adding line of best fit and removing the error bars-use se = TRUE if want them inwhich shows that there does look to be a relationship between the 2 variables showing a negative slope as would expect(ie response variable mpg on y axis goes down as predictor variable wt on x axis goes up). Using linear regression on this as below:
mtcars %>%
lm(mpg ~ wt, data = .) %>% #with mpg being our dependent y axis variable and wt being our independent x axis variable
summary()
Call:
lm(formula = mpg ~ wt, data = .)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Residuals-are the differences between the observed values and the values predicted by the model. Represents the distance between the line of best fit and the observations so a good model has low residuals (all dots close to line). The summary statistics of residuals are provided:
Min: Smallest residual (-4.5432).
1Q: First quartile residual (-2.3647).
Median: Middle residual (-0.1252).
3Q: Third quartile residual (1.4096).
Max: Largest residual (6.8727).
The closer these values are to zero, the better the model fits the data.
Coefficients
This section provides estimates for the regression coefficients: | Term | Estimate | Std. Error | t value | Pr(>|t|) |
Intercept (37.2851): The predicted
mpgwhenwt = 0. While not realistic for this dataset (since weight is never 0), it serves as the baseline for the regression line.Slope (-5.3445): For every 1-unit increase in
wt,mpgdecreases by 5.3445 units on average.Std. Error: Indicates the variability of the estimates. Smaller values suggest more precise estimates.
t value: The test statistic for each coefficient, calculated as Estimate/Std. Error.
Pr(>|t|): The p-value for the hypothesis test H0:β=0, H0:β=0. For both coefficients,p<0.001p<0.001, indicating strong evidence that the coefficients are significantly different from zero.
Residual Standard Error
The standard deviation of the residuals:
- 3.046: On average, the observed
mpgdeviates by about 3.046 units from the predicted values.
R-squared and Adjusted R-squared
Multiple R-squared (0.7528): About 75.28% of the variability in
mpgis explained bywt.Adjusted R-squared (0.7446): Adjusted for the number of predictors and sample size, slightly lower.
Higher R-squared values indicate a better fit.
F-statistic
91.38: Tests whether the model as a whole explains a significant proportion of variance in the dependent variable.
p-value (1.294e-10): Extremely small, indicating the model is highly significant.
Interpretation
The regression model suggests that car weight (wt) is a significant predictor of miles per gallon (mpg). Heavier cars tend to have lower fuel efficiency, with a strong linear relationship accounting for about 75% of the variation in mpg.
Also wanted to run an ANOVA on the above linear regression model to test whether the model as a whole explains a significant amount of variation in the dependent variable compared to a simpler model (eg a model with no predictors):
# Fit the linear model
model <- lm(mpg ~ wt, data = mtcars)
# Perform ANOVA
anova_result <- anova(model)
# View the result
print(anova_result)Analysis of Variance Table
Response: mpg
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.73 847.73 91.375 1.294e-10 ***
Residuals 30 278.32 9.28
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with the following explanations:
Df (Degrees of Freedom):
For the predictor
wt: 1 degree of freedom (since it’s one variable).Residuals: n−k−1, where n is the number of observations and k is the number of predictors (30 in this case).
Sum Sq (Sum of Squares):
Total variance explained by
wt: 847.725.Residual variance (unexplained variance): 278.321.
Mean Sq (Mean Squares):
- Sum of Squares divided by Degrees of Freedom.
F value:
- Ratio of variance explained by the predictor to residual variance (Mean Sq for wt/Mean Sq for Residuals).
Pr(>F):
- The p-value for the F-test. A small value (e.g., <0.05) indicates that the predictor significantly explains variation in the dependent variable.
Interpretation:
From the ANOVA output:
The p-value for the predictor
wt(e.g.,<2.2e−16) suggests that car weight (wt) is a highly significant predictor of miles per gallon (mpg).The large F value (91.375) indicates that the variability explained by
wtis much larger than the residual (unexplained) variability, confirming a strong relationship.
This aligns with the earlier regression analysis, reinforcing that wt is a significant predictor of mpg.
Multiple Linear Models
Can look at multiple linear models by simply adding in variables as in the below example using the penguins dataset:
m2<-lm(flipper_length_mm~bill_length_mm
+ bill_depth_mm, data = penguins)
summary(m2)
Call:
lm(formula = flipper_length_mm ~ bill_length_mm + bill_depth_mm,
data = penguins)
Residuals:
Min 1Q Median 3Q Max
-37.731 -5.855 0.041 6.112 23.059
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 194.31175 6.33800 30.66 <2e-16 ***
bill_length_mm 1.41476 0.08802 16.07 <2e-16 ***
bill_depth_mm -3.23801 0.24335 -13.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.626 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6259, Adjusted R-squared: 0.6237
F-statistic: 283.6 on 2 and 339 DF, p-value: < 2.2e-16
and as before can then use the shapiro test to look the normality of the residuals:
shapiro.test(m2$residuals)
Shapiro-Wilk normality test
data: m2$residuals
W = 0.99317, p-value = 0.1227
and can then use ancova to compare regression slopes:
ancov1<-lm(bill_length_mm~body_mass_g
+ species
+ body_mass_g*species, data=penguins)
summary(ancov1)
Call:
lm(formula = bill_length_mm ~ body_mass_g + species + body_mass_g *
species, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-6.4208 -1.6461 0.0919 1.4718 9.3138
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.9941391 1.5926015 16.950 < 2e-16 ***
body_mass_g 0.0031879 0.0004271 7.464 7.27e-13 ***
speciesChinstrap 5.1800537 3.2746719 1.582 0.115
speciesGentoo -0.2545907 2.7138655 -0.094 0.925
body_mass_g:speciesChinstrap 0.0012748 0.0008740 1.459 0.146
body_mass_g:speciesGentoo 0.0009030 0.0006066 1.489 0.138
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.399 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.8098, Adjusted R-squared: 0.807
F-statistic: 286.1 on 5 and 336 DF, p-value: < 2.2e-16
and then anova of the output:
anova(ancov1)Analysis of Variance Table
Response: bill_length_mm
Df Sum Sq Mean Sq F value Pr(>F)
body_mass_g 1 3599.7 3599.7 625.5924 <2e-16 ***
species 2 4612.5 2306.3 400.8045 <2e-16 ***
body_mass_g:species 2 18.6 9.3 1.6159 0.2003
Residuals 336 1933.4 5.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Research Methods exercise:
From a paper of your preference, read the methods carefully and create a graphical abstract of the experimetnal design with a legend to help with explanation:
Week 10 Logistic Regression
Pre session
Couple of Youtube videos to get started:
https://youtu.be/yIYKR4sgzI8?list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe In summary this one by StatQuest by Josh Starmer started with a quick review of Linear Regression where we had 2 numerical variables plotted (eg weight and size) and a line of best fit applied which allowed us to calculate r2 and determine if weight and size are correlated (large values imply large effect), and a p value to see if the r2 was statistically significant; and use the line to predict size given weight. Can also use mutliple regression whereby trying to predict size using weight and blood volume for example (or trying to model size using weight and blood volume). Multiple regression did the same thing that normal regression did: calculate r2, p value and predict size given weight and blood volume. Also possible to compare models for example the simple (single) model to the complicated one (multiple) tells us if we need to measure weight and blood volume to accurately predict size or if we can get away with just weight.
So that was linear regression, now logistic regression: similar except predicts whether something is true or false instead of predicting something continuous like size. Also instead of fitting a line to the data logistic regression fits an “S” shaped logistic function from 0-1 which tells you the probability that a mouse is obese based on its weight as an example. Just like with linear regression we can make simple models (eg obesity predicted by weight) or more complicated models (eg obesity predicted by weight + genotype + age + astrological sign). In other words just like linear regression logistic regression can work with continuous data (like weight and age) and discrete data like (genotype and astrological size). Can also test to see if each variable is useful for predicting obesity. However unlike liner regression we can’t compare the complicated model to the simple model. Instead we just test to see if a variable’s effect on the prediction is signifciantly different from 0. If not it means the variable is not helping the prediction (use Waldt’s test). Logistic regression does not have the same concept of residuals as linear regression (where residuals are used to fit the line and to calculate r2 and compare simple to complicated models). Instead it uses the maximum likelihood concept.
2nd video is the next in this series and looks at worked examples as below:
https://youtu.be/C4N3_XJJ-jU?list=PLblh5JKOoLUKxzEP5HA2d-Li7IJkHfXSe
Using the heart disease data which needs some initial wrangling:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header = FALSE)
head(data) V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
None of the columns are named so need to name them:
colnames(data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "hd")
head(data) age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal hd
1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
5 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
6 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
which looks better but str() still shows an issue:
str(data)'data.frame': 303 obs. of 14 variables:
$ age : num 63 67 67 37 41 56 62 57 63 53 ...
$ sex : num 1 1 1 1 0 1 0 0 1 1 ...
$ cp : num 1 4 4 3 2 2 4 4 4 4 ...
$ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
$ chol : num 233 286 229 250 204 236 268 354 254 203 ...
$ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
$ restecg : num 2 2 2 0 2 0 2 0 2 2 ...
$ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
$ exang : num 0 1 1 0 0 0 0 1 0 1 ...
$ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$ slope : num 3 2 2 3 1 1 3 1 2 3 ...
$ ca : chr "0.0" "3.0" "2.0" "0.0" ...
$ thal : chr "6.0" "3.0" "7.0" "3.0" ...
$ hd : int 0 2 1 0 0 0 3 0 2 1 ...
some of the columns are a problem, for example sex is a number rather than a factor where 0 represents “female” and 1 represents “male”. CP (chest pain) is also supposed to be a factor (where 1-3 represents different types of pain and 4 is no chest pain). Similarly ca and thal are correctly called factors but one of the levels is “?” when we need it to be NA. So we’ve got some cleaning up to do!
data[data == "?"] <- NA # to change the question marks to NA's
data$sex <- ifelse(test=data$sex == 0, yes = "F", no = "M")
data$sex <- as.factor(data$sex) #and convert the column into a factor
data$cp <- as.factor(data$cp) # and converting other columns into factors as that's what they are supposed to be
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$ca <- as.integer(data$ca) #and tell r that ca column is now a column of integers and then convert it to a factor
data$ca <- as.factor(data$ca)#as above
data$thal <- as.factor(data$thal)#as above
data$hd <- ifelse(test=data$hd == 0, yes = "Healthy", no = "Unhealthy") #and lastly
data$hd <- as.factor(data$hd) #making hd a factor which is easier on the eyeand check that this has all worked:
str(data)'data.frame': 303 obs. of 14 variables:
$ age : num 63 67 67 37 41 56 62 57 63 53 ...
$ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 2 1 1 2 2 ...
$ cp : Factor w/ 4 levels "1","2","3","4": 1 4 4 3 2 2 4 4 4 4 ...
$ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
$ chol : num 233 286 229 250 204 236 268 354 254 203 ...
$ fbs : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 2 ...
$ restecg : Factor w/ 3 levels "0","1","2": 3 3 3 1 3 1 3 1 3 3 ...
$ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
$ exang : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 2 1 2 ...
$ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$ slope : Factor w/ 3 levels "1","2","3": 3 2 2 3 1 1 3 1 2 3 ...
$ ca : Factor w/ 4 levels "0","1","2","3": 1 4 3 1 1 1 3 1 2 1 ...
$ thal : Factor w/ 3 levels "3.0","6.0","7.0": 2 1 3 1 1 1 1 1 3 3 ...
$ hd : Factor w/ 2 levels "Healthy","Unhealthy": 1 2 2 1 1 1 2 1 2 2 ...
now see how many samples have NA values to decide if to get rid of them or if we should impute values for them:
nrow(data[is.na(data$ca) | is.na(data$thal),]) #shows 6 rows of data have NA's in them[1] 6
we can view the samples with NA’s in them by selecting those rows from the data:
data[is.na(data$ca) | is.na(data$thal),] age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
88 53 F 3 128 216 0 2 115 0 0.0 1 0 <NA>
167 52 M 3 138 223 0 0 169 0 0.0 1 <NA> 3.0
193 43 M 4 132 247 1 2 143 1 0.1 2 <NA> 7.0
267 52 M 4 128 204 1 0 156 1 1.0 2 0 <NA>
288 58 M 2 125 220 0 0 144 0 0.4 2 <NA> 7.0
303 38 M 3 138 175 0 0 173 0 0.0 1 <NA> 3.0
hd
88 Healthy
167 Healthy
193 Unhealthy
267 Unhealthy
288 Healthy
303 Healthy
For this exercise we will just remove them:
data <- data[!(is.na(data$ca) | is.na(data$thal)),]
nrow(data) #shows that these have been removed[1] 297
Now we need to make sure that healthy and diseased samples come from each gender as if only male samples have heard disease then we should remove all females from the model. Can do this using the xtabs function to select the columns we want to build a table from (in this case HD and sex)
xtabs(~hd + sex, data=data) sex
hd F M
Healthy 71 89
Unhealthy 25 112
ie healthy and unhealthy in both male and female. Now let’s verify that all all 4 levels of chest pain were reported by a few patients:
xtabs(~hd +cp, data=data) cp
hd 1 2 3 4
Healthy 16 40 65 39
Unhealthy 7 9 18 103
so that’s fine, lets do the same thing for all other boolean and categorical variables that we are using to predict heart disease:
xtabs(~hd + fbs, data = data) fbs
hd 0 1
Healthy 137 23
Unhealthy 117 20
xtabs(~hd + restecg, data = data) restecg
hd 0 1 2
Healthy 92 1 67
Unhealthy 55 3 79
Potential issue here in that only 4 patients represent level 1 so might be a problem as not enough.
xtabs(~hd + slope, data = data) slope
hd 1 2 3
Healthy 103 48 9
Unhealthy 36 89 12
xtabs(~hd + ca, data = data) ca
hd 0 1 2 3
Healthy 129 21 7 3
Unhealthy 45 44 31 17
xtabs(~hd + thal, data = data) thal
hd 3.0 6.0 7.0
Healthy 127 6 27
Unhealthy 37 12 88
So all these look ok apart from possibly ecg. Now let’s run the Logistic Regression:
Starting with a simple model where we will try to predict heart disease using only the gender of each patient:
logistic <- glm(hd ~ sex, data=data, family = "binomial") #the call to glm the function that performs generalised linear models. Binomial makes the glm function do the logistic regression
summary(logistic)
Call:
glm(formula = hd ~ sex, family = "binomial", data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0438 0.2326 -4.488 7.18e-06 ***
sexM 1.2737 0.2725 4.674 2.95e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 409.95 on 296 degrees of freedom
Residual deviance: 386.12 on 295 degrees of freedom
AIC: 390.12
Number of Fisher Scoring iterations: 4
Output shows:
Summary of the deviance residuals which look good as close to being centred on 0 and are roughly symmetrical
Coefficients which correspond to the equation: heart disease = -1.0438 + 1.2737 x the patient is male (this latter variable is equal to 0 when the patient is female and 1 when the patient is male) so if we are predicting heart disease for a female patient we get the following equation: hd = -1.0438+1.2737x0 which reduce to hd = -1.0438 so the log(odds) that a female has hd = -1.0438. For a male patient the equation is hd = -1.0438 + 1.2737x1 giving hd = -1.0438+1.2737. In other words the second term is the log(odds ration) of the odds that a male will have hd over the odds that a female will have hd. The pvalues are well below 0.05 and thus the log(odds) and the log(odds ratios) are both stat sig. Remember that a small p value alone is not interesting, we also want large effect sizes and that’s what the log(odds) and log(odds ratio) tells us. For more on the odds ratio see this video https://youtu.be/8nm0G-1uJzA.
Next we see the default dispersion parameter used for this logistic regression. When we do normal linear regression we estimate both the mean and the variance from the data. In contrast with logistic regression we estimate the mean of the data and the variance is derived from the mean. It is therefore possible that the variance is underestimated. If so you can adjust the dispersion parameter in the summary() command.
Then we have the Null Deviance and the Residual Deviance. These can be used to compare models, compute r2 and an overal p value.
Then we have the AIC (Akaike Information Creiterion) which is just the Residual Deviance adjusted for the number of parameters in the model. The AIC can be used to compare one model to another.
Lastly we have the number of Fisher Scoring iterations which just tells us how quickly the glm() function converged on the maximum liklihood estimates for the coefficients.
Now that we’ve done a simple logistic regression using just one variable (sex) to predict hd we can create a fancy model that uses all of the variables to predict hd.
logistic <- glm(hd~., data=data, family = "binomial")
summary(logistic)
Call:
glm(formula = hd ~ ., family = "binomial", data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.253978 2.960399 -2.113 0.034640 *
age -0.023508 0.025122 -0.936 0.349402
sexM 1.670152 0.552486 3.023 0.002503 **
cp2 1.448396 0.809136 1.790 0.073446 .
cp3 0.393353 0.700338 0.562 0.574347
cp4 2.373287 0.709094 3.347 0.000817 ***
trestbps 0.027720 0.011748 2.359 0.018300 *
chol 0.004445 0.004091 1.087 0.277253
fbs1 -0.574079 0.592539 -0.969 0.332622
restecg1 1.000887 2.638393 0.379 0.704424
restecg2 0.486408 0.396327 1.227 0.219713
thalach -0.019695 0.011717 -1.681 0.092781 .
exang1 0.653306 0.447445 1.460 0.144267
oldpeak 0.390679 0.239173 1.633 0.102373
slope2 1.302289 0.486197 2.679 0.007395 **
slope3 0.606760 0.939324 0.646 0.518309
ca1 2.237444 0.514770 4.346 1.38e-05 ***
ca2 3.271852 0.785123 4.167 3.08e-05 ***
ca3 2.188715 0.928644 2.357 0.018428 *
thal6.0 -0.168439 0.810310 -0.208 0.835331
thal7.0 1.433319 0.440567 3.253 0.001141 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 409.95 on 296 degrees of freedom
Residual deviance: 183.10 on 276 degrees of freedom
AIC: 225.1
Number of Fisher Scoring iterations: 6
In this we see that age isn’t a useful predictor because it has a large p value. However the median age in our dataset was 56 so most of the folks were pretty old and that explains why it wasn’t very useful.
Gender is still a good predictor though.
If we scroll down to the bottom of the output we see that the Residual Deviance and the AIC are both much smaller for this fancy model than they were for the simple model when we only used gender to predict hd. If we want to calculate McFadden’s Pseudo r2 we can pull the log-liklihood of the null model out of the logistic variable by getting the value for the null deviance and dividing by -2 and we can pull the log-liklihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by -2, then we just do the maths and we end up with a Pseudo r2 = 0.55. This can be interpreted as the overall effect size.
ll.null <- logistic$null.deviance/-2
ll.proposed <- logistic$deviance/-2
(ll.null-ll.proposed)/ll.null[1] 0.5533531
and we can use those same log-liklihoods to calculate a p value for that r2 using a chi square distribution:
1-pchisq(2*(ll.proposed-ll.null), df = (length(logistic$coefficients)-1))[1] 0
in this case the p value is tiny so the r2 value isn’t due to dumb luck!
Lastly we can draw a graph that shows the predicted probabilities that each patient has heart disease along with their actual heart disease status. Most of the patients with hd are predicted to have a high probability of having hd and most of the patients without hd are predicted to have a low probability of having hd. Thus our logistic regression has done a pretty good job.
predicted.data <- data.frame(probability.of.hd=logistic$fitted.values, hd=data$hd) #create new data frame that contains the probabilities of having hd along with the actual hd status
predicted.data <- predicted.data[order(predicted.data$probability.of.hd, decreasing=FALSE),]#then we sort the data frame from low probabilities to high
predicted.data$rank <- 1:nrow(predicted.data)#then we add a new column to the data frame that has the rank of each sample from low probability to =high.
library(ggplot2)#then we load ggplot to we can draw a fancy graph
library(cowplot)
ggplot(data=predicted.data, aes(x=rank, y=probability.of.hd))+
geom_point(aes(color=hd), alpha=1, shape=4, stroke=2) +
xlab("Index") +
ylab("Predicted probability of getting heart disease")Post session Exercises
Taken from the STHDA website on Logistic Regression Essentials in R https://www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/
Logistic Regression is used to predict the class or category of individuals based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable which can only have only have two possible values 0 or 1, yes or no, true or false etc. LR belongs to a family named Generalised Linear Model developed for extending the linear regression model to other situations. Logistic regression does not return direclty the class of observations-it allows us to estimate the probability of class membership, ranging between 0 and 1. You need to decide the threshold probability at which the category flips from one to the other. By default this is set = 0.5 but in reality it should be settled based on the analysis purpose.
Logistic function
The standard logistic regression function, for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 + exp(y)] (James et al. 2014). This can be also simply written as p = 1/[1 + exp(-y)], where:
y = b0 + b1*x,exp()is the exponential andpis the probability of event to occur (1) givenx. Mathematically, this is written asp(event=1|x) and abbreviated asp(x), sopx = 1/[1 + exp(-(b0 + b1*x))]`
By a bit of manipulation, it can be demonstrated that p/(1-p) = exp(b0 + b1*x). By taking the logarithm of both sides, the formula becomes a linear combination of predictors: log[p/(1-p)] = b0 + b1*x.
When you have multiple predictor variables, the logistic function looks like: log[p/(1-p)] = b0 + b1*x1 + b2*x2 + ... + bn*xn
b0 and b1 are the regression beta coefficients. A positive b1 indicates that increasing x will be associated with increasing p. Conversely, a negative b1 indicates that increasing x will be associated with decreasing p.
The quantity log[p/(1-p)] is called the logarithm of the odd, also known as log-odd or log it.
The odds reflect the likelihood that the event will occur. It can be seen as the ratio of “successes” to “non-successes”. Technically, odds are the probability of an event divided by the probability that the event will not take place(P. Bruce and Bruce 2017). For example, if the probability of being diabetes-positive is 0.5, the probability of “won’t be” is 1-0.5 = 0.5, and the odds are 1.0.
Note that, the probability can be calculated from the odds as p = Odds/(1 + Odds).
Need a couple of R packages for this one:
library(tidyverse)
install.packages("caret")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(caret)
library(ggplot2)
theme_set(theme_bw())Preparing the data
Performing the following steps might improve the accuracy of your model
Remove potential outliers
Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
Remove highly correlated predictors to minimize overfitting. The presence of highly correlated predictors might lead to an unstable model solution.
Here, we’ll use the PimaIndiansDiabetes [in mlbench package] for predicting the probability of being diabetes positive based on multiple clinical variables.
We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).
install.packages("mlbench")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(mlbench)
# Load the data and remove NAs
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
# Inspect the data
sample_n(PimaIndiansDiabetes2, 3) pregnant glucose pressure triceps insulin mass pedigree age diabetes
421 1 119 88 41 170 45.3 0.507 26 neg
188 1 128 98 41 58 32.0 1.321 33 pos
639 7 97 76 32 91 40.9 0.871 32 pos
# Split the data into training and test set
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]Computing logistic regression
The R function glm(), for generalized linear model, can be used to compute logistic regression. You need to specify the option family = binomial, which tells R that we want to fit logistic regression.
# Fit the model
model <- glm( diabetes ~., data = train.data, family = binomial)
# Summarize the model
summary(model)
Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053e+01 1.440e+00 -7.317 2.54e-13 ***
pregnant 1.005e-01 6.127e-02 1.640 0.10092
glucose 3.710e-02 6.486e-03 5.719 1.07e-08 ***
pressure -3.876e-04 1.383e-02 -0.028 0.97764
triceps 1.418e-02 1.998e-02 0.710 0.47800
insulin 5.940e-04 1.508e-03 0.394 0.69371
mass 7.997e-02 3.180e-02 2.515 0.01190 *
pedigree 1.329e+00 4.823e-01 2.756 0.00585 **
age 2.718e-02 2.020e-02 1.346 0.17840
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.80 on 313 degrees of freedom
Residual deviance: 267.18 on 305 degrees of freedom
AIC: 285.18
Number of Fisher Scoring iterations: 5
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)[1] 0.7564103
Simple logistic regression is used to predict the probability of class membership based on one single predictor variable. The following R code builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration:
model <- glm( diabetes ~ glucose, data = train.data, family = binomial)
summary(model)$coef Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose 0.04327234 0.005341133 8.101716 5.418949e-16
The output above shows the estimate of the regression beta coefficients and their significance levels. The intercept (b0) is -6.15 and the coefficient of glucose variable is 0.043.
The logistic equation can be written as p = exp(-6.15 + 0.043*glucose)/ [1 + exp(-6.15 + 0.043*glucose)]. Using this formula, for each new glucose plasma concentration value, you can predict the probability of the individuals in being diabetes positive. Predictions can be easily made using the function predict(). Use the option type = “response” to directly obtain the probabilities
newdata <- data.frame(glucose = c(20, 180))
probabilities <- model %>% predict(newdata, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes 1 2
"neg" "pos"
The logistic function gives an s-shaped probability curve illustrated as follow:
train.data %>%
mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)Multiple logistic regression is used to predict the probability of class membership based on multiple predictor variables, as follow:
model <- glm( diabetes ~ glucose + mass + pregnant,
data = train.data, family = binomial)
summary(model)$coef Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose 0.03886154 0.005404219 7.190962 6.433636e-13
mass 0.09458458 0.023529905 4.019760 5.825738e-05
pregnant 0.14466661 0.045125729 3.205857 1.346611e-03
and in the below using all the predictor variables:
model <- glm( diabetes ~., data = train.data, family = binomial)
summary(model)$coef Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant 1.005031e-01 0.061266974 1.64041157 1.009196e-01
glucose 3.709621e-02 0.006486093 5.71934633 1.069346e-08
pressure -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps 1.417771e-02 0.019981885 0.70952823 4.779967e-01
insulin 5.939876e-04 0.001508231 0.39383055 6.937061e-01
mass 7.997447e-02 0.031798907 2.51500698 1.190300e-02
pedigree 1.329149e+00 0.482291020 2.75590704 5.852963e-03
age 2.718224e-02 0.020199295 1.34570257 1.783985e-01
From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:
Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variableStd.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate.z value: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3)Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is.
Note that, the functions coef() and summary()can be used to extract only the coefficients, as follow:
coef(model) (Intercept) pregnant glucose pressure triceps
-1.053400e+01 1.005031e-01 3.709621e-02 -3.875933e-04 1.417771e-02
insulin mass pedigree age
5.939876e-04 7.997447e-02 1.329149e+00 2.718224e-02
summary(model )$coef Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant 1.005031e-01 0.061266974 1.64041157 1.009196e-01
glucose 3.709621e-02 0.006486093 5.71934633 1.069346e-08
pressure -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps 1.417771e-02 0.019981885 0.70952823 4.779967e-01
insulin 5.939876e-04 0.001508231 0.39383055 6.937061e-01
mass 7.997447e-02 0.031798907 2.51500698 1.190300e-02
pedigree 1.329149e+00 0.482291020 2.75590704 5.852963e-03
age 2.718224e-02 0.020199295 1.34570257 1.783985e-01
Interpretation
It can be seen that only 5 out of the 8 predictors are significantly associated to the outcome. These include: pregnant, glucose, pressure, mass and pedigree.
The coefficient estimate of the variable glucose is b = 0.045, which is positive. This means that an increase in glucose is associated with increase in the probability of being diabetes-positive. However the coefficient for the variable pressure is b = -0.007, which is negative. This means that an increase in blood pressure will be associated with a decreased probability of being diabetes-positive.
An important concept to understand, for interpreting the logistic beta coefficients, is the odds ratio. An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0).
For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor.
If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).
For example, the regression coefficient for glucose is 0.042. This indicate that one unit increase in the glucose concentration will increase the odds of being diabetes-positive by exp(0.042) 1.04 times.
From the logistic regression results, it can be noticed that some variables - triceps, insulin and age - are not statistically significant. Keeping them in the model may contribute to overfitting. Therefore, they should be eliminated. This can be done automatically using statistical techniques, including stepwise regression an penalized regression methods. Briefly, they consist of selecting an optimal model with a reduced set of variables, without compromising the model accuracy. Here, as we have a small number of predictors (n = 9), we can select manually the most significant:
model <- glm( diabetes ~ pregnant + glucose + pressure + mass + pedigree,
data = train.data, family = binomial)Making predictions
We can predictions using the test data in order to evaluate the performance of our logistic regression model as follows:
Predict the class membership probabilities of observations based on predictor variables
Assign the observations to the class with highest probability score (i.e above 0.5)
The R function predict() can be used to predict the probability of being diabetes-positive, given the predictor values.
Predict the probabilities of being diabetes-positive:
probabilities <- model %>% predict(test.data, type = "response")
head(probabilities) 19 21 32 55 64 71
0.1352603 0.5127526 0.6795376 0.7517408 0.2734867 0.1648174
Which classes do these probabilities refer to? In our example, the output is the probability that the diabetes test will be positive. We know that these values correspond to the probability of the test to be positive, rather than negative, because the contrasts() function indicates that R has created a dummy variable with a 1 for “pos” and “0” for neg. The probabilities always refer to the class dummy-coded as “1”.
Check the dummy coding:
contrasts(test.data$diabetes) pos
neg 0
pos 1
Predict the class of individuals:
The following R code categorizes individuals into two groups based on their predicted probabilities (p) of being diabetes-positive. Individuals, with p above 0.5 (random guessing), are considered as diabetes-positive.
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes) 19 21 32 55 64 71
"neg" "pos" "pos" "pos" "neg" "neg"
Assessing model accuracy is measured as the proportion of observations that have been correctly classified. Inversely, the classification error is defined as the proportion of observations that have been misclassified.
Proportion of correctly classified observations:
mean(predicted.classes == test.data$diabetes)[1] 0.7564103
The classification prediction accuracy is about 76%, which is good. The misclassification error rate is 24%.
Week 11 Polynomial & Poisson Models
Pre-session
https://youtu.be/ZYN0YD7UfK4 on Fitting & Assessing Polynomial regression models by Mike Marin. Polynomial Regression is a special case of Linear Regression where the relationship between x and y is modelled using a polynomial, rather than a line. It can be used when the relationship is non-linear, although this is still considered to be a special case of Multiple Linear Regression. COULD NOT DOWNLOAD LCD DATA
https://youtu.be/FfBnX5dfxXw on Poisson Regression in R by . Using the bikeshare data set-see R notes for week11.
Session
See notes and R scripts for lecture session notes and exercises on Non Linear Models
Post Session
Polynomial Models-create a fictional dataframe or tibble using a quadratic function. Graph it using ggplot to visualise the data and then fit a quadratic model:
-creating a fictional dataset of blood volume versus height following a quadratic model introduce a a squared term for height, which reflects a non-linear relationship:
# Set seed for reproducibility
set.seed(42)
# Create a fictional dataset
n <- 100 # Number of observations
# Generate random heights in centimeters (between 150 and 200 cm)
height <- runif(n, min = 150, max = 200)
# Simulate blood volume using a quadratic relationship
blood_volume <- 0.02 * height^2 - 5 * height + 400 + rnorm(n, mean = 0, sd = 2)
# Combine into a data frame
blood_data_quadratic <- data.frame(
Height_cm = round(height, 1),
Blood_Volume_L = round(blood_volume, 2))
# Display the first few rows
head(blood_data_quadratic) Height_cm Blood_Volume_L
1 195.7 188.23
2 196.9 189.19
3 164.3 121.55
4 191.5 177.29
5 182.1 152.86
6 176.0 139.98
Visualise the data:
library(ggplot2)
ggplot(blood_data_quadratic, aes(x = Height_cm, y = Blood_Volume_L)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, color = "blue") +
labs(
title = "Blood Volume vs Height (Quadratic Model)",
x = "Height (cm)",
y = "Blood Volume (L)"
) +
theme_minimal()and fitting a linear model to this data:
lm(formula = Blood_Volume_L ~ Height_cm, data = blood_data_quadratic)
Call:
lm(formula = Blood_Volume_L ~ Height_cm, data = blood_data_quadratic)
Coefficients:
(Intercept) Height_cm
-211.925 2.022
and then a quadratic model:
lm(formula = Blood_Volume_L ~ Height_cm + I(Height_cm^2), data = blood_data_quadratic)
Call:
lm(formula = Blood_Volume_L ~ Height_cm + I(Height_cm^2), data = blood_data_quadratic)
Coefficients:
(Intercept) Height_cm I(Height_cm^2)
396.66827 -4.98051 0.01999
and then comparing using ANOVA:
linear_model <- lm(Blood_Volume_L ~ Height_cm, data = blood_data_quadratic)
quadratic_model <- lm(Blood_Volume_L ~ Height_cm + I(Height_cm^2), data = blood_data_quadratic)
anova(linear_model, quadratic_model)Analysis of Variance Table
Model 1: Blood_Volume_L ~ Height_cm
Model 2: Blood_Volume_L ~ Height_cm + I(Height_cm^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 1937.76
2 97 338.05 1 1599.7 459.02 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So tiny p value suggesting that quadratic model much better than linear model for this data.
Poisson Models:
Load the data ‘Recreation Demand’ from the AER package:
install.packages("AER")
The downloaded binary packages are in
/var/folders/wp/4hyx4m6j4w1f_tbkn65gz9dr0000gn/T//RtmpMwq327/downloaded_packages
library(AER)
data("RecreationDemand")
head(RecreationDemand) trips quality ski income userfee costC costS costH
1 0 0 yes 4 no 67.59 68.620 76.800
2 0 0 no 9 no 68.86 70.936 84.780
3 0 0 yes 5 no 58.12 59.465 72.110
4 0 0 no 2 no 15.79 13.750 23.680
5 0 0 yes 3 no 24.02 34.033 34.547
6 0 0 yes 5 no 129.46 137.377 137.850
str(RecreationDemand)'data.frame': 659 obs. of 8 variables:
$ trips : num 0 0 0 0 0 0 0 0 0 0 ...
$ quality: num 0 0 0 0 0 0 0 0 0 2 ...
$ ski : Factor w/ 2 levels "no","yes": 2 1 2 1 2 2 1 2 1 1 ...
$ income : num 4 9 5 2 3 5 1 5 2 3 ...
$ userfee: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ costC : num 67.6 68.9 58.1 15.8 24 ...
$ costS : num 68.6 70.9 59.5 13.8 34 ...
$ costH : num 76.8 84.8 72.1 23.7 34.5 ...
Have a look at the data description using the help in r studio:
?RecreationDemandWhat is the best model to predict the number of boat trips using a poisson model? Compare different models that you create.
So this is saying that boat trips will be our response variable and we want to see which of the other variables influence this most. Obvious that some are likely to be more important than others-for example, quality of the facility and whether a fee were payable seem most likely to influence the number of trips so let’s test it. Use a Poisson regression model as we are looking at count data that follow a Poisson distribution where the mean and variance of the response variable are approximately equal.
Boat_trip_Model1 <- glm(trips ~ quality,
data = RecreationDemand,
family = "poisson") #have to let r know which glm want
summary(Boat_trip_Model1)
Call:
glm(formula = trips ~ quality, family = "poisson", data = RecreationDemand)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.51378 0.05751 -8.933 <2e-16 ***
quality 0.54696 0.01518 36.022 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4849.7 on 658 degrees of freedom
Residual deviance: 3294.4 on 657 degrees of freedom
AIC: 4051.5
Number of Fisher Scoring iterations: 7
and plotting this:
library(ggplot2)
ggplot(RecreationDemand, aes(y = trips,
x = quality,)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "glm",
se = FALSE,
method.args = list(family = "poisson")) +
scale_colour_brewer(palette = "Dark2")and then add in more predictors:
Boat_trip_Model2 <- glm(trips ~ quality + userfee,
data = RecreationDemand,
family = "poisson") #have to let r know which glm want
summary(Boat_trip_Model2)
Call:
glm(formula = trips ~ quality + userfee, family = "poisson",
data = RecreationDemand)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.54705 0.05786 -9.455 <2e-16 ***
quality 0.52697 0.01539 34.244 <2e-16 ***
userfeeyes 1.35552 0.07835 17.301 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4849.7 on 658 degrees of freedom
Residual deviance: 3077.5 on 656 degrees of freedom
AIC: 3836.5
Number of Fisher Scoring iterations: 7
and plot this:
ggplot(RecreationDemand, aes(y = trips,
x = quality,
colour = userfee)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "glm",
se = FALSE,
method.args = list(family = "poisson")) +
scale_colour_brewer(palette = "Dark2")Compare models using AIC:
AIC(Boat_trip_Model1, Boat_trip_Model2) df AIC
Boat_trip_Model1 2 4051.477
Boat_trip_Model2 3 3836.527
Which shows that Model2 is the best model as has the lowest AIC.
and testing for overdispersion:
dispersiontest(Boat_trip_Model2)
Overdispersion test
data: Boat_trip_Model2
z = 1.8959, p-value = 0.02899
alternative hypothesis: true dispersion is greater than 1
sample estimates:
dispersion
14.18829
So is over-dispersion as >1 so should use another type of distribution such as quasi-poisson.