library(tidyverse)
ARES40011 Workbook
Week 1
Formatting notes
Remember the cheat sheet under the help menu!
One * for italics two for bold
When comfortable/knowledgeable you can edit in both Visual (has buttons to help with new things) or Source (more code looking).
For example, I do not know know how to insert tables or images in the source code. The visual input can help with this.
Running Code
When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:
The above code would usually display a table of data relating to the tidyverse package, however, as we have added an output:false instruction it is not doing so here.
These call out sections are called “chunks”.
From the pre-work for this week, I had successfully imported a data set from dropbox using the following code. However, this is now returning the error more columns than column names. What is the fix for this?
SOLVED I typed the web address incorrectly several times and on several attempts!
<- read.table(file="https://www.dropbox.com/s/5q3hdyxcouw6ifa/cyanistes.txt?dl=1",header=TRUE,dec=".") cyan
Add a plot
Code
ggplot(diamonds, aes(x=cut)) +
geom_bar()
You can refer to areas. This is a reference: Figure 1
Using LaTeX
We can put formulas inline like this: \(\alpha\)
We indent like this
\[\pi \in \mathbb{r}\] Google LaTeX for more options and guides.
Other display options
For different output types, you can change the format of the document at the top of the code.
Examples include: - PDF - Slides, use reveals at format for this. Looks like MS SWAY. Be cautions with this though as some changes will be required to ensure display of code sections works. From the example video, adding echo: true to code to sort this. There are lots of options with the slides.
watch more tutorials on this
You can also add additional options to the file types. Under format in yarl header, you can select themes. Web search for more Quatro Themes
To add a table of contents, you can add TOC within the HTML format area. Further customization is available - use the crtl+space options to explore.
You can also have multiple outputs at once e.g. HTML and PDF at once.
Other useful stuff!
-use crtl + space to see autofill possibilities within code or yarl sections
-you can set global, as in the default across the code, in the yarl header area too.
-you can add additional code in other languages by changing the info between the {} in the chunks.
Using python example
This is where I got stuck error - python biding not loaded. code removed as unable to render/publish.
Week 2 - During session notes
-Added callout boxes to highlight areas in week 1s notes.
library(palmerpenguins)
library(tidyverse) # added again as was not running
Tasks for this week:
Play around with penguin data
Insert a picture into the workbook done
Playing with Penguin data
%>%
penguins glimpse()
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex <fct> male, female, female, NA, female, male, female, male~
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
Week 2 - Post Session
Playing with Blue Tit data
For this data, re-run the command to get the blue tit data as detailed in the chunk above. This sets the dataframe to cyan.
First step in data analysis is a data exploration. This is to help identify any issues in the dataset and ensure any analysis is valid.
The steps suggested for data exploration are to identify:
Outliers in response and independent variables
Normality and homogeneity of the response variable
An excess of zeros in the response variable
Multicollinearity among independent variables
Relationships among response and independent variables
Independence of the response variable
These look familiar to me. Unsurprising as many of these conditions are assumptions of analyses that we may want to run!
Once we have the data, we can inspect the data using the structure function
str(cyan)
'data.frame': 438 obs. of 7 variables:
$ id : chr "B1" "B10" "B111" "B112" ...
$ zone : chr "B" "B" "B" "B" ...
$ year : int 2002 2001 2002 2002 2001 2001 2002 2002 2002 2002 ...
$ multi : chr "yes" "yes" "no" "no" ...
$ height: num 2.7 2 2 1.9 2.1 2.5 1.9 2.2 2.4 2.3 ...
$ day : int 8 25 10 6 28 23 10 10 11 7 ...
$ depth : num 0.33 0.25 0.2 0.25 0.33 0.4 0.2 0.2 0.2 0.33 ...
This is telling us a few things that we should note and would link to study design.
id is the unique identifier of the nest box in the study area
the variable zone are the plots within the study area
year the year the data was collected
multi is a bionomial data set stating if a nest was used more than once (0 for no, 1 for yes)
height is the height at which the box is attached to the tree
day is the number of days after 1st of April that the first egg was laid in the nest
depth is a measure of nest size (as a fraction of nest box filled) and may indicate body condition and resources of the nest box occupants. In this example this is the main variable of interest
We could run a similar analysis using the glimpse function used above.
glimpse(cyan)
Rows: 438
Columns: 7
$ id <chr> "B1", "B10", "B111", "B112", "B115", "B116", "B118", "B121", "B~
$ zone <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"~
$ year <int> 2002, 2001, 2002, 2002, 2001, 2001, 2002, 2002, 2002, 2002, 200~
$ multi <chr> "yes", "yes", "no", "no", "no", "no", "no", "yes", "no", "yes",~
$ height <dbl> 2.7, 2.0, 2.0, 1.9, 2.1, 2.5, 1.9, 2.2, 2.4, 2.3, 2.6, 1.5, 2.7~
$ day <int> 8, 25, 10, 6, 28, 23, 10, 10, 11, 7, 18, 19, 11, 32, 7, 27, 4, ~
$ depth <dbl> 0.33, 0.25, 0.20, 0.25, 0.33, 0.40, 0.20, 0.20, 0.20, 0.33, 0.5~
Note that the output also tells you the data type
-chr is a categorical data type defined by characters
-int shows the variable is a integer (whole number).
-num are continuous numeric data. Note that in the glimpse function num is replaced by dbl. This is not an important difference as far as we are concerned.
Remember
Data types can be categorical or continuous
There may be missing data in the set, which can cause difficulties in analysis.R treats missing data as NA. To find them we can run the following:
colSums(is.na(cyan)) #what this is saying is, return the sum of the number of times NA appears in the dataset cyan
id zone year multi height day depth
0 0 0 0 0 0 0
There are no missing data in this dataset.
Checking for outliers
One of the best ways to inspect data is visually. For outliers we can do this using a boxpot.
boxplot(depth ~ zone, ylab = 'Nest Depth', xlab = 'Woodland Zone', data = cyan, col = 'lightblue',outpch = 16, las=1)
# the boxplot paraters are as follows:
#depth ~ zone tells R to split the values of the depth by the category zone
#xlab and ylab are the axis labels
#data is which dataset to use for the plot
#col is the colour of the boxes
#outpch is a graphical paramater that changes the default boxplot display
#las = 1 tells R to keep the axis labels horizontal
We can see from this plot that there are 4 data points that sit outside of the expected range. This plot also shows us that there are differences between nest depths and the different areas of the woodlands. This inspection therefore has lots of advantages
Normality
To test for normality we can use a frequency polygon.
<-ggplot()
p<-p+xlab("Nest Depth")
p<-p+ylab("Frequency")
p<-theme(text=element_text(size=15))
p<-theme(panel.background = element_blank())
p<-theme(panel.border = element_rect(fill=NA, colour="black", size=1))
p<-p+theme(strip.background = element_rect(fill="white", colour="white", size=1))
p<-p+theme(text=element_text(15))
p<-p+theme(legend.position = 'none')
p<-p+geom_freqpoly(data=cyan,aes(depth),bins=7) p
I could not get this to work with the books code or from a variety of online searches! I wanted to plot a histogram or frequency polynomial graphic.
This was the error Error in +.gg
: ! Can’t add geom_freqpoly(data = cyan, aes(depth), bins = 7)
to a theme object. Backtrace: 1. ggplot2:::+.gg
(p, geom_freqpoly(data = cyan, aes(depth), bins = 7))
Solution Following week 4s work, I was able to create a histogram for the depth data using the geom_histogram plot type - see code below
ggplot(cyan, aes(depth))+
geom_histogram(bins=7)
and the Frequency Polygon
ggplot(cyan, aes(x=depth))+
geom_freqpoly(bins=7)
#This is suitable for one quantitative variable
Another way to test for normality is the Shapiro-Wilks normality test. For this test, the null hypothesis is that the data is normally distributed - if it is, we will see a p-value <0.05
shapiro.test(cyan$depth)
Shapiro-Wilk normality test
data: cyan$depth
W = 0.89872, p-value < 2.2e-16
This test shows the data is not distributed normally.
Next step in data exploration is to check the homogeneity of the data. This is another assumption of many statistical tests. This can be done visually (covariate boxplot) or with tests such as Levene’s Test.
library(lawstat)
Warning: package 'lawstat' was built under R version 4.1.3
#I had to add the lawstat package to run this test
levene.test(cyan$depth, cyan$zone)
Modified robust Brown-Forsythe Levene-type test based on the absolute
deviations from the median
data: cyan$depth
Test Statistic = 0.6051, p-value = 0.7738
Levene’s test does not assume normality and so is suitable for this dataset. As the p-vale is >0.05, we can say that this data does not deviate from homogeneity.
An excess of zeros in a response variable is called zero inflation. If this is a problem there are lots of ways to deal with it, however, lets test first!
sum(cyan$depth==0, na.rm=TRUE)*100/nrow(cyan)
[1] 0
The result is 0, so we can move on to the next exploration step.
Multicollinearity among covariates
<- c("zone", "year", "height", "day")
Coll
<- function(x, y, digits=1, prefix="", cex.cor = 6)
panel.cor <- par("usr"); on.exit(par(usr))
{usr par(usr = c(0, 1, 0, 1))
=cor(x,y,use="pairwise.complete.obs")
r1<- abs(cor(x, y,use="pairwise.complete.obs"))
r <- format(c(r1, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
txt if(missing(cex.cor)) { cex <- 0.9/strwidth(txt) } else {
= cex.cor}
cex text(0.5, 0.5, txt, cex = cex * r)}
pairs(cyan[, Coll], lower.panel = panel.cor)
This code is copied directly from the zip files supplied via the Statistics in R for Biodiversity Conservation dropbox but it would not run unsure why. I tried to google some other ways of doing but was unsuccessful.
Relationships among dependent and independent variables
Visually inspecting plotted data is useful to check these relationships as well
#| label: multipanel scatter plot of depth and nestbox height across zones
ggplot(cyan, aes(x = height, y = depth)) +
geom_point() +
geom_smooth(method = 'lm', colour = 'red', se = FALSE) +
theme(panel.background = element_blank()) +
theme(panel.border = element_rect(fill = NA, size = 1)) +
theme(strip.background = element_rect(fill = "white",
color = "white", size = 1)) +
theme(text = element_text(size=16)) +
facet_wrap(~zone)
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
i Please use the `linewidth` argument instead.
`geom_smooth()` using formula = 'y ~ x'
The best fit lines fitted to the above data show that figures CP and O might have some interactions and this suggests that further investigation is needed.
Independence of response variable
In this dataset, there are relationships between responses - nests are in the same boxes in different years. For the purposes of this example, no need to worry.
Week 3 Pre work
Week 3 Session
No notes made during session
Week 3 Post Session
First let’s ensure we have called the necessary packages
library(tidyverse)
6.1.1.01
Task Mutations of the midwest dataset
<- #defines new object name
midwest.new%>% #calls the existing dataset and pipes to the next commands
midwest mutate(avg.pop.den=mean(popdensity), #this adds the column for task a
avg.area=mean(area),
# this is for b
totadult=sum(popadults), # the totals for all these are the same throughout the data as they are averages or totals from the netire dataset
tot.minus.white=(totadult-popwhite), # this is now different per line as the popwhite varies across the lines of the data. You could achieve the same result by adding together the other pop* datafields but this is more efficient
child.to.adult=(percchildbelowpovert/percadultpoverty), # as with the tot.minus.white data, these vary by row due to difference between the rows of data
ratio.adult=(popadults/poptotal), # diving the 2 fields to get the ratio
perc.adult=(ratio.adult*100)) # multiplying by 100 to convert ratio to percentage
# this then displays the newly amended table midwest.new
# A tibble: 437 x 35
PID county state area poptotal popdensity popwhite popblack popamerindian
<int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
2 562 ALEXAN~ IL 0.014 10626 759 7054 3496 19
3 563 BOND IL 0.022 14991 681. 14477 429 35
4 564 BOONE IL 0.017 30806 1812. 29344 127 46
5 565 BROWN IL 0.018 5836 324. 5264 547 14
6 566 BUREAU IL 0.05 35688 714. 35157 50 65
7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
8 568 CARROLL IL 0.027 16805 622. 16519 111 30
9 569 CASS IL 0.024 13437 560. 13384 16 8
10 570 CHAMPA~ IL 0.058 173025 2983. 146506 16559 331
# i 427 more rows
# i 26 more variables: popasian <int>, popother <int>, percwhite <dbl>,
# percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
# popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
# poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
# percchildbelowpovert <dbl>, percadultpoverty <dbl>,
# percelderlypoverty <dbl>, inmetro <int>, category <chr>, ...
#note that by combining all the mutations in a single, multi-command code it is neater than having several chunks running one after the other.
Task Mutations of the presidential dataset
<- #defines new object name
presidential.new%>% # utilizes the presidential dataset
presidential mutate(duration=(end-start)) # duration is the end date minus the start date
# to display the new data presidential.new
# A tibble: 12 x 5
name start end party duration
<chr> <date> <date> <chr> <drtn>
1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days
3 Johnson 1963-11-22 1969-01-20 Democratic 1886 days
4 Nixon 1969-01-20 1974-08-09 Republican 2027 days
5 Ford 1974-08-09 1977-01-20 Republican 895 days
6 Carter 1977-01-20 1981-01-20 Democratic 1461 days
7 Reagan 1981-01-20 1989-01-20 Republican 2922 days
8 Bush 1989-01-20 1993-01-20 Republican 1461 days
9 Clinton 1993-01-20 2001-01-20 Democratic 2922 days
10 Bush 2001-01-20 2009-01-20 Republican 2922 days
11 Obama 2009-01-20 2017-01-20 Democratic 2922 days
12 Trump 2017-01-20 2021-01-20 Republican 1461 days
Task Mutations with the economic dataset
<- #defines new object name
econmic.new%>% # utilizes the economics dataset
economics mutate(perc.unemploy=((unemploy/pop)*100)) # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this
# to display the new data econmic.new
# A tibble: 574 x 7
date pce pop psavert uempmed unemploy perc.unemploy
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1967-07-01 507. 198712 12.6 4.5 2944 1.48
2 1967-08-01 510. 198911 12.6 4.7 2945 1.48
3 1967-09-01 516. 199113 11.9 4.6 2958 1.49
4 1967-10-01 512. 199311 12.9 4.9 3143 1.58
5 1967-11-01 517. 199498 12.8 4.7 3066 1.54
6 1967-12-01 525. 199657 11.8 4.8 3018 1.51
7 1968-01-01 531. 199808 11.7 5.1 2878 1.44
8 1968-02-01 534. 199920 12.3 4.5 3001 1.50
9 1968-03-01 544. 200056 11.7 4.1 2877 1.44
10 1968-04-01 544 200208 12.3 4.6 2709 1.35
# i 564 more rows
Task Mutations with the txhousing dataset
<- #defines new object name
txhousing.new%>% # utilizes the txhousing dataset
txhousing mutate(successrate=((sales/listings)*100), # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this
failrate=(100-successrate)) # fail rate is the difference between the whole (100) and the successrate previously calculated.
# to display the new data txhousing.new
# A tibble: 8,602 x 11
city year month sales volume median listings inventory date successrate
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Abilene 2000 1 72 5.38e6 71400 701 6.3 2000 10.3
2 Abilene 2000 2 98 6.51e6 58700 746 6.6 2000. 13.1
3 Abilene 2000 3 130 9.28e6 58100 784 6.8 2000. 16.6
4 Abilene 2000 4 98 9.73e6 68600 785 6.9 2000. 12.5
5 Abilene 2000 5 141 1.06e7 67300 794 6.8 2000. 17.8
6 Abilene 2000 6 156 1.39e7 66900 780 6.6 2000. 20
7 Abilene 2000 7 152 1.26e7 73500 742 6.2 2000. 20.5
8 Abilene 2000 8 131 1.07e7 75000 765 6.4 2001. 17.1
9 Abilene 2000 9 104 7.62e6 64500 771 6.5 2001. 13.5
10 Abilene 2000 10 101 7.04e6 59300 764 6.6 2001. 13.2
# i 8,592 more rows
# i 1 more variable: failrate <dbl>
Tasks from 6.6.1
Recreating code from the guide
# Example illustrating how to answer these problems:
<- #added name of new dataset
diamonds.new%>% # utilizes the diamonds dataset
diamonds group_by(color, clarity) %>% # groups data by color and clarity variables
mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
ungroup() %>% # data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% # new variable, original price + $10
select(cut, color, # retain only these listed columns
clarity, price, %>%
price200, random10) arrange(color) %>% # visualize data ordered by color
group_by(cut) %>% # group data by cut
mutate(dis = n_distinct(price), # counts the number of unique price values per cut
rowID = row_number()) %>% # numbers each row consecutively for each cut
ungroup() # final ungrouping of data
# display the dataset diamonds.new
# A tibble: 53,940 x 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# i 53,930 more rows
The functions all make sense, but the ordering does not to me. The rowID collum is confusing when the data is ordered as it is.
%>% # utilizes the midwest dataset
midwest group_by(state) %>% # groups the data by state
summarize(poptotalmean = mean(poptotal), # create new field and calculate the mean of the population
poptotalmed = median(poptotal), # create new field and calculate the median of the population
popmax = max(poptotal), # create new field and calculate the median of the population
popmin = min(poptotal), # create new field and return the minimum of the population
popdistinct = n_distinct(poptotal), # create new field and count the number of different poptotals
popfirst = first(poptotal), # create new field and return the first value of total population
popany = any(poptotal < 5000), # are there any items where the total population is under 5000? (TRUE FALSE response)
popany2 = any(poptotal > 2000000)) %>% # are there any items where the total population is over 2000000(TRUE FALSE response)
ungroup()
# A tibble: 5 x 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# i 1 more variable: popany2 <lgl>
%>% # utilizes the midwest dataset
midwest group_by(state) %>% # groups the data by state
summarize(num5k = sum(poptotal < 5000), # return the number of items where the poptotal is less than 5000
num2mil = sum(poptotal > 2000000), # return the number of items where the poptotal is more than 2000000
numrows = n()) %>% # number of rows (counties) in each state
ungroup()
# A tibble: 5 x 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
%>% # utilizes the midwest dataset
midwest group_by(county) %>% # group by county
summarize(x = n_distinct(state)) %>% # set field x to number of times that county appears in different states
arrange(desc(x)) %>% # arrange data with the highest counts first
ungroup()
# A tibble: 320 x 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# i 310 more rows
#| Label: Problem C part 2
%>%
midwest group_by(county) %>%
summarize(x = n()) %>% # set field x to number of times that county appears total
ungroup()
# A tibble: 320 x 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# i 310 more rows
Questions
How does n() differ from n_distinct()?
n is the number of times the county name appears, n_distinct is the number of times a county name appears in different states
When would they be the same? different?
If all the counties only appear in each state once, they would be the same. However, if a county appears twice in one state it will only be counted once by n_distinct but twice by n
%>%
midwest group_by(county) %>%
summarize(x = n_distinct(county)) %>%
arrange(desc(x)) %>% # this line is added to solve part 2 of the questions below
ungroup()
# A tibble: 320 x 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# i 310 more rows
Questions
How many distinctly different counties are there for each county??
Each county is individually names and there are 320 rows, representing 320 different counties
Can there be more than 1 (county) county in each county?
There is not. By sorting the data in 3i largest to smallest we would identify if there were any counts higher than 1
What if we replace ‘county’ with ‘state’? See below
#| Label: Problem C part 3ii
%>%
midwest group_by(state) %>%
summarize(x = n_distinct(county)) %>%
ungroup()
# A tibble: 5 x 2
state x
<chr> <int>
1 IL 102
2 IN 92
3 MI 83
4 OH 88
5 WI 72
# this changed code now returns the number of counties by state
#| Label: Problem D
%>%
diamonds group_by(clarity) %>% # items are grouped in the category clarity
summarize(a = n_distinct(color), # no of unique colors by clarity
b = n_distinct(price),# no of unique prices by clarity
c = n()) %>% # total number of items
ungroup()
# A tibble: 8 x 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
#| label: Problem E part 1
%>%
diamonds group_by(color, cut) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 x 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# i 25 more rows
#this has grouped the items in the dataset by color and cut, then returned the mean price of items in those categories along with the standard deviation
#| Label: Problem E part 2
%>%
diamonds group_by(cut, color) %>%
summarize(m = mean(price),
s = sd(price)) %>%
ungroup()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 x 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# i 25 more rows
#this has grouped the items in the dataset by cut then colour, returning the mean price of items and standard deviation. It is the same summary as part 1, just in a different order
#| Label: Problem E part 3
%>%
diamonds group_by(cut, color, clarity) %>% # this is too many arguments for R and so the output is only grouped by cut and color. See error message below
summarize(m = mean(price),
s = sd(price),
msale = m * 0.80) %>% # multiples the mean price by 0.8 (20% off)
ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 x 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# i 266 more rows
# the msale data is the mean data with 20% off.
Questions
How good is the sale if the price of diamonds equaled sale?
Not bad! 20% off is nothing to be sniffed at!
#| label: Problem F
%>%
diamonds group_by(cut) %>%
summarize(potato = mean(depth), # data potato is the mean depth
pizza = mean(price), # data pizza is the mean price
popcorn = median(y), # data popcorn is the media of y
pineapple = potato - pizza, # gives the difference between the mean depth and the mean price
papaya = pineapple ^ 2, # the square of the difference in the means
peach = n()) %>% # the number of items
ungroup()
# A tibble: 5 x 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
# these labels are stupid
#| label: Problem G part I
%>%
diamonds group_by(color) %>% # groups items by color
summarize(m = mean(price)) %>% # gives the mean of the items
mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
x2 = 5) %>% # creates new column "x2" with the value 5 for each row
ungroup()
# A tibble: 7 x 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
#| label: Problem G part ii
%>%
diamonds group_by(color) %>% # groups items by color
summarize(m = mean(price)) %>% # gives the mean of the items
ungroup() %>%
mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
x2 = 5) # creates new column "x2" with the value 5 for each row
# A tibble: 7 x 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond color D 5
2 E 3077. Diamond color E 5
3 F 3725. Diamond color F 5
4 G 3999. Diamond color G 5
5 H 4487. Diamond color H 5
6 I 5092. Diamond color I 5
7 J 5324. Diamond color J 5
Questions
What does the first ungroup() do? Is it useful here? Why/why not?
In this example, it does nothing as the summarise function has already taken place and the commands after the ungroup do not result in data changes, just reordering
No further ungroup is required as nothing is grouped at this point
#| label: Problem H part i
%>%
diamonds group_by(color) %>%
mutate(x1 = price * 0.5) %>% # creates column x1 which gives the price divided by 2
summarize(m = mean(x1)) %>% # gives the mean of teh values in x1, grouped by color per the inital group command
ungroup()
# A tibble: 7 x 2
color m
<ord> <dbl>
1 D 1585.
2 E 1538.
3 F 1862.
4 G 2000.
5 H 2243.
6 I 2546.
7 J 2662.
#| label: part ii
%>%
diamonds group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup() %>%
summarize(m = mean(x1)) # returns the mean of all data
# A tibble: 1 x 1
m
<dbl>
1 1966.
Questions
What’s the difference between part I and II?
In part 1, the mean is calculated by each group of items, it part 2 it is the mean of teh dataset. That is why you only have a single value in part 2 compared to part 1
Questions
Why is grouping data necessary?
So that existing data and groups specific variables can be treated together for operations
Why is ungrouping data necessary?
To prevent complexity and error
When should you ungroup data?
In the same code chunk, whilst learning R, probably after each function or group of functions
If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?
No, no grouping has been created therefore there is no need to undo it
Week 3 Post session - Research Methods
Bad Question on the Diamonds Dataset
How does diamond price vary?
Good Question on the Diamonds Dataset
What affect does Diamond Color have on price?
Week 4 Pre-Work
No notes - remember the difference between matrix and tables though!
Week 4 Session
No notes
Week 4 Post Session - recreating visuals
library(tidyverse)
library(modeldata)
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
gg is grammer of graphics
you define the varibales of interest first, before moving onto the plot type
#| label: Crickets plot
ggplot(crickets, aes(x=temp, y=rate)) + #this is calling the dataset and then telling R what to plot where, in this case temp on the x asis and rate on the y
geom_point() #this step specifies the type of plot to be drawn
#this is a basic plot. Once we have the basic plot, we can then work on the additional information/design we want. See chunk 38
# This visual is for 2 quantatitive variables
ggplot(crickets, aes(x=temp, y=rate, color=species)) + #aes is aesthetics
geom_point()+
labs(x="Temperature",
y="Chirp Rate",
color="Species",
title="Cricket Chirps",
caption=" Source Mcdonald (2009)")+
scale_color_brewer(palette="Dark2") #scale as we are changing a scale feature, colour as we are changing the color and brewer is the package where dark2 is found
#labs adds labels and there are lots you can do, including the demonstration here
Lets try some more modifications!
#| label: Crickets Plots 3
ggplot(crickets, aes(x=temp, y=rate))+
geom_point(color="red", #this is where we can change general properties of the plot
size=2, #changes point size
alpha=.3, #changes point transparency
shape="square")+ #changers point shape
labs(x="Temperature",
y="Chirp Rate",
title="Cricket Chirps",
caption="Source: Mcdonald (2009)")
#You can learn more anout the options for each geom by looking at the help files. YOu call them via ?geom_point, for example.
Adding additional layers to the plot
#| label: Cricket Plots - extra layers
ggplot(crickets, aes(x=temp, y=rate)) +
geom_point()+
geom_smooth(method="lm") + #Adding a regression line, as with other options there are lots of things you can do here, use ?geom_smooth to find out what they are
labs(x="Temperature",
y="Chirp Rate",
color="Species",
title="Cricket Chirps",
caption=" Source Mcdonald (2009)")+
scale_color_brewer(palette="Dark2")
`geom_smooth()` using formula = 'y ~ x'
Other plots
#| label: Histogram
ggplot(crickets, aes(x=rate))+
geom_histogram(bins=5)
#This is suitable for one quantitative variable
#| label: Frequency Polygon
ggplot(crickets, aes(x=rate))+
geom_freqpoly(bins=15)
#This is suitable for one quantitative variable
#| label: Bar Graph
ggplot(crickets, aes(x=species,
fill=species))+
geom_bar(show.legend=FALSE)+
labs(x="Species",
y= NULL)+
scale_fill_brewer(palette="Dark2")
#This is suitable for a single categorical variable.
#As with the other plots you can alter lots of options (labels, colors etc etc)
#| Label: Box Plot
ggplot(crickets, aes(x=species, y=rate, color=species))+
geom_boxplot(show.legend=FALSE)+
scale_color_brewer(palette="Dark2")+
theme_minimal()
#This is for a categorical and quantitative variable
Planning
Remember when planning visuals, what type of variables have you got and what data question you have.
Faceting
#| Label: Not Great
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15) +
scale_fill_brewer(palette = "Dark2")
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species) +
scale_fill_brewer(palette = "Dark2")
?facet_wrap will bring up the help menu for this command as per usual
#| Label: Much Better
ggplot(crickets, aes(x = rate,
fill = species)) +
geom_histogram(bins = 15,
show.legend = FALSE) +
facet_wrap(~species,
ncol = 1) +
scale_fill_brewer(palette = "Dark2") +
theme_minimal()
Week 5 Pre-Session and During Session
No Notes
Week 5 Post Session
ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species))+
geom_boxplot()+
theme_minimal()
This plot is Mean Tests as you are looking for differences in the means of the data
ggplot(iris, aes(x=Petal.Length))+
geom_density(aes(fill=Species))
#I got stuck recreating this one!!!!
#:| Label: Recreating the plots - Scatter Plot
ggplot(iris, aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula = 'y ~ x'
#I got stuck recreating this one!!!!
#:| Label: Bar Chart
library(dplyr)
library(ggplot2)
# Create a new data frame with the mutated variable. Compared to the provided code, we have redefined the dataframe
<- mutate(iris, size = ifelse(Sepal.Length < median(Sepal.Length), "small", "big"))
iris
# Create the bar plot
ggplot(iris, aes(x = Species, fill = size)) + #the fill here splits the data by the variable size and colours appropriatley
geom_bar(position = "dodge") #the dodge creates the clustered, rather than stacked plot
Week 6 Frequency Tests
Z tests
One proportion z-test:
<- prop.test(x = 95, n = 160, p = 0.5,
res1 correct = FALSE)
#the syntax for the above is that x is our observed number n is our total number and p is the expected proportion. "Correct" is stating if we need to apply Yates'continuity correction.
# Printing the results
res1
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
What this is telling us is that the observed (95) is different from the null hypothesis (expected was 0.5) with a p-value of 0.01771. This is a one-proportion test as we only have one observed proportion.
Two proportion z-test
<- prop.test(x = c(490, 400), n = c(500, 500))
res2
#Here we have two observation counts (490 and 400) and two groups (both of 500)
# Printing the results
res2
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
What this return is telling is is the value of Pearson’s chi-squared test statistic (80.909), the degrees of freedom (1) and the p-value (<2.2e-16). It also gives us a 95% confidence interval and an estimated probability of success.
We can ask for teh output of individual aspects of this test individually as well
# p-value
$p.value res2
[1] 2.363439e-19
# Mean
$estimate res2
prop 1 prop 2
0.98 0.80
# Confidence internal
$conf.int res2
[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95
We can use this when we have 2 observed proportions
Chi-squared goodness of fit test
The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, where we have two or more categories as discrete data. In other words, it compares multiple observed proportions to expected probabilities.
<- c(81, 50, 27) #this is our observed tulip counts
tulip <- chisq.test(tulip, p = c(1/3, 1/3, 1/3)) #this is our expected proportionality based on the null hypothesis that the distribution of all 3 types is equal
res3 res3
Chi-squared test for given probabilities
data: tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
This result is telling us that the tuplip types are not commonly distributed with a p-value of:
$p.value res3
[1] 8.802693e-07
which is much less than 0.05.
We can ask what the expected counts would have been if the tulips had been distributed equally:
$expected res3
[1] 52.66667 52.66667 52.66667
We can modify this command further. If we know that the proportions are not equal, but we have data stating what the expected proportions are, we can use the same test:
<- c(81, 50, 27)
tulip <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res4 res4
Chi-squared test for given probabilities
data: tulip
X-squared = 0.20253, df = 2, p-value = 0.9037
Here, our P-value is greater than 0.05 and so the observed data is not statistically significantly different from the expected
Chi-squared test of independence
The chi-square test of independence is used to analyze the frequency table (like a contingency table) formed by two categorical variables.
The chi-square test evaluates whether there is a significant association between the categories of the two variables.
First, lets grab some data
<- "https://www.sthda.com/sthda/RDoc/data/housetasks.txt"
file_path <- read.delim(file_path, row.names = 1)
housetasks housetasks
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
We can visualize this data in 2 ways. A balloon plot:
library("gplots")
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
# 1. convert the data as a table
<- as.table(as.matrix(housetasks))
dt # 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
Or as a mosaic plot:
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
main = "housetasks")
It can be seen on both these visuals, that in this dataset, Laundry, Main_meal, dinner and breakfast are mainly done by the wife (blue on the mosaic plot)
On to the chi-squared test!
Basics - Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.
Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent
The test:
<- chisq.test(housetasks)
chisq chisq
Pearson's Chi-squared test
data: housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
These test results show the variables are statistically significantly associated.
If we want to see the data we can extract it using either chisq\(observed or chisq\)expected to see what was observed or expected (if no association) respectively.
We can further investigate these results by looking at which cells contribute the most to the total chi-squared score. This is called looking at the Pearson or standardized residuals.
round(chisq$residuals, 3) #the 3 is defining the number of decimal places
Wife Alternating Husband Jointly
Laundry 12.266 -2.298 -5.878 -6.609
Main_meal 9.836 -0.484 -4.917 -6.084
Dinner 6.537 -1.192 -3.416 -3.299
Breakfeast 4.875 3.457 -2.818 -5.297
Tidying 1.702 -1.606 -4.969 3.585
Dishes -1.103 1.859 -4.163 3.486
Shopping -1.289 1.321 -3.362 3.376
Official -3.659 8.563 0.443 -2.459
Driving -5.469 6.836 8.100 -5.898
Finances -4.150 -0.852 -0.742 5.750
Insurance -5.758 -4.277 4.107 5.720
Repairs -7.534 -4.290 20.646 -6.651
Holidays -7.419 -4.620 -4.897 15.556
And we can visualise this too:
library(corrplot)
corrplot 0.95 loaded
corrplot(chisq$residuals, is.cor = FALSE)
The sign of the standardized residuals is also very important to interpret the association between rows and columns:
- Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal. There is a strong positive association between the column Husband and the row Repair
-Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.
For example the column Wife are negatively associated (~ “not associated”) with the row Repairs. There is a repulsion between the column Husband and, the rows Laundry and Main_meal
Further practice using the MPG dataset
library(ggplot2)
<-mpg %>%
tab1group_by(class, cyl)%>%
summarize(n=n())
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
tab1
# A tibble: 19 x 3
# Groups: class [7]
class cyl n
<chr> <int> <int>
1 2seater 8 5
2 compact 4 32
3 compact 5 2
4 compact 6 13
5 midsize 4 16
6 midsize 6 23
7 midsize 8 2
8 minivan 4 1
9 minivan 6 10
10 pickup 4 3
11 pickup 6 10
12 pickup 8 20
13 subcompact 4 21
14 subcompact 5 2
15 subcompact 6 7
16 subcompact 8 5
17 suv 4 8
18 suv 6 16
19 suv 8 38
Here, we have created a summary table that includes a limited part of the data from the mpg dataset.
To make this data into a crosstab or contingency table, we need to rearrange our variables. In this example, we are going to use “Class” and “cyl” as our variables.
<-mpg %>%
tab2group_by(class, cyl)%>%
summarize(n=n())%>%
spread(cyl, n)
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
tab2
# A tibble: 7 x 5
# Groups: class [7]
class `4` `5` `6` `8`
<chr> <int> <int> <int> <int>
1 2seater NA NA NA 5
2 compact 32 2 13 NA
3 midsize 16 NA 23 2
4 minivan 1 NA 10 NA
5 pickup 3 NA 10 20
6 subcompact 21 2 7 5
7 suv 8 NA 16 38
This table is giving us the number of observations by class and cyl. Lets graph it: {r}
balloonplot(t(tab2), main=“class”,xlab =““, ylab=”“, label = FALSE, show.margins = FALSE)
Ok, so that has not worked. Lets try again but using the matrix table method
<-as.table(as.matrix(tab2))
tab3 tab3
class 4 5 6 8
A 2seater 5
B compact 32 2 13
C midsize 16 23 2
D minivan 1 10
E pickup 3 10 20
F subcompact 21 2 7 5
G suv 8 16 38
balloonplot(t(tab3), main="class",xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
Warning in balloonplot.default(x, y, z, xlab = xlab, ylab = ylab, zlab = zlab,
: NAs introduced by coercion
As a comparison, lets try this without the grouping:
<-as.table(as.matrix(tab1))
tab4 tab4
class cyl n
A 2seater 8 5
B compact 4 32
C compact 5 2
D compact 6 13
E midsize 4 16
F midsize 6 23
G midsize 8 2
H minivan 4 1
I minivan 6 10
J pickup 4 3
K pickup 6 10
L pickup 8 20
M subcompact 4 21
N subcompact 5 2
O subcompact 6 7
P subcompact 8 5
Q suv 4 8
R suv 6 16
S suv 8 38
balloonplot(t(tab4), main="class",xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
Warning in balloonplot.default(x, y, z, xlab = xlab, ylab = ylab, zlab = zlab,
: NAs introduced by coercion
This shows us the importance of correct grouping prior to analysis, even without testing it is much clearer on the grouped data.
Week 7 - Comparing Means
One Sample T-test
The forumla for this test is:
t.test(x, mu = 0, alternative = “two.sided”)
x is the data we are testing - a vector, for example my_data$weight where “my_data is the dataframe and”weight” is the column where the data is in the frame.
mu is the theoretical mean we are testing against (default is 0)
alternative is our null hypothesis. This can be “two.sided”, “greater” or “less”. For a one sample t-test you would use “greater” or “less” as “two.sided” is a 2 sample test.
set.seed(1234)
<- data.frame(
mouse name = paste0(rep("M_", 10), 1:10),
weight = round(rnorm(10, 20, 2), 1)
# this has created a dummy dataset. This is simulated weights for mice.
)
# Print the first 10 rows of the data
head(mouse, 10)
name weight
1 M_1 17.6
2 M_2 20.6
3 M_3 22.2
4 M_4 15.3
5 M_5 20.9
6 M_6 21.0
7 M_7 18.9
8 M_8 18.9
9 M_9 18.9
10 M_10 18.2
Before we run any tests, lets check the summary@
summary(mouse$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.30 18.38 18.90 19.25 20.82 22.20
Let’s visualize!
library(ggplot2)
ggplot(mouse, aes(y=weight)) +
geom_boxplot()
Before we can run any tests, we need to check the assumptions of the test. For one-sample t-tests these are: -Normal Distribution -Suitable sample size
For this data set n<30 so we meet the assumptions for size. To check for normality we can run a Shapiro-Wilks test and normality plots.
Normality Test
Shaprio-Wilks
For this test, our hypothesis is that the data is not normally distributed, so if the test has a p-value<0.05 it is not normally distributed.
Lets code it:
shapiro.test(mouse$weight)
Shapiro-Wilk normality test
data: mouse$weight
W = 0.9526, p-value = 0.6993
This result shows us that the data is normally distributed and so we can perform the t-test. :::
For this dataset, the expected mean average is 25g. From the boxplot above, we can set a hypothesis that the mean average of our dataset is less than the expected mean.
In code:
t.test(mouse$weight, mu=25, alternative="less")
One Sample t-test
data: mouse$weight
t = -9.0783, df = 9, p-value = 3.977e-06
alternative hypothesis: true mean is less than 25
95 percent confidence interval:
-Inf 20.41105
sample estimates:
mean of x
19.25
This result shows us that our data is statistically significantly different from the expected mean (p-value is <0.05) and so we can accept our hypothesis.
We could also run this t-test as a two-tailed test - this is where our hypothesis is that the mean of our data is statistically significantly different from our expected mean, but we have not stated a direction. The corresponding null hypothesis is that the means are the same.
In code:
t.test(mouse$weight, mu=25)
One Sample t-test
data: mouse$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
17.8172 20.6828
sample estimates:
mean of x
19.25
As with the one-tailed test, this confirms our hypothesis that the mean of our data is not equal to the expected mean.
Non-normal one-sample t-test
If our data is not normally distributed we can do a non-parametric test. One of these is the One-sample Wilcoxon signed rank test
As our data is non-normal, rather than testing against a mean, this test compares the median values in the data compared to the expected median.
The syntax for this test is similar to the t-test:
wilcox.test(x, mu=0, alternative= “two.sided”)
x is the data we are testing - a vector, for example my_data$weight where “my_data is the dataframe and”weight” is the column where the data is in the frame.
mu is the theoretical mean we are testing against (default is 0)
alternative is our null hypothesis. This can be “two.sided”, “greater” or “less”. For a one sample t-test you would use “greater” or “less” as “two.sided” is a 2 sample test.
We will run this test using the same data as before. Despite this data being normally distributed we can still run this test. Unlike the t-test, this test is comparing medians
Let’s run the test:
wilcox.test(mouse$weight, mu=25, alternative = "less")
Warning in wilcox.test.default(mouse$weight, mu = 25, alternative = "less"):
cannot compute exact p-value with ties
Wilcoxon signed rank test with continuity correction
data: mouse$weight
V = 0, p-value = 0.002897
alternative hypothesis: true location is less than 25
The two tailed equivalent is the same as it is for the t-test:
wilcox.test(mouse$weight, mu=25)
Warning in wilcox.test.default(mouse$weight, mu = 25): cannot compute exact
p-value with ties
Wilcoxon signed rank test with continuity correction
data: mouse$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
As the default for both these tests is the two-tailed test, we do not need to define our hypothesis to “less” or “greater”.
Unpaired 2 sample
These tests are used when comparing means of two independent groups. There are 2 further assumptions related to this test - data is normally distributed (Shapiro-Wilk Test) - variances are equal (F-test)
::: {.callout-important collapse=“false”} Uneven variance Test
Welch t-statistic
If the variance of the data sets is not even, you can do this test instead. :::
As with the one sample test above, there are variations of the “tails” of the test. One tailed is when we are testing if the mean of one group is greater or lesser than the other, with the two tailed version being that the means are the same.
# Data in two numeric vectors
<- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
women_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
men_weight # Create a data frame
<- data.frame(
sex_weight group = rep(c("Woman", "Man"), each = 9),
weight = c(women_weight, men_weight)
) sex_weight
group weight
1 Woman 38.9
2 Woman 61.2
3 Woman 73.3
4 Woman 21.8
5 Woman 63.4
6 Woman 64.6
7 Woman 48.4
8 Woman 48.8
9 Woman 48.5
10 Man 67.8
11 Man 60.0
12 Man 63.4
13 Man 76.0
14 Man 89.4
15 Man 73.3
16 Man 67.3
17 Man 61.3
18 Man 62.4
Now our data is in, lets follow best practice by inspecting it:
library(dplyr)
group_by(sex_weight, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE))
# A tibble: 2 x 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 Man 9 69.0 9.38
2 Woman 9 52.1 15.6
This summary suggests that there is a difference between our means. Lets graph it!
# Plot weight by group and color by group
library("ggplot2")
ggplot(sex_weight, aes(x=group, y=weight,
colour=group))+
geom_boxplot()
Now, lets check our assumptions before we run the t-test.
First, test the normality of distributions:
# Shapiro-Wilk normality test for Men's weights
with(sex_weight, shapiro.test(weight[group == "Man"]))# p = 0.1
Shapiro-Wilk normality test
data: weight[group == "Man"]
W = 0.86425, p-value = 0.1066
# Shapiro-Wilk normality test for Women's weights
with(sex_weight, shapiro.test(weight[group == "Woman"])) # p = 0.6
Shapiro-Wilk normality test
data: weight[group == "Woman"]
W = 0.94266, p-value = 0.6101
Remember this is testing against the hypothesis that the data is significantly deviated from normal. With p values above 0.05, we can reject this hypothesis and can say the data does not significantly deviate from normal.
Next, lets check the homogeneity of the variencies using the F-test:
var.test(weight~group, data=sex_weight)
F test to compare two variances
data: weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.08150656 1.60191315
sample estimates:
ratio of variances
0.3613398
As stated in the output, this is testing against the hypothesis that the variances are not equal. As the p value is greater than 0.05, we can reject this hypothesis and say that there is no significant differences between the variances.
Both the assumptions have been met, so now we can do the test
t.test(weight~group, data=sex_weight, var=TRUE)
Two Sample t-test
data: weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
4.029759 29.748019
sample estimates:
mean in group Man mean in group Woman
68.98889 52.10000
Understanding this output
t is the t-test statistic value (t = 2.784), df is the degrees of freedom (df= 16), p-value is the significance level of the t-test (p-value = 0.01327). conf.int is the confidence interval of the mean at 95% (conf.int = [4.0298, 29.748]); sample estimates is he mean value of the sample (mean = 68.9888889, 52.1).
With this p-value, we can conclude that the men’s average weight is significantly different from the women’s average weight.
Wilcoxon test
This test is for similar data, but when the two independent groups data is not normally distributed. This would be if the Shapiro-Wilk test result was reversed from the above scenario. All other steps are the same so are unrepeated here.
wilcox.test(weight~group, data=sex_weight)
Warning in wilcox.test.default(x = c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, :
cannot compute exact p-value with ties
Wilcoxon rank sum test with continuity correction
data: weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
The warning shown is because of the assumption of a Wilcoxon test that the responses are continuous. You can suppress this message by adding another argument exact = FALSE, but the result will be the same.
The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude that men’s median weight is significantly different from women’s median weight with a p-value = 0.02712.
This is the same result as we saw from the t-test, which is to be expected as it is the same data!
Paired sample t-tests
The paired samples t-test is used to compare the means between two related groups of samples. In this example, there are two values (i.e., pair of values) for the same samples.
In this example, 20 mice received a treatment (X) during 3 months. We want to know whether the treatment X has an impact on the weight of the mice.
To answer to this question, the weight of the 20 mice has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice.
In situations like this, a paired t-test can be used to compare the mean weights before and after treatment.
As this is a t-test, the same assumptions apply as before: - data is normally distributed (Shapiro-Wilk Test) - variances are equal (F-test)
The difference for paired data, is that the test is being run against the difference between the values (i.e. the difference in weight of the mouse before and after treatment), and so the normality test is performed on this “difference” data, not the original observations.
This is an important point to remember! The paired t-test is examining the difference the means of the differences in the same individual (weight after treatment (Wa)-wieght before treatment (wb)=weight difference (wd)). We are testing the means (and distributions of) wd.
Here is our data:
# Data in two numeric vectors
#
# Weight of the mice before treatment
<-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
before # Weight of the mice after treatment
<-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
after # Create a data frame (combine the vectors)
<- data.frame(
mouse_weight group = rep(c("before", "after"), each = 10),
weight = c(before, after)
) mouse_weight
group weight
1 before 200.1
2 before 190.9
3 before 192.7
4 before 213.0
5 before 241.4
6 before 196.9
7 before 172.2
8 before 185.5
9 before 205.2
10 before 193.7
11 after 392.9
12 after 393.2
13 after 345.1
14 after 393.0
15 after 434.0
16 after 427.9
17 after 422.0
18 after 383.9
19 after 392.3
20 after 352.2
Lets examine this data:
group_by(mouse_weight, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
# A tibble: 2 x 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 after 10 394. 29.4
2 before 10 199. 18.5
Note how for this and the above data, we have grouped the data!
graph time!
# Plot weight by group and color by group
library("ggplot2")
ggplot(mouse_weight, aes(x=group, y=weight,
colour=group))+
geom_boxplot()
This is showing the differences between the groups, but has lost the paired nature of the data. We need to add the paired data package.
library(PairedData)
Warning: package 'PairedData' was built under R version 4.1.3
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: gld
Warning: package 'gld' was built under R version 4.1.3
Loading required package: mvtnorm
Loading required package: lattice
Attaching package: 'PairedData'
The following object is masked from 'package:base':
summary
#Create data subset weight data before treatment
<- subset(mouse_weight, group == "before", weight,
before drop = TRUE)
# subset weight data after treatment
<- subset(mouse_weight, group == "after", weight,
after drop = TRUE)
<- paired(before, after)
pd plot(pd, type = "profile") + theme_bw()
We can now see how each before and after movement have taken place. As before, we need to check our assumptions before we run the test.
1) Are the two samples paired Yes - since the data have been collected from measuring the same mouse before and after
2) Is this a large sample? No - since this is a small sample (less than 30), we need to check for normality
3) Is it normally distributed? Lets find out!
# calculate the differences between each data set (wd)
<-pd$after-pd$before
wdshapiro.test(wd)
Shapiro-Wilk normality test
data: wd
W = 0.94536, p-value = 0.6141
The P-Value is greater than the significant value (0.05) so we can conclude that the data is not significantly distributed away from normally.
There are 2 methods for calculating the t-test depending on how the data are arranged. Before we added out data to the dataframe, we had the data stored as 2 vectors (before and after). To perform the t-test using the vectors we do the following:
t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-215.5581 -173.4219
sample estimates:
mean of the differences
-194.49
Alternatively, we can calcluate the t-test from the dataframe:
t.test(weight~group, data=mouse_weight, paired=TRUE)
Paired t-test
data: weight by group
t = 20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
173.4219 215.5581
sample estimates:
mean of the differences
194.49
As we are testing the same dataset, the same values are to be expected and are confirmed by the tests.
This test output has a p-value of less than 0.05 and so we can reject the null hypothesis (that the before and after weights are not different) and accept out alternate hypothesis that the weights before and after treatment are significantly different.
Remember
We can modify these tests to single-tailed tests by defining what our alternative hypothesis is. We do this by adding “alternative =”less”” or “alternative =”greater”” depending on what out alternative hypothesis is.
Wilcoxon Test for paired data
As with the unpaired data examples above, the non-parametric test for paired data is the Wilcoxon test.
Remember The difference between the t.test and the Wilcoxon test is that the t.test compares means and the wilcoxon test compares the median. As a result, the Wilcoxon test is suitable when the data are not distributed normally.
We will be using the same data as above for this example and so the data summary and checking processes will be skipped as they are covered above.
As with the t-test, we can compute the Wilcoxon test either using vector data:
wilcox.test(before, after, paired=TRUE)
Wilcoxon signed rank exact test
data: before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
Or using the dataframe data:
wilcox.test(weight~group, data=mouse_weight, paired=TRUE)
Wilcoxon signed rank exact test
data: weight by group
V = 55, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
As before, the results are the same as we are testing the same data, we can also modify these to single tailed tests (greater or less) using the same code as for the t-tests.
One-Way ANOVA test
ANOVA is an extension of independent two-sample t-tests (covered above). Like the t-test, ANOVA is for comparing means, but unlike t-tests, ANOVA is used in situations where there are more than 2 groups.
To conduct ANOVA tests, data needs to be organised into several groups based on a single grouping variable (also called factor variable).
The Null hypothesis for ANOVA tests is that the means of the different groups are the same.
The alternate hypothesis is that at least one sample mean is not equal to the others.
Assumptions - The observations are obtained independently and randomly from the population - The data are normally distributed - The normal populations have a common variance
For this example, we will be using a built in dataset
PlantGrowth
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
7 5.17 ctrl
8 4.53 ctrl
9 5.33 ctrl
10 5.14 ctrl
11 4.81 trt1
12 4.17 trt1
13 4.41 trt1
14 3.59 trt1
15 5.87 trt1
16 3.83 trt1
17 6.03 trt1
18 4.89 trt1
19 4.32 trt1
20 4.69 trt1
21 6.31 trt2
22 5.12 trt2
23 5.54 trt2
24 5.50 trt2
25 5.37 trt2
26 5.29 trt2
27 4.92 trt2
28 6.15 trt2
29 5.80 trt2
30 5.26 trt2
We can see that the “group” column here referes to the treatment group. We can see all the groups by asking R to show us them using this command sequence:
levels(PlantGrowth$group)
[1] "ctrl" "trt1" "trt2"
Lets look at the summary statistics of our data:
group_by(PlantGrowth, group) %>%
summarise(
count=n(),
mean=mean(weight,na.rm=TRUE),
sd=sd(weight,na.rm=TRUE)
)
# A tibble: 3 x 4
group count mean sd
<fct> <int> <dbl> <dbl>
1 ctrl 10 5.03 0.583
2 trt1 10 4.66 0.794
3 trt2 10 5.53 0.443
From this we can see the mean and standard deviation of our groups of treatments.
Lets graph it!
ggplot(PlantGrowth, aes(x=group, y=weight, color=group))+
geom_boxplot()
We can see, visually, that there is a difference between the means of the groups, ANOVA can tell us this statistically.
aov(weight~group, data=PlantGrowth)
Call:
aov(formula = weight ~ group, data = PlantGrowth)
Terms:
group Residuals
Sum of Squares 3.76634 10.49209
Deg. of Freedom 2 27
Residual standard error: 0.6233746
Estimated effects may be unbalanced
And to summarise this output in an easier to view way:
summary(aov(weight~group, data=PlantGrowth))
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-value is the ratio between the variance between the samples to the variance within the samples. A small F-value (ie less than 1), indicates that there is no significant difference in the variance of the means, whilst a higher F-vale implies that the variation among group means in significant.
The output labeled PR>F is the equivalent of the P value in the tests. As the one in this example is under 0.05, there we can conclude that there is significant differences between the groups.
This output tells us that there is a significant difference, but it does not tell us between which groups this difference is. To find this out we need to undertake a pairwise-comparison.
One way of of doing this is to use a Tukey HSD R Function:
TukeyHSD(aov(weight~group, data=PlantGrowth))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = PlantGrowth)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
We can make this code neater by assigning the ANOVA test to a specific reference and then run the Tukey test:
<-aov(weight~group, data=PlantGrowth)
PWAOV
TukeyHSD(PWAOV)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = PlantGrowth)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
The Tukey result show that there is only a significant (p-value<0.05) difference between tr2 and tr1.
There are other ways of conducting this pairwise comparison, but they are not explored here.
As mentioned above, ANOVA tests have some assumptions
Assumptions - The observations are obtained independently and randomly from the population - The data are normally distributed - The normal populations have a common variance
The observation assumption is part of the experimental design.
For the variance and distribution assumptions we can test. Remember that ANOVA is comparing means, so our assumptions are applying to the mean data, not the original data set.
ANOVA Variance check
The normal distribution we can test using a residuals versus fits plot:
plot(PWAOV, 1)
#the 1 refers to the type of plot to be made, do not worry too much about these numbers, just that 1 is needed to test this assumption when looking at ANOVA outputs
Data points labeled 17, 15 and 4 appear as outliers on this plot, which suggests that there may be deviations from even distribution of variances. These data points could be removed as they can impact the normality and variance assumptions, thus invalidating the ANOVA assumptions.
We can test this further using Bartlett’s Test:
bartlett.test(weight ~ group, data = PlantGrowth)
Bartlett test of homogeneity of variances
data: weight by group
Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371
This returns a p-value of >0.05 so the variance does not deviate from homogeneous.
What if the common variance assumption is violated
This would mean that we cannot run the ANOVA test shown above.
One way is a Welch On-Way Test:
oneway.test(weight~group, data=PlantGrowth)
One-way analysis of means (not assuming equal variances)
data: weight and group
F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739
An alternative is a Pairwise t.test with no assumption of equal variance:
pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method="BH", pool.sd=FALSE)
Pairwise comparisons using t tests with non-pooled SD
data: PlantGrowth$weight and PlantGrowth$group
ctrl trt1
trt1 0.250 -
trt2 0.072 0.028
P value adjustment method: BH
#This final false statement tells R not to treat the standard deviation as pooled - this removes the need for the assumption of homogeneous variance
ANOVA Normality check
As with the varience check, we can do this visually:
plot(PWAOV, 2)
#as above, do not worry about what 2 means too much, just that it is the one to use to test this assumption
And we can do it with a statistical test:
#Get the residual values from our test (remember the residuals are the differences around the mean, and as it is the means we are testing this is the dataset our assumptions need to meet)
<- residuals(object=PWAOV)
PWaov_residuals #Then run the test
shapiro.test(x=PWaov_residuals)
Shapiro-Wilk normality test
data: PWaov_residuals
W = 0.96607, p-value = 0.4379
As with the variance checks, this confirms that our data meets the assumptions of the ANOVA test.
Non-parametric alternative
If the assumptions are not met fora one-way ANOVA we can use an alternative test - Kruskal-Wallis rank sum test:
kruskal.test(weight~group, data=PlantGrowth)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
As with the parametric version, the test shows a statistically significant difference in the data groups.
Two-Way ANOVA test
Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable.
Assumptions
The same assumptions are true from the one-way ANOVA: - The observations are obtained independently and randomly from the population - The data are normally distributed - The normal populations have a common variance
Further, in two-way ANOVA, the sample sizes within cells need to be equal, leading to a so-called balanced design. In this case the standard two-way ANOVA test can be applied.
When the sample sizes within each level of the independent variables are not the same (case of unbalanced designs), the ANOVA test should be handled differently.
Lets get some data!
ToothGrowth
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
11 16.5 VC 1.0
12 16.5 VC 1.0
13 15.2 VC 1.0
14 17.3 VC 1.0
15 22.5 VC 1.0
16 17.3 VC 1.0
17 13.6 VC 1.0
18 14.5 VC 1.0
19 18.8 VC 1.0
20 15.5 VC 1.0
21 23.6 VC 2.0
22 18.5 VC 2.0
23 33.9 VC 2.0
24 25.5 VC 2.0
25 26.4 VC 2.0
26 32.5 VC 2.0
27 26.7 VC 2.0
28 21.5 VC 2.0
29 23.3 VC 2.0
30 29.5 VC 2.0
31 15.2 OJ 0.5
32 21.5 OJ 0.5
33 17.6 OJ 0.5
34 9.7 OJ 0.5
35 14.5 OJ 0.5
36 10.0 OJ 0.5
37 8.2 OJ 0.5
38 9.4 OJ 0.5
39 16.5 OJ 0.5
40 9.7 OJ 0.5
41 19.7 OJ 1.0
42 23.3 OJ 1.0
43 23.6 OJ 1.0
44 26.4 OJ 1.0
45 20.0 OJ 1.0
46 25.2 OJ 1.0
47 25.8 OJ 1.0
48 21.2 OJ 1.0
49 14.5 OJ 1.0
50 27.3 OJ 1.0
51 25.5 OJ 2.0
52 26.4 OJ 2.0
53 22.4 OJ 2.0
54 24.5 OJ 2.0
55 24.8 OJ 2.0
56 30.9 OJ 2.0
57 26.4 OJ 2.0
58 27.3 OJ 2.0
59 29.4 OJ 2.0
60 23.0 OJ 2.0
The data shown above has identified one of our treatment types as a continuous data type (dbl). We need to tell R to treat this data as a factor before we continue:
<- ToothGrowth #creating a new dataframe ready for out manipulation
ToothGrowth1
$dose<-factor(ToothGrowth1$dose,
ToothGrowth1levels=c(0.5, 1, 2),
labels=c("D0.5", "D1", "D2"))
#This command is telling R to create a new column in our dataset called dose, then defining this data as a factor, what those factors are (levels) and what to call them (labels)
ToothGrowth1
len supp dose
1 4.2 VC D0.5
2 11.5 VC D0.5
3 7.3 VC D0.5
4 5.8 VC D0.5
5 6.4 VC D0.5
6 10.0 VC D0.5
7 11.2 VC D0.5
8 11.2 VC D0.5
9 5.2 VC D0.5
10 7.0 VC D0.5
11 16.5 VC D1
12 16.5 VC D1
13 15.2 VC D1
14 17.3 VC D1
15 22.5 VC D1
16 17.3 VC D1
17 13.6 VC D1
18 14.5 VC D1
19 18.8 VC D1
20 15.5 VC D1
21 23.6 VC D2
22 18.5 VC D2
23 33.9 VC D2
24 25.5 VC D2
25 26.4 VC D2
26 32.5 VC D2
27 26.7 VC D2
28 21.5 VC D2
29 23.3 VC D2
30 29.5 VC D2
31 15.2 OJ D0.5
32 21.5 OJ D0.5
33 17.6 OJ D0.5
34 9.7 OJ D0.5
35 14.5 OJ D0.5
36 10.0 OJ D0.5
37 8.2 OJ D0.5
38 9.4 OJ D0.5
39 16.5 OJ D0.5
40 9.7 OJ D0.5
41 19.7 OJ D1
42 23.3 OJ D1
43 23.6 OJ D1
44 26.4 OJ D1
45 20.0 OJ D1
46 25.2 OJ D1
47 25.8 OJ D1
48 21.2 OJ D1
49 14.5 OJ D1
50 27.3 OJ D1
51 25.5 OJ D2
52 26.4 OJ D2
53 22.4 OJ D2
54 24.5 OJ D2
55 24.8 OJ D2
56 30.9 OJ D2
57 26.4 OJ D2
58 27.3 OJ D2
59 29.4 OJ D2
60 23.0 OJ D2
The above table shows that does is now a factor, not dbl data type.
Lets check that our data meets the balanced design assumption by looking at the number of observations by category:
table(ToothGrowth1$supp, ToothGrowth1$dose)
D0.5 D1 D2
OJ 10 10 10
VC 10 10 10
This is a frequency table showing the factors as supp and dose and with 10 subjects in each cell. This meets our assumptions as the sample sizes within cells is equal.
Lets start with visualising our data:
#We need to plot tooth length (len) by groups (dose) then plot the second group (sup) and show them in different colors
ggplot(ToothGrowth1, aes(x=dose, y=len, color=supp))+
geom_boxplot()
We can also do this as a line plot with multiple groups:
ggplot(ToothGrowth1, aes(x = factor(dose), y = len, color = supp)) + geom_point(position = position_jitter(width = 0.2), alpha = 0.6) + # Dotplot with jitter
stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, color = "black") + # Mean and SE
stat_summary(fun = mean, geom = "point", size = 3, color = "black") + # Mean points
geom_line(stat = "summary", fun = mean, aes(group = supp)) + # Lines connecting the means
labs(x = "Dose", y = "Length", color = "Supplement") + theme_minimal()
The code for the two-way ANOVA test is very similar to that onf the one way ANOVA test, we just define the additional comparision factor we want examined
<-aov(len~supp+dose, data=ToothGrowth1) #the "+dose" is adding the second factor for analysis
res.tlsummary(res.tl)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 14.02 0.000429 ***
dose 2 2426.4 1213.2 82.81 < 2e-16 ***
Residuals 56 820.4 14.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output is telling us that both supp and dose have a statistically significant impact on our response factor (Tooth Length). Dose having the more significant impact (the smaller p value).
This test assumes independence of our treatments (as additive model), but does not consider if the treatments are synergistic. We need to modify our code to test for this:
<-aov(len~supp*dose, data=ToothGrowth1)
res.tl1summary(res.tl1)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 15.572 0.000231 ***
dose 2 2426.4 1213.2 92.000 < 2e-16 ***
supp:dose 2 108.3 54.2 4.107 0.021860 *
Residuals 54 712.1 13.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This output is showing that alongside the statistically significant impact of the treatments dose and supp, there is also a statistically significant interaction between both of the treatments.
The full interpretation of this output is:
From the ANOVA results, you can conclude the following, based on the p-values and a significance level of 0.05:
- the p-value of supp is 0.000429 (significant), which indicates that the levels of supp are associated with significant different tooth length.
- the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.
- the p-value for the interaction between supp*dose is 0.02 (significant), which indicates that the relationships between dose and tooth length depends on the supp method.
The above results show us that there are significant deviations from similar means in our treatment groups.
The supp treatment only had 2 levels and we know this is significant, however, the dose treatment has multiple levels so we need to know which are the significant ones. We do this the same way we did for the one way ANOVA test with pairwise comparisons:
TukeyHSD(res.tl1, which = "dose")
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = len ~ supp * dose, data = ToothGrowth1)
$dose
diff lwr upr p adj
D1-D0.5 9.130 6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1 6.365 3.597488 9.132512 2.7e-06
This shows that all the pairwise comparisons are statistically significant.
You can also do this using a pairwise t test:
pairwise.t.test(ToothGrowth1$len, ToothGrowth1$dose, p.adjust.method = "BH")
Pairwise comparisons using t tests with pooled SD
data: ToothGrowth1$len and ToothGrowth1$dose
D0.5 D1
D1 1.0e-08 -
D2 4.4e-16 1.4e-05
P value adjustment method: BH
This again shows that for dose, all the pairwise comparisons are significant.
#Does the data meet the Assumptions?
As with the previous AONVA tests, we need to ensure our data meets the tests assumptions for our results to be valid.
Homogeneity of variance
Graphically:
plot(res.tl1, 1)
3 points seem like outliers, so lets check with a test:
levene.test(ToothGrowth1$len, #This is specifying the data to be tested against
group = interaction(ToothGrowth1$supp, ToothGrowth1$dose)) # this is stating which factors to include
Modified robust Brown-Forsythe Levene-type test based on the absolute
deviations from the median
data: ToothGrowth1$len
Test Statistic = 1.7086, p-value = 0.1484
#This test did not work as expected, so function modified following research. The result matches the expected, so modification accepted.
This result tells us that there is no evidence to suggest the variance across the data is statistically significant. Our data therefore meets the assumption needed for homogeneity of variance.
Normal distribution
Visual inspection:
plot(res.tl1, 2)
This looks pretty good, but lets test statistically to be sure:
shapiro.test(x=(residuals(object=res.tl1)))
Shapiro-Wilk normality test
data: (residuals(object = res.tl1))
W = 0.98499, p-value = 0.6694
#Remember, we are testing the data around the means - this is the residual data we are calling for this test.
Our data is normally distributed, so our tests are valid.
What if our data is an unbalanced design?
Reminder A balanced design is where the sample sizes within cells are equal, unbalanced is when this is not the case.
In these situations, we need to use a type-III sum of squares method. I am not including that here.
MANOVA - Multivariate Analysis of Variance
MAONVA can be used in situations where there are multiple response variables, you can test them simultaneously using a multivariate analysis of variance (MANOVA).
For example, we may conduct an experiment where we give two treatments (A and B) to two groups of mice, and we are interested in the weight and height of mice. In that case, the weight and height of mice are two dependent variables, and our hypothesis is that both together are affected by the difference in treatment. A multivariate analysis of variance could be used to test this hypothesis.
As with other tests, there are assumptions: ::: {.callout-important collapse=“false”} Assumptions As with the two-way ANOVA above, MANOVAs assumptions include:
That the data in the groups are normally distributed.
Homogeneity of variances across the range of predictors.
Linearity between all pairs of dependent variables, all pairs of co-variates and all dependent variable co-pairs in each cell :::
As with the ANOVA tests, it is necessary to further investigate the results if the MANOVA test shows statistical significance. We will see this in the worked example.
The below is the MANOVA test to find out if there are significant differences between the species’ petal and sepal length:
#Get dataq and assign vectors
<-iris$Sepal.Length
sepl<-iris$Petal.Length
petl#MANOVA Test
summary(manova(cbind(sepl, petl)~Species, data=iris))
Df Pillai approx F num Df den Df Pr(>F)
Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Here we have both asked for a summary of the test result, and coded the test in a single step
<-manova(cbind(sepl, petl)~Species, data=iris)
res.man#This line assigns the call res.man to our MANOVA results, making it easier for the next step
This is telling us that there is a significant difference between our groups. The difference species do have statistically significantly different sepal and petal lengths, but does not tell us if it is just sepal, just petal or both.
To investigate which groups, we can run ANOVA tests on the results to elucidate our findings:
summary.aov(res.man)
Response sepl :
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response petl :
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
Residuals 147 27.22 0.185
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This ANOVA result is showing that both petal and sepal length are significantly different between the species.To fully validate the findings, we should run the checks we did above on the ANOVA to ensure our assumptions are met.
Kruskal-Wallis Test
This is a ranking test, providing a non-parametric alternative to one way ANOVA tests.
For this example, we are going to use the plant growth data used in the earlier ANOVA example, so I am skipping the visualisation steps as we have previously completed them.
On to the test:
kruskal.test(weight~group, data=PlantGrowth)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
As with the ANOVA tests, this tell us that there is a significant difference, but does not say where.
As out data is non-parametric, we cannot use the Tuskey test, instead we can use a pairwise wilcox test:
pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "BH")
Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute
exact p-value with ties
Pairwise comparisons using Wilcoxon rank sum test with continuity correction
data: PlantGrowth$weight and PlantGrowth$group
ctrl trt1
trt1 0.199 -
trt2 0.095 0.027
P value adjustment method: BH
The result shows that there is only a significant (p-value<0.05) difference between tr2 and tr1.
Week 9 Linear Models
We are using the pre-existing mtcars data set for this
Task - What is the effect of the weight of the vehicle on its efficiency (miles per gallon)?
ggplot(mtcars, aes(x=wt, y=mpg))+ #calls the data
geom_point()+#defines the plot type
geom_smooth(method= "lm")#adds our regression line
`geom_smooth()` using formula = 'y ~ x'
This output shows how the model could fit, but does not give us the model (i.e. the equation).
There is another way to fit the model and extract the info we are after:
<-lm(mpg~wt, data=mtcars) # this is defining and fitting our model
model
summary(model) # This gives us the detail from our fitted model
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
This output tells us that there is a strong, significant relationship (or correlation) between our data points, and the p-value of this correlation.
# Extract the coefficients from the model (the coefficients will inform us as to the details of the model (i.e. the mathematical relationship between our variables))
<- summary(model)$coefficients
coefficients
# Create the scatterplot with regression line and equation
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue") +
annotate("text", x = 3.5, y = 30, label = paste("y = ", round(coefficients[1,1], 2), " - ", abs(round(coefficients[2,1], 2)), " * x", sep = ""), color = "black") +
labs(x = "Weight (1000 lbs)", y = "Miles Per Gallon", title = "Effect of Vehicle Weight on Efficiency") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Most of this code is not new, however, the “annotate” function is so lets break down its syntax
annotate(“text”, …) This tells ggplot2 that you want to add text annotation to the plot.
x = 3.5, y = 30 These parameters specify the coordinates on the plot where the annotation text should be placed. In this case, the text will be placed at the point (3.5, 30) on the plot.
label = paste(“y =”, round(coefficients[1,1], 2), ” - “, abs(round(coefficients[2,1], 2)),” * x”, sep = ““) This creates the text label to be displayed.
Inside the label parameter:
paste(…): Concatenates strings together
“y =”: The beginning of the text string
round(coefficients[1,1], 2): Extracts the intercept from the model coefficients (found in the model table in cell 1,1) and rounds it to 2 decimal places
” - “: Adds a minus sign, as we can see the correlation is negative
abs(round(coefficients[2,1], 2)): Extracts the slope from the model coefficients (found in the model table in cell 2,1), takes the absolute value, and rounds it to 2 decimal places.
” * x”: Adds the final part of the equation.
sep = ““: Ensures no additional separators are added between concatenated strings.
We can run an ANOVA on our data as well to check its significance:
summary(aov(mpg~wt, data=mtcars))
Df Sum Sq Mean Sq F value Pr(>F)
wt 1 847.7 847.7 91.38 1.29e-10 ***
Residuals 30 278.3 9.3
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This again shows that there is statistical significance in the relationship between weight and MPG.
Lets try some other groups of data in the mtcars data set.
summary(manova(cbind(cyl, disp, qsec, am) ~ hp, data = mtcars))
Df Pillai approx F num Df den Df Pr(>F)
hp 1 0.79905 26.84 4 27 4.604e-09 ***
Residuals 30
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov((manova(cbind(cyl, disp, qsec, am) ~ hp, data = mtcars)))
Response cyl :
Df Sum Sq Mean Sq F value Pr(>F)
hp 1 68.517 68.517 67.71 3.478e-09 ***
Residuals 30 30.358 1.012
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response disp :
Df Sum Sq Mean Sq F value Pr(>F)
hp 1 297901 297901 50.128 7.143e-08 ***
Residuals 30 178284 5943
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response qsec :
Df Sum Sq Mean Sq F value Pr(>F)
hp 1 49.651 49.651 30.19 5.766e-06 ***
Residuals 30 49.338 1.645
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response am :
Df Sum Sq Mean Sq F value Pr(>F)
hp 1 0.4566 0.45655 1.886 0.1798
Residuals 30 7.2622 0.24207
What the above shows is that there is significant difference in the horsepower (hp) when there are differences in the number of cylinders (cyl), displacement (disp), 1/4 mile time (qsec) and transmission type (am) - this is the MANOVA output.
When we analyse this output, we can see that, actually, transmission type has no statistically significant impact on the horsepower, but all the other factors do and the associated values are shown in the output.
Week 10 - Logistic Regression
Definitions
Logistic regression is used to predict the class (or category) of individuals based on one or multiple predictor variables (x). It is used to model a binary outcomes - those with only 2 possible values: 0 or 1, yes or no, diseased or non-diseased, male or female etc etc. It gives the probability that the observed predictor (x) falls within the category. Think, how tarsus width gives the probably that a peregrine pulli is female.
Logistic regression is part of the generalized linear model family (GLM).
The equation for logistic regression formula can be generalized as:
p=exp(y)/[1+exp(y)]
P is the probability of an event to occur exp is the exponential (how steep the gradient of our curve is) Y=b0+b1 * x
In words, this equation is saying that the probability of an event occurring (p) is a function of x (our predictor variable) on y (out measured variable). In situations where there is more than one predictor variable, y=b0+(b1 * x1)+(b2 * x2)….etc
The b values are the regression beta coefficients. A positive b indicates that an increase in variable x is associated with an increase in y, with a negative b being the opposite.
The generalized formula is an exponential function, and through logarithmic transformation (don’t worry about the algebra!) becomes a linear combination of predictors: log[p/1-p]=bo+b1 * x
Remember about how logarythmically transforming data you can produce a straight line graph from data that plots exponentially originally - is it the same logic
The quantity log[p/1-p] is called the logarithm of the odd and can also be called log-odd or logit.
This odds function enables the calculation of the probability too:
p=odds/(1+odds)
Actually doing some modelling!
As with all models, there are assumptions and data wrangling that should be checked and done to improve the accuracy of a model.
The following steps should be undertaken to help with accuracy:
Remove potential outliers
Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.
Remove highly correlated predictors to minimize over fitting.
Enough Talk - lets try some stuff!
rm(list = ls()) #this clears the workspace
#| Label: library calls LR
library(tidyverse)
library(caret)
Warning: package 'caret' was built under R version 4.1.3
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
library(mlbench)
=FALSE output
It is important when modeling that you can both train and then test the model. We are going to use a pre-built dataset:
#Load the data and remove NAs
data(PimaIndiansDiabetes2)
<-na.omit(PimaIndiansDiabetes2)
PimaIndiansDiabetes2#Inspect the data
sample_n(PimaIndiansDiabetes2, 3) #this is calling the first 3 lines of the dataset
pregnant glucose pressure triceps insulin mass pedigree age diabetes
370 1 133 102 28 140 32.8 0.234 45 pos
131 4 173 70 14 168 29.7 0.361 33 pos
9 2 197 70 45 543 30.5 0.158 53 pos
#Split data into training and test set
set.seed(123) #Set seed is a way of calling a specific random number set. This is important as it ensures the results can be reproduced, unique random data sampling on each run would not facilitate this
<-PimaIndiansDiabetes2$diabetes %>% #the $diabetes is the data we are interested in
training.samplescreateDataPartition(p=0.8, list=FALSE) #This is creating a data partition containing 80% of the data set
<-PimaIndiansDiabetes2[training.samples, ] #this is assigning the call train.data to the 80% partition previously created
train.data<-PimaIndiansDiabetes2[-training.samples, ]#this is assigning the call test.data to the data that is not included in the 80% partition previously created test.data
This is an example of the model code and output:
#Fit the Model
<-glm(diabetes~., data=train.data, family=binomial)
model#Summarise the Model
summary(model)
Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5832 -0.6544 -0.3292 0.6248 2.5968
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053e+01 1.440e+00 -7.317 2.54e-13 ***
pregnant 1.005e-01 6.127e-02 1.640 0.10092
glucose 3.710e-02 6.486e-03 5.719 1.07e-08 ***
pressure -3.876e-04 1.383e-02 -0.028 0.97764
triceps 1.418e-02 1.998e-02 0.710 0.47800
insulin 5.940e-04 1.508e-03 0.394 0.69371
mass 7.997e-02 3.180e-02 2.515 0.01190 *
pedigree 1.329e+00 4.823e-01 2.756 0.00585 **
age 2.718e-02 2.020e-02 1.346 0.17840
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.80 on 313 degrees of freedom
Residual deviance: 267.18 on 305 degrees of freedom
AIC: 285.18
Number of Fisher Scoring iterations: 5
#This took a few times to get to work, remember to proof and check for erroneous capitalization of diabetes!
# Make predictions
<- model %>% predict(test.data, type = "response")
probabilities <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes # Model accuracy
mean(predicted.classes == test.data$diabetes)
[1] 0.7564103
We can see from the output that there are some statistically significant predictors, but we will return to the interpretation of the output after some other modeling practices.
A slower walk through of logistic regression modeling
A simple logistic regression is used to predict the probably of class membership based on a single predictor variable.
Lets model to predict the probability of being diabetes-positive (the class) based on plasma glucose concentration (our predictor variable)
<-glm(diabetes~glucose, data=train.data, family=binomial)
modelsummary(model)$coef #the $coef is the command to extract the coefficients and the associated statistics from the summary
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose 0.04327234 0.005341133 8.101716 5.418949e-16
This output is showing us the estimate of the beta coefficients and their associated P-vales. We can add this information into the equations we explained earlier.
Reminder
Our base equation is: p=exp(y)/[1+exp(y)] where: y=y = b0 + b1 * x
So, from the output above, our equation is p=exp(-6.15+0.043 * glucose)/[1-(-6.15+0.043 * glucose)]
We can now use our model to predict the probability of an individual having diabetes given a glucose plasma value.
<-data.frame(glucose=c(20,180))
newdata<- model %>% predict(newdata, type="response")
probabilities<-ifelse(probabilities>0.5, "pos", "neg")
predicted.class predicted.class
1 2
"neg" "pos"
This output is telling us that for a glucose data value of 20 our model predicts a negative diabetes result, whilst a glucose value of 180 should be expected to return a positive diabetes result.
We can plot this and it produces an s-shaped curve:
%>%
train.data mutate(prob=ifelse(diabetes=="pos", 1, 0))%>% #this is to change the pos, neg data into a 1,0 result
ggplot(aes(glucose, prob))+
geom_point(alpha=0.2)+ #this is to make the points on the plot have a transparency level of 20%, this will help overlapping points be visable
geom_smooth(method="glm", method.args=list(family="binomial"))+ #the method.args function allows us to add additional arguments to the geom_smooth code. In this instance we are telling R that the glm is binomial
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
`geom_smooth()` using formula = 'y ~ x'
Multiple Logistic Regression
The multiple logistic regression is used to predict the probability of class based on multiple predictor variables:
<- glm( diabetes ~ glucose + mass + pregnant,
model data = train.data, family = binomial)
summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose 0.03886154 0.005404219 7.190962 6.433636e-13
mass 0.09458458 0.023529905 4.019760 5.825738e-05
pregnant 0.14466661 0.045125729 3.205857 1.346611e-03
In this example, we have asked R to consider and evaluate the impact of glucose, mass and pregnant data on the probability that someone will be in the positive class for diabetes.
We can expand this to include all data as follows:
<- glm( diabetes ~.,
model data = train.data, family = binomial)
summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant 1.005031e-01 0.061266974 1.64041157 1.009196e-01
glucose 3.709621e-02 0.006486093 5.71934633 1.069346e-08
pressure -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps 1.417771e-02 0.019981885 0.70952823 4.779967e-01
insulin 5.939876e-04 0.001508231 0.39383055 6.937061e-01
mass 7.997447e-02 0.031798907 2.51500698 1.190300e-02
pedigree 1.329149e+00 0.482291020 2.75590704 5.852963e-03
age 2.718224e-02 0.020199295 1.34570257 1.783985e-01
I have paused at this point as despite the code chunks being the same, they are outputting different results.
See this check, where I have copied the code from both locations
identical(model.frame(glm(diabetes ~ ., data = train.data, family = binomial)),
model.frame(glm(diabetes ~., data = train.data, family = binomial)))
[1] TRUE
Lets try clearing our environments and starting again!
rm(list = ls())
library(dplyr)
library(caret)
library(mlbench)
data(PimaIndiansDiabetes2)
<-na.omit(PimaIndiansDiabetes2)
PimaIndiansDiabetes2<- na.omit(PimaIndiansDiabetes2)
PimaIndiansDiabetes2 set.seed(123)
<- PimaIndiansDiabetes2$diabetes %>%
training.samples createDataPartition(p = 0.8, list = FALSE)
<- PimaIndiansDiabetes2[training.samples, ]
train.data <- PimaIndiansDiabetes2[-training.samples, ] test.data
Now lets run the two varieties of gls and see if we get different results:
<- glm(diabetes ~ glucose + mass + pregnant, data = train.data, family = binomial)
model1 <- glm(diabetes ~., data = train.data, family = binomial)
model2 summary(model1)
Call:
glm(formula = diabetes ~ glucose + mass + pregnant, family = binomial,
data = train.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2066 -0.6448 -0.3576 0.6523 2.4407
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.323698 1.125997 -8.280 < 2e-16 ***
glucose 0.038862 0.005404 7.191 6.43e-13 ***
mass 0.094585 0.023530 4.020 5.83e-05 ***
pregnant 0.144667 0.045126 3.206 0.00135 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.80 on 313 degrees of freedom
Residual deviance: 279.88 on 310 degrees of freedom
AIC: 287.88
Number of Fisher Scoring iterations: 5
summary(model2)
Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5832 -0.6544 -0.3292 0.6248 2.5968
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053e+01 1.440e+00 -7.317 2.54e-13 ***
pregnant 1.005e-01 6.127e-02 1.640 0.10092
glucose 3.710e-02 6.486e-03 5.719 1.07e-08 ***
pressure -3.876e-04 1.383e-02 -0.028 0.97764
triceps 1.418e-02 1.998e-02 0.710 0.47800
insulin 5.940e-04 1.508e-03 0.394 0.69371
mass 7.997e-02 3.180e-02 2.515 0.01190 *
pedigree 1.329e+00 4.823e-01 2.756 0.00585 **
age 2.718e-02 2.020e-02 1.346 0.17840
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.80 on 313 degrees of freedom
Residual deviance: 267.18 on 305 degrees of freedom
AIC: 285.18
Number of Fisher Scoring iterations: 5
Running the clearing command both here and with the initial code in the quick start one seems to have removed the differences between the quick start code and the later code where all predictors have been included in the model.
Further, it is interesting to note that when all predictors are used, compared to the model that just included glucose + mass + pregnant, the significance of the pregnant data as a predictor has changed from insignificant to significant.
This could be an example of overfitting, but there are other reasons why this might be the case:
- Shared Variance
When you add more predictors to the model, some predictors may share variance with others. This means that the new predictors might explain some of the same variation in the response variable that was previously attributed to existing predictors. As a result, the statistical significance of existing predictors may decrease because their unique contribution to explaining the response variable is reduced.
- Multicollinearity
Adding predictors that are highly correlated with each other can lead to multicollinearity, which inflates the standard errors of the estimated coefficients. Higher standard errors lead to lower t-values (or z-values in the case of GLM), which in turn can make predictors appear less significant.
- Model Complexity
A more complex model with many predictors can dilute the effect of individual predictors. When the model includes more predictors, each predictor has to account for less unique variance, potentially reducing their individual significance.
- Overfitting
Including too many predictors, especially those that do not contribute much to explaining the variability in the response variable, can lead to overfitting. Overfitting decreases the model’s ability to generalize to new data and can affect the significance of the predictors.
While adding more predictors can improve the explanatory power of a model, it’s important to carefully consider which predictors to include. Using techniques like stepwise regression, regularization (e.g., Lasso or Ridge regression), or principal component analysis can help manage the trade-offs between model complexity and predictor significance.
We are not covering these model types here, but should return to learn more about them in the future
Interpreting the results
From the coefficent output, we can see that glucose, mass and pedigree are significantly associated to the outcome. Note, this is different from the webpage used to build this model, but is consistant with my findings.
All our coefficents are positive values, meaning an increase in these is associated with an increased probability of being diabetes positive. If the coefficents were negative, this would be reversed.
Another important part of analysing the output is the odds ratio. The odds ratio tells us the impact size of the predictor on the outcome.
As an example, our coefficient for glucose in the above is 0.0371. This indicates that one unit increase in the glucose concentration will increase the odds of being diabetes positive by 1.038 (the value of the expression e^0.0371).
Now we know which predictors are significant to the model, we can make a new model based only on these.
<-(glm(diabetes~glucose + mass + pedigree, data=train.data, family=binomial))
modelsummary(model)
Call:
glm(formula = diabetes ~ glucose + mass + pedigree, family = binomial,
data = train.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5262 -0.6984 -0.3927 0.6445 2.5943
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.524661 1.159907 -8.212 < 2e-16 ***
glucose 0.042445 0.005565 7.628 2.39e-14 ***
mass 0.082817 0.022990 3.602 0.000315 ***
pedigree 1.276572 0.463373 2.755 0.005870 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 398.80 on 313 degrees of freedom
Residual deviance: 282.64 on 310 degrees of freedom
AIC: 290.64
Number of Fisher Scoring iterations: 5
Now we have our model, we can test it using the testing data previously partitioned. This will enable us to evaluate the performance of our model:
<-model %>%
probabilitiespredict(test.data, type="response")
head(probabilities)
19 21 32 55 64 71
0.2086204 0.4941976 0.7078530 0.6530271 0.3672377 0.1901986
This gives us the probabilities, but that is not that helpful in giving us the class. To give us the probable result of the diabetes test, given the values of the predictors in our data we need to tell R what output to give based on the probailities returned.
<- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes head(predicted.classes)
19 21 32 55 64 71
"neg" "neg" "pos" "pos" "neg" "neg"
Well done! We made it to predictions!
Assessing model accuracy
We can now predict based on our model, but how do we know if the model is accurate?
There are many ways models can be evaluated, as a start point we can ask for the proportion of correctly classified observations:
mean(predicted.classes==test.data$diabetes)
[1] 0.7692308
This output tells us that our model correctly classified ~77% of our test data. This can also be written as a misclassification error rate of 23%.