ARES40011 Workbook

Author

Martin Cooper

Week 1

Formatting notes

Remember the cheat sheet under the help menu!

Tip

One * for italics two for bold

When comfortable/knowledgeable you can edit in both Visual (has buttons to help with new things) or Source (more code looking).

For example, I do not know know how to insert tables or images in the source code. The visual input can help with this.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

library(tidyverse)

The above code would usually display a table of data relating to the tidyverse package, however, as we have added an output:false instruction it is not doing so here.

These call out sections are called “chunks”.

From the pre-work for this week, I had successfully imported a data set from dropbox using the following code. However, this is now returning the error more columns than column names. What is the fix for this?

SOLVED I typed the web address incorrectly several times and on several attempts!

cyan <- read.table(file="https://www.dropbox.com/s/5q3hdyxcouw6ifa/cyanistes.txt?dl=1",header=TRUE,dec=".")

Add a plot

Code

ggplot(diamonds, aes(x=cut)) +
  geom_bar()

more descriptive stuff — Figure 1: A bar chart of diamons cuts

You can refer to areas. This is a reference: Figure 1

Using LaTeX

We can put formulas inline like this: $\alpha$

We indent like this

\[\pi \in \mathbb{r}\] Google LaTeX for more options and guides.

Other display options

For different output types, you can change the format of the document at the top of the code.

Examples include: - PDF - Slides, use reveals at format for this. Looks like MS SWAY. Be cautions with this though as some changes will be required to ensure display of code sections works. From the example video, adding echo: true to code to sort this. There are lots of options with the slides.

watch more tutorials on this

You can also add additional options to the file types. Under format in yarl header, you can select themes. Web search for more Quatro Themes

To add a table of contents, you can add TOC within the HTML format area. Further customization is available - use the crtl+space options to explore.

You can also have multiple outputs at once e.g. HTML and PDF at once.

Other useful stuff!

-use crtl + space to see autofill possibilities within code or yarl sections

-you can set global, as in the default across the code, in the yarl header area too.

-you can add additional code in other languages by changing the info between the {} in the chunks.

Using python example

This is where I got stuck error - python biding not loaded. code removed as unable to render/publish.

Week 2 - During session notes

-Added callout boxes to highlight areas in week 1s notes.

library(palmerpenguins)
library(tidyverse) # added again as was not running

Warning

Tasks for this week:

Play around with penguin data

Insert a picture into the workbook done

Playing with Penguin data

penguins %>%
  glimpse()

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               <fct> male, female, female, NA, female, male, female, male~
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

Week 2 - Post Session

Playing with Blue Tit data

For this data, re-run the command to get the blue tit data as detailed in the chunk above. This sets the dataframe to cyan.

First step in data analysis is a data exploration. This is to help identify any issues in the dataset and ensure any analysis is valid.

The steps suggested for data exploration are to identify:

Outliers in response and independent variables
Normality and homogeneity of the response variable
An excess of zeros in the response variable
Multicollinearity among independent variables
Relationships among response and independent variables
Independence of the response variable

These look familiar to me. Unsurprising as many of these conditions are assumptions of analyses that we may want to run!

Once we have the data, we can inspect the data using the structure function

str(cyan)

'data.frame':   438 obs. of  7 variables:
 $ id    : chr  "B1" "B10" "B111" "B112" ...
 $ zone  : chr  "B" "B" "B" "B" ...
 $ year  : int  2002 2001 2002 2002 2001 2001 2002 2002 2002 2002 ...
 $ multi : chr  "yes" "yes" "no" "no" ...
 $ height: num  2.7 2 2 1.9 2.1 2.5 1.9 2.2 2.4 2.3 ...
 $ day   : int  8 25 10 6 28 23 10 10 11 7 ...
 $ depth : num  0.33 0.25 0.2 0.25 0.33 0.4 0.2 0.2 0.2 0.33 ...

This is telling us a few things that we should note and would link to study design.

id is the unique identifier of the nest box in the study area
the variable zone are the plots within the study area
year the year the data was collected
multi is a bionomial data set stating if a nest was used more than once (0 for no, 1 for yes)
height is the height at which the box is attached to the tree
day is the number of days after 1st of April that the first egg was laid in the nest
depth is a measure of nest size (as a fraction of nest box filled) and may indicate body condition and resources of the nest box occupants. In this example this is the main variable of interest

We could run a similar analysis using the glimpse function used above.

glimpse(cyan)

Rows: 438
Columns: 7
$ id     <chr> "B1", "B10", "B111", "B112", "B115", "B116", "B118", "B121", "B~
$ zone   <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B"~
$ year   <int> 2002, 2001, 2002, 2002, 2001, 2001, 2002, 2002, 2002, 2002, 200~
$ multi  <chr> "yes", "yes", "no", "no", "no", "no", "no", "yes", "no", "yes",~
$ height <dbl> 2.7, 2.0, 2.0, 1.9, 2.1, 2.5, 1.9, 2.2, 2.4, 2.3, 2.6, 1.5, 2.7~
$ day    <int> 8, 25, 10, 6, 28, 23, 10, 10, 11, 7, 18, 19, 11, 32, 7, 27, 4, ~
$ depth  <dbl> 0.33, 0.25, 0.20, 0.25, 0.33, 0.40, 0.20, 0.20, 0.20, 0.33, 0.5~

Tip

Note that the output also tells you the data type

-chr is a categorical data type defined by characters

-int shows the variable is a integer (whole number).

-num are continuous numeric data. Note that in the glimpse function num is replaced by dbl. This is not an important difference as far as we are concerned.

Remember

Data types can be categorical or continuous

There may be missing data in the set, which can cause difficulties in analysis.R treats missing data as NA. To find them we can run the following:

colSums(is.na(cyan)) #what this is saying is, return the sum of the number of times NA appears in the dataset cyan

    id   zone   year  multi height    day  depth 
     0      0      0      0      0      0      0

There are no missing data in this dataset.

Checking for outliers

One of the best ways to inspect data is visually. For outliers we can do this using a boxpot.

boxplot(depth ~ zone, ylab = 'Nest Depth', xlab = 'Woodland Zone', data = cyan, col = 'lightblue',outpch = 16, las=1)

# the boxplot paraters are as follows:
#depth ~ zone tells R to split the values of the depth by the category zone 
#xlab and ylab are the axis labels
#data is which dataset to use for the plot
#col is the colour of the boxes
#outpch is a graphical paramater that changes the default boxplot display
#las = 1 tells R to keep the axis labels horizontal

We can see from this plot that there are 4 data points that sit outside of the expected range. This plot also shows us that there are differences between nest depths and the different areas of the woodlands. This inspection therefore has lots of advantages

Normality

To test for normality we can use a frequency polygon.

p<-ggplot()
p<-p+xlab("Nest Depth")
p<-p+ylab("Frequency")
p<-theme(text=element_text(size=15))
p<-theme(panel.background = element_blank())
p<-theme(panel.border = element_rect(fill=NA, colour="black", size=1))
p<-p+theme(strip.background = element_rect(fill="white", colour="white", size=1))
p<-p+theme(text=element_text(15))
p<-p+theme(legend.position = 'none')
p<-p+geom_freqpoly(data=cyan,aes(depth),bins=7)

I could not get this to work with the books code or from a variety of online searches! I wanted to plot a histogram or frequency polynomial graphic.

Warning

This was the error Error in +.gg: ! Can’t add geom_freqpoly(data = cyan, aes(depth), bins = 7) to a theme object. Backtrace: 1. ggplot2:::+.gg(p, geom_freqpoly(data = cyan, aes(depth), bins = 7))

Solution Following week 4s work, I was able to create a histogram for the depth data using the geom_histogram plot type - see code below

ggplot(cyan, aes(depth))+
  geom_histogram(bins=7)

and the Frequency Polygon

ggplot(cyan, aes(x=depth))+
  geom_freqpoly(bins=7)

#This is suitable for one quantitative variable

Another way to test for normality is the Shapiro-Wilks normality test. For this test, the null hypothesis is that the data is normally distributed - if it is, we will see a p-value <0.05

shapiro.test(cyan$depth)


    Shapiro-Wilk normality test

data:  cyan$depth
W = 0.89872, p-value < 2.2e-16

This test shows the data is not distributed normally.

Next step in data exploration is to check the homogeneity of the data. This is another assumption of many statistical tests. This can be done visually (covariate boxplot) or with tests such as Levene’s Test.

library(lawstat)

Warning: package 'lawstat' was built under R version 4.1.3

#I had to add the lawstat package to run this test

levene.test(cyan$depth, cyan$zone)


    Modified robust Brown-Forsythe Levene-type test based on the absolute
    deviations from the median

data:  cyan$depth
Test Statistic = 0.6051, p-value = 0.7738

Levene’s test does not assume normality and so is suitable for this dataset. As the p-vale is >0.05, we can say that this data does not deviate from homogeneity.

An excess of zeros in a response variable is called zero inflation. If this is a problem there are lots of ways to deal with it, however, lets test first!

sum(cyan$depth==0, na.rm=TRUE)*100/nrow(cyan)

[1] 0

The result is 0, so we can move on to the next exploration step.

Multicollinearity among covariates

Coll <- c("zone", "year", "height", "day")


panel.cor <- function(x, y, digits=1, prefix="", cex.cor = 6)
{usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r1=cor(x,y,use="pairwise.complete.obs")
r <- abs(cor(x, y,use="pairwise.complete.obs"))
txt <- format(c(r1, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) { cex <- 0.9/strwidth(txt) } else {
  cex = cex.cor}
text(0.5, 0.5, txt, cex = cex * r)}
pairs(cyan[, Coll], lower.panel = panel.cor)

Warning

This code is copied directly from the zip files supplied via the Statistics in R for Biodiversity Conservation dropbox but it would not run unsure why. I tried to google some other ways of doing but was unsuccessful.

Relationships among dependent and independent variables

Visually inspecting plotted data is useful to check these relationships as well

#| label: multipanel scatter plot of depth and nestbox height across zones

ggplot(cyan, aes(x = height, y = depth)) +
  geom_point() +
  geom_smooth(method = 'lm', colour = 'red', se = FALSE) +
  theme(panel.background = element_blank()) +
  theme(panel.border = element_rect(fill = NA, size = 1)) +
  theme(strip.background = element_rect(fill = "white", 
                                        color = "white", size = 1)) +
  theme(text = element_text(size=16)) +
  facet_wrap(~zone)

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
i Please use the `linewidth` argument instead.

`geom_smooth()` using formula = 'y ~ x'

The best fit lines fitted to the above data show that figures CP and O might have some interactions and this suggests that further investigation is needed.

Independence of response variable

In this dataset, there are relationships between responses - nests are in the same boxes in different years. For the purposes of this example, no need to worry.

Week 3 Pre work

Week 3 Session

No notes made during session

Week 3 Post Session

First let’s ensure we have called the necessary packages

library(tidyverse)

6.1.1.01

Task Mutations of the midwest dataset

midwest.new<- #defines new object name
  midwest %>% #calls the existing dataset and pipes to the next commands
  mutate(avg.pop.den=mean(popdensity), #this adds the column for task a
avg.area=mean(area),
# this is for b
totadult=sum(popadults), # the totals for all these are the same throughout the data as they are averages or totals from the netire dataset
tot.minus.white=(totadult-popwhite), # this is now different per line as the popwhite varies across the lines of the data. You could achieve the same result by adding together the other pop* datafields but this is more efficient
child.to.adult=(percchildbelowpovert/percadultpoverty), # as with the tot.minus.white data, these vary by row due to difference between the rows of data
ratio.adult=(popadults/poptotal), # diving the 2 fields to get the ratio
perc.adult=(ratio.adult*100)) # multiplying by 100 to convert ratio to percentage

midwest.new  # this then displays the newly amended table

# A tibble: 437 x 35
     PID county  state  area poptotal popdensity popwhite popblack popamerindian
   <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
 1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
 2   562 ALEXAN~ IL    0.014    10626       759      7054     3496            19
 3   563 BOND    IL    0.022    14991       681.    14477      429            35
 4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
 5   565 BROWN   IL    0.018     5836       324.     5264      547            14
 6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
 7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
 8   568 CARROLL IL    0.027    16805       622.    16519      111            30
 9   569 CASS    IL    0.024    13437       560.    13384       16             8
10   570 CHAMPA~ IL    0.058   173025      2983.   146506    16559           331
# i 427 more rows
# i 26 more variables: popasian <int>, popother <int>, percwhite <dbl>,
#   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
#   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
#   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
#   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
#   percelderlypoverty <dbl>, inmetro <int>, category <chr>, ...

#note that by combining all the mutations in a single, multi-command code it is neater than having several chunks running one after the other.

Task Mutations of the presidential dataset

presidential.new<- #defines new object name
  presidential %>% # utilizes the presidential dataset
  mutate(duration=(end-start)) # duration is the end date minus the start date

presidential.new # to display the new data

# A tibble: 12 x 5
   name       start      end        party      duration 
   <chr>      <date>     <date>     <chr>      <drtn>   
 1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days
 2 Kennedy    1961-01-20 1963-11-22 Democratic 1036 days
 3 Johnson    1963-11-22 1969-01-20 Democratic 1886 days
 4 Nixon      1969-01-20 1974-08-09 Republican 2027 days
 5 Ford       1974-08-09 1977-01-20 Republican  895 days
 6 Carter     1977-01-20 1981-01-20 Democratic 1461 days
 7 Reagan     1981-01-20 1989-01-20 Republican 2922 days
 8 Bush       1989-01-20 1993-01-20 Republican 1461 days
 9 Clinton    1993-01-20 2001-01-20 Democratic 2922 days
10 Bush       2001-01-20 2009-01-20 Republican 2922 days
11 Obama      2009-01-20 2017-01-20 Democratic 2922 days
12 Trump      2017-01-20 2021-01-20 Republican 1461 days

Task Mutations with the economic dataset

econmic.new<- #defines new object name
  economics %>% # utilizes the economics dataset
  mutate(perc.unemploy=((unemploy/pop)*100)) # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this

econmic.new # to display the new data

# A tibble: 574 x 7
   date         pce    pop psavert uempmed unemploy perc.unemploy
   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>         <dbl>
 1 1967-07-01  507. 198712    12.6     4.5     2944          1.48
 2 1967-08-01  510. 198911    12.6     4.7     2945          1.48
 3 1967-09-01  516. 199113    11.9     4.6     2958          1.49
 4 1967-10-01  512. 199311    12.9     4.9     3143          1.58
 5 1967-11-01  517. 199498    12.8     4.7     3066          1.54
 6 1967-12-01  525. 199657    11.8     4.8     3018          1.51
 7 1968-01-01  531. 199808    11.7     5.1     2878          1.44
 8 1968-02-01  534. 199920    12.3     4.5     3001          1.50
 9 1968-03-01  544. 200056    11.7     4.1     2877          1.44
10 1968-04-01  544  200208    12.3     4.6     2709          1.35
# i 564 more rows

Task Mutations with the txhousing dataset

txhousing.new<- #defines new object name
  txhousing %>% # utilizes the txhousing dataset
  mutate(successrate=((sales/listings)*100), # combination of 2 functions - the ratio and then the multiplication. Note the extra brackets required to achieve this
failrate=(100-successrate)) # fail rate is the difference between the whole (100) and the successrate previously calculated.
txhousing.new # to display the new data

# A tibble: 8,602 x 11
   city     year month sales  volume median listings inventory  date successrate
   <chr>   <int> <int> <dbl>   <dbl>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
 1 Abilene  2000     1    72  5.38e6  71400      701       6.3 2000         10.3
 2 Abilene  2000     2    98  6.51e6  58700      746       6.6 2000.        13.1
 3 Abilene  2000     3   130  9.28e6  58100      784       6.8 2000.        16.6
 4 Abilene  2000     4    98  9.73e6  68600      785       6.9 2000.        12.5
 5 Abilene  2000     5   141  1.06e7  67300      794       6.8 2000.        17.8
 6 Abilene  2000     6   156  1.39e7  66900      780       6.6 2000.        20  
 7 Abilene  2000     7   152  1.26e7  73500      742       6.2 2000.        20.5
 8 Abilene  2000     8   131  1.07e7  75000      765       6.4 2001.        17.1
 9 Abilene  2000     9   104  7.62e6  64500      771       6.5 2001.        13.5
10 Abilene  2000    10   101  7.04e6  59300      764       6.6 2001.        13.2
# i 8,592 more rows
# i 1 more variable: failrate <dbl>

Tasks from 6.6.1

Recreating code from the guide

# Example illustrating how to answer these problems:
diamonds.new<- #added name of new dataset
diamonds %>%                         # utilizes the diamonds dataset
  group_by(color, clarity) %>%       # groups data by color and clarity variables
  mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
  ungroup() %>%                      # data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>%  # new variable, original price + $10
  select(cut, color,                 # retain only these listed columns
         clarity, price, 
         price200, random10) %>% 
  arrange(color) %>%                 # visualize data ordered by color
  group_by(cut) %>%                  # group data by cut
  mutate(dis = n_distinct(price),    # counts the number of unique price values per cut 
         rowID = row_number()) %>%   # numbers each row consecutively for each cut
  ungroup()                          # final ungrouping of data

diamonds.new                         # display the dataset

# A tibble: 53,940 x 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# i 53,930 more rows

The functions all make sense, but the ordering does not to me. The rowID collum is confusing when the data is ordered as it is.

midwest %>% # utilizes the midwest dataset
  group_by(state) %>%  # groups the data by state
  summarize(poptotalmean = mean(poptotal), # create new field and calculate the mean of the population
            poptotalmed = median(poptotal), # create new field and calculate the median of the population
            popmax = max(poptotal), # create new field and calculate the median of the population
            popmin = min(poptotal), # create new field and return the minimum of the population
            popdistinct = n_distinct(poptotal), # create new field and count the number of different poptotals
            popfirst = first(poptotal), # create new field and return the first value of total population
            popany = any(poptotal < 5000), # are there any items where the total population is under 5000? (TRUE FALSE response)
            popany2 = any(poptotal > 2000000)) %>% # are there any items where the total population is over 2000000(TRUE FALSE response)
  ungroup()

# A tibble: 5 x 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# i 1 more variable: popany2 <lgl>

midwest %>% # utilizes the midwest dataset
  group_by(state) %>% # groups the data by state
  summarize(num5k = sum(poptotal < 5000), # return the number of items where the poptotal is less than 5000
            num2mil = sum(poptotal > 2000000), # return the number of items where the poptotal is more than 2000000
            numrows = n()) %>% # number of rows (counties) in each state
  ungroup()

# A tibble: 5 x 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72

midwest %>% # utilizes the midwest dataset
  group_by(county) %>% # group by county
  summarize(x = n_distinct(state)) %>% # set field x to number of times that county appears in different states
  arrange(desc(x)) %>% # arrange data with the highest counts first
  ungroup()

# A tibble: 320 x 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# i 310 more rows

#| Label: Problem C part 2

midwest %>% 
  group_by(county) %>% 
  summarize(x = n()) %>% # set field x to number of times that county appears total
  ungroup()

# A tibble: 320 x 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# i 310 more rows

Important

Questions

How does n() differ from n_distinct()?

n is the number of times the county name appears, n_distinct is the number of times a county name appears in different states

When would they be the same? different?

If all the counties only appear in each state once, they would be the same. However, if a county appears twice in one state it will only be counted once by n_distinct but twice by n

midwest %>% 
  group_by(county) %>% 
  summarize(x = n_distinct(county)) %>% 
  arrange(desc(x)) %>% # this line is added to solve part 2 of the questions below
  ungroup()

# A tibble: 320 x 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# i 310 more rows

Important

Questions

How many distinctly different counties are there for each county??

Each county is individually names and there are 320 rows, representing 320 different counties

Can there be more than 1 (county) county in each county?

There is not. By sorting the data in 3i largest to smallest we would identify if there were any counts higher than 1

What if we replace ‘county’ with ‘state’? See below

#| Label: Problem C part 3ii

midwest %>% 
  group_by(state) %>% 
  summarize(x = n_distinct(county)) %>% 
  ungroup()

# A tibble: 5 x 2
  state     x
  <chr> <int>
1 IL      102
2 IN       92
3 MI       83
4 OH       88
5 WI       72

# this changed code now returns the number of counties by state

#| Label: Problem D

diamonds %>% 
  group_by(clarity) %>% # items are grouped in the category clarity
  summarize(a = n_distinct(color), # no of unique colors by clarity
            b = n_distinct(price),# no of unique prices by clarity
            c = n()) %>% # total number of items
  ungroup()

# A tibble: 8 x 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790

#| label: Problem E part 1

diamonds %>% 
  group_by(color, cut) %>% 
  summarize(m = mean(price),
            s = sd(price)) %>% 
  ungroup()

`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.

# A tibble: 35 x 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# i 25 more rows

#this has grouped the items in the dataset by color and cut, then returned the mean price of items in those categories along with the standard deviation

#| Label: Problem E part 2

diamonds %>% 
  group_by(cut, color) %>% 
  summarize(m = mean(price),
            s = sd(price)) %>% 
  ungroup()

`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.

# A tibble: 35 x 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# i 25 more rows

#this has grouped the items in the dataset by cut then colour, returning the mean price of items and standard deviation. It is the same summary as part 1, just in a different order

#| Label: Problem E part 3

diamonds %>% 
  group_by(cut, color, clarity) %>% # this is too many arguments for R and so the output is only grouped by cut and color. See error message below
  summarize(m = mean(price),
            s = sd(price),
            msale = m * 0.80) %>% # multiples the mean price by 0.8 (20% off)
  ungroup()

`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.

# A tibble: 276 x 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# i 266 more rows

# the msale data is the mean data with 20% off.

Important

Questions

How good is the sale if the price of diamonds equaled sale?

Not bad! 20% off is nothing to be sniffed at!

#| label: Problem F

diamonds %>% 
  group_by(cut) %>% 
  summarize(potato = mean(depth), # data potato is the mean depth
            pizza = mean(price), # data pizza is the mean price
            popcorn = median(y), # data popcorn is the media of y
            pineapple = potato - pizza, # gives the difference between the mean depth and the mean price
            papaya = pineapple ^ 2, # the square of the difference in the means
            peach = n()) %>% # the number of items
  ungroup()

# A tibble: 5 x 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551

# these labels are stupid

#| label: Problem G part I


diamonds %>% 
  group_by(color) %>% # groups items by color
  summarize(m = mean(price)) %>% # gives the mean of the items 
  mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
         x2 = 5) %>% # creates new column "x2" with the value 5 for each row
  ungroup()

# A tibble: 7 x 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

#| label: Problem G part ii


diamonds %>% 
  group_by(color) %>% # groups items by color
  summarize(m = mean(price)) %>% # gives the mean of the items 
  ungroup() %>%
  mutate(x1 = str_c("Diamond color ", color), # combines the two text fields diamond color and color to a new field "x1"
         x2 = 5)  # creates new column "x2" with the value 5 for each row

# A tibble: 7 x 4
  color     m x1                 x2
  <ord> <dbl> <chr>           <dbl>
1 D     3170. Diamond color D     5
2 E     3077. Diamond color E     5
3 F     3725. Diamond color F     5
4 G     3999. Diamond color G     5
5 H     4487. Diamond color H     5
6 I     5092. Diamond color I     5
7 J     5324. Diamond color J     5

Important

Questions

What does the first ungroup() do? Is it useful here? Why/why not?

In this example, it does nothing as the summarise function has already taken place and the commands after the ungroup do not result in data changes, just reordering

No further ungroup is required as nothing is grouped at this point

#| label: Problem H part i

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% # creates column x1 which gives the price divided by 2
  summarize(m = mean(x1)) %>% # gives the mean of teh values in x1, grouped by color per the inital group command
  ungroup()

# A tibble: 7 x 2
  color     m
  <ord> <dbl>
1 D     1585.
2 E     1538.
3 F     1862.
4 G     2000.
5 H     2243.
6 I     2546.
7 J     2662.

#| label: part ii

diamonds %>% 
  group_by(color) %>% 
  mutate(x1 = price * 0.5) %>% 
  ungroup() %>%  
  summarize(m = mean(x1)) # returns the mean of all data

# A tibble: 1 x 1
      m
  <dbl>
1 1966.

Important

Questions

What’s the difference between part I and II?

In part 1, the mean is calculated by each group of items, it part 2 it is the mean of teh dataset. That is why you only have a single value in part 2 compared to part 1

Important

Questions

Why is grouping data necessary?

So that existing data and groups specific variables can be treated together for operations

Why is ungrouping data necessary?

To prevent complexity and error

When should you ungroup data?

In the same code chunk, whilst learning R, probably after each function or group of functions

If the code does not contain group_by(), do you still need ungroup() at the end? For example, does data() %>% mutate(newVar = 1 + 2) require ungroup()?

No, no grouping has been created therefore there is no need to undo it

Week 3 Post session - Research Methods

Bad Question on the Diamonds Dataset

How does diamond price vary?

Good Question on the Diamonds Dataset

What affect does Diamond Color have on price?

Week 4 Pre-Work

No notes - remember the difference between matrix and tables though!

Week 4 Session

No notes

Week 4 Post Session - recreating visuals

library(tidyverse)
library(modeldata)


Attaching package: 'modeldata'

The following object is masked from 'package:palmerpenguins':

    penguins

Important

gg is grammer of graphics

you define the varibales of interest first, before moving onto the plot type

#| label: Crickets plot 

ggplot(crickets, aes(x=temp, y=rate)) + #this is calling the dataset and then telling R what to plot where, in this case temp on the x asis and rate on the y
  geom_point() #this step specifies the type of plot to be drawn

#this is a basic plot. Once we have the basic plot, we can then work on the additional information/design we want. See chunk 38

# This visual is for 2 quantatitive variables

ggplot(crickets, aes(x=temp, y=rate, color=species)) + #aes is aesthetics
  geom_point()+
  labs(x="Temperature",
       y="Chirp Rate",
       color="Species",
       title="Cricket Chirps",
       caption=" Source Mcdonald (2009)")+
scale_color_brewer(palette="Dark2") #scale as we are changing a scale feature, colour as we are changing the color and brewer is the package where dark2 is found

#labs adds labels and there are lots you can do, including the demonstration here

Lets try some more modifications!

#| label: Crickets Plots 3

ggplot(crickets, aes(x=temp, y=rate))+
  geom_point(color="red", #this is where we can change general properties of the plot
             size=2, #changes point size
             alpha=.3, #changes point transparency
             shape="square")+ #changers point shape
  labs(x="Temperature",
       y="Chirp Rate",
       title="Cricket Chirps",
       caption="Source: Mcdonald (2009)")

#You can learn more anout the options for each geom by looking at the help files. YOu call them via ?geom_point, for example.

Adding additional layers to the plot

#| label: Cricket Plots - extra layers

ggplot(crickets, aes(x=temp, y=rate)) +
  geom_point()+
  geom_smooth(method="lm") + #Adding a regression line, as with other options there are lots of things you can do here, use ?geom_smooth to find out what they are
  labs(x="Temperature",
       y="Chirp Rate",
       color="Species",
       title="Cricket Chirps",
       caption=" Source Mcdonald (2009)")+
scale_color_brewer(palette="Dark2")

`geom_smooth()` using formula = 'y ~ x'

Other plots

#| label: Histogram

ggplot(crickets, aes(x=rate))+
  geom_histogram(bins=5)

#This is suitable for one quantitative variable

#| label: Frequency Polygon

ggplot(crickets, aes(x=rate))+
  geom_freqpoly(bins=15)

#This is suitable for one quantitative variable

#| label: Bar Graph

ggplot(crickets, aes(x=species,
                     fill=species))+
  geom_bar(show.legend=FALSE)+
  labs(x="Species",
         y= NULL)+
  scale_fill_brewer(palette="Dark2")

#This is suitable for a single categorical variable.
#As with the other plots you can alter lots of options (labels, colors etc etc)

#| Label: Box Plot

ggplot(crickets, aes(x=species, y=rate, color=species))+
  geom_boxplot(show.legend=FALSE)+
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

#This is for a categorical and quantitative variable

Important

Planning

Remember when planning visuals, what type of variables have you got and what data question you have.

Faceting

#| Label: Not Great

ggplot(crickets, aes(x = rate, 
                     fill = species)) + 
  geom_histogram(bins = 15) +
  scale_fill_brewer(palette = "Dark2")

ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species) +
  scale_fill_brewer(palette = "Dark2")

?facet_wrap will bring up the help menu for this command as per usual

 #| Label: Much Better
ggplot(crickets, aes(x = rate,
                     fill = species)) + 
  geom_histogram(bins = 15,
                 show.legend = FALSE) + 
  facet_wrap(~species,
             ncol = 1) +
  scale_fill_brewer(palette = "Dark2") + 
  theme_minimal()

Week 5 Pre-Session and During Session

No Notes

Week 5 Post Session

ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species))+
  geom_boxplot()+
  theme_minimal()

Important

This plot is Mean Tests as you are looking for differences in the means of the data

ggplot(iris, aes(x=Petal.Length))+
  geom_density(aes(fill=Species))

#I got stuck recreating this one!!!!

#:| Label: Recreating the plots - Scatter Plot

ggplot(iris, aes(x=Petal.Length,y=Petal.Width))+
  geom_point(aes(color=Species))+
  geom_smooth(method="lm", se=FALSE)

`geom_smooth()` using formula = 'y ~ x'

#I got stuck recreating this one!!!!

#:| Label: Bar Chart

library(dplyr)
library(ggplot2)

# Create a new data frame with the mutated variable. Compared to the provided code, we have redefined the dataframe
iris <- mutate(iris, size = ifelse(Sepal.Length < median(Sepal.Length), "small", "big"))

# Create the bar plot
ggplot(iris, aes(x = Species, fill = size)) + #the fill here splits the data by the variable size and colours appropriatley
  geom_bar(position = "dodge") #the dodge creates the clustered, rather than stacked plot

Week 6 Frequency Tests

Z tests

One proportion z-test:

res1 <- prop.test(x = 95, n = 160, p = 0.5, 
                 correct = FALSE)
#the syntax for the above is that x is our observed number n is our total number and p is the expected proportion. "Correct" is stating if we need to apply Yates'continuity correction.

# Printing the results
res1


    1-sample proportions test without continuity correction

data:  95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5163169 0.6667870
sample estimates:
      p 
0.59375

What this is telling us is that the observed (95) is different from the null hypothesis (expected was 0.5) with a p-value of 0.01771. This is a one-proportion test as we only have one observed proportion.

Two proportion z-test

res2 <- prop.test(x = c(490, 400), n = c(500, 500))

#Here we have two observation counts (490 and 400) and two groups (both of 500)

# Printing the results
res2


    2-sample test for equality of proportions with continuity correction

data:  c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 0.1408536 0.2191464
sample estimates:
prop 1 prop 2 
  0.98   0.80

What this return is telling is is the value of Pearson’s chi-squared test statistic (80.909), the degrees of freedom (1) and the p-value (<2.2e-16). It also gives us a 95% confidence interval and an estimated probability of success.

We can ask for teh output of individual aspects of this test individually as well

# p-value
res2$p.value

[1] 2.363439e-19

# Mean
res2$estimate

prop 1 prop 2 
  0.98   0.80

# Confidence internal
res2$conf.int

[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95

We can use this when we have 2 observed proportions

Chi-squared goodness of fit test

The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, where we have two or more categories as discrete data. In other words, it compares multiple observed proportions to expected probabilities.

tulip <- c(81, 50, 27) #this is our observed tulip counts
res3 <- chisq.test(tulip, p = c(1/3, 1/3, 1/3)) #this is our expected proportionality based on the null hypothesis that the distribution of all 3 types is equal
res3


    Chi-squared test for given probabilities

data:  tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07

This result is telling us that the tuplip types are not commonly distributed with a p-value of:

res3$p.value

[1] 8.802693e-07

which is much less than 0.05.

We can ask what the expected counts would have been if the tulips had been distributed equally:

res3$expected

[1] 52.66667 52.66667 52.66667

We can modify this command further. If we know that the proportions are not equal, but we have data stating what the expected proportions are, we can use the same test:

tulip <- c(81, 50, 27)
res4 <- chisq.test(tulip, p = c(1/2, 1/3, 1/6))
res4


    Chi-squared test for given probabilities

data:  tulip
X-squared = 0.20253, df = 2, p-value = 0.9037

Here, our P-value is greater than 0.05 and so the observed data is not statistically significantly different from the expected

Chi-squared test of independence

The chi-square test of independence is used to analyze the frequency table (like a contingency table) formed by two categorical variables.

The chi-square test evaluates whether there is a significant association between the categories of the two variables.

First, lets grab some data

file_path <- "https://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
housetasks

           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153

We can visualize this data in 2 ways. A balloon plot:

library("gplots")


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

# 1. convert the data as a table
dt <- as.table(as.matrix(housetasks))
# 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Or as a mosaic plot:

library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

It can be seen on both these visuals, that in this dataset, Laundry, Main_meal, dinner and breakfast are mainly done by the wife (blue on the mosaic plot)

On to the chi-squared test!

Basics - Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

Null hypothesis (H0): the row and the column variables of the contingency table are independent.

Alternative hypothesis (H1): row and column variables are dependent

The test:

chisq <- chisq.test(housetasks)
chisq


    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16

These test results show the variables are statistically significantly associated.

If we want to see the data we can extract it using either chisq$observed or chisq$expected to see what was observed or expected (if no association) respectively.

We can further investigate these results by looking at which cells contribute the most to the total chi-squared score. This is called looking at the Pearson or standardized residuals.

round(chisq$residuals, 3) #the 3 is defining the number of decimal places

             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556

And we can visualise this too:

library(corrplot)

corrplot 0.95 loaded

corrplot(chisq$residuals, is.cor = FALSE)

The sign of the standardized residuals is also very important to interpret the association between rows and columns:

Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.

In the image above, it’s evident that there are an association between the column Wife and the rows Laundry, Main_meal. There is a strong positive association between the column Husband and the row Repair

-Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.

For example the column Wife are negatively associated (~ “not associated”) with the row Repairs. There is a repulsion between the column Husband and, the rows Laundry and Main_meal

Further practice using the MPG dataset

library(ggplot2)
tab1<-mpg %>%
  group_by(class, cyl)%>%
  summarize(n=n())

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

tab1

# A tibble: 19 x 3
# Groups:   class [7]
   class        cyl     n
   <chr>      <int> <int>
 1 2seater        8     5
 2 compact        4    32
 3 compact        5     2
 4 compact        6    13
 5 midsize        4    16
 6 midsize        6    23
 7 midsize        8     2
 8 minivan        4     1
 9 minivan        6    10
10 pickup         4     3
11 pickup         6    10
12 pickup         8    20
13 subcompact     4    21
14 subcompact     5     2
15 subcompact     6     7
16 subcompact     8     5
17 suv            4     8
18 suv            6    16
19 suv            8    38

Here, we have created a summary table that includes a limited part of the data from the mpg dataset.

To make this data into a crosstab or contingency table, we need to rearrange our variables. In this example, we are going to use “Class” and “cyl” as our variables.

tab2<-mpg %>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  spread(cyl, n)

`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.

tab2

# A tibble: 7 x 5
# Groups:   class [7]
  class        `4`   `5`   `6`   `8`
  <chr>      <int> <int> <int> <int>
1 2seater       NA    NA    NA     5
2 compact       32     2    13    NA
3 midsize       16    NA    23     2
4 minivan        1    NA    10    NA
5 pickup         3    NA    10    20
6 subcompact    21     2     7     5
7 suv            8    NA    16    38

This table is giving us the number of observations by class and cyl. Lets graph it: {r}

balloonplot(t(tab2), main=“class”,xlab =““, ylab=”“, label = FALSE, show.margins = FALSE)

Ok, so that has not worked. Lets try again but using the matrix table method

tab3<-as.table(as.matrix(tab2))
tab3

  class      4  5  6  8 
A 2seater              5
B compact    32  2 13   
C midsize    16    23  2
D minivan     1    10   
E pickup      3    10 20
F subcompact 21  2  7  5
G suv         8    16 38

balloonplot(t(tab3), main="class",xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Warning in balloonplot.default(x, y, z, xlab = xlab, ylab = ylab, zlab = zlab,
: NAs introduced by coercion

As a comparison, lets try this without the grouping:

tab4<-as.table(as.matrix(tab1))
tab4

  class      cyl n 
A 2seater    8    5
B compact    4   32
C compact    5    2
D compact    6   13
E midsize    4   16
F midsize    6   23
G midsize    8    2
H minivan    4    1
I minivan    6   10
J pickup     4    3
K pickup     6   10
L pickup     8   20
M subcompact 4   21
N subcompact 5    2
O subcompact 6    7
P subcompact 8    5
Q suv        4    8
R suv        6   16
S suv        8   38

balloonplot(t(tab4), main="class",xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

Warning in balloonplot.default(x, y, z, xlab = xlab, ylab = ylab, zlab = zlab,
: NAs introduced by coercion

This shows us the importance of correct grouping prior to analysis, even without testing it is much clearer on the grouped data.

Week 7 - Comparing Means

One Sample T-test

The forumla for this test is:

t.test(x, mu = 0, alternative = “two.sided”)

x is the data we are testing - a vector, for example my_data$weight where “my_data is the dataframe and”weight” is the column where the data is in the frame.

mu is the theoretical mean we are testing against (default is 0)

alternative is our null hypothesis. This can be “two.sided”, “greater” or “less”. For a one sample t-test you would use “greater” or “less” as “two.sided” is a 2 sample test.

set.seed(1234)
mouse <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10, 20, 2), 1)
) # this has created a dummy dataset. This is simulated weights for mice.

# Print the first 10 rows of the data
head(mouse, 10)

   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2

Before we run any tests, lets check the summary@

summary(mouse$weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20

Let’s visualize!

library(ggplot2)
ggplot(mouse, aes(y=weight)) +
  geom_boxplot()

Before we can run any tests, we need to check the assumptions of the test. For one-sample t-tests these are: -Normal Distribution -Suitable sample size

For this data set n<30 so we meet the assumptions for size. To check for normality we can run a Shapiro-Wilks test and normality plots.

Important

Normality Test

Shaprio-Wilks

For this test, our hypothesis is that the data is not normally distributed, so if the test has a p-value<0.05 it is not normally distributed.

Lets code it:

shapiro.test(mouse$weight)


    Shapiro-Wilk normality test

data:  mouse$weight
W = 0.9526, p-value = 0.6993

This result shows us that the data is normally distributed and so we can perform the t-test. :::

For this dataset, the expected mean average is 25g. From the boxplot above, we can set a hypothesis that the mean average of our dataset is less than the expected mean.

In code:

t.test(mouse$weight, mu=25, alternative="less")


    One Sample t-test

data:  mouse$weight
t = -9.0783, df = 9, p-value = 3.977e-06
alternative hypothesis: true mean is less than 25
95 percent confidence interval:
     -Inf 20.41105
sample estimates:
mean of x 
    19.25

This result shows us that our data is statistically significantly different from the expected mean (p-value is <0.05) and so we can accept our hypothesis.

We could also run this t-test as a two-tailed test - this is where our hypothesis is that the mean of our data is statistically significantly different from our expected mean, but we have not stated a direction. The corresponding null hypothesis is that the means are the same.

In code:

t.test(mouse$weight, mu=25)


    One Sample t-test

data:  mouse$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25

As with the one-tailed test, this confirms our hypothesis that the mean of our data is not equal to the expected mean.

Non-normal one-sample t-test

If our data is not normally distributed we can do a non-parametric test. One of these is the One-sample Wilcoxon signed rank test

As our data is non-normal, rather than testing against a mean, this test compares the median values in the data compared to the expected median.

The syntax for this test is similar to the t-test:

wilcox.test(x, mu=0, alternative= “two.sided”)

x is the data we are testing - a vector, for example my_data$weight where “my_data is the dataframe and”weight” is the column where the data is in the frame.

mu is the theoretical mean we are testing against (default is 0)

alternative is our null hypothesis. This can be “two.sided”, “greater” or “less”. For a one sample t-test you would use “greater” or “less” as “two.sided” is a 2 sample test.

We will run this test using the same data as before. Despite this data being normally distributed we can still run this test. Unlike the t-test, this test is comparing medians

Let’s run the test:

wilcox.test(mouse$weight, mu=25, alternative = "less")

Warning in wilcox.test.default(mouse$weight, mu = 25, alternative = "less"):
cannot compute exact p-value with ties


    Wilcoxon signed rank test with continuity correction

data:  mouse$weight
V = 0, p-value = 0.002897
alternative hypothesis: true location is less than 25

The two tailed equivalent is the same as it is for the t-test:

wilcox.test(mouse$weight, mu=25)

Warning in wilcox.test.default(mouse$weight, mu = 25): cannot compute exact
p-value with ties


    Wilcoxon signed rank test with continuity correction

data:  mouse$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25

As the default for both these tests is the two-tailed test, we do not need to define our hypothesis to “less” or “greater”.

Unpaired 2 sample

These tests are used when comparing means of two independent groups. There are 2 further assumptions related to this test - data is normally distributed (Shapiro-Wilk Test) - variances are equal (F-test)

::: {.callout-important collapse=“false”} Uneven variance Test

Welch t-statistic

If the variance of the data sets is not even, you can do this test instead. :::

As with the one sample test above, there are variations of the “tails” of the test. One tailed is when we are testing if the mean of one group is greater or lesser than the other, with the two tailed version being that the means are the same.

# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 
# Create a data frame
sex_weight <- data.frame( 
                group = rep(c("Woman", "Man"), each = 9),
                weight = c(women_weight,  men_weight)
                )
sex_weight

   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4

Now our data is in, lets follow best practice by inspecting it:

library(dplyr)
group_by(sex_weight, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE))

# A tibble: 2 x 4
  group count  mean    sd
  <chr> <int> <dbl> <dbl>
1 Man       9  69.0  9.38
2 Woman     9  52.1 15.6

This summary suggests that there is a difference between our means. Lets graph it!

# Plot weight by group and color by group
library("ggplot2")
ggplot(sex_weight, aes(x=group, y=weight, 
                       colour=group))+
  geom_boxplot()

Now, lets check our assumptions before we run the t-test.

First, test the normality of distributions:

# Shapiro-Wilk normality test for Men's weights
with(sex_weight, shapiro.test(weight[group == "Man"]))# p = 0.1


    Shapiro-Wilk normality test

data:  weight[group == "Man"]
W = 0.86425, p-value = 0.1066

# Shapiro-Wilk normality test for Women's weights
with(sex_weight, shapiro.test(weight[group == "Woman"])) # p = 0.6


    Shapiro-Wilk normality test

data:  weight[group == "Woman"]
W = 0.94266, p-value = 0.6101

Remember this is testing against the hypothesis that the data is significantly deviated from normal. With p values above 0.05, we can reject this hypothesis and can say the data does not significantly deviate from normal.

Next, lets check the homogeneity of the variencies using the F-test:

var.test(weight~group, data=sex_weight)


    F test to compare two variances

data:  weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.08150656 1.60191315
sample estimates:
ratio of variances 
         0.3613398

As stated in the output, this is testing against the hypothesis that the variances are not equal. As the p value is greater than 0.05, we can reject this hypothesis and say that there is no significant differences between the variances.

Both the assumptions have been met, so now we can do the test

t.test(weight~group, data=sex_weight, var=TRUE)


    Two Sample t-test

data:  weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
  4.029759 29.748019
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000

Understanding this output

t is the t-test statistic value (t = 2.784), df is the degrees of freedom (df= 16), p-value is the significance level of the t-test (p-value = 0.01327). conf.int is the confidence interval of the mean at 95% (conf.int = [4.0298, 29.748]); sample estimates is he mean value of the sample (mean = 68.9888889, 52.1).

With this p-value, we can conclude that the men’s average weight is significantly different from the women’s average weight.

Wilcoxon test

This test is for similar data, but when the two independent groups data is not normally distributed. This would be if the Shapiro-Wilk test result was reversed from the above scenario. All other steps are the same so are unrepeated here.

wilcox.test(weight~group, data=sex_weight)

Warning in wilcox.test.default(x = c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, :
cannot compute exact p-value with ties


    Wilcoxon rank sum test with continuity correction

data:  weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0

The warning shown is because of the assumption of a Wilcoxon test that the responses are continuous. You can suppress this message by adding another argument exact = FALSE, but the result will be the same.

The p-value of the test is 0.02712, which is less than the significance level alpha = 0.05. We can conclude that men’s median weight is significantly different from women’s median weight with a p-value = 0.02712.

This is the same result as we saw from the t-test, which is to be expected as it is the same data!

Paired sample t-tests

The paired samples t-test is used to compare the means between two related groups of samples. In this example, there are two values (i.e., pair of values) for the same samples.

In this example, 20 mice received a treatment (X) during 3 months. We want to know whether the treatment X has an impact on the weight of the mice.

To answer to this question, the weight of the 20 mice has been measured before and after the treatment. This gives us 20 sets of values before treatment and 20 sets of values after treatment from measuring twice the weight of the same mice.

In situations like this, a paired t-test can be used to compare the mean weights before and after treatment.

As this is a t-test, the same assumptions apply as before: - data is normally distributed (Shapiro-Wilk Test) - variances are equal (F-test)

The difference for paired data, is that the test is being run against the difference between the values (i.e. the difference in weight of the mouse before and after treatment), and so the normality test is performed on this “difference” data, not the original observations.

This is an important point to remember! The paired t-test is examining the difference the means of the differences in the same individual (weight after treatment (Wa)-wieght before treatment (wb)=weight difference (wd)). We are testing the means (and distributions of) wd.

Here is our data:

# Data in two numeric vectors
#
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame (combine the vectors)
mouse_weight <- data.frame( 
                group = rep(c("before", "after"), each = 10),
                weight = c(before,  after)
                )
mouse_weight

    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2

Lets examine this data:

group_by(mouse_weight, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )

# A tibble: 2 x 4
  group  count  mean    sd
  <chr>  <int> <dbl> <dbl>
1 after     10  394.  29.4
2 before    10  199.  18.5

Note how for this and the above data, we have grouped the data!

graph time!

# Plot weight by group and color by group
library("ggplot2")
ggplot(mouse_weight, aes(x=group, y=weight, 
                       colour=group))+
  geom_boxplot()

This is showing the differences between the groups, but has lost the paired nature of the data. We need to add the paired data package.

library(PairedData)

Warning: package 'PairedData' was built under R version 4.1.3

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: gld

Warning: package 'gld' was built under R version 4.1.3

Loading required package: mvtnorm

Loading required package: lattice


Attaching package: 'PairedData'

The following object is masked from 'package:base':

    summary

#Create data subset weight data before treatment
before <- subset(mouse_weight,  group == "before", weight,
                 drop = TRUE)
# subset weight data after treatment
after <- subset(mouse_weight,  group == "after", weight,
                 drop = TRUE)

pd <- paired(before, after)
plot(pd, type = "profile") + theme_bw()

We can now see how each before and after movement have taken place. As before, we need to check our assumptions before we run the test.

1) Are the two samples paired Yes - since the data have been collected from measuring the same mouse before and after

2) Is this a large sample? No - since this is a small sample (less than 30), we need to check for normality

3) Is it normally distributed? Lets find out!

# calculate the differences between each data set (wd)
wd<-pd$after-pd$before
shapiro.test(wd)


    Shapiro-Wilk normality test

data:  wd
W = 0.94536, p-value = 0.6141

The P-Value is greater than the significant value (0.05) so we can conclude that the data is not significantly distributed away from normally.

There are 2 methods for calculating the t-test depending on how the data are arranged. Before we added out data to the dataframe, we had the data stored as 2 vectors (before and after). To perform the t-test using the vectors we do the following:

t.test(before, after, paired=TRUE)


    Paired t-test

data:  before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean of the differences 
                -194.49

Alternatively, we can calcluate the t-test from the dataframe:

t.test(weight~group, data=mouse_weight, paired=TRUE)


    Paired t-test

data:  weight by group
t = 20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 173.4219 215.5581
sample estimates:
mean of the differences 
                 194.49

As we are testing the same dataset, the same values are to be expected and are confirmed by the tests.

This test output has a p-value of less than 0.05 and so we can reject the null hypothesis (that the before and after weights are not different) and accept out alternate hypothesis that the weights before and after treatment are significantly different.

Important

Remember

We can modify these tests to single-tailed tests by defining what our alternative hypothesis is. We do this by adding “alternative =”less”” or “alternative =”greater”” depending on what out alternative hypothesis is.

Wilcoxon Test for paired data

As with the unpaired data examples above, the non-parametric test for paired data is the Wilcoxon test.

Remember The difference between the t.test and the Wilcoxon test is that the t.test compares means and the wilcoxon test compares the median. As a result, the Wilcoxon test is suitable when the data are not distributed normally.

We will be using the same data as above for this example and so the data summary and checking processes will be skipped as they are covered above.

As with the t-test, we can compute the Wilcoxon test either using vector data:

wilcox.test(before, after, paired=TRUE)


    Wilcoxon signed rank exact test

data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

Or using the dataframe data:

wilcox.test(weight~group, data=mouse_weight, paired=TRUE)


    Wilcoxon signed rank exact test

data:  weight by group
V = 55, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

As before, the results are the same as we are testing the same data, we can also modify these to single tailed tests (greater or less) using the same code as for the t-tests.

One-Way ANOVA test

ANOVA is an extension of independent two-sample t-tests (covered above). Like the t-test, ANOVA is for comparing means, but unlike t-tests, ANOVA is used in situations where there are more than 2 groups.

To conduct ANOVA tests, data needs to be organised into several groups based on a single grouping variable (also called factor variable).

Important

The Null hypothesis for ANOVA tests is that the means of the different groups are the same.

The alternate hypothesis is that at least one sample mean is not equal to the others.

Assumptions - The observations are obtained independently and randomly from the population - The data are normally distributed - The normal populations have a common variance

For this example, we will be using a built in dataset

PlantGrowth

   weight group
1    4.17  ctrl
2    5.58  ctrl
3    5.18  ctrl
4    6.11  ctrl
5    4.50  ctrl
6    4.61  ctrl
7    5.17  ctrl
8    4.53  ctrl
9    5.33  ctrl
10   5.14  ctrl
11   4.81  trt1
12   4.17  trt1
13   4.41  trt1
14   3.59  trt1
15   5.87  trt1
16   3.83  trt1
17   6.03  trt1
18   4.89  trt1
19   4.32  trt1
20   4.69  trt1
21   6.31  trt2
22   5.12  trt2
23   5.54  trt2
24   5.50  trt2
25   5.37  trt2
26   5.29  trt2
27   4.92  trt2
28   6.15  trt2
29   5.80  trt2
30   5.26  trt2

We can see that the “group” column here referes to the treatment group. We can see all the groups by asking R to show us them using this command sequence:

levels(PlantGrowth$group)

[1] "ctrl" "trt1" "trt2"

Lets look at the summary statistics of our data:

group_by(PlantGrowth, group) %>%
  summarise(
    count=n(),
    mean=mean(weight,na.rm=TRUE),
    sd=sd(weight,na.rm=TRUE)
  )

# A tibble: 3 x 4
  group count  mean    sd
  <fct> <int> <dbl> <dbl>
1 ctrl     10  5.03 0.583
2 trt1     10  4.66 0.794
3 trt2     10  5.53 0.443

From this we can see the mean and standard deviation of our groups of treatments.

Lets graph it!

ggplot(PlantGrowth, aes(x=group, y=weight, color=group))+
  geom_boxplot()

We can see, visually, that there is a difference between the means of the groups, ANOVA can tell us this statistically.

aov(weight~group, data=PlantGrowth)

Call:
   aov(formula = weight ~ group, data = PlantGrowth)

Terms:
                   group Residuals
Sum of Squares   3.76634  10.49209
Deg. of Freedom        2        27

Residual standard error: 0.6233746
Estimated effects may be unbalanced

And to summarise this output in an easier to view way:

summary(aov(weight~group, data=PlantGrowth))

            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value is the ratio between the variance between the samples to the variance within the samples. A small F-value (ie less than 1), indicates that there is no significant difference in the variance of the means, whilst a higher F-vale implies that the variation among group means in significant.

The output labeled PR>F is the equivalent of the P value in the tests. As the one in this example is under 0.05, there we can conclude that there is significant differences between the groups.

This output tells us that there is a significant difference, but it does not tell us between which groups this difference is. To find this out we need to undertake a pairwise-comparison.

One way of of doing this is to use a Tukey HSD R Function:

TukeyHSD(aov(weight~group, data=PlantGrowth))

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = PlantGrowth)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

We can make this code neater by assigning the ANOVA test to a specific reference and then run the Tukey test:

PWAOV<-aov(weight~group, data=PlantGrowth)

TukeyHSD(PWAOV)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = PlantGrowth)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

The Tukey result show that there is only a significant (p-value<0.05) difference between tr2 and tr1.

There are other ways of conducting this pairwise comparison, but they are not explored here.

As mentioned above, ANOVA tests have some assumptions

Important

Assumptions - The observations are obtained independently and randomly from the population - The data are normally distributed - The normal populations have a common variance

The observation assumption is part of the experimental design.

For the variance and distribution assumptions we can test. Remember that ANOVA is comparing means, so our assumptions are applying to the mean data, not the original data set.

ANOVA Variance check

The normal distribution we can test using a residuals versus fits plot:

plot(PWAOV, 1)

#the 1 refers to the type of plot to be made, do not worry too much about these numbers, just that 1 is needed to test this assumption when looking at ANOVA outputs

Data points labeled 17, 15 and 4 appear as outliers on this plot, which suggests that there may be deviations from even distribution of variances. These data points could be removed as they can impact the normality and variance assumptions, thus invalidating the ANOVA assumptions.

We can test this further using Bartlett’s Test:

bartlett.test(weight ~ group, data = PlantGrowth)


    Bartlett test of homogeneity of variances

data:  weight by group
Bartlett's K-squared = 2.8786, df = 2, p-value = 0.2371

This returns a p-value of >0.05 so the variance does not deviate from homogeneous.

Important

What if the common variance assumption is violated

This would mean that we cannot run the ANOVA test shown above.

One way is a Welch On-Way Test:

oneway.test(weight~group, data=PlantGrowth)


    One-way analysis of means (not assuming equal variances)

data:  weight and group
F = 5.181, num df = 2.000, denom df = 17.128, p-value = 0.01739

An alternative is a Pairwise t.test with no assumption of equal variance:

pairwise.t.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method="BH", pool.sd=FALSE)


    Pairwise comparisons using t tests with non-pooled SD 

data:  PlantGrowth$weight and PlantGrowth$group 

     ctrl  trt1 
trt1 0.250 -    
trt2 0.072 0.028

P value adjustment method: BH

#This final false statement tells R not to treat the standard deviation as pooled - this removes the need for the assumption of homogeneous variance

ANOVA Normality check

As with the varience check, we can do this visually:

plot(PWAOV, 2)

#as above, do not worry about what 2 means too much, just that it is the one to use to test this assumption

And we can do it with a statistical test:

#Get the residual values from our test (remember the residuals are the differences around the mean, and as it is the means we are testing this is the dataset our assumptions need to meet)
PWaov_residuals <- residuals(object=PWAOV)
#Then run the test
shapiro.test(x=PWaov_residuals)


    Shapiro-Wilk normality test

data:  PWaov_residuals
W = 0.96607, p-value = 0.4379

As with the variance checks, this confirms that our data meets the assumptions of the ANOVA test.

Non-parametric alternative

If the assumptions are not met fora one-way ANOVA we can use an alternative test - Kruskal-Wallis rank sum test:

kruskal.test(weight~group, data=PlantGrowth)


    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

As with the parametric version, the test shows a statistically significant difference in the data groups.

Two-Way ANOVA test

Two-way ANOVA test is used to evaluate simultaneously the effect of two grouping variables (A and B) on a response variable.

Important

Assumptions

The same assumptions are true from the one-way ANOVA: - The observations are obtained independently and randomly from the population - The data are normally distributed - The normal populations have a common variance

Further, in two-way ANOVA, the sample sizes within cells need to be equal, leading to a so-called balanced design. In this case the standard two-way ANOVA test can be applied.

When the sample sizes within each level of the independent variables are not the same (case of unbalanced designs), the ANOVA test should be handled differently.

Lets get some data!

ToothGrowth

    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   VC  0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5
11 16.5   VC  1.0
12 16.5   VC  1.0
13 15.2   VC  1.0
14 17.3   VC  1.0
15 22.5   VC  1.0
16 17.3   VC  1.0
17 13.6   VC  1.0
18 14.5   VC  1.0
19 18.8   VC  1.0
20 15.5   VC  1.0
21 23.6   VC  2.0
22 18.5   VC  2.0
23 33.9   VC  2.0
24 25.5   VC  2.0
25 26.4   VC  2.0
26 32.5   VC  2.0
27 26.7   VC  2.0
28 21.5   VC  2.0
29 23.3   VC  2.0
30 29.5   VC  2.0
31 15.2   OJ  0.5
32 21.5   OJ  0.5
33 17.6   OJ  0.5
34  9.7   OJ  0.5
35 14.5   OJ  0.5
36 10.0   OJ  0.5
37  8.2   OJ  0.5
38  9.4   OJ  0.5
39 16.5   OJ  0.5
40  9.7   OJ  0.5
41 19.7   OJ  1.0
42 23.3   OJ  1.0
43 23.6   OJ  1.0
44 26.4   OJ  1.0
45 20.0   OJ  1.0
46 25.2   OJ  1.0
47 25.8   OJ  1.0
48 21.2   OJ  1.0
49 14.5   OJ  1.0
50 27.3   OJ  1.0
51 25.5   OJ  2.0
52 26.4   OJ  2.0
53 22.4   OJ  2.0
54 24.5   OJ  2.0
55 24.8   OJ  2.0
56 30.9   OJ  2.0
57 26.4   OJ  2.0
58 27.3   OJ  2.0
59 29.4   OJ  2.0
60 23.0   OJ  2.0

The data shown above has identified one of our treatment types as a continuous data type (dbl). We need to tell R to treat this data as a factor before we continue:

ToothGrowth1 <- ToothGrowth #creating a new dataframe ready for out manipulation

ToothGrowth1$dose<-factor(ToothGrowth1$dose,
                          levels=c(0.5, 1, 2),
                          labels=c("D0.5", "D1", "D2"))
#This command is telling R to create a new column in our dataset called dose, then defining this data as a factor, what those factors are (levels) and what to call them (labels)
ToothGrowth1

    len supp dose
1   4.2   VC D0.5
2  11.5   VC D0.5
3   7.3   VC D0.5
4   5.8   VC D0.5
5   6.4   VC D0.5
6  10.0   VC D0.5
7  11.2   VC D0.5
8  11.2   VC D0.5
9   5.2   VC D0.5
10  7.0   VC D0.5
11 16.5   VC   D1
12 16.5   VC   D1
13 15.2   VC   D1
14 17.3   VC   D1
15 22.5   VC   D1
16 17.3   VC   D1
17 13.6   VC   D1
18 14.5   VC   D1
19 18.8   VC   D1
20 15.5   VC   D1
21 23.6   VC   D2
22 18.5   VC   D2
23 33.9   VC   D2
24 25.5   VC   D2
25 26.4   VC   D2
26 32.5   VC   D2
27 26.7   VC   D2
28 21.5   VC   D2
29 23.3   VC   D2
30 29.5   VC   D2
31 15.2   OJ D0.5
32 21.5   OJ D0.5
33 17.6   OJ D0.5
34  9.7   OJ D0.5
35 14.5   OJ D0.5
36 10.0   OJ D0.5
37  8.2   OJ D0.5
38  9.4   OJ D0.5
39 16.5   OJ D0.5
40  9.7   OJ D0.5
41 19.7   OJ   D1
42 23.3   OJ   D1
43 23.6   OJ   D1
44 26.4   OJ   D1
45 20.0   OJ   D1
46 25.2   OJ   D1
47 25.8   OJ   D1
48 21.2   OJ   D1
49 14.5   OJ   D1
50 27.3   OJ   D1
51 25.5   OJ   D2
52 26.4   OJ   D2
53 22.4   OJ   D2
54 24.5   OJ   D2
55 24.8   OJ   D2
56 30.9   OJ   D2
57 26.4   OJ   D2
58 27.3   OJ   D2
59 29.4   OJ   D2
60 23.0   OJ   D2

The above table shows that does is now a factor, not dbl data type.

Lets check that our data meets the balanced design assumption by looking at the number of observations by category:

table(ToothGrowth1$supp, ToothGrowth1$dose)

    
     D0.5 D1 D2
  OJ   10 10 10
  VC   10 10 10

This is a frequency table showing the factors as supp and dose and with 10 subjects in each cell. This meets our assumptions as the sample sizes within cells is equal.

Lets start with visualising our data:

#We need to plot tooth length (len) by groups (dose) then plot the second group (sup) and show them in different colors
ggplot(ToothGrowth1, aes(x=dose, y=len, color=supp))+
  geom_boxplot()

We can also do this as a line plot with multiple groups:

ggplot(ToothGrowth1, aes(x = factor(dose), y = len, color = supp)) + geom_point(position = position_jitter(width = 0.2), alpha = 0.6) + # Dotplot with jitter 
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, color = "black") + # Mean and SE 
  stat_summary(fun = mean, geom = "point", size = 3, color = "black") + # Mean points 
  geom_line(stat = "summary", fun = mean, aes(group = supp)) + # Lines connecting the means 
  labs(x = "Dose", y = "Length", color = "Supplement") + theme_minimal()

The code for the two-way ANOVA test is very similar to that onf the one way ANOVA test, we just define the additional comparision factor we want examined

res.tl<-aov(len~supp+dose, data=ToothGrowth1) #the "+dose" is adding the second factor for analysis
summary(res.tl)

            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4   14.02 0.000429 ***
dose         2 2426.4  1213.2   82.81  < 2e-16 ***
Residuals   56  820.4    14.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output is telling us that both supp and dose have a statistically significant impact on our response factor (Tooth Length). Dose having the more significant impact (the smaller p value).

This test assumes independence of our treatments (as additive model), but does not consider if the treatments are synergistic. We need to modify our code to test for this:

res.tl1<-aov(len~supp*dose, data=ToothGrowth1)
summary(res.tl1)

            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output is showing that alongside the statistically significant impact of the treatments dose and supp, there is also a statistically significant interaction between both of the treatments.

The full interpretation of this output is:

From the ANOVA results, you can conclude the following, based on the p-values and a significance level of 0.05:

the p-value of supp is 0.000429 (significant), which indicates that the levels of supp are associated with significant different tooth length.
the p-value of dose is < 2e-16 (significant), which indicates that the levels of dose are associated with significant different tooth length.
the p-value for the interaction between supp*dose is 0.02 (significant), which indicates that the relationships between dose and tooth length depends on the supp method.

The above results show us that there are significant deviations from similar means in our treatment groups.

The supp treatment only had 2 levels and we know this is significant, however, the dose treatment has multiple levels so we need to know which are the significant ones. We do this the same way we did for the one way ANOVA test with pairwise comparisons:

TukeyHSD(res.tl1, which = "dose")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = len ~ supp * dose, data = ToothGrowth1)

$dose
          diff       lwr       upr   p adj
D1-D0.5  9.130  6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1    6.365  3.597488  9.132512 2.7e-06

This shows that all the pairwise comparisons are statistically significant.

You can also do this using a pairwise t test:

pairwise.t.test(ToothGrowth1$len, ToothGrowth1$dose, p.adjust.method = "BH")


    Pairwise comparisons using t tests with pooled SD 

data:  ToothGrowth1$len and ToothGrowth1$dose 

   D0.5    D1     
D1 1.0e-08 -      
D2 4.4e-16 1.4e-05

P value adjustment method: BH

This again shows that for dose, all the pairwise comparisons are significant.

#Does the data meet the Assumptions?

As with the previous AONVA tests, we need to ensure our data meets the tests assumptions for our results to be valid.

Homogeneity of variance

Graphically:

plot(res.tl1, 1)

3 points seem like outliers, so lets check with a test:

levene.test(ToothGrowth1$len, #This is specifying the data to be tested against
            group = interaction(ToothGrowth1$supp, ToothGrowth1$dose))  # this is stating which factors to include


    Modified robust Brown-Forsythe Levene-type test based on the absolute
    deviations from the median

data:  ToothGrowth1$len
Test Statistic = 1.7086, p-value = 0.1484

#This test did not work as expected, so function modified following research. The result matches the expected, so modification accepted.

This result tells us that there is no evidence to suggest the variance across the data is statistically significant. Our data therefore meets the assumption needed for homogeneity of variance.

Normal distribution

Visual inspection:

plot(res.tl1, 2)

This looks pretty good, but lets test statistically to be sure:

shapiro.test(x=(residuals(object=res.tl1)))


    Shapiro-Wilk normality test

data:  (residuals(object = res.tl1))
W = 0.98499, p-value = 0.6694

#Remember, we are testing the data around the means - this is the residual data we are calling for this test.

Our data is normally distributed, so our tests are valid.

Important

What if our data is an unbalanced design?

Reminder A balanced design is where the sample sizes within cells are equal, unbalanced is when this is not the case.

In these situations, we need to use a type-III sum of squares method. I am not including that here.

MANOVA - Multivariate Analysis of Variance

MAONVA can be used in situations where there are multiple response variables, you can test them simultaneously using a multivariate analysis of variance (MANOVA).

For example, we may conduct an experiment where we give two treatments (A and B) to two groups of mice, and we are interested in the weight and height of mice. In that case, the weight and height of mice are two dependent variables, and our hypothesis is that both together are affected by the difference in treatment. A multivariate analysis of variance could be used to test this hypothesis.

As with other tests, there are assumptions: ::: {.callout-important collapse=“false”} Assumptions As with the two-way ANOVA above, MANOVAs assumptions include:

That the data in the groups are normally distributed.

Homogeneity of variances across the range of predictors.

Linearity between all pairs of dependent variables, all pairs of co-variates and all dependent variable co-pairs in each cell :::

As with the ANOVA tests, it is necessary to further investigate the results if the MANOVA test shows statistical significance. We will see this in the worked example.

The below is the MANOVA test to find out if there are significant differences between the species’ petal and sepal length:

#Get dataq and assign vectors
sepl<-iris$Sepal.Length
petl<-iris$Petal.Length
#MANOVA Test
summary(manova(cbind(sepl, petl)~Species, data=iris))

           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Here we have both asked for a summary of the test result, and coded the test in a single step
res.man<-manova(cbind(sepl, petl)~Species, data=iris)
#This line assigns the call res.man to our MANOVA results, making it easier for the next step

This is telling us that there is a significant difference between our groups. The difference species do have statistically significantly different sepal and petal lengths, but does not tell us if it is just sepal, just petal or both.

To investigate which groups, we can run ANOVA tests on the results to elucidate our findings:

summary.aov(res.man)

 Response sepl :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response petl :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA result is showing that both petal and sepal length are significantly different between the species.To fully validate the findings, we should run the checks we did above on the ANOVA to ensure our assumptions are met.

Kruskal-Wallis Test

This is a ranking test, providing a non-parametric alternative to one way ANOVA tests.

For this example, we are going to use the plant growth data used in the earlier ANOVA example, so I am skipping the visualisation steps as we have previously completed them.

On to the test:

kruskal.test(weight~group, data=PlantGrowth)


    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

As with the ANOVA tests, this tell us that there is a significant difference, but does not say where.

As out data is non-parametric, we cannot use the Tuskey test, instead we can use a pairwise wilcox test:

pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "BH")

Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute
exact p-value with ties


    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  PlantGrowth$weight and PlantGrowth$group 

     ctrl  trt1 
trt1 0.199 -    
trt2 0.095 0.027

P value adjustment method: BH

The result shows that there is only a significant (p-value<0.05) difference between tr2 and tr1.

Week 9 Linear Models

We are using the pre-existing mtcars data set for this

Task - What is the effect of the weight of the vehicle on its efficiency (miles per gallon)?

ggplot(mtcars, aes(x=wt, y=mpg))+ #calls the data
  geom_point()+#defines the plot type
  geom_smooth(method= "lm")#adds our regression line

`geom_smooth()` using formula = 'y ~ x'

This output shows how the model could fit, but does not give us the model (i.e. the equation).

There is another way to fit the model and extract the info we are after:

model<-lm(mpg~wt, data=mtcars) # this is defining and fitting our model

summary(model) # This gives us the detail from our fitted model


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

This output tells us that there is a strong, significant relationship (or correlation) between our data points, and the p-value of this correlation.

# Extract the coefficients from the model (the coefficients will inform us as to the details of the model (i.e. the mathematical relationship between our variables))
coefficients <- summary(model)$coefficients

# Create the scatterplot with regression line and equation
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  annotate("text", x = 3.5, y = 30, label = paste("y = ", round(coefficients[1,1], 2), " - ", abs(round(coefficients[2,1], 2)), " * x", sep = ""), color = "black") +
  labs(x = "Weight (1000 lbs)", y = "Miles Per Gallon", title = "Effect of Vehicle Weight on Efficiency") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Most of this code is not new, however, the “annotate” function is so lets break down its syntax

annotate(“text”, …) This tells ggplot2 that you want to add text annotation to the plot.

x = 3.5, y = 30 These parameters specify the coordinates on the plot where the annotation text should be placed. In this case, the text will be placed at the point (3.5, 30) on the plot.

label = paste(“y =”, round(coefficients[1,1], 2), ” - “, abs(round(coefficients[2,1], 2)),” * x”, sep = ““) This creates the text label to be displayed.

Inside the label parameter:

paste(…): Concatenates strings together

“y =”: The beginning of the text string

round(coefficients[1,1], 2): Extracts the intercept from the model coefficients (found in the model table in cell 1,1) and rounds it to 2 decimal places

” - “: Adds a minus sign, as we can see the correlation is negative

abs(round(coefficients[2,1], 2)): Extracts the slope from the model coefficients (found in the model table in cell 2,1), takes the absolute value, and rounds it to 2 decimal places.

” * x”: Adds the final part of the equation.

sep = ““: Ensures no additional separators are added between concatenated strings.

We can run an ANOVA on our data as well to check its significance:

summary(aov(mpg~wt, data=mtcars))

            Df Sum Sq Mean Sq F value   Pr(>F)    
wt           1  847.7   847.7   91.38 1.29e-10 ***
Residuals   30  278.3     9.3                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This again shows that there is statistical significance in the relationship between weight and MPG.

Lets try some other groups of data in the mtcars data set.

summary(manova(cbind(cyl, disp, qsec, am) ~ hp, data = mtcars))

          Df  Pillai approx F num Df den Df    Pr(>F)    
hp         1 0.79905    26.84      4     27 4.604e-09 ***
Residuals 30                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.aov((manova(cbind(cyl, disp, qsec, am) ~ hp, data = mtcars)))

 Response cyl :
            Df Sum Sq Mean Sq F value    Pr(>F)    
hp           1 68.517  68.517   67.71 3.478e-09 ***
Residuals   30 30.358   1.012                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response disp :
            Df Sum Sq Mean Sq F value    Pr(>F)    
hp           1 297901  297901  50.128 7.143e-08 ***
Residuals   30 178284    5943                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response qsec :
            Df Sum Sq Mean Sq F value    Pr(>F)    
hp           1 49.651  49.651   30.19 5.766e-06 ***
Residuals   30 49.338   1.645                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response am :
            Df Sum Sq Mean Sq F value Pr(>F)
hp           1 0.4566 0.45655   1.886 0.1798
Residuals   30 7.2622 0.24207

What the above shows is that there is significant difference in the horsepower (hp) when there are differences in the number of cylinders (cyl), displacement (disp), 1/4 mile time (qsec) and transmission type (am) - this is the MANOVA output.

When we analyse this output, we can see that, actually, transmission type has no statistically significant impact on the horsepower, but all the other factors do and the associated values are shown in the output.

Week 10 - Logistic Regression

Definitions

Logistic regression is used to predict the class (or category) of individuals based on one or multiple predictor variables (x). It is used to model a binary outcomes - those with only 2 possible values: 0 or 1, yes or no, diseased or non-diseased, male or female etc etc. It gives the probability that the observed predictor (x) falls within the category. Think, how tarsus width gives the probably that a peregrine pulli is female.

Logistic regression is part of the generalized linear model family (GLM).

The equation for logistic regression formula can be generalized as:

p=exp(y)/[1+exp(y)]

P is the probability of an event to occur exp is the exponential (how steep the gradient of our curve is) Y=b0+b1 * x

In words, this equation is saying that the probability of an event occurring (p) is a function of x (our predictor variable) on y (out measured variable). In situations where there is more than one predictor variable, y=b0+(b1 * x1)+(b2 * x2)….etc

The b values are the regression beta coefficients. A positive b indicates that an increase in variable x is associated with an increase in y, with a negative b being the opposite.

The generalized formula is an exponential function, and through logarithmic transformation (don’t worry about the algebra!) becomes a linear combination of predictors: log[p/1-p]=bo+b1 * x

Remember about how logarythmically transforming data you can produce a straight line graph from data that plots exponentially originally - is it the same logic

The quantity log[p/1-p] is called the logarithm of the odd and can also be called log-odd or logit.

This odds function enables the calculation of the probability too:

p=odds/(1+odds)

Actually doing some modelling!

As with all models, there are assumptions and data wrangling that should be checked and done to improve the accuracy of a model.

The following steps should be undertaken to help with accuracy:

Remove potential outliers

Make sure that the predictor variables are normally distributed. If not, you can use log, root, Box-Cox transformation.

Remove highly correlated predictors to minimize over fitting.

Enough Talk - lets try some stuff!

rm(list = ls()) #this clears the workspace
#| Label: library calls LR

library(tidyverse)
library(caret)

Warning: package 'caret' was built under R version 4.1.3


Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

library(mlbench)

output=FALSE

It is important when modeling that you can both train and then test the model. We are going to use a pre-built dataset:

#Load the data and remove NAs
data(PimaIndiansDiabetes2)
PimaIndiansDiabetes2<-na.omit(PimaIndiansDiabetes2)
#Inspect the data
sample_n(PimaIndiansDiabetes2, 3) #this is calling the first 3 lines of the dataset

    pregnant glucose pressure triceps insulin mass pedigree age diabetes
370        1     133      102      28     140 32.8    0.234  45      pos
131        4     173       70      14     168 29.7    0.361  33      pos
9          2     197       70      45     543 30.5    0.158  53      pos

#Split data into training and test set
set.seed(123) #Set seed is a way of calling a specific random number set. This is important as it ensures the results can be reproduced, unique random data sampling on each run would not facilitate this
training.samples<-PimaIndiansDiabetes2$diabetes %>% #the $diabetes is the data we are interested in
  createDataPartition(p=0.8, list=FALSE) #This is creating a data partition containing 80% of the data set
train.data<-PimaIndiansDiabetes2[training.samples, ] #this is assigning the call train.data to the 80% partition previously created
test.data<-PimaIndiansDiabetes2[-training.samples, ]#this is assigning the call test.data to the data that is not included in the 80% partition previously created

This is an example of the model code and output:

#Fit the Model
model<-glm(diabetes~., data=train.data, family=binomial)
#Summarise the Model
summary(model)


Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5832  -0.6544  -0.3292   0.6248   2.5968  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.053e+01  1.440e+00  -7.317 2.54e-13 ***
pregnant     1.005e-01  6.127e-02   1.640  0.10092    
glucose      3.710e-02  6.486e-03   5.719 1.07e-08 ***
pressure    -3.876e-04  1.383e-02  -0.028  0.97764    
triceps      1.418e-02  1.998e-02   0.710  0.47800    
insulin      5.940e-04  1.508e-03   0.394  0.69371    
mass         7.997e-02  3.180e-02   2.515  0.01190 *  
pedigree     1.329e+00  4.823e-01   2.756  0.00585 ** 
age          2.718e-02  2.020e-02   1.346  0.17840    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 267.18  on 305  degrees of freedom
AIC: 285.18

Number of Fisher Scoring iterations: 5

#This took a few times to get to work, remember to proof and check for erroneous capitalization of diabetes!
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
# Model accuracy
mean(predicted.classes == test.data$diabetes)

[1] 0.7564103

We can see from the output that there are some statistically significant predictors, but we will return to the interpretation of the output after some other modeling practices.

A slower walk through of logistic regression modeling

A simple logistic regression is used to predict the probably of class membership based on a single predictor variable.

Lets model to predict the probability of being diabetes-positive (the class) based on plasma glucose concentration (our predictor variable)

model<-glm(diabetes~glucose, data=train.data, family=binomial)
summary(model)$coef #the $coef is the command to extract the coefficients and the associated statistics from the summary

               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -6.15882009 0.700096646 -8.797100 1.403974e-18
glucose      0.04327234 0.005341133  8.101716 5.418949e-16

This output is showing us the estimate of the beta coefficients and their associated P-vales. We can add this information into the equations we explained earlier.

Important

Reminder

Our base equation is: p=exp(y)/[1+exp(y)] where: y=y = b0 + b1 * x

So, from the output above, our equation is p=exp(-6.15+0.043 * glucose)/[1-(-6.15+0.043 * glucose)]

We can now use our model to predict the probability of an individual having diabetes given a glucose plasma value.

newdata<-data.frame(glucose=c(20,180))
probabilities<- model %>% predict(newdata, type="response")
predicted.class<-ifelse(probabilities>0.5, "pos", "neg")
predicted.class

    1     2 
"neg" "pos"

This output is telling us that for a glucose data value of 20 our model predicts a negative diabetes result, whilst a glucose value of 180 should be expected to return a positive diabetes result.

We can plot this and it produces an s-shaped curve:

train.data %>%
  mutate(prob=ifelse(diabetes=="pos", 1, 0))%>% #this is to change the pos, neg data into a 1,0 result
  ggplot(aes(glucose, prob))+
  geom_point(alpha=0.2)+ #this is to make the points on the plot have a transparency level of 20%, this will help overlapping points be visable
  geom_smooth(method="glm", method.args=list(family="binomial"))+ #the method.args function allows us to add additional arguments to the geom_smooth code. In this instance we are telling R that the glm is binomial
  labs(
    title = "Logistic Regression Model", 
    x = "Plasma Glucose Concentration",
    y = "Probability of being diabete-pos"
    )

`geom_smooth()` using formula = 'y ~ x'

Multiple Logistic Regression

The multiple logistic regression is used to predict the probability of class based on multiple predictor variables:

model <- glm( diabetes ~ glucose + mass + pregnant, 
                data = train.data, family = binomial)
summary(model)$coef

               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -9.32369818 1.125997285 -8.280391 1.227711e-16
glucose      0.03886154 0.005404219  7.190962 6.433636e-13
mass         0.09458458 0.023529905  4.019760 5.825738e-05
pregnant     0.14466661 0.045125729  3.205857 1.346611e-03

In this example, we have asked R to consider and evaluate the impact of glucose, mass and pregnant data on the probability that someone will be in the positive class for diabetes.

We can expand this to include all data as follows:

model <- glm( diabetes ~., 
                data = train.data, family = binomial)
summary(model)$coef

                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.053400e+01 1.439679266 -7.31690975 2.537464e-13
pregnant     1.005031e-01 0.061266974  1.64041157 1.009196e-01
glucose      3.709621e-02 0.006486093  5.71934633 1.069346e-08
pressure    -3.875933e-04 0.013826185 -0.02803328 9.776356e-01
triceps      1.417771e-02 0.019981885  0.70952823 4.779967e-01
insulin      5.939876e-04 0.001508231  0.39383055 6.937061e-01
mass         7.997447e-02 0.031798907  2.51500698 1.190300e-02
pedigree     1.329149e+00 0.482291020  2.75590704 5.852963e-03
age          2.718224e-02 0.020199295  1.34570257 1.783985e-01

I have paused at this point as despite the code chunks being the same, they are outputting different results.

See this check, where I have copied the code from both locations

identical(model.frame(glm(diabetes ~ ., data = train.data, family = binomial)),
          model.frame(glm(diabetes ~., data = train.data, family = binomial)))

[1] TRUE

Lets try clearing our environments and starting again!

rm(list = ls()) 
library(dplyr) 
library(caret) 
library(mlbench) 
data(PimaIndiansDiabetes2)
PimaIndiansDiabetes2<-na.omit(PimaIndiansDiabetes2)
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
set.seed(123) 
training.samples <- PimaIndiansDiabetes2$diabetes %>%
  createDataPartition(p = 0.8, list = FALSE) 
train.data <- PimaIndiansDiabetes2[training.samples, ] 
test.data <- PimaIndiansDiabetes2[-training.samples, ]

Now lets run the two varieties of gls and see if we get different results:

model1 <- glm(diabetes ~ glucose + mass + pregnant, data = train.data, family = binomial) 
model2 <- glm(diabetes ~., data = train.data, family = binomial) 
summary(model1)


Call:
glm(formula = diabetes ~ glucose + mass + pregnant, family = binomial, 
    data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2066  -0.6448  -0.3576   0.6523   2.4407  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.323698   1.125997  -8.280  < 2e-16 ***
glucose      0.038862   0.005404   7.191 6.43e-13 ***
mass         0.094585   0.023530   4.020 5.83e-05 ***
pregnant     0.144667   0.045126   3.206  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 279.88  on 310  degrees of freedom
AIC: 287.88

Number of Fisher Scoring iterations: 5

summary(model2)


Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5832  -0.6544  -0.3292   0.6248   2.5968  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.053e+01  1.440e+00  -7.317 2.54e-13 ***
pregnant     1.005e-01  6.127e-02   1.640  0.10092    
glucose      3.710e-02  6.486e-03   5.719 1.07e-08 ***
pressure    -3.876e-04  1.383e-02  -0.028  0.97764    
triceps      1.418e-02  1.998e-02   0.710  0.47800    
insulin      5.940e-04  1.508e-03   0.394  0.69371    
mass         7.997e-02  3.180e-02   2.515  0.01190 *  
pedigree     1.329e+00  4.823e-01   2.756  0.00585 ** 
age          2.718e-02  2.020e-02   1.346  0.17840    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 267.18  on 305  degrees of freedom
AIC: 285.18

Number of Fisher Scoring iterations: 5

Running the clearing command both here and with the initial code in the quick start one seems to have removed the differences between the quick start code and the later code where all predictors have been included in the model.

Further, it is interesting to note that when all predictors are used, compared to the model that just included glucose + mass + pregnant, the significance of the pregnant data as a predictor has changed from insignificant to significant.

This could be an example of overfitting, but there are other reasons why this might be the case:

Shared Variance

When you add more predictors to the model, some predictors may share variance with others. This means that the new predictors might explain some of the same variation in the response variable that was previously attributed to existing predictors. As a result, the statistical significance of existing predictors may decrease because their unique contribution to explaining the response variable is reduced.

Multicollinearity

Adding predictors that are highly correlated with each other can lead to multicollinearity, which inflates the standard errors of the estimated coefficients. Higher standard errors lead to lower t-values (or z-values in the case of GLM), which in turn can make predictors appear less significant.

Model Complexity

A more complex model with many predictors can dilute the effect of individual predictors. When the model includes more predictors, each predictor has to account for less unique variance, potentially reducing their individual significance.

Overfitting

Including too many predictors, especially those that do not contribute much to explaining the variability in the response variable, can lead to overfitting. Overfitting decreases the model’s ability to generalize to new data and can affect the significance of the predictors.

While adding more predictors can improve the explanatory power of a model, it’s important to carefully consider which predictors to include. Using techniques like stepwise regression, regularization (e.g., Lasso or Ridge regression), or principal component analysis can help manage the trade-offs between model complexity and predictor significance.

We are not covering these model types here, but should return to learn more about them in the future

Interpreting the results

From the coefficent output, we can see that glucose, mass and pedigree are significantly associated to the outcome. Note, this is different from the webpage used to build this model, but is consistant with my findings.

All our coefficents are positive values, meaning an increase in these is associated with an increased probability of being diabetes positive. If the coefficents were negative, this would be reversed.

Another important part of analysing the output is the odds ratio. The odds ratio tells us the impact size of the predictor on the outcome.

As an example, our coefficient for glucose in the above is 0.0371. This indicates that one unit increase in the glucose concentration will increase the odds of being diabetes positive by 1.038 (the value of the expression e^0.0371).

Now we know which predictors are significant to the model, we can make a new model based only on these.

model<-(glm(diabetes~glucose + mass + pedigree, data=train.data, family=binomial))
summary(model)


Call:
glm(formula = diabetes ~ glucose + mass + pedigree, family = binomial, 
    data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5262  -0.6984  -0.3927   0.6445   2.5943  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.524661   1.159907  -8.212  < 2e-16 ***
glucose      0.042445   0.005565   7.628 2.39e-14 ***
mass         0.082817   0.022990   3.602 0.000315 ***
pedigree     1.276572   0.463373   2.755 0.005870 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 398.80  on 313  degrees of freedom
Residual deviance: 282.64  on 310  degrees of freedom
AIC: 290.64

Number of Fisher Scoring iterations: 5

Now we have our model, we can test it using the testing data previously partitioned. This will enable us to evaluate the performance of our model:

probabilities<-model %>%
  predict(test.data, type="response")
head(probabilities)

       19        21        32        55        64        71 
0.2086204 0.4941976 0.7078530 0.6530271 0.3672377 0.1901986

This gives us the probabilities, but that is not that helpful in giving us the class. To give us the probable result of the diabetes test, given the values of the predictors in our data we need to tell R what output to give based on the probailities returned.

predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
head(predicted.classes)

   19    21    32    55    64    71 
"neg" "neg" "pos" "pos" "neg" "neg"

Well done! We made it to predictions!

Assessing model accuracy

We can now predict based on our model, but how do we know if the model is accurate?

There are many ways models can be evaluated, as a start point we can ask for the proportion of correctly classified observations:

mean(predicted.classes==test.data$diabetes)

[1] 0.7692308

This output tells us that our model correctly classified ~77% of our test data. This can also be written as a misclassification error rate of 23%.

Week 1

Running Code

Add a plot

Using LaTeX

Other display options

Other useful stuff!

Using python example

Week 2 - During session notes

Playing with Penguin data

Week 2 - Post Session

Playing with Blue Tit data

Week 3 Pre work

Week 3 Session

Week 3 Post Session

Week 4 Pre-Work

Week 4 Session

Week 4 Post Session - recreating visuals

Other plots

Faceting

Week 5 Pre-Session and During Session

Week 5 Post Session

Week 6 Frequency Tests

Z tests

Further practice using the MPG dataset

Week 7 - Comparing Means

One Sample T-test

Non-normal one-sample t-test

Unpaired 2 sample

Wilcoxon test

Paired sample t-tests

Wilcoxon Test for paired data

One-Way ANOVA test

ANOVA Variance check

ANOVA Normality check

Non-parametric alternative

Two-Way ANOVA test

MANOVA - Multivariate Analysis of Variance

Kruskal-Wallis Test

Week 9 Linear Models

Week 10 - Logistic Regression

Actually doing some modelling!

A slower walk through of logistic regression modeling

Multiple Logistic Regression

Interpreting the results

Assessing model accuracy

Week 11