Shay’s (Shana’s) Workbook

Week 1

In Week 1, I followed the quarto tutorial as well as completed the “Meet the Penguin” document. I played around with Quarto in R.Studio. As you can see, I’ve learnt how to insert an image, however fell into some trouble with jpg and png files. I had to open the jpg file in the “Paint” app and save it again as a png file, so that when I rendered my work, the picture was displayed instead of shown as a file.

I then learnt how to change the theme, font colour and background colour with the help of the quarto website. From this, I learnt what a hex code is!

I’ve worked with R quite a lot during my Bachelors and so I am familiar with ggplots etc. Therefore all I did to the output was change the value colours.

Week 2

I did a bit of reading around tidyverse, what it is, what we can do with it, and how it can help us with data organisation and analysis.

Week 3

In week 3 I read and wrote notes about section 6 in the Guide to R Book which was all about Basic Data Management. I then completed exercise 6.6.1, which can be seen down below. I have had to separate the different problems into separate r code chunks as there was an error when trying to render, and therefore it was much easier to split the code into different sections to see which part was causing the issue.I then completed the extra practice in section 6.7 which was quite fun and satisfying; thinking about code and understanding it allows me to feel happy when I see the results of my code, and they are actually exactly what I wanted. I have created a subheading for the research methods part of week 3.

Research Methods

A good question about the diamond data set is “What is the distribution of diamond prices across different cuts, and how does this distribution change when organized by clarity?” A bad question about the diamond data set is “Why are diamonds so expensive?” or “Which diamonds are better?”

Notes about basic data management

  • If i wanted to modify the dataset, don’t forget to add %>% to link a sequence of analysis steps

  • Mutate() adds new columns or modifies current variables

  • Summarize() collapses all rows, and returns with a one-row summary, e.g. summarize(avg.price = mean(price))

  • If you just wanted to look at females and males separately for example, you can group_by(Sex) or group_by(Sex, Age) and then ungroup() when you are done

  • The function filter() retains specific rows of data that meet the requirements, e.g. filter(cut == “Fair”)or filter(cut == “Fair” | cut == “Good”, price <= 600) , this will display cuts of fair or good and a price at or under $600

  • The select() function allows you to select only the columns (variables) that you want to see or select(1:5) to select rows 1 to 5

6.6.1 Exercises

library (tidyverse)
View (diamonds)
##Now look at the structure of the data set
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##We know that the dataset contains 53,940 round-cut diamonds as each row of data represents a different diamond and there are 53,940 rows
##Now look at the variable names
names(diamonds)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"      
data(diamonds)
diamonds %>%                         # utilizes the diamonds dataset
  group_by(color, clarity) %>%      # groups data by color and clarity variables
  mutate(price200 = mean(price)) %>%  # creates new variable (average price by groups)
  ungroup() %>%                # data no longer grouped by color and clarity
  mutate(random10 = 10 + price) %>%  # new variable, original price + $10
  select(cut, color,                 # retain only these listed columns
         clarity, price, 
         price200, random10)%>%
  arrange(color) %>%                 # visualize data ordered by color
  group_by(cut) %>%                  # group data by cut
  mutate(dis = n_distinct(price),    # counts the number of unique price values per cut 
         rowID = row_number()) %>%   # numbers each row consecutively for each cut
  ungroup()     # final ungrouping of data
# A tibble: 53,940 × 8
   cut       color clarity price price200 random10   dis rowID
   <ord>     <ord> <ord>   <int>    <dbl>    <dbl> <int> <int>
 1 Very Good D     VS2       357    2587.      367  5840     1
 2 Very Good D     VS1       402    3030.      412  5840     2
 3 Very Good D     VS2       403    2587.      413  5840     3
 4 Good      D     VS2       403    2587.      413  3086     1
 5 Good      D     VS1       403    3030.      413  3086     2
 6 Premium   D     VS2       404    2587.      414  6014     1
 7 Premium   D     SI1       552    2976.      562  6014     2
 8 Ideal     D     SI1       552    2976.      562  7281     1
 9 Ideal     D     SI1       552    2976.      562  7281     2
10 Very Good D     VVS1      553    2948.      563  5840     4
# ℹ 53,930 more rows
##Problem A
data(midwest)
View (midwest)
midwest %>%
  group_by(state) %>%
  summarize(poptotalmean = mean(poptotal),
            poptotalmed = median(poptotal),
            popmax = max (poptotal),
            popmin = min (poptotal),
            popdistinct = n_distinct(poptotal),
            popfirst = first(poptotal),
            popany = any (poptotal < 5000),
            popany2 = any(poptotal > 2000000)) %>%
ungroup()  
# A tibble: 5 × 9
  state poptotalmean poptotalmed  popmax popmin popdistinct popfirst popany
  <chr>        <dbl>       <dbl>   <int>  <int>       <int>    <int> <lgl> 
1 IL         112065.      24486. 5105067   4373         101    66090 TRUE  
2 IN          60263.      30362.  797159   5315          92    31095 FALSE 
3 MI         111992.      37308  2111687   1701          83    10145 TRUE  
4 OH         123263.      54930. 1412140  11098          88    25371 FALSE 
5 WI          67941.      33528   959275   3890          72    15682 TRUE  
# ℹ 1 more variable: popany2 <lgl>
##Problem B
data(midwest)
midwest %>%
  group_by(state) %>%
  summarize (num5k = sum(poptotal < 5000),
             num2mil = sum(poptotal > 2000000),
             numrows = n()) %>%
  ungroup()
# A tibble: 5 × 4
  state num5k num2mil numrows
  <chr> <int>   <int>   <int>
1 IL        1       1     102
2 IN        0       0      92
3 MI        1       1      83
4 OH        0       0      88
5 WI        2       0      72
##Problem C part 1
data(midwest)
midwest %>%
  group_by (county) %>%
  summarize (x = n_distinct (state)) %>%
  arrange(desc(x))%>%
  ungroup()
# A tibble: 320 × 2
   county         x
   <chr>      <int>
 1 CRAWFORD       5
 2 JACKSON        5
 3 MONROE         5
 4 ADAMS          4
 5 BROWN          4
 6 CLARK          4
 7 CLINTON        4
 8 JEFFERSON      4
 9 LAKE           4
10 WASHINGTON     4
# ℹ 310 more rows
##Problem C part 2
data(midwest)
midwest %>%
  group_by (county) %>%
  summarize (x=n()) %>%
  ungroup()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         4
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         2
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       2
# ℹ 310 more rows
##Problem C part 3
data(midwest)
midwest %>%
  group_by(county) %>%
  summarize(x = n_distinct(county)) %>%
  ungroup ()
# A tibble: 320 × 2
   county        x
   <chr>     <int>
 1 ADAMS         1
 2 ALCONA        1
 3 ALEXANDER     1
 4 ALGER         1
 5 ALLEGAN       1
 6 ALLEN         1
 7 ALPENA        1
 8 ANTRIM        1
 9 ARENAC        1
10 ASHLAND       1
# ℹ 310 more rows
midwest %>%
  group_by(state)%>%
  summarize(x= n_distinct(state))%>%
  ungroup ()
# A tibble: 5 × 2
  state     x
  <chr> <int>
1 IL        1
2 IN        1
3 MI        1
4 OH        1
5 WI        1
##Problem D
data(diamonds)
diamonds %>%
  group_by(clarity) %>%
  summarize (a = n_distinct(color),
             b = n_distinct (price), 
             c = n()) %>%
  ungroup()
# A tibble: 8 × 4
  clarity     a     b     c
  <ord>   <int> <int> <int>
1 I1          7   632   741
2 SI2         7  4904  9194
3 SI1         7  5380 13065
4 VS2         7  5051 12258
5 VS1         7  3926  8171
6 VVS2        7  2409  5066
7 VVS1        7  1623  3655
8 IF          7   902  1790
##Problem E part 1
data(diamonds)
diamonds %>%
  group_by(color, cut) %>%
  summarize (m = mean(price),
             s = sd(price)) %>%
  ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
   color cut           m     s
   <ord> <ord>     <dbl> <dbl>
 1 D     Fair      4291. 3286.
 2 D     Good      3405. 3175.
 3 D     Very Good 3470. 3524.
 4 D     Premium   3631. 3712.
 5 D     Ideal     2629. 3001.
 6 E     Fair      3682. 2977.
 7 E     Good      3424. 3331.
 8 E     Very Good 3215. 3408.
 9 E     Premium   3539. 3795.
10 E     Ideal     2598. 2956.
# ℹ 25 more rows
##Problem E part 2
data(diamonds)
diamonds %>%
  group_by(cut,color) %>%
  summarize(m =mean(price),
            s = sd(price)) %>%
  ungroup()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
   cut   color     m     s
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 3286.
 2 Fair  E     3682. 2977.
 3 Fair  F     3827. 3223.
 4 Fair  G     4239. 3610.
 5 Fair  H     5136. 3886.
 6 Fair  I     4685. 3730.
 7 Fair  J     4976. 4050.
 8 Good  D     3405. 3175.
 9 Good  E     3424. 3331.
10 Good  F     3496. 3202.
# ℹ 25 more rows
##Problem E part 3
data(diamonds)
diamonds %>%
  group_by(cut,color,clarity) %>%
  summarize(m = mean(price),
            s = sd(price),
            msale = m*0.80) %>%
  ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
   cut   color clarity     m     s msale
   <ord> <ord> <ord>   <dbl> <dbl> <dbl>
 1 Fair  D     I1      7383  5899. 5906.
 2 Fair  D     SI2     4355. 3260. 3484.
 3 Fair  D     SI1     4273. 3019. 3419.
 4 Fair  D     VS2     4513. 3383. 3610.
 5 Fair  D     VS1     2921. 2550. 2337.
 6 Fair  D     VVS2    3607  3629. 2886.
 7 Fair  D     VVS1    4473  5457. 3578.
 8 Fair  D     IF      1620.  525. 1296.
 9 Fair  E     I1      2095.  824. 1676.
10 Fair  E     SI2     4172. 3055. 3338.
# ℹ 266 more rows
##Problem F
data(diamonds)
diamonds %>%
  group_by(cut) %>%
  summarize(potato = mean(depth),
            pizza = mean(price),
            popcorn = median(y),
            pineapple = potato - pizza,
            papaya = pineapple ^ 2,
            peach = n()) %>%
  ungroup ()
# A tibble: 5 × 7
  cut       potato pizza popcorn pineapple    papaya peach
  <ord>      <dbl> <dbl>   <dbl>     <dbl>     <dbl> <int>
1 Fair        64.0 4359.    6.1     -4295. 18444586.  1610
2 Good        62.4 3929.    5.99    -3866. 14949811.  4906
3 Very Good   61.8 3982.    5.77    -3920. 15365942. 12082
4 Premium     61.3 4584.    6.06    -4523. 20457466. 13791
5 Ideal       61.7 3458.    5.26    -3396. 11531679. 21551
##Problem G part 1
data(diamonds)
diamonds %>% 
  group_by(color) %>%
  summarize(m= mean(price)) %>%
  mutate(x1 = str_c("Diamond color", color),
         x2 = 5) %>%
  ungroup()
# A tibble: 7 × 4
  color     m x1                x2
  <ord> <dbl> <chr>          <dbl>
1 D     3170. Diamond colorD     5
2 E     3077. Diamond colorE     5
3 F     3725. Diamond colorF     5
4 G     3999. Diamond colorG     5
5 H     4487. Diamond colorH     5
6 I     5092. Diamond colorI     5
7 J     5324. Diamond colorJ     5
##Problem G part 2  
data(diamonds)
diamonds %>%
  group_by(color) %>%
  summarize(m = mean(price)) %>%
  ungroup() %>%
  mutate(x1 = str_c("Diamond color", color),
         x2 = 5)
# A tibble: 7 × 4
  color     m x1                x2
  <ord> <dbl> <chr>          <dbl>
1 D     3170. Diamond colorD     5
2 E     3077. Diamond colorE     5
3 F     3725. Diamond colorF     5
4 G     3999. Diamond colorG     5
5 H     4487. Diamond colorH     5
6 I     5092. Diamond colorI     5
7 J     5324. Diamond colorJ     5
##Problem H part 1
data(diamonds)
diamonds %>%
  group_by(color) %>%
  mutate(x1 = price * 0.5) %>%
  summarize(m=mean(1)) %>%
  ungroup()
# A tibble: 7 × 2
  color     m
  <ord> <dbl>
1 D         1
2 E         1
3 F         1
4 G         1
5 H         1
6 I         1
7 J         1
##Problem H part 2
data (diamonds)
diamonds %>%
  group_by(color) %>%
  mutate(x1 = price * 0.5) %>%
  ungroup () %>%
  summarize (m=mean(x1))
# A tibble: 1 × 1
      m
  <dbl>
1 1966.

6.7 Extra Practice

data(diamonds)
View(diamonds)

diamonds %>%
  arrange(price)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
diamonds %>%
  arrange (desc(price))
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
diamonds %>%
  arrange(price, cut)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
 9  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
diamonds %>%
  arrange(desc(price), cut)
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows
diamonds$clarity <- factor(diamonds$clarity, 
                           levels = c("I1", "I2", "I3", "SI1", "SI2", "VS1", "VS2", "VVS1", "VVS2", "IF"))
diamonds%>%
  arrange (price,clarity )
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 7  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows
diamonds %>%
  mutate(saleprice = price-250)
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z saleprice
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43        76
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31        76
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31        77
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63        84
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75        85
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48        86
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47        86
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53        87
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49        87
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39        88
# ℹ 53,930 more rows
diamonds %>%
  select(-x,-y,-z)
# A tibble: 53,940 × 7
   carat cut       color clarity depth table price
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int>
 1  0.23 Ideal     E     SI2      61.5    55   326
 2  0.21 Premium   E     SI1      59.8    61   326
 3  0.23 Good      E     VS1      56.9    65   327
 4  0.29 Premium   I     VS2      62.4    58   334
 5  0.31 Good      J     SI2      63.3    58   335
 6  0.24 Very Good J     VVS2     62.8    57   336
 7  0.24 Very Good I     VVS1     62.3    57   336
 8  0.26 Very Good H     SI1      61.9    55   337
 9  0.22 Fair      E     VS2      65.1    61   337
10  0.23 Very Good H     VS1      59.4    61   338
# ℹ 53,930 more rows
diamonds %>%
  group_by(cut) %>%
  summarize(count=n())
# A tibble: 5 × 2
  cut       count
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551
diamonds %>%
  mutate(totalNum= nrow(diamonds))
# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z totalNum
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <int>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43    53940
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31    53940
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31    53940
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63    53940
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75    53940
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48    53940
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47    53940
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53    53940
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49    53940
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39    53940
# ℹ 53,930 more rows

Week 4

Research Methods:

Based upon the readings of this week, below is a paragraph that I’ve written describing my understanding of what a “good” hypothesis is:

Hypothesis is often described as a “conjecture” meaning an educated proposition based on incomplete evidence and is yet to be proven. To create a hypothesis, we must use inductive reasoning which involves identifying patterns in a specific case and use this observation to make broader inferences. A good hypothesis will allow predictions to be made by deductive reasoning. A successful hypothesis is not meant to reduce the validity of a previous finding or seek power over existing theories; a successful hypothesis should stimulate research that may reveal the vagueness of the research area. A hypothesis should be brief and identify the objective of the study, however, it should entertain many alternative hypotheses so that the researcher avoids favoring information that supports their hypothesis. A hypothesis should be clear and demonstrate relevance in its field, most importantly it should address a knowledge gap and outline what we expect to learn. Identifying a knowledge gap is a key part of writing a hypothesis, so it is important to start with a literature review to acquire knowledge. A hypothesis, must also be falsifiable otherwise it is “unscientific” and is not ready to be tested; it must be observable or accessible. A research hypothesis is associated with the introduction, it may influence the reader’s judgement of the applicability and reliability of your research. A research hypothesis can take many forms such as the if/then method, the statement method or the when X/ then Y method, although a statistical hypothesis mainly takes the form of a null and alternative hypothesis.

Data Analysis:

This week, I followed the tutorial video online and repeated it myself. Following along with the video was helpful, as it allowed me to understand why were implementing the steps and how the plot changed depending on our input. I liked the the flow chart that was on the youtube video as well as the lecture; I have uploaded it below. I have hidden my rscript for this week to test this format out in my workbook, as well as to keep it neat.

  • Bar chart is default for a single categorical variable

  • Histogram is default for a single quantity variable

  • Scatterplot is good for 2 quantitative variables

  • Boxplot is for one quantitative and one categorical


Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':

    penguins

Basic Scatterplot & scatterplot with colours/labels

Modifying basic properties of a plot

Adding another layer & another layer with two species

`geom_smooth()` using formula = 'y ~ x'

`geom_smooth()` using formula = 'y ~ x'

Other plots

Faceting (bad vs good)

Week 5

Research Methods:

An anova test is appropriate for the first plot that shows a boxplot with three variables (species) as well as the second plot that demonstrates density of iris’s with sepal lengths according to species.

A linear regression analysis would be appropriate for the third plot that shows petal length against petal width across species of iris. This plot shows one regression line, however if we changed the code to display regression lines for each species, we could use an anova test to see if the slopes differ significantly. We can also measure the correlation coefficient to see the strength of the relationship between the two variables.

A Chi-Squared test is most appropriate for the fourth plot which is a histogram of count of species according to the size. This allows us to determine if there is a significant association between the two variables. If the chi-square test comes back with significant results, we can undergo a Post-hoc analysis by conducting pairwise comparisons using chi-square tests that are adjusted for two variables.

Data Analysis:

library(tidyverse)
library(modeldata)
library(ggplot2)
library(dplyr)


data(iris)
View(iris)

#Boxplot

ggplot(iris, aes(x=Species, 
                     y = Sepal.Length,
                     color=Species))+
  geom_boxplot(show.legend = TRUE) 

# Density plot

ggplot(iris, aes(x=Petal.Length,
                 fill = Species))+
  geom_density(alpha = 0.3) +
  labs(title = "Density plot of Petal Length by Species",
       x = "Petal Length",
       y = "Density") 

#Scatter plot with fitted regression line

ggplot(iris, aes(x = Petal.Length,
                     y = Petal.Width, 
                     colour = Species)) +
  geom_point(aes(shape = Species)) + 
  geom_smooth(method="lm",se=TRUE, aes(group =1), colour="blue") +
  labs(x = "Petal length",
       y = "Petal Width",
       color = "Species",
       shape = "Species") +
  scale_shape_manual(values = c(16, 17, 15))
`geom_smooth()` using formula = 'y ~ x'

#Histogram

data(iris)
iris <- iris %>%
  mutate(size = ifelse(Sepal.Length < median(Sepal.Length),
                     "small", "big"))
ggplot(iris, aes(x = Species, fill = size)) +
  geom_bar(position = "dodge")

Week 6

Activity No.1

file_path <- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
housetasks <- read.delim(file_path, row.names = 1)
head(housetasks)
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
library("gplots")

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
# Convert the data as a contingency table
dt <- as.table(as.matrix(housetasks))
# Graph it as a balloon plot
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
           main = "housetasks")

library("vcd")
Loading required package: grid
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)

#Chi-square tessts examines whether rows and columns of a contingency table are statistically significantly associated
chisq <- chisq.test(housetasks)
chisq

    Pearson's Chi-squared test

data:  housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
#This is how to extract observed and expected counts
chisq$observed
           Wife Alternating Husband Jointly
Laundry     156          14       2       4
Main_meal   124          20       5       4
Dinner       77          11       7      13
Breakfeast   82          36      15       7
Tidying      53          11       1      57
Dishes       32          24       4      53
Shopping     33          23       9      55
Official     12          46      23      15
Driving      10          51      75       3
Finances     13          13      21      66
Insurance     8           1      53      77
Repairs       0           3     160       2
Holidays      0           1       6     153
round(chisq$expected,2)
            Wife Alternating Husband Jointly
Laundry    60.55       25.63   38.45   51.37
Main_meal  52.64       22.28   33.42   44.65
Dinner     37.16       15.73   23.59   31.52
Breakfeast 48.17       20.39   30.58   40.86
Tidying    41.97       17.77   26.65   35.61
Dishes     38.88       16.46   24.69   32.98
Shopping   41.28       17.48   26.22   35.02
Official   33.03       13.98   20.97   28.02
Driving    47.82       20.24   30.37   40.57
Finances   38.88       16.46   24.69   32.98
Insurance  47.82       20.24   30.37   40.57
Repairs    56.77       24.03   36.05   48.16
Holidays   55.05       23.30   34.95   46.70
#If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:
#The chi-square static for each cell is also the Pearson's residuals
round(chisq$residuals, 3)
             Wife Alternating Husband Jointly
Laundry    12.266      -2.298  -5.878  -6.609
Main_meal   9.836      -0.484  -4.917  -6.084
Dinner      6.537      -1.192  -3.416  -3.299
Breakfeast  4.875       3.457  -2.818  -5.297
Tidying     1.702      -1.606  -4.969   3.585
Dishes     -1.103       1.859  -4.163   3.486
Shopping   -1.289       1.321  -3.362   3.376
Official   -3.659       8.563   0.443  -2.459
Driving    -5.469       6.836   8.100  -5.898
Finances   -4.150      -0.852  -0.742   5.750
Insurance  -5.758      -4.277   4.107   5.720
Repairs    -7.534      -4.290  20.646  -6.651
Holidays   -7.419      -4.620  -4.897  15.556
#We can visualize Pearson residuals like this
library(corrplot)
corrplot 0.95 loaded
corrplot(chisq$residuals, is.cor = FALSE)

#Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables
#Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.

# Contibution in percentage (%)
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
            Wife Alternating Husband Jointly
Laundry    7.738       0.272   1.777   2.246
Main_meal  4.976       0.012   1.243   1.903
Dinner     2.197       0.073   0.600   0.560
Breakfeast 1.222       0.615   0.408   1.443
Tidying    0.149       0.133   1.270   0.661
Dishes     0.063       0.178   0.891   0.625
Shopping   0.085       0.090   0.581   0.586
Official   0.688       3.771   0.010   0.311
Driving    1.538       2.403   3.374   1.789
Finances   0.886       0.037   0.028   1.700
Insurance  1.705       0.941   0.868   1.683
Repairs    2.919       0.947  21.921   2.275
Holidays   2.831       1.098   1.233  12.445
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

# printing the p-value
chisq$p.value
[1] 0
# printing the mean
chisq$estimate
NULL

Activity No.2

library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr)

mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class cyl n
2seater 8 5
compact 4 32
compact 5 2
compact 6 13
midsize 4 16
midsize 6 23
midsize 8 2
minivan 4 1
minivan 6 10
pickup 4 3
pickup 6 10
pickup 8 20
subcompact 4 21
subcompact 5 2
subcompact 6 7
subcompact 8 5
suv 4 8
suv 6 16
suv 8 38
#In order to turn our summary data in a contingency table, we need class to be listed by row and cyl to be listed by column
mpg%>%
  group_by(class, cyl)%>%
  summarise(n=n())%>%
  spread(cyl, n)%>%
  kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class 4 5 6 8
2seater NA NA NA 5
compact 32 2 13 NA
midsize 16 NA 23 2
minivan 1 NA 10 NA
pickup 3 NA 10 20
subcompact 21 2 7 5
suv 8 NA 16 38
#Instead of frequencies, we can get the average number of city miles by class &cyl
mpg%>%
  group_by(class, cyl)%>%
  summarise(mean_cty=mean(cty))%>%
  spread(cyl, mean_cty)%>%
  kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class 4 5 6 8
2seater NA NA NA 15.40000
compact 21.37500 21 16.92308 NA
midsize 20.50000 NA 17.78261 16.00000
minivan 18.00000 NA 15.60000 NA
pickup 16.00000 NA 14.50000 11.80000
subcompact 22.85714 20 17.00000 14.80000
suv 18.00000 NA 14.50000 12.13158
#Or we can see max number of city miles by class &cyl
mpg%>%
  group_by(class, cyl)%>%
  summarise(max_cty=max(cty))%>%
  spread(cyl, max_cty)%>%
  kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class 4 5 6 8
2seater NA NA NA 16
compact 33 21 18 NA
midsize 23 NA 19 16
minivan 18 NA 17 NA
pickup 17 NA 16 14
subcompact 35 20 18 15
suv 20 NA 17 14
#We can find proportions by creating a new variable dividing row freq by table freq
mpg%>%
  group_by(class)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%   # our new proportion variable
  kable()
class n prop
2seater 5 0.0213675
compact 47 0.2008547
midsize 41 0.1752137
minivan 11 0.0470085
pickup 33 0.1410256
subcompact 35 0.1495726
suv 62 0.2649573
#Contingency Tables

#We can then create a contingency table of proportion values 
mpg%>%
  group_by(class, cyl)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))%>%
  subset(select=c("class","cyl","prop"))%>% 
  spread(class, prop)%>%
  kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl 2seater compact midsize minivan pickup subcompact suv
4 NA 0.6808511 0.3902439 0.0909091 0.0909091 0.6000000 0.1290323
5 NA 0.0425532 NA NA NA 0.0571429 NA
6 NA 0.2765957 0.5609756 0.9090909 0.3030303 0.2000000 0.2580645
8 1 NA 0.0487805 NA 0.6060606 0.1428571 0.6129032
#We can construct a contingency table of vehicle class by number of cylinder by using table()
# I am naming it contingency table so that it is easier to identify when conducting the Chi Sq test
contingency_table <- table(mpg$class, mpg$cyl)

#The table freq can also be called like this;
mpg_table<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
ftable(mpg_table)
             4  5  6  8
                       
2seater      0  0  0  5
compact     32  2 13  0
midsize     16  0 23  2
minivan      1  0 10  0
pickup       3  0 10 20
subcompact  21  2  7  5
suv          8  0 16 38
#Row frequencies
margin.table(mpg_table, 1) 

   2seater    compact    midsize    minivan     pickup subcompact        suv 
         5         47         41         11         33         35         62 
#Column frequencies
margin.table(mpg_table, 2)  

 4  5  6  8 
81  4 79 70 
#Proportion of the entire table
prop.table(mpg_table)
            
                       4           5           6           8
  2seater    0.000000000 0.000000000 0.000000000 0.021367521
  compact    0.136752137 0.008547009 0.055555556 0.000000000
  midsize    0.068376068 0.000000000 0.098290598 0.008547009
  minivan    0.004273504 0.000000000 0.042735043 0.000000000
  pickup     0.012820513 0.000000000 0.042735043 0.085470085
  subcompact 0.089743590 0.008547009 0.029914530 0.021367521
  suv        0.034188034 0.000000000 0.068376068 0.162393162
#Row proportions
prop.table(mpg_table, 1)
            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 1.00000000
  compact    0.68085106 0.04255319 0.27659574 0.00000000
  midsize    0.39024390 0.00000000 0.56097561 0.04878049
  minivan    0.09090909 0.00000000 0.90909091 0.00000000
  pickup     0.09090909 0.00000000 0.30303030 0.60606061
  subcompact 0.60000000 0.05714286 0.20000000 0.14285714
  suv        0.12903226 0.00000000 0.25806452 0.61290323
#Column proportions
prop.table(mpg_table, 2)
            
                      4          5          6          8
  2seater    0.00000000 0.00000000 0.00000000 0.07142857
  compact    0.39506173 0.50000000 0.16455696 0.00000000
  midsize    0.19753086 0.00000000 0.29113924 0.02857143
  minivan    0.01234568 0.00000000 0.12658228 0.00000000
  pickup     0.03703704 0.00000000 0.12658228 0.28571429
  subcompact 0.25925926 0.50000000 0.08860759 0.07142857
  suv        0.09876543 0.00000000 0.20253165 0.54285714
#Chisq tests
#We can either use the contingency table of proportions for the test or the raw data. If we use proportions we are testing our expected proportions against proportions given, or if the observed data matches the expected distribution.
#If we use the raw data, we are testing an association between class and cylinders.

chisq2 <- chisq.test(contingency_table)
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
chisq2

    Pearson's Chi-squared test

data:  contingency_table
X-squared = 138.03, df = 18, p-value < 2.2e-16
chisq2$observed
            
              4  5  6  8
  2seater     0  0  0  5
  compact    32  2 13  0
  midsize    16  0 23  2
  minivan     1  0 10  0
  pickup      3  0 10 20
  subcompact 21  2  7  5
  suv         8  0 16 38
round(chisq2$expected,2)
            
                 4    5     6     8
  2seater     1.73 0.09  1.69  1.50
  compact    16.27 0.80 15.87 14.06
  midsize    14.19 0.70 13.84 12.26
  minivan     3.81 0.19  3.71  3.29
  pickup     11.42 0.56 11.14  9.87
  subcompact 12.12 0.60 11.82 10.47
  suv        21.46 1.06 20.93 18.55
#The chi-square static for each cell is also the Pearson's residuals
round(chisq2$residuals, 3)
            
                  4      5      6      8
  2seater    -1.316 -0.292 -1.299  2.865
  compact     3.900  1.335 -0.720 -3.750
  midsize     0.480 -0.837  2.462 -2.931
  minivan    -1.439 -0.434  3.262 -1.814
  pickup     -2.492 -0.751 -0.342  3.224
  subcompact  2.553  1.812 -1.401 -1.691
  suv        -2.906 -1.029 -1.078  4.517
#We can visualize Pearson residuals like this
library(corrplot)
corrplot(chisq2$residuals, is.cor = FALSE)

# Contibution in percentage (%)
contrib <- 100*chisq2$residuals^2/chisq2$statistic
round(contrib, 3)
            
                  4      5      6      8
  2seater     1.254  0.062  1.223  5.948
  compact    11.020  1.291  0.375 10.186
  midsize     0.167  0.508  4.390  6.224
  minivan     1.500  0.136  7.709  2.384
  pickup      4.500  0.409  0.085  7.528
  subcompact  4.720  2.379  1.422  2.070
  suv         6.117  0.768  0.842 14.782
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)

# printing the p-value
chisq$p.value
[1] 0
# printing the mean
chisq$estimate
NULL
library(gmodels)
Registered S3 method overwritten by 'gdata':
  method         from  
  reorder.factor gplots
#The cross table command produces frequencies, and table, row & column proportions
CrossTable(mpg$class, mpg$cyl)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  234 

 
             | mpg$cyl 
   mpg$class |         4 |         5 |         6 |         8 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|
     2seater |         0 |         0 |         0 |         5 |         5 | 
             |     1.731 |     0.085 |     1.688 |     8.210 |           | 
             |     0.000 |     0.000 |     0.000 |     1.000 |     0.021 | 
             |     0.000 |     0.000 |     0.000 |     0.071 |           | 
             |     0.000 |     0.000 |     0.000 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     compact |        32 |         2 |        13 |         0 |        47 | 
             |    15.210 |     1.782 |     0.518 |    14.060 |           | 
             |     0.681 |     0.043 |     0.277 |     0.000 |     0.201 | 
             |     0.395 |     0.500 |     0.165 |     0.000 |           | 
             |     0.137 |     0.009 |     0.056 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     midsize |        16 |         0 |        23 |         2 |        41 | 
             |     0.230 |     0.701 |     6.059 |     8.591 |           | 
             |     0.390 |     0.000 |     0.561 |     0.049 |     0.175 | 
             |     0.198 |     0.000 |     0.291 |     0.029 |           | 
             |     0.068 |     0.000 |     0.098 |     0.009 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
     minivan |         1 |         0 |        10 |         0 |        11 | 
             |     2.070 |     0.188 |    10.641 |     3.291 |           | 
             |     0.091 |     0.000 |     0.909 |     0.000 |     0.047 | 
             |     0.012 |     0.000 |     0.127 |     0.000 |           | 
             |     0.004 |     0.000 |     0.043 |     0.000 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
      pickup |         3 |         0 |        10 |        20 |        33 | 
             |     6.211 |     0.564 |     0.117 |    10.391 |           | 
             |     0.091 |     0.000 |     0.303 |     0.606 |     0.141 | 
             |     0.037 |     0.000 |     0.127 |     0.286 |           | 
             |     0.013 |     0.000 |     0.043 |     0.085 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
  subcompact |        21 |         2 |         7 |         5 |        35 | 
             |     6.515 |     3.284 |     1.963 |     2.858 |           | 
             |     0.600 |     0.057 |     0.200 |     0.143 |     0.150 | 
             |     0.259 |     0.500 |     0.089 |     0.071 |           | 
             |     0.090 |     0.009 |     0.030 |     0.021 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
         suv |         8 |         0 |        16 |        38 |        62 | 
             |     8.444 |     1.060 |     1.162 |    20.403 |           | 
             |     0.129 |     0.000 |     0.258 |     0.613 |     0.265 | 
             |     0.099 |     0.000 |     0.203 |     0.543 |           | 
             |     0.034 |     0.000 |     0.068 |     0.162 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total |        81 |         4 |        79 |        70 |       234 | 
             |     0.346 |     0.017 |     0.338 |     0.299 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|

 

Week 7

One sample t-test (parametric)

This is used to compare the mean one sample to a known standard (hypothetical) mean (mu)

#We will use an example data set
set.seed(1234)
my_data <- data.frame(
  name = paste0(rep("M_", 10), 1:10),
  weight = round(rnorm(10,20,2), 1)
)
#Check data
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
#Statistical summaries of weight
summary(my_data$weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.30   18.38   18.90   19.25   20.82   22.20 
#Visualize data
library(ggpubr)
ggboxplot(my_data$weight,
          ylab = "Weight(g)", xlab = FALSE,
          ggtheme = theme_minimal())

#As the sample size is not large enough (<30) we need to check whether the data follows a normal distribution
shapiro.test(my_data$weight)

    Shapiro-Wilk normality test

data:  my_data$weight
W = 0.9526, p-value = 0.6993
#As the p value = 0.6993 (>0.0.5) we can assume data is not significantly different from normality. We assume normality

#Visual inspection using Q-Q plots (draws correlation between sample and normal distribution)
ggqqplot(my_data$weight, ylab = "Men's weight",
         ggtheme = theme_minimal())

#Question: does the average weight of mice differ from 25g (two tailed)
res <- t.test(my_data$weight, mu=25)
res

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
 17.8172 20.6828
sample estimates:
mean of x 
    19.25 
#The p value is less than the significance level so we conclude that the mean weight of the weight is significantly different from 25g

#If you want to test if its less or more than 25g:
t.test(my_data$weight, mu=25,
       alternative = "less")

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 3.977e-06
alternative hypothesis: true mean is less than 25
95 percent confidence interval:
     -Inf 20.41105
sample estimates:
mean of x 
    19.25 
t.test(my_data$weight, mu=25,
       alternative = "greater")

    One Sample t-test

data:  my_data$weight
t = -9.0783, df = 9, p-value = 1
alternative hypothesis: true mean is greater than 25
95 percent confidence interval:
 18.08895      Inf
sample estimates:
mean of x 
    19.25 

One-sample Wilcoxon Signed Rank Test (non-parametric)

Alternative to one-sample t-test when data cannot be assumed to be normally distributed

Used to determine whether the median of the sample is = to a known standard value

#We will use the same mice data above
head(my_data, 10)
   name weight
1   M_1   17.6
2   M_2   20.6
3   M_3   22.2
4   M_4   15.3
5   M_5   20.9
6   M_6   21.0
7   M_7   18.9
8   M_8   18.9
9   M_9   18.9
10 M_10   18.2
#Question:Does the average weight of the mice differ from 25g (two tailed)
res <- wilcox.test(my_data$weight, mu = 25)
Warning in wilcox.test.default(my_data$weight, mu = 25): cannot compute exact
p-value with ties
res

    Wilcoxon signed rank test with continuity correction

data:  my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
res$p.value
[1] 0.005793045
#The p value is < than significant level. We reject null and conclude that the average weight of the mice is significantly different from 25g

Unpaired Two-Samples T-test (parametric)

Used to compare the mean of two independent groups

#We will use an example data set
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4) 

# Create a data frame
my_data <- data.frame( 
  group = rep(c("Woman", "Man"), each = 9),
  weight = c(women_weight,  men_weight)
)
print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4
library(dplyr)
#Compute summary statistics by groups
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group count  mean    sd
  <chr> <int> <dbl> <dbl>
1 Man       9  69.0  9.38
2 Woman     9  52.1 15.6 
#Visualize data by using boxplot
ggboxplot(my_data, x = "group", y="weight",
          color = "group", palette = c("#00AFBB", "#E7B800"),
          ylab = "Weight", xlab = "Groups")

#We should check the independent t-test assumptions
#Assumption 1: are the two sample independents? Yes, as the samples from women and men aren't related
#Assumption 2: Are the data from each of the 2 groups follow a normal distribution?
#Assumption 3: Do the two populations have the same variances?

#To check assumption 2, we should conduct a Shapiro-Wilk normality test

with(my_data, shapiro.test(weight[group == "Man"]))

    Shapiro-Wilk normality test

data:  weight[group == "Man"]
W = 0.86425, p-value = 0.1066
#p = 0.1
with(my_data, shapiro.test(weight[group == "Woman"]))

    Shapiro-Wilk normality test

data:  weight[group == "Woman"]
W = 0.94266, p-value = 0.6101
#p = 0.6
# As both of the p-values are greater than the significance level (0.05), we assume that the distribution are not significantly different from the normal distribution


#To check assumption 3, we should use the f-test to test for homogeneity in variances
res.ftest <- var.test(weight ~ group, data = my_data)
res.ftest

    F test to compare two variances

data:  weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.08150656 1.60191315
sample estimates:
ratio of variances 
         0.3613398 
#The p value of the F test is greater than the significance level, so there is no significant difference between the variances of the two sets of data
#As there is homogeneity if variabces, we can use the classic t-test which assumes equality of the two variances

#Question:Is there any significant difference between women and men weights?
#There are two methods:
#Method 1
res <- t.test(women_weight, men_weight, var.equal = TRUE)
res

    Two Sample t-test

data:  women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -29.748019  -4.029759
sample estimates:
mean of x mean of y 
 52.10000  68.98889 
#Method 2
res <- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res

    Two Sample t-test

data:  weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
  4.029759 29.748019
sample estimates:
  mean in group Man mean in group Woman 
           68.98889            52.10000 
#The p-value is less than the significance level. We can conclude that men's average weight is significantly different from women's average weight
res$p.value
[1] 0.0132656
res$estimate
  mean in group Man mean in group Woman 
           68.98889            52.10000 
res$conf.int
[1]  4.029759 29.748019
attr(,"conf.level")
[1] 0.95

Unpaired two-samples Wilcoxon test (non-parametric)

Alternative to the unpaired two-sample t test

Used when your data are not normally distributed

#We will use the same example data above
print(my_data)
   group weight
1  Woman   38.9
2  Woman   61.2
3  Woman   73.3
4  Woman   21.8
5  Woman   63.4
6  Woman   64.6
7  Woman   48.4
8  Woman   48.8
9  Woman   48.5
10   Man   67.8
11   Man   60.0
12   Man   63.4
13   Man   76.0
14   Man   89.4
15   Man   73.3
16   Man   67.3
17   Man   61.3
18   Man   62.4
#Compute summary by groups
group_by(my_data, group) %>%
  summarise(
    count = n(),
    median = median(weight, na.rm = TRUE),
    IQR = IQR(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group count median   IQR
  <chr> <int>  <dbl> <dbl>
1 Man       9   67.3  10.9
2 Woman     9   48.8  15  
#Visualize with boxplot

#Question: Is there any significant difference between women and men weights?
#Method 1
res <- wilcox.test(women_weight, men_weight)
Warning in wilcox.test.default(women_weight, men_weight): cannot compute exact
p-value with ties
res

    Wilcoxon rank sum test with continuity correction

data:  women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
#Method 2
res <- wilcox.test(weight ~ group, data = my_data,
                   exact = FALSE)
res

    Wilcoxon rank sum test with continuity correction

data:  weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
res$p.value
[1] 0.02711657
#Asthe p value is less than the significance level, we can conclude that men's median weight is significantly different from women's median weight

Paired Samples T-test

Used to compare the means between two related groups of samples

#We will use example data
# Weight of the mice before treatment
before <-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice after treatment
after <-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
# Create a data frame
my_data <- data.frame( 
  group = rep(c("before", "after"), each = 10),
  weight = c(before,  after)
)

#Check data
print(my_data)
    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2
#Compute summary stats by groups
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 2 × 4
  group  count  mean    sd
  <chr>  <int> <dbl> <dbl>
1 after     10  394.  29.4
2 before    10  199.  18.5
#Visualize data using boxplot
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800"),
          order = c("before", "after"),
          ylab = "Weight", xlab = "Groups")

#Subset weight data before treatment
before <- subset(my_data, group == "before", weight, 
                 drop = TRUE)
#Subset weight data after treatment
after <- subset(my_data, group == "after", weight, 
                drop = TRUE)
#Plot paired data
library(PairedData)
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
Loading required package: gld
Loading required package: mvtnorm
Loading required package: lattice

Attaching package: 'PairedData'
The following object is masked from 'package:base':

    summary
pd <- paired(before,after)
plot(pd, type = "profile") + theme_bw()

#We then have to check the paired t-test assumptions
#Assumption 1: Are the two samples paired? Yes
#Assumption 2: Is this a large sample? NO, therefore we need to check whether the differences of the pairs follow a normal distribution

#Check if data are normally distributed:
d <- with(my_data, 
          weight[group == "before"] - weight[group == "after"])
#Shapiro-Wilk normality test for the differences
shapiro.test(d)

    Shapiro-Wilk normality test

data:  d
W = 0.94536, p-value = 0.6141
#As the p value is greater than the significance implies that the distribution of the differences are not significantly different from normal distribution

#Question: Is there any significant changes in the weights of mice after treatment?
#Method 1
res <- t.test(before, after, paired = TRUE)
res

    Paired t-test

data:  before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean difference 
        -194.49 
#The p value is less than the significance level. We reject the null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment

Paired Samples Wilcoxon Test (non-parametric)

Alternative to paired t-test when data are not normally distributed

#We will use the same data as the last example

print(my_data)
    group weight
1  before  200.1
2  before  190.9
3  before  192.7
4  before  213.0
5  before  241.4
6  before  196.9
7  before  172.2
8  before  185.5
9  before  205.2
10 before  193.7
11  after  392.9
12  after  393.2
13  after  345.1
14  after  393.0
15  after  434.0
16  after  427.9
17  after  422.0
18  after  383.9
19  after  392.3
20  after  352.2
#compute summary stats by groups
#Visualized by boxplot

#Subset weight data before treatment
before <- subset(my_data, group == "before", weight, 
                 drop = TRUE)
#Subset weight data after treatment
after <- subset(my_data, group == "after", weight, 
                drop = TRUE)
#Plot paired data


library(PairedData)
pd <- paired(before,after)
plot(pd, type = "profile") + theme_bw()

#Plot paired data


#Question: Is there any significant changes in the weights of mice before and after treatment?
#Method 1
res <- wilcox.test(before, after, paired = TRUE)
res

    Wilcoxon signed rank exact test

data:  before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
res$p.value
[1] 0.001953125
#As the p value is less than the significance level, we conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment

One-Way ANOVA Test

An extension of independent two-samples t-test for comparing means in a situation where there are more than two groups

#We will use an example data
my_data <-PlantGrowth
set.seed(1234)
dplyr ::sample_n(my_data, 10)
   weight group
1    6.15  trt2
2    3.83  trt1
3    5.29  trt2
4    5.12  trt2
5    4.50  ctrl
6    4.17  trt1
7    5.87  trt1
8    5.33  ctrl
9    5.26  trt2
10   4.61  ctrl
#Show the levels
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"
#If they arent in the correct order, re-order them like this:
my_data$group <- ordered(my_data$group,
                         levels = c("ctrl", "trt1", "trt2"))

#Compute summary states by groups
library(dplyr)
group_by(my_data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
# A tibble: 3 × 4
  group count  mean    sd
  <ord> <int> <dbl> <dbl>
1 ctrl     10  5.03 0.583
2 trt1     10  4.66 0.794
3 trt2     10  5.53 0.443
#Visualise data
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight", 
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

#Plot weight by group
ggline(my_data, x = "group", y = "weight", 
       add = c("mean_se", "jitter"), 
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

#Is there any significant difference between the average weights of plants in the 3 experimental conditions
#Compute the analysis of variance
res.aov <- aov(weight~ group, data= my_data)
summary(res.aov)
            Df Sum Sq Mean Sq F value Pr(>F)  
group        2  3.766  1.8832   4.846 0.0159 *
Residuals   27 10.492  0.3886                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#As the P value is less than the significance level, we conclude that there are significant differences between the groups 


#As the group means are different, we want to find out which pairs of groups are difference
#Therefore we should perform multiple pairwise-comparisons

#Tukey multiple pairwise-comparisons
TukeyHSD(res.aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = weight ~ group, data = my_data)

$group
            diff        lwr       upr     p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1  0.865  0.1737839 1.5562161 0.0120064
#The output says that only the difference between trt2 and trt1 is significant

#We can also use the pairwise.t.test to calculate pairwise comparisons between group levels with corrections for multiple testing
pairwise.t.test(my_data$weight, my_data$group,
                p.adjust.method = "BH")

    Pairwise comparisons using t tests with pooled SD 

data:  my_data$weight and my_data$group 

     ctrl  trt1 
trt1 0.194 -    
trt2 0.132 0.013

P value adjustment method: BH 
#Lets check ANOVA assumptions
#1. Check the homogeneity of variance assumption
plot(res.aov, 1)

#There is no evident relationships between residuals and fitted values, so we can ssume the homogeneity of variances

#2.Check the normality assumption
plot(res.aov, 2)

#As all the points fall approximately along this reference line, we can assume normality
#This can also be supported by the Shapiro-Wilk test
aov_residuals <-residuals(object = res.aov)
shapiro.test(x = aov_residuals)

    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.96607, p-value = 0.4379
#If these assumptions are not met, we can use the Kruskal-Wallis rank sum test
kruskal.test(weight ~ group, data= my_data)

    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842

Two-Way ANOVA Test

Used to evaluate simultaenously the effect of two group variables on a response variable

#We will use example data
my_data <- ToothGrowth
set.seed(1234)
dplyr::sample_n(my_data, 10)
    len supp dose
1  21.5   VC  2.0
2  17.3   VC  1.0
3  27.3   OJ  2.0
4  18.5   VC  2.0
5   8.2   OJ  0.5
6  26.4   OJ  1.0
7  25.8   OJ  1.0
8   5.2   VC  0.5
9   6.4   VC  0.5
10  9.4   OJ  0.5
str(my_data)
'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#Dose is considered a numeric variable so we will change is to a factor variable
my_data$dose <- factor(my_data$dose,
                        levels = c(0.5,1,2),
                        labels = c("D0.5", "D1", "D2"))
head(my_data)
   len supp dose
1  4.2   VC D0.5
2 11.5   VC D0.5
3  7.3   VC D0.5
4  5.8   VC D0.5
5  6.4   VC D0.5
6 10.0   VC D0.5
#Question: Does tooth length depend on supp and dose

table(my_data$supp, my_data$dose)
    
     D0.5 D1 D2
  OJ   10 10 10
  VC   10 10 10
#Visualize data
library("ggpubr")
ggboxplot(my_data, x ="dose", y="len", color = "supp",
          palette = c("#00AFBB", "#E7B800"))

#Line plots with multiple groups
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.

res.aov2 <- aov(len ~supp + dose, data = my_data)
summary (res.aov2)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4   14.02 0.000429 ***
dose         2 2426.4  1213.2   82.81  < 2e-16 ***
Residuals   56  820.4    14.7                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#We can clude that bboth supp and dose are statistically significant

#Two-way ANOVA with interaction effect
#These two calls are equivalent 
res.aov3 <- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
summary(res.aov3)
            Df Sum Sq Mean Sq F value   Pr(>F)    
supp         1  205.4   205.4  15.572 0.000231 ***
dose         2 2426.4  1213.2  92.000  < 2e-16 ***
supp:dose    2  108.3    54.2   4.107 0.021860 *  
Residuals   54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The two main effects(supp and dose) are statistically significant, as well as their interaction


#Compute summary stats
require("dplyr")
group_by(my_data, supp, dose) %>%
  summarise(
    count = n(),
    mean = mean(len, na.rm = TRUE),
    sd = sd(len, na.rm = TRUE)
  )
`summarise()` has grouped output by 'supp'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
# Groups:   supp [2]
  supp  dose  count  mean    sd
  <fct> <fct> <int> <dbl> <dbl>
1 OJ    D0.5     10 13.2   4.46
2 OJ    D1       10 22.7   3.91
3 OJ    D2       10 26.1   2.66
4 VC    D0.5     10  7.98  2.75
5 VC    D1       10 16.8   2.52
6 VC    D2       10 26.1   4.80
model.tables(res.aov3, type="means", se = TRUE)
Tables of means
Grand mean
         
18.81333 

 supp 
supp
    OJ     VC 
20.663 16.963 

 dose 
dose
  D0.5     D1     D2 
10.605 19.735 26.100 

 supp:dose 
    dose
supp D0.5  D1    D2   
  OJ 13.23 22.70 26.06
  VC  7.98 16.77 26.14

Standard errors for differences of means
          supp   dose supp:dose
        0.9376 1.1484    1.6240
replic.     30     20        10
#We will perform the Tukey HSD test for the factor variable "dose" as it has more than 2 levels
TukeyHSD(res.aov3, which = "dose")
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)

$dose
          diff       lwr       upr   p adj
D1-D0.5  9.130  6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1    6.365  3.597488  9.132512 2.7e-06
#All pairwise comparisons are significant with an adjusted p value<0.05

#The pairwise.t.test can also be used to calculate pairwise comparisons between group levels with corrections for multiple testing
pairwise.t.test(my_data$len, my_data$dose,
                p.adjust.method = "BH")

    Pairwise comparisons using t tests with pooled SD 

data:  my_data$len and my_data$dose 

   D0.5    D1     
D1 1.0e-08 -      
D2 4.4e-16 1.4e-05

P value adjustment method: BH 
#Check ANOVA assumptions
#1. Homogeneity of variances
plot(res.aov3,1)

#Points 32 and 23 are outliers, useful to remove to meet test assumptions
#Use Levene's test to check homogeneity of variances
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
leveneTest(len ~ supp*dose, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  5  1.7086 0.1484
      54               
# The p value is more than the significance level which means that we assume the homogeneity of variances in the different treatment groups 

#2. Normality
plot(res.aov3, 2)

#As the points fall approx along the reference line, we assume normality 
#We can also support this with the Shapiro-Wilk test
aov_residuals <- residuals(object = res.aov3)
shapiro.test(x = aov_residuals)

    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.98499, p-value = 0.6694
#This finds no indification that normality is violated


#When the test is unbalanced:
library(car)
my_anova <- aov(len ~ supp * dose, data = my_data)
Anova(my_anova, type = "III")
Anova Table (Type III tests)

Response: len
             Sum Sq Df F value    Pr(>F)    
(Intercept) 1750.33  1 132.730 3.603e-16 ***
supp         137.81  1  10.450  0.002092 ** 
dose         885.26  2  33.565 3.363e-10 ***
supp:dose    108.32  2   4.107  0.021860 *  
Residuals    712.11 54                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

MANOVA test

When there are multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA)

my_data <- iris
set.seed(1234)
dplyr::sample_n(my_data, 10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species  size
1           5.2         3.5          1.5         0.2     setosa small
2           5.7         2.6          3.5         1.0 versicolor small
3           6.3         3.3          6.0         2.5  virginica   big
4           6.5         3.2          5.1         2.0  virginica   big
5           6.3         3.4          5.6         2.4  virginica   big
6           6.4         2.8          5.6         2.2  virginica   big
7           6.8         3.2          5.9         2.3  virginica   big
8           7.9         3.8          6.4         2.0  virginica   big
9           6.2         2.9          4.3         1.3 versicolor   big
10          7.1         3.0          5.9         2.1  virginica   big
sepl <- iris$Sepal.Length
petl <- iris$Petal.Length
# MANOVA test
res.man <- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
summary(res.man)
           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 0.9885   71.829      4    294 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#As the p value is significant, we see which ones differ:
summary.aov(res.man)
 Response Sepal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals   147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response Petal.Length :
             Df Sum Sq Mean Sq F value    Pr(>F)    
Species       2 437.10 218.551  1180.2 < 2.2e-16 ***
Residuals   147  27.22   0.185                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#It can be seen that the variables are highly significantly different among species 

Kruskal-Wallis test

Non-parametric alternative to one-way ANOVA

my_data <- PlantGrowth 
head(my_data)
  weight group
1   4.17  ctrl
2   5.58  ctrl
3   5.18  ctrl
4   6.11  ctrl
5   4.50  ctrl
6   4.61  ctrl
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"
#Summary stats by groups library(dplyr) group_by(my_data, group) %\>% summarise( count = n(), mean = mean(weight, na.rm = TRUE), sd = sd(weight, na.rm = TRUE), median = median(weight, na.rm = TRUE), IQR = IQR(weight, na.rm = TRUE) )

#Visualize data library("ggpubr") ggboxplot(my_data, x = "group", y = "weight", color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"), order = c("ctrl", "trt1", "trt2"), ylab = "Weight", xlab = "Treatment") #Mean plots library("ggpubr") ggline(my_data, x = "group", y = "weight", add = c("mean_se", "jitter"), order = c("ctrl", "trt1", "trt2"), ylab = "Weight", xlab = "Treatment")

#Question: Is there any significant difference between the average weights of plants in the 3 experimental conditions

kruskal.test(weight~group, data = my_data) #As the p value is less than the significance level, we conclude that there are significant differences between the treatment groups

    Kruskal-Wallis rank sum test

data:  weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
#We then conduct a pairwise.wilcox.test to calculate pairwise comparisons between group levels with corrections for multiple testing pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "BH") #Shows that only trt1 and trt2 are significantly different

Week 8

Exercise 5 - Correlation

I have completed 10 questions as I have grasped the concept of correlation well.

  1. Which of the following correlation coefficients expresses the strongest association?
  1. 0.55 b) 0.09 c) -0.77 d) 0.1 e) 1.05

I believe it is c as the strength of a correlation is determined by how close the coefficient is to -1 or 1, and e is not a valid coefficient as they have to be between -1 and 1.

  1. For which of the coefficients from the previous question it applies that a person with above average scores X will probably also have above average scores Y?

I believe that it is coefficient a as it must be a positive correlation and a moderate correlation.

  1. We have five representative samples of people aged 15, 20, 30, 45 and 60 years who completed a questionnaire of political conservatism. In these 5 samples in the given order were the average scores of political conservatism as follows: 60, 85, 80, 70, 65. Correlation between age and political conservatism is:
  1. 1.0 b) -1.0 c) linear d) nonlinear

I think it is d - nonlinear as age increases from 15 t0 20, conservatism increases, however when age increases from age 20 to 60, conservatism decreases.

  1. For each scatterplot below choose corresponding association description:
  1. perfect positive linear relationship (r = 1.0)

    I think this corresponds to graph c

  2. moderate positive linear relationship (r = 0.5)

    I think this corresponds to graph d

  3. no linear relationship (r = 0)

    I think this corresponds to graph e

  4. moderate negative relationship (r = -0.5)

    I think this corresponds to graph a

  5. perfect negative linear relationship (r = -1.0)

    I think this corresponds to graph b

  1. How is Pearson’s coefficient influenced by:

    1. limited variability?

    If there is limited variability, Pearson’s coefficient will be lower (closer to 0)

    1. differences in distribution of the correlated variables?

    As Pearson’s coefficient assumes that the variables are normally distributed, the coefficient may not accurately represent the relationship (strength or direction)

    1. outliers?

    Outliers can significantly distort the correlation as it pulls the regression line towards it, and therefore my create a false impression of correlations

    1. using extreme groups as samples?

    The correlation/relationship may appear stronger or weaker than it is as the full range of data is not considered

6 Estimate correlation between variable pairs listed below – is it positive, negative, or zero? a) height in cm, weight in kg - positive b) age in months, time in run for 50 meters - positive c) math grade, reading grade - positive

d) math grade, number of missed school lessons in a year - negative e) IQ, personal identification number - zero f) interest in sports, interest in politics - zero g) mileage on car speedometer, year when was car produced - negative h) maximum daily temperature, household water consumption per day - positive

  1. Suppose the answer to the question 6.g was r = -0.8, how would the correlation coefficient change, if we instead variable “year when was car produced” use variable “car’s age”?

    It would change to a positive correlation

  2. If correlation between X and Y is 0.5, how does the correlation change if we transform X to T-scores?

    The correlation will not change as T-scores are a transformation of the original variable

  3. If r = 1 and zx = -0.5, what is zy? If r= - 1. and zx = 0.8, what is zy?

    The z scores are the standardized scores of X and Y (zy = r * zx)

    zy = 1 * -0.5 = -0.5 , zy = -1 * 0.8 = -0.8

  4. Regarding interpretation, between correlations 0.2 and 0.4 and between correlations 0.5 and 0.7…: a) is approximately the same difference b) there is bigger difference between the first pair than between the second pair c) there is smaller difference between the first pair than between the second pair d) the difference can’t be compared Justify your answer

I think the answer is c as correlation is not a linear scale

Week 10

library(tidyverse)
library(caret)
Warning: package 'caret' was built under R version 4.4.2

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
theme_set(theme_bw())

#Load the data and remove data that is not available (NA)
data("PimaIndiansDiabetes2", package = "mlbench")

#Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
    pregnant glucose pressure triceps insulin mass pedigree age diabetes
726        4     112       78      40      NA 39.4    0.236  38      neg
602        6      96       NA      NA      NA 23.7    0.190  28      neg
326        1     157       72      21     168 25.6    0.123  24      neg
#Split the data into training and test set
#Randomly split the data into training set(80% for a predictive model) and a test set (20% for evaluating the model)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
  createDataPartition(p =0.8, list = FALSE)  
train.data <- PimaIndiansDiabetes2[training.samples,]
test.data <- PimaIndiansDiabetes2[-training.samples,]

#Fit the model
model <- glm(diabetes ~., data = train.data, family = binomial)
#Summarize the model
summary(model)

Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.026e+01  1.392e+00  -7.370 1.71e-13 ***
pregnant     3.562e-02  6.256e-02   0.569  0.56911    
glucose      3.969e-02  6.817e-03   5.822 5.80e-09 ***
pressure    -3.277e-03  1.306e-02  -0.251  0.80184    
triceps     -1.009e-03  1.971e-02  -0.051  0.95916    
insulin     -6.832e-04  1.445e-03  -0.473  0.63645    
mass         8.291e-02  3.171e-02   2.615  0.00893 ** 
pedigree     1.619e+00  5.010e-01   3.231  0.00123 ** 
age          3.520e-02  2.001e-02   1.759  0.07849 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 399.60  on 314  degrees of freedom
Residual deviance: 278.74  on 306  degrees of freedom
  (300 observations deleted due to missingness)
AIC: 296.74

Number of Fisher Scoring iterations: 5
#Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
#Model accuracy
mean(predicted.classes ==test.data$diabetes)
[1] NA
#Simple Logistic Regression is used to predict the probability of class membership based one single predictor variable
# This builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration
model <- glm(diabetes ~ glucose, data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -5.67020090 0.490507636 -11.55986 6.581398e-31
glucose      0.04017613 0.003775009  10.64266 1.886704e-26
#The intercept is -5.67 and the coefficient of glucose variable is 0.040
#The logistic equation can be written as p = exp(-5.67 + 0.040*glucose)/[1 + exp(-5.67 + 0.040*glucose)]

#Using this formula for each new glucose plasma conc value, you can predict the probability of the individuals in being diabetes positive
newdata <- data.frame(glucose = c(20,180))
probabilities <- model %>% predict(newdata, type ="response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes
    1     2 
"neg" "pos" 
#What does this look like on the graph of logistic regression model?
train.data %>%
  mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
  ggplot(aes(glucose, prob)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Logistic Regression Model", 
    x = "Plasma Glucose Concentration",
    y = "Probability of being diabete-pos"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).

#Multiple logistic regression is used to predict the probability of class membership based on multiple predictor variables
model <-glm(diabetes ~ glucose + mass + pregnant,
            data = train.data, family = binomial)
summary(model)$coef
               Estimate  Std. Error    z value     Pr(>|z|)
(Intercept) -8.44301872 0.755764024 -11.171501 5.622253e-29
glucose      0.03715272 0.003835489   9.686568 3.438916e-22
mass         0.08205533 0.016232625   5.054964 4.304725e-07
pregnant     0.10959903 0.030364055   3.609499 3.067888e-04
#We want to include all predictor variables in the data set
model <- glm(diabetes ~., data = train.data, family = binomial)
summary(model)$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.025834e+01 1.391940017 -7.36981709 1.708623e-13
pregnant     3.561814e-02 0.062557686  0.56936476 5.691086e-01
glucose      3.969191e-02 0.006817086  5.82241482 5.800336e-09
pressure    -3.277149e-03 0.013057808 -0.25097238 8.018355e-01
triceps     -1.009412e-03 0.019712451 -0.05120681 9.591607e-01
insulin     -6.832247e-04 0.001445460 -0.47266941 6.364491e-01
mass         8.291187e-02 0.031709372  2.61474334 8.929453e-03
pedigree     1.618980e+00 0.501031092  3.23129629 1.232301e-03
age          3.520407e-02 0.020008021  1.75949760 7.849303e-02
#To only extract the coefficients:
coef(model)
  (Intercept)      pregnant       glucose      pressure       triceps 
-1.025834e+01  3.561814e-02  3.969191e-02 -3.277149e-03 -1.009412e-03 
      insulin          mass      pedigree           age 
-6.832247e-04  8.291187e-02  1.618980e+00  3.520407e-02 
summary(model)$coef
                 Estimate  Std. Error     z value     Pr(>|z|)
(Intercept) -1.025834e+01 1.391940017 -7.36981709 1.708623e-13
pregnant     3.561814e-02 0.062557686  0.56936476 5.691086e-01
glucose      3.969191e-02 0.006817086  5.82241482 5.800336e-09
pressure    -3.277149e-03 0.013057808 -0.25097238 8.018355e-01
triceps     -1.009412e-03 0.019712451 -0.05120681 9.591607e-01
insulin     -6.832247e-04 0.001445460 -0.47266941 6.364491e-01
mass         8.291187e-02 0.031709372  2.61474334 8.929453e-03
pedigree     1.618980e+00 0.501031092  3.23129629 1.232301e-03
age          3.520407e-02 0.020008021  1.75949760 7.849303e-02
#If the predictors are significantly associated to the outcone (the coefficient of the estimate should be less than 0.05)
#Example: if glucose b=0.045 - an increase in glucose is associated with increase in the probability of being diabetes-positive 



#Penguin data set:

library(tidyverse)
library(palmerpenguins)
library(ggplot2)

m1 <- glm(sex ~ body_mass_g, family = "binomial", data = penguins)
summary(m1)

Call:
glm(formula = sex ~ body_mass_g, family = "binomial", data = penguins)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.1625416  0.7243906  -7.127 1.03e-12 ***
body_mass_g  0.0012398  0.0001727   7.177 7.10e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 461.61  on 332  degrees of freedom
Residual deviance: 396.64  on 331  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 400.64

Number of Fisher Scoring iterations: 4
#PseudoR^2
null <- m1$null.deviance/-2
model <- m1$deviance/-2
(null-model)/null
[1] 0.1407346
#Logistic models for frequency data
m2 <- glm(formula = sex ~., family = "binomial", data = penguins)
summary(m2)$coef
                       Estimate   Std. Error    z value     Pr(>|z|)
(Intercept)       -80.376671843 12.329734561 -6.5189296 7.081087e-11
speciesChinstrap   -7.402696697  1.662533652 -4.4526598 8.481309e-06
speciesGentoo      -8.427610932  2.597027352 -3.2450990 1.174098e-03
islandDream         0.324158306  0.809135282  0.4006231 6.886976e-01
islandTorgersen    -0.507858042  0.855745697 -0.5934684 5.528677e-01
bill_length_mm      0.614435757  0.131967898  4.6559487 3.224923e-06
bill_depth_mm       1.646445994  0.335797746  4.9030883 9.434156e-07
flipper_length_mm   0.026653725  0.048306899  0.5517581 5.811141e-01
body_mass_g         0.005818832  0.001087293  5.3516706 8.714591e-08
#Pseudo R^2
null2<- m2$null.deviance/ -2
model2 <- m2$deviance / -2
(null2 - model2)/null2
[1] 0.726939
#Finalmodel
m3 <- glm(sex ~bill_length_mm + bill_depth_mm + body_mass_g, family = "binomial", data = penguins)
summary(m3)

Call:
glm(formula = sex ~ bill_length_mm + bill_depth_mm + body_mass_g, 
    family = "binomial", data = penguins)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -6.056e+01  7.081e+00  -8.552  < 2e-16 ***
bill_length_mm  9.151e-02  4.416e-02   2.072   0.0382 *  
bill_depth_mm   2.063e+00  2.469e-01   8.356  < 2e-16 ***
body_mass_g     5.061e-03  6.348e-04   7.971 1.57e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 461.61  on 332  degrees of freedom
Residual deviance: 159.89  on 329  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 167.89

Number of Fisher Scoring iterations: 7
#Pseudo R^2
null3 <- m3$null.deviance / -2
model3 <- m3$deviance / -2
(null3 - model3)/null3
[1] 0.6536294
#Odds ratio - Penguin's sex
penguins %>%
  mutate(size = ifelse(body_mass_g > 4202, "large", "small")) -> penguins2

#This shows a contingency table of how many male and female penguins are classified as large or small based on their body mass
xtabs(~size+sex, data = penguins2)
       sex
size    female male
  large     53   92
  small    112   76
#Then you can either do A*D/B*C or (A/B)/(C/D) which is the same thing
(53/92)/(112/76)
[1] 0.3909161
#The odds ratio result of 0.391 means that the odds of being female are 0.391 times the odds of being male in the "large" group compared to the "small" group
#This suggests that males are more likely to be in the "large" group than females, while females are most likely to be in the "small" group than males

#Modelling probabilities
#I will use penguins2 as I have added the column of size determined by body mass 

#Clean up the data to remove NAs
penguins2_clean <- penguins2 %>%
  filter(!is.na(sex), !is.na(body_mass_g))


m1 <- glm(sex ~ body_mass_g, family = "binomial", data = penguins2_clean)
penguins2_clean$fitted <- predict(m1, type = "response")

penguins2_clean%>%
  ggplot(aes(x = body_mass_g, y=fitted, color = size)) +
  geom_point() +
  labs(
    title = "Fitted Probabilities of Size Based on Body Mass", 
    x = "Body Mass (g)",
    y = "Fitted Probability"
  )

#Even though the pseudo R^2 value was 0.14 which may seem low, seeing the model on a GGplot looks suitable. 0.14 doesn't necessarily indicate a bad model, especially in logistic regression where pseudo R^2 values are typically lower