Shay’s (Shana’s) Workbook
Week 1
In Week 1, I followed the quarto tutorial as well as completed the “Meet the Penguin” document. I played around with Quarto in R.Studio. As you can see, I’ve learnt how to insert an image, however fell into some trouble with jpg and png files. I had to open the jpg file in the “Paint” app and save it again as a png file, so that when I rendered my work, the picture was displayed instead of shown as a file.
I then learnt how to change the theme, font colour and background colour with the help of the quarto website. From this, I learnt what a hex code is!
I’ve worked with R quite a lot during my Bachelors and so I am familiar with ggplots etc. Therefore all I did to the output was change the value colours.
Week 2
I did a bit of reading around tidyverse, what it is, what we can do with it, and how it can help us with data organisation and analysis.
Week 3
In week 3 I read and wrote notes about section 6 in the Guide to R Book which was all about Basic Data Management. I then completed exercise 6.6.1, which can be seen down below. I have had to separate the different problems into separate r code chunks as there was an error when trying to render, and therefore it was much easier to split the code into different sections to see which part was causing the issue.I then completed the extra practice in section 6.7 which was quite fun and satisfying; thinking about code and understanding it allows me to feel happy when I see the results of my code, and they are actually exactly what I wanted. I have created a subheading for the research methods part of week 3.
Research Methods
A good question about the diamond data set is “What is the distribution of diamond prices across different cuts, and how does this distribution change when organized by clarity?” A bad question about the diamond data set is “Why are diamonds so expensive?” or “Which diamonds are better?”
Notes about basic data management
If i wanted to modify the dataset, don’t forget to add %>% to link a sequence of analysis steps
Mutate() adds new columns or modifies current variables
Summarize() collapses all rows, and returns with a one-row summary, e.g. summarize(avg.price = mean(price))
If you just wanted to look at females and males separately for example, you can group_by(Sex) or group_by(Sex, Age) and then ungroup() when you are done
The function filter() retains specific rows of data that meet the requirements, e.g. filter(cut == “Fair”)or filter(cut == “Fair” | cut == “Good”, price <= 600) , this will display cuts of fair or good and a price at or under $600
The select() function allows you to select only the columns (variables) that you want to see or select(1:5) to select rows 1 to 5
6.6.1 Exercises
library (tidyverse)
View (diamonds)
##Now look at the structure of the data set
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##We know that the dataset contains 53,940 round-cut diamonds as each row of data represents a different diamond and there are 53,940 rows
##Now look at the variable names
names(diamonds)
[1] "carat" "cut" "color" "clarity" "depth" "table" "price"
[8] "x" "y" "z"
data(diamonds)
%>% # utilizes the diamonds dataset
diamonds group_by(color, clarity) %>% # groups data by color and clarity variables
mutate(price200 = mean(price)) %>% # creates new variable (average price by groups)
ungroup() %>% # data no longer grouped by color and clarity
mutate(random10 = 10 + price) %>% # new variable, original price + $10
select(cut, color, # retain only these listed columns
clarity, price, %>%
price200, random10)arrange(color) %>% # visualize data ordered by color
group_by(cut) %>% # group data by cut
mutate(dis = n_distinct(price), # counts the number of unique price values per cut
rowID = row_number()) %>% # numbers each row consecutively for each cut
ungroup() # final ungrouping of data
# A tibble: 53,940 × 8
cut color clarity price price200 random10 dis rowID
<ord> <ord> <ord> <int> <dbl> <dbl> <int> <int>
1 Very Good D VS2 357 2587. 367 5840 1
2 Very Good D VS1 402 3030. 412 5840 2
3 Very Good D VS2 403 2587. 413 5840 3
4 Good D VS2 403 2587. 413 3086 1
5 Good D VS1 403 3030. 413 3086 2
6 Premium D VS2 404 2587. 414 6014 1
7 Premium D SI1 552 2976. 562 6014 2
8 Ideal D SI1 552 2976. 562 7281 1
9 Ideal D SI1 552 2976. 562 7281 2
10 Very Good D VVS1 553 2948. 563 5840 4
# ℹ 53,930 more rows
##Problem A
data(midwest)
View (midwest)
%>%
midwest group_by(state) %>%
summarize(poptotalmean = mean(poptotal),
poptotalmed = median(poptotal),
popmax = max (poptotal),
popmin = min (poptotal),
popdistinct = n_distinct(poptotal),
popfirst = first(poptotal),
popany = any (poptotal < 5000),
popany2 = any(poptotal > 2000000)) %>%
ungroup()
# A tibble: 5 × 9
state poptotalmean poptotalmed popmax popmin popdistinct popfirst popany
<chr> <dbl> <dbl> <int> <int> <int> <int> <lgl>
1 IL 112065. 24486. 5105067 4373 101 66090 TRUE
2 IN 60263. 30362. 797159 5315 92 31095 FALSE
3 MI 111992. 37308 2111687 1701 83 10145 TRUE
4 OH 123263. 54930. 1412140 11098 88 25371 FALSE
5 WI 67941. 33528 959275 3890 72 15682 TRUE
# ℹ 1 more variable: popany2 <lgl>
##Problem B
data(midwest)
%>%
midwest group_by(state) %>%
summarize (num5k = sum(poptotal < 5000),
num2mil = sum(poptotal > 2000000),
numrows = n()) %>%
ungroup()
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
##Problem C part 1
data(midwest)
%>%
midwest group_by (county) %>%
summarize (x = n_distinct (state)) %>%
arrange(desc(x))%>%
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
##Problem C part 2
data(midwest)
%>%
midwest group_by (county) %>%
summarize (x=n()) %>%
ungroup()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
##Problem C part 3
data(midwest)
%>%
midwest group_by(county) %>%
summarize(x = n_distinct(county)) %>%
ungroup ()
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
%>%
midwest group_by(state)%>%
summarize(x= n_distinct(state))%>%
ungroup ()
# A tibble: 5 × 2
state x
<chr> <int>
1 IL 1
2 IN 1
3 MI 1
4 OH 1
5 WI 1
##Problem D
data(diamonds)
%>%
diamonds group_by(clarity) %>%
summarize (a = n_distinct(color),
b = n_distinct (price),
c = n()) %>%
ungroup()
# A tibble: 8 × 4
clarity a b c
<ord> <int> <int> <int>
1 I1 7 632 741
2 SI2 7 4904 9194
3 SI1 7 5380 13065
4 VS2 7 5051 12258
5 VS1 7 3926 8171
6 VVS2 7 2409 5066
7 VVS1 7 1623 3655
8 IF 7 902 1790
##Problem E part 1
data(diamonds)
%>%
diamonds group_by(color, cut) %>%
summarize (m = mean(price),
s = sd(price)) %>%
ungroup()
`summarise()` has grouped output by 'color'. You can override using the
`.groups` argument.
# A tibble: 35 × 4
color cut m s
<ord> <ord> <dbl> <dbl>
1 D Fair 4291. 3286.
2 D Good 3405. 3175.
3 D Very Good 3470. 3524.
4 D Premium 3631. 3712.
5 D Ideal 2629. 3001.
6 E Fair 3682. 2977.
7 E Good 3424. 3331.
8 E Very Good 3215. 3408.
9 E Premium 3539. 3795.
10 E Ideal 2598. 2956.
# ℹ 25 more rows
##Problem E part 2
data(diamonds)
%>%
diamonds group_by(cut,color) %>%
summarize(m =mean(price),
s = sd(price)) %>%
ungroup()
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
# A tibble: 35 × 4
cut color m s
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
##Problem E part 3
data(diamonds)
%>%
diamonds group_by(cut,color,clarity) %>%
summarize(m = mean(price),
s = sd(price),
msale = m*0.80) %>%
ungroup()
`summarise()` has grouped output by 'cut', 'color'. You can override using the
`.groups` argument.
# A tibble: 276 × 6
cut color clarity m s msale
<ord> <ord> <ord> <dbl> <dbl> <dbl>
1 Fair D I1 7383 5899. 5906.
2 Fair D SI2 4355. 3260. 3484.
3 Fair D SI1 4273. 3019. 3419.
4 Fair D VS2 4513. 3383. 3610.
5 Fair D VS1 2921. 2550. 2337.
6 Fair D VVS2 3607 3629. 2886.
7 Fair D VVS1 4473 5457. 3578.
8 Fair D IF 1620. 525. 1296.
9 Fair E I1 2095. 824. 1676.
10 Fair E SI2 4172. 3055. 3338.
# ℹ 266 more rows
##Problem F
data(diamonds)
%>%
diamonds group_by(cut) %>%
summarize(potato = mean(depth),
pizza = mean(price),
popcorn = median(y),
pineapple = potato - pizza,
papaya = pineapple ^ 2,
peach = n()) %>%
ungroup ()
# A tibble: 5 × 7
cut potato pizza popcorn pineapple papaya peach
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair 64.0 4359. 6.1 -4295. 18444586. 1610
2 Good 62.4 3929. 5.99 -3866. 14949811. 4906
3 Very Good 61.8 3982. 5.77 -3920. 15365942. 12082
4 Premium 61.3 4584. 6.06 -4523. 20457466. 13791
5 Ideal 61.7 3458. 5.26 -3396. 11531679. 21551
##Problem G part 1
data(diamonds)
%>%
diamonds group_by(color) %>%
summarize(m= mean(price)) %>%
mutate(x1 = str_c("Diamond color", color),
x2 = 5) %>%
ungroup()
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond colorD 5
2 E 3077. Diamond colorE 5
3 F 3725. Diamond colorF 5
4 G 3999. Diamond colorG 5
5 H 4487. Diamond colorH 5
6 I 5092. Diamond colorI 5
7 J 5324. Diamond colorJ 5
##Problem G part 2
data(diamonds)
%>%
diamonds group_by(color) %>%
summarize(m = mean(price)) %>%
ungroup() %>%
mutate(x1 = str_c("Diamond color", color),
x2 = 5)
# A tibble: 7 × 4
color m x1 x2
<ord> <dbl> <chr> <dbl>
1 D 3170. Diamond colorD 5
2 E 3077. Diamond colorE 5
3 F 3725. Diamond colorF 5
4 G 3999. Diamond colorG 5
5 H 4487. Diamond colorH 5
6 I 5092. Diamond colorI 5
7 J 5324. Diamond colorJ 5
##Problem H part 1
data(diamonds)
%>%
diamonds group_by(color) %>%
mutate(x1 = price * 0.5) %>%
summarize(m=mean(1)) %>%
ungroup()
# A tibble: 7 × 2
color m
<ord> <dbl>
1 D 1
2 E 1
3 F 1
4 G 1
5 H 1
6 I 1
7 J 1
##Problem H part 2
data (diamonds)
%>%
diamonds group_by(color) %>%
mutate(x1 = price * 0.5) %>%
ungroup () %>%
summarize (m=mean(x1))
# A tibble: 1 × 1
m
<dbl>
1 1966.
6.7 Extra Practice
data(diamonds)
View(diamonds)
%>%
diamonds arrange(price)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds arrange (desc(price))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
%>%
diamonds arrange(price, cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
9 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds arrange(desc(price), cut)
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
$clarity <- factor(diamonds$clarity,
diamondslevels = c("I1", "I2", "I3", "SI1", "SI2", "VS1", "VS2", "VVS1", "VVS2", "IF"))
%>%
diamondsarrange (price,clarity )
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
7 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
%>%
diamonds mutate(saleprice = price-250)
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z saleprice
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 76
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 76
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 77
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 84
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 85
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 86
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 86
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 87
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 87
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 88
# ℹ 53,930 more rows
%>%
diamonds select(-x,-y,-z)
# A tibble: 53,940 × 7
carat cut color clarity depth table price
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326
2 0.21 Premium E SI1 59.8 61 326
3 0.23 Good E VS1 56.9 65 327
4 0.29 Premium I VS2 62.4 58 334
5 0.31 Good J SI2 63.3 58 335
6 0.24 Very Good J VVS2 62.8 57 336
7 0.24 Very Good I VVS1 62.3 57 336
8 0.26 Very Good H SI1 61.9 55 337
9 0.22 Fair E VS2 65.1 61 337
10 0.23 Very Good H VS1 59.4 61 338
# ℹ 53,930 more rows
%>%
diamonds group_by(cut) %>%
summarize(count=n())
# A tibble: 5 × 2
cut count
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
%>%
diamonds mutate(totalNum= nrow(diamonds))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z totalNum
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 53940
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 53940
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 53940
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 53940
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 53940
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 53940
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 53940
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 53940
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 53940
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 53940
# ℹ 53,930 more rows
Week 4
Research Methods:
Based upon the readings of this week, below is a paragraph that I’ve written describing my understanding of what a “good” hypothesis is:
Hypothesis is often described as a “conjecture” meaning an educated proposition based on incomplete evidence and is yet to be proven. To create a hypothesis, we must use inductive reasoning which involves identifying patterns in a specific case and use this observation to make broader inferences. A good hypothesis will allow predictions to be made by deductive reasoning. A successful hypothesis is not meant to reduce the validity of a previous finding or seek power over existing theories; a successful hypothesis should stimulate research that may reveal the vagueness of the research area. A hypothesis should be brief and identify the objective of the study, however, it should entertain many alternative hypotheses so that the researcher avoids favoring information that supports their hypothesis. A hypothesis should be clear and demonstrate relevance in its field, most importantly it should address a knowledge gap and outline what we expect to learn. Identifying a knowledge gap is a key part of writing a hypothesis, so it is important to start with a literature review to acquire knowledge. A hypothesis, must also be falsifiable otherwise it is “unscientific” and is not ready to be tested; it must be observable or accessible. A research hypothesis is associated with the introduction, it may influence the reader’s judgement of the applicability and reliability of your research. A research hypothesis can take many forms such as the if/then method, the statement method or the when X/ then Y method, although a statistical hypothesis mainly takes the form of a null and alternative hypothesis.
Data Analysis:
This week, I followed the tutorial video online and repeated it myself. Following along with the video was helpful, as it allowed me to understand why were implementing the steps and how the plot changed depending on our input. I liked the the flow chart that was on the youtube video as well as the lecture; I have uploaded it below. I have hidden my rscript for this week to test this format out in my workbook, as well as to keep it neat.
Bar chart is default for a single categorical variable
Histogram is default for a single quantity variable
Scatterplot is good for 2 quantitative variables
Boxplot is for one quantitative and one categorical
Attaching package: 'modeldata'
The following object is masked from 'package:palmerpenguins':
penguins
Basic Scatterplot & scatterplot with colours/labels
Modifying basic properties of a plot
Adding another layer & another layer with two species
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Other plots
Faceting (bad vs good)
Week 5
Research Methods:
An anova test is appropriate for the first plot that shows a boxplot with three variables (species) as well as the second plot that demonstrates density of iris’s with sepal lengths according to species.
A linear regression analysis would be appropriate for the third plot that shows petal length against petal width across species of iris. This plot shows one regression line, however if we changed the code to display regression lines for each species, we could use an anova test to see if the slopes differ significantly. We can also measure the correlation coefficient to see the strength of the relationship between the two variables.
A Chi-Squared test is most appropriate for the fourth plot which is a histogram of count of species according to the size. This allows us to determine if there is a significant association between the two variables. If the chi-square test comes back with significant results, we can undergo a Post-hoc analysis by conducting pairwise comparisons using chi-square tests that are adjusted for two variables.
Data Analysis:
library(tidyverse)
library(modeldata)
library(ggplot2)
library(dplyr)
data(iris)
View(iris)
#Boxplot
ggplot(iris, aes(x=Species,
y = Sepal.Length,
color=Species))+
geom_boxplot(show.legend = TRUE)
# Density plot
ggplot(iris, aes(x=Petal.Length,
fill = Species))+
geom_density(alpha = 0.3) +
labs(title = "Density plot of Petal Length by Species",
x = "Petal Length",
y = "Density")
#Scatter plot with fitted regression line
ggplot(iris, aes(x = Petal.Length,
y = Petal.Width,
colour = Species)) +
geom_point(aes(shape = Species)) +
geom_smooth(method="lm",se=TRUE, aes(group =1), colour="blue") +
labs(x = "Petal length",
y = "Petal Width",
color = "Species",
shape = "Species") +
scale_shape_manual(values = c(16, 17, 15))
`geom_smooth()` using formula = 'y ~ x'
#Histogram
data(iris)
<- iris %>%
iris mutate(size = ifelse(Sepal.Length < median(Sepal.Length),
"small", "big"))
ggplot(iris, aes(x = Species, fill = size)) +
geom_bar(position = "dodge")
Week 6
Activity No.1
<- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
file_path <- read.delim(file_path, row.names = 1)
housetasks head(housetasks)
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
library("gplots")
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
# Convert the data as a contingency table
<- as.table(as.matrix(housetasks))
dt # Graph it as a balloon plot
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
main = "housetasks")
library("vcd")
Loading required package: grid
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)
#Chi-square tessts examines whether rows and columns of a contingency table are statistically significantly associated
<- chisq.test(housetasks)
chisq chisq
Pearson's Chi-squared test
data: housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
#This is how to extract observed and expected counts
$observed chisq
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
round(chisq$expected,2)
Wife Alternating Husband Jointly
Laundry 60.55 25.63 38.45 51.37
Main_meal 52.64 22.28 33.42 44.65
Dinner 37.16 15.73 23.59 31.52
Breakfeast 48.17 20.39 30.58 40.86
Tidying 41.97 17.77 26.65 35.61
Dishes 38.88 16.46 24.69 32.98
Shopping 41.28 17.48 26.22 35.02
Official 33.03 13.98 20.97 28.02
Driving 47.82 20.24 30.37 40.57
Finances 38.88 16.46 24.69 32.98
Insurance 47.82 20.24 30.37 40.57
Repairs 56.77 24.03 36.05 48.16
Holidays 55.05 23.30 34.95 46.70
#If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:
#The chi-square static for each cell is also the Pearson's residuals
round(chisq$residuals, 3)
Wife Alternating Husband Jointly
Laundry 12.266 -2.298 -5.878 -6.609
Main_meal 9.836 -0.484 -4.917 -6.084
Dinner 6.537 -1.192 -3.416 -3.299
Breakfeast 4.875 3.457 -2.818 -5.297
Tidying 1.702 -1.606 -4.969 3.585
Dishes -1.103 1.859 -4.163 3.486
Shopping -1.289 1.321 -3.362 3.376
Official -3.659 8.563 0.443 -2.459
Driving -5.469 6.836 8.100 -5.898
Finances -4.150 -0.852 -0.742 5.750
Insurance -5.758 -4.277 4.107 5.720
Repairs -7.534 -4.290 20.646 -6.651
Holidays -7.419 -4.620 -4.897 15.556
#We can visualize Pearson residuals like this
library(corrplot)
corrplot 0.95 loaded
corrplot(chisq$residuals, is.cor = FALSE)
#Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables
#Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.
# Contibution in percentage (%)
<- 100*chisq$residuals^2/chisq$statistic
contrib round(contrib, 3)
Wife Alternating Husband Jointly
Laundry 7.738 0.272 1.777 2.246
Main_meal 4.976 0.012 1.243 1.903
Dinner 2.197 0.073 0.600 0.560
Breakfeast 1.222 0.615 0.408 1.443
Tidying 0.149 0.133 1.270 0.661
Dishes 0.063 0.178 0.891 0.625
Shopping 0.085 0.090 0.581 0.586
Official 0.688 3.771 0.010 0.311
Driving 1.538 2.403 3.374 1.789
Finances 0.886 0.037 0.028 1.700
Insurance 1.705 0.941 0.868 1.683
Repairs 2.919 0.947 21.921 2.275
Holidays 2.831 1.098 1.233 12.445
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)
# printing the p-value
$p.value chisq
[1] 0
# printing the mean
$estimate chisq
NULL
Activity No.2
library(ggplot2)
library(dplyr)
library(tidyr)
library(knitr)
%>%
mpggroup_by(class, cyl)%>%
summarize(n=n())%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | cyl | n |
---|---|---|
2seater | 8 | 5 |
compact | 4 | 32 |
compact | 5 | 2 |
compact | 6 | 13 |
midsize | 4 | 16 |
midsize | 6 | 23 |
midsize | 8 | 2 |
minivan | 4 | 1 |
minivan | 6 | 10 |
pickup | 4 | 3 |
pickup | 6 | 10 |
pickup | 8 | 20 |
subcompact | 4 | 21 |
subcompact | 5 | 2 |
subcompact | 6 | 7 |
subcompact | 8 | 5 |
suv | 4 | 8 |
suv | 6 | 16 |
suv | 8 | 38 |
#In order to turn our summary data in a contingency table, we need class to be listed by row and cyl to be listed by column
%>%
mpggroup_by(class, cyl)%>%
summarise(n=n())%>%
spread(cyl, n)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 5 |
compact | 32 | 2 | 13 | NA |
midsize | 16 | NA | 23 | 2 |
minivan | 1 | NA | 10 | NA |
pickup | 3 | NA | 10 | 20 |
subcompact | 21 | 2 | 7 | 5 |
suv | 8 | NA | 16 | 38 |
#Instead of frequencies, we can get the average number of city miles by class &cyl
%>%
mpggroup_by(class, cyl)%>%
summarise(mean_cty=mean(cty))%>%
spread(cyl, mean_cty)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 15.40000 |
compact | 21.37500 | 21 | 16.92308 | NA |
midsize | 20.50000 | NA | 17.78261 | 16.00000 |
minivan | 18.00000 | NA | 15.60000 | NA |
pickup | 16.00000 | NA | 14.50000 | 11.80000 |
subcompact | 22.85714 | 20 | 17.00000 | 14.80000 |
suv | 18.00000 | NA | 14.50000 | 12.13158 |
#Or we can see max number of city miles by class &cyl
%>%
mpggroup_by(class, cyl)%>%
summarise(max_cty=max(cty))%>%
spread(cyl, max_cty)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 16 |
compact | 33 | 21 | 18 | NA |
midsize | 23 | NA | 19 | 16 |
minivan | 18 | NA | 17 | NA |
pickup | 17 | NA | 16 | 14 |
subcompact | 35 | 20 | 18 | 15 |
suv | 20 | NA | 17 | 14 |
#We can find proportions by creating a new variable dividing row freq by table freq
%>%
mpggroup_by(class)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>% # our new proportion variable
kable()
class | n | prop |
---|---|---|
2seater | 5 | 0.0213675 |
compact | 47 | 0.2008547 |
midsize | 41 | 0.1752137 |
minivan | 11 | 0.0470085 |
pickup | 33 | 0.1410256 |
subcompact | 35 | 0.1495726 |
suv | 62 | 0.2649573 |
#Contingency Tables
#We can then create a contingency table of proportion values
%>%
mpggroup_by(class, cyl)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>%
subset(select=c("class","cyl","prop"))%>%
spread(class, prop)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
#We can construct a contingency table of vehicle class by number of cylinder by using table()
# I am naming it contingency table so that it is easier to identify when conducting the Chi Sq test
<- table(mpg$class, mpg$cyl)
contingency_table
#The table freq can also be called like this;
<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
mpg_tableftable(mpg_table)
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
#Row frequencies
margin.table(mpg_table, 1)
2seater compact midsize minivan pickup subcompact suv
5 47 41 11 33 35 62
#Column frequencies
margin.table(mpg_table, 2)
4 5 6 8
81 4 79 70
#Proportion of the entire table
prop.table(mpg_table)
4 5 6 8
2seater 0.000000000 0.000000000 0.000000000 0.021367521
compact 0.136752137 0.008547009 0.055555556 0.000000000
midsize 0.068376068 0.000000000 0.098290598 0.008547009
minivan 0.004273504 0.000000000 0.042735043 0.000000000
pickup 0.012820513 0.000000000 0.042735043 0.085470085
subcompact 0.089743590 0.008547009 0.029914530 0.021367521
suv 0.034188034 0.000000000 0.068376068 0.162393162
#Row proportions
prop.table(mpg_table, 1)
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 1.00000000
compact 0.68085106 0.04255319 0.27659574 0.00000000
midsize 0.39024390 0.00000000 0.56097561 0.04878049
minivan 0.09090909 0.00000000 0.90909091 0.00000000
pickup 0.09090909 0.00000000 0.30303030 0.60606061
subcompact 0.60000000 0.05714286 0.20000000 0.14285714
suv 0.12903226 0.00000000 0.25806452 0.61290323
#Column proportions
prop.table(mpg_table, 2)
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 0.07142857
compact 0.39506173 0.50000000 0.16455696 0.00000000
midsize 0.19753086 0.00000000 0.29113924 0.02857143
minivan 0.01234568 0.00000000 0.12658228 0.00000000
pickup 0.03703704 0.00000000 0.12658228 0.28571429
subcompact 0.25925926 0.50000000 0.08860759 0.07142857
suv 0.09876543 0.00000000 0.20253165 0.54285714
#Chisq tests
#We can either use the contingency table of proportions for the test or the raw data. If we use proportions we are testing our expected proportions against proportions given, or if the observed data matches the expected distribution.
#If we use the raw data, we are testing an association between class and cylinders.
<- chisq.test(contingency_table) chisq2
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
chisq2
Pearson's Chi-squared test
data: contingency_table
X-squared = 138.03, df = 18, p-value < 2.2e-16
$observed chisq2
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
round(chisq2$expected,2)
4 5 6 8
2seater 1.73 0.09 1.69 1.50
compact 16.27 0.80 15.87 14.06
midsize 14.19 0.70 13.84 12.26
minivan 3.81 0.19 3.71 3.29
pickup 11.42 0.56 11.14 9.87
subcompact 12.12 0.60 11.82 10.47
suv 21.46 1.06 20.93 18.55
#The chi-square static for each cell is also the Pearson's residuals
round(chisq2$residuals, 3)
4 5 6 8
2seater -1.316 -0.292 -1.299 2.865
compact 3.900 1.335 -0.720 -3.750
midsize 0.480 -0.837 2.462 -2.931
minivan -1.439 -0.434 3.262 -1.814
pickup -2.492 -0.751 -0.342 3.224
subcompact 2.553 1.812 -1.401 -1.691
suv -2.906 -1.029 -1.078 4.517
#We can visualize Pearson residuals like this
library(corrplot)
corrplot(chisq2$residuals, is.cor = FALSE)
# Contibution in percentage (%)
<- 100*chisq2$residuals^2/chisq2$statistic
contrib round(contrib, 3)
4 5 6 8
2seater 1.254 0.062 1.223 5.948
compact 11.020 1.291 0.375 10.186
midsize 0.167 0.508 4.390 6.224
minivan 1.500 0.136 7.709 2.384
pickup 4.500 0.409 0.085 7.528
subcompact 4.720 2.379 1.422 2.070
suv 6.117 0.768 0.842 14.782
# Visualize the contribution
corrplot(contrib, is.cor = FALSE)
# printing the p-value
$p.value chisq
[1] 0
# printing the mean
$estimate chisq
NULL
library(gmodels)
Registered S3 method overwritten by 'gdata':
method from
reorder.factor gplots
#The cross table command produces frequencies, and table, row & column proportions
CrossTable(mpg$class, mpg$cyl)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 234
| mpg$cyl
mpg$class | 4 | 5 | 6 | 8 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|
2seater | 0 | 0 | 0 | 5 | 5 |
| 1.731 | 0.085 | 1.688 | 8.210 | |
| 0.000 | 0.000 | 0.000 | 1.000 | 0.021 |
| 0.000 | 0.000 | 0.000 | 0.071 | |
| 0.000 | 0.000 | 0.000 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
compact | 32 | 2 | 13 | 0 | 47 |
| 15.210 | 1.782 | 0.518 | 14.060 | |
| 0.681 | 0.043 | 0.277 | 0.000 | 0.201 |
| 0.395 | 0.500 | 0.165 | 0.000 | |
| 0.137 | 0.009 | 0.056 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
midsize | 16 | 0 | 23 | 2 | 41 |
| 0.230 | 0.701 | 6.059 | 8.591 | |
| 0.390 | 0.000 | 0.561 | 0.049 | 0.175 |
| 0.198 | 0.000 | 0.291 | 0.029 | |
| 0.068 | 0.000 | 0.098 | 0.009 | |
-------------|-----------|-----------|-----------|-----------|-----------|
minivan | 1 | 0 | 10 | 0 | 11 |
| 2.070 | 0.188 | 10.641 | 3.291 | |
| 0.091 | 0.000 | 0.909 | 0.000 | 0.047 |
| 0.012 | 0.000 | 0.127 | 0.000 | |
| 0.004 | 0.000 | 0.043 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
pickup | 3 | 0 | 10 | 20 | 33 |
| 6.211 | 0.564 | 0.117 | 10.391 | |
| 0.091 | 0.000 | 0.303 | 0.606 | 0.141 |
| 0.037 | 0.000 | 0.127 | 0.286 | |
| 0.013 | 0.000 | 0.043 | 0.085 | |
-------------|-----------|-----------|-----------|-----------|-----------|
subcompact | 21 | 2 | 7 | 5 | 35 |
| 6.515 | 3.284 | 1.963 | 2.858 | |
| 0.600 | 0.057 | 0.200 | 0.143 | 0.150 |
| 0.259 | 0.500 | 0.089 | 0.071 | |
| 0.090 | 0.009 | 0.030 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
suv | 8 | 0 | 16 | 38 | 62 |
| 8.444 | 1.060 | 1.162 | 20.403 | |
| 0.129 | 0.000 | 0.258 | 0.613 | 0.265 |
| 0.099 | 0.000 | 0.203 | 0.543 | |
| 0.034 | 0.000 | 0.068 | 0.162 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 81 | 4 | 79 | 70 | 234 |
| 0.346 | 0.017 | 0.338 | 0.299 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Week 7
One sample t-test (parametric)
This is used to compare the mean one sample to a known standard (hypothetical) mean (mu)
#We will use an example data set
set.seed(1234)
<- data.frame(
my_data name = paste0(rep("M_", 10), 1:10),
weight = round(rnorm(10,20,2), 1)
)#Check data
head(my_data, 10)
name weight
1 M_1 17.6
2 M_2 20.6
3 M_3 22.2
4 M_4 15.3
5 M_5 20.9
6 M_6 21.0
7 M_7 18.9
8 M_8 18.9
9 M_9 18.9
10 M_10 18.2
#Statistical summaries of weight
summary(my_data$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.30 18.38 18.90 19.25 20.82 22.20
#Visualize data
library(ggpubr)
ggboxplot(my_data$weight,
ylab = "Weight(g)", xlab = FALSE,
ggtheme = theme_minimal())
#As the sample size is not large enough (<30) we need to check whether the data follows a normal distribution
shapiro.test(my_data$weight)
Shapiro-Wilk normality test
data: my_data$weight
W = 0.9526, p-value = 0.6993
#As the p value = 0.6993 (>0.0.5) we can assume data is not significantly different from normality. We assume normality
#Visual inspection using Q-Q plots (draws correlation between sample and normal distribution)
ggqqplot(my_data$weight, ylab = "Men's weight",
ggtheme = theme_minimal())
#Question: does the average weight of mice differ from 25g (two tailed)
<- t.test(my_data$weight, mu=25)
res res
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 7.953e-06
alternative hypothesis: true mean is not equal to 25
95 percent confidence interval:
17.8172 20.6828
sample estimates:
mean of x
19.25
#The p value is less than the significance level so we conclude that the mean weight of the weight is significantly different from 25g
#If you want to test if its less or more than 25g:
t.test(my_data$weight, mu=25,
alternative = "less")
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 3.977e-06
alternative hypothesis: true mean is less than 25
95 percent confidence interval:
-Inf 20.41105
sample estimates:
mean of x
19.25
t.test(my_data$weight, mu=25,
alternative = "greater")
One Sample t-test
data: my_data$weight
t = -9.0783, df = 9, p-value = 1
alternative hypothesis: true mean is greater than 25
95 percent confidence interval:
18.08895 Inf
sample estimates:
mean of x
19.25
One-sample Wilcoxon Signed Rank Test (non-parametric)
Alternative to one-sample t-test when data cannot be assumed to be normally distributed
Used to determine whether the median of the sample is = to a known standard value
#We will use the same mice data above
head(my_data, 10)
name weight
1 M_1 17.6
2 M_2 20.6
3 M_3 22.2
4 M_4 15.3
5 M_5 20.9
6 M_6 21.0
7 M_7 18.9
8 M_8 18.9
9 M_9 18.9
10 M_10 18.2
#Question:Does the average weight of the mice differ from 25g (two tailed)
<- wilcox.test(my_data$weight, mu = 25) res
Warning in wilcox.test.default(my_data$weight, mu = 25): cannot compute exact
p-value with ties
res
Wilcoxon signed rank test with continuity correction
data: my_data$weight
V = 0, p-value = 0.005793
alternative hypothesis: true location is not equal to 25
$p.value res
[1] 0.005793045
#The p value is < than significant level. We reject null and conclude that the average weight of the mice is significantly different from 25g
Unpaired Two-Samples T-test (parametric)
Used to compare the mean of two independent groups
#We will use an example data set
<- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
women_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
men_weight
# Create a data frame
<- data.frame(
my_data group = rep(c("Woman", "Man"), each = 9),
weight = c(women_weight, men_weight)
)print(my_data)
group weight
1 Woman 38.9
2 Woman 61.2
3 Woman 73.3
4 Woman 21.8
5 Woman 63.4
6 Woman 64.6
7 Woman 48.4
8 Woman 48.8
9 Woman 48.5
10 Man 67.8
11 Man 60.0
12 Man 63.4
13 Man 76.0
14 Man 89.4
15 Man 73.3
16 Man 67.3
17 Man 61.3
18 Man 62.4
library(dplyr)
#Compute summary statistics by groups
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
# A tibble: 2 × 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 Man 9 69.0 9.38
2 Woman 9 52.1 15.6
#Visualize data by using boxplot
ggboxplot(my_data, x = "group", y="weight",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "Weight", xlab = "Groups")
#We should check the independent t-test assumptions
#Assumption 1: are the two sample independents? Yes, as the samples from women and men aren't related
#Assumption 2: Are the data from each of the 2 groups follow a normal distribution?
#Assumption 3: Do the two populations have the same variances?
#To check assumption 2, we should conduct a Shapiro-Wilk normality test
with(my_data, shapiro.test(weight[group == "Man"]))
Shapiro-Wilk normality test
data: weight[group == "Man"]
W = 0.86425, p-value = 0.1066
#p = 0.1
with(my_data, shapiro.test(weight[group == "Woman"]))
Shapiro-Wilk normality test
data: weight[group == "Woman"]
W = 0.94266, p-value = 0.6101
#p = 0.6
# As both of the p-values are greater than the significance level (0.05), we assume that the distribution are not significantly different from the normal distribution
#To check assumption 3, we should use the f-test to test for homogeneity in variances
<- var.test(weight ~ group, data = my_data)
res.ftest res.ftest
F test to compare two variances
data: weight by group
F = 0.36134, num df = 8, denom df = 8, p-value = 0.1714
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.08150656 1.60191315
sample estimates:
ratio of variances
0.3613398
#The p value of the F test is greater than the significance level, so there is no significant difference between the variances of the two sets of data
#As there is homogeneity if variabces, we can use the classic t-test which assumes equality of the two variances
#Question:Is there any significant difference between women and men weights?
#There are two methods:
#Method 1
<- t.test(women_weight, men_weight, var.equal = TRUE)
res res
Two Sample t-test
data: women_weight and men_weight
t = -2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-29.748019 -4.029759
sample estimates:
mean of x mean of y
52.10000 68.98889
#Method 2
<- t.test(weight ~ group, data = my_data, var.equal = TRUE)
res res
Two Sample t-test
data: weight by group
t = 2.7842, df = 16, p-value = 0.01327
alternative hypothesis: true difference in means between group Man and group Woman is not equal to 0
95 percent confidence interval:
4.029759 29.748019
sample estimates:
mean in group Man mean in group Woman
68.98889 52.10000
#The p-value is less than the significance level. We can conclude that men's average weight is significantly different from women's average weight
$p.value res
[1] 0.0132656
$estimate res
mean in group Man mean in group Woman
68.98889 52.10000
$conf.int res
[1] 4.029759 29.748019
attr(,"conf.level")
[1] 0.95
Unpaired two-samples Wilcoxon test (non-parametric)
Alternative to the unpaired two-sample t test
Used when your data are not normally distributed
#We will use the same example data above
print(my_data)
group weight
1 Woman 38.9
2 Woman 61.2
3 Woman 73.3
4 Woman 21.8
5 Woman 63.4
6 Woman 64.6
7 Woman 48.4
8 Woman 48.8
9 Woman 48.5
10 Man 67.8
11 Man 60.0
12 Man 63.4
13 Man 76.0
14 Man 89.4
15 Man 73.3
16 Man 67.3
17 Man 61.3
18 Man 62.4
#Compute summary by groups
group_by(my_data, group) %>%
summarise(
count = n(),
median = median(weight, na.rm = TRUE),
IQR = IQR(weight, na.rm = TRUE)
)
# A tibble: 2 × 4
group count median IQR
<chr> <int> <dbl> <dbl>
1 Man 9 67.3 10.9
2 Woman 9 48.8 15
#Visualize with boxplot
#Question: Is there any significant difference between women and men weights?
#Method 1
<- wilcox.test(women_weight, men_weight) res
Warning in wilcox.test.default(women_weight, men_weight): cannot compute exact
p-value with ties
res
Wilcoxon rank sum test with continuity correction
data: women_weight and men_weight
W = 15, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
#Method 2
<- wilcox.test(weight ~ group, data = my_data,
res exact = FALSE)
res
Wilcoxon rank sum test with continuity correction
data: weight by group
W = 66, p-value = 0.02712
alternative hypothesis: true location shift is not equal to 0
$p.value res
[1] 0.02711657
#Asthe p value is less than the significance level, we can conclude that men's median weight is significantly different from women's median weight
Paired Samples T-test
Used to compare the means between two related groups of samples
#We will use example data
# Weight of the mice before treatment
<-c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
before # Weight of the mice after treatment
<-c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)
after # Create a data frame
<- data.frame(
my_data group = rep(c("before", "after"), each = 10),
weight = c(before, after)
)
#Check data
print(my_data)
group weight
1 before 200.1
2 before 190.9
3 before 192.7
4 before 213.0
5 before 241.4
6 before 196.9
7 before 172.2
8 before 185.5
9 before 205.2
10 before 193.7
11 after 392.9
12 after 393.2
13 after 345.1
14 after 393.0
15 after 434.0
16 after 427.9
17 after 422.0
18 after 383.9
19 after 392.3
20 after 352.2
#Compute summary stats by groups
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
# A tibble: 2 × 4
group count mean sd
<chr> <int> <dbl> <dbl>
1 after 10 394. 29.4
2 before 10 199. 18.5
#Visualize data using boxplot
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800"),
order = c("before", "after"),
ylab = "Weight", xlab = "Groups")
#Subset weight data before treatment
<- subset(my_data, group == "before", weight,
before drop = TRUE)
#Subset weight data after treatment
<- subset(my_data, group == "after", weight,
after drop = TRUE)
#Plot paired data
library(PairedData)
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: gld
Loading required package: mvtnorm
Loading required package: lattice
Attaching package: 'PairedData'
The following object is masked from 'package:base':
summary
<- paired(before,after)
pd plot(pd, type = "profile") + theme_bw()
#We then have to check the paired t-test assumptions
#Assumption 1: Are the two samples paired? Yes
#Assumption 2: Is this a large sample? NO, therefore we need to check whether the differences of the pairs follow a normal distribution
#Check if data are normally distributed:
<- with(my_data,
d == "before"] - weight[group == "after"])
weight[group #Shapiro-Wilk normality test for the differences
shapiro.test(d)
Shapiro-Wilk normality test
data: d
W = 0.94536, p-value = 0.6141
#As the p value is greater than the significance implies that the distribution of the differences are not significantly different from normal distribution
#Question: Is there any significant changes in the weights of mice after treatment?
#Method 1
<- t.test(before, after, paired = TRUE)
res res
Paired t-test
data: before and after
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-215.5581 -173.4219
sample estimates:
mean difference
-194.49
#The p value is less than the significance level. We reject the null hypothesis and conclude that the average weight of the mice before treatment is significantly different from the average weight after treatment
Paired Samples Wilcoxon Test (non-parametric)
Alternative to paired t-test when data are not normally distributed
#We will use the same data as the last example
print(my_data)
group weight
1 before 200.1
2 before 190.9
3 before 192.7
4 before 213.0
5 before 241.4
6 before 196.9
7 before 172.2
8 before 185.5
9 before 205.2
10 before 193.7
11 after 392.9
12 after 393.2
13 after 345.1
14 after 393.0
15 after 434.0
16 after 427.9
17 after 422.0
18 after 383.9
19 after 392.3
20 after 352.2
#compute summary stats by groups
#Visualized by boxplot
#Subset weight data before treatment
<- subset(my_data, group == "before", weight,
before drop = TRUE)
#Subset weight data after treatment
<- subset(my_data, group == "after", weight,
after drop = TRUE)
#Plot paired data
library(PairedData)
<- paired(before,after)
pd plot(pd, type = "profile") + theme_bw()
#Plot paired data
#Question: Is there any significant changes in the weights of mice before and after treatment?
#Method 1
<- wilcox.test(before, after, paired = TRUE)
res res
Wilcoxon signed rank exact test
data: before and after
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0
$p.value res
[1] 0.001953125
#As the p value is less than the significance level, we conclude that the median weight of the mice before treatment is significantly different from the median weight after treatment
One-Way ANOVA Test
An extension of independent two-samples t-test for comparing means in a situation where there are more than two groups
#We will use an example data
<-PlantGrowth
my_data set.seed(1234)
::sample_n(my_data, 10) dplyr
weight group
1 6.15 trt2
2 3.83 trt1
3 5.29 trt2
4 5.12 trt2
5 4.50 ctrl
6 4.17 trt1
7 5.87 trt1
8 5.33 ctrl
9 5.26 trt2
10 4.61 ctrl
#Show the levels
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"
#If they arent in the correct order, re-order them like this:
$group <- ordered(my_data$group,
my_datalevels = c("ctrl", "trt1", "trt2"))
#Compute summary states by groups
library(dplyr)
group_by(my_data, group) %>%
summarise(
count = n(),
mean = mean(weight, na.rm = TRUE),
sd = sd(weight, na.rm = TRUE)
)
# A tibble: 3 × 4
group count mean sd
<ord> <int> <dbl> <dbl>
1 ctrl 10 5.03 0.583
2 trt1 10 4.66 0.794
3 trt2 10 5.53 0.443
#Visualise data
library("ggpubr")
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")
#Plot weight by group
ggline(my_data, x = "group", y = "weight",
add = c("mean_se", "jitter"),
order = c("ctrl", "trt1", "trt2"),
ylab = "Weight", xlab = "Treatment")
#Is there any significant difference between the average weights of plants in the 3 experimental conditions
#Compute the analysis of variance
<- aov(weight~ group, data= my_data)
res.aov summary(res.aov)
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#As the P value is less than the significance level, we conclude that there are significant differences between the groups
#As the group means are different, we want to find out which pairs of groups are difference
#Therefore we should perform multiple pairwise-comparisons
#Tukey multiple pairwise-comparisons
TukeyHSD(res.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = my_data)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
#The output says that only the difference between trt2 and trt1 is significant
#We can also use the pairwise.t.test to calculate pairwise comparisons between group levels with corrections for multiple testing
pairwise.t.test(my_data$weight, my_data$group,
p.adjust.method = "BH")
Pairwise comparisons using t tests with pooled SD
data: my_data$weight and my_data$group
ctrl trt1
trt1 0.194 -
trt2 0.132 0.013
P value adjustment method: BH
#Lets check ANOVA assumptions
#1. Check the homogeneity of variance assumption
plot(res.aov, 1)
#There is no evident relationships between residuals and fitted values, so we can ssume the homogeneity of variances
#2.Check the normality assumption
plot(res.aov, 2)
#As all the points fall approximately along this reference line, we can assume normality
#This can also be supported by the Shapiro-Wilk test
<-residuals(object = res.aov)
aov_residuals shapiro.test(x = aov_residuals)
Shapiro-Wilk normality test
data: aov_residuals
W = 0.96607, p-value = 0.4379
#If these assumptions are not met, we can use the Kruskal-Wallis rank sum test
kruskal.test(weight ~ group, data= my_data)
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
Two-Way ANOVA Test
Used to evaluate simultaenously the effect of two group variables on a response variable
#We will use example data
<- ToothGrowth
my_data set.seed(1234)
::sample_n(my_data, 10) dplyr
len supp dose
1 21.5 VC 2.0
2 17.3 VC 1.0
3 27.3 OJ 2.0
4 18.5 VC 2.0
5 8.2 OJ 0.5
6 26.4 OJ 1.0
7 25.8 OJ 1.0
8 5.2 VC 0.5
9 6.4 VC 0.5
10 9.4 OJ 0.5
str(my_data)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#Dose is considered a numeric variable so we will change is to a factor variable
$dose <- factor(my_data$dose,
my_datalevels = c(0.5,1,2),
labels = c("D0.5", "D1", "D2"))
head(my_data)
len supp dose
1 4.2 VC D0.5
2 11.5 VC D0.5
3 7.3 VC D0.5
4 5.8 VC D0.5
5 6.4 VC D0.5
6 10.0 VC D0.5
#Question: Does tooth length depend on supp and dose
table(my_data$supp, my_data$dose)
D0.5 D1 D2
OJ 10 10 10
VC 10 10 10
#Visualize data
library("ggpubr")
ggboxplot(my_data, x ="dose", y="len", color = "supp",
palette = c("#00AFBB", "#E7B800"))
#Line plots with multiple groups
library("ggpubr")
ggline(my_data, x = "dose", y = "len", color = "supp",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
<- aov(len ~supp + dose, data = my_data)
res.aov2 summary (res.aov2)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 14.02 0.000429 ***
dose 2 2426.4 1213.2 82.81 < 2e-16 ***
Residuals 56 820.4 14.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#We can clude that bboth supp and dose are statistically significant
#Two-way ANOVA with interaction effect
#These two calls are equivalent
<- aov(len ~ supp * dose, data = my_data)
res.aov3 <- aov(len ~ supp + dose + supp:dose, data = my_data)
res.aov3 summary(res.aov3)
Df Sum Sq Mean Sq F value Pr(>F)
supp 1 205.4 205.4 15.572 0.000231 ***
dose 2 2426.4 1213.2 92.000 < 2e-16 ***
supp:dose 2 108.3 54.2 4.107 0.021860 *
Residuals 54 712.1 13.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The two main effects(supp and dose) are statistically significant, as well as their interaction
#Compute summary stats
require("dplyr")
group_by(my_data, supp, dose) %>%
summarise(
count = n(),
mean = mean(len, na.rm = TRUE),
sd = sd(len, na.rm = TRUE)
)
`summarise()` has grouped output by 'supp'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
# Groups: supp [2]
supp dose count mean sd
<fct> <fct> <int> <dbl> <dbl>
1 OJ D0.5 10 13.2 4.46
2 OJ D1 10 22.7 3.91
3 OJ D2 10 26.1 2.66
4 VC D0.5 10 7.98 2.75
5 VC D1 10 16.8 2.52
6 VC D2 10 26.1 4.80
model.tables(res.aov3, type="means", se = TRUE)
Tables of means
Grand mean
18.81333
supp
supp
OJ VC
20.663 16.963
dose
dose
D0.5 D1 D2
10.605 19.735 26.100
supp:dose
dose
supp D0.5 D1 D2
OJ 13.23 22.70 26.06
VC 7.98 16.77 26.14
Standard errors for differences of means
supp dose supp:dose
0.9376 1.1484 1.6240
replic. 30 20 10
#We will perform the Tukey HSD test for the factor variable "dose" as it has more than 2 levels
TukeyHSD(res.aov3, which = "dose")
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = len ~ supp + dose + supp:dose, data = my_data)
$dose
diff lwr upr p adj
D1-D0.5 9.130 6.362488 11.897512 0.0e+00
D2-D0.5 15.495 12.727488 18.262512 0.0e+00
D2-D1 6.365 3.597488 9.132512 2.7e-06
#All pairwise comparisons are significant with an adjusted p value<0.05
#The pairwise.t.test can also be used to calculate pairwise comparisons between group levels with corrections for multiple testing
pairwise.t.test(my_data$len, my_data$dose,
p.adjust.method = "BH")
Pairwise comparisons using t tests with pooled SD
data: my_data$len and my_data$dose
D0.5 D1
D1 1.0e-08 -
D2 4.4e-16 1.4e-05
P value adjustment method: BH
#Check ANOVA assumptions
#1. Homogeneity of variances
plot(res.aov3,1)
#Points 32 and 23 are outliers, useful to remove to meet test assumptions
#Use Levene's test to check homogeneity of variances
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
leveneTest(len ~ supp*dose, data = my_data)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 1.7086 0.1484
54
# The p value is more than the significance level which means that we assume the homogeneity of variances in the different treatment groups
#2. Normality
plot(res.aov3, 2)
#As the points fall approx along the reference line, we assume normality
#We can also support this with the Shapiro-Wilk test
<- residuals(object = res.aov3)
aov_residuals shapiro.test(x = aov_residuals)
Shapiro-Wilk normality test
data: aov_residuals
W = 0.98499, p-value = 0.6694
#This finds no indification that normality is violated
#When the test is unbalanced:
library(car)
<- aov(len ~ supp * dose, data = my_data)
my_anova Anova(my_anova, type = "III")
Anova Table (Type III tests)
Response: len
Sum Sq Df F value Pr(>F)
(Intercept) 1750.33 1 132.730 3.603e-16 ***
supp 137.81 1 10.450 0.002092 **
dose 885.26 2 33.565 3.363e-10 ***
supp:dose 108.32 2 4.107 0.021860 *
Residuals 712.11 54
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MANOVA test
When there are multiple response variables you can test them simultaneously using a multivariate analysis of variance (MANOVA)
<- iris
my_data set.seed(1234)
::sample_n(my_data, 10) dplyr
Sepal.Length Sepal.Width Petal.Length Petal.Width Species size
1 5.2 3.5 1.5 0.2 setosa small
2 5.7 2.6 3.5 1.0 versicolor small
3 6.3 3.3 6.0 2.5 virginica big
4 6.5 3.2 5.1 2.0 virginica big
5 6.3 3.4 5.6 2.4 virginica big
6 6.4 2.8 5.6 2.2 virginica big
7 6.8 3.2 5.9 2.3 virginica big
8 7.9 3.8 6.4 2.0 virginica big
9 6.2 2.9 4.3 1.3 versicolor big
10 7.1 3.0 5.9 2.1 virginica big
<- iris$Sepal.Length
sepl <- iris$Petal.Length
petl # MANOVA test
<- manova(cbind(Sepal.Length, Petal.Length) ~ Species, data = iris)
res.man summary(res.man)
Df Pillai approx F num Df den Df Pr(>F)
Species 2 0.9885 71.829 4 294 < 2.2e-16 ***
Residuals 147
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#As the p value is significant, we see which ones differ:
summary.aov(res.man)
Response Sepal.Length :
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response Petal.Length :
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 437.10 218.551 1180.2 < 2.2e-16 ***
Residuals 147 27.22 0.185
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#It can be seen that the variables are highly significantly different among species
Kruskal-Wallis test
Non-parametric alternative to one-way ANOVA
<- PlantGrowth
my_data head(my_data)
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
levels(my_data$group)
[1] "ctrl" "trt1" "trt2"
#Summary stats by groups library(dplyr) group_by(my_data, group) %\>% summarise( count = n(), mean = mean(weight, na.rm = TRUE), sd = sd(weight, na.rm = TRUE), median = median(weight, na.rm = TRUE), IQR = IQR(weight, na.rm = TRUE) )
#Visualize data library("ggpubr") ggboxplot(my_data, x = "group", y = "weight", color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"), order = c("ctrl", "trt1", "trt2"), ylab = "Weight", xlab = "Treatment") #Mean plots library("ggpubr") ggline(my_data, x = "group", y = "weight", add = c("mean_se", "jitter"), order = c("ctrl", "trt1", "trt2"), ylab = "Weight", xlab = "Treatment")
#Question: Is there any significant difference between the average weights of plants in the 3 experimental conditions
kruskal.test(weight~group, data = my_data) #As the p value is less than the significance level, we conclude that there are significant differences between the treatment groups
Kruskal-Wallis rank sum test
data: weight by group
Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
#We then conduct a pairwise.wilcox.test to calculate pairwise comparisons between group levels with corrections for multiple testing pairwise.wilcox.test(PlantGrowth$weight, PlantGrowth$group, p.adjust.method = "BH") #Shows that only trt1 and trt2 are significantly different
Week 8
Exercise 5 - Correlation
I have completed 10 questions as I have grasped the concept of correlation well.
- Which of the following correlation coefficients expresses the strongest association?
- 0.55 b) 0.09 c) -0.77 d) 0.1 e) 1.05
I believe it is c as the strength of a correlation is determined by how close the coefficient is to -1 or 1, and e is not a valid coefficient as they have to be between -1 and 1.
- For which of the coefficients from the previous question it applies that a person with above average scores X will probably also have above average scores Y?
I believe that it is coefficient a as it must be a positive correlation and a moderate correlation.
- We have five representative samples of people aged 15, 20, 30, 45 and 60 years who completed a questionnaire of political conservatism. In these 5 samples in the given order were the average scores of political conservatism as follows: 60, 85, 80, 70, 65. Correlation between age and political conservatism is:
- 1.0 b) -1.0 c) linear d) nonlinear
I think it is d - nonlinear as age increases from 15 t0 20, conservatism increases, however when age increases from age 20 to 60, conservatism decreases.
- For each scatterplot below choose corresponding association description:
perfect positive linear relationship (r = 1.0)
I think this corresponds to graph c
moderate positive linear relationship (r = 0.5)
I think this corresponds to graph d
no linear relationship (r = 0)
I think this corresponds to graph e
moderate negative relationship (r = -0.5)
I think this corresponds to graph a
perfect negative linear relationship (r = -1.0)
I think this corresponds to graph b
How is Pearson’s coefficient influenced by:
- limited variability?
If there is limited variability, Pearson’s coefficient will be lower (closer to 0)
- differences in distribution of the correlated variables?
As Pearson’s coefficient assumes that the variables are normally distributed, the coefficient may not accurately represent the relationship (strength or direction)
- outliers?
Outliers can significantly distort the correlation as it pulls the regression line towards it, and therefore my create a false impression of correlations
- using extreme groups as samples?
The correlation/relationship may appear stronger or weaker than it is as the full range of data is not considered
6 Estimate correlation between variable pairs listed below – is it positive, negative, or zero? a) height in cm, weight in kg - positive b) age in months, time in run for 50 meters - positive c) math grade, reading grade - positive
d) math grade, number of missed school lessons in a year - negative e) IQ, personal identification number - zero f) interest in sports, interest in politics - zero g) mileage on car speedometer, year when was car produced - negative h) maximum daily temperature, household water consumption per day - positive
Suppose the answer to the question 6.g was r = -0.8, how would the correlation coefficient change, if we instead variable “year when was car produced” use variable “car’s age”?
It would change to a positive correlation
If correlation between X and Y is 0.5, how does the correlation change if we transform X to T-scores?
The correlation will not change as T-scores are a transformation of the original variable
If r = 1 and zx = -0.5, what is zy? If r= - 1. and zx = 0.8, what is zy?
The z scores are the standardized scores of X and Y (zy = r * zx)
zy = 1 * -0.5 = -0.5 , zy = -1 * 0.8 = -0.8
Regarding interpretation, between correlations 0.2 and 0.4 and between correlations 0.5 and 0.7…: a) is approximately the same difference b) there is bigger difference between the first pair than between the second pair c) there is smaller difference between the first pair than between the second pair d) the difference can’t be compared Justify your answer
I think the answer is c as correlation is not a linear scale
Week 10
library(tidyverse)
library(caret)
Warning: package 'caret' was built under R version 4.4.2
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
theme_set(theme_bw())
#Load the data and remove data that is not available (NA)
data("PimaIndiansDiabetes2", package = "mlbench")
#Inspect the data
sample_n(PimaIndiansDiabetes2, 3)
pregnant glucose pressure triceps insulin mass pedigree age diabetes
726 4 112 78 40 NA 39.4 0.236 38 neg
602 6 96 NA NA NA 23.7 0.190 28 neg
326 1 157 72 21 168 25.6 0.123 24 neg
#Split the data into training and test set
#Randomly split the data into training set(80% for a predictive model) and a test set (20% for evaluating the model)
set.seed(123)
<- PimaIndiansDiabetes2$diabetes %>%
training.samples createDataPartition(p =0.8, list = FALSE)
<- PimaIndiansDiabetes2[training.samples,]
train.data <- PimaIndiansDiabetes2[-training.samples,]
test.data
#Fit the model
<- glm(diabetes ~., data = train.data, family = binomial)
model #Summarize the model
summary(model)
Call:
glm(formula = diabetes ~ ., family = binomial, data = train.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.026e+01 1.392e+00 -7.370 1.71e-13 ***
pregnant 3.562e-02 6.256e-02 0.569 0.56911
glucose 3.969e-02 6.817e-03 5.822 5.80e-09 ***
pressure -3.277e-03 1.306e-02 -0.251 0.80184
triceps -1.009e-03 1.971e-02 -0.051 0.95916
insulin -6.832e-04 1.445e-03 -0.473 0.63645
mass 8.291e-02 3.171e-02 2.615 0.00893 **
pedigree 1.619e+00 5.010e-01 3.231 0.00123 **
age 3.520e-02 2.001e-02 1.759 0.07849 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 399.60 on 314 degrees of freedom
Residual deviance: 278.74 on 306 degrees of freedom
(300 observations deleted due to missingness)
AIC: 296.74
Number of Fisher Scoring iterations: 5
#Make predictions
<- model %>% predict(test.data, type = "response")
probabilities <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes #Model accuracy
mean(predicted.classes ==test.data$diabetes)
[1] NA
#Simple Logistic Regression is used to predict the probability of class membership based one single predictor variable
# This builds a model to predict the probability of being diabetes-positive based on the plasma glucose concentration
<- glm(diabetes ~ glucose, data = train.data, family = binomial)
model summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.67020090 0.490507636 -11.55986 6.581398e-31
glucose 0.04017613 0.003775009 10.64266 1.886704e-26
#The intercept is -5.67 and the coefficient of glucose variable is 0.040
#The logistic equation can be written as p = exp(-5.67 + 0.040*glucose)/[1 + exp(-5.67 + 0.040*glucose)]
#Using this formula for each new glucose plasma conc value, you can predict the probability of the individuals in being diabetes positive
<- data.frame(glucose = c(20,180))
newdata <- model %>% predict(newdata, type ="response")
probabilities <- ifelse(probabilities > 0.5, "pos", "neg")
predicted.classes predicted.classes
1 2
"neg" "pos"
#What does this look like on the graph of logistic regression model?
%>%
train.data mutate(prob = ifelse(diabetes == "pos", 1, 0)) %>%
ggplot(aes(glucose, prob)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
#Multiple logistic regression is used to predict the probability of class membership based on multiple predictor variables
<-glm(diabetes ~ glucose + mass + pregnant,
model data = train.data, family = binomial)
summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.44301872 0.755764024 -11.171501 5.622253e-29
glucose 0.03715272 0.003835489 9.686568 3.438916e-22
mass 0.08205533 0.016232625 5.054964 4.304725e-07
pregnant 0.10959903 0.030364055 3.609499 3.067888e-04
#We want to include all predictor variables in the data set
<- glm(diabetes ~., data = train.data, family = binomial)
model summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.025834e+01 1.391940017 -7.36981709 1.708623e-13
pregnant 3.561814e-02 0.062557686 0.56936476 5.691086e-01
glucose 3.969191e-02 0.006817086 5.82241482 5.800336e-09
pressure -3.277149e-03 0.013057808 -0.25097238 8.018355e-01
triceps -1.009412e-03 0.019712451 -0.05120681 9.591607e-01
insulin -6.832247e-04 0.001445460 -0.47266941 6.364491e-01
mass 8.291187e-02 0.031709372 2.61474334 8.929453e-03
pedigree 1.618980e+00 0.501031092 3.23129629 1.232301e-03
age 3.520407e-02 0.020008021 1.75949760 7.849303e-02
#To only extract the coefficients:
coef(model)
(Intercept) pregnant glucose pressure triceps
-1.025834e+01 3.561814e-02 3.969191e-02 -3.277149e-03 -1.009412e-03
insulin mass pedigree age
-6.832247e-04 8.291187e-02 1.618980e+00 3.520407e-02
summary(model)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.025834e+01 1.391940017 -7.36981709 1.708623e-13
pregnant 3.561814e-02 0.062557686 0.56936476 5.691086e-01
glucose 3.969191e-02 0.006817086 5.82241482 5.800336e-09
pressure -3.277149e-03 0.013057808 -0.25097238 8.018355e-01
triceps -1.009412e-03 0.019712451 -0.05120681 9.591607e-01
insulin -6.832247e-04 0.001445460 -0.47266941 6.364491e-01
mass 8.291187e-02 0.031709372 2.61474334 8.929453e-03
pedigree 1.618980e+00 0.501031092 3.23129629 1.232301e-03
age 3.520407e-02 0.020008021 1.75949760 7.849303e-02
#If the predictors are significantly associated to the outcone (the coefficient of the estimate should be less than 0.05)
#Example: if glucose b=0.045 - an increase in glucose is associated with increase in the probability of being diabetes-positive
#Penguin data set:
library(tidyverse)
library(palmerpenguins)
library(ggplot2)
<- glm(sex ~ body_mass_g, family = "binomial", data = penguins)
m1 summary(m1)
Call:
glm(formula = sex ~ body_mass_g, family = "binomial", data = penguins)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.1625416 0.7243906 -7.127 1.03e-12 ***
body_mass_g 0.0012398 0.0001727 7.177 7.10e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 461.61 on 332 degrees of freedom
Residual deviance: 396.64 on 331 degrees of freedom
(11 observations deleted due to missingness)
AIC: 400.64
Number of Fisher Scoring iterations: 4
#PseudoR^2
<- m1$null.deviance/-2
null <- m1$deviance/-2
model -model)/null (null
[1] 0.1407346
#Logistic models for frequency data
<- glm(formula = sex ~., family = "binomial", data = penguins)
m2 summary(m2)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -80.376671843 12.329734561 -6.5189296 7.081087e-11
speciesChinstrap -7.402696697 1.662533652 -4.4526598 8.481309e-06
speciesGentoo -8.427610932 2.597027352 -3.2450990 1.174098e-03
islandDream 0.324158306 0.809135282 0.4006231 6.886976e-01
islandTorgersen -0.507858042 0.855745697 -0.5934684 5.528677e-01
bill_length_mm 0.614435757 0.131967898 4.6559487 3.224923e-06
bill_depth_mm 1.646445994 0.335797746 4.9030883 9.434156e-07
flipper_length_mm 0.026653725 0.048306899 0.5517581 5.811141e-01
body_mass_g 0.005818832 0.001087293 5.3516706 8.714591e-08
#Pseudo R^2
<- m2$null.deviance/ -2
null2<- m2$deviance / -2
model2 - model2)/null2 (null2
[1] 0.726939
#Finalmodel
<- glm(sex ~bill_length_mm + bill_depth_mm + body_mass_g, family = "binomial", data = penguins)
m3 summary(m3)
Call:
glm(formula = sex ~ bill_length_mm + bill_depth_mm + body_mass_g,
family = "binomial", data = penguins)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.056e+01 7.081e+00 -8.552 < 2e-16 ***
bill_length_mm 9.151e-02 4.416e-02 2.072 0.0382 *
bill_depth_mm 2.063e+00 2.469e-01 8.356 < 2e-16 ***
body_mass_g 5.061e-03 6.348e-04 7.971 1.57e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 461.61 on 332 degrees of freedom
Residual deviance: 159.89 on 329 degrees of freedom
(11 observations deleted due to missingness)
AIC: 167.89
Number of Fisher Scoring iterations: 7
#Pseudo R^2
<- m3$null.deviance / -2
null3 <- m3$deviance / -2
model3 - model3)/null3 (null3
[1] 0.6536294
#Odds ratio - Penguin's sex
%>%
penguins mutate(size = ifelse(body_mass_g > 4202, "large", "small")) -> penguins2
#This shows a contingency table of how many male and female penguins are classified as large or small based on their body mass
xtabs(~size+sex, data = penguins2)
sex
size female male
large 53 92
small 112 76
#Then you can either do A*D/B*C or (A/B)/(C/D) which is the same thing
53/92)/(112/76) (
[1] 0.3909161
#The odds ratio result of 0.391 means that the odds of being female are 0.391 times the odds of being male in the "large" group compared to the "small" group
#This suggests that males are more likely to be in the "large" group than females, while females are most likely to be in the "small" group than males
#Modelling probabilities
#I will use penguins2 as I have added the column of size determined by body mass
#Clean up the data to remove NAs
<- penguins2 %>%
penguins2_clean filter(!is.na(sex), !is.na(body_mass_g))
<- glm(sex ~ body_mass_g, family = "binomial", data = penguins2_clean)
m1 $fitted <- predict(m1, type = "response")
penguins2_clean
%>%
penguins2_cleanggplot(aes(x = body_mass_g, y=fitted, color = size)) +
geom_point() +
labs(
title = "Fitted Probabilities of Size Based on Body Mass",
x = "Body Mass (g)",
y = "Fitted Probability"
)
#Even though the pseudo R^2 value was 0.14 which may seem low, seeing the model on a GGplot looks suitable. 0.14 doesn't necessarily indicate a bad model, especially in logistic regression where pseudo R^2 values are typically lower