| title: “Week 1: Introduction to the course, R & Sleuth Chapter 1” |
| author: “Eugene D. Gallagher” |
| date: “9/6/2021” |
| output: html_document: |
Does everyone have a working laptop?
Today’s Class is designed to bring students one up to speed on the tools necessary to follow all the case studies in Ramsey & Schafer’s ``Statistical Sleuth’’. I assume a rudimentary knowledge of R from watching Andy Field’s tutorial on installing R & RStudio (see HO02), some familiarity with setting up an R project, and a general overview of R Markdown. I don’t require any experience with coding.
If there are any questions or areas you’d like covered in more detail please do not hesitate to speak up.
Course Google Doc for student background and questions during the lecture: https://docs.google.com/document/d/1FW7untHx9-dXleVoj6WoJ1mlWonjHu8xL3e3UBSB61A/edit?usp=sharing
RStudio has multiple panes. One can customize what’s open at any point. In general the four quadrants are:
If one accidentally closes one, it’s easily opened again from the View menu.
Note Menu items have underlines for certain characters. These are shortcuts for Alt-letter that one can access quickly from the keyboard.
Initially, I’d like everyone to go to Blackboard and download 2 files from the Week 1 module: Week01_Int_Ch01_F21.Rmd and Penguins.csv
We’ll initially set up an R project on your laptops for your R Markdown files and images. Following Andy Field, we’ll create an R Project, labeled something like Week01_int_Ch01_F21 with 2 subdirectories doc and images: There are 7 images you should download from the Week 1 images folder as a zip file.
R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.
From RStudio, select File -> New File and there is a large number of options. The important ones for this class are:
source in the upper right of the file and it will execute in the console below.Knit to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.
It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command sessionInfo() provides a nice overview of this that is good for checking package versions loaded.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 purrr_0.3.4 readr_2.0.1
## [5] tidyr_1.1.3 tibble_3.1.4 tidyverse_1.3.1 Sleuth3_1.0-3
## [9] mosaic_1.8.3 ggridges_0.5.3 mosaicData_0.20.2 ggformula_0.10.1
## [13] ggstance_0.3.5 dplyr_1.0.7 Matrix_1.3-4 Hmisc_4.5-0
## [17] ggplot2_3.3.5 Formula_1.2-4 survival_3.2-13 lattice_0.20-44
## [21] car_3.0-11 carData_3.0-4 aplpack_1.3.3
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.0 lubridate_1.7.10 httr_1.4.2
## [4] RColorBrewer_1.1-2 tools_4.1.1 backports_1.2.1
## [7] bslib_0.3.0 utf8_1.2.2 R6_2.5.1
## [10] rpart_4.1-15 DBI_1.1.1 colorspace_2.0-2
## [13] nnet_7.3-16 withr_2.4.2 tidyselect_1.1.1
## [16] gridExtra_2.3 leaflet_2.0.4.1 curl_4.3.2
## [19] compiler_4.1.1 cli_3.0.1 rvest_1.0.1
## [22] htmlTable_2.2.1 xml2_1.3.2 ggdendro_0.1.22
## [25] sass_0.4.0 mosaicCore_0.9.0 scales_1.1.1
## [28] checkmate_2.0.0 digest_0.6.27 foreign_0.8-81
## [31] rmarkdown_2.10 rio_0.5.27 base64enc_0.1-3
## [34] jpeg_0.1-9 pkgconfig_2.0.3 htmltools_0.5.2
## [37] labelled_2.8.0 dbplyr_2.1.1 fastmap_1.1.0
## [40] htmlwidgets_1.5.3 rlang_0.4.11 readxl_1.3.1
## [43] rstudioapi_0.13 jquerylib_0.1.4 generics_0.1.0
## [46] farver_2.1.0 jsonlite_1.7.2 crosstalk_1.1.1
## [49] zip_2.2.0 magrittr_2.0.1 Rcpp_1.0.7
## [52] munsell_0.5.0 fansi_0.5.0 abind_1.4-5
## [55] lifecycle_1.0.0 stringi_1.7.4 MASS_7.3-54
## [58] plyr_1.8.6 grid_4.1.1 ggrepel_0.9.1
## [61] crayon_1.4.1 haven_2.4.3 splines_4.1.1
## [64] hms_1.1.0 knitr_1.33 pillar_1.6.2
## [67] tcltk_4.1.1 reprex_2.0.1 glue_1.4.2
## [70] evaluate_0.14 latticeExtra_0.6-29 modelr_0.1.8
## [73] data.table_1.14.0 tzdb_0.1.2 png_0.1-7
## [76] vctrs_0.3.8 tweenr_1.0.2 cellranger_1.1.0
## [79] gtable_0.3.0 polyclip_1.10-0 assertthat_0.2.1
## [82] xfun_0.25 ggforce_0.3.3 openxlsx_4.2.4
## [85] broom_0.7.9 cluster_2.1.2 ellipsis_0.3.2
The current working directory is available via getwd() and settable via setwd("directory"). I find it easiset to set the working directory from the Session button at the top of the RStudio screen. All file loads and saves are relative to the current working directory.
R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.
Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.
BA Biology Carleton College, Northfield MN 1976
M.Sc. & Ph.D. Oceanography, University of Washington School of Oceanography 1983
Dissertation: The mechanisms of succession in the Skagit Flats infaunal community
Programming Languages
UW Graduate Statistics Courses
Loveday Conquest, UW
Donald McCaughran, UW
Pete Jumars, UW, UMaine
Professional Experience
UMass Boston Statistics courses taught
Professional Development in Statistics
Michael West, Duke
Douglas Bates, University of Wisconsin-Madison
Frank Harrell, Vanderbilt
## syllabus for Fall 2021 EnvSci601
# Introduction to R Coding: Basic R commands (from Gelman 2021, Appendix A)
1/3
## [1] 0.333
sqrt(2)
## [1] 1.41
pi
## [1] 3.14
curve(x^2 +5, from = -2, to=2)
a<-3
print(a)
## [1] 3
b<-10
a+b
## [1] 13
a*b
## [1] 30
exp(a)
## [1] 20.1
10^a
## [1] 1000
log(b)
## [1] 2.3
a^b
## [1] 59049
round(3.435, 0)
## [1] 3
round(3.435, 1)
## [1] 3.4
round(3.435, 2)
## [1] 3.44
x<-c(4,10,-1,2.4)
print(x)
## [1] 4.0 10.0 -1.0 2.4
seq(4,54,10)
## [1] 4 14 24 34 44 54
c(1,3,5)
## [1] 1 3 5
1:5
## [1] 1 2 3 4 5
c(1:5,1, 3, 5)
## [1] 1 2 3 4 5 1 3 5
c(1:5,10:20)
## [1] 1 2 3 4 5 10 11 12 13 14 15 16 17 18 19 20
seq(-1,9,2)
## [1] -1 1 3 5 7 9
# Sampling and random numbers
runif(1,0,100)
## [1] 68
runif(50,0,100)
## [1] 70.284 89.355 98.156 16.062 77.484 50.133 38.209 23.297 37.693 69.220
## [11] 94.898 8.736 71.558 89.353 34.425 76.812 51.325 39.138 71.275 77.508
## [21] 3.377 99.226 90.131 5.372 42.443 98.181 37.613 99.492 97.506 97.389
## [31] 87.884 18.139 84.238 89.693 91.441 38.785 0.366 46.532 18.715 55.208
## [41] 3.504 62.394 30.117 30.791 65.830 34.810 24.212 71.085 48.417 66.498
# Pick one of three colors with equal probability
color<-c("blue","red","green")
sample(color,1)
## [1] "blue"
# Sample with unequal probabilities
p<-c(0.5,0.3,0.2)
sample(color,1, prob=p)
## [1] "green"
# Or do it all in one line, which is more compact but less readable:
sample(c("blue","red","green"),1,prob=c(0.5,0.3,0.2))
## [1] "red"
1/0
## [1] Inf
-1/0
## [1] -Inf
exp(1000)
## [1] Inf
exp(-1000)
## [1] 0
1/Inf
## [1] 0
Inf + Inf
## [1] Inf
-Inf - Inf
## [1] -Inf
0/0
## [1] NaN
Inf - Inf
## [1] NaN
2+3 == 4
## [1] FALSE
2 + 3 == 5
## [1] TRUE
1 < 2
## [1] TRUE
2 < 1
## [1] FALSE
number <- runif(1, 0, 100)
color <- ifelse(number<30,"red","blue")
# Looping
for (i in 1:10){
print("hello")}
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
for (i in 1:10){
print(i)}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
for (i in 1:10){
print(paste("hello",i))}
## [1] "hello 1"
## [1] "hello 2"
## [1] "hello 3"
## [1] "hello 4"
## [1] "hello 5"
## [1] "hello 6"
## [1] "hello 7"
## [1] "hello 8"
## [1] "hello 9"
## [1] "hello 10"
for (i in 1:10){
number <- runif(1,0, 100)
color <- ifelse(number<30, "red","blue")
print(color)
}
## [1] "blue"
## [1] "blue"
## [1] "red"
## [1] "blue"
## [1] "blue"
## [1] "blue"
## [1] "blue"
## [1] "red"
## [1] "blue"
## [1] "blue"
## Working with Vectors
x <- 1:5
y <- c(3,4,1,1)
z <- c("A", "B","C")
# A vector of 5 uniformly distributed random numbers between 0 and 100:
u <- runif (5, 0, 100)
# Component wise operation on vectors
x
## [1] 1 2 3 4 5
y
## [1] 3 4 1 1
x + y
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 4 6 4 5 8
x/3
## [1] 0.333 0.667 1.000 1.333 1.667
x^4
## [1] 1 16 81 256 625
sum(x)
## [1] 15
mean(x)
## [1] 3
# Subscripting
a <- c("A","B","C","D","E","F","G","H","I","J")
a[1]
## [1] "A"
a[2]
## [1] "B"
a[4:6]
## [1] "D" "E" "F"
a[c(1,3,5)]
## [1] "A" "C" "E"
a[c(8,1:3,2)]
## [1] "H" "A" "B" "C" "B"
a[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)]
## [1] "D" "E" "F"
Penguins
setwd("M:/EnvSci601/R/r_projects/Wk01_IntroR_Ch01_F21/docs")
# or click on the Session button --> Set Working Directory --> Choose Directory
# ?read_csv #let's see how we can load our csv file into an R dataframe
penguins <- read_csv(file = "M:/EnvSci601/R/r_projects/Wk01_IntroR_Ch01_F21/data/penguins.csv") #load penguins data
## Rows: 344 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
penguins #see data, note NA values in red, means there is no data for those points
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ... with 334 more rows, and 2 more variables: sex <chr>, year <dbl>
# View the penguins dataframe
View(penguins)
# Close the view window when done by clicking x on the top of the view
# Two other ways of analyzing the structure of the penguins dataframe:
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A~
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", ~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <chr> "male", "female", "female", NA, "female", "male", "f~
## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
str(penguins)
## spec_tbl_df [344 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ species : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
## $ sex : chr [1:344] "male" "female" "female" NA ...
## $ year : num [1:344] 2007 2007 2007 2007 2007 ...
## - attr(*, "spec")=
## .. cols(
## .. species = col_character(),
## .. island = col_character(),
## .. bill_length_mm = col_double(),
## .. bill_depth_mm = col_double(),
## .. flipper_length_mm = col_double(),
## .. body_mass_g = col_double(),
## .. sex = col_character(),
## .. year = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
class(penguins)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
# Use a tidyverse pipe %>% to count the number of individuals of each species.
penguins %>% #get count of each species
count(species)
## # A tibble: 3 x 2
## species n
## <chr> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
penguins %>% #Use a pipe to get mean of bill length, but 2 species have NAs. Let's fix that
group_by(species) %>%
summarise(mean.bill.length = mean(bill_length_mm)) #getting mean of bill length for each species
## # A tibble: 3 x 2
## species mean.bill.length
## <chr> <dbl>
## 1 Adelie NA
## 2 Chinstrap 48.8
## 3 Gentoo NA
# Create a new dataframe penguins2 that gets rid of NA ('Not Available') cases
penguins2 <- penguins %>% #make a new dataframe that gets rid of NA values
drop_na()
View(penguins2) #great! no NAs, but 11 fewer cases 333 vs. 344
str(penguins2)
## tibble [333 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : chr [1:333] "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr [1:333] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth_mm : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length_mm: num [1:333] 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass_g : num [1:333] 3750 3800 3250 3450 3650 ...
## $ sex : chr [1:333] "male" "female" "female" "female" ...
## $ year : num [1:333] 2007 2007 2007 2007 2007 ...
penguins2 %>% #re-calculate the means with a tidyverse pipe
group_by(species) %>%
summarise(mean.bill.length = mean(bill_length_mm))
## # A tibble: 3 x 2
## species mean.bill.length
## <chr> <dbl>
## 1 Adelie 38.8
## 2 Chinstrap 48.8
## 3 Gentoo 47.6
# Make a scatterplot of bill depth (x axis) vs bill length (y axis) and color
# the points by species
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
#brings in the data, tells ggplot which columns, colors by species
geom_point() + #tells ggplot to plot it with points
stat_smooth(method=lm, formula=y~x) + #adds a least squares regression line for each species and sex
theme_classic() +
labs(title="Palmer Penguins: Bill Depth vs. Bill Length",
x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels
# Make a scatterplot of bill depth (x axis) vs bill length (y axis) and color
# the points by species & plot sex with different symbols
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species, shape=sex)) + #brings in the data, tells ggplot which columns, colors by species
geom_point() + #tells ggplot to plot it with points
stat_smooth(method=lm, formula=y~x) + #adds a least squares regression line for each species and sex
theme_classic() + labs(title="Palmer Penguins: Bill Depth vs. Bill Length",
x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels
# Let's make a coplot with different graphs for each species, but the same x,y scales
# The theme is also changed back to the ggplot default, a gray scale
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
#brings in the data, tells ggplot which columns, colors by species
geom_point() + #tells ggplot to plot it with points
stat_smooth(method=lm, formula=y~x) +
#adds a least squares regression line for each species and sex
facet_wrap(~species) +
labs(title="Palmer Penguins: Bill Depth vs. Bill Length",
x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels
# Let's make a coplot with different graphs for each species, but the same x,y scales, add sex
# The theme is also changed back to the ggplot default, a gray scale
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species, shape=sex)) + #brings in the data, tells ggplot which columns, colors by species
geom_point() + #tells ggplot to plot it with points
stat_smooth(method=lm, formula=y~x) +
#adds a least squares regression line for each species and sex
facet_wrap(~species) +
labs(title="Palmer Penguins: Bill Depth vs. Bill Length",
x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels
# Fit a least squares regression of bill depth and bill length in Gentoos
# but first let's make a dataframe for each of the three species
adelie <- filter(penguins2, species == "Adelie") #make a dataframe for just
# adelie, only takes rows that have Adelie in the species column
adelie
## # A tibble: 146 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # ... with 136 more rows, and 2 more variables: sex <chr>, year <dbl>
chinstrap <- filter(penguins2, species == "Chinstrap")
gentoo <- filter(penguins2, species == "Gentoo")
# Make a scatterplot of Bill Depth (mm) [x] vs. Bill Length (mm) [y] for gentoo penguins
ggplot(data=gentoo, aes(y = bill_length_mm, x = bill_depth_mm)) + # brings in the data, tells ggplot which columns, colors by species
geom_point() + # tells ggplot to plot it with points
stat_smooth(method=lm, formula=y~x) + #adds a least squares regression line
theme_classic() + labs(title="Gentoo penguins: Bill Depth vs. Bill Length", x= "Bill Depth (mm)", y = "Bill Length (mm)")
# labs sets the title, x and y axis labels
# now we will do least squares regression, lm stands for 'linear model' of
# bill_length vs. bill depth for Gentoo penguins.
# ?lm #check the arguments for lm() to see what we need to add to the function
# the tilde symbol '~' means 'is a function of'
g.lm <- lm(bill_length_mm ~ bill_depth_mm, data = gentoo)
# the formula is y ~ x, so the y variable has to go first!
g.lm
##
## Call:
## lm(formula = bill_length_mm ~ bill_depth_mm, data = gentoo)
##
## Coefficients:
## (Intercept) bill_depth_mm
## 16.67 2.06
attributes(g.lm) #this let's you see what is in the g.lm object
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
g.lm$coefficients #get intercept and slope values
## (Intercept) bill_depth_mm
## 16.67 2.06
# get the p value for the regression (Ho: the slope is 0) from the final column of an ANOVA table
anova(g.lm)
## Analysis of Variance Table
##
## Response: bill_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## bill_depth_mm 1 487 487 87.5 7.3e-16 ***
## Residuals 117 651 6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# After invoking this command, move the cursor to the Hit <return> in console
# and hit return 4 times to see the residual plots for this regression.
# Make a box plot of bill_length_mm for each species
ggplot(penguins2, aes(species, bill_length_mm, color = species)) +
geom_boxplot() + #tells ggplot to make a box plot
theme_classic() + labs(title="Palmer Penguins: Species vs. Bill Length",
x="Species",y="Bill Length (mm)")
# Do a t test to compare bill lengths in Adelie and Gentoo penguins
# ?t.test #see the arguments for a t.test
# test whether we can reject the null hypothesis that
# adelie bill length = gentoo bill length; note the use of the $ sign to
# indicate which variable is being tested from the two different data frames
t.test(adelie$bill_length_mm, gentoo$bill_length_mm, alternative = "less")
##
## Welch Two Sample t-test
##
## data: adelie$bill_length_mm and gentoo$bill_length_mm
## t = -24, df = 234, p-value <2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -8.15
## sample estimates:
## mean of x mean of y
## 38.8 47.6
# Try doing the t test again, but change "less" to "greater" what happens?
t.test(adelie$bill_length_mm, gentoo$bill_length_mm, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: adelie$bill_length_mm and gentoo$bill_length_mm
## t = -24, df = 234, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -9.34 Inf
## sample estimates:
## mean of x mean of y
## 38.8 47.6
# Now do the t test again, but change to a two-tailed or two sided alternative
# hypothesis
t.test(adelie$bill_length_mm, gentoo$bill_length_mm, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: adelie$bill_length_mm and gentoo$bill_length_mm
## t = -24, df = 234, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.45 -8.03
## sample estimates:
## mean of x mean of y
## 38.8 47.6
# Do boxplots of sexual dimorphism in mass among species:
ggplot(penguins2,aes(x=species,y=body_mass_g,fill=sex)) +
geom_boxplot()+labs(title="Palmer Penguins", x = "Species",
y="Body Mass (g)")
# plot all variables vs. each other
# Install the Ggally package which has the function ggpairs and plot all
# variables pairwise
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# plot should include only variables: species, bill_length_mm, bill_depth_mm,
# flipper_length_mm, body_mass_g, sex
# The indices for these columns will be used to create an index vector c
# c stands for concatenate or combine
# this vector c will contain 1,3,4,5,6,7
c <- c(1,3:7)
ggpairs(penguins2,columns=c, ggplot2::aes(colour=factor(species)),
title='Palmer Penguins')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Sleuth Chapter 1 Drawing Statistical Conclusions
Science 2015
Science 2015
Questionaire
Questionaire
Wunderlich took thousands of measurements, and found that the “normal” temperature had an average of 36.8 degrees with a large standard error. With appropriate rounding to just 2 significant digits, Wunderlich reported the average human temperature as 37º C #### 98.6º F is the conversion from 37ºF, which should have been rounded up to 99º to maintain 2 significant figures, but this would have made the error worse #### If he’d used 3 significant figures, normal temperature would be 98.2ºF #### http://en.wikipedia.org/wiki/Carl_Reinhold_August_Wunderlich #### If data are to be converted to different scales, incorporate 1 additional signficant figure in reporting the average # Significant figures or digits
library(Sleuth3)
data(case0101)
# Analyze the structure of the data table
head(case0101) # Show the 1st 6 rows of data
## Score Treatment
## 1 5.0 Extrinsic
## 2 5.4 Extrinsic
## 3 6.1 Extrinsic
## 4 10.9 Extrinsic
## 5 11.8 Extrinsic
## 6 12.0 Extrinsic
str(case0101)
## 'data.frame': 47 obs. of 2 variables:
## $ Score : num 5 5.4 6.1 10.9 11.8 12 12.3 14.8 15 16.8 ...
## $ Treatment: Factor w/ 2 levels "Extrinsic","Intrinsic": 1 1 1 1 1 1 1 1 1 1 ...
summary(case0101)
## Score Treatment
## Min. : 5.0 Extrinsic:23
## 1st Qu.:14.9 Intrinsic:24
## Median :18.7
## Mean :17.9
## 3rd Qu.:21.2
## Max. :29.7
favstats(Score ~ Treatment, data=case0101)
## Treatment min Q1 median Q3 max mean sd n missing
## 1 Extrinsic 5 12.2 17.2 18.9 24.0 15.7 5.25 23 0
## 2 Intrinsic 12 17.4 20.4 22.3 29.7 19.9 4.44 24 0
# from the Sleuth3 help file, with slight modifications: data = case0101
boxplot(Score ~ Treatment,data=case0101, # Boxplots with labels
ylab= "Average Creativity Score From 11 Judges (on a 40-point scale)",
names=c("23 'Extrinsic' Group Students","24 'Intrinsic' Group Students"),
main= "Haiku Creativity Scores for 47 Creative Writing Students")
Tukey stem-and-leaf plot
# Individual stem-and-leaf plots, from Nicholas Horton code
with(subset(case0101, Treatment == "Extrinsic"), stem(Score, scale = 5))
##
## The decimal point is at the |
##
## 5 | 04
## 6 | 1
## 7 |
## 8 |
## 9 |
## 10 | 9
## 11 | 8
## 12 | 03
## 13 |
## 14 | 8
## 15 | 0
## 16 | 8
## 17 | 2245
## 18 | 577
## 19 | 25
## 20 | 7
## 21 | 2
## 22 | 1
## 23 |
## 24 | 0
with(subset(case0101, Treatment == "Intrinsic"), stem(Score, scale = 5))
##
## The decimal point is at the |
##
## 12 | 009
## 13 | 6
## 14 |
## 15 |
## 16 | 6
## 17 | 25
## 18 | 2
## 19 | 138
## 20 | 356
## 21 | 36
## 22 | 126
## 23 | 1
## 24 | 03
## 25 |
## 26 | 7
## 27 |
## 28 |
## 29 | 7
# This Horton code produces and error, don't know why:
# mosaic::maggregate(Score ~ Treatment, data = case0101, FUN = stem)
# Display 1.10, page 17 in the text using package aplpack, needed attach/detach
attach(case0101)
aplpack::stem.leaf.backback(Score[case0101$Treatment == "Extrinsic"],
Score[case0101$Treatment == "Intrinsic"],
back.to.back = TRUE,unit=0.1,
show.no.depths = TRUE)
## __________________
## 1 | 2: represents 1.2, leaf unit: 0.1
## Score[case0101$Treatment == "Extrinsic"]
## Score[case0101$Treatment == "Intrinsic"]
## __________________
## 40| 5 |
## 1| 6 |
## | 7 |
## | 8 |
## | 9 |
## 9| 10 |
## 8| 11 |
## 30| 12 |009
## | 13 |6
## 8| 14 |
## 0| 15 |
## 8| 16 |6
## 5422| 17 |25
## 775| 18 |2
## 52| 19 |138
## 7| 20 |356
## 2| 21 |36
## 1| 22 |126
## | 23 |1
## 0| 24 |03
## | 25 |
## | 26 |7
## | 27 |
## | 28 |
## | 29 |7
## | 30 |
## __________________
## n: 23 24
## __________________
detach(case0101)
# Gallagher code with modifications by David Winsemius (1/31/12 post on R-help):
# histogram is a plotting function from the lattice package.
histogram(~Score | Treatment, data=case0101,
scales=list(x=list(at=seq(4,32,by=4),
labels=sprintf("%2.0f", seq(4,32,by=4)))),
endpoints = c(3.5, 32.5), layout = c(1,2), aspect = 1,
xlab = "Creativity Scores")
# Equal variance t test
t.test(Score~Treatment, alternative='two.sided', conf.level=.95,
var.equal=TRUE, data=case0101)
##
## Two Sample t-test
##
## data: Score by Treatment
## t = -3, df = 45, p-value = 0.005
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
## -7.00 -1.29
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic
## 15.7 19.9
# Unequal variance or Welch t test
t.test(Score ~ Treatment, alternative = "two.sided", data=case0101)
##
## Welch Two Sample t-test
##
## data: Score by Treatment
## t = -3, df = 43, p-value = 0.006
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
## -7.01 -1.28
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic
## 15.7 19.9
# analyze with a linear model from N Horton
summary(lm(Score ~ Treatment, data = case0101))
##
## Call:
## lm(formula = Score ~ Treatment, data = case0101)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.74 -2.98 1.06 2.96 9.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.74 1.01 15.55 <2e-16 ***
## TreatmentIntrinsic 4.14 1.42 2.93 0.0054 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.85 on 45 degrees of freedom
## Multiple R-squared: 0.16, Adjusted R-squared: 0.141
## F-statistic: 8.56 on 1 and 45 DF, p-value: 0.00537
# Monte Carlo simulations from N Horton's Chapter 1 pdf on his web site
diffmeans = diff(mean(Score ~ Treatment, data = case0101))
diffmeans # observed difference
## Intrinsic
## 4.14
numsim = 4999 # set to a sufficient number of Monte Carlo trials
nulldist = do(numsim) * diff(mean(Score ~ shuffle(Treatment), data = case0101))
confint(nulldist)
## name lower upper level method estimate
## 1 Intrinsic -3.01 2.99 0.95 percentile 4.14
histogram(~Intrinsic, nint = 50, data = nulldist, v = c(-4.14, 4.14))
# Monte Carlo simulations: permutations from Robison-Cox
## Montana State University Sleuth Code (Jim Robison-Cox, Stat 410/511)
with(case0101, tapply(Score, Treatment, mean))
## Extrinsic Intrinsic
## 15.7 19.9
diff(with(case0101, tapply(Score, Treatment, mean)))
## Intrinsic
## 4.14
## randomly resample these numbers once, shuffling the codes:
diff(with(case0101, tapply(Score, sample(Treatment), mean)))
## Intrinsic
## -0.99
## how many ways can these be shuffled? 1.61238e+13
choose(47,23) ## or choose(47, 24)
## [1] 1.61e+13
trials=499
random_differences <- numeric(trials) ## set up storage space, or mailboxes
## use square brackets to refer to one mailbox at a time.
for(i in 1:trials) random_differences[i] <- diff(with(case0101, tapply(Score, sample(Treatment), mean)))
hist(random_differences,breaks=15)
# plot a vertical line at the observed difference of 4.144203
abline(v= c(4.14, -4.14),col="red")
random_differences[which(random_differences < -4.14 | random_differences>4.14)]
## [1] 4.26 -4.69
## approximate (due to not seeing all 1.6e+13 possibilities) p-value
## how extreme is the difference we observed relative to all
## possible differences?
# Following Manly & Legendre & Legendre, add 1 for the observed difference to simulation
(length( which(random_differences < -4.14 | random_differences>4.14))+1)/(trials+1)
## [1] 0.006
##Chihara - Hesterberg Permutation tests code (p 41)
with(case0101, tapply(Score, Treatment, mean))
## Extrinsic Intrinsic
## 15.7 19.9
observed <- diff(with(case0101, tapply(Score, Treatment, mean)))
observed
## Intrinsic
## 4.14
## Perform permutation analysis of means
N <- 10^3 -1 # Number of times to repeat this process
result <- numeric (N) # Space to save the random differences
# use a for loop to do (N-1=999) random permutations
for (i in 1:N)
{ # sample of size 24 from 1 to 47 without replacement
index <- sample (47, size = 24, replace = FALSE)
result[i] <- mean (case0101$Score [index]) -
mean (case0101$Score [-index])
}
hist(result,breaks = 15, xlab = "xbar1 - xbar2",
main = "Case 1.1: Permutation distribution for creativity scores")
abline(v = c(observed, -observed), col = "blue") # add line at observed mean diff.
Pvalue <- (sum(result >= observed)+1)/(N+1) #P-value
Pvalue
## [1] 0.002
“There is strong evidence that a subject would receive a lower creativity score for a poem written after the extrinsic motivation questionnaire” Two-sided p-value=0.005 from a 2-sample t test. The effect size 4.1 point difference on a 0-40 point scale. 95% confidence interval for the difference is 1.3 and 7.0 points
Since this was a randomized experiment, difference in creativity was caused by the difference in motivational questionnaires Because the individuals were not selected randomly from a larger population, extending this inference to a larger population is speculative.
data(case0102)
str(case0102)
## 'data.frame': 93 obs. of 2 variables:
## $ Salary: int 3900 4020 4290 4380 4380 4380 4380 4380 4440 4500 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
head(case0102) # Loads the first 6 cases for viewing
## Salary Sex
## 1 3900 Female
## 2 4020 Female
## 3 4290 Female
## 4 4380 Female
## 5 4380 Female
## 6 4380 Female
# favstats from mosaic (Horton's code)
favstats(Salary ~ Sex, data = case0102)
## Sex min Q1 median Q3 max mean sd n missing
## 1 Female 3900 4800 5220 5400 6300 5139 540 61 0
## 2 Male 4620 5400 6000 6075 8100 5957 691 32 0
boxplot(Salary ~ Sex, data=case0102,
ylab= "Starting Salary (U.S. Dollars)",
names=c("61 Females","32 Males"),
main= "Harris Bank Entry Level Clerical Workers, 1969-1971")
# car Boxplot, which labels the outlier
Boxplot(Salary ~ Sex, data=case0102,
ylab= "Starting Salary (U.S. Dollars)",
names=c("61 Females","32 Males"),
main= "Harris Bank Entry Level Clerical Workers, 1969-1971")
## [1] "93"
# Horton's code:
bwplot(Salary ~ Sex, data = case0102)
SPSS Application Guide Figure 2.7
McGill et al. 1978
Nordhaus, W. PNAS 2006 103: 3495
# densityplot from the lattice package
densityplot(~Salary, groups = Sex, auto.key = TRUE, data = case0102)
# Histograms
attach(case0102)
hist(Salary[Sex=="Female"])
hist(Salary[Sex=="Male"])
detach(case0102)
# Gallagher code with modifications by David Winsemius (1/31/12 post on R-help):
# histogram is a plotting function from the lattice package.
histogram(~ Salary | Sex, data=case0102,
scales=list(x=list(at=seq(4000,8000,by=2000),
labels=sprintf("$%4.0f", seq(4000,8000,by=2000)))),
endpoints = c(3500, 8500), layout = c(1,2), aspect = 1,
xlab = "Salary", main="Case 1.2")
## I can't get the y labels to plot properly on this back-to-back plot, part of
# Frank Harrell's Hmisc package.
# out <- histbackback(split(case0102$Salary, case0102$Sex), probability=TRUE, xlim=c(-.001,.001),
options(digits=0)
out <- histbackback(split(case0102$Salary, case0102$Sex), probability=FALSE, xlim=c(-30,30),
main = 'Sleuth Case 1.2',ylab='Starting Salaries($)')
#! just adding color
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)
options(digits=3)
# Plots somewhat like Display 1.13 on page 20 in 3rd edition
# Plot some distributions
histogram(rnorm(1000)) # Normal
histogram(rexp(1000)) # Long-tailed
histogram(runif(1000)) # Short-tailed
histogram(rchisq(1000, df = 15)) # Skewed
diffSex<-diff(with(case0102, tapply(Salary,Sex, mean)))
sprintf('The difference in salary was %8.4f',diffSex)
## [1] "The difference in salary was 818.0225"
# Independent sample t test, equal variances assumed:
t.test(Salary~Sex, alternative='two.sided', conf.level=.95, var.equal=TRUE,
data=case0102)
##
## Two Sample t-test
##
## data: Salary by Sex
## t = -6, df = 91, p-value = 1e-08
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -1076 -560
## sample estimates:
## mean in group Female mean in group Male
## 5139 5957
# Permutation test from Horton
obsdiff = diff(mean(Salary ~ Sex, data = case0102))
obsdiff
## Male
## 818
numsim = 999
res = do(numsim) * diff(mean(Salary ~ shuffle(Sex), data = case0102))
densityplot(~Male, data = res)
confint(res)
## name lower upper level method estimate
## 1 Male -290 305 0.95 percentile 818
# Add the Manly-Legendre convention for simulations to add 1 for the actual
# This prevents reporting a p-value of 0, which it never is.
p = (sum(abs(res$Male) >= abs(obsdiff))+1)/(numsim +1)
p
## [1] 0.001
# Robison Cox's MSU simulation
observed <- diff(with(case0102, tapply(Salary,Sex, mean)))
## 818.0225
plot(Salary ~ Sex, data = case0102) # because code is numeric, gender is a factor
with(case0102, tapply(Salary,Sex, length))
## Female Male
## 61 32
# How many combinations with groups of size 61 & 32
choose(61+32,32)
## [1] 8.66e+24
trials = 10^4 -1
random_differences <- numeric(trials) ## set up storage space, or mailboxes
## use square brackets to refer to one mailbox at a time.
for(i in 1:trials) random_differences[i] <- diff(with(case0102, tapply(sample(Salary),Sex, mean)))
hist(random_differences, breaks=15)
abline(v= c(-1,1) * observed, col="red")
l<-length( which(random_differences < -observed | random_differences> observed))/1000
## approximate p-value < 1 in 1000
Pvalue <- (l+1)/(trials+1)
Pvalue
## [1] 1e-04
##Chihara - Hesterberg Permutation tests code (p 41)
with(case0101, tapply(Score, Treatment, mean))
## Extrinsic Intrinsic
## 15.7 19.9
observed <- diff(with(case0102, tapply(Salary, Sex, mean)))
observed
## Male
## 818
## Perform permutation analysis of means
N <- 10^4 -1 # Number of times to repeat this process
result <- numeric (N) # Space to save the random differences
for (i in 1:N)
{ # sample of size 32 from 1 to 32 without replacement
index <- sample (61+32, size = 32, replace = FALSE)
result[i] <- mean (case0102$Salary [index]) -
mean (case0102$Salary [-index])
}
# scale is off, need to fix sometime
hist(result,breaks = 15, xlab = "xbar1 - xbar2",
main = "Case 1.2: Permutation distribution for salaried")
abline(v = c(observed, -observed), col = "blue") # add line at observed mean diff.
Pvalue <- (sum(result >= observed)+1)/(N+1) #P-value
Pvalue
## [1] 1e-04
Sleuth Display 1.5
An inference is a conclusion that patterns in the data are present in some broader context
A statistical inference is an inference justified by a probability model linking the data to the broader context # Randomization From Kendall & Stuart’s (1977) ‘Advanced Theory of Statistics’ “The principle of randomization is simply stated: Whenever experimental units are allocated to factor-combinations in an experiment, this should be done by a random process using equal probabilities.”
“Even if the relationship of the dependent variable with some unsuspected causal factor is not recognized until after the experiment, the validity of the inferences will not be impaired, provided that the factor’s influence was “randomized out” of the experiment.”
Kendall & Stuart (1977) on experiments In any experiment the factors influencing the dependent variable are, explicitly or implicitly, divided by the experimenter into three classes: - Those incorporated into the structure of the experiment - Those “randomized out” of the experiment - Those neither incorporated nor randomized out
Classes 1 & 2 require positive action, affecting the layout of the experiment, or the randomization procedure employed. A factor may find its way into class (3) by simply being overlooked.
What makes a good experimenter? Kendall & Stuart (1977) “A substantial part of the skill of the experimenter lies in his choice of factors to be randomized out of the experiment. If he is careful, he will randomize out all the factors which are suspected of being causally important but which are not actually part of the experimental procedure. But every experimenter necessarily neglects some conceivably causal factors; if this were not so, the randomization procedure required would be impossibly complicated. Thus the choice of what factors to be randomized out is essentially a matter of judgement.”
Description of experimental design should include - The nature of the experimental units to be employed - The number and kinds of treatments and the properties of the responses that will be measured. - Specification of how the treatments will be assigned to the available experimental units (replicates) - The physical arrangement of the experimental units, (and often) the temporal sequence in which treatments are applied to and measurements made on the different experimental units.’
Sleuth is too restrictive here: Inferences to populations can be drawn from random sampling studies, but not otherwise
Simple random sampling (SRS): A simple random sample of size n from a population is a subset of the population consisting of n members selected in such a way that every subset of size n is afforded the same chance of being selected.
Random sampling ensures that all subpopulations are represented in the sample in roughly the same mix as in the overall population.
Statistical inference procedures incorporate measures of uncertainty that describe that chance. Selecting a random sample - Simple random sampling - Stratified random sampling - Multilevel sampling (e.g., Regions, Lakes, areas within lakes) - Systematic sampling
Manly Figure 1.2
Thompson 1990
- Variable probability sampling
- EMAP
EMAP Virginia Province
MA Bay sampling
MWRA MA Bay sampling
BostonHarbor sampling
P-value Modified from K Wuensch The p-value is the probability of obtaining data as or more discrepant with respect to the null and alternative hypothesis than are those in the present sample, assuming that the null hypothesis is absolutely correct. In a one-sample z-test of the hypothesis μ = μ_o, a z-score of 1.96 can have p-value 0.975, 0.05 or 0.025 depending on whether the alternative hypothesis is μ < μ_o, μ ≠ μ_o, or μ > μ_o, respectively.
Statistical Sleuthing Significant digits, Case Studies 1.1 & 1.2 Histograms, density plots, stem-and-leaf & dot plots Randomization, Fisher’s major contribution Experiment vs. observational study (survey), Experimental design Sampling schemes (SRS, stratified, systematic, adaptive, EMAP probability based sampling) Probability model for randomized experiments Randomization distribution p values