title: “Week 1: Introduction to the course, R & Sleuth Chapter 1”

author: “Eugene D. Gallagher”

date: “9/6/2021”

output: html_document:

Foundations

Does everyone have a working laptop?

Today’s Class is designed to bring students one up to speed on the tools necessary to follow all the case studies in Ramsey & Schafer’s ``Statistical Sleuth’’. I assume a rudimentary knowledge of R from watching Andy Field’s tutorial on installing R & RStudio (see HO02), some familiarity with setting up an R project, and a general overview of R Markdown. I don’t require any experience with coding.

If there are any questions or areas you’d like covered in more detail please do not hesitate to speak up.

Course Google Doc for student background and questions during the lecture: https://docs.google.com/document/d/1FW7untHx9-dXleVoj6WoJ1mlWonjHu8xL3e3UBSB61A/edit?usp=sharing

RStudio

RStudio has multiple panes. One can customize what’s open at any point. In general the four quadrants are:

Upper Left: Open Files
Upper Right: Environment and History Browser. Good for seeing what variables are in memory and exploring them.
Lower Left: Console. The interactive command to the running R interpreter.
Lower Right: Directory browser, Plots, Packages, Help.

If one accidentally closes one, it’s easily opened again from the View menu.

Note Menu items have underlines for certain characters. These are shortcuts for Alt-letter that one can access quickly from the keyboard.

Initially, I’d like everyone to go to Blackboard and download 2 files from the Week 1 module: Week01_Int_Ch01_F21.Rmd and Penguins.csv

We’ll initially set up an R project on your laptops for your R Markdown files and images. Following Andy Field, we’ll create an R Project, labeled something like Week01_int_Ch01_F21 with 2 subdirectories doc and images: There are 7 images you should download from the Week 1 images folder as a zip file.

Creating scripts

R Scripts are simple text files, i.e. not Word or editors that contain formatting information. Just straight text, usually presented in a monospaced font.

From RStudio, select File -> New File and there is a large number of options. The important ones for this class are:

R Script. A file containing R code that is executable. If the file is open in the upper left pane of RStudio one can click source in the upper right of the file and it will execute in the console below.
R Notebook. An older version that allows for a preview of the document. The preview is rebuilt on every save.
R Markdown. Same as R Notebook, but requires a manual Knit to construct the document. No preview mode. For documents that take a long time to build this is preferred and is becoming the more common option.

Session & Working Directory

There are two R sessions going inside R Studio when editing Rmd. One is the console and the other is used to generate Preview sections of documents. This can lead to confusion occasionally as variables that are seen at the console are not what is loaded in the document’s environment. The Run -> Restart R and Run All Chunks restarts the Rmd environment and reruns all chunks in a document. This has the effect of resynchronizing both environments.

It is sometimes helpful to communicate versions of packages are loaded when discussing issues. The command sessionInfo() provides a nice overview of this that is good for checking package versions loaded.

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.1     stringr_1.4.0     purrr_0.3.4       readr_2.0.1      
##  [5] tidyr_1.1.3       tibble_3.1.4      tidyverse_1.3.1   Sleuth3_1.0-3    
##  [9] mosaic_1.8.3      ggridges_0.5.3    mosaicData_0.20.2 ggformula_0.10.1 
## [13] ggstance_0.3.5    dplyr_1.0.7       Matrix_1.3-4      Hmisc_4.5-0      
## [17] ggplot2_3.3.5     Formula_1.2-4     survival_3.2-13   lattice_0.20-44  
## [21] car_3.0-11        carData_3.0-4     aplpack_1.3.3    
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0            lubridate_1.7.10    httr_1.4.2         
##  [4] RColorBrewer_1.1-2  tools_4.1.1         backports_1.2.1    
##  [7] bslib_0.3.0         utf8_1.2.2          R6_2.5.1           
## [10] rpart_4.1-15        DBI_1.1.1           colorspace_2.0-2   
## [13] nnet_7.3-16         withr_2.4.2         tidyselect_1.1.1   
## [16] gridExtra_2.3       leaflet_2.0.4.1     curl_4.3.2         
## [19] compiler_4.1.1      cli_3.0.1           rvest_1.0.1        
## [22] htmlTable_2.2.1     xml2_1.3.2          ggdendro_0.1.22    
## [25] sass_0.4.0          mosaicCore_0.9.0    scales_1.1.1       
## [28] checkmate_2.0.0     digest_0.6.27       foreign_0.8-81     
## [31] rmarkdown_2.10      rio_0.5.27          base64enc_0.1-3    
## [34] jpeg_0.1-9          pkgconfig_2.0.3     htmltools_0.5.2    
## [37] labelled_2.8.0      dbplyr_2.1.1        fastmap_1.1.0      
## [40] htmlwidgets_1.5.3   rlang_0.4.11        readxl_1.3.1       
## [43] rstudioapi_0.13     jquerylib_0.1.4     generics_0.1.0     
## [46] farver_2.1.0        jsonlite_1.7.2      crosstalk_1.1.1    
## [49] zip_2.2.0           magrittr_2.0.1      Rcpp_1.0.7         
## [52] munsell_0.5.0       fansi_0.5.0         abind_1.4-5        
## [55] lifecycle_1.0.0     stringi_1.7.4       MASS_7.3-54        
## [58] plyr_1.8.6          grid_4.1.1          ggrepel_0.9.1      
## [61] crayon_1.4.1        haven_2.4.3         splines_4.1.1      
## [64] hms_1.1.0           knitr_1.33          pillar_1.6.2       
## [67] tcltk_4.1.1         reprex_2.0.1        glue_1.4.2         
## [70] evaluate_0.14       latticeExtra_0.6-29 modelr_0.1.8       
## [73] data.table_1.14.0   tzdb_0.1.2          png_0.1-7          
## [76] vctrs_0.3.8         tweenr_1.0.2        cellranger_1.1.0   
## [79] gtable_0.3.0        polyclip_1.10-0     assertthat_0.2.1   
## [82] xfun_0.25           ggforce_0.3.3       openxlsx_4.2.4     
## [85] broom_0.7.9         cluster_2.1.2       ellipsis_0.3.2

The current working directory is available via getwd() and settable via setwd("directory"). I find it easiset to set the working directory from the Session button at the top of the RStudio screen. All file loads and saves are relative to the current working directory.

Environment

R maintains “environment frames” which are nested up to the top level. Each environment can contain variables in memory. The local environment is searched first then it follows up through the levels.

Loading Packages

Note that library is a bit of misnomer as R uses packages, not libraries. From a technical standpoint, it’s nice to recognize the distinction. You may see require used in its place. If the package is not installed, library will trigger an error while require will return FALSE. Once loaded the functions created in the package are available to your R session.

Introduction to EnvSci601

Gallagher’s Background

BA Biology Carleton College, Northfield MN 1976
M.Sc. & Ph.D. Oceanography, University of Washington School of Oceanography 1983
- Dissertation: The mechanisms of succession in the Skagit Flats infaunal community
- Programming Languages
  - FORTRAN
  - IMSL & Boeing subroutine libraries
  - Matlab
  - R
- UW Graduate Statistics Courses
  - Probability & Introduction to Statistics
  - Applied Statistics: ANOVA
  - Regression
  - Nonparametric statistics

Loveday Conquest, UW

Donald McCaughran, UW

Pete Jumars, UW, UMaine

Professional Experience
- Instructor, Benthic Ecology, UW School of Oceanography 1982
- Independent Scientist Woods Hole Marine Biological Laboratory 1983-1985
- Co-instructor MBL Marine Ecology Course 1984-1987
- UMass Boston 1983-Present
UMass Boston Statistics courses taught
- EnvSci261: Intro. Env. Statisitcs (with R)
- EnvSci601: Intro. Prob Models & Statistics (with Matlab, now R)
- EnvSci611: Applied Statistics (R, Matlab, SPSS)
- EnvSci612: Multivariate Statistics (Matlab)
Professional Development in Statistics
- Member, Boston Chapter of the American Statistical Association
- West’s Analysis of Time Series with Matlab ASA short course (2013)
- Bates’s Mixed modeling with Julia ASA short course (2017)
- Harrell’s Regression Modeling Strategies (11-14 May 2021, 4-d course)

Michael West, Duke

Douglas Bates, University of Wisconsin-Madison

Frank Harrell, Vanderbilt

  ## syllabus for Fall 2021 EnvSci601

# Introduction to R Coding: Basic R commands (from Gelman 2021, Appendix A)

1/3

## [1] 0.333

sqrt(2)

## [1] 1.41

pi

## [1] 3.14

curve(x^2 +5, from = -2, to=2)

a<-3
print(a)

## [1] 3

b<-10
a+b

## [1] 13

a*b

## [1] 30

exp(a)

## [1] 20.1

10^a

## [1] 1000

log(b)

## [1] 2.3

a^b

## [1] 59049

round(3.435, 0)

## [1] 3

round(3.435, 1)

## [1] 3.4

round(3.435, 2)

## [1] 3.44

x<-c(4,10,-1,2.4)
print(x)

## [1]  4.0 10.0 -1.0  2.4

seq(4,54,10)

## [1]  4 14 24 34 44 54

c(1,3,5)

## [1] 1 3 5

1:5

## [1] 1 2 3 4 5

c(1:5,1, 3, 5)

## [1] 1 2 3 4 5 1 3 5

c(1:5,10:20)

##  [1]  1  2  3  4  5 10 11 12 13 14 15 16 17 18 19 20

seq(-1,9,2)

## [1] -1  1  3  5  7  9

# Sampling and random numbers
runif(1,0,100)

## [1] 68

runif(50,0,100)

##  [1] 70.284 89.355 98.156 16.062 77.484 50.133 38.209 23.297 37.693 69.220
## [11] 94.898  8.736 71.558 89.353 34.425 76.812 51.325 39.138 71.275 77.508
## [21]  3.377 99.226 90.131  5.372 42.443 98.181 37.613 99.492 97.506 97.389
## [31] 87.884 18.139 84.238 89.693 91.441 38.785  0.366 46.532 18.715 55.208
## [41]  3.504 62.394 30.117 30.791 65.830 34.810 24.212 71.085 48.417 66.498

# Pick one of three colors with equal probability
color<-c("blue","red","green")
sample(color,1)

## [1] "blue"

# Sample with unequal probabilities
p<-c(0.5,0.3,0.2)
sample(color,1, prob=p)

## [1] "green"

# Or do it all in one line, which is more compact but less readable:
sample(c("blue","red","green"),1,prob=c(0.5,0.3,0.2))

## [1] "red"

1/0

## [1] Inf

-1/0

## [1] -Inf

exp(1000)

## [1] Inf

exp(-1000)

## [1] 0

1/Inf

## [1] 0

Inf + Inf

## [1] Inf

-Inf - Inf

## [1] -Inf

0/0

## [1] NaN

Inf - Inf

## [1] NaN

2+3 == 4

## [1] FALSE

2 + 3 == 5

## [1] TRUE

1 < 2

## [1] TRUE

2 < 1

## [1] FALSE

number <- runif(1, 0, 100)
color <- ifelse(number<30,"red","blue")

# Looping
for (i in 1:10){
  print("hello")}

## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"
## [1] "hello"

for (i in 1:10){
  print(i)}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

for (i in 1:10){
  print(paste("hello",i))}

## [1] "hello 1"
## [1] "hello 2"
## [1] "hello 3"
## [1] "hello 4"
## [1] "hello 5"
## [1] "hello 6"
## [1] "hello 7"
## [1] "hello 8"
## [1] "hello 9"
## [1] "hello 10"

  for (i in 1:10){
    number <- runif(1,0, 100)
    color <- ifelse(number<30, "red","blue")
    print(color)
  }

## [1] "blue"
## [1] "blue"
## [1] "red"
## [1] "blue"
## [1] "blue"
## [1] "blue"
## [1] "blue"
## [1] "red"
## [1] "blue"
## [1] "blue"

## Working with Vectors

x <- 1:5
y <- c(3,4,1,1)
z <- c("A", "B","C")
# A vector of 5 uniformly distributed random numbers between 0 and 100:
u <- runif (5, 0, 100)
# Component wise operation on vectors
x

## [1] 1 2 3 4 5

## [1] 3 4 1 1

x + y

## Warning in x + y: longer object length is not a multiple of shorter object
## length

## [1] 4 6 4 5 8

x/3

## [1] 0.333 0.667 1.000 1.333 1.667

x^4

## [1]   1  16  81 256 625

sum(x)

## [1] 15

mean(x)

## [1] 3

# Subscripting
a <- c("A","B","C","D","E","F","G","H","I","J")
a[1]

## [1] "A"

a[2]

## [1] "B"

a[4:6]

## [1] "D" "E" "F"

a[c(1,3,5)]

## [1] "A" "C" "E"

a[c(8,1:3,2)]

## [1] "H" "A" "B" "C" "B"

  a[c(FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)]

## [1] "D" "E" "F"

Analyze the Penguin Data

Penguins

setwd("M:/EnvSci601/R/r_projects/Wk01_IntroR_Ch01_F21/docs")
# or click on the Session button --> Set Working Directory --> Choose Directory
# ?read_csv #let's see how we can load our csv file into an R dataframe

penguins <- read_csv(file = "M:/EnvSci601/R/r_projects/Wk01_IntroR_Ch01_F21/data/penguins.csv") #load penguins data

## Rows: 344 Columns: 8

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

penguins #see data, note NA values in red, means there is no data for those points

## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <chr>, year <dbl>

# View the penguins dataframe
View(penguins) 
# Close the view window when done by clicking x on the top of the view

# Two other ways of analyzing the structure of the penguins dataframe:
glimpse(penguins)

## Rows: 344
## Columns: 8
## $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A~
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", ~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <chr> "male", "female", "female", NA, "female", "male", "f~
## $ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

str(penguins)

## spec_tbl_df [344 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : num [1:344] 3750 3800 3250 NA 3450 ...
##  $ sex              : chr [1:344] "male" "female" "female" NA ...
##  $ year             : num [1:344] 2007 2007 2007 2007 2007 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   species = col_character(),
##   ..   island = col_character(),
##   ..   bill_length_mm = col_double(),
##   ..   bill_depth_mm = col_double(),
##   ..   flipper_length_mm = col_double(),
##   ..   body_mass_g = col_double(),
##   ..   sex = col_character(),
##   ..   year = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

class(penguins)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

# Use a tidyverse pipe %>% to count the number of individuals of each species.
penguins %>% #get count of each species
  count(species)

## # A tibble: 3 x 2
##   species       n
##   <chr>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

penguins %>% #Use a pipe to get mean of bill length, but 2 species have NAs. Let's fix that
  group_by(species) %>% 
  summarise(mean.bill.length = mean(bill_length_mm)) #getting mean of bill length for each species

## # A tibble: 3 x 2
##   species   mean.bill.length
##   <chr>                <dbl>
## 1 Adelie                NA  
## 2 Chinstrap             48.8
## 3 Gentoo                NA

# Create a new dataframe penguins2 that gets rid of NA ('Not Available') cases
penguins2 <- penguins %>% #make a new dataframe that gets rid of NA values
  drop_na()

View(penguins2) #great! no NAs, but 11 fewer cases 333 vs. 344
str(penguins2)

## tibble [333 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : chr [1:333] "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr [1:333] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
##  $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
##  $ flipper_length_mm: num [1:333] 181 186 195 193 190 181 195 182 191 198 ...
##  $ body_mass_g      : num [1:333] 3750 3800 3250 3450 3650 ...
##  $ sex              : chr [1:333] "male" "female" "female" "female" ...
##  $ year             : num [1:333] 2007 2007 2007 2007 2007 ...

penguins2 %>% #re-calculate the means with a tidyverse pipe
  group_by(species) %>%
  summarise(mean.bill.length = mean(bill_length_mm))

## # A tibble: 3 x 2
##   species   mean.bill.length
##   <chr>                <dbl>
## 1 Adelie                38.8
## 2 Chinstrap             48.8
## 3 Gentoo                47.6

# Make a scatterplot of bill depth (x axis) vs bill length (y axis) and color 
# the points by species 
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  #brings in the data, tells ggplot which columns, colors by species
  geom_point() + #tells ggplot to plot it with points
  stat_smooth(method=lm, formula=y~x)  + #adds a least squares regression line for each species and sex
  theme_classic() + 
  labs(title="Palmer Penguins: Bill Depth vs. Bill Length", 
  x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels

# Make a scatterplot of bill depth (x axis) vs bill length (y axis) and color 
# the points by species & plot sex with different symbols
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species, shape=sex)) + #brings in the data, tells ggplot which columns, colors by species
  geom_point() + #tells ggplot to plot it with points
  stat_smooth(method=lm, formula=y~x)  + #adds a least squares regression line for each species and sex
  theme_classic() + labs(title="Palmer Penguins: Bill Depth vs. Bill Length", 
  x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels

# Let's make a coplot with different graphs for each species, but the same x,y scales
# The theme is also changed back to the ggplot default, a gray scale
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  #brings in the data, tells ggplot which columns, colors by species
  geom_point() + #tells ggplot to plot it with points
  stat_smooth(method=lm, formula=y~x)  + 
  #adds a least squares regression line for each species and sex
  facet_wrap(~species) +
  labs(title="Palmer Penguins: Bill Depth vs. Bill Length", 
       x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels

# Let's make a coplot with different graphs for each species, but the same x,y scales, add sex
# The theme is also changed back to the ggplot default, a gray scale
ggplot(data=penguins2, aes(x = bill_depth_mm, y = bill_length_mm, color = species, shape=sex)) + #brings in the data, tells ggplot which columns, colors by species
  geom_point() + #tells ggplot to plot it with points
  stat_smooth(method=lm, formula=y~x)  + 
  #adds a least squares regression line for each species and sex
  facet_wrap(~species) +
   labs(title="Palmer Penguins: Bill Depth vs. Bill Length", 
    x= "Bill Depth (mm)", y = "Bill Length (mm)") #sets the title, x and y axis labels

# Fit a least squares regression of bill depth and bill length in Gentoos
# but first let's make a dataframe for each of the three species
adelie <- filter(penguins2, species == "Adelie") #make a dataframe for just 
# adelie, only takes rows that have Adelie in the species column
adelie

## # A tibble: 146 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # ... with 136 more rows, and 2 more variables: sex <chr>, year <dbl>

chinstrap <- filter(penguins2, species == "Chinstrap")
gentoo <- filter(penguins2, species == "Gentoo")

# Make a scatterplot of Bill Depth (mm) [x] vs. Bill Length (mm) [y] for gentoo penguins
ggplot(data=gentoo, aes(y = bill_length_mm, x = bill_depth_mm)) + # brings in the data, tells ggplot which columns, colors by species
  geom_point() + # tells ggplot to plot it with points
  stat_smooth(method=lm, formula=y~x)  + #adds a least squares regression line
  theme_classic() + labs(title="Gentoo penguins: Bill Depth vs. Bill Length", x= "Bill Depth (mm)", y = "Bill Length (mm)")

# labs sets the title, x and y axis labels

# now we will do least squares regression, lm stands for 'linear model' of 
# bill_length vs. bill depth for Gentoo penguins.

# ?lm #check the arguments for lm() to see what we need to add to the function
# the tilde symbol '~' means 'is a function of'
g.lm <- lm(bill_length_mm ~ bill_depth_mm, data = gentoo) 
# the formula is y ~ x, so the y variable has to go first!
g.lm

## 
## Call:
## lm(formula = bill_length_mm ~ bill_depth_mm, data = gentoo)
## 
## Coefficients:
##   (Intercept)  bill_depth_mm  
##         16.67           2.06

attributes(g.lm) #this let's you see what is in the g.lm object

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

g.lm$coefficients #get intercept and slope values

##   (Intercept) bill_depth_mm 
##         16.67          2.06

# get the p value for the regression (Ho: the slope is 0) from the final column of an ANOVA table
anova(g.lm)

## Analysis of Variance Table
## 
## Response: bill_length_mm
##                Df Sum Sq Mean Sq F value  Pr(>F)    
## bill_depth_mm   1    487     487    87.5 7.3e-16 ***
## Residuals     117    651       6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# After invoking this command, move the cursor to the Hit <return> in console
# and hit return 4 times to see the residual plots for this regression.

# Make a box plot of bill_length_mm for each species
ggplot(penguins2, aes(species, bill_length_mm, color = species)) +
  geom_boxplot() + #tells ggplot to make a box plot 
  theme_classic() + labs(title="Palmer Penguins: Species vs. Bill Length",
  x="Species",y="Bill Length (mm)")

# Do a t test to compare bill lengths in Adelie and Gentoo penguins
# ?t.test #see the arguments for a t.test

# test whether we can reject the null hypothesis that
# adelie bill length = gentoo bill length; note the use of the $ sign to
# indicate which variable is being tested from the two different data frames
t.test(adelie$bill_length_mm, gentoo$bill_length_mm, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  adelie$bill_length_mm and gentoo$bill_length_mm
## t = -24, df = 234, p-value <2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##   -Inf -8.15
## sample estimates:
## mean of x mean of y 
##      38.8      47.6

# Try doing the t test again, but change "less" to "greater" what happens?
t.test(adelie$bill_length_mm, gentoo$bill_length_mm, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  adelie$bill_length_mm and gentoo$bill_length_mm
## t = -24, df = 234, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -9.34   Inf
## sample estimates:
## mean of x mean of y 
##      38.8      47.6

# Now do the t test again, but change to a two-tailed or two sided alternative
# hypothesis
t.test(adelie$bill_length_mm, gentoo$bill_length_mm, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  adelie$bill_length_mm and gentoo$bill_length_mm
## t = -24, df = 234, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.45 -8.03
## sample estimates:
## mean of x mean of y 
##      38.8      47.6

# Do boxplots of sexual dimorphism in mass among species:
ggplot(penguins2,aes(x=species,y=body_mass_g,fill=sex)) +
  geom_boxplot()+labs(title="Palmer Penguins", x = "Species",  
  y="Body Mass (g)")

# plot all variables vs. each other
# Install the Ggally package which has the function ggpairs and plot all
# variables pairwise
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# plot should include only variables: species, bill_length_mm, bill_depth_mm,
# flipper_length_mm, body_mass_g, sex
# The indices for these columns will be used to create an index vector c
# c stands for concatenate or combine
# this vector c will contain 1,3,4,5,6,7
c <- c(1,3:7)

ggpairs(penguins2,columns=c, ggplot2::aes(colour=factor(species)),
        title='Palmer Penguins')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Sleuth Chapter 1 Drawing Statistical Conclusions

Case 1: Motivation & Creativity

A randomized experiment by Amabile et al. 1985 (JPSP)

Questions

Do grading systems promote creativity in students?

Do ranking systems and incentive awards programs increase productivity among students?

Do rewards and praise stimulate students to learn?

Reproducibility & Replicability

Science 2015

Motivation Experimental Design

Subjects with considerable experience in creative writing were randomly assigned to two groups

Intrinsic

Extrinsic

The groups were asked to complete a questionnaire, ranking reasons for writing

There were two types of questionaire (see Figure): one emphasizing intrinsic rewards

The other questionaire emphasized extrinsic factors

Questionaire

All subjects asked to write a Haiku poem about laughter

All Haikus submitted to a panel of 12 poets, to be graded on a 0 to 40 point scale

Analysis of Sleuth Case Study 1.1 data with R

Questionaire

Too few significant figures can cause a fever to be missed

From Paulos: “A mathematician reads the newspaper”

Trick question for the day: What is “normal” human temperature?

Answer: 98.2º F

Wunderlich took thousands of measurements, and found that the “normal” temperature had an average of 36.8 degrees with a large standard error. With appropriate rounding to just 2 significant digits, Wunderlich reported the average human temperature as 37º C #### 98.6º F is the conversion from 37ºF, which should have been rounded up to 99º to maintain 2 significant figures, but this would have made the error worse #### If he’d used 3 significant figures, normal temperature would be 98.2ºF #### http://en.wikipedia.org/wiki/Carl_Reinhold_August_Wunderlich #### If data are to be converted to different scales, incorporate 1 additional signficant figure in reporting the average # Significant figures or digits

From Bevington & Robinson (1992, p. 4) & Taylor (1997)

The leftmost nonzero digit is the most significant digit

If there is no decimal point, the rightmost non-zero digit is the least significant digit

If there is a decimal point, the rightmost digit is the least significant digit, even if it is zero

All digits between the least and most significant digits are counted as significant digits

How many digits should be reported?

These numbers all have 4 significant digits (or figures): 1234, 123400, 123.4, 1001, 1000., 10.10, 0.0001010, 100.0

Best to write in scientific notation with the appropriate number of digits 1.010 x 10-4

Bevington & Robinson (1992): In calculations, carry only 1 digit more than the number of significant figures, round

That statement is wrong. Carry all significant figures in calculations. This is done automatically in all computer programs (to about 14 significant figures in Intel processors)

In summary statistics, report 1 more significant figure than is present in the data (think of Wunderlich’s 98.6º F)

Let the uncertainty define the number of significant digits

It is incorrect to report 9.979 ± 5.015

Due to propagation of error, the number of significant figures can not increase in a calculation

Our 98.6º F average human temperature would be 98.2 º F if this rule were used

library(Sleuth3)
data(case0101)
# Analyze the structure of the data table
head(case0101) # Show the 1st 6 rows of data

##   Score Treatment
## 1   5.0 Extrinsic
## 2   5.4 Extrinsic
## 3   6.1 Extrinsic
## 4  10.9 Extrinsic
## 5  11.8 Extrinsic
## 6  12.0 Extrinsic

str(case0101)

## 'data.frame':    47 obs. of  2 variables:
##  $ Score    : num  5 5.4 6.1 10.9 11.8 12 12.3 14.8 15 16.8 ...
##  $ Treatment: Factor w/ 2 levels "Extrinsic","Intrinsic": 1 1 1 1 1 1 1 1 1 1 ...

summary(case0101)

##      Score          Treatment 
##  Min.   : 5.0   Extrinsic:23  
##  1st Qu.:14.9   Intrinsic:24  
##  Median :18.7                 
##  Mean   :17.9                 
##  3rd Qu.:21.2                 
##  Max.   :29.7

favstats(Score ~ Treatment, data=case0101)

##   Treatment min   Q1 median   Q3  max mean   sd  n missing
## 1 Extrinsic   5 12.2   17.2 18.9 24.0 15.7 5.25 23       0
## 2 Intrinsic  12 17.4   20.4 22.3 29.7 19.9 4.44 24       0

# from the Sleuth3 help file, with slight modifications: data = case0101
boxplot(Score ~ Treatment,data=case0101,  # Boxplots with labels
        ylab= "Average Creativity Score From 11 Judges (on a 40-point scale)",  
        names=c("23 'Extrinsic' Group Students","24 'Intrinsic' Group Students"), 
        main= "Haiku Creativity Scores for 47 Creative Writing Students")

## Stem & Leaf plots

Tukey stem-and-leaf plot

# Individual stem-and-leaf plots, from Nicholas Horton code
with(subset(case0101, Treatment == "Extrinsic"), stem(Score, scale = 5))

## 
##   The decimal point is at the |
## 
##    5 | 04
##    6 | 1
##    7 | 
##    8 | 
##    9 | 
##   10 | 9
##   11 | 8
##   12 | 03
##   13 | 
##   14 | 8
##   15 | 0
##   16 | 8
##   17 | 2245
##   18 | 577
##   19 | 25
##   20 | 7
##   21 | 2
##   22 | 1
##   23 | 
##   24 | 0

with(subset(case0101, Treatment == "Intrinsic"), stem(Score, scale = 5))

## 
##   The decimal point is at the |
## 
##   12 | 009
##   13 | 6
##   14 | 
##   15 | 
##   16 | 6
##   17 | 25
##   18 | 2
##   19 | 138
##   20 | 356
##   21 | 36
##   22 | 126
##   23 | 1
##   24 | 03
##   25 | 
##   26 | 7
##   27 | 
##   28 | 
##   29 | 7

# This Horton code produces and error, don't know why:
# mosaic::maggregate(Score ~ Treatment, data = case0101, FUN = stem)

# Display 1.10, page 17 in the text using package aplpack, needed attach/detach
attach(case0101)
aplpack::stem.leaf.backback(Score[case0101$Treatment == "Extrinsic"],
                     Score[case0101$Treatment == "Intrinsic"],
                     back.to.back = TRUE,unit=0.1,
                     show.no.depths = TRUE)

## __________________
##   1 | 2: represents 1.2, leaf unit: 0.1 
## Score[case0101$Treatment == "Extrinsic"]
##             Score[case0101$Treatment == "Intrinsic"]
## __________________
##     40|  5 |      
##      1|  6 |      
##       |  7 |      
##       |  8 |      
##       |  9 |      
##      9| 10 |      
##      8| 11 |      
##     30| 12 |009   
##       | 13 |6     
##      8| 14 |      
##      0| 15 |      
##      8| 16 |6     
##   5422| 17 |25    
##    775| 18 |2     
##     52| 19 |138   
##      7| 20 |356   
##      2| 21 |36    
##      1| 22 |126   
##       | 23 |1     
##      0| 24 |03    
##       | 25 |      
##       | 26 |7     
##       | 27 |      
##       | 28 |      
##       | 29 |7     
##       | 30 |      
## __________________
## n:  23      24
## __________________

detach(case0101)

# Gallagher code with modifications by David Winsemius (1/31/12 post on R-help):
# histogram is a plotting function from the lattice package.
histogram(~Score | Treatment, data=case0101,
          scales=list(x=list(at=seq(4,32,by=4),
                             labels=sprintf("%2.0f", seq(4,32,by=4)))),
          endpoints = c(3.5, 32.5), layout = c(1,2), aspect = 1,
          xlab = "Creativity Scores")

# Equal variance t test
t.test(Score~Treatment, alternative='two.sided', conf.level=.95,
  var.equal=TRUE, data=case0101)

## 
##  Two Sample t-test
## 
## data:  Score by Treatment
## t = -3, df = 45, p-value = 0.005
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
##  -7.00 -1.29
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic 
##                    15.7                    19.9

# Unequal variance or Welch t test
t.test(Score ~ Treatment, alternative = "two.sided", data=case0101)

## 
##  Welch Two Sample t-test
## 
## data:  Score by Treatment
## t = -3, df = 43, p-value = 0.006
## alternative hypothesis: true difference in means between group Extrinsic and group Intrinsic is not equal to 0
## 95 percent confidence interval:
##  -7.01 -1.28
## sample estimates:
## mean in group Extrinsic mean in group Intrinsic 
##                    15.7                    19.9

# analyze with a linear model from N Horton

summary(lm(Score ~ Treatment, data = case0101))

## 
## Call:
## lm(formula = Score ~ Treatment, data = case0101)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10.74  -2.98   1.06   2.96   9.82 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           15.74       1.01   15.55   <2e-16 ***
## TreatmentIntrinsic     4.14       1.42    2.93   0.0054 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.85 on 45 degrees of freedom
## Multiple R-squared:  0.16,   Adjusted R-squared:  0.141 
## F-statistic: 8.56 on 1 and 45 DF,  p-value: 0.00537

# Monte Carlo simulations from N Horton's Chapter 1 pdf on his web site
diffmeans = diff(mean(Score ~ Treatment, data = case0101))
diffmeans # observed difference

## Intrinsic 
##      4.14

numsim = 4999 # set to a sufficient number of Monte Carlo trials
nulldist = do(numsim) * diff(mean(Score ~ shuffle(Treatment), data = case0101))
confint(nulldist)

##        name lower upper level     method estimate
## 1 Intrinsic -3.01  2.99  0.95 percentile     4.14

histogram(~Intrinsic, nint = 50, data = nulldist, v = c(-4.14, 4.14))

# Monte Carlo simulations: permutations from Robison-Cox
## Montana State University Sleuth Code (Jim Robison-Cox, Stat 410/511)
with(case0101, tapply(Score, Treatment, mean))

## Extrinsic Intrinsic 
##      15.7      19.9

diff(with(case0101, tapply(Score, Treatment, mean)))

## Intrinsic 
##      4.14

## randomly resample these numbers once, shuffling the codes:
diff(with(case0101, tapply(Score, sample(Treatment), mean)))

## Intrinsic 
##     -0.99

## how many ways can these be shuffled? 1.61238e+13
choose(47,23)      ## or choose(47, 24)

## [1] 1.61e+13

trials=499
random_differences <-  numeric(trials)  ## set up storage space, or mailboxes
  ## use square brackets to refer to one mailbox at a time.
for(i in 1:trials) random_differences[i] <-  diff(with(case0101, tapply(Score, sample(Treatment), mean)))

hist(random_differences,breaks=15)
# plot a vertical line at the observed difference of  4.144203
abline(v= c(4.14, -4.14),col="red")

 random_differences[which(random_differences < -4.14 | random_differences>4.14)]

## [1]  4.26 -4.69

  ## approximate (due to not seeing all 1.6e+13 possibilities) p-value
  ##  how extreme is the difference we observed relative to all
  ##  possible differences?
# Following Manly & Legendre & Legendre, add 1 for the observed difference to simulation

(length( which(random_differences < -4.14 | random_differences>4.14))+1)/(trials+1)

## [1] 0.006

##Chihara - Hesterberg Permutation tests code (p 41)
with(case0101, tapply(Score, Treatment, mean))

## Extrinsic Intrinsic 
##      15.7      19.9

observed <- diff(with(case0101, tapply(Score, Treatment, mean)))
observed

## Intrinsic 
##      4.14

## Perform permutation analysis of means
N <- 10^3 -1 # Number of times to repeat this process
result <- numeric (N) # Space to save the random differences
# use a for loop to do (N-1=999) random permutations
for (i in 1:N)
  { # sample of size 24 from 1 to 47 without replacement
  index <- sample (47, size = 24, replace = FALSE)
  result[i] <- mean (case0101$Score [index]) - 
  mean (case0101$Score [-index])
}

hist(result,breaks = 15, xlab = "xbar1 - xbar2",
     main = "Case 1.1: Permutation distribution for creativity scores")
abline(v = c(observed, -observed), col = "blue") # add line at observed mean diff.

Pvalue <- (sum(result >= observed)+1)/(N+1) #P-value
Pvalue

## [1] 0.002

Statistical inference: Case Study 1.1

“There is strong evidence that a subject would receive a lower creativity score for a poem written after the extrinsic motivation questionnaire” Two-sided p-value=0.005 from a 2-sample t test. The effect size 4.1 point difference on a 0-40 point scale. 95% confidence interval for the difference is 1.3 and 7.0 points

Scope of inference: Case study 1.1

Since this was a randomized experiment, difference in creativity was caused by the difference in motivational questionnaires Because the individuals were not selected randomly from a larger population, extending this inference to a larger population is speculative.

Analyze Sleuth Case Study 1.2

data(case0102)

str(case0102)

## 'data.frame':    93 obs. of  2 variables:
##  $ Salary: int  3900 4020 4290 4380 4380 4380 4380 4380 4440 4500 ...
##  $ Sex   : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...

head(case0102) # Loads the first 6 cases for viewing

##   Salary    Sex
## 1   3900 Female
## 2   4020 Female
## 3   4290 Female
## 4   4380 Female
## 5   4380 Female
## 6   4380 Female

# favstats from mosaic (Horton's code)
favstats(Salary ~ Sex, data = case0102)

##      Sex  min   Q1 median   Q3  max mean  sd  n missing
## 1 Female 3900 4800   5220 5400 6300 5139 540 61       0
## 2   Male 4620 5400   6000 6075 8100 5957 691 32       0

boxplot(Salary ~ Sex, data=case0102,
        ylab= "Starting Salary (U.S. Dollars)", 
        names=c("61 Females","32 Males"),
        main= "Harris Bank Entry Level Clerical Workers, 1969-1971")

# car Boxplot, which labels the outlier
Boxplot(Salary ~ Sex, data=case0102,
        ylab= "Starting Salary (U.S. Dollars)", 
        names=c("61 Females","32 Males"),
        main= "Harris Bank Entry Level Clerical Workers, 1969-1971")

## [1] "93"

# Horton's code:
bwplot(Salary ~ Sex, data = case0102)

Boxplots From Cleveland’s “Visualizing Data”

Outside values (more than 1.5 box lengths [Interquartile range, IQR] from upper & lower hinge) also called outliers

Extreme values (more than 3 box lengths [IQR] from upper and lower hinge): Very extreme points, extreme outliers

SPSS Application Guide Figure 2.7

Tukey’s hinges vs. Quartiles

Hyndman & Fan (1996) American Statistician 50: 361-365: 9 different ways to calculate quartiles

Tukey hinges: 702.5 & 767.5

SPSS quartiles: 701.25 & 773.75

Excel & Quattro Pro quartiles: 703.75 & 761.25

Notched box plots

Why 1.57 for notched boxplots? n1 = med + 1.57*(q3-q1)/sqrt(length(x));

McGill et al. 1978

Nordhaus, W. PNAS 2006 103: 3495

Tufte’s & Gallagher’s Rules

Tufte’s Rule: Every part of a scientific graphic -– the colors, the lines, and their orientation — should convey information.

Gallagher’s Rule: You have no more than 1 minute to convey the information in a slide. You can’t spend 30 seconds describing an unusual plotting method, like the McGill notched boxplot or the Nordhaus data-filled boxplot

Gallagher’s rule trumps Tufte’s rule

# densityplot from the lattice package
densityplot(~Salary, groups = Sex, auto.key = TRUE, data = case0102)

# Histograms
attach(case0102)
hist(Salary[Sex=="Female"])

hist(Salary[Sex=="Male"])

detach(case0102)

# Gallagher code with modifications by David Winsemius (1/31/12 post on R-help):
# histogram is a plotting function from the lattice package.
histogram(~ Salary | Sex, data=case0102,
               scales=list(x=list(at=seq(4000,8000,by=2000),
               labels=sprintf("$%4.0f", seq(4000,8000,by=2000)))),
                 endpoints = c(3500, 8500), layout = c(1,2), aspect = 1,
                 xlab = "Salary", main="Case 1.2")

## I can't get the y labels to plot properly on this back-to-back plot, part of
# Frank Harrell's Hmisc package.

# out <- histbackback(split(case0102$Salary, case0102$Sex), probability=TRUE, xlim=c(-.001,.001),
options(digits=0)
out <- histbackback(split(case0102$Salary, case0102$Sex), probability=FALSE, xlim=c(-30,30),
                    main = 'Sleuth Case 1.2',ylab='Starting Salaries($)')
#! just adding color
barplot(-out$left, col="red" , horiz=TRUE, space=0, add=TRUE, axes=FALSE)
barplot(out$right, col="blue", horiz=TRUE, space=0, add=TRUE, axes=FALSE)

options(digits=3)
# Plots somewhat like Display 1.13 on page 20 in 3rd edition

# Plot some distributions
histogram(rnorm(1000)) # Normal

histogram(rexp(1000)) # Long-tailed

histogram(runif(1000)) # Short-tailed

histogram(rchisq(1000, df = 15)) # Skewed

diffSex<-diff(with(case0102, tapply(Salary,Sex, mean)))
sprintf('The difference in salary was %8.4f',diffSex)

## [1] "The difference in salary was 818.0225"

# Independent sample t test, equal variances assumed:
t.test(Salary~Sex, alternative='two.sided', conf.level=.95, var.equal=TRUE, 
  data=case0102)

## 
##  Two Sample t-test
## 
## data:  Salary by Sex
## t = -6, df = 91, p-value = 1e-08
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -1076  -560
## sample estimates:
## mean in group Female   mean in group Male 
##                 5139                 5957

# Permutation test from Horton
obsdiff = diff(mean(Salary ~ Sex, data = case0102))
obsdiff

## Male 
##  818

numsim = 999
res = do(numsim) * diff(mean(Salary ~ shuffle(Sex), data = case0102))
densityplot(~Male, data = res)

confint(res)

##   name lower upper level     method estimate
## 1 Male  -290   305  0.95 percentile      818

# Add the Manly-Legendre convention for simulations to add 1 for the actual
# This prevents reporting a p-value of 0, which it never is.
p = (sum(abs(res$Male) >= abs(obsdiff))+1)/(numsim +1)
p

## [1] 0.001

# Robison Cox's MSU simulation

observed <- diff(with(case0102, tapply(Salary,Sex, mean)))
## 818.0225

plot(Salary ~ Sex, data = case0102) # because code is numeric, gender is a factor

with(case0102, tapply(Salary,Sex, length))

## Female   Male 
##     61     32

# How many combinations with groups of size 61 & 32
choose(61+32,32)

## [1] 8.66e+24

trials = 10^4 -1
random_differences <-  numeric(trials)  ## set up storage space, or mailboxes
## use square brackets to refer to one mailbox at a time.
for(i in 1:trials) random_differences[i] <-  diff(with(case0102, tapply(sample(Salary),Sex, mean)))
hist(random_differences, breaks=15)
abline(v= c(-1,1) * observed, col="red")

l<-length( which(random_differences < -observed | random_differences> observed))/1000
## approximate p-value < 1 in 1000
Pvalue <- (l+1)/(trials+1)
Pvalue

## [1] 1e-04

##Chihara - Hesterberg Permutation tests code (p 41)
with(case0101, tapply(Score, Treatment, mean))

## Extrinsic Intrinsic 
##      15.7      19.9

observed <- diff(with(case0102, tapply(Salary, Sex, mean)))
observed

## Male 
##  818

## Perform permutation analysis of means
N <- 10^4 -1 # Number of times to repeat this process
result <- numeric (N) # Space to save the random differences
for (i in 1:N)
{ # sample of size 32 from 1 to 32 without replacement
  index <- sample (61+32, size = 32, replace = FALSE)
  result[i] <- mean (case0102$Salary [index]) - 
    mean (case0102$Salary [-index])
}

# scale is off, need to fix sometime
hist(result,breaks = 15, xlab = "xbar1 - xbar2",
     main = "Case 1.2: Permutation distribution for salaried")
abline(v = c(observed, -observed), col = "blue") # add line at observed mean diff.

Pvalue <- (sum(result >= observed)+1)/(N+1) #P-value
Pvalue

## [1] 1e-04

Sleuth Display 1.5

Statistical inferences and chance mechanisms

An inference is a conclusion that patterns in the data are present in some broader context

A statistical inference is an inference justified by a probability model linking the data to the broader context # Randomization From Kendall & Stuart’s (1977) ‘Advanced Theory of Statistics’ “The principle of randomization is simply stated: Whenever experimental units are allocated to factor-combinations in an experiment, this should be done by a random process using equal probabilities.”

“Even if the relationship of the dependent variable with some unsuspected causal factor is not recognized until after the experiment, the validity of the inferences will not be impaired, provided that the factor’s influence was “randomized out” of the experiment.”

Kendall & Stuart (1977) on experiments In any experiment the factors influencing the dependent variable are, explicitly or implicitly, divided by the experimenter into three classes: - Those incorporated into the structure of the experiment - Those “randomized out” of the experiment - Those neither incorporated nor randomized out

Classes 1 & 2 require positive action, affecting the layout of the experiment, or the randomization procedure employed. A factor may find its way into class (3) by simply being overlooked.

What makes a good experimenter? Kendall & Stuart (1977) “A substantial part of the skill of the experimenter lies in his choice of factors to be randomized out of the experiment. If he is careful, he will randomize out all the factors which are suspected of being causally important but which are not actually part of the experimental procedure. But every experimenter necessarily neglects some conceivably causal factors; if this were not so, the randomization procedure required would be impossibly complicated. Thus the choice of what factors to be randomized out is essentially a matter of judgement.”

Description of experimental design should include - The nature of the experimental units to be employed - The number and kinds of treatments and the properties of the responses that will be measured. - Specification of how the treatments will be assigned to the available experimental units (replicates) - The physical arrangement of the experimental units, (and often) the temporal sequence in which treatments are applied to and measurements made on the different experimental units.’

Randomized Experiments vs Observational Studies

Randomized experiment: a chance mechanism used to assign subject to groups
Observational study: group status beyond the control of the investigator
“Statistical inferences of cause-and-effect relationships can be drawn from randomized experiments, but not from observational studies”
“A confounding variable is related both to group membership and to the outcome. Its presence makes it hard to establish the outcome as being a direct consequence of group membership.” (Male experience)

Sample surveys vs. experiments from Kendall & Stuart’s “The Advanced theory of statistics” (1977)

“The distinction between the design of experiments and the design of sample surveys is fairly clear-cut, and may be expressed by saying that in surveys we make observations on a sample taken from a finite population of individuals, whereas in experiments we make observations which are in principle generated by a hypothetical infinite population, in exactly the same way that the tosses of a coin are. Of course, we may sometimes experiment on the members of a sample resulting from a survey, or even make a sample survey of the results of an (extensive) experiment, but the essential distinction between the two fields should be clear.” Inferences to populations

Sleuth is too restrictive here: Inferences to populations can be drawn from random sampling studies, but not otherwise

Simple random sampling (SRS): A simple random sample of size n from a population is a subset of the population consisting of n members selected in such a way that every subset of size n is afforded the same chance of being selected.

Random sampling ensures that all subpopulations are represented in the sample in roughly the same mix as in the overall population.

Statistical inference procedures incorporate measures of uncertainty that describe that chance. Selecting a random sample - Simple random sampling - Stratified random sampling - Multilevel sampling (e.g., Regions, Lakes, areas within lakes) - Systematic sampling

Manly Figure 1.2

Quadrat samples
Line transect samples: see Hayek & Buzas (1996)
Random cluster sampling (selecting blocks or grids at random)
Lakes: Can adjust the probability of being sampled
Adaptive cluster sampling (Thompson 1990)

$Thompson 1990$

Thompson 1990

- Variable probability sampling
- EMAP

EMAP Virginia Province

MA Bay sampling
- Random locations, but many rocky stations deleted

MWRA MA Bay sampling

Boston Harbor sampling
- 8 stations sampled since 1991: not selected randomly

BostonHarbor sampling

P-value

P-value Modified from K Wuensch The p-value is the probability of obtaining data as or more discrepant with respect to the null and alternative hypothesis than are those in the present sample, assuming that the null hypothesis is absolutely correct. In a one-sample z-test of the hypothesis μ = μ_o, a z-score of 1.96 can have p-value 0.975, 0.05 or 0.025 depending on whether the alternative hypothesis is μ < μ_o, μ ≠ μ_o, or μ > μ_o, respectively.

Summary & Conclusions

Statistical Sleuthing Significant digits, Case Studies 1.1 & 1.2 Histograms, density plots, stem-and-leaf & dot plots Randomization, Fisher’s major contribution Experiment vs. observational study (survey), Experimental design Sampling schemes (SRS, stratified, systematic, adaptive, EMAP probability based sampling) Probability model for randomized experiments Randomization distribution p values