1 Preparing Your Computer

In class, these steps will already be done for you. If you want to use the workshop files on your own computer, follow these steps.

  1. Installing R & RStudio Install R and RStudio from these two sites: http://www.r-project.org http://www.rstudio.com

  2. Unzip the workshop files, taking care to ensure they all stay together in a folder

  3. Install these add-on packages by selecting the following “install.packages()” lines then choose “Run selected lines” from the “Run” menu, or use Ctrl-Enter on PC or Cmd-Enter on Mac

install.packages("car")
install.packages("emmeans")
install.packages("jmv")
install.packages("multcomp")
install.packages("RcmdrMisc")
install.packages("skimr")
install.packages("tidyverse")
  1. Test your installation Run the following chunk of R code by clicking on the green arrow on the right side of the code. If you see the data, you’re set. If not, call the OIT HelpDesk at 865-974-9900
load("mydata100.RData")
head(mydata100)

2 OIT Research Computing Support Assistance

2.1 Help We Provide

Up to 15 hours per semester of help with:

  • R, SPSS, SAS, NVivo, QuestionPro, ArcGIS…

  • Research planning, design, sample size…

  • Data entry, management, file conversions

  • Statistics, machine learning, data mining

  • Creating geographic maps of statistical information

  • Text mining, qualitative analysis…

  • Web survey design and deployment

2.2 Where to Find us


3 Basic Statistics

3.1 Preparing the Workspace

library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'purrr' was built under R version 3.6.2
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("skimr")
## Warning: package 'skimr' was built under R version 3.6.2
load("mydata100.RData")

3.2 R’s summary function

  • Accepts a data frame

  • Min/Max values shown to 3 significant digits. Use digits = 10 to change

mydata100 %>%
  select(pretest, posttest) %>%
  summary()
##     pretest         posttest    
##  Min.   :58.00   Min.   :59.00  
##  1st Qu.:72.00   1st Qu.:77.00  
##  Median :75.00   Median :82.00  
##  Mean   :74.97   Mean   :82.06  
##  3rd Qu.:79.00   3rd Qu.:86.00  
##  Max.   :86.00   Max.   :98.00

3.3 The skim Function

  • Fairly new

  • Designed for the tidyverse

  • Produces tibbles amenable to filter, group_by etc.

  • Optimized for high-quality output (more soon)

library("skimr")
library("knitr")
## Warning: package 'knitr' was built under R version 3.6.2
mydata100 %>% 
  skim() 
Data summary
Name Piped data
Number of rows 100
Number of columns 9
_______________________
Column type frequency:
factor 2
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
workshop 0 1 FALSE 4 R: 31, SAS: 25, SPS: 25, Sta: 19
gender 0 1 FALSE 2 Mal: 52, Fem: 48

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 50.50 29.01 1 25.75 50.5 75.25 100 ▇▇▇▇▇
q1 0 1.00 3.45 1.10 1 3.00 3.0 4.00 5 ▁▃▇▆▅
q2 0 1.00 3.06 1.22 1 2.00 3.0 4.00 5 ▃▇▆▇▃
q3 1 0.99 3.08 1.17 1 2.00 3.0 4.00 5 ▂▅▇▅▃
q4 0 1.00 3.40 1.14 1 3.00 3.0 4.00 5 ▂▃▇▆▅
pretest 0 1.00 74.97 5.30 58 72.00 75.0 79.00 86 ▁▂▇▇▃
posttest 0 1.00 82.06 6.59 59 77.00 82.0 86.00 98 ▁▁▇▇▂

3.4 skim Example with filter()

mydata100 %>% 
  skim() %>%
  filter(numeric.mean > 70)
Data summary
Name Piped data
Number of rows 100
Number of columns 9
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
pretest 0 1 74.97 5.30 58 72 75 79 86 ▁▂▇▇▃
posttest 0 1 82.06 6.59 59 77 82 86 98 ▁▁▇▇▂

3.5 jmv Package’s descriptives Function

  • The jmv package contains many useful functions that provide APA-style output

  • It also offers a fairly comprehensive set of univariate plots & will split by group

  • Can also use select_if(is.numeric) to get all numeric vars

library("jmv")
## Warning: package 'jmv' was built under R version 3.6.2
mydata100 %>%
  select(pretest, posttest) %>%
  descriptives()
## 
##  DESCRIPTIVES
## 
##  Descriptives                       
##  ---------------------------------- 
##               pretest    posttest   
##  ---------------------------------- 
##    N              100         100   
##    Missing          0           0   
##    Mean          75.0        82.1   
##    Median        75.0        82.0   
##    Minimum         58          59   
##    Maximum         86          98   
##  ----------------------------------

3.6 Frequency & Percent Tables

  • We’ve seen R’s table function with its sparse results

  • The jmv::descriptives function provides much nicer output

  • Adding freq=TRUE turns on the frequencies for factors

  • Could also use select_if(is.factor) to get all factors

library("jmv")

mydata100 %>%
  select(workshop, gender) %>%
  descriptives(freq = TRUE)
## 
##  DESCRIPTIVES
## 
##  Descriptives                      
##  --------------------------------- 
##               workshop    gender   
##  --------------------------------- 
##    N               100       100   
##    Missing           0         0   
##    Mean                            
##    Median                          
##    Minimum                         
##    Maximum                         
##  --------------------------------- 
## 
## 
##  FREQUENCIES
## 
##  Frequencies of workshop                            
##  -------------------------------------------------- 
##    Levels    Counts    % of Total    Cumulative %   
##  -------------------------------------------------- 
##    R             31          31.0            31.0   
##    SAS           25          25.0            56.0   
##    SPSS          25          25.0            81.0   
##    Stata         19          19.0           100.0   
##  -------------------------------------------------- 
## 
## 
##  Frequencies of gender                              
##  -------------------------------------------------- 
##    Levels    Counts    % of Total    Cumulative %   
##  -------------------------------------------------- 
##    Female        48          48.0            48.0   
##    Male          52          52.0           100.0   
##  --------------------------------------------------

3.7 Cross-tabulation & Chi-Square

  • Goal: Compare groups on pattern of counts

  • Example: Do the genders differ on workshop attendance?

  • Predictor: Factor

  • Outcome: Factor

  • Assumptions:

    • Minimum expected count > 1

    • Fewer than 20% of expected counts < 5

3.8 R’s Built-in table Function

myWG <- mydata100 %>%
  select(workshop, gender) %>%
  table

myWG
##         gender
## workshop Female Male
##    R         14   17
##    SAS       12   13
##    SPSS      13   12
##    Stata      9   10
summary(myWG)  
## Number of cases in table: 100 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 0.26338, df = 3, p-value = 0.9668

3.10 Cross-tabulations Using the jmv Package

  • Provides publication-quality tables when run in the jamovi GUI.

  • Offers Fisher’s Exact Test, odds, relative risk, and many measures of association.

  • contTablesPaired does McNemar’s Test

library("jmv")

mydata100 %>%
  contTables(
    rows = 'workshop',
    cols = 'gender', pcRow = TRUE)
## 
##  CONTINGENCY TABLES
## 
##  Contingency Tables                                       
##  -------------------------------------------------------- 
##    workshop                    Female    Male     Total   
##  -------------------------------------------------------- 
##    R           Observed            14       17       31   
##                % within row      45.2     54.8            
##                                                           
##    SAS         Observed            12       13       25   
##                % within row      48.0     52.0            
##                                                           
##    SPSS        Observed            13       12       25   
##                % within row      52.0     48.0            
##                                                           
##    Stata       Observed             9       10       19   
##                % within row      47.4     52.6            
##                                                           
##    Total       Observed            48       52      100   
##                % within row      48.0     52.0            
##  -------------------------------------------------------- 
## 
## 
##  <U+03C7>² Tests                       
##  ------------------------------ 
##          Value    df    p       
##  ------------------------------ 
##    <U+03C7>²    0.263     3    0.967   
##    N       100                  
##  ------------------------------

3.11 Other Categorical Functions

  • Small samples: fisher.test

  • Same variable measured twice: mcnemar.test

  • 3-way table: mantelhaen.test

  • N-way table: loglm in MASS package


4 Correlation

4.1 Preparing the Workspace

library("tidyverse")

load("mydata100.RData")

4.2 Correlation

  • Goal: Measure strength of linear association

  • Example: Is pretest related to posttest?

  • Predictor: Numeric (one)

  • Outcome: Numeric (one)

  • Pearson Assumes:

    • Relationship is linear

    • Variables are normally distributed

  • Spearman Assumes:

    • Relationship is consistent (monotonic)

    • Variables come from same distribution

4.3 R’s Built-in cor Function

  • Demonstrate using small mydata

  • For non-parametric, use “spearman”

load("mydata.RData")

mydata %>%
  select(q1:q4) %>%
  cor(use    = "pairwise",
      method = "pearson")
##            q1         q2          q3          q4
## q1  1.0000000  0.7395179 -0.12500000  0.88040627
## q2  0.7395179  1.0000000 -0.27003086  0.85063978
## q3 -0.1250000 -0.2700309  1.00000000 -0.02613542
## q4  0.8804063  0.8506398 -0.02613542  1.00000000

4.4 Built-in Test of Significance

Perhaps to warn gains impact of multiple testing, this allows only one test per run!

cor.test(mydata$q1, mydata$q4,
         use    = "pairwise",
         method = "pearson"
)
## 
##  Pearson's product-moment correlation
## 
## data:  mydata$q1 and mydata$q4
## t = 4.5476, df = 6, p-value = 0.003902
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4629416 0.9782033
## sample estimates:
##       cor 
## 0.8804063

4.5 R Commander’s rcorr.adjust Does More

  • Output similar to SAS/SPSS/Stata

  • Provides p-values by default

  • Adjusts p-values for number tested

library("RcmdrMisc")
## Warning: package 'RcmdrMisc' was built under R version 3.6.2
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## Loading required package: sandwich
## 
## Attaching package: 'RcmdrMisc'
## The following object is masked from 'package:jmv':
## 
##     reliability
mydata %>%
  select(q1:q4) %>%
  rcorr.adjust(
    use  = "pairwise",
    type = "pearson")
## 
##  Pearson correlations:
##         q1      q2      q3      q4
## q1  1.0000  0.7395 -0.1250  0.8804
## q2  0.7395  1.0000 -0.2700  0.8506
## q3 -0.1250 -0.2700  1.0000 -0.0261
## q4  0.8804  0.8506 -0.0261  1.0000
## 
##  Number of observations: 
##    q1 q2 q3 q4
## q1  8  8  7  8
## q2  8  8  7  8
## q3  7  7  7  7
## q4  8  8  7  8
## 
##  Pairwise two-sided p-values:
##    q1     q2     q3     q4    
## q1        0.0360 0.7894 0.0039
## q2 0.0360        0.5581 0.0074
## q3 0.7894 0.5581        0.9556
## q4 0.0039 0.0074 0.9556       
## 
##  Adjusted p-values (Holm's method)
##    q1     q2     q3     q4    
## q1        0.1440 1.0000 0.0234
## q2 0.1440        1.0000 0.0371
## q3 1.0000 1.0000        1.0000
## q4 0.0234 0.0371 1.0000

Note: you can run this analysis in BlueSky and produce a apa table


5 Multiple Regression

  • Goal: To use several variables to explain one other
  • Example: Which variables explain overall workshop rating?
  • Predictor: Numeric (one or more)
  • Outcome: Numeric (one only)
  • Assumptions:
    • Relationship is linear
    • Prediction errors (residuals) are normally distributed
    • Variance is consistent across range of predictor (homoscedasticity)

5.1 Modeling Functions

  • In SAS & SPSS

    • Specify settings in advance

    • Then see all results

  • In R, modeling functions

    • Accept formulas

    • Create model objects

    • Use generic functions show more, e.g. summary(), plot(), etc.

    • Use “extractor” (a.k.a accessor) functions to show more, e.g. coefficients()

5.2 Linear Models Using lm

my_model <- lm(q4 ~ q1 + q2 + q3, 
              data = mydata100)

summary(my_model)
## 
## Call:
## lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9467 -0.6418  0.1175  0.5960  2.0533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.20940    0.31787   3.805 0.000251 ***
## q1           0.41134    0.13170   3.123 0.002370 ** 
## q2           0.15791    0.10690   1.477 0.142942    
## q3           0.09372    0.11617   0.807 0.421838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9308 on 95 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3561, Adjusted R-squared:  0.3358 
## F-statistic: 17.51 on 3 and 95 DF,  p-value: 3.944e-09
# q4 = 1.2 + .411*q1 + .16*q2 + .09*q3

Notes: * Residuals: beyond +/-3 is worrisome, +/- 4 indicates outliers * P values interpetation: “is this important above and beyond the other values?” because p-value is a partial test * Focus on adjusted R-squared + Multiple R increases with each variable while adjusted is penalized, “pays the reaper”

anova(my_model)
## Analysis of Variance Table
## 
## Response: q4
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## q1         1 42.306  42.306 48.8278 3.824e-10 ***
## q2         1  2.657   2.657  3.0661   0.08317 .  
## q3         1  0.564   0.564  0.6508   0.42184    
## Residuals 95 82.312   0.866                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(my_model)

predict(my_model, mydata)
##        1        2        3        4        5        6        7        8 
## 2.247260 2.564880 2.722794       NA 3.831776 4.366367 4.114732 4.112939

5.3 Model Contents

  • “Generic functions” change their behavior based on an object’s class

  • What is print doing??

print(my_model)
## 
## Call:
## lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
## 
## Coefficients:
## (Intercept)           q1           q2           q3  
##     1.20940      0.41134      0.15791      0.09372
mode(my_model)
## [1] "list"
class(my_model)
## [1] "lm"
names(my_model)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"
my_model$rank 
## [1] 4

5.4 Examining Model Structure

  • Thetidyverse glimpse() function shows any object’s structure

  • The built-in equilvalent is str()

  • You rarely need to dig in this deeply

glimpse(my_model)
## List of 13
##  $ coefficients : Named num [1:4] 1.2094 0.4113 0.1579 0.0937
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "q1" "q2" "q3"
##  $ residuals    : Named num [1:99] 1.297 0.644 -1.366 0.232 0.821 ...
##   ..- attr(*, "names")= chr [1:99] "1" "2" "4" "5" ...
##  $ effects      : Named num [1:99] -33.87 6.504 1.63 0.751 0.766 ...
##   ..- attr(*, "names")= chr [1:99] "(Intercept)" "q1" "q2" "q3" ...
##  $ rank         : int 4
##  $ fitted.values: Named num [1:99] 3.7 3.36 4.37 3.77 4.18 ...
##   ..- attr(*, "names")= chr [1:99] "1" "2" "4" "5" ...
##  $ assign       : int [1:4] 0 1 2 3
##  $ qr           :List of 5
##   ..$ qr   : num [1:99, 1:4] -9.95 0.101 0.101 0.101 0.101 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
##   ..$ qraux: num [1:4] 1.1 1.05 1 1.08
##   ..$ pivot: int [1:4] 1 2 3 4
##   ..$ tol  : num 1e-07
##   ..$ rank : int 4
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 95
##  $ na.action    : 'omit' Named int 3
##   ..- attr(*, "names")= chr "3"
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
##  $ terms        :Classes 'terms', 'formula'  language q4 ~ q1 + q2 + q3
##   .. ..- attr(*, "variables")= language list(q4, q1, q2, q3)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "term.labels")= chr [1:3] "q1" "q2" "q3"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(q4, q1, q2, q3)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:4] "q4" "q1" "q2" "q3"
##  $ model        :'data.frame':   99 obs. of  4 variables:
##   ..$ q4: int [1:99] 5 4 3 4 5 4 5 1 5 3 ...
##   ..$ q1: int [1:99] 4 3 5 4 5 3 4 3 5 3 ...
##   ..$ q2: int [1:99] 3 4 4 4 4 1 4 2 5 4 ...
##   ..$ q3: int [1:99] 4 3 5 3 3 3 2 2 5 4 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language q4 ~ q1 + q2 + q3
##   .. .. ..- attr(*, "variables")= language list(q4, q1, q2, q3)
##   .. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..- attr(*, "term.labels")= chr [1:3] "q1" "q2" "q3"
##   .. .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(q4, q1, q2, q3)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:4] "q4" "q1" "q2" "q3"
##   ..- attr(*, "na.action")= 'omit' Named int 3
##   .. ..- attr(*, "names")= chr "3"
##  - attr(*, "class")= chr "lm"

5.5 Finding Relevant Functions

Keywords: help, class, objects

What functions work on lm objects?

methods(class = "lm")
##  [1] add1                alias               anova              
##  [4] Anova               avPlot              Boot               
##  [7] bootCase            boxCox              bread              
## [10] brief               case.names          ceresPlot          
## [13] coerce              confidenceEllipse   confint            
## [16] Confint             cooks.distance      crPlot             
## [19] deltaMethod         deviance            dfbeta             
## [22] dfbetaPlots         dfbetas             dfbetasPlots       
## [25] drop1               dummy.coef          durbinWatsonTest   
## [28] effects             estfun              extractAIC         
## [31] family              formula             fortify            
## [34] hatvalues           hccm                infIndexPlot       
## [37] influence           influencePlot       initialize         
## [40] inverseResponsePlot kappa               labels             
## [43] leveneTest          leveragePlot        linearHypothesis   
## [46] logLik              mcPlot              mmp                
## [49] model.frame         model.matrix        ncvTest            
## [52] nextBoot            nobs                outlierTest        
## [55] plot                powerTransform      predict            
## [58] Predict             print               proj               
## [61] qqnorm              qqPlot              qr                 
## [64] residualPlot        residualPlots       residuals          
## [67] rstandard           rstudent            S                  
## [70] show                sigmaHat            simulate           
## [73] slotsFromS3         spreadLevelPlot     summary            
## [76] summarySandwich     variable.names      vcov               
## [79] vcovBS             
## see '?methods' for accessing help and source code

5.6 Table of Regression Formulas

  • Note that data= argument applies only to formulas!

  • x1*x2*x3 includes main effects and all possible interactions!

Model Formula
Simple y ~ x
Without intercept y ~ -1 + x
Using all variables y ~ .
Specify Interactions y ~ x1 + x2 + x1:x2
Automate Interactions y ~ x1*x2 Alert: Includes main effects!
Limited Automatic
…(like SAS @2) y ~ (x1 + x2) ^ 2
Polynomial y ~ x1 + I(x1^2) + I(x1^3)
…or y ~ poly(x1, 3)

6 Comparing Two Groups

6.1 Preparing the Workspace

library("tidyverse")
  
load("mydata100.RData")

6.2 Two Independent Groups

  • Goal: Compare means/medians of independent groups

  • Example: Do genders differ on mean/median survey response?

  • Predictor: Factor

  • Outcome: Numeric

  • Assumptions:

    • t.test: Data are normally distributed

    • wilcox.test: Data have same distribution
      (except for mean or “location” of distribution)

6.3 Independent Samples t.test

  • Parametric
t.test(q1 ~ gender, data = mydata100)
## 
##  Welch Two Sample t-test
## 
## data:  q1 by gender
## t = 4.9784, df = 97.998, p-value = 2.748e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5878915 1.3672368
## sample estimates:
## mean in group Female   mean in group Male 
##             3.958333             2.980769

6.4 Independent Samples Non-parametric wilcox.test

  • Non-parametric

  • To be a test on medians, the two groups must have the same distribution (need not be normal)

  • Otherwise, it’s a test on mean ranks

wilcox.test(q1 ~ gender, data = mydata100)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  q1 by gender
## W = 1868, p-value = 8.902e-06
## alternative hypothesis: true location shift is not equal to 0
mydata100 %>%
  group_by(gender) %>%
  summarise(my_median = median(q1))
## # A tibble: 2 x 2
##   gender my_median
##   <fct>      <dbl>
## 1 Female         4
## 2 Male           3

6.5 Paired Samples

  • Goal: Compare means of paired groups

  • Example: Do the subjects differ on pre/post test means?

  • Predictor: Numeric

  • Outcome: Numeric

  • Assumptions:

    • t.test: Data are normally distributed

    • wilcox.test: Data have same distribution (except for mean or “location” of distribution)

6.6 Paired Samples t.test

  • Parametric

  • Note that data = doesn’t work here as there is no formula

t.test(mydata100$pretest, 
       mydata100$posttest, 
       paired = TRUE)
## 
##  Paired t-test
## 
## data:  mydata100$pretest and mydata100$posttest
## t = -14.46, df = 99, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.062921 -6.117079
## sample estimates:
## mean of the differences 
##                   -7.09
mean(mydata100$pretest)
## [1] 74.97
mean(mydata100$posttest)
## [1] 82.06

6.7 Paired Samples Non-parametric wilcox.test

  • Non-parametric
wilcox.test(mydata100$pretest, 
            mydata100$posttest, 
            paired = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  mydata100$pretest and mydata100$posttest
## V = 45, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
median(mydata100$pretest)
## [1] 75
median(mydata100$posttest)
## [1] 82

7 Comparing More Than 2 Groups

  • Goal: Compare means of >2 groups, >= 1 factor, optionally correcting for covariates

  • Example: Are there workshop differences in posttest means?

  • Predictor: One or more factors, optional numeric covariates

  • Outcome: Numeric

  • Assumptions:

    • ANOVA: Errors are normally distributed

    • ANOVA: Variances are roughly equal

    • Kruskal-Wallis: Data have same distribution
      (except for mean or “location” of distribution)

7.1 Getting Means & Variances by Gender

mydata100 %>%
  group_by(gender) %>%
  summarise(mean     = mean(posttest),
            variance = var( posttest))
## # A tibble: 2 x 3
##   gender  mean variance
##   <fct>  <dbl>    <dbl>
## 1 Female  81.7     36.3
## 2 Male    82.4     50.6

7.2 Getting Means & Variances by Workshop

mydata100 %>%
  group_by(workshop) %>%
  summarise(mean     = mean(posttest),
            variance = var( posttest))
## # A tibble: 4 x 3
##   workshop  mean variance
##   <fct>    <dbl>    <dbl>
## 1 R         86.3     25.0
## 2 SAS       79.6     36.1
## 3 SPSS      81.7     19.5
## 4 Stata     78.9     73.6

7.3 Test for Equality of Variances

library("car")
leveneTest(posttest ~ gender,
           data = mydata100)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.5017 0.4804
##       98
leveneTest(posttest ~ workshop,
           data = mydata100)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  3  2.3689 0.07544 .
##       96                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.4 Create AOV Model

  • AOV = Analysis Of Variance to differentiate between that and the anova function

  • contrasts option sets R to match SAS when doing Type III tests

options(contrasts=c("contr.sum","contr.poly"))

my_model <-  aov(posttest ~ gender*workshop, 
                 data=mydata100)

7.5 Plot Diagnostics

plot(my_model)

7.6 Types of ANOVA Tests

  • Balanced ANOVA designs have the same number of observations in each cell of a cross-tabulation.

  • For balanced designs, all types of F-tests yield the same p-values

  • ALERT: For unbalanced designs, R’s built-in anova() function yields sequential, or “Type I” tests, while SAS, SPSS, Stata all use “Type III” so answers differ!

  • The car package’s Anova (capital “A”) yields “Type II” tests, which are preferred (meets “Marginality Principle”)

For details, see:

  • Exegesis on Linear Models, by WN Venables

  • Companion to Applied Regression (car),
    by John Fox

7.7 Get ANOVA Table

  • Note the capital “A” in Anova() here!

  • Built in one is all lower case

library("car")
Anova(my_model, type = "III")
## Anova Table (Type III tests)
## 
## Response: posttest
##                 Sum Sq Df    F value    Pr(>F)    
## (Intercept)     643750  1 17817.7525 < 2.2e-16 ***
## gender               3  1     0.0716 0.7895879    
## workshop           845  3     7.7976 0.0001076 ***
## gender:workshop     77  3     0.7090 0.5490659    
## Residuals         3324 92                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.8 Types of Means

  • Arithmetic means take the mean of all values, as we calculated above

  • Estimated marginal means (a.k.a. lsmeans) take the mean of the group means

  • For balanced designs, these are the same

  • For unbalanced designs, either can be “best”

  • For more complex models (e.g. ANACOVA) the comparison is more complex because estimated marginal means use the model, not the raw data

7.9 Estimated Marginal Means

Keywords: Least Squares Means, LSmeans, EMmeans, Predicted Means

The help files for the emmeans package contain over a dozen vignettes filled with useful examples.

library("emmeans")
## Warning: package 'emmeans' was built under R version 3.6.2
## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
##     Indicator predictors are now treated as 2-level factors by default.
##     To revert to old behavior, use emm_options(cov.keep = character(0))
myEMmeans <- emmeans(my_model, ~ workshop)
## NOTE: Results may be misleading due to involvement in interactions
myEMmeans
##  workshop emmean   SE df lower.CL upper.CL
##  R          86.1 1.08 92     84.0     88.3
##  SAS        79.5 1.20 92     77.2     81.9
##  SPSS       81.7 1.20 92     79.3     84.1
##  Stata      79.0 1.38 92     76.3     81.7
## 
## Results are averaged over the levels of: gender 
## Confidence level used: 0.95

7.10 Comparing All Means

contrast(myEMmeans, 
         method="pairwise", 
         adjust = "tukey" )
##  contrast     estimate   SE df t.ratio p.value
##  R - SAS         6.580 1.62 92  4.062  0.0006 
##  R - SPSS        4.410 1.62 92  2.723  0.0381 
##  R - Stata       7.122 1.76 92  4.056  0.0006 
##  SAS - SPSS     -2.170 1.70 92 -1.275  0.5809 
##  SAS - Stata     0.542 1.83 92  0.296  0.9909 
##  SPSS - Stata    2.712 1.83 92  1.481  0.4534 
## 
## Results are averaged over the levels of: gender 
## P value adjustment: tukey method for comparing a family of 4 estimates

7.11 Comparing Means to Control

Since this does fewer tests, p-values are not corrected as harshly for multiple testing

Compares all against the first level of the factor

contrast(myEMmeans, method="trt.vs.ctrl" )
##  contrast  estimate   SE df t.ratio p.value
##  SAS - R      -6.58 1.62 92 -4.062  0.0003 
##  SPSS - R     -4.41 1.62 92 -2.723  0.0217 
##  Stata - R    -7.12 1.76 92 -4.056  0.0003 
## 
## Results are averaged over the levels of: gender 
## P value adjustment: dunnettx method for 3 tests

7.12 Methods of Comparison

For each of these, adding reverse = TRUE flips pair order. See help(“contrast-methods”) for details.

  • pairwise

  • revpairwise (reverses order)

  • poly - polynomial

  • trt.vs.ctrl - compares each to 1st level

  • trt.vs.ctrlk - compares each to kth level

7.13 Adjustment Types

  • BH - Benjamini-Hochberg a.k.a. fdr
  • BY - Benjamini & Yekutieli
  • bonferroni - way too conservative)
  • fdr - most liberal other than “none”
  • hochberg
  • holm - default in many R functions
  • hommel
  • none - i.e. uncorrected pairwise t-tests
  • mvt
  • scheffe
  • sidak
  • tukey

For details see

  • help(“p.adjust.methods”)
  • help(“summary.emmGrid”)

7.14 Compact Letter Display (CLD)

  • Compares all means very compactly

  • The “group” column places means into sets which are not significantly different within the set, but all those in group 1 are significantly different from those in groups 2 (…and 3, and 4, etc.)

  • cld = Compact Letter Display but here it is done using numbers instead of letters.

  • Possibly more useful for larger outcomes

library("multcomp")
## Warning: package 'multcomp' was built under R version 3.6.2
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Warning: package 'TH.data' was built under R version 3.6.2
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
cld(myEMmeans)
##  workshop emmean   SE df lower.CL upper.CL .group
##  Stata      79.0 1.38 92     76.3     81.7  1    
##  SAS        79.5 1.20 92     77.2     81.9  1    
##  SPSS       81.7 1.20 92     79.3     84.1  1    
##  R          86.1 1.08 92     84.0     88.3   2   
## 
## Results are averaged over the levels of: gender 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 4 estimates 
## significance level used: alpha = 0.05

7.15 Plotting All Comparisons

  • A compact display done with plotting

  • Mean differences are plotted as dots with 95% confidence intervals

  • If confidence interval includes zero, those means do not differ significantly

  • This is a ggplot, so it’s easy to embellish

  • Confidence intervals, compared against 0

plot( contrast(myEMmeans, method="pairwise") ) +
  geom_vline(xintercept = 0)

7.16 EM Means Interaction Plot (MIP)

  • Our model interaction is not significant, but we’ll plot it anyway.

  • Remember, interactions need much larger sample sizes to reach significance!

library("emmeans")

emmip(my_model, workshop ~ gender, CIs = TRUE)

7.17 Table of ANOVA / ANCOVA Formulas

Remember: data= applies only to formulas!

Model Formula
One-way ANOVA y ~ a
Two-way ANOVA
…with interaction y ~ a + b + a:b
…or y ~ a*b Alert: Includes main effects!
3-way ANOVA with
…all interactions y ~ abc
3-way ANOVA with
…with only 2-way
interactions y ~ (a + b + c) ^ 2
…or y ~ a * b * c - a:b:c
Analysis of Covariance
…Separate slopes y ~ x * a
…Common slopes y ~ x + a
ANOVA nesting
b within a y ~ b %in% a
…or y ~ a/b

8 Graphical User Interfaces to R

8.1 jamovi jamovi

  • Beautiful interface with immediate feedback when you click on an option

  • You need not know R is behind the scenes

  • Journal-style output with no extra work

  • Fairly limited selection of methods, but what’s there is high quality

  • Currently undable to add any variables to its data sets, e.g. predicted values, PCA…

  • Useful jmv package lets you use all its functions outside of jamovi

  • Detailed review: http://r4stats.com/articles/software-reviews/jamovi/

8.2 BlueSky Statistics BlueSky

  • Extensive feature set

  • You need not be aware R is involved at at all

  • Journal-style output with no extra work

  • Teaches tidyverse code

  • Includes machine learning both direct to each package and via caret with tuning

  • Detailed review: http://r4stats.com/articles/software-reviews/bluesky


9 Conclusion

9.1 Feedback Please

http://workshop.utk.edu

9.2 Questions?