1 Preparing Your Computer

In class, these steps will already be done for you. If you want to use the workshop files on your own computer, follow these steps.

Installing R & RStudio Install R and RStudio from these two sites: http://www.r-project.org http://www.rstudio.com
Unzip the workshop files, taking care to ensure they all stay together in a folder
Install these add-on packages by selecting the following “install.packages()” lines then choose “Run selected lines” from the “Run” menu, or use Ctrl-Enter on PC or Cmd-Enter on Mac

install.packages("car")
install.packages("emmeans")
install.packages("jmv")
install.packages("multcomp")
install.packages("RcmdrMisc")
install.packages("skimr")
install.packages("tidyverse")

Test your installation Run the following chunk of R code by clicking on the green arrow on the right side of the code. If you see the data, you’re set. If not, call the OIT HelpDesk at 865-974-9900

load("mydata100.RData")
head(mydata100)

2 OIT Research Computing Support Assistance

2.1 Help We Provide

Up to 15 hours per semester of help with:

R, SPSS, SAS, NVivo, QuestionPro, ArcGIS…
Research planning, design, sample size…
Data entry, management, file conversions
Statistics, machine learning, data mining
Creating geographic maps of statistical information
Text mining, qualitative analysis…
Web survey design and deployment

2.2 Where to Find us

Walk-in support in Greve Hall 540, schedule: http://oit.utk.edu/research/schedule
By appointment at Greve Hall 517 & Pendergrass Library
Helpdesk: 974-9900

3 Basic Statistics

3.1 Preparing the Workspace

library("tidyverse")

## Warning: package 'tidyverse' was built under R version 3.6.2

## -- Attaching packages --------

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'purrr' was built under R version 3.6.2

## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("skimr")

## Warning: package 'skimr' was built under R version 3.6.2

load("mydata100.RData")

3.2 R’s `summary` function

Accepts a data frame
Min/Max values shown to 3 significant digits. Use digits = 10 to change

mydata100 %>%
  select(pretest, posttest) %>%
  summary()

##     pretest         posttest    
##  Min.   :58.00   Min.   :59.00  
##  1st Qu.:72.00   1st Qu.:77.00  
##  Median :75.00   Median :82.00  
##  Mean   :74.97   Mean   :82.06  
##  3rd Qu.:79.00   3rd Qu.:86.00  
##  Max.   :86.00   Max.   :98.00

3.3 The `skim` Function

Fairly new
Designed for the tidyverse
Produces tibbles amenable to filter, group_by etc.
Optimized for high-quality output (more soon)

library("skimr")
library("knitr")

## Warning: package 'knitr' was built under R version 3.6.2

mydata100 %>% 
  skim()

Data summary
Name	Piped data
Number of rows	100
Number of columns	9
_______________________
Column type frequency:
factor	2
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
workshop	0	1	FALSE	4	R: 31, SAS: 25, SPS: 25, Sta: 19
gender	0	1	FALSE	2	Mal: 52, Fem: 48

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	50.50	29.01	1	25.75	50.5	75.25	100	▇▇▇▇▇
q1	0	1.00	3.45	1.10	1	3.00	3.0	4.00	5	▁▃▇▆▅
q2	0	1.00	3.06	1.22	1	2.00	3.0	4.00	5	▃▇▆▇▃
q3	1	0.99	3.08	1.17	1	2.00	3.0	4.00	5	▂▅▇▅▃
q4	0	1.00	3.40	1.14	1	3.00	3.0	4.00	5	▂▃▇▆▅
pretest	0	1.00	74.97	5.30	58	72.00	75.0	79.00	86	▁▂▇▇▃
posttest	0	1.00	82.06	6.59	59	77.00	82.0	86.00	98	▁▁▇▇▂

3.4 `skim` Example with `filter()`

mydata100 %>% 
  skim() %>%
  filter(numeric.mean > 70)

Data summary
Name	Piped data
Number of rows	100
Number of columns	9
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
pretest	0	1	74.97	5.30	58	72	75	79	86	▁▂▇▇▃
posttest	0	1	82.06	6.59	59	77	82	86	98	▁▁▇▇▂

3.5 `jmv` Package’s `descriptives` Function

The jmv package contains many useful functions that provide APA-style output
It also offers a fairly comprehensive set of univariate plots & will split by group
Can also use select_if(is.numeric) to get all numeric vars

library("jmv")

## Warning: package 'jmv' was built under R version 3.6.2

mydata100 %>%
  select(pretest, posttest) %>%
  descriptives()

## 
##  DESCRIPTIVES
## 
##  Descriptives                       
##  ---------------------------------- 
##               pretest    posttest   
##  ---------------------------------- 
##    N              100         100   
##    Missing          0           0   
##    Mean          75.0        82.1   
##    Median        75.0        82.0   
##    Minimum         58          59   
##    Maximum         86          98   
##  ----------------------------------

3.6 Frequency & Percent Tables

We’ve seen R’s table function with its sparse results
The jmv::descriptives function provides much nicer output
Adding freq=TRUE turns on the frequencies for factors
Could also use select_if(is.factor) to get all factors

library("jmv")

mydata100 %>%
  select(workshop, gender) %>%
  descriptives(freq = TRUE)

## 
##  DESCRIPTIVES
## 
##  Descriptives                      
##  --------------------------------- 
##               workshop    gender   
##  --------------------------------- 
##    N               100       100   
##    Missing           0         0   
##    Mean                            
##    Median                          
##    Minimum                         
##    Maximum                         
##  --------------------------------- 
## 
## 
##  FREQUENCIES
## 
##  Frequencies of workshop                            
##  -------------------------------------------------- 
##    Levels    Counts    % of Total    Cumulative %   
##  -------------------------------------------------- 
##    R             31          31.0            31.0   
##    SAS           25          25.0            56.0   
##    SPSS          25          25.0            81.0   
##    Stata         19          19.0           100.0   
##  -------------------------------------------------- 
## 
## 
##  Frequencies of gender                              
##  -------------------------------------------------- 
##    Levels    Counts    % of Total    Cumulative %   
##  -------------------------------------------------- 
##    Female        48          48.0            48.0   
##    Male          52          52.0           100.0   
##  --------------------------------------------------

3.7 Cross-tabulation & Chi-Square

Goal: Compare groups on pattern of counts
Example: Do the genders differ on workshop attendance?
Predictor: Factor
Outcome: Factor
Assumptions:
- Minimum expected count > 1
- Fewer than 20% of expected counts < 5

3.8 R’s Built-in `table` Function

myWG <- mydata100 %>%
  select(workshop, gender) %>%
  table

myWG

##         gender
## workshop Female Male
##    R         14   17
##    SAS       12   13
##    SPSS      13   12
##    Stata      9   10

summary(myWG)

## Number of cases in table: 100 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 0.26338, df = 3, p-value = 0.9668

3.10 Cross-tabulations Using the `jmv` Package

Provides publication-quality tables when run in the jamovi GUI.
Offers Fisher’s Exact Test, odds, relative risk, and many measures of association.
contTablesPaired does McNemar’s Test

library("jmv")

mydata100 %>%
  contTables(
    rows = 'workshop',
    cols = 'gender', pcRow = TRUE)

## 
##  CONTINGENCY TABLES
## 
##  Contingency Tables                                       
##  -------------------------------------------------------- 
##    workshop                    Female    Male     Total   
##  -------------------------------------------------------- 
##    R           Observed            14       17       31   
##                % within row      45.2     54.8            
##                                                           
##    SAS         Observed            12       13       25   
##                % within row      48.0     52.0            
##                                                           
##    SPSS        Observed            13       12       25   
##                % within row      52.0     48.0            
##                                                           
##    Stata       Observed             9       10       19   
##                % within row      47.4     52.6            
##                                                           
##    Total       Observed            48       52      100   
##                % within row      48.0     52.0            
##  -------------------------------------------------------- 
## 
## 
##  <U+03C7>² Tests                       
##  ------------------------------ 
##          Value    df    p       
##  ------------------------------ 
##    <U+03C7>²    0.263     3    0.967   
##    N       100                  
##  ------------------------------

3.11 Other Categorical Functions

Small samples: fisher.test
Same variable measured twice: mcnemar.test
3-way table: mantelhaen.test
N-way table: loglm in MASS package

4 Correlation

4.1 Preparing the Workspace

library("tidyverse")

load("mydata100.RData")

4.2 Correlation

Goal: Measure strength of linear association
Example: Is pretest related to posttest?
Predictor: Numeric (one)
Outcome: Numeric (one)
Pearson Assumes:
- Relationship is linear
- Variables are normally distributed
Spearman Assumes:
- Relationship is consistent (monotonic)
- Variables come from same distribution

4.3 R’s Built-in `cor` Function

Demonstrate using small mydata
For non-parametric, use “spearman”

load("mydata.RData")

mydata %>%
  select(q1:q4) %>%
  cor(use    = "pairwise",
      method = "pearson")

##            q1         q2          q3          q4
## q1  1.0000000  0.7395179 -0.12500000  0.88040627
## q2  0.7395179  1.0000000 -0.27003086  0.85063978
## q3 -0.1250000 -0.2700309  1.00000000 -0.02613542
## q4  0.8804063  0.8506398 -0.02613542  1.00000000

4.4 Built-in Test of Significance

Perhaps to warn gains impact of multiple testing, this allows only one test per run!

cor.test(mydata$q1, mydata$q4,
         use    = "pairwise",
         method = "pearson"
)

## 
##  Pearson's product-moment correlation
## 
## data:  mydata$q1 and mydata$q4
## t = 4.5476, df = 6, p-value = 0.003902
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4629416 0.9782033
## sample estimates:
##       cor 
## 0.8804063

4.5 R Commander’s `rcorr.adjust` Does More

Output similar to SAS/SPSS/Stata
Provides p-values by default
Adjusts p-values for number tested

library("RcmdrMisc")

## Warning: package 'RcmdrMisc' was built under R version 3.6.2

## Loading required package: car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

## Loading required package: sandwich

## 
## Attaching package: 'RcmdrMisc'

## The following object is masked from 'package:jmv':
## 
##     reliability

mydata %>%
  select(q1:q4) %>%
  rcorr.adjust(
    use  = "pairwise",
    type = "pearson")

## 
##  Pearson correlations:
##         q1      q2      q3      q4
## q1  1.0000  0.7395 -0.1250  0.8804
## q2  0.7395  1.0000 -0.2700  0.8506
## q3 -0.1250 -0.2700  1.0000 -0.0261
## q4  0.8804  0.8506 -0.0261  1.0000
## 
##  Number of observations: 
##    q1 q2 q3 q4
## q1  8  8  7  8
## q2  8  8  7  8
## q3  7  7  7  7
## q4  8  8  7  8
## 
##  Pairwise two-sided p-values:
##    q1     q2     q3     q4    
## q1        0.0360 0.7894 0.0039
## q2 0.0360        0.5581 0.0074
## q3 0.7894 0.5581        0.9556
## q4 0.0039 0.0074 0.9556       
## 
##  Adjusted p-values (Holm's method)
##    q1     q2     q3     q4    
## q1        0.1440 1.0000 0.0234
## q2 0.1440        1.0000 0.0371
## q3 1.0000 1.0000        1.0000
## q4 0.0234 0.0371 1.0000

Note: you can run this analysis in BlueSky and produce a apa table

5 Multiple Regression

Goal: To use several variables to explain one other
Example: Which variables explain overall workshop rating?
Predictor: Numeric (one or more)
Outcome: Numeric (one only)
Assumptions:
- Relationship is linear
- Prediction errors (residuals) are normally distributed
- Variance is consistent across range of predictor (homoscedasticity)

5.1 Modeling Functions

In SAS & SPSS
- Specify settings in advance
- Then see all results
In R, modeling functions
- Accept formulas
- Create model objects
- Use generic functions show more, e.g. summary(), plot(), etc.
- Use “extractor” (a.k.a accessor) functions to show more, e.g. coefficients()

5.2 Linear Models Using `lm`

my_model <- lm(q4 ~ q1 + q2 + q3, 
              data = mydata100)

summary(my_model)

## 
## Call:
## lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9467 -0.6418  0.1175  0.5960  2.0533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.20940    0.31787   3.805 0.000251 ***
## q1           0.41134    0.13170   3.123 0.002370 ** 
## q2           0.15791    0.10690   1.477 0.142942    
## q3           0.09372    0.11617   0.807 0.421838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9308 on 95 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3561, Adjusted R-squared:  0.3358 
## F-statistic: 17.51 on 3 and 95 DF,  p-value: 3.944e-09

# q4 = 1.2 + .411*q1 + .16*q2 + .09*q3

Notes: * Residuals: beyond +/-3 is worrisome, +/- 4 indicates outliers * P values interpetation: “is this important above and beyond the other values?” because p-value is a partial test * Focus on adjusted R-squared + Multiple R increases with each variable while adjusted is penalized, “pays the reaper”

anova(my_model)

## Analysis of Variance Table
## 
## Response: q4
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## q1         1 42.306  42.306 48.8278 3.824e-10 ***
## q2         1  2.657   2.657  3.0661   0.08317 .  
## q3         1  0.564   0.564  0.6508   0.42184    
## Residuals 95 82.312   0.866                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(my_model)

predict(my_model, mydata)

##        1        2        3        4        5        6        7        8 
## 2.247260 2.564880 2.722794       NA 3.831776 4.366367 4.114732 4.112939

5.3 Model Contents

“Generic functions” change their behavior based on an object’s class
What is print doing??

print(my_model)

## 
## Call:
## lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
## 
## Coefficients:
## (Intercept)           q1           q2           q3  
##     1.20940      0.41134      0.15791      0.09372

mode(my_model)

## [1] "list"

class(my_model)

## [1] "lm"

names(my_model)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"

my_model$rank

## [1] 4

5.4 Examining Model Structure

Thetidyverse glimpse() function shows any object’s structure
The built-in equilvalent is str()
You rarely need to dig in this deeply

glimpse(my_model)

## List of 13
##  $ coefficients : Named num [1:4] 1.2094 0.4113 0.1579 0.0937
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "q1" "q2" "q3"
##  $ residuals    : Named num [1:99] 1.297 0.644 -1.366 0.232 0.821 ...
##   ..- attr(*, "names")= chr [1:99] "1" "2" "4" "5" ...
##  $ effects      : Named num [1:99] -33.87 6.504 1.63 0.751 0.766 ...
##   ..- attr(*, "names")= chr [1:99] "(Intercept)" "q1" "q2" "q3" ...
##  $ rank         : int 4
##  $ fitted.values: Named num [1:99] 3.7 3.36 4.37 3.77 4.18 ...
##   ..- attr(*, "names")= chr [1:99] "1" "2" "4" "5" ...
##  $ assign       : int [1:4] 0 1 2 3
##  $ qr           :List of 5
##   ..$ qr   : num [1:99, 1:4] -9.95 0.101 0.101 0.101 0.101 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
##   ..$ qraux: num [1:4] 1.1 1.05 1 1.08
##   ..$ pivot: int [1:4] 1 2 3 4
##   ..$ tol  : num 1e-07
##   ..$ rank : int 4
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 95
##  $ na.action    : 'omit' Named int 3
##   ..- attr(*, "names")= chr "3"
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
##  $ terms        :Classes 'terms', 'formula'  language q4 ~ q1 + q2 + q3
##   .. ..- attr(*, "variables")= language list(q4, q1, q2, q3)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. ..- attr(*, "term.labels")= chr [1:3] "q1" "q2" "q3"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(q4, q1, q2, q3)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:4] "q4" "q1" "q2" "q3"
##  $ model        :'data.frame':   99 obs. of  4 variables:
##   ..$ q4: int [1:99] 5 4 3 4 5 4 5 1 5 3 ...
##   ..$ q1: int [1:99] 4 3 5 4 5 3 4 3 5 3 ...
##   ..$ q2: int [1:99] 3 4 4 4 4 1 4 2 5 4 ...
##   ..$ q3: int [1:99] 4 3 5 3 3 3 2 2 5 4 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language q4 ~ q1 + q2 + q3
##   .. .. ..- attr(*, "variables")= language list(q4, q1, q2, q3)
##   .. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..- attr(*, "term.labels")= chr [1:3] "q1" "q2" "q3"
##   .. .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(q4, q1, q2, q3)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:4] "q4" "q1" "q2" "q3"
##   ..- attr(*, "na.action")= 'omit' Named int 3
##   .. ..- attr(*, "names")= chr "3"
##  - attr(*, "class")= chr "lm"

5.5 Finding Relevant Functions

Keywords: help, class, objects

What functions work on lm objects?

methods(class = "lm")

##  [1] add1                alias               anova              
##  [4] Anova               avPlot              Boot               
##  [7] bootCase            boxCox              bread              
## [10] brief               case.names          ceresPlot          
## [13] coerce              confidenceEllipse   confint            
## [16] Confint             cooks.distance      crPlot             
## [19] deltaMethod         deviance            dfbeta             
## [22] dfbetaPlots         dfbetas             dfbetasPlots       
## [25] drop1               dummy.coef          durbinWatsonTest   
## [28] effects             estfun              extractAIC         
## [31] family              formula             fortify            
## [34] hatvalues           hccm                infIndexPlot       
## [37] influence           influencePlot       initialize         
## [40] inverseResponsePlot kappa               labels             
## [43] leveneTest          leveragePlot        linearHypothesis   
## [46] logLik              mcPlot              mmp                
## [49] model.frame         model.matrix        ncvTest            
## [52] nextBoot            nobs                outlierTest        
## [55] plot                powerTransform      predict            
## [58] Predict             print               proj               
## [61] qqnorm              qqPlot              qr                 
## [64] residualPlot        residualPlots       residuals          
## [67] rstandard           rstudent            S                  
## [70] show                sigmaHat            simulate           
## [73] slotsFromS3         spreadLevelPlot     summary            
## [76] summarySandwich     variable.names      vcov               
## [79] vcovBS             
## see '?methods' for accessing help and source code

5.6 Table of Regression Formulas

Note that data= argument applies only to formulas!
x1*x2*x3 includes main effects and all possible interactions!

Model	Formula
Simple	y ~ x
Without intercept	y ~ -1 + x
Using all variables	y ~ .
Specify Interactions	y ~ x1 + x2 + x1:x2
Automate Interactions	y ~ x1x2 Alert:* Includes main effects!
Limited Automatic
…(like SAS @2)	y ~ (x1 + x2) ^ 2
Polynomial	y ~ x1 + I(x1^2) + I(x1^3)
…or	y ~ poly(x1, 3)

6 Comparing Two Groups

6.1 Preparing the Workspace

library("tidyverse")
  
load("mydata100.RData")

6.2 Two Independent Groups

Goal: Compare means/medians of independent groups
Example: Do genders differ on mean/median survey response?
Predictor: Factor
Outcome: Numeric
Assumptions:
- t.test: Data are normally distributed
- wilcox.test: Data have same distribution
  (except for mean or “location” of distribution)

6.3 Independent Samples `t.test`

Parametric

t.test(q1 ~ gender, data = mydata100)

## 
##  Welch Two Sample t-test
## 
## data:  q1 by gender
## t = 4.9784, df = 97.998, p-value = 2.748e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5878915 1.3672368
## sample estimates:
## mean in group Female   mean in group Male 
##             3.958333             2.980769

6.4 Independent Samples Non-parametric `wilcox.test`

Non-parametric
To be a test on medians, the two groups must have the same distribution (need not be normal)
Otherwise, it’s a test on mean ranks

wilcox.test(q1 ~ gender, data = mydata100)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  q1 by gender
## W = 1868, p-value = 8.902e-06
## alternative hypothesis: true location shift is not equal to 0

mydata100 %>%
  group_by(gender) %>%
  summarise(my_median = median(q1))

## # A tibble: 2 x 2
##   gender my_median
##   <fct>      <dbl>
## 1 Female         4
## 2 Male           3

6.5 Paired Samples

Goal: Compare means of paired groups
Example: Do the subjects differ on pre/post test means?
Predictor: Numeric
Outcome: Numeric
Assumptions:
- t.test: Data are normally distributed
- wilcox.test: Data have same distribution (except for mean or “location” of distribution)

6.6 Paired Samples `t.test`

Parametric
Note that data = doesn’t work here as there is no formula

t.test(mydata100$pretest, 
       mydata100$posttest, 
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  mydata100$pretest and mydata100$posttest
## t = -14.46, df = 99, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.062921 -6.117079
## sample estimates:
## mean of the differences 
##                   -7.09

mean(mydata100$pretest)

## [1] 74.97

mean(mydata100$posttest)

## [1] 82.06

6.7 Paired Samples Non-parametric `wilcox.test`

Non-parametric

wilcox.test(mydata100$pretest, 
            mydata100$posttest, 
            paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  mydata100$pretest and mydata100$posttest
## V = 45, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

median(mydata100$pretest)

## [1] 75

median(mydata100$posttest)

## [1] 82

7 Comparing More Than 2 Groups

Goal: Compare means of >2 groups, >= 1 factor, optionally correcting for covariates
Example: Are there workshop differences in posttest means?
Predictor: One or more factors, optional numeric covariates
Outcome: Numeric
Assumptions:
- ANOVA: Errors are normally distributed
- ANOVA: Variances are roughly equal
- Kruskal-Wallis: Data have same distribution
  (except for mean or “location” of distribution)

7.1 Getting Means & Variances by Gender

mydata100 %>%
  group_by(gender) %>%
  summarise(mean     = mean(posttest),
            variance = var( posttest))

## # A tibble: 2 x 3
##   gender  mean variance
##   <fct>  <dbl>    <dbl>
## 1 Female  81.7     36.3
## 2 Male    82.4     50.6

7.2 Getting Means & Variances by Workshop

mydata100 %>%
  group_by(workshop) %>%
  summarise(mean     = mean(posttest),
            variance = var( posttest))

## # A tibble: 4 x 3
##   workshop  mean variance
##   <fct>    <dbl>    <dbl>
## 1 R         86.3     25.0
## 2 SAS       79.6     36.1
## 3 SPSS      81.7     19.5
## 4 Stata     78.9     73.6

7.3 Test for Equality of Variances

library("car")
leveneTest(posttest ~ gender,
           data = mydata100)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.5017 0.4804
##       98

leveneTest(posttest ~ workshop,
           data = mydata100)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  3  2.3689 0.07544 .
##       96                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.4 Create AOV Model

AOV = Analysis Of Variance to differentiate between that and the anova function
contrasts option sets R to match SAS when doing Type III tests

options(contrasts=c("contr.sum","contr.poly"))

my_model <-  aov(posttest ~ gender*workshop, 
                 data=mydata100)

7.5 Plot Diagnostics

plot(my_model)

7.6 Types of ANOVA Tests

Balanced ANOVA designs have the same number of observations in each cell of a cross-tabulation.
For balanced designs, all types of F-tests yield the same p-values
ALERT: For unbalanced designs, R’s built-in anova() function yields sequential, or “Type I” tests, while SAS, SPSS, Stata all use “Type III” so answers differ!
The car package’s Anova (capital “A”) yields “Type II” tests, which are preferred (meets “Marginality Principle”)

For details, see:

Exegesis on Linear Models, by WN Venables
Companion to Applied Regression (car),
by John Fox

7.7 Get ANOVA Table

Note the capital “A” in Anova() here!
Built in one is all lower case

library("car")
Anova(my_model, type = "III")

## Anova Table (Type III tests)
## 
## Response: posttest
##                 Sum Sq Df    F value    Pr(>F)    
## (Intercept)     643750  1 17817.7525 < 2.2e-16 ***
## gender               3  1     0.0716 0.7895879    
## workshop           845  3     7.7976 0.0001076 ***
## gender:workshop     77  3     0.7090 0.5490659    
## Residuals         3324 92                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.8 Types of Means

Arithmetic means take the mean of all values, as we calculated above
Estimated marginal means (a.k.a. lsmeans) take the mean of the group means
For balanced designs, these are the same
For unbalanced designs, either can be “best”
For more complex models (e.g. ANACOVA) the comparison is more complex because estimated marginal means use the model, not the raw data

7.9 Estimated Marginal Means

Keywords: Least Squares Means, LSmeans, EMmeans, Predicted Means

The help files for the emmeans package contain over a dozen vignettes filled with useful examples.

library("emmeans")

## Warning: package 'emmeans' was built under R version 3.6.2

## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
##     Indicator predictors are now treated as 2-level factors by default.
##     To revert to old behavior, use emm_options(cov.keep = character(0))

myEMmeans <- emmeans(my_model, ~ workshop)

## NOTE: Results may be misleading due to involvement in interactions

myEMmeans

##  workshop emmean   SE df lower.CL upper.CL
##  R          86.1 1.08 92     84.0     88.3
##  SAS        79.5 1.20 92     77.2     81.9
##  SPSS       81.7 1.20 92     79.3     84.1
##  Stata      79.0 1.38 92     76.3     81.7
## 
## Results are averaged over the levels of: gender 
## Confidence level used: 0.95

7.10 Comparing All Means

contrast(myEMmeans, 
         method="pairwise", 
         adjust = "tukey" )

##  contrast     estimate   SE df t.ratio p.value
##  R - SAS         6.580 1.62 92  4.062  0.0006 
##  R - SPSS        4.410 1.62 92  2.723  0.0381 
##  R - Stata       7.122 1.76 92  4.056  0.0006 
##  SAS - SPSS     -2.170 1.70 92 -1.275  0.5809 
##  SAS - Stata     0.542 1.83 92  0.296  0.9909 
##  SPSS - Stata    2.712 1.83 92  1.481  0.4534 
## 
## Results are averaged over the levels of: gender 
## P value adjustment: tukey method for comparing a family of 4 estimates

7.11 Comparing Means to Control

Since this does fewer tests, p-values are not corrected as harshly for multiple testing

Compares all against the first level of the factor

contrast(myEMmeans, method="trt.vs.ctrl" )

##  contrast  estimate   SE df t.ratio p.value
##  SAS - R      -6.58 1.62 92 -4.062  0.0003 
##  SPSS - R     -4.41 1.62 92 -2.723  0.0217 
##  Stata - R    -7.12 1.76 92 -4.056  0.0003 
## 
## Results are averaged over the levels of: gender 
## P value adjustment: dunnettx method for 3 tests

7.12 Methods of Comparison

For each of these, adding reverse = TRUE flips pair order. See help(“contrast-methods”) for details.

pairwise
revpairwise (reverses order)
poly - polynomial
trt.vs.ctrl - compares each to 1st level
trt.vs.ctrlk - compares each to kth level

7.13 Adjustment Types

BH - Benjamini-Hochberg a.k.a. fdr
BY - Benjamini & Yekutieli
bonferroni - way too conservative)
fdr - most liberal other than “none”
hochberg
holm - default in many R functions
hommel
none - i.e. uncorrected pairwise t-tests
mvt
scheffe
sidak
tukey

For details see

help(“p.adjust.methods”)
help(“summary.emmGrid”)

7.14 Compact Letter Display (CLD)

Compares all means very compactly
The “group” column places means into sets which are not significantly different within the set, but all those in group 1 are significantly different from those in groups 2 (…and 3, and 4, etc.)
cld = Compact Letter Display but here it is done using numbers instead of letters.
Possibly more useful for larger outcomes

library("multcomp")

## Warning: package 'multcomp' was built under R version 3.6.2

## Loading required package: mvtnorm

## Loading required package: survival

## Loading required package: TH.data

## Warning: package 'TH.data' was built under R version 3.6.2

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'TH.data'

## The following object is masked from 'package:MASS':
## 
##     geyser

cld(myEMmeans)

##  workshop emmean   SE df lower.CL upper.CL .group
##  Stata      79.0 1.38 92     76.3     81.7  1    
##  SAS        79.5 1.20 92     77.2     81.9  1    
##  SPSS       81.7 1.20 92     79.3     84.1  1    
##  R          86.1 1.08 92     84.0     88.3   2   
## 
## Results are averaged over the levels of: gender 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 4 estimates 
## significance level used: alpha = 0.05

7.15 Plotting All Comparisons

A compact display done with plotting
Mean differences are plotted as dots with 95% confidence intervals
If confidence interval includes zero, those means do not differ significantly
This is a ggplot, so it’s easy to embellish
Confidence intervals, compared against 0

plot( contrast(myEMmeans, method="pairwise") ) +
  geom_vline(xintercept = 0)

7.16 EM Means Interaction Plot (MIP)

Our model interaction is not significant, but we’ll plot it anyway.
Remember, interactions need much larger sample sizes to reach significance!

library("emmeans")

emmip(my_model, workshop ~ gender, CIs = TRUE)

7.17 Table of ANOVA / ANCOVA Formulas

Remember: data= applies only to formulas!

Model	Formula
One-way ANOVA	y ~ a
Two-way ANOVA
…with interaction	y ~ a + b + a:b
…or	y ~ ab Alert:* Includes main effects!
3-way ANOVA with
…all interactions	y ~ abc
3-way ANOVA with
…with only 2-way
interactions	y ~ (a + b + c) ^ 2
…or	y ~ a * b * c - a:b:c
Analysis of Covariance
…Separate slopes	y ~ x * a
…Common slopes	y ~ x + a
ANOVA nesting
b within a	y ~ b %in% a
…or	y ~ a/b

8 Graphical User Interfaces to R

8.1 jamovi jamovi

Beautiful interface with immediate feedback when you click on an option
You need not know R is behind the scenes
Journal-style output with no extra work
Fairly limited selection of methods, but what’s there is high quality
Currently undable to add any variables to its data sets, e.g. predicted values, PCA…
Useful jmv package lets you use all its functions outside of jamovi
Detailed review: http://r4stats.com/articles/software-reviews/jamovi/

8.2 BlueSky Statistics BlueSky

Extensive feature set
You need not be aware R is involved at at all
Journal-style output with no extra work
Teaches tidyverse code
Includes machine learning both direct to each package and via caret with tuning
Detailed review: http://r4stats.com/articles/software-reviews/bluesky

9 Conclusion

9.1 Feedback Please

http://workshop.utk.edu

R Statistics

Bob Muenchen

2/7/2020