In class, these steps will already be done for you. If you want to use the workshop files on your own computer, follow these steps.
Installing R & RStudio Install R and RStudio from these two sites: http://www.r-project.org http://www.rstudio.com
Unzip the workshop files, taking care to ensure they all stay together in a folder
Install these add-on packages by selecting the following “install.packages()” lines then choose “Run selected lines” from the “Run” menu, or use Ctrl-Enter on PC or Cmd-Enter on Mac
install.packages("car")
install.packages("emmeans")
install.packages("jmv")
install.packages("multcomp")
install.packages("RcmdrMisc")
install.packages("skimr")
install.packages("tidyverse")
load("mydata100.RData")
head(mydata100)
Up to 15 hours per semester of help with:
R, SPSS, SAS, NVivo, QuestionPro, ArcGIS…
Research planning, design, sample size…
Data entry, management, file conversions
Statistics, machine learning, data mining
Creating geographic maps of statistical information
Text mining, qualitative analysis…
Web survey design and deployment
Walk-in support in Greve Hall 540, schedule: http://oit.utk.edu/research/schedule
By appointment at Greve Hall 517 & Pendergrass Library
Helpdesk: 974-9900
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'purrr' was built under R version 3.6.2
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("skimr")
## Warning: package 'skimr' was built under R version 3.6.2
load("mydata100.RData")
summary functionAccepts a data frame
Min/Max values shown to 3 significant digits. Use digits = 10 to change
mydata100 %>%
select(pretest, posttest) %>%
summary()
## pretest posttest
## Min. :58.00 Min. :59.00
## 1st Qu.:72.00 1st Qu.:77.00
## Median :75.00 Median :82.00
## Mean :74.97 Mean :82.06
## 3rd Qu.:79.00 3rd Qu.:86.00
## Max. :86.00 Max. :98.00
skim FunctionFairly new
Designed for the tidyverse
Produces tibbles amenable to filter, group_by etc.
Optimized for high-quality output (more soon)
library("skimr")
library("knitr")
## Warning: package 'knitr' was built under R version 3.6.2
mydata100 %>%
skim()
| Name | Piped data |
| Number of rows | 100 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| workshop | 0 | 1 | FALSE | 4 | R: 31, SAS: 25, SPS: 25, Sta: 19 |
| gender | 0 | 1 | FALSE | 2 | Mal: 52, Fem: 48 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 50.50 | 29.01 | 1 | 25.75 | 50.5 | 75.25 | 100 | ▇▇▇▇▇ |
| q1 | 0 | 1.00 | 3.45 | 1.10 | 1 | 3.00 | 3.0 | 4.00 | 5 | ▁▃▇▆▅ |
| q2 | 0 | 1.00 | 3.06 | 1.22 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▃▇▆▇▃ |
| q3 | 1 | 0.99 | 3.08 | 1.17 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▅▇▅▃ |
| q4 | 0 | 1.00 | 3.40 | 1.14 | 1 | 3.00 | 3.0 | 4.00 | 5 | ▂▃▇▆▅ |
| pretest | 0 | 1.00 | 74.97 | 5.30 | 58 | 72.00 | 75.0 | 79.00 | 86 | ▁▂▇▇▃ |
| posttest | 0 | 1.00 | 82.06 | 6.59 | 59 | 77.00 | 82.0 | 86.00 | 98 | ▁▁▇▇▂ |
skim Example with filter()mydata100 %>%
skim() %>%
filter(numeric.mean > 70)
| Name | Piped data |
| Number of rows | 100 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| pretest | 0 | 1 | 74.97 | 5.30 | 58 | 72 | 75 | 79 | 86 | ▁▂▇▇▃ |
| posttest | 0 | 1 | 82.06 | 6.59 | 59 | 77 | 82 | 86 | 98 | ▁▁▇▇▂ |
jmv Package’s descriptives FunctionThe jmv package contains many useful functions that provide APA-style output
It also offers a fairly comprehensive set of univariate plots & will split by group
Can also use select_if(is.numeric) to get all numeric vars
library("jmv")
## Warning: package 'jmv' was built under R version 3.6.2
mydata100 %>%
select(pretest, posttest) %>%
descriptives()
##
## DESCRIPTIVES
##
## Descriptives
## ----------------------------------
## pretest posttest
## ----------------------------------
## N 100 100
## Missing 0 0
## Mean 75.0 82.1
## Median 75.0 82.0
## Minimum 58 59
## Maximum 86 98
## ----------------------------------
We’ve seen R’s table function with its sparse results
The jmv::descriptives function provides much nicer output
Adding freq=TRUE turns on the frequencies for factors
Could also use select_if(is.factor) to get all factors
library("jmv")
mydata100 %>%
select(workshop, gender) %>%
descriptives(freq = TRUE)
##
## DESCRIPTIVES
##
## Descriptives
## ---------------------------------
## workshop gender
## ---------------------------------
## N 100 100
## Missing 0 0
## Mean
## Median
## Minimum
## Maximum
## ---------------------------------
##
##
## FREQUENCIES
##
## Frequencies of workshop
## --------------------------------------------------
## Levels Counts % of Total Cumulative %
## --------------------------------------------------
## R 31 31.0 31.0
## SAS 25 25.0 56.0
## SPSS 25 25.0 81.0
## Stata 19 19.0 100.0
## --------------------------------------------------
##
##
## Frequencies of gender
## --------------------------------------------------
## Levels Counts % of Total Cumulative %
## --------------------------------------------------
## Female 48 48.0 48.0
## Male 52 52.0 100.0
## --------------------------------------------------
Goal: Compare groups on pattern of counts
Example: Do the genders differ on workshop attendance?
Predictor: Factor
Outcome: Factor
Assumptions:
Minimum expected count > 1
Fewer than 20% of expected counts < 5
table FunctionmyWG <- mydata100 %>%
select(workshop, gender) %>%
table
myWG
## gender
## workshop Female Male
## R 14 17
## SAS 12 13
## SPSS 13 12
## Stata 9 10
summary(myWG)
## Number of cases in table: 100
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 0.26338, df = 3, p-value = 0.9668
jmv PackageProvides publication-quality tables when run in the jamovi GUI.
Offers Fisher’s Exact Test, odds, relative risk, and many measures of association.
contTablesPaired does McNemar’s Test
library("jmv")
mydata100 %>%
contTables(
rows = 'workshop',
cols = 'gender', pcRow = TRUE)
##
## CONTINGENCY TABLES
##
## Contingency Tables
## --------------------------------------------------------
## workshop Female Male Total
## --------------------------------------------------------
## R Observed 14 17 31
## % within row 45.2 54.8
##
## SAS Observed 12 13 25
## % within row 48.0 52.0
##
## SPSS Observed 13 12 25
## % within row 52.0 48.0
##
## Stata Observed 9 10 19
## % within row 47.4 52.6
##
## Total Observed 48 52 100
## % within row 48.0 52.0
## --------------------------------------------------------
##
##
## <U+03C7>² Tests
## ------------------------------
## Value df p
## ------------------------------
## <U+03C7>² 0.263 3 0.967
## N 100
## ------------------------------
Small samples: fisher.test
Same variable measured twice: mcnemar.test
3-way table: mantelhaen.test
N-way table: loglm in MASS package
library("tidyverse")
load("mydata100.RData")
Goal: Measure strength of linear association
Example: Is pretest related to posttest?
Predictor: Numeric (one)
Outcome: Numeric (one)
Pearson Assumes:
Relationship is linear
Variables are normally distributed
Spearman Assumes:
Relationship is consistent (monotonic)
Variables come from same distribution
cor FunctionDemonstrate using small mydata
For non-parametric, use “spearman”
load("mydata.RData")
mydata %>%
select(q1:q4) %>%
cor(use = "pairwise",
method = "pearson")
## q1 q2 q3 q4
## q1 1.0000000 0.7395179 -0.12500000 0.88040627
## q2 0.7395179 1.0000000 -0.27003086 0.85063978
## q3 -0.1250000 -0.2700309 1.00000000 -0.02613542
## q4 0.8804063 0.8506398 -0.02613542 1.00000000
Perhaps to warn gains impact of multiple testing, this allows only one test per run!
cor.test(mydata$q1, mydata$q4,
use = "pairwise",
method = "pearson"
)
##
## Pearson's product-moment correlation
##
## data: mydata$q1 and mydata$q4
## t = 4.5476, df = 6, p-value = 0.003902
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4629416 0.9782033
## sample estimates:
## cor
## 0.8804063
rcorr.adjust Does MoreOutput similar to SAS/SPSS/Stata
Provides p-values by default
Adjusts p-values for number tested
library("RcmdrMisc")
## Warning: package 'RcmdrMisc' was built under R version 3.6.2
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Loading required package: sandwich
##
## Attaching package: 'RcmdrMisc'
## The following object is masked from 'package:jmv':
##
## reliability
mydata %>%
select(q1:q4) %>%
rcorr.adjust(
use = "pairwise",
type = "pearson")
##
## Pearson correlations:
## q1 q2 q3 q4
## q1 1.0000 0.7395 -0.1250 0.8804
## q2 0.7395 1.0000 -0.2700 0.8506
## q3 -0.1250 -0.2700 1.0000 -0.0261
## q4 0.8804 0.8506 -0.0261 1.0000
##
## Number of observations:
## q1 q2 q3 q4
## q1 8 8 7 8
## q2 8 8 7 8
## q3 7 7 7 7
## q4 8 8 7 8
##
## Pairwise two-sided p-values:
## q1 q2 q3 q4
## q1 0.0360 0.7894 0.0039
## q2 0.0360 0.5581 0.0074
## q3 0.7894 0.5581 0.9556
## q4 0.0039 0.0074 0.9556
##
## Adjusted p-values (Holm's method)
## q1 q2 q3 q4
## q1 0.1440 1.0000 0.0234
## q2 0.1440 1.0000 0.0371
## q3 1.0000 1.0000 1.0000
## q4 0.0234 0.0371 1.0000
Note: you can run this analysis in BlueSky and produce a apa table
In SAS & SPSS
Specify settings in advance
Then see all results
In R, modeling functions
Accept formulas
Create model objects
Use generic functions show more, e.g. summary(), plot(), etc.
Use “extractor” (a.k.a accessor) functions to show more, e.g. coefficients()
lmmy_model <- lm(q4 ~ q1 + q2 + q3,
data = mydata100)
summary(my_model)
##
## Call:
## lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9467 -0.6418 0.1175 0.5960 2.0533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.20940 0.31787 3.805 0.000251 ***
## q1 0.41134 0.13170 3.123 0.002370 **
## q2 0.15791 0.10690 1.477 0.142942
## q3 0.09372 0.11617 0.807 0.421838
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9308 on 95 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3561, Adjusted R-squared: 0.3358
## F-statistic: 17.51 on 3 and 95 DF, p-value: 3.944e-09
# q4 = 1.2 + .411*q1 + .16*q2 + .09*q3
Notes: * Residuals: beyond +/-3 is worrisome, +/- 4 indicates outliers * P values interpetation: “is this important above and beyond the other values?” because p-value is a partial test * Focus on adjusted R-squared + Multiple R increases with each variable while adjusted is penalized, “pays the reaper”
anova(my_model)
## Analysis of Variance Table
##
## Response: q4
## Df Sum Sq Mean Sq F value Pr(>F)
## q1 1 42.306 42.306 48.8278 3.824e-10 ***
## q2 1 2.657 2.657 3.0661 0.08317 .
## q3 1 0.564 0.564 0.6508 0.42184
## Residuals 95 82.312 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(my_model)
predict(my_model, mydata)
## 1 2 3 4 5 6 7 8
## 2.247260 2.564880 2.722794 NA 3.831776 4.366367 4.114732 4.112939
“Generic functions” change their behavior based on an object’s class
What is print doing??
print(my_model)
##
## Call:
## lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
##
## Coefficients:
## (Intercept) q1 q2 q3
## 1.20940 0.41134 0.15791 0.09372
mode(my_model)
## [1] "list"
class(my_model)
## [1] "lm"
names(my_model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "xlevels" "call" "terms"
## [13] "model"
my_model$rank
## [1] 4
Thetidyverse glimpse() function shows any object’s structure
The built-in equilvalent is str()
You rarely need to dig in this deeply
glimpse(my_model)
## List of 13
## $ coefficients : Named num [1:4] 1.2094 0.4113 0.1579 0.0937
## ..- attr(*, "names")= chr [1:4] "(Intercept)" "q1" "q2" "q3"
## $ residuals : Named num [1:99] 1.297 0.644 -1.366 0.232 0.821 ...
## ..- attr(*, "names")= chr [1:99] "1" "2" "4" "5" ...
## $ effects : Named num [1:99] -33.87 6.504 1.63 0.751 0.766 ...
## ..- attr(*, "names")= chr [1:99] "(Intercept)" "q1" "q2" "q3" ...
## $ rank : int 4
## $ fitted.values: Named num [1:99] 3.7 3.36 4.37 3.77 4.18 ...
## ..- attr(*, "names")= chr [1:99] "1" "2" "4" "5" ...
## $ assign : int [1:4] 0 1 2 3
## $ qr :List of 5
## ..$ qr : num [1:99, 1:4] -9.95 0.101 0.101 0.101 0.101 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
## ..$ qraux: num [1:4] 1.1 1.05 1 1.08
## ..$ pivot: int [1:4] 1 2 3 4
## ..$ tol : num 1e-07
## ..$ rank : int 4
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 95
## $ na.action : 'omit' Named int 3
## ..- attr(*, "names")= chr "3"
## $ xlevels : Named list()
## $ call : language lm(formula = q4 ~ q1 + q2 + q3, data = mydata100)
## $ terms :Classes 'terms', 'formula' language q4 ~ q1 + q2 + q3
## .. ..- attr(*, "variables")= language list(q4, q1, q2, q3)
## .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. ..- attr(*, "term.labels")= chr [1:3] "q1" "q2" "q3"
## .. ..- attr(*, "order")= int [1:3] 1 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(q4, q1, q2, q3)
## .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:4] "q4" "q1" "q2" "q3"
## $ model :'data.frame': 99 obs. of 4 variables:
## ..$ q4: int [1:99] 5 4 3 4 5 4 5 1 5 3 ...
## ..$ q1: int [1:99] 4 3 5 4 5 3 4 3 5 3 ...
## ..$ q2: int [1:99] 3 4 4 4 4 1 4 2 5 4 ...
## ..$ q3: int [1:99] 4 3 5 3 3 3 2 2 5 4 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language q4 ~ q1 + q2 + q3
## .. .. ..- attr(*, "variables")= language list(q4, q1, q2, q3)
## .. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. ..- attr(*, "term.labels")= chr [1:3] "q1" "q2" "q3"
## .. .. ..- attr(*, "order")= int [1:3] 1 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(q4, q1, q2, q3)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:4] "q4" "q1" "q2" "q3"
## ..- attr(*, "na.action")= 'omit' Named int 3
## .. ..- attr(*, "names")= chr "3"
## - attr(*, "class")= chr "lm"
Keywords: help, class, objects
What functions work on lm objects?
methods(class = "lm")
## [1] add1 alias anova
## [4] Anova avPlot Boot
## [7] bootCase boxCox bread
## [10] brief case.names ceresPlot
## [13] coerce confidenceEllipse confint
## [16] Confint cooks.distance crPlot
## [19] deltaMethod deviance dfbeta
## [22] dfbetaPlots dfbetas dfbetasPlots
## [25] drop1 dummy.coef durbinWatsonTest
## [28] effects estfun extractAIC
## [31] family formula fortify
## [34] hatvalues hccm infIndexPlot
## [37] influence influencePlot initialize
## [40] inverseResponsePlot kappa labels
## [43] leveneTest leveragePlot linearHypothesis
## [46] logLik mcPlot mmp
## [49] model.frame model.matrix ncvTest
## [52] nextBoot nobs outlierTest
## [55] plot powerTransform predict
## [58] Predict print proj
## [61] qqnorm qqPlot qr
## [64] residualPlot residualPlots residuals
## [67] rstandard rstudent S
## [70] show sigmaHat simulate
## [73] slotsFromS3 spreadLevelPlot summary
## [76] summarySandwich variable.names vcov
## [79] vcovBS
## see '?methods' for accessing help and source code
Note that data= argument applies only to formulas!
x1*x2*x3 includes main effects and all possible interactions!
| Model | Formula |
|---|---|
| Simple | y ~ x |
| Without intercept | y ~ -1 + x |
| Using all variables | y ~ . |
| Specify Interactions | y ~ x1 + x2 + x1:x2 |
| Automate Interactions | y ~ x1*x2 Alert: Includes main effects! |
| Limited Automatic | |
| …(like SAS @2) | y ~ (x1 + x2) ^ 2 |
| Polynomial | y ~ x1 + I(x1^2) + I(x1^3) |
| …or | y ~ poly(x1, 3) |
library("tidyverse")
load("mydata100.RData")
Goal: Compare means/medians of independent groups
Example: Do genders differ on mean/median survey response?
Predictor: Factor
Outcome: Numeric
Assumptions:
t.test: Data are normally distributed
wilcox.test: Data have same distribution
(except for mean or “location” of distribution)
t.testt.test(q1 ~ gender, data = mydata100)
##
## Welch Two Sample t-test
##
## data: q1 by gender
## t = 4.9784, df = 97.998, p-value = 2.748e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5878915 1.3672368
## sample estimates:
## mean in group Female mean in group Male
## 3.958333 2.980769
wilcox.testNon-parametric
To be a test on medians, the two groups must have the same distribution (need not be normal)
Otherwise, it’s a test on mean ranks
wilcox.test(q1 ~ gender, data = mydata100)
##
## Wilcoxon rank sum test with continuity correction
##
## data: q1 by gender
## W = 1868, p-value = 8.902e-06
## alternative hypothesis: true location shift is not equal to 0
mydata100 %>%
group_by(gender) %>%
summarise(my_median = median(q1))
## # A tibble: 2 x 2
## gender my_median
## <fct> <dbl>
## 1 Female 4
## 2 Male 3
Goal: Compare means of paired groups
Example: Do the subjects differ on pre/post test means?
Predictor: Numeric
Outcome: Numeric
Assumptions:
t.test: Data are normally distributed
wilcox.test: Data have same distribution (except for mean or “location” of distribution)
t.testParametric
Note that data = doesn’t work here as there is no formula
t.test(mydata100$pretest,
mydata100$posttest,
paired = TRUE)
##
## Paired t-test
##
## data: mydata100$pretest and mydata100$posttest
## t = -14.46, df = 99, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.062921 -6.117079
## sample estimates:
## mean of the differences
## -7.09
mean(mydata100$pretest)
## [1] 74.97
mean(mydata100$posttest)
## [1] 82.06
wilcox.testwilcox.test(mydata100$pretest,
mydata100$posttest,
paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: mydata100$pretest and mydata100$posttest
## V = 45, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
median(mydata100$pretest)
## [1] 75
median(mydata100$posttest)
## [1] 82
Goal: Compare means of >2 groups, >= 1 factor, optionally correcting for covariates
Example: Are there workshop differences in posttest means?
Predictor: One or more factors, optional numeric covariates
Outcome: Numeric
Assumptions:
ANOVA: Errors are normally distributed
ANOVA: Variances are roughly equal
Kruskal-Wallis: Data have same distribution
(except for mean or “location” of distribution)
mydata100 %>%
group_by(gender) %>%
summarise(mean = mean(posttest),
variance = var( posttest))
## # A tibble: 2 x 3
## gender mean variance
## <fct> <dbl> <dbl>
## 1 Female 81.7 36.3
## 2 Male 82.4 50.6
mydata100 %>%
group_by(workshop) %>%
summarise(mean = mean(posttest),
variance = var( posttest))
## # A tibble: 4 x 3
## workshop mean variance
## <fct> <dbl> <dbl>
## 1 R 86.3 25.0
## 2 SAS 79.6 36.1
## 3 SPSS 81.7 19.5
## 4 Stata 78.9 73.6
library("car")
leveneTest(posttest ~ gender,
data = mydata100)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.5017 0.4804
## 98
leveneTest(posttest ~ workshop,
data = mydata100)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 2.3689 0.07544 .
## 96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AOV = Analysis Of Variance to differentiate between that and the anova function
contrasts option sets R to match SAS when doing Type III tests
options(contrasts=c("contr.sum","contr.poly"))
my_model <- aov(posttest ~ gender*workshop,
data=mydata100)
plot(my_model)
Balanced ANOVA designs have the same number of observations in each cell of a cross-tabulation.
For balanced designs, all types of F-tests yield the same p-values
ALERT: For unbalanced designs, R’s built-in anova() function yields sequential, or “Type I” tests, while SAS, SPSS, Stata all use “Type III” so answers differ!
The car package’s Anova (capital “A”) yields “Type II” tests, which are preferred (meets “Marginality Principle”)
For details, see:
Exegesis on Linear Models, by WN Venables
Companion to Applied Regression (car),
by John Fox
Note the capital “A” in Anova() here!
Built in one is all lower case
library("car")
Anova(my_model, type = "III")
## Anova Table (Type III tests)
##
## Response: posttest
## Sum Sq Df F value Pr(>F)
## (Intercept) 643750 1 17817.7525 < 2.2e-16 ***
## gender 3 1 0.0716 0.7895879
## workshop 845 3 7.7976 0.0001076 ***
## gender:workshop 77 3 0.7090 0.5490659
## Residuals 3324 92
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Arithmetic means take the mean of all values, as we calculated above
Estimated marginal means (a.k.a. lsmeans) take the mean of the group means
For balanced designs, these are the same
For unbalanced designs, either can be “best”
For more complex models (e.g. ANACOVA) the comparison is more complex because estimated marginal means use the model, not the raw data
Keywords: Least Squares Means, LSmeans, EMmeans, Predicted Means
The help files for the emmeans package contain over a dozen vignettes filled with useful examples.
library("emmeans")
## Warning: package 'emmeans' was built under R version 3.6.2
## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
## Indicator predictors are now treated as 2-level factors by default.
## To revert to old behavior, use emm_options(cov.keep = character(0))
myEMmeans <- emmeans(my_model, ~ workshop)
## NOTE: Results may be misleading due to involvement in interactions
myEMmeans
## workshop emmean SE df lower.CL upper.CL
## R 86.1 1.08 92 84.0 88.3
## SAS 79.5 1.20 92 77.2 81.9
## SPSS 81.7 1.20 92 79.3 84.1
## Stata 79.0 1.38 92 76.3 81.7
##
## Results are averaged over the levels of: gender
## Confidence level used: 0.95
contrast(myEMmeans,
method="pairwise",
adjust = "tukey" )
## contrast estimate SE df t.ratio p.value
## R - SAS 6.580 1.62 92 4.062 0.0006
## R - SPSS 4.410 1.62 92 2.723 0.0381
## R - Stata 7.122 1.76 92 4.056 0.0006
## SAS - SPSS -2.170 1.70 92 -1.275 0.5809
## SAS - Stata 0.542 1.83 92 0.296 0.9909
## SPSS - Stata 2.712 1.83 92 1.481 0.4534
##
## Results are averaged over the levels of: gender
## P value adjustment: tukey method for comparing a family of 4 estimates
Since this does fewer tests, p-values are not corrected as harshly for multiple testing
Compares all against the first level of the factor
contrast(myEMmeans, method="trt.vs.ctrl" )
## contrast estimate SE df t.ratio p.value
## SAS - R -6.58 1.62 92 -4.062 0.0003
## SPSS - R -4.41 1.62 92 -2.723 0.0217
## Stata - R -7.12 1.76 92 -4.056 0.0003
##
## Results are averaged over the levels of: gender
## P value adjustment: dunnettx method for 3 tests
For each of these, adding reverse = TRUE flips pair order. See help(“contrast-methods”) for details.
pairwise
revpairwise (reverses order)
poly - polynomial
trt.vs.ctrl - compares each to 1st level
trt.vs.ctrlk - compares each to kth level
For details see
Compares all means very compactly
The “group” column places means into sets which are not significantly different within the set, but all those in group 1 are significantly different from those in groups 2 (…and 3, and 4, etc.)
cld = Compact Letter Display but here it is done using numbers instead of letters.
Possibly more useful for larger outcomes
library("multcomp")
## Warning: package 'multcomp' was built under R version 3.6.2
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Warning: package 'TH.data' was built under R version 3.6.2
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
cld(myEMmeans)
## workshop emmean SE df lower.CL upper.CL .group
## Stata 79.0 1.38 92 76.3 81.7 1
## SAS 79.5 1.20 92 77.2 81.9 1
## SPSS 81.7 1.20 92 79.3 84.1 1
## R 86.1 1.08 92 84.0 88.3 2
##
## Results are averaged over the levels of: gender
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 4 estimates
## significance level used: alpha = 0.05
A compact display done with plotting
Mean differences are plotted as dots with 95% confidence intervals
If confidence interval includes zero, those means do not differ significantly
This is a ggplot, so it’s easy to embellish
Confidence intervals, compared against 0
plot( contrast(myEMmeans, method="pairwise") ) +
geom_vline(xintercept = 0)
Our model interaction is not significant, but we’ll plot it anyway.
Remember, interactions need much larger sample sizes to reach significance!
library("emmeans")
emmip(my_model, workshop ~ gender, CIs = TRUE)
Remember: data= applies only to formulas!
| Model | Formula |
|---|---|
| One-way ANOVA | y ~ a |
| Two-way ANOVA | |
| …with interaction | y ~ a + b + a:b |
| …or | y ~ a*b Alert: Includes main effects! |
| 3-way ANOVA with | |
| …all interactions | y ~ abc |
| 3-way ANOVA with | |
| …with only 2-way | |
| interactions | y ~ (a + b + c) ^ 2 |
| …or | y ~ a * b * c - a:b:c |
| Analysis of Covariance | |
| …Separate slopes | y ~ x * a |
| …Common slopes | y ~ x + a |
| ANOVA nesting | |
| b within a | y ~ b %in% a |
| …or | y ~ a/b |
Beautiful interface with immediate feedback when you click on an option
You need not know R is behind the scenes
Journal-style output with no extra work
Fairly limited selection of methods, but what’s there is high quality
Currently undable to add any variables to its data sets, e.g. predicted values, PCA…
Useful jmv package lets you use all its functions outside of jamovi
Detailed review: http://r4stats.com/articles/software-reviews/jamovi/
Extensive feature set
You need not be aware R is involved at at all
Journal-style output with no extra work
Teaches tidyverse code
Includes machine learning both direct to each package and via caret with tuning
Detailed review: http://r4stats.com/articles/software-reviews/bluesky