Using Tilda data we are looking to see whether we can predict the score of a Mini-mental state examination, coded as mms, based on whether the individual ever had alcohol problem and level of exercise.
Let’s load in the Tilda data, subset out the variable we want and set up a dataframe with variables of interest.
I’m using the dplyr package for data manipulation. It has differet functions for subsetting, it doesn’t use square brackets [], instead using the select function. The functions are more intuitive and powerful once you are familiar with them. Check out this chapter of the R for Data Science book for an introduction. I would recommend the book strongly, it’s comprehensive and will set you up to tackle any and all datasets you are presented with.
tilda <- read_data("0053-05_TILDA_Wave4_v4.0.dta")
tilda <- as_tibble(tilda)#a tibble is a data table in the tidyverse(https://www.tidyverse.org/)
data <- dplyr::select(tilda, c(alcohol = BEHcage2, exercise = FRexercise3, mms = COGmmse))
head(data)
Great, we have our data loaded up. Now we have to consider what type of data we are dealing with.
First things first. What data types do we have? Wikipedia has a great page on the many different types of statistical data that exist. The type of data that we have will determine how we approach the analysis.
alcohol describes whether the individual has ever had an issue with alcohol, yes or no. It is therefore binary. It can only be true/false, yes/no or 0/1.
exercise describes the level of activity of an individual, the options being 0,1 or 2. These are ordered categories but the distance between 0 and 1 and 1 and 2 is unlikely to be the same, the ordered categories are not spaced out evenly. It is therefore ordinal data, it has ordered categories with unequal distance between the categories.
mms describes the score an individual received on a mini-mental state examination. This again is ordinal data, it is an ordered series of categories, with 30 possible categories, ranging from 0 to 30. For a good explanation for why test scores are ordinal check out this link
The type of data in the response/dependent variable, the variable that would be plotted on the y axis, determines the type of model that will be used.
As mms is our response variable, we should consider doing an ordered logistical regression. Here are a number of tutorials I found on ordered logistical regression.
I’m going to follow along with the UCLA tutorial.
Exploring the data.
lapply(data[, c("alcohol", "exercise", "mms")], table)
$alcohol
0 1
4209 577
$exercise
0 1 2
2111 1866 1595
$mms
0 3 13 14 15 16 17 18 19 20 21 22 23
2 1 1 1 4 6 7 7 13 22 31 50 56
24 25 26 27 28 29 30
100 110 233 446 816 1487 2322
ftable(xtabs(~ alcohol + exercise + mms, data = data))
mms 0 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
alcohol exercise
0 0 1 0 2 2 2 1 6 5 8 12 25 35 37 68 121 256 360 569
1 1 1 0 2 0 0 2 4 1 5 5 15 17 53 100 197 398 597
2 0 0 0 0 0 1 0 2 3 8 7 7 16 31 75 148 357 539
1 0 0 0 0 0 0 0 0 1 1 4 1 2 2 8 12 24 48 96
1 0 0 0 0 0 0 1 0 2 0 1 1 1 3 10 27 42 111
2 0 0 0 0 0 0 0 0 1 1 0 1 1 2 8 14 51 87
data$exercise <- as.factor(data$exercise)
data$alcohol <- as.factor(data$alcohol)
data$mms <- as.numeric(data$mms)
library(ggplot2)
package 㤼㸱ggplot2㤼㸲 was built under R version 3.6.3
ggplot(data, aes(x = exercise, y = mms)) +
geom_boxplot(size = .75) +
geom_jitter(alpha = .2) +
facet_grid(rows = vars(alcohol), margins = TRUE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
Now we will apply a proportional odds logistic regression using the polr function in the MASS package.
## fit ordered logit model and store results 'm'
library(MASS)
package 㤼㸱MASS㤼㸲 was built under R version 3.6.3
Attaching package: 㤼㸱MASS㤼㸲
The following object is masked from 㤼㸱package:dplyr㤼㸲:
select
data$mms <- as.ordered(data$mms)
m <- polr(mms ~ exercise + alcohol, data = data, Hess=TRUE)
## view a summary of the model
summary(m)
Call:
polr(formula = mms ~ exercise + alcohol, data = data, Hess = TRUE)
Coefficients:
Value Std. Error t value
exercise1 0.3406 0.06390 5.330
exercise2 0.4538 0.06679 6.795
alcohol1 0.4255 0.08477 5.019
Intercepts:
Value Std. Error t value
0|3 -7.4966 0.6696 -11.1960
3|13 -7.4827 0.6655 -11.2434
13|14 -7.0769 0.5546 -12.7613
14|15 -7.0761 0.5544 -12.7643
15|16 -6.5644 0.4370 -15.0207
16|17 -5.9754 0.3302 -18.0951
17|18 -5.7744 0.2999 -19.2536
18|19 -5.6070 0.2768 -20.2583
19|20 -5.0788 0.2149 -23.6360
20|21 -4.6407 0.1745 -26.5887
21|22 -4.2514 0.1456 -29.2085
22|23 -3.7748 0.1173 -32.1758
23|24 -3.3685 0.0985 -34.1997
24|25 -2.9398 0.0830 -35.4327
25|26 -2.5769 0.0727 -35.4266
26|27 -2.0349 0.0615 -33.0765
27|28 -1.3744 0.0530 -25.9355
28|29 -0.5400 0.0482 -11.2122
29|30 0.5985 0.0482 12.4241
Residual Deviance: 14491.66
AIC: 14535.66
(1049 observations deleted due to missingness)
## store table
ctable <- coef(summary(m))
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
ctable <- cbind(ctable, "p value" = p)
# default method gives profiled CIs
(ci <- confint(m))
Waiting for profiling to be done...
2.5 % 97.5 %
exercise1 0.2143976 0.4668488
exercise2 0.3219268 0.5858133
alcohol1 0.2588293 0.5938320
# CIs assuming normality
confint.default(m)
2.5 % 97.5 %
exercise1 0.2153630 0.4658393
exercise2 0.3229131 0.5847123
alcohol1 0.2593577 0.5916654
## odds ratios
exp(coef(m))
exercise1 exercise2 alcohol1
1.405792 1.574303 1.530373
## OR and CI
exp(cbind(OR = coef(m), ci))
OR 2.5 % 97.5 %
exercise1 1.405792 1.239115 1.594960
exercise2 1.574303 1.379784 1.796452
alcohol1 1.530373 1.295413 1.810915
Interpreting the log odds ratio
Exercise1
For individuals who acheived exerise level 1, the odds of being likely to score higher on the test (get a higher score) are 1.41 times those individuals who did no exercise, holding alcohol constant
Exercise2
For individuals who acheived exerise level 2, the odds of being likely to score higher on the test (get a higher score) are 1.57 times those individuals who did no exercise, holding alcohol constant.
alcohol1
For individuals who had an issue with alcohol at some stage of life, the odds of being likely to score higher on the test (get a higher score) are 1.53 times those individuals who did no exercise, holding exercise constant.
Not sure I trust my analysis now, although you can see from the graph above that people who had an alcohol problem didn’t have as many bad scores.
Attempting checking whether this analysis is valid but am unable to do so at the moment. I can’t get the function to work.
sf <- function(y) {
c("Y>=0" = qlogis(mean(y >= 0)),
"Y>=3" = qlogis(mean(y >= 3)),
"Y>=13" = qlogis(mean(y >= 13)),
"Y>=14" = qlogis(mean(y >= 14)),
"Y>=15" = qlogis(mean(y >= 15)),
"Y>=16" = qlogis(mean(y >= 16)),
"Y>=18" = qlogis(mean(y >= 18)),
"Y>=19" = qlogis(mean(y >= 19)),
"Y>=20" = qlogis(mean(y >= 20)),
"Y>=21" = qlogis(mean(y >= 21)),
"Y>=22" = qlogis(mean(y >= 22)),
"Y>=23" = qlogis(mean(y >= 23)),
"Y>=24" = qlogis(mean(y >= 24)),
"Y>=25" = qlogis(mean(y >= 25)),
"Y>=26" = qlogis(mean(y >= 26)),
"Y>=27" = qlogis(mean(y >= 27)),
"Y>=28" = qlogis(mean(y >= 28)),
"Y>=29" = qlogis(mean(y >= 29)),
"Y>=30" = qlogis(mean(y >= 30)))
}
data$mms <- as.numeric(data$mms)
class(data$mms)
str(data$mms)
levels(data$mms)
data$mms[levels(data$mms) == "3"]
qlogis(mean(data$mms >= 0))
qlogis(mean(data$mms >= 24))
data$mms[1:10]
droplevels(data$mms[data$mms >= 29])
mean(data$mms >= 30)
data$mms[1]
(s <- with(data, summary(as.numeric(mms) ~ alcohol + exercise, fun=sf)))
Following analysis based off this tutorial
data$mms <- as.ordered(data$mms)
levels(data$mms)
[1] "0" "3" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
[13] "23" "24" "25" "26" "27" "28" "29" "30"
str(data$mms)
Ord.factor w/ 20 levels "0"<"3"<"13"<"14"<..: 17 18 20 20 7 20 20 19 20 18 ...
trainingRows <- sample(1:nrow(data), 0.7 * nrow(data))
trainingData <- data[trainingRows, ]
testData <- data[-trainingRows, ]
options(contrasts = c("contr.treatment", "contr.poly"))
library(MASS)
polrMod <- polr(mms ~ exercise + alcohol, data = trainingData, Hess=TRUE)
summary(polrMod)
Call:
polr(formula = mms ~ exercise + alcohol, data = trainingData,
Hess = TRUE)
Coefficients:
Value Std. Error t value
exercise 0.2369 0.04007 5.912
alcohol 0.4379 0.10140 4.319
Intercepts:
Value Std. Error t value
0|3 -7.8515 1.0607 -7.4023
3|13 -7.8467 1.0578 -7.4178
13|14 -7.1521 0.7270 -9.8374
14|15 -7.1521 0.7270 -9.8374
15|16 -6.7473 0.5887 -11.4612
16|17 -6.0532 0.4136 -14.6349
17|18 -5.8987 0.3827 -15.4132
18|19 -5.6466 0.3374 -16.7350
19|20 -5.0081 0.2463 -20.3331
20|21 -4.5802 0.2003 -22.8716
21|22 -4.1968 0.1670 -25.1342
22|23 -3.7501 0.1359 -27.5876
23|24 -3.3525 0.1141 -29.3824
24|25 -2.9686 0.0973 -30.5006
25|26 -2.6167 0.0851 -30.7537
26|27 -2.0879 0.0713 -29.2676
27|28 -1.4563 0.0608 -23.9723
28|29 -0.6021 0.0539 -11.1639
29|30 0.5506 0.0536 10.2808
Residual Deviance: 10073.48
AIC: 10115.48
(731 observations deleted due to missingness)
predictedmss <- predict(polrMod, testData) # predict the classes directly
head(predictedmss)
[1] 30 30 30 30 <NA> <NA>
20 Levels: 0 3 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ... 30
predictedScores <- predict(polrMod, testData, type="p") # predict the probabilites
head(predictedScores)
0 3 13 14 15
1 0.0003890251 1.864704e-06 0.0003916854 2.495786e-09 0.0003900739
2 0.0003070039 1.471675e-06 0.0003091544 1.970068e-09 0.0003079333
3 0.0003890251 1.864704e-06 0.0003916854 2.495786e-09 0.0003900739
4 0.0002422716 1.161446e-06 0.0002440006 1.554981e-09 0.0002430685
5 NA NA NA NA NA
6 NA NA NA NA NA
16 17 18 19 20
1 0.0011721299 0.0003908061 0.0007813491 0.003122596 0.003509599
2 0.0009256112 0.0003087147 0.0006173744 0.002469319 0.002779247
3 0.0011721299 0.0003908061 0.0007813491 0.003122596 0.003509599
4 0.0007308253 0.0002438122 0.0004876762 0.001951835 0.002199249
5 NA NA NA NA NA
6 NA NA NA NA NA
21 22 23 24 25
1 0.004671060 0.008154348 0.010838537 0.015051930 0.01920634
2 0.003705410 0.006486178 0.008656032 0.012087317 0.01553654
3 0.004671060 0.008154348 0.010838537 0.015051930 0.01920634
4 0.002936151 0.005150710 0.006895881 0.009671911 0.01250514
5 NA NA NA NA NA
6 NA NA NA NA NA
26 27 28 29 30
1 0.04220338 0.07875566 0.1648381 0.2804163 0.3657152
2 0.03459217 0.06626753 0.1464004 0.2760497 0.4221929
3 0.04220338 0.07875566 0.1648381 0.2804163 0.3657152
4 0.02814275 0.05509817 0.1275554 0.2649186 0.4807814
5 NA NA NA NA NA
6 NA NA NA NA NA
## Confusion matrix and misclassification error
table(testData$mms, predictedmss) # confusion matrix
predictedmss
0 3 13 14 15 16 17 18 19 20 21 22 23 24 25
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
predictedmss
26 27 28 29 30
0 0 0 0 0 1
3 0 0 0 0 0
13 0 0 0 0 0
14 0 0 0 0 0
15 0 0 0 0 1
16 0 0 0 0 1
17 0 0 0 0 1
18 0 0 0 0 0
19 0 0 0 0 1
20 0 0 0 0 3
21 0 0 0 0 4
22 0 0 0 0 9
23 0 0 0 0 11
24 0 0 0 0 22
25 0 0 0 0 24
26 0 0 0 0 54
27 0 0 0 0 114
28 0 0 0 0 199
29 0 0 0 0 368
30 0 0 0 0 584
mean(na.omit(as.character(testData$mms) != as.character(predictedmss))) # misclassification error
[1] 0.5819613
This is quite a poor classification rate. The predicted values are all 30 or NA, so something is not going right there.