file <- "assignment3.sas7bdat"
data <- read_sas(file)
summary(data)
## IMALE IEDUC PAQAGE IMARRIED
## Min. :0.0000 Min. : 1.0 Min. :18.0 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:12.0 1st Qu.:42.0 1st Qu.:0.000
## Median :0.0000 Median :13.0 Median :57.0 Median :1.000
## Mean :0.3827 Mean :13.3 Mean :54.8 Mean :0.592
## 3rd Qu.:1.0000 3rd Qu.:16.0 3rd Qu.:67.0 3rd Qu.:1.000
## Max. :1.0000 Max. :18.0 Max. :98.0 Max. :1.000
## NA's :78 NA's :126 NA's :51 NA's :71
## IADJINCM NCLINDEP CHF MI
## Min. : 625 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.: 12374 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 17678 Median :0.0000 Median :0.00000 Median :0.00000
## Mean : 22922 Mean :0.1401 Mean :0.06577 Mean :0.03258
## 3rd Qu.: 31820 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :100000 Max. :1.0000 Max. :1.00000 Max. :1.00000
## NA's :405 NA's :37
## IWHITE sevhyp PSQG PSQT
## Min. :0.0000 Min. :0.00000 Min. : 0.00 Min. : 0.00
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.: 50.00 1st Qu.: 57.50
## Median :1.0000 Median :0.00000 Median : 66.67 Median : 67.50
## Mean :0.7916 Mean :0.03349 Mean : 63.92 Mean : 67.06
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 79.17 3rd Qu.: 77.50
## Max. :1.0000 Max. :1.00000 Max. :100.00 Max. :100.00
## NA's :117 NA's :59 NA's :41 NA's :42
## PSQI PSQF PSQTI PSQA
## Min. : 3.571 Min. : 0.00 Min. : 0.00 Min. : 6.25
## 1st Qu.: 64.286 1st Qu.: 53.12 1st Qu.: 50.00 1st Qu.: 62.50
## Median : 75.000 Median : 68.75 Median : 75.00 Median : 72.92
## Mean : 72.006 Mean : 65.39 Mean : 65.02 Mean : 71.08
## 3rd Qu.: 82.143 3rd Qu.: 81.25 3rd Qu.: 75.00 3rd Qu.: 81.25
## Max. :100.000 Max. :100.00 Max. :100.00 Max. :100.00
## NA's :100 NA's :96 NA's :71 NA's :55
## severity
## Min. :1.000
## 1st Qu.:2.000
## Median :3.000
## Mean :2.839
## 3rd Qu.:3.000
## Max. :4.000
## NA's :142
data$IMALE <- as.factor(data$IMALE)
data$IMARRIED <- as.factor(data$IMARRIED)
data$NCLINDEP <- as.factor(data$NCLINDEP)
data$CHF <- as.factor(data$CHF)
data$MI <- as.factor(data$MI)
data$IWHITE <- as.factor(data$IWHITE)
data$sevhyp <- as.factor(data$sevhyp)
#data$severity <- as.factor(data$severity)
#data$severity <- ordered(data$severity,levels=c("1","2","3","4"))
Two plots are generated to understand the distribution of missing values, including their combination. The first plot is scaled so that cell height is proportional to N, while the second is not. Combined, the plots illustrate that most cases are complete. Of cases with missing values, 73% are missing income alone. Very few cases have combinations of missing values.
Summary tables from the previous section indicate a low incidence of missing values amongst variables of interest:
41 cases are missing the response variable (PSQG) (1%). In this revision, cases missing the response variable are deleted.
0 are missing pain severity (4%)
37 37 are missing clinical depression (1%)
Overall, 73.91% of cases are complete.
aggr(data, prop = T, col=("blue"),numbers = T,bars=TRUE,only.miss=TRUE,combined=TRUE,sortCombs=TRUE,varheight=TRUE)
aggr(data, prop = T, col=("blue"),numbers = T,bars=TRUE,only.miss=TRUE,combined=TRUE,sortCombs=TRUE,sortVars=TRUE,labels=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## IADJINCM 392
## severity 129
## IEDUC 123
## IWHITE 111
## IMALE 77
## IMARRIED 69
## PSQI 64
## PSQF 62
## sevhyp 56
## PAQAGE 51
## NCLINDEP 37
## PSQTI 33
## PSQA 16
## PSQT 3
## CHF 0
## MI 0
## PSQG 0
Three plots are generated to look for signs that missingness is conditional on the value of key variables. Quality scores appear independent of the missingness of income, pain severity, and depression, and vice-versa.
Given that income has the highest incidence of missing data, I think it is important to assess for non-random missingness (even if will only be included as potential confound). Marginplot for income by education conditoned on missing value status suggest a non-random association, specifically non-response bias for individiuals of low socioeconomic status.
# Key response and predictor variables
marginplot(data[c(11, 6)], col = c("lightblue", "red", "orange"))
marginplot(data[c(6, 17)], col = c("lightblue", "red", "orange"))
marginplot(data[c(11, 17)], col = c("lightblue", "red", "orange"))
# Income (highest incidence of missing data)
marginplot(data[c(11, 5)], col = c("lightblue", "red", "orange"))
prop.table(table(data$NCLINDEP,is.na(data$severity)),2)
##
## FALSE TRUE
## 0 0.8593699 0.8818898
## 1 0.1406301 0.1181102
marginplot(data[c(2, 5)], col = c("lightblue", "red", "orange"),numbers=T)
t.test(IEDUC~is.na(IADJINCM),data=data)
##
## Welch Two Sample t-test
##
## data: IEDUC by is.na(IADJINCM)
## t = 7.4599, df = 371.93, p-value = 6.17e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.915084 1.570172
## sample estimates:
## mean in group FALSE mean in group TRUE
## 13.44602 12.20339
Before imputation, will examine the distribution of variables to see whether it might be reasonable to test some variable transformations during the modeling.
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 873 rows containing non-finite values (stat_bin).
The distributions for income, and to a lesser extent education, are skewed. During modeling, I will include a log transformed income variable as well as a square root for education
data$logIADJINCM <- log(data$IADJINCM)
data$sqrIEDUC <- (data$IEDUC^2)
ggplot(data=data,aes(data$logIADJINCM))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 392 rows containing non-finite values (stat_bin).
ggplot(data=data,aes(data$sqrIEDUC))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 123 rows containing non-finite values (stat_bin).
One largely discredited technqiue for managing missing data is to impute a measure of central tendency (e.g. median, or mode for categorical data). A second approach, complete case analysis, will also be used later.
# Mode function from https://www.tutorialspoint.com/r/r_mean_median_mode.htm
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Impute function, I created
impute.mean <- function(x) {
if (class(x) == "numeric") {
x[is.na(x)] <- median(x,na.rm=TRUE) # Use median if numeric
return(as.numeric(x))
}
else {
x[is.na(x)] <- getmode(x) # Otherwise use mode
return(as.factor(x))
}
}
# I thought I'd be able to use apply but it coereces all data to the same data type, not sure if there is an equivalent function that doesn't do that
data.mean <- data
data.mean$IMALE <- impute.mean(data$IMALE)
data.mean$IEDUC <- impute.mean(data$IEDUC)
data.mean$PAQAGE <- impute.mean(data$PAQAGE)
data.mean$IMARRIED <- impute.mean(data$IMARRIED)
data.mean$IADJINCM <- impute.mean(data$IADJINCM)
data.mean$NCLINDEP <- impute.mean(data$NCLINDEP)
data.mean$CHF <- impute.mean(data$CHF)
data.mean$MI <- impute.mean(data$MI)
data.mean$IWHITE <- impute.mean(data$IWHITE)
data.mean$sevhyp <- impute.mean(data$sevhyp)
data.mean$PSQG <- impute.mean(data$PSQG)
data.mean$PSQT <- impute.mean(data$PSQT)
data.mean$PSQI <- impute.mean(data$PSQI)
data.mean$PSQF <- impute.mean(data$PSQF)
data.mean$PSQTI <- impute.mean(data$PSQTI)
data.mean$PSQA <- impute.mean(data$PSQA)
data.mean$severity <- impute.mean(data$severity)
data.mean$logIADJINCM <- impute.mean(data$IADJINCM)
data.mean$sqrIEDUC <- impute.mean(data$sqrIEDUC)
# Sanity check
summary(data)
## IMALE IEDUC PAQAGE IMARRIED IADJINCM
## 0 :1952 Min. : 1.00 Min. :18.00 0 :1293 Min. : 625
## 1 :1214 1st Qu.:12.00 1st Qu.:42.00 1 :1881 1st Qu.: 12374
## NA's: 77 Median :13.00 Median :57.00 NA's: 69 Median : 17678
## Mean :13.33 Mean :54.71 Mean : 23015
## 3rd Qu.:16.00 3rd Qu.:67.00 3rd Qu.: 31820
## Max. :18.00 Max. :98.00 Max. :100000
## NA's :123 NA's :51 NA's :392
## NCLINDEP CHF MI IWHITE sevhyp PSQG
## 0 :2758 0:3032 0:3137 0 : 645 0 :3082 Min. : 0.00
## 1 : 448 1: 211 1: 106 1 :2487 1 : 105 1st Qu.: 50.00
## NA's: 37 NA's: 111 NA's: 56 Median : 66.67
## Mean : 63.92
## 3rd Qu.: 79.17
## Max. :100.00
##
## PSQT PSQI PSQF PSQTI
## Min. : 0.00 Min. : 3.571 Min. : 0.00 Min. : 0.00
## 1st Qu.: 57.50 1st Qu.: 64.286 1st Qu.: 53.12 1st Qu.: 50.00
## Median : 67.50 Median : 75.000 Median : 68.75 Median : 75.00
## Mean : 67.06 Mean : 71.986 Mean : 65.39 Mean : 65.01
## 3rd Qu.: 77.50 3rd Qu.: 82.143 3rd Qu.: 81.25 3rd Qu.: 75.00
## Max. :100.00 Max. :100.000 Max. :100.00 Max. :100.00
## NA's :3 NA's :64 NA's :62 NA's :33
## PSQA severity logIADJINCM sqrIEDUC
## Min. : 6.25 Min. :1.000 Min. : 6.438 Min. : 1.0
## 1st Qu.: 62.50 1st Qu.:2.000 1st Qu.: 9.423 1st Qu.:144.0
## Median : 72.92 Median :3.000 Median : 9.780 Median :169.0
## Mean : 71.09 Mean :2.838 Mean : 9.759 Mean :186.4
## 3rd Qu.: 81.25 3rd Qu.:3.000 3rd Qu.:10.368 3rd Qu.:256.0
## Max. :100.00 Max. :4.000 Max. :11.513 Max. :324.0
## NA's :16 NA's :129 NA's :392 NA's :123
summary(data.mean)
## IMALE IEDUC PAQAGE IMARRIED IADJINCM
## 0:2029 Min. : 1.00 Min. :18.00 0:1293 Min. : 625
## 1:1214 1st Qu.:12.00 1st Qu.:43.00 1:1950 1st Qu.: 12500
## Median :13.00 Median :57.00 Median : 17678
## Mean :13.32 Mean :54.74 Mean : 22370
## 3rd Qu.:16.00 3rd Qu.:66.50 3rd Qu.: 25981
## Max. :18.00 Max. :98.00 Max. :100000
## NCLINDEP CHF MI IWHITE sevhyp PSQG
## 0:2795 0:3032 0:3137 0: 645 0:3138 Min. : 0.00
## 1: 448 1: 211 1: 106 1:2598 1: 105 1st Qu.: 50.00
## Median : 66.67
## Mean : 63.92
## 3rd Qu.: 79.17
## Max. :100.00
## PSQT PSQI PSQF PSQTI
## Min. : 0.00 Min. : 3.571 Min. : 0.00 Min. : 0.00
## 1st Qu.: 57.50 1st Qu.: 64.286 1st Qu.: 53.12 1st Qu.: 50.00
## Median : 67.50 Median : 75.000 Median : 68.75 Median : 75.00
## Mean : 67.06 Mean : 72.045 Mean : 65.46 Mean : 65.11
## 3rd Qu.: 77.50 3rd Qu.: 82.143 3rd Qu.: 81.25 3rd Qu.: 75.00
## Max. :100.00 Max. :100.000 Max. :100.00 Max. :100.00
## PSQA severity logIADJINCM sqrIEDUC
## Min. : 6.25 Min. :1.000 Min. : 625 Min. : 1.0
## 1st Qu.: 62.50 1st Qu.:2.000 1st Qu.: 12500 1st Qu.:144.0
## Median : 72.92 Median :3.000 Median : 17678 Median :169.0
## Mean : 71.09 Mean :2.845 Mean : 22370 Mean :185.7
## 3rd Qu.: 81.25 3rd Qu.:3.000 3rd Qu.: 25981 3rd Qu.:256.0
## Max. :100.00 Max. :4.000 Max. :100000 Max. :324.0
table(complete.cases(data.mean))
##
## TRUE
## 3243
From Bair et al., the goal of the study is to determine the effect of depression and pain severity on patient satisfaction. I will model effects on general satisfaction (PSQG). The dependent variables include depression and pain, with other variables included as potential confounds.
As one approach to variable selection I tried using the regsubsets function of the leaps package. This function calculates model BIC for all combinations of included variables, displayed as a heatmap. With this approach, we can see that optimal model BIC is achieved with education, age, depression and severity.
This procedure could be repeated for each of the datasets (complete, imputed mean, and each frame of multiple imputation). However, given the low incidence of missing data in this dataset, I do not think this is worthwhile.
data.mean <- data.mean[-c(12:16)]
data <- data[-c(12:16)]
complete.leaps <- regsubsets(PSQG~.,data=data,nbest=10)
complete.leaps$nn # number of cases used, see that n << nrow(data)
## [1] 2464
mean.leaps <- regsubsets(PSQG~.,data=data.mean,nbest=10)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
mean.leaps$nn # number of cases used, now n = nrow(data)
## [1] 3243
plot(complete.leaps, scale="bic")
plot(mean.leaps,scale="bic")
mean.m1 <- lm(PSQG~severity,data=data.mean)
mean.m2 <- lm(PSQG~severity+NCLINDEP,data=data.mean)
mean.m3 <- lm(PSQG~severity+NCLINDEP+PAQAGE,data=data.mean)
mean.m4 <- lm(PSQG~severity+NCLINDEP+PAQAGE+sqrIEDUC,data=data.mean)
multiplot(mean.m1,mean.m2,mean.m3,mean.m4,intercept=FALSE)
summary(mean.m4)
##
## Call:
## lm(formula = PSQG ~ severity + NCLINDEP + PAQAGE + sqrIEDUC,
## data = data.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.899 -12.039 1.201 13.058 51.901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.808385 2.108512 24.097 < 2e-16 ***
## severity 3.296399 0.394969 8.346 < 2e-16 ***
## NCLINDEP1 -6.778905 1.012516 -6.695 2.53e-11 ***
## PAQAGE 0.185665 0.023161 8.016 1.51e-15 ***
## sqrIEDUC -0.029563 0.004631 -6.384 1.97e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.83 on 3238 degrees of freedom
## Multiple R-squared: 0.08519, Adjusted R-squared: 0.08406
## F-statistic: 75.38 on 4 and 3238 DF, p-value: < 2.2e-16
plot(mean.m4)
complete <- data[complete.cases(data),]
complete.m4 <- lm(PSQG~severity+NCLINDEP+PAQAGE+sqrIEDUC,data=complete)
summary(complete.m4)
##
## Call:
## lm(formula = PSQG ~ severity + NCLINDEP + PAQAGE + sqrIEDUC,
## data = complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.452 -12.087 1.116 12.985 51.804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.837253 2.400351 21.596 < 2e-16 ***
## severity 3.419606 0.455716 7.504 8.61e-14 ***
## NCLINDEP1 -6.695808 1.120323 -5.977 2.61e-09 ***
## PAQAGE 0.167743 0.026417 6.350 2.56e-10 ***
## sqrIEDUC -0.031355 0.005197 -6.034 1.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.84 on 2459 degrees of freedom
## Multiple R-squared: 0.08515, Adjusted R-squared: 0.08366
## F-statistic: 57.22 on 4 and 2459 DF, p-value: < 2.2e-16
multiplot(complete.m4,mean.m4,intercept=FALSE)
We have been asked to use multiple imputation as an alternative to complete case analysis and other simpler techniques. For the imputation, I will use the mice package. I have chosen predictive mean matching ‘pmm’ as the imputation method. Alternatives based on linear regression are probably not appropriate, as many of the included variables for modeling are discrete variables.
As a check of the imputed values, I have compared observed vs. imputed using the density plot functions. These plots show substantial overlap in the distributions. As expected, under the pmm method, peaks and valleys in distributions are preserved (e.g. severity).
data.mi <- mice(data,m=100,method="pmm",seed=1,print=FALSE)
densityplot(data.mi)
The default method for mice is “pmm” (predictive mean matching), which can be used for any data type (continuous, binary, ordered categorical and non-ordered categorical. This dataset contains continuous, binary and ordered categorical data (one could argue education as ordered categorical, but I dont think that’s useful here). An alternative approach will be to use logistic regression to impute binary variables and linear regression to impute continuous variables.
Plots of imputed vs. observed variables show fair overlap, but peaks and valleys are no longer preserved.
method <- data.mi$method # Export the variable-wise list of methods [easy to work with this list to specify method for each variable]. A few variables will have an empty string "", these are variables with no missing data.
method["IMALE"] <- "logreg.boot" # Logistic regression with bootstrap (logistric regression alone would just reproduce the same value)
method["IEDUC"] <- "norm" # Predicted values from linear regression
method["PAQAGE"] <- "norm"
method["IMARRIED"] <- "logreg.boot"
method["IADJINCM"] <-"norm"
method["NCLINDEP"] <- "logreg.boot"
method["IWHITE"] <- "logreg.boot"
method["sevhyp"] <- "logreg.boot"
method["severity"] <- "norm" #
method["logIADJINCM"] <- "norm"
method["sqrIEDUC"] <- "norm"
data.mir <- mice(data,m=100,method=method,seed=1,print=FALSE)
densityplot(data.mir)
mi.m4 <- with(data.mi,lm(PSQG~severity+NCLINDEP+PAQAGE+sqrIEDUC))
mir.m4 <- with(data.mir,lm(PSQG~severity+NCLINDEP+PAQAGE+sqrIEDUC))
pool.m4 <- pool(mi.m4)
pool.r.squared(mi.m4)
## est lo 95 hi 95 fmi
## R^2 0.08647058 0.06861455 0.1059654 NaN
pool.m4r <- pool(mir.m4)
pool.r.squared(mir.m4)
## est lo 95 hi 95 fmi
## R^2 0.08654106 0.068696 0.106021 NaN
summary(pool.m4)
## estimate std.error statistic df p.value
## (Intercept) 50.45725749 2.127652702 23.714988 2985.775 0.000000e+00
## severity 3.29207221 0.396960932 8.293189 2897.911 0.000000e+00
## NCLINDEP1 -6.73217810 1.019196340 -6.605379 3126.409 4.642242e-11
## PAQAGE 0.18848177 0.023311604 8.085320 3108.480 8.881784e-16
## sqrIEDUC -0.02825746 0.004641059 -6.088582 2972.425 1.277804e-09
summary(pool.m4r)
## estimate std.error statistic df p.value
## (Intercept) 50.55244212 2.121009210 23.834146 3035.762 0.000000e+00
## severity 3.27158533 0.396182486 8.257774 2945.325 2.220446e-16
## NCLINDEP1 -6.73877829 1.021378135 -6.597731 3093.538 4.881473e-11
## PAQAGE 0.18849941 0.023252510 8.106626 3139.272 8.881784e-16
## sqrIEDUC -0.02842697 0.004637299 -6.130071 2968.803 9.878707e-10
summary(mean.m4)
##
## Call:
## lm(formula = PSQG ~ severity + NCLINDEP + PAQAGE + sqrIEDUC,
## data = data.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.899 -12.039 1.201 13.058 51.901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.808385 2.108512 24.097 < 2e-16 ***
## severity 3.296399 0.394969 8.346 < 2e-16 ***
## NCLINDEP1 -6.778905 1.012516 -6.695 2.53e-11 ***
## PAQAGE 0.185665 0.023161 8.016 1.51e-15 ***
## sqrIEDUC -0.029563 0.004631 -6.384 1.97e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.83 on 3238 degrees of freedom
## Multiple R-squared: 0.08519, Adjusted R-squared: 0.08406
## F-statistic: 75.38 on 4 and 3238 DF, p-value: < 2.2e-16
summary(complete.m4)
##
## Call:
## lm(formula = PSQG ~ severity + NCLINDEP + PAQAGE + sqrIEDUC,
## data = complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.452 -12.087 1.116 12.985 51.804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.837253 2.400351 21.596 < 2e-16 ***
## severity 3.419606 0.455716 7.504 8.61e-14 ***
## NCLINDEP1 -6.695808 1.120323 -5.977 2.61e-09 ***
## PAQAGE 0.167743 0.026417 6.350 2.56e-10 ***
## sqrIEDUC -0.031355 0.005197 -6.034 1.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.84 on 2459 degrees of freedom
## Multiple R-squared: 0.08515, Adjusted R-squared: 0.08366
## F-statistic: 57.22 on 4 and 2459 DF, p-value: < 2.2e-16