Assignment 2

#data <- read_sas("infection_control_survey.sas7bdat","formats.sas7bcat") # Read in Sas data with format
#dim(data) # Sanity check, confirm dimensions
#formats <- readxl::read_xlsx("Assignment 2 variables.xlsx",sheet = 1) # Read in excel file containing question numbers and question labels (short and long)
#colnames(data) <- formats$Name # Assign short column lables to data
#invert.which <- which (formats$Valence == "-") # Identify columns to be inverted
#data[invert.which] <- apply (data[invert.which],2,function(x) x*-1+6) # Invert columns by multiplying *-1, then adding 6 to rescale to 1-5
#data$hospAcademic <- as.factor(data$hospAcademic)
#data$hospType <- as.factor(data$hospType)
#save(data,file="data.Rda")
#rm(data)
load("data.Rda") ### Can start here

Exploratory data analysis

Histogram analysis of each of the survey responses suggest:

No obvious data quality problem (distributions look reasonable; this might also be done with a five-number summary)
It might be preferable to use the Spearman correlation, knowing that the underlying type for all variables is ordinal (not interval).

Data histograms

ggplot(tidyr::gather(data[1:24]), aes(value)) + 
    geom_histogram(bins = 5) + 
    facet_wrap(~key, scales = "fixed")

Heatmap - correlation matrix

Heat map illustrates high degree of pairwise correlation between variables belonging to two concepts - workplace culture (including administration support) and information technology. Spearman correlation was used (see note above).

data.cor <- cor(data[1:24],method="spearman") # Correlation matrix
data.cor.melt <- reshape2::melt(data.cor) # Melt matrix (needed for ggplot)
ggplot(data = data.cor.melt, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "white", high = "blue", mid = "red", 
   midpoint = 0.5, limit = c(0,1), space = "Lab", 
   name="Spearman\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

Principal components analysis

Step 1 - Is the data suitable for PCA

The Bartlett test is highly significant (0), indicating that it is very unlikely that the observed correlation matrix reflects an identity matrix. At KMO = 0.86 the KMO test is comfortably above the 0.5 threshold recommended in Williams, Onsman, and Brown (2010).

Step 2 Choice of approach (PCA vs. EFA)

Because the purpose of this analysis is to prepare an analytic dataset for regression, I think that it is preferable to use PCA.

Step 3 What criteria will assist in determining factor extraction

Will use combination of the Eigenvalue > 1, scree plot, parallel analysis.

Using the paran package for parallel analysis (https://www.rdocumentation.org/packages/paran/versions/1.5.2/topics/paran), it is possible to generate a screeplot containing the unadjusted (raw) eigenvalues as well as eigenvalues generated from a random dataset of equivalent dimension. Factors with observed eigenvalue greater than random eigenvalue at the same component index can be kept (used for a non-math heavy reference https://stats.idre.ucla.edu/stata/faq/how-to-do-parallel-analysis-for-pca-or-factor-analysis-in-stata/)

Screeplot technique: ~ 4 components
Parallel analysis: 3 components
Eigenvalue > 1: 5 components

Because the purpose of this PCA is data reduction for regression, and the gain in variance from factors 4-5 is small (~ 5% each), I think that it is reasonable to limit the PCA to 3 factors (which is what is suggested from the parallel analysis).

An approach to a sensitivity analysis for this decision might include a subjective assessment of the interpretability of component loadings. Given the purpose of this analysis, it would not make sense to keep additional components if their importance was difficult to interpret.

paran::paran(data[c(1:24)], iterations = 5000, centile =50, quietly = TRUE, 
    status = FALSE, all = FALSE, cfa = FALSE, graph = TRUE, color = TRUE, 
    col = c("black", "red", "blue"), lty = c(1, 2, 3), lwd = 1, legend = TRUE, 
    file = "", width = 640, height = 640, grdevice = "png", seed = 0)

## 
## Using eigendecomposition of correlation matrix.

Review component loadings (table and heatmap)

Table and heatmap of component loadings suggests PC1 loaded with cultural factors, PC2 loaded with technology factors, and PC3 loaded with provider/clinician involvement.

Component loading is relatively non-specific for PC1: the maximum load is 0.29, and 5 factors load above 0.25. Culture-related questions dominate, but I suspect this component is also capturing non-specific tendency to agree or disagree across survey questions. Directionality is reversed for PC1 and PC2 (higher agreement with question = more negative principal component).

Component loading is straightforward for the second and third components; PC2 loads with the five technology-related questions from section 18 (load range 0.29-0.41), whereas PC3 loads with the three policy dissemination items from section 3 (range 0.43-0.40) as well as a question about reporting of trends in drug use (0.32).

Loadings are less robust beyond PC3: PC4 loads heavily with a single factor for restricted formulary (0.61), whereas component 5 loads with a mix of recycled culture variables. This supports our choice to truncate our analysis at 3 components.

data.rot.melt <- reshape2::melt(data.pca$rotation[,1:5]) # Melt matrix (needed for ggplot), factors 1-5
ggplot(data = data.rot.melt, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "red", high = "blue", mid = "white", 
   midpoint = 0.0, limit = c(-0.6,0.6),   name="Factor loadings") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

abs(round(data.pca$rotation[,1:5],2))

##                       PC1  PC2  PC3  PC4  PC5
## adminSupport         0.23 0.08 0.14 0.11 0.07
## providerDiscuss      0.18 0.06 0.40 0.11 0.02
## providerCopy         0.17 0.05 0.43 0.28 0.03
## formsCreated         0.16 0.01 0.42 0.29 0.15
## handwashFeedback     0.19 0.19 0.24 0.10 0.07
## formsAntimicrobial   0.10 0.05 0.21 0.21 0.40
## drugRestricted       0.11 0.09 0.17 0.61 0.12
## drugUseReported      0.15 0.06 0.32 0.35 0.19
## adminParticipate     0.21 0.11 0.04 0.19 0.29
## adminCoordinate      0.28 0.08 0.07 0.09 0.25
## qualityProcess       0.24 0.06 0.08 0.13 0.25
## commMDtoAdmin        0.27 0.04 0.17 0.07 0.22
## commMDtoRN           0.25 0.05 0.15 0.27 0.07
## cultureFlex          0.25 0.15 0.08 0.04 0.36
## cultureParticipative 0.29 0.15 0.11 0.01 0.28
## cultureData          0.27 0.00 0.11 0.03 0.24
## cultureIsBarrier     0.15 0.16 0.26 0.02 0.35
## qualityMDParticipate 0.25 0.10 0.00 0.13 0.22
## qualityClinInput     0.27 0.13 0.11 0.01 0.03
## techResources        0.17 0.29 0.03 0.17 0.17
## techDecSupport       0.18 0.41 0.04 0.04 0.15
## techPatientData      0.06 0.47 0.14 0.07 0.10
## techAutomated        0.09 0.46 0.16 0.16 0.05
## techCommunicate      0.09 0.36 0.10 0.23 0.05

PCA on Trimmed dataset

Trim

To trim the dataset, a critical value is set for each of the three components. I have set these values according to the factor loadings in the previous setting. Variables are kept if they exceed the critical load on at least one component. This yields an analytic dataset with 16 variables (reduced from 24).

At this point, I will also create a new dataset that keeps only those variables that excluded from the trimmed dataset. This is intended to simplify future development of the linear regression models (I do not want any of the variables reduced by PCA to appear in the model).

which(abs(data.pca$rotation[,1])>0.25) -> keep.1
which(abs(data.pca$rotation[,2])>0.28) -> keep.2
which(abs(data.pca$rotation[,3])>0.3) -> keep.3
keep <- c(keep.1,keep.2,keep.3)
data.trim <- data[,keep]
data.mod <- data[,-keep]

Repeat PCA

Scree plot and parallel analysis unchanged, 3 factors still supported.

paran::paran(data.trim[c(1:16)], iterations = 5000, centile =50, quietly = TRUE, 
    status = FALSE, all = FALSE, cfa = FALSE, graph = TRUE, color = TRUE, 
    col = c("black", "red", "blue"), lty = c(1, 2, 3), lwd = 1, legend = TRUE, 
    file = "", width = 640, height = 640, grdevice = "png", seed = 0)

## 
## Using eigendecomposition of correlation matrix.

Distribution of loads across variables remains qualitatively unchanged. Factor loadings have increased, which I think is an expected effect of removing uncorrelated variables. Although PC1 still appears to be a mix of cultural factors as well as non-specific response bias, I think this is a reasonable place to stop. The last step in this analysis is to add the calculated factor loadings back to the main data file for use in regression. I do not think that there are any clear cross-loaded factors. If there were, a sensitivity analysis could be done with and without cross loaded factors.

data.trim.pca <- prcomp(data.trim,scale=T)

data.trim.rot.melt <- reshape2::melt(data.trim.pca$rotation[,1:5]) # Melt matrix (needed for ggplot), factors 1-5
ggplot(data = data.trim.rot.melt, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "red", high = "blue", mid = "white", 
   midpoint = 0.0, limit = c(-1,1),   name="Factor loadings") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()

abs(round(data.trim.pca$rotation[,1:5],2))

##                       PC1  PC2  PC3  PC4  PC5
## adminCoordinate      0.32 0.08 0.14 0.12 0.28
## commMDtoAdmin        0.31 0.01 0.23 0.04 0.22
## cultureFlex          0.30 0.14 0.19 0.01 0.35
## cultureParticipative 0.34 0.14 0.24 0.09 0.26
## cultureData          0.33 0.00 0.18 0.03 0.24
## qualityMDParticipate 0.29 0.12 0.07 0.04 0.40
## qualityClinInput     0.31 0.11 0.20 0.08 0.07
## techResources        0.22 0.28 0.07 0.32 0.30
## techDecSupport       0.22 0.38 0.15 0.26 0.26
## techPatientData      0.09 0.50 0.03 0.07 0.08
## techAutomated        0.12 0.49 0.03 0.18 0.03
## techCommunicate      0.12 0.37 0.01 0.32 0.36
## providerDiscuss      0.23 0.18 0.38 0.13 0.12
## providerCopy         0.22 0.16 0.50 0.29 0.03
## formsCreated         0.20 0.11 0.52 0.31 0.18
## drugUseReported      0.18 0.05 0.23 0.68 0.35

culture <- predict(data.trim.pca)[,1]
tech <- predict(data.trim.pca)[,2]
provider <- predict(data.trim.pca)[,3]

data.mod <- cbind(data.mod,culture)
data.mod <- cbind(data.mod,tech)
data.mod <- cbind(data.mod,provider)

Modeling Goldman variables

Exploratory analysis of the Goldman variables

Based on the histograms, I will choose the educational variable for modeling (closest to a normal distribution)

ggplot(tidyr::gather(data[26:30]), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = "fixed")

Try a few effect size models and select the best one (account for confounding and fit)

To build the model, I worked with the analytic dataset containing hospital characteristics, rotated component scores, and survey variables that were not used to generate the components. I chose to start with the component scores because I thought they reflected more robust concepts than individual survey questions. Individual survey questions were tested second, followed by hospital characteristics which were assessed as potential confounds.

Through step-wise model building, assessed by AIC and change to effect size + precision [for confounds], I landed on a model containing the culture & provider components as well as the survey question addressing restricted formulary. Other variables failed to improve model fit. Effect sizes for the included variables did not significanlty change with the addition of potential confounds, either.

library(coefplot)

## Warning: package 'coefplot' was built under R version 3.5.2

mod.culture <- lm(goldEducational~culture, data=data.mod)
mod.clinician <- lm(goldEducational~culture+provider, data=data.mod)
mod.tech <- lm(goldEducational~culture+provider+tech, data=data.mod)

mod.restricted <- lm(goldEducational~culture+provider+drugRestricted, data=data.mod)
mod.handwash <- lm(goldEducational~culture+provider+drugRestricted+handwashFeedback, data=data.mod)
mod.forms<- lm(goldEducational~culture+provider+drugRestricted+formsAntimicrobial, data=data.mod)
mod.process <- lm(goldEducational~culture+provider+drugRestricted+qualityProcess, data=data.mod)
mod.comm <- lm(goldEducational~culture+provider+drugRestricted+commMDtoRN, data=data.mod)
mod.barrier <- lm(goldEducational~culture+provider+drugRestricted+cultureIsBarrier, data=data.mod)
mod.admin <- lm(goldEducational~culture+provider+drugRestricted+adminParticipate, data=data.mod)

mod.size <- lm(goldEducational~culture+provider+drugRestricted+adminParticipate+hospBed, data=data.mod)
mod.type <- lm(goldEducational~culture+provider+drugRestricted+adminParticipate+hospType, data=data.mod)
mod.academic <- lm(goldEducational~culture+provider+drugRestricted+adminParticipate+hospAcademic, data=data.mod)
multiplot(mod.culture,mod.clinician,mod.tech,mod.restricted,mod.handwash,mod.forms,mod.process,mod.comm,mod.barrier,mod.admin,mod.size,mod.type,mod.academic)

AIC(mod.culture,mod.clinician,mod.tech,mod.restricted,mod.handwash,mod.forms,mod.process,mod.comm,mod.barrier,mod.admin,mod.size,mod.type,mod.academic)

##                df      AIC
## mod.culture     3 540.1591
## mod.clinician   4 524.9405
## mod.tech        5 525.5629
## mod.restricted  5 498.6683
## mod.handwash    6 500.6181
## mod.forms       6 500.0557
## mod.process     6 499.9113
## mod.comm        6 500.3303
## mod.barrier     6 500.6671
## mod.admin       6 496.9524
## mod.size        7 498.2487
## mod.type        7 498.9517
## mod.academic    7 498.4790

Review some model diagnostics

The model diagnostic plots show

No concerning relationship between residuals and fitted values
No concerning departure from normality
No concerning departure from homoscedasity
No high leverage points

plot(mod.admin)

Summarize the model

The results of the linear model for response to the agreement to “Improve antimicrobial prescribing by educational and administrative means” indicate an association to factors relatable to survey responses to cultural and clinician domains, as well as to one specific question addressing forumulary restriction but not to technological domains. These effects were unconfounded by hospital characteristics including size and academic status.

data.mod$adminParticipate <- scale(data.mod$adminParticipate)
data.mod$drugRestricted <- scale(data.mod$drugRestricted)
data.mod$culture <- scale(data.mod$culture)
data.mod$provider <- scale(data.mod$provider)
mod.admin <- lm(goldEducational~culture+provider+drugRestricted+adminParticipate, data=data.mod)
summary(mod.admin)

## 
## Call:
## lm(formula = goldEducational ~ culture + provider + drugRestricted + 
##     adminParticipate, data = data.mod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67718 -0.45950 -0.02311  0.45964  2.10505 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.27593    0.04308  76.046  < 2e-16 ***
## culture          -0.43415    0.05052  -8.593 1.18e-15 ***
## provider          0.16614    0.04345   3.824 0.000168 ***
## drugRestricted    0.23939    0.04492   5.330 2.29e-07 ***
## adminParticipate -0.09394    0.04906  -1.915 0.056706 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6688 on 236 degrees of freedom
## Multiple R-squared:  0.4084, Adjusted R-squared:  0.3983 
## F-statistic: 40.72 on 4 and 236 DF,  p-value: < 2.2e-16