library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
f.d <- read.csv("~/Desktop/472final.csv", header = TRUE, na.strings=".")
str(f.d)
## 'data.frame': 488 obs. of 12 variables:
## $ region : Factor w/ 7 levels "","MW","NE","SE",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ region.n : int 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : int 1 1 1 1 1 1 1 2 2 1 ...
## $ What.year.were.you.born.: int 1958 1978 1946 1981 1953 1986 1985 1970 1962 1948 ...
## $ age : int 60 40 72 37 65 32 33 48 56 70 ...
## $ education : int 3 5 4 4 4 2 1 2 3 5 ...
## $ farm_size : Factor w/ 185 levels "","0","1","1,650",..: 16 117 163 86 96 13 32 52 133 27 ...
## $ value.health : int 4 1 1 NA 3 1 NA NA 5 5 ...
## $ value.profit : int 4 1 3 NA 2 2 NA NA 3 5 ...
## $ value.cons : int 3 4 5 NA 4 2 NA NA 3 4 ...
## $ value.overall : int 3 3 1 NA 3 1 NA NA 3 4 ...
## $ q22 : int 1 2 7 NA 2 7 NA NA 1 1 ...
f.d <- f.d[,-1]
dim(f.d) ## 487 cases
## [1] 488 11
ok <- complete.cases(f.d[,6:9])
sum(!ok) ## 53 removed
## [1] 54
f.d <- f.d[ok,]
dim(f.d)
## [1] 434 11
First, after loading in the data, I spent some time checking it and getting it into shape. PCA requires complete cases of answers to the four questions. There were 487 cases and and 53 incomplete cases were removed at this point.
head(f.d)
## region.n gender What.year.were.you.born. age education farm_size
## 1 1 1 1958 60 3 1100
## 2 1 1 1978 40 5 42
## 3 1 1 1946 72 4 70
## 5 1 1 1953 65 4 31
## 6 1 1 1986 32 2 11
## 9 1 2 1962 56 3 50
## value.health value.profit value.cons value.overall q22
## 1 4 4 3 3 1
## 2 1 1 4 3 2
## 3 1 3 5 1 7
## 5 3 2 4 3 2
## 6 1 2 2 1 7
## 9 5 3 3 3 1
tail(f.d)
## region.n gender What.year.were.you.born. age education farm_size
## 455 NA 1 1965 53 6 500
## 456 NA 1 1976 42 3 5800
## 457 NA 1 1957 61 4 45
## 458 NA 1 1966 52 5 800
## 459 NA 1 1961 57 5 1970
## 460 NA 2 1987 31 6 2388
## value.health value.profit value.cons value.overall q22
## 455 1 1 1 1 7
## 456 5 5 5 5 1
## 457 3 3 3 3 2
## 458 2 2 2 2 7
## 459 2 3 3 3 5
## 460 3 3 3 3 NA
Here, looking at the correlation matrix, you can see the answers to the four questions are moderately to strongly correlated.
f.data <- f.d[,7:10]
cor(f.data)
## value.health value.profit value.cons value.overall
## value.health 1.0000000 0.7766135 0.4869402 0.7870802
## value.profit 0.7766135 1.0000000 0.4543142 0.7312193
## value.cons 0.4869402 0.4543142 1.0000000 0.5694547
## value.overall 0.7870802 0.7312193 0.5694547 1.0000000
Here is the usual table of summary statistics by question. If we use 3 as an absolute benchmark of Neutral, you’ll notice that Q20a and Q20d look low, and Q20b looks suspiciously low.
df.sum <- summarise_all(f.data, funs(mean, sd))
df.stats.tidy <- df.sum %>% gather(stat, val) %>% separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val)
df.stats.tidy
## var mean sd
## 1 value.cons 3.179724 1.167281
## 2 value.health 2.580645 1.258392
## 3 value.overall 2.518433 1.217987
## 4 value.profit 2.165899 1.106411
I center and scale the answers to the four questions and run PCA to obtain “loadings” and a scree plot.
csf.data <- scale(f.data)
PCA <- prcomp(csf.data)
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7105 0.7883 0.50564 0.4441
## Proportion of Variance 0.7314 0.1554 0.06392 0.0493
## Cumulative Proportion 0.7314 0.8868 0.95070 1.0000
(temp <- PCA$rotation %*% diag(PCA$sdev)) ## PC loadings
## [,1] [,2] [,3] [,4]
## value.health -0.9061401 0.22562281 -0.10044409 -0.34338832
## value.profit -0.8787400 0.27561141 0.36702358 0.13095064
## value.cons -0.7064041 -0.70122340 0.08749115 -0.04030153
## value.overall -0.9129223 0.05335666 -0.32128253 0.24597439
# cor(csf.data, PCA$x) ## Also PC loadings
plot(PCA)
From looking at a.) the PC loadings, b.) the square root of the eigenvalues (standard deviation), c.) the cumulative variance, and d.) the scree plot, we get a nice result; that is, the data are described nicely by the first principal component (PC1).
cor(PCA$x) ## Demonstrates orthogonality of the PCs and is a sanity check.
## PC1 PC2 PC3 PC4
## PC1 1.000000e+00 -1.528309e-15 3.265671e-15 5.537072e-15
## PC2 -1.528309e-15 1.000000e+00 -1.521088e-15 -2.045480e-15
## PC3 3.265671e-15 -1.521088e-15 1.000000e+00 -7.000060e-16
## PC4 5.537072e-15 -2.045480e-15 -7.000060e-16 1.000000e+00
apply(PCA$x, 2, mean) ## Safety check; should all be essentially 0.
## PC1 PC2 PC3 PC4
## -1.219812e-16 -1.851850e-17 -1.935288e-17 -1.995731e-17
apply(PCA$x, 2, var) ## Safety check; these will equal the eigenvalues.
## PC1 PC2 PC3 PC4
## 2.9257078 0.6214285 0.2556725 0.1971912
csf.pcs <- (-1)*scale(PCA$x)
cor(csf.pcs) ## Demonstrates orthogonality of the PCs and is a sanity check.
## PC1 PC2 PC3 PC4
## PC1 1.000000e+00 -1.546319e-15 3.268856e-15 5.538717e-15
## PC2 -1.546319e-15 1.000000e+00 -1.512642e-15 -2.037694e-15
## PC3 3.268856e-15 -1.512642e-15 1.000000e+00 -6.914678e-16
## PC4 5.538717e-15 -2.037694e-15 -6.914678e-16 1.000000e+00
apply(csf.pcs, 2, mean) ## Safety check; should all be essentially 0.
## PC1 PC2 PC3 PC4
## -4.205967e-18 1.167841e-17 1.693179e-17 1.787934e-17
apply(csf.pcs, 2, var) ## Safety check; should all be 1.
## PC1 PC2 PC3 PC4
## 1 1 1 1
As is commonly done, PC1 has been (implicitly) scaled to have standard deviation 1. I then multiplied it by -1 for ease of interpretation. PC1 can be thought of as the construct “Value”. The larger the value of PC1, the more value the participant believes the FARM program has.
To obtain the coefficients for the standardized PC1, we take the inverse of the loadings matrix from before and pull off the appropriate row.
(-1)*solve(temp)[1,] ## One way to do it
## value.health value.profit value.cons value.overall
## 0.3097165 0.3003512 0.2414473 0.3120347
(coefs <- (-1)*(PCA$rotation %*% solve(diag(PCA$sdev)))[,1]) ## Second way to do it
## value.health value.profit value.cons value.overall
## 0.3097165 0.3003512 0.2414473 0.3120347
Thus we see that: \[PC1 = 0.31Q20a + 0.30Q20b + 0.24Q20c + 0.31Q20d\]
We can compute the (standardized) PC1 scores now. They have mean 0 and standard deviation 1. We can compute them manually or obtain them from the PCA object created previously.
PC1_scores <- csf.data %*% coefs
mean(PC1_scores)
## [1] -3.340742e-17
sd(PC1_scores)
## [1] 1
head(PC1_scores)
## [,1]
## 1 0.9334229
## 2 -0.4124867
## 3 -0.1750897
## 5 0.3512193
## 6 -1.0670919
## 9 0.9080794
head((-1)*PCA$x[,1]/sd(PCA$x[,1])) ## Same thing; ta-da!!!
## 1 2 3 5 6 9
## 0.9334229 -0.4124867 -0.1750897 0.3512193 -1.0670919 0.9080794
We are then prepared to use PC1 in the multinomial regression model in the next step of the analysis. We’ll be sure to check for complete cases.
One thing I should point out is that there is a difference between “relative Value” and “absolute Value”. For example, if we look at the previous summary statistics above, it is pretty clear that participants overall, or in absolute terms, might not really think the FARM program has all that much value. Hence, what would a PC1 score look like that corresponds to Neutral, where all four questions are answered with 3s? Using the standardized scoring coefficients and the standardized question answers given here, such a PC1 score would be ~ 0.4158. We will refer to this value throughout the remainder of this report.
## The following gives the coefficients for std X to recover the std PC
(PC1.neutral <- coefs %*% ((3 - df.stats.tidy[,2])/df.stats.tidy[,3]))
## [,1]
## [1,] 0.3831046
In Part Two, we did not have many participants who did not answer Q22. Therefore, we will ultimately retain 430 complete cases for this analysis.
f.d <- read.csv("~/Desktop/472final.csv", header = TRUE, na.strings=".")
str(f.d)
## 'data.frame': 488 obs. of 12 variables:
## $ region : Factor w/ 7 levels "","MW","NE","SE",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ region.n : int 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : int 1 1 1 1 1 1 1 2 2 1 ...
## $ What.year.were.you.born.: int 1958 1978 1946 1981 1953 1986 1985 1970 1962 1948 ...
## $ age : int 60 40 72 37 65 32 33 48 56 70 ...
## $ education : int 3 5 4 4 4 2 1 2 3 5 ...
## $ farm_size : Factor w/ 185 levels "","0","1","1,650",..: 16 117 163 86 96 13 32 52 133 27 ...
## $ value.health : int 4 1 1 NA 3 1 NA NA 5 5 ...
## $ value.profit : int 4 1 3 NA 2 2 NA NA 3 5 ...
## $ value.cons : int 3 4 5 NA 4 2 NA NA 3 4 ...
## $ value.overall : int 3 3 1 NA 3 1 NA NA 3 4 ...
## $ q22 : int 1 2 7 NA 2 7 NA NA 1 1 ...
f.d <- f.d[,-1]
f.d$gender <- factor(f.d$gender)
str(f.d)
## 'data.frame': 488 obs. of 11 variables:
## $ region.n : int 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 2 2 1 ...
## $ What.year.were.you.born.: int 1958 1978 1946 1981 1953 1986 1985 1970 1962 1948 ...
## $ age : int 60 40 72 37 65 32 33 48 56 70 ...
## $ education : int 3 5 4 4 4 2 1 2 3 5 ...
## $ farm_size : Factor w/ 185 levels "","0","1","1,650",..: 16 117 163 86 96 13 32 52 133 27 ...
## $ value.health : int 4 1 1 NA 3 1 NA NA 5 5 ...
## $ value.profit : int 4 1 3 NA 2 2 NA NA 3 5 ...
## $ value.cons : int 3 4 5 NA 4 2 NA NA 3 4 ...
## $ value.overall : int 3 3 1 NA 3 1 NA NA 3 4 ...
## $ q22 : int 1 2 7 NA 2 7 NA NA 1 1 ...
dim(f.d) ## 487 cases
## [1] 488 11
ok <- complete.cases(f.d[,7:10])
sum(!ok) ## 53 removed
## [1] 54
f.d <- f.d[ok,]
dim(f.d)
## [1] 434 11
head(f.d)
## region.n gender What.year.were.you.born. age education farm_size
## 1 1 1 1958 60 3 1100
## 2 1 1 1978 40 5 42
## 3 1 1 1946 72 4 70
## 5 1 1 1953 65 4 31
## 6 1 1 1986 32 2 11
## 9 1 2 1962 56 3 50
## value.health value.profit value.cons value.overall q22
## 1 4 4 3 3 1
## 2 1 1 4 3 2
## 3 1 3 5 1 7
## 5 3 2 4 3 2
## 6 1 2 2 1 7
## 9 5 3 3 3 1
tail(f.d)
## region.n gender What.year.were.you.born. age education farm_size
## 455 NA 1 1965 53 6 500
## 456 NA 1 1976 42 3 5800
## 457 NA 1 1957 61 4 45
## 458 NA 1 1966 52 5 800
## 459 NA 1 1961 57 5 1970
## 460 NA 2 1987 31 6 2388
## value.health value.profit value.cons value.overall q22
## 455 1 1 1 1 7
## 456 5 5 5 5 1
## 457 3 3 3 3 2
## 458 2 2 2 2 7
## 459 2 3 3 3 5
## 460 3 3 3 3 NA
f.data <- f.d[,7:10]
csf.data <- scale(f.data)
PCA <- prcomp(csf.data)
csf.pcs <- scale(PCA$x)
new.data <- cbind(Q22 = f.d$q22, PC1 = (-1)*csf.pcs[,1])
dim(new.data) ## 434 cases
## [1] 434 2
ok <- complete.cases(new.data)
sum(!ok) ## 4 removed
## [1] 4
new.data <- new.data[ok,]
dim(new.data)
## [1] 430 2
new.data <- data.frame(new.data)
new.data$Q22 <- factor(new.data$Q22)
We now conceive of PC1 as a numerical predictor variable. Due to the sparsity of the data, we only generate strip charts by level of Q22, instead of side-by-side boxplots. Red diamonds denote the median of PC1 at that level. You will also notice the vertical dashed red line corresponds to absolute Neutrality (PC1 = 0.4158). These results are striking when you consider PC1 is on a standardized scale. This means those participants who answered 7 to Q22 (“The program is not important to me”) are almost 1.5 standard deviations below the line of absolute Neutrality. In other words, these respondents place very little value in the FARM program. We will also keep our eye on those who answered 1 to Q22.
my.sc4 <- ggplot(data = new.data, aes(y = PC1, x = Q22))
my.sc4 <- my.sc4 + geom_jitter(position = position_jitter(0.05))
my.sc4 <- my.sc4 + stat_summary(fun.y = median, geom = "point", shape = 18, size = 3, color = "red")
my.sc4 <- my.sc4 + geom_hline(yintercept = 0.4158, linetype = "dashed", color = "red")
my.sc4 <- my.sc4 + coord_flip()
my.sc4
We now fit a multinomial regression model with Q22 as the response with 7 levels and the 7th level set as the reference category, and PC1 as a numerical predictor. I used the pmlr package in R to first fit the multinomial regression model. It showed that PC1 is highly significant in terms of its effect on Q22 (\(P\)-value < 0.0001).
{r} # new.data$Q22 <- relevel(new.data$Q22, ref = "7") # # library(pmlr) # ## Takes about 5 minutes to run # kayla02 <- pmlr(Q22 ~ PC1, data = new.data, weights = NULL, alpha = 0.05, penalized = TRUE, # method = "likelihood", joint = TRUE) # summary(kayla02) #Following is a table of results looking at the effect of PC1 for all six submodels:
| Submodel | \(P\)-value | OR | 95% CI |
|---|---|---|---|
| 7 (ref) vs. 1 | < 0.0001 | 29.517 | (15.126, 61.171) |
| 7 (ref) vs. 2 | < 0.0001 | 7.341 | (5.070, 11.017) |
| 7 (ref) vs. 3 | 0.0282 | 6.051 | (1.217, 36.216) |
| 7 (ref) vs. 4 | < 0.0001 | 8.873 | (3.745, 22.309) |
| 7 (ref) vs. 5 | < 0.0001 | 4.215 | (2.794, 6.533) |
| 7 (ref) vs. 6 | < 0.0001 | 6.475 | (2.801, 15.824) |
These results are dramatic. With the exception of the submodel comparing \(\pi_3\) versus \(\pi_7\), the effect of PC1 is highly significant. Some of the confidence intervals are broad which reflects the sparsity of the data.
Leaning on our interpretation of odds ratios, we can interpret these in a similar fashion. For example, for every one-unit increase in PC1 (and this is a big increase!), we estimate the odds of a participant choosing answer 6 as opposed to answer 7 increase by a factor of about 6.475. Stated another way, the more value the participant believes the FARM program has, the higher the probability he/she will feel the FARM program unifies the dairy industry on animal welfare as opposed to feeling the program is not important. The other results are interpreted analagously (note all the OR estimates are positive).
We can plot predicted probabilities \(\widehat{\pi}_i\) against PC1 with 95% confidence limits for different levels of Q22. I think this plot is informative. For example, examine the line corresponding to answer 7. For PC1 around -1 or less (“low value”), the probability a participant would select this answer is quite high (0.70 or higher). By the time we get to absolute Neutrality (PC1 = 0.4158), the probability of select answer 7 has dropped all the way down to around 0.10.
library(nnet)
mod.full <- multinom(Q22 ~ PC1, data = new.data)
## # weights: 21 (12 variable)
## initial value 836.741364
## iter 10 value 484.096119
## iter 20 value 475.427025
## final value 475.426525
## converged
library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
fit.eff <- Effect("PC1", mod.full, xlevels = list(PC1 = seq(min(new.data$PC1), max(new.data$PC1), length = 100)))
# data.frame(fit.eff$prob, fit.eff$lower.prob, fit.eff$upper.prob)
plot(fit.eff, multiline = TRUE)
As was the case in Part One, we should exhasutively look at the remaining 15 pairings to get as much information out of these data as possible. By repeatedly fitting the multinomial regression model and toggling the answer used as the reference category (R code and output not shown), we generate the following table of additional significant results looking at the effect of PC1 for other submodels:
| Submodel | \(P\)-value | OR | 95% CI |
|---|---|---|---|
| 6 (ref) vs. 1 | 0.0020 | 4.558 | (1.728, 12.332) |
| 5 (ref) vs. 1 | < 0.0001 | 7.003 | (3.702, 14.063) |
| 5 (ref) vs. 2 | 0.0020 | 1.742 | (1.224, 2.506) |
| 4 (ref) vs. 1 | 0.0151 | 3.326 | (1.258, 9.029) |
| 2 (ref) vs. 1 | < 0.0001 | 4.021 | (2.291, 7.499) |
To me, the main story from this table (which ties into the previous plot) is answer 1. It seems the more value participants place on the FARM program, the more likely they are to select answer 1 (“Improves animal health and wellbeing”) relative to every other answer except 3.