library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## Warning: package 'dplyr' was built under R version 3.5.1

## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

f.d <- read.csv("~/Desktop/472final.csv", header = TRUE, na.strings=".")

str(f.d)

## 'data.frame':    488 obs. of  12 variables:
##  $ region                  : Factor w/ 7 levels "","MW","NE","SE",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ region.n                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ gender                  : int  1 1 1 1 1 1 1 2 2 1 ...
##  $ What.year.were.you.born.: int  1958 1978 1946 1981 1953 1986 1985 1970 1962 1948 ...
##  $ age                     : int  60 40 72 37 65 32 33 48 56 70 ...
##  $ education               : int  3 5 4 4 4 2 1 2 3 5 ...
##  $ farm_size               : Factor w/ 185 levels "","0","1","1,650",..: 16 117 163 86 96 13 32 52 133 27 ...
##  $ value.health            : int  4 1 1 NA 3 1 NA NA 5 5 ...
##  $ value.profit            : int  4 1 3 NA 2 2 NA NA 3 5 ...
##  $ value.cons              : int  3 4 5 NA 4 2 NA NA 3 4 ...
##  $ value.overall           : int  3 3 1 NA 3 1 NA NA 3 4 ...
##  $ q22                     : int  1 2 7 NA 2 7 NA NA 1 1 ...

f.d <- f.d[,-1]  

dim(f.d) ## 487 cases

## [1] 488  11

ok <- complete.cases(f.d[,6:9]) 
sum(!ok) ## 53 removed

## [1] 54

f.d <- f.d[ok,]
dim(f.d)

## [1] 434  11

First, after loading in the data, I spent some time checking it and getting it into shape. PCA requires complete cases of answers to the four questions. There were 487 cases and and 53 incomplete cases were removed at this point.

head(f.d)

##   region.n gender What.year.were.you.born. age education farm_size
## 1        1      1                     1958  60         3      1100
## 2        1      1                     1978  40         5        42
## 3        1      1                     1946  72         4        70
## 5        1      1                     1953  65         4        31
## 6        1      1                     1986  32         2        11
## 9        1      2                     1962  56         3        50
##   value.health value.profit value.cons value.overall q22
## 1            4            4          3             3   1
## 2            1            1          4             3   2
## 3            1            3          5             1   7
## 5            3            2          4             3   2
## 6            1            2          2             1   7
## 9            5            3          3             3   1

tail(f.d)

##     region.n gender What.year.were.you.born. age education farm_size
## 455       NA      1                     1965  53         6       500
## 456       NA      1                     1976  42         3      5800
## 457       NA      1                     1957  61         4        45
## 458       NA      1                     1966  52         5       800
## 459       NA      1                     1961  57         5      1970
## 460       NA      2                     1987  31         6      2388
##     value.health value.profit value.cons value.overall q22
## 455            1            1          1             1   7
## 456            5            5          5             5   1
## 457            3            3          3             3   2
## 458            2            2          2             2   7
## 459            2            3          3             3   5
## 460            3            3          3             3  NA

EDA

Here, looking at the correlation matrix, you can see the answers to the four questions are moderately to strongly correlated.

f.data <- f.d[,7:10]
cor(f.data)

##               value.health value.profit value.cons value.overall
## value.health     1.0000000    0.7766135  0.4869402     0.7870802
## value.profit     0.7766135    1.0000000  0.4543142     0.7312193
## value.cons       0.4869402    0.4543142  1.0000000     0.5694547
## value.overall    0.7870802    0.7312193  0.5694547     1.0000000

Here is the usual table of summary statistics by question. If we use 3 as an absolute benchmark of Neutral, you’ll notice that Q20a and Q20d look low, and Q20b looks suspiciously low.

df.sum <- summarise_all(f.data, funs(mean, sd))
df.stats.tidy <- df.sum %>% gather(stat, val) %>% separate(stat, into = c("var", "stat"), sep = "_") %>%
                 spread(stat, val)
df.stats.tidy

##             var     mean       sd
## 1    value.cons 3.179724 1.167281
## 2  value.health 2.580645 1.258392
## 3 value.overall 2.518433 1.217987
## 4  value.profit 2.165899 1.106411

PCA

I center and scale the answers to the four questions and run PCA to obtain “loadings” and a scree plot.

csf.data <- scale(f.data)

PCA <- prcomp(csf.data)
summary(PCA)

## Importance of components:
##                           PC1    PC2     PC3    PC4
## Standard deviation     1.7105 0.7883 0.50564 0.4441
## Proportion of Variance 0.7314 0.1554 0.06392 0.0493
## Cumulative Proportion  0.7314 0.8868 0.95070 1.0000

(temp <- PCA$rotation %*% diag(PCA$sdev)) ## PC loadings

##                     [,1]        [,2]        [,3]        [,4]
## value.health  -0.9061401  0.22562281 -0.10044409 -0.34338832
## value.profit  -0.8787400  0.27561141  0.36702358  0.13095064
## value.cons    -0.7064041 -0.70122340  0.08749115 -0.04030153
## value.overall -0.9129223  0.05335666 -0.32128253  0.24597439

# cor(csf.data, PCA$x)  ## Also PC loadings
plot(PCA)

From looking at a.) the PC loadings, b.) the square root of the eigenvalues (standard deviation), c.) the cumulative variance, and d.) the scree plot, we get a nice result; that is, the data are described nicely by the first principal component (PC1).

cor(PCA$x) ## Demonstrates orthogonality of the PCs and is a sanity check.

##               PC1           PC2           PC3           PC4
## PC1  1.000000e+00 -1.528309e-15  3.265671e-15  5.537072e-15
## PC2 -1.528309e-15  1.000000e+00 -1.521088e-15 -2.045480e-15
## PC3  3.265671e-15 -1.521088e-15  1.000000e+00 -7.000060e-16
## PC4  5.537072e-15 -2.045480e-15 -7.000060e-16  1.000000e+00

apply(PCA$x, 2, mean) ## Safety check; should all be essentially 0.

##           PC1           PC2           PC3           PC4 
## -1.219812e-16 -1.851850e-17 -1.935288e-17 -1.995731e-17

apply(PCA$x, 2, var) ## Safety check; these will equal the eigenvalues.

##       PC1       PC2       PC3       PC4 
## 2.9257078 0.6214285 0.2556725 0.1971912

csf.pcs <- (-1)*scale(PCA$x)
cor(csf.pcs) ## Demonstrates orthogonality of the PCs and is a sanity check.

##               PC1           PC2           PC3           PC4
## PC1  1.000000e+00 -1.546319e-15  3.268856e-15  5.538717e-15
## PC2 -1.546319e-15  1.000000e+00 -1.512642e-15 -2.037694e-15
## PC3  3.268856e-15 -1.512642e-15  1.000000e+00 -6.914678e-16
## PC4  5.538717e-15 -2.037694e-15 -6.914678e-16  1.000000e+00

apply(csf.pcs, 2, mean) ## Safety check; should all be essentially 0.

##           PC1           PC2           PC3           PC4 
## -4.205967e-18  1.167841e-17  1.693179e-17  1.787934e-17

apply(csf.pcs, 2, var) ## Safety check; should all be 1.

## PC1 PC2 PC3 PC4 
##   1   1   1   1

As is commonly done, PC1 has been (implicitly) scaled to have standard deviation 1. I then multiplied it by -1 for ease of interpretation. PC1 can be thought of as the construct “Value”. The larger the value of PC1, the more value the participant believes the FARM program has.

To obtain the coefficients for the standardized PC1, we take the inverse of the loadings matrix from before and pull off the appropriate row.

(-1)*solve(temp)[1,] ## One way to do it

##  value.health  value.profit    value.cons value.overall 
##     0.3097165     0.3003512     0.2414473     0.3120347

(coefs <- (-1)*(PCA$rotation %*% solve(diag(PCA$sdev)))[,1]) ## Second way to do it

##  value.health  value.profit    value.cons value.overall 
##     0.3097165     0.3003512     0.2414473     0.3120347

Thus we see that: \[PC1 = 0.31Q20a + 0.30Q20b + 0.24Q20c + 0.31Q20d\]

We can compute the (standardized) PC1 scores now. They have mean 0 and standard deviation 1. We can compute them manually or obtain them from the PCA object created previously.

PC1_scores <- csf.data %*% coefs
mean(PC1_scores)

## [1] -3.340742e-17

sd(PC1_scores)

## [1] 1

head(PC1_scores)

##         [,1]
## 1  0.9334229
## 2 -0.4124867
## 3 -0.1750897
## 5  0.3512193
## 6 -1.0670919
## 9  0.9080794

head((-1)*PCA$x[,1]/sd(PCA$x[,1])) ## Same thing; ta-da!!!

##          1          2          3          5          6          9 
##  0.9334229 -0.4124867 -0.1750897  0.3512193 -1.0670919  0.9080794

We are then prepared to use PC1 in the multinomial regression model in the next step of the analysis. We’ll be sure to check for complete cases.

One thing I should point out is that there is a difference between “relative Value” and “absolute Value”. For example, if we look at the previous summary statistics above, it is pretty clear that participants overall, or in absolute terms, might not really think the FARM program has all that much value. Hence, what would a PC1 score look like that corresponds to Neutral, where all four questions are answered with 3s? Using the standardized scoring coefficients and the standardized question answers given here, such a PC1 score would be ~ 0.4158. We will refer to this value throughout the remainder of this report.

## The following gives the coefficients for std X to recover the std PC
(PC1.neutral <-  coefs %*% ((3 - df.stats.tidy[,2])/df.stats.tidy[,3]))

##           [,1]
## [1,] 0.3831046

Part Two

In Part Two, we did not have many participants who did not answer Q22. Therefore, we will ultimately retain 430 complete cases for this analysis.

f.d <- read.csv("~/Desktop/472final.csv", header = TRUE, na.strings=".")

str(f.d)

## 'data.frame':    488 obs. of  12 variables:
##  $ region                  : Factor w/ 7 levels "","MW","NE","SE",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ region.n                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ gender                  : int  1 1 1 1 1 1 1 2 2 1 ...
##  $ What.year.were.you.born.: int  1958 1978 1946 1981 1953 1986 1985 1970 1962 1948 ...
##  $ age                     : int  60 40 72 37 65 32 33 48 56 70 ...
##  $ education               : int  3 5 4 4 4 2 1 2 3 5 ...
##  $ farm_size               : Factor w/ 185 levels "","0","1","1,650",..: 16 117 163 86 96 13 32 52 133 27 ...
##  $ value.health            : int  4 1 1 NA 3 1 NA NA 5 5 ...
##  $ value.profit            : int  4 1 3 NA 2 2 NA NA 3 5 ...
##  $ value.cons              : int  3 4 5 NA 4 2 NA NA 3 4 ...
##  $ value.overall           : int  3 3 1 NA 3 1 NA NA 3 4 ...
##  $ q22                     : int  1 2 7 NA 2 7 NA NA 1 1 ...

f.d <- f.d[,-1]
f.d$gender <- factor(f.d$gender)
str(f.d)

## 'data.frame':    488 obs. of  11 variables:
##  $ region.n                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ gender                  : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 2 2 1 ...
##  $ What.year.were.you.born.: int  1958 1978 1946 1981 1953 1986 1985 1970 1962 1948 ...
##  $ age                     : int  60 40 72 37 65 32 33 48 56 70 ...
##  $ education               : int  3 5 4 4 4 2 1 2 3 5 ...
##  $ farm_size               : Factor w/ 185 levels "","0","1","1,650",..: 16 117 163 86 96 13 32 52 133 27 ...
##  $ value.health            : int  4 1 1 NA 3 1 NA NA 5 5 ...
##  $ value.profit            : int  4 1 3 NA 2 2 NA NA 3 5 ...
##  $ value.cons              : int  3 4 5 NA 4 2 NA NA 3 4 ...
##  $ value.overall           : int  3 3 1 NA 3 1 NA NA 3 4 ...
##  $ q22                     : int  1 2 7 NA 2 7 NA NA 1 1 ...

dim(f.d) ## 487 cases

## [1] 488  11

ok <- complete.cases(f.d[,7:10])
sum(!ok) ## 53 removed

## [1] 54

f.d <- f.d[ok,]
dim(f.d)

## [1] 434  11

head(f.d)

##   region.n gender What.year.were.you.born. age education farm_size
## 1        1      1                     1958  60         3      1100
## 2        1      1                     1978  40         5        42
## 3        1      1                     1946  72         4        70
## 5        1      1                     1953  65         4        31
## 6        1      1                     1986  32         2        11
## 9        1      2                     1962  56         3        50
##   value.health value.profit value.cons value.overall q22
## 1            4            4          3             3   1
## 2            1            1          4             3   2
## 3            1            3          5             1   7
## 5            3            2          4             3   2
## 6            1            2          2             1   7
## 9            5            3          3             3   1

tail(f.d)

##     region.n gender What.year.were.you.born. age education farm_size
## 455       NA      1                     1965  53         6       500
## 456       NA      1                     1976  42         3      5800
## 457       NA      1                     1957  61         4        45
## 458       NA      1                     1966  52         5       800
## 459       NA      1                     1961  57         5      1970
## 460       NA      2                     1987  31         6      2388
##     value.health value.profit value.cons value.overall q22
## 455            1            1          1             1   7
## 456            5            5          5             5   1
## 457            3            3          3             3   2
## 458            2            2          2             2   7
## 459            2            3          3             3   5
## 460            3            3          3             3  NA

f.data <- f.d[,7:10]

csf.data <- scale(f.data)

PCA <- prcomp(csf.data)
csf.pcs <- scale(PCA$x)

new.data <- cbind(Q22 = f.d$q22, PC1 = (-1)*csf.pcs[,1])
dim(new.data) ## 434 cases

## [1] 434   2

ok <- complete.cases(new.data)
sum(!ok) ## 4 removed

## [1] 4

new.data <- new.data[ok,]
dim(new.data)

## [1] 430   2

new.data <- data.frame(new.data)
new.data$Q22 <- factor(new.data$Q22)

Naive Graphics

We now conceive of PC1 as a numerical predictor variable. Due to the sparsity of the data, we only generate strip charts by level of Q22, instead of side-by-side boxplots. Red diamonds denote the median of PC1 at that level. You will also notice the vertical dashed red line corresponds to absolute Neutrality (PC1 = 0.4158). These results are striking when you consider PC1 is on a standardized scale. This means those participants who answered 7 to Q22 (“The program is not important to me”) are almost 1.5 standard deviations below the line of absolute Neutrality. In other words, these respondents place very little value in the FARM program. We will also keep our eye on those who answered 1 to Q22.

my.sc4 <- ggplot(data = new.data, aes(y = PC1, x = Q22))
my.sc4 <- my.sc4 + geom_jitter(position = position_jitter(0.05))
my.sc4 <- my.sc4 + stat_summary(fun.y = median, geom = "point", shape = 18, size = 3, color = "red")
my.sc4 <- my.sc4 + geom_hline(yintercept = 0.4158, linetype = "dashed", color = "red")
my.sc4 <- my.sc4 + coord_flip()
my.sc4

Modeling

We now fit a multinomial regression model with Q22 as the response with 7 levels and the 7th level set as the reference category, and PC1 as a numerical predictor. I used the pmlr package in R to first fit the multinomial regression model. It showed that PC1 is highly significant in terms of its effect on Q22 ($P$-value < 0.0001).

`{r} # new.data$Q22 <- relevel(new.data$Q22, ref = "7") # # library(pmlr) # ## Takes about 5 minutes to run # kayla02 <- pmlr(Q22 ~ PC1, data = new.data, weights = NULL, alpha = 0.05, penalized = TRUE, # method = "likelihood", joint = TRUE) # summary(kayla02) #`

Post-Hoc Analysis

Following is a table of results looking at the effect of PC1 for all six submodels:

Submodel	$P$-value	OR	95% CI
7 (ref) vs. 1	< 0.0001	29.517	(15.126, 61.171)
7 (ref) vs. 2	< 0.0001	7.341	(5.070, 11.017)
7 (ref) vs. 3	0.0282	6.051	(1.217, 36.216)
7 (ref) vs. 4	< 0.0001	8.873	(3.745, 22.309)
7 (ref) vs. 5	< 0.0001	4.215	(2.794, 6.533)
7 (ref) vs. 6	< 0.0001	6.475	(2.801, 15.824)

These results are dramatic. With the exception of the submodel comparing $\pi_3$ versus $\pi_7$, the effect of PC1 is highly significant. Some of the confidence intervals are broad which reflects the sparsity of the data.
Leaning on our interpretation of odds ratios, we can interpret these in a similar fashion. For example, for every one-unit increase in PC1 (and this is a big increase!), we estimate the odds of a participant choosing answer 6 as opposed to answer 7 increase by a factor of about 6.475. Stated another way, the more value the participant believes the FARM program has, the higher the probability he/she will feel the FARM program unifies the dairy industry on animal welfare as opposed to feeling the program is not important. The other results are interpreted analagously (note all the OR estimates are positive).

We can plot predicted probabilities $\widehat{\pi}_i$ against PC1 with 95% confidence limits for different levels of Q22. I think this plot is informative. For example, examine the line corresponding to answer 7. For PC1 around -1 or less (“low value”), the probability a participant would select this answer is quite high (0.70 or higher). By the time we get to absolute Neutrality (PC1 = 0.4158), the probability of select answer 7 has dropped all the way down to around 0.10.

library(nnet)
mod.full <- multinom(Q22 ~ PC1, data = new.data)

## # weights:  21 (12 variable)
## initial  value 836.741364 
## iter  10 value 484.096119
## iter  20 value 475.427025
## final  value 475.426525 
## converged

library(effects)

## Loading required package: carData

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

fit.eff <- Effect("PC1", mod.full, xlevels = list(PC1 = seq(min(new.data$PC1), max(new.data$PC1), length = 100)))
# data.frame(fit.eff$prob, fit.eff$lower.prob, fit.eff$upper.prob)
plot(fit.eff, multiline = TRUE)

As was the case in Part One, we should exhasutively look at the remaining 15 pairings to get as much information out of these data as possible. By repeatedly fitting the multinomial regression model and toggling the answer used as the reference category (R code and output not shown), we generate the following table of additional significant results looking at the effect of PC1 for other submodels:

Submodel	$P$-value	OR	95% CI
6 (ref) vs. 1	0.0020	4.558	(1.728, 12.332)
5 (ref) vs. 1	< 0.0001	7.003	(3.702, 14.063)
5 (ref) vs. 2	0.0020	1.742	(1.224, 2.506)
4 (ref) vs. 1	0.0151	3.326	(1.258, 9.029)
2 (ref) vs. 1	< 0.0001	4.021	(2.291, 7.499)

To me, the main story from this table (which ties into the previous plot) is answer 1. It seems the more value participants place on the FARM program, the more likely they are to select answer 1 (“Improves animal health and wellbeing”) relative to every other answer except 3.

FARM Data Analysis

Philip Turk

April 7, 2017 Draft

EDA

PCA

Part Two

Naive Graphics

Modeling

`{r} # new.data$Q22 <- relevel(new.data$Q22, ref = "7") # # library(pmlr) # ## Takes about 5 minutes to run # kayla02 <- pmlr(Q22 ~ PC1, data = new.data, weights = NULL, alpha = 0.05, penalized = TRUE, # method = "likelihood", joint = TRUE) # summary(kayla02) #`

Post-Hoc Analysis