Part 1: More categorical predictors with 3+ levels

Question 1

codes = c("testNoArts", "testJan", "testJanMethod")
noArt = c(3/4, 0, 0)
ArtificialsinJan = c(-1/4, -1/3, 1/2)
ArtificialsinJanSpread = c(-1/4, -1/3, -1/2)
ArtificialsinAprilSpread = c(-1/4, 2/3, 0)
codeTable = data.frame(codes, noArt, ArtificialsinJan, ArtificialsinJanSpread, ArtificialsinAprilSpread)
colnames(codeTable) = c("", "No Artificials", "Artificials in January", "Artificials in January with Spreading", "Artificials in April with Spreading")
kable(codeTable, digits = 2, align = "lcccc", caption = "Helmert coding scheme used to test questions 1a-c") %>% kable_styling("striped")
Helmert coding scheme used to test questions 1a-c
No Artificials Artificials in January Artificials in January with Spreading Artificials in April with Spreading
testNoArts 0.75 -0.25 -0.25 -0.25
testJan 0.00 -0.33 -0.33 0.67
testJanMethod 0.00 0.50 -0.50 0.00

The table above shows the Helmert contrast codes used in the model comparisons for each of question 1a-c, which are given in turn below:

For question 1a) we have the model comparison: \[ Model_{A}: yield_i = \beta_0 + \beta_1*testNoArts_i + \beta_2*testJan_i + \beta_3*testJanMethod_i + \epsilon_i \] \[ Model_{C}: yield_i = \beta_0 + \beta_2*testJan_i + \beta_3*testJanMethod_i + \epsilon_i \] \[ Null: \beta_1 = 0 \] For question 1b) we have \[ Model_{A}: yield_i = \beta_0 + \beta_1*testNoArts_i + \beta_2*testJan_i + \beta_3*testJanMethod_i + \epsilon_i \] \[ Model_{C}: yield_i = \beta_0 + \beta_1*testNoArts_i + \beta_3*testJanMethod_i + \epsilon_i \] \[ Null: \beta_2 = 0 \] Finally, for question 1c) we have \[ Model_{A}: yield_i = \beta_0 + \beta_1*testNoArts_i + \beta_2*testJan_i + \beta_3*testJanMethod_i + \epsilon_i \] \[ Model_{C}: yield_i = \beta_0 + \beta_1*testNoArts_i + \beta_2*testJan_i + \epsilon_i \] \[ Null: \beta_3 = 0 \]

Importantly, the Helmert contrast coding scheme employed here yields a complete set of orthogonal contrast codes. This is readily apparent from the table since the products of the “No Artificials” and “Artificials in April with Spreading” levels each give 0 while the products of the inner two columns together sum to zero (\((-1/4)(-1/3)(1/2) + (-1/4)(-1/3)(-1/2) = 0\)).

Question 2

To run the appropriate model comparisons for question 2a and 2b, we first summarize and generate the correct set of contrast codes:

codes = c("testAnyCollege", "testOneorBoth")
noCollege = c(2/3, 0)
OneCollege = c(-1/3, 1/2)
TwoCollege = c(-1/3, -1/2)
collegeCodeTable1 = data.frame(codes, noCollege, OneCollege, TwoCollege)
colnames(collegeCodeTable1) = c("", "Neither Parent with Degree", "One Parent with Degree", "Two Parents with Degrees")
kable(collegeCodeTable1, digits = 2, align = "lccc", caption = "Helmert coding scheme used to test questions 2a-b") %>% kable_styling("striped")
Helmert coding scheme used to test questions 2a-b
Neither Parent with Degree One Parent with Degree Two Parents with Degrees
testAnyCollege 0.67 -0.33 -0.33
testOneorBoth 0.00 0.50 -0.50
#generate corresponding contrast codes
nursery$testAnyCollege = 2/3*(nursery$COLLEGE=="none") -1/3*(nursery$COLLEGE=="one") -1/3*(nursery$COLLEGE=="both")
nursery$testOneorBoth = 0*(nursery$COLLEGE=="none") + 1/2*(nursery$COLLEGE=="one") -1/2*(nursery$COLLEGE=="both")

The table above summarizes the coding scheme used here. We now run the appropriate model comparisons for 2a and 2b in turn:

model.2aA = lm(nursery$PEABODY ~ nursery$testAnyCollege + nursery$testOneorBoth)
mcSummary(model.2aA)
## lm(formula = nursery$PEABODY ~ nursery$testAnyCollege + nursery$testOneorBoth)
## 
## Omnibus ANOVA
##                 SS df      MS EtaSq     F     p
## Model       330.99  2 165.495 0.074 1.882 0.164
## Error      4131.99 47  87.915                  
## Corr Total 4462.98 49  91.081                  
## 
##   RMSE AdjEtaSq
##  9.376    0.035
## 
## Coefficients
##                           Est StErr      t     SSR(3) EtaSq   tol  CI_2.5
## (Intercept)            76.743 1.683 45.597 182778.803 0.978    NA  73.357
## nursery$testAnyCollege -5.690 3.110 -1.830    294.328 0.066 0.758 -11.946
## nursery$testOneorBoth  -6.720 4.593 -1.463    188.160 0.044 0.758 -15.961
##                        CI_97.5     p
## (Intercept)             80.129 0.000
## nursery$testAnyCollege   0.566 0.074
## nursery$testOneorBoth    2.521 0.150

In order to run the appropriate model comparison to test 2c, we must first rearrange the coding scheme:

codes = c("testOneCollege", "testNoneorBoth")
noCollege = c(-1/3, 1/2)
OneCollege = c(2/3, 0)
TwoCollege = c(1/3, -1/2)
collegeCodeTable2 = data.frame(codes, noCollege, OneCollege, TwoCollege)
colnames(collegeCodeTable2) = c("", "Neither Parent with Degree", "One Parent with Degree", "Two Parents with Degrees")
kable(collegeCodeTable2, digits = 2, align = "lccc", caption = "Helmert coding scheme used to test questions 2c") %>% kable_styling("striped")
Helmert coding scheme used to test questions 2c
Neither Parent with Degree One Parent with Degree Two Parents with Degrees
testOneCollege -0.33 0.67 0.33
testNoneorBoth 0.50 0.00 -0.50
#generate corresponding contrast codes
nursery$testOneCollege = -1/3*(nursery$COLLEGE=="none") + 2/3*(nursery$COLLEGE=="one") + 1/3*(nursery$COLLEGE=="both")
nursery$testNoneorBoth = 1/2*(nursery$COLLEGE=="none") + 0*(nursery$COLLEGE=="one") -1/2*(nursery$COLLEGE=="both")

We now run the appropriate model comparison for 2c:

model.2cA = lm(nursery$PEABODY ~ nursery$testOneCollege + nursery$testNoneorBoth)
mcSummary(model.2cA)
## lm(formula = nursery$PEABODY ~ nursery$testOneCollege + nursery$testNoneorBoth)
## 
## Omnibus ANOVA
##                 SS df      MS EtaSq     F     p
## Model       330.99  2 165.495 0.074 1.882 0.164
## Error      4131.99 47  87.915                  
## Corr Total 4462.98 49  91.081                  
## 
##   RMSE AdjEtaSq
##  9.376    0.035
## 
## Coefficients
##                            Est StErr      t    SSR(3) EtaSq   tol  CI_2.5
## (Intercept)             77.475 2.344 33.051 96038.010 0.959    NA  72.759
## nursery$testOneCollege  -3.292 4.503 -0.731    47.005 0.011 0.388 -12.351
## nursery$testNoneorBoth -11.245 6.647 -1.692   251.642 0.057 0.388 -24.616
##                        CI_97.5     p
## (Intercept)             82.191 0.000
## nursery$testOneCollege   5.766 0.468
## nursery$testNoneorBoth   2.126 0.097

We provide statistical summaries for the results of question 2a-c in turn:

2a - In a model not controlling for any other factors, we found that there were no significant differences between the PEABODY means of children of 0, 1, or 2 parents with college degrees, F(2, 47) = 1.882, PRE = .074, p > .1.

2b - Tests of individual, single degree-of-freedom contrasts showed no significant differences between the mean of children of non-degree holding parents versus the combined means of children of parents where one or both attained degrees, F(1, 47) = 3.34, PRE = .066, p = .074.

2c - Moreover, a further single degree-of-freedom contrast showed no significant differences between the PEABODY means of children of no degree-holders versus the means of children of two degree-holders, F = 2.82, PRE = .057, p = .097.

We use the code below to generate a dummy coding scheme to test simultaneously whether there is a difference between the “none” and “one” groups as well as between the “none” and “two” groups:

codes = c("testNoneorBoth_Dummy", "testNoneorOne_Dummy")
noCollege = c(0, 0)
OneCollege = c(0, 1)
TwoCollege = c(1, 0)
collegeCodeTable3 = data.frame(codes, noCollege, OneCollege, TwoCollege)
colnames(collegeCodeTable3) = c("", "Neither Parent with Degree", "One Parent with Degree", "Two Parents with Degrees")
kable(collegeCodeTable3, digits = 2, align = "lccc", caption = "Helmert coding scheme used to test questions 2c") %>% kable_styling("striped")
Helmert coding scheme used to test questions 2c
Neither Parent with Degree One Parent with Degree Two Parents with Degrees
testNoneorBoth_Dummy 0 0 1
testNoneorOne_Dummy 0 1 0
#generate corresponding contrast codes
nursery$testNoneorBoth_Dummy = 0*(nursery$COLLEGE=="none") + 0*(nursery$COLLEGE=="one") + 1*(nursery$COLLEGE=="both")
nursery$testNoneorOne_Dummy = 0*(nursery$COLLEGE=="none") + 1*(nursery$COLLEGE=="one") + 0*(nursery$COLLEGE=="both")

model.2eA = lm(nursery$PEABODY ~ nursery$testNoneorBoth_Dummy + nursery$testNoneorOne_Dummy)
mcSummary(model.2eA)
## lm(formula = nursery$PEABODY ~ nursery$testNoneorBoth_Dummy + 
##     nursery$testNoneorOne_Dummy)
## 
## Omnibus ANOVA
##                 SS df      MS EtaSq     F     p
## Model       330.99  2 165.495 0.074 1.882 0.164
## Error      4131.99 47  87.915                  
## Corr Total 4462.98 49  91.081                  
## 
##   RMSE AdjEtaSq
##  9.376    0.035
## 
## Coefficients
##                                Est StErr      t     SSR(3) EtaSq   tol
## (Intercept)                  72.95 2.097 34.794 106434.050 0.963    NA
## nursery$testNoneorBoth_Dummy  9.05 4.688  1.930    327.610 0.073 0.889
## nursery$testNoneorOne_Dummy   2.33 2.813  0.828     60.321 0.014 0.889
##                              CI_2.5 CI_97.5     p
## (Intercept)                  68.732  77.168 0.000
## nursery$testNoneorBoth_Dummy -0.381  18.481 0.060
## nursery$testNoneorOne_Dummy  -3.329   7.989 0.412

The output provided above shows that the PEABODY means of children of no degree-holders were not significantly different from the means of children of two degree-holders, F(1, 47) = 3.72, PRE = .073, p = .060, and from those of children of one degree-holder, F(1, 47) = .685, PRE = .014, p = .412.

Part 2: Interactions

Question 3

We first predict the size of the budget deficit from city size and poverty rate:

model.3aA = lm(cities$budget ~ cities$size + cities$pov)
mcSummary(model.3aA)
## lm(formula = cities$budget ~ cities$size + cities$pov)
## 
## Omnibus ANOVA
##                 SS  df      MS EtaSq      F p
## Model      226.938   2 113.469  0.46 72.915 0
## Error      266.108 171   1.556               
## Corr Total 493.046 173   2.850               
## 
##   RMSE AdjEtaSq
##  1.247    0.454
## 
## Coefficients
##                Est StErr      t SSR(3) EtaSq   tol CI_2.5 CI_97.5 p
## (Intercept)  3.260 0.453  7.188 80.395 0.232    NA  2.364   4.155 0
## cities$size -0.047 0.006 -7.518 87.953 0.248 0.952 -0.059  -0.035 0
## cities$pov  -0.144 0.019 -7.574 89.267 0.251 0.952 -0.182  -0.107 0

From the output above, we observe that for every additional 10,000 inhabitants, cities can expect budget surpluses to decrease by $470,000 (or, if they are running a deficit, deficits will increase by this amount). Similarly, for every additional percentage point of residents living below the poverty line, cities can expect budget surpluses to decrease by $1.44 million. We provide a news summary below to summarize the above results:

News Summary: Our group was interested in identifying which economic and social factors predict the magnitude of budget surpluses/deficits in large American cities. In a preliminary analysis, both the size (in tens of thousands) and percentage of residents below the poverty line together predicted budget shortages. We aimed to quantify the indpendent contributions of each variable in a follow-up analysis, which showed that both variables significantly predict budget shortages independently. In particular, we found that for every additional 10,000 inhabitants, cities can expect budget deficits to increase by $470,000. Similarly, for every additional percentage point of residents living below the poverty line, cities can expect budget deficits to increase by $1.44 million.

To extend the results above, we include a size-by-poverty interaction term in the following model comparison:

model.3bA = lm(cities$budget ~ cities$size + cities$pov + cities$size:cities$pov)
mcSummary(model.3bA)
## lm(formula = cities$budget ~ cities$size + cities$pov + cities$size:cities$pov)
## 
## Omnibus ANOVA
##                 SS  df     MS EtaSq      F p
## Model      233.080   3 77.693 0.473 50.806 0
## Error      259.967 170  1.529               
## Corr Total 493.046 173  2.850               
## 
##   RMSE AdjEtaSq
##  1.237    0.463
## 
## Coefficients
##                           Est StErr      t SSR(3) EtaSq   tol CI_2.5
## (Intercept)             0.819 1.298  0.631  0.609 0.002    NA -1.743
## cities$size             0.018 0.033  0.536  0.439 0.002 0.034 -0.047
## cities$pov             -0.039 0.056 -0.700  0.749 0.003 0.109 -0.149
## cities$size:cities$pov -0.003 0.001 -2.004  6.142 0.023 0.023 -0.005
##                        CI_97.5     p
## (Intercept)              3.382 0.529
## cities$size              0.082 0.593
## cities$pov               0.071 0.485
## cities$size:cities$pov   0.000 0.047

From the output above we observe that for every additional 10,000 inhabitants, city surpluses can be expected to increase by $180,000 when that city has a 0% poverty level (however, this parameter estimate is far from significance). Similarly, for every percentage point increase in residents below the poverty line, a city with 0 residents can expected to see surpluses decrease by $390,000 (however, this scenario is clearly nonsensical and the associated parameter estimate is far from significance). Finally, an interaction effect was detected: for every percentage point increase in the poverty rate, the negative relationship between city size and surplus levels was magnified by $30,000.

Statistical Summary: In a model predicting surplus levels from the main effects of city size and poverty rate as well as their interaction, simple effects were not significant for city size at a poverty rate of 0%, b = .018, F(1, 170) = .287, PRE = .034, p = .593, nor significant for poverty rate at a city size of 0 residents, b = -.039, F(1, 170) = .49, PRE = .003, .485. However, their interaction was significant, b = -.003, F(1, 170) = 4.016, PRE = .023, p = .047, such that for every percentage point increase in the poverty rate, the negative relationship between city size and surplus levels was magnified by $30,000.

To confirm the results above, we mean-center the predictors and re-run the model comparison:

cities$size_C = cities$size - mean(cities$size)
cities$pov_C = cities$pov - mean(cities$pov)

model.3cA = lm(cities$budget ~ cities$size_C*cities$pov_C)
mcSummary(model.3cA)
## lm(formula = cities$budget ~ cities$size_C * cities$pov_C)
## 
## Omnibus ANOVA
##                 SS  df     MS EtaSq      F p
## Model      233.080   3 77.693 0.473 50.806 0
## Error      259.967 170  1.529               
## Corr Total 493.046 173  2.850               
## 
##   RMSE AdjEtaSq
##  1.237    0.463
## 
## Coefficients
##                               Est StErr       t  SSR(3) EtaSq   tol CI_2.5
## (Intercept)                -1.747 0.097 -18.077 499.740 0.658    NA -1.938
## cities$size_C              -0.044 0.006  -6.895  72.691 0.219 0.899 -0.056
## cities$pov_C               -0.144 0.019  -7.615  88.674 0.254 0.952 -0.181
## cities$size_C:cities$pov_C -0.003 0.001  -2.004   6.142 0.023 0.940 -0.005
##                            CI_97.5     p
## (Intercept)                 -1.557 0.000
## cities$size_C               -0.031 0.000
## cities$pov_C                -0.107 0.000
## cities$size_C:cities$pov_C   0.000 0.047

From the output above we observe that for every additional 10,000 inhabitants, city surpluses can be expected to decrease by $440,000 when that city has poverty level at the mean. Similarly, for every percentage point increase in residents below the poverty line, a city with the mean number of residents can expected to see surpluses decrease by $1,440,000. Finally, an interaction effect was detected: for every percentage point increase in the poverty rate, the negative relationship between city size and surplus levels was magnified by $30,000.

To investigate these effects in cities with populations at or above 500,000 we deviate size and re-run the corresponding model comparisons:

cities$size_50 = cities$size - 50
model.3dA = lm(cities$budget ~ cities$size_50 + cities$pov + cities$size_50*cities$pov, data = cities)
mcSummary(model.3dA)
## lm(formula = cities$budget ~ cities$size_50 + cities$pov + cities$size_50 * 
##     cities$pov, data = cities)
## 
## Omnibus ANOVA
##                 SS  df     MS EtaSq      F p
## Model      233.080   3 77.693 0.473 50.806 0
## Error      259.967 170  1.529               
## Corr Total 493.046 173  2.850               
## 
##   RMSE AdjEtaSq
##  1.237    0.463
## 
## Coefficients
##                              Est StErr      t SSR(3) EtaSq   tol CI_2.5
## (Intercept)                1.697 0.601  2.824 12.199 0.045    NA  0.511
## cities$size_50             0.018 0.033  0.536  0.439 0.002 0.034 -0.047
## cities$pov                -0.175 0.024 -7.176 78.740 0.232 0.568 -0.224
## cities$size_50:cities$pov -0.003 0.001 -2.004  6.142 0.023 0.036 -0.005
##                           CI_97.5     p
## (Intercept)                 2.883 0.005
## cities$size_50              0.082 0.593
## cities$pov                 -0.127 0.000
## cities$size_50:cities$pov   0.000 0.047

From the output above we observe that for every additional 10,000 inhabitants, city surpluses can be expected to increase by $440,000 when that city has a 0% poverty level. Similarly, for every percentage point increase in residents below the poverty line, a city with 50,000 residents can expected to see surpluses decrease by $1,750,000. Finally, an interaction effect was detected: for every percentage point increase in the poverty rate, the negative relationship between city size and surplus levels was magnified by $30,000.

News Summary: Our group was interested in identifying which economic and social factors predict the magnitude of budget surpluses/deficits in large American cities. We found that city size was not related to budget surpluses in cases where cities experienced a 0% poverty rate. On the other hand, poverty rate did influence surpluses, such that for every percentage point increase in residents below the poverty line, a city with 50,000 residents can expected to see surpluses decrease by $1,750,000. Importantly, this effect was magnified by the city size: for every percentage point increase in the poverty rate, the negative relationship between city size and surplus levels was magnified by $30,000.

To visualize the relationship between poverty and budget at various levels of city size, we plot the regression lines at four selected levels of size:

summary(cities)
##       obs              size            pov             unemp      
##  Min.   :  1.00   Min.   :10.30   Min.   : 9.509   Min.   : 4.89  
##  1st Qu.: 44.25   1st Qu.:26.72   1st Qu.:19.646   1st Qu.:11.28  
##  Median : 87.50   Median :36.88   Median :22.241   Median :14.57  
##  Mean   : 87.50   Mean   :38.42   Mean   :22.544   Mean   :14.53  
##  3rd Qu.:130.75   3rd Qu.:48.45   3rd Qu.:25.774   3rd Qu.:17.48  
##  Max.   :174.00   Max.   :87.53   Max.   :39.929   Max.   :24.39  
##      police            crime            budget            size_C       
##  Min.   : 0.2963   Min.   : 15.91   Min.   :-6.4838   Min.   :-28.129  
##  1st Qu.:24.0452   1st Qu.: 82.67   1st Qu.:-2.9091   1st Qu.:-11.706  
##  Median :35.5413   Median :117.17   Median :-1.7208   Median : -1.549  
##  Mean   :34.4456   Mean   :118.33   Mean   :-1.7946   Mean   :  0.000  
##  3rd Qu.:43.3959   3rd Qu.:152.83   3rd Qu.:-0.5446   3rd Qu.: 10.025  
##  Max.   :76.5688   Max.   :240.67   Max.   : 1.6955   Max.   : 49.111  
##      pov_C             size_50       
##  Min.   :-13.0351   Min.   :-39.705  
##  1st Qu.: -2.8979   1st Qu.:-23.282  
##  Median : -0.3029   Median :-13.125  
##  Mean   :  0.0000   Mean   :-11.576  
##  3rd Qu.:  3.2295   3rd Qu.: -1.551  
##  Max.   : 17.3854   Max.   : 37.535
cities %>% ggplot(aes(pov, budget)) +
  geom_point(alpha = .2) +
  geom_abline(intercept = .999, slope = -.069, size = 2, color = "red") +
  geom_abline(intercept = 1.51, slope = -.154, size = 2, color = "orange") +
  geom_abline(intercept = 1.719, slope = -.189, size = 2, color = "yellow") +
  geom_abline(intercept = 2.169, slope = -.264, size = 2, color = "green") +
  coord_cartesian(ylim = c(-10, 5), xlim = c(0, 45)) +
  theme_classic(base_size = 22)