#Question 2
Iām going to create NEast, a dummy variable indicating whether the region is in the Northeast or not. Those in the Northeast will be coded as 1, with those not in the Northeast coded as 0.
load("~/Desktop/Fall 19/POGO 611/Assignments/R stuff/pogo_611.RData")
head(states)
## state region pop area density metro waste energy miles toxic
## 1 Alabama 3 4041000 52423 77.08 67.4 1.11 393 10.5 27.86
## 2 Alaska 1 550000 570374 0.96 41.1 0.91 991 7.2 37.41
## 3 Arizona 1 3665000 113642 32.25 79.0 0.79 258 9.7 19.65
## 4 Arkansas 3 2351000 52075 45.15 40.1 0.85 330 8.9 24.60
## 5 California 1 29760000 155973 190.80 95.7 1.51 246 8.7 3.26
## 6 Colorado 1 3294000 103730 31.76 81.5 0.73 273 8.3 2.25
## green house senate csat vsat msat percent expense income high college
## 1 29.25 30 10 991 476 515 8 3627 27.498 66.9 15.7
## 2 NA 0 20 920 439 481 41 8330 48.254 86.6 23.0
## 3 18.37 13 33 932 442 490 26 4309 32.093 78.7 20.3
## 4 26.04 25 37 1005 482 523 6 3700 24.643 66.3 13.3
## 5 15.65 50 47 897 415 482 47 4491 41.716 76.2 23.4
## 6 21.89 36 58 959 453 506 29 5064 35.123 84.4 27.0
## NEast
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
states$NEast <- as.numeric(states$region == 2)
print(states$NEast)
## [1] 0 0 0 0 0 0 1 0 NA 0 0 0 0 0 0 0 0 0 0 1 0 1 0
## [24] 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1
## [47] 0 0 0 0 0
But I want to double-check that Northeast states are 1 and all others are 0
print(states[, c("state","NEast")])
## state NEast
## 1 Alabama 0
## 2 Alaska 0
## 3 Arizona 0
## 4 Arkansas 0
## 5 California 0
## 6 Colorado 0
## 7 Connecticut 1
## 8 Delaware 0
## 9 District of Columbia NA
## 10 Florida 0
## 11 Georgia 0
## 12 Hawaii 0
## 13 Idaho 0
## 14 Illinois 0
## 15 Indiana 0
## 16 Iowa 0
## 17 Kansas 0
## 18 Kentucky 0
## 19 Louisiana 0
## 20 Maine 1
## 21 Maryland 0
## 22 Massachusetts 1
## 23 Michigan 0
## 24 Minnesota 0
## 25 Mississippi 0
## 26 Missouri 0
## 27 Montana 0
## 28 Nebraska 0
## 29 Nevada 0
## 30 New Hampshire 1
## 31 New Jersey 1
## 32 New Mexico 0
## 33 New York 1
## 34 North Carolina 0
## 35 North Dakota 0
## 36 Ohio 0
## 37 Oklahoma 0
## 38 Oregon 0
## 39 Pennsylvania 1
## 40 Rhode Island 1
## 41 South Carolina 0
## 42 South Dakota 0
## 43 Tennessee 0
## 44 Texas 0
## 45 Utah 0
## 46 Vermont 1
## 47 Virginia 0
## 48 Washington 0
## 49 West Virginia 0
## 50 Wisconsin 0
## 51 Wyoming 0
Running an ANOVA and creating into an object called anova2
anova2<- aov(csat ~ NEast, data = states)
print(anova2)
## Call:
## aov(formula = csat ~ NEast, data = states)
##
## Terms:
## NEast Residuals
## Sum of Squares 35191.4 177770.0
## Deg. of Freedom 1 48
##
## Residual standard error: 60.85673
## Estimated effects may be unbalanced
## 1 observation deleted due to missingness
summary.aov(anova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## NEast 1 35191 35191 9.502 0.0034 **
## Residuals 48 177770 3704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
According to the model, there is a statsitically significant relationship in the mean composite sat score (csat) between those who live in the Northeast (NEast = 1) and those who do not live in the Northeast (NEast = 0).
I will try a TukeyHSD (Tukey Honest Significant Differences).
TukeyHSD(aov(csat ~ as.factor(NEast), data = states))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = csat ~ as.factor(NEast), data = states)
##
## $`as.factor(NEast)`
## diff lwr upr p adj
## 1-0 -69.0542 -114.0958 -24.01262 0.0033963
The TukeyHSD test has showed us that those who live in the Northeast will, on average, have a mean SAT score that is roughly 69 points lower than those who do not live in the Northeast. This is also signficant, as the p-value for the TukeyHSD test in 0.0034.