#Question 2

I’m going to create NEast, a dummy variable indicating whether the region is in the Northeast or not. Those in the Northeast will be coded as 1, with those not in the Northeast coded as 0.

load("~/Desktop/Fall 19/POGO 611/Assignments/R stuff/pogo_611.RData")
head(states)
##        state region      pop   area density metro waste energy miles toxic
## 1    Alabama      3  4041000  52423   77.08  67.4  1.11    393  10.5 27.86
## 2     Alaska      1   550000 570374    0.96  41.1  0.91    991   7.2 37.41
## 3    Arizona      1  3665000 113642   32.25  79.0  0.79    258   9.7 19.65
## 4   Arkansas      3  2351000  52075   45.15  40.1  0.85    330   8.9 24.60
## 5 California      1 29760000 155973  190.80  95.7  1.51    246   8.7  3.26
## 6   Colorado      1  3294000 103730   31.76  81.5  0.73    273   8.3  2.25
##   green house senate csat vsat msat percent expense income high college
## 1 29.25    30     10  991  476  515       8    3627 27.498 66.9    15.7
## 2    NA     0     20  920  439  481      41    8330 48.254 86.6    23.0
## 3 18.37    13     33  932  442  490      26    4309 32.093 78.7    20.3
## 4 26.04    25     37 1005  482  523       6    3700 24.643 66.3    13.3
## 5 15.65    50     47  897  415  482      47    4491 41.716 76.2    23.4
## 6 21.89    36     58  959  453  506      29    5064 35.123 84.4    27.0
##   NEast
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
states$NEast <- as.numeric(states$region == 2)
print(states$NEast)
##  [1]  0  0  0  0  0  0  1  0 NA  0  0  0  0  0  0  0  0  0  0  1  0  1  0
## [24]  0  0  0  0  0  0  1  1  0  1  0  0  0  0  0  1  1  0  0  0  0  0  1
## [47]  0  0  0  0  0

But I want to double-check that Northeast states are 1 and all others are 0

print(states[, c("state","NEast")])
##                   state NEast
## 1               Alabama     0
## 2                Alaska     0
## 3               Arizona     0
## 4              Arkansas     0
## 5            California     0
## 6              Colorado     0
## 7           Connecticut     1
## 8              Delaware     0
## 9  District of Columbia    NA
## 10              Florida     0
## 11              Georgia     0
## 12               Hawaii     0
## 13                Idaho     0
## 14             Illinois     0
## 15              Indiana     0
## 16                 Iowa     0
## 17               Kansas     0
## 18             Kentucky     0
## 19            Louisiana     0
## 20                Maine     1
## 21             Maryland     0
## 22        Massachusetts     1
## 23             Michigan     0
## 24            Minnesota     0
## 25          Mississippi     0
## 26             Missouri     0
## 27              Montana     0
## 28             Nebraska     0
## 29               Nevada     0
## 30        New Hampshire     1
## 31           New Jersey     1
## 32           New Mexico     0
## 33             New York     1
## 34       North Carolina     0
## 35         North Dakota     0
## 36                 Ohio     0
## 37             Oklahoma     0
## 38               Oregon     0
## 39         Pennsylvania     1
## 40         Rhode Island     1
## 41       South Carolina     0
## 42         South Dakota     0
## 43            Tennessee     0
## 44                Texas     0
## 45                 Utah     0
## 46              Vermont     1
## 47             Virginia     0
## 48           Washington     0
## 49        West Virginia     0
## 50            Wisconsin     0
## 51              Wyoming     0

Running an ANOVA and creating into an object called anova2

anova2<- aov(csat ~ NEast, data = states)
print(anova2)
## Call:
##    aov(formula = csat ~ NEast, data = states)
## 
## Terms:
##                    NEast Residuals
## Sum of Squares   35191.4  177770.0
## Deg. of Freedom        1        48
## 
## Residual standard error: 60.85673
## Estimated effects may be unbalanced
## 1 observation deleted due to missingness
summary.aov(anova2)
##             Df Sum Sq Mean Sq F value Pr(>F)   
## NEast        1  35191   35191   9.502 0.0034 **
## Residuals   48 177770    3704                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

According to the model, there is a statsitically significant relationship in the mean composite sat score (csat) between those who live in the Northeast (NEast = 1) and those who do not live in the Northeast (NEast = 0).

I will try a TukeyHSD (Tukey Honest Significant Differences).

TukeyHSD(aov(csat ~ as.factor(NEast), data = states))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = csat ~ as.factor(NEast), data = states)
## 
## $`as.factor(NEast)`
##         diff       lwr       upr     p adj
## 1-0 -69.0542 -114.0958 -24.01262 0.0033963

The TukeyHSD test has showed us that those who live in the Northeast will, on average, have a mean SAT score that is roughly 69 points lower than those who do not live in the Northeast. This is also signficant, as the p-value for the TukeyHSD test in 0.0034.