To continue to make this project feel cohesive, the previous part can be found here: https://rpubs.com/bekkahmoore/747602
I’m going to keep working with the same Marriage Data I’ve been using, but look at it a little differently using a \(\chi^2\) distribution.
I’m going to focus on the states that this data comes from. Let’s make a table!
table(data$State)
##
## AL AR CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC
## 16 8 57 11 12 5 1 20 24 2 16 3 35 20 13 12 13 38 5 4 27 15 22 5 6 34
## ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY
## 4 3 9 22 3 2 74 34 11 13 66 7 13 1 20 44 4 22 6 14 16 7 1
I want to look at the difference of number of colleges that data was collected from per state. My null and alternative hypotheses are as follows:
\[ H_0: \text{The number of colleges surveyed per state are equal}\\ \] \[ H_A: \text{The number of colleges surveyed per state are NOT equal}. \] I’ll now test this with a \(\chi^2\) distribution test.
states_test = chisq.test(table(data$State))
states_test
##
## Chi-squared test for given probabilities
##
## data: table(data$State)
## X-squared = 783.02, df = 48, p-value < 2.2e-16
Clearly, the p-value is very small, so I can reject the null hypothesis that all the states had the same amount of colleges surveyed. We could also come to this conclusion based on the table I made prior to the chi test, but it was nice to confirm it.
Here are the expected values for the number of colleges surveyed per state:
states_test$expected
## AL AR CA CO CT DC DE FL
## 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469
## GA HI IA ID IL IN KS KY
## 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469
## LA MA MD ME MI MN MO MS
## 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469
## MT NC ND NE NH NJ NM NV
## 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469
## NY OH OK OR PA RI SC SD
## 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469
## TN TX UT VA VT WA WI WV
## 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469 16.73469
## WY
## 16.73469
Alabama, Indiana, and Wisconsin are actually the only states that had 16 colleges surveyed. This makes sense because they each have the probability of being surveyed the same amount of times in order to keep everything even. The observed results disproved this, though.
Here’s a bar chart of my table!
barplot(table(data$State),
col="darkmagenta",
main = "Number of Colleges Surveyed per State")
This barplot gives a great visual for how aggressively I can reject the null hypothesis. It’s very clear that different numbers of colleges were surveyed throughout the states. I made the bars magenta to add a little flair. Let’s move on!
I’d like to incorporate HBCU (Historically Black Colleges & Universities) into this study. I want to see if the number of HBCU’s surveyed is dependent on the state. Here comes another table.
table(data$State,data$HBCU.)
##
## No Yes
## AL 14 2
## AR 7 1
## CA 57 0
## CO 11 0
## CT 12 0
## DC 4 1
## DE 1 0
## FL 19 1
## GA 20 4
## HI 2 0
## IA 16 0
## ID 3 0
## IL 35 0
## IN 20 0
## KS 13 0
## KY 11 1
## LA 10 3
## MA 38 0
## MD 5 0
## ME 4 0
## MI 27 0
## MN 15 0
## MO 22 0
## MS 4 1
## MT 6 0
## NC 28 6
## ND 4 0
## NE 3 0
## NH 9 0
## NJ 22 0
## NM 3 0
## NV 2 0
## NY 74 0
## OH 33 1
## OK 11 0
## OR 13 0
## PA 65 1
## RI 7 0
## SC 12 1
## SD 1 0
## TN 19 1
## TX 43 1
## UT 4 0
## VA 20 2
## VT 6 0
## WA 14 0
## WI 16 0
## WV 6 1
## WY 1 0
So it looks like many of the states have 0 HBCU’s, but then North Carolina has 6… I am going to hypothesize that the number of HBCU’s surveyed is independent of the state. Stated formally:
\[ H_0: \text{The number of HBCU's surveyed is independent of state} \] \[ H_A: \text{The number of HBCU's surveyed depends on the state}. \] Let’s test it with another \(\chi^2\) test.
HBCU_test = chisq.test(table(data$State,data$HBCU.))
## Warning in chisq.test(table(data$State, data$HBCU.)): Chi-squared approximation
## may be incorrect
HBCU_test
##
## Pearson's Chi-squared test
##
## data: table(data$State, data$HBCU.)
## X-squared = 87.595, df = 48, p-value = 0.0004207
Again, another very small p value. This indicates strong evidence that we should reject the null hypothesis. That’s interesting to me. If it is truly dependent on state, I wonder why so many of the states had 0 HBCU’s surveyed. Let’s get the expected values for this test too.
HBCU_test$expected
##
## No Yes
## AL 15.4536585 0.54634146
## AR 7.7268293 0.27317073
## CA 55.0536585 1.94634146
## CO 10.6243902 0.37560976
## CT 11.5902439 0.40975610
## DC 4.8292683 0.17073171
## DE 0.9658537 0.03414634
## FL 19.3170732 0.68292683
## GA 23.1804878 0.81951220
## HI 1.9317073 0.06829268
## IA 15.4536585 0.54634146
## ID 2.8975610 0.10243902
## IL 33.8048780 1.19512195
## IN 19.3170732 0.68292683
## KS 12.5560976 0.44390244
## KY 11.5902439 0.40975610
## LA 12.5560976 0.44390244
## MA 36.7024390 1.29756098
## MD 4.8292683 0.17073171
## ME 3.8634146 0.13658537
## MI 26.0780488 0.92195122
## MN 14.4878049 0.51219512
## MO 21.2487805 0.75121951
## MS 4.8292683 0.17073171
## MT 5.7951220 0.20487805
## NC 32.8390244 1.16097561
## ND 3.8634146 0.13658537
## NE 2.8975610 0.10243902
## NH 8.6926829 0.30731707
## NJ 21.2487805 0.75121951
## NM 2.8975610 0.10243902
## NV 1.9317073 0.06829268
## NY 71.4731707 2.52682927
## OH 32.8390244 1.16097561
## OK 10.6243902 0.37560976
## OR 12.5560976 0.44390244
## PA 63.7463415 2.25365854
## RI 6.7609756 0.23902439
## SC 12.5560976 0.44390244
## SD 0.9658537 0.03414634
## TN 19.3170732 0.68292683
## TX 42.4975610 1.50243902
## UT 3.8634146 0.13658537
## VA 21.2487805 0.75121951
## VT 5.7951220 0.20487805
## WA 13.5219512 0.47804878
## WI 15.4536585 0.54634146
## WV 6.7609756 0.23902439
## WY 0.9658537 0.03414634
So North Carolina had an expected value of 1.1, with an observed value of 6… and New York had an expected value of 2.5 and had 0. Again, interesting. I’m not sure exactly what to make of these results. I’ll make a visual for this contingency table.
mosaicplot(table(data$State,data$HBCU.),
main = "HBCU's by State",
xlab = "State",
ylab = "HBCU?",
off =30,
)
This visual is kind of overwhelming because there’s a ton of states squished in together, but still paints a nice enough picture. I figured out that the “off = 30” line of code is what made me able to split the yes’s and no’s from each other so it’s easier to understand.
This part of the report went very smoothly for the most part, and helped me utilize the \(\chi^2\) distribution tests to compare categorical variables pretty well. I’m still slightly surprised that the number of HBCU’s were based on state, but other than that it was pretty standard.