Introduction:

This analysis examines the relationship between regions in the United States and education levels since 2000. In “The New Geography of Jobs”, Enrico Moretti argues that there is a growing economic and cultural divergence in America according to geography because the highly educated and skilled are moving to and creating geographic clusters of economic prosperity, exacerbating the division between rich and poor cities. He makes a strong connection between economic prosperity and education level, stating the higher the percentage of the population of a city that has a college degree, the greater potential that city has to become economically successful. This analysis looks at a part of Moretti’s analysis, evaluating if regions in the United States have significant differences in education levels according to highest educational degree attained. This analysis will only evaluate the educational part of Moretti’s argument and at a different geographic scope (region vs. city), it will not evaluate economic differences. This analysis will examine if there is a difference in education levels between U.S. regions since 2000. I will conduct a chi-squared test of independence on the highest degree attained to examine if there is a significant difference between regions. If there is a growing education gap between regions in the United States, this could correlate to a growing geographic economic disparity in the United States and growing political divisions by region.

Data:

The data for this analysis comes from the GSS survey. Below is a citation for the data and a URL where it is located:

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1

Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

The data from this survey was collected through face-to-face interview with respondents who were selected through random sampling of the US population that controls for age, gender, and status of employment. Since this is the data source, this analysis is a retrospective observational study based on historical data gathered through a random sampling of the US population that does not interfere with how the data arose and not a randomly assigned experiment so it cannot be used to establish casual links. This analysis can only possibly find correlations, not causation. Because of the random sampling, we will be able to generalize the findings of this analysis to the populations of the different regions of the US. Since the survey did subsampling for non-respondents to fulfill quotas, there should not be any bias that would endanger our ability to generalize the findings of this analysis.
The population of interest is the population of the US grouped by region. Each case in this analysis is a respondent to the GSS survey since 2000. The two variables of study are the region of interview (region), which is regular categorical, and the highest educational degree attained (degree), which is ordinal categorical. The region of the interview, which will, for our purposes, be interpreted as the respondent’s region of residence (New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, West South Central, Mountain, and Pacific) and the highest educational degree attained from the surveys were collected from 2000-2012 (2000, 2002, 2004, 2006, 2008, 20010, and 2012).

Below are how the states are grouped by region in the survey:
New England = Maine, Vermont, New Hampshire, Massachusetts, Connecticut, Rhode Island
Middle Atlantic = New York, New Jersey, Pennsylvania
East North Central = Wisconsin, Illinois, Indiana, Michigan, Ohio
West North Central = Minnesota, Iowa, Missouri, North Dakota, South Dakota, Nebraska, Kansas
South Atlantic = Delaware, Maryland, West Virginia, Virginia, North Carolina, South Carolina, Georgia, Florida, District of Columbia
East South Central = Kentucky, Tennessee, Alabama, Mississippi
West South Central = Arkansas, Oklahoma, Louisiana, Texas
Mountain = Montana, Idaho, Wyoming, Nevada, Utah, Colorado, Arizona, New Mexico
Pacific = Washington, Oregon, California, Alaska, Hawaii

Exploratory data analysis:

Below is a table showing the proportion of respondents across the United States who listed the following as their highest educational degree attained :

signif(table(y00s$degree)/dim(y00s)[1],3)*100
## 
## Lt High School    High School Junior College       Bachelor       Graduate 
##          14.10          51.10           7.80          17.10           9.23

For the overall population, the majority have only graduated high school. The rest show varying breakdowns.

Below is a cross tabulation of the same data above broken down by region.

signif(table(y00s$region, y00s$degree)/as.vector(table(y00s$region)),3)*100
##                  
##                   Lt High School High School Junior College Bachelor
##   New England              10.30       41.60           7.51    23.20
##   Middle Atlantic          12.80       48.60           8.09    17.50
##   E. Nor. Central          13.70       54.20           8.00    14.70
##   W. Nor. Central          10.70       56.00           5.67    19.00
##   South Atlantic           14.60       49.90           8.62    17.50
##   E. Sou. Central          19.20       55.60           5.92    11.10
##   W. Sou. Central          18.90       51.50           6.91    14.30
##   Mountain                 10.40       52.10           9.26    19.90
##   Pacific                  13.80       48.90           7.92    19.40
##                  
##                   Graduate
##   New England        16.00
##   Middle Atlantic    12.50
##   E. Nor. Central     8.62
##   W. Nor. Central     8.20
##   South Atlantic      8.44
##   E. Sou. Central     7.30
##   W. Sou. Central     7.11
##   Mountain            8.01
##   Pacific             9.53

Below is a boxplot of years of education grouped by region (Years of education is related to highest degree attained allows the data to be better visualized):

par(mar = c(10,6,4,2), oma = c(0,0,0,0))
boxplot(y00s$educ ~ y00s$region, las = 2)
title(main = "Years of Education by Region", ylab = "Years of Education")
title(xlab = "Region", mgp = c(8,0,0))

plot of chunk unnamed-chunk-4

As seen in the boxplots and cross tabulation above, there is a visual variation in the summary statistics by region from the overall US distribution. The cross tabulation demonstrates what appears to be a significant difference between education levels in difference regions that may be validated in the inferential analysis. Particularly, New England shows very high levels of education while the East and West South Central show below average levels of education. This suggests that region and education level are dependent on each other.

Inference:

I will conduct a hypothesis test, using the chi-squared test of independence on region and highest educational degree attained because the two variables are categorical with more than two levels.
The null hypothesis is that there is no significant difference in education levels between regions in the United States and they are independent of each other. The alternative hypothesis is that there is a significant difference and that they are dependent of each other.
As mentioned above, the data was collected through random sampling of less than 10% of population, and each scenario has at least five cases and each case in the data only contributes to one scenario. Therefore, the data meets all of the necessary conditions for this type of inferential analysis.

Below are the results of a chi-squared test of independence on the highest degree attained. This test measures the difference between the surveyed totals from the different regions and the expected amounts based on proportions from the pooled data (or total surveyed US population). For us to reject the null hypothesis, the combination of the X-squared statistic and the degrees of freedom must produce a p-value below our significance value, which I will be using the standard 5%.

source("http://bit.ly/dasi_inference")
inference(y = y00s$degree, x = y00s$region, est = "proportion", type = "ht", method = "theoretical", alternative = 'greater')
## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
## 
## Summary statistics:
##                 x
## y                New England Middle Atlantic E. Nor. Central
##   Lt High School          82             323             444
##   High School            332            1230            1756
##   Junior College          60             205             259
##   Bachelor               185             443             477
##   Graduate               128             317             279
##   Sum                    787            2518            3215
##                 x
## y                W. Nor. Central South Atlantic E. Sou. Central
##   Lt High School             136            571             224
##   High School                711           1958             648
##   Junior College              72            338              69
##   Bachelor                   241            686             129
##   Graduate                   104            331              85
##   Sum                       1264           3884            1155
##                 x
## y                W. Sou. Central Mountain Pacific   Sum
##   Lt High School             375      142     370  2667
##   High School               1022      708    1309  9674
##   Junior College             137      126     212  1478
##   Bachelor                   283      270     518  3232
##   Graduate                   141      109     255  1749
##   Sum                       1958     1355    2664 18800
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## Check conditions: expected counts
##                 x
## y                New England Middle Atlantic E. Nor. Central
##   Lt High School      111.65           357.2           456.1
##   High School         404.97          1295.7          1654.4
##   Junior College       61.87           198.0           252.8
##   Bachelor            135.30           432.9           552.7
##   Graduate             73.22           234.2           299.1
##                 x
## y                W. Nor. Central South Atlantic E. Sou. Central
##   Lt High School          179.31          551.0           163.8
##   High School             650.42         1998.6           594.3
##   Junior College           99.37          305.4            90.8
##   Bachelor                217.30          667.7           198.6
##   Graduate                117.59          361.3           107.5
##                 x
## y                W. Sou. Central Mountain Pacific
##   Lt High School           277.8    192.2   377.9
##   High School             1007.5    697.2  1370.8
##   Junior College           153.9    106.5   209.4
##   Bachelor                 336.6    232.9   458.0
##   Graduate                 182.2    126.1   247.8
## 
##  Pearson's Chi-squared test
## 
## data:  y_table
## X-squared = 322.3, df = 32, p-value < 2.2e-16

plot of chunk unnamed-chunk-5 According to the chi-squared test, the p-value is extremely low, meaning that there is an extremely low chance that the differences are due to chance so I am rejecting the null hypothesis. Because of the nature of the data, there are no other tests to check if this analysis is correct.

Conclusion:

The results of the chi-squared test of independence demonstrates that there is a significant education difference between regions in the United States since the produced a p-value that is well under the standard 5% significance value. The effects of the regionalization of education is important topic that can explain certain regional differences so more analysis must be undertaken to see if there are other links between this and economic prosperity. A major shortcoming of this analysis is the geographic level analysis. According to Moretti, the geography of economic growth is more tied to cities that regions, but regions or states can significantly influence the environment in which cities grow.

Appendix:

y00s[1:59,c(2,12,28)]
##       year         degree          region
## 38117 2000       Bachelor W. Sou. Central
## 38118 2000    High School W. Sou. Central
## 38119 2000    High School W. Sou. Central
## 38120 2000    High School W. Sou. Central
## 38121 2000 Junior College W. Sou. Central
## 38122 2000    High School W. Sou. Central
## 38123 2000    High School W. Sou. Central
## 38124 2000 Junior College W. Sou. Central
## 38125 2000       Graduate W. Sou. Central
## 38126 2000       Bachelor W. Sou. Central
## 38127 2000       Graduate W. Sou. Central
## 38128 2000       Graduate W. Sou. Central
## 38129 2000    High School W. Sou. Central
## 38130 2000       Bachelor W. Sou. Central
## 38131 2000       Bachelor W. Sou. Central
## 38132 2000    High School W. Sou. Central
## 38133 2000 Junior College W. Sou. Central
## 38134 2000    High School W. Sou. Central
## 38135 2000 Lt High School W. Sou. Central
## 38136 2000    High School W. Sou. Central
## 38137 2000       Bachelor W. Sou. Central
## 38138 2000       Bachelor W. Sou. Central
## 38139 2000           <NA> W. Sou. Central
## 38140 2000       Bachelor W. Sou. Central
## 38141 2000       Bachelor W. Sou. Central
## 38142 2000    High School W. Sou. Central
## 38143 2000    High School W. Sou. Central
## 38144 2000    High School W. Sou. Central
## 38145 2000    High School W. Sou. Central
## 38146 2000    High School W. Sou. Central
## 38147 2000    High School W. Sou. Central
## 38148 2000       Bachelor W. Sou. Central
## 38149 2000    High School W. Sou. Central
## 38150 2000    High School W. Sou. Central
## 38151 2000    High School W. Sou. Central
## 38152 2000 Lt High School W. Sou. Central
## 38153 2000       Bachelor W. Sou. Central
## 38154 2000    High School W. Sou. Central
## 38155 2000       Bachelor W. Sou. Central
## 38156 2000    High School W. Sou. Central
## 38157 2000       Bachelor W. Sou. Central
## 38158 2000 Junior College W. Sou. Central
## 38159 2000    High School W. Sou. Central
## 38160 2000       Graduate W. Sou. Central
## 38161 2000    High School W. Sou. Central
## 38162 2000 Lt High School W. Sou. Central
## 38163 2000       Bachelor W. Sou. Central
## 38164 2000    High School W. Sou. Central
## 38165 2000       Bachelor W. Sou. Central
## 38166 2000       Bachelor W. Sou. Central
## 38167 2000 Lt High School W. Sou. Central
## 38168 2000       Bachelor W. Sou. Central
## 38169 2000       Graduate W. Sou. Central
## 38170 2000    High School W. Sou. Central
## 38171 2000       Bachelor W. Sou. Central
## 38172 2000    High School W. Sou. Central
## 38173 2000    High School W. Sou. Central
## 38174 2000       Graduate W. Sou. Central
## 38175 2000       Bachelor W. Sou. Central