The ‘Other’, as in ‘a person different from yourself’, has always been a difficult topic. So it is today and this paper addresses the question : “Do people living in the US South Atlantic region have a strong different opinion from people in other US regions regarding ‘racial differences in welfare are caused due to lack of education’ ?”
Since ‘Ferguson’ the topic under discussion has received international attention. However, it should not be too difficult to imagine a similar survey in, let’s say, any European country and discover similar results (maybe the word ‘racial’ should then be replaced by ‘ethnic’ or ‘immigrants’…)
The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The survey is conducted face-to-face with an in-person interview by the National Opinion Research Center at the University of Chicago, of adults (18+) in randomly selected households. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer. See https://en.wikipedia.org/wiki/General_Social_Survey for a detailed description of the data collection process.The original and full survey is available in http://www3.norc.org/GSS+Website/ .
Each case is the result of 1 survey that is conducted face-to-face with an in-person interview.The person interviewed is an adult (18+) in randomly selected households. Concretely, this paper is mainly using 2 variables from the above mentioned study :
Region is pretty much self-explanatory. For Racdif3 it might be better to repeat the question asked : On the average (negroes/blacks/African-Americans) have worse jobs, income, and housing than white people. Do you think these differences are: c. Because most (negroes/blacks/African-Americans) don’t have the chance for education that it takes to rise out of poverty?
The study is observational and is based on historical data (1972-2010). There is no sign of any experiment, nor control groups. Data is collected (observed) via surveys, and each person interviewed was asked the same questions. The goal of the survey is to correlate demographic data with different beliefs.
The survey randomly samples adults from all over the US (census approach), covering all regions with sufficient cases. Considering the size and duration of the survey, it may be considered unbiased in its selection of people and results can be generalised across the US. However, the second characteristic selected for this analysis (racdiff3) has a lot of missing answers and we will have to investigate if this is not indirectly introducing a bias (voluntary response bias to this particular question ?).
The analysis will not establish any causal links, it will merely reveal facts and look for correlations.
Let’s look first at a frequency crosstable for the 2 variables region and racdif3.
## racdif3
## region Yes No <NA>
## New England 736 468 1471
## Middle Atlantic 1769 1618 5048
## E. Nor. Central 2165 2161 6246
## W. Nor. Central 1019 845 2357
## South Atlantic 1970 2700 6307
## E. Sou. Central 629 996 2140
## W. Sou. Central 989 1415 2959
## Mountain 916 709 1798
## Pacific 1812 1542 4276
## <NA> 0 0 0
All cells have sufficiently high frequencies. There are a lot of not available values, the following table shows per region the % share of NA/YES/NO. As can be seen, the NA %-proportion per region is fairly similar
## racdif3
## region Yes No <NA>
## New England 0.275 0.175 0.550
## Middle Atlantic 0.210 0.192 0.598
## E. Nor. Central 0.205 0.204 0.591
## W. Nor. Central 0.241 0.200 0.558
## South Atlantic 0.179 0.246 0.575
## E. Sou. Central 0.167 0.265 0.568
## W. Sou. Central 0.184 0.264 0.552
## Mountain 0.268 0.207 0.525
## Pacific 0.237 0.202 0.560
## <NA>
If we ignore NA values for a moment, how does the YES/NO (answer to the question asked) distribution look like per region ? In below graph, the blue line is the overall average of YES answers. Eyeballing the graph suggests there are relevant differences from region to region.
The NA question requires a bit more investigation as it could introduce a bias. The next couple of tables and charts investigate the distribution of NA values per region per sex and race.
Below table and graph show the proportions for people (shown by sex) who DID NOT respond to the question.
## sex
## region Male Female
## New England 0.457 0.543
## Middle Atlantic 0.434 0.566
## E. Nor. Central 0.438 0.562
## W. Nor. Central 0.464 0.536
## South Atlantic 0.436 0.564
## E. Sou. Central 0.416 0.584
## W. Sou. Central 0.431 0.569
## Mountain 0.433 0.567
## Pacific 0.466 0.534
Below table and graph show the proportions for people (shown by race) who DID NOT respond to the question.
## race
## region White Black Other
## New England 0.917 0.054 0.029
## Middle Atlantic 0.806 0.154 0.041
## E. Nor. Central 0.845 0.133 0.022
## W. Nor. Central 0.888 0.089 0.023
## South Atlantic 0.738 0.231 0.031
## E. Sou. Central 0.770 0.221 0.008
## W. Sou. Central 0.739 0.203 0.058
## Mountain 0.906 0.022 0.072
## Pacific 0.825 0.076 0.099
It is not obvious from above charts if a relevant bias is injected by the NA values in this study. In the next chapter we will assume that the data is not biased and originates from a random sampling. The question remains : is the observed difference in the proportions in figure 1 statistically relevant ?
We focus again on figure 1 from the previous chapter. The null hypothesis \(H_0\) is that there is no difference in the propertions of YES/NO answers per region. The alternative hypothesis \(H_A\) states that there is a significant difference (significant at the 5% \({\alpha}\) level). We are investigating two categorical variables who have 2 or more values. Therefore we should apply a Chi-square test which will reveal if there is a significant difference in the proportions. For this we use the CrossTable function from the gmodels package. It shows the observed and expected values, as well as the \({\chi}^2\) values per cell.
The formula used for the Pearon’s chi-squared test is :
\({\chi}^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\)
As investigated in the previous chapter, we have a random sampling with no clear bias and we have sufficient expected measurements per cell for both YES and NO answers. The people interviewed are only a tiny fraction of the total US population and each case is reported in one cell only.
library(gmodels)
analysis = CrossTable(gss$region, gss$racdif3,expected=TRUE,
prop.t=FALSE, prop.c=FALSE, prop.r=FALSE,
chisq=TRUE, prop.chisq=TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## |-------------------------|
##
##
## Total Observations in Table: 24459
##
##
## | gss$racdif3
## gss$region | Yes | No | Row Total |
## ----------------|-----------|-----------|-----------|
## New England | 736 | 468 | 1204 |
## | 590.949 | 613.051 | |
## | 35.603 | 34.320 | |
## ----------------|-----------|-----------|-----------|
## Middle Atlantic | 1769 | 1618 | 3387 |
## | 1662.412 | 1724.588 | |
## | 6.834 | 6.588 | |
## ----------------|-----------|-----------|-----------|
## E. Nor. Central | 2165 | 2161 | 4326 |
## | 2123.293 | 2202.707 | |
## | 0.819 | 0.790 | |
## ----------------|-----------|-----------|-----------|
## W. Nor. Central | 1019 | 845 | 1864 |
## | 914.891 | 949.109 | |
## | 11.847 | 11.420 | |
## ----------------|-----------|-----------|-----------|
## South Atlantic | 1970 | 2700 | 4670 |
## | 2292.136 | 2377.864 | |
## | 45.273 | 43.641 | |
## ----------------|-----------|-----------|-----------|
## E. Sou. Central | 629 | 996 | 1625 |
## | 797.585 | 827.415 | |
## | 35.634 | 34.349 | |
## ----------------|-----------|-----------|-----------|
## W. Sou. Central | 989 | 1415 | 2404 |
## | 1179.935 | 1224.065 | |
## | 30.897 | 29.783 | |
## ----------------|-----------|-----------|-----------|
## Mountain | 916 | 709 | 1625 |
## | 797.585 | 827.415 | |
## | 17.581 | 16.947 | |
## ----------------|-----------|-----------|-----------|
## Pacific | 1812 | 1542 | 3354 |
## | 1646.215 | 1707.785 | |
## | 16.696 | 16.094 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 12005 | 12454 | 24459 |
## ----------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 395.1133 d.f. = 8 p = 2.078646e-80
##
##
##
With 8 degrees of freedom and a \({\chi}^2\) value of 395 the resulting p-value for for the \({\chi}^2\) distribution given that \(H_0\) would have been true, is negligable. Therefore we reject the null hypothesis.
The statistical analysis shows that the response for the racdif3 question as measured in the South Atlantic region deviates significantly (at the 5% \({\alpha}\) level) from other regions. In other words, there is a significant association between the response given and the region. This is merely an observation but one that raises many more questions which should be the subject of a more elaborate study :
The analysis is using the GSS data set http://bit.ly/dasi_gss_data (General Social Survey - 1972-2012), cleaned by Duke University. If the link does not download the file, use https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html.
Below table shows a random sample from the data set that contains the columns used in this study.
#load(url("http://bit.ly/dasi_gss_data"))
subColumns = c("year", "race", "sex", "region", "racdif3")
analysisData = gss[,subColumns]
extr1 = analysisData[sample(nrow(analysisData),35),]
extr1
## year race sex region racdif3
## 54284 2010 White Female South Atlantic <NA>
## 19551 1986 White Male Pacific Yes
## 43736 2004 White Male Middle Atlantic <NA>
## 22481 1988 Black Female E. Nor. Central Yes
## 3962 1974 White Male Middle Atlantic <NA>
## 27850 1993 White Female E. Nor. Central Yes
## 10897 1980 White Female W. Nor. Central <NA>
## 41384 2002 White Female Middle Atlantic <NA>
## 31164 1994 White Male South Atlantic No
## 45035 2004 White Female E. Nor. Central <NA>
## 43171 2002 White Female Middle Atlantic <NA>
## 51097 2008 White Male Middle Atlantic Yes
## 32766 1996 White Female E. Nor. Central Yes
## 39763 2000 Black Female E. Nor. Central Yes
## 3883 1974 White Male Middle Atlantic <NA>
## 24096 1989 White Female South Atlantic No
## 26234 1990 White Male Pacific No
## 23116 1988 White Female W. Nor. Central <NA>
## 25960 1990 White Female W. Nor. Central Yes
## 38538 2000 White Female Pacific <NA>
## 45179 2004 Black Female E. Sou. Central <NA>
## 7756 1977 White Male E. Nor. Central Yes
## 32268 1994 White Female Mountain Yes
## 8202 1977 Black Female South Atlantic <NA>
## 45797 2004 White Male Mountain No
## 21388 1987 White Female E. Sou. Central <NA>
## 4199 1974 White Female Pacific <NA>
## 23887 1989 Black Female E. Nor. Central No
## 28327 1993 White Male Pacific <NA>
## 25295 1990 Black Female New England No
## 12889 1982 White Male Middle Atlantic <NA>
## 14158 1983 White Female E. Nor. Central <NA>
## 8020 1977 White Female Pacific Yes
## 48548 2006 White Male E. Nor. Central <NA>
## 33247 1996 White Male Mountain Yes