Introduction:

The ‘Other’, as in ‘a person different from yourself’, has always been a difficult topic. So it is today and this paper addresses the question : “Do people living in the US South Atlantic region have a strong different opinion from people in other US regions regarding ‘racial differences in welfare are caused due to lack of education’ ?”

Since ‘Ferguson’ the topic under discussion has received international attention. However, it should not be too difficult to imagine a similar survey in, let’s say, any European country and discover similar results (maybe the word ‘racial’ should then be replaced by ‘ethnic’ or ‘immigrants’…)

Data:

The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The survey is conducted face-to-face with an in-person interview by the National Opinion Research Center at the University of Chicago, of adults (18+) in randomly selected households. The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer. See https://en.wikipedia.org/wiki/General_Social_Survey for a detailed description of the data collection process.The original and full survey is available in http://www3.norc.org/GSS+Website/ .

Each case is the result of 1 survey that is conducted face-to-face with an in-person interview.The person interviewed is an adult (18+) in randomly selected households. Concretely, this paper is mainly using 2 variables from the above mentioned study :

  1. region (categorical)
  2. racdif3 - DIFFERENCES DUE TO LACK OF EDUCATION (ordinal categorical)

Region is pretty much self-explanatory. For Racdif3 it might be better to repeat the question asked : On the average (negroes/blacks/African-Americans) have worse jobs, income, and housing than white people. Do you think these differences are: c. Because most (negroes/blacks/African-Americans) don’t have the chance for education that it takes to rise out of poverty?

The study is observational and is based on historical data (1972-2010). There is no sign of any experiment, nor control groups. Data is collected (observed) via surveys, and each person interviewed was asked the same questions. The goal of the survey is to correlate demographic data with different beliefs.

The survey randomly samples adults from all over the US (census approach), covering all regions with sufficient cases. Considering the size and duration of the survey, it may be considered unbiased in its selection of people and results can be generalised across the US. However, the second characteristic selected for this analysis (racdiff3) has a lot of missing answers and we will have to investigate if this is not indirectly introducing a bias (voluntary response bias to this particular question ?).

The analysis will not establish any causal links, it will merely reveal facts and look for correlations.

Exploratory data analysis:

Let’s look first at a frequency crosstable for the 2 variables region and racdif3.

##                  racdif3
## region             Yes   No <NA>
##   New England      736  468 1471
##   Middle Atlantic 1769 1618 5048
##   E. Nor. Central 2165 2161 6246
##   W. Nor. Central 1019  845 2357
##   South Atlantic  1970 2700 6307
##   E. Sou. Central  629  996 2140
##   W. Sou. Central  989 1415 2959
##   Mountain         916  709 1798
##   Pacific         1812 1542 4276
##   <NA>               0    0    0

All cells have sufficiently high frequencies. There are a lot of not available values, the following table shows per region the % share of NA/YES/NO. As can be seen, the NA %-proportion per region is fairly similar

##                  racdif3
## region              Yes    No  <NA>
##   New England     0.275 0.175 0.550
##   Middle Atlantic 0.210 0.192 0.598
##   E. Nor. Central 0.205 0.204 0.591
##   W. Nor. Central 0.241 0.200 0.558
##   South Atlantic  0.179 0.246 0.575
##   E. Sou. Central 0.167 0.265 0.568
##   W. Sou. Central 0.184 0.264 0.552
##   Mountain        0.268 0.207 0.525
##   Pacific         0.237 0.202 0.560
##   <NA>

If we ignore NA values for a moment, how does the YES/NO (answer to the question asked) distribution look like per region ? In below graph, the blue line is the overall average of YES answers. Eyeballing the graph suggests there are relevant differences from region to region.

The NA question requires a bit more investigation as it could introduce a bias. The next couple of tables and charts investigate the distribution of NA values per region per sex and race.

Below table and graph show the proportions for people (shown by sex) who DID NOT respond to the question.

##                  sex
## region             Male Female
##   New England     0.457  0.543
##   Middle Atlantic 0.434  0.566
##   E. Nor. Central 0.438  0.562
##   W. Nor. Central 0.464  0.536
##   South Atlantic  0.436  0.564
##   E. Sou. Central 0.416  0.584
##   W. Sou. Central 0.431  0.569
##   Mountain        0.433  0.567
##   Pacific         0.466  0.534

Below table and graph show the proportions for people (shown by race) who DID NOT respond to the question.

##                  race
## region            White Black Other
##   New England     0.917 0.054 0.029
##   Middle Atlantic 0.806 0.154 0.041
##   E. Nor. Central 0.845 0.133 0.022
##   W. Nor. Central 0.888 0.089 0.023
##   South Atlantic  0.738 0.231 0.031
##   E. Sou. Central 0.770 0.221 0.008
##   W. Sou. Central 0.739 0.203 0.058
##   Mountain        0.906 0.022 0.072
##   Pacific         0.825 0.076 0.099

It is not obvious from above charts if a relevant bias is injected by the NA values in this study. In the next chapter we will assume that the data is not biased and originates from a random sampling. The question remains : is the observed difference in the proportions in figure 1 statistically relevant ?

Inference:

We focus again on figure 1 from the previous chapter. The null hypothesis \(H_0\) is that there is no difference in the propertions of YES/NO answers per region. The alternative hypothesis \(H_A\) states that there is a significant difference (significant at the 5% \({\alpha}\) level). We are investigating two categorical variables who have 2 or more values. Therefore we should apply a Chi-square test which will reveal if there is a significant difference in the proportions. For this we use the CrossTable function from the gmodels package. It shows the observed and expected values, as well as the \({\chi}^2\) values per cell.

The formula used for the Pearon’s chi-squared test is :

\({\chi}^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\)

As investigated in the previous chapter, we have a random sampling with no clear bias and we have sufficient expected measurements per cell for both YES and NO answers. The people interviewed are only a tiny fraction of the total US population and each case is reported in one cell only.

library(gmodels)
analysis = CrossTable(gss$region, gss$racdif3,expected=TRUE,
                      prop.t=FALSE, prop.c=FALSE, prop.r=FALSE,
                      chisq=TRUE, prop.chisq=TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |-------------------------|
## 
##  
## Total Observations in Table:  24459 
## 
##  
##                 | gss$racdif3 
##      gss$region |       Yes |        No | Row Total | 
## ----------------|-----------|-----------|-----------|
##     New England |       736 |       468 |      1204 | 
##                 |   590.949 |   613.051 |           | 
##                 |    35.603 |    34.320 |           | 
## ----------------|-----------|-----------|-----------|
## Middle Atlantic |      1769 |      1618 |      3387 | 
##                 |  1662.412 |  1724.588 |           | 
##                 |     6.834 |     6.588 |           | 
## ----------------|-----------|-----------|-----------|
## E. Nor. Central |      2165 |      2161 |      4326 | 
##                 |  2123.293 |  2202.707 |           | 
##                 |     0.819 |     0.790 |           | 
## ----------------|-----------|-----------|-----------|
## W. Nor. Central |      1019 |       845 |      1864 | 
##                 |   914.891 |   949.109 |           | 
##                 |    11.847 |    11.420 |           | 
## ----------------|-----------|-----------|-----------|
##  South Atlantic |      1970 |      2700 |      4670 | 
##                 |  2292.136 |  2377.864 |           | 
##                 |    45.273 |    43.641 |           | 
## ----------------|-----------|-----------|-----------|
## E. Sou. Central |       629 |       996 |      1625 | 
##                 |   797.585 |   827.415 |           | 
##                 |    35.634 |    34.349 |           | 
## ----------------|-----------|-----------|-----------|
## W. Sou. Central |       989 |      1415 |      2404 | 
##                 |  1179.935 |  1224.065 |           | 
##                 |    30.897 |    29.783 |           | 
## ----------------|-----------|-----------|-----------|
##        Mountain |       916 |       709 |      1625 | 
##                 |   797.585 |   827.415 |           | 
##                 |    17.581 |    16.947 |           | 
## ----------------|-----------|-----------|-----------|
##         Pacific |      1812 |      1542 |      3354 | 
##                 |  1646.215 |  1707.785 |           | 
##                 |    16.696 |    16.094 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |     12005 |     12454 |     24459 | 
## ----------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  395.1133     d.f. =  8     p =  2.078646e-80 
## 
## 
## 

With 8 degrees of freedom and a \({\chi}^2\) value of 395 the resulting p-value for for the \({\chi}^2\) distribution given that \(H_0\) would have been true, is negligable. Therefore we reject the null hypothesis.

Conclusion:

The statistical analysis shows that the response for the racdif3 question as measured in the South Atlantic region deviates significantly (at the 5% \({\alpha}\) level) from other regions. In other words, there is a significant association between the response given and the region. This is merely an observation but one that raises many more questions which should be the subject of a more elaborate study :

References:

The analysis is using the GSS data set http://bit.ly/dasi_gss_data (General Social Survey - 1972-2012), cleaned by Duke University. If the link does not download the file, use https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html.

Appendix:

Below table shows a random sample from the data set that contains the columns used in this study.

#load(url("http://bit.ly/dasi_gss_data"))
subColumns = c("year", "race", "sex", "region", "racdif3")
analysisData = gss[,subColumns]
extr1 = analysisData[sample(nrow(analysisData),35),]
extr1
##       year  race    sex          region racdif3
## 54284 2010 White Female  South Atlantic    <NA>
## 19551 1986 White   Male         Pacific     Yes
## 43736 2004 White   Male Middle Atlantic    <NA>
## 22481 1988 Black Female E. Nor. Central     Yes
## 3962  1974 White   Male Middle Atlantic    <NA>
## 27850 1993 White Female E. Nor. Central     Yes
## 10897 1980 White Female W. Nor. Central    <NA>
## 41384 2002 White Female Middle Atlantic    <NA>
## 31164 1994 White   Male  South Atlantic      No
## 45035 2004 White Female E. Nor. Central    <NA>
## 43171 2002 White Female Middle Atlantic    <NA>
## 51097 2008 White   Male Middle Atlantic     Yes
## 32766 1996 White Female E. Nor. Central     Yes
## 39763 2000 Black Female E. Nor. Central     Yes
## 3883  1974 White   Male Middle Atlantic    <NA>
## 24096 1989 White Female  South Atlantic      No
## 26234 1990 White   Male         Pacific      No
## 23116 1988 White Female W. Nor. Central    <NA>
## 25960 1990 White Female W. Nor. Central     Yes
## 38538 2000 White Female         Pacific    <NA>
## 45179 2004 Black Female E. Sou. Central    <NA>
## 7756  1977 White   Male E. Nor. Central     Yes
## 32268 1994 White Female        Mountain     Yes
## 8202  1977 Black Female  South Atlantic    <NA>
## 45797 2004 White   Male        Mountain      No
## 21388 1987 White Female E. Sou. Central    <NA>
## 4199  1974 White Female         Pacific    <NA>
## 23887 1989 Black Female E. Nor. Central      No
## 28327 1993 White   Male         Pacific    <NA>
## 25295 1990 Black Female     New England      No
## 12889 1982 White   Male Middle Atlantic    <NA>
## 14158 1983 White Female E. Nor. Central    <NA>
## 8020  1977 White Female         Pacific     Yes
## 48548 2006 White   Male E. Nor. Central    <NA>
## 33247 1996 White   Male        Mountain     Yes