Setup

The required packages and data were loaded as follows: ### Load packages

library(ggplot2)
library(dplyr)
## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(statsr)

Load data

load("gss.Rdata")
dim(gss)
## [1] 57061   114

Introduction

A research question was formulated on a real-world dataset and inferential statistical analyses were performed to check the hypothesis.

The question of interest was whether there is a significant association between respondents’ social class and their attitude towards homosexuality over time. This was an intriguing hypothesis as it would reflect social dimensions of homosexuality and help to identify whom the policy makers should focus on in drafting policies and planning programmes to ensure the rights of this marginalized community.The study sample was dichotomized into two time frames as before 2006 and 2006 or afterwards to evaluate whether the association differs over time.

The analysis comprised exploratory data analysis (EDA) and statistical inference on the data using the appropriate test: chi-square test for independence.

The null hypothesis was rejected and it was concluded that the data indicates that there is a relationship between the respondent’s social class and their attitudes towards homosexuality.

As the study is only a large-scale cross-sectional study/survey but not an experimental one, causality cannot be inferred. Temporality of the association cannot be dissected by cross-sectional data. However, as the sample was quite large, and randomly selected representing the entire population of interest, the results can be generalized to the population of interest and beyond the sample.


Part 1: Data

The GSS is a periodic survey aimed at collecting data of various aspects of the contemporary American society. The objective is to monitor and evaluate trends and patterns of an array of societal aspects spanning attitudes, behaviors, and related attributes. A wealth of data on these aspects have been accumulated since 1972 allowing for analyses over time.

The GSS questionnaire encompasses demographic, behavioral, and attitudinal aspects as well as certain other domains of special interest such as civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events. Therefore, the GSS can be considered a rich source of sociological and attitudinal trend data in the United States.The GSS aims to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

Data collection process

Face-to-face interviews are used by the team at National Opinion Research Center (NORC) - hosted at the University of Chicago to collect data. Adults living in the United States are the target group/ study population from which a random sample is drawn using an area probability design that randomly selects respondents in households across the United States. The GSS sample is a heterogenous mix of rural, urban and semi-urban people. Participation in the study is strictly voluntary.

The survey has been conducted annually from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year. The survey takes about 90 minutes to administer. As of 2014, 30 national samples with 59,599 respondents and 5,900+ variables have been collected.

Generalizability, Causality, and Bias

As the GSS employs random sampling, finings are generalizable to the US population.

The GSS is, however, an observational study(more specifically, a cross-sectional study) - with no randomized assignments/allocations to treatments - so all relationships indicate associations, but not causations as biases, confounders and random error cannot be ruled out.

Changes in data collection methods over the years may have introduced bias, if the changes were systematically different across groups. As an example, it was not until 2006 that Spanish-speaking adults were included in the survey. Voluntary participaion may introduce a form of selection bias, if the participants/non-participants are systemaically different. The relatively long duration of the interview (one hour) may perhaps enhance non-response leading to a biased outcome. There is the possibility of underreporting/ misclassification bias especially if the questions are on socially-sensitive or taboo issues such as sexuality, politics, religion, smoking, drug use etc.


Part 2: Research question

Is there an association between a respondent’s social class and their opinion on homosexuality? This research question was interesting as it can help unravel any social stratification of sexuality related opinions in the American society, if it actually exists. Understanding such societal differences in sexually-sensitive domains is important in crafting sexual health policies aimed at vulnerable groups and also planning and implementing cross-cutting programmes to enhance their protection and integration etc. So this will provide a crude but broader picture that would be useful for higher level policy makers, administrators and programme planners.

Based on that question, the analysis was done using the following variables:

year - Year the observations were made.
class - self-perceived class of the respondents.
homosex- attitude/opinion about homosexuality.

Modifications to the data to aid in analysis: All NA’s were removed.The duration was dichotomized as before 2006 and 2006 or afterwards to evaluate if there is any time-bound change in the attitude towards homosexuality.The attitude to homosexuality was mutated into a new variable tracking positive attitude only.


Part 3: Exploratory data analysis

subset the dataset

df_study <- select(gss,year,class,homosex) %>% na.omit() %>%
  mutate(positive=grepl("Not Wrong At All",homosex)) %>%
  mutate(recent=as.factor(ifelse(year>=2006,"R","H")))

#summary of the whole dataset/overview

alltime <- df_study
summary(alltime)
##       year                class                   homosex     
##  Min.   :1973   Lower Class  : 1669   Always Wrong    :20947  
##  1st Qu.:1982   Working Class:14727   Almst Always Wrg: 1529  
##  Median :1991   Middle Class :14477   Sometimes Wrong : 2177  
##  Mean   :1992   Upper Class  : 1017   Not Wrong At All: 7171  
##  3rd Qu.:2000   No Class     :    1   Other           :   67  
##  Max.   :2012                                                 
##   positive       recent   
##  Mode :logical   H:26286  
##  FALSE:24720     R: 5605  
##  TRUE :7171               
##                           
##                           
## 
alltime_table <- table(alltime$class,alltime$positive)
alltime_table
##                
##                 FALSE  TRUE
##   Lower Class    1312   357
##   Working Class 11785  2942
##   Middle Class  10917  3560
##   Upper Class     706   311
##   No Class          0     1
prop.table(alltime_table)
##                
##                        FALSE         TRUE
##   Lower Class   4.114013e-02 1.119438e-02
##   Working Class 3.695400e-01 9.225173e-02
##   Middle Class  3.423223e-01 1.116302e-01
##   Upper Class   2.213791e-02 9.751968e-03
##   No Class      0.000000e+00 3.135681e-05
#The count table and the proportions table both indicate a difference in attitude by classes.

#Visualization of data

g <- ggplot(alltime) + aes(x=class,fill=positive) + geom_bar(position = "fill") +
  labs(x="Social class",y="Proportion",title="Impact of social class on Positive View towards homosexuality") +
  scale_fill_discrete(name="Opinion",labels=c("Positive","Negative"))
g

Now EDA will be performed on the recent data; 2006 and afterwards.

since2006 <- filter(alltime,recent=="R")
summary(since2006)
##       year                class                  homosex    
##  Min.   :2006   Lower Class  : 400   Always Wrong    :2809  
##  1st Qu.:2006   Working Class:2569   Almst Always Wrg: 228  
##  Median :2008   Middle Class :2456   Sometimes Wrong : 401  
##  Mean   :2009   Upper Class  : 180   Not Wrong At All:2167  
##  3rd Qu.:2010   No Class     :   0   Other           :   0  
##  Max.   :2012                                               
##   positive       recent  
##  Mode :logical   H:   0  
##  FALSE:3438      R:5605  
##  TRUE :2167              
##                          
##                          
## 
since2006_table <- table(since2006$class,since2006$positive)
since2006_table
##                
##                 FALSE TRUE
##   Lower Class     251  149
##   Working Class  1652  917
##   Middle Class   1438 1018
##   Upper Class      97   83
##   No Class          0    0
prop.table(since2006_table)
##                
##                      FALSE       TRUE
##   Lower Class   0.04478145 0.02658341
##   Working Class 0.29473684 0.16360393
##   Middle Class  0.25655665 0.18162355
##   Upper Class   0.01730598 0.01480821
##   No Class      0.00000000 0.00000000

There appears to be a difference of the attitude among classes in recent times(2006 and afterwards) too. Visualization

h <- ggplot(since2006) + aes(x=class,fill=positive) + geom_bar(position = "fill") + 
  labs(x="Social class",y="Proportion",title="Impact of social class on Positive View of homosexuality") +
  scale_fill_discrete(name="Opinion",labels=c("Positive","Negative"))
h

A visual comparison of historical and the recent subsets was next made as follows:

 i<- ggplot(alltime) + aes(x=recent,fill=positive) + geom_bar(position = "fill") + facet_grid(.~class) +
  labs(x="Historical versus Recent",y="Proportion",title="Impact of social class on Positive View of homosexuality") +
  scale_fill_discrete(name="Opinion",labels=c("Positive","Negative"))
i

According to the above graph, a reduction the proportion having a positive attitude could be seen in recent times across all social classes!


Part 4: Inference

Hypotheses:

The null hypothesis (H0):The respondents’ opinion on homosexuality is independent of their social class.

The alternative hypothesis (HA): There is an association between the respondents’ opinion on homosexuality and their social class.

Methods

Since the analysis involves two categorical variables (social class and attitude to homosexuality), chi-square test for independence was chosen which is to be used when comparing 2 categorical variables where one of the variables has more than 2 levels. This is the case here, as can be seen below:

str(alltime)
## 'data.frame':    31891 obs. of  5 variables:
##  $ year    : int  1973 1973 1973 1973 1973 1973 1973 1973 1973 1973 ...
##  $ class   : Factor w/ 5 levels "Lower Class",..: 2 3 2 3 2 2 2 2 3 2 ...
##  $ homosex : Factor w/ 5 levels "Always Wrong",..: 1 1 4 1 1 1 1 1 2 2 ...
##  $ positive: logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ recent  : Factor w/ 2 levels "H","R": 1 1 1 1 1 1 1 1 1 1 ...
str(since2006)
## 'data.frame':    5605 obs. of  5 variables:
##  $ year    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ class   : Factor w/ 5 levels "Lower Class",..: 2 2 3 2 3 3 1 2 2 2 ...
##  $ homosex : Factor w/ 5 levels "Always Wrong",..: 1 4 1 4 1 3 4 1 1 1 ...
##  $ positive: logi  FALSE TRUE FALSE TRUE FALSE FALSE ...
##  $ recent  : Factor w/ 2 levels "H","R": 2 2 2 2 2 2 2 2 2 2 ...

Conditions

  1. Independence between observations. This is assumed to be intact as random sampling was used in the GSS.

  2. As can be seen below, there are at least 5 counts for each cell except in “no class” cateogry.

study_table1 <- table(alltime$class,alltime$positive)
study_table1
##                
##                 FALSE  TRUE
##   Lower Class    1312   357
##   Working Class 11785  2942
##   Middle Class  10917  3560
##   Upper Class     706   311
##   No Class          0     1
sum(study_table1<=5)
## [1] 2

So that row was removed.

t1 <- study_table1[-5,]
t1
##                
##                 FALSE  TRUE
##   Lower Class    1312   357
##   Working Class 11785  2942
##   Middle Class  10917  3560
##   Upper Class     706   311
study_table2 <- table(since2006$class,since2006$positive)
study_table2
##                
##                 FALSE TRUE
##   Lower Class     251  149
##   Working Class  1652  917
##   Middle Class   1438 1018
##   Upper Class      97   83
##   No Class          0    0

Again two cells have values <5

sum(study_table2<=5)
## [1] 2

The zero cells were removed from the table.

t2 <- study_table2[-5,]
t2
##                
##                 FALSE TRUE
##   Lower Class     251  149
##   Working Class  1652  917
##   Middle Class   1438 1018
##   Upper Class      97   83

Chi square test for independence was performed on both the entire dataset and the recent sub dataset.

c_alltime<-chisq.test(t1, alltime$class,alltime$positive)
c_alltime
## 
##  Pearson's Chi-squared test
## 
## data:  t1
## X-squared = 129.37, df = 3, p-value < 2.2e-16
c_since2006<-chisq.test(t2, since2006$class,since2006$positive)
c_since2006
## 
##  Pearson's Chi-squared test
## 
## data:  t2
## X-squared = 22.133, df = 3, p-value = 6.121e-05

Interpretation of results

In both cases, the null hypothesis is rejected, and there is a signifcant association between the social class and the opinion on homosexuality of respondents. In other words, attitude to homosexuality varies by social class.