Statistics with R Specialization - Duke University

Week 5 - Inferential Statistics

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

Generaliazibility

The data was collected by full probability sampling commonly called random sampling across years by helding personal interviews. Within this, there were stratified samplings by different parameters like age, region, race, income. This is an interview survey so it is voluntary and nothing is controlled which makes it an observational study. This can be generalized to the United States population and statistical significance can be found to know the extent to which the samples’ data is representative of the entire US population.

Causality

Causal inference can’t be derived as there is no random assignment rather there are observations of independently and randomly made samples. There is also non-response subsampling which would help in reducing the non-response bias. Reducing this bias assures the statistical significance and makes the chosen samples more representative of the entire US population.


Part 2: Research question

Find out whether there is an association between social class and financial situation.

Interest The interest on this topic is merely because of the curiosity to find out how many people want to make their financial situation better based on their sex.

Variables:

  • class: Subjective class identification
  • satfin: Satifaction with financial situation
  • sex : Respondent’s sex

Part 3: Exploratory data analysis

Getting the required data

data <- gss[names(gss) %in% 
                     c("sex","class","satfin")] %>%
  filter(class != "No Class")

Summary

#remove missing value
data <- data[complete.cases(data),]
data<- data %>%
  filter(class !="No Class")
#summary of all variables
summary(data)
##      sex                  class                  satfin     
##  Male  :22455   Lower Class  : 2962   Satisfied     :14795  
##  Female:28184   Working Class:23189   More Or Less  :22402  
##                 Middle Class :22852   Not At All Sat:13442  
##                 Upper Class  : 1636                         
##                 No Class     :    0

Plots

ggplot(data, aes(x=factor(sex), ..count..)) + 
        geom_bar(aes(fill = satfin),width = 0.7, position = "stack") +
         labs(x = "Social Class", y = "Frequency", fill = "Satisfaction Level") +
        theme_minimal(base_size = 10) +
        facet_grid(. ~ class )

Narrative

It is clear from the above plot that social classes somewhat have correlation with the financial satisfaction. The observations also vary slightly between male and female.

  • People who are categorized under lower class are mostly not at all satisfied with their status with a very small percentage of people being satified
  • Among the working class, the ratio of people who are not at all satisfied and more or less satisfied are higher than people who are satisfied with their financial status.
  • Among the middle class, people who are not at all satisfied are significantly lower than the working class. In this category the higher proportion of people are more or less satisfied with their financial status.
  • There is no one who is not satisfied in the upper class category.
  • overall, the in general females are comparitively more disatisfied with their financial status compared to males

Part 4: Inference

Hypotheses

H0 : There is no correlation between social status and financial satisfaction

HA : There is correlation between social status and financial satisfaction

Create a contingency table

ct <- table(data$class , data$satfin)
ct<-ct[-5L,]
ct
##                
##                 Satisfied More Or Less Not At All Sat
##   Lower Class         299          787           1876
##   Working Class      4594        10935           7660
##   Middle Class       8913        10260           3679
##   Upper Class         989          420            227

Condition

  • Random sampling of the entire sample was mentioned in the GSS study.
  • When it comes to each group, by seeing the contingency table above, the number of cases are definitely less than 10% of respective populations. The respective populations would be way higher.
  • Each case definitely contributed to one specific case and to remove any ambiguity the NA values have been filtered out.
cont <- as.table(as.matrix(ct), na.rm=FALSE)

mosaicplot(cont, shade = TRUE, las=2, main = "Class and Financial Satisfaction")

Method(s) to be used and why and how

We will be using chi-square test of independence to find out the relationship between these two categorical variables. Both the variables have more than 2 categorical levels and the conditions for the test of independence (TOI) are satisfied.

Calculate chi-square statistics

chidata <- chisq.test(ct)
chidata
## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 5668, df = 6, p-value < 2.2e-16
round(chidata$expected,2)
##                
##                 Satisfied More Or Less Not At All Sat
##   Lower Class      865.40      1310.35         786.26
##   Working Class   6775.04     10258.50        6155.46
##   Middle Class    6676.58     10109.41        6066.01
##   Upper Class      477.98       723.74         434.27

Result

Based on the high value of chi-square statistic and the degrees of freedom under consideration, we have a p-value which is very tiny compared to the 5% significance level. This implies there is a clear association between the class and financial satisfaction of individuals. Thus the null hypothesis can be ruled out.