The data was collected by full probability sampling commonly called random sampling across years by helding personal interviews. Within this, there were stratified samplings by different parameters like age, region, race, income. This is an interview survey so it is voluntary and nothing is controlled which makes it an observational study. This can be generalized to the United States population and statistical significance can be found to know the extent to which the samples’ data is representative of the entire US population.
Causal inference can’t be derived as there is no random assignment rather there are observations of independently and randomly made samples. There is also non-response subsampling which would help in reducing the non-response bias. Reducing this bias assures the statistical significance and makes the chosen samples more representative of the entire US population.
Getting the required data
#remove missing value
data <- data[complete.cases(data),]
data<- data %>%
filter(class !="No Class")
#summary of all variables
summary(data)## sex class satfin
## Male :22455 Lower Class : 2962 Satisfied :14795
## Female:28184 Working Class:23189 More Or Less :22402
## Middle Class :22852 Not At All Sat:13442
## Upper Class : 1636
## No Class : 0
ggplot(data, aes(x=factor(sex), ..count..)) +
geom_bar(aes(fill = satfin),width = 0.7, position = "stack") +
labs(x = "Social Class", y = "Frequency", fill = "Satisfaction Level") +
theme_minimal(base_size = 10) +
facet_grid(. ~ class )It is clear from the above plot that social classes somewhat have correlation with the financial satisfaction. The observations also vary slightly between male and female.
H0 : There is no correlation between social status and financial satisfaction
HA : There is correlation between social status and financial satisfaction
Create a contingency table
##
## Satisfied More Or Less Not At All Sat
## Lower Class 299 787 1876
## Working Class 4594 10935 7660
## Middle Class 8913 10260 3679
## Upper Class 989 420 227
cont <- as.table(as.matrix(ct), na.rm=FALSE)
mosaicplot(cont, shade = TRUE, las=2, main = "Class and Financial Satisfaction")We will be using chi-square test of independence to find out the relationship between these two categorical variables. Both the variables have more than 2 categorical levels and the conditions for the test of independence (TOI) are satisfied.
Calculate chi-square statistics
##
## Pearson's Chi-squared test
##
## data: ct
## X-squared = 5668, df = 6, p-value < 2.2e-16
##
## Satisfied More Or Less Not At All Sat
## Lower Class 865.40 1310.35 786.26
## Working Class 6775.04 10258.50 6155.46
## Middle Class 6676.58 10109.41 6066.01
## Upper Class 477.98 723.74 434.27
Based on the high value of chi-square statistic and the degrees of freedom under consideration, we have a p-value which is very tiny compared to the 5% significance level. This implies there is a clear association between the class and financial satisfaction of individuals. Thus the null hypothesis can be ruled out.