DATA 606 Spring 2019 - Data Project

Introduction

GSS Data - Race vs Income and Financial Satisfaction Analysis

This project is to study the relationship between the race vs respondent’s income and personal financial situation.

Research Question 1: Does race influences in money a person makes? in that case what is the relationship between race and income
Research Question 2: Does race affect the personal financial satisfaction levels of the public and its relationship

The source of research data is from General Social Survey (GSS), which is a sociological survey applied on US residents in order to collect data on demographic characteristics and behavior. By studying the survey, one could learn some interesting insights of American society.

Data

Data Collection

The General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as income, national spending priorities, crime and punishment, intergroup relations, and confidence in institutions

The GSS data was collected by computer-assisted personal interview (CAPI), face-to-face interview and telephone interview of adults (18+) in randomly selected households.

library(treemap)
library(tidyverse)
library(sqldf)
library(ggplot2)
library(statsr)

load(url("http://bit.ly/dasi_gss_data"))
data <- gss %>% select("race","coninc","satfin") %>% filter(race != "NA") %>% filter(coninc != "NA")  %>% filter(satfin != "NA")

race <- sqldf("select race,count(*) as count from data group by race")
satfin1 <- sqldf("select satfin,race,count(*) as count from data group by satfin,race")

dim(gss)

## [1] 57061   114

summary(data)

##     race           coninc                  satfin     
##  White:38791   Min.   :   383   Satisfied     :13660  
##  Black: 6381   1st Qu.: 18241   More Or Less  :20874  
##  Other: 2120   Median : 35471   Not At All Sat:12758  
##                Mean   : 43959                         
##                3rd Qu.: 58849                         
##                Max.   :180386

Cases

The data is composed of 57,061 cases (rows) and 114 variables (columns) and each row corresponds to a person surveyed

Variables

race: Records the race of the respondent (categorical)
satfin: Records whether the respondent is personally satisfied with their financial situation. (categorical)
coninc: Records the family continuous income (continuous numerical)

Type of study

This is an observational Study because it can establish only correlation between the variables examined and not causation

Scope of inference

The generalization principle could be applied for all the US population because of random sampling, The selected individuals were not divided into control and treatment groups to be treated differently

Exploratory data analysis

#treemap
treemap(dtf = race,
        index=c("race"),
        vSize="count",
        vColor="count",
        palette="Pastel2",
        type="value",
        border.col=c("grey70", "grey90"),
        fontsize.title = 18,
        algorithm="pivotSize",
        title ="Fig1: Race Distribution",
        title.legend="Count")

Fig1 : The distribution of race column, which is the variable self-declaration of their race, and it has a highest concentration in white.

#histogram
ggplot(data, aes(x=coninc)) + geom_histogram(binwidth=5000, colour="black") + xlab(" Continous Income") + ggtitle("Fig2: Family Income") + theme(plot.title = element_text(hjust = 0.5))

Fig2: The distribution for the family income is right-skewed and there is no negative income, we can say that count of respondents to decrease as the income increases

#box plot
ggplot(data, aes(x=race, y=coninc, fill=race)) + geom_boxplot(alpha=0.2,notch=TRUE) + xlab("Race") + ylab("Income") + ggtitle("Fig3: Family Income vs Race") + theme(plot.title = element_text(hjust = 0.5))

Fig3: From the boxplot, it seems that there is a great similarity in the relationship between income and races.

#density
ggplot(data, aes(coninc, color = race)) + geom_density (alpha = 0.1) + labs(title = "Fig4: Density - Family Income vs Race") + labs(x = "Family Income", y = "Density") + theme(plot.title = element_text(hjust = 0.5))

Fig4: On Comparing the Fig3, we can observe an overlapping income distribution across races

#plotting the data
ggplot(satfin1, aes(race, count, fill = satfin)) + geom_col() + labs(x="Race", y="Financial Satisfaction Level") + theme(plot.title = element_text(hjust = 0.5)) + labs(title = "Fig5: Financial Satisfaction Level vs Race")

Fig5: It appears that proportionally, black and other people are the most unsatisfied with their financial situation and other hand White people are most satisfied.

Inference - Question 1

Question 1: Does race influences in money a person makes? in that case what is the relationship between race and income

This inference test is to explore if there is a statistically significant difference between the mean family income of United States resident as respect to their race

Conditions

Null hypothesis H0: All means (µ) of each race are equal. [µ1=µ2=µ3]
Alternate hypothesis HA: The average income varies across some (or all) groups

Based on the Data exploration, the Conditions are below
1. Independence: within and between groups
2. Normality: Approximate normality
3. Equality: This condition demands that the groups have roughly equal variability

Methodology

The conditions on normality and constant variance are not completely satisfied, we will use Analysis of Variance (ANOVA) in our hypothesis test.

Simulation based inference

#Quantiles
par(mfrow = c(1,3))
qqnorm(data$coninc[data$race == "White"], main = "White")
qqline(data$coninc[data$race == "White"])
qqnorm(data$coninc[data$race == "Black"], main = "Black")
qqline(data$coninc[data$race == "Black"])
qqnorm(data$coninc[data$race == "Other"], main = "Other")
qqline(data$coninc[data$race == "Other"])

#ANOVA
anova(lm(coninc ~ race, data=data))

## Analysis of Variance Table
## 
## Response: coninc
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## race          2 1.5310e+12 7.6549e+11  633.58 < 2.2e-16 ***
## Residuals 47289 5.7135e+13 1.2082e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova  <- aov(lm(coninc ~ race, data=data))
#Tuckey HSD
thsd <- TukeyHSD(anova, ordered=TRUE)
thsd

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
##     factor levels have been ordered
## 
## Fit: aov(formula = lm(coninc ~ race, data = data))
## 
## $race
##                  diff       lwr       upr p adj
## Other-Black 11929.302  9887.121 13971.483     0
## White-Black 16652.845 15552.329 17753.361     0
## White-Other  4723.543  2906.529  6540.556     0

Looking at F statistics, ANOVA shows as 633.58 and a p-value of close to zero which mean that the probability of observing a F value of 633.58 or higher, if the null hypothesis is true, is very low. So we can reject the null hypothesis

The null hypothesis is rejected, so going by pairwise comparison to find groups have different mean, we use a t test statistic to confirm the null hypothesis that the means of the two groups are equal or the alternative hypothesis that they are different

pairwise.t.test(data$coninc, data$race)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$coninc and data$race 
## 
##       White   Black  
## Black < 2e-16 -      
## Other 1.1e-09 < 2e-16
## 
## P value adjustment method: holm

Inference - Question 2

Question 2: Does race affect the personal financial satisfaction levels of the public and its relationship

This inference test is to explore if there is a statistically significant difference between the personal financial satisfaction level of United States resident as respect to their race

Conditions

Null hypothesis H0: Race and financial satisfaction levels are independent of each other
Alternate hypothesis HA: Race and financial satisfaction levels are dependent on each other.

Based on the Data exploration, the Conditions are below
1. Sampling Method: Random Sampling
2. Sample Size: The scenario must have at least 5 expected cases.
3. Independence: Since random sampling was used, considered as independent.

Methodology

table(data$race, data$satfin)

##        
##         Satisfied More Or Less Not At All Sat
##   White     12034        17268           9489
##   Black      1165         2568           2648
##   Other       461         1038            621

Based on the above conditions and data, we can use chi-square test for this inference test
Degrees of Freedom = (r-1)*(c-1) where r = no. of rows and c = no. of columns

Simulation based inference

#Based on data
df = (3-1)*(3-1)
df

## [1] 4

#chisq.test(
chisq.test(data$race, data$satfin)

## 
##  Pearson's Chi-squared test
## 
## data:  data$race and data$satfin
## X-squared = 976.59, df = 4, p-value < 2.2e-16

The p-value is virtually 0, doesn’t meet the significant level of 5% (very less), so we can reject H0 in favor of HA.

Let’s conduct an hypothesis test using inference() function

inference(x = race, y = satfin, data = data, type = "ht", statistic = "proportion", method = "theoretical", sig_level = 0.05, success = "Satisfied", alternative = "greater" )

## Response variable: categorical (3 levels) 
## Explanatory variable: categorical (3 levels) 
## Observed:
##        y
## x       Satisfied More Or Less Not At All Sat
##   White     12034        17268           9489
##   Black      1165         2568           2648
##   Other       461         1038            621
## 
## Expected:
##        y
## x        Satisfied More Or Less Not At All Sat
##   White 11204.5390   17121.7824     10464.6786
##   Black  1843.1122    2816.4805      1721.4074
##   Other   612.3488     935.7371       571.9141
## 
## H0: race and satfin are independent
## HA: race and satfin are dependent
## chi_sq = 976.5906, df = 4, p_value = 0

The results agrees with the chisq.test()

Conclusion

Quesiton1: Does race influences in money a person makes? in that case what is the relationship between race and income?/b>
We can conclude that while family income between the races being relatively similar, there is a tendency for black respondents have a family income lower than the respondents of other races.
This fact can be observed in several graphs from data exploration until the inference, through various forms of representation of information.

Quesiton2: Does race affect the personal financial satisfaction levels of the public and its relationship
We can conclude that race and financial satisfaction are most dependent on each other based on our analysis

References

http://gss.norc.org/Get-Documentation
http://gss.norc.org/Get-The-Data
http://gss.norc.org/faq

DATA 606 Spring 2019 - Data Project

Mohamed Thasleem Kalikul Zaman

May 15, 2019

Introduction

GSS Data - Race vs Income and Financial Satisfaction Analysis

Data

Data Collection

Cases

Variables

Type of study

Scope of inference

Exploratory data analysis

Inference - Question 1

Conditions

Methodology

Simulation based inference

Inference - Question 2

Conditions

Methodology

Simulation based inference

Conclusion

References