This project is to study the relationship between the race vs respondent’s income and personal financial situation.
Research Question 1: Does race influences in money a person makes? in that case what is the relationship between race and income
Research Question 2: Does race affect the personal financial satisfaction levels of the public and its relationship
The source of research data is from General Social Survey (GSS), which is a sociological survey applied on US residents in order to collect data on demographic characteristics and behavior. By studying the survey, one could learn some interesting insights of American society.
The General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as income, national spending priorities, crime and punishment, intergroup relations, and confidence in institutions
The GSS data was collected by computer-assisted personal interview (CAPI), face-to-face interview and telephone interview of adults (18+) in randomly selected households.
library(treemap)
library(tidyverse)
library(sqldf)
library(ggplot2)
library(statsr)
load(url("http://bit.ly/dasi_gss_data"))
data <- gss %>% select("race","coninc","satfin") %>% filter(race != "NA") %>% filter(coninc != "NA") %>% filter(satfin != "NA")
race <- sqldf("select race,count(*) as count from data group by race")
satfin1 <- sqldf("select satfin,race,count(*) as count from data group by satfin,race")
dim(gss)
## [1] 57061 114
summary(data)
## race coninc satfin
## White:38791 Min. : 383 Satisfied :13660
## Black: 6381 1st Qu.: 18241 More Or Less :20874
## Other: 2120 Median : 35471 Not At All Sat:12758
## Mean : 43959
## 3rd Qu.: 58849
## Max. :180386
The data is composed of 57,061 cases (rows) and 114 variables (columns) and each row corresponds to a person surveyed
race: Records the race of the respondent (categorical)
satfin: Records whether the respondent is personally satisfied with their financial situation. (categorical)
coninc: Records the family continuous income (continuous numerical)
This is an observational Study because it can establish only correlation between the variables examined and not causation
The generalization principle could be applied for all the US population because of random sampling, The selected individuals were not divided into control and treatment groups to be treated differently
#treemap
treemap(dtf = race,
index=c("race"),
vSize="count",
vColor="count",
palette="Pastel2",
type="value",
border.col=c("grey70", "grey90"),
fontsize.title = 18,
algorithm="pivotSize",
title ="Fig1: Race Distribution",
title.legend="Count")
Fig1 : The distribution of race column, which is the variable self-declaration of their race, and it has a highest concentration in white.
#histogram
ggplot(data, aes(x=coninc)) + geom_histogram(binwidth=5000, colour="black") + xlab(" Continous Income") + ggtitle("Fig2: Family Income") + theme(plot.title = element_text(hjust = 0.5))
Fig2: The distribution for the family income is right-skewed and there is no negative income, we can say that count of respondents to decrease as the income increases
#box plot
ggplot(data, aes(x=race, y=coninc, fill=race)) + geom_boxplot(alpha=0.2,notch=TRUE) + xlab("Race") + ylab("Income") + ggtitle("Fig3: Family Income vs Race") + theme(plot.title = element_text(hjust = 0.5))
Fig3: From the boxplot, it seems that there is a great similarity in the relationship between income and races.
#density
ggplot(data, aes(coninc, color = race)) + geom_density (alpha = 0.1) + labs(title = "Fig4: Density - Family Income vs Race") + labs(x = "Family Income", y = "Density") + theme(plot.title = element_text(hjust = 0.5))
Fig4: On Comparing the Fig3, we can observe an overlapping income distribution across races
#plotting the data
ggplot(satfin1, aes(race, count, fill = satfin)) + geom_col() + labs(x="Race", y="Financial Satisfaction Level") + theme(plot.title = element_text(hjust = 0.5)) + labs(title = "Fig5: Financial Satisfaction Level vs Race")
Fig5: It appears that proportionally, black and other people are the most unsatisfied with their financial situation and other hand White people are most satisfied.
Question 1: Does race influences in money a person makes? in that case what is the relationship between race and income
This inference test is to explore if there is a statistically significant difference between the mean family income of United States resident as respect to their race
Null hypothesis H0: All means (µ) of each race are equal. [µ1=µ2=µ3]
Alternate hypothesis HA: The average income varies across some (or all) groups
Based on the Data exploration, the Conditions are below
1. Independence: within and between groups
2. Normality: Approximate normality
3. Equality: This condition demands that the groups have roughly equal variability
The conditions on normality and constant variance are not completely satisfied, we will use Analysis of Variance (ANOVA) in our hypothesis test.
#Quantiles
par(mfrow = c(1,3))
qqnorm(data$coninc[data$race == "White"], main = "White")
qqline(data$coninc[data$race == "White"])
qqnorm(data$coninc[data$race == "Black"], main = "Black")
qqline(data$coninc[data$race == "Black"])
qqnorm(data$coninc[data$race == "Other"], main = "Other")
qqline(data$coninc[data$race == "Other"])
#ANOVA
anova(lm(coninc ~ race, data=data))
## Analysis of Variance Table
##
## Response: coninc
## Df Sum Sq Mean Sq F value Pr(>F)
## race 2 1.5310e+12 7.6549e+11 633.58 < 2.2e-16 ***
## Residuals 47289 5.7135e+13 1.2082e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova <- aov(lm(coninc ~ race, data=data))
#Tuckey HSD
thsd <- TukeyHSD(anova, ordered=TRUE)
thsd
## Tukey multiple comparisons of means
## 95% family-wise confidence level
## factor levels have been ordered
##
## Fit: aov(formula = lm(coninc ~ race, data = data))
##
## $race
## diff lwr upr p adj
## Other-Black 11929.302 9887.121 13971.483 0
## White-Black 16652.845 15552.329 17753.361 0
## White-Other 4723.543 2906.529 6540.556 0
Looking at F statistics, ANOVA shows as 633.58 and a p-value of close to zero which mean that the probability of observing a F value of 633.58 or higher, if the null hypothesis is true, is very low. So we can reject the null hypothesis
The null hypothesis is rejected, so going by pairwise comparison to find groups have different mean, we use a t test statistic to confirm the null hypothesis that the means of the two groups are equal or the alternative hypothesis that they are different
pairwise.t.test(data$coninc, data$race)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$coninc and data$race
##
## White Black
## Black < 2e-16 -
## Other 1.1e-09 < 2e-16
##
## P value adjustment method: holm
Question 2: Does race affect the personal financial satisfaction levels of the public and its relationship
This inference test is to explore if there is a statistically significant difference between the personal financial satisfaction level of United States resident as respect to their race
Null hypothesis H0: Race and financial satisfaction levels are independent of each other
Alternate hypothesis HA: Race and financial satisfaction levels are dependent on each other.
Based on the Data exploration, the Conditions are below
1. Sampling Method: Random Sampling
2. Sample Size: The scenario must have at least 5 expected cases.
3. Independence: Since random sampling was used, considered as independent.
table(data$race, data$satfin)
##
## Satisfied More Or Less Not At All Sat
## White 12034 17268 9489
## Black 1165 2568 2648
## Other 461 1038 621
Based on the above conditions and data, we can use chi-square test for this inference test
Degrees of Freedom = (r-1)*(c-1) where r = no. of rows and c = no. of columns
#Based on data
df = (3-1)*(3-1)
df
## [1] 4
#chisq.test(
chisq.test(data$race, data$satfin)
##
## Pearson's Chi-squared test
##
## data: data$race and data$satfin
## X-squared = 976.59, df = 4, p-value < 2.2e-16
The p-value is virtually 0, doesn’t meet the significant level of 5% (very less), so we can reject H0 in favor of HA.
Let’s conduct an hypothesis test using inference() function
inference(x = race, y = satfin, data = data, type = "ht", statistic = "proportion", method = "theoretical", sig_level = 0.05, success = "Satisfied", alternative = "greater" )
## Response variable: categorical (3 levels)
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x Satisfied More Or Less Not At All Sat
## White 12034 17268 9489
## Black 1165 2568 2648
## Other 461 1038 621
##
## Expected:
## y
## x Satisfied More Or Less Not At All Sat
## White 11204.5390 17121.7824 10464.6786
## Black 1843.1122 2816.4805 1721.4074
## Other 612.3488 935.7371 571.9141
##
## H0: race and satfin are independent
## HA: race and satfin are dependent
## chi_sq = 976.5906, df = 4, p_value = 0
The results agrees with the chisq.test()
Quesiton1: Does race influences in money a person makes? in that case what is the relationship between race and income?/b>
We can conclude that while family income between the races being relatively similar, there is a tendency for black respondents have a family income lower than the respondents of other races.
This fact can be observed in several graphs from data exploration until the inference, through various forms of representation of information.
Quesiton2: Does race affect the personal financial satisfaction levels of the public and its relationship
We can conclude that race and financial satisfaction are most dependent on each other based on our analysis