This project is to study the relationship between the race vs respondent’s income and personal financial situation.
The source of research data is from General Social Survey (GSS), which is a sociological survey applied on US residents in order to collect data on demographic characteristics and behavior. By studying the survey, one could learn some interesting insights of American society.
The General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as income, national spending priorities, crime and punishment, intergroup relations, and confidence in institutions
library(treemap)
library(tidyverse)
library(sqldf)
library(ggplot2)
load(url("http://bit.ly/dasi_gss_data"))
data <- gss %>% select("race","coninc","satfin") %>% filter(race != "NA") %>% filter(coninc != "NA") %>% filter(satfin != "NA")
race <- sqldf("select race,count(*) as count from data group by race")
satfin1 <- sqldf("select satfin,race,count(*) as count from data group by satfin,race")
dim(gss)
## [1] 57061 114
head(data)
## race coninc satfin
## 1 White 25926 Not At All Sat
## 2 White 33333 More Or Less
## 3 White 33333 Satisfied
## 4 White 41667 Not At All Sat
## 5 White 69444 Satisfied
## 6 White 60185 More Or Less
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Research Question 1: Does race influences in money a person makes? in that case what is the relationship between race and income
Research Question 2: Does race affect the personal financial satisfaction levels of the public and its relationship
What are the cases, and how many are there?
The data is composed of 57,061 cases (rows) and 114 variables (columns) and each row corresponds to a person surveyed
Describe the method of data collection.
The GSS data was collected by computer-assisted personal interview (CAPI), face-to-face interview and telephone interview of adults (18+) in randomly selected households.
What type of study is this (observational/experiment)?
This is an observational Study because it can establish only correlation between the variables examined and not causation
If you collected the data, state self-collected. If not, provide a citation/link.
The General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as income, national spending priorities, crime and punishment, intergroup relations, and confidence in institutions
http://bit.ly/dasi_gss_data
What is the response variable, and what type is it (numerical/categorical)?
satfin: Records whether the respondent is personally satisfied with their financial situation. (categorical)
coninc: Records the family continuous income (continuous numerical)
What is the explanatory variable, and what type is it (numerical/categorical)?
race: Records the race of the respondent (categorical)
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(data)
## race coninc satfin
## White:38791 Min. : 383 Satisfied :13660
## Black: 6381 1st Qu.: 18241 More Or Less :20874
## Other: 2120 Median : 35471 Not At All Sat:12758
## Mean : 43959
## 3rd Qu.: 58849
## Max. :180386
#treemap
treemap(dtf = race,
index=c("race"),
vSize="count",
vColor="count",
palette="Pastel2",
type="value",
border.col=c("grey70", "grey90"),
fontsize.title = 18,
algorithm="pivotSize",
title ="Fig1: Race Distribution",
title.legend="Count")
Fig1 : The distribution of race column, which is the variable self-declaration of their race, and it has a highest concentration in white.
#histogram
ggplot(data, aes(x=coninc)) + geom_histogram(binwidth=5000, colour="black") + xlab(" Continous Income") + ggtitle("Fig2: Family Income") + theme(plot.title = element_text(hjust = 0.5))
Fig2: The distribution for the family income is right-skewed and there is no negative income, we can say that count of respondents to decrease as the income increases
#box plot
ggplot(data, aes(x=race, y=coninc, fill=race)) + geom_boxplot(alpha=0.2,notch=TRUE) + xlab("Race") + ylab("Income") + ggtitle("Fig3: Family Income vs Race") + theme(plot.title = element_text(hjust = 0.5))
Fig3: From the boxplot, it seems that there is a great similarity in the relationship between income and races.
#density
ggplot(data, aes(coninc, color = race)) + geom_density (alpha = 0.1) + labs(title = "Fig4: Density - Family Income vs Race") + labs(x = "Family Income", y = "Density") + theme(plot.title = element_text(hjust = 0.5))
Fig4: On Comparing the Fig3, we can observe an overlapping income distribution across races
#plotting the data
ggplot(satfin1, aes(race, count, fill = satfin)) + geom_col() + labs(x="Race", y="Financial Satisfaction Level") + theme(plot.title = element_text(hjust = 0.5)) + labs(title = "Fig5: Financial Satisfaction Level vs Race")
Fig5: It appears that proportionally, black and other people are the most unsatisfied with their financial situation and other hand White people are most satisfied.