This project will be focusing on whether or not Americans’ income is affected by race, age, marital status, or sex.
For decades questions have been raised about whether or not there is an actual difference in Americans’ income based on race, age, marital status, or sex. Using the census dataset, this project’s purpose is to determine whether or not there are any biases in status affecting an american’s income, and if so, what type of biases and how strong/weak are the relationships based on income. The data is a random sample from the 2000 U.S. Census Data in Open Intro, containing 500 observations on the following 8 variables: the census year, the name of the state, the total familly income (in U.S. dollars), the age of each survey participant, the race (American Indian or Alaska Native, Black, Chinese, Japanese, Other Asian or Pacific Islander, Two major races, White and Other), the marital status ( Divorced, Married/spouse absent, Married/spouse present, Never married/single, Separated and Widowed), and the total personal income (in U.S. dollars).
Use the following chunk of code to load the census dataset into the environment and explore the data.
# Store the Census dataset
library(openintro)
census <- census
# Analyzing the census dataset
str(census)
## 'data.frame': 500 obs. of 8 variables:
## $ censusYear : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ stateFIPScode : Factor w/ 47 levels "Alabama","Arizona",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ totalFamilyIncome : int 14550 22800 0 23000 48000 74000 23000 74000 60000 14600 ...
## $ age : int 44 20 20 6 55 43 60 47 54 58 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1 1 1 1 1 ...
## $ raceGeneral : Factor w/ 8 levels "American Indian or Alaska Native",..: 7 8 2 8 8 8 8 8 2 8 ...
## $ maritalStatus : Factor w/ 6 levels "Divorced","Married/spouse absent",..: 3 4 4 4 3 3 3 3 3 6 ...
## $ totalPersonalIncome: int 0 13000 20000 NA 36000 27000 11800 48000 40000 14600 ...
head(census)
## censusYear stateFIPScode totalFamilyIncome age sex raceGeneral
## 1 2000 Florida 14550 44 Male Two major races
## 2 2000 Florida 22800 20 Female White
## 3 2000 Florida 0 20 Male Black
## 4 2000 Florida 23000 6 Female White
## 5 2000 Florida 48000 55 Male White
## 6 2000 Florida 74000 43 Female White
## maritalStatus totalPersonalIncome
## 1 Married/spouse present 0
## 2 Never married/single 13000
## 3 Never married/single 20000
## 4 Never married/single NA
## 5 Married/spouse present 36000
## 6 Married/spouse present 27000
summary(census)
## censusYear stateFIPScode totalFamilyIncome age
## Min. :2000 California : 62 Min. : 0 Min. : 0.0
## 1st Qu.:2000 New York : 43 1st Qu.: 21500 1st Qu.:17.0
## Median :2000 Florida : 39 Median : 43000 Median :35.0
## Mean :2000 Texas : 37 Mean : 57411 Mean :35.3
## 3rd Qu.:2000 Pennsylvania: 26 3rd Qu.: 70700 3rd Qu.:51.0
## Max. :2000 Ohio : 23 Max. :892050 Max. :93.0
## (Other) :270 NA's :15
## sex raceGeneral
## Female:232 White :363
## Male :268 Black : 71
## Other : 25
## Two major races : 16
## Other Asian or Pacific Islander: 13
## Chinese : 8
## (Other) : 4
## maritalStatus totalPersonalIncome
## Divorced : 38 Min. : -4400
## Married/spouse absent : 14 1st Qu.: 5900
## Married/spouse present:192 Median : 17750
## Never married/single :222 Mean : 29082
## Separated : 3 3rd Qu.: 37000
## Widowed : 31 Max. :456000
## NA's :108
boxplot(census)
Remove portions of the dataset that are irrelevant to the analyses that will be conducted.
# Analizing the total family income
children <- nrow(census[census$age < 15, ])
#Remove the 108 Children from the data set since they do not have a total personnal income listed
Income <- subset(census, !is.na(census$totalPersonalIncome))
There are 108/500 people in the dataset whose personal income is not applicable since they are younger than 15 years of age. The actual sample size that is to be considered is 392.
1.) Explore the race, age, marital status, or sex variables in the census dataset.
#Analyzing the race, age, marital status, and sex variables
summary(census$race)
## American Indian or Alaska Native Black
## 3 71
## Chinese Japanese
## 8 1
## Other Other Asian or Pacific Islander
## 25 13
## Two major races White
## 16 363
summary(census$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 17.0 35.0 35.3 51.0 93.0
summary(census$maritalStatus)
## Divorced Married/spouse absent Married/spouse present
## 38 14 192
## Never married/single Separated Widowed
## 222 3 31
summary(census$sex)
## Female Male
## 232 268
2.) As seen in the exploratory analysis, the Black and White are the two leading races. Other races represent a small fraction of the total sample size, making their analyses less significant. Determine the difference in income between the Black and White population, by running a two-independent sample t-test.
Hypotheses
\(\mu_1\) = Black
\(\mu_2\) = White
\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
Two-tailed test
# Black subset
Black <- subset(census, census$raceGeneral == "Black")
# White subset
White <- subset(census, census$raceGeneral == "White")
# Mean difference in total personal income (x1 - x2)
mean(Black$totalPersonalIncome, na.rm = TRUE) - mean(White$totalPersonalIncome, na.rm = TRUE)
## [1] -15171.02
# T-test function
t.test(Black$totalPersonalIncome, White$totalPersonalIncome, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: Black$totalPersonalIncome and White$totalPersonalIncome
## t = -4.019, df = 231.87, p-value = 7.903e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -22608.300 -7733.746
## sample estimates:
## mean of x mean of y
## 17049.19 32220.22
Decision
p > 0.05, therefore we fail to reject Ho.
Conclusion
There is sufficient evidence to support the claim that there is a significant difference in income between the Black and White races.
3.) Both the total personal income and age variables are numerical. For the following exercise, determine whether or not there is significant change in income based on age by determining the linear relationship between these two variables.
a.) Create a scatterplot, with income as the response variable and with age as the explanatory variable.
# scatterplot of the relationship between income and age
plot(Income$age,Income$totalPersonalIncome, main = "Total Personal Income per Age", xlab = "Income", ylab = "Age")
Based on the scatterplot above, there is no/very weak relationship between the total personal income and the age of the participants.
b.) Calculate the correlation coefficient
# Correlation coefficient
cor(Income$age,Income$totalPersonalIncome)
## [1] 0.1314763
The correlation coefficient is very small, which implies that age does not have a significant impact on income.
This caculation will involve the usage of ANOVA (Analysis of variance) to determine the difference in income based on marital status. ANOVA(multiple means) is a collection of is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among group means in a sample.
4.) Determine the difference in income based on marital status using an ANOVA test.
a.) Create a boxplot of the relationship between marital status and total personal incom
# Boxplot
par(mar=c(11,9,1,1))
boxplot(Income$totalPersonalIncome ~ Income$maritalStatus, las = 2)
This side-by-syde scatterplot of the total personal income for the 392 Americans in each marital status group (divorced, married/spouse present, married/spouse absent, widowed, separated, never married/single) shows quite a bit of details (e.g.,the outliers in the married/spouse present group).
b.) Test if there is difference in income across all these groups stated above.
Analysis Variance
-Hypothesis
\(H_0 = \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_ 5 = \mu_6\)
\(H_A:\) At least one mean is different.
c.) Determine the test statistics and the p-value
# use aov() and store as result
result <- aov(totalPersonalIncome ~ maritalStatus,data = Income)
# use summary() to find out p-value
summary(result)
## Df Sum Sq Mean Sq F value Pr(>F)
## maritalStatus 5 4.145e+10 8.290e+09 4.055 0.00134 **
## Residuals 386 7.891e+11 2.044e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Decision and Conclusion
Since p < 0.05 we reject Ho, and the data presents sufficient evidence to support the claim that the income for all groups of different marital status differs.
d.) The results above shows that the F-Statistics is 4.06, although it can be used to determine whether or not there are significant differences in income between the different groups, it does not show how they differ. So further analyses are needed.
# Determine how the personal income differs between the different marital statuses using the paiwise.t.test function
pairwise.t.test(Income$totalPersonalIncome,Income$maritalStatus,p.adjust.method = "bonferroni", na.rm = TRUE)
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Income$totalPersonalIncome and Income$maritalStatus
##
## Divorced Married/spouse absent
## Married/spouse absent 1.0000 -
## Married/spouse present 1.0000 1.0000
## Never married/single 1.0000 1.0000
## Separated 1.0000 1.0000
## Widowed 1.0000 1.0000
## Married/spouse present Never married/single
## Married/spouse absent - -
## Married/spouse present - -
## Never married/single 0.0013 -
## Separated 1.0000 1.0000
## Widowed 0.0737 1.0000
## Separated
## Married/spouse absent -
## Married/spouse present -
## Never married/single -
## Separated -
## Widowed 1.0000
##
## P value adjustment method: bonferroni
Based on the above analysis, on a scale of 1.00, there are significant diffences in income between the Never married/single and the Married/spouse present groups. The results suggest that there is a diffence in income based on marital status of Americans, especially if Married/spouse present or Never married/single.
The sex variable has two different categories : Males and Females. Determine the difference in income, if there is any, between these two categories.
5.) Create a boxplot to determine the relashionship between the sex and total personal income variables.
# Relationship between sex and total personal income
boxplot(Income$totalPersonalIncome ~ Income$sex)
This scatterplot shows that there are a few outliers present, but because there the sample size is large enough, these outliers are not of major concern.
Hypotheses \(H_0: \mu_{female}=\mu_{male}\)
\(H_A: \mu_{female}\neq\mu_{male}\)
6.) Run t.test to determine the p-value, by which whether or not Ho will be rejected is determined.
# Create Males and Females subsets
maleincome <- subset(Income,sex=="Male")
femaleincome <- subset(Income,sex=="Female")
# T.test
t.test(femaleincome$totalPersonalIncome,maleincome$totalPersonalIncome)
##
## Welch Two Sample t-test
##
## data: femaleincome$totalPersonalIncome and maleincome$totalPersonalIncome
## t = -5.118, df = 237.12, p-value = 6.38e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -30977.56 -13758.03
## sample estimates:
## mean of x mean of y
## 17441.34 39809.14
Decision and Conclusion
Since p < 0.5, we reject Ho. The data provides sufficient evidence to support the claim that there is significant difference in total personal income between males and females.
All the analyses conducted above were to answer the question of whether or not there is a difference in American’s income based on sex, age, marital status, and race. To answer that question, we used both Two-independent and Non-independent Sample T-tests, a Multiple Comparison T-test along with an ANOVA(multiple means) Test, and Linear Regression. The results were as follow for each variable analyzed.
a.) Race There is significant difference in income between the Black and White races, with the white population earning a higher personal income than the black population.
b.) Age Children (>15 years-old) were excluded from the data, since their income was not applicable. The sample size analyzed was 392, and age did not appear to be a determinant factor of differences in personal income.
c.) Marital Status The only groups among which there existed a significant difference in income were: the Never married/single and the Married/spouse present groups.
d.) Sex Analyses show there is a significant difference in income between males and females, with males having a higher personal income than females.
Below are some limitations of the analyses conducted.
a.) Although this research has been of some help, the data sample was collected a long time ago (in 2000), and with generations changing over time, this data might not be the most accurate one in current days. So this research can only tell us about the income differences in the year 2000, but does not apply to the current decade.
b.) The sample data included specific states, it did not include all 50 states of the U.S., so it might just so happen that the states included were subject to biases in income more/less than others. Each State also have different wages, so our knowledge tends to be limited on the differences in income that are related to things other than the statuses of Americans.
c.) The ANOVA test ran to calculate the difference in income between all marital status groups lacked results for certain groups, making it more difficult to analyze the actual difference between all groups.
This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Nirwendjie Altidor
Semester: Fall 2018