In this week’s assignment I will use linear regression models to predict whether an individual has capital gains or loss (income) using extracted data from the 1994 US Census Data set. To make predictions about capital gains/loss (income) I will look at gender, race, number of years of education,and age and using an interaction I will look at differences within the categories.
library(Zelig)
library(readr)
library(pander)
library(radiant.data)
library(texreg)
library(lmtest)
library(visreg)
library(tidyverse)
In this step I loaded the required packages to analyze and visualize the data.
Income<- read_csv("C:/Users/Papa/Desktop/Soc 712 -R/adult.csv", col_names = TRUE)
head(Income)
In this step, I loaded my data set and saved it as “Income”. Currently, we only see the first 6 observations.
Income2 <- Income %>% mutate(newincome = capital.gain - capital.loss)
head(Income2)
In this step, I first created a new data set called Income2. Next, I created a continuous outcome variable called “newincome” using the mutate command by taking the difference of capital gains and capital loss variables.
Income2 <- Income2 %>% mutate(newsex = ifelse(sex == "Female",1,0))
head(Income2)
Here I generated a new variable called “newsex” using the mutate command in which “Female” is 1 and “Male” is 0.
lm0 <- lm(newincome ~ newsex, data = Income2)
summary(lm0)
Call:
lm(formula = newincome ~ newsex, data = Income2)
Residuals:
Min 1Q Median 3Q Max
-4999 -1229 -1229 -507 99492
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1229.16 50.14 24.515 <0.0000000000000002 ***
newsex -721.93 87.18 -8.281 <0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7401 on 32559 degrees of freedom
Multiple R-squared: 0.002102, Adjusted R-squared: 0.002071
F-statistic: 68.58 on 1 and 32559 DF, p-value: < 0.00000000000000022
For the first model, I looked at the relationship between income and sex. When looking at the result we see that women on average make 721.93 dollars less than men in capital gains. We see that the results are statistically significant at a 99% confidence level. I chose this variable because there is generally a gender pay gap.
lm1 <- lm(newincome ~ newsex + race + age, data = Income2)
summary(lm1)
Call:
lm(formula = newincome ~ newsex + race + age, data = Income2)
Residuals:
Min 1Q Median 3Q Max
-6943 -1418 -896 -396 100395
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -608.486 435.394 -1.398 0.162
newsex -598.674 87.895 -6.811 0.00000000000984 ***
raceAsian-Pac-Islander 738.306 477.191 1.547 0.122
raceBlack 4.547 439.080 0.010 0.992
raceOther 437.036 613.567 0.712 0.476
raceWhite 335.342 421.016 0.797 0.426
age 38.432 3.013 12.753 < 0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7382 on 32554 degrees of freedom
Multiple R-squared: 0.007367, Adjusted R-squared: 0.007184
F-statistic: 40.27 on 6 and 32554 DF, p-value: < 0.00000000000000022
For the second model I added the variables race and age to the previous model. In this case the intercept represents an individual of American Indian/Eskimo race. Now we see that women make on average 598.67 dollars less than men in capital gains and that this is still statistically significant a 99% confidence level. Next, we see that Asian/Pacific Islander make on average 738.31 dollars more in capital gains than their American-Indian/Eskimo counterparts. Next, we see that Black individuals make on average 4.547 dollars more in capital gains than their American-Indian/Eskimo counterparts and individuals in the Other races category make on average 437.04 dollars more in capital gains than their American-Indian/Eskimo counterparts. White individuals make on average 335.34 dollars more in capital gains than their American-Indian/Eskimo counterparts. It should be noted that across all races the values were not statistically significant at all. Finally, for the age variable we see that with every additional year, on average capital gains increase by 38.43 dollars, which is statistically significant at a 99% confidence level.
lm2 <- lm(newincome ~ newsex + race + age + education.num, data = Income2)
summary(lm2)
Call:
lm(formula = newincome ~ newsex + race + age + education.num,
data = Income2)
Residuals:
Min 1Q Median 3Q Max
-7518 -1627 -804 -70 100488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3605.559 455.981 -7.907 0.00000000000000271 ***
newsex -598.927 87.319 -6.859 0.00000000000705673 ***
raceAsian-Pac-Islander 194.967 474.789 0.411 0.681
raceBlack -51.752 436.214 -0.119 0.906
raceOther 584.620 609.593 0.959 0.338
raceWhite 66.746 418.460 0.160 0.873
age 36.320 2.995 12.125 < 0.0000000000000002 ***
education.num 330.295 15.903 20.769 < 0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7334 on 32553 degrees of freedom
Multiple R-squared: 0.02035, Adjusted R-squared: 0.02014
F-statistic: 96.59 on 7 and 32553 DF, p-value: < 0.00000000000000022
In this model, I added the numbers of years of education variable. This time we see that on average women make 598.27 dollars less than men in capital gains and that this is still statistically significant a 99% confidence level. Next, we see that Asian/Pacific Islander individuals make on average 194.97 dollars more in capital gains than their American-Indian/Eskimo counterparts. Next, we see that Black individuals make on average 51.75 dollars less in capital gains than their American-Indian/Eskimo counterparts and individuals in the Other races category make on average 584.62 dollars more in capital gains than their American-Indian/Eskimo counterparts. White individuals make on average 66.75 dollars more in capital gains than their American-Indian/Eskimo counterparts. Again, across all races the values were not statistically significant at all. Finally, for the age variable we see that with every additional year, on average capital gains increase by 36.32 dollars and with every additional year of education capital gains increase on average by 330.30 dollars. Both numbers are statistically significant at a 99% confidence level.
lm3 <- lm(newincome ~ newsex*race + age + education.num, data = Income2)
summary(lm3)
Call:
lm(formula = newincome ~ newsex * race + age + education.num,
data = Income2)
Residuals:
Min 1Q Median 3Q Max
-7474 -1631 -798 -64 100526
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3711.615 558.888 -6.641 0.0000000000316 ***
newsex -301.621 855.725 -0.352 0.724
raceAsian-Pac-Islander 293.991 599.153 0.491 0.624
raceBlack -134.379 560.775 -0.240 0.811
raceOther 867.200 782.472 1.108 0.268
raceWhite 197.400 532.252 0.371 0.711
age 36.158 2.997 12.064 < 0.0000000000000002 ***
education.num 330.116 15.911 20.748 < 0.0000000000000002 ***
newsex:raceAsian-Pac-Islander -251.889 982.811 -0.256 0.798
newsex:raceBlack 97.491 895.044 0.109 0.913
newsex:raceOther -718.742 1248.127 -0.576 0.565
newsex:raceWhite -350.373 861.023 -0.407 0.684
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7334 on 32549 degrees of freedom
Multiple R-squared: 0.02044, Adjusted R-squared: 0.0201
F-statistic: 61.73 on 11 and 32549 DF, p-value: < 0.00000000000000022
For this model I looked at the interaction between sex and race with respect to whether an individual has a capital gain or loss to see whether there are differences across both genders and different races. I chose the interaction between race and gender to get a closer look at income discrepancies.
In this interaction, the intercept (-3711.615) represents male individuals of American-Indian/Eskimo race. Now we see that on average females make 301.621 dollars less than men in capital gains. Next we see that Asian-Pac-Islander males make on average 293.991 dollars more than their American-Indian/Eskimo counterparts. Next we see that Black males make on average 134.379 dollars less than their American-Indian/Eskimo counterparts, males of other races make on average 867.200 dollars more than their American-Indian/Eskimo counterparts and White males make on average 197.400 dollars more than their American-Indian/Eskimo counterparts. For each additional year in age on average capital gains increase by 36.158 dollars and for each additional in year in education capital gains increase on average by 330.116 dollars. Next, Asian/Pac Islander females make on average 251.889 dollars less in capital gains than their American-Indian/Eskimo counterparts and Black females make on average 97.491 dollars more than their American-Indian/Eskimo counterparts in capital gains. Finally, females from other races make on average 718.742 dollars less than their American-Indian/Eskimo counterparts in capital gains and White females make on average 350.373 dollars less than their American-Indian/Eskimo counterparts. We see that in this model that only the values for age and years of education are statistically significant at a 99% confidence level.
Income3 <- Income2 %>%
select(newincome, newsex, race ) %>%
group_by(newsex, race) %>%
summarise(mean = mean(newincome))
Income3
In this step, I created another data set called “Income3” which has a group-wise summary statistics of the dependent variable and where we have the means for each race and sex group, showing the average capital gains for each grouping. For accuracy of the interpretation of the results I will verify the results of the regression with the interaction terms with the results of the summary statistics. I will compare the differences among genders for each race group by subtracting the male and female values for each race group for both the regression results and the summary. Next, I will look if the differences between the regression and the summary and significant or not.
Gender differences among Asian males & females:
This shows a large difference between the two results.
Gender differences among Black males and females:
We see again large differences from both results.
Gender differences among Other races males and females:
For this group we also see large differences from both results.
Gender differences among White males and females:
Checking the regression esimates against the summary results suggest that the findings of the regression with the interaction term are accurate and make sense. This is because as the regression model showed us that, even with the interaction term, race and sex are not statistically significant, which is confirmed by the large deviations in the results and make those two variables unreliable for estimation.
Based on this conclusion, we will go back to the last model and look at the two variables that were statistically significant, age and years of education.
Income4 <- Income2 %>%
select(newincome, education.num) %>%
group_by(education.num) %>%
summarise(mean = mean(newincome))
Income4
In this step, I created another data set called “Income4” which has a group-wise summary statistics of the dependent variable and where we have the means for each year of education group, showing the average capital gains for each grouping. Looking at both results, we see that as years of education increase we also see an increase in capital gains, confirming the regression results that years of education is statistically significant.
library(texreg)
htmlreg(list(lm0, lm1, lm2, lm3))
| Model 1 | Model 2 | Model 3 | Model 4 | ||
|---|---|---|---|---|---|
| (Intercept) | 1229.16*** | -608.49 | -3605.56*** | -3711.62*** | |
| (50.14) | (435.39) | (455.98) | (558.89) | ||
| newsex | -721.93*** | -598.67*** | -598.93*** | -301.62 | |
| (87.18) | (87.89) | (87.32) | (855.72) | ||
| raceAsian-Pac-Islander | 738.31 | 194.97 | 293.99 | ||
| (477.19) | (474.79) | (599.15) | |||
| raceBlack | 4.55 | -51.75 | -134.38 | ||
| (439.08) | (436.21) | (560.78) | |||
| raceOther | 437.04 | 584.62 | 867.20 | ||
| (613.57) | (609.59) | (782.47) | |||
| raceWhite | 335.34 | 66.75 | 197.40 | ||
| (421.02) | (418.46) | (532.25) | |||
| age | 38.43*** | 36.32*** | 36.16*** | ||
| (3.01) | (3.00) | (3.00) | |||
| education.num | 330.29*** | 330.12*** | |||
| (15.90) | (15.91) | ||||
| newsex:raceAsian-Pac-Islander | -251.89 | ||||
| (982.81) | |||||
| newsex:raceBlack | 97.49 | ||||
| (895.04) | |||||
| newsex:raceOther | -718.74 | ||||
| (1248.13) | |||||
| newsex:raceWhite | -350.37 | ||||
| (861.02) | |||||
| R2 | 0.00 | 0.01 | 0.02 | 0.02 | |
| Adj. R2 | 0.00 | 0.01 | 0.02 | 0.02 | |
| Num. obs. | 32561 | 32561 | 32561 | 32561 | |
| RMSE | 7401.31 | 7382.33 | 7334.01 | 7334.13 | |
| p < 0.001, p < 0.01, p < 0.05 | |||||
Using the texreg package we put the results of the 4 models into a table to compare the results. Even though the model was not very successful in explaining some of the factors that influence capital gains, we were able to slightly improve it from an R^2 of 0.00 in the first model to an R^2 of 0.02 in the last model. In this case R^2 means that our model explains only 2% of the variations in the data. This means that in the future, we need to select better independent variables and choose our interaction terms more carefully.
The first graph shows the relationship between age and capital gains.
library(visreg)
visreg(lm2, "age", scale = "response")
As confirmed by our results, age positively influences capital gains as the variable is statistically significant.
The second graph shows the relationship between years of education and capital gains.
visreg(lm2, "education.num", scale = "response")
Again, this graph confirms our results that years of education positively influences capital gains as the variable is statistically significant.
For the third graph we first need to recode race into a numerical variable, create a new data set (Income5) and re-run the lm2 model as lmodel5 .
Income5 <- Income2%>%
mutate(newrace = factor(ifelse(race == "White", 1,
ifelse(race == "Asian-Pac-Islander", 2,
ifelse(race == "Black", 3,
ifelse(race == "Other", 4,
ifelse(race == "Amer-Indian-Eskimo", 5, "error")))))))
Income5
lmodel2 <- lm(newincome ~ newsex + newrace + age + education.num, data = Income5)
summary(lmodel2)
Call:
lm(formula = newincome ~ newsex + newrace + age + education.num,
data = Income5)
Residuals:
Min 1Q Median 3Q Max
-7518 -1627 -804 -70 100488
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3538.813 203.491 -17.391 < 0.0000000000000002 ***
newsex -598.927 87.319 -6.859 0.00000000000706 ***
newrace2 128.220 232.141 0.552 0.581
newrace3 -118.499 139.725 -0.848 0.396
newrace4 517.874 448.451 1.155 0.248
newrace5 -66.746 418.460 -0.160 0.873
age 36.320 2.995 12.125 < 0.0000000000000002 ***
education.num 330.295 15.903 20.769 < 0.0000000000000002 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7334 on 32553 degrees of freedom
Multiple R-squared: 0.02035, Adjusted R-squared: 0.02014
F-statistic: 96.59 on 7 and 32553 DF, p-value: < 0.00000000000000022
visreg(lmodel2, "newrace", by = "newsex", scale = "response")
Looking at this graph, we see the differences in capital by gender, clustered by race. This graph further confirms that there is not a statistically significant relationship between sex, race and capital gains.
visreg(lmodel2, "newsex", by = "newrace", scale = "response")
Now we switched gender and race and the graph shows us capital gains by race and clustered by gender. Again, we don’t see a statistically significant relationship between the variables.
Next, we will re-run the lm3 model with the interaction term and label it lmodel3. Finally, we will generate the last graph.
lmodel3 <- lm(newincome ~ newsex*newrace + age + education.num, data = Income5)
summary(lmodel3)
Call:
lm(formula = newincome ~ newsex * newrace + age + education.num,
data = Income5)
Residuals:
Min 1Q Median 3Q Max
-7474 -1631 -798 -64 100526
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3514.215 204.139 -17.215 < 0.0000000000000002 ***
newsex -651.994 95.404 -6.834 0.0000000000084 ***
newrace2 96.590 284.146 0.340 0.7339
newrace3 -331.779 192.997 -1.719 0.0856 .
newrace4 669.799 579.219 1.156 0.2475
newrace5 -197.400 532.252 -0.371 0.7107
age 36.158 2.997 12.064 < 0.0000000000000002 ***
education.num 330.116 15.911 20.748 < 0.0000000000000002 ***
newsex:newrace2 98.484 492.238 0.200 0.8414
newsex:newrace3 447.864 279.264 1.604 0.1088
newsex:newrace4 -368.368 913.536 -0.403 0.6868
newsex:newrace5 350.373 861.023 0.407 0.6841
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7334 on 32549 degrees of freedom
Multiple R-squared: 0.02044, Adjusted R-squared: 0.0201
F-statistic: 61.73 on 11 and 32549 DF, p-value: < 0.00000000000000022
visreg(lmodel3, "newrace", by = "newsex", scale = "response")
Again, we can confirm no statistically significant differences among different races by gender when it comes to capital gains.
Using the regression models, the summary statistics and the graphs we can see that the relationship between gender, race and capital gains is not statistically significant and that there are no significant differences in capital gains; however, in our models the variables that proved to be statistically significant are age and years of education, as both variables tend to increase capital gains.