01. Summarize and describe the data set. Your answer should include: What share of the sample is female? What share of the sample is non-white? How skewed are the income and education distributions? Create three figures (graphs) that summarize key variables. Create two figures (graphs) that summarize how key variables relate to each other. Explain your decisions on summarizing the data. What do you learn about potential relationships?
library(pacman)
## Warning: package 'pacman' was built under R version 3.6.3
p_load(tidyverse,
stargazer,
estimatr,
here,
magrittr,
purrr,
ggthemes)
file_data <- read.csv("proj1.csv")
female_data <- file_data %>% filter(female == 1)
female_data_sum <- sum(female_data$female)
female_data_sum
## [1] 2600
nonwhite_data <- file_data %>% filter(nonwhite == 1)
nonwhite_sum <- sum(nonwhite_data$nonwhite)
nonwhite_sum
## [1] 3126
fig.align='center'
Thus we can tell that there are 2600 females, which leaves 2400 males. 5000-2600 = 2400.
We can also tell that out of our sample population, that 3126 people are nonwhite. 3126/5000
3126 / 5000 * 100 This equals 62.52 percent.
ggplot(file_data, aes(x = file_data$education,
y = file_data$income)) +
geom_point(shape = 1,
color = "darkgreen") +
theme_minimal() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "Education",
y = "Income",
title = "Effect of Education on Income") +
geom_smooth(method = lm)
Looking at this graph I can tell that the data is skewed left.
hist(file_data$female,
col = "darkgreen",
main = "Histogram of Female Data",
xlab = "Female")
hist(file_data$nonwhite,
col = "darkgreen",
main = "Histogram of Non-White Data",
xlab = "Non-White")
hist(file_data$urban,
col = "darkgreen",
main = "Histogram of Urban Data",
xlab = "Urban")
ggplot(file_data, aes(x = file_data$kids,
y = file_data$income)) +
geom_point(shape = 1,
color = "darkgreen") +
theme_minimal() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "Kids",
y = "Income",
title = "Effect of Kids on Income") +
geom_smooth(method = lm)
ggplot(file_data, aes(x = file_data$ability,
y = file_data$income)) +
geom_point(shape = 1,
color = "darkgreen") +
theme_minimal() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "Ability",
y = "Income",
title = "Effect of Ability on Income") +
geom_smooth(method = lm)
I learned that there are more nonwhite people in our sample than white people. I also noticed that we have slightly more number of females in our sample than males. There are same number of people in our sample that live in an urban environment as people who live in a non urban environment.
Also by looking at the graph comparing number of kids to income, there doesn’t seem to be any correlation between the data as our line of best fit has a slope of zero from what it looks like. Finally when looking at the graph of Ability on Income, it also looks like ability is incredibly varied and doesn’t have a strong correlation on income. However just because there is a high amount of variance in terms of Income when compared to ability doesn’t mean we are able to omit ability as a variable, because it can still be statically significant.
02. Regress individuals’ income ( income ) on an intercept and their education ( education ).
reg1 <- lm(file_data$income ~ file_data$education)
stargazer(reg1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income
## -----------------------------------------------
## education 34,559.590***
## (608.848)
##
## Constant -227,319.400***
## (6,251.173)
##
## -----------------------------------------------
## Observations 5,000
## R2 0.392
## Adjusted R2 0.392
## Residual Std. Error 137,767.600 (df = 4998)
## F Statistic 3,221.957*** (df = 1; 4998)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
03. Create a scatter plot with the residuals from 02 on the y axis and education on the x axis.
file_data$residuals_1 <- resid(reg1)
ggplot(file_data, aes(x = file_data$education, y = file_data$residuals_1)) +
geom_point(shape = 1,
color = "darkgreen") +
theme_minimal() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
labs(x = "Education",
y = "Residuals (Income)",
title = "Effect of Education on Residuals") +
geom_smooth(method = lm)
04. Does the scatter plot from 03 suggest that heteroskedasticity may be present? Explain your answer
The scatter plot from 03 suggests that heteroskedasticity is present. The graph is skewed left. The graph has bias. This graph looks like a funnel from both sides. This is a classic case/ example of heteroskedasticity. The reason that this is actually heteroskedastic is because when you run a GQ test on a graph that funnels, the GQ test fails to reject the null hypothesis. The null hypothesis being the fact that the graph is homoskedastic, however this is a contradiction.
05. More generally: Does the scatter plot from 03 suggest that there are any issues with your specification? Explain
06. Explain why the regression in 02 could suffer from omitted-variable bias.
the regression in 02 suffers from omitted-variable bias because the variable ability is omitted. This typically happens due to the fact that ability is something very hard to test, yet it still plays a role on the effect of education (correlated with education) and the affect on Income.
07. Give an example of an omitted variable (other than ability) that could cause bias in the regression in 02. Just to be clear: Do not use the variable ability as your example. Explain how your example variable satisfies both requirements for omitted-variable bias. Describe the direction of the bias this variable would cause (when we estimate the effect of education on income). Explain your answer.
An omitted variable other that ability could be gender or race. I would assume that gender and race have an impact on income and that race and gender could be correlated with other variables such as education and/or ability. Having a variable affect Y while being correlated with another variable is the definition of omitted variable bias. I’m assuming that being female and/or nonwhite will have and impact on a person income. Thus it would be upward bias.
08. Now regress income on an intercept, education, and ability. Interpret the results.
reg2 <- lm(file_data$income ~ file_data$education + file_data$ability)
stargazer(reg2, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income
## -----------------------------------------------
## education 33,324.970***
## (619.306)
##
## ability 650.066***
## (72.034)
##
## Constant -244,530.100***
## (6,488.086)
##
## -----------------------------------------------
## Observations 5,000
## R2 0.402
## Adjusted R2 0.401
## Residual Std. Error 136,672.200 (df = 4997)
## F Statistic 1,677.627*** (df = 2; 4997)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The p-values are very small numbers meaning that these are statistically significant. There is a 650.066 increase in income for a single unit increase in ability. There is a 33,324.970 increase in income for a single unit increase in education.
09. Does your estimate for the effect of education on income change from question 02 to question 08? Explain why this change (or lack of change) makes sense. Hint: Is there a significant relationship between education and ability?
Yes it changes from question 02, because in that question we were omitting the variable ability. Thus because this variable was omitted we were getting an upward bias estimate. Thus the variable education was absorbing the 650.066 of ability, making it more bias upwards. We made sure to include ability in question 08 because it is an important variable, hence we can not omit that variable.
10. Up to this point, we have generally told you which regressions to run. And we’ve stuck with pretty simple regressions (e.g., regress y on x1 and x2 ). We now want you to explore the actual complexity of econometric/statistical analyses. Estimate three new models. These models should not match your previous models (in 02 and 08). Across these three new models, you should include (at least once): a log-transformed variable (i.e., use log ) an interaction
reg3 <- lm(file_data$income ~ file_data$education + file_data$ability + file_data$female + file_data$nonwhite + file_data$urban)
reg4 <- lm(file_data$income ~ file_data$education + file_data$ability + file_data$female * file_data$nonwhite)
reg5 <- lm(log(file_data$income) ~ file_data$education + file_data$ability + file_data$female)
stargazer(reg3, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income
## -----------------------------------------------
## education 34,032.630***
## (643.017)
##
## ability 635.733***
## (71.873)
##
## female -34,632.500***
## (3,893.467)
##
## nonwhite -5,434.732
## (4,093.320)
##
## urban 286.443
## (3,836.374)
##
## Constant -229,524.000***
## (7,747.643)
##
## -----------------------------------------------
## Observations 5,000
## R2 0.411
## Adjusted R2 0.411
## Residual Std. Error 135,610.600 (df = 4994)
## F Statistic 697.907*** (df = 5; 4994)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(reg4, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income
## -----------------------------------------------
## education 34,033.080***
## (643.103)
##
## ability 635.805***
## (71.886)
##
## female -34,776.400***
## (6,320.337)
##
## nonwhite -5,551.476
## (5,827.209)
##
## nonwhite 231.845
## (7,937.430)
##
## Constant -229,316.400***
## (7,925.311)
##
## -----------------------------------------------
## Observations 5,000
## R2 0.411
## Adjusted R2 0.411
## Residual Std. Error 135,610.600 (df = 4994)
## F Statistic 697.905*** (df = 5; 4994)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(reg5, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income)
## -----------------------------------------------
## education 0.347***
## (0.003)
##
## ability 0.003***
## (0.0003)
##
## female -0.299***
## (0.018)
##
## Constant 7.325***
## (0.030)
##
## -----------------------------------------------
## Observations 5,000
## R2 0.764
## Adjusted R2 0.764
## Residual Std. Error 0.622 (df = 4996)
## F Statistic 5,400.744*** (df = 3; 4996)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
11. How did you choose your specifications in 10? Explain your decision making.
I started by listing each possible variable in the regression. I ran the regression to see which variable were statically significant. Then I started to omit variable that were not statistically significant. Thus the variables that seems to be significant were education, ability, and the variable for female. I tried to do a interaction variable for females and nonwhite, I would expect that race and gender both have a role to play, however when running the regression in the interaction between female and nonwhite, its p-value was large enough to make it not statistically significant. Therefore because the interaction variable was not significant I chose to omit it as a variable, leaving me with regression number 5. I used a log transformation on the Y variable income. Thus this will let me see the percent change of beta * 100 for an one unit change in x.
12. Which of your new models is “best”—if you must choose one model, which would you choose? Why?
reg5 <- lm(log(file_data$income) ~ file_data$education + file_data$ability + file_data$female)
stargazer(reg5, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income)
## -----------------------------------------------
## education 0.347***
## (0.003)
##
## ability 0.003***
## (0.0003)
##
## female -0.299***
## (0.018)
##
## Constant 7.325***
## (0.030)
##
## -----------------------------------------------
## Observations 5,000
## R2 0.764
## Adjusted R2 0.764
## Residual Std. Error 0.622 (df = 4996)
## F Statistic 5,400.744*** (df = 3; 4996)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
I chose this regression, regression number 5 because like I said in question 11, this log transformation allows be to see the percent change of beta times 100 for a single unit increase in one of the variables. For example: there is a 34.7% increase in income for a single unit increase in education. There is also a .3 % increase in income for a single unit increase in ability. And finally there is a negative 29.9% percent change in income for being female. I chose this regression because it included all important variables and it allows be to see the percent change.
13. For your “best” model (chosen in 12): Interpret the coefficients and comment on their statistical significance.
There is a 34.7% increase in income for a single unit increase in education. There is also a .3 % increase in income for a single unit increase in ability. And finally there is a negative 29.9% percent change in income for being female. Each one of these variables has a small p-value, thus making it statistically significant.
14. Do you trust the estimates from your best model? Explain why/why not.
I do trust my estimates for my best model because we now don’t have any omitted variables that matter that were causing bias. Thus including these important variables eliminates omitted variable bias. I also trust my estimates because each variable has small p-values, meaning that the coefficients are statistically significant. Another reason that this is the “best” model, is when comparing the different stargazer tables, the regression that I chose, reg5 has the highest adjust R squared value out of all the regressions. This just means that it fits the data better than the other previous regression models.
15. Suppose you want to estimate the effect of high-school graduation. How could you use the current data to estimate this effect? Describe any regressions, estimates, figures, and/or caveats you would make. Note: You can assume that someone with 12 years of education graduated from high school.
file_data_grad_info <- file_data %>%
mutate(grad_level = ifelse(education >12,1,0)) %>%
filter(grad_level == 1)
sum(file_data_grad_info$grad_level)
## [1] 933
dim(file_data)
## [1] 5000 9
Thus we can see that out of the 5000 people in our sample population, that only 933 have a higher education than High School. Thus that means that 933 / 5000 that only 18.66% of people in our sample are High School graduates.
reg6 <- lm(log(file_data_grad_info$income) ~ file_data_grad_info$education + file_data_grad_info$ability + file_data_grad_info$female, data = file_data_grad_info)
stargazer(reg6, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## income)
## -----------------------------------------------
## education 0.469***
## (0.017)
##
## ability 0.002***
## (0.001)
##
## female -0.636***
## (0.037)
##
## Constant 6.218***
## (0.229)
##
## -----------------------------------------------
## Observations 933
## R2 0.524
## Adjusted R2 0.522
## Residual Std. Error 0.534 (df = 929)
## F Statistic 340.755*** (df = 3; 929)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Thus we can see that after I have filter out the 18.66% of people who graduated high school, and then ran the same regression as reg5, however now we are controlling for education being higher than 12 (>). Thus those people are filtered by there education levels and then regressed again. We can see that when have a higher education control (education > 12) that we also have a higher coefficient for education. Thus this means that a person who has a high school degree, if they plan on continuing education by a single year (unit) then they would see a 46.9% increase in income. It is also worth noting that the coefficient for female doubled. It was -0.299 in reg5 however in reg6 its -0.636.