#Robert Ross
#1) Critque each of the plans
Difference in means alone might not be enough information. There are more factors at play than simply the means. For example, workers could have different amounts of experience, differing education levels, differing types of engineering. Running a regression on wage with a dummy variables for woman and other independent variables like experience, education, age, etc may offer better results.
I think this person is doing everything right. A regression makes sense here. Crime committed and time spent in prison may be helpful for the understanding- I think employers would view 1 year in prison for a small crime as very different than 20 years in prison for assault. But given what the person has I think they are doing things well.
#2) Using the regression results in column (2):
Age is an important factor. It has a p value of 0.05, which is the bare minimum to be in the 95% confidence interval. This means that only 5% of the data was likely to be randomly correlated with age, or, in other words, 95% of the data can be follows the trend that age as a regressor impacts average hourly earnings. Furthermore, in the data supplied, a 1 year increase in age, ceteris paribus, should increase expected ahe by 0.61 dollars/hour. For my test stat I am using t and the std dev for a typical 95% confidence interval (as we know we have that thanks to our given p value). T is 12.2 which is high enough to be significant
t_stat <- (0.61-0)/(0.05)
t_stat
## [1] 12.2
We are given enough to form the regression equation. However, it is worth noting that these two applicants are the same in every variable but age. This means that intercept, college, and female will all be the same for them, and the only variable of interest is age. Since age is already within the 95% confidence range, we can simply use the regression value for age along with the 95% confidence range and its p value to create a range.The 5 is to account for the 5 year age difference.
CI(female, bachelors, ages 29-34) = 5[0.61 - 1.96x0.05: 0.61 + 1.96x0.05]
CI(female, bachelors, ages 29-34) = 5[0.61 +/- 0.098]
CI(female, bachelors, ages 29-34) = 5[0.512: 0.708]
CI(female, bachelors, ages 29-34) = [2.56, 3.54]
#3 Suppose a researcher collects data on houses that have solf in a particular neighborhood over the past year and obtrains the regression results in the table shown below:
to construct a 95% confidence interval on change in house price we need the effect of size on price. This is given to us in the summary table, and it was found that a 1sqft increase in size has the effect of increasing price by 0.00042 withh a se of 0.000038. To construct this confidence interval we do as follows:
CI = [mean - 1.96x0.000038: mean + 1.96x0.000038]*500sqft
CI = [0.00042 - 0.00007448: 0.00042 + 0.00007448]*500
CI = [0.00034512, 0.00049448]*500
CI = [0.17256: 0.24724]
This tells us with 95% confidence that the addition of 500sqft onto a house should increase the price between 17.25 and 24.72% on average.
t_size <- 0.00042/0.000038
t_lnsize <- 0.69/0.054
t_size; t_lnsize
## [1] 11.05263
## [1] 12.77778
Since t_lnsize is larger it is more signifcant and should be the variable that is used as it explains more of the data.
c)est effect of pool on price using ln(size) By definition of a regression, adding a pool onto a house ceteris paribus should increase the houses value by 0.071 units. A confidence interval of the true value is as follows:
CI = [0.071 - 1.96x0.034: 0.071 + 1.96x0.034]
CI = [0.00436: 0.1376] or [0.44%: 13.76%]
t_bed <- 0.0036/0.073
t_bed
## [1] 0.04931507
t_quad <- 0.0078/0.14
t_quad
## [1] 0.05571429
no. the t stat is very low and would need a value of at least 1.64 to be in the 90% confidence range.
The interaction term in column (5) actually has this second part for us. So we only need to compare adding a pool to a standard house and our interaction term.
The addition of a pool has the effect of increasing p by 0.071 all else equal. The addition of a pool and a view has the affect of increasing p by 0.0022. This means that the addition of a pool with no view has the expected price increase by (0.071) 7.1%, and the addition of a pool to a house with a view is (0.071+0.0022) 7.32%, ceteris paribus. Thus, the effect is not much of a difference. The poolxview variable was rather insignificant, with a t value of 0.022
t_poolxview <- 0.0022/0.1
t_poolxview
## [1] 0.022
#4
library(haven)
CPS2015 <- read_dta("CPS2015.dta")
data <- CPS2015
Variables that probably affect a persons earnings are, from the data, years of education, gender, and age. Other factors I would expect to matter are things like married, family size, location/avg income in your area, and even things like inflation and CPI.
Location, avg income in your area, and inflation are all not available. This means we will have less similarities to go on when looking for causal effect. We have bachelors degree as a proxy for years of education, and we are given gender, wage, and age. The lack of data on location will affect our interpretation of the difference between bachelors degree holders and non because a woman in NYC working the same job as a woman in Georgia, ceteris paribus, would probably make more money than the woman in Georgia. This is because the cost of living in NYC is more than in Georgia. We will not be able to account for cost of living like this. However, we can observe wages via the confounding variables of gender, a bachelors degree, and age to find some causal effect. If we were to increase our number of variables the amount of the data that is explained by each variable is lowered. This means that we could be more accurate given more information, and that we will probably be overvaluing our current variables in a regression. That does not negate from our results, if we have a statistically significant regression it is still significant, however it could be even more accurate given more confounding variables.
reg1 <- lm(ahe~ bachelor + female + age, data = data)
summary(reg1)
##
## Call:
## lm(formula = ahe ~ bachelor + female + age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.913 -6.647 -1.865 4.252 83.908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04481 1.35465 1.509 0.131
## bachelor 9.84564 0.26242 37.519 <2e-16 ***
## female -4.14354 0.26590 -15.583 <2e-16 ***
## age 0.53128 0.04507 11.788 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.92 on 7094 degrees of freedom
## Multiple R-squared: 0.1896, Adjusted R-squared: 0.1893
## F-statistic: 553.4 on 3 and 7094 DF, p-value: < 2.2e-16
Bachelor, female, and age are all statistically significant within 99.9%. The coefficient “bachelor” is to be interpreted for this problem. I believe it is easier to start by interpreting the intercept first, and then a covariate. The intercept is interpreted as follows: for a male at age 0 with no bachelors degress, the average expected hourly wage is 2.04 dollars/hour. The coefficient bachelor is then looking at the intercept with bachelors degree = 1, ceteris paribus. That is, for a male, age 0, with a bachelors degree, the average expected hourly wage increases by $9.84/hour, or in other words, he would be expected to earn 11.88 dollars/hour by simply aving a bachelors degree. In reality this does not make sense. No one at age 0 would be working, but furthermore they certainly wouldn’t have a bachelors degree.
Bootstrap the data and find that SE and compare it to the one from the regression
n <- length(data$bachelor)
B <- 10000
variable <- data$bachelor
bootstrap.samples <- matrix(sample(variable, size = n*B, replace = TRUE),
nrow=n, ncol=B)
boot.test.stat1 <- rep(0,B)
for (i in 1:B) {
boot.test.stat1[i] <- mean(bootstrap.samples[, i])
}
sd(boot.test.stat1)
## [1] 0.005927387
The SE for the regression was 0.2642 and the SE for the bootstrap samples is 0.00595. The standard error for the bootstrap is lower than the regression. I think this makes sense, 10,000 samples were drawn from the same dataset so I think it is reasonable to assume that the data may centralize over an specific range of numbers.
mean(data$age)
## [1] 29.63046
older <- c()
older <- ifelse(data$age>=mean(data$age), 1, 0)
data$did1 <- data$bachelor*older
didreg1 <- lm(ahe~ bachelor + female + age + did1, data = data)
summary(didreg1)
##
## Call:
## lm(formula = ahe ~ bachelor + female + age + did1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.079 -6.637 -1.921 4.296 83.618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.18263 1.73935 2.980 0.00290 **
## bachelor 9.15182 0.35648 25.673 < 2e-16 ***
## female -4.12676 0.26583 -15.524 < 2e-16 ***
## age 0.42520 0.05824 7.301 3.17e-13 ***
## did1 1.32900 0.46244 2.874 0.00407 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.91 on 7093 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1901
## F-statistic: 417.5 on 4 and 7093 DF, p-value: < 2.2e-16
The question asks us to interpret the effect of bachelor on ahe. My initial thought was to do this using a difference in differences approach. So I made ahe a dummy variable called older in which a 1 is greater than the mean and 0 is less than the mean. I multiplied it by the bachelor dummy for a regression adgjustment estimator called did1. It was significant within 95% signifcance range and is interpreted such that having both a bachelors degree and an age over 29.63 increases average the average earnings per hour by 1.33 dollars for every unit. Because these are dummy variables the only unit will be 1 or 0. So simply put, having a bachelors degree and being over the age of 29.63 increases estimated avg earnings per hour by 1.33 dollars, on avg.
This caused the standard error for our bachelor coefficient to increase. I believe this is because some of what bachelor previously described is now being described by the did1 variable. This tells me that our previous regression was better for describing the data.
I am not going to interpret it because I already did one, but given that female is already a binary variable like bachelor I think we were supposed to use that. Here is the regression for it:
didtry <- lm(ahe~ bachelor + female + age + bachelor*female, data = data)
data$ww <- data$bachelor*data$female
didtry2 <- lm(ahe~ bachelor + female + age + ww, data = data)
summary(didtry)
##
## Call:
## lm(formula = ahe ~ bachelor + female + age + bachelor * female,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.224 -6.668 -1.951 4.296 83.345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.88935 1.35509 1.394 0.16328
## bachelor 10.45964 0.34023 30.743 < 2e-16 ***
## female -3.30728 0.39717 -8.327 < 2e-16 ***
## age 0.52694 0.04507 11.691 < 2e-16 ***
## bachelor:female -1.51454 0.53455 -2.833 0.00462 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.91 on 7093 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1901
## F-statistic: 417.4 on 4 and 7093 DF, p-value: < 2.2e-16
summary(didtry2)
##
## Call:
## lm(formula = ahe ~ bachelor + female + age + ww, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.224 -6.668 -1.951 4.296 83.345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.88935 1.35509 1.394 0.16328
## bachelor 10.45964 0.34023 30.743 < 2e-16 ***
## female -3.30728 0.39717 -8.327 < 2e-16 ***
## age 0.52694 0.04507 11.691 < 2e-16 ***
## ww -1.51454 0.53455 -2.833 0.00462 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.91 on 7093 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1901
## F-statistic: 417.4 on 4 and 7093 DF, p-value: < 2.2e-16
library(MatchIt)
## Warning: package 'MatchIt' was built under R version 4.0.3
match1 <- matchit(bachelor~ female, data = data, method = "exact")
summary(match1)
##
## Call:
## matchit(formula = bachelor ~ female, data = data, method = "exact")
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## female 0.4865 0.3397 0.2937 . 0.1468
## eCDF Max
## female 0.1468
##
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## female 0.4865 0.4865 0 . 0
## eCDF Max Std. Pair Dist.
## female 0 0
##
## Percent Balance Improvement:
## Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
## female 100 . 100 100
##
## Sample Sizes:
## Control Treated
## All 3365. 3733
## Matched (ESS) 3070.04 3733
## Matched 3365. 3733
## Unmatched 0. 0
## Discarded 0. 0
df.match <- match.data(match1)
mYsub <- mean(df.match$ahe[which(df.match$subclass==2)])
mYsub1 <- mean(df.match$ahe[which(df.match$subclass==1)])
abs(mYsub1 - mYsub)
## [1] 2.759816
For this I needed to find the matches for bachelor and female among our variables. Then looking at only that data finding the effect they have on ahe. The result is such that having a bachelors degree and being a woman increases ahe by 2.76 dollars on average, ceteris paribus.