homework4

#Robert Ross

#1) Critque each of the plans

Given: a researcher is interested in determining whether a large aerospace firm is guily of sex bias in setting wages. Data collected on salary and sex for all of firms engineers. Difference-in-means test to determine whether the avg salary for women is significantly lower than for men.

Difference in means alone might not be enough information. There are more factors at play than simply the means. For example, workers could have different amounts of experience, differing education levels, differing types of engineering. Running a regression on wage with a dummy variables for woman and other independent variables like experience, education, age, etc may offer better results.

effect of prison on long term wage. treated subjects have been out of prison for at least 15 years, untreated is randomly assigned people who have never been to prison. Data includes: wage, edu, age, ethnicity, gender, tenure, occupation, union status, and a dummy for incarceration.They plan to estimate wages by regression. Regressing wages on incarceration and including the rest as confounding variables.

I think this person is doing everything right. A regression makes sense here. Crime committed and time spent in prison may be helpful for the understanding- I think employers would view 1 year in prison for a small crime as very different than 20 years in prison for assault. But given what the person has I think they are doing things well.

#2) Using the regression results in column (2):

is age an important determinant of earnings? Use an aprpropriate statistical test and/or confidence interval to explain

Age is an important factor. It has a p value of 0.05, which is the bare minimum to be in the 95% confidence interval. This means that only 5% of the data was likely to be randomly correlated with age, or, in other words, 95% of the data can be follows the trend that age as a regressor impacts average hourly earnings. Furthermore, in the data supplied, a 1 year increase in age, ceteris paribus, should increase expected ahe by 0.61 dollars/hour. For my test stat I am using t and the std dev for a typical 95% confidence interval (as we know we have that thanks to our given p value). T is 12.2 which is high enough to be significant

t_stat <- (0.61-0)/(0.05)
t_stat

## [1] 12.2

Sally is a 29 year old female college grad. Betsy is a 34 year old female college grad. COnstruct a 95% confidence interval for the expected difference between their earnings

We are given enough to form the regression equation. However, it is worth noting that these two applicants are the same in every variable but age. This means that intercept, college, and female will all be the same for them, and the only variable of interest is age. Since age is already within the 95% confidence range, we can simply use the regression value for age along with the 95% confidence range and its p value to create a range.The 5 is to account for the 5 year age difference.

CI(female, bachelors, ages 29-34) = 5[0.61 - 1.96x0.05: 0.61 + 1.96x0.05]

CI(female, bachelors, ages 29-34) = 5[0.61 +/- 0.098]

CI(female, bachelors, ages 29-34) = 5[0.512: 0.708]

CI(female, bachelors, ages 29-34) = [2.56, 3.54]

#3 Suppose a researcher collects data on houses that have solf in a particular neighborhood over the past year and obtrains the regression results in the table shown below:

Using the results in column (1), what is the expected change in price of building a 500sqft addition to a house? Construct a 95% confidence interval for the %deltaP

to construct a 95% confidence interval on change in house price we need the effect of size on price. This is given to us in the summary table, and it was found that a 1sqft increase in size has the effect of increasing price by 0.00042 withh a se of 0.000038. To construct this confidence interval we do as follows:

CI = [mean - 1.96x0.000038: mean + 1.96x0.000038]*500sqft

CI = [0.00042 - 0.00007448: 0.00042 + 0.00007448]*500

CI = [0.00034512, 0.00049448]*500

CI = [0.17256: 0.24724]

This tells us with 95% confidence that the addition of 500sqft onto a house should increase the price between 17.25 and 24.72% on average.

Is ln(Size) or size a better determinant? look at the t-scores

t_size <- 0.00042/0.000038
t_lnsize <- 0.69/0.054
t_size; t_lnsize

## [1] 11.05263

## [1] 12.77778

Since t_lnsize is larger it is more signifcant and should be the variable that is used as it explains more of the data.

c)est effect of pool on price using ln(size) By definition of a regression, adding a pool onto a house ceteris paribus should increase the houses value by 0.071 units. A confidence interval of the true value is as follows:

CI = [0.071 - 1.96x0.034: 0.071 + 1.96x0.034]

CI = [0.00436: 0.1376] or [0.44%: 13.76%]

est effect of adding a bedroom; is it significant? Why is effect so small? Estimated effect of adding 1 bedroom ceteris paribus is only increasing exp avg price by 0.0036. It has a t score of only 0.049 which is very low. I think the issue here is that adding a room without increasing house size does not actually offer anything better. Generally we think adding a room means increasing house size with it, and house size is a better explanatory variable than bedrooms.

t_bed <- 0.0036/0.073
t_bed

## [1] 0.04931507

is the quadradic term ln(size^2) important?

t_quad <- 0.0078/0.14
t_quad

## [1] 0.05571429

no. the t stat is very low and would need a value of at least 1.64 to be in the 90% confidence range.

add a pool to a house with no view. then add a pool to a house with a view. WHat is the difference? is it statistically significant?

The interaction term in column (5) actually has this second part for us. So we only need to compare adding a pool to a standard house and our interaction term.

The addition of a pool has the effect of increasing p by 0.071 all else equal. The addition of a pool and a view has the affect of increasing p by 0.0022. This means that the addition of a pool with no view has the expected price increase by (0.071) 7.1%, and the addition of a pool to a house with a view is (0.071+0.0022) 7.32%, ceteris paribus. Thus, the effect is not much of a difference. The poolxview variable was rather insignificant, with a t value of 0.022

t_poolxview <- 0.0022/0.1
t_poolxview

## [1] 0.022

library(haven)
CPS2015 <- read_dta("CPS2015.dta")
data <- CPS2015

Variables that probably affect a persons earnings are, from the data, years of education, gender, and age. Other factors I would expect to matter are things like married, family size, location/avg income in your area, and even things like inflation and CPI.
Location, avg income in your area, and inflation are all not available. This means we will have less similarities to go on when looking for causal effect. We have bachelors degree as a proxy for years of education, and we are given gender, wage, and age. The lack of data on location will affect our interpretation of the difference between bachelors degree holders and non because a woman in NYC working the same job as a woman in Georgia, ceteris paribus, would probably make more money than the woman in Georgia. This is because the cost of living in NYC is more than in Georgia. We will not be able to account for cost of living like this. However, we can observe wages via the confounding variables of gender, a bachelors degree, and age to find some causal effect. If we were to increase our number of variables the amount of the data that is explained by each variable is lowered. This means that we could be more accurate given more information, and that we will probably be overvaluing our current variables in a regression. That does not negate from our results, if we have a statistically significant regression it is still significant, however it could be even more accurate given more confounding variables.

reg1 <- lm(ahe~ bachelor + female + age, data = data)
summary(reg1)

## 
## Call:
## lm(formula = ahe ~ bachelor + female + age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.913  -6.647  -1.865   4.252  83.908 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.04481    1.35465   1.509    0.131    
## bachelor     9.84564    0.26242  37.519   <2e-16 ***
## female      -4.14354    0.26590 -15.583   <2e-16 ***
## age          0.53128    0.04507  11.788   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 7094 degrees of freedom
## Multiple R-squared:  0.1896, Adjusted R-squared:  0.1893 
## F-statistic: 553.4 on 3 and 7094 DF,  p-value: < 2.2e-16

Bachelor, female, and age are all statistically significant within 99.9%. The coefficient “bachelor” is to be interpreted for this problem. I believe it is easier to start by interpreting the intercept first, and then a covariate. The intercept is interpreted as follows: for a male at age 0 with no bachelors degress, the average expected hourly wage is 2.04 dollars/hour. The coefficient bachelor is then looking at the intercept with bachelors degree = 1, ceteris paribus. That is, for a male, age 0, with a bachelors degree, the average expected hourly wage increases by $9.84/hour, or in other words, he would be expected to earn 11.88 dollars/hour by simply aving a bachelors degree. In reality this does not make sense. No one at age 0 would be working, but furthermore they certainly wouldn’t have a bachelors degree.
Bootstrap the data and find that SE and compare it to the one from the regression

n <- length(data$bachelor)
B <- 10000
variable <- data$bachelor

bootstrap.samples <- matrix(sample(variable, size = n*B, replace = TRUE), 
                            nrow=n, ncol=B)
boot.test.stat1 <- rep(0,B)
for (i in 1:B) {
  boot.test.stat1[i] <- mean(bootstrap.samples[, i])
}
sd(boot.test.stat1)

## [1] 0.005927387

The SE for the regression was 0.2642 and the SE for the bootstrap samples is 0.00595. The standard error for the bootstrap is lower than the regression. I think this makes sense, 10,000 samples were drawn from the same dataset so I think it is reasonable to assume that the data may centralize over an specific range of numbers.

Estimate the effect of bachelor on ahe using a regression adjustment estimator. How does this compare to the estimate from part c?

mean(data$age)

## [1] 29.63046

older <- c()
older <- ifelse(data$age>=mean(data$age), 1, 0)
data$did1 <- data$bachelor*older
didreg1 <- lm(ahe~ bachelor + female + age + did1, data = data)
summary(didreg1)

## 
## Call:
## lm(formula = ahe ~ bachelor + female + age + did1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.079  -6.637  -1.921   4.296  83.618 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.18263    1.73935   2.980  0.00290 ** 
## bachelor     9.15182    0.35648  25.673  < 2e-16 ***
## female      -4.12676    0.26583 -15.524  < 2e-16 ***
## age          0.42520    0.05824   7.301 3.17e-13 ***
## did1         1.32900    0.46244   2.874  0.00407 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.91 on 7093 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1901 
## F-statistic: 417.5 on 4 and 7093 DF,  p-value: < 2.2e-16

The question asks us to interpret the effect of bachelor on ahe. My initial thought was to do this using a difference in differences approach. So I made ahe a dummy variable called older in which a 1 is greater than the mean and 0 is less than the mean. I multiplied it by the bachelor dummy for a regression adgjustment estimator called did1. It was significant within 95% signifcance range and is interpreted such that having both a bachelors degree and an age over 29.63 increases average the average earnings per hour by 1.33 dollars for every unit. Because these are dummy variables the only unit will be 1 or 0. So simply put, having a bachelors degree and being over the age of 29.63 increases estimated avg earnings per hour by 1.33 dollars, on avg.

This caused the standard error for our bachelor coefficient to increase. I believe this is because some of what bachelor previously described is now being described by the did1 variable. This tells me that our previous regression was better for describing the data.

I am not going to interpret it because I already did one, but given that female is already a binary variable like bachelor I think we were supposed to use that. Here is the regression for it:

didtry <- lm(ahe~ bachelor + female + age + bachelor*female, data = data)
data$ww <- data$bachelor*data$female
didtry2 <- lm(ahe~ bachelor + female + age + ww, data = data)
summary(didtry)

## 
## Call:
## lm(formula = ahe ~ bachelor + female + age + bachelor * female, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.224  -6.668  -1.951   4.296  83.345 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.88935    1.35509   1.394  0.16328    
## bachelor        10.45964    0.34023  30.743  < 2e-16 ***
## female          -3.30728    0.39717  -8.327  < 2e-16 ***
## age              0.52694    0.04507  11.691  < 2e-16 ***
## bachelor:female -1.51454    0.53455  -2.833  0.00462 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.91 on 7093 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1901 
## F-statistic: 417.4 on 4 and 7093 DF,  p-value: < 2.2e-16

summary(didtry2)

## 
## Call:
## lm(formula = ahe ~ bachelor + female + age + ww, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.224  -6.668  -1.951   4.296  83.345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.88935    1.35509   1.394  0.16328    
## bachelor    10.45964    0.34023  30.743  < 2e-16 ***
## female      -3.30728    0.39717  -8.327  < 2e-16 ***
## age          0.52694    0.04507  11.691  < 2e-16 ***
## ww          -1.51454    0.53455  -2.833  0.00462 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.91 on 7093 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1901 
## F-statistic: 417.4 on 4 and 7093 DF,  p-value: < 2.2e-16

Matching - find similar people and find differences for those outcomes

library(MatchIt)

## Warning: package 'MatchIt' was built under R version 4.0.3

match1 <- matchit(bachelor~ female, data = data, method = "exact")
summary(match1)

## 
## Call:
## matchit(formula = bachelor ~ female, data = data, method = "exact")
## 
## Summary of Balance for All Data:
##        Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## female        0.4865        0.3397          0.2937          .    0.1468
##        eCDF Max
## female   0.1468
## 
## 
## Summary of Balance for Matched Data:
##        Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## female        0.4865        0.4865               0          .         0
##        eCDF Max Std. Pair Dist.
## female        0               0
## 
## Percent Balance Improvement:
##        Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
## female             100          .       100      100
## 
## Sample Sizes:
##               Control Treated
## All           3365.      3733
## Matched (ESS) 3070.04    3733
## Matched       3365.      3733
## Unmatched        0.         0
## Discarded        0.         0

df.match <- match.data(match1)
mYsub <- mean(df.match$ahe[which(df.match$subclass==2)])
mYsub1 <- mean(df.match$ahe[which(df.match$subclass==1)])
abs(mYsub1 - mYsub)

## [1] 2.759816

For this I needed to find the matches for bachelor and female among our variables. Then looking at only that data finding the effect they have on ahe. The result is such that having a bachelors degree and being a woman increases ahe by 2.76 dollars on average, ceteris paribus.