read the data “Earnings_and_Height”"

require(xlsx)
## Loading required package: xlsx
## Loading required package: rJava
## Loading required package: xlsxjars
mydata<-read.xlsx("C:/Users/yuanmany/Downloads/Earnings_and_Height.xlsx",1)

E4.2

(a) What is the median value of height in the sample?

Answer

height=mydata$height
median(height)
## [1] 67

we can see that the median value of the height is 67 inches. …

(b)

(i) Estimate average earnings of workers whose height is at most 67 inches.

Answer

shortdata<-subset(mydata,height<67)
b=c(shortdata$earnings)
mean(b)
## [1] 44487.9

And then the average earnings of people which is below the 67 inches is $44487.9 …

(ii) Estimate average earnings of workers whose height is greater than 67 inches.

Answer

highdata<-subset(mydata,height>67)
a=c(highdata$earnings)
mean(a)
## [1] 49987.88

Then we can see the average earnings of workers which above 67 inches is $49987.88 …

(iii) On average, do taller workers earn more than shorter workers? How much more? What is a 95% confidence interval for the difference in average earnings?

Answer

t.test(a,b, var.equal=TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  a and b
## t = 13.095, df = 16360, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4676.698 6323.263
## sample estimates:
## mean of x mean of y 
##  49987.88  44487.90
mean(a)-mean(b)
## [1] 5499.98

We can see that the 95% confidence interval of the difference in true average earnings between tall people and short people is ($4676.698,$6323.263). Thus, taller workers, on average earn more than shorter workers. Also, we can see from the sample averages, taller workers receive $5499.98 more than short workers. …

(c) Construct a scatterplot of annual earnings (Earnings) on height (Height). Notice that the points on the plot fall along horizontal lines. (There are only 23 distinct values of Earnings). Why?

Answer

height=mydata$height
earnings=mydata$earnings
plot(height,earnings)

Since in the survey, labor earnings are reported in the 23 brackets. That means each of these brackets Professors Case and Paxson estimates a value of average earnings based on information in the current population. These average values were assigned to all workers with incomes in the corresponding bracket. …

(d) Run a regression pf Earnings on Height.

regression=lm(earnings~height,data=mydata)
summary(regression)
## 
## Call:
## lm(formula = earnings ~ height, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47836 -21879  -7976  34323  50599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -512.73    3386.86  -0.151     0.88    
## height        707.67      50.49  14.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.01082 
## F-statistic: 196.5 on 1 and 17868 DF,  p-value: < 2.2e-16

The earnings is eual to -512.73+707.67*height …

(i) What is the estimated slope?

Answer

The slope is 707.67 …

(ii) Use the estimated regression to predict earnings for a worker who is 67 inches tall, for a worker who is 70 inches tall, and a worker who is 65 inches tall.

Answer

The earnings is equal to -512.73+707.67height for worker who is 67 inches,the earnings is eual to -512.73+707.6767=$46901.16 for worker who is 70 inches,the earnings is eual to -512.73+707.6770=$49024.17 for worker who is 65 inches,the earnings is eual to -512.73+707.6765=$45485.82

(e) Suppose height were measured in centimeters instead of inches. Answer the following questions about Earnings on Height (in cm) regression.

cmheight=height*2.54
regression2=lm(earnings~cmheight,data=mydata)
summary(regression2)
## 
## Call:
## lm(formula = earnings ~ cmheight, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47836 -21879  -7976  34323  50599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -512.73    3386.86  -0.151     0.88    
## cmheight      278.61      19.88  14.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.01082 
## F-statistic: 196.5 on 1 and 17868 DF,  p-value: < 2.2e-16

The earnings is eual to -512.73+278.61*cmheight …

(i) What is the estimated slope of the regression?

Answer

The slope is 278.61 …

(ii) What is the estimated intercept?

Answer

The estimated intercept is -512.73 …

(iii) What is the R2?

Answer

R2 is 0.01088 …

(iv) What is the standard error of the regression?

Answer

26780 …

(f) Run a regression of Earnings on Height, using data for female workers only.

femaledata<-subset(mydata,sex<1)
femaleregression=lm(earnings~height,data=femaledata)
summary(femaleregression)
## 
## Call:
## lm(formula = earnings ~ height, data = femaledata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42748 -22006  -7466  36641  46865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12650.9     6383.7   1.982   0.0475 *  
## height         511.2       98.9   5.169  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared:  0.002672,   Adjusted R-squared:  0.002572 
## F-statistic: 26.72 on 1 and 9972 DF,  p-value: 2.396e-07

The earnings is eual to 12650.9+511.2*height …

(i) What is the estimated slope?

Answer

511.2 …

(ii) A randomly selected woman is 1 inch taller than the average woman in the sample. Would you predict her earnings for women in the sample? By how much?

Answer

mean(femaledata$height)
## [1] 64.49278

The average woman earning=12650.9+511.264.5=$45623.3 1 inch taller woman earning=12650.9+511.265.5=46134.5 difference in earnings is 511.2 …

(g) Repeat (f) for male workers.

maledata<-subset(mydata,sex>0)
maleregression=lm(earnings~height,data=maledata)
summary(maleregression)
## 
## Call:
## lm(formula = earnings ~ height, data = maledata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50158 -22373  -8118  33091  59228 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43130.3     7068.5  -6.102  1.1e-09 ***
## height        1306.9      100.8  12.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  0.02074 
## F-statistic: 168.2 on 1 and 7894 DF,  p-value: < 2.2e-16

(i)

Answer

The slope is 1306.9 …

(ii)

Answer

Then the difference in earnings should be 1306.9 because the selected male is about 1 inch taller than average. His earning would also be higher by 1*slope of the regression line, which is 1306.9. …

(h) Do you think that height is uncorrelated with other factors that cause earning? That is,do you think that regression error term,say ui,has a conditional mean of zero, given Height (Xi)?

Answer

Yes, height should not be correlated with other factors such as age, occupation and race…etc. Being a high person does not guarantee that the person is able to get an occupation such as sales, or mechanics. Therefore, the regression error term would have a conditional mean of zero, at the given height. …

E5.1

(a) Run a regression of Earnings on Height.

regression3=lm(earnings~height,data=mydata)
summary(regression3)
## 
## Call:
## lm(formula = earnings ~ height, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47836 -21879  -7976  34323  50599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -512.73    3386.86  -0.151     0.88    
## height        707.67      50.49  14.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.01082 
## F-statistic: 196.5 on 1 and 17868 DF,  p-value: < 2.2e-16

(i) Is the estimated slope statistically significant?

Answer

Yes, since the slope has a very high t-value, which is 14.016, so we can reject our null hypothesis of the slope of regression line=0 at any significance level. That means the estimated slope of 707.67 is statistically significant. …

(ii) Construct a 95% confidence interval for the slope coefficent.

Answer

The confidence interval of the slope={707.67-1.9650.49,707.67+1.9650.49} ={608.7096, 806.6304} …

(b) Repeat (a) for women.

femaledata<-subset(mydata,sex<1)
femaleregression=lm(earnings~height,data=femaledata)
summary(femaleregression)
## 
## Call:
## lm(formula = earnings ~ height, data = femaledata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42748 -22006  -7466  36641  46865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12650.9     6383.7   1.982   0.0475 *  
## height         511.2       98.9   5.169  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared:  0.002672,   Adjusted R-squared:  0.002572 
## F-statistic: 26.72 on 1 and 9972 DF,  p-value: 2.396e-07

(i)

Answer

Yes, the t-value of the slope is 5.169. It means it exceeds the critical value of 2.58 of the t-test at 1% significance level. Therefore it is statistically significant. …

(ii)

Answer

The 95% confidence interval=[511.2-98.91.96,511.2+98.91.96]

511.2-98.9*1.96
## [1] 317.356
511.2+98.9*1.96
## [1] 705.044

so, the confidence interval of the slope is=[317.356,705.044] …

(c) Repeat (a) for men.

maledata<-subset(mydata,sex>0)
maleregression=lm(earnings~height,data=maledata)
summary(maleregression)
## 
## Call:
## lm(formula = earnings ~ height, data = maledata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50158 -22373  -8118  33091  59228 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43130.3     7068.5  -6.102  1.1e-09 ***
## height        1306.9      100.8  12.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  0.02074 
## F-statistic: 168.2 on 1 and 7894 DF,  p-value: < 2.2e-16

(i)

Answer

The slope is statistically significant since the t-value of the slope is 12.969, which exceeds the 1% significance level critical value of 2.58. That means we can reject the null hypothesis that the slope is 0.

(ii)

Answer

The confidence interval of the slope at 95%=[1306.9-100.81.96,1306.9+100.81.96]

1306.9-100.8*1.96
## [1] 1109.332
1306.9+100.8*1.96
## [1] 1504.468

The confidence interval of the slope=[1109.332,1504.468]