require(xlsx)
## Loading required package: xlsx
## Loading required package: rJava
## Loading required package: xlsxjars
mydata<-read.xlsx("C:/Users/yuanmany/Downloads/Earnings_and_Height.xlsx",1)
…
height=mydata$height
median(height)
## [1] 67
we can see that the median value of the height is 67 inches. …
shortdata<-subset(mydata,height<67)
b=c(shortdata$earnings)
mean(b)
## [1] 44487.9
And then the average earnings of people which is below the 67 inches is $44487.9 …
highdata<-subset(mydata,height>67)
a=c(highdata$earnings)
mean(a)
## [1] 49987.88
Then we can see the average earnings of workers which above 67 inches is $49987.88 …
t.test(a,b, var.equal=TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: a and b
## t = 13.095, df = 16360, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4676.698 6323.263
## sample estimates:
## mean of x mean of y
## 49987.88 44487.90
mean(a)-mean(b)
## [1] 5499.98
We can see that the 95% confidence interval of the difference in true average earnings between tall people and short people is ($4676.698,$6323.263). Thus, taller workers, on average earn more than shorter workers. Also, we can see from the sample averages, taller workers receive $5499.98 more than short workers. …
height=mydata$height
earnings=mydata$earnings
plot(height,earnings)
Since in the survey, labor earnings are reported in the 23 brackets. That means each of these brackets Professors Case and Paxson estimates a value of average earnings based on information in the current population. These average values were assigned to all workers with incomes in the corresponding bracket. …
regression=lm(earnings~height,data=mydata)
summary(regression)
##
## Call:
## lm(formula = earnings ~ height, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47836 -21879 -7976 34323 50599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -512.73 3386.86 -0.151 0.88
## height 707.67 50.49 14.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.01082
## F-statistic: 196.5 on 1 and 17868 DF, p-value: < 2.2e-16
The earnings is eual to -512.73+707.67*height …
The slope is 707.67 …
The earnings is equal to -512.73+707.67height for worker who is 67 inches,the earnings is eual to -512.73+707.6767=$46901.16 for worker who is 70 inches,the earnings is eual to -512.73+707.6770=$49024.17 for worker who is 65 inches,the earnings is eual to -512.73+707.6765=$45485.82
…
cmheight=height*2.54
regression2=lm(earnings~cmheight,data=mydata)
summary(regression2)
##
## Call:
## lm(formula = earnings ~ cmheight, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47836 -21879 -7976 34323 50599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -512.73 3386.86 -0.151 0.88
## cmheight 278.61 19.88 14.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.01082
## F-statistic: 196.5 on 1 and 17868 DF, p-value: < 2.2e-16
The earnings is eual to -512.73+278.61*cmheight …
The slope is 278.61 …
The estimated intercept is -512.73 …
R2 is 0.01088 …
26780 …
femaledata<-subset(mydata,sex<1)
femaleregression=lm(earnings~height,data=femaledata)
summary(femaleregression)
##
## Call:
## lm(formula = earnings ~ height, data = femaledata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42748 -22006 -7466 36641 46865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12650.9 6383.7 1.982 0.0475 *
## height 511.2 98.9 5.169 2.4e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared: 0.002672, Adjusted R-squared: 0.002572
## F-statistic: 26.72 on 1 and 9972 DF, p-value: 2.396e-07
The earnings is eual to 12650.9+511.2*height …
511.2 …
mean(femaledata$height)
## [1] 64.49278
The average woman earning=12650.9+511.264.5=$45623.3 1 inch taller woman earning=12650.9+511.265.5=46134.5 difference in earnings is 511.2 …
maledata<-subset(mydata,sex>0)
maleregression=lm(earnings~height,data=maledata)
summary(maleregression)
##
## Call:
## lm(formula = earnings ~ height, data = maledata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50158 -22373 -8118 33091 59228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43130.3 7068.5 -6.102 1.1e-09 ***
## height 1306.9 100.8 12.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared: 0.02086, Adjusted R-squared: 0.02074
## F-statistic: 168.2 on 1 and 7894 DF, p-value: < 2.2e-16
The slope is 1306.9 …
Then the difference in earnings should be 1306.9 because the selected male is about 1 inch taller than average. His earning would also be higher by 1*slope of the regression line, which is 1306.9. …
Yes, height should not be correlated with other factors such as age, occupation and race…etc. Being a high person does not guarantee that the person is able to get an occupation such as sales, or mechanics. Therefore, the regression error term would have a conditional mean of zero, at the given height. …
regression3=lm(earnings~height,data=mydata)
summary(regression3)
##
## Call:
## lm(formula = earnings ~ height, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47836 -21879 -7976 34323 50599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -512.73 3386.86 -0.151 0.88
## height 707.67 50.49 14.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.01082
## F-statistic: 196.5 on 1 and 17868 DF, p-value: < 2.2e-16
…
Yes, since the slope has a very high t-value, which is 14.016, so we can reject our null hypothesis of the slope of regression line=0 at any significance level. That means the estimated slope of 707.67 is statistically significant. …
The confidence interval of the slope={707.67-1.9650.49,707.67+1.9650.49} ={608.7096, 806.6304} …
femaledata<-subset(mydata,sex<1)
femaleregression=lm(earnings~height,data=femaledata)
summary(femaleregression)
##
## Call:
## lm(formula = earnings ~ height, data = femaledata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42748 -22006 -7466 36641 46865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12650.9 6383.7 1.982 0.0475 *
## height 511.2 98.9 5.169 2.4e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared: 0.002672, Adjusted R-squared: 0.002572
## F-statistic: 26.72 on 1 and 9972 DF, p-value: 2.396e-07
Yes, the t-value of the slope is 5.169. It means it exceeds the critical value of 2.58 of the t-test at 1% significance level. Therefore it is statistically significant. …
The 95% confidence interval=[511.2-98.91.96,511.2+98.91.96]
511.2-98.9*1.96
## [1] 317.356
511.2+98.9*1.96
## [1] 705.044
so, the confidence interval of the slope is=[317.356,705.044] …
maledata<-subset(mydata,sex>0)
maleregression=lm(earnings~height,data=maledata)
summary(maleregression)
##
## Call:
## lm(formula = earnings ~ height, data = maledata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50158 -22373 -8118 33091 59228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43130.3 7068.5 -6.102 1.1e-09 ***
## height 1306.9 100.8 12.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared: 0.02086, Adjusted R-squared: 0.02074
## F-statistic: 168.2 on 1 and 7894 DF, p-value: < 2.2e-16
The slope is statistically significant since the t-value of the slope is 12.969, which exceeds the 1% significance level critical value of 2.58. That means we can reject the null hypothesis that the slope is 0.
…
The confidence interval of the slope at 95%=[1306.9-100.81.96,1306.9+100.81.96]
1306.9-100.8*1.96
## [1] 1109.332
1306.9+100.8*1.96
## [1] 1504.468
The confidence interval of the slope=[1109.332,1504.468]