library(haven)
## Warning: package 'haven' was built under R version 3.4.4
hprice1 <- read_dta("C:/Users/Admin/Desktop/Berlin Uni/Econometrics/hprice1.dta")
mean(hprice1$price)
## [1] 293.546
Estimate the model and provide the results in the usual form, including n and R2. Interpret the coefficients, i.e. explain the meaning.
my.model <- lm(lprice ~ bdrms + llotsize, data=hprice1)
summary(my.model)
##
## Call:
## lm(formula = lprice ~ bdrms + llotsize, data = hprice1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85343 -0.12505 0.00694 0.11992 0.68115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.95489 0.41950 7.044 4.55e-10 ***
## bdrms 0.14043 0.03072 4.571 1.64e-05 ***
## llotsize 0.24449 0.04752 5.145 1.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2376 on 85 degrees of freedom
## Multiple R-squared: 0.4013, Adjusted R-squared: 0.3872
## F-statistic: 28.49 on 2 and 85 DF, p-value: 3.399e-10
llotsize_squared <- hprice1$llotsize^2
#we needed to include new variable into hprice
my.model2 <- lm(lprice ~ bdrms + llotsize + llotsize_squared, data=hprice1 )
summary(my.model2)
##
## Call:
## lm(formula = lprice ~ bdrms + llotsize + llotsize_squared, data = hprice1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85603 -0.12371 0.00907 0.12133 0.63947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.673910 2.793093 1.315 0.192
## bdrms 0.140325 0.030897 4.542 1.85e-05 ***
## llotsize 0.087582 0.604419 0.145 0.885
## llotsize_squared 0.008526 0.032741 0.260 0.795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.239 on 84 degrees of freedom
## Multiple R-squared: 0.4018, Adjusted R-squared: 0.3804
## F-statistic: 18.81 on 3 and 84 DF, p-value: 2.019e-09
anova(my.model2)
## Analysis of Variance Table
##
## Response: lprice
## Df Sum Sq Mean Sq F value Pr(>F)
## bdrms 1 1.7224 1.72236 30.1650 4.147e-07 ***
## llotsize 1 1.4951 1.49512 26.1851 1.933e-06 ***
## llotsize_squared 1 0.0039 0.00387 0.0678 0.7952
## Residuals 84 4.7962 0.05710
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference is really small. Holding all variebles constant, foe every 1% square foot increase in lot size, the price increses by 0,08%.
In the second we have a squared (makinng it quadratic) we can look at it how it dimishes, so it decrease the effect of llotsize on the price.
Glassgow Markow assumption
In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated.(it doesnt correct the degrees of freedom)
An unbiased estimator is an accurate statistic that’s used to approximate a population parameter. “Accurate” in this sense means that it’s neither an overestimate nor an underestimate.
If an overestimate or underestimate does happen, the mean of the difference is called a “bias.”
SSR/n-k(k - independet variable)-1= SSR/n-4
#1)SSR/n-1=SSR/88-1=0,055 (biased version)
#2)SSR/n-k-1=SSR/84=0,037 (unbiased version)
#The difference in biasness and unbiasness is so small, because the difference of degrees of freedom is really small. (88 / 4)
# the standart error formula se(Bj)= б/[SSTj(1-Rj^2)]^0,5
se_llotsize = 0.04288/(25.7521193*(1-0.9939^2))^0.5
se_llotsize
## [1] 0.07661815
se_bdrms = 0.83885/(61.5909091*(1-0.0289))^0.5
se_bdrms
## [1] 0.1084661
Use the data in WAGE2.RAW for this exercise.
log(wage) =b0 + b1educ + b2exper + b3tenure + b4married + b5black + b6south + b7urban + u (3)
and report the results in the usual form. Holding other factors fixed, what is the approximate difference in monthly salary between blacks and nonblacks? What is the corresponding 90%- confidence interval for this salary difference?
WAGE2 <- read_dta("C:/Users/Admin/Desktop/Berlin Uni/R statistics/WAGE2.DTA")
model3 <- lm(lwage ~ educ + exper + tenure + married + black + south + urban, data=WAGE2)
summary(model3)
##
## Call:
## lm(formula = lwage ~ educ + exper + tenure + married + black +
## south + urban, data = WAGE2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98069 -0.21996 0.00707 0.24288 1.22822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.395497 0.113225 47.653 < 2e-16 ***
## educ 0.065431 0.006250 10.468 < 2e-16 ***
## exper 0.014043 0.003185 4.409 1.16e-05 ***
## tenure 0.011747 0.002453 4.789 1.95e-06 ***
## married 0.199417 0.039050 5.107 3.98e-07 ***
## black -0.188350 0.037667 -5.000 6.84e-07 ***
## south -0.090904 0.026249 -3.463 0.000558 ***
## urban 0.183912 0.026958 6.822 1.62e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared: 0.2526, Adjusted R-squared: 0.2469
## F-statistic: 44.75 on 7 and 927 DF, p-value: < 2.2e-16
#Holding other factors fixed, what is the approximate difference in monthly salary between blacks and nonblacks
model3$coefficients['black']
## black
## -0.1883499
exp(model3$coefficients['black'])
## black
## 0.8283248
#18% difference
#What is the corresponding 90%- confidence interval for this salary difference?
qnorm(.95)
## [1] 1.644854
corresponded_interval1=exp(model3$coefficients['black']) + qnorm(.95)*0.037667
corresponded_interval2=exp(model3$coefficients['black']) - qnorm(.95)*0.037667
corresponded_interval1
## black
## 0.8902815
corresponded_interval2
## black
## 0.7663681
#difference_salary = b5 *c(value in the table for 90%) *se (b5)
The result when you run a regression that there 90% probability that black people have smaller wage on 1,17 %
expersq <- WAGE2$exper^2
head(WAGE2)
## # A tibble: 6 x 17
## wage hours IQ KWW educ exper tenure age married black south
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 769 40 93 35 12 11 2 31 1 0 0
## 2 808 50 119 41 18 11 16 37 1 0 0
## 3 825 40 108 46 14 11 9 33 1 0 0
## 4 650 40 96 32 12 13 7 32 1 0 0
## 5 562 40 74 27 11 14 5 34 1 0 0
## 6 1400 40 116 43 16 14 2 35 1 1 0
## # ... with 6 more variables: urban <dbl>, sibs <dbl>, brthord <dbl>,
## # meduc <dbl>, feduc <dbl>, lwage <dbl>
tenuresq <- WAGE2$tenure^2
head(WAGE2)
## # A tibble: 6 x 17
## wage hours IQ KWW educ exper tenure age married black south
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 769 40 93 35 12 11 2 31 1 0 0
## 2 808 50 119 41 18 11 16 37 1 0 0
## 3 825 40 108 46 14 11 9 33 1 0 0
## 4 650 40 96 32 12 13 7 32 1 0 0
## 5 562 40 74 27 11 14 5 34 1 0 0
## 6 1400 40 116 43 16 14 2 35 1 1 0
## # ... with 6 more variables: urban <dbl>, sibs <dbl>, brthord <dbl>,
## # meduc <dbl>, feduc <dbl>, lwage <dbl>
model4 <- lm(lwage ~ educ + exper + tenure + married + black + south + urban +expersq + tenuresq, data=WAGE2)
#model4 <- lm(lwage ~ educ + exper + tenure + married + black + south + urban +I(exper) + I(tenure), data=WAGE2) - we are using I if we dont want to make adjustment to the data set (too big z.B)
summary(model4)
##
## Call:
## lm(formula = lwage ~ educ + exper + tenure + married + black +
## south + urban + expersq + tenuresq, data = WAGE2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.98236 -0.21972 -0.00036 0.24078 1.25127
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3586757 0.1259143 42.558 < 2e-16 ***
## educ 0.0642761 0.0063115 10.184 < 2e-16 ***
## exper 0.0172146 0.0126138 1.365 0.172665
## tenure 0.0249291 0.0081297 3.066 0.002229 **
## married 0.1985470 0.0391103 5.077 4.65e-07 ***
## black -0.1906636 0.0377011 -5.057 5.13e-07 ***
## south -0.0912153 0.0262356 -3.477 0.000531 ***
## urban 0.1854241 0.0269585 6.878 1.12e-11 ***
## expersq -0.0001138 0.0005319 -0.214 0.830622
## tenuresq -0.0007964 0.0004710 -1.691 0.091188 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.2477
## F-statistic: 35.17 on 9 and 925 DF, p-value: < 2.2e-16
anova(model4)
## Analysis of Variance Table
##
## Response: lwage
## Df Sum Sq Mean Sq F value Pr(>F)
## educ 1 16.138 16.1377 120.9469 < 2.2e-16 ***
## exper 1 5.540 5.5400 41.5202 1.874e-10 ***
## tenure 1 4.018 4.0177 30.1111 5.268e-08 ***
## married 1 3.493 3.4935 26.1826 3.777e-07 ***
## black 1 3.888 3.8878 29.1379 8.569e-08 ***
## south 1 2.545 2.5447 19.0718 1.401e-05 ***
## urban 1 6.216 6.2164 46.5901 1.582e-11 ***
## expersq 1 0.016 0.0161 0.1204 0.72869
## tenuresq 1 0.382 0.3815 2.8592 0.09119 .
## Residuals 925 123.421 0.1334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#F-statics
SSRr = sum(model3$residuals^2)
SSRr
## [1] 123.8185
SSRur = sum(model4$residuals^2)
SSRur
## [1] 123.421
F_test = ((SSRr-SSRur)/2)/(SSRur/(nrow(WAGE2)-7-1))
F_test
## [1] 1.493027
#H0 hupothesis was that there is no significant explanations of these additional variables
#in the table (q=2, r=10000) the value was 3,00.
#so we didnt reject H0 because, 0.025 is smaller than 3.00
Use an F-Test for choosing between this model and the more parsimonious model of part (a), i.e. test whether the additional variables provide ‘enough’ additional explanatory power. Provide the test statistics, the rejection rule (a = 0.05) and the test decision.
model5 = lm(lwage ~ educ + exper + tenure + married + black + south + urban + black*educ, data = WAGE2)
summary(model5)
##
## Call:
## lm(formula = lwage ~ educ + exper + tenure + married + black +
## south + urban + black * educ, data = WAGE2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.97782 -0.21832 0.00475 0.24136 1.23226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.374817 0.114703 46.859 < 2e-16 ***
## educ 0.067115 0.006428 10.442 < 2e-16 ***
## exper 0.013826 0.003191 4.333 1.63e-05 ***
## tenure 0.011787 0.002453 4.805 1.80e-06 ***
## married 0.198908 0.039047 5.094 4.25e-07 ***
## black 0.094809 0.255399 0.371 0.710561
## south -0.089450 0.026277 -3.404 0.000692 ***
## urban 0.183852 0.026955 6.821 1.63e-11 ***
## educ:black -0.022624 0.020183 -1.121 0.262603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared: 0.2536, Adjusted R-squared: 0.2471
## F-statistic: 39.32 on 8 and 926 DF, p-value: < 2.2e-16
#so t-value is far from the 0 which means that greater is likelihood of having an impact, the p value is less then 0,1 meaning we are rejecting the H0. The black*educ is not significant.
the black coefficient is 0.094 unlike the initial model meaning -0.188350, because educ*black took partial the meaning (the sign -)
model5 = lm(lwage ~ educ + exper + tenure + married + black + south + urban + black*educ, data = WAGE2)
summary(model5)
#married.black = WAGE2$black*wage2$married