(a) What is the sample mean of price? Provide a table of summary statistics for price, bdrms and lotsize

library(haven)
## Warning: package 'haven' was built under R version 3.4.4
hprice1 <- read_dta("C:/Users/Admin/Desktop/Berlin Uni/Econometrics/hprice1.dta")

mean(hprice1$price)
## [1] 293.546

(b) Consider the model log(price) = b0 + b1bdrms + b2 log(lotsize) + v. (1)

Estimate the model and provide the results in the usual form, including n and R2. Interpret the coefficients, i.e. explain the meaning.

my.model <- lm(lprice ~ bdrms + llotsize, data=hprice1)
summary(my.model)
## 
## Call:
## lm(formula = lprice ~ bdrms + llotsize, data = hprice1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85343 -0.12505  0.00694  0.11992  0.68115 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.95489    0.41950   7.044 4.55e-10 ***
## bdrms        0.14043    0.03072   4.571 1.64e-05 ***
## llotsize     0.24449    0.04752   5.145 1.69e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2376 on 85 degrees of freedom
## Multiple R-squared:  0.4013, Adjusted R-squared:  0.3872 
## F-statistic: 28.49 on 2 and 85 DF,  p-value: 3.399e-10

(c) Now consider the model

log(price) = b0 + b1bdrms + b2 log(lotsize) + b3 [log(lotsize)]^2 + u. (2)

(i) Estimate the model and provide the results in the usual form and interpret the coefficients.

llotsize_squared <- hprice1$llotsize^2

#we needed to include new variable into hprice

my.model2 <- lm(lprice ~ bdrms + llotsize + llotsize_squared, data=hprice1 )
summary(my.model2)
## 
## Call:
## lm(formula = lprice ~ bdrms + llotsize + llotsize_squared, data = hprice1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85603 -0.12371  0.00907  0.12133  0.63947 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.673910   2.793093   1.315    0.192    
## bdrms            0.140325   0.030897   4.542 1.85e-05 ***
## llotsize         0.087582   0.604419   0.145    0.885    
## llotsize_squared 0.008526   0.032741   0.260    0.795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.239 on 84 degrees of freedom
## Multiple R-squared:  0.4018, Adjusted R-squared:  0.3804 
## F-statistic: 18.81 on 3 and 84 DF,  p-value: 2.019e-09
anova(my.model2)
## Analysis of Variance Table
## 
## Response: lprice
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## bdrms             1 1.7224 1.72236 30.1650 4.147e-07 ***
## llotsize          1 1.4951 1.49512 26.1851 1.933e-06 ***
## llotsize_squared  1 0.0039 0.00387  0.0678    0.7952    
## Residuals        84 4.7962 0.05710                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(ii) Compare point estimates and standard errors of the log(lotsize)- and bdrms-coefficients in the two models (equations (1) and (2)). Explain why this pattern occurs.

The difference is really small. Holding all variebles constant, foe every 1% square foot increase in lot size, the price increses by 0,08%.

In the second we have a squared (makinng it quadratic) we can look at it how it dimishes, so it decrease the effect of llotsize on the price.

(d) A possible estimator for the variance of u in model (2) is to calculate the sample variance of the residuals (SSR/(n − 1)). We know that this estimator is biased. (they forgot to put k)

Glassgow Markow assumption

In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated.(it doesnt correct the degrees of freedom)

Which (unbiased) estimator is usually implemented in regression packages (provide the formula)? Explain what unbiasedness means.

An unbiased estimator is an accurate statistic that’s used to approximate a population parameter. “Accurate” in this sense means that it’s neither an overestimate nor an underestimate.

If an overestimate or underestimate does happen, the mean of the difference is called a “bias.”

SSR/n-k(k - independet variable)-1= SSR/n-4

Which assumptions are required for this property? Obtain the two suggested variance estimates for model (2). Are there large differences? Why (not)?

#1)SSR/n-1=SSR/88-1=0,055 (biased version)
#2)SSR/n-k-1=SSR/84=0,037 (unbiased version)

#The difference in biasness and unbiasness is so small, because the difference of degrees of freedom is really small. (88 / 4)

e) Calculate the standard error of b2 in model (2). Similarly, calculate the standard error of b1 using this output:

# the standart error formula se(Bj)= б/[SSTj(1-Rj^2)]^0,5

se_llotsize = 0.04288/(25.7521193*(1-0.9939^2))^0.5
se_llotsize
## [1] 0.07661815
se_bdrms = 0.83885/(61.5909091*(1-0.0289))^0.5
se_bdrms
## [1] 0.1084661

II (7 Points)

Use the data in WAGE2.RAW for this exercise.

(a) Estimate the model

log(wage) =b0 + b1educ + b2exper + b3tenure + b4married + b5black + b6south + b7urban + u (3)

and report the results in the usual form. Holding other factors fixed, what is the approximate difference in monthly salary between blacks and nonblacks? What is the corresponding 90%- confidence interval for this salary difference?

WAGE2 <- read_dta("C:/Users/Admin/Desktop/Berlin Uni/R statistics/WAGE2.DTA")

model3 <- lm(lwage ~ educ + exper + tenure + married + black + south + urban, data=WAGE2)

summary(model3)
## 
## Call:
## lm(formula = lwage ~ educ + exper + tenure + married + black + 
##     south + urban, data = WAGE2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98069 -0.21996  0.00707  0.24288  1.22822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.395497   0.113225  47.653  < 2e-16 ***
## educ         0.065431   0.006250  10.468  < 2e-16 ***
## exper        0.014043   0.003185   4.409 1.16e-05 ***
## tenure       0.011747   0.002453   4.789 1.95e-06 ***
## married      0.199417   0.039050   5.107 3.98e-07 ***
## black       -0.188350   0.037667  -5.000 6.84e-07 ***
## south       -0.090904   0.026249  -3.463 0.000558 ***
## urban        0.183912   0.026958   6.822 1.62e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3655 on 927 degrees of freedom
## Multiple R-squared:  0.2526, Adjusted R-squared:  0.2469 
## F-statistic: 44.75 on 7 and 927 DF,  p-value: < 2.2e-16
#Holding other factors fixed, what is the approximate difference in monthly salary between blacks and nonblacks

model3$coefficients['black']
##      black 
## -0.1883499
exp(model3$coefficients['black'])
##     black 
## 0.8283248
#18% difference

#What is the corresponding 90%- confidence interval for this salary difference?

qnorm(.95)
## [1] 1.644854
corresponded_interval1=exp(model3$coefficients['black']) + qnorm(.95)*0.037667
corresponded_interval2=exp(model3$coefficients['black']) - qnorm(.95)*0.037667
corresponded_interval1
##     black 
## 0.8902815
corresponded_interval2
##     black 
## 0.7663681
#difference_salary = b5 *c(value in the table for 90%) *se (b5) 

The result when you run a regression that there 90% probability that black people have smaller wage on 1,17 %

(b) Adding the variables exper^2 and tenure^2 yields the following output:

expersq <- WAGE2$exper^2
head(WAGE2)
## # A tibble: 6 x 17
##    wage hours    IQ   KWW  educ exper tenure   age married black south
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1   769    40    93    35    12    11      2    31       1     0     0
## 2   808    50   119    41    18    11     16    37       1     0     0
## 3   825    40   108    46    14    11      9    33       1     0     0
## 4   650    40    96    32    12    13      7    32       1     0     0
## 5   562    40    74    27    11    14      5    34       1     0     0
## 6  1400    40   116    43    16    14      2    35       1     1     0
## # ... with 6 more variables: urban <dbl>, sibs <dbl>, brthord <dbl>,
## #   meduc <dbl>, feduc <dbl>, lwage <dbl>
tenuresq <- WAGE2$tenure^2
head(WAGE2)
## # A tibble: 6 x 17
##    wage hours    IQ   KWW  educ exper tenure   age married black south
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl>
## 1   769    40    93    35    12    11      2    31       1     0     0
## 2   808    50   119    41    18    11     16    37       1     0     0
## 3   825    40   108    46    14    11      9    33       1     0     0
## 4   650    40    96    32    12    13      7    32       1     0     0
## 5   562    40    74    27    11    14      5    34       1     0     0
## 6  1400    40   116    43    16    14      2    35       1     1     0
## # ... with 6 more variables: urban <dbl>, sibs <dbl>, brthord <dbl>,
## #   meduc <dbl>, feduc <dbl>, lwage <dbl>
model4 <- lm(lwage ~ educ + exper + tenure + married + black + south + urban +expersq + tenuresq, data=WAGE2)

#model4 <- lm(lwage ~ educ + exper + tenure + married + black + south + urban +I(exper) + I(tenure), data=WAGE2) - we are using I if we dont want to make adjustment to the data set (too big z.B)

summary(model4)
## 
## Call:
## lm(formula = lwage ~ educ + exper + tenure + married + black + 
##     south + urban + expersq + tenuresq, data = WAGE2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98236 -0.21972 -0.00036  0.24078  1.25127 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.3586757  0.1259143  42.558  < 2e-16 ***
## educ         0.0642761  0.0063115  10.184  < 2e-16 ***
## exper        0.0172146  0.0126138   1.365 0.172665    
## tenure       0.0249291  0.0081297   3.066 0.002229 ** 
## married      0.1985470  0.0391103   5.077 4.65e-07 ***
## black       -0.1906636  0.0377011  -5.057 5.13e-07 ***
## south       -0.0912153  0.0262356  -3.477 0.000531 ***
## urban        0.1854241  0.0269585   6.878 1.12e-11 ***
## expersq     -0.0001138  0.0005319  -0.214 0.830622    
## tenuresq    -0.0007964  0.0004710  -1.691 0.091188 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3653 on 925 degrees of freedom
## Multiple R-squared:  0.255,  Adjusted R-squared:  0.2477 
## F-statistic: 35.17 on 9 and 925 DF,  p-value: < 2.2e-16
anova(model4)
## Analysis of Variance Table
## 
## Response: lwage
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## educ        1  16.138 16.1377 120.9469 < 2.2e-16 ***
## exper       1   5.540  5.5400  41.5202 1.874e-10 ***
## tenure      1   4.018  4.0177  30.1111 5.268e-08 ***
## married     1   3.493  3.4935  26.1826 3.777e-07 ***
## black       1   3.888  3.8878  29.1379 8.569e-08 ***
## south       1   2.545  2.5447  19.0718 1.401e-05 ***
## urban       1   6.216  6.2164  46.5901 1.582e-11 ***
## expersq     1   0.016  0.0161   0.1204   0.72869    
## tenuresq    1   0.382  0.3815   2.8592   0.09119 .  
## Residuals 925 123.421  0.1334                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#F-statics
SSRr = sum(model3$residuals^2)
SSRr
## [1] 123.8185
SSRur = sum(model4$residuals^2)
SSRur
## [1] 123.421
F_test = ((SSRr-SSRur)/2)/(SSRur/(nrow(WAGE2)-7-1))
F_test
## [1] 1.493027
#H0 hupothesis was that there is no significant explanations of these additional variables 
#in the table (q=2, r=10000) the value was 3,00. 
#so we didnt reject H0 because, 0.025 is smaller than 3.00

Use an F-Test for choosing between this model and the more parsimonious model of part (a), i.e. test whether the additional variables provide ‘enough’ additional explanatory power. Provide the test statistics, the rejection rule (a = 0.05) and the test decision.

(c) Extend the original model to allow the return to education to depend on race. Provide the model equation and then run the regression and provide the results. Test whether the return to education does depend on race (a = 0.1).

model5 = lm(lwage ~ educ + exper + tenure + married + black + south + urban + black*educ, data = WAGE2)

summary(model5)
## 
## Call:
## lm(formula = lwage ~ educ + exper + tenure + married + black + 
##     south + urban + black * educ, data = WAGE2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.97782 -0.21832  0.00475  0.24136  1.23226 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.374817   0.114703  46.859  < 2e-16 ***
## educ         0.067115   0.006428  10.442  < 2e-16 ***
## exper        0.013826   0.003191   4.333 1.63e-05 ***
## tenure       0.011787   0.002453   4.805 1.80e-06 ***
## married      0.198908   0.039047   5.094 4.25e-07 ***
## black        0.094809   0.255399   0.371 0.710561    
## south       -0.089450   0.026277  -3.404 0.000692 ***
## urban        0.183852   0.026955   6.821 1.63e-11 ***
## educ:black  -0.022624   0.020183  -1.121 0.262603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3654 on 926 degrees of freedom
## Multiple R-squared:  0.2536, Adjusted R-squared:  0.2471 
## F-statistic: 39.32 on 8 and 926 DF,  p-value: < 2.2e-16
#so t-value is far from the 0 which means that greater is likelihood of having an impact, the p value is less then 0,1 meaning we are rejecting the H0. The black*educ is not significant.

(i) Interpret the black-coefficient of this model and compare it to part (a).

the black coefficient is 0.094 unlike the initial model meaning -0.188350, because educ*black took partial the meaning (the sign -)

(ii) Modify your model by replacing the interaction term with black*(educ − c) where c is an appropriate value. Which value of c do you suggest? Run the regression and interpret the black-coefficient of this regression.

model5 = lm(lwage ~ educ + exper + tenure + married + black + south + urban + black*educ, data = WAGE2)

summary(model5)

#married.black = WAGE2$black*wage2$married