The Gauss-Markov Assumptions are a set of conditions to make sure the regression model is valid and the predicted value of regression is unbiased.
The set of conditions includes six assumptions, Linearity, Full column rank, The zero conditiolnal mean assumption, Homoscedasticity, Nonautocorrelation, Exogeneity, and the central limit theorem.
1.Simple Explanation
Linearity. Ensure that the relationship between the variables is linear, which allows the dependent variable to be predicted by linear regression.
Full column rank. Ensure that each independent variable is independent and independent variables cannot be perfectly substituted for each other.
The zero conditiolnal mean assumption.The sum of errors is zero which ensure the errors terms are unbised.
Homoscedasticity. Ensure that residuals between predicted and real values are the same in the model and assure the reliability of the prediction.
Nonautocorrelation.Ensure that the residuals are completely independent and the errors of different independent variables don’t affect each other in the same row.
Exogeneity. Ensure that each independent variable is independent and the independent variable is not affected by the error term.
The Central Limit Theorem. The distribution of the data is random.
2.Explanation with technical background
Linearity.This assumption states that there is a linear relationship between y and x.
Full column rank. X is an n*k matrix of full rank and all columns of X are linearly independent.
The zero conditiolnal mean assumption.
The disturbances average out to 0 for any value of X.The assumption implies that the predicted value of y is only related to beta and x.
d/e) Homoscedasticity and Nonautocorrelation.
Assume ϵ represents the error term and ϵt ϵs represents the error term at time t and s.
Homoscedasticity implies: Var(ϵ|X)=sd^2
Nonautocorrelation implies: Cov(ϵt,ϵs=0)
Exogeneity. X may be fixed or random, but must be generated by a mechanism that is unrelated to ϵ. g) The Central Limit Theorem.
The Central Limit Theorem. The data are randomly distributed, and as the sample size n increases, the distribution of the standardized sample mean will converge to a standard normal distribution
This regression selected five variables from the Boston dataset, containing 506 rows, 5columns, and 2530 observations. The dataset is used to analyze the median value of house prices in Boston.
The dependent variables is medv, median value of owner- occupied homes in $1000s.
The independent variables: 1.rm. Average number of rooms per dwelling. 2.ptratio. pupil-teacher ratio by town and the unit is % 3.lstat. the lower status of the population and the unit is % 4.dis. the weighted mean of distances to five Boston employment centres and the unit is mile 5.black. the proportion of Black residents in a community and the unit is the numaber of per 1,000
library(MASS)
data(Boston)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
newbos <- Boston %>% select(rm,ptratio,lstat,dis,black,medv)
\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 ptratio_i + \beta_3 lstat_i + \beta_4 dis_i + \beta_5 black_i \]
library(psych)
library(knitr)
summary_stats <- describe (newbos)
kable(summary_stats, format = "markdown", digits = 2)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rm | 1 | 506 | 6.28 | 0.70 | 6.21 | 6.25 | 0.51 | 3.56 | 8.78 | 5.22 | 0.40 | 1.84 | 0.03 |
ptratio | 2 | 506 | 18.46 | 2.16 | 19.05 | 18.66 | 1.70 | 12.60 | 22.00 | 9.40 | -0.80 | -0.30 | 0.10 |
lstat | 3 | 506 | 12.65 | 7.14 | 11.36 | 11.90 | 7.11 | 1.73 | 37.97 | 36.24 | 0.90 | 0.46 | 0.32 |
dis | 4 | 506 | 3.80 | 2.11 | 3.21 | 3.54 | 1.91 | 1.13 | 12.13 | 11.00 | 1.01 | 0.46 | 0.09 |
black | 5 | 506 | 356.67 | 91.29 | 391.44 | 383.17 | 8.09 | 0.32 | 396.90 | 396.58 | -2.87 | 7.10 | 4.06 |
medv | 6 | 506 | 22.53 | 9.20 | 21.20 | 21.56 | 5.93 | 5.00 | 50.00 | 45.00 | 1.10 | 1.45 | 0.41 |
reg_bos <- lm(data=newbos,
formula = medv ~ rm + ptratio + lstat + dis + black
)
summary(reg_bos)
##
## Call:
## lm(formula = medv ~ rm + ptratio + lstat + dis + black, data = newbos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.7200 -2.7471 -0.7381 1.7482 27.5110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.70367 4.28365 4.133 4.20e-05 ***
## rm 4.45375 0.41931 10.622 < 2e-16 ***
## ptratio -0.94234 0.11414 -8.256 1.35e-15 ***
## lstat -0.60842 0.04766 -12.766 < 2e-16 ***
## dis -0.61518 0.12545 -4.904 1.27e-06 ***
## black 0.01195 0.00269 4.444 1.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.045 on 500 degrees of freedom
## Multiple R-squared: 0.7021, Adjusted R-squared: 0.6991
## F-statistic: 235.7 on 5 and 500 DF, p-value: < 2.2e-16
Analysis of the regression results:
Beta0 represents When all independent variables are equal to zero, the expected value of the depengdent variable is 24.47135 thousand dollars.
Beta1 represents if the average number of rooms per dwelling increases by 1, the median value of owner occupied homes is predicted to increases by 4.22379 thousand of dollars.
Beta2 represents if the pupil-teacher ratio increases by 1%, the median value of owner occupied homes is predicted to decrease by 0.97365 thousand of dollars
Beta3 represents if the lower status of the population increases by 1%, the median value of owner occupied homes is predicted to decrease by 0.66544 thousand of dollars.
Beta4 represents if the weighted mean of distances to five Boston empolyment centres increases 1 mile, the median value of owner occupied homes is predicted to decrease by 0.55193 thousand of dollars Beta5 represents if the number of black residents per 1000 increases by 1, the median value of owner occupied homes is predicted to increase 0.01195 thousand of dollars.
According to the t-value and p-value, the effects of all independent variables on dependent variable are statistically greater than zero at 1 significance level.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
vif_value <- vif(reg_bos)
print(vif_value)
## rm ptratio lstat dis black
## 1.722131 1.211453 2.298322 1.384551 1.196475
Based on the result of test, vif-values are less than 5, there isn’t multicollinearity problem.
plot(reg_bos)
Residuals vs. Fitted The image can display the relationship between predicted value of y and residuals.Its purpose is to check whether the residuals are randomly distributed. Ideally, they should be randomly distributed around 0.
Q-Q Residuals The image can show whether the residuals of this model are normally distributed.If the point follows the diagonal line, it indicates that the residuals are normally distributed.
Scale-Location The image can show whether the predicted values have the homoscedasticity of residuals.If the points in the graph are randomly distributed around the same horizontal line, it means that the variance of the residuals is constant
Residuals vs. leverage This plot is used to identify data points that have a significant impact on the model fit. Ideally, the residuals should be randomly distributed around zero and no point should have too high a leverage value.
In the Q-Q Residuals image, the points in the tails are mostly off the diagonal line, which indicates that there are some extreme values and the residuals do not quite fit the normal distribution.According to the Residuals vs. fitted and scale-location plot, points are not distributed at the same horizontal line which indicates the data set may have heteroskedasticity.The regression has problems with normal distribution and heteroskedasticity. Therefore, the next part will change the variables in the regression model.
\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 rm^2_i + \beta_3 ptratio_i + \beta_4 lstat_i + \beta_5 dis_i + \beta_6 black_i \]
newbos$rm_2 <- newbos$rm^2
reg_bos2 <- lm(data=newbos,
formula = medv ~ rm + rm_2 + ptratio + lstat + dis + black
)
summary(reg_bos2)
##
## Call:
## lm(formula = medv ~ rm + rm_2 + ptratio + lstat + dis + black,
## data = newbos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.6624 -2.3629 -0.3736 1.6547 27.7642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.674057 9.352774 12.902 < 2e-16 ***
## rm -29.102197 2.812950 -10.346 < 2e-16 ***
## rm_2 2.602154 0.216245 12.033 < 2e-16 ***
## ptratio -0.738405 0.102003 -7.239 1.72e-12 ***
## lstat -0.641462 0.042092 -15.240 < 2e-16 ***
## dis -0.370613 0.112409 -3.297 0.00105 **
## black 0.010788 0.002372 4.547 6.84e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.446 on 499 degrees of freedom
## Multiple R-squared: 0.7691, Adjusted R-squared: 0.7663
## F-statistic: 277 on 6 and 499 DF, p-value: < 2.2e-16
plot(reg_bos2)
With the addition of the variable rm_2, it can be seen that data points in Residuals vs. Fitted, Scale-Location, and redisuals vs. leverage become closer to the horizontal line. And the points in Q-Q Residuals are closer to the diagonal. The problems of normal distribution and heteroscedasticity are improved.
The value of R-squared increases from 0.7012 to 0.7691.It indicates a better fir to the model.