The first Guass Markov assumption is that the relationship between x and y is linear. The second Gauss Markov assumption is that there is no perfect multicollinearity between column vectors. This means that the columns of X are linearly independent. The third assumption is that the disturbances average out to 0 for any value of X (Zero Conditional Mean Assumption). The fourth assumption is that there is homoskedasticity and no autocorrelation. The fifth assumption is that X may be fixed or random, but must be generated by a mechanism that is unrelated to \(\epsilon\) . Lastly, we assume that the distribution of the data is normal, but this assumption is not required.
The first assumption of linearity means that the relationship between x and y is linear. This means that the best-fit line for the data is a straight line. The second assumption, of full column rank, means that the change in one variable can not cause a change in a second variable if that second variable remains constant. An example of correlated variables would be a variable of whether a subject is under the influence of alcohol and a variable of whether the subject is under the influence of any drug/substance when looking at the rate of people being pulled over. The third assumption of zero conditional mean states that for any value chosen for X, the error term averages out to 0, and there is no error pattern shown. Therefore, we know that our model is sufficient. The fourth assumption of homoskedasticity and no autocorrelation means that there is an equal distribution of all error terms under each variable, and the positive and negative error values are random and are not correlated in pattern. The fifth assumption of data generation means that the data generated cannot be generated in relation to the error term, as we should collect a random sample from the population. Lastly, it is not required, but in order to make hypothesis testing easier, we can assume a normal distribution of data. This means that the data distribution is symmetrical about the mean, and that values around the mean occur more frequently.
The first assumption of linearity means that the relationship between x and y is linear. Therefore,
\[ y=X\beta + \epsilon \]
The second assumption, of full column rank, means there is no perfect multicollinearity between column vectors. The columns of X are linearly independent. (Identification condition). The third assumption of zero conditional mean states that for any value chosen for X, the error term averages out to 0, and there is no error pattern shown. Therefore,
\[ E[\epsilon|X] = 0~~~~~~~, \]
as this assumption states that the disturbances average out to 0 for any value of X. Assumption implies that E(y) = X*beta. The fourth assumption of homoskedasticity and no autocorrelation means that there is an equal distribution of all error terms under each variable, and the positive and negative error values are random and are not correlated in pattern. Therefore,
\[ E(\epsilon\epsilon'|X) = \sigma^2I ~~~~~, \]
as assumption of homoskedasticity states that the variance of \[\epsilon_i\] is the same variance for all i, and the assumption of no autocorrelaiton means that \(cov(\epsilon_i,\epsilon_j|X)=0\) for all i not equal to j. The fifth assumption of data generation means that the data generated cannot be generated in relation to the error term, as we should collect a random sample from the population. Lastly, it is not required, but in order to make hypothesis testing easier, we can assume a normal distribution of data. This means that the data distribution is symmetrical about the mean, and that values around the mean occur more frequently.
The data set that I used is a cross-sectional data frame consisting of 546 observations of 12 variables on the sales price of houses sold in Windsor, Canada, during July, August, and September, 1987.
rm(list=ls())
library(AER)
## Warning: package 'AER' was built under R version 4.2.3
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
## Loading required package: survival
library(rsconnect)
library(knitr)
data("HousePrices")
str(HousePrices)
## 'data.frame': 546 obs. of 12 variables:
## $ price : num 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
## $ lotsize : num 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
## $ bedrooms : num 3 2 3 3 2 3 3 3 3 3 ...
## $ bathrooms : num 1 1 1 1 1 1 2 1 1 2 ...
## $ stories : num 2 1 1 2 1 1 2 3 1 4 ...
## $ driveway : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ recreation: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ...
## $ fullbase : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 2 1 2 1 ...
## $ gasheat : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ aircon : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 2 ...
## $ garage : num 1 0 0 0 0 0 2 0 0 1 ...
## $ prefer : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Price: Sales price of the house
lotsize: lot size of a property in sqaure feet
bedrooms: number of bedrooms
stories: number of stories excluding basement
driveway: does the house have a driveway?
recreation: does the house have a rec room?
fullbase: is the basement fully finished?
gasheat: gas or hotwater heating?
aircon: is there AC?
garage: number of garage places
prefer: is it located in a preferred neighborhood or the city?
\[ Price_i = \beta_0 + \beta_1lotsize + \epsilon_i \]
#run regression and store it in my_reg
my_reg <- lm(price ~ lotsize, data=HousePrices)
sd(HousePrices$lotsize)
## [1] 2168.159
summary(my_reg)
##
## Call:
## lm(formula = price ~ lotsize, data = HousePrices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69551 -14626 -2858 9752 106901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.414e+04 2.491e+03 13.7 <2e-16 ***
## lotsize 6.599e+00 4.458e-01 14.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22570 on 544 degrees of freedom
## Multiple R-squared: 0.2871, Adjusted R-squared: 0.2858
## F-statistic: 219.1 on 1 and 544 DF, p-value: < 2.2e-16
sd(HousePrices$lotsize) * 6.599
## [1] 14307.68
Based on the summary of the Simple Linear Regression that I ran, we can see that the resulting equation is:
\[ Price_i = 34140 + 6.599(lotsize) \]
Based on the resulting equation, we can interpret that the price of a household with a lotsize of 0 sq feet is $34,140, and if the lotsize increases by 1 square foot then the price of the home increases by $6.599.
Based on the resulting simple linear regression we can see that the intercept and the lotsize coefficient is statistically significant at an alpha of 0.
Based on the standard deviation of the lotsize being 2168.159 square feet and the coefficient of lotsize is estimated to be $6.599 per square foot, we can see that the effect of an increase in lotsize by a standard deviation of 2168.159 square feet has an economic impact of $14307.68. Therefore, the economic magnitude of the impact of lotsize on price is meaningful.
plot(my_reg)
Based on the residuals vs fitted values plot, we are able to see if the type of model used is appropriate. In this case, we are using a level-level model, and we can see that the best-fit line for the residuals starts a downward trend, from residual = 0, between fitted values = 80,000-90,000. This could pose a problem, as this is reflecting an issue of heteroskedasticity.
Based on the Normal Q-Q plot, we are able to see if the residuals are normally distributed by comparing them with an actual normal distribution. From the plot that we have made, we can see that the value density is greater around (0,0), but we shouldn’t be seeing values above standardized_residuals = 2 and below standardized_residuals = -2. In this case, the r program even labeled some of the outliers in this plot. (i.e. 378 and 419). We are overpredicting values for price on the lower end of lotsize and underpredicting values for price on the higher end of lotsize. The more apparent outliers are on the higher end of lotsize, and this is consistent with the Residuals vs Fitted Values Plot as well.
This is also know as the spread location plot. This plot shows us if the residuals are spread equally among our predictions in order to check for homoscedasticity. We want, in this case, to have an even dispersion of values across the plot and we don’t want to see any pattern for the residuals. We also want the red line to be relatively horizontal. For the plot that was created, we can see that the dispersion of values along the fitted values is skewed to the left, and that on the right side of the plot, the plotted points are further from the red line as those on the right on average.
This plot helps to show influential data points that have a big effect on the linear model. We are looking to see if any point lies outside the Cook’s distance dotted line in the top right or bottom left corners of the plot. In this case, we do not have any values outside the Cook’s distance.
Based on the plots that were created we can see that there is an issue with heteroskedasticity,as the residuals vs fitted values plot has a red line that starts horizontal and ends with a downward trend. We can also see that there is heteroskedasticity based on the scale-location plot, as the dispersion of values along the fitted values line is skewed to the left. Finally, we can also see that there is an issue involving heterogeneity, as the Normal Q-Q plot shows that we are over-predicting values for price on the lower end of lotsize and under-predicting values for price on the higher end of lotsize, therefore the residuals are not normally distributed.
#check for log-level model
my_reg2 <- lm(log(price) ~ lotsize, data= HousePrices)
#check for log-log model
my_reg3 <- lm(log(price) ~ log(lotsize), data=HousePrices)
#check sqrt-sqrt model
my_reg4 <- lm(sqrt(price) ~ sqrt(lotsize), data=HousePrices)
#check sqrt-level model
my_reg5 <- lm(sqrt(price) ~ lotsize, data=HousePrices)
#check level-sqrt model
my_reg6 <- lm(price ~ sqrt(lotsize), data=HousePrices)
#check sqrt-log model
my_reg7 <- lm(sqrt(price) ~ log(lotsize), data=HousePrices)
#check log-sqrt model
my_reg8 <- lm(log(price) ~ sqrt(lotsize), data=HousePrices)
plot(my_reg2)
plot(my_reg3)
plot(my_reg4)
plot(my_reg5)
plot(my_reg6)
plot(my_reg7)
plot(my_reg8)
Based on the observed plots from the changed models, we can see that my_reg3 model has the best plots for confirming the Gauss Markov Assumptions and creating the best fit model. The problems associated with heteroskedasticity and heterogeneity are less apparent in this case, therefore this model is best set up as a log-log model.