Why Gauss-Markov assumptions?
Gauss-Markov Assumptions
Linearity:
Explanation: Linear relationship must exist in the independent variable X and the dependent variable Y in order for us to draw a line that allows us to see the trend.
Logic behind: If the there is no linearity between the our independent variables and the dependent variable, it would be difficult for us to fit a regression line in our data. Sometimes outliers can add difficulties to us to draw such a line, therefore, outliers need to be excluded from our data.
No Multicollinearity (Full column rank):
Explanation: The variables (Xs) that we use to predict the dependent variable (Y) are not correlated with one another. Another way of saying this is all the Xs are linearly independent from each other.
Logic behind: If one of the independent variable are correlated with another one, this means that when represent the the data in a matrix, we have linearly dependency and this will create a problem when we try to perform regression. We will get (X^T *X) is singular. Therefore, we will not obtain a defined solution for our coefficients.
Zero Conditional Mean:
Explanation: All predictors, our independent variables (Xs), has no correlation to the error terms.
Logic behind: This assumption implies that
\[E(Y) = X\beta\] which can allow us to capture the most accurate relationship between our X and Y.
Homoscedasticity:
Explanation: Each error term has an equal distribution across the regression line (normal distribution).
Logic behind: The job of regression is to estimate an optimal curve which passes as close to as many of the data point as possible. If data points are more widely dispersed, then the largest variances will affect the trend of the regression curve. Homescedasticity means that the variance between each error term is even thus it will enable us to get BLUE. Without it, we still will have a linear, unbiased estimator, but it will not be BLUE.
Non-Autocorrelation:
Explanation: Each error term is independent from another other.
Logic behind: In the presence of autocorrelation, we will not have the smallest variance among all linear unbiased estimators which can lead to wrong standard error for the regression coefficient estimates.
Normal Distribution:
Data Description
This is a real data set of house prices sold in Seattle, Washing, USA between August and December 2022.
Variables included:
beds = Number of beds
baths = Number of bathrooms, (note 0.5 corresponds to a half-bath which has a sink and toilet but no tub or shower)
size = Total floor area of property in square feet
lot_size = Total area of the land where the property is located on. The lot belongs to the house owner
zip_code = Zip code
price = Price the property was sold for in US dollars
# Import Data
house <- read.csv("/Users/pin.lyu/Desktop/BC_Class_Folder/Econometrics/DIS_&_ASSIGNMENT/DIS_4/Housing/train.csv")
Objective
Regression function
\[ Price = \beta_0\ + \beta_1Beds + \beta_2Baths\ + \beta_3Size\ + \beta_4Area\ + \epsilon\ \]
This function includes four different variables as predictors for price prediction. The forth term is referring to lot_size.
Data cleaning
# Checking N/A in data set
colSums(is.na(house))
## beds baths size size_units lot_size
## 0 0 0 0 347
## lot_size_units zip_code price
## 0 0 0
# Eliminate rows that have N/A
house2 <- na.omit(house)
colSums(is.na(house2))
## beds baths size size_units lot_size
## 0 0 0 0 0
## lot_size_units zip_code price
## 0 0 0
# Outlier check
boxplot(house2$price,
ylab = "price"
)
summary(house2$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159000 680000 865000 1038475 1175000 25000000
Now it looks like we have outlier that is affecting our data, so next we are going to eliminate the outliers in price.
# Outlier elimination
quartiles <- quantile(house2$price, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(house2$price)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outlier <- subset(house2, house2$price > Lower & house2$price < Upper)
dim(data_no_outlier)
## [1] 1565 8
Outliers are now excluded, now let’s see if it worked.
# New boxplot without outliers
boxplot(data_no_outlier$price,
ylab = "price"
)
my_reg <- lm( price ~ beds + baths + size + lot_size,
data = house2
)
summary(my_reg)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 292277.95998 69475.131333 4.2069436 2.726411e-05
## beds -125889.21034 29279.047806 -4.2996347 1.810042e-05
## baths 71176.85676 30474.626101 2.3356105 1.962983e-02
## size 510.01587 38.657235 13.1932837 7.119295e-38
## lot_size 2.69381 8.966492 0.3004308 7.638861e-01
Interpretation:
This linear regression model is level-level
Intercept term: the price of all houses in our linear regression model when all X = 0.
Beds: As the number of beds in a house increase by 1 unit, the price of the house would decrease by $125,889.
Bath: As the number of bathrooms in a house increase by 1 unit, the price of the house would increase by $71,176.
size: As the size of house increase by 1 square feet, the price of that house would increase by $510.
lot_size: As the total property area increases by 1 square feet, the price of the house would increase by $2.69.
plot(my_reg)
Residuals vs Fitted:
Intuition: This graphs shows us the relationship between the actual residuals and the predicted values. The dashed center line it’s the ideal condition that mean of all the residuals equal ot zero. The red line is the predicted value. And lastly the black dats arournd the dashed center line are the actual residuals representing how far away each residual is away from the 0 mean line. Ideally the mean of all residuals should be zero, however, in reality, sometimes due to errors such as outliers, and an absence of a factor that influences the dependent variable (Y) in the regression function, these situations can cause the predicted line to deviate from the 0 mean line. the general ideal of this graph is that the closer the predicted line is to the 0 mean line, the better our linear model is.
Our graph: the predicted line, 2/3 of it is more or less on the 0 mean line which is suggesting that for the most part the linear regression is doing a good job preducting the market price of these houses. However, as we can see, as the price of a house increases in our graph, the predicted line begins to deviate. I think this is suggesting that there are missing factors that we did not include in our regression that led to this outcome. Or perhaps, this occured becasue of the influence from outliers. However, I would say overall the model we have did a good job predicting the prices because the deviation at the end of the prediction line is caused by one singular data point.
Normal Q-Q:
Intuition: This kind of graph enables us to see the distribution of the residuals. Ideally, we should see a normal distribution of our residuals on this graph which means that 95% of the data are between the interval of -2 to 2.
Our graph: Most of our residual data points are between -2 to 2 which shows us that they are normally distributed. At the two tips of our distvition, we can also see there are some data points are above the dashed line and outside of the intervals between -2 to 2. This shows it’s likely that we have lots of extreme data points in our data set that can impair the quality of our linear regression model. R automatically labeled some of these extreme outliers in our data set, such as data 1701, 1360, and 638.
Scale location plot:
Intuition: This kind of graph shows that the residuals are spread equally among our preductions in order to check homoscedasticity. To show homoscedasticity, we would like to see the points on this graph are relatively scattered evenly on this graph and show no obvious pattern. And the red line to be relative horizontal lying in the center of the graph.
Our graph: The graph violates the principal of homoscedasticity as our residual points (square rooted) on this graph concentrates on one corner of the graph and have several data points scattered on the other end of the graph. In this graph, one of the Gauss-Maarkov assumptions is violated which is that the homoscedasticity of residual points.
Residual leverage plot:
Intuition: The plot helps us to find influential data points that have a big effect on the linear model. Our data points might be lying far away from the majority of our data, if these points, so called outliers are included in the model, then we could see our trend be influenced by these points. On this graph, we have leverage as X-aix and Standardized residuals as our Y-aix. Leverage just means how much of influecne each data point has on our graph. Instead of spotting for patterns, we should concentrate on finding out what data ponts have high leverage value and the ones that lie outside of the cook’s distance( the dashed line)
Our graph: There are about 5 points that are inside of the cook’s distance but hold high leverage values. This is fine. However, we can also see that we have two data points 1701, 1360 are on the dashed line. These are outliers that we should eliminated to create a better linear model.