This time, we will learn how to analyze and predict house prices. To conduct this analysis and prediction, we will use one of the machine learning models, namely linear regression. We also aim to understand the relationship between house prices and other variables. The data used for this analysis comes from https://www.kaggle.com/greenwing1985/housepricing.
## Area Garage FirePlace Baths White.Marble
## 0 0 0 0 0
## Black.Marble Indian.Marble Floors City Solar
## 0 0 0 0 0
## Electric Fiber Glass.Doors Swiming.Pool Garden
## 0 0 0 0 0
## Prices
## 0
Based on the data above, our house dataset does
not contain any missing values.
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
From the data displayed above, we obtain the following information:
Data Description:
These variables serve as predictor variables, while Price is the target variable for house price prediction.
We will use Pearson correlation to measure the relationship between variables in the dataset. To do this, we can utilize the GGally package in R, which provides an enhanced visualization of correlations.
From the image above, we can conclude that the variables that have a
significant correlation with house sale prices are
White.Marble,Indian.Marble,Floors,City,Fiber
and Glass.Doors.
Before building the linear regression model, we need to split the dataset into training and testing sets. We will allocate:
house_model <- lm(Prices ~ White.Marble + Indian.Marble + Floors + City + Fiber + Glass.Doors
,data = data_train)
house_model##
## Call:
## lm(formula = Prices ~ White.Marble + Indian.Marble + Floors +
## City + Fiber + Glass.Doors, data = data_train)
##
## Coefficients:
## (Intercept) White.Marble Indian.Marble Floors City
## 18139 9024 -4997 14998 3492
## Fiber Glass.Doors
## 11752 4435
##
## Call:
## lm(formula = Prices ~ White.Marble + Indian.Marble + Floors +
## City + Fiber + Glass.Doors, data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9367.0 -2152.0 -0.5 2148.0 9354.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18138.920 16.774 1081.4 <0.0000000000000002 ***
## White.Marble 9024.401 11.857 761.1 <0.0000000000000002 ***
## Indian.Marble -4997.352 11.857 -421.5 <0.0000000000000002 ***
## Floors 14998.245 9.678 1549.6 <0.0000000000000002 ***
## City 3491.729 5.929 589.0 <0.0000000000000002 ***
## Fiber 11752.405 9.679 1214.3 <0.0000000000000002 ***
## Glass.Doors 4434.556 9.679 458.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3061 on 399993 degrees of freedom
## Multiple R-squared: 0.9362, Adjusted R-squared: 0.9362
## F-statistic: 9.782e+05 on 6 and 399993 DF, p-value: < 0.00000000000000022
\[ Prices = 18138.920 + 9024.401\White.Marble -4997.352\Indian.Marble + 14998.245\Floors + 3491.729\City + 11752.405\Fiber + 4434.556\Glass.Doors \]
The Adjusted R-squared value of the house_model is 0.9362, meaning that the model can explain 93.62% of the variance in the target variable (house price). The remaining 6.38% is influenced by other factors not included in the model.
Before evaluating the model’s performance, we need to predict house prices using the trained regression model. Then, we will use MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error) from the caret package to assess the model’s accuracy on both training and testing data.
\[ MAE = \frac{\sum |\hat y - y|}{n} \]
\[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \]
Prediction of Model Performance on Training Data
## [1] 2481.073
## [1] 3060.576
Interpretation:
MAE (Mean Absolute Error): 2481.073 This means that, on average,
our model’s predictions on the training data deviate by 2481.073
(either positively or negatively) from the actual house prices.
RMSE (Root Mean Squared Error): 3060.576 The model’s predictions have an average deviation of 3060.576 from the actual prices, considering squared errors to penalize larger deviations.
Prediction of Model Performance on Training Data
## [1] 2482.874
## [1] 3062.738
Interpretation:
MAE (Mean Absolute Error): 2482.874 This means that, on average,
our model’s predictions on the training data deviate by 2482.874
(either positively or negatively) from the actual house prices.
RMSE (Root Mean Squared Error): 3062.738 The model’s predictions have an average deviation of 3062.738 from the actual prices, considering squared errors to penalize larger deviations.
error_model <- data.frame(data_train = c(MAE = MAE(house_predict_train, data_train$Prices),
RMSE = RMSE(house_predict_train, data_train$Prices)),
data_test = c(MAE(house_predict_test, data_test$Prices),
RMSE = RMSE(house_predict_test, data_test$Prices))
)
error_modelWe can observe that the model’s performance on both the training data and testing data is similar, with no significant differences.
Thus, we can conclude that the model does not suffer from overfitting or underfitting. This indicates that the model generalizes well to new data and provides reliable predictions.
Linear regression has several assumptions that need to be met to ensure that the interpretation is unbiased. These assumptions are crucial if the goal of building the linear regression model is to interpret or analyze the effect of each predictor on the target variable.
However, if the purpose is only to make predictions, then meeting these assumptions is not mandatory. The model can still be useful even if some assumptions are violated, as long as its predictive performance remains strong.
Linearity means that the target variable and its predictors have a linear relationship, meaning the relationship follows a straight-line pattern.
The linear relationship between the target variable and predictors can also be observed through correlation between variables. If the correlation between a predictor and the target variable is strong (either positive or negative), it indicates a linear relationship.
##
## Call:
## lm(formula = Prices ~ White.Marble + Indian.Marble + Floors +
## City + Fiber + Glass.Doors, data = data_train)
##
## Coefficients:
## (Intercept) White.Marble Indian.Marble Floors City
## 18139 9024 -4997 14998 3492
## Fiber Glass.Doors
## 11752 4435
\[ Prices = 18138.920 + 9024.401\White.Marble -4997.352\Indian.Marble + 14998.245\Floors + 3491.729\City + 11752.405\Fiber + 4434.556\Glass.Doors \]
ggcorr(house %>%
select(c(White.Marble,Indian.Marble,Floors,City,Fiber,Glass.Doors, Prices)),
label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)Based on the histogram plot above, we can conclude that the residuals (errors) are normally distributed.
Homoscedasticity means that the residuals (errors) have a constant variance across all levels of the predicted values. This ensures that the model’s predictions remain stable and reliable.
If the residuals show a pattern (such as a funnel shape, increasing/decreasing spread, or a clear trend), it indicates Heteroscedasticity. This can lead to:
##
## studentized Breusch-Pagan test
##
## data: house_model
## BP = 8.7898, df = 6, p-value = 0.1858
We can see that the p-value > 0.05, which means that the residuals have constant variance (Homoscedasticity).
Multicollinearity occurs when predictor variables in the model have a strong correlation with each other. This can cause instability in the regression model, making it difficult to determine the individual effect of each predictor on the target variable.
We can detect multicollinearity using the Variance Inflation Factor (VIF):
## White.Marble Indian.Marble Floors City Fiber
## 1.334860 1.334860 1.000002 1.000012 1.000013
## Glass.Doors
## 1.000010
The variables that significantly impact and help explain variations in house prices are White.Marble, Indian.Marble, Floors, City, Fiber, Glass.Doors, with Prices as the target variable.
The R-squared value obtained from the model is quite high, indicating that the model can explain 93.62% of the variation in house prices, while the remaining 6.38% is influenced by other factors.
The model’s accuracy in predicting house prices is measured using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
On the training data (data_train), the model has: MAE = 2481.073 and RMSE = 3060.576
On the test data (data_test), the model has: MAE = 2482.874 and RMSE = 3062.738
Since the MAE and RMSE values for data_train and data_test are very similar, we can conclude that the model does not suffer from overfitting or underfitting.