The purpose of my regression and exploratory data analysis is to get an insight of the housing prices with respect to its other attributes. Here I am studying the dataset “ Housing Prices” by Kaggle. The main focus is on analyzing the factors which are affecting the prices of houses from the given 500000 houses with their prices and other columns which will be taken into consideration as factors which might affect the prices.
house <- read.csv("data_input/HousePrices_HalfMil.csv")
str(house)
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Area: Land area of the building
Garage: Garage Type
FirePlace: Type of fireplace in the house
Baths: Bathroom Type
White.Marble: Type of white marble in the house
Black.Marble: Type of black marble in the house
Indian.Marble: Types of Indian marble in the house
Floors: Type of floor
City: City of home location
Solar: There is Solar powered / no
Electric: There is electricity / no
Fiber: There is Fiber roof / no
Glass.Doors: Doors made of glass / no
Swiming.Pool: There is a swimming pool / no
Garden: There is a garden / no
Prices: House prices
# menguji keberadaan missing value
anyNA(house)
## [1] FALSE
Checking the columns and the number of missing entries in them Examining this is important as because of this the dataset can lose expressiveness, which can lead to weak and biased analyses
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
In the correlation graph, it can be seen that there are variables that have a positive influence on price where the GDP factor has the highest positive correlation compared to other factors.
The highest correlation between the predictor variables is that the floor of the house Floors is very influential on the price of the house, which is 0.6
At this point, Build a regression model between the Price and Floor variables, because the correlation between the two variables is very high. Because some variables must be of type category, but still of type numeric, we must change the data type to category.
names(house)[2:15]
## [1] "Garage" "FirePlace" "Baths" "White.Marble"
## [5] "Black.Marble" "Indian.Marble" "Floors" "City"
## [9] "Solar" "Electric" "Fiber" "Glass.Doors"
## [13] "Swiming.Pool" "Garden"
house[names(house)[2:15]] = lapply(house[names(house)[2:15]], as.factor)
str(house)
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : Factor w/ 3 levels "1","2","3": 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : Factor w/ 5 levels "0","1","2","3",..: 1 1 5 5 5 4 1 2 1 3 ...
## $ Baths : Factor w/ 5 levels "1","2","3","4",..: 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 2 1 1 ...
## $ Black.Marble : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 1 ...
## $ Indian.Marble: Factor w/ 2 levels "0","1": 1 2 1 2 1 1 2 1 2 2 ...
## $ Floors : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 2 1 ...
## $ City : Factor w/ 3 levels "1","2","3": 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 2 ...
## $ Electric : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 2 1 1 ...
## $ Fiber : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 1 1 1 ...
## $ Glass.Doors : Factor w/ 2 levels "0","1": 2 2 1 2 2 2 2 2 1 1 ...
## $ Swiming.Pool : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 2 2 ...
## $ Garden : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 2 1 1 1 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
mod1 <- lm(Prices ~ Floors, data = house)
mod1
##
## Call:
## lm(formula = Prices ~ Floors, data = house)
##
## Coefficients:
## (Intercept) Floors1
## 34558 15003
summary(mod1)
##
## Call:
## lm(formula = Prices ~ Floors, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27211 -6986 -11 6592 28542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34557.66 19.00 1819 <2e-16 ***
## Floors1 15003.39 26.89 558 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9507 on 499998 degrees of freedom
## Multiple R-squared: 0.3837, Adjusted R-squared: 0.3837
## F-statistic: 3.113e+05 on 1 and 499998 DF, p-value: < 2.2e-16
When viewed from the perspective of multiple R-squared, this model is still very weak because the predictor fields used are categorical. Next, we will try to use all the variables that will be selected automatically using the stepwise regression technique.
mod2 <- lm(Prices ~., data = house)
mod2_back <- step(mod2, direction = "backward",trace = 0)
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense
Based on two models where the model uses the backward stepping method to select the variable to get the smallest AIC value, there are indications that the model is overfitting as seen from the corrected ideal R-squared value = 1
As a statistical model, linear regression has several assumptions that need to be met so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to want an interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions are not required to be met.
The hope is that when making a linear regression model, the resulting error is normally distributed. This means that many errors are gathered around the number 0.
hist(mod2_back$residuals, breaks = 20)
If we look at the distribution, the backward stepwise model is proven not to be well distributed, because it is too clustered at 0.
tryCatch({
shapiro.test(mod2_back$residuals)
},
warning =
function(w) {message("sample size must be between 3 and 5000")},
error =
function(e) {message("sample size must be between 3 and 5000")})
## sample size must be between 3 and 5000
head(mod2_back$residuals)
## 1 2 3 4 5
## -8.344602e-05 5.049946e-07 -3.748786e-07 1.851977e-09 1.199718e-08
## 6
## -8.005082e-08
Based on the error from the Shapiro Test, the model cannot be tested for the Shapiro Test because there are sample values below 3 or negative values. As for the Shapiro Test itself, the sample value must range from 3 - 5000
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(mod2_back)
##
## studentized Breusch-Pagan test
##
## data: mod2_back
## BP = 19.34, df = 20, p-value = 0.4998
Based on the heteroscedasticity test, the backward stepwise model does not have a pattern (heteroscedasticity) of P>0.05 where all existing patterns have been successfully captured by the model made.
Based on the statistical test assumptions above and several other considerations, this HousePricing data is not appropriate to be used as a model using multiple linear regression because many variables consist of categorical types. However, if you use ordinary linear regression which only consists of 2 variables, the model is still relevant and can be used in the future