LBB 4
LBB 4
BACKGROUND
LBB Requirements
In making a report, don’t forget to cover the following:
- Selection of variable targets depends on the perspective of the case you want to take
- Data analysis and the process of selecting predictor / feature selection variables
- Test the validity of the model
- Model interpretations and recommendations related to the initial case
Case Study
I conducted a regression and analysis of exploratory data to gain insight into housing prices in relation to other attributes. And the dataset I got from kaggle is “Housing Prices”
Insight
The aim is to analyze the variable that affect home prices with other variables, that will be considered as factors that can affect prices.
DATA PREPARATION
Packages
Data Input
## Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1 164 2 0 2 0 1 0 0
## 2 84 2 0 4 0 0 1 1
## 3 190 2 4 4 1 0 0 0
## 4 75 2 4 4 0 0 1 1
## 5 148 1 4 2 1 0 0 1
## 6 124 3 3 3 0 1 0 1
## City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1 3 1 1 1 1 0 0 43800
## 2 2 0 0 0 1 1 1 37550
## 3 2 0 0 1 0 0 0 49500
## 4 1 1 1 1 1 1 1 50075
## 5 2 1 0 0 1 1 1 52400
## 6 1 0 0 1 1 1 1 54300
Colnames
## [1] "Area" "Garage" "FirePlace" "Baths"
## [5] "White.Marble" "Black.Marble" "Indian.Marble" "Floors"
## [9] "City" "Solar" "Electric" "Fiber"
## [13] "Glass.Doors" "Swiming.Pool" "Garden" "Prices"
Chunk Commentary:
Structure
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Chunk Commentary: The data has 500000 rows and 16 columns. Our target variable is the price, and the rest others is predictior
Variable Description C001
The following is an explanation of the variables and their corresponding data types:
- Area: What is the area of the Unit? | data type: integer
- Garage: Is there a Garage in the Unit? | data type: integer
- Fireplace: how much e a Fireplace in the Unit? | data type: integer
- Bath: What is the amount of Bath? | data type: integer
- White.Marble: Do you use White Marble? | data type: level of Factor
- Black.Marble: Do you use Black Marble? | data type: level of Factor
- Indian.Marble: Does it use Indian Marble? | data type: level of Factor
- Floors: What is the Number of Floors? | data type: integer
- City: Is the city in the unit? | data type: Factor
- Solar: Do you use Solar in the Unit? | data type: boolean
- Electric: Does Electricy use? | data type: boolean
- Fiber: Does they use Fiber? data type: boolean
- Glass.Doors: Do you use Glass Doors? | data type: boolean
- Swiming.Pool: Do you use Swimming Pool? | data type: boolean
- Garden: Is there a Garden in the unit? | data type: boolean
- Prices: What is the unit price? ? data type: integer
## [1] 2 4 3 1 5
#change solar Factor levels Yes/no
house$Solar <- as.factor(house$Solar)
levels(x=house$Solar) <- list("no"="0", "yes"="1")
#change Electric Factor levels Yes/no
house$Electric <- as.factor(house$Electric)
levels(x=house$Electric) <- list("no"="0", "yes"="1")
#change solar Fiber levels Yes/no
house$Fiber <- as.factor(house$Fiber)
levels(x=house$Fiber) <- list("no"="0", "yes"="1")
#change Glass Door Factor levels Yes/no
house$Glass.Doors <- as.factor(house$Glass.Doors)
levels(x=house$Glass.Doors) <- list("no"="0", "yes"="1")
#change swimming pool Factor levels Yes/no
house$Swiming.Pool <- as.factor(house$Swiming.Pool)
levels(x=house$Swiming.Pool) <- list("no"="0", "yes"="1")
#change Garden Factor levels Yes/no
house$Garden <- as.factor(house$Garden)
levels(x=house$Garden) <- list("no"="0", "yes"="1")
house$Floors <- as.factor(house$Floors)
levels(x=house$Floors) <- list("no"="0", "yes"="1")lets Check the coverted data type
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 1 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : Factor w/ 2 levels "no","yes": 2 1 1 2 2 1 1 1 1 2 ...
## $ Electric : Factor w/ 2 levels "no","yes": 2 1 1 2 1 1 2 2 1 1 ...
## $ Fiber : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 1 1 ...
## $ Glass.Doors : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 1 1 ...
## $ Swiming.Pool : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ Garden : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 2 1 1 1 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
## Area Garage FirePlace Baths White.Marble
## 0 0 0 0 0
## Black.Marble Indian.Marble Floors City Solar
## 0 0 0 0 0
## Electric Fiber Glass.Doors Swiming.Pool Garden
## 0 0 0 0 0
## Prices
## 0
- Since Indian Marble, Black Marble and White Marble that should be a factor and we argue that affected Price in business prespective so we erase those variables.
handling Marbles:
house <- house %>%
mutate(marbles = case_when( White.Marble == 1 ~ "White",
Black.Marble == 1 ~ "Black",
Indian.Marble == 1 ~ "Indian") ) %>%
select(-c(Black.Marble, White.Marble, Indian.Marble)) %>%
select(Floors, Fiber, marbles, Prices, Glass.Doors, City, Baths, FirePlace, Garage, Area, Electric, Swiming.Pool, Garden)
house$City <- as.factor(house$City)
# house$Floor <- as.factor(house$City)
head(house)## Floors Fiber marbles Prices Glass.Doors City Baths FirePlace Garage Area
## 1 no yes Black 43800 yes 3 2 0 2 164
## 2 yes no Indian 37550 yes 2 4 0 2 84
## 3 no yes White 49500 no 2 4 4 2 190
## 4 yes yes Indian 50075 yes 1 4 4 2 75
## 5 yes no White 52400 yes 2 2 4 1 148
## 6 yes yes Black 54300 yes 1 3 3 3 124
## Electric Swiming.Pool Garden
## 1 yes no no
## 2 no yes yes
## 3 no no no
## 4 yes yes yes
## 5 no yes yes
## 6 no yes yes
EXPLANATORY DATA ANALYSIS
Linearity Test
Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.
Find the Pearson correlation between features.
## Warning in ggcorr(house, label = T, hjust = 1, layout.exp = 1): data in
## column(s) 'Floors', 'Fiber', 'marbles', 'Glass.Doors', 'City', 'Electric',
## 'Swiming.Pool', 'Garden' are not numeric and were ignored
- Price variable has correlation with : price has a low correlation as much as 0.1 with Baths, FirePlace, Garage, abd Area
## 'data.frame': 500000 obs. of 13 variables:
## $ Floors : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 1 ...
## $ Fiber : Factor w/ 2 levels "no","yes": 2 1 2 2 1 2 2 1 1 1 ...
## $ marbles : chr "Black" "Indian" "White" "Indian" ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
## $ Glass.Doors : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 1 1 ...
## $ City : Factor w/ 3 levels "1","2","3": 3 2 2 1 2 1 3 1 1 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Electric : Factor w/ 2 levels "no","yes": 2 1 1 2 1 1 2 2 1 1 ...
## $ Swiming.Pool: Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ Garden : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 2 1 1 1 ...
Data With Outlier
Chunk commentary:
- price is distributed normaly
- it is observed that the collection of random data from independent sources is distributed normally. We get a bell shape curve on plotting a graph.
Data without outlier
outlier<- boxplot(house$Prices, plot = F)$out
house.without.oultier<- house %>%
filter(Prices != outlier)
hist(house.without.oultier$Prices)Chunk Commentary:
- there is no differences we use data with outlier or not
MODELING
Train Test Splitting
set.seed(100)
index <- sample (nrow(house), nrow(house)*0.8)
house_train<- house[index, ]
house_test <- house[-index, ]Chunk commentary:
- store data splitting in house_train and house_test
Choosen predictor
our.model <- lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors + City + Baths+ FirePlace + Electric , data = house_train)
summary(our.model)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
## City + Baths + FirePlace + Electric, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4754.9 -1564.8 -5.1 1571.6 4754.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15771.815 13.559 1163.2 <0.0000000000000002 ***
## Floorsyes 14994.281 6.885 2177.9 <0.0000000000000002 ***
## Fiberyes 11749.701 6.885 1706.6 <0.0000000000000002 ***
## marblesIndian -5004.175 8.433 -593.4 <0.0000000000000002 ***
## marblesWhite 9006.261 8.439 1067.3 <0.0000000000000002 ***
## Glass.Doorsyes 4437.526 6.885 644.5 <0.0000000000000002 ***
## City2 3497.283 8.434 414.7 <0.0000000000000002 ***
## City3 6986.189 8.434 828.4 <0.0000000000000002 ***
## Baths 1245.565 2.434 511.7 <0.0000000000000002 ***
## FirePlace 751.936 2.435 308.9 <0.0000000000000002 ***
## Electricyes 1253.480 6.885 182.1 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2177 on 399988 degrees of freedom
## Multiple R-squared: 0.9677, Adjusted R-squared: 0.9677
## F-statistic: 1.199e+06 on 10 and 399988 DF, p-value: < 0.00000000000000022
Interpretasi koefisien: Setiap kenaikan 1 nilai pada Bath maka price bertambah sebesar 1245.565
Setiap kenaikan 1 nilai pada FirePlace maka price berkurang sebesar 751.936
Setiap Unit yang memilik Floors maka price bertambah sebesar 14994.281
Setiap Unit yang memilik Marbles.Indian maka price berkurang sebesar 5004.175
Step wise predictor
All mode and none predictor Model
All predictor stored in all.model and no perdictable variable stored in none.model
##
## Call:
## lm(formula = Prices ~ ., data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.1 -124.7 -122.6 125.3 127.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9625.454893 1.018735 9448.441 < 0.0000000000000002 ***
## Floorsyes 14999.402578 0.395286 37945.693 < 0.0000000000000002 ***
## Fiberyes 11750.084214 0.395292 29725.078 < 0.0000000000000002 ***
## marblesIndian -5000.565594 0.484159 -10328.358 < 0.0000000000000002 ***
## marblesWhite 8999.135046 0.484486 18574.622 < 0.0000000000000002 ***
## Glass.Doorsyes 4450.043568 0.395291 11257.642 < 0.0000000000000002 ***
## City2 3500.104138 0.484229 7228.205 < 0.0000000000000002 ***
## City3 6999.621004 0.484211 14455.729 < 0.0000000000000002 ***
## Baths 1249.945704 0.139769 8942.942 < 0.0000000000000002 ***
## FirePlace 749.999688 0.139780 5365.575 < 0.0000000000000002 ***
## Garage 1500.253660 0.241904 6201.856 < 0.0000000000000002 ***
## Area 25.000557 0.002753 9080.103 < 0.0000000000000002 ***
## Electricyes 1250.542534 0.395284 3163.660 < 0.0000000000000002 ***
## Swiming.Poolyes -0.034082 0.395289 -0.086 0.93129
## Gardenyes -1.246035 0.395291 -3.152 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125 on 399984 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 2.684e+08 on 14 and 399984 DF, p-value: < 0.00000000000000022
Backward model
##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
## City + Baths + FirePlace + Garage + Area + Electric + Garden,
## data = house_train)
##
## Coefficients:
## (Intercept) Floorsyes Fiberyes marblesIndian marblesWhite
## 9625.438 14999.403 11750.084 -5000.566 8999.135
## Glass.Doorsyes City2 City3 Baths FirePlace
## 4450.044 3500.104 6999.621 1249.946 750.000
## Garage Area Electricyes Gardenyes
## 1500.254 25.001 1250.543 -1.246
backward.model <-lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
City + Baths + FirePlace + Garage + Area + Electric + Garden,
data = house_train)
summary(backward.model)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
## City + Baths + FirePlace + Garage + Area + Electric + Garden,
## data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.1 -124.7 -122.6 125.3 127.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9625.438006 0.999730 9628.042 < 0.0000000000000002 ***
## Floorsyes 14999.402624 0.395285 37945.775 < 0.0000000000000002 ***
## Fiberyes 11750.084045 0.395287 29725.479 < 0.0000000000000002 ***
## marblesIndian -5000.565586 0.484158 -10328.371 < 0.0000000000000002 ***
## marblesWhite 8999.135104 0.484484 18574.663 < 0.0000000000000002 ***
## Glass.Doorsyes 4450.043544 0.395290 11257.659 < 0.0000000000000002 ***
## City2 3500.104127 0.484228 7228.214 < 0.0000000000000002 ***
## City3 6999.620964 0.484210 14455.753 < 0.0000000000000002 ***
## Baths 1249.945686 0.139769 8942.962 < 0.0000000000000002 ***
## FirePlace 749.999688 0.139780 5365.581 < 0.0000000000000002 ***
## Garage 1500.253647 0.241904 6201.865 < 0.0000000000000002 ***
## Area 25.000557 0.002753 9080.118 < 0.0000000000000002 ***
## Electricyes 1250.542530 0.395283 3163.664 < 0.0000000000000002 ***
## Gardenyes -1.246040 0.395290 -3.152 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125 on 399985 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 2.891e+08 on 13 and 399985 DF, p-value: < 0.00000000000000022
Forward model
forward model
##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + City + Glass.Doors +
## Area + Baths + Garage + FirePlace + Electric + Garden, data = house_train)
##
## Coefficients:
## (Intercept) Floorsyes Fiberyes marblesIndian marblesWhite
## 9625.438 14999.403 11750.084 -5000.566 8999.135
## City2 City3 Glass.Doorsyes Area Baths
## 3500.104 6999.621 4450.044 25.001 1249.946
## Garage FirePlace Electricyes Gardenyes
## 1500.254 750.000 1250.543 -1.246
forward.model <- lm(formula = Prices ~ Floors + Fiber + marbles + City + Glass.Doors +
Area + Baths + Garage + FirePlace + Electric + Garden, data = house_train)
summary(forward.model)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + City + Glass.Doors +
## Area + Baths + Garage + FirePlace + Electric + Garden, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.1 -124.7 -122.6 125.3 127.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9625.438006 0.999730 9628.042 < 0.0000000000000002 ***
## Floorsyes 14999.402624 0.395285 37945.775 < 0.0000000000000002 ***
## Fiberyes 11750.084045 0.395287 29725.479 < 0.0000000000000002 ***
## marblesIndian -5000.565586 0.484158 -10328.371 < 0.0000000000000002 ***
## marblesWhite 8999.135104 0.484484 18574.663 < 0.0000000000000002 ***
## City2 3500.104127 0.484228 7228.214 < 0.0000000000000002 ***
## City3 6999.620964 0.484210 14455.753 < 0.0000000000000002 ***
## Glass.Doorsyes 4450.043544 0.395290 11257.659 < 0.0000000000000002 ***
## Area 25.000557 0.002753 9080.118 < 0.0000000000000002 ***
## Baths 1249.945686 0.139769 8942.962 < 0.0000000000000002 ***
## Garage 1500.253647 0.241904 6201.865 < 0.0000000000000002 ***
## FirePlace 749.999688 0.139780 5365.581 < 0.0000000000000002 ***
## Electricyes 1250.542530 0.395283 3163.664 < 0.0000000000000002 ***
## Gardenyes -1.246040 0.395290 -3.152 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125 on 399985 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 2.891e+08 on 13 and 399985 DF, p-value: < 0.00000000000000022
Both model
##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
## City + Baths + FirePlace + Garage + Area + Electric + Garden,
## data = house_train)
##
## Coefficients:
## (Intercept) Floorsyes Fiberyes marblesIndian marblesWhite
## 9625.438 14999.403 11750.084 -5000.566 8999.135
## Glass.Doorsyes City2 City3 Baths FirePlace
## 4450.044 3500.104 6999.621 1249.946 750.000
## Garage Area Electricyes Gardenyes
## 1500.254 25.001 1250.543 -1.246
both.model <- lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
City + Baths + FirePlace + Garage + Area + Electric + Garden,
data = house_train)
summary(both.model)##
## Call:
## lm(formula = Prices ~ Floors + Fiber + marbles + Glass.Doors +
## City + Baths + FirePlace + Garage + Area + Electric + Garden,
## data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.1 -124.7 -122.6 125.3 127.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9625.438006 0.999730 9628.042 < 0.0000000000000002 ***
## Floorsyes 14999.402624 0.395285 37945.775 < 0.0000000000000002 ***
## Fiberyes 11750.084045 0.395287 29725.479 < 0.0000000000000002 ***
## marblesIndian -5000.565586 0.484158 -10328.371 < 0.0000000000000002 ***
## marblesWhite 8999.135104 0.484484 18574.663 < 0.0000000000000002 ***
## Glass.Doorsyes 4450.043544 0.395290 11257.659 < 0.0000000000000002 ***
## City2 3500.104127 0.484228 7228.214 < 0.0000000000000002 ***
## City3 6999.620964 0.484210 14455.753 < 0.0000000000000002 ***
## Baths 1249.945686 0.139769 8942.962 < 0.0000000000000002 ***
## FirePlace 749.999688 0.139780 5365.581 < 0.0000000000000002 ***
## Garage 1500.253647 0.241904 6201.865 < 0.0000000000000002 ***
## Area 25.000557 0.002753 9080.118 < 0.0000000000000002 ***
## Electricyes 1250.542530 0.395283 3163.664 < 0.0000000000000002 ***
## Gardenyes -1.246040 0.395290 -3.152 0.00162 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 125 on 399985 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 2.891e+08 on 13 and 399985 DF, p-value: < 0.00000000000000022
Prediction
Based on evaluation test that we have, we before Continue to “Checking Assumption” Choosen Predictor has a higher Error in RMSE, MAE, and MSE
Choosen Predictor RMSE
our.pred <- predict(object = our.model, newdata = house_test, type = "response", interval = "confidence", level = 0.95)
RMSE(our.pred, house_test$Prices)## [1] 2172.339
## [1] 4719056
## [1] 1794.511
Stepwise Predictor RMSE C001
Backward RMSE
All stepwise model results have the same results pick backward model
backward.pred <- predict(object = backward.model, newdata = house_test, type = "response", interval = "confidence", level = 0.95)
RMSE(backward.pred, house_test$Prices)## [1] 125.0077
## [1] 15626.93
## [1] 124.9986
Comparing Adjsted R squared and RMSE/MAE/MSE C001
Adjusted R squared
We found out Stepwise model is a better model considering adj.r.squared
For the model suggested by stepwise (both backward, forward, and both) there seems to be an overfitting case because the R-squares are almost perfect 100%
Choosen Predictor Model
## [1] 0.9677117
backward.model
## [1] 0.9998936
forward.model
## [1] 0.9998936
both.model
## [1] 0.9998936
CHECKING ASSUMPTIONS
Normality
What if data is not distributed normal like in stepwise model?
- find new model base on business insight, check it until pass assumption and make sure residuals distributed normally
- add more data
Expectation when making linear regression models, the resulting errors are normally distributed. This means that many errors gather around the number 0. To test this assumption can be done: Visualization of residual histograms, using the hist () function.
In the normality test, the distribution of the residuals stepwise model is really abnormal because the distribution is outside point 0, our.model is better than that recommended by stepwise, because it fulfills all linear regression assumptions.
Saphiro Test’s cannot be used since the the sample size is more than 5000
## Warning in ks.test(our.model$residuals, "pnorm", mean =
## mean(our.model$residuals), : ties should not be present for the Kolmogorov-
## Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: our.model$residuals
## D = 0.021994, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
backward model
In the normality test, the distribution of the residuals stepwise model is really abnormal because the distribution is outside point 0
H0: Residual spreads normally H1: Residuals do not spread normally
if p-value <alpha (0.05) then reject h0 Conclusion reject H0 residuals are declared not normal when it is not p-value> 0.05 (assumptions are not met)
### forward model In the normality test, the distribution of the residuals stepwise model is really abnormal because the distribution is outside point 0
H0: Residual spreads normally H1: Residuals do not spread normally
if p-value <alpha (0.05) then reject h0 Coclusion: reject H0 residuals are declared not normal when it is not p-value> 0.05 (assumptions are not met)
Both model
H0: Residual spreads normally H1: Residuals do not spread normally
if p-value <alpha (0.05) then reject h0 conclusion: reject H0 residuals are declared not normal when it is not p-value> 0.05 (assumptions are not met)
Choosen Predictor
H0: Residual spreads normally H1: Residuals do not spread normally
if p-value <alpha (0.05) then reject h0
Conclusion: Failed to reject H0 residuals are declared normal when it is not p-value> 0.05 (assumptions are met)
## Warning in ks.test(our.model$residuals, "pnorm", mean =
## mean(our.model$residuals), : ties should not be present for the Kolmogorov-
## Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: our.model$residuals
## D = 0.021994, p-value < 0.00000000000000022
## alternative hypothesis: two-sided
Homocedasticity
With Breusch-Pagan from the lmtest package Breusch-Pagan hypothesis test: (the expectation is pvalue> alpha) H0: Variance error spreads constant (Homoscedasticity) H1: Variance error spreads is not constant / forming pattern (Heteroscedasticity)
Conclusion three Models Failed to reject H0 means ALL three models is homocedasticity
##
## studentized Breusch-Pagan test
##
## data: our.model
## BP = 6.7316, df = 10, p-value = 0.7505
##
## studentized Breusch-Pagan test
##
## data: backward.model
## BP = 10.455, df = 13, p-value = 0.6563
##
## studentized Breusch-Pagan test
##
## data: forward.model
## BP = 10.455, df = 13, p-value = 0.6563
##
## studentized Breusch-Pagan test
##
## data: both.model
## BP = 10.455, df = 13, p-value = 0.6563
Multicolinearity
multicolinarity: Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation.
When the VIF value is more than 10, it means multicollinearity. hopes to get VIF <10
Choosen Model
## GVIF Df GVIF^(1/(2*Df))
## Floors 1.000013 1 1.000007
## Fiber 1.000028 1 1.000014
## marbles 1.000011 2 1.000003
## Glass.Doors 1.000020 1 1.000010
## City 1.000027 2 1.000007
## Baths 1.000026 1 1.000013
## FirePlace 1.000012 1 1.000006
## Electric 1.000008 1 1.000004
Stepwisemodel
## GVIF Df GVIF^(1/(2*Df))
## Floors 1.000019 1 1.000009
## Fiber 1.000028 1 1.000014
## marbles 1.000028 2 1.000007
## City 1.000050 2 1.000013
## Glass.Doors 1.000046 1 1.000023
## Area 1.000032 1 1.000016
## Baths 1.000044 1 1.000022
## Garage 1.000026 1 1.000013
## FirePlace 1.000015 1 1.000007
## Electric 1.000009 1 1.000005
## Garden 1.000043 1 1.000021
## GVIF Df GVIF^(1/(2*Df))
## Floors 1.000019 1 1.000009
## Fiber 1.000028 1 1.000014
## marbles 1.000028 2 1.000007
## Glass.Doors 1.000046 1 1.000023
## City 1.000050 2 1.000013
## Baths 1.000044 1 1.000022
## FirePlace 1.000015 1 1.000007
## Garage 1.000026 1 1.000013
## Area 1.000032 1 1.000016
## Electric 1.000009 1 1.000005
## Garden 1.000043 1 1.000021
## GVIF Df GVIF^(1/(2*Df))
## Floors 1.000019 1 1.000009
## Fiber 1.000028 1 1.000014
## marbles 1.000028 2 1.000007
## City 1.000050 2 1.000013
## Glass.Doors 1.000046 1 1.000023
## Area 1.000032 1 1.000016
## Baths 1.000044 1 1.000022
## Garage 1.000026 1 1.000013
## FirePlace 1.000015 1 1.000007
## Electric 1.000009 1 1.000005
## Garden 1.000043 1 1.000021
Chunk Commentary:
- There is no Multicolinearity in this test
CONCLUSION
From the results of the model created and after conducting several evaluation tests, the model is formed from stepwise regression backward, forward, and both meet the multicolinearity, homocedasticity test, but do not meet the normality test. After choosing The predictor sees the correlation of the linear test and coupled with the insight of the business that we have, finally we choose the predictor such as marbles, Fiber and Floors.
some business Recomendation From Choosen Model
- Indian Marbles make Price cheaper than the other Marbles
- Floor and Fiber make Price way more expensive than other Variables
- FirePlace is not expensive than other Variables.