The data frame has 321 observasions and 13 features.This is a continuous dataset.
This data frame contains the following columns:
Year :
The year in which that particular house price was taken
age :
How old the house is in years
nbh :
Number of neighbours that the house has
cbd :
Distance to central bus district in feet
intst :
Distance to interstate in feet
price :
Selling price of house
rooms :
Number of rooms in the house
area :
Area of house in square feet
land :
Area of lot in square feet
baths :
Number of bathrooms in the house
dist :
Distance from house to incinerator in feet
wind :
Percentage time wind incinerator to house
rprice :
Inflation adjusted prices in 1978 and 1981
Importing the dataset
Since the first row is header, we enter header as TRUE
df = read.csv("incinerator.csv", header = TRUE)
There might be some NA values in our dataset,
head(df)
## ï..year age nbh cbd intst price rooms area land baths dist wind rprice
## 1 1978 48 4 3000 1000 60000 7 1660 4578 1 10700 3 60000
## 2 1978 83 4 4000 1000 40000 6 2612 8370 2 11000 3 40000
## 3 1978 58 4 4000 1000 34000 6 1144 5000 1 11500 3 34000
## 4 1978 11 4 4000 1000 63900 5 1136 10000 1 11900 3 63900
## 5 1978 48 4 4000 2000 44000 5 1868 10000 1 12100 3 44000
## 6 1978 78 4 3000 2000 46000 6 1780 9500 3 10000 3 46000
sum(is.na(df))
## [1] 0
There aren’t any NA values, so we can go ahead with data cleaning.There might be some duplicated rows in our data
checking for duplicate rows
sum(duplicated(df))
## [1] 1
So there is one duplicate row. To find out which row that is, we can use which function
which(duplicated(df))
## [1] 213
So row number 213 is repeated
removing Duplicate rows
We need to remove the whole row
df=df[-213,]
sum(duplicated(df))
## [1] 0
names(df)
## [1] "ï..year" "age" "nbh" "cbd" "intst" "price" "rooms"
## [8] "area" "land" "baths" "dist" "wind" "rprice"
There might be some columns which need renaming. We need to import deplyr package to rename columns
renaming the columns
I need to rename only one column actually, rest are fine
library(dplyr)
df=rename(df,year=ï..year)
names(df)
## [1] "year" "age" "nbh" "cbd" "intst" "price" "rooms" "area"
## [9] "land" "baths" "dist" "wind" "rprice"
So now we don’t have any duplicate rows or any NA values, we can look for outliers. In layman’s terms, Outliers are those data points which are far away from the most of the data points.
Simple way to find out the outliers is by plotting boxplot. Any point which is not in the range of (Q1- 1.5* IQR, Q3+ 1.5*IQR) is considered as outlier. An outlier dominate other dataponts, because most of the analysis tool use euclidean distance method .
CHECKING FOR OUTLIERS IN EACH INDEPENDENT VARIABLES
Checking outliers in distance to interstate variable
boxplot(df$intst, main='Distance to interstate',col='Orange')
There are no outliers in this variable
Checking outliers in Distance from house to incinerator variable
boxplot(df$dist, main='Distance from house to incinerator',col='Red')
Checking outliers in selling price variable
boxplot(df$price, main='Selling price',col='Red')
There are some outliers in this variable. We can take the log of this variable to get rid of outliers.
df$price = log(df$price)
boxplot(df$price, main='Selling price',col='Red')
Checking outliers in area
boxplot(df$area, main='Area of house',col='yellow')
Again taking log to remove the outliers in area and plotting a box plot to check
df$area = log(df$area)
boxplot(df$area, main='Area of house',col='yellow')
Checking outliers in area of land
boxplot(df$land, main='Area of land',col='Brown')
Removing outliers by taking log
df$land = log(df$land)
boxplot(df$land)
Even after taking log, we still have outliers left. We can check which points are those outliers , since any point below Q1- 1.5IQR and above Q3+1.5IQR will be outlier. I have removed all these outliers by using subset function
summary(df$land)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.444 9.718 10.682 10.301 10.739 13.208
IQR = (10.739-9.737)
9.737-1.5*IQR
## [1] 8.234
10.739+1.5*IQR
## [1] 12.242
which(df$land<8.234)
## [1] 63 181 234
which(df$land>12.242)
## [1] 72 157
df=subset(df,land<12.242&land>8.234)
boxplot(df$land, main='Area of land',col='Brown')
Checking outliers in real price variable
boxplot(df$rprice, main='Real house price',col='Grey')
Taking log to remove the outlier
df$rprice = log(df$rprice)
boxplot(df$rprice, main='Price in 1978$',col='Grey')
Checking outliers in Age variable
boxplot(df$age, main='Age of house',col='Sky Blue')
We could take log of age variable as well to remove the outlier, but it will bot be possible because there are a lot of zero values in age variable, so our second option is to square the whole variable, but there is one more thing we can do. We can just add 1 all the age variable and calculate the log after that, there will not be much of change in values, because we are taking log at the end.
df$age=df$age+1
df$age = log(df$age)
boxplot(df$age)
summary(df$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.609 1.743 3.091 5.247
Checking outliers in cbd and nbh variables
par(mfrow = c(1, 2))
boxplot(df$cbd ,main='Distance to central bus',col='Sky Blue')
boxplot(df$nbh, main='No. of neighbours',col='Brown')
So all the outliers have been removed now.so we can finally get to the interesting part ,which is analysis.
Checking for relation between price and distance to the incinerator
library(tidyverse)
ggplot(df)+aes(x=dist,y=rprice)+geom_smooth()+geom_point()
As can be seen from the above plot, if the incinerator is near the house, the housing price is very low and as the distance increases , the price starts to increase as well upto a specific point. That clearly points out that people don’t want to buy a house if the waste dumping place is near the house, which is nearly 17000 feet. That’s why the price of the house decreases if the dump place is closer than that. After approximately, 17000 feet, the curve starts to flatten out, which means, the people are satisfied with the distance of incinerator from the house and they do not much care about it anymore, so this would be an ideal distance to keep the house away from the incinerator for a real estate company to sell the house at a high price. The ideal distance would be approximately 17000 to 24000 feet. After this distance, this is interesting to see that, the housing prices start decreasing,which points to the fact that, dumping waste is a problem and a tiresome work if the dumping place is too far from the house. So the distance of dumping place from the house plays a huge role in deciding the housing price.
Let’s back up our finding by linear regression model. But the problem is that the relation is not linear, so either we have to use a non linear regression model or we could add some cut points in our graph. We should be very careful while choosing the points, a trial and error could help also. This will get us a significant regression model.
I have coded all the distance points below 15000 feet as 1 and 0 to all the points above that.
df$dist=ifelse(df$dist<15000, 1,0)
Now creating a simple linear regression model only for the distance variable
regressor = lm(formula = rprice~dist ,data = df )
summary(regressor)
##
## Call:
## lm(formula = rprice ~ dist, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2162 -0.1780 -0.0035 0.2035 1.1780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.38202 0.02157 527.68 <2e-16 ***
## dist -0.43196 0.04104 -10.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3257 on 313 degrees of freedom
## Multiple R-squared: 0.2614, Adjusted R-squared: 0.259
## F-statistic: 110.8 on 1 and 313 DF, p-value: < 2.2e-16
We get a very significant model, so we can consider this finding and our regression model is useful. The R square value is 0.2621. The distance coefficient is -0.432 . Since we encoded below 15000 feet as 1, we can say that,if the value of distance increases that is if it becomes 1 or if it is less than 15000 , the house prices decrease by 43.26% . This is one finding, We will have to add more cut points in our distance variable if we want to find out upto what point and by how much percentage the housing prices change if the distance increases more than 15000 feet.
But have we have already got a wholesome picture of that by above smoothening graph, so now let’s find out the affect of year on the housing prices.
Checking for relation between price and distance to the incinerator
We have two years 1978 and 1981 in our year variable. we need to encode it to 1 and 0 first
df$year = ifelse(df$year==1981, 1 , 0)
regressor1=lm(formula=rprice~year,data = df)
summary(regressor1)
##
## Call:
## lm(formula = rprice ~ year, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01463 -0.27674 0.05929 0.24718 0.95174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.17637 0.02760 404.868 < 2e-16 ***
## year 0.19569 0.04156 4.709 3.75e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3662 on 313 degrees of freedom
## Multiple R-squared: 0.06616, Adjusted R-squared: 0.06317
## F-statistic: 22.17 on 1 and 313 DF, p-value: 3.747e-06
As can be seen from the summary, our regression model is significant and the r square value is just 0.066, so we can say that the year explains 6% variability in the house prices. So not a very good relation. Or we can also say, year doesn’t really have much to do with prices.
Checking for relation between House age and price of the house
ggplot(df)+ aes(x=age, y=rprice) + geom_point() + geom_smooth(color="darkred", fill="blue")+
labs(title="Scatter Plot of houseage and real price")
As can be seen from the graph that, as the house becomes old, the price starts to decrease but, If the age of house becomes too high, the prices start to increase, it must have something to do with the historical value of house.
But there were some outliers in the variable, so I want to square the age variable to find out if the above relation really holds true
df$age1=df$age^2
ggplot(df)+ aes(x=age1, y=rprice) + geom_point() + geom_smooth(color="darkred", fill="blue")+
labs(title="Scatter Plot of houseage and real price")
Now the plot becomes more clear as we have squared the variable, we can see a sharp increase in housing price after a certain point. If we go on to cube the original age variable, we will see even more clear picture. But we don’t need that now, as we already know the relation here.
Checking for relation between Area of house and price of the house
ggplot(df)+aes(x=area,y=rprice)+geom_smooth()+geom_point()
There is a direct proportional relation between the house price and area of the house, upto a certain point, After which the the price starts to stable or even decrease.
Let’s substantiate our finding through the linear regression model
regressor2 = lm(formula = rprice~area, df)
summary(regressor2)
##
## Call:
## lm(formula = rprice ~ area, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10899 -0.12883 0.05874 0.17771 0.60765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.48792 0.35264 15.56 <2e-16 ***
## area 0.75986 0.04636 16.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.278 on 313 degrees of freedom
## Multiple R-squared: 0.4619, Adjusted R-squared: 0.4602
## F-statistic: 268.7 on 1 and 313 DF, p-value: < 2.2e-16
The regression model is significant. Clearly a very good relation between area of house and price can be seen . The value of R square is 0.46, which means the area of house defines 46% change in the price of house. Further the intercept of area is 0.7598 . Which shows that for every 1 unit increase in log(area), the price of house increases by 0.75 unit. Which is a very good relation.
Checking for relation between Area of land and price of the house
ggplot(df)+aes(x=land,y=rprice)+geom_smooth(color="red",fill="blue")+geom_point()+
labs(title="Scatter Plot of land area and real price")
Clearly there is a nearly linear relation between the two. What we can do is square or cube the variable to get a good regression model , or we can simply put cut points where the graph changes it’s nature . By trial and error I have found a good regression by classifying the graph at three points. The points below 10 are encoded as 1, between 10 and 11.5 as 2 and above 11.5 as 3. I have preferred to create a new variable land1 for the encoding.
Encoding :
df$land1=ifelse(df$land<10,1,ifelse(df$land<11.5&df$land>10,2,3))
Building a simple linear regression model between price and land1(encoded area of land).
regressor3 = lm(formula = rprice~land1, df)
summary(regressor3)
##
## Call:
## lm(formula = rprice ~ land1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21275 -0.17797 -0.00494 0.20689 0.86367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.48829 0.06615 158.54 <2e-16 ***
## land1 0.44516 0.03666 12.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3124 on 313 degrees of freedom
## Multiple R-squared: 0.3203, Adjusted R-squared: 0.3181
## F-statistic: 147.5 on 1 and 313 DF, p-value: < 2.2e-16
As expected the encoding increased the power of regression model and we have achieved a significant model. R square is 0.32 so, 32.03% of change in house price is defined by change in land size or a unit change in land area will lead to a change of 0.44 change in housing price.
df$nbh1=factor(df$nbh)
regressor4 = lm(formula = rprice~nbh1, df)
summary(regressor4)
##
## Call:
## lm(formula = rprice ~ nbh1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17906 -0.19172 0.01719 0.22122 0.89736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.34492 0.03024 375.185 <2e-16 ***
## nbh11 0.08295 0.07092 1.170 0.243
## nbh12 -0.00498 0.05648 -0.088 0.930
## nbh13 -0.19943 0.12727 -1.567 0.118
## nbh14 -0.47846 0.05223 -9.161 <2e-16 ***
## nbh15 -0.01645 0.06983 -0.236 0.814
## nbh16 0.07051 0.06525 1.081 0.281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3271 on 308 degrees of freedom
## Multiple R-squared: 0.267, Adjusted R-squared: 0.2527
## F-statistic: 18.7 on 6 and 308 DF, p-value: < 2.2e-16
So more number of neighbors mean less house pricing and if there are 0 neighbours , we actually see an increase in the house prices. The model is significant only for nbh4 that is when we have 4 neighbors.
SO when we have 4 neighbors, the housing price decreases by 47.8% which is a huge drop.
Checking for relation between number of bathrooms and price of the house
regressor5 = lm(formula = rprice~baths, df)
summary(regressor5)
##
## Call:
## lm(formula = rprice ~ baths, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33717 -0.14853 0.00836 0.19121 0.73925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.40066 0.04576 227.28 <2e-16 ***
## baths 0.36745 0.01854 19.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2524 on 313 degrees of freedom
## Multiple R-squared: 0.5565, Adjusted R-squared: 0.5551
## F-statistic: 392.8 on 1 and 313 DF, p-value: < 2.2e-16
We have a significant model, the value of R square is 0.55 which is also huge. So having more number of bathroom is good thing but not too many.
regressor_final= lm(formula = rprice~baths+year+age+cbd+intst+rooms+area+land1+
dist+wind, df)
summary(regressor_final)
##
## Call:
## lm(formula = rprice ~ baths + year + age + cbd + intst + rooms +
## area + land1 + dist + wind, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25823 -0.08926 0.00657 0.11002 0.62535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.548e+00 3.415e-01 25.031 < 2e-16 ***
## baths 1.145e-01 2.765e-02 4.141 4.48e-05 ***
## year 1.280e-01 2.394e-02 5.347 1.76e-07 ***
## age -6.232e-02 1.066e-02 -5.845 1.30e-08 ***
## cbd -8.304e-06 1.035e-05 -0.803 0.422801
## intst 3.671e-06 1.066e-05 0.344 0.730858
## rooms 3.840e-02 1.680e-02 2.285 0.022974 *
## area 2.814e-01 5.144e-02 5.471 9.34e-08 ***
## land1 1.518e-01 4.173e-02 3.639 0.000322 ***
## dist -1.162e-01 4.442e-02 -2.615 0.009361 **
## wind -7.763e-03 7.845e-03 -0.990 0.323160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2023 on 304 degrees of freedom
## Multiple R-squared: 0.7232, Adjusted R-squared: 0.7141
## F-statistic: 79.43 on 10 and 304 DF, p-value: < 2.2e-16
We see a good regression model here as we have an adjusted R-square of 0.714. But there are still some unnecessary variables, that we do not want in our model. We want our model to have only variables which affect the dependent variable. For example, cbd , intst doesn’t seem to affect much to the housing prices. So we need to get rid of these variables as they can increase the multicollinearity in the model.Which we don’t want.
Checking for multicollinearity
library(car)
## Warning: package 'car' was built under R version 4.0.5
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
vif(regressor_final)
## baths year age cbd intst rooms area land1
## 3.462009 1.087271 2.198279 66.223000 71.331250 1.734956 2.324576 3.091400
## dist wind
## 3.036107 3.351303
If the value of VIF is above 10, the variable imposing multicollinearity. Here as expected we have a huge multicollinearity in our dataset. For cbd the vif is 66.22 and for intst the value of vif is 71.33. So it’s best to get rid of these two variables.
regressor_final= lm(formula = rprice~baths+year+age+rooms+area+land1+
dist, df)
summary(regressor_final)
##
## Call:
## lm(formula = rprice ~ baths + year + age + rooms + area + land1 +
## dist, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21525 -0.09123 0.00875 0.11562 0.65971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.37570 0.33607 24.922 < 2e-16 ***
## baths 0.12083 0.02776 4.353 1.83e-05 ***
## year 0.13562 0.02392 5.670 3.30e-08 ***
## age -0.05492 0.01035 -5.308 2.13e-07 ***
## rooms 0.03794 0.01682 2.256 0.02476 *
## area 0.29504 0.05149 5.730 2.40e-08 ***
## land1 0.09536 0.03551 2.686 0.00763 **
## dist -0.06916 0.03708 -1.865 0.06313 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2038 on 307 degrees of freedom
## Multiple R-squared: 0.7163, Adjusted R-squared: 0.7098
## F-statistic: 110.7 on 7 and 307 DF, p-value: < 2.2e-16
vif(regressor_final)
## baths year age rooms area land1 dist
## 3.436908 1.069594 2.039970 1.712784 2.295493 2.205109 2.084410
We don’t have multicollinearity anymore in our model. This is the final model and we have achieved the Adjusted R-Square value of 0.7098. Which is a very good . It states that the 70.1% change in housing price is defined by these variables.
1.If the incinerator is near the house, the housing price is very low and as the distance increases , the price starts to increase as well upto a specific point. From 17000 to 24000 feet distance the housing price is not affected by the distance to dumping place. After 24000 feet the price starts decreasing. So People want the incinerator to be not too far from the house and not too close either.
2.Year doesn’t really have much role to play.
3.As the house becomes older, the price starts to decrease, but after a certain point the value starts increasing because of house’s historical value.
4.Area of house and area of the lot are very good factors to determine the housing price. They are both directly proportional to the price, but it’s worth noting that,after a certain point,the relation is inversely proportional.
5.Number of bathrooms also is directly proportional to price
6.If the number of neighbors is more the prices actually fall. Having 4 neighbors means a huge drop in prices.