A. DATA DICTIONARY

The data frame has 321 observasions and 13 features.This is a continuous dataset.

This data frame contains the following columns:

Year :

The year in which that particular house price was taken

age :

How old the house is in years

nbh :

Number of neighbours that the house has

cbd :

Distance to central bus district in feet

intst :

Distance to interstate in feet

price :

Selling price of house

rooms :

Number of rooms in the house

area :

Area of house in square feet

land :

Area of lot in square feet

baths :

Number of bathrooms in the house

dist :

Distance from house to incinerator in feet

wind :

Percentage time wind incinerator to house

rprice :

Inflation adjusted prices in 1978 and 1981

Importing the dataset

Since the first row is header, we enter header as TRUE

df = read.csv("incinerator.csv", header = TRUE)

B. DATA PREPROCESSING

There might be some NA values in our dataset,

head(df)
##   ï..year age nbh  cbd intst price rooms area  land baths  dist wind rprice
## 1    1978  48   4 3000  1000 60000     7 1660  4578     1 10700    3  60000
## 2    1978  83   4 4000  1000 40000     6 2612  8370     2 11000    3  40000
## 3    1978  58   4 4000  1000 34000     6 1144  5000     1 11500    3  34000
## 4    1978  11   4 4000  1000 63900     5 1136 10000     1 11900    3  63900
## 5    1978  48   4 4000  2000 44000     5 1868 10000     1 12100    3  44000
## 6    1978  78   4 3000  2000 46000     6 1780  9500     3 10000    3  46000
sum(is.na(df))
## [1] 0

There aren’t any NA values, so we can go ahead with data cleaning.There might be some duplicated rows in our data

checking for duplicate rows

sum(duplicated(df))
## [1] 1

So there is one duplicate row. To find out which row that is, we can use which function

which(duplicated(df))
## [1] 213

So row number 213 is repeated

removing Duplicate rows

We need to remove the whole row

df=df[-213,]
sum(duplicated(df))
## [1] 0
names(df)
##  [1] "ï..year" "age"     "nbh"     "cbd"     "intst"   "price"   "rooms"  
##  [8] "area"    "land"    "baths"   "dist"    "wind"    "rprice"

There might be some columns which need renaming. We need to import deplyr package to rename columns

renaming the columns

I need to rename only one column actually, rest are fine

library(dplyr)
df=rename(df,year=ï..year)
names(df)
##  [1] "year"   "age"    "nbh"    "cbd"    "intst"  "price"  "rooms"  "area"  
##  [9] "land"   "baths"  "dist"   "wind"   "rprice"

So now we don’t have any duplicate rows or any NA values, we can look for outliers. In layman’s terms, Outliers are those data points which are far away from the most of the data points.

Simple way to find out the outliers is by plotting boxplot. Any point which is not in the range of (Q1- 1.5* IQR, Q3+ 1.5*IQR) is considered as outlier. An outlier dominate other dataponts, because most of the analysis tool use euclidean distance method .

CHECKING FOR OUTLIERS IN EACH INDEPENDENT VARIABLES

Checking outliers in distance to interstate variable

boxplot(df$intst, main='Distance to interstate',col='Orange')

There are no outliers in this variable

Checking outliers in Distance from house to incinerator variable

boxplot(df$dist, main='Distance from house to incinerator',col='Red')

Checking outliers in selling price variable

boxplot(df$price, main='Selling price',col='Red')

There are some outliers in this variable. We can take the log of this variable to get rid of outliers.

df$price = log(df$price)
boxplot(df$price, main='Selling price',col='Red')

Checking outliers in area

boxplot(df$area, main='Area of house',col='yellow')

Again taking log to remove the outliers in area and plotting a box plot to check

df$area = log(df$area)
boxplot(df$area, main='Area of house',col='yellow')

Checking outliers in area of land

boxplot(df$land, main='Area of land',col='Brown')

Removing outliers by taking log

df$land = log(df$land)
boxplot(df$land)

Even after taking log, we still have outliers left. We can check which points are those outliers , since any point below Q1- 1.5IQR and above Q3+1.5IQR will be outlier. I have removed all these outliers by using subset function

summary(df$land)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.444   9.718  10.682  10.301  10.739  13.208
IQR = (10.739-9.737)
9.737-1.5*IQR
## [1] 8.234
10.739+1.5*IQR
## [1] 12.242
which(df$land<8.234)
## [1]  63 181 234
which(df$land>12.242)
## [1]  72 157
df=subset(df,land<12.242&land>8.234)
boxplot(df$land, main='Area of land',col='Brown')  

Checking outliers in real price variable

boxplot(df$rprice, main='Real house price',col='Grey')

Taking log to remove the outlier

df$rprice = log(df$rprice)
boxplot(df$rprice, main='Price in 1978$',col='Grey')

Checking outliers in Age variable

boxplot(df$age, main='Age of house',col='Sky Blue')

We could take log of age variable as well to remove the outlier, but it will bot be possible because there are a lot of zero values in age variable, so our second option is to square the whole variable, but there is one more thing we can do. We can just add 1 all the age variable and calculate the log after that, there will not be much of change in values, because we are taking log at the end.

df$age=df$age+1
df$age = log(df$age)
boxplot(df$age)

summary(df$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.609   1.743   3.091   5.247

Checking outliers in cbd and nbh variables

par(mfrow = c(1, 2))
boxplot(df$cbd ,main='Distance to central bus',col='Sky Blue')
boxplot(df$nbh, main='No. of neighbours',col='Brown')

So all the outliers have been removed now.so we can finally get to the interesting part ,which is analysis.

C. Analysis

Checking for relation between price and distance to the incinerator

library(tidyverse)
ggplot(df)+aes(x=dist,y=rprice)+geom_smooth()+geom_point()

As can be seen from the above plot, if the incinerator is near the house, the housing price is very low and as the distance increases , the price starts to increase as well upto a specific point. That clearly points out that people don’t want to buy a house if the waste dumping place is near the house, which is nearly 17000 feet. That’s why the price of the house decreases if the dump place is closer than that. After approximately, 17000 feet, the curve starts to flatten out, which means, the people are satisfied with the distance of incinerator from the house and they do not much care about it anymore, so this would be an ideal distance to keep the house away from the incinerator for a real estate company to sell the house at a high price. The ideal distance would be approximately 17000 to 24000 feet. After this distance, this is interesting to see that, the housing prices start decreasing,which points to the fact that, dumping waste is a problem and a tiresome work if the dumping place is too far from the house. So the distance of dumping place from the house plays a huge role in deciding the housing price.

Let’s back up our finding by linear regression model. But the problem is that the relation is not linear, so either we have to use a non linear regression model or we could add some cut points in our graph. We should be very careful while choosing the points, a trial and error could help also. This will get us a significant regression model.

I have coded all the distance points below 15000 feet as 1 and 0 to all the points above that.

df$dist=ifelse(df$dist<15000, 1,0)

Now creating a simple linear regression model only for the distance variable

regressor = lm(formula = rprice~dist ,data = df )
summary(regressor)
## 
## Call:
## lm(formula = rprice ~ dist, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2162 -0.1780 -0.0035  0.2035  1.1780 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.38202    0.02157  527.68   <2e-16 ***
## dist        -0.43196    0.04104  -10.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3257 on 313 degrees of freedom
## Multiple R-squared:  0.2614, Adjusted R-squared:  0.259 
## F-statistic: 110.8 on 1 and 313 DF,  p-value: < 2.2e-16

We get a very significant model, so we can consider this finding and our regression model is useful. The R square value is 0.2621. The distance coefficient is -0.432 . Since we encoded below 15000 feet as 1, we can say that,if the value of distance increases that is if it becomes 1 or if it is less than 15000 , the house prices decrease by 43.26% . This is one finding, We will have to add more cut points in our distance variable if we want to find out upto what point and by how much percentage the housing prices change if the distance increases more than 15000 feet.

But have we have already got a wholesome picture of that by above smoothening graph, so now let’s find out the affect of year on the housing prices.

Checking for relation between price and distance to the incinerator

We have two years 1978 and 1981 in our year variable. we need to encode it to 1 and 0 first

df$year = ifelse(df$year==1981, 1 , 0)
regressor1=lm(formula=rprice~year,data = df)
summary(regressor1)
## 
## Call:
## lm(formula = rprice ~ year, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01463 -0.27674  0.05929  0.24718  0.95174 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.17637    0.02760 404.868  < 2e-16 ***
## year         0.19569    0.04156   4.709 3.75e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3662 on 313 degrees of freedom
## Multiple R-squared:  0.06616,    Adjusted R-squared:  0.06317 
## F-statistic: 22.17 on 1 and 313 DF,  p-value: 3.747e-06

As can be seen from the summary, our regression model is significant and the r square value is just 0.066, so we can say that the year explains 6% variability in the house prices. So not a very good relation. Or we can also say, year doesn’t really have much to do with prices.

Checking for relation between House age and price of the house

ggplot(df)+ aes(x=age, y=rprice) + geom_point() + geom_smooth(color="darkred", fill="blue")+
  labs(title="Scatter Plot of houseage and real price")

As can be seen from the graph that, as the house becomes old, the price starts to decrease but, If the age of house becomes too high, the prices start to increase, it must have something to do with the historical value of house.

But there were some outliers in the variable, so I want to square the age variable to find out if the above relation really holds true

df$age1=df$age^2
ggplot(df)+ aes(x=age1, y=rprice) + geom_point() + geom_smooth(color="darkred", fill="blue")+
  labs(title="Scatter Plot of houseage and real price")

Now the plot becomes more clear as we have squared the variable, we can see a sharp increase in housing price after a certain point. If we go on to cube the original age variable, we will see even more clear picture. But we don’t need that now, as we already know the relation here.

Checking for relation between Area of house and price of the house

ggplot(df)+aes(x=area,y=rprice)+geom_smooth()+geom_point()

There is a direct proportional relation between the house price and area of the house, upto a certain point, After which the the price starts to stable or even decrease.

Let’s substantiate our finding through the linear regression model

regressor2 = lm(formula = rprice~area, df)
summary(regressor2)
## 
## Call:
## lm(formula = rprice ~ area, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10899 -0.12883  0.05874  0.17771  0.60765 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.48792    0.35264   15.56   <2e-16 ***
## area         0.75986    0.04636   16.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.278 on 313 degrees of freedom
## Multiple R-squared:  0.4619, Adjusted R-squared:  0.4602 
## F-statistic: 268.7 on 1 and 313 DF,  p-value: < 2.2e-16

The regression model is significant. Clearly a very good relation between area of house and price can be seen . The value of R square is 0.46, which means the area of house defines 46% change in the price of house. Further the intercept of area is 0.7598 . Which shows that for every 1 unit increase in log(area), the price of house increases by 0.75 unit. Which is a very good relation.

Checking for relation between Area of land and price of the house

ggplot(df)+aes(x=land,y=rprice)+geom_smooth(color="red",fill="blue")+geom_point()+
  labs(title="Scatter Plot of land area and real price")

Clearly there is a nearly linear relation between the two. What we can do is square or cube the variable to get a good regression model , or we can simply put cut points where the graph changes it’s nature . By trial and error I have found a good regression by classifying the graph at three points. The points below 10 are encoded as 1, between 10 and 11.5 as 2 and above 11.5 as 3. I have preferred to create a new variable land1 for the encoding.

Encoding :

df$land1=ifelse(df$land<10,1,ifelse(df$land<11.5&df$land>10,2,3))

Building a simple linear regression model between price and land1(encoded area of land).

regressor3 = lm(formula = rprice~land1, df)
summary(regressor3)
## 
## Call:
## lm(formula = rprice ~ land1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21275 -0.17797 -0.00494  0.20689  0.86367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.48829    0.06615  158.54   <2e-16 ***
## land1        0.44516    0.03666   12.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3124 on 313 degrees of freedom
## Multiple R-squared:  0.3203, Adjusted R-squared:  0.3181 
## F-statistic: 147.5 on 1 and 313 DF,  p-value: < 2.2e-16

As expected the encoding increased the power of regression model and we have achieved a significant model. R square is 0.32 so, 32.03% of change in house price is defined by change in land size or a unit change in land area will lead to a change of 0.44 change in housing price.

df$nbh1=factor(df$nbh)
regressor4 = lm(formula = rprice~nbh1, df)
summary(regressor4)
## 
## Call:
## lm(formula = rprice ~ nbh1, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17906 -0.19172  0.01719  0.22122  0.89736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.34492    0.03024 375.185   <2e-16 ***
## nbh11        0.08295    0.07092   1.170    0.243    
## nbh12       -0.00498    0.05648  -0.088    0.930    
## nbh13       -0.19943    0.12727  -1.567    0.118    
## nbh14       -0.47846    0.05223  -9.161   <2e-16 ***
## nbh15       -0.01645    0.06983  -0.236    0.814    
## nbh16        0.07051    0.06525   1.081    0.281    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3271 on 308 degrees of freedom
## Multiple R-squared:  0.267,  Adjusted R-squared:  0.2527 
## F-statistic:  18.7 on 6 and 308 DF,  p-value: < 2.2e-16

So more number of neighbors mean less house pricing and if there are 0 neighbours , we actually see an increase in the house prices. The model is significant only for nbh4 that is when we have 4 neighbors.

SO when we have 4 neighbors, the housing price decreases by 47.8% which is a huge drop.

Checking for relation between number of bathrooms and price of the house

regressor5 = lm(formula = rprice~baths, df)
summary(regressor5)
## 
## Call:
## lm(formula = rprice ~ baths, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33717 -0.14853  0.00836  0.19121  0.73925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.40066    0.04576  227.28   <2e-16 ***
## baths        0.36745    0.01854   19.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2524 on 313 degrees of freedom
## Multiple R-squared:  0.5565, Adjusted R-squared:  0.5551 
## F-statistic: 392.8 on 1 and 313 DF,  p-value: < 2.2e-16

We have a significant model, the value of R square is 0.55 which is also huge. So having more number of bathroom is good thing but not too many.

Building a regression model with all the variables that affect the housing price

regressor_final= lm(formula = rprice~baths+year+age+cbd+intst+rooms+area+land1+
                      dist+wind, df)
summary(regressor_final)
## 
## Call:
## lm(formula = rprice ~ baths + year + age + cbd + intst + rooms + 
##     area + land1 + dist + wind, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25823 -0.08926  0.00657  0.11002  0.62535 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.548e+00  3.415e-01  25.031  < 2e-16 ***
## baths        1.145e-01  2.765e-02   4.141 4.48e-05 ***
## year         1.280e-01  2.394e-02   5.347 1.76e-07 ***
## age         -6.232e-02  1.066e-02  -5.845 1.30e-08 ***
## cbd         -8.304e-06  1.035e-05  -0.803 0.422801    
## intst        3.671e-06  1.066e-05   0.344 0.730858    
## rooms        3.840e-02  1.680e-02   2.285 0.022974 *  
## area         2.814e-01  5.144e-02   5.471 9.34e-08 ***
## land1        1.518e-01  4.173e-02   3.639 0.000322 ***
## dist        -1.162e-01  4.442e-02  -2.615 0.009361 ** 
## wind        -7.763e-03  7.845e-03  -0.990 0.323160    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2023 on 304 degrees of freedom
## Multiple R-squared:  0.7232, Adjusted R-squared:  0.7141 
## F-statistic: 79.43 on 10 and 304 DF,  p-value: < 2.2e-16

We see a good regression model here as we have an adjusted R-square of 0.714. But there are still some unnecessary variables, that we do not want in our model. We want our model to have only variables which affect the dependent variable. For example, cbd , intst doesn’t seem to affect much to the housing prices. So we need to get rid of these variables as they can increase the multicollinearity in the model.Which we don’t want.

Checking for multicollinearity

library(car)
## Warning: package 'car' was built under R version 4.0.5
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(regressor_final)
##     baths      year       age       cbd     intst     rooms      area     land1 
##  3.462009  1.087271  2.198279 66.223000 71.331250  1.734956  2.324576  3.091400 
##      dist      wind 
##  3.036107  3.351303

If the value of VIF is above 10, the variable imposing multicollinearity. Here as expected we have a huge multicollinearity in our dataset. For cbd the vif is 66.22 and for intst the value of vif is 71.33. So it’s best to get rid of these two variables.

regressor_final= lm(formula = rprice~baths+year+age+rooms+area+land1+
                      dist, df)
summary(regressor_final)
## 
## Call:
## lm(formula = rprice ~ baths + year + age + rooms + area + land1 + 
##     dist, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21525 -0.09123  0.00875  0.11562  0.65971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.37570    0.33607  24.922  < 2e-16 ***
## baths        0.12083    0.02776   4.353 1.83e-05 ***
## year         0.13562    0.02392   5.670 3.30e-08 ***
## age         -0.05492    0.01035  -5.308 2.13e-07 ***
## rooms        0.03794    0.01682   2.256  0.02476 *  
## area         0.29504    0.05149   5.730 2.40e-08 ***
## land1        0.09536    0.03551   2.686  0.00763 ** 
## dist        -0.06916    0.03708  -1.865  0.06313 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2038 on 307 degrees of freedom
## Multiple R-squared:  0.7163, Adjusted R-squared:  0.7098 
## F-statistic: 110.7 on 7 and 307 DF,  p-value: < 2.2e-16
vif(regressor_final)
##    baths     year      age    rooms     area    land1     dist 
## 3.436908 1.069594 2.039970 1.712784 2.295493 2.205109 2.084410

We don’t have multicollinearity anymore in our model. This is the final model and we have achieved the Adjusted R-Square value of 0.7098. Which is a very good . It states that the 70.1% change in housing price is defined by these variables.

D. Finding & Conclusion

1.If the incinerator is near the house, the housing price is very low and as the distance increases , the price starts to increase as well upto a specific point. From 17000 to 24000 feet distance the housing price is not affected by the distance to dumping place. After 24000 feet the price starts decreasing. So People want the incinerator to be not too far from the house and not too close either.

2.Year doesn’t really have much role to play.

3.As the house becomes older, the price starts to decrease, but after a certain point the value starts increasing because of house’s historical value.

4.Area of house and area of the lot are very good factors to determine the housing price. They are both directly proportional to the price, but it’s worth noting that,after a certain point,the relation is inversely proportional.

5.Number of bathrooms also is directly proportional to price

6.If the number of neighbors is more the prices actually fall. Having 4 neighbors means a huge drop in prices.