House Price Prediction

Author

Bhargava Vidiyala

Introduction

The House Pricing Data set is a comprehensive collection of real estate information that serves as a valuable resource for understanding the dynamics of the housing market. Housing is not only a fundamental human need but also a significant investment and a key indicator of economic well-being. Consequently, the analysis of housing data plays a crucial role in various domains, including real estate, finance, urban planning, and government policy development.

Data Description

This data set contains 545 rows and 13 columns. This data set offers a glimpse into the factors influencing house prices and facilitates research and analysis on topics such as property valuation, market trends, and the impact of various attributes on pricing. The data set consists a mix of continuous and categorical variables.

The price is the response variable and the all other predictor variables are mentioned in the below table.

d1<-read.csv("/Users/sunny/Downloads/Housing (1).csv")
head(d1)
     price area bedrooms bathrooms stories mainroad guestroom basement
1 13300000 7420        4         2       3      yes        no       no
2 12250000 8960        4         4       4      yes        no       no
3 12250000 9960        3         2       2      yes        no      yes
4 12215000 7500        4         2       2      yes        no      yes
5 11410000 7420        4         1       2      yes       yes      yes
6 10850000 7500        3         3       1      yes        no      yes
  hotwaterheating airconditioning parking prefarea furnishingstatus
1              no             yes       2      yes        furnished
2              no             yes       3       no        furnished
3              no              no       2      yes   semi-furnished
4              no             yes       3      yes        furnished
5              no             yes       2       no        furnished
6              no             yes       2      yes   semi-furnished
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
dim(d1)
[1] 545  13
Variables Description
1 Price. (Response Variable ) The price of the House.
2 Area The size of house
3 Bedrooms Number of Bedrooms
4 Bathrooms Number of Bathrooms
5 Stories Number of floors
6 Mainroad

Whether the house connected to mainroad or not

a)Yes

b)No

7 Guestroom

Whether the house has guestroom or not.

a)Yes

b)No

8 Basement

Whether the house has a basement or not.

a)Yes

b)No

9 Hotwaterheating

Whether the house has an option to water heating or not.

a)Yes

b)No

10 Airconditioning

Whether the house as an air conditioning or not.

a)Yes

b)No

11 Parking Number of parking slots in house
12 Pref Area

Whether the house is in preferred area for the most of people or not.

a)Yes

b)No

13 Furnishing Status

Status is classified into three kinds.

a) Furnished

b)semi-Furnished

c)Unfurnished

d1 <- d1 %>%
  mutate(category = case_when(
    stories >= 3 ~ "Manison",
    stories >=2 ~ "Two-storied",
    stories >= 1 ~ "Bungalow",
  ))

This “category” variable is created based on the number of stories (or floors) in each property. The code categorizes the properties as follows:

  • If a property has 3 or more stories, it is categorized as a “Mansion.”

  • If a property has 2 or more stories (but less than 3), it is categorized as a “Two-storied” house.

  • If a property has 1 or more stories (but less than 2), it is categorized as a “Bungalow.”


sorted_d1_desc<-d1 %>%
  arrange(desc(price))
head(sorted_d1_desc)
     price area bedrooms bathrooms stories mainroad guestroom basement
1 13300000 7420        4         2       3      yes        no       no
2 12250000 8960        4         4       4      yes        no       no
3 12250000 9960        3         2       2      yes        no      yes
4 12215000 7500        4         2       2      yes        no      yes
5 11410000 7420        4         1       2      yes       yes      yes
6 10850000 7500        3         3       1      yes        no      yes
  hotwaterheating airconditioning parking prefarea furnishingstatus    category
1              no             yes       2      yes        furnished     Manison
2              no             yes       3       no        furnished     Manison
3              no              no       2      yes   semi-furnished Two-storied
4              no             yes       3      yes        furnished Two-storied
5              no             yes       2       no        furnished Two-storied
6              no             yes       2      yes   semi-furnished    Bungalow

The above table is useful for exploring and gaining an overview of the highest-priced properties in the dataset.

selected_rows <- d1 %>%
  filter(bathrooms >= 2,furnishingstatus=="furnished")
head(selected_rows)
     price area bedrooms bathrooms stories mainroad guestroom basement
1 13300000 7420        4         2       3      yes        no       no
2 12250000 8960        4         4       4      yes        no       no
3 12215000 7500        4         2       2      yes        no      yes
4  9240000 3500        4         2       2      yes        no       no
5  8960000 8500        3         2       4      yes        no       no
6  8890000 4600        3         2       2      yes       yes       no
  hotwaterheating airconditioning parking prefarea furnishingstatus    category
1              no             yes       2      yes        furnished     Manison
2              no             yes       3       no        furnished     Manison
3              no             yes       3      yes        furnished Two-storied
4             yes              no       2       no        furnished Two-storied
5              no             yes       2       no        furnished     Manison
6              no             yes       2       no        furnished Two-storied

The resulting data set, named “selected_rows,” contains only the rows in which properties with two or more bathrooms that are Furnished, because there are some certain kinds of most common requirements for people, so that they can quickly grab such kind of details easily.

avg_price_by_mainroad <- d1 %>%
  group_by(mainroad) %>%
  summarize(average_price = mean(price))
avg_price_by_mainroad
# A tibble: 2 × 2
  mainroad average_price
  <chr>            <dbl>
1 no            3398905.
2 yes           4991777.

The above table tells the average price differences between properties located on main roads and those not located on main roads, which is also a most common requirement for the people who are looking to buy homes.

Scatter plot with Regression line

library(ggplot2)

# Scatterplot with regression line for 'area' vs. 'price'
ggplot(d1, aes(x = area, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Area", y = "Price") +
  ggtitle("Price vs. Area")
`geom_smooth()` using formula = 'y ~ x'

lm1<-lm(price~area+bedrooms+bathrooms+parking+stories,data=d1)
summary(lm1)

Call:
lm(formula = price ~ area + bedrooms + bathrooms + parking + 
    stories, data = d1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3396744  -731825   -64056   601486  5651126 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -145734.5   246634.5  -0.591   0.5548    
area            331.1       26.6  12.448  < 2e-16 ***
bedrooms     167809.8    82932.7   2.023   0.0435 *  
bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
parking      377596.3    66804.1   5.652 2.57e-08 ***
stories      547939.8    68894.5   7.953 1.07e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1244000 on 539 degrees of freedom
Multiple R-squared:  0.5616,    Adjusted R-squared:  0.5575 
F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16


### Residual Plot

::: {.cell}

```{.r .cell-code}
plot(lm1, which = 1)

:::

Normal Plot

# QQ Plot for Residuals
qqnorm(residuals(lm1))
qqline(residuals(lm1))

Interaction Plots

Interaction Plot for Area and Bedrooms

ggplot(d1, aes(x = area, y = price, color = as.factor(bedrooms))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Area", y = "Price") +
  ggtitle("Interaction Plot for Area and Bedrooms")
`geom_smooth()` using formula = 'y ~ x'

Interaction Plot for Area and Bathrooms

ggplot(d1, aes(x = area, y = price, color = as.factor(bathrooms))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Area", y = "Price") +
  ggtitle("Interaction Plot for Area and Bathrooms")
`geom_smooth()` using formula = 'y ~ x'

Boxplot for ‘furnished_status’ vs. ‘price’

ggplot(d1, aes(x = furnishingstatus, y = price,fill=furnishingstatus)) +
  geom_boxplot() +
  labs(x = "Furnished Status", y = "Price") +
  ggtitle("Price vs. Furnished Status")+scale_fill_manual(values = c("red", "blue", "green")) 

Histogram for House Price Distribution

library(ggplot2)

# Create a histogram for the distribution of house prices
ggplot(d1, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "lightblue", color = "black") +
  labs(x = "House Price", y = "Frequency") +
  ggtitle("House Price Distribution")

Bar Plot of Area by Air Conditioning and Furnishing Status

ggplot(d1, aes(x = airconditioning, y = price, fill = furnishingstatus)) +
  geom_bar(stat = "identity",width = 0.5) +
  labs(x = "Air Conditioning", y = "price") +
  scale_fill_discrete(name = "Furnishing Status") +
  ggtitle("Bar Plot of Area by Air Conditioning and Furnishing Status")

This graph shows that even though there is no air conditioning, the house of a furnished house is much higher than the houses with air conditioning

Bar Plot of Mainroad vs Price

library(ggplot2)

ggplot(d1, aes(x = mainroad, y = price,fill=mainroad)) +
  geom_bar(stat = "identity", position = "dodge",width=0.5) +
  labs(x = "Main Road", y = "Price") +
  ggtitle("Bar Plot: Price vs. Main Road")+
  scale_fill_manual(values = c("green", "red")) +
  
  # Add a theme (e.g., minimal)
  theme_minimal() 

The houses with connection to the main road have more price when compared to the houses without the connection to the main road. The main road is also one of the factors that is affecting the prices of the houses.

Statistical Model

lm1<-lm(price~area+bedrooms+bathrooms+parking+stories,data=d1)
summary(lm1)

Call:
lm(formula = price ~ area + bedrooms + bathrooms + parking + 
    stories, data = d1)

Residuals:
     Min       1Q   Median       3Q      Max 
-3396744  -731825   -64056   601486  5651126 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -145734.5   246634.5  -0.591   0.5548    
area            331.1       26.6  12.448  < 2e-16 ***
bedrooms     167809.8    82932.7   2.023   0.0435 *  
bathrooms   1133740.2   118828.3   9.541  < 2e-16 ***
parking      377596.3    66804.1   5.652 2.57e-08 ***
stories      547939.8    68894.5   7.953 1.07e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1244000 on 539 degrees of freedom
Multiple R-squared:  0.5616,    Adjusted R-squared:  0.5575 
F-statistic: 138.1 on 5 and 539 DF,  p-value: < 2.2e-16
  • This line specifies the linear regression model that was fitted. It shows the formula used to predict the “price” variable based on the predictor variables “area,” “bedrooms,” “bathrooms,” “parking,” and “stories” using the data from the data set “d1.”

  • Predicting new values using the above linear regression model involves applying the model to new data to estimate or forecast the values of the response variable (in this case, “price”) based on the values of the predictor variables

    Predictions with Linear regression model

new_data <- data.frame(area = 9000,bedrooms=3,bathrooms=2,stories=2,parking=1)
predictions <- predict(lm1, newdata = new_data)
predictions
      1 
7078691