The House Pricing Data set is a comprehensive collection of real estate information that serves as a valuable resource for understanding the dynamics of the housing market. Housing is not only a fundamental human need but also a significant investment and a key indicator of economic well-being. Consequently, the analysis of housing data plays a crucial role in various domains, including real estate, finance, urban planning, and government policy development.
Data Description
This data set contains 545 rows and 13 columns. This data set offers a glimpse into the factors influencing house prices and facilitates research and analysis on topics such as property valuation, market trends, and the impact of various attributes on pricing. The data set consists a mix of continuous and categorical variables.
The price is the response variable and the all other predictor variables are mentioned in the below table.
price area bedrooms bathrooms stories mainroad guestroom basement
1 13300000 7420 4 2 3 yes no no
2 12250000 8960 4 4 4 yes no no
3 12215000 7500 4 2 2 yes no yes
4 9240000 3500 4 2 2 yes no no
5 8960000 8500 3 2 4 yes no no
6 8890000 4600 3 2 2 yes yes no
hotwaterheating airconditioning parking prefarea furnishingstatus category
1 no yes 2 yes furnished Manison
2 no yes 3 no furnished Manison
3 no yes 3 yes furnished Two-storied
4 yes no 2 no furnished Two-storied
5 no yes 2 no furnished Manison
6 no yes 2 no furnished Two-storied
The resulting data set, named “selected_rows,” contains only the rows in which properties with two or more bathrooms that are Furnished, because there are some certain kinds of most common requirements for people, so that they can quickly grab such kind of details easily.
# A tibble: 2 × 2
mainroad average_price
<chr> <dbl>
1 no 3398905.
2 yes 4991777.
The above table tells the average price differences between properties located on main roads and those not located on main roads, which is also a most common requirement for the people who are looking to buy homes.
Scatter plot with Regression line
library(ggplot2)# Scatterplot with regression line for 'area' vs. 'price'ggplot(d1, aes(x = area, y = price)) +geom_point() +geom_smooth(method ="lm", se =FALSE) +labs(x ="Area", y ="Price") +ggtitle("Price vs. Area")
# QQ Plot for Residualsqqnorm(residuals(lm1))qqline(residuals(lm1))
Interaction Plots
Interaction Plot for Area and Bedrooms
ggplot(d1, aes(x = area, y = price, color =as.factor(bedrooms))) +geom_point() +geom_smooth(method ="lm", se =FALSE) +labs(x ="Area", y ="Price") +ggtitle("Interaction Plot for Area and Bedrooms")
`geom_smooth()` using formula = 'y ~ x'
Interaction Plot for Area and Bathrooms
ggplot(d1, aes(x = area, y = price, color =as.factor(bathrooms))) +geom_point() +geom_smooth(method ="lm", se =FALSE) +labs(x ="Area", y ="Price") +ggtitle("Interaction Plot for Area and Bathrooms")
`geom_smooth()` using formula = 'y ~ x'
Boxplot for ‘furnished_status’ vs. ‘price’
ggplot(d1, aes(x = furnishingstatus, y = price,fill=furnishingstatus)) +geom_boxplot() +labs(x ="Furnished Status", y ="Price") +ggtitle("Price vs. Furnished Status")+scale_fill_manual(values =c("red", "blue", "green"))
Histogram for House Price Distribution
library(ggplot2)# Create a histogram for the distribution of house pricesggplot(d1, aes(x = price)) +geom_histogram(binwidth =100000, fill ="lightblue", color ="black") +labs(x ="House Price", y ="Frequency") +ggtitle("House Price Distribution")
Bar Plot of Area by Air Conditioning and Furnishing Status
ggplot(d1, aes(x = airconditioning, y = price, fill = furnishingstatus)) +geom_bar(stat ="identity",width =0.5) +labs(x ="Air Conditioning", y ="price") +scale_fill_discrete(name ="Furnishing Status") +ggtitle("Bar Plot of Area by Air Conditioning and Furnishing Status")
This graph shows that even though there is no air conditioning, the house of a furnished house is much higher than the houses with air conditioning
Bar Plot of Mainroad vs Price
library(ggplot2)ggplot(d1, aes(x = mainroad, y = price,fill=mainroad)) +geom_bar(stat ="identity", position ="dodge",width=0.5) +labs(x ="Main Road", y ="Price") +ggtitle("Bar Plot: Price vs. Main Road")+scale_fill_manual(values =c("green", "red")) +# Add a theme (e.g., minimal)theme_minimal()
The houses with connection to the main road have more price when compared to the houses without the connection to the main road. The main road is also one of the factors that is affecting the prices of the houses.
Call:
lm(formula = price ~ area + bedrooms + bathrooms + parking +
stories, data = d1)
Residuals:
Min 1Q Median 3Q Max
-3396744 -731825 -64056 601486 5651126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -145734.5 246634.5 -0.591 0.5548
area 331.1 26.6 12.448 < 2e-16 ***
bedrooms 167809.8 82932.7 2.023 0.0435 *
bathrooms 1133740.2 118828.3 9.541 < 2e-16 ***
parking 377596.3 66804.1 5.652 2.57e-08 ***
stories 547939.8 68894.5 7.953 1.07e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1244000 on 539 degrees of freedom
Multiple R-squared: 0.5616, Adjusted R-squared: 0.5575
F-statistic: 138.1 on 5 and 539 DF, p-value: < 2.2e-16
This line specifies the linear regression model that was fitted. It shows the formula used to predict the “price” variable based on the predictor variables “area,” “bedrooms,” “bathrooms,” “parking,” and “stories” using the data from the data set “d1.”
Predicting new values using the above linear regression model involves applying the model to new data to estimate or forecast the values of the response variable (in this case, “price”) based on the values of the predictor variables