The housing market is an important part of any economy, and understanding housing price dynamics is critical for policy makers, economists, and potential home buyers. The purpose of this study is to analyze the housing prices in Washington, USA, examine various relevant indicators that may affect housing prices, gain an in-depth understanding of the factors that affect the real estate market in this area, and through machine learning and deep learning, it can better predict housing price fluctuations.
Data Source: https://www.kaggle.com/datasets/shree1992/housedata
# Read the dataset
house_data <- read.csv("housing_data.csv")
# Examine the structure of the dataset
str(house_data)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
The provided dataset consists of 4,600 observations (rows) and 18 variables (columns). Here is a summary of the variables:
| Variable Name | Description | Type |
|---|---|---|
| date | Date and time of the house sale | Character |
| price | The sale price of the house | Numeric |
| bedrooms | The number of bedrooms in the house | Numeric |
| bathrooms | The number of bathrooms in the house | Numeric |
| sqft_living | The square footage of the house’s living area | Integer |
| sqft_lot | The total square footage of the house’s lot | Integer |
| floors | The number of floors in the house | Numeric |
| waterfrontt | An indicator variable (0 or 1) indicating whether the house has a waterfront view. | Integer |
| view | A rating of the house’s view quality | Integer |
| condition | A rating of the overall condition of the house | Integer |
| sqft_above | The square footage of the house above ground level | Integer |
| sqft_basement | The square footage of the house’s basement | Integer |
| yr_build | The year the house was built | Integer |
| yr_renovated | The year the house was last renovated | Integer |
| street | The street address of the house | Character |
| city | The city where the house is located | Character |
| statezip | The state and ZIP code of the house’s location | Character |
| country | The country where the house is located (presumably all entries are “USA”). | Character |
The dataset includes a mix of numerical and categorical
variables. The numerical variables represent measurements such as price,
square footage, and ratings, while the categorical variables include
information such as the house’s address and location.
This dataset provides valuable information for analyzing the housing market, exploring factors that influence house prices, and gaining insights into housing trends in Washington, USA.
# Check for missing values
sum(is.na(house_data))
## [1] 0
# Identify invalid data points
invalid_data <- house_data[house_data$price <= 0 | house_data$bathrooms <= 0, ]
str(invalid_data)
## 'data.frame': 51 obs. of 18 variables:
## $ date : chr "2014-06-12 00:00:00" "2014-06-24 00:00:00" "2014-05-05 00:00:00" "2014-05-05 00:00:00" ...
## $ price : num 1095000 1295648 0 0 0 ...
## $ bedrooms : num 0 0 3 4 6 5 5 4 2 4 ...
## $ bathrooms : num 0 0 1.75 2.75 2.75 3.5 1.5 4 2.5 2.25 ...
## $ sqft_living : int 3064 4810 1490 2600 3200 3480 1500 3680 2200 2170 ...
## $ sqft_lot : int 4764 28008 10125 5390 9200 36615 7112 18804 188200 10500 ...
## $ floors : num 3.5 2 1 1 1 2 1 2 1 1 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 2 0 0 0 2 0 0 0 3 2 ...
## $ condition : int 3 3 4 4 4 4 5 3 3 4 ...
## $ sqft_above : int 3064 4810 1490 1300 1600 2490 760 3680 2200 1270 ...
## $ sqft_basement: int 0 0 0 1300 1600 990 740 0 0 900 ...
## $ yr_built : int 1990 1990 1962 1960 1953 1983 1920 1990 2007 1960 ...
## $ yr_renovated : int 2009 2009 0 2001 1983 0 0 2009 0 2001 ...
## $ street : chr "814 E Howe St" "20418 NE 64th Pl" "3911 S 328th St" "2120 31st Ave W" ...
## $ city : chr "Seattle" "Redmond" "Federal Way" "Seattle" ...
## $ statezip : chr "WA 98102" "WA 98053" "WA 98001" "WA 98199" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
Additionally, we have conducted a thorough analysis of the dataset and are pleased to report that there are no missing values present, indicating good data completeness.
To ensure data accuracy, we have established two important criteria for identifying invalid data points: a price less than or equal to zero and a number of bathrooms less than or equal to zero.
Invalid Price: It is important to note that any data point with a price equal to or less than zero is considered invalid. This indicates a data issue as it is highly unlikely for a property’s price to be zero or negative. These occurrences may be due to data entry errors or missing values.
Invalid Bathrooms: Similarly, any data point with a number of bathrooms equal to or less than zero is considered invalid. This is because properties are expected to have at least one bathroom. Zero or negative values for the number of bathrooms are unrealistic and do not align with standard real estate practices.
Upon thorough analysis, we have successfully identified 51 observations that contain these invalid data points, which deviate from the established criteria. Detecting and addressing these issues is crucial to maintain the overall quality and reliability of the dataset.
In this step, data cleaning is performed to identify and fix structural errors and remove unwanted observations.
library(ggplot2); library(dplyr); library(tidyr)
options(repr.plot.width = 12, repr.plot.height = 8)
summary(house_data)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Check and Remove Illogical values (Price)
# Check whether there is 0 value in house price column
house_data$price[(house_data$price == 0)]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0
Since it is illogical to have 0 in house price, there might be some error in the raw data. We would replace the 0 values in house price to the average house price.
# Replace 0 house price with NA
house_data <- house_data %>% mutate_at(c("price"), ~na_if(., 0))
house_data$price[(house_data$price == 0)]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# Replace the NA with mean of house price
house_data <- house_data %>% mutate(price = replace_na(price, mean(price, na.rm = TRUE)))
# To check and confirm that the min of house price is not 0 anymore, and the mean value has changed as well.
summary(house_data)
## date price bedrooms bathrooms
## Length:4600 Min. : 7800 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 328159 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 468750 Median :3.000 Median :2.250
## Mean : 557906 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Check and Remove Illogical values (Bedroom and Bathroom) Since it is illogical for a house to have 0 bathroom in it, which makes it inhabitant, it is logical to remove the row with none bedroom. Out of the 4600 rows of data, there are 2 houses that do not have any bathroom nor bedroom.
# Remove the rows that do not have any bathrooms
data.clean <- house_data[(house_data$bedrooms & house_data$bathrooms) != 0, ]
Check for outliers
# Set the layout to 2 rows and 5 columns
par(mfrow = c(3, 3))
# Create the boxplots
boxplot(data.clean$price, main = "Price")
boxplot(data.clean$bedrooms, main = "Bedrooms")
boxplot(data.clean$bathrooms, main = "Bathrooms")
boxplot(data.clean$sqft_living, main = "sqft_living")
boxplot(data.clean$sqft_lot, main = "sqft_lot")
boxplot(data.clean$floors, main = "floors")
boxplot(data.clean$waterfront, main = "waterfront")
boxplot(data.clean$condition, main = "condition")
boxplot(data.clean$view, main = "view")
Based on the boxplot plotted, outliers could be observed in most
of the data. However, we decided not to drop off the outliers, as there
are a great number of outliers identified in the dataset and these data
might carry significant values. Removing all these values might produce
inaccurate and irreliable results.
Feature Selection
# Feature Engineering and Selection stage
colnames(data.clean)
## [1] "date" "price" "bedrooms" "bathrooms"
## [5] "sqft_living" "sqft_lot" "floors" "waterfront"
## [9] "view" "condition" "sqft_above" "sqft_basement"
## [13] "yr_built" "yr_renovated" "street" "city"
## [17] "statezip" "country"
# Drop "date" "street" "statezip" "country" as the variables do not provide much information for our analysis
data.clean2 <- data.clean[,-c(1,15,17,18)]
head(data.clean2)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 313000 3 1.50 1340 7912 1.5 0 0
## 2 2384000 5 2.50 3650 9050 2.0 0 4
## 3 342000 3 2.00 1930 11947 1.0 0 0
## 4 420000 3 2.25 2000 8030 1.0 0 0
## 5 550000 4 2.50 1940 10500 1.0 0 0
## 6 490000 2 1.00 880 6380 1.0 0 0
## condition sqft_above sqft_basement yr_built yr_renovated city
## 1 3 1340 0 1955 2005 Shoreline
## 2 5 3370 280 1921 0 Seattle
## 3 4 1930 0 1966 0 Kent
## 4 4 1000 1000 1963 0 Bellevue
## 5 4 1140 800 1976 1992 Redmond
## 6 3 880 0 1938 1994 Seattle
# Perform multivariate normal regression to check for statistical significance for each features
# mod.fit is a multivariate linear regression model with log transformed data vs price
colnames(data.clean2)
## [1] "price" "bedrooms" "bathrooms" "sqft_living"
## [5] "sqft_lot" "floors" "waterfront" "view"
## [9] "condition" "sqft_above" "sqft_basement" "yr_built"
## [13] "yr_renovated" "city"
mod.fit <- lm(price~., data=data.clean2)
summary(mod.fit)$coefficients[-grep("city", rownames(summary(mod.fit)$coefficients)),]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.511357e+06 7.918313e+05 1.9086862 5.636554e-02
## bedrooms -3.825739e+04 1.028341e+04 -3.7203020 2.013883e-04
## bathrooms 4.959504e+04 1.666110e+04 2.9766973 2.929087e-03
## sqft_living 1.557970e+02 2.159427e+01 7.2147380 6.301081e-13
## sqft_lot -1.763398e-01 2.202255e-01 -0.8007237 4.233335e-01
## floors -1.492711e+04 1.985953e+04 -0.7516349 4.523096e-01
## waterfront 4.762616e+05 9.262745e+04 5.1416890 2.837238e-07
## view 5.112173e+04 1.072698e+04 4.7657154 1.940603e-06
## condition 3.603079e+04 1.283701e+04 2.8067893 5.025116e-03
## sqft_above 9.271633e+01 2.229666e+01 4.1583052 3.265304e-05
## yr_built -8.935130e+02 3.793060e+02 -2.3556522 1.853249e-02
## yr_renovated 5.005525e+00 8.368514e+00 0.5981378 5.497778e-01
From the multivariate linear regression result, there are 9 variables that have Pr(>|t|) less than 0.05, which means the variables are statistically important and hence retained. The other values are not included in the later analysis as they are statistically insignificant, with Pr(>|t|) more than 0.05.
# Remove statistically insignificant variables
mod.fit.1 <- lm(price~., data=data.clean2[, -c(6,10,11,13)])
summary(mod.fit.1)$coefficients[-grep("city", rownames(summary(mod.fit.1)$coefficients)),]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.518233e+06 7.283978e+05 2.0843465 3.718435e-02
## bedrooms -4.223986e+04 1.024939e+04 -4.1212058 3.836065e-05
## bathrooms 4.647481e+04 1.617404e+04 2.8734203 4.079359e-03
## sqft_living 2.273048e+02 1.349824e+01 16.8395837 9.008250e-62
## sqft_lot -1.480951e-01 2.203428e-01 -0.6721125 5.015462e-01
## waterfront 4.736047e+05 9.276860e+04 5.1052265 3.438931e-07
## view 4.404886e+04 1.061941e+04 4.1479582 3.415773e-05
## condition 2.569416e+04 1.178850e+04 2.1795961 2.933856e-02
## yr_built -8.623646e+02 3.462448e+02 -2.4906215 1.278741e-02
The whole model is now statistically significant as the p-value is less than 0.05.
# Convert the "city" column to numeric labels
city_labels <- as.numeric(as.factor(data.clean2$city))
data.clean2$city <- city_labels
head(data.clean2)
## price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1 313000 3 1.50 1340 7912 1.5 0 0
## 2 2384000 5 2.50 3650 9050 2.0 0 4
## 3 342000 3 2.00 1930 11947 1.0 0 0
## 4 420000 3 2.25 2000 8030 1.0 0 0
## 5 550000 4 2.50 1940 10500 1.0 0 0
## 6 490000 2 1.00 880 6380 1.0 0 0
## condition sqft_above sqft_basement yr_built yr_renovated city
## 1 3 1340 0 1955 2005 37
## 2 5 3370 280 1921 0 36
## 3 4 1930 0 1966 0 19
## 4 4 1000 1000 1963 0 4
## 5 4 1140 800 1976 1992 32
## 6 3 880 0 1938 1994 36
Check for Correlation
# Cleaned dataset
data.clean3 <- data.clean2[, -c(6,10,11,13)]
corr <- round(cor(data.clean3), 2)
Pearson Correlation
Perform Pearson Correlation to check if there is a linear relationship between price (dependent variable) and the independent variables.
#install.packages("reshape2")
# correlation heatmap with ggplot2
library(reshape2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(corr){
corr[upper.tri(corr)] <- NA
return(corr)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(corr){
corr[lower.tri(corr)]<- NA
return(corr)
}
upper_tri <- get_upper_tri(corr)
upper_tri
## price bedrooms bathrooms sqft_living sqft_lot waterfront view
## price 1 0.21 0.34 0.44 0.05 0.14 0.24
## bedrooms NA 1.00 0.54 0.60 0.07 0.00 0.11
## bathrooms NA NA 1.00 0.77 0.11 0.08 0.21
## sqft_living NA NA NA 1.00 0.21 0.12 0.31
## sqft_lot NA NA NA NA 1.00 0.02 0.07
## waterfront NA NA NA NA NA 1.00 0.36
## view NA NA NA NA NA NA 1.00
## condition NA NA NA NA NA NA NA
## yr_built NA NA NA NA NA NA NA
## city NA NA NA NA NA NA NA
## condition yr_built city
## price 0.04 0.02 0.02
## bedrooms 0.02 0.14 -0.13
## bathrooms -0.12 0.47 -0.10
## sqft_living -0.06 0.29 -0.11
## sqft_lot 0.00 0.05 -0.08
## waterfront 0.00 -0.02 0.00
## view 0.06 -0.06 0.00
## condition 1.00 -0.40 -0.01
## yr_built NA 1.00 -0.21
## city NA NA 1.00
# Melt the correlation data and drop the rows with NA values
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
# Create a ggheatmap
ggheatmap <- ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
# print a ggheatmap
print(ggheatmap)
# Add correlation coefficients on the heatmap
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
From the Pearson Correlation, the following can be observed: 1. Size of squarefoot living shows a moderate positive correlation with price. 2. Number of bedrooms, bathrooms, and view show a weak positive correlation with price. 3. Squarefoot lot, waterfront, condition, and year built show a very weak positive correlation with price.
Plot Scatterplot matrix
# Scatterplot matrix is more suitable for continuous variables, hence it is performed with price, sqft_living and sqft_lot.
# The variables are normalised to obtain a more reliable result.
pairs(~log10(price) + log10(sqft_living) + log10(sqft_lot) , data = data.clean3,
main = "Scatterplot Matrix")
The scatterplot matrix shows a strong relationship between price and squarefoot living, as well as price and squarefoot lot.
Plot Q-Q Plot
# Plot q-q plot to check for normality
par(mfrow=c(3, 3))
qqnorm(log10(data.clean3$price), main="Density Plot: Price", pch=1, frame=FALSE)
qqline(log10(data.clean3$price), col="green", lwd=2)
qqnorm(log10(data.clean3$sqft_living), main="Density Plot: sqft_living", pch=1, frame=FALSE)
qqline(log10(data.clean3$sqft_living), col="blue", lwd=2)
qqnorm(log10(data.clean3$sqft_lot), main="Density Plot: sqft_lot", pch=1, frame=FALSE)
qqline(log10(data.clean3$sqft_lot), col="orange", lwd=2)
qqnorm(log10(data.clean3$bedrooms), main="Density Plot: bedrooms", pch=1, frame=FALSE)
qqnorm(log10(data.clean3$bathrooms), main="Density Plot: bathrooms", pch=1, frame=FALSE)
qqnorm(data.clean3$waterfront, main="Density Plot: waterfront", pch=1, frame=FALSE)
qqnorm(data.clean3$view, main="Density Plot: view", pch=1, frame=FALSE)
qqnorm(data.clean3$condition, main="Density Plot: condition", pch=1, frame=FALSE)
Most of the features above are normally distributed as they are
closely distributed on the q-q line for continuous variables (except for
sqft_lot). Whereas, the discrete variables are well distributed in each
quantile. Hence, we can say that most of the data are approximately
normal.
Relationship between categorical variables and house price
# Check for relationship between categorical variables and house price
par(mfrow=c(3, 2))
boxplot(log10(price)~bedrooms, data=data.clean3,
main="log US House Price vs Number of Bedrooms",
xlab="bedrooms",
ylab="log US House Price (log10 USD)",
col="blue")
boxplot(log10(price)~bathrooms, data=data.clean3,
main="log US House Price vs Number of Bathrooms",
xlab="bathrooms",
ylab="log US House Price (log10 USD)",
col="yellow")
boxplot(log10(price)~waterfront, data=data.clean3,
main="log US House Price vs Waterfront",
xlab="waterfront",
ylab="log US House Price (log10 USD)",
col="green")
boxplot(log10(price)~view, data=data.clean3,
main="log US House Price vs View",
xlab="view",
ylab="log US House Price (log10 USD)",
col="pink")
boxplot(log10(price)~condition, data=data.clean3,
main="log US House Price vs Condition",
xlab="condition",
ylab="log US House Price (log10 USD)",
col="orange")
From the above charts, it could be observed that as the number of
bedrooms and bathrooms increase, the house price increases (observed
from the increasing median line).Whereas, houses with waterfront, and
with a better view and condition will have higher prices as well.
library("dplyr")
# Extract the desired columns
classification_data <- data.clean3 %>% select(price, sqft_living, city)
# Remove rows with null values or zero as a value
classification_data <- classification_data[complete.cases(classification_data) & !classification_data$price == 0 & !classification_data$sqft_living == 0, ]
# Add the "price/sqft" column after the "sqft" column
classification_data <- within(classification_data, price_sqft <- price / sqft_living)
# Group the dataset by "city" column
grouped_data <- classification_data %>% group_by(city)
# Create quartile bins based on "price/sqft" within each group
grouped_data <- grouped_data %>% mutate(bin = ntile(price_sqft, 3))
# Map the bin values to corresponding labels
grouped_data$bin <- as.factor(case_when(
grouped_data$bin == 1 ~ "low",
grouped_data$bin == 2 ~ "medium",
grouped_data$bin == 3 ~ "high"
))
#install.packages("caret")
library("caret")
## Loading required package: lattice
# Split the grouped_data into training and testing datasets
set.seed(123) # For reproducibility
train_indices <- createDataPartition(grouped_data$bin, p = 0.8, list = FALSE)
train_data <- grouped_data[train_indices, ]
test_data <- grouped_data[-train_indices, ]
Decision Tree
#install.packages("rpart")
library("rpart")
# Train a decision tree model
model <- rpart(bin ~ city + price_sqft, data = train_data, method = "class")
# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")
# Calculate accuracy
accuracy_dt <- mean(predictions == test_data$bin)
# Create the confusion matrix
confusion_dt <- confusionMatrix(predictions, test_data$bin)
# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_dt))
## [1] "Accuracy: 0.831154684095861"
print(confusion_dt)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low medium
## high 247 13 38
## low 29 281 33
## medium 27 15 235
##
## Overall Statistics
##
## Accuracy : 0.8312
## 95% CI : (0.8053, 0.8548)
## No Information Rate : 0.3366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7467
##
## Mcnemar's Test P-Value : 0.002085
##
## Statistics by Class:
##
## Class: high Class: low Class: medium
## Sensitivity 0.8152 0.9094 0.7680
## Specificity 0.9171 0.8982 0.9314
## Pos Pred Value 0.8289 0.8192 0.8484
## Neg Pred Value 0.9097 0.9513 0.8892
## Prevalence 0.3301 0.3366 0.3333
## Detection Rate 0.2691 0.3061 0.2560
## Detection Prevalence 0.3246 0.3736 0.3017
## Balanced Accuracy 0.8661 0.9038 0.8497
Random Forest
#install.packages("randomForest")
library("randomForest")
# Train a Random Forest Model
model <- randomForest(bin ~ city + price_sqft, data = train_data, ntree = 100)
# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")
# Calculate accuracy
accuracy_rf <- mean(predictions == test_data$bin)
# Create the confusion matrix
confusion_rf <- confusionMatrix(predictions, test_data$bin)
# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_rf))
## [1] "Accuracy: 0.961873638344227"
print(confusion_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low medium
## high 293 3 7
## low 1 299 8
## medium 9 7 291
##
## Overall Statistics
##
## Accuracy : 0.9619
## 95% CI : (0.9474, 0.9733)
## No Information Rate : 0.3366
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9428
##
## Mcnemar's Test P-Value : 0.7252
##
## Statistics by Class:
##
## Class: high Class: low Class: medium
## Sensitivity 0.9670 0.9676 0.9510
## Specificity 0.9837 0.9852 0.9739
## Pos Pred Value 0.9670 0.9708 0.9479
## Neg Pred Value 0.9837 0.9836 0.9755
## Prevalence 0.3301 0.3366 0.3333
## Detection Rate 0.3192 0.3257 0.3170
## Detection Prevalence 0.3301 0.3355 0.3344
## Balanced Accuracy 0.9754 0.9764 0.9624
Naive Bayes
#install.packages("e1071")
library("e1071")
# Train the Naive Bayes model with Laplace smoothing
model <- naiveBayes(bin ~ city + price_sqft, data = train_data, laplace = 1)
# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")
# Calculate accuracy
accuracy_nb <- mean(predictions == test_data$bin)
# Create the confusion matrix
confusion_nb <- confusionMatrix(predictions, test_data$bin)
# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_nb))
## [1] "Accuracy: 0.574074074074074"
print(confusion_nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low medium
## high 84 2 1
## low 63 235 97
## medium 156 72 208
##
## Overall Statistics
##
## Accuracy : 0.5741
## 95% CI : (0.5413, 0.6063)
## No Information Rate : 0.3366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3601
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: high Class: low Class: medium
## Sensitivity 0.27723 0.7605 0.6797
## Specificity 0.99512 0.7373 0.6275
## Pos Pred Value 0.96552 0.5949 0.4771
## Neg Pred Value 0.73646 0.8585 0.7967
## Prevalence 0.33007 0.3366 0.3333
## Detection Rate 0.09150 0.2560 0.2266
## Detection Prevalence 0.09477 0.4303 0.4749
## Balanced Accuracy 0.63617 0.7489 0.6536
Support Vector Machine
# Train the SVM model
model <- svm(bin ~ city + price_sqft, data = train_data)
# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")
# Calculate accuracy
accuracy_svm <- mean(predictions == test_data$bin)
# Create the confusion matrix
confusion_svm <- confusionMatrix(predictions, test_data$bin)
# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_svm))
## [1] "Accuracy: 0.630718954248366"
print(confusion_svm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low medium
## high 170 9 45
## low 48 240 92
## medium 85 60 169
##
## Overall Statistics
##
## Accuracy : 0.6307
## 95% CI : (0.5986, 0.662)
## No Information Rate : 0.3366
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4456
##
## Mcnemar's Test P-Value : 6.477e-10
##
## Statistics by Class:
##
## Class: high Class: low Class: medium
## Sensitivity 0.5611 0.7767 0.5523
## Specificity 0.9122 0.7701 0.7631
## Pos Pred Value 0.7589 0.6316 0.5382
## Neg Pred Value 0.8084 0.8717 0.7732
## Prevalence 0.3301 0.3366 0.3333
## Detection Rate 0.1852 0.2614 0.1841
## Detection Prevalence 0.2440 0.4139 0.3420
## Balanced Accuracy 0.7366 0.7734 0.6577
Comparison Table for the Classification For the classification task, a comparison table was created to evaluate different models.
#install.packages("tidyverse")
#install.packages("kableExtra")
library(tidyverse)
library(kableExtra)
# Create a data frame for the comparison
comparison_table <- data.frame(Model = c("Decision Tree", "Random Forest", "Naive Bayes", "SVM"), Accuracy = c(accuracy_dt, accuracy_rf, accuracy_nb, accuracy_svm), stringsAsFactors = FALSE)
# Find the index of the model with the highest accuracy
highest_index <- which.max(comparison_table$Accuracy)
# Get the model name with the highest accuracy
highest_model <- comparison_table$Model[highest_index]
# Print the model with the highest accuracy and the comparison table
print(comparison_table)
## Model Accuracy
## 1 Decision Tree 0.8311547
## 2 Random Forest 0.9618736
## 3 Naive Bayes 0.5740741
## 4 SVM 0.6307190
print(paste("Model with the highest accuracy:", highest_model))
## [1] "Model with the highest accuracy: Random Forest"
The analysis revealed that the random forest model achieved the highest accuracy compared to other models such as decision trees, SVM, and Naïve Bayes. Thus, the random forest model is recommended for accurately classifying housing prices into categories such as ‘High,’ ‘Low,’ and ‘Medium’.
This analysis provides valuable insights into the real estate market in Washington, enabling informed decision-making, policy formulation, and a deeper understanding of the economic landscape and development trends in the region.
Plot univariate linear regression between sqft_living and price
ggplot(data.clean3,aes(y=price,x=sqft_living)) +
geom_point() +
xlim(0, 9000) +
ylim(0, 5000000) +
geom_smooth(formula = y ~ x,method="lm")
Multi univariate linear regression
linearmodel = lm(price~bedrooms + sqft_living + sqft_lot + condition + view ,
data = data.clean3)
summary(linearmodel)
##
## Call:
## lm(formula = price ~ bedrooms + sqft_living + sqft_lot + condition +
## view, data = data.clean3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1490859 -143120 -23176 90341 26258308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.676e+04 4.684e+04 -0.998 0.318135
## bedrooms -5.085e+04 1.024e+04 -4.965 7.11e-07 ***
## sqft_living 2.753e+02 1.030e+01 26.715 < 2e-16 ***
## sqft_lot -7.815e-01 2.098e-01 -3.725 0.000198 ***
## condition 5.307e+04 1.094e+04 4.850 1.27e-06 ***
## view 7.136e+04 1.002e+04 7.123 1.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 497600 on 4592 degrees of freedom
## Multiple R-squared: 0.2136, Adjusted R-squared: 0.2128
## F-statistic: 249.5 on 5 and 4592 DF, p-value: < 2.2e-16
# Create a vector of categorical column names
cat_cols <- c("price", "sqft_living")
# Extract target variable (y)
y <- data.clean3$price
# Extract input features (X)
X <- data.clean3
X$price <- NULL
# Load required libraries
library(caret)
# Set the seed for reproducibility
set.seed(1)
# Split the data into training and validation sets
split <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[split, ]
X_val <- X[-split, ]
y_train <- y[split]
y_val <- y[-split]
Linear Regression
# Create Linear Regression model
lin_reg <- lm(y_train ~ ., data = cbind(y_train, X_train))
# Make predictions on the validation set
predictions <- predict(lin_reg, newdata = X_val)
# Calculate R-squared
lin_r_squared <- cor(predictions, y_val)^2
print(paste("R-squared:", lin_r_squared))
## [1] "R-squared: 0.556115001988202"
# Calculate RMSLE
lin_rmsle <- sqrt(mean((log(predictions + 1) - log(y_val + 1))^2))
print(paste("RMSLE:", lin_rmsle))
## [1] "RMSLE: 0.373956476460525"
Ridge Regression
#install.packages("glmnet")
# Load required libraries
library(glmnet)
library(caret)
# Create Ridge regression model
ridge <- glmnet(as.matrix(X_train), y_train, alpha = 0)
# Make predictions on the validation set
predictions <- predict(ridge, as.matrix(X_val))
# Calculate R-squared
ridge_r_squared <- cor(predictions, y_val)^2
## Warning in cor(predictions, y_val): the standard deviation is zero
ridge_r_squared <-mean(ridge_r_squared, na.rm = TRUE)
print(paste("R-squared:", ridge_r_squared))
## [1] "R-squared: 0.481397949234316"
# Append method and performance to lists
ridge_rmsle <- sqrt(mean((log(predictions + 1) - log(y_val + 1))^2))
print(paste("RMSLE:", ridge_rmsle))
## [1] "RMSLE: 0.474976573511914"
XGBoost
#install.packages("xgboost")
# Load required libraries
library(xgboost)
library(caret)
# Create XGBoost regression model
xgb <- xgboost(data = as.matrix(X_train), label = y_train, nrounds = 1000, eta = 0.01, verbose = FALSE)
# Make predictions on the validation set
predictions <- predict(xgb, newdata = as.matrix(X_val))
# Calculate R-squared
xgb_r_squared <- cor(predictions, y_val)^2
print(paste("R-squared:", xgb_r_squared))
## [1] "R-squared: 0.0448088378543971"
# Calculate RMSLE
xgb_rmsle <- sqrt(mean((log(predictions + 1) - log(y_val + 1))^2))
print(paste("RMSLE:", xgb_rmsle))
## [1] "RMSLE: 0.397975817756442"
Evaluation Upon evaluating our regression models, we focused on two key metrics: R-squared (R2) and RMSLE (Root Mean Squared Logarithmic Error).
R2 measures the proportion of variance explained by the model, while RMSLE evaluates the average magnitude of prediction errors in a logarithmic scale.
Higher R2 values suggest a better fit and indicate how well the model captures the variability in the target variable, while lower RMSLE values indicate better accuracy in predicting housing prices.
# Create a data frame with the regression models results
regression_results <- data.frame(
Model = c("Linear Regression", "XGBoost Regression", "Ridge Regression"),
R_squared = c(lin_r_squared, xgb_r_squared, ridge_r_squared),
RMSLE = c(lin_rmsle, xgb_rmsle, ridge_rmsle)
)
# Print the table
print(regression_results)
## Model R_squared RMSLE
## 1 Linear Regression 0.55611500 0.3739565
## 2 XGBoost Regression 0.04480884 0.3979758
## 3 Ridge Regression 0.48139795 0.4749766
Based on the evaluation of the three models, the linear regression model stands out as the top performer with the highest R2 value and the lowest RMSLE value. As a result, the linear regression model is the preferred choice for accurately predicting housing prices in this particular scenario.
Further recommendations to improve the models and enhance results While the linear regression model performed the best among the three models, achieving an R2 value of 0.556 and an RMSLE value of 0.374, there is still room for improvement.
Upon analyzing the correlation matrix and evaluating the regression models, it became evident that the dataset lacked significant correlations among the attributes. The models’ relatively low R2 values suggest they explain only a moderate amount of variability in housing prices. Additionally, the RMSLE values indicate some degree of prediction error.
To enhance the models and address these limitations, we recommend the following:
Feature Engineering: Explore additional relevant features or create new meaningful features that better capture the relationship with housing prices.
Model Selection and Tuning: Experiment with different regression models, ensemble methods, or variations of the existing models.
Data Quality and Preprocessing: Apply suitable preprocessing techniques such as feature scaling or normalization to improve model performance.
Dataset Expansion: Expand the dataset by incorporating more diverse and comprehensive data.
By implementing these recommendations, we can address the limitations of our current models and enhance the accuracy and reliability of our predictions.
In conclusion, the analysis aimed to predict the sales price of houses in Washington. The data cleaning process involved checking for missing values, outliers, and assessing normality. Exploratory Data Analysis (EDA) was performed to analyze the characteristics, patterns, and relationships present in the data.
The classification analysis revealed that the random forest model achieved the highest accuracy compared to other models such as decision trees, SVM, and Naïve Bayes. Thus, the random forest model is recommended for accurately classifying housing prices into categories such as ‘High,’ ‘Low,’ and ‘Medium’.
Based on the regression analysis, the best option for predicting house prices is the linear regression model. However, a moderate level of accuracy is shown, therefore further improvement will be needed.