Introduction

The housing market is an important part of any economy, and understanding housing price dynamics is critical for policy makers, economists, and potential home buyers. The purpose of this study is to analyze the housing prices in Washington, USA, examine various relevant indicators that may affect housing prices, gain an in-depth understanding of the factors that affect the real estate market in this area, and through machine learning and deep learning, it can better predict housing price fluctuations.

Objectives

  1. To classify housing prices (High, Medium, and Low) based on their attributes.
  2. To predict housing prices using a regression model.

Data Understanding

Data Source: https://www.kaggle.com/datasets/shree1992/housedata

# Read the dataset
house_data <- read.csv("housing_data.csv")

# Examine the structure of the dataset
str(house_data)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

Data Description and Exploration

The provided dataset consists of 4,600 observations (rows) and 18 variables (columns). Here is a summary of the variables:

Variable Name Description Type
date Date and time of the house sale Character
price The sale price of the house Numeric
bedrooms The number of bedrooms in the house Numeric
bathrooms The number of bathrooms in the house Numeric
sqft_living The square footage of the house’s living area Integer
sqft_lot The total square footage of the house’s lot Integer
floors The number of floors in the house Numeric
waterfrontt An indicator variable (0 or 1) indicating whether the house has a waterfront view. Integer
view A rating of the house’s view quality Integer
condition A rating of the overall condition of the house Integer
sqft_above The square footage of the house above ground level Integer
sqft_basement The square footage of the house’s basement Integer
yr_build The year the house was built Integer
yr_renovated The year the house was last renovated Integer
street The street address of the house Character
city The city where the house is located Character
statezip The state and ZIP code of the house’s location Character
country The country where the house is located (presumably all entries are “USA”). Character


The dataset includes a mix of numerical and categorical variables. The numerical variables represent measurements such as price, square footage, and ratings, while the categorical variables include information such as the house’s address and location.

This dataset provides valuable information for analyzing the housing market, exploring factors that influence house prices, and gaining insights into housing trends in Washington, USA.

# Check for missing values
sum(is.na(house_data))
## [1] 0
# Identify invalid data points
invalid_data <- house_data[house_data$price <= 0 | house_data$bathrooms <= 0, ]
str(invalid_data)
## 'data.frame':    51 obs. of  18 variables:
##  $ date         : chr  "2014-06-12 00:00:00" "2014-06-24 00:00:00" "2014-05-05 00:00:00" "2014-05-05 00:00:00" ...
##  $ price        : num  1095000 1295648 0 0 0 ...
##  $ bedrooms     : num  0 0 3 4 6 5 5 4 2 4 ...
##  $ bathrooms    : num  0 0 1.75 2.75 2.75 3.5 1.5 4 2.5 2.25 ...
##  $ sqft_living  : int  3064 4810 1490 2600 3200 3480 1500 3680 2200 2170 ...
##  $ sqft_lot     : int  4764 28008 10125 5390 9200 36615 7112 18804 188200 10500 ...
##  $ floors       : num  3.5 2 1 1 1 2 1 2 1 1 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  2 0 0 0 2 0 0 0 3 2 ...
##  $ condition    : int  3 3 4 4 4 4 5 3 3 4 ...
##  $ sqft_above   : int  3064 4810 1490 1300 1600 2490 760 3680 2200 1270 ...
##  $ sqft_basement: int  0 0 0 1300 1600 990 740 0 0 900 ...
##  $ yr_built     : int  1990 1990 1962 1960 1953 1983 1920 1990 2007 1960 ...
##  $ yr_renovated : int  2009 2009 0 2001 1983 0 0 2009 0 2001 ...
##  $ street       : chr  "814 E Howe St" "20418 NE 64th Pl" "3911 S 328th St" "2120 31st Ave W" ...
##  $ city         : chr  "Seattle" "Redmond" "Federal Way" "Seattle" ...
##  $ statezip     : chr  "WA 98102" "WA 98053" "WA 98001" "WA 98199" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

Data Quality Verification

Additionally, we have conducted a thorough analysis of the dataset and are pleased to report that there are no missing values present, indicating good data completeness.

To ensure data accuracy, we have established two important criteria for identifying invalid data points: a price less than or equal to zero and a number of bathrooms less than or equal to zero.

Invalid Price: It is important to note that any data point with a price equal to or less than zero is considered invalid. This indicates a data issue as it is highly unlikely for a property’s price to be zero or negative. These occurrences may be due to data entry errors or missing values.

Invalid Bathrooms: Similarly, any data point with a number of bathrooms equal to or less than zero is considered invalid. This is because properties are expected to have at least one bathroom. Zero or negative values for the number of bathrooms are unrealistic and do not align with standard real estate practices.

Upon thorough analysis, we have successfully identified 51 observations that contain these invalid data points, which deviate from the established criteria. Detecting and addressing these issues is crucial to maintain the overall quality and reliability of the dataset.

Data Cleaning

In this step, data cleaning is performed to identify and fix structural errors and remove unwanted observations.

library(ggplot2); library(dplyr); library(tidyr)
options(repr.plot.width = 12, repr.plot.height = 8)
summary(house_data)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Check and Remove Illogical values (Price)

# Check whether there is 0 value in house price column
house_data$price[(house_data$price == 0)]
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0

Since it is illogical to have 0 in house price, there might be some error in the raw data. We would replace the 0 values in house price to the average house price.

# Replace 0 house price with NA
house_data <- house_data %>% mutate_at(c("price"), ~na_if(., 0))
house_data$price[(house_data$price == 0)]
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# Replace the NA with mean of house price
house_data <- house_data %>% mutate(price = replace_na(price, mean(price, na.rm = TRUE)))
# To check and confirm that the min of house price is not 0 anymore, and the mean value has changed as well.
summary(house_data)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :    7800   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  328159   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  468750   Median :3.000   Median :2.250  
##                     Mean   :  557906   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Check and Remove Illogical values (Bedroom and Bathroom)
Since it is illogical for a house to have 0 bathroom in it, which makes it inhabitant, it is logical to remove the row with none bedroom. Out of the 4600 rows of data, there are 2 houses that do not have any bathroom nor bedroom.

# Remove the rows that do not have any bathrooms
data.clean <- house_data[(house_data$bedrooms & house_data$bathrooms) != 0, ]

Check for outliers

# Set the layout to 2 rows and 5 columns
par(mfrow = c(3, 3))

# Create the boxplots
boxplot(data.clean$price, main = "Price")
boxplot(data.clean$bedrooms, main = "Bedrooms")
boxplot(data.clean$bathrooms, main = "Bathrooms")
boxplot(data.clean$sqft_living, main = "sqft_living")
boxplot(data.clean$sqft_lot, main = "sqft_lot")
boxplot(data.clean$floors, main = "floors")
boxplot(data.clean$waterfront, main = "waterfront")
boxplot(data.clean$condition, main = "condition")
boxplot(data.clean$view, main = "view")


Based on the boxplot plotted, outliers could be observed in most of the data. However, we decided not to drop off the outliers, as there are a great number of outliers identified in the dataset and these data might carry significant values. Removing all these values might produce inaccurate and irreliable results.

Data Analysis & Result

Exploratory Data Analysis:

Feature Selection

# Feature Engineering and Selection stage
colnames(data.clean)
##  [1] "date"          "price"         "bedrooms"      "bathrooms"    
##  [5] "sqft_living"   "sqft_lot"      "floors"        "waterfront"   
##  [9] "view"          "condition"     "sqft_above"    "sqft_basement"
## [13] "yr_built"      "yr_renovated"  "street"        "city"         
## [17] "statezip"      "country"
# Drop "date" "street" "statezip" "country" as the variables do not provide much information for our analysis
data.clean2 <- data.clean[,-c(1,15,17,18)]
head(data.clean2)
##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1  313000        3      1.50        1340     7912    1.5          0    0
## 2 2384000        5      2.50        3650     9050    2.0          0    4
## 3  342000        3      2.00        1930    11947    1.0          0    0
## 4  420000        3      2.25        2000     8030    1.0          0    0
## 5  550000        4      2.50        1940    10500    1.0          0    0
## 6  490000        2      1.00         880     6380    1.0          0    0
##   condition sqft_above sqft_basement yr_built yr_renovated      city
## 1         3       1340             0     1955         2005 Shoreline
## 2         5       3370           280     1921            0   Seattle
## 3         4       1930             0     1966            0      Kent
## 4         4       1000          1000     1963            0  Bellevue
## 5         4       1140           800     1976         1992   Redmond
## 6         3        880             0     1938         1994   Seattle
# Perform multivariate normal regression to check for statistical significance for each features
# mod.fit is a multivariate linear regression model with log transformed data vs price

colnames(data.clean2)
##  [1] "price"         "bedrooms"      "bathrooms"     "sqft_living"  
##  [5] "sqft_lot"      "floors"        "waterfront"    "view"         
##  [9] "condition"     "sqft_above"    "sqft_basement" "yr_built"     
## [13] "yr_renovated"  "city"
mod.fit <- lm(price~., data=data.clean2)
summary(mod.fit)$coefficients[-grep("city", rownames(summary(mod.fit)$coefficients)),]
##                   Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)   1.511357e+06 7.918313e+05  1.9086862 5.636554e-02
## bedrooms     -3.825739e+04 1.028341e+04 -3.7203020 2.013883e-04
## bathrooms     4.959504e+04 1.666110e+04  2.9766973 2.929087e-03
## sqft_living   1.557970e+02 2.159427e+01  7.2147380 6.301081e-13
## sqft_lot     -1.763398e-01 2.202255e-01 -0.8007237 4.233335e-01
## floors       -1.492711e+04 1.985953e+04 -0.7516349 4.523096e-01
## waterfront    4.762616e+05 9.262745e+04  5.1416890 2.837238e-07
## view          5.112173e+04 1.072698e+04  4.7657154 1.940603e-06
## condition     3.603079e+04 1.283701e+04  2.8067893 5.025116e-03
## sqft_above    9.271633e+01 2.229666e+01  4.1583052 3.265304e-05
## yr_built     -8.935130e+02 3.793060e+02 -2.3556522 1.853249e-02
## yr_renovated  5.005525e+00 8.368514e+00  0.5981378 5.497778e-01

From the multivariate linear regression result, there are 9 variables that have Pr(>|t|) less than 0.05, which means the variables are statistically important and hence retained. The other values are not included in the later analysis as they are statistically insignificant, with Pr(>|t|) more than 0.05.

# Remove statistically insignificant variables
mod.fit.1 <- lm(price~., data=data.clean2[, -c(6,10,11,13)])
summary(mod.fit.1)$coefficients[-grep("city", rownames(summary(mod.fit.1)$coefficients)),]
##                  Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)  1.518233e+06 7.283978e+05  2.0843465 3.718435e-02
## bedrooms    -4.223986e+04 1.024939e+04 -4.1212058 3.836065e-05
## bathrooms    4.647481e+04 1.617404e+04  2.8734203 4.079359e-03
## sqft_living  2.273048e+02 1.349824e+01 16.8395837 9.008250e-62
## sqft_lot    -1.480951e-01 2.203428e-01 -0.6721125 5.015462e-01
## waterfront   4.736047e+05 9.276860e+04  5.1052265 3.438931e-07
## view         4.404886e+04 1.061941e+04  4.1479582 3.415773e-05
## condition    2.569416e+04 1.178850e+04  2.1795961 2.933856e-02
## yr_built    -8.623646e+02 3.462448e+02 -2.4906215 1.278741e-02

The whole model is now statistically significant as the p-value is less than 0.05.

# Convert the "city" column to numeric labels
city_labels <- as.numeric(as.factor(data.clean2$city))
data.clean2$city <- city_labels

head(data.clean2)
##     price bedrooms bathrooms sqft_living sqft_lot floors waterfront view
## 1  313000        3      1.50        1340     7912    1.5          0    0
## 2 2384000        5      2.50        3650     9050    2.0          0    4
## 3  342000        3      2.00        1930    11947    1.0          0    0
## 4  420000        3      2.25        2000     8030    1.0          0    0
## 5  550000        4      2.50        1940    10500    1.0          0    0
## 6  490000        2      1.00         880     6380    1.0          0    0
##   condition sqft_above sqft_basement yr_built yr_renovated city
## 1         3       1340             0     1955         2005   37
## 2         5       3370           280     1921            0   36
## 3         4       1930             0     1966            0   19
## 4         4       1000          1000     1963            0    4
## 5         4       1140           800     1976         1992   32
## 6         3        880             0     1938         1994   36

Check for Correlation

# Cleaned dataset
data.clean3 <- data.clean2[, -c(6,10,11,13)]
corr <- round(cor(data.clean3), 2)

Pearson Correlation

Perform Pearson Correlation to check if there is a linear relationship between price (dependent variable) and the independent variables.

#install.packages("reshape2")

# correlation heatmap with ggplot2
library(reshape2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(corr){
  corr[upper.tri(corr)] <- NA
  return(corr)
}

# Get upper triangle of the correlation matrix
get_upper_tri <- function(corr){
  corr[lower.tri(corr)]<- NA
  return(corr)
}

upper_tri <- get_upper_tri(corr)
upper_tri
##             price bedrooms bathrooms sqft_living sqft_lot waterfront view
## price           1     0.21      0.34        0.44     0.05       0.14 0.24
## bedrooms       NA     1.00      0.54        0.60     0.07       0.00 0.11
## bathrooms      NA       NA      1.00        0.77     0.11       0.08 0.21
## sqft_living    NA       NA        NA        1.00     0.21       0.12 0.31
## sqft_lot       NA       NA        NA          NA     1.00       0.02 0.07
## waterfront     NA       NA        NA          NA       NA       1.00 0.36
## view           NA       NA        NA          NA       NA         NA 1.00
## condition      NA       NA        NA          NA       NA         NA   NA
## yr_built       NA       NA        NA          NA       NA         NA   NA
## city           NA       NA        NA          NA       NA         NA   NA
##             condition yr_built  city
## price            0.04     0.02  0.02
## bedrooms         0.02     0.14 -0.13
## bathrooms       -0.12     0.47 -0.10
## sqft_living     -0.06     0.29 -0.11
## sqft_lot         0.00     0.05 -0.08
## waterfront       0.00    -0.02  0.00
## view             0.06    -0.06  0.00
## condition        1.00    -0.40 -0.01
## yr_built           NA     1.00 -0.21
## city               NA       NA  1.00
# Melt the correlation data and drop the rows with NA values
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

# Create a ggheatmap
ggheatmap <- ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  scale_fill_gradient2(low = "blue", high = "red", mid = "white",
                       midpoint = 0, limit = c(-1,1), space = "Lab",
                       name="Pearson\nCorrelation") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 1,
                                   size = 12, hjust = 1))+
  coord_fixed()

# print a ggheatmap
print(ggheatmap)

# Add correlation coefficients on the heatmap
ggheatmap +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.border = element_blank(),
    panel.background = element_blank(),
    axis.ticks = element_blank(),
    legend.justification = c(1, 0),
    legend.position = c(0.6, 0.7),
    legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                               title.position = "top", title.hjust = 0.5))

From the Pearson Correlation, the following can be observed:
1. Size of squarefoot living shows a moderate positive correlation with price.
2. Number of bedrooms, bathrooms, and view show a weak positive correlation with price.
3. Squarefoot lot, waterfront, condition, and year built show a very weak positive correlation with price.

Plot Scatterplot matrix

# Scatterplot matrix is more suitable for continuous variables, hence it is performed with price, sqft_living and sqft_lot.
# The variables are normalised to obtain a more reliable result.

pairs(~log10(price) + log10(sqft_living) + log10(sqft_lot) , data = data.clean3,
       main = "Scatterplot Matrix")

The scatterplot matrix shows a strong relationship between price and squarefoot living, as well as price and squarefoot lot.

Plot Q-Q Plot

# Plot q-q plot to check for normality

par(mfrow=c(3, 3))

qqnorm(log10(data.clean3$price), main="Density Plot: Price", pch=1, frame=FALSE)
qqline(log10(data.clean3$price), col="green", lwd=2)

qqnorm(log10(data.clean3$sqft_living), main="Density Plot: sqft_living", pch=1, frame=FALSE)
qqline(log10(data.clean3$sqft_living), col="blue", lwd=2)

qqnorm(log10(data.clean3$sqft_lot), main="Density Plot: sqft_lot", pch=1, frame=FALSE)
qqline(log10(data.clean3$sqft_lot), col="orange", lwd=2)

qqnorm(log10(data.clean3$bedrooms), main="Density Plot: bedrooms", pch=1, frame=FALSE)

qqnorm(log10(data.clean3$bathrooms), main="Density Plot: bathrooms", pch=1, frame=FALSE)

qqnorm(data.clean3$waterfront, main="Density Plot: waterfront", pch=1, frame=FALSE)

qqnorm(data.clean3$view, main="Density Plot: view", pch=1, frame=FALSE)

qqnorm(data.clean3$condition, main="Density Plot: condition", pch=1, frame=FALSE)


Most of the features above are normally distributed as they are closely distributed on the q-q line for continuous variables (except for sqft_lot). Whereas, the discrete variables are well distributed in each quantile. Hence, we can say that most of the data are approximately normal.

Relationship between categorical variables and house price

# Check for relationship between categorical variables and house price
par(mfrow=c(3, 2))

boxplot(log10(price)~bedrooms, data=data.clean3,
        main="log US House Price vs Number of Bedrooms",
        xlab="bedrooms",
        ylab="log US House Price (log10 USD)",
        col="blue")

boxplot(log10(price)~bathrooms, data=data.clean3,
        main="log US House Price vs Number of Bathrooms",
        xlab="bathrooms",
        ylab="log US House Price (log10 USD)",
        col="yellow")

boxplot(log10(price)~waterfront, data=data.clean3,
        main="log US House Price vs Waterfront",
        xlab="waterfront",
        ylab="log US House Price (log10 USD)",
        col="green")

boxplot(log10(price)~view, data=data.clean3,
        main="log US House Price vs View",
        xlab="view",
        ylab="log US House Price (log10 USD)",
        col="pink")

boxplot(log10(price)~condition, data=data.clean3,
        main="log US House Price vs Condition",
        xlab="condition",
        ylab="log US House Price (log10 USD)",
        col="orange")


From the above charts, it could be observed that as the number of bedrooms and bathrooms increase, the house price increases (observed from the increasing median line).Whereas, houses with waterfront, and with a better view and condition will have higher prices as well.

Objective 1: Classification Modeling

library("dplyr")
# Extract the desired columns
classification_data <- data.clean3 %>% select(price, sqft_living, city)
# Remove rows with null values or zero as a value
classification_data <- classification_data[complete.cases(classification_data) & !classification_data$price == 0 & !classification_data$sqft_living == 0, ]
# Add the "price/sqft" column after the "sqft" column
classification_data <- within(classification_data, price_sqft <- price / sqft_living)
# Group the dataset by "city" column
grouped_data <- classification_data %>% group_by(city)

# Create quartile bins based on "price/sqft" within each group
grouped_data <- grouped_data %>% mutate(bin = ntile(price_sqft, 3))

# Map the bin values to corresponding labels
grouped_data$bin <- as.factor(case_when(
  grouped_data$bin == 1 ~ "low",
  grouped_data$bin == 2 ~ "medium",
  grouped_data$bin == 3 ~ "high"
))
#install.packages("caret")
library("caret")
## Loading required package: lattice
# Split the grouped_data into training and testing datasets
set.seed(123)  # For reproducibility
train_indices <- createDataPartition(grouped_data$bin, p = 0.8, list = FALSE)
train_data <- grouped_data[train_indices, ]
test_data <- grouped_data[-train_indices, ]

Decision Tree

#install.packages("rpart")
library("rpart")
# Train a decision tree model
model <- rpart(bin ~ city + price_sqft, data = train_data, method = "class")

# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")

# Calculate accuracy
accuracy_dt <- mean(predictions == test_data$bin)

# Create the confusion matrix
confusion_dt <- confusionMatrix(predictions, test_data$bin)

# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_dt))
## [1] "Accuracy: 0.831154684095861"
print(confusion_dt)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low medium
##     high    247  13     38
##     low      29 281     33
##     medium   27  15    235
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8312          
##                  95% CI : (0.8053, 0.8548)
##     No Information Rate : 0.3366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7467          
##                                           
##  Mcnemar's Test P-Value : 0.002085        
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium
## Sensitivity               0.8152     0.9094        0.7680
## Specificity               0.9171     0.8982        0.9314
## Pos Pred Value            0.8289     0.8192        0.8484
## Neg Pred Value            0.9097     0.9513        0.8892
## Prevalence                0.3301     0.3366        0.3333
## Detection Rate            0.2691     0.3061        0.2560
## Detection Prevalence      0.3246     0.3736        0.3017
## Balanced Accuracy         0.8661     0.9038        0.8497

Random Forest

#install.packages("randomForest")
library("randomForest")
# Train a Random Forest Model
model <- randomForest(bin ~ city + price_sqft, data = train_data, ntree = 100)

# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")

# Calculate accuracy
accuracy_rf <- mean(predictions == test_data$bin)

# Create the confusion matrix
confusion_rf <- confusionMatrix(predictions, test_data$bin)

# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_rf))
## [1] "Accuracy: 0.961873638344227"
print(confusion_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low medium
##     high    293   3      7
##     low       1 299      8
##     medium    9   7    291
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9619          
##                  95% CI : (0.9474, 0.9733)
##     No Information Rate : 0.3366          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9428          
##                                           
##  Mcnemar's Test P-Value : 0.7252          
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium
## Sensitivity               0.9670     0.9676        0.9510
## Specificity               0.9837     0.9852        0.9739
## Pos Pred Value            0.9670     0.9708        0.9479
## Neg Pred Value            0.9837     0.9836        0.9755
## Prevalence                0.3301     0.3366        0.3333
## Detection Rate            0.3192     0.3257        0.3170
## Detection Prevalence      0.3301     0.3355        0.3344
## Balanced Accuracy         0.9754     0.9764        0.9624

Naive Bayes

#install.packages("e1071")

library("e1071")
# Train the Naive Bayes model with Laplace smoothing
model <- naiveBayes(bin ~ city + price_sqft, data = train_data, laplace = 1)

# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")

# Calculate accuracy
accuracy_nb <- mean(predictions == test_data$bin)

# Create the confusion matrix
confusion_nb <- confusionMatrix(predictions, test_data$bin)

# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_nb))
## [1] "Accuracy: 0.574074074074074"
print(confusion_nb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low medium
##     high     84   2      1
##     low      63 235     97
##     medium  156  72    208
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5741          
##                  95% CI : (0.5413, 0.6063)
##     No Information Rate : 0.3366          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3601          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium
## Sensitivity              0.27723     0.7605        0.6797
## Specificity              0.99512     0.7373        0.6275
## Pos Pred Value           0.96552     0.5949        0.4771
## Neg Pred Value           0.73646     0.8585        0.7967
## Prevalence               0.33007     0.3366        0.3333
## Detection Rate           0.09150     0.2560        0.2266
## Detection Prevalence     0.09477     0.4303        0.4749
## Balanced Accuracy        0.63617     0.7489        0.6536

Support Vector Machine

# Train the SVM model
model <- svm(bin ~ city + price_sqft, data = train_data)

# Make predictions on the test set
predictions <- predict(model, test_data, type = "class")

# Calculate accuracy
accuracy_svm <- mean(predictions == test_data$bin)

# Create the confusion matrix
confusion_svm <- confusionMatrix(predictions, test_data$bin)

# Print the accuracy and confusion matrix
print(paste("Accuracy:", accuracy_svm))
## [1] "Accuracy: 0.630718954248366"
print(confusion_svm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low medium
##     high    170   9     45
##     low      48 240     92
##     medium   85  60    169
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6307         
##                  95% CI : (0.5986, 0.662)
##     No Information Rate : 0.3366         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4456         
##                                          
##  Mcnemar's Test P-Value : 6.477e-10      
## 
## Statistics by Class:
## 
##                      Class: high Class: low Class: medium
## Sensitivity               0.5611     0.7767        0.5523
## Specificity               0.9122     0.7701        0.7631
## Pos Pred Value            0.7589     0.6316        0.5382
## Neg Pred Value            0.8084     0.8717        0.7732
## Prevalence                0.3301     0.3366        0.3333
## Detection Rate            0.1852     0.2614        0.1841
## Detection Prevalence      0.2440     0.4139        0.3420
## Balanced Accuracy         0.7366     0.7734        0.6577

Comparison Table for the Classification
For the classification task, a comparison table was created to evaluate different models.

#install.packages("tidyverse")
#install.packages("kableExtra")
library(tidyverse)
library(kableExtra)
# Create a data frame for the comparison
comparison_table <- data.frame(Model = c("Decision Tree", "Random Forest", "Naive Bayes", "SVM"), Accuracy = c(accuracy_dt, accuracy_rf, accuracy_nb, accuracy_svm), stringsAsFactors = FALSE)

# Find the index of the model with the highest accuracy
highest_index <- which.max(comparison_table$Accuracy)

# Get the model name with the highest accuracy
highest_model <- comparison_table$Model[highest_index]

# Print the model with the highest accuracy and the comparison table
print(comparison_table)
##           Model  Accuracy
## 1 Decision Tree 0.8311547
## 2 Random Forest 0.9618736
## 3   Naive Bayes 0.5740741
## 4           SVM 0.6307190
print(paste("Model with the highest accuracy:", highest_model))
## [1] "Model with the highest accuracy: Random Forest"

The analysis revealed that the random forest model achieved the highest accuracy compared to other models such as decision trees, SVM, and Naïve Bayes. Thus, the random forest model is recommended for accurately classifying housing prices into categories such as ‘High,’ ‘Low,’ and ‘Medium’.

This analysis provides valuable insights into the real estate market in Washington, enabling informed decision-making, policy formulation, and a deeper understanding of the economic landscape and development trends in the region.

Objective 2: Regression Modeling

Plot univariate linear regression between sqft_living and price

ggplot(data.clean3,aes(y=price,x=sqft_living)) +
       geom_point() +
        xlim(0, 9000) +
        ylim(0, 5000000) +
        geom_smooth(formula = y ~ x,method="lm")

Multi univariate linear regression

linearmodel = lm(price~bedrooms + sqft_living + sqft_lot + condition + view ,
                 data = data.clean3)
summary(linearmodel)
## 
## Call:
## lm(formula = price ~ bedrooms + sqft_living + sqft_lot + condition + 
##     view, data = data.clean3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1490859  -143120   -23176    90341 26258308 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.676e+04  4.684e+04  -0.998 0.318135    
## bedrooms    -5.085e+04  1.024e+04  -4.965 7.11e-07 ***
## sqft_living  2.753e+02  1.030e+01  26.715  < 2e-16 ***
## sqft_lot    -7.815e-01  2.098e-01  -3.725 0.000198 ***
## condition    5.307e+04  1.094e+04   4.850 1.27e-06 ***
## view         7.136e+04  1.002e+04   7.123 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 497600 on 4592 degrees of freedom
## Multiple R-squared:  0.2136, Adjusted R-squared:  0.2128 
## F-statistic: 249.5 on 5 and 4592 DF,  p-value: < 2.2e-16
# Create a vector of categorical column names
cat_cols <- c("price", "sqft_living")
# Extract target variable (y)
y <- data.clean3$price

# Extract input features (X)
X <- data.clean3
X$price <- NULL
# Load required libraries
library(caret)

# Set the seed for reproducibility
set.seed(1)

# Split the data into training and validation sets
split <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[split, ]
X_val <- X[-split, ]
y_train <- y[split]
y_val <- y[-split]

Linear Regression

# Create Linear Regression model
lin_reg <- lm(y_train ~ ., data = cbind(y_train, X_train))

# Make predictions on the validation set
predictions <- predict(lin_reg, newdata = X_val)

# Calculate R-squared
lin_r_squared <- cor(predictions, y_val)^2
print(paste("R-squared:", lin_r_squared))
## [1] "R-squared: 0.556115001988202"
# Calculate RMSLE
lin_rmsle <- sqrt(mean((log(predictions + 1) - log(y_val + 1))^2))
print(paste("RMSLE:", lin_rmsle))
## [1] "RMSLE: 0.373956476460525"

Ridge Regression

#install.packages("glmnet")
# Load required libraries
library(glmnet)
library(caret)
# Create Ridge regression model
ridge <- glmnet(as.matrix(X_train), y_train, alpha = 0)

# Make predictions on the validation set
predictions <- predict(ridge, as.matrix(X_val))

# Calculate R-squared
ridge_r_squared <- cor(predictions, y_val)^2
## Warning in cor(predictions, y_val): the standard deviation is zero
ridge_r_squared <-mean(ridge_r_squared, na.rm = TRUE)
print(paste("R-squared:", ridge_r_squared))
## [1] "R-squared: 0.481397949234316"
# Append method and performance to lists
ridge_rmsle <- sqrt(mean((log(predictions + 1) - log(y_val + 1))^2))
print(paste("RMSLE:", ridge_rmsle))
## [1] "RMSLE: 0.474976573511914"

XGBoost

#install.packages("xgboost")
# Load required libraries
library(xgboost)
library(caret)
# Create XGBoost regression model
xgb <- xgboost(data = as.matrix(X_train), label = y_train, nrounds = 1000, eta = 0.01, verbose = FALSE)

# Make predictions on the validation set
predictions <- predict(xgb, newdata = as.matrix(X_val))

# Calculate R-squared
xgb_r_squared <- cor(predictions, y_val)^2
print(paste("R-squared:", xgb_r_squared))
## [1] "R-squared: 0.0448088378543971"
# Calculate RMSLE
xgb_rmsle <- sqrt(mean((log(predictions + 1) - log(y_val + 1))^2))
print(paste("RMSLE:", xgb_rmsle))
## [1] "RMSLE: 0.397975817756442"

Evaluation
Upon evaluating our regression models, we focused on two key metrics: R-squared (R2) and RMSLE (Root Mean Squared Logarithmic Error).

R2 measures the proportion of variance explained by the model, while RMSLE evaluates the average magnitude of prediction errors in a logarithmic scale.

Higher R2 values suggest a better fit and indicate how well the model captures the variability in the target variable, while lower RMSLE values indicate better accuracy in predicting housing prices.

# Create a data frame with the regression models results
regression_results <- data.frame(
  Model = c("Linear Regression", "XGBoost Regression", "Ridge Regression"),
  R_squared = c(lin_r_squared, xgb_r_squared, ridge_r_squared),
  RMSLE = c(lin_rmsle, xgb_rmsle, ridge_rmsle)
)

# Print the table
print(regression_results)
##                Model  R_squared     RMSLE
## 1  Linear Regression 0.55611500 0.3739565
## 2 XGBoost Regression 0.04480884 0.3979758
## 3   Ridge Regression 0.48139795 0.4749766

Based on the evaluation of the three models, the linear regression model stands out as the top performer with the highest R2 value and the lowest RMSLE value. As a result, the linear regression model is the preferred choice for accurately predicting housing prices in this particular scenario.

Further recommendations to improve the models and enhance results
While the linear regression model performed the best among the three models, achieving an R2 value of 0.556 and an RMSLE value of 0.374, there is still room for improvement.

Upon analyzing the correlation matrix and evaluating the regression models, it became evident that the dataset lacked significant correlations among the attributes. The models’ relatively low R2 values suggest they explain only a moderate amount of variability in housing prices. Additionally, the RMSLE values indicate some degree of prediction error.

To enhance the models and address these limitations, we recommend the following:

  1. Feature Engineering: Explore additional relevant features or create new meaningful features that better capture the relationship with housing prices.

  2. Model Selection and Tuning: Experiment with different regression models, ensemble methods, or variations of the existing models.

  3. Data Quality and Preprocessing: Apply suitable preprocessing techniques such as feature scaling or normalization to improve model performance.

  4. Dataset Expansion: Expand the dataset by incorporating more diverse and comprehensive data.

By implementing these recommendations, we can address the limitations of our current models and enhance the accuracy and reliability of our predictions.

Conclusion

In conclusion, the analysis aimed to predict the sales price of houses in Washington. The data cleaning process involved checking for missing values, outliers, and assessing normality. Exploratory Data Analysis (EDA) was performed to analyze the characteristics, patterns, and relationships present in the data.


The classification analysis revealed that the random forest model achieved the highest accuracy compared to other models such as decision trees, SVM, and Naïve Bayes. Thus, the random forest model is recommended for accurately classifying housing prices into categories such as ‘High,’ ‘Low,’ and ‘Medium’.


Based on the regression analysis, the best option for predicting house prices is the linear regression model. However, a moderate level of accuracy is shown, therefore further improvement will be needed.