1 Objective

The objective of this project is to predict housing prices in the Pittsburgh area based off of historical data. Various data science models were used in order to get accurate results. Models used include:

Linear Regression
Ridge Regression
Lasso Regression
Bagged Tree
Random Forest

Mean Squared Error (MSE) was then ultimately used to determine which model performed best in order to predict prices on the test set.

2 Data Exploration

2.1 Libraries

library(tidyverse)
library(forcats)
library(GGally)
library(glmnet)
library(caret)
library(randomForest)

2.2 Exploring the Data

Exploring the data will help us get a sense of the sets. We will make sure that the sets are not missing any data and convert character variables to factors for our model prediction.

#Read in training and test data
df.train <- read.csv("train.csv", header = TRUE)
df.test <- read.csv("test.csv", header = TRUE)

#Display the first 5 rows of the data frames
head(df.train)

##        id  price          desc numstories yearbuilt exteriorfinish rooftype
## 1 PA10258 432500 SINGLE FAMILY          2      1988          Brick  SHINGLE
## 2 PA10261  86000 SINGLE FAMILY          1      1931          Brick  SHINGLE
## 3 PA10698 875000 SINGLE FAMILY          1      1950          Stone    SLATE
## 4 PA10417 460000 SINGLE FAMILY          2      1900          Frame  SHINGLE
## 5 PA10452 900000 SINGLE FAMILY          1      1995          Frame  SHINGLE
## 6 PA10419  27000   MOBILE HOME          1      1968          Frame    METAL
##   basement totalrooms bedrooms bathrooms fireplaces sqft lotarea zipcode
## 1        1          9        5       2.5          1 4112   11400   15236
## 2        1          6        3       1.5          1 1080   10336   15037
## 3        1         10        3       3.0          2 3322   23950   15243
## 4        1          7        3       2.0          0 2323    1218   15212
## 5        1          3        1       1.0          0 1011  496671   15235
## 6        1          4        2       1.0          0  696   14767   15026
##   AvgIncome Location DistDowntown
## 1     46913  NotCity        7.695
## 2     55863  NotCity       18.970
## 3     60717  NotCity        7.224
## 4     26712 PartCity        2.168
## 5     41367 PartCity       10.963
## 6     69387  NotCity       23.041

head(df.test)

##        id price          desc numstories yearbuilt exteriorfinish rooftype
## 1 PA10723    NA SINGLE FAMILY        2.0      2000          Brick  SHINGLE
## 2 PA10773    NA SINGLE FAMILY        2.0      1994          Brick  SHINGLE
## 3 PA10366    NA SINGLE FAMILY        1.0      1951          Frame  SHINGLE
## 4 PA10915    NA SINGLE FAMILY        1.0      1960          Brick  SHINGLE
## 5 PA10898    NA SINGLE FAMILY        2.0      1931          Brick    SLATE
## 6 PA10890    NA SINGLE FAMILY        2.5      1910          Brick  SHINGLE
##   basement totalrooms bedrooms bathrooms fireplaces sqft lotarea zipcode
## 1        1          8        4       2.5          1 3864   14026   15025
## 2        1          8        4       3.5          1 2352   13262   15025
## 3        1          7        3       2.0          0 1366    2150   15238
## 4        1          7        3       1.5          0 1347    8013   15236
## 5        1         10        4       3.5          1 4531   22073   15228
## 6        1          9        4       2.0          0 4374   10500   15224
##   AvgIncome Location DistDowntown
## 1     60669  NotCity       13.094
## 2     60669  NotCity       13.094
## 3     67370  NotCity       11.876
## 4     46913  NotCity        7.695
## 5     58440  NotCity        7.011
## 6     22880     City        4.644

#Check the dimensions of the training set
dim(df.train)

## [1] 700  18

#Check to see if the set contains missing values
colSums(is.na(df.train))

##             id          price           desc     numstories      yearbuilt 
##              0              0              0              0              0 
## exteriorfinish       rooftype       basement     totalrooms       bedrooms 
##              0              0              0              0              0 
##      bathrooms     fireplaces           sqft        lotarea        zipcode 
##              0              0              0              0              0 
##      AvgIncome       Location   DistDowntown 
##              0              0              0

#See what data type each column is
str(df.train)

## 'data.frame':    700 obs. of  18 variables:
##  $ id            : chr  "PA10258" "PA10261" "PA10698" "PA10417" ...
##  $ price         : int  432500 86000 875000 460000 900000 27000 460000 80900 42500 1250000 ...
##  $ desc          : chr  "SINGLE FAMILY" "SINGLE FAMILY" "SINGLE FAMILY" "SINGLE FAMILY" ...
##  $ numstories    : num  2 1 1 2 1 1 2 2 2 2 ...
##  $ yearbuilt     : int  1988 1931 1950 1900 1995 1968 1962 1930 1900 1998 ...
##  $ exteriorfinish: chr  "Brick" "Brick" "Stone" "Frame" ...
##  $ rooftype      : chr  "SHINGLE" "SHINGLE" "SLATE" "SHINGLE" ...
##  $ basement      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ totalrooms    : int  9 6 10 7 3 4 8 6 4 13 ...
##  $ bedrooms      : int  5 3 3 3 1 2 4 3 1 4 ...
##  $ bathrooms     : num  2.5 1.5 3 2 1 1 3 2 1 6 ...
##  $ fireplaces    : int  1 1 2 0 0 0 0 0 0 2 ...
##  $ sqft          : int  4112 1080 3322 2323 1011 696 2540 1134 1024 6705 ...
##  $ lotarea       : int  11400 10336 23950 1218 496671 14767 12460 2500 816 87120 ...
##  $ zipcode       : int  15236 15037 15243 15212 15235 15026 15243 15227 15224 15044 ...
##  $ AvgIncome     : int  46913 55863 60717 26712 41367 69387 60717 38439 22880 95289 ...
##  $ Location      : chr  "NotCity" "NotCity" "NotCity" "PartCity" ...
##  $ DistDowntown  : num  7.7 18.97 7.22 2.17 10.96 ...

Luckily we have no missing data in our training set. Since we have character data in our sets, we will convert them to factors so our models can handle them. We will also store the id column in its only variable.

#Convert character variables to factors in both train and test sets
df.train$desc <- factor(df.train$desc, levels = c("CONDOMINIUM", "MOBILE HOME", "MULTI-FAMILY", "ROWHOUSE",
                                             "SINGLE FAMILY"))
df.train <- df.train %>% mutate(exteriorfinish = as.factor(exteriorfinish))
df.train <- df.train %>% mutate(rooftype = as.factor(rooftype))
df.train <- df.train %>% mutate(Location = as.factor(Location))

df.test$desc <- factor(df.test$desc, levels = c("CONDOMINIUM", "MOBILE HOME", "MULTI-FAMILY", "ROWHOUSE",
                                             "SINGLE FAMILY"))
df.test <- df.test %>% mutate(exteriorfinish = as.factor(exteriorfinish))
df.test <- df.test %>% mutate(rooftype = as.factor(rooftype))
df.test <- df.test %>% mutate(Location = as.factor(Location))

#Save the ids in a new variable and omit them from the training set
train_ids <- df.train$id
df.train <- df.train[, -which(names(df.train) == "id")]

test_ids <- df.test$id
df.test <- df.test[, -which(names(df.test) == "id")]

#Set all test prices to 0, this step is not necessary but will get rid of the NAs in the set
df.test$price <- 0

#Check to see the stats of each predictor in the training set
summary(df.train)

##      price                    desc       numstories      yearbuilt   
##  Min.   :  25500   CONDOMINIUM  : 56   Min.   :1.000   Min.   :1830  
##  1st Qu.:  87000   MOBILE HOME  :  2   1st Qu.:1.000   1st Qu.:1930  
##  Median : 196250   MULTI-FAMILY : 29   Median :2.000   Median :1956  
##  Mean   : 302470   ROWHOUSE     : 17   Mean   :1.632   Mean   :1957  
##  3rd Qu.: 361052   SINGLE FAMILY:596   3rd Qu.:2.000   3rd Qu.:1988  
##  Max.   :3300000                       Max.   :3.000   Max.   :2017  
##   exteriorfinish    rooftype      basement        totalrooms    
##  Brick   :356    METAL  :  2   Min.   :0.0000   Min.   : 3.000  
##  Concrete:  4    ROLL   : 34   1st Qu.:1.0000   1st Qu.: 6.000  
##  Frame   :281    SHINGLE:585   Median :1.0000   Median : 7.000  
##  Log     :  1    SLATE  : 79   Mean   :0.9414   Mean   : 7.097  
##  Stone   : 41                  3rd Qu.:1.0000   3rd Qu.: 8.000  
##  Stucco  : 17                  Max.   :1.0000   Max.   :16.000  
##     bedrooms       bathrooms       fireplaces          sqft      
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :  475  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 1246  
##  Median :3.000   Median :2.000   Median :0.0000   Median : 1894  
##  Mean   :3.267   Mean   :2.235   Mean   :0.5843   Mean   : 2331  
##  3rd Qu.:4.000   3rd Qu.:2.500   3rd Qu.:1.0000   3rd Qu.: 2825  
##  Max.   :7.000   Max.   :7.500   Max.   :5.0000   Max.   :15872  
##     lotarea           zipcode        AvgIncome         Location  
##  Min.   :      0   Min.   :15003   Min.   :14399   City    : 54  
##  1st Qu.:   4248   1st Qu.:15037   1st Qu.:39526   NotCity :453  
##  Median :  10202   Median :15218   Median :55016   PartCity:193  
##  Mean   :  47327   Mean   :15170   Mean   :51837                 
##  3rd Qu.:  23183   3rd Qu.:15236   3rd Qu.:60717                 
##  Max.   :3820212   Max.   :15332   Max.   :95289                 
##   DistDowntown   
##  Min.   : 0.000  
##  1st Qu.: 6.905  
##  Median :10.963  
##  Mean   :10.814  
##  3rd Qu.:13.448  
##  Max.   :23.041

2.3 Variable Distribution, Correlation, and Variable Importance

2.3.1 Response Variable

Here we will look at the distribution of the response variable and determine outliers and high leverage points within the data set. If these will affect our models in the future, we will remove them in order to get more accurate predictions.

#Make a density plot of the housing prices to check for outliers
plot(density(df.train$price), main = "Density of Housing Prices", xlab = "Price")

#Check data for outliers/high leverage points
lm.fit <- lm(price ~ ., df.train)
par(mfrow = c(2,2))
plot(lm.fit)

The density graph is severely right skewed, we will log transform the response variable for a normal distribution. The Q-Q plot is also non-linear, meaning log transformation will help to diminish residuals.

#Log transform the response variable
df.train <- df.train %>% mutate(price = log(df.train$price))

#Plot density graph after log transformation of response
plot(density(df.train$price), main = "Density of Housing Prices", xlab = "Price")

#Check data for outliers/high leverage points after log transformation
lm.fit <- lm(price ~ ., df.train)
par(mfrow = c(2,2))
plot(lm.fit)

Price is now normally distributed and the residuals in the Q-Q plot look much better.

2.3.2 Numeric Predictors

Now we will explore the numeric variables to determine if any are highly skewed and need to be transformed.

#Check the distibutions of numeric predictors in the training set
numeric <- df.train %>%
  select_if(is.numeric) 

histograms <- numeric %>%
  gather(key = "variable", value = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "cyan4") +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")

print(histograms)

From the histograms above, it seems that sqft is highly right skewed and may need to be normalized for accurate results.

plot(density(df.train$sqft), main = "Density of Square Footage", xlab = "Sqft")

df.train$sqft <- log(df.train$sqft)
df.test$sqft <- log(df.test$sqft)

plot(density(df.train$sqft), main = "Density of Square Footage", xlab = "Sqft")

After log transformation, sqft follows a normal distribution and will not throw off our results.

2.3.3 Correlation Matrix

#Plot the correlation between variables
ggcorr(df.train, label = T, hjust = 1, layout.exp = 3)

From the correlation graph above, sqft, bathrooms, and totalrooms are the variables with the most correlation to price.

2.3.4 Variable Importance

We will fit a random forest model to the training set in order to determine variable importance.

rf <- randomForest(price ~ ., data = df.train, 
                   mtry = 16/3, ntrees = 500, importance = TRUE)

varImpPlot(rf, main = "Variable Importance")

Based off of the random forest model above, sqft, bathrooms, and lotarea seem to be the most important variables that drive higher house prices.

3. Model Selection

3.1 Splitting the Training Set

Split the training set into a training and test set to get results on how well our models are doing by calculating MSE.

#Split training data into a train and test set to get model MSE
set.seed(99)

idx <- sample(nrow(df.train), nrow(df.train) * 0.8)
train <- df.train[idx,]
test <- df.train[-idx,]

3.2 Linear Regression

Perform a linear regression model using all variables to determine which predictors are most important and to get a baseline for the other regression models.

set.seed(99)

#Create a linear regression model using all variables
lm.fit <- lm(price ~ ., data = train)
summary(lm.fit)

## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75822 -0.29302  0.03428  0.31081  1.93970 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -4.036e-01  5.278e+00  -0.076 0.939076    
## descMOBILE HOME        -5.721e-01  8.059e-01  -0.710 0.478058    
## descMULTI-FAMILY       -1.646e-01  1.768e-01  -0.931 0.352239    
## descROWHOUSE            2.661e-02  1.946e-01   0.137 0.891267    
## descSINGLE FAMILY      -2.087e-01  1.189e-01  -1.755 0.079787 .  
## numstories             -8.340e-02  6.072e-02  -1.373 0.170201    
## yearbuilt               4.217e-03  9.723e-04   4.337 1.73e-05 ***
## exteriorfinishConcrete  5.419e-01  2.912e-01   1.861 0.063281 .  
## exteriorfinishFrame    -7.148e-02  5.311e-02  -1.346 0.178926    
## exteriorfinishLog       6.548e-02  5.872e-01   0.112 0.911250    
## exteriorfinishStone    -5.334e-02  1.149e-01  -0.464 0.642696    
## exteriorfinishStucco    2.713e-01  1.513e-01   1.793 0.073582 .  
## rooftypeROLL            3.814e-01  5.861e-01   0.651 0.515571    
## rooftypeSHINGLE         2.311e-01  5.696e-01   0.406 0.685122    
## rooftypeSLATE           5.029e-01  5.715e-01   0.880 0.379326    
## basement                4.538e-02  1.314e-01   0.345 0.729913    
## totalrooms             -9.814e-03  2.658e-02  -0.369 0.712094    
## bedrooms                1.002e-01  4.394e-02   2.280 0.023005 *  
## bathrooms               1.150e-01  3.945e-02   2.915 0.003710 ** 
## fireplaces              4.089e-02  3.887e-02   1.052 0.293372    
## sqft                    8.890e-01  8.939e-02   9.945  < 2e-16 ***
## lotarea                 2.960e-07  1.335e-07   2.217 0.027039 *  
## zipcode                -2.070e-04  3.166e-04  -0.654 0.513471    
## AvgIncome               6.571e-06  2.229e-06   2.948 0.003341 ** 
## LocationNotCity         9.270e-02  1.344e-01   0.690 0.490704    
## LocationPartCity        7.954e-02  1.101e-01   0.722 0.470331    
## DistDowntown           -2.372e-02  6.900e-03  -3.437 0.000634 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5522 on 533 degrees of freedom
## Multiple R-squared:  0.7025, Adjusted R-squared:  0.688 
## F-statistic: 48.42 on 26 and 533 DF,  p-value: < 2.2e-16

#Predict price on test set
lm.pred <- predict(lm.fit, test)

#Calculate MSE
lm.MSE <- mean((lm.pred - test$price)^2)
cat("Linear Regression MSE = ", lm.MSE, "\n")

## Linear Regression MSE =  0.3200871

As we can see from the model statistics above, many variables are not as important as others when it comes to predicting the prices. We now try ridge and lasso regression to reduce the coefficients of these variables.

3.3 Ridge Regression

Due to many variables not being significant in the model above, we will use ridge and lasso regression to reduce the coefficients of non-important variables.

set.seed(99)

#Create matrices in order to perform ridge regression
x <- model.matrix(price ~ ., df.train)[, -1]
y <- df.train$price

#Split into training and test set
xtrain <- sample(1:nrow(x), nrow(x) * 0.8)
xtest <- (-xtrain)
y.test <- y[xtest]

grid <- 10^seq(10, -2, length = 100)

#Create a ridge regression model
ridge.mod <- glmnet(x[xtrain, ], y[xtrain], alpha = 0, lambda = grid, thresh = 1e-12)

#Perform cross validation to obtain the best lambda value to use in the model
cv.out <- cv.glmnet(x[xtrain, ], y[xtrain], alpha = 0)
plot(cv.out)

bestlam <- cv.out$lambda.min
cat("Best Lambda = ", bestlam, "\n")

## Best Lambda =  0.07805105

#Predict prices on the test set and calculate the MSE
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[xtest, ])
ridge.mse <- mean((ridge.pred - y.test)^2)
cat("Ridge Regression MSE = ", ridge.mse, "\n")

## Ridge Regression MSE =  0.3255705

#Identify the most important variables used in the model
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:27, ]

##            (Intercept)        descMOBILE HOME       descMULTI-FAMILY 
##          -6.099579e+00          -7.377955e-01          -2.414076e-02 
##           descROWHOUSE      descSINGLE FAMILY             numstories 
##           2.002937e-01          -8.047298e-02          -3.390782e-02 
##              yearbuilt exteriorfinishConcrete    exteriorfinishFrame 
##           5.361183e-03           5.160519e-01          -5.537529e-02 
##      exteriorfinishLog    exteriorfinishStone   exteriorfinishStucco 
##           1.410496e-01           4.138084e-02           2.314932e-01 
##           rooftypeROLL        rooftypeSHINGLE          rooftypeSLATE 
##           1.865921e-01          -1.273672e-01           1.777983e-01 
##               basement             totalrooms               bedrooms 
##           2.753367e-02           9.660227e-03           7.741935e-02 
##              bathrooms             fireplaces                   sqft 
##           1.422287e-01           5.884382e-02           6.884206e-01 
##                lotarea                zipcode              AvgIncome 
##           3.651078e-07           1.194264e-04           6.156163e-06 
##        LocationNotCity       LocationPartCity           DistDowntown 
##           3.950147e-02           6.006796e-02          -1.429521e-02

Lambda was chosen via cross-validation and applied to model when predicting prices.

3.4 Lasso Regression

Lasso Regression will handle any non-important variable by setting it’s coefficient equal to 0, excluding it from the model.

set.seed(99)

#Create matrices in order to perform ridge regression
x <- model.matrix(price ~ ., df.train)[, -1]
y <- df.train$price

#Split into training and test set
xtrain <- sample(1:nrow(x), nrow(x) * 0.8)
xtest <- (-xtrain)
y.test <- y[xtest]

grid <- 10^seq(10, -2, length = 100)

#Create a lasso regression model
lasso.mod <- glmnet(x[xtrain, ], y[xtrain], alpha = 1, lambda = grid)

#Perform cross validation to obtain the best lambda value to use in the model
cv.out <- cv.glmnet(x[xtrain, ], y[xtrain], alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
cat("Best Lambda = ", bestlam, "\n")

## Best Lambda =  0.01186306

#Predict prices on the test set and calculate the MSE
lasso.pred <- predict(lasso.mod, s = bestlam,newx = x[xtest, ])
lasso.mse <- mean((lasso.pred - y.test)^2)
cat("Lasso Regression MSE = ", lasso.mse, "\n")

## Lasso Regression MSE =  0.3209131

#Identify the most important variables used in the model
out <- glmnet(x, y, alpha = 1, lambda = grid)
predict(out, type = "coefficients", s = bestlam)[1:27, ]

##            (Intercept)        descMOBILE HOME       descMULTI-FAMILY 
##          -5.726748e+00          -4.634005e-01           0.000000e+00 
##           descROWHOUSE      descSINGLE FAMILY             numstories 
##           1.328056e-01          -2.037470e-02          -2.995756e-02 
##              yearbuilt exteriorfinishConcrete    exteriorfinishFrame 
##           5.102767e-03           3.460057e-01          -4.845788e-02 
##      exteriorfinishLog    exteriorfinishStone   exteriorfinishStucco 
##           0.000000e+00           0.000000e+00           1.482732e-01 
##           rooftypeROLL        rooftypeSHINGLE          rooftypeSLATE 
##           0.000000e+00          -2.645959e-01           1.884410e-03 
##               basement             totalrooms               bedrooms 
##           0.000000e+00           0.000000e+00           3.876792e-02 
##              bathrooms             fireplaces                   sqft 
##           1.354157e-01           3.122096e-02           8.549358e-01 
##                lotarea                zipcode              AvgIncome 
##           2.970482e-07           7.184802e-05           5.732477e-06 
##        LocationNotCity       LocationPartCity           DistDowntown 
##           0.000000e+00           0.000000e+00          -1.420707e-02

Lambda was chosen via cross-validation and applied to model when predicting prices. Both ridge and lasso regression did not outperform regular linear regression, which is a bit surprising.

3.5 Tree Ensembles

Perform bagging and random forest on the training set. I expect these models to perform better than the ones above.

3.5.1 Bagged Tree

set.seed(99)
#Create a bagging random forest model and calculate MSE
bag.rf <- randomForest(price ~ ., data = train, mtry = 16, ntrees = 30, importance = TRUE)
yhat.bag <- predict(bag.rf, newdata = test)
bag.mse <- mean((yhat.bag - test$price)^2)
cat("Bagged Tree MSE = ", bag.mse, "\n")

## Bagged Tree MSE =  0.304875

importance(bag.rf)

##                  %IncMSE IncNodePurity
## desc            4.122712     5.4977612
## numstories      5.778141     4.9491089
## yearbuilt      13.542958    30.0526809
## exteriorfinish  5.398958     8.0416183
## rooftype        3.778469     3.1869687
## basement       -4.379678     0.6883449
## totalrooms     12.138685    11.3348480
## bedrooms        9.202390     6.1478448
## bathrooms      16.184439    76.7961673
## fireplaces     -2.238825     4.0593096
## sqft           67.891261   308.9373007
## lotarea        15.919443    28.1963206
## zipcode         9.042859    13.5524116
## AvgIncome      15.596305    14.5804645
## Location        9.546598     4.0521093
## DistDowntown   11.109784    13.1618594

#Plot the variables in order of importance
varImpPlot(bag.rf, main = "Bagged Tree Variable Importance")

plot(yhat.bag, test$price, main = "Actual vs Predicted Price", 
     xlab = "Predicted Priced", ylab = "Actual Price")
abline(0, 1, col = "red")

The bagged tree models chose sqft, bathrooms, and lotarea as the most important predictors, similar to what we saw in the exploratory data analysis.

3.5.2 Random Forests

Now we will try a random forest model to see if it performs better than the bagged tree above.

set.seed(99)

#Create a random forest model with 500 trees and mtry set equal to number of predictors divided by 3 and calculate the MSE
rf <- randomForest(price ~ ., data = train, 
                   mtry = 16/3, ntrees = 500, importance = TRUE)
yhat.rf <- predict(rf, newdata = test)

#Calculate MSE
rf.mse <- mean((yhat.rf - test$price)^2)
cat("Random Forest MSE = ", rf.mse, "\n")

## Random Forest MSE =  0.2969777

importance(rf)

##                  %IncMSE IncNodePurity
## desc            5.403698     7.5795673
## numstories      5.525999     6.8188834
## yearbuilt      14.598973    36.1042204
## exteriorfinish  6.847546    10.3510353
## rooftype        7.263399     5.5852877
## basement        1.331580     0.8291662
## totalrooms     16.676433    52.6891132
## bedrooms       11.621588    31.7700064
## bathrooms      21.917085   100.4532839
## fireplaces      2.608388     8.5429958
## sqft           37.134582   163.5534223
## lotarea        21.210137    47.2897241
## zipcode        10.159101    14.0958808
## AvgIncome      13.805571    19.6481057
## Location        7.342755     4.8713113
## DistDowntown   11.744082    16.0314598

#Plot the variables to see importance and how close the model fits the expected values
varImpPlot(rf, main = "Random Forest Variable Importance")

plot(yhat.rf, test$price, main = "Actual vs Predicted Price", 
     xlab = "Predicted Priced", ylab = "Actual Price")
abline(0, 1, col = "red")

4. Model Evaluation

Ordering the models based on MSE will allow us to visually see which model performed best.

models <- c("Ridge", "Lasso", "LM", "Bagging", "RF")
mses <- c(ridge.mse, lasso.mse, lm.MSE, bag.mse, rf.mse)
df.performance <- data.frame(models, mses)

ggplot(df.performance, aes(x = models, y = mses)) + geom_col(width = .5, fill = "cyan4") + 
  labs(main = "Model vs MSE", x = "Model", y = "Mean Squared Error(MSE)") + 
  scale_x_discrete(limits = fct_reorder(df.performance$models, df.performance$mses))

All models were pretty close when it came to MSE, with random forest performing the best.

5. Predicting Prices and Writing to CSV

We will now use the original training dataset to fit a random forest model in order to determine housing prices in the original test set. The results are then written to a csv file, along with the ids from the test set.

set.seed(99)
#Refit the best model using all the data from the original training set
bestmodel <- randomForest(price ~ ., data = df.train, 
                   mtry = 16/3, ntrees = 500, importance = TRUE)

#Predict and convert the prices back to original form
pred <- predict(bestmodel, df.test)
house.price <- as.numeric(round(exp(pred), 0))

out <- data.frame(id = test_ids, price = house.price)

#Write data frame to csv file
write.csv(out, file = "testing_predictions_Predix_Joel_JMP261.csv", row.names = FALSE)

Predicting Pittsburgh Housing Prices

Joel Predix

2023-04-04