The objective of this project is to predict housing prices in the Pittsburgh area based off of historical data. Various data science models were used in order to get accurate results. Models used include:
Linear Regression
Ridge Regression
Lasso Regression
Bagged Tree
Random Forest
Mean Squared Error (MSE) was then ultimately used to determine which model performed best in order to predict prices on the test set.
library(tidyverse)
library(forcats)
library(GGally)
library(glmnet)
library(caret)
library(randomForest)
Exploring the data will help us get a sense of the sets. We will make sure that the sets are not missing any data and convert character variables to factors for our model prediction.
#Read in training and test data
df.train <- read.csv("train.csv", header = TRUE)
df.test <- read.csv("test.csv", header = TRUE)
#Display the first 5 rows of the data frames
head(df.train)
## id price desc numstories yearbuilt exteriorfinish rooftype
## 1 PA10258 432500 SINGLE FAMILY 2 1988 Brick SHINGLE
## 2 PA10261 86000 SINGLE FAMILY 1 1931 Brick SHINGLE
## 3 PA10698 875000 SINGLE FAMILY 1 1950 Stone SLATE
## 4 PA10417 460000 SINGLE FAMILY 2 1900 Frame SHINGLE
## 5 PA10452 900000 SINGLE FAMILY 1 1995 Frame SHINGLE
## 6 PA10419 27000 MOBILE HOME 1 1968 Frame METAL
## basement totalrooms bedrooms bathrooms fireplaces sqft lotarea zipcode
## 1 1 9 5 2.5 1 4112 11400 15236
## 2 1 6 3 1.5 1 1080 10336 15037
## 3 1 10 3 3.0 2 3322 23950 15243
## 4 1 7 3 2.0 0 2323 1218 15212
## 5 1 3 1 1.0 0 1011 496671 15235
## 6 1 4 2 1.0 0 696 14767 15026
## AvgIncome Location DistDowntown
## 1 46913 NotCity 7.695
## 2 55863 NotCity 18.970
## 3 60717 NotCity 7.224
## 4 26712 PartCity 2.168
## 5 41367 PartCity 10.963
## 6 69387 NotCity 23.041
head(df.test)
## id price desc numstories yearbuilt exteriorfinish rooftype
## 1 PA10723 NA SINGLE FAMILY 2.0 2000 Brick SHINGLE
## 2 PA10773 NA SINGLE FAMILY 2.0 1994 Brick SHINGLE
## 3 PA10366 NA SINGLE FAMILY 1.0 1951 Frame SHINGLE
## 4 PA10915 NA SINGLE FAMILY 1.0 1960 Brick SHINGLE
## 5 PA10898 NA SINGLE FAMILY 2.0 1931 Brick SLATE
## 6 PA10890 NA SINGLE FAMILY 2.5 1910 Brick SHINGLE
## basement totalrooms bedrooms bathrooms fireplaces sqft lotarea zipcode
## 1 1 8 4 2.5 1 3864 14026 15025
## 2 1 8 4 3.5 1 2352 13262 15025
## 3 1 7 3 2.0 0 1366 2150 15238
## 4 1 7 3 1.5 0 1347 8013 15236
## 5 1 10 4 3.5 1 4531 22073 15228
## 6 1 9 4 2.0 0 4374 10500 15224
## AvgIncome Location DistDowntown
## 1 60669 NotCity 13.094
## 2 60669 NotCity 13.094
## 3 67370 NotCity 11.876
## 4 46913 NotCity 7.695
## 5 58440 NotCity 7.011
## 6 22880 City 4.644
#Check the dimensions of the training set
dim(df.train)
## [1] 700 18
#Check to see if the set contains missing values
colSums(is.na(df.train))
## id price desc numstories yearbuilt
## 0 0 0 0 0
## exteriorfinish rooftype basement totalrooms bedrooms
## 0 0 0 0 0
## bathrooms fireplaces sqft lotarea zipcode
## 0 0 0 0 0
## AvgIncome Location DistDowntown
## 0 0 0
#See what data type each column is
str(df.train)
## 'data.frame': 700 obs. of 18 variables:
## $ id : chr "PA10258" "PA10261" "PA10698" "PA10417" ...
## $ price : int 432500 86000 875000 460000 900000 27000 460000 80900 42500 1250000 ...
## $ desc : chr "SINGLE FAMILY" "SINGLE FAMILY" "SINGLE FAMILY" "SINGLE FAMILY" ...
## $ numstories : num 2 1 1 2 1 1 2 2 2 2 ...
## $ yearbuilt : int 1988 1931 1950 1900 1995 1968 1962 1930 1900 1998 ...
## $ exteriorfinish: chr "Brick" "Brick" "Stone" "Frame" ...
## $ rooftype : chr "SHINGLE" "SHINGLE" "SLATE" "SHINGLE" ...
## $ basement : int 1 1 1 1 1 1 1 1 1 1 ...
## $ totalrooms : int 9 6 10 7 3 4 8 6 4 13 ...
## $ bedrooms : int 5 3 3 3 1 2 4 3 1 4 ...
## $ bathrooms : num 2.5 1.5 3 2 1 1 3 2 1 6 ...
## $ fireplaces : int 1 1 2 0 0 0 0 0 0 2 ...
## $ sqft : int 4112 1080 3322 2323 1011 696 2540 1134 1024 6705 ...
## $ lotarea : int 11400 10336 23950 1218 496671 14767 12460 2500 816 87120 ...
## $ zipcode : int 15236 15037 15243 15212 15235 15026 15243 15227 15224 15044 ...
## $ AvgIncome : int 46913 55863 60717 26712 41367 69387 60717 38439 22880 95289 ...
## $ Location : chr "NotCity" "NotCity" "NotCity" "PartCity" ...
## $ DistDowntown : num 7.7 18.97 7.22 2.17 10.96 ...
Luckily we have no missing data in our training set. Since we have character data in our sets, we will convert them to factors so our models can handle them. We will also store the id column in its only variable.
#Convert character variables to factors in both train and test sets
df.train$desc <- factor(df.train$desc, levels = c("CONDOMINIUM", "MOBILE HOME", "MULTI-FAMILY", "ROWHOUSE",
"SINGLE FAMILY"))
df.train <- df.train %>% mutate(exteriorfinish = as.factor(exteriorfinish))
df.train <- df.train %>% mutate(rooftype = as.factor(rooftype))
df.train <- df.train %>% mutate(Location = as.factor(Location))
df.test$desc <- factor(df.test$desc, levels = c("CONDOMINIUM", "MOBILE HOME", "MULTI-FAMILY", "ROWHOUSE",
"SINGLE FAMILY"))
df.test <- df.test %>% mutate(exteriorfinish = as.factor(exteriorfinish))
df.test <- df.test %>% mutate(rooftype = as.factor(rooftype))
df.test <- df.test %>% mutate(Location = as.factor(Location))
#Save the ids in a new variable and omit them from the training set
train_ids <- df.train$id
df.train <- df.train[, -which(names(df.train) == "id")]
test_ids <- df.test$id
df.test <- df.test[, -which(names(df.test) == "id")]
#Set all test prices to 0, this step is not necessary but will get rid of the NAs in the set
df.test$price <- 0
#Check to see the stats of each predictor in the training set
summary(df.train)
## price desc numstories yearbuilt
## Min. : 25500 CONDOMINIUM : 56 Min. :1.000 Min. :1830
## 1st Qu.: 87000 MOBILE HOME : 2 1st Qu.:1.000 1st Qu.:1930
## Median : 196250 MULTI-FAMILY : 29 Median :2.000 Median :1956
## Mean : 302470 ROWHOUSE : 17 Mean :1.632 Mean :1957
## 3rd Qu.: 361052 SINGLE FAMILY:596 3rd Qu.:2.000 3rd Qu.:1988
## Max. :3300000 Max. :3.000 Max. :2017
## exteriorfinish rooftype basement totalrooms
## Brick :356 METAL : 2 Min. :0.0000 Min. : 3.000
## Concrete: 4 ROLL : 34 1st Qu.:1.0000 1st Qu.: 6.000
## Frame :281 SHINGLE:585 Median :1.0000 Median : 7.000
## Log : 1 SLATE : 79 Mean :0.9414 Mean : 7.097
## Stone : 41 3rd Qu.:1.0000 3rd Qu.: 8.000
## Stucco : 17 Max. :1.0000 Max. :16.000
## bedrooms bathrooms fireplaces sqft
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. : 475
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 1246
## Median :3.000 Median :2.000 Median :0.0000 Median : 1894
## Mean :3.267 Mean :2.235 Mean :0.5843 Mean : 2331
## 3rd Qu.:4.000 3rd Qu.:2.500 3rd Qu.:1.0000 3rd Qu.: 2825
## Max. :7.000 Max. :7.500 Max. :5.0000 Max. :15872
## lotarea zipcode AvgIncome Location
## Min. : 0 Min. :15003 Min. :14399 City : 54
## 1st Qu.: 4248 1st Qu.:15037 1st Qu.:39526 NotCity :453
## Median : 10202 Median :15218 Median :55016 PartCity:193
## Mean : 47327 Mean :15170 Mean :51837
## 3rd Qu.: 23183 3rd Qu.:15236 3rd Qu.:60717
## Max. :3820212 Max. :15332 Max. :95289
## DistDowntown
## Min. : 0.000
## 1st Qu.: 6.905
## Median :10.963
## Mean :10.814
## 3rd Qu.:13.448
## Max. :23.041
Here we will look at the distribution of the response variable and determine outliers and high leverage points within the data set. If these will affect our models in the future, we will remove them in order to get more accurate predictions.
#Make a density plot of the housing prices to check for outliers
plot(density(df.train$price), main = "Density of Housing Prices", xlab = "Price")
#Check data for outliers/high leverage points
lm.fit <- lm(price ~ ., df.train)
par(mfrow = c(2,2))
plot(lm.fit)
The density graph is severely right skewed, we will log transform the response variable for a normal distribution. The Q-Q plot is also non-linear, meaning log transformation will help to diminish residuals.
#Log transform the response variable
df.train <- df.train %>% mutate(price = log(df.train$price))
#Plot density graph after log transformation of response
plot(density(df.train$price), main = "Density of Housing Prices", xlab = "Price")
#Check data for outliers/high leverage points after log transformation
lm.fit <- lm(price ~ ., df.train)
par(mfrow = c(2,2))
plot(lm.fit)
Price is now normally distributed and the residuals in the Q-Q plot look much better.
Now we will explore the numeric variables to determine if any are highly skewed and need to be transformed.
#Check the distibutions of numeric predictors in the training set
numeric <- df.train %>%
select_if(is.numeric)
histograms <- numeric %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "cyan4") +
facet_wrap(~ variable, scales = "free") +
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")
print(histograms)
From the histograms above, it seems that sqft is highly right skewed and may need to be normalized for accurate results.
plot(density(df.train$sqft), main = "Density of Square Footage", xlab = "Sqft")
df.train$sqft <- log(df.train$sqft)
df.test$sqft <- log(df.test$sqft)
plot(density(df.train$sqft), main = "Density of Square Footage", xlab = "Sqft")
After log transformation, sqft follows a normal distribution and will not throw off our results.
#Plot the correlation between variables
ggcorr(df.train, label = T, hjust = 1, layout.exp = 3)
From the correlation graph above, sqft, bathrooms, and totalrooms are the variables with the most correlation to price.
We will fit a random forest model to the training set in order to determine variable importance.
rf <- randomForest(price ~ ., data = df.train,
mtry = 16/3, ntrees = 500, importance = TRUE)
varImpPlot(rf, main = "Variable Importance")
Based off of the random forest model above, sqft, bathrooms, and lotarea seem to be the most important variables that drive higher house prices.
Split the training set into a training and test set to get results on how well our models are doing by calculating MSE.
#Split training data into a train and test set to get model MSE
set.seed(99)
idx <- sample(nrow(df.train), nrow(df.train) * 0.8)
train <- df.train[idx,]
test <- df.train[-idx,]
Perform a linear regression model using all variables to determine which predictors are most important and to get a baseline for the other regression models.
set.seed(99)
#Create a linear regression model using all variables
lm.fit <- lm(price ~ ., data = train)
summary(lm.fit)
##
## Call:
## lm(formula = price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75822 -0.29302 0.03428 0.31081 1.93970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.036e-01 5.278e+00 -0.076 0.939076
## descMOBILE HOME -5.721e-01 8.059e-01 -0.710 0.478058
## descMULTI-FAMILY -1.646e-01 1.768e-01 -0.931 0.352239
## descROWHOUSE 2.661e-02 1.946e-01 0.137 0.891267
## descSINGLE FAMILY -2.087e-01 1.189e-01 -1.755 0.079787 .
## numstories -8.340e-02 6.072e-02 -1.373 0.170201
## yearbuilt 4.217e-03 9.723e-04 4.337 1.73e-05 ***
## exteriorfinishConcrete 5.419e-01 2.912e-01 1.861 0.063281 .
## exteriorfinishFrame -7.148e-02 5.311e-02 -1.346 0.178926
## exteriorfinishLog 6.548e-02 5.872e-01 0.112 0.911250
## exteriorfinishStone -5.334e-02 1.149e-01 -0.464 0.642696
## exteriorfinishStucco 2.713e-01 1.513e-01 1.793 0.073582 .
## rooftypeROLL 3.814e-01 5.861e-01 0.651 0.515571
## rooftypeSHINGLE 2.311e-01 5.696e-01 0.406 0.685122
## rooftypeSLATE 5.029e-01 5.715e-01 0.880 0.379326
## basement 4.538e-02 1.314e-01 0.345 0.729913
## totalrooms -9.814e-03 2.658e-02 -0.369 0.712094
## bedrooms 1.002e-01 4.394e-02 2.280 0.023005 *
## bathrooms 1.150e-01 3.945e-02 2.915 0.003710 **
## fireplaces 4.089e-02 3.887e-02 1.052 0.293372
## sqft 8.890e-01 8.939e-02 9.945 < 2e-16 ***
## lotarea 2.960e-07 1.335e-07 2.217 0.027039 *
## zipcode -2.070e-04 3.166e-04 -0.654 0.513471
## AvgIncome 6.571e-06 2.229e-06 2.948 0.003341 **
## LocationNotCity 9.270e-02 1.344e-01 0.690 0.490704
## LocationPartCity 7.954e-02 1.101e-01 0.722 0.470331
## DistDowntown -2.372e-02 6.900e-03 -3.437 0.000634 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5522 on 533 degrees of freedom
## Multiple R-squared: 0.7025, Adjusted R-squared: 0.688
## F-statistic: 48.42 on 26 and 533 DF, p-value: < 2.2e-16
#Predict price on test set
lm.pred <- predict(lm.fit, test)
#Calculate MSE
lm.MSE <- mean((lm.pred - test$price)^2)
cat("Linear Regression MSE = ", lm.MSE, "\n")
## Linear Regression MSE = 0.3200871
As we can see from the model statistics above, many variables are not as important as others when it comes to predicting the prices. We now try ridge and lasso regression to reduce the coefficients of these variables.
Due to many variables not being significant in the model above, we will use ridge and lasso regression to reduce the coefficients of non-important variables.
set.seed(99)
#Create matrices in order to perform ridge regression
x <- model.matrix(price ~ ., df.train)[, -1]
y <- df.train$price
#Split into training and test set
xtrain <- sample(1:nrow(x), nrow(x) * 0.8)
xtest <- (-xtrain)
y.test <- y[xtest]
grid <- 10^seq(10, -2, length = 100)
#Create a ridge regression model
ridge.mod <- glmnet(x[xtrain, ], y[xtrain], alpha = 0, lambda = grid, thresh = 1e-12)
#Perform cross validation to obtain the best lambda value to use in the model
cv.out <- cv.glmnet(x[xtrain, ], y[xtrain], alpha = 0)
plot(cv.out)
bestlam <- cv.out$lambda.min
cat("Best Lambda = ", bestlam, "\n")
## Best Lambda = 0.07805105
#Predict prices on the test set and calculate the MSE
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[xtest, ])
ridge.mse <- mean((ridge.pred - y.test)^2)
cat("Ridge Regression MSE = ", ridge.mse, "\n")
## Ridge Regression MSE = 0.3255705
#Identify the most important variables used in the model
out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:27, ]
## (Intercept) descMOBILE HOME descMULTI-FAMILY
## -6.099579e+00 -7.377955e-01 -2.414076e-02
## descROWHOUSE descSINGLE FAMILY numstories
## 2.002937e-01 -8.047298e-02 -3.390782e-02
## yearbuilt exteriorfinishConcrete exteriorfinishFrame
## 5.361183e-03 5.160519e-01 -5.537529e-02
## exteriorfinishLog exteriorfinishStone exteriorfinishStucco
## 1.410496e-01 4.138084e-02 2.314932e-01
## rooftypeROLL rooftypeSHINGLE rooftypeSLATE
## 1.865921e-01 -1.273672e-01 1.777983e-01
## basement totalrooms bedrooms
## 2.753367e-02 9.660227e-03 7.741935e-02
## bathrooms fireplaces sqft
## 1.422287e-01 5.884382e-02 6.884206e-01
## lotarea zipcode AvgIncome
## 3.651078e-07 1.194264e-04 6.156163e-06
## LocationNotCity LocationPartCity DistDowntown
## 3.950147e-02 6.006796e-02 -1.429521e-02
Lambda was chosen via cross-validation and applied to model when predicting prices.
Lasso Regression will handle any non-important variable by setting it’s coefficient equal to 0, excluding it from the model.
set.seed(99)
#Create matrices in order to perform ridge regression
x <- model.matrix(price ~ ., df.train)[, -1]
y <- df.train$price
#Split into training and test set
xtrain <- sample(1:nrow(x), nrow(x) * 0.8)
xtest <- (-xtrain)
y.test <- y[xtest]
grid <- 10^seq(10, -2, length = 100)
#Create a lasso regression model
lasso.mod <- glmnet(x[xtrain, ], y[xtrain], alpha = 1, lambda = grid)
#Perform cross validation to obtain the best lambda value to use in the model
cv.out <- cv.glmnet(x[xtrain, ], y[xtrain], alpha = 1)
plot(cv.out)
bestlam <- cv.out$lambda.min
cat("Best Lambda = ", bestlam, "\n")
## Best Lambda = 0.01186306
#Predict prices on the test set and calculate the MSE
lasso.pred <- predict(lasso.mod, s = bestlam,newx = x[xtest, ])
lasso.mse <- mean((lasso.pred - y.test)^2)
cat("Lasso Regression MSE = ", lasso.mse, "\n")
## Lasso Regression MSE = 0.3209131
#Identify the most important variables used in the model
out <- glmnet(x, y, alpha = 1, lambda = grid)
predict(out, type = "coefficients", s = bestlam)[1:27, ]
## (Intercept) descMOBILE HOME descMULTI-FAMILY
## -5.726748e+00 -4.634005e-01 0.000000e+00
## descROWHOUSE descSINGLE FAMILY numstories
## 1.328056e-01 -2.037470e-02 -2.995756e-02
## yearbuilt exteriorfinishConcrete exteriorfinishFrame
## 5.102767e-03 3.460057e-01 -4.845788e-02
## exteriorfinishLog exteriorfinishStone exteriorfinishStucco
## 0.000000e+00 0.000000e+00 1.482732e-01
## rooftypeROLL rooftypeSHINGLE rooftypeSLATE
## 0.000000e+00 -2.645959e-01 1.884410e-03
## basement totalrooms bedrooms
## 0.000000e+00 0.000000e+00 3.876792e-02
## bathrooms fireplaces sqft
## 1.354157e-01 3.122096e-02 8.549358e-01
## lotarea zipcode AvgIncome
## 2.970482e-07 7.184802e-05 5.732477e-06
## LocationNotCity LocationPartCity DistDowntown
## 0.000000e+00 0.000000e+00 -1.420707e-02
Lambda was chosen via cross-validation and applied to model when predicting prices. Both ridge and lasso regression did not outperform regular linear regression, which is a bit surprising.
Perform bagging and random forest on the training set. I expect these models to perform better than the ones above.
set.seed(99)
#Create a bagging random forest model and calculate MSE
bag.rf <- randomForest(price ~ ., data = train, mtry = 16, ntrees = 30, importance = TRUE)
yhat.bag <- predict(bag.rf, newdata = test)
bag.mse <- mean((yhat.bag - test$price)^2)
cat("Bagged Tree MSE = ", bag.mse, "\n")
## Bagged Tree MSE = 0.304875
importance(bag.rf)
## %IncMSE IncNodePurity
## desc 4.122712 5.4977612
## numstories 5.778141 4.9491089
## yearbuilt 13.542958 30.0526809
## exteriorfinish 5.398958 8.0416183
## rooftype 3.778469 3.1869687
## basement -4.379678 0.6883449
## totalrooms 12.138685 11.3348480
## bedrooms 9.202390 6.1478448
## bathrooms 16.184439 76.7961673
## fireplaces -2.238825 4.0593096
## sqft 67.891261 308.9373007
## lotarea 15.919443 28.1963206
## zipcode 9.042859 13.5524116
## AvgIncome 15.596305 14.5804645
## Location 9.546598 4.0521093
## DistDowntown 11.109784 13.1618594
#Plot the variables in order of importance
varImpPlot(bag.rf, main = "Bagged Tree Variable Importance")
plot(yhat.bag, test$price, main = "Actual vs Predicted Price",
xlab = "Predicted Priced", ylab = "Actual Price")
abline(0, 1, col = "red")
The bagged tree models chose sqft, bathrooms, and lotarea as the most important predictors, similar to what we saw in the exploratory data analysis.
Now we will try a random forest model to see if it performs better than the bagged tree above.
set.seed(99)
#Create a random forest model with 500 trees and mtry set equal to number of predictors divided by 3 and calculate the MSE
rf <- randomForest(price ~ ., data = train,
mtry = 16/3, ntrees = 500, importance = TRUE)
yhat.rf <- predict(rf, newdata = test)
#Calculate MSE
rf.mse <- mean((yhat.rf - test$price)^2)
cat("Random Forest MSE = ", rf.mse, "\n")
## Random Forest MSE = 0.2969777
importance(rf)
## %IncMSE IncNodePurity
## desc 5.403698 7.5795673
## numstories 5.525999 6.8188834
## yearbuilt 14.598973 36.1042204
## exteriorfinish 6.847546 10.3510353
## rooftype 7.263399 5.5852877
## basement 1.331580 0.8291662
## totalrooms 16.676433 52.6891132
## bedrooms 11.621588 31.7700064
## bathrooms 21.917085 100.4532839
## fireplaces 2.608388 8.5429958
## sqft 37.134582 163.5534223
## lotarea 21.210137 47.2897241
## zipcode 10.159101 14.0958808
## AvgIncome 13.805571 19.6481057
## Location 7.342755 4.8713113
## DistDowntown 11.744082 16.0314598
#Plot the variables to see importance and how close the model fits the expected values
varImpPlot(rf, main = "Random Forest Variable Importance")
plot(yhat.rf, test$price, main = "Actual vs Predicted Price",
xlab = "Predicted Priced", ylab = "Actual Price")
abline(0, 1, col = "red")
Ordering the models based on MSE will allow us to visually see which model performed best.
models <- c("Ridge", "Lasso", "LM", "Bagging", "RF")
mses <- c(ridge.mse, lasso.mse, lm.MSE, bag.mse, rf.mse)
df.performance <- data.frame(models, mses)
ggplot(df.performance, aes(x = models, y = mses)) + geom_col(width = .5, fill = "cyan4") +
labs(main = "Model vs MSE", x = "Model", y = "Mean Squared Error(MSE)") +
scale_x_discrete(limits = fct_reorder(df.performance$models, df.performance$mses))
All models were pretty close when it came to MSE, with random forest performing the best.
We will now use the original training dataset to fit a random forest model in order to determine housing prices in the original test set. The results are then written to a csv file, along with the ids from the test set.
set.seed(99)
#Refit the best model using all the data from the original training set
bestmodel <- randomForest(price ~ ., data = df.train,
mtry = 16/3, ntrees = 500, importance = TRUE)
#Predict and convert the prices back to original form
pred <- predict(bestmodel, df.test)
house.price <- as.numeric(round(exp(pred), 0))
out <- data.frame(id = test_ids, price = house.price)
#Write data frame to csv file
write.csv(out, file = "testing_predictions_Predix_Joel_JMP261.csv", row.names = FALSE)