Jiang Li
8/29/2017
You are the data scientist at Acme Realty, a real estate company specializing in listings in upscale Hillsborough, California. Senior management has asked you to build a predictive sales model using R.
You have two sets of data at your disposal:
DataScience_7_Case_realdata.xls: Shows details of recent home sales, including the sales price, the house size in square feet, the lot size in square feet, and the numbers of bedrooms and bathrooms.
DataScience_7_Case_TimeSeries.xls: Shows the average home sales price for every quarter from January 2008 through August 2010.
To be thorough, you plan to build two regression-based predictive models in R. The first will use time series regression to forecast future sales, based on historic sales. The second will use multivariate regression to forecast sales prices based on the house size and lot size.
library(readxl)
## Read time series dataset
time.df = read_excel(path = "DataScience_7_Case_TimeSeries.xls",sheet = "Sheet1")
dim(time.df)
## [1] 12 2
head(time.df)
## Date Price ($M)
## 1 2008.00 2.6
## 2 2008.25 2.5
## 3 2008.50 2.5
## 4 2008.75 2.6
## 5 2009.00 2.7
## 6 2009.25 2.7
## Read the house dataset with sale, size and lot
sale.df = read_excel(path = "DataScience_7_Case_realdata.xls")
dim(sale.df)
## [1] 18 3
sale.df
## Price House Lot
## 1 6.0 6.9 42.7
## 2 5.8 8.0 36.6
## 3 5.6 8.0 44.0
## 4 3.5 3.8 18.0
## 5 3.4 6.1 27.4
## 6 3.4 4.3 22.2
## 7 2.7 3.8 22.0
## 8 2.6 5.0 29.3
## 9 2.6 3.6 31.4
## 10 2.3 3.1 22.2
## 11 2.3 3.9 21.7
## 12 2.3 3.2 24.4
## 13 1.9 3.5 25.3
## 14 1.9 3.4 24.0
## 15 1.9 3.2 21.8
## 16 1.6 3.3 6.6
## 17 1.6 2.3 15.9
## 18 1.5 2.3 21.2
Using the regression analysis capability in R, forecast home sales for 2011. Use the TimeSeries dataset. State the governing equation.
library(ggplot2)
ggplot(data = time.df,aes(x = Date,y = `Price ($M)`)) +
geom_bar(stat = "identity",fill='grey') +
geom_point(color="blue")+
geom_line(color='red')+
ggtitle("Sale price from years 2008 to 2010")
fit = lm(formula = `Price ($M)`~Date,data=time.df)
summary(fit)
##
## Call:
## lm(formula = `Price ($M)` ~ Date, data = time.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.243473 -0.046970 -0.003613 0.039744 0.203380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 216.06725 83.86487 2.576 0.0276 *
## Date -0.10629 0.04174 -2.547 0.0290 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1248 on 10 degrees of freedom
## Multiple R-squared: 0.3934, Adjusted R-squared: 0.3328
## F-statistic: 6.486 on 1 and 10 DF, p-value: 0.02902
## Replot with fitted line
ggplot(data = time.df,aes(x = Date,y = `Price ($M)`)) +
geom_bar(stat = "identity",fill='grey') +
geom_point(color="blue")+
geom_line(color='red')+
geom_abline(slope = fit$coefficients[2],intercept = fit$coefficients[1],color = 'purple')+ ## Added regression line
geom_smooth(method='lm',color='purple',fill='green')+ ## Added shade
ggtitle("Sale price from years 2008 to 2010\n(Purple is the regression line)")
d.2011 = data.frame(Date=2011)
predict.2011 = predict(fit,d.2011)
cat("Forecast sale price for 2011 is $",round(predict.2011,2),"M",sep = "")
## Forecast sale price for 2011 is $2.31M
Using the regression analysis capability in R, forecast the price for a house size of 4000 square feet and a lot size of 22000 square feet. State the governing equation. Use the realdata dataset.
library(scatterplot3d)
## view the data
scatterplot3d(x = sale.df$House,sale.df$Lot,sale.df$Price,
xlab = "House size (1000 square feet)",
ylab = "Lot size (1000 square feet)",
zlab = "Price ($M)",
main="Sale price vs house and lot size",highlight.3d=TRUE,type="h")
## built model
fit2 = lm(formula = Price~House+Lot,data = sale.df)
summary(fit2)
##
## Call:
## lm(formula = Price ~ House + Lot, data = sale.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88939 -0.25993 -0.03057 0.21752 1.09898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.55415 0.39559 -1.401 0.181614
## House 0.64680 0.12713 5.088 0.000134 ***
## Lot 0.02763 0.02478 1.115 0.282361
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5486 on 15 degrees of freedom
## Multiple R-squared: 0.8738, Adjusted R-squared: 0.857
## F-statistic: 51.94 on 2 and 15 DF, p-value: 1.808e-07
s3d = scatterplot3d(x = sale.df$House,sale.df$Lot,sale.df$Price,
xlab = "House size (1000 square feet)",
ylab = "Lot size (1000 square feet)",
zlab = "Price ($M)",
main="Sale price vs house and lot size",highlight.3d=TRUE,type = "h")
s3d$plane3d(fit2)
xx = data.frame("House"=4,"Lot"=22)
pp = predict(object = fit2,xx)
cat("Forecast sale price for a house with 4000 sf and 22000 sf lot is $",round(pp,2),"M",sep = "")
## Forecast sale price for a house with 4000 sf and 22000 sf lot is $2.64M