Learning Outcomes measured in this assignment: LO1 to LO8
Content knowledge you’ll gain from doing this assignment: Test/Train split, build a linear model, communicate your results to a non-technical audience.
Submission: You have two options. Please choose as you wish.
If you choose to publish in RPubs, share your link in your assignment submission.
| R code | Decision/Why | Communication of findings | |
|---|---|---|---|
| Percentage of Assigned Points | 30% | 35% | 35% |
Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.
Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable G3
Consider the data set below. This data is a simplified version of UCI data set (https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set)
Data set comes from the following paper: Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.
We are trying to predict house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared).
Variables are as follows:
X2: the house age (in years)X3: the distance to the nearest MRT station (in meters)X4: the number of convenience stores in the living circle on footX5: the geographic coordinate, latitude. (in degrees)X6: the geographic coordinate, longitude. (in degrees)(10 points) Data is imported to R for you, and and named housing. Please separate the testing and training sets:
set a seed: seed should be a fucntion of your birth date, i.e., if you are born in January 18, 2022, use the seed 18.
separate 15 % of your data into testing set
housing=read.table("https://unh.box.com/shared/static/1s7plserj5i0i7uc6ybdc6icn0trtlv9.csv", header = TRUE, sep=",", dec=".")
head(housing)
## X2 X3 X4 X5 X6 Y
## 1 32.0 84.87882 10 24.98298 121.5402 37.9
## 2 19.5 306.59470 9 24.98034 121.5395 42.2
## 3 13.3 561.98450 5 24.98746 121.5439 47.3
## 4 13.3 561.98450 5 24.98746 121.5439 54.8
## 5 5.0 390.56840 5 24.97937 121.5425 43.1
## 6 7.1 2175.03000 3 24.96305 121.5125 32.1
set.seed(123)
split = sample.split(housing$Y, SplitRatio = 0.85)
train_housing = subset(housing, split == TRUE)
test_housing = subset(housing, split == FALSE)
c(nrow(housing), nrow(train_housing), nrow(test_housing))
## [1] 414 351 63
For questions 2 to 5 use training set
Y (dependent variable) and the rest by obtaining the scatter plots and the respective correlations. Hint: plot and cor. Comment on your findings.plot(housing[,"X2"], housing[,"Y"], col="blue")
q=cbind(House_Price=housing[,"Y"], House_Age_in_Years=housing[,"X2"])
cor(q)
## House_Price House_Age_in_Years
## House_Price 1.000000 -0.210567
## House_Age_in_Years -0.210567 1.000000
It seems like there’s some negative correlation between house prices and the age of the house, that is older houses are cheaper and vice versa. Albeit the correlation (-0.210567) is not as strong as one might expect.
plot(housing[,"X3"], housing[,"Y"], col="red")
r=cbind(House_Price=housing[,"Y"], Distance_to_the_nearest_station=housing[,"X3"])
cor(r)
## House_Price Distance_to_the_nearest_station
## House_Price 1.0000000 -0.6736129
## Distance_to_the_nearest_station -0.6736129 1.0000000
It looks like there’s a significant negative relationship between house prices and the distance of the house to the nearest MRT station, such that the farther the station, the more expensive the house is, which seems counter-intuitive.
plot(housing[,"X4"], housing[,"Y"], col="black")
s=cbind(House_Price=housing[,"Y"], Number_of_convinience_stores=housing[,"X4"])
cor(s)
## House_Price Number_of_convinience_stores
## House_Price 1.0000000 0.5710049
## Number_of_convinience_stores 0.5710049 1.0000000
This plot and correlation graph suggests that there’s a positive correlation between house prices and the number of continence stores close to it.
plot(housing[,"X5"], housing[,"Y"], col="dark grey")
t=cbind(House_Price=housing[,"Y"], Latitude=housing[,"X5"])
cor(t)
## House_Price Latitude
## House_Price 1.0000000 0.5463067
## Latitude 0.5463067 1.0000000
From the scatter plot, it seems like houses within the same location in the northern hemisphere (positive latitude), might have similar prices and the prices are similar in some places, judging from the cluster between 24.96 and 24.98.
plot(housing[,"X6"], housing[,"Y"], col="green")
u=cbind(House_Price=housing[,"Y"], Longitude=housing[,"X6"])
cor(u)
## House_Price Longitude
## House_Price 1.0000000 0.5232865
## Longitude 0.5232865 1.0000000
Looks like there’s a fairly positive correlation between house prices and location.
Y (as the dependent variable) and your choices of the following variables: X2, X3, X4, X5, or X6 separately. Comment on your findings.Feature Scaling - enables all numerical variables predictor and response) to have a similar magnitude or scale, typically from -3 to +3. This enhances model performance
train_housing[, 1:6] = scale(train_housing[, 1:6])
test_housing[, 1:6] = scale(test_housing[, 1:6])
regressor_1=lm(formula = Y~X2,
data=train_housing)
summary(regressor_1)
##
## Call:
## lm(formula = Y ~ X2, data = train_housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0901 -0.8084 0.1229 0.6121 3.4664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.007e-18 5.236e-02 0.000 1.000000
## X2 -2.013e-01 5.243e-02 -3.839 0.000147 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9809 on 349 degrees of freedom
## Multiple R-squared: 0.04051, Adjusted R-squared: 0.03776
## F-statistic: 14.74 on 1 and 349 DF, p-value: 0.0001468
ggplot()+
geom_point(data = train_housing, aes(X2, Y),
color = 'black')+
geom_line(aes(train_housing$X2, predict(regressor_1, newdata = train_housing)),
color = 'blue')+
ggtitle('Price vs House Age(Training Set)') +
xlab('House Age') +
ylab('Price')
Visualizing the Prediction for Test Set
ggplot()+
geom_point(data = test_housing, aes(X2, Y),
color = 'black')+
geom_line(aes(train_housing$X2, predict(regressor_1, newdata = train_housing)),
color = 'blue')+
ggtitle('Price vs House Age(Test Set)') +
xlab('House Age') +
ylab('Price')
Given that the p-value is significantly less than 0.05 (with the ***, showing a high statistical significance), we can conclude that there’s a linear relationship between house prices and age of housing.The plot shows that the relationship is negative, the older the houses, the lesser the price.
regressor_2=lm(formula = Y~X3,
data=train_housing)
summary(regressor_2)
##
## Call:
## lm(formula = Y ~ X3, data = train_housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8990 -0.4560 -0.1118 0.4103 2.6879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.994e-18 3.810e-02 0.00 1
## X3 -7.014e-01 3.815e-02 -18.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7138 on 349 degrees of freedom
## Multiple R-squared: 0.4919, Adjusted R-squared: 0.4905
## F-statistic: 337.9 on 1 and 349 DF, p-value: < 2.2e-16
ggplot()+
geom_point(data = train_housing, aes(X3, Y),
color = 'black')+
geom_line(aes(train_housing$X3, predict(regressor_2, newdata = train_housing)),
color = 'red')+
ggtitle('Price vs Distance to Nearest MRT Station (Training Set)') +
xlab('Distance to Nearest MRT Station') +
ylab('Price')
ggplot()+
geom_point(data = test_housing, aes(X3, Y),
color = 'black')+
geom_line(aes(train_housing$X3, predict(regressor_2, newdata = train_housing)),
color = 'red')+
ggtitle('Price vs Distance to Nearest MRT Station (Test Set)') +
xlab('Distance to Nearest MRT Station') +
ylab('Price')
The p-value is very small, indicating that there’s a linear relationship between house prices and its distance to the nearest MRT station. The Adjusted R-squared and plot, shows that the relationship is positive.
regressor_3=lm(formula = Y~X4,
data=train_housing)
summary(regressor_3)
##
## Call:
## lm(formula = Y ~ X4, data = train_housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6968 -0.5628 -0.1119 0.4808 2.3786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.930e-16 4.209e-02 0.00 1
## X4 6.163e-01 4.215e-02 14.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7886 on 349 degrees of freedom
## Multiple R-squared: 0.3799, Adjusted R-squared: 0.3781
## F-statistic: 213.8 on 1 and 349 DF, p-value: < 2.2e-16
ggplot()+
geom_point(data = train_housing, aes(X4, Y),
color = 'black')+
geom_line(aes(train_housing$X4, predict(regressor_3, newdata = train_housing)),
color = 'green')+
ggtitle('Price vs Number of Convinient Stores in Living Circle on Foot (Training Set)') +
xlab('Number of Convinient Stores (in foot)') +
ylab('Price')
ggplot()+
geom_point(data = test_housing, aes(X4, Y),
color = 'black')+
geom_line(aes(train_housing$X4, predict(regressor_3, newdata = train_housing)),
color = 'green')+
ggtitle('Price vs Number of Convinient Stores in Living Circle on Foot (Test Set)') +
xlab('Number of Convinient Stores (in foot)') +
ylab('Price')
We can assume that the number of convenience store near a house, can affect its price. The p-value suggests that. The R-squared value and plot shows that price increases as the number of stores increases.
Y (as the dependent variable) and the rest of the variables. Comment on your findings.regressor_mlr=lm(formula = Y~.,
data=train_housing)
summary(regressor_mlr)
##
## Call:
## lm(formula = Y ~ ., data = train_housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4849 -0.3921 -0.1187 0.3214 2.6929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.825e-14 3.292e-02 0.000 1.000
## X2 -2.208e-01 3.330e-02 -6.630 1.29e-10 ***
## X3 -3.847e-01 6.536e-02 -5.886 9.36e-09 ***
## X4 2.958e-01 4.166e-02 7.099 7.19e-12 ***
## X5 2.178e-01 4.109e-02 5.301 2.06e-07 ***
## X6 1.403e-02 5.426e-02 0.259 0.796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6167 on 345 degrees of freedom
## Multiple R-squared: 0.6251, Adjusted R-squared: 0.6196
## F-statistic: 115 on 5 and 345 DF, p-value: < 2.2e-16
This shows that generally the price has a significant linear relationship with all predictor variable, except for X6 (only one without *** beside its coefficients).
#Autocorrelation:
durbinWatsonTest(regressor_mlr)
## lag Autocorrelation D-W Statistic p-value
## 1 0.0109545 1.96598 0.724
## Alternative hypothesis: rho != 0
#Constant Variance:
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(regressor_mlr)
##
## studentized Breusch-Pagan test
##
## data: regressor_mlr
## BP = 9.0416, df = 5, p-value = 0.1074
#Near perfect multicolinearity
vif(regressor_mlr)
## X2 X3 X4 X5 X6
## 1.020239 3.930810 1.597178 1.553480 2.709277
Autocorrelation (Durbin Watson Test), since the p-value is very large, we can reject the null hypothesis and assume that the residuals are not auto-correlated.
(White Test), since the p-value is large, then we do not reject the hull hypothesis (constant variance). Hence the assumption of constant variance holds.
Y better? Comment on your findings.pred1=predict(regressor_1, newdata=test_housing)
pred2=predict(regressor_2, newdata=test_housing)
pred3=predict(regressor_3, newdata=test_housing)
pred4=predict(regressor_mlr, newdata=test_housing)
home_prediction=cbind(obs=seq(1,nrow(test_housing)),house_price=test_housing[,"Y"], pred1, pred2, pred3, pred4)
head(home_prediction)
## obs house_price pred1 pred2 pred3 pred4
## 4 1 0.9803208 0.05058326 0.3410621 0.2658455 0.6825378
## 5 2 0.3005518 0.20198558 0.4284048 0.2658455 0.7728640
## 8 3 0.5097115 -0.07710545 0.4808696 0.4726143 0.6104790
## 11 4 0.2017819 -0.34160347 0.4209426 -0.5612295 -0.3210562
## 16 5 0.7304912 -0.35802059 0.3322859 -0.3544607 -0.1426521
## 20 6 0.5678114 0.26582993 0.6154990 0.6793830 0.9665110
mean_se1=mean((test_housing[,"Y"]-pred1)^2)
mean_se2=mean((test_housing[,"Y"]-pred2)^2)
mean_se3=mean((test_housing[,"Y"]-pred3)^2)
mean_se4=mean((test_housing[,"Y"]-pred4)^2)
Models=c("Model 1", "Model 2", "Model 3", "MLR Model")
mse=c(mean_se1, mean_se2, mean_se3, mean_se4)
mse=round(mse, 4)
df=rbind(Models, mse)
kable(df) %>%
kable_styling("striped")
| Models | Model 1 | Model 2 | Model 3 | MLR Model |
| mse | 0.9197 | 0.6666 | 0.8682 | 0.6239 |
The second and forth models have the lease mean squared error, with the forth model slightly better. This means that the distance of a house to public transportation is a better predictor of price, compared to other variables. Also
Do you think if you have used a different seed your MSE will be the same? Why? Why not?
Using a different seed will produce a different MSE value, because the diffrent seeds will simulate different training and testing dataset, which will affect the MSE value.
If you could have any data you wanted, which other variable(s) you would have liked to include in your regression?
It would have been great if there’s an additional predictor variable like economic indicator like median household income, number of white-collar jobs.
(15 points) Executive Summary Suppose that you are working for a real estate company as a data scientist with the goal of developing a statistical model to find the value of houses. Communicate your findings to stakeholders (non-technical audience)