DS 805 - Assignment 1

Guidelines

Purpose:

Learning Outcomes measured in this assignment: LO1 to LO8
Content knowledge you’ll gain from doing this assignment: Test/Train split, build a linear model, communicate your results to a non-technical audience.

Criteria:

Submission: You have two options. Please choose as you wish.
1. Upload the knitted document on Canvas.
2. Publish your final output in RPubs. https://rpubs.com/about/getting-started
If you choose to publish in RPubs, share your link in your assignment submission.

The grading rubric can be found below:

	R code	Decision/Why	Communication of findings
Percentage of Assigned Points	30%	35%	35%

Decision/why?: Explain your reasoning behind your choice of the procedure, set of variables and such for the question.
- Explain why you use the procedure/model/variable
- To exceed this criterion, describe steps taken to implement the procedure in a non technical way.
Communication of your findings: Explain your results in terms of training MSE, testing MSE, and prediction of the variable G3
- Explain why you think one model is better than the other.
- To exceed this criterion, explain your model and how it predicts the variable of interest in a non technical way.

Data

Consider the data set below. This data is a simplified version of UCI data set (https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set)

Data set comes from the following paper: Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.

We are trying to predict house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared).

Variables are as follows:

X2: the house age (in years)
X3: the distance to the nearest MRT station (in meters)
X4: the number of convenience stores in the living circle on foot
X5: the geographic coordinate, latitude. (in degrees)
X6: the geographic coordinate, longitude. (in degrees)

Questions

(10 points) Data is imported to R for you, and and named housing. Please separate the testing and training sets:
- set a seed: seed should be a fucntion of your birth date, i.e., if you are born in January 18, 2022, use the seed 18.
- separate 15 % of your data into testing set

housing=read.table("https://unh.box.com/shared/static/1s7plserj5i0i7uc6ybdc6icn0trtlv9.csv", header = TRUE, sep=",", dec=".")

head(housing)

##     X2         X3 X4       X5       X6    Y
## 1 32.0   84.87882 10 24.98298 121.5402 37.9
## 2 19.5  306.59470  9 24.98034 121.5395 42.2
## 3 13.3  561.98450  5 24.98746 121.5439 47.3
## 4 13.3  561.98450  5 24.98746 121.5439 54.8
## 5  5.0  390.56840  5 24.97937 121.5425 43.1
## 6  7.1 2175.03000  3 24.96305 121.5125 32.1

set.seed(123)
split = sample.split(housing$Y, SplitRatio = 0.85)
train_housing = subset(housing, split == TRUE)
test_housing = subset(housing, split == FALSE)

c(nrow(housing), nrow(train_housing), nrow(test_housing))

## [1] 414 351  63

For questions 2 to 5 use training set

(15 points) Explore if a linear relationship is viable between Y (dependent variable) and the rest by obtaining the scatter plots and the respective correlations. Hint: plot and cor. Comment on your findings.

plot(housing[,"X2"], housing[,"Y"], col="blue")

q=cbind(House_Price=housing[,"Y"], House_Age_in_Years=housing[,"X2"])
cor(q)

##                    House_Price House_Age_in_Years
## House_Price           1.000000          -0.210567
## House_Age_in_Years   -0.210567           1.000000

It seems like there’s some negative correlation between house prices and the age of the house, that is older houses are cheaper and vice versa. Albeit the correlation (-0.210567) is not as strong as one might expect.

plot(housing[,"X3"], housing[,"Y"], col="red")

r=cbind(House_Price=housing[,"Y"], Distance_to_the_nearest_station=housing[,"X3"])
cor(r)

##                                 House_Price Distance_to_the_nearest_station
## House_Price                       1.0000000                      -0.6736129
## Distance_to_the_nearest_station  -0.6736129                       1.0000000

It looks like there’s a significant negative relationship between house prices and the distance of the house to the nearest MRT station, such that the farther the station, the more expensive the house is, which seems counter-intuitive.

plot(housing[,"X4"], housing[,"Y"],  col="black")

s=cbind(House_Price=housing[,"Y"], Number_of_convinience_stores=housing[,"X4"])
cor(s)

##                              House_Price Number_of_convinience_stores
## House_Price                    1.0000000                    0.5710049
## Number_of_convinience_stores   0.5710049                    1.0000000

This plot and correlation graph suggests that there’s a positive correlation between house prices and the number of continence stores close to it.

plot(housing[,"X5"], housing[,"Y"],  col="dark grey")

t=cbind(House_Price=housing[,"Y"], Latitude=housing[,"X5"])
cor(t)

##             House_Price  Latitude
## House_Price   1.0000000 0.5463067
## Latitude      0.5463067 1.0000000

From the scatter plot, it seems like houses within the same location in the northern hemisphere (positive latitude), might have similar prices and the prices are similar in some places, judging from the cluster between 24.96 and 24.98.

plot(housing[,"X6"], housing[,"Y"],  col="green")

u=cbind(House_Price=housing[,"Y"], Longitude=housing[,"X6"])
cor(u)

##             House_Price Longitude
## House_Price   1.0000000 0.5232865
## Longitude     0.5232865 1.0000000

Looks like there’s a fairly positive correlation between house prices and location.

(15 points) Estimate 3 simple linear regression models between Y (as the dependent variable) and your choices of the following variables: X2, X3, X4, X5, or X6 separately. Comment on your findings.

Feature Scaling - enables all numerical variables predictor and response) to have a similar magnitude or scale, typically from -3 to +3. This enhances model performance

train_housing[, 1:6] = scale(train_housing[, 1:6])
test_housing[, 1:6] = scale(test_housing[, 1:6])

regressor_1=lm(formula = Y~X2, 
               data=train_housing)
summary(regressor_1)

## 
## Call:
## lm(formula = Y ~ X2, data = train_housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0901 -0.8084  0.1229  0.6121  3.4664 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.007e-18  5.236e-02   0.000 1.000000    
## X2          -2.013e-01  5.243e-02  -3.839 0.000147 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9809 on 349 degrees of freedom
## Multiple R-squared:  0.04051,    Adjusted R-squared:  0.03776 
## F-statistic: 14.74 on 1 and 349 DF,  p-value: 0.0001468

ggplot()+
    geom_point(data = train_housing, aes(X2, Y), 
               color = 'black')+
    geom_line(aes(train_housing$X2, predict(regressor_1, newdata = train_housing)),
              color = 'blue')+
    ggtitle('Price vs House Age(Training Set)') +
    xlab('House Age') +
    ylab('Price')

Visualizing the Prediction for Test Set

ggplot()+
    geom_point(data = test_housing, aes(X2, Y), 
               color = 'black')+
    geom_line(aes(train_housing$X2, predict(regressor_1, newdata = train_housing)),
              color = 'blue')+
    ggtitle('Price vs House Age(Test Set)') +
    xlab('House Age') +
    ylab('Price')

Given that the p-value is significantly less than 0.05 (with the ***, showing a high statistical significance), we can conclude that there’s a linear relationship between house prices and age of housing.The plot shows that the relationship is negative, the older the houses, the lesser the price.

regressor_2=lm(formula = Y~X3, 
               data=train_housing)
summary(regressor_2)

## 
## Call:
## lm(formula = Y ~ X3, data = train_housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8990 -0.4560 -0.1118  0.4103  2.6879 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.994e-18  3.810e-02    0.00        1    
## X3          -7.014e-01  3.815e-02  -18.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7138 on 349 degrees of freedom
## Multiple R-squared:  0.4919, Adjusted R-squared:  0.4905 
## F-statistic: 337.9 on 1 and 349 DF,  p-value: < 2.2e-16

ggplot()+
    geom_point(data = train_housing, aes(X3, Y), 
               color = 'black')+
    geom_line(aes(train_housing$X3, predict(regressor_2, newdata = train_housing)),
              color = 'red')+
    ggtitle('Price vs Distance to Nearest MRT Station (Training Set)') +
    xlab('Distance to Nearest MRT Station') +
    ylab('Price')

ggplot()+
    geom_point(data = test_housing, aes(X3, Y), 
               color = 'black')+
    geom_line(aes(train_housing$X3, predict(regressor_2, newdata = train_housing)),
              color = 'red')+
    ggtitle('Price vs Distance to Nearest MRT Station (Test Set)') +
    xlab('Distance to Nearest MRT Station') +
    ylab('Price')

The p-value is very small, indicating that there’s a linear relationship between house prices and its distance to the nearest MRT station. The Adjusted R-squared and plot, shows that the relationship is positive.

regressor_3=lm(formula = Y~X4, 
               data=train_housing)
summary(regressor_3)

## 
## Call:
## lm(formula = Y ~ X4, data = train_housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6968 -0.5628 -0.1119  0.4808  2.3786 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.930e-16  4.209e-02    0.00        1    
## X4           6.163e-01  4.215e-02   14.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7886 on 349 degrees of freedom
## Multiple R-squared:  0.3799, Adjusted R-squared:  0.3781 
## F-statistic: 213.8 on 1 and 349 DF,  p-value: < 2.2e-16

ggplot()+
    geom_point(data = train_housing, aes(X4, Y), 
               color = 'black')+
    geom_line(aes(train_housing$X4, predict(regressor_3, newdata = train_housing)),
              color = 'green')+
    ggtitle('Price vs Number of Convinient Stores in Living Circle on Foot (Training Set)') +
    xlab('Number of Convinient Stores (in foot)') +
    ylab('Price')

ggplot()+
    geom_point(data = test_housing, aes(X4, Y), 
               color = 'black')+
    geom_line(aes(train_housing$X4, predict(regressor_3, newdata = train_housing)),
              color = 'green')+
    ggtitle('Price vs Number of Convinient Stores in Living Circle on Foot (Test Set)') +
    xlab('Number of Convinient Stores (in foot)') +
    ylab('Price')

We can assume that the number of convenience store near a house, can affect its price. The p-value suggests that. The R-squared value and plot shows that price increases as the number of stores increases.

(10 points) Estimate a multiple linear regression model between Y (as the dependent variable) and the rest of the variables. Comment on your findings.

regressor_mlr=lm(formula = Y~., 
                 data=train_housing)
summary(regressor_mlr)

## 
## Call:
## lm(formula = Y ~ ., data = train_housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4849 -0.3921 -0.1187  0.3214  2.6929 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.825e-14  3.292e-02   0.000    1.000    
## X2          -2.208e-01  3.330e-02  -6.630 1.29e-10 ***
## X3          -3.847e-01  6.536e-02  -5.886 9.36e-09 ***
## X4           2.958e-01  4.166e-02   7.099 7.19e-12 ***
## X5           2.178e-01  4.109e-02   5.301 2.06e-07 ***
## X6           1.403e-02  5.426e-02   0.259    0.796    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6167 on 345 degrees of freedom
## Multiple R-squared:  0.6251, Adjusted R-squared:  0.6196 
## F-statistic:   115 on 5 and 345 DF,  p-value: < 2.2e-16

This shows that generally the price has a significant linear relationship with all predictor variable, except for X6 (only one without *** beside its coefficients).

(10 points) Check the residuals of your model in part 4. Do you believe that the assumptions of regression are met? Comment on your findings.

#Autocorrelation:
durbinWatsonTest(regressor_mlr)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.0109545       1.96598   0.724
##  Alternative hypothesis: rho != 0

#Constant Variance:
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(regressor_mlr)

## 
##  studentized Breusch-Pagan test
## 
## data:  regressor_mlr
## BP = 9.0416, df = 5, p-value = 0.1074

#Near perfect multicolinearity
vif(regressor_mlr)

##       X2       X3       X4       X5       X6 
## 1.020239 3.930810 1.597178 1.553480 2.709277

Autocorrelation (Durbin Watson Test), since the p-value is very large, we can reject the null hypothesis and assume that the residuals are not auto-correlated.

(White Test), since the p-value is large, then we do not reject the hull hypothesis (constant variance). Hence the assumption of constant variance holds.

(15 points) Using your testing data, calculate the accuracy (MSE) for all four models you have estimated. Which model predicts the Y better? Comment on your findings.

pred1=predict(regressor_1, newdata=test_housing)
pred2=predict(regressor_2, newdata=test_housing)
pred3=predict(regressor_3, newdata=test_housing)
pred4=predict(regressor_mlr, newdata=test_housing)

home_prediction=cbind(obs=seq(1,nrow(test_housing)),house_price=test_housing[,"Y"], pred1, pred2, pred3, pred4)
head(home_prediction)

##    obs house_price       pred1     pred2      pred3      pred4
## 4    1   0.9803208  0.05058326 0.3410621  0.2658455  0.6825378
## 5    2   0.3005518  0.20198558 0.4284048  0.2658455  0.7728640
## 8    3   0.5097115 -0.07710545 0.4808696  0.4726143  0.6104790
## 11   4   0.2017819 -0.34160347 0.4209426 -0.5612295 -0.3210562
## 16   5   0.7304912 -0.35802059 0.3322859 -0.3544607 -0.1426521
## 20   6   0.5678114  0.26582993 0.6154990  0.6793830  0.9665110

mean_se1=mean((test_housing[,"Y"]-pred1)^2)
mean_se2=mean((test_housing[,"Y"]-pred2)^2)
mean_se3=mean((test_housing[,"Y"]-pred3)^2)
mean_se4=mean((test_housing[,"Y"]-pred4)^2)
Models=c("Model 1", "Model 2", "Model 3", "MLR Model")
mse=c(mean_se1, mean_se2, mean_se3, mean_se4)
mse=round(mse, 4)
df=rbind(Models, mse)
kable(df) %>%
  kable_styling("striped")

Models	Model 1	Model 2	Model 3	MLR Model
mse	0.9197	0.6666	0.8682	0.6239

The second and forth models have the lease mean squared error, with the forth model slightly better. This means that the distance of a house to public transportation is a better predictor of price, compared to other variables. Also

(10 points) Please comment on the following questions:

Do you think if you have used a different seed your MSE will be the same? Why? Why not?

Using a different seed will produce a different MSE value, because the diffrent seeds will simulate different training and testing dataset, which will affect the MSE value.
If you could have any data you wanted, which other variable(s) you would have liked to include in your regression?

It would have been great if there’s an additional predictor variable like economic indicator like median household income, number of white-collar jobs.

(15 points) Executive Summary Suppose that you are working for a real estate company as a data scientist with the goal of developing a statistical model to find the value of houses. Communicate your findings to stakeholders (non-technical audience)
- Explain your model, test/train errors, and power of your model without using the jargon, i.e., MSE.