INTRODUCTION

Linear regression has a number of purposes in the modern world. Linear regression at its core is taking in any number of variables, categorical or quantitative, and return a prediction of a continuous variable. This is seen in the modern world in the following areas…

I) In insurance, it’s important to match higher level of risks to higher premiums in order to maintain solvency. In order to do that, regression models can take a number of variables and determine the level of risk a potential insured may be. This risk score can then be used to determine an appropriate premium.

II) Sometimes when we went to the doctor as kids they would tell us how tall we were gonna be. This wasn’t magic; it could have been linear regression. Your doctor has information such as gender, height, weight, and age on file and with that data they try to extrapolate and predict your height. While extrapolation is generally bad practice, this is an application of linear regression we are familiar with.

III) Another interesting application of linear regression is performance versus expectation modeling. A cool application I have seen of this is in football. Based on available data a model will return an expected amount of yards a player will gain if they run the football. These models have been fit well, so when there is data with a high residual, that suggests an abnormality. If players are gaining more yards than expected, that may suggest they are running the ball well. This has many applications; in health if your blood pressure is higher than the average under your circumstances, this can be measured by comparing your observed value to one predicted by a linear regression model. This type of modeling can predict abnormalities in everything from business to health to sports.

MULTIPLE LINEAR REGRESSION

Definition: Multiple linear regression is a statistical technique that uses multiple explanatory variables, with linear relationships between individual changes in explanatory variables and the responding variable, to predict a value for the responding variable.

Formula for MLR: $ y = {0} + {1} x_{1} + {2} x{2} + … + {k} x{k} + $

In order to run a multiple linear regression model, the dependent variable must be both continuous and have normally distributed outcomes for each of the values of the independent variables.

DATA ANALYSIS

Descriptive Statistics

Categorical Variables:

Quantitative Variables:

Fitting the multiple regression

## Standardizing the data
X1.transaction.date <- realEstate$X1.transaction.date
X4.number.of.convenience.stores <- realEstate$X4.number.of.convenience.stores
estate <- realEstate[,-c(1,4)]
estate <- scale(estate, center = TRUE, scale = TRUE)
realEstate <- as.data.frame(cbind(X1.transaction.date, X4.number.of.convenience.stores, estate))
head(realEstate)
## Splitting data into training and testing data
set.seed(123)

sampleSize <- floor(0.75 * nrow(realEstate))

trainingSample <- sample(seq_len(nrow(realEstate)), size = sampleSize)

originalTraining <- realEstate[trainingSample,]
originalTesting <- realEstate[-trainingSample,]

Using Bayesian Information Criterion to select variables:

Here we can see the combination of number of convenience stores (x4), house age (x2), distance to the nearest MRT station (x3), and latitude (x5) give us the lowest BIC, which is a good place to start for our model. The summary of that model is below:

## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X4.number.of.convenience.stores + 
##     X2.house.age + X3.distance.to.the.nearest.MRT.station + X5.latitude, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5656 -0.3336 -0.0503  0.2515  5.4905 
## 
## Coefficients:
##                                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            -0.143457   0.110335  -1.300  0.19455
## X4.number.of.convenience.stores2        0.031924   0.148132   0.216  0.82952
## X4.number.of.convenience.stores3       -0.233130   0.190722  -1.222  0.22255
## X4.number.of.convenience.stores4       -0.275451   0.156477  -1.760  0.07938
## X4.number.of.convenience.stores5       -0.002066   0.180106  -0.011  0.99085
## X4.number.of.convenience.stores6        0.237355   0.161782   1.467  0.14340
## X4.number.of.convenience.stores7        0.392659   0.182281   2.154  0.03204
## X4.number.of.convenience.stores8        0.280157   0.182912   1.532  0.12668
## X4.number.of.convenience.stores9        0.422210   0.198369   2.128  0.03413
## X4.number.of.convenience.stores10       0.638795   0.203451   3.140  0.00186
## X4.number.of.convenience.stores11       0.450652   0.337632   1.335  0.18299
## X2.house.age                           -0.171190   0.040403  -4.237 3.03e-05
## X3.distance.to.the.nearest.MRT.station -0.364293   0.055234  -6.595 1.95e-10
## X5.latitude                             0.270744   0.050286   5.384 1.48e-07
##                                           
## (Intercept)                               
## X4.number.of.convenience.stores2          
## X4.number.of.convenience.stores3          
## X4.number.of.convenience.stores4       .  
## X4.number.of.convenience.stores5          
## X4.number.of.convenience.stores6          
## X4.number.of.convenience.stores7       *  
## X4.number.of.convenience.stores8          
## X4.number.of.convenience.stores9       *  
## X4.number.of.convenience.stores10      ** 
## X4.number.of.convenience.stores11         
## X2.house.age                           ***
## X3.distance.to.the.nearest.MRT.station ***
## X5.latitude                            ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6815 on 296 degrees of freedom
## Multiple R-squared:  0.5442, Adjusted R-squared:  0.5242 
## F-statistic: 27.19 on 13 and 296 DF,  p-value: < 2.2e-16

Checking Price.Model1 Diagnostics:

plot(Price.Model1)

Price.Model1 looks like it’s a good candidate for a linear regression model.

Mean Square Error:

anova(Price.Model1)['Residuals', 'Mean Sq']
## [1] 0.2404926

What if we made the number of convenience stories a discrete variable with 3 breaks?

## Discretizing X4 variable
training$X4.number.of.convenience.stores <- discretize(originalTraining$X4.number.of.convenience.stores, breaks = 4)

testing$X4.number.of.convenience.stores <- discretize(originalTesting$X4.number.of.convenience.stores, breaks = 4)
## Fitting new model with discrete X4 variables
Price.Model2 <- lm(Y.house.price.of.unit.area ~ X4.number.of.convenience.stores + X2.house.age + X3.distance.to.the.nearest.MRT.station + X5.latitude, data = training)

summary(Price.Model2)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X4.number.of.convenience.stores + 
##     X2.house.age + X3.distance.to.the.nearest.MRT.station + X5.latitude, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6098 -0.3527 -0.0881  0.2268  5.6544 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            -0.14565    0.11026  -1.321  0.18750    
## X4.number.of.convenience.stores[2,5)   -0.13358    0.12430  -1.075  0.28339    
## X4.number.of.convenience.stores[5,7)    0.15486    0.14659   1.056  0.29162    
## X4.number.of.convenience.stores[7,11]   0.41903    0.14496   2.891  0.00412 ** 
## X2.house.age                           -0.17738    0.03955  -4.485 1.03e-05 ***
## X3.distance.to.the.nearest.MRT.station -0.37735    0.05411  -6.973 1.94e-11 ***
## X5.latitude                             0.25153    0.04854   5.182 4.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6838 on 303 degrees of freedom
## Multiple R-squared:  0.5303, Adjusted R-squared:  0.521 
## F-statistic: 57.01 on 6 and 303 DF,  p-value: < 2.2e-16

Checking Price.Model2 Diagnostics:

plot(Price.Model2)

Price.Model2 looks like it’s a good candidate for a linear regression model, but Price.Model1’s residuals appear to be distributed more normally based on the QQ plot.

Mean Square Error:

anova(Price.Model2)['Residuals', 'Mean Sq']
## [1] 0.4675386

which suggests that if all other variables are held equal that as distance increases by one unit, our model suggests price decreases by 0.36 units.

The beta value for Distance to the nearest MRT station is -0.364293, which suggests that if all other variables are held equal that as distance increases by one unit, our model suggests price decreases by 0.36 units. The beta value for house age is -0.171190, which suggests that if all other variables are held equal that as house age increases by one unit, our model suggests price decreases by 0.17 units. The estimates for X4 variable in Price.Model1 are more useful because they suggest the relationship between price and number of convenience stores for each individual store, rather than grouping categorical variables that may have different relationships with price and thus creating more error in our model.

CONCLUSION

Throughout the course of this project I got to practice a lot of real world practices when doing statistical analysis. I began by cleaning the data to make sure it could be analyzed. We had to make sure data was stored in the proper class (strings as strings, numbers as integers and doubles), we had to make sure we identified categorical variables and that they were stored as factors. We had to standardize the data so our data was internally consistent and could be compared to other data of different types. For example, someone five feet tall is significantly shorter than someone six feet tall. It’s hard to recognize just how big of a difference this in unless we standardize our data, which will show that five feet tall is toward the bottom of the height range while someone six feet tall is toward the top. Otherwise it just appears that the six foot person is just 20% taller, when in fact they may be over 70 percentile points apart. Next, we had to discern what methods were appropriate for variable selection.

From there we had to split the data into training and testing sets to make sure our model could be generalized to other data. From there we tested different ways of storing variables (in discrete chunks or as multiple factors) to determine which would result in the lowest mean squared error.

From our model we are able to discern that higher priced houses tend to be either located near MRT stations, being fairly new, located farther north, or a combination of the three. More expensive homes also tend to have more convenience stories in close proximity to them.