REGRESSION PROJECT

Problem Statement:

As an owner of a data science consulting firm and of the building that my company currently occupies, I have to decide if it is better for me to sell the building or rent it out to tenants. While the decision hasn’t been easy due to the facts that there are a lot of factors involved in the analysis, I have to be wise on the profits I make. My decision depends on how much rent I can charge and if that rental rate will be greater than operation expenses and taxes of keeping ownership of the building. Since a decision has to be made, I can rely on the set of sample data to guide me.

In order to achieve our goal, we will analyze the other buildings in the sample data and build a ordinary least squares regression model, particularly a multiple linear regression model, to help us predict the amount of rent I can charge for my building.

For this regression project our outcome variable, the variable we are trying to predict) is rental rate. Our potential predictor variables are all of the variables in our sample data set (Age, OperExp, VacRate, SqFt, Taxes, W2MiDT)

Process:

1. Import the data

2. Examination of the distributions of the variables

2.1 View the data (Look at the summary)

2.2 Analyze the data visually (through histograms and boxplots) and look for skewness within the variables

2.3 Examination the existence of correlations (though scatter plots and correlation plots)

3. Building and Fitting the Model

3.1 Manual Iteration

3.2 Stepwise Regression

3.3 All Subsets Regression

4. Conclusion

5. Other Considerations

knitr::opts_chunk$set(echo = TRUE)
#Loading the libraries that are required to execute the regression modelling process
library(readr) # Used to read the csv file
library(ggplot2) #Used to create the ggplots
library(corrplot) #Used to create the correlation plots for different variables in a graphical table
library(car) # Contains the ncvTest (hypothesis test for non-constant variance) and qqPlot
library(MASS) #Used for Stepwise Regression
library(leaps) # Used for All Subsetss Regression

1. Import the data

comm_prop <- read_csv("comm_prop.csv") #Importing the data from the CSV file that has the sample data. Note: Before importing, the working directory was set to the directory that contains the csv file.

## Parsed with column specification:
## cols(
##   RentRate = col_double(),
##   Age = col_integer(),
##   OperExp = col_double(),
##   VacRate = col_double(),
##   SqFt = col_integer(),
##   Taxes = col_double(),
##   W2MiDT = col_integer()
## )

2. Examination of the distributions of the variables

2.1 View the data

head(comm_prop) #Head function is taken from the 'utils' library that retuns the FIRST 6 rows of the dataset

## # A tibble: 6 × 7
##   RentRate   Age OperExp VacRate   SqFt Taxes W2MiDT
##      <dbl> <int>   <dbl>   <dbl>  <int> <dbl>  <int>
## 1     13.5     1    5.02    0.14 123000  2.21      0
## 2     12.0    14    8.19    0.27 104079  1.87      0
## 3     10.5    16    3.00    0.00  39998  0.72      0
## 4     15.0     4   10.70    0.05  57112  1.03      0
## 5     14.0    11    8.97    0.07  60000  1.08      0
## 6     10.5    15    9.45    0.24 101385  1.82      0

tail(comm_prop) #Tail function is taken from the 'utils' library that retuns the LAST 6 rows of the dataset

## # A tibble: 6 × 7
##   RentRate   Age OperExp VacRate   SqFt Taxes W2MiDT
##      <dbl> <int>   <dbl>   <dbl>  <int> <dbl>  <int>
## 1    13.40    16    9.15    0.06  99500  1.79      0
## 2    12.30     5    5.13    0.14 105940  1.91      0
## 3    16.20     3   11.90    0.05  94150  1.69      0
## 4    10.80    19    7.28    0.22  92410  1.66      0
## 5    18.45    13   10.87    0.03 408150  7.35      1
## 6    16.35     3    7.48    0.08 165304  2.98      1

summary(comm_prop) #Summary function returns a compact analysis of the dataset

##     RentRate          Age           OperExp          VacRate      
##  Min.   :10.50   Min.   : 0.00   Min.   : 3.000   Min.   :0.0000  
##  1st Qu.:14.00   1st Qu.: 2.00   1st Qu.: 7.997   1st Qu.:0.0000  
##  Median :15.00   Median : 4.00   Median :10.370   Median :0.0400  
##  Mean   :15.13   Mean   : 8.01   Mean   : 9.658   Mean   :0.0801  
##  3rd Qu.:16.50   3rd Qu.:15.00   3rd Qu.:11.660   3rd Qu.:0.0925  
##  Max.   :19.25   Max.   :20.00   Max.   :14.620   Max.   :0.7300  
##       SqFt            Taxes           W2MiDT    
##  Min.   : 27000   Min.   :0.490   Min.   :0.00  
##  1st Qu.: 73141   1st Qu.:1.315   1st Qu.:0.00  
##  Median :129614   Median :2.330   Median :0.00  
##  Mean   :166101   Mean   :2.990   Mean   :0.36  
##  3rd Qu.:244546   3rd Qu.:4.402   3rd Qu.:1.00  
##  Max.   :484290   Max.   :8.720   Max.   :1.00

#View(comm_prop) #View functions helps us view the datset in a table like format

Observations: We can get sense of our data by looking at the summary metrics. We noticed SqFt has a higher mean than its median. This indicates the variable is probably right skewed. We should explore further with histogram plots.

2.2 Analyze the data visually and look for skewness in the data

# Check for missing values
sapply(comm_prop, function(x) sum(is.na(x))) #Check for missing values in all variables

## RentRate      Age  OperExp  VacRate     SqFt    Taxes   W2MiDT 
##        0        0        0        0        0        0        0

Observations: There are no missing values in our dataset.

#'hist' function is taken from 'graphics' library, which draws the histograms for a given datset
hist(comm_prop$RentRate, col="turquoise", xlab = "Property Rental Rate", ylab = "Frequency")

hist(comm_prop$Age, col ="violet", xlab = "Property Age", ylab = "Frequency")

hist(comm_prop$OperExp, col ="gold", xlab = "Property Operations Expenses", ylab = "Frequency")

hist(comm_prop$VacRate, col ="blue", xlab = "Property Vacancy Rate", ylab = "Frequency")

hist(comm_prop$SqFt, col ="light green", xlab = "Property Square Feet", ylab = "Frequency")

hist(comm_prop$Taxes, col ="green", xlab = "Property Taxes", ylab = "Frequency")

Observations: Histograms help us better understand the distribution of continuous variables. Above, we have plotted all continuous variables into histograms. We noticed the following:
Rental Rate - looks normally distributed
Age - appears bimodal. Seems there were a good amount of buildings built recently and also around 15 years ago.
Operation Expenses - is slightly Left Skewed
Vacancy Rate - is very right skewed
Square Footage - is right skewed
Taxes - is right skewed

#'boxplot' functions is taken from 'graphics' library, which produces boxplots for a given datset
boxplot(comm_prop[,c(1:3,6)], col ="light blue", horizontal  = TRUE,
        main = "Boxplot Comparison") # For multiple columns in the dataset - had to exempt a few variables because the distribution for a few variables make the plot hard to analyze.

boxplot(comm_prop$VacRate, col ="light green", horizontal  = TRUE,
        main = "Vacancy Rate") # for Vacancy Rate

boxplot(comm_prop$SqFt, col ="turquoise", horizontal  = TRUE,
        main = "Square Feet") # for Square Footage

Observations: The boxplots reinforce the skewness observations we observed in our histograms. We noticed there are quiet a few outliers for Vacancy Rate.

# Examination of factor variables
miles <- mean(comm_prop$W2MiDT) * 100 
comm_prop$W2MiDT <- factor(comm_prop$W2MiDT, labels=c("No","Yes")) #convert variable into factor 
plot(comm_prop$W2MiDT, xlab="Locations Within 2 Mile Radius of Downtown", ylab="Frequency", col ="orange")

boxplot(RentRate~W2MiDT, data=comm_prop, col ="light blue") #

Observations: 36% of buildings are located within 2 miles of downtown. Rental Rate is higher if location is within 2 miles of downtown.

2.3 Examining the existence of correlations

pairs(comm_prop[,1:6], col ="turquoise") # 'pairs' function is derived from 'graphics' library and draws a martix of scatterplots of the diven dataset

cor(comm_prop[,1:6]) #'cor' function is derived from the 'stats' library, calculates the correlations of multiple factors of the dataset and returns a matrix

##             RentRate        Age    OperExp     VacRate       SqFt
## RentRate  1.00000000 -0.2717822  0.4961736 -0.03718869 0.58440930
## Age      -0.27178219  1.0000000  0.2807262 -0.18660828 0.26929259
## OperExp   0.49617362  0.2807262  1.0000000 -0.36605714 0.46658357
## VacRate  -0.03718869 -0.1866083 -0.3660571  1.00000000 0.02947991
## SqFt      0.58440930  0.2692926  0.4665836  0.02947991 1.00000000
## Taxes     0.58452951  0.2691128  0.4663997  0.02944071 0.99999918
##               Taxes
## RentRate 0.58452951
## Age      0.26911276
## OperExp  0.46639966
## VacRate  0.02944071
## SqFt     0.99999918
## Taxes    1.00000000

corrplot(cor(comm_prop[,1:6]), method="ellipse") #'corrplot' function is derived from the 'corrplot' library, returns the correlations of multiple factors of the dataset in graphical table

Observations: Strong correlation between taxes and sqft (0.99999918). Taxes must be a function of SqFt as they have a very strong correlation. This makes sense because usually property taxes are calculated using the square footage of the building. The next highest correlation is between RentRate and Taxes (0.58452951).
Moderate correlation exists between:
- Taxes and RentRate (positive)
- SqFt and RentRate (positive)
- OperExp and RentRate (positive)
- OperExp and Taxes (positive)

3. Building and Fitting the Model

3.1 MANUAL ITERATION

NOTES:
Manual Iteration is a process of building the best-fit model by manually observing the data set and eliminating the factors without a significant p-value one by one. Then observing the new model’s adjusted r-sqaure value.
The process of elimination is done for every variable that individually has the highest p-value (i.e is least significant).
For doing the this, we use the ‘lm’ function, derived from the ‘stats’ library; used to fit Linear Models to the datasets and execute the ‘summary’ function on the model. The model here, is fitted by using both the ‘first order variables (columns of the dataset) and ’interactions’ (combined predictor variables. Denoted with x:y).
The OUTPUT of this method finally needs to have a HIGH Adjusted R value and SIGNIFICANT (i.e LOW) p-value.
We have decided to remove the taxes variable because it’s strong correlation with SqFt. We reasoned that taxes is likely a function of SqFt so we will take the variable (SqFt) that we believe is used in calculating Taxes.

#For the COMPLETE dataset with all interactions excluding taxes
comm_prop.1 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:SqFt + Age:W2MiDT + OperExp:VacRate + OperExp:SqFt + OperExp:W2MiDT + VacRate:SqFt + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.1)
# OBSERVATION: 
# Adjusted R-squared:  0.7827 
# F-statistic: 26.47 on 14 and 85 DF
# p-value: < 2.2e-16 
# REMOVE: OperExp:W2MiDTYes (individual p-value : 0.708035 )


#Removed OperExp:W2MiDTYes
comm_prop.2 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:SqFt + Age:W2MiDT + OperExp:VacRate + OperExp:SqFt +  VacRate:SqFt + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.2)
# OBSERVATION: 
# Adjusted R-squared: 0.7848 (slight increase)
# F-statistic: 28.78 on 13 and 86 DF
# p-value: < 2.2e-16 
# REMOVE: OperExp:VacRate (individual p-value : 0.696145 )


#Removed OperExp:VacRate
comm_prop.3 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:SqFt + Age:W2MiDT + OperExp:SqFt +  VacRate:SqFt + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.3)
# OBSERVATION: 
# Adjusted R-squared: 0.7869 (slight increase)
# F-statistic: 31.47 on 12 and 87 DF
# p-value: < 2.2e-16 
# REMOVE: VacRate:SqFt (individual p-value : 0.637853 )


#Removed VacRate:SqFt
comm_prop.4 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:SqFt + Age:W2MiDT + OperExp:SqFt + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.4)
# OBSERVATION: 
# Adjusted R-squared: 0.7888 (slight increase)
# F-statistic: 34.62 on 11 and 88 DF
# p-value: < 2.2e-16 
# REMOVE: Age:SqFt (individual p-value : 0.577617 )


#Removed Age:SqFt
comm_prop.5 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:W2MiDT + OperExp:SqFt + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.5)
# OBSERVATION: 
# Adjusted R-squared: 0.7904 (slight increase)
# F-statistic: 38.34 on 10 and 89 DF
# p-value: < 2.2e-16 
# REMOVE: OperExp:SqFt (individual p-value : 0.356293 )


#Removed OperExp:SqFt
comm_prop.6 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:W2MiDT + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.6)
# OBSERVATION: 
# Adjusted R-squared: 0.7908 (slight increase)
# F-statistic: 42.57 on 9 and 90 DF
# p-value: < 2.2e-16 
# REMOVE: Age:W2MiDTYes (individual p-value : 0.247392 )


#Removed Age:W2MiDTYes
comm_prop.7 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + VacRate:W2MiDT, data=comm_prop)
summary(comm_prop.7)
# OBSERVATION: 
# Adjusted R-squared: 0.79 (slight decrease) ****
# F-statistic: 47.54 on 8 and 91 DF
# p-value: < 2.2e-16 
# REMOVE: Age:VacRate (individual p-value : 0.111700 )


#Removed Age:VacRate
comm_prop.8 <- lm(RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + VacRate:W2MiDT, data=comm_prop)

# OBSERVATION: 
# Adjusted R-squared: 0.7863 (slight decrease) ****
# F-statistic: 53.05 on 7 and 92 DF
# p-value: < 2.2e-16 
# THIS IS OUR MODEL

summary(comm_prop.8)

## 
## Call:
## lm(formula = RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + 
##     Age:OperExp + VacRate:W2MiDT, data = comm_prop)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51330 -0.51145 -0.01806  0.50503  2.41260 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.377e+01  5.539e-01  24.853  < 2e-16 ***
## Age               -2.972e-01  5.995e-02  -4.958 3.24e-06 ***
## OperExp            9.132e-02  5.900e-02   1.548 0.125117    
## VacRate           -5.057e+00  1.418e+00  -3.568 0.000574 ***
## SqFt               5.490e-06  1.079e-06   5.089 1.90e-06 ***
## W2MiDTYes          1.213e+00  2.677e-01   4.534 1.74e-05 ***
## Age:OperExp        2.096e-02  6.165e-03   3.400 0.000998 ***
## VacRate:W2MiDTYes  5.297e+00  1.653e+00   3.204 0.001865 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8792 on 92 degrees of freedom
## Multiple R-squared:  0.8015, Adjusted R-squared:  0.7863 
## F-statistic: 53.05 on 7 and 92 DF,  p-value: < 2.2e-16

Analysis and Observations: We started with a linear regression model that included all variables except the taxes variable and all interactions. We then manually generated new models by manually removing the least significant variables one at time. We removed variables based on their p-value. A small p-value, typically .05 or less, associated with a variable indicates strong evidence against the null hypothesis, which is in this case the regression coefficient is not significantly different than zero. In other words, all other variables being held constant an increase or decrease in our variable does not affect our predicted value for Y (RentalRate).
Once our model only contained significant p-values for all included variables, we decided to stop iterating through models. (Note: We left OperExp even though it does not have a significant p-value. We must keep this variable because it used in the significant interaction variable of Age:OperExp).
As we interated though models, we paid special attention to each models’s adjusted r-square value. The adjusted r-square value is the r-square value which is the percentage of variation in rental rate that is explained by our model, adjusted for the number of predictors in out model. We mostly saw an increase in adjusted r-square from model to model. We believe the few decreases are justified in that they are very slight decreases and they removal of the variable simplies our model.
With each model iterations, we saw increase in our F-statistic. Each model had a significant F-statistic (p-value < .01), which is evidence of in facor of the alternative hypothesis (Our model significantly predicts better than the intercept only model). With each interation, the F-statistic increases indicating a stronger likelihood our model predicts better then the intercept only model.

We now have a model but we must also make sure all assumptions about multiple linear regression models are true for our model. That is our residuals are normal/have constant variance and there is no presence of multicollinearity between variables.

#------------- DIAGNOSTICS ----------------
#------- Visual Tests ----------
# Residual plot
comm_prop.8.df <- fortify(comm_prop.8) # 'fortify' is a function of 'lme4' library; used to add information to the fitted model

#Plotting a scatter plot through 'ggplot' function which is taken from the 'ggplot2' library 
ggplot(comm_prop.8.df, aes(x=.fitted, y=.resid)) +
  geom_point(col ="red") + # this function is used to draw the scatter plot 
  geom_hline(yintercept=0, linetype=2) + #Used to draw a Horizontal line at y intecept 0
  labs(x="Fitted Values", y="Residuals") # adding labels to the x and y axis of the plotss

# OBSERVATION: Our Residuals appear to be randomly dispersed


# Check for NORMALITY
qqPlot(comm_prop.8$residuals, pch=16)

# OBSERVATION: Residuals appear to be NORMAL (since most of the dots are within the confidence interval i.e the red dotted lines)


# We created a FUNCTION that will help us visually analyze the fitted model
residplot <- function(fit, nbreaks=10) {
  z <- rstudent(fit)
  hist(z, breaks=nbreaks, freq=FALSE,
       xlab="Studentized Residual",
       main="Distribution of Errors")
  rug(jitter(z), col="brown")
  curve(dnorm(x, mean=mean(z), sd=sd(z)),
        add=TRUE, col="blue", lwd=2)
  lines(density(z)$x, density(z)$y,
        col="red", lwd=2, lty=2)
  legend("topright",
         legend = c( "Normal Curve", "Kernel Density Curve"),
         lty=1:2, col=c("blue","red"), cex=.7)
}

residplot(comm_prop.8)

# OBSERVATION: Our Residuals appear to be NORMALLY Distributed
# This is a visual verification. To be more certain we will perform a statistical test.


#------- Statistical Tests ----------
shapiro.test(comm_prop.8$residuals) #Cheching for normality

## 
##  Shapiro-Wilk normality test
## 
## data:  comm_prop.8$residuals
## W = 0.98719, p-value = 0.4506

# OBERVATION: p-value of .4506
# RESULT : Fail to reject the Null Hypothesis. The residuals do appear to be normally distributed.


# Check for NCV - Non Constant Variance
ncvTest(comm_prop.8)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.178327    Df = 1     p = 0.2776968

# OBERVATION: p-value of .2777. 
# RESULT: Fail to Reject the null hypothesis. Residuals appear to  be homooscedaticity. 


# Check for Multicollinearity - VIF - Variance Inflation 
vif(comm_prop.8)

##            Age        OperExp        VacRate           SqFt         W2MiDT 
##      19.418401       3.083854       3.984158       1.873984       2.135323 
##    Age:OperExp VacRate:W2MiDT 
##      24.994129       4.757226

# OBERVATION: 
# High multicollinearity between Age and Age:OperExp but this is an interaction and original variable so this should be fine.
# We see a slightly high rate for VacRate:W2MiDT but it is not above the rule of thumb of 10.

Based on the residual plot, the qqPlot, and the histogram our residuals appear normal. To be certain, we also performed an hypothesis test. The Shapiro-Wilk test resulted in a p-value of .4506. We fail to reject the null hypothesis and conclude that the residuals do appear to be normally distributed. The assumption is met.
We also performed a Non-constant variance test on our residuals and failed to reject the null hypothesis. The residuals appear to have constant variance. This assumption is met.
Finally we peformed a Variance Inflation Factor test to check for multicollinearity. We observed a high multicollinearity between Age and Age:OperExp but this is an interaction and original variable so this should be fine. We saw a slightly high rate for VacRate:W2MiDT but it is not above the rule of thumb of 10.

#------- Predict ----------
# Predicting the RentalRate with the following values:
# Age = 9
# OperExp = 13.0
# VacRate = 0.00
# SqFt = 40000
# Taxes = 0.54
# W2MiDT = No

ourBuilding.data <- data.frame(Age = 9, OperExp = 13.0, VacRate = 0.00, SqFt = 40000, Taxes =0.54, W2MiDT="No")
predict(comm_prop.8, newdata=ourBuilding.data, interval="predict")

##        fit      lwr      upr
## 1 14.95052 13.13364 16.76741

# OBERVATION: 
#        fit      lwr      upr
# 1 14.95052 13.13364 16.76741


#Our Calulations
expensesToKeep <- 13000 + 540
predictedRentalRateLow <- 13.13364 * 1000 
predictedRentalRateLow > expensesToKeep

## [1] FALSE

predictedRentalRateLow

## [1] 13133.64

expensesToKeep

## [1] 13540

CONCLUSION:
Our model pridicted a rental value of our property between 13.13 and 16.77 based on a 95% confidence interval. Our monthly expenses falls within our predicted monthly rental rate. Based on this we do think renting would not be a good idea becuase there is a possibility we would lose money every month.

3.2 STEPWISE REGRESSION

The backward stepwise regression algorithm starts with all variables and interactions and then deletes them one by one until removing any more variables would degrade the quality of the model. Once again, we have removed taxes because of its correlation with SqFt.

# Backward Stepwise 
comm.Step <- lm(RentRate ~ Age + OperExp + VacRate +SqFt + W2MiDT + Age:OperExp + Age:VacRate + Age:SqFt + Age:W2MiDT + OperExp:VacRate + OperExp:SqFt + OperExp:W2MiDT + VacRate:SqFt + VacRate:W2MiDT, data=comm_prop) 
summary(comm.Step)
# OBSERVATION: 
# Adjusted R-squared: 0.7827
# F-statistic: 26.47 on 14 and 85 DF
# p-value: < 2.2e-16 

stepAIC(comm.Step, direction = "backward")
# NOTES: AIC - Akikee information criteria, minimizes AIC number can do forward backward or both- both will remove and occasionaly bring them back in


# Resulting model:
comm.Step2 <- lm(formula = RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + Age:VacRate + VacRate:W2MiDT, data = comm_prop)
summary(comm.Step2)
# OBSERVATION: 
# Adjusted R-squared: 0.79
# F-statistic: 47.54 on 8 and 91 DF
# p-value: < 2.2e-16 
# REMOVE: Age:VacRate (individual p-value : 0.111700 )


# Removed - Age:VacRate
comm.Step3 <- lm(formula = RentRate ~ Age + OperExp + VacRate + SqFt + W2MiDT + Age:OperExp + VacRate:W2MiDT, data = comm_prop)
summary(comm.Step3)
# OBSERVATION: 
# Adjusted R-squared: 0.7863 (slight decrease) 
# F-statistic: 53.05 on 7 and 92 DF
# p-value: < 2.2e-16 



#------------- DIAGNOSTICS ----------------
#------- Visual Tests ----------
# Residual plot
comm.Step3.df <- fortify(comm.Step3) # 'fortify' is a function of 'lme4' library; used to add information to the fitted model

#Plotting a scatter plot through 'ggplot' function which is taken from the 'ggplot2' library 
ggplot(comm.Step3.df, aes(x=.fitted, y=.resid)) +
  geom_point(col ="red") + # this function is used to draw the scatter plot 
  geom_hline(yintercept=0, linetype=2) + #Used to draw a Horizontal line at y intecept 0
  labs(x="Fitted Values", y="Residuals") # adding labels to the x and y axis of the plot

# OBSERVATION: Our Residuals appear to be randomly dispersed


# Check for NORMALITY
qqPlot(comm.Step3$residuals, pch=16)

# OBSERVATION: Residuals appear to be NORMAL (since most of the dots are within the confidence interval i.e the red dotted lines)


residplot(comm.Step3)

# OBSERVATION: Our Residuals appear to be NORMALLY Distributed
# This is a visual verification. To be more certain we will perform a statistical test.


#------- Statistical Tests ----------
shapiro.test(comm.Step3$residuals) #Cheching for normality
# OBERVATION: p-value of .4506
# RESULT : Fail to reject the Null Hypothesis. The residuals do appear to be normally distributed.


# Check for NCV - Non Constant Varriance
ncvTest(comm.Step3)
# OBERVATION: p-value of 0.2776968 
# RESULT: Fail to Reject the null hypothesis. Residuals appear to  be homooscedaticity. 


# Check for Multicollinearity - VIF - Variance Inflation 
vif(comm.Step3) 
# OBERVATION: 
# High multicollinearity between Age and Age:OperExp but this is an interaction and original variable so this should be fine.
# We see a slightly high rate for VacRate:W2MiDT but it is not above the rule of thumb of 10.


#------- Predict ----------
# Predicting the RentalRate with the following values:
# Age = 9
# OperExp = 13.0
# VacRate = 0.00
# SqFt = 40000
# Taxes = 0.54
# W2MiDT = No

predict(comm.Step3, newdata=ourBuilding.data, interval="predict")
# OBERVATION: 
#        fit      lwr      upr
# 1 14.95052 13.13364 16.76741


#Our Calulations
expensesToKeep <- 13000 + 540
predictedRentalRateLow <- 13.13364 * 1000 
predictedRentalRateLow > expensesToKeep
predictedRentalRateLow
expensesToKeep

# CONCLUSION:
# Our monthly expenses falls within our predicted monthly rental rate. Based on this we do think renting would not be a good idea becuase there is a possibility we would lose money every month.

The model generated by the stepwise regression process generated the same model we came up with manually. As before, we have met all of our assumptions and we come to the decision that we should NOT rent the building.

3.3 ALL SUBSETS REGRESSION

The All Subsets Regression algorithm considers all possible models and will return the model with the highest adjusted r-squared. Once again, we have removed taxes because of its correlation with SqFt.

comm.sub <- regsubsets(RentRate ~ Age + OperExp +  VacRate +SqFt   +W2MiDT  +Age:OperExp  +Age:VacRate +Age:SqFt  +Age:W2MiDT +OperExp:VacRate+OperExp:SqFt  
                       +OperExp:W2MiDT +VacRate:SqFt+VacRate:W2MiDT, data=comm_prop, really.big = TRUE)
plot(comm.sub, scale="adjr2")

#Observations : 
# From the resultant graph - the following variables need to be included in our data model - Age,OperExp,VacRate,SqFt,W2MiDT,Age:OperExp,Age:VacRate,VacRate:W2MiDT


comm.Subset.1 <- lm(RentRate ~ Age+OperExp+VacRate+SqFt+W2MiDT+Age:OperExp+Age:VacRate+VacRate:W2MiDT, data=comm_prop) 
summary(comm.Subset.1)
# OBSERVATION: 
# Adjusted R-squared: 0.79
# F-statistic: 47.54 on 8 and 91 DF
# p-value: < 2.2e-16 


#------------- DIAGNOSTICS ----------------
#------- Visual Tests ----------
# Residual plot
comm.Subset.1.df <- fortify(comm.Subset.1) # 'fortify' is a function of 'lme4' library; used to add information to the fitted model

#Plotting a scatter plot through 'ggplot' function which is taken from the 'ggplot2' library 
ggplot(comm.Subset.1.df, aes(x=.fitted, y=.resid)) +
  geom_point(col ="red") + # this function is used to draw the scatter plot 
  geom_hline(yintercept=0, linetype=2) + #Used to draw a Horizontal line at y intecept 0
  labs(x="Fitted Values", y="Residuals") # adding labels to the x and y axis of the plots

# OBSERVATION: Our Residuals appear to be randomly dispersed


# Check for NORMALITY
qqPlot(comm.Subset.1$residuals, pch=16)

# OBSERVATION: Residuals appear to be NORMAL (since most of the dots are within the confidence interval i.e the red dotted lines). 
# NOTE: There are a few dots outside the confidence interval curves


residplot(comm.Subset.1)

# OBSERVATION: Our Residuals appear to be NORMALLY Distributed. Though there is a slight deviation in the Density Curve
# This is a visual verification. To be more certain we will perform a statistical test.

#------- Statistical Tests ----------
shapiro.test(comm.Subset.1$residuals) #Cheching for normality 
# OBERVATION: p-value of .5383
# The residuals do appear to be normally distributed.
# RESULT : Fail to reject the Null Hypothesis. 


# Check for NCV - Non Constant Varriance
ncvTest(comm.Subset.1)
# OBERVATION: p-value of .37  
# Residuals appear to  be homooscedaticity.  
# RESULT: Fail to Reject the null hypothesis. 


# Check for Multicollinearity - VIF - Variance Inflation 
vif(comm.Subset.1) 
# OBERVATION: 
# High multicollinearity between Age and Age:OperExp but this is an interaction and original variable so this should be fine.
# We see a slightly high rate for VacRate:W2MiDT but it is not above the rule of thumb of 10.


#------- Predict ----------
# Predicting the RentalRate with the following values:
# Age = 9
# OperExp = 13.0
# VacRate = 0.00
# SqFt = 40000
# Taxes = 0.54
# W2MiDT = No

predict(comm.Subset.1, newdata=ourBuilding.data, interval="predict")
# OBERVATION: 
#        fit      lwr      upr
# 1 15.00913 13.20593 16.81234


#Our Calulations
expensesToKeep <- 13000 + 540
predictedRentalRateLow <- 13.20593 * 1000 
predictedRentalRateLow > expensesToKeep
predictedRentalRateLow
expensesToKeep

CONCLUSION: Our model generated by all subset regression produced an adjusted r-square of 0.79 and a F-statistic of 47.54. Our monthly expenses falls within our predicted monthly rental rate. Based on this we do think renting would not be a good idea becuase there is a possibility we would lose money every month.

4. Conclusion

After performing a series of analysis with the given set of data, we came up with two separate models (we found that the models for Manual iteration and Stepwise were identical). We are choosing to go with the model produced by Manual Iteration. The reasons for us to choose this over the other are - It is a much simpler model - The value for the Adjusted R-square is virtually the same (and in general closer to 1) - The believe the interations are more sensical.

FINAL RESULT: F-statistic value : 53.05 which is significant at 0.01 significance level
Adjusted R-squared : 0.7863
p-value : 2.2e-16

The final variables for our model are below with their respective estimates (the expected increase of rental rate for every 1 increase in variable, with all other variables being held constant).

Age -2.972e-01
OperExp 9.132e-02
VacRate -5.057e+00
SqFt 5.490e-06
W2MiDTYes 1.213e+00
Age:OperExp 2.096e-02
VacRate:W2MiDTYes 5.297e+00

We believe the signs of each estimate make sense and are intuitive. All variables except Vacancy rate are associated with a higher rental rate. We believe this is intuitive are associated with a more expensive building either based on location or physical attributes of the building. Vacancy Rate is associated with a lower rental rate. We believe this makes sense as well. A building that is frequently vacant may have a very low demand for occupancy which could force the landlord to lower rent prices.

After thorough examination of our data, and based on our modelling and analysis, we conclude with the best suggestion that is: We DO NOT RENT THE BUILDING. Since our confidence interval for the estimate of our buildings rent contained the total operating expenses and tax amount, it is possible that we will not make enough money in rent to cover the expenses of the building.

5. Other Considerations:

In the world of data science, us data scientists and data statistians have a lot of variables to consider; and for our project we did so, and we noticed that there were variable which might’ve impacted the decision. To list some of the questions which we had to answered to better analyze the data and come to a reasonable conclusion. Below is a list:
- How was the data collected
- What is the population of the area.

Also, a great data scientist always ask for more data, for there are always other attributes missing. Below is a list of data we would require.
- The purchased price of all the buildings in the dataset
- The geo location of all the buildings in the dataset
- Any needed repairs
- Cost of the maintenance (including HouseKeeping, Security, Landscape)
- Insurance rates of all the buildings in the dataset
- Amenities in the buildings (each building)
- The building material (were the buildings made from good material)
- If parking is available - is parking Garage available, if so, what’s the charge (if NOT free)
- Proximity to mass transportation

Regression Project

Austin White, Shree Mijarguttu, Anani Assoutovi

April 28, 2017

REGRESSION PROJECT

Problem Statement:

In order to achieve our goal, we will analyze the other buildings in the sample data and build a ordinary least squares regression model, particularly a multiple linear regression model, to help us predict the amount of rent I can charge for my building.

Process:

1. Import the data

2. Examination of the distributions of the variables

2.1 View the data (Look at the summary)

2.2 Analyze the data visually (through histograms and boxplots) and look for skewness within the variables

2.3 Examination the existence of correlations (though scatter plots and correlation plots)

3. Building and Fitting the Model

3.1 Manual Iteration

3.2 Stepwise Regression

3.3 All Subsets Regression

4. Conclusion

5. Other Considerations

1. Import the data

2. Examination of the distributions of the variables

2.1 View the data

2.2 Analyze the data visually and look for skewness in the data

2.3 Examining the existence of correlations

3. Building and Fitting the Model

3.1 MANUAL ITERATION

3.2 STEPWISE REGRESSION

3.3 ALL SUBSETS REGRESSION

4. Conclusion

5. Other Considerations: