Project Overview

Analyzing predictors that determine the property value of Boston and West Roxbury, Massachusetts.

  1. Read in the data
  2. Visualizations
  3. Boxplot showing variation in a target variable by some categorical variable (factor).
  4. Best regression to the Boston Housing data with MEDV as the target variable. (Diagnostic plots of your residuals)
  5. Fit your best regression to the West Roxbury data with TOTAL VALUE as the target variable. (Diagnostic plots of your residuals)
  6. Explain why one of your regression models should exclude TAX from the predictors.
  7. Interpret how property value relates to a variety of predictors.

Import Library

library(ggplot2)     
library(hrbrthemes)
library(dplyr)

1. Read in the data:

I read the data of West Roxbury and housing, and I showed the top 4 rows of the Dataset.

wr.df <- read.csv("WestRoxbury.csv")
head(wr.df,4)  
housing.df <- read.csv("BostonHousing.csv")
head(housing.df,4)  

2. Visualizations

I created a scatterplot to see the correlation between the total value and the living area of West Roxbury, with the number of floors as different colors to indicate the whether the number of floors makes a difference in value. The scatterplot indicates that the larger the living area the higher the value with a positive correlation. The number of floors reasonably increases with the size of the living area, and the color of number of floors indicates that usually higher number of floors has higher values.

p <- ggplot(wr.df, aes(x=LIVING.AREA, y=TOTAL.VALUE, color=as.factor(FLOORS))) +
     geom_point(alpha = 0.7) +
     geom_smooth(method=lm , color="red", se=FALSE) 
p + ggtitle("West Roxbury Total Value \nby Living Area & floors")
## `geom_smooth()` using formula 'y ~ x'

I created a scatterplot to see the correlation between the total value and the Lot Square Foot of West RoxburyThe scatterplot indicates that the larger the Lot Square Foot the higher the value with a positive correlation.

p1 <- ggplot(wr.df, aes(x=LOT.SQFT, y=TOTAL.VALUE)) +
     geom_point(alpha = 0.7, color="#69b3a2") +xlim(0,30000) +
     geom_smooth(method=lm , color="red", se=FALSE) +
     theme_ipsum()
p1 + ggtitle("West Roxbury Total Value \nby Lot Square Foot")

Density Plot with Gross Area

#  Density
  ggplot(wr.df, aes(x=(x=GROSS.AREA))) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

3. Boxplot showing variation in a target variable by some categorical variable (factor).

I created a boxplot to show whether remodeling makes a difference to the total value of the property in West Roxbury. The result shows that the median for the remodel status is different, the median and Q1 and Q3 for “recent” is highest, and for “none” is the lowest. We can also infer that there are a lot more outliers for properties with no remodel.

wr.df$kitch <- factor(wr.df$REMODEL, labels=c("none", "old", "Recent"))
bp <- ggplot(wr.df) +
     geom_boxplot(aes(x=kitch, y=TOTAL.VALUE)) + 
     xlab("Remodel Status?") +
     ylab("Total value") +
     ggtitle("Does remodeling \nAffect Total Value with statistical significance?")
bp

4. Best regression to the Boston Housing data with MEDV as the target variable. (Diagnostic plots of your residuals)

Running the regression with MEDV and all other variables, I found all while most variables are significant, ZN and AGE are insignificant variables. As a result, I ran another regression that dropped ZN and AGE and kept all other variables. The second regression is a better one as the Adjusted R-squared increased subtly from 0.8373 to 0.8378, making it a more accurate regression.

From the residual plots, which shows a measure of how much a regression line misses a data point, we can see the residual are fairly concentrated in the center value, and tend to have residuals that are more dispersed when the value is higher. Most residuals are still fairly concentrated to the predicted value.

##regression to the Boston Housing data with MEDV as the target variable
RegH1 <- lm( MEDV ~ . , data = housing.df)     
summary(RegH1)
## 
## Call:
## lm(formula = MEDV ~ ., data = housing.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8156 -1.9975 -0.2335  1.6757 16.0932 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.954458   3.816870  11.254  < 2e-16 ***
## CRIM         -0.129678   0.025517  -5.082 5.32e-07 ***
## ZN           -0.005113   0.011103  -0.460 0.645396    
## INDUS         0.114290   0.048362   2.363 0.018506 *  
## CHAS          2.359846   0.673138   3.506 0.000497 ***
## NOX         -15.362403   2.983384  -5.149 3.79e-07 ***
## RM            1.058350   0.354782   2.983 0.002995 ** 
## AGE          -0.006162   0.010319  -0.597 0.550689    
## DIS          -0.733482   0.161312  -4.547 6.86e-06 ***
## RAD           0.205249   0.051933   3.952 8.88e-05 ***
## TAX          -0.009369   0.002944  -3.182 0.001554 ** 
## PTRATIO      -0.558002   0.104307  -5.350 1.35e-07 ***
## LSTAT        -0.478377   0.039373 -12.150  < 2e-16 ***
## CAT..MEDV    11.813994   0.647596  18.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.709 on 492 degrees of freedom
## Multiple R-squared:  0.8415, Adjusted R-squared:  0.8373 
## F-statistic: 200.9 on 13 and 492 DF,  p-value: < 2.2e-16
##Diagnostic Plot for residuals
par(mfrow = c(2, 2))
plot(RegH1)

5. Fit your regression to the West Roxbury data with TOTAL VALUE as the target variable. Try using all other variables as predictors. Show diagnostic plots of your residuals.

I ran the regression first with all variables, and the summary shows that only TAX and Gross area being statistically significant, the adjust R-square is 1, which means the regression is accurate. The second regression I ran I dropped all other insignificant variables, but the summary of the regression did not change much. The residual plot shows that a good amount of residuals are not concentrated, meaning that the observe value and the predicted value does not align. Therefore, I ran another set of regressions without TAX as a variable, which shows a better indication of how the predictors influence total value.

#Run regression to the West Roxbury data with TOTAL VALUE as the target variable
RegWR <- lm( TOTAL.VALUE ~ . , data = wr.df)     
summary(RegWR)
## 
## Call:
## lm(formula = TOTAL.VALUE ~ ., data = wr.df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041697 -0.019777  0.000203  0.019741  0.041276 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)    3.203e-02  1.805e-02      1.774   0.0761 .  
## TAX            7.949e-02  5.534e-07 143644.813   <2e-16 ***
## LOT.SQFT       2.072e-07  1.398e-07      1.481   0.1386    
## YR.BUILT       1.142e-06  8.998e-06      0.127   0.8990    
## GROSS.AREA     1.860e-06  8.867e-07      2.097   0.0360 *  
## LIVING.AREA   -3.061e-06  1.635e-06     -1.872   0.0612 .  
## FLOORS         1.246e-03  9.305e-04      1.339   0.1807    
## ROOMS         -4.473e-04  3.450e-04     -1.297   0.1948    
## BEDROOMS       9.860e-05  5.256e-04      0.188   0.8512    
## FULL.BATH      7.521e-04  7.177e-04      1.048   0.2947    
## HALF.BATH      9.145e-04  6.601e-04      1.385   0.1660    
## KITCHEN        4.088e-03  2.553e-03      1.601   0.1094    
## FIREPLACE      2.215e-04  5.748e-04      0.385   0.6999    
## REMODELOld    -9.841e-04  1.011e-03     -0.973   0.3306    
## REMODELRecent -4.049e-05  8.897e-04     -0.046   0.9637    
## kitchold              NA         NA         NA       NA    
## kitchRecent           NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02271 on 5787 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.904e+09 on 14 and 5787 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(RegWR)

Run Regression without tax as a variable

#Drop tax
newwr.df = wr.df[ -c(2) ]
#Regression
RegWR2 <- lm(TOTAL.VALUE ~ . , data = newwr.df)    
summary(RegWR2)
## 
## Call:
## lm(formula = TOTAL.VALUE ~ ., data = newwr.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -269.137  -26.247    0.064   25.260  291.423 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.555e+01  3.408e+01  -1.924 0.054449 .  
## LOT.SQFT       8.523e-03  2.391e-04  35.648  < 2e-16 ***
## YR.BUILT       6.006e-02  1.697e-02   3.539 0.000405 ***
## GROSS.AREA     3.155e-02  1.622e-03  19.452  < 2e-16 ***
## LIVING.AREA    5.178e-02  3.011e-03  17.195  < 2e-16 ***
## FLOORS         4.034e+01  1.675e+00  24.087  < 2e-16 ***
## ROOMS          8.692e-01  6.512e-01   1.335 0.182009    
## BEDROOMS      -1.233e+00  9.923e-01  -1.242 0.214175    
## FULL.BATH      1.969e+01  1.330e+00  14.804  < 2e-16 ***
## HALF.BATH      1.884e+01  1.221e+00  15.422  < 2e-16 ***
## KITCHEN       -1.482e+01  4.816e+00  -3.078 0.002092 ** 
## FIREPLACE      1.896e+01  1.056e+00  17.955  < 2e-16 ***
## REMODELOld     4.184e+00  1.909e+00   2.192 0.028435 *  
## REMODELRecent  2.510e+01  1.647e+00  15.242  < 2e-16 ***
## kitchold              NA         NA      NA       NA    
## kitchRecent           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42.88 on 5788 degrees of freedom
## Multiple R-squared:  0.8135, Adjusted R-squared:  0.8131 
## F-statistic:  1942 on 13 and 5788 DF,  p-value: < 2.2e-16
##Diagnostic Plot for residuals
par(mfrow = c(2, 2))
plot(RegWR2)

6. Explanation on why West Roxbury regression models should exclude TAX from the predictors.

According to the residual plot of regression with TAX as one of the variable above, we can infer that the residuals are more disperses and that the observe value and the predicted value does not align. Therefore, TAX should be excluded from the predictors. Moreover, logically reasoning, TAX is usually not the main determinant for people buying houses or properties, consequentially the value of properties would reflect the same way. The new regression excluding tax allows the other variables to show significance, which is logically reasonable since other variables such as living area and gross areas are important determinants of the value of the property. Running a scatter Plot with Total Value and Tax we can see that the scatter dots are very linear. Moreover, the residual plot after excluding TAX shows a higher concentration to the line, meaning that the predicted and observe values are more consistent.

p1 <- ggplot(wr.df, aes(x=TAX, y=TOTAL.VALUE)) +
     geom_point(alpha = 0.7, color="#69b3a2") +
     geom_smooth(method=lm , color="red", se=FALSE, size = 0.4) +
     theme_ipsum()
p1 + ggtitle("West Roxbury Total Value \nby TAX")

7. Interpretation on how property value relates to a variety of predictors. We can infer property value relates to a variety of predictors because the variables are statistically significant in the regression. For housing in Boston, almost all predictors are significant except proportion of residential land zoned for lots over 25,000 sq.ft and proportion of owner-occupied units built prior to 1940, showing that those significant predictor influence the property value. Regarding West Roxbury, after excluding TAX, the regression of West Roxbury shows that most predictors are significant except the number of rooms and bedroom.