Part 3: Conducting some data cleaning

I clean the data using the code given in the assignment, but I exclude incomplete observations. The clean dataset characterizes each house with two count variables (number of bedrooms and number of bathrooms), two continuous variables (price and square footage), and the address of the house.

cleanDataset <- rawDataset %>%
  mutate(bedrooms = as.integer(str_trim(str_extract(Det, "[\\d ]*(?=bds)")))) %>%
  mutate(bathrooms = as.integer(str_trim(str_extract(Det, "[\\d ]*(?=ba)")))) %>%
  mutate(sqft = str_trim(str_extract(Det, "[\\d ,]*(?=sqft)"))) %>%
  mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
  mutate(price = as.numeric(str_replace_all(Price,"[^0-9]*","")))
cleanDataset <- na.omit(cleanDataset)
str(cleanDataset)
## 'data.frame':    18 obs. of  7 variables:
##  $ Price    : Factor w/ 18 levels "$175,000","$339,000",..: 6 5 1 3 4 2 9 7 8 16 ...
##  $ Det      : Factor w/ 18 levels "2 bds2 ba1,273 sqft- Condo for sale",..: 4 6 1 5 2 3 8 7 9 14 ...
##  $ Addr     : Factor w/ 18 levels "10301 Naglee Rd, Silver Spring, MD 20903",..: 3 5 8 2 4 9 1 6 7 15 ...
##  $ bedrooms : int  3 4 2 4 3 3 4 4 5 4 ...
##  $ bathrooms: int  2 3 2 2 2 2 4 3 3 3 ...
##  $ sqft     : num  2276 1458 1273 1914 1296 ...
##  $ price    : num  480000 410000 175000 365000 399900 ...
summary(cleanDataset)
##       Price                                     Det    
##  $175,000: 1   2 bds2 ba1,273 sqft- Condo for sale: 1  
##  $339,000: 1   3 bds2 ba1,296 sqft- House for sale: 1  
##  $365,000: 1   3 bds2 ba1,517 sqft- Condo for sale: 1  
##  $399,900: 1   3 bds2 ba2,276 sqft- House for sale: 1  
##  $410,000: 1   4 bds2 ba1,914 sqft- House for sale: 1  
##  $480,000: 1   4 bds3 ba1,458 sqft- House for sale: 1  
##  (Other) :12   (Other)                            :12  
##                                              Addr       bedrooms    
##  10301 Naglee Rd, Silver Spring, MD 20903      : 1   Min.   :2.000  
##  1110 Ruatan St, Silver Spring, MD 20903       : 1   1st Qu.:3.000  
##  11720 College View Dr, Silver Spring, MD 20902: 1   Median :4.000  
##  11807 Goodloe Rd, Silver Spring, MD 20906     : 1   Mean   :3.611  
##  11815 Dewey Rd, Silver Spring, MD 20906       : 1   3rd Qu.:4.000  
##  1253 Cresthaven Dr, Silver Spring, MD 20903   : 1   Max.   :6.000  
##  (Other)                                       :12                  
##    bathrooms          sqft          price        
##  Min.   :2.000   Min.   :1079   Min.   : 175000  
##  1st Qu.:2.000   1st Qu.:1279   1st Qu.: 276000  
##  Median :2.000   Median :1529   Median : 445000  
##  Mean   :2.722   Mean   :1736   Mean   : 513089  
##  3rd Qu.:3.000   3rd Qu.:2178   3rd Qu.: 592474  
##  Max.   :5.000   Max.   :2752   Max.   :1300000  
## 

Part 4: Doing some visual analysis

Sub-Part 4.1. Scatterplot square footage vs. price

To explore a linear relationship between price and square footage, I will plot square footage against price. However, I’m also interested in having some bedrooms without bathrooms for a couple reasons: (i) office room, (ii) storage room, and (iii) bedrooms sharing bathrooms.

# diff: number of bedrooms minus number of bathrooms
cleanDataset$bedBath <- cleanDataset$bedrooms-cleanDataset$bathrooms

# new variable based on diff
cleanDataset$bedBathCategory <- "NA"
cleanDataset$bedBathCategory[cleanDataset$bedBath<0]  <- "fewer beds than baths"
cleanDataset$bedBathCategory[cleanDataset$bedBath==0] <- "same number of beds and baths"
cleanDataset$bedBathCategory[cleanDataset$bedBath==1] <- "one bed without baths"
cleanDataset$bedBathCategory[cleanDataset$bedBath>=2] <- "two or more beds without baths"
cleanDataset$bedBathCategory = as.factor(cleanDataset$bedBathCategory)

The plot shows two main visual results: (1) most houses are in the quadrant spanning square footage <= 5,000 sqft and price <= $750,000; (2) in this quadrant, the relationship between price and square footage is non-linear, rather logarithmic or quadratic; (3) there is a concentration of most houses with at least one bedroom without bathroom in this quadrant.

graph1 <- ggplot(data = cleanDataset, aes(x = sqft, y = price, color = bedBathCategory)) + 
  geom_point() +
  labs(y = "Price ($)",
       x = "Square footage (squared feet)",
       title = "Square footage vs. Price of houses in Silver Spring, MD")
graph1

Sub-Part 4.2. Generating new variables suggested by the plot

To explore a non-linear relationship between price and square footage in the regressions, I will create the squared value of square footage, as well as the log value of square footage and price. I will also create one dummy variable capturing houses with at least one bedroom without bathroom, which can be interacted with the number of bedrooms; another dummy variable will define the quadrant that excludes the outliers.

# squared sqft
cleanDataset$squared_sqft <- cleanDataset$sqft*cleanDataset$sqft

# log variables
cleanDataset$log_sqft <-  log(cleanDataset$sqft)
cleanDataset$log_price <- log(cleanDataset$price)

# dummy variables
cleanDataset$nonOutlier <- ifelse(cleanDataset$sqft<=5000 & cleanDataset$price<=750000,1,0)
#cleanDataset$nonOutlier = as.factor(cleanDataset$nonOutlier)

cleanDataset$extraBedWithoutBath <- ifelse(cleanDataset$bedBath>=1,1,0)
#cleanDataset$extraBedWithoutBath = as.factor(cleanDataset$extraBedWithoutBath)

# interaction variable
cleanDataset$interactionvar <- cleanDataset$bedrooms * cleanDataset$extraBedWithoutBath

Part 5: Running some regression analysis

I estimated four models: linear, log, quadratic (with interaction), and quadratic (without interaction). Not surprisingly after the visual analysis, the regression results indicate a better fit (highest R-squared) with the quadratic models.

Sub-Part 5.1. Linear model and goodness of fit

lmmodel <- lm(price ~ sqft + bedrooms + extraBedWithoutBath + interactionvar + nonOutlier, cleanDataset)
summary(lmmodel)
## 
## Call:
## lm(formula = price ~ sqft + bedrooms + extraBedWithoutBath + 
##     interactionvar + nonOutlier, data = cleanDataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107543  -42014    9856   37381  118875 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          357022.85  130337.53   2.739 0.017957 *  
## sqft                     69.02      59.27   1.164 0.266861    
## bedrooms             183755.66   41704.44   4.406 0.000856 ***
## extraBedWithoutBath  295589.07  128802.53   2.295 0.040570 *  
## interactionvar      -117230.42   37517.91  -3.125 0.008779 ** 
## nonOutlier          -579221.29   70097.83  -8.263  2.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76790 on 12 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9387 
## F-statistic: 53.08 on 5 and 12 DF,  p-value: 9.076e-08

Sub-Part 5.2. Log model and goodness of fit

logmodel <- lm(price ~ log_sqft + bedrooms + extraBedWithoutBath + interactionvar + nonOutlier, cleanDataset)
summary(logmodel)
## 
## Call:
## lm(formula = price ~ log_sqft + bedrooms + extraBedWithoutBath + 
##     interactionvar + nonOutlier, data = cleanDataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111976  -42170   10036   38879  118612 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -459369     710337  -0.647  0.53001    
## log_sqft              129794     106302   1.221  0.24554    
## bedrooms              176957      44539   3.973  0.00185 ** 
## extraBedWithoutBath   277420     131077   2.116  0.05589 .  
## interactionvar       -111386      38673  -2.880  0.01383 *  
## nonOutlier           -584204      68600  -8.516 1.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76400 on 12 degrees of freedom
## Multiple R-squared:  0.9572, Adjusted R-squared:  0.9393 
## F-statistic: 53.64 on 5 and 12 DF,  p-value: 8.55e-08

Sub-Part 5.3. Quadratic model (with interation) and goodness of fit

quadintmodel <- lm(price ~ squared_sqft + sqft + bedrooms + extraBedWithoutBath + interactionvar + nonOutlier, cleanDataset)
summary(quadintmodel)
## 
## Call:
## lm(formula = price ~ squared_sqft + sqft + bedrooms + extraBedWithoutBath + 
##     interactionvar + nonOutlier, data = cleanDataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -114781  -42242   12950   39793  118484 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.936e+05  2.633e+05   1.115  0.28865    
## squared_sqft        -2.897e-02  1.031e-01  -0.281  0.78399    
## sqft                 1.802e+02  4.005e+02   0.450  0.66153    
## bedrooms             1.750e+05  5.337e+04   3.280  0.00734 ** 
## extraBedWithoutBath  2.706e+05  1.609e+05   1.682  0.12077    
## interactionvar      -1.089e+05  4.899e+04  -2.224  0.04807 *  
## nonOutlier          -5.876e+05  7.884e+04  -7.453 1.27e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79910 on 11 degrees of freedom
## Multiple R-squared:  0.9571, Adjusted R-squared:  0.9336 
## F-statistic: 40.85 on 6 and 11 DF,  p-value: 6.855e-07

Sub-Part 5.4. Quadratic model (without interation) and goodness of fit

quadmodel <- lm(price ~ squared_sqft + sqft + bedrooms + bathrooms + nonOutlier, cleanDataset)
summary(quadmodel)
## 
## Call:
## lm(formula = price ~ squared_sqft + sqft + bedrooms + bathrooms + 
##     nonOutlier, data = cleanDataset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -93482 -46162   7126  49742  81675 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.882e+04  2.197e+05   0.222  0.82787    
## squared_sqft -9.545e-02  7.614e-02  -1.254  0.23388    
## sqft          4.510e+02  2.910e+02   1.550  0.14715    
## bedrooms      5.090e+04  2.216e+04   2.296  0.04046 *  
## bathrooms     9.266e+04  2.766e+04   3.350  0.00578 ** 
## nonOutlier   -4.928e+05  8.107e+04  -6.078 5.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70310 on 12 degrees of freedom
## Multiple R-squared:  0.9637, Adjusted R-squared:  0.9486 
## F-statistic: 63.77 on 5 and 12 DF,  p-value: 3.183e-08

The quadratic model (without interaction) is easier to interpret than its counterpart with interaction. Based on that last model, the marginal effects of the attributes “bedroom” and “bathroom” of a house are either insignificant or significant at the 10% level of confidence (depending on the search data) but both positive, with a magnitude approximately lower than $15,000 for “bedroom” and than $25,000 for “bathroom” (depending on the search data). The marginal effect of not taking an outlying house is significant at the 1% level of confidence and largely negative.

quadmodel$coefficients
##   (Intercept)  squared_sqft          sqft      bedrooms     bathrooms 
##  4.881508e+04 -9.544671e-02  4.509749e+02  5.089963e+04  9.265651e+04 
##    nonOutlier 
## -4.927520e+05

Given these results, I will ideally choose a house with the following characteristcs:

  • falling within the range of square footage <= 5,000 sqft and price <= $750,000 ;

  • having just the minimum of bathrooms needed ;

  • having one or two more bedrooms than bathrooms ;

  • and having the minimum square footage among houses with the above three characteristics.

Four other variables I would like to consider further for refining my housing search could be: (1) the median or average income in the house neighborhood; (2) the level of safety in the house neighborhood; (3) the availability of medium to large grocery stores in the in the house neighborhood; and (4) my household income and job tenure (critical for affordability).

Note: This analysis could have been better done if my selection of independent variabes was based on the literature on hedonic price for residential buildings in Maryland, in addition to an analysis of the residuals from the retained model.