Introduction

Questions of Interest

Regression Method

Regression Analysis, Results and Interpretation

We want to investigate

# read the real estate dat from the file.
realEstate <- as.tibble(read.csv("realestate.txt",sep = "\t",header=TRUE))


# add a column for the unit price of the real estate.

realEstate$UnitPrice <- realEstate$SalePrice/realEstate$SqFeet

summary(realEstate)
##    SalePrice         SqFeet           Beds           Baths      
##  Min.   : 84.0   Min.   :0.980   Min.   :1.000   Min.   :1.000  
##  1st Qu.:180.0   1st Qu.:1.701   1st Qu.:3.000   1st Qu.:2.000  
##  Median :229.9   Median :2.061   Median :3.000   Median :3.000  
##  Mean   :277.4   Mean   :2.261   Mean   :3.478   Mean   :2.647  
##  3rd Qu.:335.0   3rd Qu.:2.638   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :920.0   Max.   :5.032   Max.   :7.000   Max.   :7.000  
##       Air             Garage           Pool             Year     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :1885  
##  1st Qu.:1.0000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:1956  
##  Median :1.0000   Median :2.000   Median :0.0000   Median :1966  
##  Mean   :0.8311   Mean   :2.098   Mean   :0.0691   Mean   :1967  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:1981  
##  Max.   :1.0000   Max.   :7.000   Max.   :1.0000   Max.   :1998  
##     Quality          Style             Lot           Highway       
##  Min.   :1.000   Min.   : 1.000   Min.   : 4.56   Min.   :0.00000  
##  1st Qu.:2.000   1st Qu.: 1.000   1st Qu.:17.16   1st Qu.:0.00000  
##  Median :2.000   Median : 2.000   Median :22.20   Median :0.00000  
##  Mean   :2.186   Mean   : 3.349   Mean   :24.34   Mean   :0.02111  
##  3rd Qu.:3.000   3rd Qu.: 7.000   3rd Qu.:26.78   3rd Qu.:0.00000  
##  Max.   :3.000   Max.   :11.000   Max.   :86.83   Max.   :1.00000  
##    UnitPrice     
##  Min.   : 60.50  
##  1st Qu.: 98.62  
##  Median :112.94  
##  Mean   :119.53  
##  3rd Qu.:131.66  
##  Max.   :262.58
head(realEstate)
colnames(realEstate)
##  [1] "SalePrice" "SqFeet"    "Beds"      "Baths"     "Air"      
##  [6] "Garage"    "Pool"      "Year"      "Quality"   "Style"    
## [11] "Lot"       "Highway"   "UnitPrice"

Correlation Scatter Matrix

You can also embed plots, for example

From the correlation matrix, we can intuitively see that the square feet is among the most important fact attribute to the price of house. And the number of baths (suprisingly) , number of garage and beds are also important. From the scatterplot, we see no infludence of highway on the price of the house. Later on we will investingate on it. We will perform hypothesis test on whether the influence is statitially important (or unimportant).

# linear regression

lm_lot.size <- lm(SalePrice ~ Lot, data = realEstate)
summary(lm_lot.size)
## 
## Call:
## lm(formula = SalePrice ~ Lot, data = realEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -219.40  -85.16  -42.63   51.41  620.57 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 213.9625    13.6150  15.715  < 2e-16 ***
## Lot           2.6063     0.5043   5.168 3.38e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.3 on 519 degrees of freedom
## Multiple R-squared:  0.04894,    Adjusted R-squared:  0.04711 
## F-statistic: 26.71 on 1 and 519 DF,  p-value: 3.38e-07
# the plots about the assumptions of the linear model.
par(mfrow = c(2,2))
plot(lm_lot.size)

# the linear regression of sale price and lot size
plot(SalePrice ~ Lot,data=realEstate, 
    main="Sale Price v.s. Lot size",
    xlab = "Lot Size",
    ylab = "Sale Price")
abline(lm_lot.size, col = "green")

## 
## Call:
## lm(formula = UnitPrice ~ Lot, data = realEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.608 -18.987  -5.315  11.388 138.344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 103.9746     3.0665  33.906   <2e-16 ***
## Lot           0.6391     0.1136   5.627    3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.26 on 519 degrees of freedom
## Multiple R-squared:  0.0575, Adjusted R-squared:  0.05568 
## F-statistic: 31.66 on 1 and 519 DF,  p-value: 3.004e-08

The \(R^2\) is about \(48.9\%\), lot size has a good explanation for the variability of sale price of the real estate.

But stringly, the \(R^2\) for the linear model of unit price and lot size is around 5% which means the there is little linear association between the unit price and lot size.