We want to investigate
# read the real estate dat from the file.
realEstate <- as.tibble(read.csv("realestate.txt",sep = "\t",header=TRUE))
# add a column for the unit price of the real estate.
realEstate$UnitPrice <- realEstate$SalePrice/realEstate$SqFeet
summary(realEstate)
## SalePrice SqFeet Beds Baths
## Min. : 84.0 Min. :0.980 Min. :1.000 Min. :1.000
## 1st Qu.:180.0 1st Qu.:1.701 1st Qu.:3.000 1st Qu.:2.000
## Median :229.9 Median :2.061 Median :3.000 Median :3.000
## Mean :277.4 Mean :2.261 Mean :3.478 Mean :2.647
## 3rd Qu.:335.0 3rd Qu.:2.638 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :920.0 Max. :5.032 Max. :7.000 Max. :7.000
## Air Garage Pool Year
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :1885
## 1st Qu.:1.0000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:1956
## Median :1.0000 Median :2.000 Median :0.0000 Median :1966
## Mean :0.8311 Mean :2.098 Mean :0.0691 Mean :1967
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.0000 3rd Qu.:1981
## Max. :1.0000 Max. :7.000 Max. :1.0000 Max. :1998
## Quality Style Lot Highway
## Min. :1.000 Min. : 1.000 Min. : 4.56 Min. :0.00000
## 1st Qu.:2.000 1st Qu.: 1.000 1st Qu.:17.16 1st Qu.:0.00000
## Median :2.000 Median : 2.000 Median :22.20 Median :0.00000
## Mean :2.186 Mean : 3.349 Mean :24.34 Mean :0.02111
## 3rd Qu.:3.000 3rd Qu.: 7.000 3rd Qu.:26.78 3rd Qu.:0.00000
## Max. :3.000 Max. :11.000 Max. :86.83 Max. :1.00000
## UnitPrice
## Min. : 60.50
## 1st Qu.: 98.62
## Median :112.94
## Mean :119.53
## 3rd Qu.:131.66
## Max. :262.58
head(realEstate)
colnames(realEstate)
## [1] "SalePrice" "SqFeet" "Beds" "Baths" "Air"
## [6] "Garage" "Pool" "Year" "Quality" "Style"
## [11] "Lot" "Highway" "UnitPrice"
You can also embed plots, for example
From the correlation matrix, we can intuitively see that the square feet is among the most important fact attribute to the price of house. And the number of baths (suprisingly) , number of garage and beds are also important. From the scatterplot, we see no infludence of highway on the price of the house. Later on we will investingate on it. We will perform hypothesis test on whether the influence is statitially important (or unimportant).
# linear regression
lm_lot.size <- lm(SalePrice ~ Lot, data = realEstate)
summary(lm_lot.size)
##
## Call:
## lm(formula = SalePrice ~ Lot, data = realEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -219.40 -85.16 -42.63 51.41 620.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.9625 13.6150 15.715 < 2e-16 ***
## Lot 2.6063 0.5043 5.168 3.38e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.3 on 519 degrees of freedom
## Multiple R-squared: 0.04894, Adjusted R-squared: 0.04711
## F-statistic: 26.71 on 1 and 519 DF, p-value: 3.38e-07
# the plots about the assumptions of the linear model.
par(mfrow = c(2,2))
plot(lm_lot.size)
# the linear regression of sale price and lot size
plot(SalePrice ~ Lot,data=realEstate,
main="Sale Price v.s. Lot size",
xlab = "Lot Size",
ylab = "Sale Price")
abline(lm_lot.size, col = "green")
##
## Call:
## lm(formula = UnitPrice ~ Lot, data = realEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.608 -18.987 -5.315 11.388 138.344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.9746 3.0665 33.906 <2e-16 ***
## Lot 0.6391 0.1136 5.627 3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.26 on 519 degrees of freedom
## Multiple R-squared: 0.0575, Adjusted R-squared: 0.05568
## F-statistic: 31.66 on 1 and 519 DF, p-value: 3.004e-08
The \(R^2\) is about \(48.9\%\), lot size has a good explanation for the variability of sale price of the real estate.
But stringly, the \(R^2\) for the linear model of unit price and lot size is around 5% which means the there is little linear association between the unit price and lot size.