Business Understanding

The dataset is obtained from: https://www.kaggle.com/harlfoxem/housesalesprediction

House sale prices vary with a lot of factors and understanding the main drivers of the prices are of interest to a lot of parties. For example, from the perspective of individuals, knowing what makes a house expensive helps them determine the “fair” price one should pay or sell. On the other hand, from the perspective of real estate brokers, this knowledge enables them to easily match the prospective buyers with the right houses in their price ranges. In this analysis, we took on a little different focus and analyze from the perspective of property developers. The choice of the focal point is partly driven by the location from which our data source is derived, King County, Washington.

The county is the home of many Fortune 500 companies and prominent technology and internet firms, such as Boeing, Amazon, Microsoft, Zillow. The strong performance of these companies has led to strong hiring in the job market, particularly for high-paying occupations. Attracted by the opportunities, migrants from other places come to the area, resulting in an influx of new residents. Consequently, the demand for housing rises and brings about the need for building new homes. This is a boon for property developers. However, the construction of new buildings in Washington is no easy task due to the strict zoning law of the state. In addition, a home’s value is greatly impacted by its configuration for which there is an unlimited number of combinations. Given regulatory constraints and a plethora of design options, the property developers, thus, face optimization problems to get the highest prices out of the homes they build. Since the buildings are not easily modified or cost-prohibitive to do so, the property developers only have one chance to get it right. Motivated by this problem, our analysis aims to identify the main drivers of a home’s value to help developers prioritize their configuring combinations. Specifically, we will look into both the internal factors of a home such as the number of bedrooms and bathrooms, and the external factors such as the area of the lot and its location.

Data Understanding

The data consists of 21 variables and each variable has 21,613 records. These various variables are data of a house regarding its structure, location, and pricing. The data is spread over 70 different ZIP codes in King County. In terms of the house structure, we have factors such as square footage of house and lot. Other location parameters include ZIP code of the house as well as its exact coordinates. Also, the year of the house when it was built and its year of renovation can greatly affect the price of the house and hence should not be ignored. Interestingly, the bathroom variable has decimal values, multiples of 0.25 in some records. Through additional research, we discovered that in the parlance of property management, a bathroom is considered a full bathroom if it has all four components: toilet, sink, shower, and bathtub; each component represents a quarter of the bathroom (Rogers, 2011).

Therefore, when there is a decimal value, the bathroom may not have all the components and one of the 4 utilities is missing. Similarly, the floors variable also has decimal values (multiples of 0.5) which correspond to attics instead of full floors in the house.

Data Cleaning

Our dataset contains no missing values as observed. Thus, at first glance, it seems that the cleaning in terms of missing or null values was not a primary issue. However, the missing values might have been coded as zero and may underestimate the values of the variables without proper management. As a result, We examined each variable and evaluate whether a zero value would make sense.

In the bedroom variable, we observed that the number of bedrooms was zero for 13 records; common sense dictates that zero bedrooms should not occur for a house purchase. As the number of records is relatively small and the distribution of these records are spread across ZIP codes and dates of closing, we decided to eliminate these observations. Moreover, we also had one record with 33 bedrooms. This value is significantly outside of the range where the majority of records fall into, 1-11 bedrooms. We cross-checked the price to assess if it is some specially designed mansion, but the price is only $640,000 and is way less than the average price of houses which have 6 or more bedrooms; hence, there might have been a mistake in the inputting of information for this record. Since the number of bedrooms is likely to be an important indicator, we want to avoid the influence of erroneous outlier and eliminate this record.

The number of bathrooms, as we discussed above, can have decimal values as multiples of 0.25. However, it should not be less than 0.75 because a house should have at least one toilet, one sink, and one bathroom, which total up to 0.75; therefore, we decided to exclude these observations. Consequently, after the data cleaning process, we ended up with 21,592 records, a reduction of 21 records from the original dataset.

## Loading libraries
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(gridExtra)
library(grid)
library(dplyr)
library(lubridate)
library(GGally)
library(leaflet)

## Loading the data
housing1 = read.csv('kc_house_data.csv')

## Cleaning the data to exclude outliners and abnormal observations
housing <-filter(housing1, housing1$bedrooms<33 &
                   housing1$bedrooms!=0 & housing1$bathrooms>=0.75 & price <= 3000000)

# Create new date
housing$newDate = as.Date(substr(housing$date,1,8),'%Y%m%d')

# Creat month and year from column newDate 
housing$month = month(housing$newDate)
housing$year = year(housing$newDate)

Exploratory Analysis

We first used the histogram to understand the distribution of prices; they vary from $78,000 up to $7,700,000.

# Histogram to identify outlier
ggplot(data=housing1, aes(x=price))+
  geom_histogram(binwidth = 30000, color="skyblue4", fill = "skyblue3") + 
  theme(axis.text.x = element_text(hjust = 1, size = 12))+
  theme(axis.text.y = element_text(hjust = 1, size = 12))+
  geom_vline(xintercept = mean(housing$price), color = 'darkred')+
  xlab("Price") +  
  ylab("Frequency") + 
  ggtitle("Full Histogram of Price")

However, from the graph, the prices seem to be very right-skewed, and most of the observations fall below $3,000,000. Since there are very few observations beyond $3,000,000, we consider those observations outliers and decided to use $3,000,000 as the cut-off point to retain relevant records for subsequent analyses. After excluding the outliers, the number of observations drops 45 observations, from 21,592 to 21,547. The updated histogram is shown below.

#Histogram excluding outlier
housing %>% filter(price <= 3000000) %>%
  ggplot(aes(x=price))+
  geom_histogram(binwidth = 30000, color="skyblue4", fill="skyblue3") + 
  theme(axis.text.x = element_text(hjust = 1, size = 12))+
  theme(axis.text.y = element_text(hjust = 1, size = 12))+
  geom_vline(xintercept = mean(housing$price), color = 'darkred')+
  xlab("Price") +  
  ylab("Frequency") + 
  ggtitle("Histogram of Price")

The fact that the distribution of price is very right-skewed may indicate the non-linearity of the relationship between predictors and price, as the effects are prominent to most observations only up to a certain point. We explore a log transformation of price to evaluate the use of transformation for regression analyses. The histogram of log price is plotted below.

#Histogram of log price
housing$logPrice = log(housing$price)
housing %>% filter(price <= 3000000) %>%
  ggplot(aes(x=logPrice))+
  geom_histogram(binwidth = 0.15, size = 0.7,color="darkseagreen4", fill="darkseagreen3") + 
  theme(axis.text.x = element_text(hjust = 1, size = 12))+
  theme(axis.text.y = element_text(hjust = 1, size = 12))+
  xlab("Log Price") +  
  ylab("Frequency") + 
  ggtitle("Histogram of Log Price")

The histogram of the log-transformed price resembles largely that of the normal distribution and confirms our intuition. Therefore, we will explore the log-transformed regression analysis in the Modeling session.

Next, we decided to plot the observations on a map of King County to get a first impression of if and how the neighborhood a house is in affects its sale price.

# Clustering Price for Map Index Purpose
set.seed(1)
priceclustering <- housing %>% select(price)


# Build a kmeans model
model_km3 <- kmeans(priceclustering, centers = 5)


# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster


# Create a new dataframe appending the cluster assignment
price_cluster <- mutate(housing, cluster = clust_km3)
price_index <- function(x){
  for(i in 1:nrow(x)){
    if(x[i,"cluster"] == 5){
      x[i,"index"] = '75,000 - 675,000'
    } else if(x[i,"cluster"] == 4) {
      x[i,"index"] = '675,000 - 1,275,000'
    } else if(x[i,"cluster"] == 3){
      x[i,"index"] = '1,275,000 - 1,875,000'
    } else if(x[i,"cluster"] == 2){
      x[i,"index"] = '1,875,000 - 2,475,000'
    } else {
      x[i,"index"] = '2,475,000 - 3,075,000'
    }    
  }
  return(x)
}

price_map <- price_index(price_cluster)
price_map$index <- as.factor(price_map$index)

house1 <- filter(price_map, cluster == 5)
house2 <- filter(price_map, cluster == 4)
house3 <- filter(price_map, cluster == 3)
house4 <- filter(price_map, cluster == 2)
house5 <- filter(price_map, cluster == 1)


# Color Index
# blue, green, yellow, orange, red
pal <- colorFactor(palette = c("#b0deff", "#caff99", "#ffec6e", "#ffce8a", "#ff997a"), 
                   levels = c('75,000 - 675,000', '675,000 - 1,275,000', '1,275,000 - 1,875,000', 
                              '1,875,000 - 2,475,000', '2,475,000 - 3,075,000'))


# Draw a Map
leaflet(options = leafletOptions(minZoom = 9, dragging = TRUE)) %>% 
  addProviderTiles(provider = 'CartoDB')%>%
  addCircleMarkers(data = house1, radius = 0.02, 
                   popup = ~paste0("<b>", 'USD ', price, "</b>", "<br/>", "House area (sqft): ", 
                                   sqft_living15, "<br/>", "Lot area (sqft): ", sqft_lot15),
                   color = ~pal(index),  group = '75,000 - 675,000') %>%
  addCircleMarkers(data = house2, radius = 0.02, 
                   popup = ~paste0("<b>", 'USD ', price, "</b>", "<br/>", "House area (sqft): ", 
                                   sqft_living15, "<br/>", "Lot area (sqft): ", sqft_lot15),
                   color = ~pal(index),  group = '675,000 - 1,275,000') %>%
  addCircleMarkers(data = house3, radius = 0.02, 
                   popup = ~paste0("<b>", 'USD ', price, "</b>", "<br/>", "House area (sqft): ", 
                                   sqft_living15, "<br/>", "Lot area (sqft): ", sqft_lot15),
                   color = ~pal(index),  group = '1,275,000 - 1,875,000') %>%
  addCircleMarkers(data = house4, radius = 0.02, 
                   popup = ~paste0("<b>", 'USD ', price, "</b>", "<br/>", "House area (sqft): ", 
                                   sqft_living15, "<br/>", "Lot area (sqft): ", sqft_lot15),
                   color = ~pal(index),  group = '1,875,000 - 2,475,000') %>%
  addCircleMarkers(data = house5, radius = 0.02, 
                   popup = ~paste0("<b>", 'USD ', price, "</b>", "<br/>", "House area (sqft): ", 
                                   sqft_living15, "<br/>", "Lot area (sqft): ", sqft_lot15),
                   color = ~pal(index),  group = '2,475,000 - 3,075,000') %>%
  setView(lng = -122.001008, lat = 47.474443, zoom = 9) %>%
  addLegend(pal = pal, 
            values = c('75,000 - 675,000', '675,000 - 1,275,000', '1,275,000 - 1,875,000', '1,875,000 - 2,475,000', '2,475,000 - 3,075,000'),
            opacity = 0.6, title = "Price Range", position = "bottomright") %>%
  addLayersControl(overlayGroups = c('75,000 - 675,000', '675,000 - 1,275,000', '1,275,000 - 1,875,000', '1,875,000 - 2,475,000', '2,475,000 - 3,075,000'), position = "bottomleft")

The map clearly shows a relationship between location and housing prices, with cheaper housing prevalent in the south and expensive housing in and around the center.

## Subsetting housing to exclude some variables from correlation matrix
housingEdit = subset(housing,select=c(price,bedrooms,bathrooms,sqft_living,sqft_lot
                                      ,floors,waterfront,condition,grade,sqft_above,
                                      sqft_basement,sqft_living15,sqft_lot15))

ggcorr(housingEdit, size = 4, color = "grey30", hjust = 0.75,
       label = TRUE, label_size = 4, label_alpha = TRUE,
       nbreaks = 10, palette = "RdBu")

To have a more comprehensive view of the relationship between variables, we created a correlation matrix. It shows that prices have a strong positive relationship with the square-feet living, grade, and square-feet above. The strongest correlation among all the variables is the correlation between square feet of living before renovated and square feet of the house area above the basement.

Interaction Exploration

After some exploratory analysis, we was interested dive deeper into the more complicated interaction exploration.

housing$basement<- ifelse(housing$sqft_basement > 0,"1","0")
ggplot(data = housing, aes(x = housing$sqft_living15, y = price, group=basement, color=basement))+
  xlab("Living Area")+
  ylab("Price") +
  ylim(0,3500000)+
  xlim(0,8000)+
  geom_point(alpha = 0.3)+
  geom_smooth(method='lm', aes(color=basement),fullrange=TRUE)+
  ggtitle("Living area vs Basement") +
  scale_color_manual(values=c("#EA9999", "#80BE86"))
## Warning: Removed 4 rows containing missing values (geom_smooth).

The plots above show the effect of having a basement on the relationship between the size of living area and sales price (0 indicates no, 1 indicates yes). We can see that having a basement in King County, Seattle, has a higher housing price when the size of the living area is the same. This means that if a house has a basement, an extra square foot of space will add more value than for a house without a basement.

housing$floor_num <- ifelse(housing$floors<=1,"1", ifelse(housing$floors<=2,"2","3"))

ggplot(data = housing, aes(x = housing$sqft_living15, y = price, group=floor_num,color=floor_num))+
  xlab("Living area")+
  ylab("Price") +
  ylim(0,3500000)+
  xlim(0,8000)+
  geom_point(alpha = 0.3)+
  geom_smooth(method='lm', aes(color=floor_num),fullrange=TRUE)+
  ggtitle("Living Area vs Number of Floors") +
  scale_color_manual(values=c("#EA9999", "#80BE86", "#6FA8DC"))
## Warning: Removed 11 rows containing missing values (geom_smooth).

The interaction plot between the size of the living area and the number of floors indicates that there is not much difference in the effect on the housing price between having a house with one or two floors. However, having a house with three floors has a much more prominent effect on the housing price.

Modeling

Intuitive Model

For the starting model, we include the most intuitive variables to get a sense of the effects the predictors have on the sale prices.

lm0 = lm(price~ bedrooms + bathrooms + sqft_living + sqft_lot, data = housing)
summary(lm0)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot, 
##     data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1277286  -140030   -24202    97929  2000803 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.110e+05  6.483e+03  17.125  < 2e-16 ***
## bedrooms    -5.362e+04  2.242e+03 -23.920  < 2e-16 ***
## bathrooms    9.758e+03  3.230e+03   3.021  0.00252 ** 
## sqft_living  2.832e+02  2.949e+00  96.045  < 2e-16 ***
## sqft_lot    -2.958e-01  3.945e-02  -7.498 6.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 235000 on 21542 degrees of freedom
## Multiple R-squared:  0.4925, Adjusted R-squared:  0.4924 
## F-statistic:  5227 on 4 and 21542 DF,  p-value: < 2.2e-16
extractAIC(lm0)
## [1]      5 532960

The result of the first model is unexpected in a few ways. First off, the coefficient for the bedrooms is statistically insignificant, even though one would expect that the more bedrooms a house has, the more expensive it is. Closer investigation reveals that this irregularity may be caused by the positive correlation between bedrooms and bathrooms as well as between bedrooms and sqft_living; the collinearity between these variables may have led to the variations of bedrooms be captured by bathrooms and sqft_living. Furthermore, the coefficient for sqft_lot (which indicates the area of the lot) is negative and statistically significant. This indicates that the larger the lot, the less the house would be sold for. This may seem counterintuitive but may point to the importance of controlling for the locations of the houses. A larger area of the lot may mean that the house is located in rural areas and house prices may be a lot less compared to urban locations.

Automatic procedures

To build a more accurate model, We used backward selection and forward selection - to select the variables affecting prices in the most significant manner. We have excluded a few variables before commencing the procedures. Firstly, the coordinate variables such as lat and long were taken out as their values have no meaning in predicting prices. For a similar reason, the unique id was also excluded. We also keep out the ZIP code and date variables at this stage. While the ZIP codes and dates can be used to control for the geographic and time effects, they are not usable in their numerical forms. With the exclusion of these variables, we have the price as the dependent variable and 14 independent variables to narrow down.

## Forward selection
subHousing = subset(housing, select = -c(id,date,view,long,lat,newDate,zipcode))

fit.nothing = lm(price ~ 1, data = subHousing)
FitAll= lm(price ~ ., data = subHousing)

step(fit.nothing,direction 
     = 'forward', scope=formula('FitAll'))
## Start:  AIC=547567.2
## price ~ 1
## 
##                 Df  Sum of Sq        RSS    AIC
## + logPrice       1 1.9883e+15 3.5562e+14 506938
## + sqft_living    1 1.1211e+15 1.2228e+15 533549
## + grade          1 1.0852e+15 1.2587e+15 534173
## + sqft_living15  1 8.4798e+14 1.4959e+15 537893
## + sqft_above     1 8.3813e+14 1.5058e+15 538034
## + bathrooms      1 6.3136e+14 1.7125e+15 540807
## + bedrooms       1 2.4215e+14 2.1017e+15 545220
## + sqft_basement  1 2.1507e+14 2.1288e+15 545495
## + floor_num      2 1.8580e+14 2.1581e+15 545792
## + floors         1 1.7141e+14 2.1725e+15 545933
## + waterfront     1 9.7108e+13 2.2468e+15 546657
## + basement       1 7.8424e+13 2.2655e+15 546836
## + yr_renovated   1 3.6854e+13 2.3070e+15 547228
## + sqft_lot       1 2.1213e+13 2.3227e+15 547373
## + sqft_lot15     1 1.7476e+13 2.3264e+15 547408
## + yr_built       1 6.6097e+12 2.3373e+15 547508
## + condition      1 3.7489e+12 2.3401e+15 547535
## + month          1 5.8856e+11 2.3433e+15 547564
## <none>                        2.3439e+15 547567
## + year           1 1.5237e+11 2.3437e+15 547568
## 
## Step:  AIC=506938
## price ~ logPrice
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living    1 1.6307e+13 3.3931e+14 505929
## + waterfront     1 1.1979e+13 3.4364e+14 506202
## + sqft_above     1 1.0583e+13 3.4504e+14 506289
## + grade          1 7.0195e+12 3.4860e+14 506510
## + sqft_living15  1 5.0754e+12 3.5055e+14 506630
## + bathrooms      1 1.5077e+12 3.5411e+14 506848
## + sqft_basement  1 1.4224e+12 3.5420e+14 506854
## + yr_renovated   1 1.2141e+12 3.5441e+14 506866
## + yr_built       1 9.0256e+11 3.5472e+14 506885
## + floor_num      2 8.6914e+11 3.5475e+14 506889
## + floors         1 6.1416e+11 3.5501e+14 506903
## + basement       1 2.3924e+11 3.5538e+14 506926
## + condition      1 3.5875e+10 3.5559e+14 506938
## <none>                        3.5562e+14 506938
## + sqft_lot       1 2.3164e+10 3.5560e+14 506939
## + bedrooms       1 1.8343e+10 3.5560e+14 506939
## + sqft_lot15     1 9.2637e+09 3.5561e+14 506939
## + year           1 8.9604e+09 3.5561e+14 506939
## + month          1 7.9376e+09 3.5561e+14 506940
## 
## Step:  AIC=505928.6
## price ~ logPrice + sqft_living
## 
##                 Df  Sum of Sq        RSS    AIC
## + waterfront     1 1.3339e+13 3.2598e+14 505066
## + yr_built       1 6.9711e+12 3.3234e+14 505483
## + bedrooms       1 5.4047e+12 3.3391e+14 505585
## + floor_num      2 4.0579e+12 3.3526e+14 505673
## + floors         1 2.7364e+12 3.3658e+14 505756
## + bathrooms      1 2.6553e+12 3.3666e+14 505761
## + yr_renovated   1 1.5734e+12 3.3774e+14 505830
## + basement       1 6.7641e+11 3.3864e+14 505888
## + condition      1 4.6743e+11 3.3885e+14 505901
## + sqft_lot15     1 3.6421e+11 3.3895e+14 505907
## + grade          1 3.1787e+11 3.3900e+14 505910
## + sqft_lot       1 2.0400e+11 3.3911e+14 505918
## + year           1 7.3708e+10 3.3924e+14 505926
## + month          1 4.0637e+10 3.3927e+14 505928
## <none>                        3.3931e+14 505929
## + sqft_living15  1 1.8727e+10 3.3930e+14 505929
## + sqft_above     1 2.8902e+09 3.3931e+14 505930
## + sqft_basement  1 2.8902e+09 3.3931e+14 505930
## 
## Step:  AIC=505066.5
## price ~ logPrice + sqft_living + waterfront
## 
##                 Df  Sum of Sq        RSS    AIC
## + yr_built       1 6.4368e+12 3.1954e+14 504639
## + bedrooms       1 4.5194e+12 3.2146e+14 504768
## + floor_num      2 3.8458e+12 3.2213e+14 504815
## + floors         1 2.5488e+12 3.2343e+14 504899
## + bathrooms      1 2.3882e+12 3.2359e+14 504910
## + yr_renovated   1 9.7282e+11 3.2500e+14 505004
## + basement       1 7.0062e+11 3.2527e+14 505022
## + grade          1 5.0981e+11 3.2547e+14 505035
## + sqft_lot15     1 4.7829e+11 3.2550e+14 505037
## + condition      1 4.4246e+11 3.2553e+14 505039
## + sqft_lot       1 2.4703e+11 3.2573e+14 505052
## + year           1 8.2903e+10 3.2589e+14 505063
## + month          1 5.2956e+10 3.2592e+14 505065
## <none>                        3.2598e+14 505066
## + sqft_above     1 2.0845e+10 3.2595e+14 505067
## + sqft_basement  1 2.0845e+10 3.2595e+14 505067
## + sqft_living15  1 1.7153e+10 3.2596e+14 505067
## 
## Step:  AIC=504638.7
## price ~ logPrice + sqft_living + waterfront + yr_built
## 
##                 Df  Sum of Sq        RSS    AIC
## + bedrooms       1 5.2939e+12 3.1424e+14 504281
## + grade          1 4.3347e+12 3.1520e+14 504346
## + floor_num      2 2.3791e+12 3.1716e+14 504482
## + basement       1 2.1976e+12 3.1734e+14 504492
## + sqft_above     1 1.0170e+12 3.1852e+14 504572
## + sqft_basement  1 1.0170e+12 3.1852e+14 504572
## + sqft_lot15     1 4.7567e+11 3.1906e+14 504609
## + sqft_lot       1 2.7391e+11 3.1926e+14 504622
## + floors         1 2.4198e+11 3.1930e+14 504624
## + bathrooms      1 2.0088e+11 3.1934e+14 504627
## + yr_renovated   1 1.5626e+11 3.1938e+14 504630
## + year           1 1.1642e+11 3.1942e+14 504633
## + sqft_living15  1 1.1120e+11 3.1943e+14 504633
## + month          1 7.4162e+10 3.1946e+14 504636
## + condition      1 5.3898e+10 3.1948e+14 504637
## <none>                        3.1954e+14 504639
## 
## Step:  AIC=504280.8
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + grade          1 3.2282e+12 3.1102e+14 504060
## + floor_num      2 2.0475e+12 3.1220e+14 504144
## + basement       1 1.8658e+12 3.1238e+14 504154
## + sqft_lot15     1 8.8179e+11 3.1336e+14 504222
## + sqft_above     1 7.5223e+11 3.1349e+14 504231
## + sqft_basement  1 7.5223e+11 3.1349e+14 504231
## + sqft_lot       1 5.5292e+11 3.1369e+14 504245
## + floors         1 2.4435e+11 3.1400e+14 504266
## + year           1 1.3946e+11 3.1411e+14 504273
## + yr_renovated   1 1.2933e+11 3.1412e+14 504274
## + month          1 9.0480e+10 3.1415e+14 504277
## + sqft_living15  1 3.5112e+10 3.1421e+14 504280
## <none>                        3.1424e+14 504281
## + condition      1 5.8805e+09 3.1424e+14 504282
## + bathrooms      1 7.6107e+08 3.1424e+14 504283
## 
## Step:  AIC=504060.3
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade
## 
##                 Df  Sum of Sq        RSS    AIC
## + floor_num      2 2.3499e+12 3.0867e+14 503901
## + basement       1 1.3220e+12 3.0969e+14 503970
## + sqft_lot15     1 7.8240e+11 3.1023e+14 504008
## + floors         1 5.0052e+11 3.1052e+14 504028
## + sqft_lot       1 4.7759e+11 3.1054e+14 504029
## + sqft_above     1 2.8408e+11 3.1073e+14 504043
## + sqft_basement  1 2.8408e+11 3.1073e+14 504043
## + year           1 2.0090e+11 3.1082e+14 504048
## + month          1 1.1427e+11 3.1090e+14 504054
## + yr_renovated   1 1.0097e+11 3.1092e+14 504055
## + sqft_living15  1 3.6644e+10 3.1098e+14 504060
## <none>                        3.1102e+14 504060
## + condition      1 1.2429e+09 3.1102e+14 504062
## + bathrooms      1 5.1013e+08 3.1102e+14 504062
## 
## Step:  AIC=503900.8
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num
## 
##                 Df  Sum of Sq        RSS    AIC
## + basement       1 2.7658e+12 3.0590e+14 503709
## + sqft_above     1 1.5410e+12 3.0713e+14 503795
## + sqft_basement  1 1.5410e+12 3.0713e+14 503795
## + sqft_lot15     1 8.6721e+11 3.0780e+14 503842
## + sqft_lot       1 5.3277e+11 3.0813e+14 503866
## + year           1 1.8464e+11 3.0848e+14 503890
## + yr_renovated   1 1.5436e+11 3.0851e+14 503892
## + month          1 1.0219e+11 3.0856e+14 503896
## <none>                        3.0867e+14 503901
## + floors         1 2.7014e+10 3.0864e+14 503901
## + bathrooms      1 2.2248e+10 3.0864e+14 503901
## + sqft_living15  1 1.6691e+10 3.0865e+14 503902
## + condition      1 1.4408e+10 3.0865e+14 503902
## 
## Step:  AIC=503708.9
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot15     1 1.2310e+12 3.0467e+14 503624
## + sqft_lot       1 8.0076e+11 3.0510e+14 503654
## + bathrooms      1 2.5045e+11 3.0565e+14 503693
## + sqft_living15  1 1.7785e+11 3.0572e+14 503698
## + year           1 1.6560e+11 3.0574e+14 503699
## + yr_renovated   1 1.4945e+11 3.0575e+14 503700
## + month          1 8.8841e+10 3.0581e+14 503705
## <none>                        3.0590e+14 503709
## + sqft_above     1 2.0707e+10 3.0588e+14 503709
## + sqft_basement  1 2.0707e+10 3.0588e+14 503709
## + condition      1 9.8576e+09 3.0589e+14 503710
## + floors         1 1.6569e+09 3.0590e+14 503711
## 
## Step:  AIC=503624
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement + sqft_lot15
## 
##                 Df  Sum of Sq        RSS    AIC
## + bathrooms      1 2.1453e+11 3.0446e+14 503611
## + year           1 1.6977e+11 3.0450e+14 503614
## + yr_renovated   1 1.5495e+11 3.0451e+14 503615
## + sqft_living15  1 1.3036e+11 3.0454e+14 503617
## + month          1 8.7868e+10 3.0458e+14 503620
## <none>                        3.0467e+14 503624
## + sqft_lot       1 2.6876e+10 3.0464e+14 503624
## + sqft_above     1 8.9119e+09 3.0466e+14 503625
## + sqft_basement  1 8.9119e+09 3.0466e+14 503625
## + condition      1 6.9211e+09 3.0466e+14 503626
## + floors         1 3.3776e+09 3.0467e+14 503626
## 
## Step:  AIC=503610.8
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement + sqft_lot15 + bathrooms
## 
##                 Df  Sum of Sq        RSS    AIC
## + year           1 1.7632e+11 3.0428e+14 503600
## + sqft_living15  1 1.1274e+11 3.0434e+14 503605
## + yr_renovated   1 1.1201e+11 3.0434e+14 503605
## + month          1 8.9122e+10 3.0437e+14 503607
## <none>                        3.0446e+14 503611
## + sqft_lot       1 2.7675e+10 3.0443e+14 503611
## + floors         1 1.4571e+10 3.0444e+14 503612
## + condition      1 1.0765e+10 3.0444e+14 503612
## + sqft_above     1 7.2477e+09 3.0445e+14 503612
## + sqft_basement  1 7.2477e+09 3.0445e+14 503612
## 
## Step:  AIC=503600.4
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement + sqft_lot15 + bathrooms + year
## 
##                 Df  Sum of Sq        RSS    AIC
## + yr_renovated   1 1.1669e+11 3.0416e+14 503594
## + sqft_living15  1 1.1012e+11 3.0417e+14 503595
## + sqft_lot       1 2.8967e+10 3.0425e+14 503600
## <none>                        3.0428e+14 503600
## + floors         1 1.4234e+10 3.0426e+14 503601
## + sqft_above     1 8.0220e+09 3.0427e+14 503602
## + sqft_basement  1 8.0220e+09 3.0427e+14 503602
## + condition      1 6.5898e+09 3.0427e+14 503602
## + month          1 2.2903e+09 3.0428e+14 503602
## 
## Step:  AIC=503594.1
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement + sqft_lot15 + bathrooms + year + 
##     yr_renovated
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_living15  1 1.0055e+11 3.0406e+14 503589
## + sqft_lot       1 2.8389e+10 3.0413e+14 503594
## <none>                        3.0416e+14 503594
## + floors         1 2.2796e+10 3.0414e+14 503594
## + sqft_above     1 7.5849e+09 3.0415e+14 503596
## + sqft_basement  1 7.5849e+09 3.0415e+14 503596
## + month          1 2.5427e+09 3.0416e+14 503596
## + condition      1 5.7690e+08 3.0416e+14 503596
## 
## Step:  AIC=503589
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement + sqft_lot15 + bathrooms + year + 
##     yr_renovated + sqft_living15
## 
##                 Df  Sum of Sq        RSS    AIC
## + sqft_lot       1 3.4020e+10 3.0403e+14 503589
## <none>                        3.0406e+14 503589
## + floors         1 2.1447e+10 3.0404e+14 503589
## + sqft_above     1 3.7583e+09 3.0406e+14 503591
## + sqft_basement  1 3.7583e+09 3.0406e+14 503591
## + month          1 2.1529e+09 3.0406e+14 503591
## + condition      1 1.2938e+09 3.0406e+14 503591
## 
## Step:  AIC=503588.6
## price ~ logPrice + sqft_living + waterfront + yr_built + bedrooms + 
##     grade + floor_num + basement + sqft_lot15 + bathrooms + year + 
##     yr_renovated + sqft_living15 + sqft_lot
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        3.0403e+14 503589
## + floors         1 2.2621e+10 3.0400e+14 503589
## + sqft_above     1 3.1517e+09 3.0402e+14 503590
## + sqft_basement  1 3.1517e+09 3.0402e+14 503590
## + month          1 2.1410e+09 3.0403e+14 503590
## + condition      1 1.5104e+09 3.0403e+14 503590
## 
## Call:
## lm(formula = price ~ logPrice + sqft_living + waterfront + yr_built + 
##     bedrooms + grade + floor_num + basement + sqft_lot15 + bathrooms + 
##     year + yr_renovated + sqft_living15 + sqft_lot, data = subHousing)
## 
## Coefficients:
##   (Intercept)       logPrice    sqft_living     waterfront       yr_built  
##    -1.685e+07      5.004e+05      6.866e+01      2.971e+05     -9.169e+02  
##      bedrooms          grade     floor_num2     floor_num3      basement1  
##    -1.986e+04      1.928e+04     -3.432e+04     -2.601e+03     -3.017e+04  
##    sqft_lot15      bathrooms           year   yr_renovated  sqft_living15  
##    -2.322e-01      6.654e+03      6.191e+03      5.837e+00     -5.507e+00  
##      sqft_lot  
##    -4.375e-02
lmProcedure = lm(price ~ grade + yr_built + sqft_living + bathrooms + waterfront
                 + sqft_living15 + floors + sqft_above + condition + bedrooms + sqft_lot15
                 + sqft_lot + yr_renovated, data = subHousing)

summary(lmProcedure)
## 
## Call:
## lm(formula = price ~ grade + yr_built + sqft_living + bathrooms + 
##     waterfront + sqft_living15 + floors + sqft_above + condition + 
##     bedrooms + sqft_lot15 + sqft_lot + yr_renovated, data = subHousing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1140099  -107824   -11902    85802  1915104 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.210e+06  1.251e+05  49.640  < 2e-16 ***
## grade          1.247e+05  2.046e+03  60.963  < 2e-16 ***
## yr_built      -3.603e+03  6.406e+01 -56.244  < 2e-16 ***
## sqft_living    1.480e+02  4.259e+00  34.758  < 2e-16 ***
## bathrooms      4.291e+04  3.200e+03  13.410  < 2e-16 ***
## waterfront     5.252e+05  1.659e+04  31.652  < 2e-16 ***
## sqft_living15  5.343e+01  3.260e+00  16.391  < 2e-16 ***
## floors         4.129e+04  3.444e+03  11.989  < 2e-16 ***
## sqft_above    -3.026e+01  4.079e+00  -7.418 1.23e-13 ***
## condition      2.242e+04  2.265e+03   9.898  < 2e-16 ***
## bedrooms      -3.462e+04  1.926e+03 -17.971  < 2e-16 ***
## sqft_lot15    -4.991e-01  7.105e-02  -7.024 2.22e-12 ***
## sqft_lot       8.860e-02  4.644e-02   1.908   0.0564 .  
## yr_renovated   1.562e+01  3.554e+00   4.396 1.11e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195700 on 21533 degrees of freedom
## Multiple R-squared:  0.6481, Adjusted R-squared:  0.6479 
## F-statistic:  3051 on 13 and 21533 DF,  p-value: < 2.2e-16
extractAIC(lmProcedure)
## [1]     14 525087
##Backward selection Price
backwardsub=subset(housing, select=-c(id, date, lat, long, view, zipcode))

FitAll=lm(price~., data=backwardsub)
step(FitAll,direction = 'backward')
## Start:  AIC=503596.2
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated + sqft_living15 + sqft_lot15 + newDate + 
##     month + year + logPrice + basement + floor_num
## 
## 
## Step:  AIC=503596.2
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     year + logPrice + basement + floor_num
## 
##                 Df  Sum of Sq        RSS    AIC
## - year           1 8.4996e+08 3.0400e+14 503594
## - month          1 2.1327e+09 3.0400e+14 503594
## - condition      1 2.3803e+09 3.0400e+14 503594
## - sqft_above     1 2.7763e+09 3.0400e+14 503594
## - newDate        1 2.8864e+09 3.0400e+14 503594
## - floors         1 2.2548e+10 3.0402e+14 503596
## <none>                        3.0399e+14 503596
## - sqft_lot       1 3.4967e+10 3.0403e+14 503597
## - sqft_living15  1 1.0192e+11 3.0410e+14 503601
## - yr_renovated   1 1.0578e+11 3.0410e+14 503602
## - bathrooms      1 1.7759e+11 3.0417e+14 503607
## - sqft_lot15     1 4.1480e+11 3.0441e+14 503624
## - basement       1 1.4263e+12 3.0542e+14 503695
## - floor_num      2 2.3861e+12 3.0638e+14 503761
## - grade          1 2.8566e+12 3.0685e+14 503796
## - bedrooms       1 4.0632e+12 3.0806e+14 503880
## - sqft_living    1 4.9644e+12 3.0896e+14 503943
## - yr_built       1 5.6659e+12 3.0966e+14 503992
## - waterfront     1 1.2088e+13 3.1608e+14 504434
## - logPrice       1 5.0858e+14 8.1258e+14 524779
## 
## Step:  AIC=503594.3
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice + basement + floor_num
## 
##                 Df  Sum of Sq        RSS    AIC
## - condition      1 2.3865e+09 3.0400e+14 503592
## - sqft_above     1 2.7598e+09 3.0400e+14 503593
## - floors         1 2.2541e+10 3.0402e+14 503594
## <none>                        3.0400e+14 503594
## - month          1 3.1233e+10 3.0403e+14 503595
## - sqft_lot       1 3.4924e+10 3.0403e+14 503595
## - newDate        1 9.1600e+10 3.0409e+14 503599
## - sqft_living15  1 1.0195e+11 3.0410e+14 503600
## - yr_renovated   1 1.0588e+11 3.0410e+14 503600
## - bathrooms      1 1.7786e+11 3.0417e+14 503605
## - sqft_lot15     1 4.1510e+11 3.0441e+14 503622
## - basement       1 1.4264e+12 3.0542e+14 503693
## - floor_num      2 2.3860e+12 3.0638e+14 503759
## - grade          1 2.8565e+12 3.0685e+14 503794
## - bedrooms       1 4.0651e+12 3.0806e+14 503879
## - sqft_living    1 4.9648e+12 3.0896e+14 503941
## - yr_built       1 5.6684e+12 3.0966e+14 503990
## - waterfront     1 1.2093e+13 3.1609e+14 504433
## - logPrice       1 5.0890e+14 8.1289e+14 524786
## 
## Step:  AIC=503592.5
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + grade + sqft_above + yr_built + yr_renovated + 
##     sqft_living15 + sqft_lot15 + newDate + month + logPrice + 
##     basement + floor_num
## 
##                 Df  Sum of Sq        RSS    AIC
## - sqft_above     1 2.3712e+09 3.0400e+14 503591
## - floors         1 2.1943e+10 3.0402e+14 503592
## <none>                        3.0400e+14 503592
## - month          1 3.1376e+10 3.0403e+14 503593
## - sqft_lot       1 3.4670e+10 3.0403e+14 503593
## - newDate        1 9.3546e+10 3.0409e+14 503597
## - sqft_living15  1 1.0108e+11 3.0410e+14 503598
## - yr_renovated   1 1.1443e+11 3.0411e+14 503599
## - bathrooms      1 1.7588e+11 3.0417e+14 503603
## - sqft_lot15     1 4.1688e+11 3.0441e+14 503620
## - basement       1 1.4254e+12 3.0542e+14 503691
## - floor_num      2 2.3838e+12 3.0638e+14 503757
## - grade          1 2.8635e+12 3.0686e+14 503792
## - bedrooms       1 4.0813e+12 3.0808e+14 503878
## - sqft_living    1 4.9740e+12 3.0897e+14 503940
## - yr_built       1 5.9337e+12 3.0993e+14 504007
## - waterfront     1 1.2092e+13 3.1609e+14 504431
## - logPrice       1 5.1296e+14 8.1696e+14 524891
## 
## Step:  AIC=503590.7
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + grade + yr_built + yr_renovated + sqft_living15 + 
##     sqft_lot15 + newDate + month + logPrice + basement + floor_num
## 
##                 Df  Sum of Sq        RSS    AIC
## - floors         1 2.2824e+10 3.0402e+14 503590
## <none>                        3.0400e+14 503591
## - month          1 3.1482e+10 3.0403e+14 503591
## - sqft_lot       1 3.5234e+10 3.0404e+14 503591
## - newDate        1 9.3037e+10 3.0409e+14 503595
## - sqft_living15  1 1.0438e+11 3.0410e+14 503596
## - yr_renovated   1 1.1472e+11 3.0412e+14 503597
## - bathrooms      1 1.7683e+11 3.0418e+14 503601
## - sqft_lot15     1 4.1764e+11 3.0442e+14 503618
## - floor_num      2 2.5014e+12 3.0650e+14 503763
## - grade          1 2.8901e+12 3.0689e+14 503793
## - basement       1 3.4212e+12 3.0742e+14 503830
## - bedrooms       1 4.0789e+12 3.0808e+14 503876
## - yr_built       1 5.9634e+12 3.0996e+14 504007
## - waterfront     1 1.2113e+13 3.1611e+14 504431
## - sqft_living    1 1.5257e+13 3.1926e+14 504644
## - logPrice       1 5.1301e+14 8.1701e+14 524891
## 
## Step:  AIC=503590.3
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + waterfront + 
##     grade + yr_built + yr_renovated + sqft_living15 + sqft_lot15 + 
##     newDate + month + logPrice + basement + floor_num
## 
##                 Df  Sum of Sq        RSS    AIC
## <none>                        3.0402e+14 503590
## - month          1 3.2122e+10 3.0406e+14 503591
## - sqft_lot       1 3.4054e+10 3.0406e+14 503591
## - newDate        1 9.2200e+10 3.0412e+14 503595
## - sqft_living15  1 1.0571e+11 3.0413e+14 503596
## - yr_renovated   1 1.0644e+11 3.0413e+14 503596
## - bathrooms      1 1.6419e+11 3.0419e+14 503600
## - sqft_lot15     1 4.0935e+11 3.0443e+14 503617
## - grade          1 2.8766e+12 3.0690e+14 503791
## - basement       1 3.4002e+12 3.0742e+14 503828
## - bedrooms       1 4.0610e+12 3.0808e+14 503874
## - floor_num      2 4.3285e+12 3.0835e+14 503891
## - yr_built       1 7.6517e+12 3.1168e+14 504124
## - waterfront     1 1.2114e+13 3.1614e+14 504430
## - sqft_living    1 1.5261e+13 3.1928e+14 504644
## - logPrice       1 5.1302e+14 8.1704e+14 524889
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     waterfront + grade + yr_built + yr_renovated + sqft_living15 + 
##     sqft_lot15 + newDate + month + logPrice + basement + floor_num, 
##     data = backwardsub)
## 
## Coefficients:
##   (Intercept)       bedrooms      bathrooms    sqft_living       sqft_lot  
##    -4.701e+06     -1.986e+04      6.661e+03      6.865e+01     -4.377e-02  
##    waterfront          grade       yr_built   yr_renovated  sqft_living15  
##     2.971e+05      1.929e+04     -9.170e+02      5.842e+00     -5.495e+00  
##    sqft_lot15        newDate          month       logPrice      basement1  
##    -2.322e-01      1.955e+01     -4.181e+02      5.004e+05     -3.017e+04  
##    floor_num2     floor_num3  
##    -3.432e+04     -2.594e+03
backwardmodel=lm(price~bedrooms+bathrooms+sqft_living+sqft_lot+floors+waterfront+condition+grade+sqft_above+yr_built+yr_renovated+sqft_living15+sqft_lot15, data=backwardsub)

summary(backwardmodel)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     floors + waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15, data = backwardsub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1140099  -107824   -11902    85802  1915104 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.210e+06  1.251e+05  49.640  < 2e-16 ***
## bedrooms      -3.462e+04  1.926e+03 -17.971  < 2e-16 ***
## bathrooms      4.291e+04  3.200e+03  13.410  < 2e-16 ***
## sqft_living    1.480e+02  4.259e+00  34.758  < 2e-16 ***
## sqft_lot       8.860e-02  4.644e-02   1.908   0.0564 .  
## floors         4.129e+04  3.444e+03  11.989  < 2e-16 ***
## waterfront     5.252e+05  1.659e+04  31.652  < 2e-16 ***
## condition      2.242e+04  2.265e+03   9.898  < 2e-16 ***
## grade          1.247e+05  2.046e+03  60.963  < 2e-16 ***
## sqft_above    -3.026e+01  4.079e+00  -7.418 1.23e-13 ***
## yr_built      -3.603e+03  6.406e+01 -56.244  < 2e-16 ***
## yr_renovated   1.562e+01  3.554e+00   4.396 1.11e-05 ***
## sqft_living15  5.343e+01  3.260e+00  16.391  < 2e-16 ***
## sqft_lot15    -4.991e-01  7.105e-02  -7.024 2.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195700 on 21533 degrees of freedom
## Multiple R-squared:  0.6481, Adjusted R-squared:  0.6479 
## F-statistic:  3051 on 13 and 21533 DF,  p-value: < 2.2e-16
extractAIC(backwardmodel)
## [1]     14 525087

Interestingly, both procedures arrive at the same variables selection. The final selection contains 13 variables and has an AIC value of 525,087.

Next, as we mentioned in the exploratory analysis, we explored the log transformation of the price variable. Specifically, we repeated the model selection procedure for the natural logarithm of price in a similar manner.

### Automatic model selection for lnPrice

## Forward selection lnPrice
housing$lnprice=log(housing$price)
lnforwardsub=subset(housing, select=-c(price, id, date, lat, long, view, zipcode))
fit.nothing = lm(lnprice ~ 1, data = lnforwardsub)
FitAll= lm(lnprice ~ ., data = lnforwardsub)
lnforward=step(fit.nothing,direction 
               = 'forward', scope=formula('FitAll'))
## Start:  AIC=-28349.39
## lnprice ~ 1
## 
##                 Df Sum of Sq    RSS      AIC
## + logPrice       1    5780.2    0.0 -1352084
## + grade          1    2801.0 2979.3   -42628
## + sqft_living    1    2711.6 3068.6   -41991
## + sqft_living15  1    2173.1 3607.1   -38507
## + sqft_above     1    2014.6 3765.6   -37581
## + bathrooms      1    1687.6 4092.6   -35787
## + bedrooms       1     692.5 5087.7   -31097
## + floor_num      2     611.1 5169.2   -30753
## + floors         1     556.6 5223.6   -30529
## + sqft_basement  1     532.1 5248.1   -30428
## + basement       1     253.3 5526.9   -29313
## + waterfront     1     120.2 5660.1   -28800
## + yr_renovated   1      72.0 5708.3   -28617
## + sqft_lot       1      57.7 5722.6   -28563
## + sqft_lot15     1      48.5 5731.7   -28529
## + yr_built       1      36.0 5744.3   -28482
## + condition      1       8.9 5771.4   -28380
## + month          1       1.3 5778.9   -28352
## <none>                       5780.2   -28349
## + year           1       0.3 5780.0   -28348
## + newDate        1       0.1 5780.1   -28348
## 
## Step:  AIC=-1352084
## lnprice ~ logPrice
## 
##                 Df  Sum of Sq        RSS      AIC
## + yr_renovated   1 1.2654e-26 1.2041e-23 -1352105
## + month          1 1.6944e-27 1.2052e-23 -1352085
## <none>                        1.2054e-23 -1352084
## + basement       1 8.2460e-28 1.2053e-23 -1352083
## + grade          1 6.4160e-28 1.2053e-23 -1352083
## + floors         1 4.4950e-28 1.2054e-23 -1352083
## + sqft_living15  1 3.3640e-28 1.2054e-23 -1352083
## + yr_built       1 2.7860e-28 1.2054e-23 -1352083
## + year           1 2.6510e-28 1.2054e-23 -1352082
## + condition      1 2.2680e-28 1.2054e-23 -1352082
## + bedrooms       1 1.6660e-28 1.2054e-23 -1352082
## + sqft_living    1 1.3310e-28 1.2054e-23 -1352082
## + sqft_above     1 7.7900e-29 1.2054e-23 -1352082
## + newDate        1 7.4000e-29 1.2054e-23 -1352082
## + sqft_lot       1 2.7100e-29 1.2054e-23 -1352082
## + sqft_lot15     1 2.6100e-29 1.2054e-23 -1352082
## + sqft_basement  1 1.7100e-29 1.2054e-23 -1352082
## + waterfront     1 9.4000e-30 1.2054e-23 -1352082
## + bathrooms      1 7.0000e-31 1.2054e-23 -1352082
## + floor_num      2 5.8790e-28 1.2053e-23 -1352081
## 
## Step:  AIC=-1352105
## lnprice ~ logPrice + yr_renovated
## 
##                 Df  Sum of Sq        RSS      AIC
## + month          1 1.5688e-27 1.2040e-23 -1352105
## <none>                        1.2041e-23 -1352105
## + basement       1 6.7072e-28 1.2041e-23 -1352104
## + floors         1 6.0823e-28 1.2041e-23 -1352104
## + sqft_living    1 2.4871e-28 1.2041e-23 -1352103
## + sqft_above     1 2.3891e-28 1.2041e-23 -1352103
## + grade          1 2.1872e-28 1.2041e-23 -1352103
## + year           1 1.8405e-28 1.2041e-23 -1352103
## + waterfront     1 1.3486e-28 1.2041e-23 -1352103
## + newDate        1 1.2826e-28 1.2041e-23 -1352103
## + bedrooms       1 1.0783e-28 1.2041e-23 -1352103
## + yr_built       1 1.0162e-28 1.2041e-23 -1352103
## + sqft_living15  1 5.9980e-29 1.2041e-23 -1352103
## + condition      1 5.9290e-29 1.2041e-23 -1352103
## + sqft_lot15     1 2.3550e-29 1.2041e-23 -1352103
## + sqft_lot       1 2.2920e-29 1.2041e-23 -1352103
## + bathrooms      1 7.2700e-30 1.2041e-23 -1352103
## + sqft_basement  1 1.2000e-31 1.2041e-23 -1352103
## + floor_num      2 6.1555e-28 1.2041e-23 -1352102
## 
## Step:  AIC=-1352105
## lnprice ~ logPrice + yr_renovated + month
## 
##                 Df  Sum of Sq        RSS      AIC
## <none>                        1.2040e-23 -1352105
## + year           1 7.8230e-28 1.2039e-23 -1352105
## + newDate        1 7.1358e-28 1.2039e-23 -1352105
## + basement       1 6.5309e-28 1.2039e-23 -1352105
## + floors         1 5.7058e-28 1.2039e-23 -1352104
## + grade          1 2.5098e-28 1.2040e-23 -1352104
## + sqft_living    1 2.1469e-28 1.2040e-23 -1352104
## + sqft_above     1 2.1318e-28 1.2040e-23 -1352104
## + waterfront     1 1.3995e-28 1.2040e-23 -1352104
## + bedrooms       1 1.1197e-28 1.2040e-23 -1352104
## + yr_built       1 1.0328e-28 1.2040e-23 -1352104
## + condition      1 7.4920e-29 1.2040e-23 -1352104
## + sqft_living15  1 6.9740e-29 1.2040e-23 -1352104
## + sqft_lot15     1 2.5270e-29 1.2040e-23 -1352104
## + sqft_lot       1 2.2440e-29 1.2040e-23 -1352104
## + bathrooms      1 3.9900e-30 1.2040e-23 -1352103
## + sqft_basement  1 0.0000e+00 1.2040e-23 -1352103
## + floor_num      2 5.7726e-28 1.2039e-23 -1352102
extractAIC(lnforward)
## [1]        4 -1352105
##Backward selection lnPrice
housing$lnprice=log(housing$price)
lnbackwardsub=subset(housing, select=-c(price, id, date, lat, long, view, zipcode))
lnbackwardmod=lm(lnprice~., data=lnbackwardsub)
lnbackward=step(lnbackwardmod,direction = 'backward')
## Start:  AIC=-1525351
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated + sqft_living15 + sqft_lot15 + newDate + 
##     month + year + logPrice + basement + floor_num
## 
## 
## Step:  AIC=-1525351
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     year + logPrice + basement + floor_num
## 
##                 Df Sum of Sq    RSS      AIC
## - year           1       0.0    0.0 -1606913
## - sqft_living15  1       0.0    0.0 -1549542
## - bedrooms       1       0.0    0.0 -1529790
## - basement       1       0.0    0.0 -1525352
## <none>                          0.0 -1525351
## - floor_num      2       0.0    0.0 -1525350
## - month          1       0.0    0.0 -1513624
## - yr_renovated   1       0.0    0.0 -1511361
## - bathrooms      1       0.0    0.0 -1497599
## - sqft_above     1       0.0    0.0 -1490842
## - floors         1       0.0    0.0 -1485155
## - sqft_living    1       0.0    0.0 -1467508
## - condition      1       0.0    0.0 -1462514
## - waterfront     1       0.0    0.0 -1462240
## - newDate        1       0.0    0.0 -1454058
## - sqft_lot       1       0.0    0.0 -1447859
## - sqft_lot15     1       0.0    0.0 -1445383
## - yr_built       1       0.0    0.0 -1438692
## - grade          1       0.0    0.0 -1433220
## - logPrice       1    2029.5 2029.5   -50863
## 
## Step:  AIC=-1606913
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice + basement + floor_num
## 
##                 Df Sum of Sq    RSS      AIC
## - floor_num      2       0.0    0.0 -1606917
## - basement       1       0.0    0.0 -1606915
## <none>                          0.0 -1606913
## - bedrooms       1       0.0    0.0 -1604456
## - condition      1       0.0    0.0 -1562371
## - month          1       0.0    0.0 -1522299
## - newDate        1       0.0    0.0 -1520704
## - floors         1       0.0    0.0 -1520041
## - sqft_living15  1       0.0    0.0 -1515562
## - sqft_living    1       0.0    0.0 -1513943
## - bathrooms      1       0.0    0.0 -1511221
## - grade          1       0.0    0.0 -1503346
## - sqft_lot15     1       0.0    0.0 -1485012
## - waterfront     1       0.0    0.0 -1473884
## - sqft_above     1       0.0    0.0 -1465350
## - yr_renovated   1       0.0    0.0 -1463546
## - yr_built       1       0.0    0.0 -1459359
## - sqft_lot       1       0.0    0.0 -1452524
## - logPrice       1    2030.9 2030.9   -50851
## 
## Step:  AIC=-1606917
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice + basement
## 
##                 Df Sum of Sq    RSS      AIC
## - basement       1       0.0    0.0 -1606919
## <none>                          0.0 -1606917
## - bedrooms       1       0.0    0.0 -1604459
## - condition      1       0.0    0.0 -1562373
## - month          1       0.0    0.0 -1522302
## - newDate        1       0.0    0.0 -1520706
## - floors         1       0.0    0.0 -1520043
## - sqft_living15  1       0.0    0.0 -1515564
## - sqft_living    1       0.0    0.0 -1513946
## - bathrooms      1       0.0    0.0 -1511223
## - grade          1       0.0    0.0 -1503349
## - sqft_lot15     1       0.0    0.0 -1485014
## - waterfront     1       0.0    0.0 -1473886
## - sqft_above     1       0.0    0.0 -1465353
## - yr_renovated   1       0.0    0.0 -1463548
## - yr_built       1       0.0    0.0 -1459362
## - sqft_lot       1       0.0    0.0 -1452526
## - logPrice       1    2037.5 2037.5   -50785
## 
## Step:  AIC=-1606919
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice
## 
##                 Df Sum of Sq    RSS      AIC
## <none>                          0.0 -1606919
## - bedrooms       1       0.0    0.0 -1604461
## - condition      1       0.0    0.0 -1562375
## - month          1       0.0    0.0 -1522304
## - newDate        1       0.0    0.0 -1520708
## - floors         1       0.0    0.0 -1520045
## - sqft_living15  1       0.0    0.0 -1515566
## - sqft_living    1       0.0    0.0 -1513947
## - bathrooms      1       0.0    0.0 -1511225
## - grade          1       0.0    0.0 -1503351
## - sqft_lot15     1       0.0    0.0 -1485016
## - waterfront     1       0.0    0.0 -1473888
## - sqft_above     1       0.0    0.0 -1465353
## - yr_renovated   1       0.0    0.0 -1463550
## - yr_built       1       0.0    0.0 -1459364
## - sqft_lot       1       0.0    0.0 -1452528
## - logPrice       1    2059.1 2059.1   -50560
extractAIC(lnbackward)
## [1]       17 -1606919
##Both direction selection lnPrice
housing$lnprice=log(housing$price)
lnbothsub=subset(housing, select=-c(price, id, date, lat, long, view, zipcode))
lnbothmod=lm(lnprice~., data=lnbothsub)
lnboth=step(lnbothmod,direction = 'both')
## Start:  AIC=-1525351
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + sqft_basement + 
##     yr_built + yr_renovated + sqft_living15 + sqft_lot15 + newDate + 
##     month + year + logPrice + basement + floor_num
## 
## 
## Step:  AIC=-1525351
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     year + logPrice + basement + floor_num
## 
##                 Df Sum of Sq    RSS      AIC
## - year           1       0.0    0.0 -1606913
## - sqft_living15  1       0.0    0.0 -1549542
## - bedrooms       1       0.0    0.0 -1529790
## - basement       1       0.0    0.0 -1525352
## <none>                          0.0 -1525351
## - floor_num      2       0.0    0.0 -1525350
## - month          1       0.0    0.0 -1513624
## - yr_renovated   1       0.0    0.0 -1511361
## - bathrooms      1       0.0    0.0 -1497599
## - sqft_above     1       0.0    0.0 -1490842
## - floors         1       0.0    0.0 -1485155
## - sqft_living    1       0.0    0.0 -1467508
## - condition      1       0.0    0.0 -1462514
## - waterfront     1       0.0    0.0 -1462240
## - newDate        1       0.0    0.0 -1454058
## - sqft_lot       1       0.0    0.0 -1447859
## - sqft_lot15     1       0.0    0.0 -1445383
## - yr_built       1       0.0    0.0 -1438692
## - grade          1       0.0    0.0 -1433220
## - logPrice       1    2029.5 2029.5   -50863
## 
## Step:  AIC=-1606913
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice + basement + floor_num
## 
##                 Df Sum of Sq    RSS      AIC
## - floor_num      2       0.0    0.0 -1606917
## - basement       1       0.0    0.0 -1606915
## + year           1       0.0    0.0 -1606915
## <none>                          0.0 -1606913
## - bedrooms       1       0.0    0.0 -1604456
## - condition      1       0.0    0.0 -1562371
## - month          1       0.0    0.0 -1522299
## - newDate        1       0.0    0.0 -1520704
## - floors         1       0.0    0.0 -1520041
## - sqft_living15  1       0.0    0.0 -1515562
## - sqft_living    1       0.0    0.0 -1513943
## - bathrooms      1       0.0    0.0 -1511221
## - grade          1       0.0    0.0 -1503346
## - sqft_lot15     1       0.0    0.0 -1485012
## - waterfront     1       0.0    0.0 -1473884
## - sqft_above     1       0.0    0.0 -1465350
## - yr_renovated   1       0.0    0.0 -1463546
## - yr_built       1       0.0    0.0 -1459359
## - sqft_lot       1       0.0    0.0 -1452524
## - logPrice       1    2030.9 2030.9   -50851
## 
## Step:  AIC=-1606917
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice + basement
## 
##                 Df Sum of Sq    RSS      AIC
## - basement       1       0.0    0.0 -1606919
## + year           1       0.0    0.0 -1606919
## <none>                          0.0 -1606917
## + floor_num      2       0.0    0.0 -1606913
## - bedrooms       1       0.0    0.0 -1604459
## - condition      1       0.0    0.0 -1562373
## - month          1       0.0    0.0 -1522302
## - newDate        1       0.0    0.0 -1520706
## - floors         1       0.0    0.0 -1520043
## - sqft_living15  1       0.0    0.0 -1515564
## - sqft_living    1       0.0    0.0 -1513946
## - bathrooms      1       0.0    0.0 -1511223
## - grade          1       0.0    0.0 -1503349
## - sqft_lot15     1       0.0    0.0 -1485014
## - waterfront     1       0.0    0.0 -1473886
## - sqft_above     1       0.0    0.0 -1465353
## - yr_renovated   1       0.0    0.0 -1463548
## - yr_built       1       0.0    0.0 -1459362
## - sqft_lot       1       0.0    0.0 -1452526
## - logPrice       1    2037.5 2037.5   -50785
## 
## Step:  AIC=-1606919
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice
## 
##                 Df Sum of Sq    RSS      AIC
## + year           1       0.0    0.0 -1606920
## <none>                          0.0 -1606919
## + basement       1       0.0    0.0 -1606917
## + floor_num      2       0.0    0.0 -1606915
## - bedrooms       1       0.0    0.0 -1604461
## - condition      1       0.0    0.0 -1562375
## - month          1       0.0    0.0 -1522304
## - newDate        1       0.0    0.0 -1520708
## - floors         1       0.0    0.0 -1520045
## - sqft_living15  1       0.0    0.0 -1515566
## - sqft_living    1       0.0    0.0 -1513947
## - bathrooms      1       0.0    0.0 -1511225
## - grade          1       0.0    0.0 -1503351
## - sqft_lot15     1       0.0    0.0 -1485016
## - waterfront     1       0.0    0.0 -1473888
## - sqft_above     1       0.0    0.0 -1465353
## - yr_renovated   1       0.0    0.0 -1463550
## - yr_built       1       0.0    0.0 -1459364
## - sqft_lot       1       0.0    0.0 -1452528
## - logPrice       1    2059.1 2059.1   -50560
## 
## Step:  AIC=-1606920
## lnprice ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + condition + grade + sqft_above + yr_built + 
##     yr_renovated + sqft_living15 + sqft_lot15 + newDate + month + 
##     logPrice + year
## 
##                 Df Sum of Sq    RSS      AIC
## <none>                          0.0 -1606920
## - year           1       0.0    0.0 -1606919
## + basement       1       0.0    0.0 -1606919
## + floor_num      2       0.0    0.0 -1606917
## - bedrooms       1       0.0    0.0 -1604460
## - condition      1       0.0    0.0 -1562374
## - month          1       0.0    0.0 -1522304
## - newDate        1       0.0    0.0 -1520706
## - floors         1       0.0    0.0 -1520044
## - sqft_living15  1       0.0    0.0 -1515565
## - sqft_living    1       0.0    0.0 -1513946
## - bathrooms      1       0.0    0.0 -1511224
## - grade          1       0.0    0.0 -1503350
## - sqft_lot15     1       0.0    0.0 -1485015
## - waterfront     1       0.0    0.0 -1473887
## - sqft_above     1       0.0    0.0 -1465352
## - yr_renovated   1       0.0    0.0 -1463549
## - yr_built       1       0.0    0.0 -1459363
## - sqft_lot       1       0.0    0.0 -1452527
## - logPrice       1    2057.6 2057.6   -50573
extractAIC(lnboth)
## [1]       18 -1606920

Interestingly, once again, all three methods - backward, forward, and stepwise - lead to the same result: 13 variables included (all except square feet of the basement), with an AIC of -50836.31. All the coefficients turn out to be significant and all of them are significant at the 5% significance level.

Since the log-linear model does not show any better fit than the linear model (nor did it reduce the number of variables needed), we decided to use the regular variable price as the dependent variable.

With these 13 variables, both the R2 and AIC value are improved compared to our intuitive starting model. Thus, as far as a “straight-forward” model goes, the variables selection methods help us create a baseline model.