King County is a county located in US State of Washington with a population around 2,200,000 in the 2018 sensus estimate, making it one of the most populous county in the region with two-thirds of the population living in the suburbs area. It is also the 86th county with the highest income in 2011.
King County provides housing area with notable main features that affects the housing unit prices. These features includes:
This section will predict the housing unit’s price according to the main features mentioned earlier using a Linear Regression model. The dataset can be gathered from this Kaggle Database.
Using the dataset gathered from Kaggle Database, we will take a glimpse look to our data composition.
Observations: 21,613
Variables: 21
$ date <fct> 20141013T000000, 20141209T000000, 20150225T00000...
$ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, ...
$ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
$ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000,...
$ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
$ bathrooms <int> 1, 2, 1, 3, 2, 4, 2, 1, 1, 2, 2, 1, 1, 1, 2, 3, ...
$ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
$ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
$ floors <int> 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, ...
$ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, ...
$ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
$ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
$ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
$ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
$ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003,...
$ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47....
$ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045...
$ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
$ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...
The dataset has include 21,613 observations with 21 variables to analyze. But some of its variables may not be in accordance with our main features. Thus, intuitively we can drop several variables like id, view, zipcode, lat, and long because these variable will not effect anything.
We also tweak some variables in our dataset by adding some new ones like yr_gap for the house’s age (age between when the house is built and the current year it got priced) and renov to tell wether the house has been renovated or not along it’s year. And we also took the waterfront, floors, condition, grade, and renov into factor data type.
Here’s a look of our dataset will look like:
Observations: 21,613
Variables: 18
$ date <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09,...
$ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, ...
$ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, ...
$ bathrooms <int> 1, 2, 1, 3, 2, 4, 2, 1, 1, 2, 2, 1, 1, 1, 2, 3, ...
$ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1...
$ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 971...
$ floors <fct> 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, ...
$ waterfront <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ condition <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, ...
$ grade <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9,...
$ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1...
$ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300...
$ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, ...
$ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711...
$ yr_gap <int> 59, 63, 82, 49, 28, 13, 19, 52, 55, 12, 50, 72, ...
$ renov <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000,...
This section will indicate the correlation between each variables with the price effected as shown below by the Correlation Table Figure.
Figure: The Correlation of Variables and Price
From the Figure, we can infer that the strongest correlation is between the sqft_living15, sqft_above, sqft_living, and bathrooms variables with the price index with more than 0.5 correlation value. Unfortunately in the correlation figure above, the one shown are the only ones that are not factor in type. But at least it gave us some information, that the price variable will be strongly affected by the variables mentioned before.
set.seed(100)
idx <- sample(nrow(house.price),nrow(house.price)*0.8)
train.hp <- house.price[idx,-(1:3)]
test.hp <- house.price[-idx,-(1:3)]In linear regression, there are two ways to achieve the regression model. One with a forward step-method where variables will be added and matched until we achieve an optimum model. While other is using the backward step, where it is almost similiar but rather adding variables, backward step decrease variables until it reach an optimum model. In this sction we will cover both method in finding an optimum model, and later we will find both RMSE and declare which one is the optimum one to use.
This part will try to create the regression model using forward step-method.
From the forward step-method syntax given above we now know what kind of variable that will affect the price of each housing unit. The equation below will give us a glimpse look:
\(price = -1.2483 + (-1.1997)grade3 + (-8.0184)grade4 + (-1.2349)grade5 + (-7.5252)grade6 + (9.1637)grade7 + (9.5285)grade8 + (2.3867)grade9 + (4.1883)grade10 + (6.7310)grade11 + (1.1344)grade12 + (2.2454)grade13 + (1.6135)sqft_living + (3.3130)yr_gap + (7.0132)waterfront1 + (5.0046)bathrooms + (2.0726)floors2 + (1.5925)floors3 + (-2.7594)bedrooms + (4.5468)sqft_living15 + (3.1705)condition2 + (5.6217)condition3 + (7.4934)condition4 + (1.1653)condition5 + (-4.9017)sqft_lot15 + (-3.9234)sqft_above + (4.5722)renov1\)
From the model created, we can see the price is affected heavily by sqft_living15, yr_gap, waterfron1, bathrooms, sqft_living15, and renov. While although the factor grade and floors variables are also affecting the price, from the equation above we know that only some of it contribute to the price change.
This part of the model creation section, will focus in creating regression model using backward step-method.
From the backward step-method syntax given above we now know what kind of variable that will affect the price of each housing unit. The equation below will give us a glimpse look:
\(price = -1.2483 + (-2.7594)bedrooms + (5.0046)bathrooms + (1.6135)sqft_living + (2.0726)floors2 + (1.5925)floors3 + (7.0132)waterfront1 + (3.1705)condition2 + (5.6217)condition3 + (7.4934)condition4 + (1.1653)condition5 + (-1.1997)grade3 + (-8.0184)grade4 + (-1.2349)grade5 + (-7.5252)grade6 + (9.1637)grade7 + (9.5285)grade8 + (2.3867)grade9 + (4.1883)grade10 + (6.7310)grade11 + (1.1344)grade12 + (2.2454)grade13 + (-3.9234)sqft_above + (4.5468)sqft_living15 + (-4.9017)sqft_lot15 + (3.3130)yr_gap + (4.5722)renov1\)
The model that we have created so far have the exact values and variables to the forward step model created earlier. With same values and variables affecting the pricing, we can hypothize that in the end both method will gibe us the same prediction value and RMSE to such we can choose one of the two methods that we are more comfortable with.
After we acquired our regression model from the step-methods above. Now we predict our price value using our regression method.
After we predict the housing price, now we will compare the actual price and our predicted price. Below are the comparison between prices of King County housing.As hypothized before, the predicted price by using both forward and backward step-method show the same predicted price value. In which it strengthen our conclussion that the RMSE (Root Mean-Square Error) will be the same and that the user can comfortably choose one of the step-method in creating the regression model.
Shown below are the RMSE acquired from prediction model. Since both regression create the same model, it is concluded that both will create the same prediction value and RMSE.The variables that will affect the price change of each house in King County heavily are the bedrooms, bathrooms, sqft_living, floors, waterfront, condition, grade, sqft_above, sqft_living15, sqft_lot15, yr_gap, renov variables with both model giving an RMSE value of 204468.9.