1. Data description

This section includes:

  • the table below

  • a brief motivation of my choice of the data

  • the code for setting the global chunk options and loading R packages

  • the code for importing/loading/tidying the data

Data Description
HouseData kc_house_data.csv
www.kaggle.com https://www.kaggle.com/swathiachath/kc-housesales-data
Excel The Data was in csv format but I changed it to excel format so I could import it.

I downloaded my data from www.kaggle.com, this is a very well known website which provides a wide variety of different datasets.This website also got recommended in the assignment itself. The aim of this dataset was to predict the house sales in King County, Washington State, USA, based on a lot of different variables, such as number of bedrooms, floors and square feet of the home. The dataset consists of historic data of houses sold between May 2014 to May 2015. I chose this dataset, because I am fascinated by being able to predict certain variables, such as predicting housing sales, predicting stock prices and predicting house prices. This dataset contains a lot of interesting variables and I am very sure there must be some correlations between variables hidden in this dataset.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v stringr 1.4.0
## v tidyr   1.0.2     v forcats 0.4.0
## v readr   1.3.1
## -- Conflicts ---------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)
library(readxl)
HouseData <- read_excel("HouseData.xlsx")
HD<-HouseData
attach(HD)
HD$price<-as.numeric(HD$price)
HD$bedrooms<-as.numeric(HD$bedrooms)
HD$sqft_living<-as.numeric(HD$sqft_living)

2. Univariate statistics

This section includes

  • three univariate summary statistics (one for each variable)

  • the univariate plots in a 1-by-3 plot array (one for each variable)

  • a brief discussion of the univariate distributions

Univariate summary statistic price
summary(HD$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   78000  322000  450000  540297  645000 7700000
sd(HD$price)
## [1] 367368.1
var(HD$price)
## [1] 134959350362
Univariate summary statistic bedrooms
summary(HD$bedrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   3.000   3.373   4.000  33.000
sd(HD$bedrooms)
## [1] 0.9262989
var(HD$bedrooms)
## [1] 0.8580296
Univariate summary statistic sqft_living
summary(HD$sqft_living)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     370    1430    1910    2080    2550   13540
sd(HD$sqft_living)
## [1] 918.1061
var(HD$sqft_living)
## [1] 842918.9
Three univariate plots
options(scipen=99999999) #this changes the price notation from scientific notation to standard notation
par(mfrow = c(1,3), pty = "s")
hist(as.numeric(HD$price),
     main = "Histogram of Price",
     xlab = "Price ($)",
     col = blues9,
     xlim=c(0,3000000),
     breaks = 55)

hist(as.numeric(HD$bedrooms),
     main = "Histogram of Bedrooms",
     xlab = "Number of bedrooms",
     col = blues9,
     xlim=c(0,8),
     breaks = 50)

hist(as.numeric(HD$sqft_living),
     main = "Histogram of Square Feet",
     xlab = "Square feet",
     col = blues9,
     xlim = c(0,10000),
     breaks = 40)

Discussion of the univariate distributions

For this assignment I chose the three variables price, bedrooms and sqft_living. I think the number of bedrooms and the square footage of a house determine the price of the house the most, besides location, location, location of course! I made three histograms with frequency on the y-axis and price, bedrooms and sqft_living on the x-axis.

The highest price of a house in the dataset is actually $7.7 million dollars, but since a very high percentage of the houses are below $1 million dollars I decided to put a limit on the x-axis. Otherwise it would be very difficult to read the graph properly. The median price in this dataset is $450,000 dollars and the mean is $540,297 dollars. This already shows that most houses are well below $1 million dollars. On the histogram you can see a very fast increase in frequency in the beginning and after a price of approximately $450,000 dollars you can see a steady decrease in the frequency.

The median for number of bedrooms is 3 bedrooms and the mean is 3.373 bedrooms. Again I set a limit to the x-axis on the histogram of a maximum of 8 bedrooms. There is one house with 33 bedrooms, which is just absurd. Since most houses, in this dataset, have less than 8 bedrooms I decided to put a limit on the x-axis at 8 bedrooms, so we can read the plot easier.

Last but not least the square feet of the house. The dataset is from King County, Washington State, USA. This also includes Seattle which is a pretty big city, but this dataset also contains houses in the middle of nowhere. This can explain the very wide range of square feet in this dataset, the smallest house has 370 square feet while the biggest house has 13540 square feet. The median house has 1910 square feet, while the mean is 2080. Again there is a big increase in frequency in the beginning of the histogram, but when square feet reaches approximately 2100 there is a steady decrease in frequency.

3. Multivariate statistics

This section includes

  • three multivariate summary statistics (joint distributions of two or three variables)

  • three multivariate plots in a 1-by-3 plot array (joint distributions of two or three variables)

  • a brief discussion of the multivariate distributions

Multivariate summary statistic of price and bedrooms

price<-as.numeric(HD$price)
bedrooms<-as.numeric(HD$bedrooms)
summary(lm(price ~ bedrooms))
## 
## Call:
## lm(formula = price ~ bedrooms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3528526  -203093   -66593   105407  6838014 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   127200       8979   14.17 <0.0000000000000002 ***
## bedrooms      122464       2567   47.71 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349400 on 21595 degrees of freedom
## Multiple R-squared:  0.09535,    Adjusted R-squared:  0.09531 
## F-statistic:  2276 on 1 and 21595 DF,  p-value: < 0.00000000000000022
cor(price, bedrooms)
## [1] 0.3087875

Multivariate summary statistic of price and condition

condition<-as.numeric(HD$condition)
summary(lm(price ~ condition))
## 
## Call:
## lm(formula = price ~ condition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -463313 -217313  -87313  102048 7147687 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   470868      13331  35.322 < 0.0000000000000002 ***
## condition      20361       3840   5.302          0.000000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 367100 on 21595 degrees of freedom
## Multiple R-squared:  0.0013, Adjusted R-squared:  0.001254 
## F-statistic: 28.11 on 1 and 21595 DF,  p-value: 0.0000001157
cor(price, condition)
## [1] 0.03605638

Multivariate summary statistic of price and sqft_living

sqft_living<-as.numeric(HD$sqft_living)
summary(lm(price ~ sqft_living))
## 
## Call:
## lm(formula = price ~ sqft_living)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1478896  -147583   -24131   106274  4359590 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -43988.892   4410.023  -9.975 <0.0000000000000002 ***
## sqft_living    280.863      1.939 144.819 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 261700 on 21595 degrees of freedom
## Multiple R-squared:  0.4927, Adjusted R-squared:  0.4927 
## F-statistic: 2.097e+04 on 1 and 21595 DF,  p-value: < 0.00000000000000022
cor(price, sqft_living)
## [1] 0.7019173

Three multiavariate plots

options(scipen=99999999)
grid.arrange(
  ggplot(HD, mapping = aes(x = bedrooms, y = price)) +
    geom_point(color="royalblue") +
    ggtitle("Plot of price and bedrooms") +
    labs(y = "Price ($)", x = "Number of bedrooms") +
    geom_smooth(method = 'lm', color = 'red') +
    theme(aspect.ratio=1, plot.title = element_text(size=9)) +
    xlim(0,11),   #I put a limit to the x-axis at 11 bedrooms, because there is one house with 33 bedrooms and when I included this one data point it made it very difficult to read the plot correcly. 
  
  ggplot(HD, mapping = aes(x = condition, y = price)) +
    geom_point(color="royalblue") +
    ggtitle("Plot of price and condition") +
    labs(y = "Price ($)", x = "Condition") +
    geom_smooth(method = 'lm', color = 'red') +
    theme(aspect.ratio=1, plot.title = element_text(size=9)),
  
  ggplot(HD, mapping = aes(x = sqft_living, y = price)) +
    geom_point(color="royalblue") +
    ggtitle("Plot of price and square feet") +
    labs(y = "Price ($)", x = "Square feet") +
    geom_smooth(method = 'lm', color = 'red') +
    theme(aspect.ratio=1, plot.title = element_text(size=8,5)),
  
  nrow=1)
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Discussion of the multivariate distributions

When I first started with this exercise I included a third variable to the three ggplots, I did this by using the ‘colour =’ function (you can also use ‘col =’). By doing this you are adding a different colour to every data point with another variable attached to it. I added grade and condition, because I think these two variables really influence the price a lot. But when I added these two variables a legend got added to the right of each plot. This legend explains the different colours which are attached to each data point. Unfortunately the three legends took in a lot of space, and as a result the graphs were very small and looked very messy. This is because the plots should be represented in a 1-by-3 plot array. Even when I reduced the size of the legend it still looked very messy, so I decided to stay away from using a third variable.

All the plots I made have price on the y-axis, I did this because I am interested in finding what influences price positively, or negatively. The first plot I made combined price and bedrooms. My guess was that as the number of bedrooms increases the price of the house increases as well. This is actually true as well as you can see in my plot, I added a red regression line and this is an upward sloping line. The R-squared is 0.09, so approximately 9% of variation in price can be determined by bedrooms. This is not a lot and the correlation between price and bedrooms is approximately 0.30. I would say that there is not that much correlation between price and bedrooms in this dataset.

The second plot shows the combinations of price and condition of the house. Condition is on a scale of one to five, one being houses in very bad conditions and five being houses in excellent conditions. My first guess was that when the condition of a house is good, it will sell for a higher price. But when you look at the plot the regression line in pretty much horizontal. At first you might think condition does not really have an impact on price, but as you can see on the plot, houses with condition rating three, four or five tend to have higher prices, as opposed to houses with condition rating one and two. So condition does probably have an influence on price, but condition is not enough of an determinant of price alone. I think, besides condition, other variables such as number of bedrooms and square feet should be included to make a correct guess of the price. The R-squared between price and condition is almost zero and the correlation is approximately 0.03. So I can conclude that there is very little to no correlation between price and condition. But this is not weird, since houses in excellent and bad conditions can be both priced very high or very low.

The third plot shows the combinations of price and sqft_living. I must add that square feet in this plot is square feet of the actual home, so excluding garden, garage, driveway, etc. This plot is very interesting, it seems as if there is a maximum and minimum price for every square feet added to the home. It looks like the shape of a cone. In this plot I also added a regression line and it shows an upward moving line. When square feet increases, the price of the house increases as well, which sounds very logical. The R-squared between price and square feet is approximately 0.49 and the correlation is 0.70, out of the other two variables I tested against the price, square feet seems to be the most correlated to price.