This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

In HW2, we will do exploratory data analysis by using Housing_prices_data.csv data set.

Read the data first

library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1     <U+221A> purrr   0.2.4
## <U+221A> tibble  1.4.2     <U+221A> dplyr   0.7.4
## <U+221A> tidyr   0.7.2     <U+221A> stringr 1.2.0
## <U+221A> readr   1.1.1     <U+221A> forcats 0.2.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data <- read.csv(
  "/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/2.15 Lec 5/Housing_prices_data.csv")

2) Generate two plots that examine the distribution of the data for one of the continuous variables from Q1 and for the sales prices. One plot should visualize the distribution of the continuous explanatory variable and a separate plot should visualize the distribution of the variable indicating sales prices.

# Distribution of one of the continuous variables from Q1: GrLivArea 
ggplot(data=data) +
  geom_histogram(mapping = aes(x=GrLivArea),bins=30)

# Distribution of sales prices: SalePrice
ggplot(data=data) +
  geom_histogram(mapping = aes(x=SalePrice),bins=30)

3) Use a correlation plot to present a matrix of correlations for each of the three variables in question one and sales prices.

chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE)

subdata <- select(data,LotArea,GrLivArea,TotalBsmtSF,
               SalePrice)

cor.matrix <- round(cor(subdata),2)
# Here is the corrleation matrix
cor.matrix
##             LotArea GrLivArea TotalBsmtSF SalePrice
## LotArea        1.00      0.26        0.26      0.26
## GrLivArea      0.26      1.00        0.45      0.71
## TotalBsmtSF    0.26      0.45        1.00      0.61
## SalePrice      0.26      0.71        0.61      1.00
# Correlation Plot Visualization 
install.packages(("corrplot"))
## 
## The downloaded binary packages are in
##  /var/folders/zp/3g72mswd05d6nsdjf4g88dzc0000gn/T//RtmptlTy9s/downloaded_packages
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor.matrix, type="upper",
         order = "hclust", tl.col="black",
         tl.srt=45)

4)Find a categorical variable in the data that you think might explain sales prices.

A) Generate a table that presents the frequency of this categorical variable in the full dataset.

Answer:

The categorical variable I choose is OverallQual (Overall material and finish quality). The quality rating is from 1 to 10. I would expect that as the rating of the overall quality increases, the housing sale price would increase.

table(data$OverallQual)
## 
##   1   2   3   4   5   6   7   8   9  10 
##   2   3  20 116 397 374 319 168  43  18

B) Generate a bar plot to visualize this distribution.

library(ggplot2)
ggplot(data = data) +
  geom_bar(mapping = aes(x = OverallQual))

C) Create a single visual with boxplots of sales price for each category of this categorical variable. Is there an identifiable relationship between this variable and housing prices? If so, describe it.

boxplot(SalePrice~OverallQual,data=data, main="Housing Data", 
        xlab="Overall Quality", ylab="Sale Price")

Answer:

From the boxplot above, we can see that there is an identifiable relationship between overall quality and housing prices. As the overall Quality increases from 1 to 10, the key statistical measures for the Sales Prices shown by the plot (median, 1st and 3rd quartiles) increase as well generally.

We also observed that there are overlaps for sales prices in the boxplots amongest most of the rating overall quality 1 to 10. It indicates that the overall quality is not the only variable that explains the sales price, even though the trend shows that the houses with higher rating quality tend to have the higher sales prices.

5) Find a categorical variable in the data with four or fewer categories that you think might explain sales prices.

A) Generate a mean of the sales price variable (variable name is “SalePrice”) for each value of this categorical variable.

Answer:

The categorical variable I find with only four categories is ExterQual (Exterior material quality). It is possible that the mean of the housing sale price might vary as the Exterior material quality varies.

data2 <- group_by(data,ExterQual)
summarise(data2, saleavg = mean(SalePrice, na.rm=TRUE))
## # A tibble: 4 x 2
##   ExterQual saleavg
##   <fct>       <dbl>
## 1 Ex         367361
## 2 Fa          87985
## 3 Gd         231634
## 4 TA         144341

B) Create density plots of sales price for each category of your categorical variable and visualize them in a single plot (see example from “sm” package in EDA code from R script on course site). Is there an identifiable relationship between this variable and housing prices? If so, describe it.

library(sm)
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
# plot densities 
sm.density.compare(data$SalePrice, data$ExterQual, xlab="Sale Price")
title(main="Sale Price Distribution by Exterior material quality")

# create value labels for a legend
qual.f <- factor(mtcars$cyl, levels= c("Ex","Fa","Gd","TA"),
                labels = c("Level Ex","Level Fa", "Level Gd","Level Ta" )) 

# add legend 
colfill<-c(2:(2+length(levels(qual.f)))) 
legend("topright", levels(qual.f), fill=colfill)

Answer:

From the plot above, we can see that there is an identifiable relationship between Exterior material quality and housing prices.

Presumming that the labels refer to Exterior material quality ranging from Fair to Excellent, we observed that the mean of the sales prices increases as the quality increases. We also observed that there are overlaps for the sales price in the density plot amongest all categories (variances of sales price between different categories are different, the variance is biggest in Ex group and smallest in Fa group). It indicates that the Exterior material quality is not the only variable that explains the sales price, even though on average it impacts the average of the sales prices.

6) You have now explored the distributions of multiple continuous and categorical variables that may relate to the sales price of homes. Summarize what you learned about the relationships between these explanatory variables and the sales price of homes from your exploratory analysis.

Answer:

1) For the three continuous variables LotArea, TotalBsmtSF and GrLivArea,the correlation coefficient indicates the strength of a linear relationship between each of them and the sale price variable:

From the correlation plot, we know that GrLivArea (Above grade (ground) living area square feet) and TotalBsmtSF (Total square feet of basement area) have stronger positive linear relationships with sales price. The LotArea (Lot size in square feet) has a less strong positive linear relationship with sales price.

In general, each of these three variables have a positive linear relationship with sales price: With a larger living area, a larger basement area or a larger lot size, the house tends to have higher sales price.

2) For the categorical variable OverallQual (Overall material and finish quality):

From the boxplots of sales price for each category of OverallQual above, we can see that there is an identifiable relationship between overall quality and housing prices. As the overall Quality increases from 1 to 10, the key statistical measures for the Sales Prices shown by the plot (median, 1st and 3rd quartiles) increase as well generally.

However, there are overlaps for sales prices in the boxplots amongest most of the rating overall quality 1 to 10. It indicates that the overall quality is not the only variable that explains the sales price, even though the trend shows that the houses with higher rating quality tend to have the higher sales prices.

3) For the categorical variable ExterQual (Exterior material quality):

From the density plots of sales price for each category of ExterQual above, we can see that there is an identifiable relationship between Exterior material quality and housing prices.

Presumming that the labels refer to Exterior material quality ranging from Fair, TA, Good to Excellent, we observed that the mean of the sales prices increases as the quality increases.

We also observed that there are overlaps in the density plot amongest all categories. It indicates that the Exterior material quality is not the only variable that explains the sales price, even though on average it impacts the average of the sales prices.