This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
In HW2, we will do exploratory data analysis by using Housing_prices_data.csv data set.
library(tidyverse)
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 2.2.1 <U+221A> purrr 0.2.4
## <U+221A> tibble 1.4.2 <U+221A> dplyr 0.7.4
## <U+221A> tidyr 0.7.2 <U+221A> stringr 1.2.0
## <U+221A> readr 1.1.1 <U+221A> forcats 0.2.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data <- read.csv(
"/Users/zhengdongnanzi/Desktop/Columbia Spring 2018/GR 5058/2.15 Lec 5/Housing_prices_data.csv")
# Distribution of one of the continuous variables from Q1: GrLivArea
ggplot(data=data) +
geom_histogram(mapping = aes(x=GrLivArea),bins=30)
# Distribution of sales prices: SalePrice
ggplot(data=data) +
geom_histogram(mapping = aes(x=SalePrice),bins=30)
chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE)
subdata <- select(data,LotArea,GrLivArea,TotalBsmtSF,
SalePrice)
cor.matrix <- round(cor(subdata),2)
# Here is the corrleation matrix
cor.matrix
## LotArea GrLivArea TotalBsmtSF SalePrice
## LotArea 1.00 0.26 0.26 0.26
## GrLivArea 0.26 1.00 0.45 0.71
## TotalBsmtSF 0.26 0.45 1.00 0.61
## SalePrice 0.26 0.71 0.61 1.00
# Correlation Plot Visualization
install.packages(("corrplot"))
##
## The downloaded binary packages are in
## /var/folders/zp/3g72mswd05d6nsdjf4g88dzc0000gn/T//RtmptlTy9s/downloaded_packages
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor.matrix, type="upper",
order = "hclust", tl.col="black",
tl.srt=45)
The categorical variable I choose is OverallQual (Overall material and finish quality). The quality rating is from 1 to 10. I would expect that as the rating of the overall quality increases, the housing sale price would increase.
table(data$OverallQual)
##
## 1 2 3 4 5 6 7 8 9 10
## 2 3 20 116 397 374 319 168 43 18
library(ggplot2)
ggplot(data = data) +
geom_bar(mapping = aes(x = OverallQual))
boxplot(SalePrice~OverallQual,data=data, main="Housing Data",
xlab="Overall Quality", ylab="Sale Price")
From the boxplot above, we can see that there is an identifiable relationship between overall quality and housing prices. As the overall Quality increases from 1 to 10, the key statistical measures for the Sales Prices shown by the plot (median, 1st and 3rd quartiles) increase as well generally.
We also observed that there are overlaps for sales prices in the boxplots amongest most of the rating overall quality 1 to 10. It indicates that the overall quality is not the only variable that explains the sales price, even though the trend shows that the houses with higher rating quality tend to have the higher sales prices.
The categorical variable I find with only four categories is ExterQual (Exterior material quality). It is possible that the mean of the housing sale price might vary as the Exterior material quality varies.
data2 <- group_by(data,ExterQual)
summarise(data2, saleavg = mean(SalePrice, na.rm=TRUE))
## # A tibble: 4 x 2
## ExterQual saleavg
## <fct> <dbl>
## 1 Ex 367361
## 2 Fa 87985
## 3 Gd 231634
## 4 TA 144341
library(sm)
## Package 'sm', version 2.2-5.4: type help(sm) for summary information
# plot densities
sm.density.compare(data$SalePrice, data$ExterQual, xlab="Sale Price")
title(main="Sale Price Distribution by Exterior material quality")
# create value labels for a legend
qual.f <- factor(mtcars$cyl, levels= c("Ex","Fa","Gd","TA"),
labels = c("Level Ex","Level Fa", "Level Gd","Level Ta" ))
# add legend
colfill<-c(2:(2+length(levels(qual.f))))
legend("topright", levels(qual.f), fill=colfill)
From the plot above, we can see that there is an identifiable relationship between Exterior material quality and housing prices.
Presumming that the labels refer to Exterior material quality ranging from Fair to Excellent, we observed that the mean of the sales prices increases as the quality increases. We also observed that there are overlaps for the sales price in the density plot amongest all categories (variances of sales price between different categories are different, the variance is biggest in Ex group and smallest in Fa group). It indicates that the Exterior material quality is not the only variable that explains the sales price, even though on average it impacts the average of the sales prices.
From the correlation plot, we know that GrLivArea (Above grade (ground) living area square feet) and TotalBsmtSF (Total square feet of basement area) have stronger positive linear relationships with sales price. The LotArea (Lot size in square feet) has a less strong positive linear relationship with sales price.
In general, each of these three variables have a positive linear relationship with sales price: With a larger living area, a larger basement area or a larger lot size, the house tends to have higher sales price.
From the boxplots of sales price for each category of OverallQual above, we can see that there is an identifiable relationship between overall quality and housing prices. As the overall Quality increases from 1 to 10, the key statistical measures for the Sales Prices shown by the plot (median, 1st and 3rd quartiles) increase as well generally.
However, there are overlaps for sales prices in the boxplots amongest most of the rating overall quality 1 to 10. It indicates that the overall quality is not the only variable that explains the sales price, even though the trend shows that the houses with higher rating quality tend to have the higher sales prices.
From the density plots of sales price for each category of ExterQual above, we can see that there is an identifiable relationship between Exterior material quality and housing prices.
Presumming that the labels refer to Exterior material quality ranging from Fair, TA, Good to Excellent, we observed that the mean of the sales prices increases as the quality increases.
We also observed that there are overlaps in the density plot amongest all categories. It indicates that the Exterior material quality is not the only variable that explains the sales price, even though on average it impacts the average of the sales prices.