Link to HTML version published on RPubs.com: http://rpubs.com/lucian_lee/362414

3. Use a correlation plot to present a matrix of correlations for each of the three variables in question one and sales prices.

data %>%
  select(SalePrice, LotArea, PoolArea, GarageArea) %>%
  cor() %>%
  round(2) %>%
  corrplot(type="upper", order="hclust", 
           tl.col="black", tl.srt=45)

4. Find a categorical variable in the data that you think might explain sales prices. A) Generate a table that presents the frequency of this categorical variable in the full dataset. B) Generate a bar plot to visualize this distribution. C) Create a single visual with boxplots of sales price for each category of this categorical variable. Is there an identifiable relationship between this variable and housing prices? If so, describe it.

We will use the categorical variable HouseStyle, which is the style of dwelling.

table(data$HouseStyle)
## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    154     14    726      8     11    445     37     65
barplot(table(data$HouseStyle), main="Bar plot of style of dwelling",
        xlab="Style of dwelling")

boxplot(SalePrice~HouseStyle, data=data, main="Sale prices for each style of dwelling", 
        xlab="Style of dwelling", ylab="Sale price ($)")

The median sales price is higher for finished houses compared to unfinished houses, and higher for 2-story buildings compared to 1-story buildings.

5. Find a categorical variable in the data with four or fewer categories that you think might explain sales prices. A) Generate a mean of the sales price variable (variable name is “SalePrice”) for each value of this categorical variable. B) Create density plots of sales price for each category of your categorical variable and visualize them in a single plot (see example from “sm” package in EDA code from R script on course site). Is there an identifiable relationship between this variable and housing prices? If so, describe it.

We will use the categorical variable KitchenQual, which is the kitchen quality.

summarise(group_by(data,KitchenQual),
          meanprice=mean(SalePrice,na.rm=TRUE))
## # A tibble: 4 x 2
##   KitchenQual meanprice
##   <fct>           <dbl>
## 1 Ex             328555
## 2 Fa             105565
## 3 Gd             212116
## 4 TA             139963
sm.density.compare(data$SalePrice, data$KitchenQual, 
                   xlab="Sale price ($)")
title(main="Density plots of sales prices for each type of kitchen quality")

The mean sales price is higher for houses with better kitchen quality.

6. You have now explored the distributions of multiple continuous and categorical variables that may relate to the sales price of homes. Summarize what you learned about the relationships between these explanatory variables and the sales price of homes from your exploratory analysis.

Garage area is strongly and positively correlated with sales price of homes. The sales prices are also higher for finished houses (compared to unfinished houses), 2-story buildings (compared to 1-story buildings), and houses with better kitchen quality.

Thanks for reading ;)