path<-'C:/Users/nfabius/Desktop/data mining/'
file_name<-"Housing_prices_data.csv"
Data<-read.csv(paste(path,file_name,sep=""))
Generate summary statistics for 3 continuous variables that you think may be related to the sales prices for homes in the data (variable for sales price is called “SalePrice”.
LotArea: Lot size in square feet
summary(Data$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
LotFrontage: Linear feet of street connected to property
summary(Data$LotFrontage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 59.00 69.00 70.05 80.00 313.00 259
GrLivArea: Above grade (ground) living area square feet
summary(Data$GrLivArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
Generate two plots that examine the distribution of the data for one of the continuous variables from question #1 and for the sales prices. One plot should visualize the distribution of the continuous explanatory variable and a separate plot should visualize the distribution of the variable indicating sales prices.
plot(density(Data$LotArea), main="Lot size in square feet", ylab="Frequency")
polygon(density(Data$LotArea), col="red")
plot(density(Data$SalePrice), main="sales prices", ylab="Frequency")
polygon(density(Data$SalePrice), col="red")
Use a correlation plot to present a matrix of correlations for each of the three variables in question one and sales prices.
**I dropped the incomplete ones.
Data2<-data.frame("SalePrice" = Data$SalePrice,"LotFrontage" = Data$LotFrontage,"LotArea" = Data$LotArea,"GrLivArea" = Data$GrLivArea)
M<-cor(Data2,use ="complete.obs")
head(round(M,2))
## SalePrice LotFrontage LotArea GrLivArea
## SalePrice 1.00 0.35 0.31 0.70
## LotFrontage 0.35 1.00 0.43 0.40
## LotArea 0.31 0.43 1.00 0.31
## GrLivArea 0.70 0.40 0.31 1.00
corrplot(M, method="circle", type="upper")
Find a categorical variable in the data that you think might explain sales prices. A) Generate a table that presents the frequency of this categorical variable in the full dataset. B) Generate a bar plot to visualize this distribution. C) Create a single visual with boxplots of sales price for each category of this categorical variable. Is there an identifiable relationship between this variable and housing prices? If so, describe it.
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
count<-table(Data$TotRmsAbvGrd)
count
##
## 2 3 4 5 6 7 8 9 10 11 12 14
## 1 17 97 275 402 329 187 75 47 18 11 1
barplot(count, main ="Total rooms above grade (does not include bathrooms)" )
boxplot(SalePrice~TotRmsAbvGrd,data=Data, main="Price by number of rooms",
xlab="Number of Rooms", ylab="Price")
It seems to be a positive relation between the number of rooms and price, more rooms a higher price.
Find a categorical variable in the data with four or fewer categories that you think might explain sales prices. A) Generate a mean of the sales price variable (variable name is “SalePrice”) for each value of this categorical variable. B) Create density plots of sales price for each category of your categorical variable and visualize them in a single plot (see example from “sm” package in EDA code from R script on course site). Is there an identifiable relationship between this variable and housing prices? If so, describe it.
CentralAir: Central air conditioning
table(Data$CentralAir)
##
## N Y
## 95 1365
Data3<-group_by(Data, CentralAir)
summarise(Data3, priceavg=mean(SalePrice, na.rm=TRUE))
## # A tibble: 2 x 2
## CentralAir priceavg
## <fct> <dbl>
## 1 N 105264
## 2 Y 186187
sm.density.compare(Data$SalePrice, Data$CentralAir, xlab="Sale Price")
title(main="Sale Price if Central AC installed")
# create value labels for a legend
cyl.f <- factor(Data$CentralAir, levels= c("N","Y"),
labels = c("No","Yes"))
# add legend
colfill<-c(2:(2+length(levels(cyl.f))))
legend("topright", levels(cyl.f), fill=colfill)
There seems to be a higher price for houses were a central Air conditioning is present.
You have now explored the distributions of multiple continuous and categorical variables that may relate to the sales price of homes. Summarize what you learned about the relationships between these explanatory variables and the sales price of homes from your exploratory analysis.
It seems to be the case that the total Living area has a great impact in price beign positively correlated. The Lot Size and the shape of it seems to be less relevant altough also possitively related to prices.
Number of rooms is also positively related to prices, which makes sense as higher number of rooms means a bigger living area, which further confirms what was found in part 1.
But size is not all, also the quality of the house and the amenities it presents also affects prices as can be seen in the fact that houses with Central AC tend to have higher sale prices.