link

path<-'C:/Users/nfabius/Desktop/data mining/'

file_name<-"Housing_prices_data.csv"
Data<-read.csv(paste(path,file_name,sep=""))

1)

Generate summary statistics for 3 continuous variables that you think may be related to the sales prices for homes in the data (variable for sales price is called “SalePrice”.

LotArea: Lot size in square feet

summary(Data$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

LotFrontage: Linear feet of street connected to property

summary(Data$LotFrontage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   59.00   69.00   70.05   80.00  313.00     259

GrLivArea: Above grade (ground) living area square feet

summary(Data$GrLivArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

2)

Generate two plots that examine the distribution of the data for one of the continuous variables from question #1 and for the sales prices. One plot should visualize the distribution of the continuous explanatory variable and a separate plot should visualize the distribution of the variable indicating sales prices.

plot(density(Data$LotArea), main="Lot size in square feet", ylab="Frequency")  
polygon(density(Data$LotArea), col="red")

plot(density(Data$SalePrice), main="sales prices", ylab="Frequency")  
polygon(density(Data$SalePrice), col="red")

3)

Use a correlation plot to present a matrix of correlations for each of the three variables in question one and sales prices.

**I dropped the incomplete ones.

Data2<-data.frame("SalePrice" = Data$SalePrice,"LotFrontage" = Data$LotFrontage,"LotArea" =  Data$LotArea,"GrLivArea" =   Data$GrLivArea)
M<-cor(Data2,use ="complete.obs")
head(round(M,2))
##             SalePrice LotFrontage LotArea GrLivArea
## SalePrice        1.00        0.35    0.31      0.70
## LotFrontage      0.35        1.00    0.43      0.40
## LotArea          0.31        0.43    1.00      0.31
## GrLivArea        0.70        0.40    0.31      1.00
corrplot(M, method="circle", type="upper")

4)

Find a categorical variable in the data that you think might explain sales prices. A) Generate a table that presents the frequency of this categorical variable in the full dataset. B) Generate a bar plot to visualize this distribution. C) Create a single visual with boxplots of sales price for each category of this categorical variable. Is there an identifiable relationship between this variable and housing prices? If so, describe it.

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

count<-table(Data$TotRmsAbvGrd)
count
## 
##   2   3   4   5   6   7   8   9  10  11  12  14 
##   1  17  97 275 402 329 187  75  47  18  11   1
barplot(count, main ="Total rooms above grade (does not include bathrooms)" )

boxplot(SalePrice~TotRmsAbvGrd,data=Data, main="Price by number of rooms", 
          xlab="Number of Rooms", ylab="Price")

It seems to be a positive relation between the number of rooms and price, more rooms a higher price.

5)

Find a categorical variable in the data with four or fewer categories that you think might explain sales prices. A) Generate a mean of the sales price variable (variable name is “SalePrice”) for each value of this categorical variable. B) Create density plots of sales price for each category of your categorical variable and visualize them in a single plot (see example from “sm” package in EDA code from R script on course site). Is there an identifiable relationship between this variable and housing prices? If so, describe it.

CentralAir: Central air conditioning

table(Data$CentralAir)
## 
##    N    Y 
##   95 1365
  Data3<-group_by(Data, CentralAir)
  summarise(Data3, priceavg=mean(SalePrice, na.rm=TRUE))
## # A tibble: 2 x 2
##   CentralAir priceavg
##   <fct>         <dbl>
## 1 N            105264
## 2 Y            186187
    sm.density.compare(Data$SalePrice, Data$CentralAir, xlab="Sale Price")
  title(main="Sale Price if Central AC installed")
  
   # create value labels for a legend
  cyl.f <- factor(Data$CentralAir, levels= c("N","Y"),
                  labels = c("No","Yes")) 
  
  # add legend 
  colfill<-c(2:(2+length(levels(cyl.f)))) 
  legend("topright", levels(cyl.f), fill=colfill)

There seems to be a higher price for houses were a central Air conditioning is present.

6)

You have now explored the distributions of multiple continuous and categorical variables that may relate to the sales price of homes. Summarize what you learned about the relationships between these explanatory variables and the sales price of homes from your exploratory analysis.

It seems to be the case that the total Living area has a great impact in price beign positively correlated. The Lot Size and the shape of it seems to be less relevant altough also possitively related to prices.

Number of rooms is also positively related to prices, which makes sense as higher number of rooms means a bigger living area, which further confirms what was found in part 1.

But size is not all, also the quality of the house and the amenities it presents also affects prices as can be seen in the fact that houses with Central AC tend to have higher sale prices.