Ames Housing Dataset: EDA and Linear Regression Modelling

The Ames Housing data contains data on 2,930 properties in Ames, Iowa, including columns related to
1.house characteristics (bedrooms, garage, fireplace, pool, porch, etc.),
2.location (neighborhood),
3.lot information (zoning, shape, size, etc.),
4.ratings of condition and quality, and
5.sale price

First I will be performing an Exploratory data analysis on the variables in the dataset including univariate, bivariate and multivariate plots. Then I will build and compare 3 linear regression models to get a better understanding of the variables that drive the sale prices of houses the most.

Loading necessary Packages

library(tidyverse)
library(dplyr)
library(skimr)
library(see)
library(broom)
library(performance)
library(ggplot2)
library(qqplotr)
library(reshape2)

data(ames, package = "modeldata")

Exploratory Data Analysis

Univariate plots

Sale Price

ggplot(ames)+
  geom_histogram(aes(Sale_Price),bins=40,color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Sale Price distribution",x="Sale Price ($)",y="Number of Houses")+
  scale_x_continuous(labels = scales::comma)

mean(ames$Sale_Price)

## [1] 180411.6

Sale Price is right-skewed; there are more inexpensive houses than expensive ones. The mean sale price is 180411.6

Checking Outliers in Sale Price using Box plot.

ggplot(data = ames, mapping = aes(Sale_Price))+
  geom_boxplot( color = 'red')+
  scale_x_continuous(labels = scales::comma)+
  theme_minimal()

It is also possible to extract the values of the potential outliers based on the IQR criterion using the boxplot.stats()$out function:

boxplot.stats(ames$Sale_Price)$out

##   [1] 538000 394432 376162 395192 611657 500000 355000 410000 362500 378500
##  [11] 345000 377500 375000 501837 372500 462000 485000 555000 398800 404000
##  [21] 402861 451950 610000 582933 360000 409900 350000 386250 445000 552000
##  [31] 382500 403000 340000 402000 468000 370878 375000 410000 425000 355000
##  [41] 387000 394617 426000 385000 446261 372402 417500 383000 390000 460000
##  [51] 379000 615000 412500 421250 370000 367294 370967 350000 350000 341000
##  [61] 339750 372500 345000 475000 395039 381000 392500 370000 377426 349265
##  [71] 591587 392000 441929 455000 356000 345474 415298 492000 450000 479069
##  [81] 395000 380000 440000 418000 500067 372000 342000 354000 350000 384500
##  [91] 466500 410000 430000 419005 340000 373000 383970 424870 360000 359100
## [101] 375000 392000 457347 545224 356383 556581 361919 535000 401179 438780
## [111] 470000 412083 342643 465000 415000 345000 344133 360000 372397 378000
## [121] 374000 437154 348000 625000 405000 350000 584500 423000 405749 385000
## [131] 475000 415000 375000 369900 359900

Extracting all the rows corresponding to these outliers:

out <- boxplot.stats(ames$Sale_Price)$out
out_ind <- which(ames$Sale_Price %in% c(out))

ames[out_ind,]

## # A tibble: 135 x 74
##    MS_SubClass      MS_Zoning    Lot_Frontage Lot_Area Street Alley  Lot_Shape  
##    <fct>            <fct>               <dbl>    <int> <fct>  <fct>  <fct>      
##  1 Two_Story_1946_~ Residential~           47    53504 Pave   No_Al~ Moderately~
##  2 One_Story_1946_~ Residential~           88    11394 Pave   No_Al~ Regular    
##  3 Two_Story_1946_~ Residential~          102    12858 Pave   No_Al~ Slightly_I~
##  4 One_Story_1946_~ Residential~           83    10159 Pave   No_Al~ Slightly_I~
##  5 One_Story_1946_~ Residential~          100    12919 Pave   No_Al~ Slightly_I~
##  6 One_Story_1946_~ Residential~          110    14300 Pave   No_Al~ Regular    
##  7 Two_Story_1946_~ Residential~           60    17433 Pave   No_Al~ Moderately~
##  8 One_and_Half_St~ Residential~           56    14720 Pave   No_Al~ Slightly_I~
##  9 Two_Story_1946_~ Residential~           85    13143 Pave   No_Al~ Slightly_I~
## 10 One_Story_1946_~ Residential~           89    13214 Pave   No_Al~ Slightly_I~
## # ... with 125 more rows, and 67 more variables: Land_Contour <fct>,
## #   Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>,
## #   Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
## #   Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
## #   Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
## #   Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
## #   Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
## #   BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>,
## #   Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>,
## #   Central_Air <fct>, Electrical <fct>, First_Flr_SF <int>,
## #   Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
## #   Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
## #   Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
## #   Functional <fct>, Fireplaces <int>, Garage_Type <fct>, Garage_Finish <fct>,
## #   Garage_Cars <dbl>, Garage_Area <dbl>, Garage_Cond <fct>, Paved_Drive <fct>,
## #   Wood_Deck_SF <int>, Open_Porch_SF <int>, Enclosed_Porch <int>,
## #   Three_season_porch <int>, Screen_Porch <int>, Pool_Area <int>,
## #   Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, Misc_Val <int>,
## #   Mo_Sold <fct>, Year_Sold <int>, Sale_Type <fct>, Sale_Condition <fct>,
## #   Sale_Price <int>, Longitude <dbl>, Latitude <dbl>

Pool Area

ggplot(ames)+
  geom_histogram(aes(Pool_Area),color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Pool Area distribution",x="Pool Area (sq.ft.)",y="Number of Houses")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ames %>% filter(Pool_Area>0) %>% summarize(n=n())

## # A tibble: 1 x 1
##       n
##   <int>
## 1    11

Pool Area is highly right skewed.Most of the houses have 0 pool area.Only 11 houses have pool area greater than zero.

Monthly sales

Monthly_sales<- ames %>%
  group_by(Mo_Sold)%>%
  summarise( n = n())%>%
  mutate(Month_sold = factor(month.abb[Mo_Sold], levels = month.abb))

Monthly_sales

## # A tibble: 12 x 3
##    Mo_Sold     n Month_sold
##    <fct>   <int> <fct>     
##  1 1         121 Jan       
##  2 2         133 Feb       
##  3 3         232 Mar       
##  4 4         279 Apr       
##  5 5         395 May       
##  6 6         505 Jun       
##  7 7         448 Jul       
##  8 8         233 Aug       
##  9 9         161 Sep       
## 10 10        171 Oct       
## 11 11        143 Nov       
## 12 12        104 Dec

ggplot(data = Monthly_sales, mapping = aes(x= Month_sold, y= n,fill = Month_sold, label = n))+
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = n, color = Month_sold, vjust=-0.2),show.legend = FALSE )+
  theme_minimal()+
  labs(title = 'Individual residential properties sold in Ames, IA from 2006 to 2010')+
  xlab(element_blank())+
  ylab('Number of Properties')

It is clear from the plot that most of the houses were sold in the mid months i.e May, June, July and least were sold in initial and last month.

Houses sold by Year

Let’s start. The first thing to do is to count the number of observations for each value of the year_sold. Thus, how many house are there for each year? We can do this by grouping the data on year_sold and then counting the number of rows using n.

sold_house<- ames %>%
  group_by(Year_Sold)%>%
  summarise(n = n())

sold_house

## # A tibble: 5 x 2
##   Year_Sold     n
##       <int> <int>
## 1      2006   625
## 2      2007   690
## 3      2008   621
## 4      2009   648
## 5      2010   341

Let’s Plot it using Geom_col

ggplot(sold_house)+
  geom_col(aes(x = Year_Sold,y = n),color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Number of house sold each year",x="Year",y="Count")

It is clear from the plot that maximum houses were sold in the year of 2009 with approximate 700 value. 2006,2008 and, 2009 had almost same sales and 2010 is the only year with least sale.

Lot Area

ggplot(ames)+
  geom_histogram(aes(Lot_Area),bins=40,color='red',fill='skyblue')+
  geom_vline(aes(xintercept=mean(Lot_Area)),
             color="blue", linetype="dashed", size=1)+
  theme_minimal()+
  labs(title="Lot Area Distribution",x="Lot Area (sq.ft.)",y="Number of Houses")

mean(ames$Lot_Area)

## [1] 10103.58

Lot Area is highly right skewed. Most of the houses have lot sizes smaller than 50000 square feet. The mean lot size is 10103.58.

Houses Built by Year

ggplot(ames)+
  geom_density(aes(Year_Built),color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Year Built Distribution",x="Year Built",y="Density")

We have three major peaks for the years the houses are built. One around 1920, the second around 1960 and the third one after 1990

Ground Living Area

ggplot(ames)+
  geom_histogram(aes(Gr_Liv_Area),bins=40,color='red',fill='skyblue')+
  geom_vline(aes(xintercept=mean(Gr_Liv_Area)),
             color="blue", linetype="dashed", size=1)+
  theme_minimal()+
  labs(title="Above Ground Living Area Distribution with mean",x="Living Area (sq.ft.)",y="Number of Houses")

mean(ames$Gr_Liv_Area)

## [1] 1493.979

The distribution for Living Area is multi-nodal and right skewed. Meaning more houses have smaller living areas than larger ones. The mean is shown by the vertical line which is 1493.9

Garage Area

ggplot(ames)+
  geom_density(aes(Garage_Area),color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Garage Area Distribution",x="Garaage Area (sq.ft.)",y="Density")

The distribution of Garage Area is multi-nodal and normal. Some houses have O garage area but most of the distribution is concentrated around 500 sq ft.

Total Basement Area

ggplot(ames)+
  geom_histogram(aes(Total_Bsmt_SF),bins=40,color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Basement Area Distribution",x="Total basement Area (sq.ft.)",y="Number of Houses")

Total Basement Area is unimodal and slightly rightly skewed. Some houses also have no basement

Foundation

Creating Frequency Table

F_tab<- data.frame(table(ames$Foundation))
colnames(F_tab)<- c('Foundation','count')
F_tab

##   Foundation count
## 1     BrkTil   311
## 2     CBlock  1244
## 3      PConc  1305
## 4       Slab    49
## 5      Stone    11
## 6       Wood     5

Bar graph for foundation.

ggplot(F_tab,mapping = aes(x=Foundation, y=count,fill=Foundation))+
  geom_col()+
  theme_minimal()+
  labs(title = 'Distribution of Foundation')

This bar chat shows that most houses were made up using CBlock and PConc and least were made using stone wood and slab.

Bivariate plots

We will now do Bivariate analysis involving 2 variables.

Houses sold by Year and Month

ggplot(ames)+
  geom_bar(aes(Year_Sold,fill=Mo_Sold),position="dodge")+
  theme_minimal()+
  labs(title="Houses sold by year and month",x="Year",y="Number of Houses",fill="Months")

ames %>% group_by(Year_Sold) %>% summarize(houses=n())

## # A tibble: 5 x 2
##   Year_Sold houses
##       <int>  <int>
## 1      2006    625
## 2      2007    690
## 3      2008    621
## 4      2009    648
## 5      2010    341

Most of the houses were sold in months 6 and 7 of all the years, as evident by the peaks in the barcharts. Least number of houses were sold in 2010.

Sale Price VS sale Year

ggplot(data = ames, mapping = aes(x = as.factor(Year_Sold), y = Sale_Price, color = as.factor(Year_Sold)))+
  geom_boxplot(show.legend = FALSE)+
  stat_summary(fun.y=mean, geom="point", shape=20, size=4, color="black", fill="red") +
  scale_y_continuous(labels = scales::comma)+
  theme_minimal()+
  xlab(element_blank())+
  ylab('Sale Price')+
  labs(title = 'Sale Price VS Year of sale')

## Warning: `fun.y` is deprecated. Use `fun` instead.

From above plot, similar patter can be seen in all the years for sale Price. The average sale price is almost below 200,000 which represented by a black dot on the plot.

Let’s calculate it manually.

mean(ames$Sale_Price)

## [1] 180411.6

Sale Price VS Garage Area

ggplot(ames,aes(Garage_Area,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
   theme_minimal()+
  labs(title="Sale Price VS Garage Area",x="Garage Area(sq. ft.)",y="Sale Price")

## `geom_smooth()` using formula 'y ~ x'

cor(ames$Garage_Area,ames$Sale_Price)

## [1] 0.6480499

A strong positive correlation exists for garage area and sale price. It means as Garage Area increases, Sale Price also increases.

Neighborhood VS Sale Price

ggplot(ames,mapping = aes(x = Neighborhood,y = Sale_Price ))+
  geom_boxplot(aes(color = Neighborhood), show.legend=FALSE)+
  scale_y_continuous(labels = scales::comma)+
  theme_minimal()+
  coord_flip()

It is seen that Northridge Heights and Stony Brook are the most expensive neighborhoods. Northridge Heights has a great significance on the sale price of the houses.

Houses by Neighborhood and Lot type

ggplot(ames)+
  geom_bar(aes(Neighborhood,fill=Lot_Shape))+
  theme_minimal()+
  labs(title="Houses by Neighborhood and Lot Type",x="Neighborhood",y="Number of Houses",fill="Lot Shape")+
  coord_flip()

North Ames Neighborhood has the highest number of houses with Regular Lot Type

Sale Price VS Total Basement Area

ggplot(ames,aes(Total_Bsmt_SF,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
   theme_minimal()+
  labs(title="Sale Price VS Basement Area",x="Basement Area (sq. ft.)",y="Sale Price ($)")

## `geom_smooth()` using formula 'y ~ x'

cor(ames$Total_Bsmt_SF,ames$Sale_Price)

## [1] 0.6588634

A strong positive correlation exists between Basement area and sale price. It means as Basement Area of houses increases, their sale Price also increases.

Sale Price VS Year Built

ggplot(ames,aes(Year_Built,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
  theme_minimal()+
  labs(title="Sale Price VS Year Built",x="Year Built",y="Sale Price ($)")

## `geom_smooth()` using formula 'y ~ x'

cor(ames$Year_Built,ames$Sale_Price)

## [1] 0.5651105

A strong positive correlation exists between Year built and Sale price. It means that newer houses are sold at higher prices and houses built in the earlier years were sold for lesser prices.

Sale Price VS Lot Area

ggplot(ames,aes(Lot_Area,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
    theme_minimal()+ 
  labs(title="Sale Price VS Lot Area",x="Lot Area (sq. ft.)",y="Sale Price ($)")

## `geom_smooth()` using formula 'y ~ x'

 cor(ames$Lot_Area,ames$Sale_Price)

## [1] 0.2700472

We can see that a fairly strong positive correlation exists between lot area and sale price. But we need to be sure that this relationship is not influenced by outliers, so we will filter the houses with less than 100000 lot area and then check the relationship again.

ames_sample <- ames %>% filter(Lot_Area<100000)

ggplot(ames_sample,aes(Lot_Area,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
    theme_minimal()+
  labs(title="Sale Price VS Lot Area",x="Lot Area (sq. ft.)",y="Sale Price ($)")

## `geom_smooth()` using formula 'y ~ x'

cor(ames_sample$Lot_Area,ames_sample$Sale_Price)

## [1] 0.3450639

It is evident that the relationship is influenced by the outliers but the general trend is the same. The houses with larger lot areas have greater Sale Prices.

Multivariate plots

We will now do Multivariate analysis involving more than 2 variables.

Relationship between Sale Price and Year Built based on Street type

ggplot(ames,aes(Year_Built,Sale_Price,color = Street))+
  geom_jitter(alpha = 0.2)+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
  theme_minimal()+
  labs(title="Sale Price VS Year Built based on Street",x="Year Built",y="Sale Price ($)")

## `geom_smooth()` using formula 'y ~ x'

The above scatter plot shows a linear and direct relationship between built year and sale price for both type of streets.Almost all houses were connected with Paved streets and only few were with gravel streets.

Relationship between Sale price and sold year based on Paved Driveway

ggplot(data = ames,mapping = aes(Year_Sold,Sale_Price,fill = Paved_Drive) )+
  geom_bar(stat = 'identity',position='dodge')+
  scale_y_continuous(labels = scales::comma)+
  theme_minimal()

Mostly expensive houses had paved driveway for all the given years in which they were sold. Houses with price range around $200,000 were connected to partial pavement and Dirt Gravel, similar price pattern can be seen for all properties for all years.

unique(ames$Sale_Type)

##  [1] WD    New   COD   ConLI Con   ConLD Oth   ConLw CWD   VWD  
## Levels: COD Con ConLD ConLI ConLw CWD New Oth VWD WD

Number of Sales by Condition & Miscellaneous Feature

ggplot(data = ames, aes(x=Sale_Condition,fill=Misc_Feature)) + 
  geom_bar()+
  theme_minimal()+
  labs(title="No of Sales by Sale Condition & Misc. Feature",x="Sale Condition",y="Number of Houses",fill="Misc. Feature")

Most of the houses sold had a normal sale with no miscellaneous feature.

Avg Sale Price by of Rooms and Central Air

ggplot(ames,aes(x=TotRms_AbvGrd,y=Sale_Price))+
  stat_summary(fun = mean, na.rm = TRUE,geom='line',aes(group=Central_Air,color=Central_Air))+
  scale_y_continuous(labels = scales::comma)+
  labs(title="Avg Sale Price by No of Rooms & Central Air",x="Total rooms above ground",y="Sale Price ($)")+
  theme_minimal()

It is evident from the line plot that houses with central air have a higher avg sale price than those which do not have central air conditioning. Graph also hows that avg sale price increases with the number of rooms.

We can also get the Avg prices manually

ames %>% group_by(TotRms_AbvGrd, Central_Air) %>%
  summarize(Avg_Price=mean(Sale_Price)) %>% arrange(desc(Avg_Price))

## `summarise()` has grouped output by 'TotRms_AbvGrd'. You can override using the `.groups` argument.

## # A tibble: 23 x 3
## # Groups:   TotRms_AbvGrd [13]
##    TotRms_AbvGrd Central_Air Avg_Price
##            <int> <fct>           <dbl>
##  1            11 Y             353957 
##  2            10 Y             290527.
##  3            12 Y             283349.
##  4             9 Y             263678.
##  5             8 Y             226102.
##  6            13 Y             205000 
##  7             7 Y             201967.
##  8            14 Y             200000 
##  9             6 Y             166733.
## 10            10 N             152056.
## # ... with 13 more rows

Correlation matrix of numeric variables

cormat <- round(cor(ames_num),2)

melted_cormat <- melt(cormat)
head(melted_cormat)

##           Var1         Var2 value
## 1 Lot_Frontage Lot_Frontage  1.00
## 2     Lot_Area Lot_Frontage  0.12
## 3 Mas_Vnr_Area Lot_Frontage  0.09
## 4 BsmtFin_SF_1 Lot_Frontage  0.07
## 5 BsmtFin_SF_2 Lot_Frontage  0.00
## 6  Bsmt_Unf_SF Lot_Frontage  0.14

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile()+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title="Correlation Matrix")

Response Variable

Sale_Price is the response variable in ames dataset. It is not normally distributed. We will be transforming it using log transformation in the data transformation Tablater. Right now, we will be using it as it is.

ggplot(ames)+
  geom_histogram(aes(Sale_Price),bins=40,color='red',fill='skyblue')+
  theme_minimal()+
  labs(title="Sale Price distribution",x="Sale Price ($)",y="Number of Houses")+
  scale_x_continuous(labels = scales::comma)

Modeling

Model 1 (Sale Price ~ Lot Area)

ggplot(ames,aes(Lot_Area,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
    theme_minimal()+ 
  labs(title="Sale Price VS Lot Area",x="Lot Area (sq. ft.)",y="Sale Price ($)")

## `geom_smooth()` using formula 'y ~ x'

ames %>%
  summarise(cor(Lot_Area, Sale_Price))

## # A tibble: 1 x 1
##   `cor(Lot_Area, Sale_Price)`
##                         <dbl>
## 1                       0.270

m1 <- lm(Sale_Price ~ Lot_Area, data = ames)
tidy(m1)

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 152869.    2293.         66.7 0       
## 2 Lot_Area         2.73     0.180      15.2 4.72e-50

\[ \hat{y} = 152869.484 + 2.73 \times LotArea \]

This equation tells us two things:

For houses with a Lot_Area of 0 we expect their mean sale price to be 152869.484 dollars.
For every 1 square ft increase in Lot_Area, we expect an sale price to increase on an average of 2.73 dollars.

glance(m1)

## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1    0.0729        0.0726 75649.      230. 4.72e-50     1 -37008. 74023. 74041.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

For this model, 7.29% of the variability in Sale_Price is explained by Lot_Area.

Model 2 (Sale Price ~ Garage Area)

ggplot(ames,aes(Garage_Area,Sale_Price))+
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  scale_y_continuous(labels = scales::comma)+
   theme_minimal()+
  labs(title="Sale Price VS Garage Area",x="Garage Area(sq. ft.)",y="Sale Price")

## `geom_smooth()` using formula 'y ~ x'

ames %>%
  summarise(cor(Garage_Area, Sale_Price))

## # A tibble: 1 x 1
##   `cor(Garage_Area, Sale_Price)`
##                            <dbl>
## 1                          0.648

m2 <- lm(Sale_Price ~ Garage_Area, data = ames)
tidy(m2)

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   68173.   2679.        25.4 3.26e-129
## 2 Garage_Area     238.      5.17      46.0 0

\[ \hat{y} = 68173.409 + 238 \times GarageArea \] This equation tells us two things:

For houses with a Garage_Area of 0 we expect their mean sale price to be 68173.409 dollars.
For every 1 square ft increase in Garage_Area, we expect an increase of mean sale price to increase 238 dollars.

glance(m2)

## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.420         0.420 59837.     2116.       0     1 -36323. 72651. 72669.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

For this model, 42.0% of the variability in Sale_Price is explained by Garage_Area.

Model 3 (Sale Price ~ Living Area)

# some code goes here
ggplot(ames,aes(Gr_Liv_Area,Sale_Price))+
  geom_point() +
  geom_smooth(method = 'lm',se=FALSE)+
   scale_y_continuous(labels = scales::comma)+
   theme_minimal()+
  labs(title="Sale Price VS Living Area",x="Living Area(sq. ft.)",y="Sale Price")

## `geom_smooth()` using formula 'y ~ x'

ames %>%
  summarise(cor(Gr_Liv_Area, Sale_Price))

## # A tibble: 1 x 1
##   `cor(Gr_Liv_Area, Sale_Price)`
##                            <dbl>
## 1                          0.719

m3 <- lm(Sale_Price ~ Gr_Liv_Area, data = ames)
tidy(m3)

## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    6773.   3260.        2.08  0.0378
## 2 Gr_Liv_Area     116.      2.08     56.0   0

\[ \hat{y} = 6773.484 + 116.2 \times GrLivArea\]

This equation tells us two things:

For houses with a Gr_Liv_Area of 0 we expect their mean sale price to be 6773.484 dollars.
For every 1 square ft increase in Gr_Liv_Area, we expect an sale price to increase on an average 116.2 dollars.

glance(m3)

## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.518         0.517 54568.     3137.       0     1 -36053. 72112. 72130.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

For this model, 51.8% of the variability in Sale_Price is explained by Gr_Liv_Area.

Model Assessment

library(modelsummary)

models <- list(
  "Sale Price ~ Lot_Area"     = m1,
 "Sale Price ~ Garage_Area" = m2,
  "Sale Price ~ Gr_Liv_Area"=m3
)

modelsummary(models)

	Sale Price ~ Lot_Area	Sale Price ~ Garage_Area	Sale Price ~ Gr_Liv_Area
(Intercept)	152869.484	68173.409	6773.484
	(2292.521)	(2678.887)	(3260.418)
Lot_Area	2.726
	(0.180)
Garage_Area		237.933
		(5.172)
Gr_Liv_Area			116.225
			(2.075)
Num.Obs.	2925	2925	2925
R2	0.073	0.420	0.518
R2 Adj.	0.073	0.420	0.517
AIC	74022.9	72651.2	72111.9
BIC	74040.8	72669.1	72129.9
Log.Lik.	-37008.444	-36322.602	-36052.970
F	229.929	2116.383	3136.620

Out of all three models, Model 1 has the least R2 of about 7.3% which means higher variability around the regression line.

Model Diagnostics

Sale Price ~ Lot Area

m1_aug <- augment(m1)

m1_aug

## # A tibble: 2,925 x 8
##    Sale_Price Lot_Area .fitted  .resid     .hat .sigma      .cooksd .std.resid
##         <int>    <int>   <dbl>   <dbl>    <dbl>  <dbl>        <dbl>      <dbl>
##  1     215000    31770 239474. -24474. 0.00299  75661. 0.000158       -0.324  
##  2     105000    11622 184551. -79551. 0.000355 75648. 0.000196       -1.05   
##  3     172000    14267 191761. -19761. 0.000440 75661. 0.0000150      -0.261  
##  4     244000    11160 183291.  60709. 0.000348 75654. 0.000112        0.803  
##  5     189900    13830 190570.   -670. 0.000420 75662. 0.0000000165   -0.00885
##  6     195500     9978 180069.  15431. 0.000342 75662. 0.00000712      0.204  
##  7     213500     4920 166281.  47219. 0.000494 75657. 0.0000963       0.624  
##  8     191500     5005 166513.  24987. 0.000489 75661. 0.0000267       0.330  
##  9     236500     5389 167560.  68940. 0.000467 75651. 0.000194        0.912  
## 10     189000     7500 173314.  15686. 0.000380 75662. 0.00000818      0.207  
## # ... with 2,915 more rows

# Linearity

ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")+
  theme_minimal()

Model 1 fails the condition of linearity.

# Nearly normal residuals

ggplot(data = m1_aug, aes(x = .resid)) +
  geom_histogram(color='red',fill='skyblue') +
  xlab("Residuals")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Model 1 also fails the condition of normality of residuals.

# constant variability

check_model(m1)

check_heteroscedasticity(m1)

## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

Sale Price ~ Garage Area

# model diagnostics

m2_aug <- augment(m2)

m2_aug

## # A tibble: 2,925 x 8
##    Sale_Price Garage_Area .fitted   .resid     .hat .sigma    .cooksd .std.resid
##         <int>       <dbl>   <dbl>    <dbl>    <dbl>  <dbl>      <dbl>      <dbl>
##  1     215000         528 193802.   21198. 0.000366 59846.    2.30e-5     0.354 
##  2     105000         730 241864. -136864. 0.000840 59794.    2.20e-3    -2.29  
##  3     172000         312 142408.   29592. 0.000532 59845.    6.52e-5     0.495 
##  4     244000         522 192374.   51626. 0.000361 59840.    1.34e-4     0.863 
##  5     189900         482 182857.    7043. 0.000343 59848.    2.38e-6     0.118 
##  6     195500         470 180002.   15498. 0.000342 59847.    1.15e-5     0.259 
##  7     213500         582 206650.    6850. 0.000433 59848.    2.84e-6     0.114 
##  8     191500         506 188567.    2933. 0.000351 59848.    4.21e-7     0.0490
##  9     236500         608 212836.   23664. 0.000481 59846.    3.76e-5     0.396 
## 10     189000         442 173340.   15660. 0.000348 59847.    1.19e-5     0.262 
## # ... with 2,915 more rows

# Linearity

ggplot(data = m2_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")+
  theme_minimal()

# nearly normal 

ggplot(data = m2_aug, aes(x = .resid)) +
  geom_histogram(color='red',fill='skyblue') +
  xlab("Residuals")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# constant variability

check_model(m2)

check_heteroscedasticity(m2)

## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

Sale Price ~ Living Area

# model diagnostics

m3_aug <- augment(m3)

m3_aug

## # A tibble: 2,925 x 8
##    Sale_Price Gr_Liv_Area .fitted  .resid     .hat .sigma     .cooksd .std.resid
##         <int>       <int>   <dbl>   <dbl>    <dbl>  <dbl>       <dbl>      <dbl>
##  1     215000        1656 199243.  15757. 0.000380 54577. 0.0000158       0.289 
##  2     105000         896 110911.  -5911. 0.000859 54577. 0.00000505     -0.108 
##  3     172000        1329 161237.  10763. 0.000381 54577. 0.00000742      0.197 
##  4     244000        2110 252009.  -8009. 0.000891 54577. 0.00000961     -0.147 
##  5     189900        1629 196104.  -6204. 0.000368 54577. 0.00000238     -0.114 
##  6     195500        1604 193199.   2301. 0.000359 54577. 0.000000320     0.0422
##  7     213500        1338 162283.  51217. 0.000377 54569. 0.000166        0.939 
##  8     191500        1280 155542.  35958. 0.000408 54573. 0.0000887       0.659 
##  9     236500        1616 194594.  41906. 0.000363 54572. 0.000107        0.768 
## 10     189000        1804 216444. -27444. 0.000481 54575. 0.0000609      -0.503 
## # ... with 2,915 more rows

# Linearity

ggplot(data = m3_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")+
  theme_minimal()

# nearly normal residuals

ggplot(data = m3_aug, aes(x = .resid)) +
  geom_histogram(color='red',fill='skyblue') +
  xlab("Residuals")+
theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# constant variability

check_model(m3)

check_heteroscedasticity(m3)

## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

Based on the diagnostics and assessment, model 2 and model 3 both have normal distribution of residuals and residuals are linearly distributed along zero, but model 3 is our best fit model because of higher adjusted R2 value.

Data Transformation and Re-fitting the Best Model

Now we will be log transforming our response variable Sale_price.

ames<-ames %>% mutate(sale_norm=log(Sale_Price))

ggplot(ames)+
  geom_histogram(aes(sale_norm),color='red',fill='skyblue')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ames %>%
  summarise(cor(Gr_Liv_Area, sale_norm))

## # A tibble: 1 x 1
##   `cor(Gr_Liv_Area, sale_norm)`
##                           <dbl>
## 1                         0.711

m4 <- lm(sale_norm ~ Gr_Liv_Area, data = ames)
tidy(m4)

## # A tibble: 2 x 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept) 11.1      0.0171        653.        0
## 2 Gr_Liv_Area  0.000594 0.0000109      54.7       0

glance(m4)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.506         0.506 0.285     2994.       0     1  -482.  970.  988.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

\[ \hat{y} = 11.1 + 0.000594 \times GrLivArea\]

This equation tells us two things:

For houses with a Gr_Liv_Area of 0 we expect their mean sale price to be 11.1 dollars.
For every 1 square ft increase in Gr_Liv_Area, we expect an sale price to increase on an average 0.000594 dollars

For this model, 50.6% of the variability in Sale_Price is explained by Gr_Liv_Area.

Running Diagnostics for this model:

m4_aug <- augment(m4)

# Checking for Linearity

ggplot(data = m4_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")+
  theme_minimal()

# Checking for Nearly normal residuals

ggplot(data = m4_aug, aes(x = .resid)) +
  geom_histogram(color='red',fill='skyblue') +
  xlab("Residuals")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Checking for constant variability

check_model(m4)

check_heteroscedasticity(m4)

## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).

Even though this model has a slighter lower adjusted R2, the linearity, homogeneity of variance and normality of residuals are slightly better fitted.