Ames Housing Dataset: EDA and Linear Regression Modelling
The Ames Housing data contains data on 2,930 properties in Ames, Iowa, including columns related to 1.house characteristics (bedrooms, garage, fireplace, pool, porch, etc.), 2.location (neighborhood), 3.lot information (zoning, shape, size, etc.), 4.ratings of condition and quality, and 5.sale price
First I will be performing an Exploratory data analysis on the variables in the dataset including univariate, bivariate and multivariate plots. Then I will build and compare 3 linear regression models to get a better understanding of the variables that drive the sale prices of houses the most.
Loading necessary Packages
library(tidyverse)
library(dplyr)
library(skimr)
library(see)
library(broom)
library(performance)
library(ggplot2)
library(qqplotr)
library(reshape2)
data(ames, package = "modeldata")
Exploratory Data Analysis
Univariate plots
Sale Price
ggplot(ames)+
geom_histogram(aes(Sale_Price),bins=40,color='red',fill='skyblue')+
theme_minimal()+
labs(title="Sale Price distribution",x="Sale Price ($)",y="Number of Houses")+
scale_x_continuous(labels = scales::comma)
mean(ames$Sale_Price)
## [1] 180411.6
Sale Price is right-skewed; there are more inexpensive houses than expensive ones. The mean sale price is 180411.6
Checking Outliers in Sale Price using Box plot.
ggplot(data = ames, mapping = aes(Sale_Price))+
geom_boxplot( color = 'red')+
scale_x_continuous(labels = scales::comma)+
theme_minimal()
It is also possible to extract the values of the potential outliers based on the IQR criterion using the boxplot.stats()$out function:
boxplot.stats(ames$Sale_Price)$out
## [1] 538000 394432 376162 395192 611657 500000 355000 410000 362500 378500
## [11] 345000 377500 375000 501837 372500 462000 485000 555000 398800 404000
## [21] 402861 451950 610000 582933 360000 409900 350000 386250 445000 552000
## [31] 382500 403000 340000 402000 468000 370878 375000 410000 425000 355000
## [41] 387000 394617 426000 385000 446261 372402 417500 383000 390000 460000
## [51] 379000 615000 412500 421250 370000 367294 370967 350000 350000 341000
## [61] 339750 372500 345000 475000 395039 381000 392500 370000 377426 349265
## [71] 591587 392000 441929 455000 356000 345474 415298 492000 450000 479069
## [81] 395000 380000 440000 418000 500067 372000 342000 354000 350000 384500
## [91] 466500 410000 430000 419005 340000 373000 383970 424870 360000 359100
## [101] 375000 392000 457347 545224 356383 556581 361919 535000 401179 438780
## [111] 470000 412083 342643 465000 415000 345000 344133 360000 372397 378000
## [121] 374000 437154 348000 625000 405000 350000 584500 423000 405749 385000
## [131] 475000 415000 375000 369900 359900
Extracting all the rows corresponding to these outliers:
<- boxplot.stats(ames$Sale_Price)$out
out <- which(ames$Sale_Price %in% c(out))
out_ind
ames[out_ind,]
## # A tibble: 135 x 74
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
## 1 Two_Story_1946_~ Residential~ 47 53504 Pave No_Al~ Moderately~
## 2 One_Story_1946_~ Residential~ 88 11394 Pave No_Al~ Regular
## 3 Two_Story_1946_~ Residential~ 102 12858 Pave No_Al~ Slightly_I~
## 4 One_Story_1946_~ Residential~ 83 10159 Pave No_Al~ Slightly_I~
## 5 One_Story_1946_~ Residential~ 100 12919 Pave No_Al~ Slightly_I~
## 6 One_Story_1946_~ Residential~ 110 14300 Pave No_Al~ Regular
## 7 Two_Story_1946_~ Residential~ 60 17433 Pave No_Al~ Moderately~
## 8 One_and_Half_St~ Residential~ 56 14720 Pave No_Al~ Slightly_I~
## 9 Two_Story_1946_~ Residential~ 85 13143 Pave No_Al~ Slightly_I~
## 10 One_Story_1946_~ Residential~ 89 13214 Pave No_Al~ Slightly_I~
## # ... with 125 more rows, and 67 more variables: Land_Contour <fct>,
## # Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>,
## # Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
## # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
## # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
## # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
## # Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
## # BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>,
## # Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>,
## # Central_Air <fct>, Electrical <fct>, First_Flr_SF <int>,
## # Second_Flr_SF <int>, Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>,
## # Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>,
## # Bedroom_AbvGr <int>, Kitchen_AbvGr <int>, TotRms_AbvGrd <int>,
## # Functional <fct>, Fireplaces <int>, Garage_Type <fct>, Garage_Finish <fct>,
## # Garage_Cars <dbl>, Garage_Area <dbl>, Garage_Cond <fct>, Paved_Drive <fct>,
## # Wood_Deck_SF <int>, Open_Porch_SF <int>, Enclosed_Porch <int>,
## # Three_season_porch <int>, Screen_Porch <int>, Pool_Area <int>,
## # Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, Misc_Val <int>,
## # Mo_Sold <fct>, Year_Sold <int>, Sale_Type <fct>, Sale_Condition <fct>,
## # Sale_Price <int>, Longitude <dbl>, Latitude <dbl>
Pool Area
ggplot(ames)+
geom_histogram(aes(Pool_Area),color='red',fill='skyblue')+
theme_minimal()+
labs(title="Pool Area distribution",x="Pool Area (sq.ft.)",y="Number of Houses")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
%>% filter(Pool_Area>0) %>% summarize(n=n()) ames
## # A tibble: 1 x 1
## n
## <int>
## 1 11
Pool Area is highly right skewed.Most of the houses have 0 pool area.Only 11 houses have pool area greater than zero.
Monthly sales
<- ames %>%
Monthly_salesgroup_by(Mo_Sold)%>%
summarise( n = n())%>%
mutate(Month_sold = factor(month.abb[Mo_Sold], levels = month.abb))
Monthly_sales
## # A tibble: 12 x 3
## Mo_Sold n Month_sold
## <fct> <int> <fct>
## 1 1 121 Jan
## 2 2 133 Feb
## 3 3 232 Mar
## 4 4 279 Apr
## 5 5 395 May
## 6 6 505 Jun
## 7 7 448 Jul
## 8 8 233 Aug
## 9 9 161 Sep
## 10 10 171 Oct
## 11 11 143 Nov
## 12 12 104 Dec
ggplot(data = Monthly_sales, mapping = aes(x= Month_sold, y= n,fill = Month_sold, label = n))+
geom_col(show.legend = FALSE)+
geom_text(aes(label = n, color = Month_sold, vjust=-0.2),show.legend = FALSE )+
theme_minimal()+
labs(title = 'Individual residential properties sold in Ames, IA from 2006 to 2010')+
xlab(element_blank())+
ylab('Number of Properties')
It is clear from the plot that most of the houses were sold in the mid months i.e May, June, July and least were sold in initial and last month.
Houses sold by Year
Let’s start. The first thing to do is to count the number of observations for each value of the year_sold. Thus, how many house are there for each year? We can do this by grouping the data on year_sold and then counting the number of rows using n.
<- ames %>%
sold_housegroup_by(Year_Sold)%>%
summarise(n = n())
sold_house
## # A tibble: 5 x 2
## Year_Sold n
## <int> <int>
## 1 2006 625
## 2 2007 690
## 3 2008 621
## 4 2009 648
## 5 2010 341
Let’s Plot it using Geom_col
ggplot(sold_house)+
geom_col(aes(x = Year_Sold,y = n),color='red',fill='skyblue')+
theme_minimal()+
labs(title="Number of house sold each year",x="Year",y="Count")
It is clear from the plot that maximum houses were sold in the year of 2009 with approximate 700 value. 2006,2008 and, 2009 had almost same sales and 2010 is the only year with least sale.
Lot Area
ggplot(ames)+
geom_histogram(aes(Lot_Area),bins=40,color='red',fill='skyblue')+
geom_vline(aes(xintercept=mean(Lot_Area)),
color="blue", linetype="dashed", size=1)+
theme_minimal()+
labs(title="Lot Area Distribution",x="Lot Area (sq.ft.)",y="Number of Houses")
mean(ames$Lot_Area)
## [1] 10103.58
Lot Area is highly right skewed. Most of the houses have lot sizes smaller than 50000 square feet. The mean lot size is 10103.58.
Houses Built by Year
ggplot(ames)+
geom_density(aes(Year_Built),color='red',fill='skyblue')+
theme_minimal()+
labs(title="Year Built Distribution",x="Year Built",y="Density")
We have three major peaks for the years the houses are built. One around 1920, the second around 1960 and the third one after 1990
Ground Living Area
ggplot(ames)+
geom_histogram(aes(Gr_Liv_Area),bins=40,color='red',fill='skyblue')+
geom_vline(aes(xintercept=mean(Gr_Liv_Area)),
color="blue", linetype="dashed", size=1)+
theme_minimal()+
labs(title="Above Ground Living Area Distribution with mean",x="Living Area (sq.ft.)",y="Number of Houses")
mean(ames$Gr_Liv_Area)
## [1] 1493.979
The distribution for Living Area is multi-nodal and right skewed. Meaning more houses have smaller living areas than larger ones. The mean is shown by the vertical line which is 1493.9
Garage Area
ggplot(ames)+
geom_density(aes(Garage_Area),color='red',fill='skyblue')+
theme_minimal()+
labs(title="Garage Area Distribution",x="Garaage Area (sq.ft.)",y="Density")
The distribution of Garage Area is multi-nodal and normal. Some houses have O garage area but most of the distribution is concentrated around 500 sq ft.
Total Basement Area
ggplot(ames)+
geom_histogram(aes(Total_Bsmt_SF),bins=40,color='red',fill='skyblue')+
theme_minimal()+
labs(title="Basement Area Distribution",x="Total basement Area (sq.ft.)",y="Number of Houses")
Total Basement Area is unimodal and slightly rightly skewed. Some houses also have no basement
Foundation
Creating Frequency Table
<- data.frame(table(ames$Foundation))
F_tabcolnames(F_tab)<- c('Foundation','count')
F_tab
## Foundation count
## 1 BrkTil 311
## 2 CBlock 1244
## 3 PConc 1305
## 4 Slab 49
## 5 Stone 11
## 6 Wood 5
Bar graph for foundation.
ggplot(F_tab,mapping = aes(x=Foundation, y=count,fill=Foundation))+
geom_col()+
theme_minimal()+
labs(title = 'Distribution of Foundation')
This bar chat shows that most houses were made up using CBlock and PConc and least were made using stone wood and slab.
Bivariate plots
We will now do Bivariate analysis involving 2 variables.
Houses sold by Year and Month
ggplot(ames)+
geom_bar(aes(Year_Sold,fill=Mo_Sold),position="dodge")+
theme_minimal()+
labs(title="Houses sold by year and month",x="Year",y="Number of Houses",fill="Months")
%>% group_by(Year_Sold) %>% summarize(houses=n()) ames
## # A tibble: 5 x 2
## Year_Sold houses
## <int> <int>
## 1 2006 625
## 2 2007 690
## 3 2008 621
## 4 2009 648
## 5 2010 341
Most of the houses were sold in months 6 and 7 of all the years, as evident by the peaks in the barcharts. Least number of houses were sold in 2010.
Sale Price VS sale Year
ggplot(data = ames, mapping = aes(x = as.factor(Year_Sold), y = Sale_Price, color = as.factor(Year_Sold)))+
geom_boxplot(show.legend = FALSE)+
stat_summary(fun.y=mean, geom="point", shape=20, size=4, color="black", fill="red") +
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
xlab(element_blank())+
ylab('Sale Price')+
labs(title = 'Sale Price VS Year of sale')
## Warning: `fun.y` is deprecated. Use `fun` instead.
From above plot, similar patter can be seen in all the years for sale Price. The average sale price is almost below 200,000 which represented by a black dot on the plot.
Let’s calculate it manually.
mean(ames$Sale_Price)
## [1] 180411.6
Sale Price VS Garage Area
ggplot(ames,aes(Garage_Area,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Garage Area",x="Garage Area(sq. ft.)",y="Sale Price")
## `geom_smooth()` using formula 'y ~ x'
cor(ames$Garage_Area,ames$Sale_Price)
## [1] 0.6480499
A strong positive correlation exists for garage area and sale price. It means as Garage Area increases, Sale Price also increases.
Neighborhood VS Sale Price
ggplot(ames,mapping = aes(x = Neighborhood,y = Sale_Price ))+
geom_boxplot(aes(color = Neighborhood), show.legend=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
coord_flip()
It is seen that Northridge Heights and Stony Brook are the most expensive neighborhoods. Northridge Heights has a great significance on the sale price of the houses.
Houses by Neighborhood and Lot type
ggplot(ames)+
geom_bar(aes(Neighborhood,fill=Lot_Shape))+
theme_minimal()+
labs(title="Houses by Neighborhood and Lot Type",x="Neighborhood",y="Number of Houses",fill="Lot Shape")+
coord_flip()
North Ames Neighborhood has the highest number of houses with Regular Lot Type
Sale Price VS Total Basement Area
ggplot(ames,aes(Total_Bsmt_SF,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Basement Area",x="Basement Area (sq. ft.)",y="Sale Price ($)")
## `geom_smooth()` using formula 'y ~ x'
cor(ames$Total_Bsmt_SF,ames$Sale_Price)
## [1] 0.6588634
A strong positive correlation exists between Basement area and sale price. It means as Basement Area of houses increases, their sale Price also increases.
Sale Price VS Year Built
ggplot(ames,aes(Year_Built,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Year Built",x="Year Built",y="Sale Price ($)")
## `geom_smooth()` using formula 'y ~ x'
cor(ames$Year_Built,ames$Sale_Price)
## [1] 0.5651105
A strong positive correlation exists between Year built and Sale price. It means that newer houses are sold at higher prices and houses built in the earlier years were sold for lesser prices.
Sale Price VS Lot Area
ggplot(ames,aes(Lot_Area,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Lot Area",x="Lot Area (sq. ft.)",y="Sale Price ($)")
## `geom_smooth()` using formula 'y ~ x'
cor(ames$Lot_Area,ames$Sale_Price)
## [1] 0.2700472
We can see that a fairly strong positive correlation exists between lot area and sale price. But we need to be sure that this relationship is not influenced by outliers, so we will filter the houses with less than 100000 lot area and then check the relationship again.
<- ames %>% filter(Lot_Area<100000)
ames_sample
ggplot(ames_sample,aes(Lot_Area,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Lot Area",x="Lot Area (sq. ft.)",y="Sale Price ($)")
## `geom_smooth()` using formula 'y ~ x'
cor(ames_sample$Lot_Area,ames_sample$Sale_Price)
## [1] 0.3450639
It is evident that the relationship is influenced by the outliers but the general trend is the same. The houses with larger lot areas have greater Sale Prices.
Multivariate plots
We will now do Multivariate analysis involving more than 2 variables.
Relationship between Sale Price and Year Built based on Street type
ggplot(ames,aes(Year_Built,Sale_Price,color = Street))+
geom_jitter(alpha = 0.2)+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Year Built based on Street",x="Year Built",y="Sale Price ($)")
## `geom_smooth()` using formula 'y ~ x'
The above scatter plot shows a linear and direct relationship between built year and sale price for both type of streets.Almost all houses were connected with Paved streets and only few were with gravel streets.
Relationship between Sale price and sold year based on Paved Driveway
ggplot(data = ames,mapping = aes(Year_Sold,Sale_Price,fill = Paved_Drive) )+
geom_bar(stat = 'identity',position='dodge')+
scale_y_continuous(labels = scales::comma)+
theme_minimal()
Mostly expensive houses had paved driveway for all the given years in which they were sold. Houses with price range around $200,000 were connected to partial pavement and Dirt Gravel, similar price pattern can be seen for all properties for all years.
unique(ames$Sale_Type)
## [1] WD New COD ConLI Con ConLD Oth ConLw CWD VWD
## Levels: COD Con ConLD ConLI ConLw CWD New Oth VWD WD
Number of Sales by Condition & Miscellaneous Feature
ggplot(data = ames, aes(x=Sale_Condition,fill=Misc_Feature)) +
geom_bar()+
theme_minimal()+
labs(title="No of Sales by Sale Condition & Misc. Feature",x="Sale Condition",y="Number of Houses",fill="Misc. Feature")
ggplot(data = ames, aes(x=Sale_Condition,fill=Misc_Feature)) +
geom_bar()+
theme_minimal()+
labs(title="No of Sales by Sale Condition & Misc. Feature",x="Sale Condition",y="Number of Houses",fill="Misc. Feature")
Most of the houses sold had a normal sale with no miscellaneous feature.
Avg Sale Price by of Rooms and Central Air
ggplot(ames,aes(x=TotRms_AbvGrd,y=Sale_Price))+
stat_summary(fun = mean, na.rm = TRUE,geom='line',aes(group=Central_Air,color=Central_Air))+
scale_y_continuous(labels = scales::comma)+
labs(title="Avg Sale Price by No of Rooms & Central Air",x="Total rooms above ground",y="Sale Price ($)")+
theme_minimal()
ggplot(ames,aes(x=TotRms_AbvGrd,y=Sale_Price))+
stat_summary(fun = mean, na.rm = TRUE,geom='line',aes(group=Central_Air,color=Central_Air))+
scale_y_continuous(labels = scales::comma)+
labs(title="Avg Sale Price by No of Rooms & Central Air",x="Total rooms above ground",y="Sale Price ($)")+
theme_minimal()
It is evident from the line plot that houses with central air have a higher avg sale price than those which do not have central air conditioning. Graph also hows that avg sale price increases with the number of rooms.
We can also get the Avg prices manually
%>% group_by(TotRms_AbvGrd, Central_Air) %>%
ames summarize(Avg_Price=mean(Sale_Price)) %>% arrange(desc(Avg_Price))
## `summarise()` has grouped output by 'TotRms_AbvGrd'. You can override using the `.groups` argument.
## # A tibble: 23 x 3
## # Groups: TotRms_AbvGrd [13]
## TotRms_AbvGrd Central_Air Avg_Price
## <int> <fct> <dbl>
## 1 11 Y 353957
## 2 10 Y 290527.
## 3 12 Y 283349.
## 4 9 Y 263678.
## 5 8 Y 226102.
## 6 13 Y 205000
## 7 7 Y 201967.
## 8 14 Y 200000
## 9 6 Y 166733.
## 10 10 N 152056.
## # ... with 13 more rows
Correlation matrix of numeric variables
<- round(cor(ames_num),2)
cormat
<- melt(cormat)
melted_cormat head(melted_cormat)
## Var1 Var2 value
## 1 Lot_Frontage Lot_Frontage 1.00
## 2 Lot_Area Lot_Frontage 0.12
## 3 Mas_Vnr_Area Lot_Frontage 0.09
## 4 BsmtFin_SF_1 Lot_Frontage 0.07
## 5 BsmtFin_SF_2 Lot_Frontage 0.00
## 6 Bsmt_Unf_SF Lot_Frontage 0.14
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(title="Correlation Matrix")
<- round(cor(ames_num),2)
cormat
<- melt(cormat)
melted_cormat head(melted_cormat)
## Var1 Var2 value
## 1 Lot_Frontage Lot_Frontage 1.00
## 2 Lot_Area Lot_Frontage 0.12
## 3 Mas_Vnr_Area Lot_Frontage 0.09
## 4 BsmtFin_SF_1 Lot_Frontage 0.07
## 5 BsmtFin_SF_2 Lot_Frontage 0.00
## 6 Bsmt_Unf_SF Lot_Frontage 0.14
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
labs(title="Correlation Matrix")
Response Variable
Sale_Price
is the response variable in ames
dataset. It is not normally distributed. We will be transforming it using log transformation in the data transformation Tablater. Right now, we will be using it as it is.
ggplot(ames)+
geom_histogram(aes(Sale_Price),bins=40,color='red',fill='skyblue')+
theme_minimal()+
labs(title="Sale Price distribution",x="Sale Price ($)",y="Number of Houses")+
scale_x_continuous(labels = scales::comma)
Modeling
Model 1 (Sale Price ~ Lot Area)
ggplot(ames,aes(Lot_Area,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Lot Area",x="Lot Area (sq. ft.)",y="Sale Price ($)")
## `geom_smooth()` using formula 'y ~ x'
%>%
ames summarise(cor(Lot_Area, Sale_Price))
## # A tibble: 1 x 1
## `cor(Lot_Area, Sale_Price)`
## <dbl>
## 1 0.270
<- lm(Sale_Price ~ Lot_Area, data = ames)
m1 tidy(m1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 152869. 2293. 66.7 0
## 2 Lot_Area 2.73 0.180 15.2 4.72e-50
\[ \hat{y} = 152869.484 + 2.73 \times LotArea \]
This equation tells us two things:
For houses with a
Lot_Area
of 0 we expect their mean sale price to be 152869.484 dollars.For every 1 square ft increase in
Lot_Area
, we expect an sale price to increase on an average of 2.73 dollars.
glance(m1)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0729 0.0726 75649. 230. 4.72e-50 1 -37008. 74023. 74041.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
For this model, 7.29% of the variability in Sale_Price
is explained by Lot_Area
.
Model 2 (Sale Price ~ Garage Area)
ggplot(ames,aes(Garage_Area,Sale_Price))+
geom_point()+
geom_smooth(method='lm', se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Garage Area",x="Garage Area(sq. ft.)",y="Sale Price")
## `geom_smooth()` using formula 'y ~ x'
%>%
ames summarise(cor(Garage_Area, Sale_Price))
## # A tibble: 1 x 1
## `cor(Garage_Area, Sale_Price)`
## <dbl>
## 1 0.648
<- lm(Sale_Price ~ Garage_Area, data = ames)
m2 tidy(m2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 68173. 2679. 25.4 3.26e-129
## 2 Garage_Area 238. 5.17 46.0 0
\[ \hat{y} = 68173.409 + 238 \times GarageArea \] This equation tells us two things:
For houses with a
Garage_Area
of 0 we expect their mean sale price to be 68173.409 dollars.For every 1 square ft increase in
Garage_Area
, we expect an increase of mean sale price to increase 238 dollars.
glance(m2)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.420 0.420 59837. 2116. 0 1 -36323. 72651. 72669.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
For this model, 42.0% of the variability in Sale_Price
is explained by Garage_Area
.
Model 3 (Sale Price ~ Living Area)
# some code goes here
ggplot(ames,aes(Gr_Liv_Area,Sale_Price))+
geom_point() +
geom_smooth(method = 'lm',se=FALSE)+
scale_y_continuous(labels = scales::comma)+
theme_minimal()+
labs(title="Sale Price VS Living Area",x="Living Area(sq. ft.)",y="Sale Price")
## `geom_smooth()` using formula 'y ~ x'
%>%
ames summarise(cor(Gr_Liv_Area, Sale_Price))
## # A tibble: 1 x 1
## `cor(Gr_Liv_Area, Sale_Price)`
## <dbl>
## 1 0.719
<- lm(Sale_Price ~ Gr_Liv_Area, data = ames)
m3 tidy(m3)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6773. 3260. 2.08 0.0378
## 2 Gr_Liv_Area 116. 2.08 56.0 0
\[ \hat{y} = 6773.484 + 116.2 \times GrLivArea\]
This equation tells us two things:
For houses with a
Gr_Liv_Area
of 0 we expect their mean sale price to be 6773.484 dollars.For every 1 square ft increase in
Gr_Liv_Area
, we expect an sale price to increase on an average 116.2 dollars.
glance(m3)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.518 0.517 54568. 3137. 0 1 -36053. 72112. 72130.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
For this model, 51.8% of the variability in Sale_Price
is explained by Gr_Liv_Area
.
Model Assessment
library(modelsummary)
<- list(
models "Sale Price ~ Lot_Area" = m1,
"Sale Price ~ Garage_Area" = m2,
"Sale Price ~ Gr_Liv_Area"=m3
)
modelsummary(models)
Sale Price ~ Lot_Area | Sale Price ~ Garage_Area | Sale Price ~ Gr_Liv_Area | |
---|---|---|---|
(Intercept) | 152869.484 | 68173.409 | 6773.484 |
(2292.521) | (2678.887) | (3260.418) | |
Lot_Area | 2.726 | ||
(0.180) | |||
Garage_Area | 237.933 | ||
(5.172) | |||
Gr_Liv_Area | 116.225 | ||
(2.075) | |||
Num.Obs. | 2925 | 2925 | 2925 |
R2 | 0.073 | 0.420 | 0.518 |
R2 Adj. | 0.073 | 0.420 | 0.517 |
AIC | 74022.9 | 72651.2 | 72111.9 |
BIC | 74040.8 | 72669.1 | 72129.9 |
Log.Lik. | -37008.444 | -36322.602 | -36052.970 |
F | 229.929 | 2116.383 | 3136.620 |
Out of all three models, Model 1 has the least R2 of about 7.3% which means higher variability around the regression line.
Model Diagnostics
Sale Price ~ Lot Area
<- augment(m1)
m1_aug
m1_aug
## # A tibble: 2,925 x 8
## Sale_Price Lot_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 31770 239474. -24474. 0.00299 75661. 0.000158 -0.324
## 2 105000 11622 184551. -79551. 0.000355 75648. 0.000196 -1.05
## 3 172000 14267 191761. -19761. 0.000440 75661. 0.0000150 -0.261
## 4 244000 11160 183291. 60709. 0.000348 75654. 0.000112 0.803
## 5 189900 13830 190570. -670. 0.000420 75662. 0.0000000165 -0.00885
## 6 195500 9978 180069. 15431. 0.000342 75662. 0.00000712 0.204
## 7 213500 4920 166281. 47219. 0.000494 75657. 0.0000963 0.624
## 8 191500 5005 166513. 24987. 0.000489 75661. 0.0000267 0.330
## 9 236500 5389 167560. 68940. 0.000467 75651. 0.000194 0.912
## 10 189000 7500 173314. 15686. 0.000380 75662. 0.00000818 0.207
## # ... with 2,915 more rows
# Linearity
ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
<- augment(m1)
m1_aug
m1_aug
## # A tibble: 2,925 x 8
## Sale_Price Lot_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 31770 239474. -24474. 0.00299 75661. 0.000158 -0.324
## 2 105000 11622 184551. -79551. 0.000355 75648. 0.000196 -1.05
## 3 172000 14267 191761. -19761. 0.000440 75661. 0.0000150 -0.261
## 4 244000 11160 183291. 60709. 0.000348 75654. 0.000112 0.803
## 5 189900 13830 190570. -670. 0.000420 75662. 0.0000000165 -0.00885
## 6 195500 9978 180069. 15431. 0.000342 75662. 0.00000712 0.204
## 7 213500 4920 166281. 47219. 0.000494 75657. 0.0000963 0.624
## 8 191500 5005 166513. 24987. 0.000489 75661. 0.0000267 0.330
## 9 236500 5389 167560. 68940. 0.000467 75651. 0.000194 0.912
## 10 189000 7500 173314. 15686. 0.000380 75662. 0.00000818 0.207
## # ... with 2,915 more rows
# Linearity
ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
Model 1 fails the condition of linearity.
# Nearly normal residuals
ggplot(data = m1_aug, aes(x = .resid)) +
geom_histogram(color='red',fill='skyblue') +
xlab("Residuals")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Model 1 also fails the condition of normality of residuals.
# constant variability
check_model(m1)
check_heteroscedasticity(m1)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
Sale Price ~ Garage Area
# model diagnostics
<- augment(m2)
m2_aug
m2_aug
## # A tibble: 2,925 x 8
## Sale_Price Garage_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 528 193802. 21198. 0.000366 59846. 2.30e-5 0.354
## 2 105000 730 241864. -136864. 0.000840 59794. 2.20e-3 -2.29
## 3 172000 312 142408. 29592. 0.000532 59845. 6.52e-5 0.495
## 4 244000 522 192374. 51626. 0.000361 59840. 1.34e-4 0.863
## 5 189900 482 182857. 7043. 0.000343 59848. 2.38e-6 0.118
## 6 195500 470 180002. 15498. 0.000342 59847. 1.15e-5 0.259
## 7 213500 582 206650. 6850. 0.000433 59848. 2.84e-6 0.114
## 8 191500 506 188567. 2933. 0.000351 59848. 4.21e-7 0.0490
## 9 236500 608 212836. 23664. 0.000481 59846. 3.76e-5 0.396
## 10 189000 442 173340. 15660. 0.000348 59847. 1.19e-5 0.262
## # ... with 2,915 more rows
# Linearity
ggplot(data = m2_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
# model diagnostics
<- augment(m2)
m2_aug
m2_aug
## # A tibble: 2,925 x 8
## Sale_Price Garage_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 528 193802. 21198. 0.000366 59846. 2.30e-5 0.354
## 2 105000 730 241864. -136864. 0.000840 59794. 2.20e-3 -2.29
## 3 172000 312 142408. 29592. 0.000532 59845. 6.52e-5 0.495
## 4 244000 522 192374. 51626. 0.000361 59840. 1.34e-4 0.863
## 5 189900 482 182857. 7043. 0.000343 59848. 2.38e-6 0.118
## 6 195500 470 180002. 15498. 0.000342 59847. 1.15e-5 0.259
## 7 213500 582 206650. 6850. 0.000433 59848. 2.84e-6 0.114
## 8 191500 506 188567. 2933. 0.000351 59848. 4.21e-7 0.0490
## 9 236500 608 212836. 23664. 0.000481 59846. 3.76e-5 0.396
## 10 189000 442 173340. 15660. 0.000348 59847. 1.19e-5 0.262
## # ... with 2,915 more rows
# Linearity
ggplot(data = m2_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
# nearly normal
ggplot(data = m2_aug, aes(x = .resid)) +
geom_histogram(color='red',fill='skyblue') +
xlab("Residuals")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# constant variability
check_model(m2)
check_heteroscedasticity(m2)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
Sale Price ~ Living Area
# model diagnostics
<- augment(m3)
m3_aug
m3_aug
## # A tibble: 2,925 x 8
## Sale_Price Gr_Liv_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 1656 199243. 15757. 0.000380 54577. 0.0000158 0.289
## 2 105000 896 110911. -5911. 0.000859 54577. 0.00000505 -0.108
## 3 172000 1329 161237. 10763. 0.000381 54577. 0.00000742 0.197
## 4 244000 2110 252009. -8009. 0.000891 54577. 0.00000961 -0.147
## 5 189900 1629 196104. -6204. 0.000368 54577. 0.00000238 -0.114
## 6 195500 1604 193199. 2301. 0.000359 54577. 0.000000320 0.0422
## 7 213500 1338 162283. 51217. 0.000377 54569. 0.000166 0.939
## 8 191500 1280 155542. 35958. 0.000408 54573. 0.0000887 0.659
## 9 236500 1616 194594. 41906. 0.000363 54572. 0.000107 0.768
## 10 189000 1804 216444. -27444. 0.000481 54575. 0.0000609 -0.503
## # ... with 2,915 more rows
# Linearity
ggplot(data = m3_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
# model diagnostics
<- augment(m3)
m3_aug
m3_aug
## # A tibble: 2,925 x 8
## Sale_Price Gr_Liv_Area .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 215000 1656 199243. 15757. 0.000380 54577. 0.0000158 0.289
## 2 105000 896 110911. -5911. 0.000859 54577. 0.00000505 -0.108
## 3 172000 1329 161237. 10763. 0.000381 54577. 0.00000742 0.197
## 4 244000 2110 252009. -8009. 0.000891 54577. 0.00000961 -0.147
## 5 189900 1629 196104. -6204. 0.000368 54577. 0.00000238 -0.114
## 6 195500 1604 193199. 2301. 0.000359 54577. 0.000000320 0.0422
## 7 213500 1338 162283. 51217. 0.000377 54569. 0.000166 0.939
## 8 191500 1280 155542. 35958. 0.000408 54573. 0.0000887 0.659
## 9 236500 1616 194594. 41906. 0.000363 54572. 0.000107 0.768
## 10 189000 1804 216444. -27444. 0.000481 54575. 0.0000609 -0.503
## # ... with 2,915 more rows
# Linearity
ggplot(data = m3_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
# nearly normal residuals
ggplot(data = m3_aug, aes(x = .resid)) +
geom_histogram(color='red',fill='skyblue') +
xlab("Residuals")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# constant variability
check_model(m3)
check_heteroscedasticity(m3)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
Based on the diagnostics and assessment, model 2 and model 3 both have normal distribution of residuals and residuals are linearly distributed along zero, but model 3 is our best fit model because of higher adjusted R2 value.
Data Transformation and Re-fitting the Best Model
Now we will be log transforming our response variable Sale_price.
<-ames %>% mutate(sale_norm=log(Sale_Price))
ames
ggplot(ames)+
geom_histogram(aes(sale_norm),color='red',fill='skyblue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
%>%
ames summarise(cor(Gr_Liv_Area, sale_norm))
## # A tibble: 1 x 1
## `cor(Gr_Liv_Area, sale_norm)`
## <dbl>
## 1 0.711
<- lm(sale_norm ~ Gr_Liv_Area, data = ames)
m4 tidy(m4)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.1 0.0171 653. 0
## 2 Gr_Liv_Area 0.000594 0.0000109 54.7 0
glance(m4)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.506 0.506 0.285 2994. 0 1 -482. 970. 988.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
\[ \hat{y} = 11.1 + 0.000594 \times GrLivArea\]
This equation tells us two things:
For houses with a
Gr_Liv_Area
of 0 we expect their mean sale price to be 11.1 dollars.For every 1 square ft increase in
Gr_Liv_Area
, we expect an sale price to increase on an average 0.000594 dollars
For this model, 50.6% of the variability in Sale_Price
is explained by Gr_Liv_Area
.
Running Diagnostics for this model:
<- augment(m4)
m4_aug
# Checking for Linearity
ggplot(data = m4_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")+
theme_minimal()
# Checking for Nearly normal residuals
ggplot(data = m4_aug, aes(x = .resid)) +
geom_histogram(color='red',fill='skyblue') +
xlab("Residuals")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Checking for constant variability
check_model(m4)
check_heteroscedasticity(m4)
## Warning: Heteroscedasticity (non-constant error variance) detected (p < .001).
Even though this model has a slighter lower adjusted R2, the linearity, homogeneity of variance and normality of residuals are slightly better fitted.