First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
library(ggplot2)
library(devtools)
## Warning: package 'devtools' was built under R version 3.4.1
library(statsr)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
# type your code for Question 1 here, and Knit
#Age is defined as 2017- ames_train$Year.Built
ames_train$age=(2017-ames_train$Year.Built)
hist(ames_train$age, breaks = 30, col='Green', main="Distribution of Houing age in 30 bins")
plot(density(ames_train$age))
summary(ames_train$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 16.0 42.0 44.8 62.0 145.0
fivenum(ames_train$age)
## [1] 7 16 42 62 145
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
describe(ames_train$age)
## ames_train$age
## n missing distinct Info Mean Gmd .05 .10
## 1000 0 102 1 44.8 33.14 10 11
## .25 .50 .75 .90 .95
## 16 42 62 92 98
##
## lowest : 7 8 9 10 11, highest: 122 127 132 137 145
library(pastecs)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
stat.desc(ames_train$age)
## nbr.val nbr.null nbr.na min max
## 1.000000e+03 0.000000e+00 0.000000e+00 7.000000e+00 1.450000e+02
## range sum median mean SE.mean
## 1.380000e+02 4.479700e+04 4.200000e+01 4.479700e+01 9.372172e-01
## CI.mean.0.95 var std.dev coef.var
## 1.839140e+00 8.783762e+02 2.963741e+01 6.615937e-01
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(ames_train$age)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1000 44.8 29.64 42 41.83 35.58 7 145 138 0.66 -0.32
## se
## X1 0.94
library(moments)
skewness(ames_train$age)
## [1] 0.6581401
kurtosis(ames_train$age)
## [1] 2.686274
#function for mode.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
v <- ames_train$age
result <- getmode(v)
print(result)
## [1] 12
Answer for Q1
Summary statistics were calculated for the age variable using different fucntions in several R packages. Differing mode, median and mean values, that is, a mode of 12, a mean age of 44.8 and a median age of 42.0 were observed. As the mean (44.8) was larger than the median age of 42.0, a positively skewed (Right-skewed) distribution is seen. This was confirmed by the skewness score of 0.66 which suggests a highly positively skewed distribution.For the kurtosis, we have 2.69 implying that the distribution of the data is platykurtic, since the computed value is less than 3.The distribution as such is not normal.The density plot indicated that the distribution is multi modal with two peaks.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
# type your code for Question 2 here, and Knit
#box plots showing housing prices versus neighborhoods
ames_train %>% select(Neighborhood, price) %>% ggplot(aes(x=reorder(Neighborhood,price, FUN = median), y = price)) + geom_boxplot() + xlab('Neighborhood')+ggtitle("Price vs Neighborhood ")
ames_train %>%
group_by(Neighborhood) %>%
summarise(median_price = median(price), sd_price = sd(price)) %>%
arrange(desc(median_price))
## # A tibble: 27 x 3
## Neighborhood median_price sd_price
## <fctr> <dbl> <dbl>
## 1 StoneBr 340691.5 123459.10
## 2 NridgHt 336860.0 105088.90
## 3 NoRidge 290000.0 35888.97
## 4 GrnHill 280000.0 70710.68
## 5 Timber 232500.0 84029.57
## 6 Somerst 221650.0 65199.49
## 7 Greens 212625.0 29063.42
## 8 Veenker 205750.0 72545.41
## 9 Crawfor 205000.0 71267.56
## 10 CollgCr 195800.0 52786.08
## # ... with 17 more rows
Answer for Q2
In a non-normal, positively skewed distribution, median may be a better measure of central tendency than the mean which is ideal in case of a symmetric distribution. Therefore, as a more robust measure, median was chosen. In general, housing prices are known to be distributed with positive skew and median prices are frequently used. Most Expensive = Stone Brook neighborhood (median price is 340691.5 USD). Least expensive = Meadow Village (median price is 85750.0 USD) . Most heterogeneous = Stone Brook (standard deviation is 123459.10 USD)
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
# type your code for Question 3 here, and Knit
library(Rcpp)
require(Amelia)
## Loading required package: Amelia
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(ames_train, main="Missing Map")
## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
#missing data proportions
sapply(ames_train, function(df) {
+ sum(is.na(df)==TRUE)/ length(df)
})
## PID area price MS.SubClass
## 0.000 0.000 0.000 0.000
## MS.Zoning Lot.Frontage Lot.Area Street
## 0.000 0.167 0.000 0.000
## Alley Lot.Shape Land.Contour Utilities
## 0.933 0.000 0.000 0.000
## Lot.Config Land.Slope Neighborhood Condition.1
## 0.000 0.000 0.000 0.000
## Condition.2 Bldg.Type House.Style Overall.Qual
## 0.000 0.000 0.000 0.000
## Overall.Cond Year.Built Year.Remod.Add Roof.Style
## 0.000 0.000 0.000 0.000
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type
## 0.000 0.000 0.000 0.000
## Mas.Vnr.Area Exter.Qual Exter.Cond Foundation
## 0.007 0.000 0.000 0.000
## Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 0.021 0.021 0.021 0.021
## BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF
## 0.001 0.021 0.001 0.001
## Total.Bsmt.SF Heating Heating.QC Central.Air
## 0.001 0.000 0.000 0.000
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
## 0.000 0.000 0.000 0.000
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## 0.001 0.001 0.000 0.000
## Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd
## 0.000 0.000 0.000 0.000
## Functional Fireplaces Fireplace.Qu Garage.Type
## 0.000 0.000 0.491 0.046
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## 0.048 0.046 0.001 0.001
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 0.047 0.047 0.000 0.000
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## 0.000 0.000 0.000 0.000
## Pool.Area Pool.QC Fence Misc.Feature
## 0.000 0.997 0.798 0.971
## Misc.Val Mo.Sold Yr.Sold Sale.Type
## 0.000 0.000 0.000 0.000
## Sale.Condition age
## 0.000 0.000
#missing data in numbers
sapply(ames_train, function(x) sum(is.na(x)))
## PID area price MS.SubClass
## 0 0 0 0
## MS.Zoning Lot.Frontage Lot.Area Street
## 0 167 0 0
## Alley Lot.Shape Land.Contour Utilities
## 933 0 0 0
## Lot.Config Land.Slope Neighborhood Condition.1
## 0 0 0 0
## Condition.2 Bldg.Type House.Style Overall.Qual
## 0 0 0 0
## Overall.Cond Year.Built Year.Remod.Add Roof.Style
## 0 0 0 0
## Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type
## 0 0 0 0
## Mas.Vnr.Area Exter.Qual Exter.Cond Foundation
## 7 0 0 0
## Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1
## 21 21 21 21
## BsmtFin.SF.1 BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF
## 1 21 1 1
## Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 0 0
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
## 0 0 0 0
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## 1 1 0 0
## Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd
## 0 0 0 0
## Functional Fireplaces Fireplace.Qu Garage.Type
## 0 0 491 46
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## 48 46 1 1
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 47 47 0 0
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## 0 0 0 0
## Pool.Area Pool.QC Fence Misc.Feature
## 0 997 798 971
## Misc.Val Mo.Sold Yr.Sold Sale.Type
## 0 0 0 0
## Sale.Condition age
## 0 0
max(colSums(is.na(ames_train)))
## [1] 997
summary(ames_train$Pool.QC)
## Ex Fa Gd TA NA's
## 1 1 1 0 997
Answer for Q3
The variable Pool.QC with 997 (out of 1000)missing values has largest number of missing values(99.7%).This measures quality of Pools. There are only three houses in the dataset that have a pool.So the non-respondents pertain to those not having pools in their houses.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
# type your code for Question 4 here, and Knit
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 82 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
## $ age : num 78 33 87 117 16 14 64 10 33 12 ...
vars <- names(ames_train) %in% c("price","Lot.Area","Land.Slope","Year.Built","Year.Remod.Add","Bedroom.AbvGr")
ames2 <- ames_train[vars]
names(ames2)
## [1] "price" "Lot.Area" "Land.Slope" "Year.Built"
## [5] "Year.Remod.Add" "Bedroom.AbvGr"
str(ames2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 6 variables:
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add: int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
ames2$Land.Slope<-as.integer(ames2$Land.Slope)
str(ames2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 6 variables:
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Land.Slope : int 1 1 1 1 1 1 2 1 1 1 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add: int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
#missing data in numbers
sapply(ames2, function(x) sum(is.na(x)))
## price Lot.Area Land.Slope Year.Built Year.Remod.Add
## 0 0 0 0 0
## Bedroom.AbvGr
## 0
describe(ames2)
## vars n mean sd median trimmed mad
## price 1 1000 181190.08 81909.79 159467.0 169968.85 52681.23
## Lot.Area 2 1000 10352.41 9827.84 9317.0 9469.61 3138.66
## Land.Slope 3 1000 1.04 0.23 1.0 1.00 0.00
## Year.Built 4 1000 1972.20 29.64 1975.0 1975.17 35.58
## Year.Remod.Add 5 1000 1984.34 20.56 1992.5 1985.72 21.50
## Bedroom.AbvGr 6 1000 2.81 0.83 3.0 2.78 0.00
## min max range skew kurtosis se
## price 12789 615000 602211 1.63 3.72 2590.21
## Lot.Area 1470 215245 213775 13.22 241.18 310.78
## Land.Slope 1 3 2 5.72 35.53 0.01
## Year.Built 1872 2010 138 -0.66 -0.32 0.94
## Year.Remod.Add 1950 2010 60 -0.44 -1.32 0.65
## Bedroom.AbvGr 0 6 6 0.34 1.61 0.03
#fitting a multiple linear regression model measuring AIC. Both backward elimination and forward selection were incorporated to determine the optimal model.
allmodel <- lm(log(ames2$price) ~ ., data = ames2)
bestmodel <- step(allmodel,direction = "both")
## Start: AIC=-2530.07
## log(ames2$price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## - Land.Slope 1 0.0465 78.750 -2531.5
## <none> 78.704 -2530.1
## - Bedroom.AbvGr 1 5.1713 83.875 -2468.4
## - Lot.Area 1 5.3617 84.065 -2466.2
## - Year.Remod.Add 1 12.4099 91.114 -2385.7
## - Year.Built 1 19.6773 98.381 -2308.9
##
## Step: AIC=-2531.47
## log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 78.750 -2531.5
## + Land.Slope 1 0.0465 78.704 -2530.1
## - Bedroom.AbvGr 1 5.1250 83.875 -2470.4
## - Lot.Area 1 7.1223 85.872 -2446.9
## - Year.Remod.Add 1 12.3912 91.141 -2387.3
## - Year.Built 1 19.6618 98.412 -2310.6
summary(bestmodel)
##
## Call:
## lm(formula = log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr, data = ames2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.09102 -0.16488 -0.02135 0.16605 1.13359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.385e+01 8.633e-01 -16.049 < 2e-16 ***
## Lot.Area 8.697e-06 9.168e-07 9.486 < 2e-16 ***
## Year.Built 6.019e-03 3.819e-04 15.761 < 2e-16 ***
## Year.Remod.Add 6.889e-03 5.505e-04 12.512 < 2e-16 ***
## Bedroom.AbvGr 8.704e-02 1.082e-02 8.047 2.4e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2813 on 995 degrees of freedom
## Multiple R-squared: 0.5544, Adjusted R-squared: 0.5526
## F-statistic: 309.5 on 4 and 995 DF, p-value: < 2.2e-16
anova(bestmodel)
## Analysis of Variance Table
##
## Response: log(ames2$price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.392 131.306 < 2.2e-16 ***
## Year.Built 1 69.442 69.442 877.389 < 2.2e-16 ***
## Year.Remod.Add 1 13.018 13.018 164.483 < 2.2e-16 ***
## Bedroom.AbvGr 1 5.125 5.125 64.754 2.404e-15 ***
## Residuals 995 78.750 0.079
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(allmodel)
## [1] 344.166
BIC(bestmodel)
## [1] 337.8492
Answer for Q4
A new dataframe comprised of the relevant six variables was created named ames2. The log-transformed price variable was the dependent variable and the others were covariates. Step wise regression was applied specifying both the forward selection and backward elimination methods to select the optimal model. The best model contained four significant covariates (all except Land.Slope) and explained a total variance of around 55.26% (adjusted R-squared).
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
# type your code for Question 5 here, and Knit
ames3 <- ames_train %>%
select(PID, price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
plm = lm(log(price) ~ Lot.Area
+ Land.Slope
+ Year.Built
+ Year.Remod.Add
+ Bedroom.AbvGr, data = na.omit(ames3))
bplm <- stepAIC(plm, trace = T, k = log(nrow(na.omit(ames3))))
## Start: AIC=-2511.42
## log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 77.322 -2511.4
## - Land.Slope 2 1.4281 78.750 -2506.9
## - Bedroom.AbvGr 1 5.0628 82.385 -2454.9
## - Lot.Area 1 6.7292 84.051 -2434.9
## - Year.Remod.Add 1 11.9642 89.286 -2374.5
## - Year.Built 1 19.8546 97.177 -2289.8
BIC(bplm)
## [1] 333.3642
residuals <- ames3 %>%
select(PID, price) %>%
mutate(log_price = log(price))
residuals$predicted <- predict(bplm)
residuals$residuals <- residuals(bplm)
residuals$squared_residuals <- residuals(bplm)^2
residuals <- residuals %>%
arrange(desc(squared_residuals))
residuals[c(1:5),]
## # A tibble: 5 x 6
## PID price log_price predicted residuals squared_residuals
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 902207130 12789 9.456341 11.54419 -2.0878529 4.3591298
## 2 534427010 84900 11.349229 12.37473 -1.0254998 1.0516498
## 3 911175430 35311 10.471950 11.47230 -1.0003537 1.0007075
## 4 528164060 615000 13.329378 12.33487 0.9945048 0.9890399
## 5 534450090 39300 10.578980 11.55148 -0.9724980 0.9457524
ames_train[which(ames_train$PID == 902207130),]
## # A tibble: 1 x 82
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fctr> <int> <int> <fctr>
## 1 902207130 832 12789 30 RM 68 9656 Pave
## # ... with 74 more variables: Alley <fctr>, Lot.Shape <fctr>,
## # Land.Contour <fctr>, Utilities <fctr>, Lot.Config <fctr>,
## # Land.Slope <fctr>, Neighborhood <fctr>, Condition.1 <fctr>,
## # Condition.2 <fctr>, Bldg.Type <fctr>, House.Style <fctr>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fctr>, Roof.Matl <fctr>,
## # Exterior.1st <fctr>, Exterior.2nd <fctr>, Mas.Vnr.Type <fctr>,
## # Mas.Vnr.Area <int>, Exter.Qual <fctr>, Exter.Cond <fctr>,
## # Foundation <fctr>, Bsmt.Qual <fctr>, Bsmt.Cond <fctr>,
## # Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fctr>, Heating.QC <fctr>,
## # Central.Air <fctr>, Electrical <fctr>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## # Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fctr>,
## # TotRms.AbvGrd <int>, Functional <fctr>, Fireplaces <int>,
## # Fireplace.Qu <fctr>, Garage.Type <fctr>, Garage.Yr.Blt <int>,
## # Garage.Finish <fctr>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fctr>, Garage.Cond <fctr>, Paved.Drive <fctr>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>,
## # Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>, Misc.Val <int>,
## # Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>, Sale.Condition <fctr>,
## # age <dbl>
model.resid<-bplm$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="red")
qqnorm(residuals(bplm))
qqline(residuals(bplm))
Answer for Q5
Residuals were exracted from the fitted model.The PID pertaining to the largest squared residual was identified which was “9022 07130”. PID “902207130” is more than 4 times bigger than the second largest squared residual. It is therefore an extreme outlier. This house had the lowest price. The age of the house was 91 years old and waslocated in old Town neighborhood. Overall quality was poor,square footage was below average.These all factors may have made the house unattractive and hence the Squared residuals were the largest.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
# type your code for Question 6 here, and Knit
logmodel <- lm(log(ames3$price)~log(ames3$Lot.Area)+ames3$Land.Slope+ames3$Year.Built+
ames3$Year.Remod.Add+ames3$Bedroom.AbvGr,data=ames3)
bestlogmodel <- stepAIC(logmodel, trace = T, k = log(nrow(na.omit(ames3))))
## Start: AIC=-2615.14
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Land.Slope + ames3$Year.Built +
## ames3$Year.Remod.Add + ames3$Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## - ames3$Land.Slope 2 0.4429 70.147 -2622.6
## <none> 69.704 -2615.1
## - ames3$Bedroom.AbvGr 1 2.2002 71.904 -2591.0
## - ames3$Year.Remod.Add 1 11.8182 81.522 -2465.4
## - log(ames3$Lot.Area) 1 14.3474 84.051 -2434.9
## - ames3$Year.Built 1 19.4083 89.112 -2376.4
##
## Step: AIC=-2622.62
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built + ames3$Year.Remod.Add +
## ames3$Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 70.147 -2622.6
## - ames3$Bedroom.AbvGr 1 2.0816 72.228 -2600.3
## - ames3$Year.Remod.Add 1 11.9455 82.092 -2472.3
## - log(ames3$Lot.Area) 1 15.7256 85.872 -2427.3
## - ames3$Year.Built 1 19.3032 89.450 -2386.4
anova(bplm)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.392 133.462 < 2.2e-16 ***
## Land.Slope 2 2.424 1.212 15.564 2.211e-07 ***
## Year.Built 1 68.991 68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add 1 12.535 12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr 1 5.063 5.063 65.018 2.124e-15 ***
## Residuals 993 77.322 0.078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(bestlogmodel)
## Analysis of Variance Table
##
## Response: log(ames3$price)
## Df Sum Sq Mean Sq F value Pr(>F)
## log(ames3$Lot.Area) 1 24.338 24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built 1 67.894 67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add 1 12.267 12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr 1 2.082 2.082 29.526 6.939e-08 ***
## Residuals 995 70.147 0.070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(bestlogmodel)
## [1] 222.1599
BIC(bplm)
## [1] 333.3642
(BIC(bplm)-BIC(bestlogmodel))
## [1] 111.2043
ggplot(data = ames3, aes(x = Lot.Area)) +
geom_histogram() +
ggtitle("Lot.Area without log-transformation")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = ames3, aes(x = log(Lot.Area))) +
geom_histogram() +
ggtitle("Lot.Area with log transformation")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Answer for Q6
Log transformation of the Lot.Area variable greatly improved the linear model. This is because the original right-skewness of inherent in the Lot.Area variable disappeared with the log transformation and turned into a near-normal distribution.The BIC score of the model also greatly improved comared to the previous model by 111.2043 (333.3642-222.1599). Diagnostic plots(residual vs fitted,Normal Q-Q,Scale Location,Residulas vs Leverage) showed that linearity of the model is better achieved with the log transformed data.This is shown in the plots and in summary for model fit in the next question.
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
# type your code for Question 7 here, and Knit
op <- par(mfrow = c(2, 3))
plot(bplm)
plot(bestlogmodel)
summary(bplm)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = na.omit(ames3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
summary(bestlogmodel)
##
## Call:
## lm(formula = log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built +
## ames3$Year.Remod.Add + ames3$Bedroom.AbvGr, data = ames3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14609 -0.15825 -0.01477 0.15354 1.01578
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.557e+01 8.213e-01 -18.964 < 2e-16 ***
## log(ames3$Lot.Area) 2.471e-01 1.654e-02 14.935 < 2e-16 ***
## ames3$Year.Built 5.964e-03 3.604e-04 16.547 < 2e-16 ***
## ames3$Year.Remod.Add 6.765e-03 5.197e-04 13.017 < 2e-16 ***
## ames3$Bedroom.AbvGr 5.726e-02 1.054e-02 5.434 6.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared: 0.6031, Adjusted R-squared: 0.6015
## F-statistic: 377.9 on 4 and 995 DF, p-value: < 2.2e-16
anova(bplm)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## Lot.Area 1 10.392 10.392 133.462 < 2.2e-16 ***
## Land.Slope 2 2.424 1.212 15.564 2.211e-07 ***
## Year.Built 1 68.991 68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add 1 12.535 12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr 1 5.063 5.063 65.018 2.124e-15 ***
## Residuals 993 77.322 0.078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(bestlogmodel)
## Analysis of Variance Table
##
## Response: log(ames3$price)
## Df Sum Sq Mean Sq F value Pr(>F)
## log(ames3$Lot.Area) 1 24.338 24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built 1 67.894 67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add 1 12.267 12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr 1 2.082 2.082 29.526 6.939e-08 ***
## Residuals 995 70.147 0.070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer for Q7
Residual Q-Q plots of both models indicate that errors are normally distributed with a constant variance (homoscedasticity). Homogeneity of the residuals seems to be accomplished by both models.Independence of residuals is better seen in the second model according to the plots.Outliers affecting the normality can be seen in both models. According to the resiuduals versus leverage plots, the influence of leverage points seem to be less in the second model.The adjusted R-squared value is also higher in the second model indicating that it gives a better fit. In summary,the log transformation of Lot.Area improves the model assumptions for multiple linear regression models.