First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.1

library(ggplot2)
library(devtools)

## Warning: package 'devtools' was built under R version 3.4.1

library(statsr)

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# type your code for Question 1 here, and Knit
#Age is defined as 2017- ames_train$Year.Built
ames_train$age=(2017-ames_train$Year.Built)


hist(ames_train$age, breaks = 30, col='Green', main="Distribution of Houing age in 30 bins")

plot(density(ames_train$age))

summary(ames_train$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     7.0    16.0    42.0    44.8    62.0   145.0

fivenum(ames_train$age)

## [1]   7  16  42  62 145

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

describe(ames_train$age)

## ames_train$age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1000        0      102        1     44.8    33.14       10       11 
##      .25      .50      .75      .90      .95 
##       16       42       62       92       98 
## 
## lowest :   7   8   9  10  11, highest: 122 127 132 137 145

library(pastecs)

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:survival':
## 
##     aml

## The following object is masked from 'package:lattice':
## 
##     melanoma

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

stat.desc(ames_train$age)

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.000000e+03 0.000000e+00 0.000000e+00 7.000000e+00 1.450000e+02 
##        range          sum       median         mean      SE.mean 
## 1.380000e+02 4.479700e+04 4.200000e+01 4.479700e+01 9.372172e-01 
## CI.mean.0.95          var      std.dev     coef.var 
## 1.839140e+00 8.783762e+02 2.963741e+01 6.615937e-01

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:boot':
## 
##     logit

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(ames_train$age)

##    vars    n mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 1000 44.8 29.64     42   41.83 35.58   7 145   138 0.66    -0.32
##      se
## X1 0.94

library(moments)
skewness(ames_train$age)

## [1] 0.6581401

kurtosis(ames_train$age)

## [1] 2.686274

#function for mode.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

v <- ames_train$age

result <- getmode(v)
print(result)

## [1] 12

Answer for Q1

Summary statistics were calculated for the age variable using different fucntions in several R packages. Differing mode, median and mean values, that is, a mode of 12, a mean age of 44.8 and a median age of 42.0 were observed. As the mean (44.8) was larger than the median age of 42.0, a positively skewed (Right-skewed) distribution is seen. This was confirmed by the skewness score of 0.66 which suggests a highly positively skewed distribution.For the kurtosis, we have 2.69 implying that the distribution of the data is platykurtic, since the computed value is less than 3.The distribution as such is not normal.The density plot indicated that the distribution is multi modal with two peaks.

2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# type your code for Question 2 here, and Knit
#box plots showing housing prices versus neighborhoods
ames_train %>% select(Neighborhood, price) %>% ggplot(aes(x=reorder(Neighborhood,price, FUN = median), y = price)) + geom_boxplot()  + xlab('Neighborhood')+ggtitle("Price vs Neighborhood ")

ames_train %>%
  group_by(Neighborhood) %>%
  summarise(median_price = median(price), sd_price = sd(price)) %>%
  arrange(desc(median_price))

## # A tibble: 27 x 3
##    Neighborhood median_price  sd_price
##          <fctr>        <dbl>     <dbl>
##  1      StoneBr     340691.5 123459.10
##  2      NridgHt     336860.0 105088.90
##  3      NoRidge     290000.0  35888.97
##  4      GrnHill     280000.0  70710.68
##  5       Timber     232500.0  84029.57
##  6      Somerst     221650.0  65199.49
##  7       Greens     212625.0  29063.42
##  8      Veenker     205750.0  72545.41
##  9      Crawfor     205000.0  71267.56
## 10      CollgCr     195800.0  52786.08
## # ... with 17 more rows

Answer for Q2

In a non-normal, positively skewed distribution, median may be a better measure of central tendency than the mean which is ideal in case of a symmetric distribution. Therefore, as a more robust measure, median was chosen. In general, housing prices are known to be distributed with positive skew and median prices are frequently used. Most Expensive = Stone Brook neighborhood (median price is 340691.5 USD). Least expensive = Meadow Village (median price is 85750.0 USD) . Most heterogeneous = Stone Brook (standard deviation is 123459.10 USD)

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# type your code for Question 3 here, and Knit
library(Rcpp)
require(Amelia)

## Loading required package: Amelia

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

missmap(ames_train, main="Missing Map")

## Warning in if (class(obj) == "amelia") {: the condition has length > 1 and
## only the first element will be used

## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'imputations'.

#missing data proportions
sapply(ames_train, function(df) {
  + sum(is.na(df)==TRUE)/ length(df)
})

##             PID            area           price     MS.SubClass 
##           0.000           0.000           0.000           0.000 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street 
##           0.000           0.167           0.000           0.000 
##           Alley       Lot.Shape    Land.Contour       Utilities 
##           0.933           0.000           0.000           0.000 
##      Lot.Config      Land.Slope    Neighborhood     Condition.1 
##           0.000           0.000           0.000           0.000 
##     Condition.2       Bldg.Type     House.Style    Overall.Qual 
##           0.000           0.000           0.000           0.000 
##    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##           0.000           0.000           0.000           0.000 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type 
##           0.000           0.000           0.000           0.000 
##    Mas.Vnr.Area      Exter.Qual      Exter.Cond      Foundation 
##           0.007           0.000           0.000           0.000 
##       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##           0.021           0.021           0.021           0.021 
##    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF 
##           0.001           0.021           0.001           0.001 
##   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##           0.001           0.000           0.000           0.000 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF 
##           0.000           0.000           0.000           0.000 
##  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath       Half.Bath 
##           0.001           0.001           0.000           0.000 
##   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##           0.000           0.000           0.000           0.000 
##      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##           0.000           0.000           0.491           0.046 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##           0.048           0.046           0.001           0.001 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF 
##           0.047           0.047           0.000           0.000 
##   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##           0.000           0.000           0.000           0.000 
##       Pool.Area         Pool.QC           Fence    Misc.Feature 
##           0.000           0.997           0.798           0.971 
##        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##           0.000           0.000           0.000           0.000 
##  Sale.Condition             age 
##           0.000           0.000

#missing data in numbers
sapply(ames_train, function(x) sum(is.na(x)))

##             PID            area           price     MS.SubClass 
##               0               0               0               0 
##       MS.Zoning    Lot.Frontage        Lot.Area          Street 
##               0             167               0               0 
##           Alley       Lot.Shape    Land.Contour       Utilities 
##             933               0               0               0 
##      Lot.Config      Land.Slope    Neighborhood     Condition.1 
##               0               0               0               0 
##     Condition.2       Bldg.Type     House.Style    Overall.Qual 
##               0               0               0               0 
##    Overall.Cond      Year.Built  Year.Remod.Add      Roof.Style 
##               0               0               0               0 
##       Roof.Matl    Exterior.1st    Exterior.2nd    Mas.Vnr.Type 
##               0               0               0               0 
##    Mas.Vnr.Area      Exter.Qual      Exter.Cond      Foundation 
##               7               0               0               0 
##       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##              21              21              21              21 
##    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF 
##               1              21               1               1 
##   Total.Bsmt.SF         Heating      Heating.QC     Central.Air 
##               1               0               0               0 
##      Electrical     X1st.Flr.SF     X2nd.Flr.SF Low.Qual.Fin.SF 
##               0               0               0               0 
##  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath       Half.Bath 
##               1               1               0               0 
##   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##               0               0               0               0 
##      Functional      Fireplaces    Fireplace.Qu     Garage.Type 
##               0               0             491              46 
##   Garage.Yr.Blt   Garage.Finish     Garage.Cars     Garage.Area 
##              48              46               1               1 
##     Garage.Qual     Garage.Cond     Paved.Drive    Wood.Deck.SF 
##              47              47               0               0 
##   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch    Screen.Porch 
##               0               0               0               0 
##       Pool.Area         Pool.QC           Fence    Misc.Feature 
##               0             997             798             971 
##        Misc.Val         Mo.Sold         Yr.Sold       Sale.Type 
##               0               0               0               0 
##  Sale.Condition             age 
##               0               0

max(colSums(is.na(ames_train)))

## [1] 997

summary(ames_train$Pool.QC)

##   Ex   Fa   Gd   TA NA's 
##    1    1    1    0  997

Answer for Q3

The variable Pool.QC with 997 (out of 1000)missing values has largest number of missing values(99.7%).This measures quality of Pools. There are only three houses in the dataset that have a pool.So the non-respondents pertain to those not having pools in their houses.

4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# type your code for Question 4 here, and Knit
str(ames_train)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  82 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
##  $ age            : num  78 33 87 117 16 14 64 10 33 12 ...

vars <- names(ames_train) %in%    c("price","Lot.Area","Land.Slope","Year.Built","Year.Remod.Add","Bedroom.AbvGr")
 ames2 <- ames_train[vars] 
 names(ames2)

## [1] "price"          "Lot.Area"       "Land.Slope"     "Year.Built"    
## [5] "Year.Remod.Add" "Bedroom.AbvGr"

 str(ames2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  6 variables:
##  $ price         : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ Lot.Area      : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Land.Slope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Year.Built    : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add: int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Bedroom.AbvGr : int  2 2 2 2 3 4 2 2 3 2 ...

 ames2$Land.Slope<-as.integer(ames2$Land.Slope)
 str(ames2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  6 variables:
##  $ price         : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ Lot.Area      : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Land.Slope    : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ Year.Built    : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add: int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Bedroom.AbvGr : int  2 2 2 2 3 4 2 2 3 2 ...

 #missing data in numbers
sapply(ames2, function(x) sum(is.na(x)))

##          price       Lot.Area     Land.Slope     Year.Built Year.Remod.Add 
##              0              0              0              0              0 
##  Bedroom.AbvGr 
##              0

describe(ames2)

##                vars    n      mean       sd   median   trimmed      mad
## price             1 1000 181190.08 81909.79 159467.0 169968.85 52681.23
## Lot.Area          2 1000  10352.41  9827.84   9317.0   9469.61  3138.66
## Land.Slope        3 1000      1.04     0.23      1.0      1.00     0.00
## Year.Built        4 1000   1972.20    29.64   1975.0   1975.17    35.58
## Year.Remod.Add    5 1000   1984.34    20.56   1992.5   1985.72    21.50
## Bedroom.AbvGr     6 1000      2.81     0.83      3.0      2.78     0.00
##                  min    max  range  skew kurtosis      se
## price          12789 615000 602211  1.63     3.72 2590.21
## Lot.Area        1470 215245 213775 13.22   241.18  310.78
## Land.Slope         1      3      2  5.72    35.53    0.01
## Year.Built      1872   2010    138 -0.66    -0.32    0.94
## Year.Remod.Add  1950   2010     60 -0.44    -1.32    0.65
## Bedroom.AbvGr      0      6      6  0.34     1.61    0.03

#fitting a multiple linear regression model measuring AIC. Both backward elimination and forward selection were incorporated to determine the optimal model.
allmodel <- lm(log(ames2$price) ~ ., data = ames2)
bestmodel <- step(allmodel,direction = "both")

## Start:  AIC=-2530.07
## log(ames2$price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## - Land.Slope      1    0.0465 78.750 -2531.5
## <none>                        78.704 -2530.1
## - Bedroom.AbvGr   1    5.1713 83.875 -2468.4
## - Lot.Area        1    5.3617 84.065 -2466.2
## - Year.Remod.Add  1   12.4099 91.114 -2385.7
## - Year.Built      1   19.6773 98.381 -2308.9
## 
## Step:  AIC=-2531.47
## log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        78.750 -2531.5
## + Land.Slope      1    0.0465 78.704 -2530.1
## - Bedroom.AbvGr   1    5.1250 83.875 -2470.4
## - Lot.Area        1    7.1223 85.872 -2446.9
## - Year.Remod.Add  1   12.3912 91.141 -2387.3
## - Year.Built      1   19.6618 98.412 -2310.6

summary(bestmodel)

## 
## Call:
## lm(formula = log(ames2$price) ~ Lot.Area + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr, data = ames2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09102 -0.16488 -0.02135  0.16605  1.13359 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.385e+01  8.633e-01 -16.049  < 2e-16 ***
## Lot.Area        8.697e-06  9.168e-07   9.486  < 2e-16 ***
## Year.Built      6.019e-03  3.819e-04  15.761  < 2e-16 ***
## Year.Remod.Add  6.889e-03  5.505e-04  12.512  < 2e-16 ***
## Bedroom.AbvGr   8.704e-02  1.082e-02   8.047  2.4e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2813 on 995 degrees of freedom
## Multiple R-squared:  0.5544, Adjusted R-squared:  0.5526 
## F-statistic: 309.5 on 4 and 995 DF,  p-value: < 2.2e-16

anova(bestmodel)

## Analysis of Variance Table
## 
## Response: log(ames2$price)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Lot.Area         1 10.392  10.392 131.306 < 2.2e-16 ***
## Year.Built       1 69.442  69.442 877.389 < 2.2e-16 ***
## Year.Remod.Add   1 13.018  13.018 164.483 < 2.2e-16 ***
## Bedroom.AbvGr    1  5.125   5.125  64.754 2.404e-15 ***
## Residuals      995 78.750   0.079                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

BIC(allmodel)

## [1] 344.166

BIC(bestmodel)

## [1] 337.8492

Answer for Q4

A new dataframe comprised of the relevant six variables was created named ames2. The log-transformed price variable was the dependent variable and the others were covariates. Step wise regression was applied specifying both the forward selection and backward elimination methods to select the optimal model. The best model contained four significant covariates (all except Land.Slope) and explained a total variance of around 55.26% (adjusted R-squared).

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# type your code for Question 5 here, and Knit
ames3 <- ames_train %>%
  select(PID, price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
plm = lm(log(price) ~ Lot.Area 
               + Land.Slope 
               + Year.Built 
               + Year.Remod.Add 
               + Bedroom.AbvGr, data = na.omit(ames3))
bplm <- stepAIC(plm, trace = T, k = log(nrow(na.omit(ames3))))

## Start:  AIC=-2511.42
## log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + 
##     Bedroom.AbvGr
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        77.322 -2511.4
## - Land.Slope      2    1.4281 78.750 -2506.9
## - Bedroom.AbvGr   1    5.0628 82.385 -2454.9
## - Lot.Area        1    6.7292 84.051 -2434.9
## - Year.Remod.Add  1   11.9642 89.286 -2374.5
## - Year.Built      1   19.8546 97.177 -2289.8

BIC(bplm)

## [1] 333.3642

residuals <- ames3 %>%
  select(PID, price) %>%
  mutate(log_price = log(price))

residuals$predicted <- predict(bplm)
residuals$residuals <- residuals(bplm)
residuals$squared_residuals <- residuals(bplm)^2

residuals <- residuals %>%
  arrange(desc(squared_residuals))
residuals[c(1:5),]

## # A tibble: 5 x 6
##         PID  price log_price predicted  residuals squared_residuals
##       <int>  <int>     <dbl>     <dbl>      <dbl>             <dbl>
## 1 902207130  12789  9.456341  11.54419 -2.0878529         4.3591298
## 2 534427010  84900 11.349229  12.37473 -1.0254998         1.0516498
## 3 911175430  35311 10.471950  11.47230 -1.0003537         1.0007075
## 4 528164060 615000 13.329378  12.33487  0.9945048         0.9890399
## 5 534450090  39300 10.578980  11.55148 -0.9724980         0.9457524

ames_train[which(ames_train$PID == 902207130),]

## # A tibble: 1 x 82
##         PID  area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##       <int> <int> <int>       <int>    <fctr>        <int>    <int> <fctr>
## 1 902207130   832 12789          30        RM           68     9656   Pave
## # ... with 74 more variables: Alley <fctr>, Lot.Shape <fctr>,
## #   Land.Contour <fctr>, Utilities <fctr>, Lot.Config <fctr>,
## #   Land.Slope <fctr>, Neighborhood <fctr>, Condition.1 <fctr>,
## #   Condition.2 <fctr>, Bldg.Type <fctr>, House.Style <fctr>,
## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## #   Year.Remod.Add <int>, Roof.Style <fctr>, Roof.Matl <fctr>,
## #   Exterior.1st <fctr>, Exterior.2nd <fctr>, Mas.Vnr.Type <fctr>,
## #   Mas.Vnr.Area <int>, Exter.Qual <fctr>, Exter.Cond <fctr>,
## #   Foundation <fctr>, Bsmt.Qual <fctr>, Bsmt.Cond <fctr>,
## #   Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>, BsmtFin.SF.1 <int>,
## #   BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## #   Total.Bsmt.SF <int>, Heating <fctr>, Heating.QC <fctr>,
## #   Central.Air <fctr>, Electrical <fctr>, X1st.Flr.SF <int>,
## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fctr>,
## #   TotRms.AbvGrd <int>, Functional <fctr>, Fireplaces <int>,
## #   Fireplace.Qu <fctr>, Garage.Type <fctr>, Garage.Yr.Blt <int>,
## #   Garage.Finish <fctr>, Garage.Cars <int>, Garage.Area <int>,
## #   Garage.Qual <fctr>, Garage.Cond <fctr>, Paved.Drive <fctr>,
## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>,
## #   Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>, Misc.Val <int>,
## #   Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>, Sale.Condition <fctr>,
## #   age <dbl>

model.resid<-bplm$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="red")

qqnorm(residuals(bplm))
qqline(residuals(bplm))

Answer for Q5

Residuals were exracted from the fitted model.The PID pertaining to the largest squared residual was identified which was “9022 07130”. PID “902207130” is more than 4 times bigger than the second largest squared residual. It is therefore an extreme outlier. This house had the lowest price. The age of the house was 91 years old and waslocated in old Town neighborhood. Overall quality was poor,square footage was below average.These all factors may have made the house unattractive and hence the Squared residuals were the largest.

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# type your code for Question 6 here, and Knit
logmodel <- lm(log(ames3$price)~log(ames3$Lot.Area)+ames3$Land.Slope+ames3$Year.Built+
                 ames3$Year.Remod.Add+ames3$Bedroom.AbvGr,data=ames3)
bestlogmodel <- stepAIC(logmodel, trace = T, k = log(nrow(na.omit(ames3))))

## Start:  AIC=-2615.14
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Land.Slope + ames3$Year.Built + 
##     ames3$Year.Remod.Add + ames3$Bedroom.AbvGr
## 
##                        Df Sum of Sq    RSS     AIC
## - ames3$Land.Slope      2    0.4429 70.147 -2622.6
## <none>                              69.704 -2615.1
## - ames3$Bedroom.AbvGr   1    2.2002 71.904 -2591.0
## - ames3$Year.Remod.Add  1   11.8182 81.522 -2465.4
## - log(ames3$Lot.Area)   1   14.3474 84.051 -2434.9
## - ames3$Year.Built      1   19.4083 89.112 -2376.4
## 
## Step:  AIC=-2622.62
## log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built + ames3$Year.Remod.Add + 
##     ames3$Bedroom.AbvGr
## 
##                        Df Sum of Sq    RSS     AIC
## <none>                              70.147 -2622.6
## - ames3$Bedroom.AbvGr   1    2.0816 72.228 -2600.3
## - ames3$Year.Remod.Add  1   11.9455 82.092 -2472.3
## - log(ames3$Lot.Area)   1   15.7256 85.872 -2427.3
## - ames3$Year.Built      1   19.3032 89.450 -2386.4

anova(bplm)

## Analysis of Variance Table
## 
## Response: log(price)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Lot.Area         1 10.392  10.392 133.462 < 2.2e-16 ***
## Land.Slope       2  2.424   1.212  15.564 2.211e-07 ***
## Year.Built       1 68.991  68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add   1 12.535  12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr    1  5.063   5.063  65.018 2.124e-15 ***
## Residuals      993 77.322   0.078                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(bestlogmodel)

## Analysis of Variance Table
## 
## Response: log(ames3$price)
##                       Df Sum Sq Mean Sq F value    Pr(>F)    
## log(ames3$Lot.Area)    1 24.338  24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built       1 67.894  67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add   1 12.267  12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr    1  2.082   2.082  29.526 6.939e-08 ***
## Residuals            995 70.147   0.070                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

BIC(bestlogmodel)

## [1] 222.1599

BIC(bplm)

## [1] 333.3642

(BIC(bplm)-BIC(bestlogmodel))

## [1] 111.2043

ggplot(data = ames3, aes(x = Lot.Area)) +
  geom_histogram() +
  ggtitle("Lot.Area without log-transformation")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = ames3, aes(x = log(Lot.Area))) +
  geom_histogram() +
  ggtitle("Lot.Area with log transformation")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answer for Q6

Log transformation of the Lot.Area variable greatly improved the linear model. This is because the original right-skewness of inherent in the Lot.Area variable disappeared with the log transformation and turned into a near-normal distribution.The BIC score of the model also greatly improved comared to the previous model by 111.2043 (333.3642-222.1599). Diagnostic plots(residual vs fitted,Normal Q-Q,Scale Location,Residulas vs Leverage) showed that linearity of the model is better achieved with the log transformed data.This is shown in the plots and in summary for model fit in the next question.

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# type your code for Question 7 here, and Knit
op <- par(mfrow = c(2, 3))
plot(bplm)
plot(bestlogmodel)

summary(bplm)

## 
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = na.omit(ames3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

summary(bestlogmodel)

## 
## Call:
## lm(formula = log(ames3$price) ~ log(ames3$Lot.Area) + ames3$Year.Built + 
##     ames3$Year.Remod.Add + ames3$Bedroom.AbvGr, data = ames3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14609 -0.15825 -0.01477  0.15354  1.01578 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
## log(ames3$Lot.Area)   2.471e-01  1.654e-02  14.935  < 2e-16 ***
## ames3$Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
## ames3$Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
## ames3$Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:  0.6015 
## F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

anova(bplm)

## Analysis of Variance Table
## 
## Response: log(price)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## Lot.Area         1 10.392  10.392 133.462 < 2.2e-16 ***
## Land.Slope       2  2.424   1.212  15.564 2.211e-07 ***
## Year.Built       1 68.991  68.991 886.012 < 2.2e-16 ***
## Year.Remod.Add   1 12.535  12.535 160.982 < 2.2e-16 ***
## Bedroom.AbvGr    1  5.063   5.063  65.018 2.124e-15 ***
## Residuals      993 77.322   0.078                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(bestlogmodel)

## Analysis of Variance Table
## 
## Response: log(ames3$price)
##                       Df Sum Sq Mean Sq F value    Pr(>F)    
## log(ames3$Lot.Area)    1 24.338  24.338 345.224 < 2.2e-16 ***
## ames3$Year.Built       1 67.894  67.894 963.045 < 2.2e-16 ***
## ames3$Year.Remod.Add   1 12.267  12.267 174.000 < 2.2e-16 ***
## ames3$Bedroom.AbvGr    1  2.082   2.082  29.526 6.939e-08 ***
## Residuals            995 70.147   0.070                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer for Q7

Residual Q-Q plots of both models indicate that errors are normally distributed with a constant variance (homoscedasticity). Homogeneity of the residuals seems to be accomplished by both models.Independence of residuals is better seen in the second model according to the plots.Outliers affecting the normality can be seen in both models. According to the resiuduals versus leverage plots, the influence of leverage points seem to be less in the second model.The adjusted R-squared value is also higher in the second model indicating that it gives a better fit. In summary,the log transformation of Lot.Area improves the model assumptions for multiple linear regression models.

Peer Assessment I

1

2

3

4

5

6

7

7.0.1