House Prices - Advanced Regression Techniques

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
## ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ tidyr   1.3.0     ✔ forcats 1.0.0
## ✔ readr   2.1.4
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(rstatix)      # summary statistics and statistical tests
## Warning: package 'rstatix' was built under R version 4.2.2
## 
## Attaching package: 'rstatix'
## 
## The following object is masked from 'package:stats':
## 
##     filter
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(infer)
## Warning: package 'infer' was built under R version 4.2.2
## 
## Attaching package: 'infer'
## 
## The following objects are masked from 'package:rstatix':
## 
##     chisq_test, prop_test, t_test
library(forcats)
library(DT)
## Warning: package 'DT' was built under R version 4.2.2
# Import training and testing data

train <- read.csv('https://raw.githubusercontent.com/enidroman/Data_605_Fundamentals_of_Computational_Mathematics/main/train.csv')

test  <- read.csv('https://raw.githubusercontent.com/enidroman/Data_605_Fundamentals_of_Computational_Mathematics/main/test.csv')

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

# Preview of the train dataset.
head(train) # first 6 observations
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

Data Structure

# Structure of dataset
str(train) 
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

The dataset contain 1460 observation and 81 variables

Data fields Here’s a brief version of what you’ll find in the data description file.

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict. MSSubClass: The building class MSZoning: The general zoning classification LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access Alley: Type of alley access LotShape: General shape of property LandContour: Flatness of the property Utilities: Type of utilities available LotConfig: Lot configuration LandSlope: Slope of property Neighborhood: Physical locations within Ames city limits Condition1: Proximity to main road or railroad Condition2: Proximity to main road or railroad (if a second is present) BldgType: Type of dwelling HouseStyle: Style of dwelling OverallQual: Overall material and finish quality OverallCond: Overall condition rating YearBuilt: Original construction date YearRemodAdd: Remodel date RoofStyle: Type of roof RoofMatl: Roof material Exterior1st: Exterior covering on house Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type MasVnrArea: Masonry veneer area in square feet ExterQual: Exterior material quality ExterCond: Present condition of the material on the exterior Foundation: Type of foundation BsmtQual: Height of the basement BsmtCond: General condition of the basement BsmtExposure: Walkout or garden level basement walls BsmtFinType1: Quality of basement finished area BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Quality of second finished area (if present) BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating HeatingQC: Heating quality and condition CentralAir: Central air conditioning Electrical: Electrical system 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Number of bedrooms above basement level Kitchen: Number of kitchens KitchenQual: Kitchen quality TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality GarageType: Garage location GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality GarageCond: Garage condition PavedDrive: Paved driveway WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Fence: Fence quality MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature MoSold: Month Sold YrSold: Year Sold SaleType: Type of sale SaleCondition: Condition of sale

Summary of Each Variable for the Train Dataset

summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

Independent Variable and Dependent Varialble

Dependent Variable SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.

Independent Variable LotArea - Lot size in square feet GrLivArea - Above grade (ground) living area square feet GarageArea - Size of garage in square feet PoolArea - Pool area in square feet

You can use the get_summary_stats() function from rstatix to return summary statistics in a data frame format. This can be helpful for performing subsequent operations or plotting on the numbers.

By using get_summary_stats I get a calculation of the summary stats for the Indpendent Variable and Dependent Variable I am using for my analysis.

variables_sum <- train %>% 
  # columns to calculate for
  get_summary_stats(SalePrice, LotArea, GrLivArea, GarageArea, PoolArea,
  # summary stats to return
    type = "common")
variables_sum
## # A tibble: 5 × 10
##   variable       n   min    max  median    iqr      mean      sd      se      ci
##   <fct>      <dbl> <dbl>  <dbl>   <dbl>  <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
## 1 SalePrice   1460 34900 755000 163000  84025  180921.   79443.  2079.   4078.  
## 2 LotArea     1460  1300 215245   9478.  4048   10517.    9981.   261.    512.  
## 3 GrLivArea   1460   334   5642   1464    647.   1515.     525.    13.8    27.0 
## 4 GarageArea  1460     0   1418    480    242.    473.     214.     5.60   11.0 
## 5 PoolArea    1460     0    738      0      0       2.76    40.2    1.05    2.06

Prepararation of Dataset with Variables Needed

# Subsetting LotArea, GrLivArea, GarageArea, PoolArea, and SalePrice from dataset train.
subset.train <- subset(train, select = c("LotArea", "GrLivArea", "GarageArea", "PoolArea","SalePrice"))
head(subset.train)
##   LotArea GrLivArea GarageArea PoolArea SalePrice
## 1    8450      1710        548        0    208500
## 2    9600      1262        460        0    181500
## 3   11250      1786        608        0    223500
## 4    9550      1717        642        0    140000
## 5   14260      2198        836        0    250000
## 6   14115      1362        480        0    143000
# Subsetting LotArea, GrLivArea, GarageArea, and PoolArea from dataset train excluding SalePrice
subset.ind <- subset(train, select = c("LotArea", "GrLivArea", "GarageArea", "PoolArea"))
head(subset.ind)
##   LotArea GrLivArea GarageArea PoolArea
## 1    8450      1710        548        0
## 2    9600      1262        460        0
## 3   11250      1786        608        0
## 4    9550      1717        642        0
## 5   14260      2198        836        0
## 6   14115      1362        480        0

Histogram with Density of Each Independent Variable

#install.packages("gridExtra")
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
#install.packages("plotfunctions")
library(plotfunctions)
## Warning: package 'plotfunctions' was built under R version 4.2.3
## 
## Attaching package: 'plotfunctions'
## The following object is masked from 'package:plotly':
## 
##     add_bars
## The following object is masked from 'package:ggplot2':
## 
##     alpha
p1 <-ggplot(train, aes(x=LotArea)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Lot Area", x = "", y = "")

p2 <- ggplot(train, aes(x=GrLivArea)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Ground Living Area", x = "", y = "")

p3 <- ggplot(train, aes(x=GarageArea)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Garage Area", x = "", y = "")

p4 <-ggplot(train, aes(x=PoolArea)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title = "Pool Area", x = "", y = "")

grid.arrange(p1, p2, p3, p4, nrow=2)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

summary(train[c("LotArea", "GrLivArea", "GarageArea", "PoolArea")])
##     LotArea         GrLivArea      GarageArea        PoolArea      
##  Min.   :  1300   Min.   : 334   Min.   :   0.0   Min.   :  0.000  
##  1st Qu.:  7554   1st Qu.:1130   1st Qu.: 334.5   1st Qu.:  0.000  
##  Median :  9478   Median :1464   Median : 480.0   Median :  0.000  
##  Mean   : 10517   Mean   :1515   Mean   : 473.0   Mean   :  2.759  
##  3rd Qu.: 11602   3rd Qu.:1777   3rd Qu.: 576.0   3rd Qu.:  0.000  
##  Max.   :215245   Max.   :5642   Max.   :1418.0   Max.   :738.000

Density curves come in all shapes and sizes and they allow us to gain a quick visual understanding of the distribution of values in a given dataset.

Histograms are one of the most intuitive ways of representing the shape of a data set’s distribution along a single numeric variable.

In regards to the skewness of the 4 variables Lot Area, Ground Living Area, and Pool Area are more skewed to the left. The right skewness mean that the mean is greater than the median. Right skewness or positive-skewed means many of the values are near the lower end of the range, and higher values are infrequent. All 3 are unimodel because the distribution has only one peak.

The Garage Area has no skew. The no skewness means that the mean is equal to the median. The Garage Area has a multimodal distributions that have two or more peaks. In this case it looks like it has 4 peaks.

I notice Pool Area has only 1 bar and Lot Area has 6 bars. Number of bars is too small, then important features of the data may be obscured.

Looking at the Pool Area number of square feet looks like not many of the houses has Pool installed.

Scatter Plot of Each Independent Variable

p1 <- ggplot(train, aes(sample = LotArea))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Lot Area",x = "", y = "")

p2 <- ggplot(train, aes(sample = GrLivArea))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Ground Living Area", x = "", y = "")

p3 <- ggplot(train, aes(sample = GarageArea))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Garage Area", x = "", y = "")

p4 <- ggplot(train, aes(sample = PoolArea))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="Pool Area", x = "", y = "")

grid.arrange(p1, p2, p3, p4, nrow=2)

The qq plot of Ground Living Area shows points seem to fall along a straight line. Notice the x-axis plots the theoretical quantiles. Those are the quantiles from the standard Normal distribution with mean 0 and standard deviation 1.

The qq plot of the Garage Area it starts straight but not within the line and then the points seems to fall within the line but as it increases it tends to be off range form the line.

Histogram and Scatterplot with Density of Dependent Variable

p1 <- ggplot(train, aes(x=SalePrice)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white",bins=50)+
 geom_density(alpha=.2, fill="green")+ 
  labs(title="SalePrice", x = "", y = "")

p2 <- ggplot(train, aes(sample = SalePrice))+ 
  stat_qq()+
  stat_qq_line()+ 
  labs(title="SalePrice", x = "", y = "")

grid.arrange(p1, p2, nrow=1)

summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

The distribution of SalePrice is skewed to the right with some prices that are outliers towards the tail. The minimum sale price is $34,900 and the maximum sale price is $755,000. The median sale price is $163,000.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

# Drawing a scatterplot matrix of LotArea, GrLivArea, GarageArea, PoolArea, SalePrice using the pairs function
pairs(subset.train, pch = 16, col = "blue", main = "Matrix Scatterplot of LotArea, GrLivArea, GarageArea, PoolArea, SalePrice")

Derive a correlation matrix for any three quantitative variables in the dataset.

cor_matrix<-cor(subset.train)
cor_matrix
##               LotArea GrLivArea GarageArea   PoolArea  SalePrice
## LotArea    1.00000000 0.2631162 0.18040276 0.07767239 0.26384335
## GrLivArea  0.26311617 1.0000000 0.46899748 0.17020534 0.70862448
## GarageArea 0.18040276 0.4689975 1.00000000 0.06104727 0.62343144
## PoolArea   0.07767239 0.1702053 0.06104727 1.00000000 0.09240355
## SalePrice  0.26384335 0.7086245 0.62343144 0.09240355 1.00000000
train %>%
  dplyr::select(LotArea, GrLivArea, GarageArea, PoolArea, SalePrice)%>%
  cor() %>%
  corrplot(method ="color",order = "hclust", addrect = 3, number.cex = 1, sig.level = 0.20,
         addCoef.col = "black", # Add coefficient of correlation
          tl.srt = 90, # Text label color and rotation
         # Combine with significance
         diag = TRUE)

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

Discuss the meaning of your analysis.

Hypotheses

H0 = There is 0 correlation between each pairwise variables

HA = There is correlation between each pairwise variables

cor.test(subset.train$LotArea, subset.train$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  subset.train$LotArea and subset.train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

The correlation coefficient between LotArea and SalePrice is 0.2638434, which indicates a moderate positive correlation between the two variables.

The p-value is less than 0.05, which suggests that the correlation is statistically significant, and we can reject the null hypothesis that the true correlation is equal to zero.

The t-value of 10.445 and the associated p-value less than 2.2e-16 indicate that this correlation is statistically significant, meaning it is unlikely to have occurred by chance. The alternative hypothesis that the true correlation is not equal to 0 is supported by this result.

The 80 percent confidence interval is [0.2323391, 0.2947946], which means that we can be 80 percent confident that the true correlation coefficient lies between these two values.

Overall, this test tells us that there is a statistically significant moderate positive correlation between LotArea and SalePrice.

cor.test(subset.train$GrLivArea, subset.train$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  subset.train$GrLivArea and subset.train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

The correlation coefficient between GrLivArea and SalePrice is 0.7086. This indicates a strong positive linear relationship between the two variables.

The t-value for the test of the null hypothesis that the true correlation between the two variables is zero is 38.348. The degrees of freedom for the t-test are 1458, and the p-value is less than 2.2e-16, indicating strong evidence against the null hypothesis.

The 80 percent confidence interval for the true correlation coefficient between the two variables is (0.6915, 0.7249). This indicates that we are 80 percent confident that the true correlation between the two variables lies in this interval.

Overall, this test tells that there is a strong positive linear relationship between GrLivArea and SalePrice.

cor.test(subset.train$GarageArea, subset.train$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  subset.train$GarageArea and subset.train$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314

The correlation coefficient GarageArea and SalePrice is 0.6234, which indicates a moderate positive correlation between the two variables.

The t-value of 30.446 with 1458 degrees of freedom and p-value less than 2.2e-16 suggests that the observed correlation is statistically significant and unlikely to have occurred by chance.

The 80 percent confidence interval suggests that we can be reasonably confident that the true correlation between GarageArea and SalePrice falls between 0.6025 and 0.6435.

Overall, these results suggest that there is a moderate positive relationship between the GarageArea and SalePrice variables, with larger garage areas generally associated with higher sale prices.

cor.test(subset.train$PoolArea, subset.train$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  subset.train$PoolArea and subset.train$SalePrice
## t = 3.5435, df = 1458, p-value = 0.0004073
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.05902496 0.12557575
## sample estimates:
##        cor 
## 0.09240355

The correlation coefficient between PoolArea and SalePrice is 0.0924, which indicates a weak positive correlation between the two variables.

The t value is 3.5435, and the p-value is 0.0004073, which means that the correlation coefficient is statistically significant at a significance level of 0.05. Therefore, we can reject the null hypothesis that there is no correlation between PoolArea and SalePrice.

The 80 percent confidence interval for the true correlation coefficient lies between 0.059 and 0.126, which means that we can be 80 percent confident that the true correlation coefficient falls within this interval.

Would you be worried about familywise error? Why or why not?

Familywise error rate (FWER) is a statistical concept that pertains to the probability of making one or more type I errors in a set of hypothesis tests. A type I error occurs when a researcher rejects a null hypothesis that is actually true.

The FWER is the probability of making one or more type I errors in a family of hypothesis tests, meaning that if a family of tests is conducted, the FWER is the probability of at least one type I error occurring in that family of tests.

You should worry about familywise error when you are performing multiple statistical tests simultaneously, such as in a hypothesis testing scenario where you are comparing the means or variances of multiple groups or testing the correlation between multiple pairs of variables.

The familywise error rate is the probability of making at least one false positive error in a family of tests, which increases as the number of tests increases. This means that the more tests you perform, the greater the chance of falsely rejecting a null hypothesis, even if all the individual tests have a low probability of error.

If you don’t account for familywise error, you may end up with a higher chance of finding significant results purely by chance. This can lead to incorrect conclusions and invalid interpretations of your data. Therefore, it is important to adjust your statistical tests to control for familywise error, especially when you are performing a large number of tests.

For example:

k <- 4
alpha <- .05
1 - (1-alpha)^k
## [1] 0.1854938

The familywise error rate in this case is 0.1854938. This means the probability of committing at least one Type I error is 18.54%. This is quite low.

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

# Correlation from above matrix.
cor_matrix<-cor(subset.train)
cor_matrix
##               LotArea GrLivArea GarageArea   PoolArea  SalePrice
## LotArea    1.00000000 0.2631162 0.18040276 0.07767239 0.26384335
## GrLivArea  0.26311617 1.0000000 0.46899748 0.17020534 0.70862448
## GarageArea 0.18040276 0.4689975 1.00000000 0.06104727 0.62343144
## PoolArea   0.07767239 0.1702053 0.06104727 1.00000000 0.09240355
## SalePrice  0.26384335 0.7086245 0.62343144 0.09240355 1.00000000
# Invert Correlation from above matrix.
prec_matrix <- solve(cor_matrix)
prec_matrix
##                LotArea   GrLivArea   GarageArea     PoolArea   SalePrice
## LotArea     1.09042226 -0.15680158 -0.021168829 -0.041975537 -0.15951123
## GrLivArea  -0.15680158  2.08181516 -0.087419540 -0.211165985 -1.35984155
## GarageArea -0.02116883 -0.08741954  1.640186020  0.004680966 -0.95544319
## PoolArea   -0.04197554 -0.21116598  0.004680966  1.033156947  0.06232672
## SalePrice  -0.15951123 -1.35984155 -0.955443187  0.062326722  2.59559710

Multiply the correlation matrix by the precision matrix.

# Multiply the correlation matrix by the precision matrix
cor_prec <- cor_matrix %*% prec_matrix
cor_prec
##                 LotArea    GrLivArea    GarageArea      PoolArea SalePrice
## LotArea    1.000000e+00 5.551115e-17  0.000000e+00 -3.469447e-18         0
## GrLivArea  1.387779e-17 1.000000e+00  0.000000e+00  0.000000e+00         0
## GarageArea 4.163336e-17 0.000000e+00  1.000000e+00 -6.938894e-18         0
## PoolArea   1.040834e-17 5.551115e-17 -1.387779e-17  1.000000e+00         0
## SalePrice  8.326673e-17 0.000000e+00  1.110223e-16  0.000000e+00         1

Multiply the precision matrix by the correlation matrix.

prec_cor <-   prec_matrix %*% cor_matrix
prec_cor
##                  LotArea     GrLivArea    GarageArea      PoolArea
## LotArea     1.000000e+00 -2.775558e-17  0.000000e+00  6.938894e-18
## GrLivArea   5.551115e-17  1.000000e+00  0.000000e+00  5.551115e-17
## GarageArea  0.000000e+00  1.110223e-16  1.000000e+00 -1.387779e-17
## PoolArea   -1.734723e-17 -2.775558e-17 -6.938894e-18  1.000000e+00
## SalePrice   1.110223e-16  0.000000e+00  0.000000e+00  0.000000e+00
##                SalePrice
## LotArea    -2.775558e-17
## GrLivArea   0.000000e+00
## GarageArea  0.000000e+00
## PoolArea   -2.775558e-17
## SalePrice   1.000000e+00

Conduct LU decomposition on the matrix.

#The function lu.decomposition is used from the matrixcalc package.
#install.packages('matrixcalc')
library('matrixcalc')
lu_decomp <- lu.decomposition(cor_matrix)

The lower triangular matrix.

L <- lu_decomp$L
L
##            [,1]      [,2]       [,3]        [,4] [,5]
## [1,] 1.00000000 0.0000000  0.0000000  0.00000000    0
## [2,] 0.26311617 1.0000000  0.0000000  0.00000000    0
## [3,] 0.18040276 0.4528838  1.0000000  0.00000000    0
## [4,] 0.07767239 0.1609082 -0.0267758  1.00000000    0
## [5,] 0.26384335 0.6867466  0.3687445 -0.02401248    1

The upper triangular.

U <- lu_decomp$U
U
##      [,1]      [,2]      [,3]        [,4]        [,5]
## [1,]    1 0.2631162 0.1804028  0.07767239  0.26384335
## [2,]    0 0.9307699 0.4215306  0.14976847  0.63920303
## [3,]    0 0.0000000 0.7765505 -0.02079276  0.28634868
## [4,]    0 0.0000000 0.0000000  0.96931129 -0.02327557
## [5,]    0 0.0000000 0.0000000  0.00000000  0.38526781

Multiplying lower triangular and upper triangular result in the correlation matrix.

L %*% U
##            [,1]      [,2]       [,3]       [,4]       [,5]
## [1,] 1.00000000 0.2631162 0.18040276 0.07767239 0.26384335
## [2,] 0.26311617 1.0000000 0.46899748 0.17020534 0.70862448
## [3,] 0.18040276 0.4689975 1.00000000 0.06104727 0.62343144
## [4,] 0.07767239 0.1702053 0.06104727 1.00000000 0.09240355
## [5,] 0.26384335 0.7086245 0.62343144 0.09240355 1.00000000

LU is equivalent to the cor_matrix.

cor_matrix
##               LotArea GrLivArea GarageArea   PoolArea  SalePrice
## LotArea    1.00000000 0.2631162 0.18040276 0.07767239 0.26384335
## GrLivArea  0.26311617 1.0000000 0.46899748 0.17020534 0.70862448
## GarageArea 0.18040276 0.4689975 1.00000000 0.06104727 0.62343144
## PoolArea   0.07767239 0.1702053 0.06104727 1.00000000 0.09240355
## SalePrice  0.26384335 0.7086245 0.62343144 0.09240355 1.00000000

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Calculus-Based Probability & Statistics.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

I selected the variable BsmtUnfSF, Unfinished square feet of basement area.

fit_data <- train$BsmtUnfSF
fit_data <- fit_data[complete.cases(fit_data)]

The distribution is skewed to the right.

hist(fit_data)

summary(train$BsmtUnfSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0
length(fit_data[fit_data == 0])
## [1] 118

Out of the 1452 houses, there are 118 that have no Unfinished square feet of basement area.

fit_data <- fit_data + .01

Because the data measures area, adding a value of .01 should be negligible and would get rid of the zero values. A property with a Unfinished square feet of basement area .01 square feet would mean this property does not really have any masonry veneer.

load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:rstatix':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
# Run fitdistr to fit an exponential probability density function.
BsmtUnfSF_exp_dist <- fitdistr(train$BsmtUnfSF,'exponential')

Find the optimal value of λ for this distribution.

BsmtUnfSF_lamb <- BsmtUnfSF_exp_dist$estimate
BsmtUnfSF_lamb
##        rate 
## 0.001762921

Take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

set.seed(1000)
BsmtUnfSF_sample <- rexp(1000,BsmtUnfSF_lamb)

Plot a histogram and compare it with a histogram of your original variable.**

hist(BsmtUnfSF_sample)

Compare it with a histogram of your original variable.

par(mfrow=c(1,2))
hist(fit_data)
hist(BsmtUnfSF_sample)

The histogram of fit_data and exp_dist are both right skrewed; however, the second bin of BsmtUnfSF_sample has a frequency that is about double the frequency of fit_data. Both have the same count, but the distribution of the frequency is not similar.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(.05, rate=BsmtUnfSF_lamb)
## [1] 29.09563
qexp(.95, rate=BsmtUnfSF_lamb)
## [1] 1699.3

Generate a 95% confidence interval from the empirical data, assuming normality.

norm.interval = function(data, variance = var(data), conf.level = 0.95) 
{
      z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
      xbar = mean(data)
      sdx = sqrt(variance/length(data))
      c(xbar - z * sdx, xbar + z * sdx)
}
norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)
## [1] 544.5850 589.9158

Provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(x=fit_data, probs=c(.05, .95))
##      5%     95% 
##    0.01 1468.01

We are 95% confident that the mean of Unfinished square feet of basement area is between 544.8550 and 589.9158. The exponential distribution is a good fit since 95% is 1468.01 and only 5% is .01.

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

Modeling

Build some type of multiple regression model and submit your model to the competition board.

For my mutilple regression model I chose the following variables:

Dependent Variable

SalePrice - the property’s sale price in dollars. This is the target variable that we are trying to predict.

Independent Variable

LotArea - Lot size in square feet BsmtUnfSF - Unfinished square feet of basement area TotalBsmtSF - Total square feet of basement area GrLivArea - Above grade (ground) living area square feet GarageArea - Size of garage in square feet

We want to build a model for estimating SalePrice based on the BsmtUnfSF, TotalBsmtSF, GrLivArea, and GarageArea of each house.

# Subsetting LotArea, BsmtUnfSF, TotalBsmtSF, FullBath, GrLivArea, GarageArea, YearBuilt, YearRemodAdd, and SalePrice from dataset train.
new_subset_train <- subset(train, select =c("LotArea", "BsmtUnfSF", "TotalBsmtSF", "FullBath","GrLivArea", "GarageArea", "YearBuilt", "YearRemodAdd","SalePrice"))
head(new_subset_train)
##   LotArea BsmtUnfSF TotalBsmtSF FullBath GrLivArea GarageArea YearBuilt
## 1    8450       150         856        2      1710        548      2003
## 2    9600       284        1262        2      1262        460      1976
## 3   11250       434         920        2      1786        608      2001
## 4    9550       540         756        1      1717        642      1915
## 5   14260       490        1145        2      2198        836      2000
## 6   14115        64         796        1      1362        480      1993
##   YearRemodAdd SalePrice
## 1         2003    208500
## 2         1976    181500
## 3         2002    223500
## 4         1970    140000
## 5         2000    250000
## 6         1995    143000
# Preview of the test dataset
head(test) #  6 observations
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461         20       RH          80   11622   Pave  <NA>      Reg
## 2 1462         20       RL          81   14267   Pave  <NA>      IR1
## 3 1463         60       RL          74   13830   Pave  <NA>      IR1
## 4 1464         60       RL          78    9978   Pave  <NA>      IR1
## 5 1465        120       RL          43    5005   Pave  <NA>      IR1
## 6 1466         60       RL          75   10000   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1         Lvl    AllPub    Inside       Gtl        NAmes      Feedr       Norm
## 2         Lvl    AllPub    Corner       Gtl        NAmes       Norm       Norm
## 3         Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 4         Lvl    AllPub    Inside       Gtl      Gilbert       Norm       Norm
## 5         HLS    AllPub    Inside       Gtl      StoneBr       Norm       Norm
## 6         Lvl    AllPub    Corner       Gtl      Gilbert       Norm       Norm
##   BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1     1Fam     1Story           5           6      1961         1961     Gable
## 2     1Fam     1Story           6           6      1958         1958       Hip
## 3     1Fam     2Story           5           5      1997         1998     Gable
## 4     1Fam     2Story           6           6      1998         1998     Gable
## 5   TwnhsE     1Story           8           5      1992         1992     Gable
## 6     1Fam     2Story           6           5      1993         1994     Gable
##   RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1  CompShg     VinylSd     VinylSd       None          0        TA        TA
## 2  CompShg     Wd Sdng     Wd Sdng    BrkFace        108        TA        TA
## 3  CompShg     VinylSd     VinylSd       None          0        TA        TA
## 4  CompShg     VinylSd     VinylSd    BrkFace         20        TA        TA
## 5  CompShg     HdBoard     HdBoard       None          0        Gd        TA
## 6  CompShg     HdBoard     HdBoard       None          0        TA        TA
##   Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1     CBlock       TA       TA           No          Rec        468
## 2     CBlock       TA       TA           No          ALQ        923
## 3      PConc       Gd       TA           No          GLQ        791
## 4      PConc       TA       TA           No          GLQ        602
## 5      PConc       Gd       TA           No          ALQ        263
## 6      PConc       Gd       TA           No          Unf          0
##   BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1          LwQ        144       270         882    GasA        TA          Y
## 2          Unf          0       406        1329    GasA        TA          Y
## 3          Unf          0       137         928    GasA        Gd          Y
## 4          Unf          0       324         926    GasA        Ex          Y
## 5          Unf          0      1017        1280    GasA        Ex          Y
## 6          Unf          0       763         763    GasA        Gd          Y
##   Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1      SBrkr       896         0            0       896            0
## 2      SBrkr      1329         0            0      1329            0
## 3      SBrkr       928       701            0      1629            0
## 4      SBrkr       926       678            0      1604            0
## 5      SBrkr      1280         0            0      1280            0
## 6      SBrkr       763       892            0      1655            0
##   BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1            0        1        0            2            1          TA
## 2            0        1        1            3            1          Gd
## 3            0        2        1            3            1          TA
## 4            0        2        1            3            1          Gd
## 5            0        2        0            2            1          Gd
## 6            0        2        1            3            1          TA
##   TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1            5        Typ          0        <NA>     Attchd        1961
## 2            6        Typ          0        <NA>     Attchd        1958
## 3            6        Typ          1          TA     Attchd        1997
## 4            7        Typ          1          Gd     Attchd        1998
## 5            5        Typ          0        <NA>     Attchd        1992
## 6            7        Typ          1          TA     Attchd        1993
##   GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1          Unf          1        730         TA         TA          Y
## 2          Unf          1        312         TA         TA          Y
## 3          Fin          2        482         TA         TA          Y
## 4          Fin          2        470         TA         TA          Y
## 5          RFn          2        506         TA         TA          Y
## 6          Fin          2        440         TA         TA          Y
##   WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1        140           0             0          0         120        0   <NA>
## 2        393          36             0          0           0        0   <NA>
## 3        212          34             0          0           0        0   <NA>
## 4        360          36             0          0           0        0   <NA>
## 5          0          82             0          0         144        0   <NA>
## 6        157          84             0          0           0        0   <NA>
##   Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv        <NA>       0      6   2010       WD        Normal
## 2  <NA>        Gar2   12500      6   2010       WD        Normal
## 3 MnPrv        <NA>       0      3   2010       WD        Normal
## 4  <NA>        <NA>       0      6   2010       WD        Normal
## 5  <NA>        <NA>       0      1   2010       WD        Normal
## 6  <NA>        <NA>       0      4   2010       WD        Normal
# Subsetting LotArea, BsmtUnfSF, TotalBsmtSF, FullBath, GrLivArea, GarageArea, YearBuildt, YearRemodAdd, and SalePrice from dataset test.
new_subset_test <- subset(test, select =c("LotArea", "BsmtUnfSF", "TotalBsmtSF", "FullBath","GrLivArea", "GarageArea", "YearBuilt", "YearRemodAdd"))
head(new_subset_test)
##   LotArea BsmtUnfSF TotalBsmtSF FullBath GrLivArea GarageArea YearBuilt
## 1   11622       270         882        1       896        730      1961
## 2   14267       406        1329        1      1329        312      1958
## 3   13830       137         928        2      1629        482      1997
## 4    9978       324         926        2      1604        470      1998
## 5    5005      1017        1280        2      1280        506      1992
## 6   10000       763         763        2      1655        440      1993
##   YearRemodAdd
## 1         1961
## 2         1958
## 3         1998
## 4         1998
## 5         1992
## 6         1994

Data Cleaning

Checked for missing values for both train and test subset. There was none.

# create a data frame 
stats <- data.frame(new_subset_train)
  
# find location of missing values
print("Position of missing values -")
## [1] "Position of missing values -"
which(is.na(stats))
## integer(0)
# count total missing values 
print("Count of total missing values - ")
## [1] "Count of total missing values - "
sum(is.na(stats))
## [1] 0
# create a data frame 
stats <- data.frame(new_subset_test)
  
# find location of missing values
print("Position of missing values -")
## [1] "Position of missing values -"
which(is.na(stats))
## [1] 2120 3579 8412
# count total missing values 
print("Count of total missing values - ")
## [1] "Count of total missing values - "
sum(is.na(stats))
## [1] 3

Compute the model coefficients

model_1 <- lm(SalePrice ~ LotArea + BsmtUnfSF + TotalBsmtSF + GrLivArea + GarageArea, data = new_subset_train)
summary(model_1)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtUnfSF + TotalBsmtSF + 
##     GrLivArea + GarageArea, data = new_subset_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -681901  -19033     362   19594  273221 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.318e+04  4.033e+03  -5.749 1.09e-08 ***
## LotArea      1.318e-01  1.279e-01   1.031    0.303    
## BsmtUnfSF   -1.235e+01  3.032e+00  -4.074 4.88e-05 ***
## TotalBsmtSF  5.364e+01  3.568e+00  15.034  < 2e-16 ***
## GrLivArea    6.914e+01  2.758e+00  25.065  < 2e-16 ***
## GarageArea   1.019e+02  6.802e+00  14.988  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45950 on 1454 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6654 
## F-statistic: 581.3 on 5 and 1454 DF,  p-value: < 2.2e-16
summary(model_1)$coefficients[, "Pr(>|t|)"]
##   (Intercept)       LotArea     BsmtUnfSF   TotalBsmtSF     GrLivArea 
##  1.093188e-08  3.028421e-01  4.877380e-05  1.362393e-47 1.541041e-115 
##    GarageArea 
##  2.483565e-47

The model is a multiple linear regression model with SalePrice as the response variable and LotArea, BsmtUnfSF, TotalBsmtSF, GrLivArea, and GarageArea as the predictor variables.

The coefficients table shows the estimated regression coefficients for each predictor variable, as well as their standard errors, t-values, and associated p-values.

The intercept coefficient (-2.318e+04) represents the estimated SalePrice when all predictor variables are zero.

The p-values for each predictor variable show whether they are statistically significant in predicting SalePrice or not. In this case, LotArea is not significant (p-value = 0.303), while all the other predictor variables have very small p-values (less than 0.001), indicating strong evidence of a significant linear relationship between each predictor variable and SalePrice.

Multiple R-squared is 0.6665. The adjusted R-squared value of 0.6654 suggests that the model explains about 66.5% of the variation in SalePrice after accounting for the number of predictor variables in the model.

The F-statistic of 581.3 with a very small p-value (< 2.2e-16) suggests that the overall model is statistically significant in predicting SalePrice.

The residual standard error of 45950 is an estimate of the standard deviation of the errors or residuals, and indicates the degree of variability of the response variable that is not explained by the model. The small residual standard error suggests that the model has a good fit to the data.

To see which predictor variables are significant, you can examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statitic p-values:

summary(model_1)$coefficients
##                  Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) -2.318421e+04 4032.843200 -5.748851  1.093188e-08
## LotArea      1.318336e-01    0.127904  1.030723  3.028421e-01
## BsmtUnfSF   -1.235047e+01    3.031774 -4.073676  4.877380e-05
## TotalBsmtSF  5.364311e+01    3.568100 15.034082  1.362393e-47
## GrLivArea    6.914123e+01    2.758440 25.065337 1.541041e-115
## GarageArea   1.019489e+02    6.801995 14.988084  2.483565e-47

The table presents the estimates of the coefficients for each predictor variable. The “Estimate” column shows the estimated effect of each predictor variable on SalePrice. For example, the estimated effect of LotArea on SalePrice is 0.1318.

The “Std. Error” column shows the standard error of each coefficient estimate. The “t value” column shows the t-statistic for each coefficient, which measures the number of standard errors that the estimate is away from zero.

Finally, the “Pr(>|t|)” column shows the p-value for each coefficient, which indicates the probability of observing a t-statistic as extreme or more extreme than the observed value, assuming that the null hypothesis is true (i.e., the coefficient is equal to zero). A p-value less than the significance level (usually 0.05) indicates that the coefficient is statistically significant, meaning that we reject the null hypothesis and conclude that the predictor variable is associated with the response variable. In this case, all predictor variables except for LotArea have a p-value less than 0.05, indicating that they are statistically significant. The adjusted R-squared value of 0.6654 suggests that the model explains approximately 67% of the variation in SalePrice.

plot(model_1$fitted.values, model_1$residuals,
     xlab="Fitted Values", ylab="Residuals", main="Fitted Values vs. Residuals")
abline(h=0, col='blue')

The resulting plot will show the relationship between the predicted and actual values, and whether there is any pattern in the residuals. If the points are randomly scattered around the horizontal line at 0, then the model’s assumptions are met and the residuals are unbiased and normally distributed. If there is a clear pattern (e.g., a U-shape or a curve), then it suggests that the model is not adequately capturing some important nonlinear relationship between the predictor and outcome variables.

This plot of residuals versus fits shows that the residual variance (vertical spread) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance.

qqnorm(model_1$residuals); qqline(model_1$residuals)

Residuals are normally distributed.

The reference line in this plot is a straight line that passes through the first and third quartiles of the data, and it is used to check whether the residuals are approximately normally distributed. If the residuals are normally distributed, they will fall roughly along this line.

The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution

Model 2

I was not happy with my numbers in Model 1, especially the R-squared at 0.6665, so I added 2 variables to increase my numbers R-squared.

Every time you add a variable, the R-squared increases. Some of the independent variables will be statistically significant. Perhaps there is an actual relationship or just a chance correlation.

Added:

Independent Variable

YearBuilt - Original construction date YearRemodAdd - Remodel date

We want to build a model 2 for estimating SalePrice based on the LotArea, BsmtUnfSF, TotalBsmtSF, GrLivArea, and GarageArea, YearBuilt, and YearRemodAdd of each house.

model_2 <- lm(SalePrice ~ LotArea + BsmtUnfSF + TotalBsmtSF + FullBath + GrLivArea + GarageArea + YearBuilt + YearRemodAdd, data = new_subset_train)
summary(model_2)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + BsmtUnfSF + TotalBsmtSF + 
##     FullBath + GrLivArea + GarageArea + YearBuilt + YearRemodAdd, 
##     data = new_subset_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -626171  -17752   -3960   14599  287055 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.148e+06  1.258e+05 -17.071  < 2e-16 ***
## LotArea       4.073e-01  1.154e-01   3.530 0.000429 ***
## BsmtUnfSF    -1.324e+01  2.777e+00  -4.769 2.04e-06 ***
## TotalBsmtSF   4.168e+01  3.335e+00  12.497  < 2e-16 ***
## FullBath     -9.406e+02  2.920e+03  -0.322 0.747409    
## GrLivArea     6.911e+01  3.069e+00  22.515  < 2e-16 ***
## GarageArea    5.802e+01  6.566e+00   8.837  < 2e-16 ***
## YearBuilt     4.969e+02  5.212e+01   9.534  < 2e-16 ***
## YearRemodAdd  5.934e+02  6.705e+01   8.850  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41120 on 1451 degrees of freedom
## Multiple R-squared:  0.7335, Adjusted R-squared:  0.7321 
## F-statistic: 499.3 on 8 and 1451 DF,  p-value: < 2.2e-16

By adding the 2 variables the R-squared increase to 0.7335. The adjusted R-squared value of 0.7335 suggests that the model explains about 73.3% of the variation in SalePrice after accounting for the number of predictor variables in the model.

This is a summary of a linear regression model fitted to the data set new_subset_train. The model is trying to predict the SalePrice of houses based on a set of predictor variables. The summary provides information on the goodness of fit of the model and the significance of the coefficients.

Residuals: This section provides summary statistics on the residuals (i.e., the differences between the predicted and actual values of the dependent variable). The minimum and maximum residuals are -626171 and 287055, respectively. The median residual is -3960, indicating that the model tends to overestimate the sale price of houses on average.

Coefficients: This section provides estimates of the regression coefficients (i.e., the slopes) and their significance levels. The intercept is -2.148e+06, meaning that the predicted sale price when all predictor variables are zero is -2.148 million. The coefficient for LotArea is 0.4073, indicating that for every 1-unit increase in LotArea, the predicted sale price increases by $407. The coefficient for BsmtUnfSF is -13.24, indicating that for every 1-unit increase in BsmtUnfSF, the predicted sale price decreases by $13.24. The coefficients for the other variables can be interpreted similarly.

Significance: The significance levels (p-values) for the coefficients are also provided. All variables except FullBath have p-values less than 0.05, indicating that they are statistically significant predictors of SalePrice. The adjusted R-squared value of 0.7321 indicates that the model explains about 73% of the variability in SalePrice.

The coefficient for ‘FullBath’ is -940.62, but the p-value is 0.7474, which is greater than 0.05. Therefore, we cannot reject the null hypothesis that the coefficient is equal to zero, and we conclude that there is not a significant relationship between ‘FullBath’ and ‘SalePrice’.

Residual standard error: This is an estimate of the standard deviation of the residuals. It is a measure of the average distance that the data points fall from the regression line. The residual standard error of 41120 means that on average, the predicted sale price can be off by $41120.

F-statistic: This is a test of whether the model as a whole is significant. The F-statistic of 499.3 and the associated p-value of < 2.2e-16 suggest that the model is significant and that at least one of the predictor variables is related to SalePrice.

summary(model_2)$coefficients
##                   Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept)  -2.147928e+06 1.258257e+05 -17.0706563 1.105332e-59
## LotArea       4.073197e-01 1.153903e-01   3.5299309 4.286577e-04
## BsmtUnfSF    -1.324043e+01 2.776587e+00  -4.7685990 2.041667e-06
## TotalBsmtSF   4.168114e+01 3.335253e+00  12.4971448 4.118002e-34
## FullBath     -9.406210e+02 2.920109e+03  -0.3221185 7.474093e-01
## GrLivArea     6.910657e+01 3.069297e+00  22.5154419 1.592333e-96
## GarageArea    5.802276e+01 6.565973e+00   8.8368869 2.785968e-18
## YearBuilt     4.968937e+02 5.211734e+01   9.5341339 6.111098e-21
## YearRemodAdd  5.933854e+02 6.704808e+01   8.8501474 2.489138e-18
plot(model_2$fitted.values, model_2$residuals,
     xlab="Fitted Values", ylab="Residuals", main="Fitted Values vs. Residuals")
abline(h=0, col='blue')

The resulting plot will show the relationship between the predicted and actual values, and whether there is any pattern in the residuals. If the points are randomly scattered around the horizontal line at 0, then the model’s assumptions are met and the residuals are unbiased and normally distributed. If there is a clear pattern (e.g., a U-shape or a curve), then it suggests that the model is not adequately capturing some important nonlinear relationship between the predictor and outcome variables.

This plot of residuals versus fits shows that the residual variance (vertical spread) increases as the fitted values (predicted values of sale price) increase. This violates the assumption of constant error variance.

qqnorm(model_2$residuals); qqline(model_1$residuals)

Residuals are normally distributed.

The reference line in this plot is a straight line that passes through the first and third quartiles of the data, and it is used to check whether the residuals are approximately normally distributed. If the residuals are normally distributed, they will fall roughly along this line.

The pattern of the normal probability plot is straight, so this plot also provides evidence that it is reasonable to assume that the errors have a normal distribution

House Sale Prices Prediction

mySalePrice <- predict(model_2,test)

##create dataframe
prediction <- data.frame( Id = test[,"Id"],  SalePrice = mySalePrice)
prediction[prediction<0] <- 0
prediction <- replace(prediction,is.na(prediction),0)
  
prediction
##        Id SalePrice
## 1    1461 131366.75
## 2    1462 151673.68
## 3    1463 211112.71
## 4    1464 205057.34
## 5    1465 181767.87
## 6    1466 189385.47
## 7    1467 186175.76
## 8    1468 178474.73
## 9    1469 192370.41
## 10   1470 130579.64
## 11   1471 207965.00
## 12   1472 100084.58
## 13   1473 113691.18
## 14   1474 161692.48
## 15   1475 104782.07
## 16   1476 296482.96
## 17   1477 247379.35
## 18   1478 250341.26
## 19   1479 259987.45
## 20   1480 381456.57
## 21   1481 293022.05
## 22   1482 202236.43
## 23   1483 197835.95
## 24   1484 175854.94
## 25   1485 173926.81
## 26   1486 207578.71
## 27   1487 308856.01
## 28   1488 247102.50
## 29   1489 202944.52
## 30   1490 233010.40
## 31   1491 206330.61
## 32   1492  84296.02
## 33   1493 202685.44
## 34   1494 286324.80
## 35   1495 276768.82
## 36   1496 229369.03
## 37   1497 210648.66
## 38   1498 167970.40
## 39   1499 170011.73
## 40   1500 172546.28
## 41   1501 189653.37
## 42   1502 156717.45
## 43   1503 255620.16
## 44   1504 229951.28
## 45   1505 227339.70
## 46   1506 215184.73
## 47   1507 289687.02
## 48   1508 228031.73
## 49   1509 167941.55
## 50   1510 156381.82
## 51   1511 159852.54
## 52   1512 189379.83
## 53   1513 161490.72
## 54   1514 206284.02
## 55   1515 246619.71
## 56   1516 143822.70
## 57   1517 163622.12
## 58   1518 178211.73
## 59   1519 236992.86
## 60   1520 133113.61
## 61   1521 139495.04
## 62   1522 170891.72
## 63   1523 118449.83
## 64   1524 104858.48
## 65   1525 108157.70
## 66   1526 110170.98
## 67   1527 115270.23
## 68   1528 144222.43
## 69   1529 146309.13
## 70   1530 182370.12
## 71   1531 118817.84
## 72   1532  63066.35
## 73   1533 168373.89
## 74   1534 127591.16
## 75   1535 182331.59
## 76   1536 108058.99
## 77   1537 101597.74
## 78   1538 138131.56
## 79   1539 182783.30
## 80   1540 139838.32
## 81   1541 156202.24
## 82   1542 178434.58
## 83   1543 198051.10
## 84   1544  55507.54
## 85   1545  89743.56
## 86   1546 110486.97
## 87   1547 120748.35
## 88   1548 120739.88
## 89   1549 117310.68
## 90   1550 135162.35
## 91   1551 101447.07
## 92   1552 157112.89
## 93   1553 128548.07
## 94   1554 109803.23
## 95   1555 179781.35
## 96   1556  79735.18
## 97   1557 108068.22
## 98   1558  73500.35
## 99   1559 109707.15
## 100  1560 135122.45
## 101  1561 187185.92
## 102  1562 142526.58
## 103  1563 110779.57
## 104  1564 194879.81
## 105  1565 158748.67
## 106  1566 226391.40
## 107  1567  66064.13
## 108  1568 233748.33
## 109  1569 181210.65
## 110  1570 130481.07
## 111  1571 138674.05
## 112  1572 159718.77
## 113  1573 247329.63
## 114  1574 150555.82
## 115  1575 211790.68
## 116  1576 249982.99
## 117  1577 199854.40
## 118  1578 129872.25
## 119  1579 168455.62
## 120  1580 208213.69
## 121  1581 158527.90
## 122  1582 124098.59
## 123  1583 302789.94
## 124  1584 241893.04
## 125  1585 152754.34
## 126  1586  70243.07
## 127  1587  83037.39
## 128  1588 151221.46
## 129  1589 107395.36
## 130  1590 138824.99
## 131  1591  82154.41
## 132  1592 132261.49
## 133  1593  84726.56
## 134  1594 179894.48
## 135  1595 112300.04
## 136  1596 215671.44
## 137  1597 219762.22
## 138  1598 198104.03
## 139  1599 165261.33
## 140  1600 168342.69
## 141  1601  44954.15
## 142  1602 116055.94
## 143  1603  60330.32
## 144  1604 244980.96
## 145  1605 247043.13
## 146  1606 170179.95
## 147  1607 251082.10
## 148  1608 214658.91
## 149  1609 193742.28
## 150  1610 175797.03
## 151  1611 155800.16
## 152  1612 215522.14
## 153  1613 205403.00
## 154  1614 122204.34
## 155  1615  92905.97
## 156  1616  90748.36
## 157  1617 113412.04
## 158  1618 134091.22
## 159  1619 140025.25
## 160  1620 254636.61
## 161  1621 171417.98
## 162  1622 150285.42
## 163  1623 250663.88
## 164  1624 234179.09
## 165  1625 111647.68
## 166  1626 188250.34
## 167  1627 208109.64
## 168  1628 273186.43
## 169  1629 180218.84
## 170  1630 334501.64
## 171  1631 207279.35
## 172  1632 234568.41
## 173  1633 164109.12
## 174  1634 204183.28
## 175  1635 184935.25
## 176  1636 166845.39
## 177  1637 214277.54
## 178  1638 208185.22
## 179  1639 204912.59
## 180  1640 268461.03
## 181  1641 212343.39
## 182  1642 235350.92
## 183  1643 228110.84
## 184  1644 228592.43
## 185  1645 213042.93
## 186  1646 157788.45
## 187  1647 161663.94
## 188  1648 144953.41
## 189  1649 131879.44
## 190  1650 129162.67
## 191  1651 145385.75
## 192  1652 103562.70
## 193  1653 105851.04
## 194  1654 164947.48
## 195  1655 146915.32
## 196  1656 162246.09
## 197  1657 167336.21
## 198  1658 161886.03
## 199  1659 115044.37
## 200  1660 170093.27
## 201  1661 347077.94
## 202  1662 333497.07
## 203  1663 309847.93
## 204  1664 378056.06
## 205  1665 261920.43
## 206  1666 290752.74
## 207  1667 318330.36
## 208  1668 280831.39
## 209  1669 244642.49
## 210  1670 284814.36
## 211  1671 262078.78
## 212  1672 356945.65
## 213  1673 278059.90
## 214  1674 248195.99
## 215  1675 210633.95
## 216  1676 213554.65
## 217  1677 215096.09
## 218  1678 368312.07
## 219  1679 315794.89
## 220  1680 275103.10
## 221  1681 215268.50
## 222  1682 263864.34
## 223  1683 200226.31
## 224  1684 184604.66
## 225  1685 187328.49
## 226  1686 180344.63
## 227  1687 168629.11
## 228  1688 201583.91
## 229  1689 198633.01
## 230  1690 194653.84
## 231  1691 179285.81
## 232  1692 247399.67
## 233  1693 172845.04
## 234  1694 193297.57
## 235  1695 166361.76
## 236  1696 279650.26
## 237  1697 166388.52
## 238  1698 325530.35
## 239  1699 299928.63
## 240  1700 250796.80
## 241  1701 260490.03
## 242  1702 265869.27
## 243  1703 246235.05
## 244  1704 274637.09
## 245  1705 213981.98
## 246  1706 358010.19
## 247  1707 214092.45
## 248  1708 216016.18
## 249  1709 250450.94
## 250  1710 220495.64
## 251  1711 245656.38
## 252  1712 252468.86
## 253  1713 264239.19
## 254  1714 238755.00
## 255  1715 216524.20
## 256  1716 195467.77
## 257  1717 181385.83
## 258  1718 152065.58
## 259  1719 216143.28
## 260  1720 252841.81
## 261  1721 184057.58
## 262  1722 150043.09
## 263  1723 192842.33
## 264  1724 230215.14
## 265  1725 245141.77
## 266  1726 204070.57
## 267  1727 184743.62
## 268  1728 192875.43
## 269  1729 167008.37
## 270  1730 178962.77
## 271  1731 125640.91
## 272  1732 128122.94
## 273  1733 114117.59
## 274  1734 119909.36
## 275  1735 128143.23
## 276  1736 106421.92
## 277  1737 286084.43
## 278  1738 226124.38
## 279  1739 324524.46
## 280  1740 241948.84
## 281  1741 211422.06
## 282  1742 190203.56
## 283  1743 187817.48
## 284  1744 250926.30
## 285  1745 219920.81
## 286  1746 222916.84
## 287  1747 225279.21
## 288  1748 255637.48
## 289  1749 168214.55
## 290  1750 152314.79
## 291  1751 240956.30
## 292  1752 109578.55
## 293  1753 178105.03
## 294  1754 220011.43
## 295  1755 173570.06
## 296  1756 111010.24
## 297  1757  97785.60
## 298  1758 167181.29
## 299  1759 181853.42
## 300  1760 187509.57
## 301  1761 166133.50
## 302  1762 202282.81
## 303  1763 165984.20
## 304  1764 103993.29
## 305  1765 211826.22
## 306  1766 182394.61
## 307  1767 227114.65
## 308  1768 127519.08
## 309  1769 155513.74
## 310  1770 135937.11
## 311  1771 130606.49
## 312  1772 149982.10
## 313  1773 134051.68
## 314  1774 199564.17
## 315  1775 112163.46
## 316  1776 110311.38
## 317  1777  78694.22
## 318  1778 140155.39
## 319  1779 103690.65
## 320  1780 168359.32
## 321  1781 117580.24
## 322  1782  86619.91
## 323  1783 162832.43
## 324  1784  86723.24
## 325  1785  88711.80
## 326  1786 195329.15
## 327  1787 162631.65
## 328  1788  27632.63
## 329  1789  79030.69
## 330  1790  66830.99
## 331  1791 264817.49
## 332  1792 156300.79
## 333  1793 142034.96
## 334  1794 148719.12
## 335  1795 121955.97
## 336  1796  95314.93
## 337  1797 141811.39
## 338  1798 107497.31
## 339  1799  77311.19
## 340  1800 110718.20
## 341  1801 154341.24
## 342  1802 168472.71
## 343  1803 170538.39
## 344  1804 148184.33
## 345  1805 122773.49
## 346  1806 130120.44
## 347  1807 149725.85
## 348  1808 119593.86
## 349  1809 106754.14
## 350  1810 130394.77
## 351  1811  99049.76
## 352  1812  84206.11
## 353  1813 128058.81
## 354  1814  65983.61
## 355  1815  34784.53
## 356  1816  56362.04
## 357  1817 125651.43
## 358  1818 140303.19
## 359  1819 119070.33
## 360  1820  32782.24
## 361  1821 106401.38
## 362  1822 155435.53
## 363  1823  18202.64
## 364  1824 107014.08
## 365  1825 129550.81
## 366  1826  80410.68
## 367  1827 107704.33
## 368  1828 127287.79
## 369  1829 143314.58
## 370  1830 147548.20
## 371  1831 154985.03
## 372  1832 140181.30
## 373  1833 146707.11
## 374  1834 127335.46
## 375  1835 144499.05
## 376  1836 139503.98
## 377  1837  62657.92
## 378  1838 114691.40
## 379  1839  84537.10
## 380  1840 153572.64
## 381  1841 153308.23
## 382  1842  78866.00
## 383  1843 148802.35
## 384  1844 130674.20
## 385  1845 155076.10
## 386  1846 140960.64
## 387  1847 176144.15
## 388  1848  24961.40
## 389  1849 112375.14
## 390  1850 115414.76
## 391  1851 131961.30
## 392  1852 107708.76
## 393  1853 145303.09
## 394  1854 165752.96
## 395  1855 168557.91
## 396  1856 222271.42
## 397  1857 159709.46
## 398  1858 206457.73
## 399  1859 139999.59
## 400  1860 182124.54
## 401  1861 138474.77
## 402  1862 317951.30
## 403  1863 316356.24
## 404  1864 316371.31
## 405  1865 285828.27
## 406  1866 275363.01
## 407  1867 238861.79
## 408  1868 271316.51
## 409  1869 217097.84
## 410  1870 224544.98
## 411  1871 236952.02
## 412  1872 185359.44
## 413  1873 233245.11
## 414  1874 161220.71
## 415  1875 221142.17
## 416  1876 217553.40
## 417  1877 227347.95
## 418  1878 217523.76
## 419  1879 130982.82
## 420  1880 136899.16
## 421  1881 281399.59
## 422  1882 253420.31
## 423  1883 211504.94
## 424  1884 227776.97
## 425  1885 229533.58
## 426  1886 265228.17
## 427  1887 214220.03
## 428  1888 264691.76
## 429  1889 182184.51
## 430  1890 144046.52
## 431  1891 166575.27
## 432  1892 101231.24
## 433  1893 132753.29
## 434  1894 154182.47
## 435  1895 154413.93
## 436  1896 109368.70
## 437  1897 110590.78
## 438  1898  84902.42
## 439  1899 127320.03
## 440  1900 102327.24
## 441  1901 145128.30
## 442  1902 146548.44
## 443  1903 198599.92
## 444  1904 127572.93
## 445  1905 184153.12
## 446  1906 178284.90
## 447  1907 207200.72
## 448  1908 122809.79
## 449  1909 151540.92
## 450  1910 138234.56
## 451  1911 209345.55
## 452  1912 291713.33
## 453  1913 185292.61
## 454  1914  36309.78
## 455  1915 247328.71
## 456  1916  38394.23
## 457  1917 255209.95
## 458  1918 130752.37
## 459  1919 170645.23
## 460  1920 220383.41
## 461  1921 316465.55
## 462  1922 271550.80
## 463  1923 234867.53
## 464  1924 226397.60
## 465  1925 231067.48
## 466  1926 311570.58
## 467  1927 129552.02
## 468  1928 188718.12
## 469  1929 111976.55
## 470  1930 145269.41
## 471  1931 162717.68
## 472  1932 163802.04
## 473  1933 199714.88
## 474  1934 205417.72
## 475  1935 180804.70
## 476  1936 208289.81
## 477  1937 195847.36
## 478  1938 196120.70
## 479  1939 233854.76
## 480  1940 178312.11
## 481  1941 199172.34
## 482  1942 199356.66
## 483  1943 182630.60
## 484  1944 286602.86
## 485  1945 281740.22
## 486  1946 169072.53
## 487  1947 254502.90
## 488  1948 183464.47
## 489  1949 230772.10
## 490  1950 184121.47
## 491  1951 265548.23
## 492  1952 236923.58
## 493  1953 167203.12
## 494  1954 208543.02
## 495  1955 144518.72
## 496  1956 337830.95
## 497  1957 177121.66
## 498  1958 288288.19
## 499  1959 183240.59
## 500  1960 113435.92
## 501  1961 130545.02
## 502  1962 106263.75
## 503  1963 110696.02
## 504  1964 128752.24
## 505  1965 163738.20
## 506  1966 140150.17
## 507  1967 263785.45
## 508  1968 350155.07
## 509  1969 293640.74
## 510  1970 353397.21
## 511  1971 361834.95
## 512  1972 317186.63
## 513  1973 251854.74
## 514  1974 288422.78
## 515  1975 358595.79
## 516  1976 251150.02
## 517  1977 321962.12
## 518  1978 322943.47
## 519  1979 280790.37
## 520  1980 202907.21
## 521  1981 288627.66
## 522  1982 214448.30
## 523  1983 203564.67
## 524  1984 197209.31
## 525  1985 227992.17
## 526  1986 238396.56
## 527  1987 184019.27
## 528  1988 188725.08
## 529  1989 202748.37
## 530  1990 223956.69
## 531  1991 213871.69
## 532  1992 225579.76
## 533  1993 177293.93
## 534  1994 260014.25
## 535  1995 196915.75
## 536  1996 246679.11
## 537  1997 268392.26
## 538  1998 304725.82
## 539  1999 253259.00
## 540  2000 281890.70
## 541  2001 270062.48
## 542  2002 248380.81
## 543  2003 257556.35
## 544  2004 260193.63
## 545  2005 221060.60
## 546  2006 210507.83
## 547  2007 242932.43
## 548  2008 199591.40
## 549  2009 218946.78
## 550  2010 206223.62
## 551  2011 158736.37
## 552  2012 183527.86
## 553  2013 196632.62
## 554  2014 203935.73
## 555  2015 204040.88
## 556  2016 217323.26
## 557  2017 203832.12
## 558  2018 124604.21
## 559  2019 138530.05
## 560  2020 116714.81
## 561  2021 100709.75
## 562  2022 188515.90
## 563  2023 188161.59
## 564  2024 272817.98
## 565  2025 317078.16
## 566  2026 187303.83
## 567  2027 170502.87
## 568  2028 186800.71
## 569  2029 191605.82
## 570  2030 234109.13
## 571  2031 207331.24
## 572  2032 219866.94
## 573  2033 226997.44
## 574  2034 182020.67
## 575  2035 224906.74
## 576  2036 203596.45
## 577  2037 208302.40
## 578  2038 285661.32
## 579  2039 195312.70
## 580  2040 302961.48
## 581  2041 235494.89
## 582  2042 215493.99
## 583  2043 181576.21
## 584  2044 204727.06
## 585  2045 204477.83
## 586  2046 183158.14
## 587  2047 159344.31
## 588  2048 153337.70
## 589  2049 199082.08
## 590  2050 179024.77
## 591  2051  72454.66
## 592  2052 128037.33
## 593  2053 147796.63
## 594  2054  60204.81
## 595  2055 166113.34
## 596  2056 142799.04
## 597  2057 109728.57
## 598  2058 223483.38
## 599  2059 128984.10
## 600  2060 189998.37
## 601  2061 177719.19
## 602  2062 129371.65
## 603  2063  81377.76
## 604  2064 142756.79
## 605  2065 112452.61
## 606  2066 173088.38
## 607  2067 141670.56
## 608  2068 194941.88
## 609  2069  62987.45
## 610  2070  85861.14
## 611  2071  90465.43
## 612  2072 178620.57
## 613  2073 130467.93
## 614  2074 181176.23
## 615  2075 150870.92
## 616  2076 105791.24
## 617  2077 143575.17
## 618  2078 122156.35
## 619  2079 129181.06
## 620  2080  95007.50
## 621  2081 119912.52
## 622  2082 167280.93
## 623  2083 157014.17
## 624  2084  85254.15
## 625  2085 118719.37
## 626  2086 147318.90
## 627  2087 149379.95
## 628  2088 123176.80
## 629  2089  54567.66
## 630  2090 134059.71
## 631  2091 122014.04
## 632  2092 157893.33
## 633  2093 129933.36
## 634  2094  95148.91
## 635  2095 155898.68
## 636  2096  47463.31
## 637  2097  71155.40
## 638  2098 164469.09
## 639  2099  35100.71
## 640  2100 128114.54
## 641  2101 147817.05
## 642  2102 109546.97
## 643  2103  85497.61
## 644  2104 187147.01
## 645  2105 112587.96
## 646  2106  85709.53
## 647  2107 209503.88
## 648  2108 107465.36
## 649  2109 135968.80
## 650  2110 122080.22
## 651  2111 126719.40
## 652  2112 137110.19
## 653  2113 132570.83
## 654  2114 100899.59
## 655  2115 162063.12
## 656  2116 112190.76
## 657  2117 172602.50
## 658  2118 104392.97
## 659  2119  74788.80
## 660  2120 114007.59
## 661  2121      0.00
## 662  2122  88364.98
## 663  2123  69240.38
## 664  2124 170632.98
## 665  2125 164146.30
## 666  2126 175226.45
## 667  2127 175152.17
## 668  2128 106131.20
## 669  2129  70246.76
## 670  2130 134324.57
## 671  2131 173276.85
## 672  2132 118290.62
## 673  2133 112624.86
## 674  2134  97807.60
## 675  2135  85526.00
## 676  2136  71170.95
## 677  2137 121578.17
## 678  2138 134026.14
## 679  2139 152487.54
## 680  2140 146302.23
## 681  2141 130297.44
## 682  2142 123036.58
## 683  2143 168615.04
## 684  2144 123699.55
## 685  2145 138294.78
## 686  2146 145996.03
## 687  2147 187863.54
## 688  2148 140267.76
## 689  2149 152419.24
## 690  2150 200437.66
## 691  2151  83520.93
## 692  2152 177536.93
## 693  2153 165017.34
## 694  2154 103961.86
## 695  2155 143107.94
## 696  2156 304468.22
## 697  2157 233997.50
## 698  2158 235595.73
## 699  2159 217950.62
## 700  2160 182357.55
## 701  2161 249858.07
## 702  2162 371965.22
## 703  2163 282088.83
## 704  2164 231480.29
## 705  2165 183978.69
## 706  2166 165886.92
## 707  2167 219631.02
## 708  2168 223714.93
## 709  2169 208664.76
## 710  2170 233783.53
## 711  2171 159374.02
## 712  2172 150461.37
## 713  2173 191584.41
## 714  2174 254237.84
## 715  2175 267412.11
## 716  2176 275417.88
## 717  2177 265963.97
## 718  2178 213585.21
## 719  2179 153708.55
## 720  2180 276559.53
## 721  2181 207625.97
## 722  2182 243421.23
## 723  2183 210491.51
## 724  2184 128525.31
## 725  2185 130089.66
## 726  2186 159634.77
## 727  2187 154906.34
## 728  2188 168698.14
## 729  2189 326402.56
## 730  2190  63746.21
## 731  2191  64085.51
## 732  2192  72307.96
## 733  2193  99404.29
## 734  2194  83053.85
## 735  2195 101252.90
## 736  2196  97808.83
## 737  2197 131406.97
## 738  2198 116230.74
## 739  2199 170025.52
## 740  2200 104592.25
## 741  2201 122115.28
## 742  2202 137805.84
## 743  2203 135521.35
## 744  2204 183912.45
## 745  2205 108134.06
## 746  2206 114447.77
## 747  2207 168776.18
## 748  2208 219073.09
## 749  2209 221401.04
## 750  2210 128295.49
## 751  2211 101322.03
## 752  2212 117609.36
## 753  2213  75671.32
## 754  2214 150455.35
## 755  2215  95653.79
## 756  2216 181487.40
## 757  2217  64208.97
## 758  2218  61705.85
## 759  2219  82999.20
## 760  2220  55012.82
## 761  2221 249823.69
## 762  2222 243823.26
## 763  2223 283336.19
## 764  2224 230483.92
## 765  2225 166212.11
## 766  2226 232884.54
## 767  2227 198727.89
## 768  2228 251422.91
## 769  2229 238496.23
## 770  2230 174748.71
## 771  2231 217048.35
## 772  2232 195253.63
## 773  2233 191957.14
## 774  2234 219544.42
## 775  2235 238044.91
## 776  2236 259600.91
## 777  2237 305596.44
## 778  2238 225638.29
## 779  2239 150215.90
## 780  2240 179479.67
## 781  2241 179297.49
## 782  2242 146737.78
## 783  2243 146948.69
## 784  2244 129741.37
## 785  2245 107338.09
## 786  2246 141247.81
## 787  2247 116330.77
## 788  2248 117785.56
## 789  2249 134593.82
## 790  2250 141952.17
## 791  2251 137545.40
## 792  2252 199840.94
## 793  2253 160285.70
## 794  2254 182111.66
## 795  2255 216170.44
## 796  2256 186421.85
## 797  2257 247667.09
## 798  2258 166130.50
## 799  2259 188728.53
## 800  2260 149951.87
## 801  2261 165369.61
## 802  2262 191819.33
## 803  2263 303913.23
## 804  2264 375260.71
## 805  2265 218894.87
## 806  2266 240623.18
## 807  2267 324126.89
## 808  2268 291454.56
## 809  2269 174260.14
## 810  2270 183174.19
## 811  2271 230457.94
## 812  2272 205976.56
## 813  2273 171324.08
## 814  2274 192956.20
## 815  2275 197842.70
## 816  2276 184869.56
## 817  2277 208870.99
## 818  2278 180193.57
## 819  2279 140974.64
## 820  2280 105156.28
## 821  2281 193585.12
## 822  2282 207485.20
## 823  2283 112995.19
## 824  2284 121692.51
## 825  2285 137741.20
## 826  2286 124396.88
## 827  2287 276965.87
## 828  2288 257613.38
## 829  2289 325340.74
## 830  2290 360097.97
## 831  2291 289967.94
## 832  2292 311059.94
## 833  2293 371048.75
## 834  2294 328644.93
## 835  2295 360026.51
## 836  2296 281711.18
## 837  2297 269475.13
## 838  2298 267529.91
## 839  2299 346381.29
## 840  2300 301993.06
## 841  2301 248357.98
## 842  2302 264819.55
## 843  2303 241569.02
## 844  2304 256326.40
## 845  2305 214408.68
## 846  2306 216580.93
## 847  2307 208994.68
## 848  2308 227446.98
## 849  2309 237179.17
## 850  2310 201215.43
## 851  2311 213369.80
## 852  2312 202295.16
## 853  2313 192162.97
## 854  2314 187654.17
## 855  2315 190732.08
## 856  2316 202666.72
## 857  2317 205191.65
## 858  2318 192586.69
## 859  2319 186055.91
## 860  2320 190695.99
## 861  2321 236199.88
## 862  2322 185899.99
## 863  2323 179058.50
## 864  2324 171249.97
## 865  2325 221062.46
## 866  2326 169352.40
## 867  2327 221118.43
## 868  2328 231485.61
## 869  2329 209957.87
## 870  2330 201246.91
## 871  2331 329232.26
## 872  2332 341287.16
## 873  2333 309437.88
## 874  2334 262914.24
## 875  2335 245444.20
## 876  2336 314096.44
## 877  2337 211063.68
## 878  2338 259206.36
## 879  2339 219420.21
## 880  2340 319340.28
## 881  2341 218402.64
## 882  2342 228096.77
## 883  2343 236012.03
## 884  2344 239060.70
## 885  2345 254202.70
## 886  2346 207434.03
## 887  2347 205138.84
## 888  2348 230089.82
## 889  2349 204592.65
## 890  2350 261362.75
## 891  2351 231329.57
## 892  2352 244653.16
## 893  2353 284368.71
## 894  2354 149181.78
## 895  2355 152782.93
## 896  2356 190415.28
## 897  2357 217406.86
## 898  2358 215470.32
## 899  2359 134974.69
## 900  2360 120726.25
## 901  2361 151850.02
## 902  2362 277905.46
## 903  2363 144306.21
## 904  2364 173876.45
## 905  2365 213914.01
## 906  2366 180189.20
## 907  2367 218964.51
## 908  2368 210681.86
## 909  2369 221889.57
## 910  2370 188487.31
## 911  2371 192868.58
## 912  2372 201427.35
## 913  2373 254175.15
## 914  2374 305967.58
## 915  2375 222848.93
## 916  2376 288612.32
## 917  2377 331635.00
## 918  2378 131020.60
## 919  2379 241771.52
## 920  2380 144624.20
## 921  2381 170575.06
## 922  2382 191440.34
## 923  2383 209883.03
## 924  2384 241741.06
## 925  2385 143242.12
## 926  2386 128078.18
## 927  2387 142558.87
## 928  2388 111277.03
## 929  2389 117902.37
## 930  2390 133478.03
## 931  2391 141320.19
## 932  2392  96924.56
## 933  2393 147770.65
## 934  2394 130051.82
## 935  2395 224309.04
## 936  2396 148665.36
## 937  2397 222323.06
## 938  2398 125445.76
## 939  2399  46150.27
## 940  2400  34848.10
## 941  2401 116952.57
## 942  2402 118252.20
## 943  2403 197518.70
## 944  2404 141819.73
## 945  2405 152785.83
## 946  2406 134438.44
## 947  2407 114732.73
## 948  2408 141392.16
## 949  2409 122875.00
## 950  2410 190275.44
## 951  2411 103376.03
## 952  2412 159530.03
## 953  2413 128647.21
## 954  2414 145822.61
## 955  2415 168504.58
## 956  2416 105919.76
## 957  2417 115901.52
## 958  2418 130596.68
## 959  2419 148218.23
## 960  2420 109351.63
## 961  2421 158386.32
## 962  2422 104154.60
## 963  2423 118439.27
## 964  2424 193785.62
## 965  2425 312164.05
## 966  2426 162942.83
## 967  2427 110263.49
## 968  2428 176208.65
## 969  2429  96712.33
## 970  2430 122362.10
## 971  2431 116669.26
## 972  2432 141795.92
## 973  2433 130894.92
## 974  2434 141035.48
## 975  2435 162602.82
## 976  2436  73350.58
## 977  2437 101595.78
## 978  2438 109608.35
## 979  2439 109832.27
## 980  2440 118861.21
## 981  2441  85811.16
## 982  2442  92061.06
## 983  2443  94192.59
## 984  2444 124839.80
## 985  2445  71975.13
## 986  2446 129987.87
## 987  2447 180068.77
## 988  2448 131018.65
## 989  2449  91896.06
## 990  2450 154461.44
## 991  2451 138405.65
## 992  2452 194711.44
## 993  2453  73331.77
## 994  2454 129186.78
## 995  2455 113482.19
## 996  2456 135677.73
## 997  2457 127542.02
## 998  2458  92602.97
## 999  2459  74360.09
## 1000 2460 126937.94
## 1001 2461 102791.44
## 1002 2462 134066.78
## 1003 2463 104407.92
## 1004 2464 201611.09
## 1005 2465 116228.05
## 1006 2466 114428.89
## 1007 2467 135696.78
## 1008 2468  66025.40
## 1009 2469  79128.99
## 1010 2470 217594.04
## 1011 2471 203755.41
## 1012 2472 193747.52
## 1013 2473 136572.59
## 1014 2474 100362.98
## 1015 2475 211723.43
## 1016 2476 120289.16
## 1017 2477 121389.26
## 1018 2478 190821.48
## 1019 2479 130528.03
## 1020 2480 158674.13
## 1021 2481 129516.62
## 1022 2482 139929.63
## 1023 2483 108588.61
## 1024 2484 130267.98
## 1025 2485 113807.53
## 1026 2486 175465.16
## 1027 2487 237564.22
## 1028 2488 129253.94
## 1029 2489 166178.84
## 1030 2490 161101.44
## 1031 2491  74016.10
## 1032 2492 204625.11
## 1033 2493 155614.14
## 1034 2494 177338.47
## 1035 2495 101533.15
## 1036 2496 247811.91
## 1037 2497 152050.04
## 1038 2498 101527.47
## 1039 2499  94988.76
## 1040 2500 145196.30
## 1041 2501 135183.27
## 1042 2502 170276.07
## 1043 2503 101820.33
## 1044 2504 210268.69
## 1045 2505 225156.04
## 1046 2506 253422.36
## 1047 2507 302776.90
## 1048 2508 253852.90
## 1049 2509 238839.00
## 1050 2510 226451.06
## 1051 2511 194032.76
## 1052 2512 223171.40
## 1053 2513 225358.54
## 1054 2514 206046.97
## 1055 2515 181257.56
## 1056 2516 181325.72
## 1057 2517 152994.56
## 1058 2518 162394.19
## 1059 2519 230769.96
## 1060 2520 219028.53
## 1061 2521 206546.80
## 1062 2522 228986.86
## 1063 2523 128811.40
## 1064 2524 135926.56
## 1065 2525 151076.59
## 1066 2526 162258.73
## 1067 2527 125033.13
## 1068 2528 129310.33
## 1069 2529 147295.78
## 1070 2530 143025.86
## 1071 2531 251093.26
## 1072 2532 242994.14
## 1073 2533 211645.27
## 1074 2534 223572.32
## 1075 2535 297472.27
## 1076 2536 243664.69
## 1077 2537 209102.68
## 1078 2538 196717.87
## 1079 2539 198181.01
## 1080 2540 196372.69
## 1081 2541 192795.22
## 1082 2542 191084.06
## 1083 2543 125690.56
## 1084 2544 143919.72
## 1085 2545 126908.20
## 1086 2546 142533.37
## 1087 2547 144842.66
## 1088 2548 199655.42
## 1089 2549 194295.15
## 1090 2550 673124.18
## 1091 2551 164922.65
## 1092 2552 142336.64
## 1093 2553  64100.58
## 1094 2554  94377.05
## 1095 2555  89730.13
## 1096 2556  88673.06
## 1097 2557 109902.72
## 1098 2558 188150.41
## 1099 2559 127798.25
## 1100 2560 120876.71
## 1101 2561 102340.70
## 1102 2562  94607.17
## 1103 2563 126023.73
## 1104 2564 150919.93
## 1105 2565  91539.05
## 1106 2566 164565.55
## 1107 2567 126334.54
## 1108 2568 204941.22
## 1109 2569 192593.20
## 1110 2570 128677.71
## 1111 2571 233488.12
## 1112 2572 156414.35
## 1113 2573 215010.53
## 1114 2574 259248.16
## 1115 2575 126241.51
## 1116 2576 135490.99
## 1117 2577      0.00
## 1118 2578  50159.38
## 1119 2579  49466.72
## 1120 2580 150305.00
## 1121 2581 160377.85
## 1122 2582 118800.98
## 1123 2583 246736.09
## 1124 2584 172493.59
## 1125 2585 219642.80
## 1126 2586 228406.04
## 1127 2587 214830.65
## 1128 2588 126539.68
## 1129 2589 148129.12
## 1130 2590 206918.13
## 1131 2591 215922.20
## 1132 2592 231173.92
## 1133 2593 230162.80
## 1134 2594 174814.62
## 1135 2595 191916.87
## 1136 2596 282768.65
## 1137 2597 208129.79
## 1138 2598 278557.03
## 1139 2599 275033.37
## 1140 2600 199129.56
## 1141 2601 161316.69
## 1142 2602 110783.96
## 1143 2603 107125.86
## 1144 2604  95667.72
## 1145 2605 113880.47
## 1146 2606 157802.37
## 1147 2607 187097.16
## 1148 2608 233071.38
## 1149 2609 167153.36
## 1150 2610  90997.15
## 1151 2611 187777.16
## 1152 2612 181992.45
## 1153 2613 120698.05
## 1154 2614 127913.54
## 1155 2615 147505.22
## 1156 2616 115837.18
## 1157 2617 230570.95
## 1158 2618 192008.63
## 1159 2619 211646.71
## 1160 2620 192414.46
## 1161 2621 192842.56
## 1162 2622 202160.06
## 1163 2623 245307.16
## 1164 2624 289253.99
## 1165 2625 309738.83
## 1166 2626 173436.60
## 1167 2627 161908.75
## 1168 2628 330502.22
## 1169 2629 365417.18
## 1170 2630 286191.98
## 1171 2631 341566.71
## 1172 2632 313319.17
## 1173 2633 252482.18
## 1174 2634 296814.43
## 1175 2635 163424.90
## 1176 2636 194226.47
## 1177 2637 162074.62
## 1178 2638 269984.28
## 1179 2639 205844.40
## 1180 2640 150872.83
## 1181 2641 129871.33
## 1182 2642 220954.73
## 1183 2643 119302.25
## 1184 2644 150072.37
## 1185 2645 106213.13
## 1186 2646 100068.56
## 1187 2647 112617.90
## 1188 2648 146638.37
## 1189 2649 149964.17
## 1190 2650 135035.97
## 1191 2651 158023.32
## 1192 2652 338655.54
## 1193 2653 250585.35
## 1194 2654 262175.36
## 1195 2655 307320.63
## 1196 2656 264810.98
## 1197 2657 278331.18
## 1198 2658 265479.96
## 1199 2659 275701.18
## 1200 2660 304974.75
## 1201 2661 287185.87
## 1202 2662 281035.68
## 1203 2663 269594.79
## 1204 2664 229637.68
## 1205 2665 270615.66
## 1206 2666 241258.35
## 1207 2667 184916.59
## 1208 2668 191086.86
## 1209 2669 196163.63
## 1210 2670 271021.13
## 1211 2671 190612.74
## 1212 2672 198861.15
## 1213 2673 214290.00
## 1214 2674 218694.58
## 1215 2675 175160.86
## 1216 2676 199113.10
## 1217 2677 227446.66
## 1218 2678 266982.08
## 1219 2679 260573.35
## 1220 2680 273260.26
## 1221 2681 336167.73
## 1222 2682 304974.88
## 1223 2683 398907.70
## 1224 2684 291858.56
## 1225 2685 316122.62
## 1226 2686 252971.27
## 1227 2687 280920.09
## 1228 2688 218048.03
## 1229 2689 199678.60
## 1230 2690 352321.47
## 1231 2691 218183.06
## 1232 2692 152007.78
## 1233 2693 213360.39
## 1234 2694 153385.05
## 1235 2695 208038.55
## 1236 2696 197192.90
## 1237 2697 225085.26
## 1238 2698 206679.60
## 1239 2699 203151.09
## 1240 2700 174338.10
## 1241 2701 168204.72
## 1242 2702 127025.39
## 1243 2703 154247.20
## 1244 2704 167543.34
## 1245 2705 118630.16
## 1246 2706 110923.87
## 1247 2707 115578.69
## 1248 2708 126033.00
## 1249 2709  93196.43
## 1250 2710 115693.75
## 1251 2711 292847.15
## 1252 2712 326673.13
## 1253 2713 183142.61
## 1254 2714 166669.50
## 1255 2715 188993.81
## 1256 2716 154319.34
## 1257 2717 189889.37
## 1258 2718 221036.39
## 1259 2719 162604.32
## 1260 2720 164927.44
## 1261 2721 142513.06
## 1262 2722 190162.38
## 1263 2723 156487.62
## 1264 2724 114737.48
## 1265 2725 127014.67
## 1266 2726 157172.64
## 1267 2727 191468.25
## 1268 2728 166457.81
## 1269 2729 148856.46
## 1270 2730 139179.17
## 1271 2731  73832.20
## 1272 2732  94241.71
## 1273 2733 182599.45
## 1274 2734 151531.63
## 1275 2735 125677.23
## 1276 2736 147196.63
## 1277 2737 109690.22
## 1278 2738 182276.46
## 1279 2739 157177.44
## 1280 2740 118587.31
## 1281 2741 143476.58
## 1282 2742 145023.77
## 1283 2743 142661.41
## 1284 2744 152628.22
## 1285 2745 142058.01
## 1286 2746 138110.04
## 1287 2747 120603.73
## 1288 2748 117817.30
## 1289 2749 111224.03
## 1290 2750 122565.46
## 1291 2751 119994.79
## 1292 2752 202319.05
## 1293 2753 198877.21
## 1294 2754 303815.79
## 1295 2755 129362.19
## 1296 2756  92818.50
## 1297 2757  79227.91
## 1298 2758  62412.72
## 1299 2759 125969.12
## 1300 2760 151563.46
## 1301 2761 130233.19
## 1302 2762 148072.07
## 1303 2763 207232.94
## 1304 2764 120600.92
## 1305 2765 261263.18
## 1306 2766 153778.48
## 1307 2767  74795.32
## 1308 2768 142388.71
## 1309 2769 131250.87
## 1310 2770 136557.81
## 1311 2771 103083.47
## 1312 2772 105714.54
## 1313 2773 183791.68
## 1314 2774 147412.07
## 1315 2775 142610.07
## 1316 2776 136576.62
## 1317 2777 121935.59
## 1318 2778  64115.76
## 1319 2779 149470.66
## 1320 2780  87171.55
## 1321 2781  61910.78
## 1322 2782  86834.51
## 1323 2783  86529.78
## 1324 2784 119764.16
## 1325 2785 124239.30
## 1326 2786  31061.09
## 1327 2787 108212.34
## 1328 2788  59860.82
## 1329 2789 196896.17
## 1330 2790  67754.58
## 1331 2791 119721.05
## 1332 2792  62712.74
## 1333 2793 160633.95
## 1334 2794  90301.76
## 1335 2795 112921.65
## 1336 2796  61291.12
## 1337 2797 253391.99
## 1338 2798 104761.03
## 1339 2799 118208.29
## 1340 2800  39385.15
## 1341 2801  88569.69
## 1342 2802 125566.14
## 1343 2803 216844.50
## 1344 2804 122076.03
## 1345 2805  59486.74
## 1346 2806  67140.12
## 1347 2807 167983.70
## 1348 2808 145064.94
## 1349 2809 112678.79
## 1350 2810 128418.14
## 1351 2811 179076.45
## 1352 2812 156055.44
## 1353 2813 235137.00
## 1354 2814 239543.17
## 1355 2815  75141.09
## 1356 2816 235399.75
## 1357 2817 140206.75
## 1358 2818 125899.23
## 1359 2819 171306.81
## 1360 2820 145974.21
## 1361 2821  93560.28
## 1362 2822 177504.96
## 1363 2823 359752.24
## 1364 2824 206543.65
## 1365 2825 149778.60
## 1366 2826 132838.84
## 1367 2827 147853.06
## 1368 2828 261854.86
## 1369 2829 235725.08
## 1370 2830 257323.39
## 1371 2831 198961.59
## 1372 2832 272115.64
## 1373 2833 254673.50
## 1374 2834 230176.75
## 1375 2835 231599.06
## 1376 2836 207980.67
## 1377 2837 177725.69
## 1378 2838 166738.49
## 1379 2839 183611.75
## 1380 2840 216603.94
## 1381 2841 214751.99
## 1382 2842 231546.96
## 1383 2843 150260.44
## 1384 2844 180051.72
## 1385 2845 111394.38
## 1386 2846 222561.33
## 1387 2847 217900.34
## 1388 2848 206370.70
## 1389 2849 223902.81
## 1390 2850 267628.87
## 1391 2851 234737.63
## 1392 2852 239867.14
## 1393 2853 239254.58
## 1394 2854 153350.16
## 1395 2855 218062.07
## 1396 2856 218564.39
## 1397 2857 206439.30
## 1398 2858 213005.04
## 1399 2859 145455.40
## 1400 2860 117302.25
## 1401 2861 151280.89
## 1402 2862 215051.19
## 1403 2863 136529.12
## 1404 2864 261458.60
## 1405 2865 166575.27
## 1406 2866 203876.84
## 1407 2867  76822.18
## 1408 2868 108793.30
## 1409 2869 130938.28
## 1410 2870 169459.13
## 1411 2871  88911.88
## 1412 2872  30635.63
## 1413 2873 108902.64
## 1414 2874 138221.85
## 1415 2875 127954.57
## 1416 2876 139663.43
## 1417 2877 106659.42
## 1418 2878 118633.96
## 1419 2879 126842.50
## 1420 2880 106679.73
## 1421 2881 119942.96
## 1422 2882 153298.82
## 1423 2883 146804.49
## 1424 2884 159889.75
## 1425 2885 143828.89
## 1426 2886 213410.41
## 1427 2887  87888.63
## 1428 2888 115702.48
## 1429 2889  26955.59
## 1430 2890  45926.98
## 1431 2891 150536.98
## 1432 2892  30107.00
## 1433 2893  96235.04
## 1434 2894  34578.79
## 1435 2895 256292.43
## 1436 2896 247522.51
## 1437 2897 219667.95
## 1438 2898 208239.13
## 1439 2899 209426.75
## 1440 2900 161967.90
## 1441 2901 208838.05
## 1442 2902 214128.52
## 1443 2903 286238.11
## 1444 2904 279637.76
## 1445 2905 117231.54
## 1446 2906 223454.86
## 1447 2907 124817.87
## 1448 2908 129234.13
## 1449 2909 219827.91
## 1450 2910  67946.50
## 1451 2911 110775.81
## 1452 2912 160286.03
## 1453 2913 112595.47
## 1454 2914  90596.02
## 1455 2915  90763.02
## 1456 2916 110677.01
## 1457 2917 186612.19
## 1458 2918 124653.74
## 1459 2919 241923.02

Prediction csv file to upload in Kaggle

write.csv(prediction, file="prediction.csv", row.names = FALSE)