DATA 605 - Final Project

Part 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

set.seed(5432)

n <- 10000
N <- 10
sigma <- (N + 1)/2

df <- data.frame(X = runif(n, min = 1, max = N), Y = rnorm(n, mean = sigma, sd = sigma))
head(df,5)

##          X         Y
## 1 5.095135  3.755059
## 2 3.288050  8.498194
## 3 9.097577  4.651931
## 4 4.226510  6.247689
## 5 2.431395 -3.720824

Checking that the distribution of 10000 random numbers is normal.

hist(df$Y, col = 'lightblue', breaks = 20)

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Median of the x variable as median of IQR

x = quantile(df$X, 0.5)
x

##      50% 
## 5.492108

Median of the y variable as 1st quartile

y = quantile(df$Y, 0.25)
y

##      25% 
## 1.734326

\(P(X>x | X>y)\)

prob_Xx_Xy = df %>% filter(X > x, X > y) %>% nrow()/n
prob_Xy = df %>% filter(X > y) %>% nrow()/n
prob1 = prob_Xx_Xy/prob_Xy
round(prob1,3)

## [1] 0.543

The probability that X is greater than its median given that X is greater than Q1 of Y is 0.543

\(P(X>x | Y>y)\)

prob_Xx_Yy <- df %>% filter(X > x, Y > y) %>% nrow()/n
round(prob_Xx_Yy,3)

## [1] 0.372

The probability of X>x and Y>y is 0.372

\(P(X<x | X>y)\)

prob_xX_Xy = df %>% filter(X < x, X > y) %>% nrow()/n

prob_Xy = df %>% filter(X > y) %>% nrow()/n

prob2 = prob_xX_Xy/prob_Xy
round(prob2,3)

## [1] 0.457

The probability of X being less than its median and greater than Q1 of Y is 0.457

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Joint probability:

joint_prob_AB = df %>%
  mutate(A = ifelse(X > x, "X > x", "X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>%
    group_by(A, B) %>%
      summarise(count = n()) %>%
        mutate(probability = count/n)

joint_prob_AB

## # A tibble: 4 x 4
## # Groups:   A [2]
##   A     B        count probability
##   <chr> <chr>    <int>       <dbl>
## 1 X < x " Y < y"  1220       0.122
## 2 X < x " Y > y"  3780       0.378
## 3 X > x " Y < y"  1280       0.128
## 4 X > x " Y > y"  3720       0.372

Marginal probabilities:

marg_prob_A = joint_prob_AB %>% 
                ungroup() %>% 
                  group_by(A) %>% 
                    summarize(count = sum(count), probability = sum(probability))
marg_prob_A

## # A tibble: 2 x 3
##   A     count probability
## * <chr> <int>       <dbl>
## 1 X < x  5000         0.5
## 2 X > x  5000         0.5

marg_prob_B = joint_prob_AB %>% 
                ungroup() %>% 
                  group_by(B) %>% 
                    summarize(count = sum(count), probability = sum(probability))
marg_prob_A

## # A tibble: 2 x 3
##   A     count probability
## * <chr> <int>       <dbl>
## 1 X < x  5000         0.5
## 2 X > x  5000         0.5

Build the table

tab <- bind_rows(joint_prob_AB, marg_prob_A, marg_prob_B) %>% 
          dplyr::select(-count) %>% 
            spread(A, probability)

tab$B[is.na(tab$B)] <- "Total"

tab <- tab %>% 
          rename(Event = 'B') 

names(tab)[4] <- "Total"

tab[,4][is.na(tab[,4])] <- 1

tab

## # A tibble: 3 x 4
##   Event    `X < x` `X > x` Total
##   <chr>      <dbl>   <dbl> <dbl>
## 1 " Y < y"   0.122   0.128  0.25
## 2 " Y > y"   0.378   0.372  0.75
## 3 "Total"    0.5     0.5    1

Evaluation:

P(X>x and Y>y) = 0.372

P(X>x)P(Y>y) = (0.5000)*(0.75) = 0.375

This is close enough to say that P(X>x and Y>y) = P(X>x)P(Y>y).

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s Exact Test is used with small sample sets and the Chi Squared Test with larger sample sets. Since we are using 10000 samples I would say Chi Square is more appropriate. With the results in both tests coming in with p-values above 0.05 (0.173 for both), I would say we cannot reject the null hypothesis and say that independence does hold for X and Y.

Fisher’s Exact Test

df_table <- table(df$X > x, df$Y>y)
fisher.test(df_table)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  df_table
## p-value = 0.173
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8559101 1.0279714
## sample estimates:
## odds ratio 
##  0.9380243

Chi Square Test

chisq.test(df_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df_table
## X-squared = 1.8565, df = 1, p-value = 0.173

Part 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Load the train and test datasets from my GitHub coursedata repo.

house_train <- read.csv("https://raw.githubusercontent.com/douglasbarley/coursedata/master/train.csv", header = TRUE, stringsAsFactors = FALSE)
house_test <- read.csv("https://raw.githubusercontent.com/douglasbarley/coursedata/master/test.csv", header = TRUE, stringsAsFactors = FALSE)

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set.

Look at the structure of the data.

str(house_train)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

and look at summary stats of the data.

summary(house_train)

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

Let’s look at some dimensions that we intuitively think might impact price.

hist(house_train$GrLivArea, main = "Above Ground Living Area", xlab = "Above Ground Living Area (sq ft)", ylab = "Count", col="lightblue", breaks = 20)

Most houses on the market appear to have between 1200 and 1800 sq ft.

hist(house_train$TotRmsAbvGrd, main = "Total Rooms Above Ground", xlab = "Total # Rooms Above Ground", ylab = "Count", col="lightblue", breaks = 10)

Most houses appear to have between 5 and 7 rooms above ground.

hist(house_train$GarageCars, main = "Garage Capacity", xlab = "# Cars", ylab = "Count", col="lightblue", breaks = 5)

Most houses appear to have a 2-car garage.

hist(house_train$FullBath, main = "Number of Baths", xlab = "# Baths", ylab = "Count", col="lightblue", breaks = 2)

Most houses appear to have a 1-2 bathrooms.

hist(house_train$YearBuilt, main = "Year Built", xlab = "Decade Built", ylab = "Count", col="lightblue", breaks = 15)

It appears there were building booms for housing in the 1950s, 60s and 70s, a lull in the 1980s and a spike in the 1990s and 2000s. This should mean that there are more newer homes on the market than older homes.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

In terms of training and predicting the pricing, which is the dependent variable (SalePrice), my hunch is that “Above grade (ground) living area square feet” (GrLivArea), the configuration of total square feet in terms of “Total rooms above grade (does not include bathrooms)” (TotRmsAbvGrd), Size of garage in car capacity (GarageCars), “Full bathrooms above grade” (FullBath), and year built (YearBuilt) might have significant impacts on price.

corr_df <- house_train %>%
  dplyr::select("SalePrice", "GrLivArea", "TotRmsAbvGrd", "GarageCars", "FullBath", "YearBuilt") %>%
  replace(is.na(.),0)

pairs(corr_df)

There appear to be positive correlations between Sales Price and all five other dimensions: above-ground living area, total rooms above ground, the number of cars that fit in the garage, the number of full bathrooms and the year built.

Derive a correlation matrix for any three quantitative variables in the dataset.

Let’s see if we can more accurately quantify the correlations among the variables by looking at a correlation matrix.

corr_matrix <- cor(corr_df, method = "pearson")
round(corr_matrix, 4)

##              SalePrice GrLivArea TotRmsAbvGrd GarageCars FullBath YearBuilt
## SalePrice       1.0000    0.7086       0.5337     0.6404   0.5607    0.5229
## GrLivArea       0.7086    1.0000       0.8255     0.4672   0.6300    0.1990
## TotRmsAbvGrd    0.5337    0.8255       1.0000     0.3623   0.5548    0.0956
## GarageCars      0.6404    0.4672       0.3623     1.0000   0.4697    0.5379
## FullBath        0.5607    0.6300       0.5548     0.4697   1.0000    0.4683
## YearBuilt       0.5229    0.1990       0.0956     0.5379   0.4683    1.0000

This correlation matrix clarifies quantitatively what the scatterplot matrix suggested visually, since we can see more clearly that all five independent variables correlate with price > 0.5, with total above-ground living area and space for cars in the garage correlating 0.0786 and 0.6404 respectively with regard to sale price.

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

I ran Chi-Squared tests on each of the independent variables against the Sale Price. All five independent variables return a p-value of < 2.2e-16 against Sales Price, meaning we can reject the null hypothesis for each independent variable and accept that there is a correlation between each independent variable and sale price. The individual correlation tests confirm the correlation values in the correlation matrix, with above-ground living area having the greatest correlation to sale price, and the year built having the lowest positive correlation.

Given that the correlation to sale price for each independent variable is > 0.5, I would not be worried about familywise error (also known as Type 1 error, or false positives). If the correlations were more weak perhaps I would be more concerned about it.

cor.test(house_train$GrLivArea, house_train$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_train$GrLivArea and house_train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

cor.test(house_train$TotRmsAbvGrd, house_train$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_train$TotRmsAbvGrd and house_train$SalePrice
## t = 24.099, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5092841 0.5573021
## sample estimates:
##       cor 
## 0.5337232

cor.test(house_train$GarageCars, house_train$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_train$GarageCars and house_train$SalePrice
## t = 31.839, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6201771 0.6597899
## sample estimates:
##       cor 
## 0.6404092

cor.test(house_train$FullBath, house_train$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_train$FullBath and house_train$SalePrice
## t = 25.854, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5372107 0.5832505
## sample estimates:
##       cor 
## 0.5606638

cor.test(house_train$YearBuilt, house_train$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  house_train$YearBuilt and house_train$SalePrice
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4980766 0.5468619
## sample estimates:
##       cor 
## 0.5228973

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

precision <- solve(corr_matrix)
precision

##               SalePrice  GrLivArea TotRmsAbvGrd  GarageCars    FullBath
## SalePrice     3.2003001 -2.0147950   0.23692286 -0.74378555  0.14355914
## GrLivArea    -2.0147950  5.1001626  -2.65302497 -0.12292794 -0.92441627
## TotRmsAbvGrd  0.2369229 -2.6530250   3.28250258 -0.09122621 -0.39053841
## GarageCars   -0.7437855 -0.1229279  -0.09122621  1.91224091 -0.08850805
## FullBath      0.1435591 -0.9244163  -0.39053841 -0.08850805  2.13823869
## YearBuilt    -0.9622913  0.7911433   0.32226360 -0.56494565 -0.80743830
##               YearBuilt
## SalePrice    -0.9622913
## GrLivArea     0.7911433
## TotRmsAbvGrd  0.3222636
## GarageCars   -0.5649456
## FullBath     -0.8074383
## YearBuilt     1.9968853

It appears that above-ground living area has the greatest variance inflation factor, and garage capacity the least.

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

corr_precision <- corr_matrix %*% precision
corr_precision

##                 SalePrice    GrLivArea  TotRmsAbvGrd    GarageCars
## SalePrice    1.000000e+00 1.665335e-16  5.551115e-17  5.551115e-17
## GrLivArea    1.387779e-16 1.000000e+00  1.804112e-16  8.326673e-17
## TotRmsAbvGrd 6.938894e-17 4.579670e-16  1.000000e+00 -6.938894e-17
## GarageCars   0.000000e+00 2.775558e-16  0.000000e+00  1.000000e+00
## FullBath     2.220446e-16 1.110223e-16 -1.387779e-16  0.000000e+00
## YearBuilt    0.000000e+00 0.000000e+00 -5.551115e-17  2.220446e-16
##                   FullBath     YearBuilt
## SalePrice     0.000000e+00  0.000000e+00
## GrLivArea    -8.326673e-17 -5.551115e-17
## TotRmsAbvGrd  2.775558e-17  2.775558e-17
## GarageCars   -1.665335e-16  0.000000e+00
## FullBath      1.000000e+00 -1.110223e-16
## YearBuilt    -1.110223e-16  1.000000e+00

and then multiply the precision matrix by the correlation matrix.

precision_corr <- precision %*% corr_matrix
precision_corr

##                  SalePrice     GrLivArea  TotRmsAbvGrd    GarageCars
## SalePrice     1.000000e+00  3.885781e-16  2.359224e-16  2.220446e-16
## GrLivArea     2.220446e-16  1.000000e+00  4.440892e-16  3.330669e-16
## TotRmsAbvGrd  0.000000e+00  9.714451e-17  1.000000e+00 -2.775558e-17
## GarageCars   -1.665335e-16 -1.249001e-16 -1.665335e-16  1.000000e+00
## FullBath      1.110223e-16 -8.326673e-17  2.775558e-17 -1.665335e-16
## YearBuilt    -2.220446e-16 -5.551115e-17  2.775558e-17  0.000000e+00
##                   FullBath     YearBuilt
## SalePrice     3.885781e-16  2.220446e-16
## GrLivArea     1.665335e-16 -1.110223e-16
## TotRmsAbvGrd -2.498002e-16 -5.551115e-17
## GarageCars   -1.665335e-16  0.000000e+00
## FullBath      1.000000e+00 -1.110223e-16
## YearBuilt     0.000000e+00  1.000000e+00

Conduct LU decomposition on the matrix.

LU_decomp <- lu.decomposition(corr_precision)
LU_decomp

## $L
##              [,1]         [,2]          [,3]          [,4]          [,5] [,6]
## [1,] 1.000000e+00 0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00    0
## [2,] 1.387779e-16 1.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00    0
## [3,] 6.938894e-17 4.579670e-16  1.000000e+00  0.000000e+00  0.000000e+00    0
## [4,] 0.000000e+00 2.775558e-16 -5.007418e-32  1.000000e+00  0.000000e+00    0
## [5,] 2.220446e-16 1.110223e-16 -1.387779e-16 -3.120007e-32  1.000000e+00    0
## [6,] 0.000000e+00 0.000000e+00 -5.551115e-17  2.220446e-16 -1.110223e-16    1
## 
## $U
##      [,1]         [,2]         [,3]          [,4]          [,5]          [,6]
## [1,]    1 1.665335e-16 5.551115e-17  5.551115e-17  0.000000e+00  0.000000e+00
## [2,]    0 1.000000e+00 1.804112e-16  8.326673e-17 -8.326673e-17 -5.551115e-17
## [3,]    0 0.000000e+00 1.000000e+00 -6.938894e-17  2.775558e-17  2.775558e-17
## [4,]    0 0.000000e+00 0.000000e+00  1.000000e+00 -1.665335e-16  1.540744e-32
## [5,]    0 0.000000e+00 0.000000e+00  0.000000e+00  1.000000e+00 -1.110223e-16
## [6,]    0 0.000000e+00 0.000000e+00  0.000000e+00  0.000000e+00  1.000000e+00

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Wood deck is right skewed, with a minimum already at 0.

hist(house_train$WoodDeckSF, breaks = 30)

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

deck_exp <- fitdistr(house_train$WoodDeckSF, densfun = "exponential")
deck_exp

##        rate    
##   0.0106106965 
##  (0.0002776946)

Find the optimal value of \(\lambda\) for this distribution,

deck_lambda <- deck_exp$estimate
opt_val <- 1/deck_lambda
opt_val

##     rate 
## 94.24452

and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)).

deck_samples <- rexp(1000, deck_lambda)

Plot a histogram and compare it with a histogram of your original variable.

par(mfrow = c(1, 2))
hist(deck_samples, breaks = 30, col="lightgreen", main = "Exponential - Wood Decks")
hist(house_train$WoodDeckSF, breaks = 30, xlab = "Wood Decks", col="lightblue", main = "Original - Wood Decks")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(c(0.05, 0.95), rate = deck_lambda)

## [1]   4.834112 282.331352

Also generate a 95% confidence interval from the empirical data, assuming normality.

qnorm(c(0.025, 0.975), mean=mean(house_train$WoodDeckSF), sd=sd(house_train$WoodDeckSF))

## [1] -151.415  339.904

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(house_train$WoodDeckSF, c(0.05, 0.95))

##  5% 95% 
##   0 335

The side-by-side histograms show visually that the original data was much less smoothed and gradual than the exponential distribution. The empirical 5th and 95th percentiles are considerably different (0 and 335) than the normalized confidence intervals (-151.4 and 339.9), possibly because there are so many 0 values in the data for Wood Decks. Even though the Wood Decks variable is not normally distributed, the empirical values are within the 95% confidence interval.

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

train_price_lm <- lm(SalePrice ~ GrLivArea + TotRmsAbvGrd + GarageCars + FullBath + YearBuilt, data = house_train)

summary(train_price_lm)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + TotRmsAbvGrd + GarageCars + 
##     FullBath + YearBuilt, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -425832  -24082   -3316   18898  308703 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.532e+06  9.775e+04 -15.676  < 2e-16 ***
## GrLivArea     9.518e+01  4.338e+00  21.939  < 2e-16 ***
## TotRmsAbvGrd -3.618e+03  1.295e+03  -2.795  0.00526 ** 
## GarageCars    2.471e+04  2.055e+03  12.021  < 2e-16 ***
## FullBath     -6.469e+03  3.086e+03  -2.096  0.03627 *  
## YearBuilt     7.909e+02  5.039e+01  15.697  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44480 on 1454 degrees of freedom
## Multiple R-squared:  0.6875, Adjusted R-squared:  0.6865 
## F-statistic: 639.8 on 5 and 1454 DF,  p-value: < 2.2e-16

We see that all p-values for the 5 independent variables are < 0.05, and the R-squared is 0.6865, so this is a good model.

But could it be better? Perhaps. If we could find independent variables that perform better than the # of rooms and the # of full baths, the model may improve.

Let’s try Lot Area and Overall Quality of the building materials.

train_price_lm2 <- lm(SalePrice ~ GrLivArea + LotArea + GarageCars + OverallQual + YearBuilt, data = house_train)

summary(train_price_lm2)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + LotArea + GarageCars + OverallQual + 
##     YearBuilt, data = house_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -389546  -21413   -2184   17207  300512 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.179e+05  8.554e+04  -9.561  < 2e-16 ***
## GrLivArea    5.194e+01  2.612e+00  19.882  < 2e-16 ***
## LotArea      8.461e-01  1.065e-01   7.945 3.85e-15 ***
## GarageCars   1.492e+04  1.853e+03   8.049 1.71e-15 ***
## OverallQual  2.355e+04  1.153e+03  20.426  < 2e-16 ***
## YearBuilt    3.760e+02  4.494e+01   8.368  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38960 on 1454 degrees of freedom
## Multiple R-squared:  0.7603, Adjusted R-squared:  0.7595 
## F-statistic: 922.6 on 5 and 1454 DF,  p-value: < 2.2e-16

These substitutions appear to be better fits and have improved the model for an adjusted R-squared of 0.7595, which is an improvement of 0.0730 in explaining the data.

This final model shows that 75.95% of the variance in sale price can be explained using the five independent variables identified in the model.

This results in a final model of: \(SalesPrice= 5.194e+01 * GrLivArea + 8.461e-01 * LotArea + 1.492e+04 * GarageCars + 2.355e+04 * OverallQual + 3.760e+02 * YearBuilt\)

Let’s check some residuals and the QQ plot to see if the basic assumptions for linear regression have been met.

plot(fitted(train_price_lm2), resid(train_price_lm2))

The residuals appear to be non-linear, almost slightly parabolic, with a much greater deal of variation as the prce increases.

qqnorm(resid(train_price_lm2))
qqline(resid(train_price_lm2), col = "red")

The QQ plot confirms that the lower and mid-range priced homes follow a linear pattern, but the upper range homes veer away from a good fit.

So let’s test the model we have build against the test data set.

test_price <- house_test %>%
  dplyr::select(GrLivArea, LotArea, GarageCars, OverallQual, YearBuilt, Id)

prediction <- predict(train_price_lm2, newdata = test_price, type = "response")

prediction <- as.data.frame(prediction)

pred_prices <- data.frame(Id = test_price$Id, SalePrice = prediction$prediction)

pred_prices$SalePrice[is.na(pred_prices$SalePrice)] <- "0"

write_csv(pred_prices, file = "dbarley_price_prediction.csv")

Kaggle Submission

My Kaggle user name is Douglas Barley, and the resulting score on Kaggle.com from this model is 0.55808.

Kaggle Submission Results.