Final Project

Load Packages

library(ggplot2)
library(tidyverse)
library(knitr)
library(data.table)
library(RCurl)
library(magrittr)
library(psych)
library(Matrix)
library(matrixcalc)
library(MASS)
library(Rmisc)

Problem 1

Q. Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu\)=\(\sigma\)=(N+1)/2.

Answer

Please note that in this RMD, the questions are answered in piecemeal basis, and italicized and marked with Q, wherever they appear.

Let’s choose N = 7. Right below, I created a function f1(), where, I am initializing the variables: mean, sd, df, X and Y. f1() is being called from code chunk, right below it, for printing whatever is required.

f1 <- function(N, seed_value)
{
  mean <- (N + 1) / 2
  sd <- mean # Since it's given that mu = sigma
  #
  set.seed(seed_value)
  X <- runif(10000, min = 1, max = N)
  Y <- rnorm(X, mean, sd)
  #
  df <- data.frame(X, Y)
  #
  list1 <- list(mean = mean, sd = sd, df = df, X = X, Y = Y)
  return(list1)
}

Below code chunk, calls the above function f1(). Gets the salient variables, and prints them.

N <- 7
seed_value = 121
list1 = f1(N, seed_value)
#
X = list1$X
Y = list1$Y
#
cat('mean(X) = ', mean(X), 'median(X) = ', median(X), 'sd(X) = ', sd(X))
## mean(X) =  4.020999 median(X) =  4.040544 sd(X) =  1.733148
cat('mean(Y) = ', mean(Y), 'median(Y) = ', median(Y), 'sd(Y) = ', sd(Y))
## mean(Y) =  3.967011 median(Y) =  4.000224 sd(Y) =  4.021847
#
head(list1$df, n = 10)
##           X          Y
## 1  3.395334  1.3882545
## 2  6.709554  9.4388717
## 3  4.258905  4.4950922
## 4  5.576774  4.0064366
## 5  4.305032  4.3511589
## 6  3.526142 -2.7237422
## 7  3.806151  1.7920659
## 8  4.679762  8.6216448
## 9  2.426268  3.6394108
## 10 2.528021 -0.1594078
#
hist(X)

hist(Y, probability = TRUE, main = "Normal, mean = 4, sd = 4")
abline(v = (N + 1) / 2, col = "blue")

So, we observe that means and medians of both X and Y are almost equal. And from the histograms, while the distribution of X is uniform, the distribution of Y is Normal.

Q. Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)                     # Median of X
y <- quantile(Y, prob = c(.25))    # First quartile of Y
cat('Median(X) = ', x, ', ', '1st quartile of the Y = ', y)
## Median(X) =  4.040544 ,  1st quartile of the Y =  1.252943

a. P(X>x | X>y)

Meaning: Conditional probability: P(X > x | X > y) = P(X > x and X > y) / P(X > y)

cat('P(X > x | X > y) = ', sum(X > x & X > y) / sum(X > y))
## P(X > x | X > y) =  0.5216484

b. P(X>x, Y>y)

Meaning: Joint probability: P(X > x, Y > y) = P(X > x and Y > y)

cat('P(X > x, Y > y) = ', sum(X > x & Y > y) / length(X))
## P(X > x, Y > y) =  0.3745

c. P(X<x | X>y)

Meaning: Conditional probability: P(X < x | X > y) = P(X < x and X > y) / P(X > y)

cat('P(X < x | X > y) = ', sum(X < x & X > y) / sum(X > y))
## P(X < x | X > y) =  0.4783516

Q. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

prob_Y_lesser_than_y <- c( (sum(Y < y & X < x) / length(X)), (sum(Y < y & X > x) / length(X)) )
prob_Y_greater_than_y <- c( (sum(Y > y & X < x) / length(X)), (sum(Y > y & X > x) / length(X)) )
#
table1 <- rbind.data.frame(prob_Y_lesser_than_y,  prob_Y_greater_than_y)  # Building the table of Joint Probability Distributions.
#
colnames(table1) <- c("P(X < x)", "P(X > x)")                             # Setting the headiings
rownames(table1) <- c("P(Y < y)", "P(Y > y)")                             # Setting the columns names
table1
##          P(X < x) P(X > x)
## P(Y < y)   0.1245   0.1255
## P(Y > y)   0.3755   0.3745

Now, let’s compute the important observations from above table:

print(paste('P(X > x) = ', sum(table1[,2])))        # P(X > x) = 0.1255 + 0.3745 = 0.5  (added contents of right most column)
## [1] "P(X > x) =  0.5"
print(paste('P(Y > y) = ', sum(table1[2,])))        # P(Y > y) = 0.3755 + 0.3745 = 0.75 (added contents of bottom most row)
## [1] "P(Y > y) =  0.75"
#
print(paste('P(X > x and Y > y) = ', table1[2,2]))  # P(X > x and Y > y) (right most, bottom most element of table)
## [1] "P(X > x and Y > y) =  0.3745"
print(paste('Therefore, P(X > x).P(Y > y) = ', sum(table1[,2]) * sum(table1[2,])))    # The product: P(X>x)P(Y>y)
## [1] "Therefore, P(X > x).P(Y > y) =  0.375"

So, we observe that P(X > x and Y > y) = 0.3745 and P(X>x)P(Y>y) = 0.375. They are almost equal.

Q. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Let’s state the null and alternative hypotheses

H0: X > x and Y > y are independent events.
HA: X > x and Y > y are not independent events.

Computing the Contingency Table:

count_Y_lesser_than_y <- c( (sum(Y < y & X < x)), (sum(Y < y & X > x)) )
count_Y_greater_than_y <- c( (sum(Y > y & X < x)), (sum(Y > y & X > x)) )
#
table2 <- rbind.data.frame(count_Y_lesser_than_y,  count_Y_greater_than_y)  # Building the table of counts.
#
colnames(table2) <- c("Count(X < x)", "Count(X > x)")                       # Setting the headiings
rownames(table2) <- c("Count(Y < y)", "Count(Y > y)")                       # Setting the columns names
table2
##              Count(X < x) Count(X > x)
## Count(Y < y)         1245         1255
## Count(Y > y)         3755         3745

Fisher’s Exact test, below:

fisher.test(table2)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table2
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9028128 1.0842857
## sample estimates:
## odds ratio 
##  0.9893898

Chi-squared test, below:

chisq.test(table2)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table2
## X-squared = 0.0432, df = 1, p-value = 0.8353

We observe that p-values from both tests are almost equal and greater than 0.05. Therefore, We can’t rule out independence.

On the issue of appropriateness of applicability of Fisher vs Chi-squared, I read several competing views. One view, states that While Chi-squared test is applicable on contingency tables having order greater than or equal to 2 x 2, Fisher’s Exact test is typically used for 2 x 2 tables (reference: https://www.datascienceblog.net/post/statistical_test/contingency_table_tests/). However, in this case, since order of the contingency table is 2 x 2, both should be appropriate.

Problem 2

Q. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Answer

I registered with Kaggle.com, and will report my Kaggle.com user and score in the sequel.

Q. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

A very basic descriptive statistics of train.csv is provided below:

train_csv <- read.csv("./house-prices-advanced-regression-techniques/train.csv")
head(train_csv)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
dim(train_csv)
## [1] 1460   81
describe(train_csv)
##                vars    n      mean       sd   median   trimmed      mad   min
## Id                1 1460    730.50   421.61    730.5    730.50   541.15     1
## MSSubClass        2 1460     56.90    42.30     50.0     49.15    44.48    20
## MSZoning*         3 1460      4.03     0.63      4.0      4.06     0.00     1
## LotFrontage       4 1201     70.05    24.28     69.0     68.94    16.31    21
## LotArea           5 1460  10516.83  9981.26   9478.5   9563.28  2962.23  1300
## Street*           6 1460      2.00     0.06      2.0      2.00     0.00     1
## Alley*            7   91      1.45     0.50      1.0      1.44     0.00     1
## LotShape*         8 1460      2.94     1.41      4.0      3.05     0.00     1
## LandContour*      9 1460      3.78     0.71      4.0      4.00     0.00     1
## Utilities*       10 1460      1.00     0.03      1.0      1.00     0.00     1
## LotConfig*       11 1460      4.02     1.62      5.0      4.27     0.00     1
## LandSlope*       12 1460      1.06     0.28      1.0      1.00     0.00     1
## Neighborhood*    13 1460     13.15     5.89     13.0     13.11     7.41     1
## Condition1*      14 1460      3.03     0.87      3.0      3.00     0.00     1
## Condition2*      15 1460      3.01     0.26      3.0      3.00     0.00     1
## BldgType*        16 1460      1.49     1.20      1.0      1.14     0.00     1
## HouseStyle*      17 1460      4.04     1.91      3.0      4.03     1.48     1
## OverallQual      18 1460      6.10     1.38      6.0      6.08     1.48     1
## OverallCond      19 1460      5.58     1.11      5.0      5.48     0.00     1
## YearBuilt        20 1460   1971.27    30.20   1973.0   1974.13    37.06  1872
## YearRemodAdd     21 1460   1984.87    20.65   1994.0   1986.37    19.27  1950
## RoofStyle*       22 1460      2.41     0.83      2.0      2.26     0.00     1
## RoofMatl*        23 1460      2.08     0.60      2.0      2.00     0.00     1
## Exterior1st*     24 1460     10.62     3.20     13.0     10.93     1.48     1
## Exterior2nd*     25 1460     11.34     3.54     14.0     11.65     2.97     1
## MasVnrType*      26 1452      2.76     0.62      3.0      2.73     0.00     1
## MasVnrArea       27 1452    103.69   181.07      0.0     63.15     0.00     0
## ExterQual*       28 1460      3.54     0.69      4.0      3.65     0.00     1
## ExterCond*       29 1460      4.73     0.73      5.0      4.95     0.00     1
## Foundation*      30 1460      2.40     0.72      2.0      2.46     1.48     1
## BsmtQual*        31 1423      3.26     0.87      3.0      3.43     1.48     1
## BsmtCond*        32 1423      3.81     0.66      4.0      4.00     0.00     1
## BsmtExposure*    33 1422      3.27     1.15      4.0      3.46     0.00     1
## BsmtFinType1*    34 1423      3.73     1.83      3.0      3.79     2.97     1
## BsmtFinSF1       35 1460    443.64   456.10    383.5    386.08   568.58     0
## BsmtFinType2*    36 1422      5.71     0.94      6.0      5.98     0.00     1
## BsmtFinSF2       37 1460     46.55   161.32      0.0      1.38     0.00     0
## BsmtUnfSF        38 1460    567.24   441.87    477.5    519.29   426.99     0
## TotalBsmtSF      39 1460   1057.43   438.71    991.5   1036.70   347.67     0
## Heating*         40 1460      2.04     0.30      2.0      2.00     0.00     1
## HeatingQC*       41 1460      2.54     1.74      1.0      2.42     0.00     1
## CentralAir*      42 1460      1.93     0.25      2.0      2.00     0.00     1
## Electrical*      43 1459      4.68     1.05      5.0      5.00     0.00     1
## X1stFlrSF        44 1460   1162.63   386.59   1087.0   1129.99   347.67   334
## X2ndFlrSF        45 1460    346.99   436.53      0.0    285.36     0.00     0
## LowQualFinSF     46 1460      5.84    48.62      0.0      0.00     0.00     0
## GrLivArea        47 1460   1515.46   525.48   1464.0   1467.67   483.33   334
## BsmtFullBath     48 1460      0.43     0.52      0.0      0.39     0.00     0
## BsmtHalfBath     49 1460      0.06     0.24      0.0      0.00     0.00     0
## FullBath         50 1460      1.57     0.55      2.0      1.56     0.00     0
## HalfBath         51 1460      0.38     0.50      0.0      0.34     0.00     0
## BedroomAbvGr     52 1460      2.87     0.82      3.0      2.85     0.00     0
## KitchenAbvGr     53 1460      1.05     0.22      1.0      1.00     0.00     0
## KitchenQual*     54 1460      3.34     0.83      4.0      3.50     0.00     1
## TotRmsAbvGrd     55 1460      6.52     1.63      6.0      6.41     1.48     2
## Functional*      56 1460      6.75     0.98      7.0      7.00     0.00     1
## Fireplaces       57 1460      0.61     0.64      1.0      0.53     1.48     0
## FireplaceQu*     58  770      3.73     1.13      3.0      3.80     1.48     1
## GarageType*      59 1379      3.28     1.79      2.0      3.11     0.00     1
## GarageYrBlt      60 1379   1978.51    24.69   1980.0   1981.07    31.13  1900
## GarageFinish*    61 1379      2.18     0.81      2.0      2.23     1.48     1
## GarageCars       62 1460      1.77     0.75      2.0      1.77     0.00     0
## GarageArea       63 1460    472.98   213.80    480.0    469.81   177.91     0
## GarageQual*      64 1379      4.86     0.61      5.0      5.00     0.00     1
## GarageCond*      65 1379      4.90     0.52      5.0      5.00     0.00     1
## PavedDrive*      66 1460      2.86     0.50      3.0      3.00     0.00     1
## WoodDeckSF       67 1460     94.24   125.34      0.0     71.76     0.00     0
## OpenPorchSF      68 1460     46.66    66.26     25.0     33.23    37.06     0
## EnclosedPorch    69 1460     21.95    61.12      0.0      3.87     0.00     0
## X3SsnPorch       70 1460      3.41    29.32      0.0      0.00     0.00     0
## ScreenPorch      71 1460     15.06    55.76      0.0      0.00     0.00     0
## PoolArea         72 1460      2.76    40.18      0.0      0.00     0.00     0
## PoolQC*          73    7      2.14     0.90      2.0      2.14     1.48     1
## Fence*           74  281      2.43     0.86      3.0      2.48     0.00     1
## MiscFeature*     75   54      2.91     0.45      3.0      3.00     0.00     1
## MiscVal          76 1460     43.49   496.12      0.0      0.00     0.00     0
## MoSold           77 1460      6.32     2.70      6.0      6.25     2.97     1
## YrSold           78 1460   2007.82     1.33   2008.0   2007.77     1.48  2006
## SaleType*        79 1460      8.51     1.56      9.0      8.92     0.00     1
## SaleCondition*   80 1460      4.77     1.10      5.0      5.00     0.00     1
## SalePrice        81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
##                   max  range   skew kurtosis      se
## Id               1460   1459   0.00    -1.20   11.03
## MSSubClass        190    170   1.40     1.56    1.11
## MSZoning*           5      4  -1.73     6.25    0.02
## LotFrontage       313    292   2.16    17.34    0.70
## LotArea        215245 213945  12.18   202.26  261.22
## Street*             2      1 -15.49   238.01    0.00
## Alley*              2      1   0.20    -1.98    0.05
## LotShape*           4      3  -0.61    -1.60    0.04
## LandContour*        4      3  -3.16     8.65    0.02
## Utilities*          2      1  38.13  1453.00    0.00
## LotConfig*          5      4  -1.13    -0.59    0.04
## LandSlope*          3      2   4.80    24.47    0.01
## Neighborhood*      25     24   0.02    -1.06    0.15
## Condition1*         9      8   3.01    16.34    0.02
## Condition2*         8      7  13.14   247.54    0.01
## BldgType*           5      4   2.24     3.41    0.03
## HouseStyle*         8      7   0.31    -0.96    0.05
## OverallQual        10      9   0.22     0.09    0.04
## OverallCond         9      8   0.69     1.09    0.03
## YearBuilt        2010    138  -0.61    -0.45    0.79
## YearRemodAdd     2010     60  -0.50    -1.27    0.54
## RoofStyle*          6      5   1.47     0.61    0.02
## RoofMatl*           8      7   8.09    66.28    0.02
## Exterior1st*       15     14  -0.72    -0.37    0.08
## Exterior2nd*       16     15  -0.69    -0.52    0.09
## MasVnrType*         4      3  -0.07    -0.13    0.02
## MasVnrArea       1600   1600   2.66    10.03    4.75
## ExterQual*          4      3  -1.83     3.86    0.02
## ExterCond*          5      4  -2.56     5.29    0.02
## Foundation*         6      5   0.09     1.02    0.02
## BsmtQual*           4      3  -1.31     1.27    0.02
## BsmtCond*           4      3  -3.39    10.14    0.02
## BsmtExposure*       4      3  -1.15    -0.39    0.03
## BsmtFinType1*       6      5  -0.02    -1.39    0.05
## BsmtFinSF1       5644   5644   1.68    11.06   11.94
## BsmtFinType2*       6      5  -3.56    12.32    0.02
## BsmtFinSF2       1474   1474   4.25    20.01    4.22
## BsmtUnfSF        2336   2336   0.92     0.46   11.56
## TotalBsmtSF      6110   6110   1.52    13.18   11.48
## Heating*            6      5   9.83   110.98    0.01
## HeatingQC*          5      4   0.48    -1.51    0.05
## CentralAir*         2      1  -3.52    10.42    0.01
## Electrical*         5      4  -3.06     7.49    0.03
## X1stFlrSF        4692   4358   1.37     5.71   10.12
## X2ndFlrSF        2065   2065   0.81    -0.56   11.42
## LowQualFinSF      572    572   8.99    82.83    1.27
## GrLivArea        5642   5308   1.36     4.86   13.75
## BsmtFullBath        3      3   0.59    -0.84    0.01
## BsmtHalfBath        2      2   4.09    16.31    0.01
## FullBath            3      3   0.04    -0.86    0.01
## HalfBath            2      2   0.67    -1.08    0.01
## BedroomAbvGr        8      8   0.21     2.21    0.02
## KitchenAbvGr        3      3   4.48    21.42    0.01
## KitchenQual*        4      3  -1.42     1.72    0.02
## TotRmsAbvGrd       14     12   0.67     0.87    0.04
## Functional*         7      6  -4.08    16.37    0.03
## Fireplaces          3      3   0.65    -0.22    0.02
## FireplaceQu*        5      4  -0.16    -0.98    0.04
## GarageType*         6      5   0.76    -1.30    0.05
## GarageYrBlt      2010    110  -0.65    -0.42    0.66
## GarageFinish*       3      2  -0.35    -1.41    0.02
## GarageCars          4      4  -0.34     0.21    0.02
## GarageArea       1418   1418   0.18     0.90    5.60
## GarageQual*         5      4  -4.43    18.25    0.02
## GarageCond*         5      4  -5.28    26.77    0.01
## PavedDrive*         3      2  -3.30     9.22    0.01
## WoodDeckSF        857    857   1.54     2.97    3.28
## OpenPorchSF       547    547   2.36     8.44    1.73
## EnclosedPorch     552    552   3.08    10.37    1.60
## X3SsnPorch        508    508  10.28   123.06    0.77
## ScreenPorch       480    480   4.11    18.34    1.46
## PoolArea          738    738  14.80   222.19    1.05
## PoolQC*             3      2  -0.22    -1.90    0.34
## Fence*              4      3  -0.57    -0.88    0.05
## MiscFeature*        4      3  -2.93    10.71    0.06
## MiscVal         15500  15500  24.43   697.64   12.98
## MoSold             12     11   0.21    -0.41    0.07
## YrSold           2010      4   0.10    -1.19    0.03
## SaleType*           9      8  -3.83    14.57    0.04
## SaleCondition*      6      5  -2.74     6.82    0.03
## SalePrice      755000 720100   1.88     6.50 2079.11

Facts:
1) There are 1460 observations and 81 columns in train.csv. If we exclude Id and SalePrice, then there are 79 regressor variables.
2) SalePrice is the dependent variable.

Since price of a house may depend quite a bit on spatial characteristics, I’ll consider 3 area related variables, and the dependent variable SalePrice, for statistical analysis. They are:
1) LotArea: Lot size in square feet
2) GrLivArea: Above grade (ground) living area square feet
3) GarageArea: Size of garage in square feet
4) SalePrice: Dependent variable

Univariate descriptive statistics of LotArea:

summary(train_csv$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
hist(train_csv$LotArea, main = "LotArea: Lot size in square feet")

Univariate descriptive statistics of GrLivArea:

summary(train_csv$GrLivArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642
hist(train_csv$GrLivArea, main = "GrLivArea: Above grade (ground) living area square feet")

Univariate descriptive statistics of GarageArea:

summary(train_csv$GarageArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0
hist(train_csv$GarageArea, main = "GarageArea: Size of garage in square feet")

Univariate descriptive statistics of SalePrice:

summary(train_csv$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
hist(train_csv$SalePrice, main = "SalePrice: Dependent variable")

Now, I’ll provide the scatter plots.

Scatter plot of LotArea vs SalePrice:

plot(train_csv$LotArea, train_csv$SalePrice, xlab = 'Lot size in square feet', ylab = 'Sale Price', main = 'Scatter Plot of Lot size vs. Sale Price')

Scatter plot of GrLivArea vs SalePrice:

plot(train_csv$GrLivArea, train_csv$SalePrice, xlab = 'Above grade (ground) living area square feet', ylab = 'Sale Price', main = 'Scatter Plot of Above grade (ground) living area vs. Sale Price')

Scatter plot of GarageArea vs SalePrice:

plot(train_csv$GarageArea, train_csv$SalePrice, xlab = 'Size of garage in square feet', ylab = 'Sale Price', main = 'Scatter Plot of Size of garage vs. Sale Price')

Now, I’ll provide the correlation matrix.

Correlation matrix of the three variables LotArea, GrLivArea, GarageArea:

cor_df <- data.frame(train_csv$LotArea, train_csv$GrLivArea, train_csv$GarageArea)
cor_matrix <- cor(cor_df)
colnames(cor_matrix) <- c("LotArea", "GrLivArea", "GarageArea")                   # Setting the headiings
rownames(cor_matrix) <- c("LotArea", "GrLivArea", "GarageArea")                   # Setting the columns names
cor_matrix
##              LotArea GrLivArea GarageArea
## LotArea    1.0000000 0.2631162  0.1804028
## GrLivArea  0.2631162 1.0000000  0.4689975
## GarageArea 0.1804028 0.4689975  1.0000000

Now, for a comparative visualization, I’ll display pairwise scatter plot matrices.

Pairwise scatter plot matrix of four variables SalePrice, LotArea, GrLivArea, GarageArea:

pairs(~SalePrice+LotArea+GrLivArea+GarageArea, data = train_csv, main = "Pairwise scatter plot matrix")

Now, we’ll test this hypothesis, the correlations between each pairwise set of variables is 0, and provide 80% Confidence Level.
For the following analysis, I referred chapter 6 of Introductory Statistics with R, by Peter Dalgaard, Springer, and some internet resources e.g https://rcompanion.org/handbook/D_01.html

Hypothesis testing: LotArea, GrLivArea

cor.test(train_csv$LotArea, train_csv$GrLivArea, method = "pearson", conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_csv$LotArea and train_csv$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162

Hypothesis testing: LotArea, GarageArea

cor.test(train_csv$LotArea, train_csv$GarageArea, method = "pearson", conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_csv$LotArea and train_csv$GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028

Hypothesis testing: GarageArea, GrLivArea

cor.test(train_csv$GarageArea, train_csv$GrLivArea, method = "pearson", conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train_csv$GarageArea and train_csv$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

Assuming an Alpha of 0.05, we observe that each p-value is less than Alpha. Therefore the null hypothesis can be rejected.

The issue of Familywise Error:
First, the definition of Familywise Error Rate (from https://www.statisticshowto.com/familywise-error-rate/):
The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests . In other words, it’s the probability of making at least one Type I Error. The term “familywise” error rate comes from family of tests, which is the technical definition for a series of tests on data.

In my case the tests are very few, so I wouldn’t worry about it.

Q. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Inversion

prec_matrix <- solve(cor_matrix)
prec_matrix
##                LotArea  GrLivArea  GarageArea
## LotArea     1.07920917 -0.2469705 -0.07886378
## GrLivArea  -0.24697046  1.3385010 -0.58319943
## GarageArea -0.07886378 -0.5831994  1.28774631

Now, I’ll multiply correlation matrix by the precision matrix:

round(cor_matrix %*% prec_matrix)
##            LotArea GrLivArea GarageArea
## LotArea          1         0          0
## GrLivArea        0         1          0
## GarageArea       0         0          1

Then, I’ll multiply precision matrix by the correlation matrix:

round(prec_matrix %*% cor_matrix)
##            LotArea GrLivArea GarageArea
## LotArea          1         0          0
## GrLivArea        0         1          0
## GarageArea       0         0          1

LU Decomposition of correlation matrix (Lower Triangular):

expand(lu(cor_matrix))$L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.2631162 1.0000000         .
## [3,] 0.1804028 0.4528838 1.0000000

LU Decomposition correlation matrix (Upper Triangular):

expand(lu(cor_matrix))$U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.2631162 0.1804028
## [2,]         . 0.9307699 0.4215306
## [3,]         .         . 0.7765505

LU Decomposition of precision matrix (Lower Triangular). Used another function:

lu.decomposition(prec_matrix)$L
##             [,1]       [,2] [,3]
## [1,]  1.00000000  0.0000000    0
## [2,] -0.22884393  1.0000000    0
## [3,] -0.07307553 -0.4689975    1

LU Decomposition precision matrix (Upper Triangular):

lu.decomposition(prec_matrix)$U
##          [,1]       [,2]        [,3]
## [1,] 1.079209 -0.2469705 -0.07886378
## [2,] 0.000000  1.2819833 -0.60124693
## [3,] 0.000000  0.0000000  1.00000000

Q. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

I tested the right skewness of some of the variables MasVnrArea, BsmtFinSF1, TotalBsmtSF, BsmtUnfSF, WoodDeckSF, OpenPorchSF, EnclosedPorch, and finally selected TotalBsmtSF, which appeared skewed to the right. TotalBsmtSF is the Total square feet of basement area. The histogram and summary of TotalBsmtSF is as follows:

Histogram of Total square feet of basement area:

hist(train_csv$TotalBsmtSF, breaks = 200, main = "Total square feet of basement area")

Summary of Total square feet of basement area:

summary(train_csv$TotalBsmtSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0

From the histogram, since it’s not clear whether parts of skew side are zero, I’ll shift it to raise raise it, to be on the safe side. Pleasee see shifted histogram below.

TotalBsmtSF_shft <- train_csv$TotalBsmtSF[train_csv$TotalBsmtSF < 5000] + 0.01
hist(TotalBsmtSF_shft, breaks = 200, main = "Total square feet of basement area, shifted")

We’ll calculate optimal value of \(\lambda\), for this distribution.

lambda <- fitdistr(TotalBsmtSF_shft, 'exponential')
lambda
##        rate    
##   9.487878e-04 
##  (2.483942e-05)
lambda$estimate
##         rate 
## 0.0009487878

Then, we’ll take 1000 samples from this exponential distribution, using this value rexp(1000, \(\lambda\))) and plot histogram.

exp_samp_1000 <- rexp(1000, lambda$estimate)
hist(exp_samp_1000, breaks = 200, main = 'Fitted Distribution with 1000 samples')

Comparison: Compared with original histogram, this histogram with sample of 1000, looks more skewed.

5th and 95th percentiles using the cumulative distribution function (CDF).

print( paste('5th Percentile   = ', qexp(0.05, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)) )
## [1] "5th Percentile   =  54.0619225502357"
print( paste('95th Percentile   = ', qexp(0.95, rate = lambda$estimate, lower.tail = TRUE, log.p = FALSE)) )
## [1] "95th Percentile   =  3157.43116303767"

95% confidence interval from the empirical data, assuming normality. In the following, I computed Confidence Interval with CI, but it can be computed, using qnorm.

# Confidence Interval using shifted data:
CI(TotalBsmtSF_shft, ci = 0.95)
##    upper     mean    lower 
## 1075.464 1053.976 1032.489
# Confidence Interval using, without the shift:
CI(train_csv$TotalBsmtSF, ci = 0.95)
##    upper     mean    lower 
## 1079.951 1057.429 1034.908

Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(TotalBsmtSF_shft, c(0.05, 0.95))
##      5%     95% 
##  518.61 1752.11

Discussion: The values of exponential model are way off the values of the given data set. Furthermore, the histograms are also very different. Histogram of exp_samp_1000 has gone more skewed to the right.

Q. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I’ll select most of the quantitative variables and create a model, to determine the significant ones:

df2 <- data.frame(train_csv$LotArea, train_csv$OverallQual, train_csv$YearBuilt, train_csv$YearRemodAdd, train_csv$MasVnrArea, train_csv$BsmtFinSF1, train_csv$TotalBsmtSF, train_csv$X1stFlrSF, train_csv$X2ndFlrSF, train_csv$GrLivArea, train_csv$FullBath, train_csv$TotRmsAbvGrd, train_csv$Fireplaces, train_csv$GarageCars, train_csv$GarageArea, train_csv$WoodDeckSF, train_csv$OpenPorchSF, train_csv$SalePrice)

colnames(df2) <- c("LotArea", "OverallQual", "YearBuilt", "YearRemodAdd", "MasVnrArea", "BsmtFinSF1", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "GrLivArea", "FullBath", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "SalePrice")

mod1 <- lm(SalePrice ~., data = df2)
summary(mod1)
## 
## Call:
## lm(formula = SalePrice ~ ., data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -517677  -17003   -1878   14217  289379 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.131e+06  1.261e+05  -8.973  < 2e-16 ***
## LotArea       4.928e-01  1.033e-01   4.769 2.04e-06 ***
## OverallQual   1.915e+04  1.172e+03  16.340  < 2e-16 ***
## YearBuilt     1.762e+02  4.948e+01   3.560 0.000383 ***
## YearRemodAdd  3.621e+02  6.163e+01   5.875 5.24e-09 ***
## MasVnrArea    2.968e+01  6.115e+00   4.853 1.35e-06 ***
## BsmtFinSF1    1.640e+01  2.582e+00   6.352 2.84e-10 ***
## TotalBsmtSF   1.050e+01  4.272e+00   2.457 0.014142 *  
## X1stFlrSF     2.363e+01  2.067e+01   1.143 0.253146    
## X2ndFlrSF     1.601e+01  2.033e+01   0.788 0.431023    
## GrLivArea     2.143e+01  2.020e+01   1.061 0.288838    
## FullBath     -1.710e+03  2.610e+03  -0.655 0.512480    
## TotRmsAbvGrd  1.744e+03  1.081e+03   1.614 0.106766    
## Fireplaces    6.677e+03  1.788e+03   3.734 0.000196 ***
## GarageCars    9.986e+03  2.938e+03   3.398 0.000696 ***
## GarageArea    9.015e+00  9.981e+00   0.903 0.366560    
## WoodDeckSF    2.709e+01  8.103e+00   3.343 0.000850 ***
## OpenPorchSF   6.811e+00  1.561e+01   0.436 0.662624    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36110 on 1434 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.7926 
## F-statistic: 327.2 on 17 and 1434 DF,  p-value: < 2.2e-16

Now, I’ll only consider significant quantitative variables and recreate the model:

df2 <- data.frame(train_csv$LotArea, train_csv$OverallQual, train_csv$YearBuilt, train_csv$YearRemodAdd, train_csv$MasVnrArea, train_csv$BsmtFinSF1, train_csv$TotalBsmtSF, train_csv$Fireplaces, train_csv$GarageCars, train_csv$WoodDeckSF, train_csv$SalePrice)
#
colnames(df2) <- c("LotArea", "OverallQual", "YearBuilt", "YearRemodAdd", "MasVnrArea", "BsmtFinSF1", "TotalBsmtSF", "Fireplaces", "GarageCars", "WoodDeckSF", "SalePrice")
#
mod1 <- lm(SalePrice ~., data = df2)
summary(mod1)
## 
## Call:
## lm(formula = SalePrice ~ ., data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -424514  -20402   -2604   16180  367067 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.217e+05  1.273e+05  -6.455 1.48e-10 ***
## LotArea       7.234e-01  1.115e-01   6.487 1.20e-10 ***
## OverallQual   2.541e+04  1.188e+03  21.392  < 2e-16 ***
## YearBuilt    -5.532e+01  4.854e+01  -1.140    0.255    
## YearRemodAdd  4.391e+02  6.651e+01   6.603 5.67e-11 ***
## MasVnrArea    4.773e+01  6.528e+00   7.311 4.37e-13 ***
## BsmtFinSF1    1.409e+01  2.733e+00   5.157 2.85e-07 ***
## TotalBsmtSF   2.212e+01  3.280e+00   6.742 2.25e-11 ***
## Fireplaces    1.293e+04  1.867e+03   6.925 6.54e-12 ***
## GarageCars    1.769e+04  1.855e+03   9.535  < 2e-16 ***
## WoodDeckSF    3.926e+01  8.779e+00   4.472 8.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39380 on 1441 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.755,  Adjusted R-squared:  0.7533 
## F-statistic: 444.2 on 10 and 1441 DF,  p-value: < 2.2e-16

Now, I’ll give histogram, for visualizations:

hist(mod1$residuals, breaks = 200)

Now, I’ll import the test data set:

test_csv <- read.csv("./house-prices-advanced-regression-techniques/test.csv")
test_csv[complete.cases(test_csv),]
##  [1] Id            MSSubClass    MSZoning      LotFrontage   LotArea      
##  [6] Street        Alley         LotShape      LandContour   Utilities    
## [11] LotConfig     LandSlope     Neighborhood  Condition1    Condition2   
## [16] BldgType      HouseStyle    OverallQual   OverallCond   YearBuilt    
## [21] YearRemodAdd  RoofStyle     RoofMatl      Exterior1st   Exterior2nd  
## [26] MasVnrType    MasVnrArea    ExterQual     ExterCond     Foundation   
## [31] BsmtQual      BsmtCond      BsmtExposure  BsmtFinType1  BsmtFinSF1   
## [36] BsmtFinType2  BsmtFinSF2    BsmtUnfSF     TotalBsmtSF   Heating      
## [41] HeatingQC     CentralAir    Electrical    X1stFlrSF     X2ndFlrSF    
## [46] LowQualFinSF  GrLivArea     BsmtFullBath  BsmtHalfBath  FullBath     
## [51] HalfBath      BedroomAbvGr  KitchenAbvGr  KitchenQual   TotRmsAbvGrd 
## [56] Functional    Fireplaces    FireplaceQu   GarageType    GarageYrBlt  
## [61] GarageFinish  GarageCars    GarageArea    GarageQual    GarageCond   
## [66] PavedDrive    WoodDeckSF    OpenPorchSF   EnclosedPorch X3SsnPorch   
## [71] ScreenPorch   PoolArea      PoolQC        Fence         MiscFeature  
## [76] MiscVal       MoSold        YrSold        SaleType      SaleCondition
## <0 rows> (or 0-length row.names)

Prediction the output for Kaggle score:

prediction <- predict(mod1, test_csv)
#
kaggle_score <- data.frame(Id = test_csv[,"Id"], SalePrice = prediction)
kaggle_score[kaggle_score < 0] <- 0
kaggle_score <- replace(kaggle_score, is.na(kaggle_score), 0)
write.csv(kaggle_score, file = "kaggle_score.csv", row.names = FALSE)

Kaggle and presentation related information

User Name
shovanbiswas

Display Name
Shovan Biswas

Kaggle score is copied below:

Presentation link:

https://youtu.be/WfWnoCWOjLc

Marker: 605-16