CUNY MSDS Data 605 Final Project

Problem 1
Problem 2

Problem 1

Using R, generate a variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{(N+1)}{2}\).

set.seed(525)
r = 10000
X = runif(r, 1, 8)

mu = (8+1)/2
sigma = mu
Y = rnorm(r, mean = mu, sd = sigma)

Question A

Calculate as a minimum the below probabilities. Assume the smaller letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

\(P(X > x | X > y)\)

This is equal to \(\frac {P(X > x \ \& \ X > y)}{P(X > y)}\).

\(P(X > x, Y > y)\)

This is equal to \(P(X > x)*P(Y > y)\).

\(P(X < x | X > y)\)

This is equal to \(\frac {P(X < x \ \& \ X > y)}{P(X > y)}\).

x = median(X)
y = quantile(Y, 1/4)

# Part 1: P(X > x | X > y) is
P1 = (length(X[X>x & X>y])/length(X)) /(length(X[X>y])/length(X))
P1

## [1] 0.5343593

# Part 2: P(X > x, Y > y) is
P2 = (length(X[X>x]) / length(X))*(length(Y[Y>y]) / length(Y))
P2

## [1] 0.375

# Part 3: P(X < x | X > y) is
P3 = (length(X[X<x & X>y])/length(X))/(length(X[X>y])/length(X))
P3

## [1] 0.4656407

Question B

Investigate whether \(P(X > x \ \&\ Y > y) = P(X > x)P(Y > y)\) by building a table and evaluating the marginal and joint probabilities.

a = length(X[X > x & Y > y])/10000 # P(X > x & Y > y)
b = length(X[X > x & Y < y])/10000 # P(X > x & Y < y)
c = length(X[X < x & Y > y])/10000 # P(X < x & Y > y)
d = length(X[X < x & Y < y])/10000 # P(X < x & Y < y)

ptable = matrix(c(a,b,a+b,c,d,c+d,a+c,b+d,a+b+c+d), nrow = 3)
colnames(ptable) = c('P(X > x)', 'P(X < x)', 'marginal')
rownames(ptable) = c('P(Y > y)', 'P(Y < y)', 'marginal')
ptable = as.table(ptable)
ptable

##          P(X > x) P(X < x) marginal
## P(Y > y)    0.377    0.373    0.750
## P(Y < y)    0.123    0.127    0.250
## marginal    0.500    0.500    1.000

# a is P(X > x & Y > y)
a

## [1] 0.377

# P(X > x)P(Y > y)
P2

## [1] 0.375

From the above, it is clear that the probabilities are very close to each other, 0.377 and 0.375 receptively.

Question C

Check to see if independence holds by using Fisher’s Exact Test and the Chi-Square Test. What is the difference between the two? Which is most appropriate?

Both the Fisher’s Exact Test and the Chi-Square Test are test of independence such that the probability distribution of one variable does not affect the presence of another.

\(H_{0} =\) the proportions at one variable are the same for different values of the second variable.

ptable = as.table(matrix(c(a,b,c,d)*10000, nrow = 2))
fisher.test(ptable)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  ptable
## p-value = 0.3678
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9522537 1.1436981
## sample estimates:
## odds ratio 
##   1.043608

The p-value of 0.3678 for the test does not provide any evidence against the assumption of independence. This means that we cannot confidently claim any difference in the two sets of data.

chisq.test(ptable)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  ptable
## X-squared = 0.8112, df = 1, p-value = 0.3678

The p-value of 0.3678 is greater than the 0.05 significance level, we do not reject the null hypothesis that the X dataset is independent of the Y dataset.

The chi-squared test applies an approximation assuming the sample is large, while the Fisher’s exact test is an exact significant test for small sample sizes. In the Fisher’s test, a false rejection equals to the significance level of the test, which is not necessarily true for approximate tests such as the chi-squared test. In short, Fisher’s exact test relies on computing the p-value according to the hypergeometric distribution using binomial coefficients. Since the computed factorials can become very large, Fisher’s exact test may not work for large sample sizes.\(^1\)

\(^1\) Kim, Hae-Young. “Statistical notes for clinical researchers: Chi-squared test and Fisher’s exact test.” Restorative dentistry & endodontics vol. 42,2 (2017): 152-155. doi:10.5395/rde.2017.42.2.152

Problem 2

House Prices: Advanced Regression Techniques is a Kaggle Challenge with a dataset that proves 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.

There is a training (n = 1460) and a test (n = 1459) dataset. Some of the variables of interested are listed below. Briefly, there is an interest in the square feet of the porch, lot, and garage.

Variables	Descriptions
MSZoning	Identifies the general zoning classification of the sale
LotFrontage	Linear feet of street connected to property
LotArea	Lot size in square feet
WoodDeckSF	Wood deck area in square feet
OpenPorchSF	Open porch area in square feet
EnclosedPorch	Enclosed porch area in square feet
3SsnPorch	Three season porch area in square feet
ScreenPorch	Screen porch area in square feet
PoolArea	Pool area in square feet
PoolQC	Pool quality
GarageArea	Size of garage in square feet
GarageQual	Garage quality
OverallCond	Rates the overall condition of the house
SalePrice	Selling Price of the House

# Libraries needed
library(tidyverse)
library(psych)
library(Matrix)
library(MASS)
library(scales)

testdata = read.csv("test.csv")
train = read.csv("train.csv")

Descriptive & Inferential Statistics

Decriptive Statistics of Numeric and Factor Variables

# Descriptive statistics of the numeric variables
train %>% keep(is.numeric) %>% describe()

##               vars    n      mean       sd   median   trimmed      mad
## Id               1 1460    730.50   421.61    730.5    730.50   541.15
## MSSubClass       2 1460     56.90    42.30     50.0     49.15    44.48
## LotFrontage      3 1201     70.05    24.28     69.0     68.94    16.31
## LotArea          4 1460  10516.83  9981.26   9478.5   9563.28  2962.23
## OverallQual      5 1460      6.10     1.38      6.0      6.08     1.48
## OverallCond      6 1460      5.58     1.11      5.0      5.48     0.00
## YearBuilt        7 1460   1971.27    30.20   1973.0   1974.13    37.06
## YearRemodAdd     8 1460   1984.87    20.65   1994.0   1986.37    19.27
## MasVnrArea       9 1452    103.69   181.07      0.0     63.15     0.00
## BsmtFinSF1      10 1460    443.64   456.10    383.5    386.08   568.58
## BsmtFinSF2      11 1460     46.55   161.32      0.0      1.38     0.00
## BsmtUnfSF       12 1460    567.24   441.87    477.5    519.29   426.99
## TotalBsmtSF     13 1460   1057.43   438.71    991.5   1036.70   347.67
## X1stFlrSF       14 1460   1162.63   386.59   1087.0   1129.99   347.67
## X2ndFlrSF       15 1460    346.99   436.53      0.0    285.36     0.00
## LowQualFinSF    16 1460      5.84    48.62      0.0      0.00     0.00
## GrLivArea       17 1460   1515.46   525.48   1464.0   1467.67   483.33
## BsmtFullBath    18 1460      0.43     0.52      0.0      0.39     0.00
## BsmtHalfBath    19 1460      0.06     0.24      0.0      0.00     0.00
## FullBath        20 1460      1.57     0.55      2.0      1.56     0.00
## HalfBath        21 1460      0.38     0.50      0.0      0.34     0.00
## BedroomAbvGr    22 1460      2.87     0.82      3.0      2.85     0.00
## KitchenAbvGr    23 1460      1.05     0.22      1.0      1.00     0.00
## TotRmsAbvGrd    24 1460      6.52     1.63      6.0      6.41     1.48
## Fireplaces      25 1460      0.61     0.64      1.0      0.53     1.48
## GarageYrBlt     26 1379   1978.51    24.69   1980.0   1981.07    31.13
## GarageCars      27 1460      1.77     0.75      2.0      1.77     0.00
## GarageArea      28 1460    472.98   213.80    480.0    469.81   177.91
## WoodDeckSF      29 1460     94.24   125.34      0.0     71.76     0.00
## OpenPorchSF     30 1460     46.66    66.26     25.0     33.23    37.06
## EnclosedPorch   31 1460     21.95    61.12      0.0      3.87     0.00
## X3SsnPorch      32 1460      3.41    29.32      0.0      0.00     0.00
## ScreenPorch     33 1460     15.06    55.76      0.0      0.00     0.00
## PoolArea        34 1460      2.76    40.18      0.0      0.00     0.00
## MiscVal         35 1460     43.49   496.12      0.0      0.00     0.00
## MoSold          36 1460      6.32     2.70      6.0      6.25     2.97
## YrSold          37 1460   2007.82     1.33   2008.0   2007.77     1.48
## SalePrice       38 1460 180921.20 79442.50 163000.0 170783.29 56338.80
##                 min    max  range  skew kurtosis      se
## Id                1   1460   1459  0.00    -1.20   11.03
## MSSubClass       20    190    170  1.40     1.56    1.11
## LotFrontage      21    313    292  2.16    17.34    0.70
## LotArea        1300 215245 213945 12.18   202.26  261.22
## OverallQual       1     10      9  0.22     0.09    0.04
## OverallCond       1      9      8  0.69     1.09    0.03
## YearBuilt      1872   2010    138 -0.61    -0.45    0.79
## YearRemodAdd   1950   2010     60 -0.50    -1.27    0.54
## MasVnrArea        0   1600   1600  2.66    10.03    4.75
## BsmtFinSF1        0   5644   5644  1.68    11.06   11.94
## BsmtFinSF2        0   1474   1474  4.25    20.01    4.22
## BsmtUnfSF         0   2336   2336  0.92     0.46   11.56
## TotalBsmtSF       0   6110   6110  1.52    13.18   11.48
## X1stFlrSF       334   4692   4358  1.37     5.71   10.12
## X2ndFlrSF         0   2065   2065  0.81    -0.56   11.42
## LowQualFinSF      0    572    572  8.99    82.83    1.27
## GrLivArea       334   5642   5308  1.36     4.86   13.75
## BsmtFullBath      0      3      3  0.59    -0.84    0.01
## BsmtHalfBath      0      2      2  4.09    16.31    0.01
## FullBath          0      3      3  0.04    -0.86    0.01
## HalfBath          0      2      2  0.67    -1.08    0.01
## BedroomAbvGr      0      8      8  0.21     2.21    0.02
## KitchenAbvGr      0      3      3  4.48    21.42    0.01
## TotRmsAbvGrd      2     14     12  0.67     0.87    0.04
## Fireplaces        0      3      3  0.65    -0.22    0.02
## GarageYrBlt    1900   2010    110 -0.65    -0.42    0.66
## GarageCars        0      4      4 -0.34     0.21    0.02
## GarageArea        0   1418   1418  0.18     0.90    5.60
## WoodDeckSF        0    857    857  1.54     2.97    3.28
## OpenPorchSF       0    547    547  2.36     8.44    1.73
## EnclosedPorch     0    552    552  3.08    10.37    1.60
## X3SsnPorch        0    508    508 10.28   123.06    0.77
## ScreenPorch       0    480    480  4.11    18.34    1.46
## PoolArea          0    738    738 14.80   222.19    1.05
## MiscVal           0  15500  15500 24.43   697.64   12.98
## MoSold            1     12     11  0.21    -0.41    0.07
## YrSold         2006   2010      4  0.10    -1.19    0.03
## SalePrice     34900 755000 720100  1.88     6.50 2079.11

# Descriptive statistics of the factor variables
train %>% keep(is.factor) %>% summary()

##     MSZoning     Street      Alley      LotShape  LandContour
##  C (all):  10   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  FV     :  65   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  RH     :  16               NA's:1369   IR3: 10   Low:  36   
##  RL     :1151                           Reg:925   Lvl:1311   
##  RM     : 218                                                
##                                                              
##                                                              
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle    RoofStyle       RoofMatl   
##  Norm   :1445   1Fam  :1220   1Story :726   Flat   :  13   CompShg:1434  
##  Feedr  :   6   2fmCon:  31   2Story :445   Gable  :1141   Tar&Grv:  11  
##  Artery :   2   Duplex:  52   1.5Fin :154   Gambrel:  11   WdShngl:   6  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Hip    : 286   WdShake:   5  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   Mansard:   7   ClyTile:   1  
##  PosA   :   1                 1.5Unf : 14   Shed   :   2   Membran:   1  
##  (Other):   2                 (Other): 19                  (Other):   2  
##   Exterior1st   Exterior2nd    MasVnrType  ExterQual ExterCond
##  VinylSd:515   VinylSd:504   BrkCmn : 15   Ex: 52    Ex:   3  
##  HdBoard:222   MetalSd:214   BrkFace:445   Fa: 14    Fa:  28  
##  MetalSd:220   HdBoard:207   None   :864   Gd:488    Gd: 146  
##  Wd Sdng:206   Wd Sdng:197   Stone  :128   TA:906    Po:   1  
##  Plywood:108   Plywood:142   NA's   :  8             TA:1282  
##  CemntBd: 61   CmentBd: 60                                    
##  (Other):128   (Other):136                                    
##   Foundation  BsmtQual   BsmtCond    BsmtExposure BsmtFinType1
##  BrkTil:146   Ex  :121   Fa  :  45   Av  :221     ALQ :220    
##  CBlock:634   Fa  : 35   Gd  :  65   Gd  :134     BLQ :148    
##  PConc :647   Gd  :618   Po  :   2   Mn  :114     GLQ :418    
##  Slab  : 24   TA  :649   TA  :1311   No  :953     LwQ : 74    
##  Stone :  6   NA's: 37   NA's:  37   NA's: 38     Rec :133    
##  Wood  :  3                                       Unf :430    
##                                                   NA's: 37    
##  BsmtFinType2  Heating     HeatingQC CentralAir Electrical   KitchenQual
##  ALQ :  19    Floor:   1   Ex:741    N:  95     FuseA:  94   Ex:100     
##  BLQ :  33    GasA :1428   Fa: 49    Y:1365     FuseF:  27   Fa: 39     
##  GLQ :  14    GasW :  18   Gd:241               FuseP:   3   Gd:586     
##  LwQ :  46    Grav :   7   Po:  1               Mix  :   1   TA:735     
##  Rec :  54    OthW :   2   TA:428               SBrkr:1334              
##  Unf :1256    Wall :   4                        NA's :   1              
##  NA's:  38                                                              
##  Functional  FireplaceQu   GarageType  GarageFinish GarageQual 
##  Maj1:  14   Ex  : 24    2Types :  6   Fin :352     Ex  :   3  
##  Maj2:   5   Fa  : 33    Attchd :870   RFn :422     Fa  :  48  
##  Min1:  31   Gd  :380    Basment: 19   Unf :605     Gd  :  14  
##  Min2:  34   Po  : 20    BuiltIn: 88   NA's: 81     Po  :   3  
##  Mod :  15   TA  :313    CarPort:  9                TA  :1311  
##  Sev :   1   NA's:690    Detchd :387                NA's:  81  
##  Typ :1360               NA's   : 81                           
##  GarageCond  PavedDrive  PoolQC       Fence      MiscFeature
##  Ex  :   2   N:  90     Ex  :   2   GdPrv:  59   Gar2:   2  
##  Fa  :  35   P:  30     Fa  :   2   GdWo :  54   Othr:   2  
##  Gd  :   9   Y:1340     Gd  :   3   MnPrv: 157   Shed:  49  
##  Po  :   7              NA's:1453   MnWw :  11   TenC:   1  
##  TA  :1326                          NA's :1179   NA's:1406  
##  NA's:  81                                                  
##                                                             
##     SaleType    SaleCondition 
##  WD     :1267   Abnorml: 101  
##  New    : 122   AdjLand:   4  
##  COD    :  43   Alloca :  12  
##  ConLD  :   9   Family :  20  
##  ConLI  :   5   Normal :1198  
##  ConLw  :   5   Partial: 125  
##  (Other):   9

Scatter Plots of Independent Variables with Dependent Variable

The dependent variable is the selling price of the house, SalePrice. The independent variables are the lot size in square feet, LotArea and size of garage in square feet, GarageArea. Moreover, one of independent variables of interest is the newly created variable of the total porch area which is the sum of all porch area (i.e. WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, and ScreenPorch) in square feet called TotalPorch.

# Scatterplots of Independent Variables with Dependent Variable: SalePrice

# vs LotArea
ggplot(train, aes(x = log(LotArea/1000), y = log(SalePrice/1000))) + geom_point(color = "#69b3a2", alpha = 0.5) + labs(title = "Home Sale Price vs Lot Size", x = "Lot Size in log(Kft^2)", y = "Price in log(K$)") + theme_minimal()

# vs TotalPorch: Sum of all porch area
train$TotalPorch = rowSums(train[,c(67:71)])

ggplot(train, aes(x = TotalPorch, y = (SalePrice/1000))) + geom_point(color = "#69b3a2", alpha = 0.5) + labs(title = "Home Sale Price vs Total Porch Area", x = "Total Porch Area in K ft^2", y = "Price in K$") + theme_minimal()

# vs GarageArea
ggplot(train, aes(x = (GarageArea/1000), y = (SalePrice/1000))) + geom_point(color = "#69b3a2", alpha = 0.5) + labs(title = " Home Sale Price vs Garage Area", x = "Garage Area in K ft^2", y = "Price in K$") + theme_minimal()

Initial Correlation Analysis

Correlation matrix will be of SalePrice, LotArea, PoolArea, GarageArea, and TotalPorch. Following it, will be a correlation hypothesis testing where:

Null: true correlation is equal to 0

test = train[,c(5,63,72,82,81)]
res = corr.test(test, y = NULL, use = "pairwise", method = "pearson", adjust = "holm", alpha =.2, ci = TRUE, minlength=5)
print(res, short = FALSE)

## Call:corr.test(x = test, y = NULL, use = "pairwise", method = "pearson", 
##     adjust = "holm", alpha = 0.2, ci = TRUE, minlength = 5)
## Correlation matrix 
##            LotArea GarageArea PoolArea TotalPorch SalePrice
## LotArea       1.00       0.18     0.08       0.19      0.26
## GarageArea    0.18       1.00     0.06       0.26      0.62
## PoolArea      0.08       0.06     1.00       0.12      0.09
## TotalPorch    0.19       0.26     0.12       1.00      0.39
## SalePrice     0.26       0.62     0.09       0.39      1.00
## Sample Size 
## [1] 1460
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            LotArea GarageArea PoolArea TotalPorch SalePrice
## LotArea          0       0.00     0.01          0         0
## GarageArea       0       0.00     0.02          0         0
## PoolArea         0       0.02     0.00          0         0
## TotalPorch       0       0.00     0.00          0         0
## SalePrice        0       0.00     0.00          0         0
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## LotAr-GrgAr      0.15  0.18      0.21  0.00      0.13      0.23
## LotAr-PolAr      0.04  0.08      0.11  0.00      0.03      0.12
## LotAr-TtlPr      0.15  0.19      0.22  0.00      0.13      0.24
## LotAr-SlPrc      0.23  0.26      0.29  0.00      0.21      0.32
## GrgAr-PolAr      0.03  0.06      0.09  0.02      0.03      0.09
## GrgAr-TtlPr      0.23  0.26      0.29  0.00      0.20      0.31
## GrgAr-SlPrc      0.60  0.62      0.64  0.00      0.58      0.66
## PolAr-TtlPr      0.09  0.12      0.16  0.00      0.07      0.17
## PolAr-SlPrc      0.06  0.09      0.13  0.00      0.04      0.14
## TtlPr-SlPrc      0.36  0.39      0.42  0.00      0.34      0.44

Let’s take a closer look at the analysis above. Firstly, the correlation matrix, it highlights that no pairs of variable have a very strong relationship with another (r > 0.80). At most, the sale price and garage area have a moderate positive correlation of 0.62. In addition, there is a slight positive correlation between sale price with lot area, and sale price with total porch area, r = 0.26 and r = 0.39, respectively. Now at the 80% confidence level, the left diagonal of the probability values matrix suggests that there are evidence that a correlation exist since the p-values are less than 0.2. The familywise error rate is the probability of making at least one Type I Error in a series of hypothesis tests. Fortunately, corr.test adjust for multiple testing using Holm method where tests are run and then ordered from lowest to highest p-values. These values are in the right diagonal of the probability values matrix.

Linear Algebra and Correlation

Let’s invert the correlation matrix to find the precision matrix and conduct LU decomposition.

# Precision Matrix
precision = solve(res[["r"]])
precision

##                LotArea   GarageArea     PoolArea  TotalPorch   SalePrice
## LotArea     1.08690976 -0.025491008 -0.049657024 -0.09981525 -0.22726647
## GarageArea -0.02549101  1.637131857 -0.001944333 -0.02721284 -1.00309414
## PoolArea   -0.04965702 -0.001944333  1.019887469 -0.09892174 -0.04124968
## TotalPorch -0.09981525 -0.027212840 -0.098921736  1.20069881 -0.41702321
## SalePrice  -0.22726647 -1.003094142 -0.041249680 -0.41702321  1.85218795

# Correlation matrix times Precision matrix
cp = res[["r"]] %*% precision
cp

##                  LotArea   GarageArea      PoolArea   TotalPorch SalePrice
## LotArea     1.000000e+00 0.000000e+00  0.000000e+00 0.000000e+00         0
## GarageArea -2.775558e-17 1.000000e+00  3.469447e-18 0.000000e+00         0
## PoolArea    1.734723e-17 1.387779e-17  1.000000e+00 6.938894e-18         0
## TotalPorch  1.387779e-17 5.551115e-17  2.428613e-17 1.000000e+00         0
## SalePrice  -2.775558e-17 2.220446e-16 -2.081668e-17 0.000000e+00         1

# Precision matrix times Correlation matrix
pc = precision %*% res[["r"]]
pc

##                  LotArea    GarageArea      PoolArea   TotalPorch
## LotArea     1.000000e+00  0.000000e+00  1.734723e-17 4.163336e-17
## GarageArea  5.551115e-17  1.000000e+00  2.775558e-17 5.551115e-17
## PoolArea   -8.673617e-18 -3.469447e-18  1.000000e+00 3.469447e-18
## TotalPorch  1.387779e-17  5.551115e-17  4.163336e-17 1.000000e+00
## SalePrice  -1.110223e-16  0.000000e+00 -2.775558e-17 0.000000e+00
##                SalePrice
## LotArea     0.000000e+00
## GarageArea  2.220446e-16
## PoolArea   -6.938894e-18
## TotalPorch  5.551115e-17
## SalePrice   1.000000e+00

# LU decomposition
expand(lu(res[["r"]]))

## $L
## 5 x 5 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]       [,4]       [,5]      
## [1,] 1.00000000          .          .          .          .
## [2,] 0.18040276 1.00000000          .          .          .
## [3,] 0.07767239 0.04861721 1.00000000          .          .
## [4,] 0.18525613 0.23339423 0.09776712 1.00000000          .
## [5,] 0.26384335 0.59520439 0.04428321 0.22515167 1.00000000
## 
## $U
## 5 x 5 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]       [,4]       [,5]      
## [1,] 1.00000000 0.18040276 0.07767239 0.18525613 0.26384335
## [2,]          . 0.96745485 0.04703496 0.22579838 0.57583337
## [3,]          .          . 0.99168029 0.09695373 0.04391479
## [4,]          .          .          . 0.90350124 0.20342481
## [5,]          .          .          .          . 0.53990201
## 
## $P
## 5 x 5 sparse Matrix of class "pMatrix"
##               
## [1,] | . . . .
## [2,] . | . . .
## [3,] . . | . .
## [4,] . . . | .
## [5,] . . . . |

By multiplying the correlation matrix by the precision matrix and vice versa, result is the identity matrix. Moreover, the LU decomposition resulted in the correlation matrix after multiplying the two components.

Calculus-based Probability & Statistics

From the descriptive statistic, one of the variable of interest is positively skewed, namely LotArea, with a skewness of 12.18.

skew(train$LotArea)

## [1] 12.18262

# With the MASS library, fitdistr is used to fit the data to an exponential probability density function
LotArea_exp = fitdistr(train$LotArea, "exponential")
LotArea_exp

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

# Generate 1000 samples from the exponential distribution
y = rexp(1000, LotArea_exp$estimate)

ggplot(train, aes(x = LotArea)) + geom_histogram(aes(y = ..density..,fill = "Observed"),bins = 50,alpha=0.4) + geom_histogram(data = data.frame(y),aes(x = y, y = ..density.., fill = "Simulated"), bins = 100, alpha = 0.5) + theme_minimal() + theme(legend.title = element_blank()) + scale_fill_brewer(palette = "Set2") + labs(title = "Plots of Lot Areas", x = "Square Feet", y = "Proportion") + xlim(0, 10^5)

From the graph, it is evident that the observed graph of the data does not fit too well with the simulated dataset. Thus, let’s compare the 5th and 95th percentiles of the exponential pdf and empirical data.

# Exponential PDF
qexp(c(0.05,0.95),LotArea_exp$estimate)

## [1]   539.4428 31505.6013

# Emperical Data
quantile(train$LotArea, probs = c(0.05,0.95))

##       5%      95% 
##  3311.70 17401.15

# 95% condifence interval of empirical data
z = qnorm(0.95)
a = z * sd(train$LotArea)/sqrt(length(train$LotArea))

CI = matrix(c(round(mean(train$LotArea - a),2), round(mean(train$LotArea + a),2)))
rownames(CI) = c("Lower CI","Upper CI")
CI

##              [,1]
## Lower CI 10087.16
## Upper CI 10946.50

# Exponential PDF and Emperical Data means
means = matrix(c(mean(train$LotArea), mean(y)))
rownames(means) = c("Mean of LotArea","Mean of EXP")
means

##                     [,1]
## Mean of LotArea 10516.83
## Mean of EXP     10518.08

The model’s sample mean lies within the 95% confidence interval for the mean. Thus, an exponential distribution does appear to be an appropriate model that fits the lot area. In spite of this, based on the graph of the observed and simulated data, it is still not the best fitted model.

Modeling

Now, finding the best fit model with the data using Stepwise Regression.

dataset = train %>% select_if(is.numeric) %>% filter(complete.cases(.))

# Remove Outlier
remove_outliers = function(x, na.rm = TRUE, ...) {
  qnt = quantile(x, probs = c(.30, .70), na.rm = na.rm, ...)
  caps = quantile(x, probs=c(.05, .95), na.rm = T)
  H = 1.5 * IQR(x, na.rm = na.rm)
  y = x
  y[x < (qnt[1] - H)] = caps[1]
  y[x > (qnt[2] + H)] = caps[2]
  y
}
remove_all_outliers = function(df){
  df[,sapply(df, is.numeric)] = lapply(df[,sapply(df, is.numeric)], remove_outliers)
  df
}

dataset_new = remove_all_outliers(dataset)

# Fit the full model
model = lm(SalePrice ~., data = dataset_new)

# Stepwise Regression model
step = stepAIC(model, direction = "both", trace = FALSE)
summary(step)

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF2 + BsmtUnfSF + 
##     TotalBsmtSF + X2ndFlrSF + GrLivArea + BsmtHalfBath + BedroomAbvGr + 
##     Fireplaces + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + TotalPorch, data = dataset_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -207243  -12221    -345   12895   95465 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.253e+06  9.522e+04 -13.163  < 2e-16 ***
## MSSubClass    -9.641e+01  2.267e+01  -4.253 2.28e-05 ***
## LotArea        1.601e+00  2.958e-01   5.413 7.59e-08 ***
## OverallQual    1.463e+04  9.184e+02  15.924  < 2e-16 ***
## OverallCond    7.118e+03  9.557e+02   7.448 1.92e-13 ***
## YearBuilt      3.887e+02  4.629e+01   8.397  < 2e-16 ***
## YearRemodAdd   2.053e+02  5.340e+01   3.844 0.000128 ***
## BsmtFinSF2    -2.201e+01  6.393e+00  -3.443 0.000597 ***
## BsmtUnfSF     -2.105e+01  1.990e+00 -10.577  < 2e-16 ***
## TotalBsmtSF    5.172e+01  4.246e+00  12.181  < 2e-16 ***
## X2ndFlrSF      1.469e+01  3.963e+00   3.707 0.000220 ***
## GrLivArea      3.946e+01  4.311e+00   9.154  < 2e-16 ***
## BsmtHalfBath  -6.259e+03  3.202e+03  -1.955 0.050856 .  
## BedroomAbvGr  -5.970e+03  1.498e+03  -3.986 7.17e-05 ***
## Fireplaces     5.791e+03  1.409e+03   4.110 4.25e-05 ***
## GarageCars     8.237e+03  2.365e+03   3.483 0.000516 ***
## GarageArea     1.429e+01  8.063e+00   1.773 0.076582 .  
## WoodDeckSF     4.104e+01  1.540e+01   2.665 0.007813 ** 
## OpenPorchSF    6.462e+01  2.230e+01   2.898 0.003835 ** 
## EnclosedPorch  4.729e+01  1.699e+01   2.784 0.005462 ** 
## ScreenPorch    5.069e+01  2.168e+01   2.338 0.019554 *  
## TotalPorch    -2.280e+01  1.447e+01  -1.576 0.115370    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23670 on 1099 degrees of freedom
## Multiple R-squared:  0.8844, Adjusted R-squared:  0.8822 
## F-statistic: 400.3 on 21 and 1099 DF,  p-value: < 2.2e-16

Prediction

# Using the test dataset to make the prediction.
# All tranformation and new variable from train data is created in testdata
testdata$TotalPorch = rowSums(testdata[,c(67:71)])

testdata$SalePrice = predict(step, testdata)

testdata[is.na(testdata)] = mean(testdata$SalePrice, na.rm = TRUE)

submission = subset(testdata, select = c("Id", "SalePrice"))

write.csv(submission, file = "sam13hopexlife_submission.csv")