Problem 3 Advanced Regression

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

trainds <- read.csv(("https://raw.githubusercontent.com/jtul333/Data605/main/train.csv"), header = TRUE, stringsAsFactors = FALSE)

Data Summary

psych::describe(trainds)
##                vars    n      mean       sd   median   trimmed      mad   min
## Id                1 1460    730.50   421.61    730.5    730.50   541.15     1
## MSSubClass        2 1460     56.90    42.30     50.0     49.15    44.48    20
## MSZoning*         3 1460      4.03     0.63      4.0      4.06     0.00     1
## LotFrontage       4 1201     70.05    24.28     69.0     68.94    16.31    21
## LotArea           5 1460  10516.83  9981.26   9478.5   9563.28  2962.23  1300
## Street*           6 1460      2.00     0.06      2.0      2.00     0.00     1
## Alley*            7   91      1.45     0.50      1.0      1.44     0.00     1
## LotShape*         8 1460      2.94     1.41      4.0      3.05     0.00     1
## LandContour*      9 1460      3.78     0.71      4.0      4.00     0.00     1
## Utilities*       10 1460      1.00     0.03      1.0      1.00     0.00     1
## LotConfig*       11 1460      4.02     1.62      5.0      4.27     0.00     1
## LandSlope*       12 1460      1.06     0.28      1.0      1.00     0.00     1
## Neighborhood*    13 1460     13.15     5.89     13.0     13.11     7.41     1
## Condition1*      14 1460      3.03     0.87      3.0      3.00     0.00     1
## Condition2*      15 1460      3.01     0.26      3.0      3.00     0.00     1
## BldgType*        16 1460      1.49     1.20      1.0      1.14     0.00     1
## HouseStyle*      17 1460      4.04     1.91      3.0      4.03     1.48     1
## OverallQual      18 1460      6.10     1.38      6.0      6.08     1.48     1
## OverallCond      19 1460      5.58     1.11      5.0      5.48     0.00     1
## YearBuilt        20 1460   1971.27    30.20   1973.0   1974.13    37.06  1872
## YearRemodAdd     21 1460   1984.87    20.65   1994.0   1986.37    19.27  1950
## RoofStyle*       22 1460      2.41     0.83      2.0      2.26     0.00     1
## RoofMatl*        23 1460      2.08     0.60      2.0      2.00     0.00     1
## Exterior1st*     24 1460     10.62     3.20     13.0     10.93     1.48     1
## Exterior2nd*     25 1460     11.34     3.54     14.0     11.65     2.97     1
## MasVnrType*      26 1452      2.76     0.62      3.0      2.73     0.00     1
## MasVnrArea       27 1452    103.69   181.07      0.0     63.15     0.00     0
## ExterQual*       28 1460      3.54     0.69      4.0      3.65     0.00     1
## ExterCond*       29 1460      4.73     0.73      5.0      4.95     0.00     1
## Foundation*      30 1460      2.40     0.72      2.0      2.46     1.48     1
## BsmtQual*        31 1423      3.26     0.87      3.0      3.43     1.48     1
## BsmtCond*        32 1423      3.81     0.66      4.0      4.00     0.00     1
## BsmtExposure*    33 1422      3.27     1.15      4.0      3.46     0.00     1
## BsmtFinType1*    34 1423      3.73     1.83      3.0      3.79     2.97     1
## BsmtFinSF1       35 1460    443.64   456.10    383.5    386.08   568.58     0
## BsmtFinType2*    36 1422      5.71     0.94      6.0      5.98     0.00     1
## BsmtFinSF2       37 1460     46.55   161.32      0.0      1.38     0.00     0
## BsmtUnfSF        38 1460    567.24   441.87    477.5    519.29   426.99     0
## TotalBsmtSF      39 1460   1057.43   438.71    991.5   1036.70   347.67     0
## Heating*         40 1460      2.04     0.30      2.0      2.00     0.00     1
## HeatingQC*       41 1460      2.54     1.74      1.0      2.42     0.00     1
## CentralAir*      42 1460      1.93     0.25      2.0      2.00     0.00     1
## Electrical*      43 1459      4.68     1.05      5.0      5.00     0.00     1
## X1stFlrSF        44 1460   1162.63   386.59   1087.0   1129.99   347.67   334
## X2ndFlrSF        45 1460    346.99   436.53      0.0    285.36     0.00     0
## LowQualFinSF     46 1460      5.84    48.62      0.0      0.00     0.00     0
## GrLivArea        47 1460   1515.46   525.48   1464.0   1467.67   483.33   334
## BsmtFullBath     48 1460      0.43     0.52      0.0      0.39     0.00     0
## BsmtHalfBath     49 1460      0.06     0.24      0.0      0.00     0.00     0
## FullBath         50 1460      1.57     0.55      2.0      1.56     0.00     0
## HalfBath         51 1460      0.38     0.50      0.0      0.34     0.00     0
## BedroomAbvGr     52 1460      2.87     0.82      3.0      2.85     0.00     0
## KitchenAbvGr     53 1460      1.05     0.22      1.0      1.00     0.00     0
## KitchenQual*     54 1460      3.34     0.83      4.0      3.50     0.00     1
## TotRmsAbvGrd     55 1460      6.52     1.63      6.0      6.41     1.48     2
## Functional*      56 1460      6.75     0.98      7.0      7.00     0.00     1
## Fireplaces       57 1460      0.61     0.64      1.0      0.53     1.48     0
## FireplaceQu*     58  770      3.73     1.13      3.0      3.80     1.48     1
## GarageType*      59 1379      3.28     1.79      2.0      3.11     0.00     1
## GarageYrBlt      60 1379   1978.51    24.69   1980.0   1981.07    31.13  1900
## GarageFinish*    61 1379      2.18     0.81      2.0      2.23     1.48     1
## GarageCars       62 1460      1.77     0.75      2.0      1.77     0.00     0
## GarageArea       63 1460    472.98   213.80    480.0    469.81   177.91     0
## GarageQual*      64 1379      4.86     0.61      5.0      5.00     0.00     1
## GarageCond*      65 1379      4.90     0.52      5.0      5.00     0.00     1
## PavedDrive*      66 1460      2.86     0.50      3.0      3.00     0.00     1
## WoodDeckSF       67 1460     94.24   125.34      0.0     71.76     0.00     0
## OpenPorchSF      68 1460     46.66    66.26     25.0     33.23    37.06     0
## EnclosedPorch    69 1460     21.95    61.12      0.0      3.87     0.00     0
## X3SsnPorch       70 1460      3.41    29.32      0.0      0.00     0.00     0
## ScreenPorch      71 1460     15.06    55.76      0.0      0.00     0.00     0
## PoolArea         72 1460      2.76    40.18      0.0      0.00     0.00     0
## PoolQC*          73    7      2.14     0.90      2.0      2.14     1.48     1
## Fence*           74  281      2.43     0.86      3.0      2.48     0.00     1
## MiscFeature*     75   54      2.91     0.45      3.0      3.00     0.00     1
## MiscVal          76 1460     43.49   496.12      0.0      0.00     0.00     0
## MoSold           77 1460      6.32     2.70      6.0      6.25     2.97     1
## YrSold           78 1460   2007.82     1.33   2008.0   2007.77     1.48  2006
## SaleType*        79 1460      8.51     1.56      9.0      8.92     0.00     1
## SaleCondition*   80 1460      4.77     1.10      5.0      5.00     0.00     1
## SalePrice        81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
##                   max  range   skew kurtosis      se
## Id               1460   1459   0.00    -1.20   11.03
## MSSubClass        190    170   1.40     1.56    1.11
## MSZoning*           5      4  -1.73     6.25    0.02
## LotFrontage       313    292   2.16    17.34    0.70
## LotArea        215245 213945  12.18   202.26  261.22
## Street*             2      1 -15.49   238.01    0.00
## Alley*              2      1   0.20    -1.98    0.05
## LotShape*           4      3  -0.61    -1.60    0.04
## LandContour*        4      3  -3.16     8.65    0.02
## Utilities*          2      1  38.13  1453.00    0.00
## LotConfig*          5      4  -1.13    -0.59    0.04
## LandSlope*          3      2   4.80    24.47    0.01
## Neighborhood*      25     24   0.02    -1.06    0.15
## Condition1*         9      8   3.01    16.34    0.02
## Condition2*         8      7  13.14   247.54    0.01
## BldgType*           5      4   2.24     3.41    0.03
## HouseStyle*         8      7   0.31    -0.96    0.05
## OverallQual        10      9   0.22     0.09    0.04
## OverallCond         9      8   0.69     1.09    0.03
## YearBuilt        2010    138  -0.61    -0.45    0.79
## YearRemodAdd     2010     60  -0.50    -1.27    0.54
## RoofStyle*          6      5   1.47     0.61    0.02
## RoofMatl*           8      7   8.09    66.28    0.02
## Exterior1st*       15     14  -0.72    -0.37    0.08
## Exterior2nd*       16     15  -0.69    -0.52    0.09
## MasVnrType*         4      3  -0.07    -0.13    0.02
## MasVnrArea       1600   1600   2.66    10.03    4.75
## ExterQual*          4      3  -1.83     3.86    0.02
## ExterCond*          5      4  -2.56     5.29    0.02
## Foundation*         6      5   0.09     1.02    0.02
## BsmtQual*           4      3  -1.31     1.27    0.02
## BsmtCond*           4      3  -3.39    10.14    0.02
## BsmtExposure*       4      3  -1.15    -0.39    0.03
## BsmtFinType1*       6      5  -0.02    -1.39    0.05
## BsmtFinSF1       5644   5644   1.68    11.06   11.94
## BsmtFinType2*       6      5  -3.56    12.32    0.02
## BsmtFinSF2       1474   1474   4.25    20.01    4.22
## BsmtUnfSF        2336   2336   0.92     0.46   11.56
## TotalBsmtSF      6110   6110   1.52    13.18   11.48
## Heating*            6      5   9.83   110.98    0.01
## HeatingQC*          5      4   0.48    -1.51    0.05
## CentralAir*         2      1  -3.52    10.42    0.01
## Electrical*         5      4  -3.06     7.49    0.03
## X1stFlrSF        4692   4358   1.37     5.71   10.12
## X2ndFlrSF        2065   2065   0.81    -0.56   11.42
## LowQualFinSF      572    572   8.99    82.83    1.27
## GrLivArea        5642   5308   1.36     4.86   13.75
## BsmtFullBath        3      3   0.59    -0.84    0.01
## BsmtHalfBath        2      2   4.09    16.31    0.01
## FullBath            3      3   0.04    -0.86    0.01
## HalfBath            2      2   0.67    -1.08    0.01
## BedroomAbvGr        8      8   0.21     2.21    0.02
## KitchenAbvGr        3      3   4.48    21.42    0.01
## KitchenQual*        4      3  -1.42     1.72    0.02
## TotRmsAbvGrd       14     12   0.67     0.87    0.04
## Functional*         7      6  -4.08    16.37    0.03
## Fireplaces          3      3   0.65    -0.22    0.02
## FireplaceQu*        5      4  -0.16    -0.98    0.04
## GarageType*         6      5   0.76    -1.30    0.05
## GarageYrBlt      2010    110  -0.65    -0.42    0.66
## GarageFinish*       3      2  -0.35    -1.41    0.02
## GarageCars          4      4  -0.34     0.21    0.02
## GarageArea       1418   1418   0.18     0.90    5.60
## GarageQual*         5      4  -4.43    18.25    0.02
## GarageCond*         5      4  -5.28    26.77    0.01
## PavedDrive*         3      2  -3.30     9.22    0.01
## WoodDeckSF        857    857   1.54     2.97    3.28
## OpenPorchSF       547    547   2.36     8.44    1.73
## EnclosedPorch     552    552   3.08    10.37    1.60
## X3SsnPorch        508    508  10.28   123.06    0.77
## ScreenPorch       480    480   4.11    18.34    1.46
## PoolArea          738    738  14.80   222.19    1.05
## PoolQC*             3      2  -0.22    -1.90    0.34
## Fence*              4      3  -0.57    -0.88    0.05
## MiscFeature*        4      3  -2.93    10.71    0.06
## MiscVal         15500  15500  24.43   697.64   12.98
## MoSold             12     11   0.21    -0.41    0.07
## YrSold           2010      4   0.10    -1.19    0.03
## SaleType*           9      8  -3.83    14.57    0.04
## SaleCondition*      6      5  -2.74     6.82    0.03
## SalePrice      755000 720100   1.88     6.50 2079.11

Data View

kable(data.frame(head(trainds, n = 10L))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#ea6323") %>%
    scroll_box(width = "100%", height = "300px")
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
7 20 RL 75 10084 Pave NA Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004 RFn 2 636 TA TA Y 255 57 0 0 0 0 NA NA NA 0 8 2007 WD Normal 307000
8 60 RL NA 10382 Pave NA IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973 RFn 2 484 TA TA Y 235 204 228 0 0 0 NA NA Shed 350 11 2009 WD Normal 200000
9 50 RM 51 6120 Pave NA Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NA NA NA 0 4 2008 WD Abnorml 129900
10 190 RL 50 7420 Pave NA Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NA NA NA 0 1 2008 WD Normal 118000

Univariate Descriptive Statistics and Appropriate Plots

#Summary
summary(trainds$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
#Histogram
hist(trainds$SalePrice, main="Sale Price")

Nomal QQ Plot

qqnorm(trainds$SalePrice)
qqline(trainds$SalePrice)

### Scatterplot Matrix Provide a scatterplot matrix for at least two of the independent variables and the dependent variable

pairs(~SalePrice+LotArea+GrLivArea++PoolArea,data=trainds, main="Scatterplot Matrix")

### Correlation Matrix Derive a correlation matrix for any three quantitative variables in the dataset

sub_trainds <- data.frame(trainds$LotArea,trainds$GrLivArea,trainds$PoolArea)
#Correlation
cortrainmatrix <- cor(sub_trainds)
cortrainmatrix
##                   trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea        1.00000000         0.2631162       0.07767239
## trainds.GrLivArea      0.26311617         1.0000000       0.17020534
## trainds.PoolArea       0.07767239         0.1702053       1.00000000
#Correlation plot
corrplot(cortrainmatrix, method="circle")

### Hypotheses the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

#GrLivArea
cor.test(trainds$LotArea,trainds$GrLivArea,method = "pearson",conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  trainds$LotArea and trainds$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162
#PoolArea
cor.test(trainds$LotArea,trainds$PoolArea,method = "pearson",conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  trainds$LotArea and trainds$PoolArea
## t = 2.9748, df = 1458, p-value = 0.00298
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.04422604 0.11094482
## sample estimates:
##        cor 
## 0.07767239
#GrLivArea,#PoolArea
cor.test(trainds$GrLivArea,trainds$PoolArea,method = "pearson",conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  trainds$GrLivArea and trainds$PoolArea
## t = 6.5953, df = 1458, p-value = 5.918e-11
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1374287 0.2026096
## sample estimates:
##       cor 
## 0.1702053

Analysis Observation

All three confidence intervals have p-values less than 0.5 which means that the null hypothesis could be rejected. Family Wise Error is going to be high since we’re only executing a single experiment so probability wil be higher. Family Wise Error on type I errors when performing multiple hypotheses tests. This problem can be avoid by adjusting the correlation test to a confident level of higher percentage.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the

Inversion

precisionmatrix <- solve(cortrainmatrix)
precisionmatrix
##                   trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea        1.07566675        -0.2768243      -0.03643264
## trainds.GrLivArea     -0.27682428         1.1010752      -0.16590728
## trainds.PoolArea      -0.03643264        -0.1659073       1.03106811

Multiply the correlation matrix by the precision matrix

round(cortrainmatrix %*% precisionmatrix)
##                   trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea                 1                 0                0
## trainds.GrLivArea               0                 1                0
## trainds.PoolArea                0                 0                1

Multiply the precision matrix by the correlation matrix

round(precisionmatrix %*% cortrainmatrix)
##                   trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea                 1                 0                0
## trainds.GrLivArea               0                 1                0
## trainds.PoolArea                0                 0                1

LU Decomposition

expand(lu(cortrainmatrix))$L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,] 1.00000000          .          .
## [2,] 0.26311617 1.00000000          .
## [3,] 0.07767239 0.16090817 1.00000000
expand(lu(cortrainmatrix))$U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,] 1.00000000 0.26311617 0.07767239
## [2,]          . 0.93076988 0.14976847
## [3,]          .          . 0.96986803
expand(lu(precisionmatrix))$L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]        [,2]        [,3]       
## [1,]  1.00000000           .           .
## [2,] -0.25735134  1.00000000           .
## [3,] -0.03386982 -0.17020534  1.00000000
expand(lu(precisionmatrix))$U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]        [,2]        [,3]       
## [1,]  1.07566675 -0.27682428 -0.03643264
## [2,]           .  1.02983415 -0.17528327
## [3,]           .           .  1.00000000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Run fitdistr to fit an exponential probability density function

opt_lambda <- fitdistr(trainds$TotalBsmtSF,"exponential")
opt_lambda$estimate
##         rate 
## 0.0009456896

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable.

hist(rexp(1000,opt_lambda$estimate),breaks = 200,main = "Fitted Exponential PDF",xlim = c(1,quantile(rexp(1000,opt_lambda$estimate),0.99)))

hist(trainds$TotalBsmtSF,breaks = 400,main = "Observed Basement Area Size",xlim = c(1,quantile(trainds$TotalBsmtSF,0.99)))

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)

#5th percentile
qexp(0.05,rate = opt_lambda$estimate,lower.tail = TRUE,log.p = FALSE)
## [1] 54.23904
#95th percentile
qexp(0.95,rate = opt_lambda$estimate,lower.tail = TRUE,log.p = FALSE)
## [1] 3167.776
#95% confidence interval from the empirical data - normal
Bsmt_mean <- mean(trainds$TotalBsmtSF)
Bsmt_sd <- sd(trainds$TotalBsmtSF)
qnorm(0.95,Bsmt_mean,Bsmt_sd)
## [1] 1779.035
#empirical 5th and 95th percentile of the data
quantile(trainds$TotalBsmtSF,c(0.05,0.95))
##     5%    95% 
##  519.3 1753.0

The exponential value model doesn’t look like a good model, since the range doesn’t fit the actual data and it is largly biased

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

#select all the quantitative variables and eliminate the ones with low correlations
quantitative <- data.frame(trainds$OverallQual,trainds$YearBuilt,trainds$YearRemodAdd,trainds$MasVnrArea,trainds$BsmtFinSF1,trainds$TotalBsmtSF,trainds$X1stFlrSF,trainds$X2ndFlrSF,trainds$GrLivArea,trainds$FullBath,trainds$TotRmsAbvGrd,trainds$Fireplaces,trainds$GarageCars,trainds$GarageArea,trainds$WoodDeckSF,trainds$OpenPorchSF,trainds$SalePrice) 

#create a linear regression model
m1 <- lm(trainds.SalePrice ~.,data = quantitative)
summary(m1)
## 
## Call:
## lm(formula = trainds.SalePrice ~ ., data = quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512233  -17548   -1737   14681  283280 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.094e+06  1.268e+05  -8.627  < 2e-16 ***
## trainds.OverallQual   1.856e+04  1.174e+03  15.807  < 2e-16 ***
## trainds.YearBuilt     1.638e+02  4.978e+01   3.290 0.001028 ** 
## trainds.YearRemodAdd  3.564e+02  6.208e+01   5.741 1.15e-08 ***
## trainds.MasVnrArea    2.881e+01  6.159e+00   4.678 3.17e-06 ***
## trainds.BsmtFinSF1    1.725e+01  2.596e+00   6.646 4.26e-11 ***
## trainds.TotalBsmtSF   1.165e+01  4.298e+00   2.711 0.006796 ** 
## trainds.X1stFlrSF     2.618e+01  2.082e+01   1.257 0.208871    
## trainds.X2ndFlrSF     1.753e+01  2.048e+01   0.856 0.392000    
## trainds.GrLivArea     2.135e+01  2.035e+01   1.049 0.294370    
## trainds.FullBath     -1.489e+03  2.630e+03  -0.566 0.571228    
## trainds.TotRmsAbvGrd  1.688e+03  1.089e+03   1.550 0.121402    
## trainds.Fireplaces    7.888e+03  1.783e+03   4.423 1.05e-05 ***
## trainds.GarageCars    1.011e+04  2.960e+03   3.414 0.000659 ***
## trainds.GarageArea    1.040e+01  1.005e+01   1.035 0.301006    
## trainds.WoodDeckSF    3.068e+01  8.129e+00   3.774 0.000167 ***
## trainds.OpenPorchSF   7.271e+00  1.572e+01   0.462 0.643861    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36380 on 1435 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7918, Adjusted R-squared:  0.7894 
## F-statistic:   341 on 16 and 1435 DF,  p-value: < 2.2e-16
#eliminate variables based on significant level
quantitative2 <- data.frame(trainds$OverallQual,trainds$YearRemodAdd,trainds$MasVnrArea,trainds$BsmtFinSF1,trainds$TotalBsmtSF,trainds$Fireplaces,trainds$GarageCars,trainds$WoodDeckSF,trainds$SalePrice)
colnames(quantitative2) <- c("OverallQual","YearRemodAdd","MasVnrArea","BsmtFinSF1","TotalBsmtSF","Fireplaces","GarageCars","WoodDeckSF","SalePrice")

#create a linear regression model
m2 <- lm(SalePrice ~.,data = quantitative2)
summary(m2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = quantitative2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -407840  -21443   -2760   16410  363961 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.307e+05  1.210e+05  -6.867 9.70e-12 ***
## OverallQual   2.449e+04  1.183e+03  20.706  < 2e-16 ***
## YearRemodAdd  3.925e+02  6.256e+01   6.273 4.66e-10 ***
## MasVnrArea    4.651e+01  6.602e+00   7.045 2.85e-12 ***
## BsmtFinSF1    1.482e+01  2.752e+00   5.383 8.52e-08 ***
## TotalBsmtSF   2.504e+01  3.290e+00   7.611 4.89e-14 ***
## Fireplaces    1.551e+04  1.849e+03   8.389  < 2e-16 ***
## GarageCars    1.794e+04  1.820e+03   9.855  < 2e-16 ***
## WoodDeckSF    4.464e+01  8.848e+00   5.045 5.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39960 on 1443 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.746 
## F-statistic: 533.6 on 8 and 1443 DF,  p-value: < 2.2e-16
hist(m2$residuals,breaks = 200)

### Nearly Normal Residuals Nearly normal distributed,there are some outliers, the Q-Q plot displays signs of skewness towards the left. The assumption of nearly normal residuals do not seem to have been met.

qqnorm(m2$residuals)
qqline(m2$residuals)

### Independence The independent variables show some positive correlation with each other.

data <- dplyr::select( trainds, GrLivArea, OverallQual, TotalBsmtSF, PoolArea) 
m<-cor(data)
corrplot.mixed(m, lower.col = "black", number.cex = .6)

Check the performance using test data.

testds <- read.csv(("https://raw.githubusercontent.com/jtul333/Data605/main/test.csv"))
temp1ds<-data.matrix(testds)
temp1ds<-data.frame(temp1ds)
pred <- predict(m2,temp1ds)

#kaggle Score
kaggle <- data.frame( Id = temp1ds[,"Id"],  SalePrice =pred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)

submissionds = cbind(kaggle$Id, kaggle$SalePrice)
colnames(submissionds) = c("Id", "SalePrice")
submissionds = as.data.frame(submissionds)
head(submissionds, 5)
##     Id SalePrice
## 1 1461  114540.8
## 2 1462  172103.9
## 3 1463  171662.5
## 4 1464  200837.6
## 5 1465  218790.4
write.csv(submissionds, file = "Kaggletest.csv", quote = FALSE, row.names = FALSE)

Kaggle Score

User: https://www.kaggle.com/jayaveluri

Score: 1.45878