Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
trainds <- read.csv(("https://raw.githubusercontent.com/jtul333/Data605/main/train.csv"), header = TRUE, stringsAsFactors = FALSE)
psych::describe(trainds)
## vars n mean sd median trimmed mad min
## Id 1 1460 730.50 421.61 730.5 730.50 541.15 1
## MSSubClass 2 1460 56.90 42.30 50.0 49.15 44.48 20
## MSZoning* 3 1460 4.03 0.63 4.0 4.06 0.00 1
## LotFrontage 4 1201 70.05 24.28 69.0 68.94 16.31 21
## LotArea 5 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300
## Street* 6 1460 2.00 0.06 2.0 2.00 0.00 1
## Alley* 7 91 1.45 0.50 1.0 1.44 0.00 1
## LotShape* 8 1460 2.94 1.41 4.0 3.05 0.00 1
## LandContour* 9 1460 3.78 0.71 4.0 4.00 0.00 1
## Utilities* 10 1460 1.00 0.03 1.0 1.00 0.00 1
## LotConfig* 11 1460 4.02 1.62 5.0 4.27 0.00 1
## LandSlope* 12 1460 1.06 0.28 1.0 1.00 0.00 1
## Neighborhood* 13 1460 13.15 5.89 13.0 13.11 7.41 1
## Condition1* 14 1460 3.03 0.87 3.0 3.00 0.00 1
## Condition2* 15 1460 3.01 0.26 3.0 3.00 0.00 1
## BldgType* 16 1460 1.49 1.20 1.0 1.14 0.00 1
## HouseStyle* 17 1460 4.04 1.91 3.0 4.03 1.48 1
## OverallQual 18 1460 6.10 1.38 6.0 6.08 1.48 1
## OverallCond 19 1460 5.58 1.11 5.0 5.48 0.00 1
## YearBuilt 20 1460 1971.27 30.20 1973.0 1974.13 37.06 1872
## YearRemodAdd 21 1460 1984.87 20.65 1994.0 1986.37 19.27 1950
## RoofStyle* 22 1460 2.41 0.83 2.0 2.26 0.00 1
## RoofMatl* 23 1460 2.08 0.60 2.0 2.00 0.00 1
## Exterior1st* 24 1460 10.62 3.20 13.0 10.93 1.48 1
## Exterior2nd* 25 1460 11.34 3.54 14.0 11.65 2.97 1
## MasVnrType* 26 1452 2.76 0.62 3.0 2.73 0.00 1
## MasVnrArea 27 1452 103.69 181.07 0.0 63.15 0.00 0
## ExterQual* 28 1460 3.54 0.69 4.0 3.65 0.00 1
## ExterCond* 29 1460 4.73 0.73 5.0 4.95 0.00 1
## Foundation* 30 1460 2.40 0.72 2.0 2.46 1.48 1
## BsmtQual* 31 1423 3.26 0.87 3.0 3.43 1.48 1
## BsmtCond* 32 1423 3.81 0.66 4.0 4.00 0.00 1
## BsmtExposure* 33 1422 3.27 1.15 4.0 3.46 0.00 1
## BsmtFinType1* 34 1423 3.73 1.83 3.0 3.79 2.97 1
## BsmtFinSF1 35 1460 443.64 456.10 383.5 386.08 568.58 0
## BsmtFinType2* 36 1422 5.71 0.94 6.0 5.98 0.00 1
## BsmtFinSF2 37 1460 46.55 161.32 0.0 1.38 0.00 0
## BsmtUnfSF 38 1460 567.24 441.87 477.5 519.29 426.99 0
## TotalBsmtSF 39 1460 1057.43 438.71 991.5 1036.70 347.67 0
## Heating* 40 1460 2.04 0.30 2.0 2.00 0.00 1
## HeatingQC* 41 1460 2.54 1.74 1.0 2.42 0.00 1
## CentralAir* 42 1460 1.93 0.25 2.0 2.00 0.00 1
## Electrical* 43 1459 4.68 1.05 5.0 5.00 0.00 1
## X1stFlrSF 44 1460 1162.63 386.59 1087.0 1129.99 347.67 334
## X2ndFlrSF 45 1460 346.99 436.53 0.0 285.36 0.00 0
## LowQualFinSF 46 1460 5.84 48.62 0.0 0.00 0.00 0
## GrLivArea 47 1460 1515.46 525.48 1464.0 1467.67 483.33 334
## BsmtFullBath 48 1460 0.43 0.52 0.0 0.39 0.00 0
## BsmtHalfBath 49 1460 0.06 0.24 0.0 0.00 0.00 0
## FullBath 50 1460 1.57 0.55 2.0 1.56 0.00 0
## HalfBath 51 1460 0.38 0.50 0.0 0.34 0.00 0
## BedroomAbvGr 52 1460 2.87 0.82 3.0 2.85 0.00 0
## KitchenAbvGr 53 1460 1.05 0.22 1.0 1.00 0.00 0
## KitchenQual* 54 1460 3.34 0.83 4.0 3.50 0.00 1
## TotRmsAbvGrd 55 1460 6.52 1.63 6.0 6.41 1.48 2
## Functional* 56 1460 6.75 0.98 7.0 7.00 0.00 1
## Fireplaces 57 1460 0.61 0.64 1.0 0.53 1.48 0
## FireplaceQu* 58 770 3.73 1.13 3.0 3.80 1.48 1
## GarageType* 59 1379 3.28 1.79 2.0 3.11 0.00 1
## GarageYrBlt 60 1379 1978.51 24.69 1980.0 1981.07 31.13 1900
## GarageFinish* 61 1379 2.18 0.81 2.0 2.23 1.48 1
## GarageCars 62 1460 1.77 0.75 2.0 1.77 0.00 0
## GarageArea 63 1460 472.98 213.80 480.0 469.81 177.91 0
## GarageQual* 64 1379 4.86 0.61 5.0 5.00 0.00 1
## GarageCond* 65 1379 4.90 0.52 5.0 5.00 0.00 1
## PavedDrive* 66 1460 2.86 0.50 3.0 3.00 0.00 1
## WoodDeckSF 67 1460 94.24 125.34 0.0 71.76 0.00 0
## OpenPorchSF 68 1460 46.66 66.26 25.0 33.23 37.06 0
## EnclosedPorch 69 1460 21.95 61.12 0.0 3.87 0.00 0
## X3SsnPorch 70 1460 3.41 29.32 0.0 0.00 0.00 0
## ScreenPorch 71 1460 15.06 55.76 0.0 0.00 0.00 0
## PoolArea 72 1460 2.76 40.18 0.0 0.00 0.00 0
## PoolQC* 73 7 2.14 0.90 2.0 2.14 1.48 1
## Fence* 74 281 2.43 0.86 3.0 2.48 0.00 1
## MiscFeature* 75 54 2.91 0.45 3.0 3.00 0.00 1
## MiscVal 76 1460 43.49 496.12 0.0 0.00 0.00 0
## MoSold 77 1460 6.32 2.70 6.0 6.25 2.97 1
## YrSold 78 1460 2007.82 1.33 2008.0 2007.77 1.48 2006
## SaleType* 79 1460 8.51 1.56 9.0 8.92 0.00 1
## SaleCondition* 80 1460 4.77 1.10 5.0 5.00 0.00 1
## SalePrice 81 1460 180921.20 79442.50 163000.0 170783.29 56338.80 34900
## max range skew kurtosis se
## Id 1460 1459 0.00 -1.20 11.03
## MSSubClass 190 170 1.40 1.56 1.11
## MSZoning* 5 4 -1.73 6.25 0.02
## LotFrontage 313 292 2.16 17.34 0.70
## LotArea 215245 213945 12.18 202.26 261.22
## Street* 2 1 -15.49 238.01 0.00
## Alley* 2 1 0.20 -1.98 0.05
## LotShape* 4 3 -0.61 -1.60 0.04
## LandContour* 4 3 -3.16 8.65 0.02
## Utilities* 2 1 38.13 1453.00 0.00
## LotConfig* 5 4 -1.13 -0.59 0.04
## LandSlope* 3 2 4.80 24.47 0.01
## Neighborhood* 25 24 0.02 -1.06 0.15
## Condition1* 9 8 3.01 16.34 0.02
## Condition2* 8 7 13.14 247.54 0.01
## BldgType* 5 4 2.24 3.41 0.03
## HouseStyle* 8 7 0.31 -0.96 0.05
## OverallQual 10 9 0.22 0.09 0.04
## OverallCond 9 8 0.69 1.09 0.03
## YearBuilt 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 2010 60 -0.50 -1.27 0.54
## RoofStyle* 6 5 1.47 0.61 0.02
## RoofMatl* 8 7 8.09 66.28 0.02
## Exterior1st* 15 14 -0.72 -0.37 0.08
## Exterior2nd* 16 15 -0.69 -0.52 0.09
## MasVnrType* 4 3 -0.07 -0.13 0.02
## MasVnrArea 1600 1600 2.66 10.03 4.75
## ExterQual* 4 3 -1.83 3.86 0.02
## ExterCond* 5 4 -2.56 5.29 0.02
## Foundation* 6 5 0.09 1.02 0.02
## BsmtQual* 4 3 -1.31 1.27 0.02
## BsmtCond* 4 3 -3.39 10.14 0.02
## BsmtExposure* 4 3 -1.15 -0.39 0.03
## BsmtFinType1* 6 5 -0.02 -1.39 0.05
## BsmtFinSF1 5644 5644 1.68 11.06 11.94
## BsmtFinType2* 6 5 -3.56 12.32 0.02
## BsmtFinSF2 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 6110 6110 1.52 13.18 11.48
## Heating* 6 5 9.83 110.98 0.01
## HeatingQC* 5 4 0.48 -1.51 0.05
## CentralAir* 2 1 -3.52 10.42 0.01
## Electrical* 5 4 -3.06 7.49 0.03
## X1stFlrSF 4692 4358 1.37 5.71 10.12
## X2ndFlrSF 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 572 572 8.99 82.83 1.27
## GrLivArea 5642 5308 1.36 4.86 13.75
## BsmtFullBath 3 3 0.59 -0.84 0.01
## BsmtHalfBath 2 2 4.09 16.31 0.01
## FullBath 3 3 0.04 -0.86 0.01
## HalfBath 2 2 0.67 -1.08 0.01
## BedroomAbvGr 8 8 0.21 2.21 0.02
## KitchenAbvGr 3 3 4.48 21.42 0.01
## KitchenQual* 4 3 -1.42 1.72 0.02
## TotRmsAbvGrd 14 12 0.67 0.87 0.04
## Functional* 7 6 -4.08 16.37 0.03
## Fireplaces 3 3 0.65 -0.22 0.02
## FireplaceQu* 5 4 -0.16 -0.98 0.04
## GarageType* 6 5 0.76 -1.30 0.05
## GarageYrBlt 2010 110 -0.65 -0.42 0.66
## GarageFinish* 3 2 -0.35 -1.41 0.02
## GarageCars 4 4 -0.34 0.21 0.02
## GarageArea 1418 1418 0.18 0.90 5.60
## GarageQual* 5 4 -4.43 18.25 0.02
## GarageCond* 5 4 -5.28 26.77 0.01
## PavedDrive* 3 2 -3.30 9.22 0.01
## WoodDeckSF 857 857 1.54 2.97 3.28
## OpenPorchSF 547 547 2.36 8.44 1.73
## EnclosedPorch 552 552 3.08 10.37 1.60
## X3SsnPorch 508 508 10.28 123.06 0.77
## ScreenPorch 480 480 4.11 18.34 1.46
## PoolArea 738 738 14.80 222.19 1.05
## PoolQC* 3 2 -0.22 -1.90 0.34
## Fence* 4 3 -0.57 -0.88 0.05
## MiscFeature* 4 3 -2.93 10.71 0.06
## MiscVal 15500 15500 24.43 697.64 12.98
## MoSold 12 11 0.21 -0.41 0.07
## YrSold 2010 4 0.10 -1.19 0.03
## SaleType* 9 8 -3.83 14.57 0.04
## SaleCondition* 6 5 -2.74 6.82 0.03
## SalePrice 755000 720100 1.88 6.50 2079.11
kable(data.frame(head(trainds, n = 10L))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#ea6323") %>%
scroll_box(width = "100%", height = "300px")
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
| 2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
| 3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
| 4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
| 6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
| 7 | 20 | RL | 75 | 10084 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | WD | Normal | 307000 |
| 8 | 60 | RL | NA | 10382 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NA | NA | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
| 9 | 50 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2008 | WD | Abnorml | 129900 |
| 10 | 190 | RL | 50 | 7420 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2008 | WD | Normal | 118000 |
#Summary
summary(trainds$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
#Histogram
hist(trainds$SalePrice, main="Sale Price")
qqnorm(trainds$SalePrice)
qqline(trainds$SalePrice)
### Scatterplot Matrix Provide a scatterplot matrix for at least two of the independent variables and the dependent variable
pairs(~SalePrice+LotArea+GrLivArea++PoolArea,data=trainds, main="Scatterplot Matrix")
### Correlation Matrix Derive a correlation matrix for any three quantitative variables in the dataset
sub_trainds <- data.frame(trainds$LotArea,trainds$GrLivArea,trainds$PoolArea)
#Correlation
cortrainmatrix <- cor(sub_trainds)
cortrainmatrix
## trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea 1.00000000 0.2631162 0.07767239
## trainds.GrLivArea 0.26311617 1.0000000 0.17020534
## trainds.PoolArea 0.07767239 0.1702053 1.00000000
#Correlation plot
corrplot(cortrainmatrix, method="circle")
### Hypotheses the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
#GrLivArea
cor.test(trainds$LotArea,trainds$GrLivArea,method = "pearson",conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: trainds$LotArea and trainds$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
#PoolArea
cor.test(trainds$LotArea,trainds$PoolArea,method = "pearson",conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: trainds$LotArea and trainds$PoolArea
## t = 2.9748, df = 1458, p-value = 0.00298
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.04422604 0.11094482
## sample estimates:
## cor
## 0.07767239
#GrLivArea,#PoolArea
cor.test(trainds$GrLivArea,trainds$PoolArea,method = "pearson",conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: trainds$GrLivArea and trainds$PoolArea
## t = 6.5953, df = 1458, p-value = 5.918e-11
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1374287 0.2026096
## sample estimates:
## cor
## 0.1702053
All three confidence intervals have p-values less than 0.5 which means that the null hypothesis could be rejected. Family Wise Error is going to be high since we’re only executing a single experiment so probability wil be higher. Family Wise Error on type I errors when performing multiple hypotheses tests. This problem can be avoid by adjusting the correlation test to a confident level of higher percentage.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the
precisionmatrix <- solve(cortrainmatrix)
precisionmatrix
## trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea 1.07566675 -0.2768243 -0.03643264
## trainds.GrLivArea -0.27682428 1.1010752 -0.16590728
## trainds.PoolArea -0.03643264 -0.1659073 1.03106811
round(cortrainmatrix %*% precisionmatrix)
## trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea 1 0 0
## trainds.GrLivArea 0 1 0
## trainds.PoolArea 0 0 1
round(precisionmatrix %*% cortrainmatrix)
## trainds.LotArea trainds.GrLivArea trainds.PoolArea
## trainds.LotArea 1 0 0
## trainds.GrLivArea 0 1 0
## trainds.PoolArea 0 0 1
expand(lu(cortrainmatrix))$L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.00000000 . .
## [2,] 0.26311617 1.00000000 .
## [3,] 0.07767239 0.16090817 1.00000000
expand(lu(cortrainmatrix))$U
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.00000000 0.26311617 0.07767239
## [2,] . 0.93076988 0.14976847
## [3,] . . 0.96986803
expand(lu(precisionmatrix))$L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.00000000 . .
## [2,] -0.25735134 1.00000000 .
## [3,] -0.03386982 -0.17020534 1.00000000
expand(lu(precisionmatrix))$U
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.07566675 -0.27682428 -0.03643264
## [2,] . 1.02983415 -0.17528327
## [3,] . . 1.00000000
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
opt_lambda <- fitdistr(trainds$TotalBsmtSF,"exponential")
opt_lambda$estimate
## rate
## 0.0009456896
hist(rexp(1000,opt_lambda$estimate),breaks = 200,main = "Fitted Exponential PDF",xlim = c(1,quantile(rexp(1000,opt_lambda$estimate),0.99)))
hist(trainds$TotalBsmtSF,breaks = 400,main = "Observed Basement Area Size",xlim = c(1,quantile(trainds$TotalBsmtSF,0.99)))
#5th percentile
qexp(0.05,rate = opt_lambda$estimate,lower.tail = TRUE,log.p = FALSE)
## [1] 54.23904
#95th percentile
qexp(0.95,rate = opt_lambda$estimate,lower.tail = TRUE,log.p = FALSE)
## [1] 3167.776
#95% confidence interval from the empirical data - normal
Bsmt_mean <- mean(trainds$TotalBsmtSF)
Bsmt_sd <- sd(trainds$TotalBsmtSF)
qnorm(0.95,Bsmt_mean,Bsmt_sd)
## [1] 1779.035
#empirical 5th and 95th percentile of the data
quantile(trainds$TotalBsmtSF,c(0.05,0.95))
## 5% 95%
## 519.3 1753.0
The exponential value model doesn’t look like a good model, since the range doesn’t fit the actual data and it is largly biased
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
#select all the quantitative variables and eliminate the ones with low correlations
quantitative <- data.frame(trainds$OverallQual,trainds$YearBuilt,trainds$YearRemodAdd,trainds$MasVnrArea,trainds$BsmtFinSF1,trainds$TotalBsmtSF,trainds$X1stFlrSF,trainds$X2ndFlrSF,trainds$GrLivArea,trainds$FullBath,trainds$TotRmsAbvGrd,trainds$Fireplaces,trainds$GarageCars,trainds$GarageArea,trainds$WoodDeckSF,trainds$OpenPorchSF,trainds$SalePrice)
#create a linear regression model
m1 <- lm(trainds.SalePrice ~.,data = quantitative)
summary(m1)
##
## Call:
## lm(formula = trainds.SalePrice ~ ., data = quantitative)
##
## Residuals:
## Min 1Q Median 3Q Max
## -512233 -17548 -1737 14681 283280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.094e+06 1.268e+05 -8.627 < 2e-16 ***
## trainds.OverallQual 1.856e+04 1.174e+03 15.807 < 2e-16 ***
## trainds.YearBuilt 1.638e+02 4.978e+01 3.290 0.001028 **
## trainds.YearRemodAdd 3.564e+02 6.208e+01 5.741 1.15e-08 ***
## trainds.MasVnrArea 2.881e+01 6.159e+00 4.678 3.17e-06 ***
## trainds.BsmtFinSF1 1.725e+01 2.596e+00 6.646 4.26e-11 ***
## trainds.TotalBsmtSF 1.165e+01 4.298e+00 2.711 0.006796 **
## trainds.X1stFlrSF 2.618e+01 2.082e+01 1.257 0.208871
## trainds.X2ndFlrSF 1.753e+01 2.048e+01 0.856 0.392000
## trainds.GrLivArea 2.135e+01 2.035e+01 1.049 0.294370
## trainds.FullBath -1.489e+03 2.630e+03 -0.566 0.571228
## trainds.TotRmsAbvGrd 1.688e+03 1.089e+03 1.550 0.121402
## trainds.Fireplaces 7.888e+03 1.783e+03 4.423 1.05e-05 ***
## trainds.GarageCars 1.011e+04 2.960e+03 3.414 0.000659 ***
## trainds.GarageArea 1.040e+01 1.005e+01 1.035 0.301006
## trainds.WoodDeckSF 3.068e+01 8.129e+00 3.774 0.000167 ***
## trainds.OpenPorchSF 7.271e+00 1.572e+01 0.462 0.643861
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36380 on 1435 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7918, Adjusted R-squared: 0.7894
## F-statistic: 341 on 16 and 1435 DF, p-value: < 2.2e-16
#eliminate variables based on significant level
quantitative2 <- data.frame(trainds$OverallQual,trainds$YearRemodAdd,trainds$MasVnrArea,trainds$BsmtFinSF1,trainds$TotalBsmtSF,trainds$Fireplaces,trainds$GarageCars,trainds$WoodDeckSF,trainds$SalePrice)
colnames(quantitative2) <- c("OverallQual","YearRemodAdd","MasVnrArea","BsmtFinSF1","TotalBsmtSF","Fireplaces","GarageCars","WoodDeckSF","SalePrice")
#create a linear regression model
m2 <- lm(SalePrice ~.,data = quantitative2)
summary(m2)
##
## Call:
## lm(formula = SalePrice ~ ., data = quantitative2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -407840 -21443 -2760 16410 363961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.307e+05 1.210e+05 -6.867 9.70e-12 ***
## OverallQual 2.449e+04 1.183e+03 20.706 < 2e-16 ***
## YearRemodAdd 3.925e+02 6.256e+01 6.273 4.66e-10 ***
## MasVnrArea 4.651e+01 6.602e+00 7.045 2.85e-12 ***
## BsmtFinSF1 1.482e+01 2.752e+00 5.383 8.52e-08 ***
## TotalBsmtSF 2.504e+01 3.290e+00 7.611 4.89e-14 ***
## Fireplaces 1.551e+04 1.849e+03 8.389 < 2e-16 ***
## GarageCars 1.794e+04 1.820e+03 9.855 < 2e-16 ***
## WoodDeckSF 4.464e+01 8.848e+00 5.045 5.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39960 on 1443 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7474, Adjusted R-squared: 0.746
## F-statistic: 533.6 on 8 and 1443 DF, p-value: < 2.2e-16
hist(m2$residuals,breaks = 200)
### Nearly Normal Residuals Nearly normal distributed,there are some outliers, the Q-Q plot displays signs of skewness towards the left. The assumption of nearly normal residuals do not seem to have been met.
qqnorm(m2$residuals)
qqline(m2$residuals)
### Independence The independent variables show some positive correlation with each other.
data <- dplyr::select( trainds, GrLivArea, OverallQual, TotalBsmtSF, PoolArea)
m<-cor(data)
corrplot.mixed(m, lower.col = "black", number.cex = .6)
testds <- read.csv(("https://raw.githubusercontent.com/jtul333/Data605/main/test.csv"))
temp1ds<-data.matrix(testds)
temp1ds<-data.frame(temp1ds)
pred <- predict(m2,temp1ds)
#kaggle Score
kaggle <- data.frame( Id = temp1ds[,"Id"], SalePrice =pred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
submissionds = cbind(kaggle$Id, kaggle$SalePrice)
colnames(submissionds) = c("Id", "SalePrice")
submissionds = as.data.frame(submissionds)
head(submissionds, 5)
## Id SalePrice
## 1 1461 114540.8
## 2 1462 172103.9
## 3 1463 171662.5
## 4 1464 200837.6
## 5 1465 218790.4
write.csv(submissionds, file = "Kaggletest.csv", quote = FALSE, row.names = FALSE)