Â
Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from \(1\) to \(N\), where \(N\) can be any number of your choosing greater than or equal to \(6\). Then generate a random variable \(Y\) that has \(10,000\) random normal numbers with a mean of \(\mu =\sigma=(N+1)/2\).
set.seed(100)
# Set N to any number greater than or equal to 6 (in this case, 8).
N <- 8
# Generate a random variable X that has 10,000 random uniform numbers from 1 to N.
X <- runif(10000, 1, N)
# Generate a random variable Y that has 10,000 random normal numbers with the requested mean.
Y <- rnorm(10000, mean = (N+1)/2, (N+1)/2)
Â
Calculate as a minimum the below probabilities \(a\) through \(c\). Assume the small letter "\(x\)" is estimated as the median of the \(X\) variable, and the small letter "\(y\)" is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.
# Small x is estimated as the median of the X variable.
x <- round(median(X), 2)
x
## [1] 4.48
# Small y is estimated as the 1st quartile of the Y variable.
y <- round(quantile(Y, 0.25), 2)
y
## 25%
## 1.45
Answer: Small x is equal to 4.48 (median of the X variable), small y is equal to 1.45 (1st quartile of the Y variable).
A. \(P(X>x \ | \ X>y)\)
Interpretation:
Probability that X is greater than its median given that X is greater than the first quartile of Y.
\[P(X>x \ | \ X>y) = \frac{P(X>x \ , \ X>y)}{P(X>y)}\]
# Define the events.
event_one <- (X > x)
event_two <- (X > y)
# P(X>x and X>y).
a_and_b <- length(X[event_one & event_two]) / length(X)
# P(X>y).
b <- length(X[event_two]) / length(X)
# P(X>x | X>y).
probability <- a_and_b / b
answer <- round(probability, 2)
answer
## [1] 0.53
Answer: \(P(X > x \ | \ X > y)\) = 0.53
Â
B. \(P(X>x, Y>y)\)
Interpretation:
Probability that \(X\) is greater than \(x\), and \(Y\) is greater than \(y\).
# Define the events.
event_one <- (X > x)
event_two <- (Y > y)
# P(X > x).
X_gt_x <- length(X[event_one]) / length(X)
# P(Y > y).
Y_gt_y <- length(Y[event_two]) / length(Y)
probability <- X_gt_x * Y_gt_y
answer <- round(probability, 2)
answer
## [1] 0.37
Answer: \(P(X>x, Y>y)\) = 0.37
Â
C. \(P(X<x \ | \ X>y)\)
Interpretation:
Probability that \(X\) is less than its median given that it is greater than the first quantile of \(Y\).
# Define the events.
event_one <- (X < x)
event_two <- (X > y)
# P(X > x & X > y).
a_and_b <- length(X[event_one & event_two]) / length(X)
# P(X > y).
b <- length(X[event_two]) / length(X)
probability <- a_and_b / b
answer <- round(probability, 2)
answer
## [1] 0.47
Answer: \(P(X<x \ | \ X>y)\) = 0.47
Â
Investigate whether \(P(X>x \ and \ Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
part_one <- (X > x)
prob_X_gt_x <- (length(part_one[part_one == TRUE])) / (length(part_one))
part_two <- (Y > y)
prob_Y_gt_y <- (length(part_two[part_two == TRUE])) / (length(part_two))
results_table <- data.table(
Event = c('(X>x)', '(Y>y)', '(X>x)*(Y>y)', '(X>x and Y>y)'),
Xx = c(prob_X_gt_x, prob_Y_gt_y, prob_X_gt_x * prob_Y_gt_y, prob_X_gt_x * prob_Y_gt_y),
Yy = c(prob_Y_gt_y, prob_X_gt_x, prob_X_gt_x * prob_Y_gt_y, prob_X_gt_x * prob_Y_gt_y)
)
results_table
## Event Xx Yy
## 1: (X>x) 0.499900 0.749800
## 2: (Y>y) 0.749800 0.499900
## 3: (X>x)*(Y>y) 0.374825 0.374825
## 4: (X>x and Y>y) 0.374825 0.374825
Answer: They are both equal.
Â
Register for Kaggle.com and compete in the House Prices: Advanced Regression Techniques competition.
Competition Objective
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Â
Import the training and test datasets and summarize them.
# Pull in the train and test datasets.
training_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-final-exam/train.csv')
test_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-final-exam/test.csv')
Â
Snapshot of the training dataset.
head(training_dataset, 1)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
Summarize the training dataset.
summary(training_dataset)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour Utilities
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50 NoSeWa: 1
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 263 Gtl:1382 NAmes :225 Norm :1260 Norm :1445
## CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81 Feedr : 6
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48 Artery : 2
## FR3 : 4 Edwards:100 RRAn : 26 PosN : 2
## Inside :1052 Somerst: 86 PosN : 19 RRNn : 2
## Gilbert: 79 RRAe : 11 PosA : 1
## (Other):707 (Other): 15 (Other): 2
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000 Min. :1872
## 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000 Median :1973
## Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575 Mean :1971
## TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## 1.5Unf : 14 Max. :10.000 Max. :9.000 Max. :2010
## (Other): 19
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## Min. :1950 Flat : 13 CompShg:1434 VinylSd:515 VinylSd:504
## 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222 MetalSd:214
## Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220 HdBoard:207
## Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206 Wd Sdng:197
## 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108 Plywood:142
## Max. :2010 Shed : 2 Membran: 1 CemntBd: 61 CmentBd: 60
## (Other): 2 (Other):128 (Other):136
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
## BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3 BrkTil:146 Ex :121
## BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28 CBlock:634 Fa : 35
## None :864 Median : 0.0 Gd:488 Gd: 146 PConc :647 Gd :618
## Stone :128 Mean : 103.7 TA:906 Po: 1 Slab : 24 TA :649
## NA's : 8 3rd Qu.: 166.0 TA:1282 Stone : 6 NA's: 37
## Max. :1600.0 Wood : 3
## NA's :8
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 45 Av :221 ALQ :220 Min. : 0.0 ALQ : 19
## Gd : 65 Gd :134 BLQ :148 1st Qu.: 0.0 BLQ : 33
## Po : 2 Mn :114 GLQ :418 Median : 383.5 GLQ : 14
## TA :1311 No :953 LwQ : 74 Mean : 443.6 LwQ : 46
## NA's: 37 NA's: 38 Rec :133 3rd Qu.: 712.2 Rec : 54
## Unf :430 Max. :5644.0 Unf :1256
## NA's: 37 NA's: 38
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49
## Median : 0.00 Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 46.55 Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :1474.00 Max. :2336.0 Max. :6110.0 Wall : 4
##
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## N: 95 FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## Y:1365 FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Gd:586 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## Maj1: 14 Min. :0.000 Ex : 24 2Types : 6 Min. :1900
## Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1961
## Min1: 31 Median :1.000 Gd :380 Basment: 19 Median :1980
## Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88 Mean :1979
## Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9 3rd Qu.:2002
## Sev : 1 Max. :3.000 NA's:690 Detchd :387 Max. :2010
## Typ :1360 NA's : 81 NA's :81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## Fin :352 Min. :0.000 Min. : 0.0 Ex : 3 Ex : 2
## RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48 Fa : 35
## Unf :605 Median :2.000 Median : 480.0 Gd : 14 Gd : 9
## NA's: 81 Mean :1.767 Mean : 473.0 Po : 3 Po : 7
## 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311 TA :1326
## Max. :4.000 Max. :1418.0 NA's: 81 NA's: 81
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Y:1340 Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2 GdWo : 54 Othr: 2
## Median : 0.00 Median : 0.000 Gd : 3 MnPrv: 157 Shed: 49
## Mean : 15.06 Mean : 2.759 NA's:1453 MnWw : 11 TenC: 1
## 3rd Qu.: 0.00 3rd Qu.: 0.000 NA's :1179 NA's:1406
## Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :1267
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 New : 122
## Median : 0.00 Median : 6.000 Median :2008 COD : 43
## Mean : 43.49 Mean : 6.322 Mean :2008 ConLD : 9
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5
## Max. :15500.00 Max. :12.000 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
Â
Snapshot of the test dataset.
head(test_dataset, 1)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl NAmes Feedr Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 1Story 5 6 1961 1961 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd None 0 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 CBlock TA TA No Rec 468
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 LwQ 144 270 882 GasA TA Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 896 0 0 896 0
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 0 2 1 TA
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 5 Typ 0 <NA> Attchd 1961
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Unf 1 730 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 140 0 0 0 120 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv <NA> 0 6 2010 WD Normal
Summarize the test dataset.
summary(test_dataset)
## Id MSSubClass MSZoning LotFrontage
## Min. :1461 Min. : 20.00 C (all): 15 Min. : 21.00
## 1st Qu.:1826 1st Qu.: 20.00 FV : 74 1st Qu.: 58.00
## Median :2190 Median : 50.00 RH : 10 Median : 67.00
## Mean :2190 Mean : 57.38 RL :1114 Mean : 68.58
## 3rd Qu.:2554 3rd Qu.: 70.00 RM : 242 3rd Qu.: 80.00
## Max. :2919 Max. :190.00 NA's : 4 Max. :200.00
## NA's :227
## LotArea Street Alley LotShape LandContour Utilities
## Min. : 1470 Grvl: 6 Grvl: 70 IR1:484 Bnk: 54 AllPub:1457
## 1st Qu.: 7391 Pave:1453 Pave: 37 IR2: 35 HLS: 70 NA's : 2
## Median : 9399 NA's:1352 IR3: 6 Low: 24
## Mean : 9819 Reg:934 Lvl:1311
## 3rd Qu.:11518
## Max. :56600
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 248 Gtl:1396 NAmes :218 Norm :1251 Artery: 3
## CulDSac: 82 Mod: 60 OldTown:126 Feedr : 83 Feedr : 7
## FR2 : 38 Sev: 3 CollgCr:117 Artery : 44 Norm :1444
## FR3 : 10 Somerst: 96 RRAn : 24 PosA : 3
## Inside :1081 Edwards: 94 PosN : 20 PosN : 2
## NridgHt: 89 RRAe : 17
## (Other):719 (Other): 20
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1Fam :1205 1.5Fin:160 Min. : 1.000 Min. :1.000 Min. :1879
## 2fmCon: 31 1.5Unf: 5 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1953
## Duplex: 57 1Story:745 Median : 6.000 Median :5.000 Median :1973
## Twnhs : 53 2.5Unf: 13 Mean : 6.079 Mean :5.554 Mean :1971
## TwnhsE: 113 2Story:427 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001
## SFoyer: 46 Max. :10.000 Max. :9.000 Max. :2010
## SLvl : 63
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## Min. :1950 Flat : 7 CompShg:1442 VinylSd:510 VinylSd:510
## 1st Qu.:1963 Gable :1169 Tar&Grv: 12 MetalSd:230 MetalSd:233
## Median :1992 Gambrel: 11 WdShake: 4 HdBoard:220 HdBoard:199
## Mean :1984 Hip : 265 WdShngl: 1 Wd Sdng:205 Wd Sdng:194
## 3rd Qu.:2004 Mansard: 4 Plywood:113 Plywood:128
## Max. :2010 Shed : 3 (Other):180 (Other):194
## NA's : 1 NA's : 1
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
## BrkCmn : 10 Min. : 0.0 Ex: 55 Ex: 9 BrkTil:165 Ex :137
## BrkFace:434 1st Qu.: 0.0 Fa: 21 Fa: 39 CBlock:601 Fa : 53
## None :878 Median : 0.0 Gd:491 Gd: 153 PConc :661 Gd :591
## Stone :121 Mean : 100.7 TA:892 Po: 2 Slab : 25 TA :634
## NA's : 16 3rd Qu.: 164.0 TA:1256 Stone : 5 NA's: 44
## Max. :1290.0 Wood : 2
## NA's :15
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 59 Av :197 ALQ :209 Min. : 0.0 ALQ : 33
## Gd : 57 Gd :142 BLQ :121 1st Qu.: 0.0 BLQ : 35
## Po : 3 Mn :125 GLQ :431 Median : 350.5 GLQ : 20
## TA :1295 No :951 LwQ : 80 Mean : 439.2 LwQ : 41
## NA's: 45 NA's: 44 Rec :155 3rd Qu.: 753.5 Rec : 51
## Unf :421 Max. :4010.0 Unf :1237
## NA's: 42 NA's :1 NA's: 42
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.00 Min. : 0.0 Min. : 0 GasA:1446 Ex:752
## 1st Qu.: 0.00 1st Qu.: 219.2 1st Qu.: 784 GasW: 9 Fa: 43
## Median : 0.00 Median : 460.0 Median : 988 Grav: 2 Gd:233
## Mean : 52.62 Mean : 554.3 Mean :1046 Wall: 2 Po: 2
## 3rd Qu.: 0.00 3rd Qu.: 797.8 3rd Qu.:1305 TA:429
## Max. :1526.00 Max. :2140.0 Max. :5095
## NA's :1 NA's :1 NA's :1
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## N: 101 FuseA: 94 Min. : 407.0 Min. : 0 Min. : 0.000
## Y:1358 FuseF: 23 1st Qu.: 873.5 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 5 Median :1079.0 Median : 0 Median : 0.000
## SBrkr:1337 Mean :1156.5 Mean : 326 Mean : 3.543
## 3rd Qu.:1382.5 3rd Qu.: 676 3rd Qu.: 0.000
## Max. :5095.0 Max. :1862 Max. :1064.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 407 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:1118 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.000
## Median :1432 Median :0.0000 Median :0.0000 Median :2.000
## Mean :1486 Mean :0.4345 Mean :0.0652 Mean :1.571
## 3rd Qu.:1721 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :5095 Max. :3.0000 Max. :2.0000 Max. :4.000
## NA's :2 NA's :2
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex :105 Min. : 3.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa : 31 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Gd :565 Median : 6.000
## Mean :0.3777 Mean :2.854 Mean :1.042 TA :757 Mean : 6.385
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 NA's: 1 3rd Qu.: 7.000
## Max. :2.0000 Max. :6.000 Max. :2.000 Max. :15.000
##
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## Typ :1357 Min. :0.0000 Ex : 19 2Types : 17 Min. :1895
## Min2 : 36 1st Qu.:0.0000 Fa : 41 Attchd :853 1st Qu.:1959
## Min1 : 34 Median :0.0000 Gd :364 Basment: 17 Median :1979
## Mod : 20 Mean :0.5812 Po : 26 BuiltIn: 98 Mean :1978
## Maj1 : 5 3rd Qu.:1.0000 TA :279 CarPort: 6 3rd Qu.:2002
## (Other): 5 Max. :4.0000 NA's:730 Detchd :392 Max. :2207
## NA's : 2 NA's : 76 NA's :78
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## Fin :367 Min. :0.000 Min. : 0.0 Fa : 76 Ex : 1
## RFn :389 1st Qu.:1.000 1st Qu.: 318.0 Gd : 10 Fa : 39
## Unf :625 Median :2.000 Median : 480.0 Po : 2 Gd : 6
## NA's: 78 Mean :1.766 Mean : 472.8 TA :1293 Po : 7
## 3rd Qu.:2.000 3rd Qu.: 576.0 NA's: 78 TA :1328
## Max. :5.000 Max. :1488.0 NA's: 78
## NA's :1 NA's :1
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## N: 126 Min. : 0.00 Min. : 0.00 Min. : 0.00
## P: 32 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Y:1301 Median : 0.00 Median : 28.00 Median : 0.00
## Mean : 93.17 Mean : 48.31 Mean : 24.24
## 3rd Qu.: 168.00 3rd Qu.: 72.00 3rd Qu.: 0.00
## Max. :1424.00 Max. :742.00 Max. :1012.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC Fence
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 Gd : 1 GdWo : 58
## Median : 0.000 Median : 0.00 Median : 0.000 NA's:1456 MnPrv: 172
## Mean : 1.794 Mean : 17.06 Mean : 1.744 MnWw : 1
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000 NA's :1169
## Max. :360.000 Max. :576.00 Max. :800.000
##
## MiscFeature MiscVal MoSold YrSold SaleType
## Gar2: 3 Min. : 0.00 Min. : 1.000 Min. :2006 WD :1258
## Othr: 2 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007 New : 117
## Shed: 46 Median : 0.00 Median : 6.000 Median :2008 COD : 44
## NA's:1408 Mean : 58.17 Mean : 6.104 Mean :2008 ConLD : 17
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 CWD : 8
## Max. :17000.00 Max. :12.000 Max. :2010 (Other): 14
## NA's : 1
## SaleCondition
## Abnorml: 89
## AdjLand: 8
## Alloca : 12
## Family : 26
## Normal :1204
## Partial: 120
##
Â
1. Provide univariate descriptive statistics and appropriate plots for the training data set.
# Summarize the training dataset's SalePrice variable.
summary(training_dataset$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Â
Create a histogram of sales prices.
hist(training_dataset$SalePrice,
xlab = 'Sale Price',
main = 'Distribution of Sales Prices',
col = 'darkgreen')
Â
Create a QQ plot of sales prices.
qqnorm(training_dataset$SalePrice, col = 'darkred')
qqline(training_dataset$SalePrice)
Â
2. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
# Define the variables to include in the matrix.
sale_price <- training_dataset$SalePrice
lot_area <- training_dataset$LotArea
gr_liv_area <- training_dataset$GrLivArea
garage_area <- training_dataset$GarageArea
# Plot the matrix.
plot_data <- data.frame(sale_price, lot_area, gr_liv_area, garage_area)
pairs(plot_data, main = 'Scatterplot Matrix', col = '#50394c')
Â
3. Derive a correlation matrix for any three quantitative variables in the dataset.
# Create a dataframe containing the 3 variables to include in the matrix.
LotArea <- training_dataset$LotArea
GrLivArea <- training_dataset$GrLivArea
GarageArea <- training_dataset$GarageArea
matrix_variables <- data.frame(LotArea, GrLivArea, GarageArea)
# Create the correlation matrix.
cor_matrix <- cor(matrix_variables)
corrplot(cor_matrix, method = "shade")
Â
4. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
4(a). Test LotArea Vs. GrLivArea.
# Test LotArea Vs. GrLivArea using the Pearson method with 80% confidence level.
cor.test(training_dataset$LotArea, training_dataset$GrLivArea, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: training_dataset$LotArea and training_dataset$GrLivArea
## t = 10.414, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
4(b). Test LotArea vs GarageArea.
# Test LotArea Vs. GarageArea vs with 80% confidence level.
cor.test(training_dataset$LotArea, training_dataset$GarageArea, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: training_dataset$LotArea and training_dataset$GarageArea
## t = 7.0034, df = 1458, p-value = 0.000000000003803
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1477356 0.2126767
## sample estimates:
## cor
## 0.1804028
4(c). Test GarageArea Vs. GrLivArea.
# Test GarageArea Vs. GrLivArea with 80% confidence level.
cor.test(training_dataset$GarageArea, training_dataset$GrLivArea, method = 'pearson', conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: training_dataset$GarageArea and training_dataset$GrLivArea
## t = 20.276, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
Â
5. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
From the above results, the correlation between the selected variables is not equal to 0 in all 3 comparisons. Additionally, the p-values for all 3 samples are less than 0.05, so we can reject the null hypothesis.
Due to the fact that the correlation tests result in low p-values, I would not be worried about family-wise errors when measuring relationships across the 3 attributes.
Â
1. Invert your correlation matrix from above.
# Invert the matrix.
inverted_matrix <- solve(cor_matrix)
round(inverted_matrix, 2)
## LotArea GrLivArea GarageArea
## LotArea 1.08 -0.25 -0.08
## GrLivArea -0.25 1.34 -0.58
## GarageArea -0.08 -0.58 1.29
Â
2. Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
2(a). Multiply the correlation matrix by the precision matrix.
correlation_x_precision <- cor_matrix %*% inverted_matrix
round(correlation_x_precision, 2)
## LotArea GrLivArea GarageArea
## LotArea 1 0 0
## GrLivArea 0 1 0
## GarageArea 0 0 1
2(b). Multiply the precision matrix by the correlation matrix.
precision_x_correlation <- inverted_matrix %*% cor_matrix
round(precision_x_correlation, 2)
## LotArea GrLivArea GarageArea
## LotArea 1 0 0
## GrLivArea 0 1 0
## GarageArea 0 0 1
Â
3. Conduct LU decomposition on the matrix.
# Perform LU decomposition on the matrix.
lu.decomposition(inverted_matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] -0.22884393 1.0000000 0
## [3,] -0.07307553 -0.4689975 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1.079209 -0.2469705 -0.07886378
## [2,] 0.000000 1.2819833 -0.60124693
## [3,] 0.000000 0.0000000 1.00000000
Â
1. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.
# Select the right skewed TotalBsmtSF variable from the training dataset
# and run fitdistr to fit an exponential probability density function.
fit <- fitdistr(training_dataset$TotalBsmtSF, "exponential")
fit
## rate
## 0.00094568957
## (0.00002474983)
** 2. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)))**.
# Check the names that are available in the fit.
names(fit)
## [1] "estimate" "sd" "vcov" "n" "loglik"
lambda <- fit$estimate
samples <- rexp(1000, lambda)
Plot a histogram and compare it with a histogram of your original variable.
# Histrgram of samples.
hist(samples, breaks = 100)
# Histrgram of original variable.
original_variable <- training_dataset$TotalBsmtSF
hist(original_variable, breaks = 100)
Conclusion:
The histogram of samples produces a less skewed distribution than that of the original variable.
3a. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
quantile(samples, probs = c(0.05, 0.95))
## 5% 95%
## 53.62245 3062.35061
3b. Also generate a 95% confidence interval from the empirical data, assuming normality.
empirical_data <- training_dataset$TotalBsmtSF
mean(empirical_data)
## [1] 1057.429
normality <-rnorm(length(empirical_data), mean(empirical_data), sd(empirical_data))
hist(normality)
3c. Finally, provide the empirical 5th percentile and 95th percentile of the data.
quantile(normality, probs = c(0.05, 0.95))
## 5% 95%
## 329.8326 1789.1667