Final Exam
Problem 1.
N = 46
set.seed(1973)
# random variable X
X <- runif(10000, min = 1, max = N)
# random variable Y
Y = rnorm(10000, mean = (N+1)/2, sd = (N+1)/2)
# x is median of X
x <- median(X)
# y is first quartile of Y
y <- quantile(Y)[[2]]a. \(P(X > x | X > y )\)
library(tidyverse)
library(dplyr)
Xdf <- data.frame(value = as.numeric(X), stringsAsFactors = FALSE)
cond_prob <- Xdf %>%
filter(Xdf, value > y) %>%
summarize(mean(value > x))
cat("The requested probability is",cond_prob[[1]])The requested probability is 0.5858231
b. \(P(X > x\text{ and } Y > y)\)
Ydf <- data.frame(value = as.numeric(Y), stringsAsFactors = FALSE)
res_prob <- mean(Xdf$value > x) * mean(Ydf$value > y)
cat("The requested probability is",res_prob)The requested probability is 0.375
c. \(P(X < x\text{ and } X > y )\)
cond_prob2 <- mean(Xdf$value < x) * mean(Xdf$value > y)
cat("The requested probability is",cond_prob2[[1]])The requested probability is 0.42675
XYdf <- data.frame(cbind(Xdf$value,Ydf$value))
colnames(XYdf) <- c('xval','yval')
# compose dataframe with sums
tb_XY <- data.frame(c(sum(XYdf$xval > x & XYdf$yval > y),
sum(XYdf$xval <= x & XYdf$yval > y)),
c(sum(XYdf$xval > x & XYdf$yval <= y),
sum(XYdf$xval <= x & XYdf$yval <= y)),
row.names=c('X > x','X >= x'))
colnames(tb_XY) <- c('Y > y','Y <= y')
library(kableExtra)
tb_XY %>%
kable(align = "c", format = "html",
table.attr = 'class="table table-striped table-hover"')%>%
kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)%>%
column_spec(1, bold = T, border_right = T)%>%
column_spec(2, border_right = T)| Y > y | Y <= y | |
|---|---|---|
| X > x | 3760 | 1240 |
| X >= x | 3740 | 1260 |
Marginal probabilities:
\(P(Y>y)=\) 0.75.
\(P(Y\le y)=\) 0.25.
\(P(X>x)=\) 0.5.
\(P(X\le x)=\) 0.5.
Joint probabilities:
\(P(X>x\text{ and }Y>y)=\) 0.376.
\(P(X>x\text{ and }Y\le y)=\) 0.124.
\(P(X\le x\text{ and }Y>y)=\) 0.374.
\(P(X\le x\text{ and }Y\le y)=\) 0.126.
These probabilities are confirmed in the table below.
# calculate proportion
tb_prop_XY <- tb_XY/10000
tb_prop_XY %>%
kable(align = "c", format = "html",
table.attr = 'class="table table-striped table-hover"')%>%
kable_styling(bootstrap_options = c("striped", "hover"),full_width = F)%>%
column_spec(1, bold = T, border_right = T)%>%
column_spec(2, border_right = T)| Y > y | Y <= y | |
|---|---|---|
| X > x | 0.376 | 0.124 |
| X >= x | 0.374 | 0.126 |
Further, the joint probability \(P(X>x\text{ and }Y>y)=\) 0.376.
The product of marginal probabilities
\(P(X>x)=\) 0.5 and
\(P(Y>y)=\) 0.75 is
\(P(X>x)P(Y>y)=\) 0.375.
library(stats)
XY_matrix <- as.matrix(tb_XY)
fisher.test(XY_matrix, alternative = "two.sided", hybrid = TRUE)
Fisher's Exact Test for Count Data
data: XY_matrix
p-value = 0.6608
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.932132 1.119534
sample estimates:
odds ratio
1.021566
Pearson's Chi-squared test with Yates' continuity correction
data: tb_XY
X-squared = 0.19253, df = 1, p-value = 0.6608
FIsher’s Test is for 2-by-2 contingency tables; chi-square is appropriate when comparing two categorical variables of more than 2 categories each.
Since the p-value for both tests are equivalent and both greater than 0.05, we do not reject the null hypothesis of the test (the distribution of X is independent of the distribution of Y) at the 5% significance level. This is the expected result, as the variables were generated randomly and independently.
Problem 2.
Predict the final price, “Sale Price”, of each home.
Provide univariate descriptive statistics and appropriate plots for the training data set.
# allowing strings to be imported as factors
hp_train <- read.csv('https://raw.githubusercontent.com/sigmasigmaiota/DATA605/master/train.csv')
# omit the iterative ID column
hp_train[,1] <- NULLhp_train
80 Variables 1460 Observations
--------------------------------------------------------------------------------
MSSubClass
n missing distinct Info Mean Gmd .05 .10
1460 0 15 0.94 56.9 43.19 20 20
.25 .50 .75 .90 .95
20 50 70 120 160
lowest : 20 30 40 45 50, highest: 90 120 160 180 190
Value 20 30 40 45 50 60 70 75 80 85 90
Frequency 536 69 4 12 144 299 60 16 58 20 52
Proportion 0.367 0.047 0.003 0.008 0.099 0.205 0.041 0.011 0.040 0.014 0.036
Value 120 160 180 190
Frequency 87 63 10 30
Proportion 0.060 0.043 0.007 0.021
--------------------------------------------------------------------------------
MSZoning
n missing distinct
1460 0 5
lowest : C (all) FV RH RL RM
highest: C (all) FV RH RL RM
Value C (all) FV RH RL RM
Frequency 10 65 16 1151 218
Proportion 0.007 0.045 0.011 0.788 0.149
--------------------------------------------------------------------------------
LotFrontage
n missing distinct Info Mean Gmd .05 .10
1201 259 110 0.998 70.05 24.61 34 44
.25 .50 .75 .90 .95
59 69 80 96 107
lowest : 21 24 30 32 33, highest: 160 168 174 182 313
--------------------------------------------------------------------------------
LotArea
n missing distinct Info Mean Gmd .05 .10
1460 0 1073 1 10517 5718 3312 5000
.25 .50 .75 .90 .95
7554 9478 11602 14382 17401
lowest : 1300 1477 1491 1526 1533, highest: 70761 115149 159000 164660 215245
--------------------------------------------------------------------------------
Street
n missing distinct
1460 0 2
Value Grvl Pave
Frequency 6 1454
Proportion 0.004 0.996
--------------------------------------------------------------------------------
Alley
n missing distinct
91 1369 2
Value Grvl Pave
Frequency 50 41
Proportion 0.549 0.451
--------------------------------------------------------------------------------
LotShape
n missing distinct
1460 0 4
Value IR1 IR2 IR3 Reg
Frequency 484 41 10 925
Proportion 0.332 0.028 0.007 0.634
--------------------------------------------------------------------------------
LandContour
n missing distinct
1460 0 4
Value Bnk HLS Low Lvl
Frequency 63 50 36 1311
Proportion 0.043 0.034 0.025 0.898
--------------------------------------------------------------------------------
Utilities
n missing distinct
1460 0 2
Value AllPub NoSeWa
Frequency 1459 1
Proportion 0.999 0.001
--------------------------------------------------------------------------------
LotConfig
n missing distinct
1460 0 5
lowest : Corner CulDSac FR2 FR3 Inside
highest: Corner CulDSac FR2 FR3 Inside
Value Corner CulDSac FR2 FR3 Inside
Frequency 263 94 47 4 1052
Proportion 0.180 0.064 0.032 0.003 0.721
--------------------------------------------------------------------------------
LandSlope
n missing distinct
1460 0 3
Value Gtl Mod Sev
Frequency 1382 65 13
Proportion 0.947 0.045 0.009
--------------------------------------------------------------------------------
Neighborhood
n missing distinct
1460 0 25
lowest : Blmngtn Blueste BrDale BrkSide ClearCr
highest: Somerst StoneBr SWISU Timber Veenker
--------------------------------------------------------------------------------
Condition1
n missing distinct
1460 0 9
lowest : Artery Feedr Norm PosA PosN , highest: PosN RRAe RRAn RRNe RRNn
Value Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
Frequency 48 81 1260 8 19 11 26 2 5
Proportion 0.033 0.055 0.863 0.005 0.013 0.008 0.018 0.001 0.003
--------------------------------------------------------------------------------
Condition2
n missing distinct
1460 0 8
lowest : Artery Feedr Norm PosA PosN , highest: PosA PosN RRAe RRAn RRNn
Value Artery Feedr Norm PosA PosN RRAe RRAn RRNn
Frequency 2 6 1445 1 2 1 1 2
Proportion 0.001 0.004 0.990 0.001 0.001 0.001 0.001 0.001
--------------------------------------------------------------------------------
BldgType
n missing distinct
1460 0 5
lowest : 1Fam 2fmCon Duplex Twnhs TwnhsE, highest: 1Fam 2fmCon Duplex Twnhs TwnhsE
Value 1Fam 2fmCon Duplex Twnhs TwnhsE
Frequency 1220 31 52 43 114
Proportion 0.836 0.021 0.036 0.029 0.078
--------------------------------------------------------------------------------
HouseStyle
n missing distinct
1460 0 8
lowest : 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf, highest: 2.5Fin 2.5Unf 2Story SFoyer SLvl
Value 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
Frequency 154 14 726 8 11 445 37 65
Proportion 0.105 0.010 0.497 0.005 0.008 0.305 0.025 0.045
--------------------------------------------------------------------------------
OverallQual
n missing distinct Info Mean Gmd .05 .10
1460 0 10 0.951 6.099 1.522 4 5
.25 .50 .75 .90 .95
5 6 7 8 8
lowest : 1 2 3 4 5, highest: 6 7 8 9 10
Value 1 2 3 4 5 6 7 8 9 10
Frequency 2 3 20 116 397 374 319 168 43 18
Proportion 0.001 0.002 0.014 0.079 0.272 0.256 0.218 0.115 0.029 0.012
--------------------------------------------------------------------------------
OverallCond
n missing distinct Info Mean Gmd
1460 0 9 0.814 5.575 1.111
lowest : 1 2 3 4 5, highest: 5 6 7 8 9
Value 1 2 3 4 5 6 7 8 9
Frequency 1 5 25 57 821 252 205 72 22
Proportion 0.001 0.003 0.017 0.039 0.562 0.173 0.140 0.049 0.015
--------------------------------------------------------------------------------
YearBuilt
n missing distinct Info Mean Gmd .05 .10
1460 0 112 1 1971 33.88 1916 1925
.25 .50 .75 .90 .95
1954 1973 2000 2006 2007
lowest : 1872 1875 1880 1882 1885, highest: 2006 2007 2008 2009 2010
--------------------------------------------------------------------------------
YearRemodAdd
n missing distinct Info Mean Gmd .05 .10
1460 0 61 0.997 1985 23.05 1950 1950
.25 .50 .75 .90 .95
1967 1994 2004 2006 2007
lowest : 1950 1951 1952 1953 1954, highest: 2006 2007 2008 2009 2010
--------------------------------------------------------------------------------
RoofStyle
n missing distinct
1460 0 6
lowest : Flat Gable Gambrel Hip Mansard
highest: Gable Gambrel Hip Mansard Shed
Value Flat Gable Gambrel Hip Mansard Shed
Frequency 13 1141 11 286 7 2
Proportion 0.009 0.782 0.008 0.196 0.005 0.001
--------------------------------------------------------------------------------
RoofMatl
n missing distinct
1460 0 8
lowest : ClyTile CompShg Membran Metal Roll
highest: Metal Roll Tar&Grv WdShake WdShngl
Value ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
Frequency 1 1434 1 1 1 11 5 6
Proportion 0.001 0.982 0.001 0.001 0.001 0.008 0.003 0.004
--------------------------------------------------------------------------------
Exterior1st
n missing distinct
1460 0 15
lowest : AsbShng AsphShn BrkComm BrkFace CBlock
highest: Stone Stucco VinylSd Wd Sdng WdShing
Value AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc
Frequency 20 1 2 50 1 61 222 1
Proportion 0.014 0.001 0.001 0.034 0.001 0.042 0.152 0.001
Value MetalSd Plywood Stone Stucco VinylSd Wd Sdng WdShing
Frequency 220 108 2 25 515 206 26
Proportion 0.151 0.074 0.001 0.017 0.353 0.141 0.018
--------------------------------------------------------------------------------
Exterior2nd
n missing distinct
1460 0 16
lowest : AsbShng AsphShn Brk Cmn BrkFace CBlock
highest: Stone Stucco VinylSd Wd Sdng Wd Shng
Value AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc
Frequency 20 3 7 25 1 60 207 10
Proportion 0.014 0.002 0.005 0.017 0.001 0.041 0.142 0.007
Value MetalSd Other Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
Frequency 214 1 142 5 26 504 197 38
Proportion 0.147 0.001 0.097 0.003 0.018 0.345 0.135 0.026
--------------------------------------------------------------------------------
MasVnrType
n missing distinct
1452 8 4
Value BrkCmn BrkFace None Stone
Frequency 15 445 864 128
Proportion 0.010 0.306 0.595 0.088
--------------------------------------------------------------------------------
MasVnrArea
n missing distinct Info Mean Gmd .05 .10
1452 8 327 0.791 103.7 156.9 0 0
.25 .50 .75 .90 .95
0 0 166 335 456
lowest : 0 1 11 14 16, highest: 1115 1129 1170 1378 1600
--------------------------------------------------------------------------------
ExterQual
n missing distinct
1460 0 4
Value Ex Fa Gd TA
Frequency 52 14 488 906
Proportion 0.036 0.010 0.334 0.621
--------------------------------------------------------------------------------
ExterCond
n missing distinct
1460 0 5
lowest : Ex Fa Gd Po TA, highest: Ex Fa Gd Po TA
Value Ex Fa Gd Po TA
Frequency 3 28 146 1 1282
Proportion 0.002 0.019 0.100 0.001 0.878
--------------------------------------------------------------------------------
Foundation
n missing distinct
1460 0 6
lowest : BrkTil CBlock PConc Slab Stone , highest: CBlock PConc Slab Stone Wood
Value BrkTil CBlock PConc Slab Stone Wood
Frequency 146 634 647 24 6 3
Proportion 0.100 0.434 0.443 0.016 0.004 0.002
--------------------------------------------------------------------------------
BsmtQual
n missing distinct
1423 37 4
Value Ex Fa Gd TA
Frequency 121 35 618 649
Proportion 0.085 0.025 0.434 0.456
--------------------------------------------------------------------------------
BsmtCond
n missing distinct
1423 37 4
Value Fa Gd Po TA
Frequency 45 65 2 1311
Proportion 0.032 0.046 0.001 0.921
--------------------------------------------------------------------------------
BsmtExposure
n missing distinct
1422 38 4
Value Av Gd Mn No
Frequency 221 134 114 953
Proportion 0.155 0.094 0.080 0.670
--------------------------------------------------------------------------------
BsmtFinType1
n missing distinct
1423 37 6
lowest : ALQ BLQ GLQ LwQ Rec, highest: BLQ GLQ LwQ Rec Unf
Value ALQ BLQ GLQ LwQ Rec Unf
Frequency 220 148 418 74 133 430
Proportion 0.155 0.104 0.294 0.052 0.093 0.302
--------------------------------------------------------------------------------
BsmtFinSF1
n missing distinct Info Mean Gmd .05 .10
1460 0 637 0.967 443.6 484.5 0.0 0.0
.25 .50 .75 .90 .95
0.0 383.5 712.2 1065.5 1274.0
lowest : 0 2 16 20 24, highest: 1904 2096 2188 2260 5644
--------------------------------------------------------------------------------
BsmtFinType2
n missing distinct
1422 38 6
lowest : ALQ BLQ GLQ LwQ Rec, highest: BLQ GLQ LwQ Rec Unf
Value ALQ BLQ GLQ LwQ Rec Unf
Frequency 19 33 14 46 54 1256
Proportion 0.013 0.023 0.010 0.032 0.038 0.883
--------------------------------------------------------------------------------
BsmtFinSF2
n missing distinct Info Mean Gmd .05 .10
1460 0 144 0.305 46.55 86.58 0.0 0.0
.25 .50 .75 .90 .95
0.0 0.0 0.0 117.2 396.2
lowest : 0 28 32 35 40, highest: 1080 1085 1120 1127 1474
--------------------------------------------------------------------------------
BsmtUnfSF
n missing distinct Info Mean Gmd .05 .10
1460 0 780 0.999 567.2 486.6 0.0 74.9
.25 .50 .75 .90 .95
223.0 477.5 808.0 1232.0 1468.0
lowest : 0 14 15 23 26, highest: 2042 2046 2121 2153 2336
--------------------------------------------------------------------------------
TotalBsmtSF
n missing distinct Info Mean Gmd .05 .10
1460 0 721 1 1057 459.5 519.3 636.9
.25 .50 .75 .90 .95
795.8 991.5 1298.2 1602.2 1753.0
lowest : 0 105 190 264 270, highest: 3094 3138 3200 3206 6110
--------------------------------------------------------------------------------
Heating
n missing distinct
1460 0 6
lowest : Floor GasA GasW Grav OthW , highest: GasA GasW Grav OthW Wall
Value Floor GasA GasW Grav OthW Wall
Frequency 1 1428 18 7 2 4
Proportion 0.001 0.978 0.012 0.005 0.001 0.003
--------------------------------------------------------------------------------
HeatingQC
n missing distinct
1460 0 5
lowest : Ex Fa Gd Po TA, highest: Ex Fa Gd Po TA
Value Ex Fa Gd Po TA
Frequency 741 49 241 1 428
Proportion 0.508 0.034 0.165 0.001 0.293
--------------------------------------------------------------------------------
CentralAir
n missing distinct
1460 0 2
Value N Y
Frequency 95 1365
Proportion 0.065 0.935
--------------------------------------------------------------------------------
Electrical
n missing distinct
1459 1 5
lowest : FuseA FuseF FuseP Mix SBrkr, highest: FuseA FuseF FuseP Mix SBrkr
Value FuseA FuseF FuseP Mix SBrkr
Frequency 94 27 3 1 1334
Proportion 0.064 0.019 0.002 0.001 0.914
--------------------------------------------------------------------------------
X1stFlrSF
n missing distinct Info Mean Gmd .05 .10
1460 0 753 1 1163 416.4 673.0 756.9
.25 .50 .75 .90 .95
882.0 1087.0 1391.2 1680.0 1831.2
lowest : 334 372 438 480 483, highest: 2633 2898 3138 3228 4692
--------------------------------------------------------------------------------
X2ndFlrSF
n missing distinct Info Mean Gmd .05 .10
1460 0 417 0.817 347 450.2 0.0 0.0
.25 .50 .75 .90 .95
0.0 0.0 728.0 954.2 1141.0
lowest : 0 110 167 192 208, highest: 1611 1796 1818 1872 2065
--------------------------------------------------------------------------------
LowQualFinSF
n missing distinct Info Mean Gmd .05 .10
1460 0 24 0.052 5.845 11.55 0 0
.25 .50 .75 .90 .95
0 0 0 0 0
lowest : 0 53 80 120 144, highest: 513 514 515 528 572
--------------------------------------------------------------------------------
GrLivArea
n missing distinct Info Mean Gmd .05 .10
1460 0 861 1 1515 563.1 848 912
.25 .50 .75 .90 .95
1130 1464 1777 2158 2466
lowest : 334 438 480 520 605, highest: 3627 4316 4476 4676 5642
--------------------------------------------------------------------------------
BsmtFullBath
n missing distinct Info Mean Gmd
1460 0 4 0.733 0.4253 0.5085
Value 0 1 2 3
Frequency 856 588 15 1
Proportion 0.586 0.403 0.010 0.001
--------------------------------------------------------------------------------
BsmtHalfBath
n missing distinct Info Mean Gmd
1460 0 3 0.159 0.05753 0.1088
Value 0 1 2
Frequency 1378 80 2
Proportion 0.944 0.055 0.001
--------------------------------------------------------------------------------
FullBath
n missing distinct Info Mean Gmd
1460 0 4 0.766 1.565 0.5521
Value 0 1 2 3
Frequency 9 650 768 33
Proportion 0.006 0.445 0.526 0.023
--------------------------------------------------------------------------------
HalfBath
n missing distinct Info Mean Gmd
1460 0 3 0.706 0.3829 0.4852
Value 0 1 2
Frequency 913 535 12
Proportion 0.625 0.366 0.008
--------------------------------------------------------------------------------
BedroomAbvGr
n missing distinct Info Mean Gmd
1460 0 8 0.815 2.866 0.818
lowest : 0 1 2 3 4, highest: 3 4 5 6 8
Value 0 1 2 3 4 5 6 8
Frequency 6 50 358 804 213 21 7 1
Proportion 0.004 0.034 0.245 0.551 0.146 0.014 0.005 0.001
--------------------------------------------------------------------------------
KitchenAbvGr
n missing distinct Info Mean Gmd
1460 0 4 0.133 1.047 0.09174
Value 0 1 2 3
Frequency 1 1392 65 2
Proportion 0.001 0.953 0.045 0.001
--------------------------------------------------------------------------------
KitchenQual
n missing distinct
1460 0 4
Value Ex Fa Gd TA
Frequency 100 39 586 735
Proportion 0.068 0.027 0.401 0.503
--------------------------------------------------------------------------------
TotRmsAbvGrd
n missing distinct Info Mean Gmd .05 .10
1460 0 12 0.958 6.518 1.762 4 5
.25 .50 .75 .90 .95
5 6 7 9 10
lowest : 2 3 4 5 6, highest: 9 10 11 12 14
Value 2 3 4 5 6 7 8 9 10 11 12
Frequency 1 17 97 275 402 329 187 75 47 18 11
Proportion 0.001 0.012 0.066 0.188 0.275 0.225 0.128 0.051 0.032 0.012 0.008
Value 14
Frequency 1
Proportion 0.001
--------------------------------------------------------------------------------
Functional
n missing distinct
1460 0 7
lowest : Maj1 Maj2 Min1 Min2 Mod , highest: Min1 Min2 Mod Sev Typ
Value Maj1 Maj2 Min1 Min2 Mod Sev Typ
Frequency 14 5 31 34 15 1 1360
Proportion 0.010 0.003 0.021 0.023 0.010 0.001 0.932
--------------------------------------------------------------------------------
Fireplaces
n missing distinct Info Mean Gmd
1460 0 4 0.806 0.613 0.6566
Value 0 1 2 3
Frequency 690 650 115 5
Proportion 0.473 0.445 0.079 0.003
--------------------------------------------------------------------------------
FireplaceQu
n missing distinct
770 690 5
lowest : Ex Fa Gd Po TA, highest: Ex Fa Gd Po TA
Value Ex Fa Gd Po TA
Frequency 24 33 380 20 313
Proportion 0.031 0.043 0.494 0.026 0.406
--------------------------------------------------------------------------------
GarageType
n missing distinct
1379 81 6
lowest : 2Types Attchd Basment BuiltIn CarPort
highest: Attchd Basment BuiltIn CarPort Detchd
Value 2Types Attchd Basment BuiltIn CarPort Detchd
Frequency 6 870 19 88 9 387
Proportion 0.004 0.631 0.014 0.064 0.007 0.281
--------------------------------------------------------------------------------
GarageYrBlt
n missing distinct Info Mean Gmd .05 .10
1379 81 97 1 1979 27.63 1930 1945
.25 .50 .75 .90 .95
1961 1980 2002 2006 2007
lowest : 1900 1906 1908 1910 1914, highest: 2006 2007 2008 2009 2010
--------------------------------------------------------------------------------
GarageFinish
n missing distinct
1379 81 3
Value Fin RFn Unf
Frequency 352 422 605
Proportion 0.255 0.306 0.439
--------------------------------------------------------------------------------
GarageCars
n missing distinct Info Mean Gmd
1460 0 5 0.802 1.767 0.7609
lowest : 0 1 2 3 4, highest: 0 1 2 3 4
Value 0 1 2 3 4
Frequency 81 369 824 181 5
Proportion 0.055 0.253 0.564 0.124 0.003
--------------------------------------------------------------------------------
GarageArea
n missing distinct Info Mean Gmd .05 .10
1460 0 441 1 473 234.9 0.0 240.0
.25 .50 .75 .90 .95
334.5 480.0 576.0 757.1 850.1
lowest : 0 160 164 180 186, highest: 1220 1248 1356 1390 1418
--------------------------------------------------------------------------------
GarageQual
n missing distinct
1379 81 5
lowest : Ex Fa Gd Po TA, highest: Ex Fa Gd Po TA
Value Ex Fa Gd Po TA
Frequency 3 48 14 3 1311
Proportion 0.002 0.035 0.010 0.002 0.951
--------------------------------------------------------------------------------
GarageCond
n missing distinct
1379 81 5
lowest : Ex Fa Gd Po TA, highest: Ex Fa Gd Po TA
Value Ex Fa Gd Po TA
Frequency 2 35 9 7 1326
Proportion 0.001 0.025 0.007 0.005 0.962
--------------------------------------------------------------------------------
PavedDrive
n missing distinct
1460 0 3
Value N P Y
Frequency 90 30 1340
Proportion 0.062 0.021 0.918
--------------------------------------------------------------------------------
WoodDeckSF
n missing distinct Info Mean Gmd .05 .10
1460 0 274 0.858 94.24 125 0 0
.25 .50 .75 .90 .95
0 0 168 262 335
lowest : 0 12 24 26 28, highest: 668 670 728 736 857
--------------------------------------------------------------------------------
OpenPorchSF
n missing distinct Info Mean Gmd .05 .10
1460 0 202 0.909 46.66 62.43 0 0
.25 .50 .75 .90 .95
0 25 68 130 175
lowest : 0 4 8 10 11, highest: 406 418 502 523 547
--------------------------------------------------------------------------------
EnclosedPorch
n missing distinct Info Mean Gmd .05 .10
1460 0 120 0.369 21.95 39.39 0.0 0.0
.25 .50 .75 .90 .95
0.0 0.0 0.0 112.0 180.1
lowest : 0 19 20 24 30, highest: 301 318 330 386 552
--------------------------------------------------------------------------------
X3SsnPorch
n missing distinct Info Mean Gmd .05 .10
1460 0 20 0.049 3.41 6.739 0 0
.25 .50 .75 .90 .95
0 0 0 0 0
lowest : 0 23 96 130 140, highest: 290 304 320 407 508
Value 0 23 96 130 140 144 153 162 168 180 182
Frequency 1436 1 1 1 1 2 1 1 3 2 1
Proportion 0.984 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.002 0.001 0.001
Value 196 216 238 245 290 304 320 407 508
Frequency 1 2 1 1 1 1 1 1 1
Proportion 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001
--------------------------------------------------------------------------------
ScreenPorch
n missing distinct Info Mean Gmd .05 .10
1460 0 76 0.22 15.06 28.27 0 0
.25 .50 .75 .90 .95
0 0 0 0 160
lowest : 0 40 53 60 63, highest: 385 396 410 440 480
--------------------------------------------------------------------------------
PoolArea
n missing distinct Info Mean Gmd
1460 0 8 0.014 2.759 5.497
lowest : 0 480 512 519 555, highest: 519 555 576 648 738
Value 0 480 512 519 555 576 648 738
Frequency 1453 1 1 1 1 1 1 1
Proportion 0.995 0.001 0.001 0.001 0.001 0.001 0.001 0.001
--------------------------------------------------------------------------------
PoolQC
n missing distinct
7 1453 3
Value Ex Fa Gd
Frequency 2 2 3
Proportion 0.286 0.286 0.429
--------------------------------------------------------------------------------
Fence
n missing distinct
281 1179 4
Value GdPrv GdWo MnPrv MnWw
Frequency 59 54 157 11
Proportion 0.210 0.192 0.559 0.039
--------------------------------------------------------------------------------
MiscFeature
n missing distinct
54 1406 4
Value Gar2 Othr Shed TenC
Frequency 2 2 49 1
Proportion 0.037 0.037 0.907 0.019
--------------------------------------------------------------------------------
MiscVal
n missing distinct Info Mean Gmd .05 .10
1460 0 21 0.103 43.49 85.67 0 0
.25 .50 .75 .90 .95
0 0 0 0 0
lowest : 0 54 350 400 450, highest: 2000 2500 3500 8300 15500
Value 0 50 350 400 450 500 550 600 700 800 1150
Frequency 1408 1 1 11 4 10 1 5 5 1 1
Proportion 0.964 0.001 0.001 0.008 0.003 0.007 0.001 0.003 0.003 0.001 0.001
Value 1200 1300 1400 2000 2500 3500 8300 15500
Frequency 2 1 1 4 1 1 1 1
Proportion 0.001 0.001 0.001 0.003 0.001 0.001 0.001 0.001
For the frequency table, variable is rounded to the nearest 50
--------------------------------------------------------------------------------
MoSold
n missing distinct Info Mean Gmd .05 .10
1460 0 12 0.985 6.322 3.041 2 3
.25 .50 .75 .90 .95
5 6 8 10 11
lowest : 1 2 3 4 5, highest: 8 9 10 11 12
Value 1 2 3 4 5 6 7 8 9 10 11
Frequency 58 52 106 141 204 253 234 122 63 89 79
Proportion 0.040 0.036 0.073 0.097 0.140 0.173 0.160 0.084 0.043 0.061 0.054
Value 12
Frequency 59
Proportion 0.040
--------------------------------------------------------------------------------
YrSold
n missing distinct Info Mean Gmd
1460 0 5 0.955 2008 1.498
lowest : 2006 2007 2008 2009 2010, highest: 2006 2007 2008 2009 2010
Value 2006 2007 2008 2009 2010
Frequency 314 329 304 338 175
Proportion 0.215 0.225 0.208 0.232 0.120
--------------------------------------------------------------------------------
SaleType
n missing distinct
1460 0 9
lowest : COD Con ConLD ConLI ConLw, highest: ConLw CWD New Oth WD
Value COD Con ConLD ConLI ConLw CWD New Oth WD
Frequency 43 2 9 5 5 4 122 3 1267
Proportion 0.029 0.001 0.006 0.003 0.003 0.003 0.084 0.002 0.868
--------------------------------------------------------------------------------
SaleCondition
n missing distinct
1460 0 6
lowest : Abnorml AdjLand Alloca Family Normal
highest: AdjLand Alloca Family Normal Partial
Value Abnorml AdjLand Alloca Family Normal Partial
Frequency 101 4 12 20 1198 125
Proportion 0.069 0.003 0.008 0.014 0.821 0.086
--------------------------------------------------------------------------------
SalePrice
n missing distinct Info Mean Gmd .05 .10
1460 0 663 1 180921 81086 88000 106475
.25 .50 .75 .90 .95
129975 163000 214000 278000 326100
lowest : 34900 35311 37900 39300 40000, highest: 582933 611657 625000 745000 755000
--------------------------------------------------------------------------------
library(corrplot)
# create data frame with numeric variables
nums <- unlist(lapply(hp_train, is.numeric))
# replace NAs with 0s
hp_train[is.na(hp_train)] <- 0
hpnum_train <- hp_train[,nums]
# obtain correlations
T <-cor(hpnum_train)
# plot correlation plot
cex.before <- par("cex")
par(cex = 0.5)
corrplot(T, type = "upper", insig = "blank", method = "color",
addCoef.col="grey",
order = "AOE", tl.cex = 1/par("cex"),
cl.cex = 1/par("cex"), addCoefasPercent = TRUE)Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
Derive a correlation matrix for any three quantitative variables in the dataset.
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
Also generate a 95% confidence interval from the empirical data, assuming normality.
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.