CUNY SPS DATA 605 Final

by Mike Silva

2019-05-24

NOTE: The instructions/questions are in bold. My responses and comments are in italics.

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{N+1}{2}\).

set.seed(12345)
N <- 10
n <- 10000
mu <- sigma <- (N + 1)/2
df <- data.frame(X = runif(n, min=1, max=N), 
                 Y = rnorm(n, mean=mu, sd=sigma))
summary(df$X)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.283   5.541   5.506   7.742  10.000 
summary(df$Y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-14.664   1.846   5.482   5.500   9.154  26.766 

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x <- median(df$X)
y <- as.numeric(quantile(df$Y)["25%"])

Small letter “x” is 5.540952 and small letter “y” is 1.8460758.

  1. \(P(X>x | X>y)\)
P_a_and_b <- df %>%
  filter(X > x,
         X > y) %>%
  nrow() / n

P_b <- df %>%
  filter(X > y) %>%
  nrow() / n

problem_1a <- P_a_and_b / P_b

The probablity of a random number uniformly ranging from 1 to 10 being greater than 5.540952 given that it is greater than 1.8460758 is 0.5512679

  1. \(P(X>x, Y>y)\)
problem_1b <- df %>%
  filter(X > x,
         Y > y) %>%
  nrow() / n

The probablity of a random number uniformly ranging from 1 to 10 being greater than 5.540952 and a random normally distributed number with a mean and standard deviation of 5.5 being greater than 1.8460758 is 0.3808

  1. \(P(X<x | X>y)\)
P_a_and_b <- df %>%
  filter(X < x,
         X > y) %>%
  nrow() / n

P_b <- df %>%
  filter(X > y) %>%
  nrow() / n

problem_1c <- P_a_and_b / P_b

The probablity of a random number uniformly ranging from 1 to 10 being less than 5.540952 given that it is greater than 1.8460758 is 0.4487321

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

# Create Joint Probabilities
temp <- df %>%
  mutate(A = ifelse(X > x, " X greater than x", " X not greater than x"),
         B = ifelse(Y > y, " Y greater than y", " Y not greater than y")) %>%
  group_by(A, B) %>%
  summarise(count = n()) %>%
  mutate(probability = count / n)

# Create Marginal Probabilities
temp <- temp %>%
  ungroup() %>%
  group_by(A) %>%
  summarise(count = sum(count),
            probability = sum(probability)) %>%
  mutate(B = "Total") %>%
  bind_rows(temp)

temp <- temp %>%
  ungroup() %>%
  group_by(B) %>%
  summarise(count = sum(count),
            probability = sum(probability)) %>%
  mutate(A = "Total") %>%
  bind_rows(temp)

# Create Table
temp %>%
  select(-count) %>%
  spread(A, probability) %>%
  rename(" " = B) %>%
  kable() %>%
  kable_styling()
X greater than x X not greater than x Total
Y greater than y 0.3808 0.3692 0.75
Y not greater than y 0.1192 0.1308 0.25
Total 0.5000 0.5000 1.00

P(X>x and Y>y) is 0.3808. P(X>x)P(Y>y) is 0.5 \(\times\) 0.75 which is 0.375. They are not the same.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

count_data <- temp %>%
  filter(A != "Total",
         B != "Total") %>%
  select(-probability) %>%
  spread(A, count) %>%
  as.data.frame()

row.names(count_data) <- count_data$B

count_data <- count_data %>%
  select(-B) %>%
  as.matrix() 


fisher.test(count_data)

    Fisher's Exact Test for Count Data

data:  count_data
p-value = 0.007904
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.032680 1.240419
sample estimates:
odds ratio 
  1.131777 
chisq.test(count_data)

    Pearson's Chi-squared test with Yates' continuity correction

data:  count_data
X-squared = 7.0533, df = 1, p-value = 0.007912

Fisher’s Exact Test is for is used when you have small cell sizes (less than 5). The Chi Square Test is used when the cell sizes are large. It would be appropriate in this case.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

kaggle <- read.csv("Kaggle_House_Prices/train.csv") %>%
  # Removing outliers per http://jse.amstat.org/v19n3/decock.pdf
  filter(GrLivArea < 4000)
  
  
fill_holes <- function(df){
  df %>%
    # Filling in missing values with zeros where it makes sense
    mutate(BedroomAbvGr = replace_na(BedroomAbvGr, 0),
           BsmtFullBath = replace_na(BsmtFullBath, 0),
           BsmtHalfBath = replace_na(BsmtHalfBath, 0),
           BsmtUnfSF = replace_na(BsmtUnfSF, 0),
          EnclosedPorch = replace_na(EnclosedPorch, 0),
          Fireplaces = replace_na(Fireplaces, 0),
          GarageArea = replace_na(GarageArea, 0),
          GarageCars = replace_na(GarageCars, 0),
          HalfBath = replace_na(HalfBath, 0),
          KitchenAbvGr = replace_na(KitchenAbvGr, 0),
          LotFrontage = replace_na(LotFrontage, 0),
          OpenPorchSF = replace_na(OpenPorchSF, 0),
          PoolArea = replace_na(PoolArea, 0),
          ScreenPorch = replace_na(ScreenPorch, 0),
          TotRmsAbvGrd = replace_na(TotRmsAbvGrd, 0),
          WoodDeckSF = replace_na(WoodDeckSF, 0)) 
}

kaggle <- fill_holes(kaggle)

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

Id

Measure Value
Min 1.000
First Quartile 364.750
Median 730.500
Mean 729.967
Third Quartile 1094.250
Max 1460.000

MSSubClass

Measure Value
Min 20.00000
First Quartile 20.00000
Median 50.00000
Mean 56.88874
Third Quartile 70.00000
Max 190.00000

MSZoning

MSZoning count
C (all) 10
FV 65
RH 16
RL 1147
RM 218

LotFrontage

Measure Value
Min 0.00000
First Quartile 42.00000
Median 63.00000
Mean 57.29602
Third Quartile 79.00000
Max 313.00000

LotArea

Measure Value
Min 1300.00
First Quartile 7538.75
Median 9468.50
Mean 10448.78
Third Quartile 11588.00
Max 215245.00

Street

Street count
Grvl 6
Pave 1450

Alley

Alley count
Grvl 50
Pave 41
NA 1365

LotShape

LotShape count
IR1 481
IR2 41
IR3 9
Reg 925

LandContour

LandContour count
Bnk 61
HLS 50
Low 36
Lvl 1309

Utilities

Utilities count
AllPub 1455
NoSeWa 1

LotConfig

LotConfig count
Corner 260
CulDSac 94
FR2 47
FR3 4
Inside 1051

LandSlope

LandSlope count
Gtl 1378
Mod 65
Sev 13

Neighborhood

Neighborhood count
Blmngtn 17
Blueste 2
BrDale 16
BrkSide 58
ClearCr 28
CollgCr 150
Crawfor 51
Edwards 98
Gilbert 79
IDOTRR 37
MeadowV 17
Mitchel 49
NAmes 225
NoRidge 39
NPkVill 9
NridgHt 77
NWAmes 73
OldTown 113
Sawyer 74
SawyerW 59
Somerst 86
StoneBr 25
SWISU 25
Timber 38
Veenker 11

Condition1

Condition1 count
Artery 48
Feedr 80
Norm 1258
PosA 8
PosN 18
RRAe 11
RRAn 26
RRNe 2
RRNn 5

Condition2

Condition2 count
Artery 2
Feedr 6
Norm 1442
PosA 1
PosN 1
RRAe 1
RRAn 1
RRNn 2

BldgType

BldgType count
1Fam 1216
2fmCon 31
Duplex 52
Twnhs 43
TwnhsE 114

HouseStyle

HouseStyle count
1.5Fin 154
1.5Unf 14
1Story 726
2.5Fin 8
2.5Unf 11
2Story 441
SFoyer 37
SLvl 65

OverallQual

Measure Value
Min 1.000000
First Quartile 5.000000
Median 6.000000
Mean 6.088599
Third Quartile 7.000000
Max 10.000000

OverallCond

Measure Value
Min 1.000000
First Quartile 5.000000
Median 5.000000
Mean 5.576236
Third Quartile 6.000000
Max 9.000000

YearBuilt

Measure Value
Min 1872.000
First Quartile 1954.000
Median 1972.000
Mean 1971.185
Third Quartile 2000.000
Max 2010.000

YearRemodAdd

Measure Value
Min 1950.000
First Quartile 1966.750
Median 1993.500
Mean 1984.819
Third Quartile 2004.000
Max 2010.000

RoofStyle

RoofStyle count
Flat 13
Gable 1140
Gambrel 11
Hip 283
Mansard 7
Shed 2

RoofMatl

RoofMatl count
CompShg 1432
Membran 1
Metal 1
Roll 1
Tar&Grv 11
WdShake 5
WdShngl 5

Exterior1st

Exterior1st count
AsbShng 20
AsphShn 1
BrkComm 2
BrkFace 50
CBlock 1
CemntBd 60
HdBoard 221
ImStucc 1
MetalSd 220
Plywood 108
Stone 2
Stucco 24
VinylSd 515
Wd Sdng 205
WdShing 26

Exterior2nd

Exterior2nd count
AsbShng 20
AsphShn 3
Brk Cmn 7
BrkFace 25
CBlock 1
CmentBd 59
HdBoard 206
ImStucc 9
MetalSd 214
Other 1
Plywood 142
Stone 5
Stucco 25
VinylSd 504
Wd Sdng 197
Wd Shng 38

MasVnrType

MasVnrType count
BrkCmn 15
BrkFace 444
None 863
Stone 126
NA 8

MasVnrArea

Measure Value
Min 0.0000
First Quartile 0.0000
Median 0.0000
Mean 102.0877
Third Quartile 164.2500
Max 1600.0000

ExterQual

ExterQual count
Ex 49
Fa 14
Gd 487
TA 906

ExterCond

ExterCond count
Ex 3
Fa 28
Gd 146
Po 1
TA 1278

Foundation

Foundation count
BrkTil 146
CBlock 634
PConc 643
Slab 24
Stone 6
Wood 3

BsmtQual

BsmtQual count
Ex 117
Fa 35
Gd 618
TA 649
NA 37

BsmtCond

BsmtCond count
Fa 45
Gd 65
Po 2
TA 1307
NA 37

BsmtExposure

BsmtExposure count
Av 220
Gd 131
Mn 114
No 953
NA 38

BsmtFinType1

BsmtFinType1 count
ALQ 220
BLQ 148
GLQ 414
LwQ 74
Rec 133
Unf 430
NA 37

BsmtFinSF1

Measure Value
Min 0.0000
First Quartile 0.0000
Median 381.0000
Mean 436.9911
Third Quartile 706.5000
Max 2188.0000

BsmtFinType2

BsmtFinType2 count
ALQ 19
BLQ 33
GLQ 14
LwQ 46
Rec 54
Unf 1252
NA 38

BsmtFinSF2

Measure Value
Min 0.0000
First Quartile 0.0000
Median 0.0000
Mean 46.6772
Third Quartile 0.0000
Max 1474.0000

BsmtUnfSF

Measure Value
Min 0.0000
First Quartile 222.5000
Median 477.5000
Mean 566.9904
Third Quartile 808.0000
Max 2336.0000

TotalBsmtSF

Measure Value
Min 0.000
First Quartile 795.000
Median 990.500
Mean 1050.659
Third Quartile 1293.750
Max 3206.000

Heating

Heating count
Floor 1
GasA 1424
GasW 18
Grav 7
OthW 2
Wall 4

HeatingQC

HeatingQC count
Ex 737
Fa 49
Gd 241
Po 1
TA 428

CentralAir

CentralAir count
N 95
Y 1361

Electrical

Electrical count
FuseA 94
FuseF 27
FuseP 3
Mix 1
SBrkr 1330
NA 1

X1stFlrSF

Measure Value
Min 334.000
First Quartile 882.000
Median 1086.000
Mean 1157.109
Third Quartile 1389.250
Max 3228.000

X2ndFlrSF

Measure Value
Min 0.000
First Quartile 0.000
Median 0.000
Mean 343.533
Third Quartile 728.000
Max 1818.000

LowQualFinSF

Measure Value
Min 0.000000
First Quartile 0.000000
Median 0.000000
Mean 5.860577
Third Quartile 0.000000
Max 572.000000

GrLivArea

Measure Value
Min 334.000
First Quartile 1128.000
Median 1458.500
Mean 1506.502
Third Quartile 1775.250
Max 3627.000

BsmtFullBath

Measure Value
Min 0.0000000
First Quartile 0.0000000
Median 0.0000000
Mean 0.4237637
Third Quartile 1.0000000
Max 3.0000000

BsmtHalfBath

Measure Value
Min 0.0000000
First Quartile 0.0000000
Median 0.0000000
Mean 0.0570055
Third Quartile 0.0000000
Max 2.0000000

FullBath

Measure Value
Min 0.000000
First Quartile 1.000000
Median 2.000000
Mean 1.561813
Third Quartile 2.000000
Max 3.000000

HalfBath

Measure Value
Min 0.0000000
First Quartile 0.0000000
Median 0.0000000
Mean 0.3811813
Third Quartile 1.0000000
Max 2.0000000

BedroomAbvGr

Measure Value
Min 0.000000
First Quartile 2.000000
Median 3.000000
Mean 2.864698
Third Quartile 3.000000
Max 8.000000

KitchenAbvGr

Measure Value
Min 0.000000
First Quartile 1.000000
Median 1.000000
Mean 1.046703
Third Quartile 1.000000
Max 3.000000

KitchenQual

KitchenQual count
Ex 96
Fa 39
Gd 586
TA 735

TotRmsAbvGrd

Measure Value
Min 2.000000
First Quartile 5.000000
Median 6.000000
Mean 6.506181
Third Quartile 7.000000
Max 14.000000

Functional

Functional count
Maj1 14
Maj2 5
Min1 31
Min2 34
Mod 15
Sev 1
Typ 1356

Fireplaces

Measure Value
Min 0.0000000
First Quartile 0.0000000
Median 1.0000000
Mean 0.6092033
Third Quartile 1.0000000
Max 3.0000000

FireplaceQu

FireplaceQu count
Ex 23
Fa 33
Gd 378
Po 20
TA 312
NA 690

GarageType

GarageType count
2Types 6
Attchd 867
Basment 19
BuiltIn 87
CarPort 9
Detchd 387
NA 81

GarageYrBlt

Measure Value
Min 1900.00
First Quartile 1961.00
Median 1980.00
Mean 1978.44
Third Quartile 2002.00
Max 2010.00

GarageFinish

GarageFinish count
Fin 348
RFn 422
Unf 605
NA 81

GarageCars

Measure Value
Min 0.000000
First Quartile 1.000000
Median 2.000000
Mean 1.764423
Third Quartile 2.000000
Max 4.000000

GarageArea

Measure Value
Min 0.0000
First Quartile 329.5000
Median 478.5000
Mean 471.5687
Third Quartile 576.0000
Max 1390.0000

GarageQual

GarageQual count
Ex 3
Fa 48
Gd 14
Po 3
TA 1307
NA 81

GarageCond

GarageCond count
Ex 2
Fa 35
Gd 9
Po 7
TA 1322
NA 81

PavedDrive

PavedDrive count
N 90
P 30
Y 1336

WoodDeckSF

Measure Value
Min 0.00000
First Quartile 0.00000
Median 0.00000
Mean 93.83379
Third Quartile 168.00000
Max 857.00000

OpenPorchSF

Measure Value
Min 0.00000
First Quartile 0.00000
Median 24.00000
Mean 46.22115
Third Quartile 68.00000
Max 547.00000

EnclosedPorch

Measure Value
Min 0.00000
First Quartile 0.00000
Median 0.00000
Mean 22.01442
Third Quartile 0.00000
Max 552.00000

X3SsnPorch

Measure Value
Min 0.000000
First Quartile 0.000000
Median 0.000000
Mean 3.418956
Third Quartile 0.000000
Max 508.000000

ScreenPorch

Measure Value
Min 0.00000
First Quartile 0.00000
Median 0.00000
Mean 15.10234
Third Quartile 0.00000
Max 480.00000

PoolArea

Measure Value
Min 0.000000
First Quartile 0.000000
Median 0.000000
Mean 2.055632
Third Quartile 0.000000
Max 738.000000

PoolQC

PoolQC count
Ex 1
Fa 2
Gd 2
NA 1451

Fence

Fence count
GdPrv 59
GdWo 54
MnPrv 156
MnWw 11
NA 1176

MiscFeature

MiscFeature count
Gar2 2
Othr 2
Shed 49
TenC 1
NA 1402

MiscVal

Measure Value
Min 0.00000
First Quartile 0.00000
Median 0.00000
Mean 43.60852
Third Quartile 0.00000
Max 15500.00000

MoSold

Measure Value
Min 1.000000
First Quartile 5.000000
Median 6.000000
Mean 6.326236
Third Quartile 8.000000
Max 12.000000

YrSold

Measure Value
Min 2006.000
First Quartile 2007.000
Median 2008.000
Mean 2007.817
Third Quartile 2009.000
Max 2010.000

SaleType

SaleType count
COD 43
Con 2
ConLD 9
ConLI 5
ConLw 5
CWD 4
New 120
Oth 3
WD 1265

SaleCondition

SaleCondition count
Abnorml 100
AdjLand 4
Alloca 12
Family 20
Normal 1197
Partial 123

SalePrice

Measure Value
Min 34900.0
First Quartile 129900.0
Median 163000.0
Mean 180151.2
Third Quartile 214000.0
Max 625000.0

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

kaggle %>%
  select(TotalBsmtSF, GrLivArea, YearBuilt, SalePrice) %>%
  pairs()

Derive a correlation matrix for any three quantitative variables in the dataset.

correlation_matrix <- kaggle %>%
  select(TotalBsmtSF, GrLivArea, YearBuilt) %>%
  cor() %>%
  as.matrix()

correlation_matrix %>%
  kable() %>%
  kable_styling()
TotalBsmtSF GrLivArea YearBuilt
TotalBsmtSF 1.0000000 0.3948292 0.3998666
GrLivArea 0.3948292 1.0000000 0.1926450
YearBuilt 0.3998666 0.1926450 1.0000000

Test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

zero_vars <- 0
test <- 0

variables <- kaggle %>%
  select(-SalePrice) %>%
  names()

cat("<table><thead><tr><th>Variable</th><th>Correlation Equal to Zero</th></tr></thread><tbody>")
for(variable in variables){
  d <- kaggle[,names(kaggle) == variable]
  if(is.numeric(d)){
    test <- test + 1
    results <- cor.test(kaggle$SalePrice, d, conf.level = 0.8)
    if(0 > results$conf.int[1] & results$conf.int[2] > 0){
      hypothesis_test_results <- "Yes"
      zero_vars <- zero_vars + 1
    } else {
      hypothesis_test_results <- "No"
    }
    cat(paste("<tr><td>",variable,"</td><td>", hypothesis_test_results, "</td>"))
  }
}
cat("</tbody></table>")
Variable Correlation Equal to Zero
Id Yes
MSSubClass No
LotFrontage No
LotArea No
OverallQual No
OverallCond No
YearBuilt No
YearRemodAdd No
MasVnrArea No
BsmtFinSF1 No
BsmtFinSF2 Yes
BsmtUnfSF No
TotalBsmtSF No
X1stFlrSF No
X2ndFlrSF No
LowQualFinSF Yes
GrLivArea No
BsmtFullBath No
BsmtHalfBath No
FullBath No
HalfBath No
BedroomAbvGr No
KitchenAbvGr No
TotRmsAbvGrd No
Fireplaces No
GarageYrBlt No
GarageCars No
GarageArea No
WoodDeckSF No
OpenPorchSF No
EnclosedPorch No
X3SsnPorch No
ScreenPorch No
PoolArea Yes
MiscVal Yes
MoSold No
YrSold Yes

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

There are 6 variables that have no correlation with the sale price. I would be worried be woried about family-wise error because I just ran 37 hypothesis tests. The family-wise error rate would be: \({FWER} = 1 - (1 - .2)^6 = 1 - 0.262144 = 0.737856\). This means there is a really high probability of committing a Type I error.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

precision_matrix <- inv(correlation_matrix)

precision_matrix %>%
  kable() %>%
  kable_styling()
1.3601385 -0.4489077 -0.4573942
-0.4489077 1.1867025 -0.0491090
-0.4573942 -0.0491090 1.1923573

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

This will generate the identity matrix.

correlation_matrix %*% precision_matrix %>%
  round() %>%
  kable() %>%
  kable_styling()
TotalBsmtSF 1 0 0
GrLivArea 0 1 0
YearBuilt 0 0 1

This will too.

precision_matrix %*% correlation_matrix %>%
  round() %>%
  kable() %>%
  kable_styling()
TotalBsmtSF GrLivArea YearBuilt
1 0 0
0 1 0
0 0 1

Conduct LU decomposition on the matrix.

lu_decomposition <- Matrix::expand(lu(correlation_matrix))

The LU decomposition should yield the correlation matrix after multiplying the two components. In other words this:

A <- lu_decomposition$L %*% lu_decomposition$U %>%
  as.matrix() 
colnames(A) <- colnames(correlation_matrix)
rownames(A) <- rownames(correlation_matrix)

A %>%
  kable() %>%
  kable_styling()
TotalBsmtSF GrLivArea YearBuilt
TotalBsmtSF 1.0000000 0.3948292 0.3998666
GrLivArea 0.3948292 1.0000000 0.1926450
YearBuilt 0.3998666 0.1926450 1.0000000

Should match this:

correlation_matrix %>%
  kable() %>%
  kable_styling()
TotalBsmtSF GrLivArea YearBuilt
TotalBsmtSF 1.0000000 0.3948292 0.3998666
GrLivArea 0.3948292 1.0000000 0.1926450
YearBuilt 0.3998666 0.1926450 1.0000000

Which it does.

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

I will be using the Square Footage of the Unfinished Basement for this part of the exam. I know it is right skewed because the mean (566.9903846) is larger than the median (477.5) as shown in the figure below:

The minimum (0) is not smaller than zero so we don’t need to shift the data. Remember that the maximum value is 2336. I have already loaded the MASS package.

lambda <- fitdistr(kaggle$BsmtUnfSF, densfun = "exponential")$estimate

The optimal value for \(\lambda\) is 0.0017637.

samples <- rexp(1000, rate = lambda)

Plot a histogram and compare it with a histogram of your original variable.

The original data is not quite exponential, but it’s not a bad approximation. The exponential distribution data has a longer tail than the basement square footage data.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

The PDF would be \(f(x;\lambda) = \lambda e^{-\lambda x}\) where \(x \geq 0\) and otherwise zero.

The CDF would be \(F(x;\lambda) = 1 - e^{-\lambda x}\). lambda is 0.0017637. To find the 5th percentile we need to solve for x in:

\[0.05 = 1 - e^{-0.0017637 x}\]

\[-0.95 = - e^{-0.0017637 x}\]

\[-ln(0.95) = 0.0017637 x\]

\[x = \frac{-ln(0.95)}{0.0017637} = 29.0828047\] For the 95th percentile we need to solve for x in:

\[0.95 = 1 - e^{-0.0017637 x}\]

\[-0.05 = - e^{-0.0017637 x}\]

\[-ln(0.05) = 0.0017637 x\]

\[x = \frac{-ln(0.05)}{0.0017637} = 1698.551394\]

So the 5th and 95th percentiles are approximately 29 and 1699, respectively.

Also generate a 95% confidence interval from the empirical data, assuming normality.

mu <- mean(kaggle$BsmtUnfSF)
s <- sd(kaggle$BsmtUnfSF)
n <- nrow(kaggle)
error <- qnorm(0.975) * s / sqrt(n)

ci <- c(mu - error, mu + error)
names(ci) <- c("5%", "95%")
ci
      5%      95% 
544.2769 589.7038 

The 95% confidence interval (assuming normality) is 544 to 590. Although what I think was meant was assume the data is normally distributed. Calculate the 5th and 95th percentile. That is found using \(x = \mu + Z \sigma\). The Z is -1.645 for the 5th percentile, and 1.645 for the 95th. So the percentiles would be -160 and 1294.

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(kaggle$BsmtUnfSF, c(0.05, 0.95))
  5%  95% 
   0 1468 

The actual 5th percentile is 0 and the 95th is 1468. So the findings are summarized in the following table:

Method 5% 95%
Exponential CDF 29 1699
Normal 95% CI 544 590
Normal Percentiles -160 1294
Emperical Percentiles 0 1468

If we model the data as exponentially distributed the 5th percentile is 29. If we model it as normally distributed the 5th is at -160 which in the context of square footage does not make any sense. The actual 5th percentile is 0. The difference is explained to the assumed shape/distibution of the underlying data.

Looking at the 95th percentile we have 393 if the data are exponentially distributed, 1294 if it is normally distributed and 1468 in reality. Again the difference is due to the assumed shape.

I have left out the 95% CI from the discussion because the confidence interval is a way of estimating the mean of the population. We would know if we took 100 estimates that the actual mean falls within the confidence intervals 95% of the time. Within this context it is meaningless.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Since there are so many variables I will be using a random forest to detect which variables are the most important and will be using them to build the model. First I will drop out the variables that are missing a lot of data or otherwise meaningless.

set.seed(12345)

rf_data <- kaggle %>%
  # Dropping variables with a lot of missing values
  select(-Alley, -MasVnrType, -MasVnrArea, -BsmtQual, -BsmtCond, -BsmtExposure, -BsmtFinType1, -BsmtFinType2, -Electrical, -FireplaceQu, -GarageType, -GarageYrBlt, -GarageFinish, -GarageQual, -GarageCond, -PoolQC, -Fence, -MiscFeature) %>%
  # Dropping the ID because it is meaningless
  select(-Id)

ctrl <- trainControl(method = "cv", number = 5)
rf_fit <- train(SalePrice ~ ., 
                data = rf_data, 
                method = "rf", 
                ntree = 1000, 
                trControl = ctrl, 
                tuneGrid = data.frame(mtry = 10), 
                importance = TRUE, 
                verbose = TRUE)

Now we can examine the variables to in order of importance:

imp_var <- importance(rf_fit$finalModel) 
variables <- rownames(imp_var)

imp_var <- imp_var %>%
  as.data.frame()

names(imp_var) <- c("IncMSE", "IncNodePurity")
imp_var$Variable <- variables

imp_var %>%
  select(Variable, IncMSE, IncNodePurity) %>%
  arrange(-IncMSE) %>%
  kable() %>%
  kable_styling()
Variable IncMSE IncNodePurity
GrLivArea 27.7642054 558222457815
X1stFlrSF 23.8402736 438261813340
GarageArea 23.4461796 396371093603
TotalBsmtSF 23.0135452 466705746656
X2ndFlrSF 21.8530314 173856510098
BsmtFinSF1 21.4807603 258028180660
GarageCars 20.4468933 441435019266
LotArea 19.9500054 223742546295
YearRemodAdd 18.9757012 244497915498
TotRmsAbvGrd 18.6221084 225898592051
Fireplaces 18.2525067 183834633265
YearBuilt 18.2095526 365182036863
FullBath 18.0785926 267428897407
OverallQual07 17.8122831 54778740547
KitchenQualTA 17.3133398 212814029087
ExterQualTA 16.7886571 319459715203
OverallQual08 15.5552231 148055942628
ExterQualGd 15.4469920 168787089403
KitchenQualGd 14.9208736 81076347248
OverallQual05 14.6904962 71828292878
FoundationPConc 14.6041206 152501331460
HalfBath 14.1326475 64835240662
MSSubClassSC60 14.0719635 80260652096
OverallQual06 13.5570802 34216698228
BsmtUnfSF 13.0750197 92376101616
MSZoningRM 12.9698822 44695503360
BedroomAbvGr 12.8961520 64408140622
OpenPorchSF 12.6017606 150317042035
FoundationCBlock 12.0103478 61632768206
MSZoningRL 11.9412630 27030146717
HouseStyle2Story 11.9130210 37962672713
OverallQual09 11.8806930 125777334119
WoodDeckSF 11.1052709 101105003058
CentralAirY 11.0840199 29474398534
BsmtFullBath 10.7850346 47184296081
HouseStyle1Story 10.7231739 22313763589
MSSubClassSC30 10.0277032 23314258488
OverallCond05 9.9576233 36917969145
OverallQual04 9.9174075 29559767577
NeighborhoodNridgHt 9.6892514 111376178800
HeatingQCTA 9.5295453 34593868828
LotFrontage 9.4974677 132965031566
RoofStyleHip 9.1111191 56874431472
Exterior1stVinylSd 9.0163227 33230520180
OverallQual10 8.9706238 74465306490
MSSubClassSC160 8.8767743 7305541842
MSSubClassSC20 8.7685719 16412418205
PavedDriveY 8.6151987 19361352452
NeighborhoodCrawfor 8.2457088 12809035999
Exterior2ndVinylSd 8.2103380 41530410869
NeighborhoodEdwards 8.2012692 10704691960
KitchenAbvGr 7.9417898 15029863724
RoofStyleGable 7.7734354 51311533488
NeighborhoodOldTown 7.6717584 11896312325
OverallCond03 7.6473501 5288730615
NeighborhoodNAmes 7.5605677 11095991736
SaleTypeNew 7.4116705 55406944533
SaleConditionPartial 7.3684266 64585720051
LotShapeReg 7.2638065 31883580269
NeighborhoodGilbert 7.1491821 6147200232
BldgTypeDuplex 7.0724285 6136830586
BldgTypeTwnhsE 6.9237689 7383513502
OverallCond07 6.8054355 8063560161
NeighborhoodCollgCr 6.6595760 8424088310
NeighborhoodNWAmes 6.5626023 6555080248
OverallCond06 6.3219140 8471565422
NeighborhoodStoneBr 6.2252129 24890630342
MSSubClassSC90 6.1941219 6504393659
NeighborhoodNoRidge 6.1344144 36921906867
Exterior1stMetalSd 6.0971684 8541438797
MSSubClassSC50 5.8609017 8254128902
Exterior1stWd Sdng 5.8517990 7959702982
EnclosedPorch 5.8189704 18698479917
FunctionalTyp 5.7728467 10038339426
NeighborhoodSawyer 5.5022107 3603890796
MSSubClassSC80 5.3093292 3230952260
Exterior1stPlywood 5.2983259 5921642459
NeighborhoodClearCr 5.2490482 7049418796
OverallQual03 5.2486329 4976370867
NeighborhoodIDOTRR 5.1772984 7740132342
BldgTypeTwnhs 5.1029673 3399903454
SaleTypeWD 4.9979941 28116185911
MSZoningFV 4.9899826 4560551576
Condition1Feedr 4.9064343 4229460756
FoundationSlab 4.8125409 3320013219
OverallCond04 4.7987513 7337757055
Exterior2ndMetalSd 4.7147506 8078977399
KitchenQualFa 4.6807084 6580727317
LandContourLvl 4.5175216 10851868623
NeighborhoodBrDale 4.5006819 1244979412
BsmtFinSF2 4.4427439 14755269865
Exterior2ndPlywood 4.3389015 6779855429
MSSubClassSC70 4.3346720 5432789230
NeighborhoodMeadowV 4.3141175 1900477409
Exterior1stBrkFace 4.1725914 9310006897
Exterior2ndWd Sdng 4.1206877 7969996880
NeighborhoodNPkVill 4.1072830 307630325
MSSubClassSC180 3.9546190 897817473
Exterior2ndCmentBd 3.9475897 14208226735
ScreenPorch 3.9353891 20193048493
Condition1Norm 3.9270879 7969386646
ExterQualFa 3.8908805 3940621831
NeighborhoodSomerst 3.8351411 10594426163
Exterior1stHdBoard 3.8145594 6510562101
NeighborhoodTimber 3.8134536 6647192482
Condition1RRAe 3.7427599 813366813
BldgType2fmCon 3.6976839 2334975817
NeighborhoodSawyerW 3.6511495 3337881959
HeatingQCGd 3.5784565 7463053600
LandContourLow 3.4966691 7556532641
OverallQual02 3.3424402 1189891163
Exterior1stCemntBd 3.3372101 16787314267
SaleConditionNormal 3.2183362 17926761077
Exterior2ndBrk Cmn 3.2100712 1019961967
HeatingWall 3.2040764 341103999
MSSubClassSC190 3.1475311 2515495382
LandSlopeMod 3.1044339 10164110731
OverallCond08 2.9852142 4930064723
NeighborhoodMitchel 2.9809616 2900131238
HouseStyleSFoyer 2.9569055 1266022780
HeatingQCFa 2.8838835 4450317437
OverallCond09 2.8275507 7077081468
NeighborhoodBrkSide 2.7912815 3973978708
BsmtHalfBath 2.7906051 4658438070
ExterCondTA 2.7144591 8476120568
LotConfigCulDSac 2.5664585 13518708455
RoofMatlCompShg 2.5467057 6536927501
ExterCondFa 2.4539984 5256440740
PavedDriveP 2.3440645 1414425905
RoofMatlWdShngl 2.3253509 4676063657
NeighborhoodSWISU 2.2522396 1766072445
MSSubClassSC85 2.0506240 544540749
FunctionalMin2 2.0094515 1829415750
Exterior1stBrkComm 1.9782157 660262229
Exterior2ndHdBoard 1.9310587 7989689692
MoSold08 1.8776579 6510428093
Condition1PosN 1.8227573 2015016657
RoofMatlWdShake 1.7477976 925564602
HouseStyle1.5Unf 1.6977007 538926943
OverallCond02 1.6665478 1087968559
MoSold12 1.6045188 4069288751
FunctionalMod 1.5783282 3750376514
SaleTypeCon 1.5139422 785116346
LandSlopeSev 1.5054188 4478543599
SaleConditionFamily 1.4954567 1731293795
MoSold02 1.4563728 4061854353
Condition1RRNn 1.4536000 374244244
X3SsnPorch 1.4063978 3394784358
YrSold 1.4058075 27849586937
Exterior2ndBrkFace 1.4047286 4653607861
MiscVal 1.3745684 2961160418
Condition2Norm 1.3393934 1045544631
HouseStyleSLvl 1.3383236 2994722232
HeatingGrav 1.3114232 1169839448
HeatingOthW 1.1410317 179771859
Condition1RRNe 1.0005004 58389749
HeatingGasA 0.9356422 4105193137
NeighborhoodVeenker 0.8932208 2766468742
MoSold07 0.8385579 9273652119
PoolArea 0.8051327 890273272
Exterior2ndStucco 0.7526820 3057051233
MSZoningRH 0.7341767 728711503
Exterior2ndStone 0.7281763 609268079
LotShapeIR2 0.7130099 10648833393
Exterior2ndWd Shng 0.5948916 4089126186
LotConfigFR2 0.5930134 2261351395
LandContourHLS 0.5662390 10265622877
FunctionalMin1 0.5600259 1474125037
HeatingGasW 0.5060180 3264903027
SaleTypeOth 0.4152644 208526780
SaleConditionAdjLand 0.3927154 222853090
HouseStyle2.5Unf 0.3866142 1079826559
Exterior1stStucco 0.2917621 2546562371
StreetPave 0.2476584 1040113916
MoSold06 0.2367773 7808912454
RoofStyleMansard 0.0876502 377569410
Condition1RRAn 0.0838131 1787292874
Exterior1stStone 0.0810245 566527056
MSSubClassSC75 0.0573733 3640324222
UtilitiesNoSeWa 0.0000000 24405758
Condition2PosA 0.0000000 592259874
Condition2PosN 0.0000000 242273046
Condition2RRAe 0.0000000 49299817
Condition2RRAn 0.0000000 23914616
RoofMatlMembran 0.0000000 332916895
RoofMatlMetal 0.0000000 52599758
RoofMatlRoll 0.0000000 29326184
Exterior1stAsphShn 0.0000000 70708416
Exterior1stCBlock 0.0000000 34417577
Exterior1stImStucc 0.0000000 78026941
Exterior2ndCBlock 0.0000000 72049298
Exterior2ndOther 0.0000000 204369573
ExterCondPo 0.0000000 140093549
HeatingQCPo 0.0000000 23603682
FunctionalSev 0.0000000 306054523
MoSold03 -0.1111648 8529224563
LotConfigFR3 -0.1376133 207387605
Exterior2ndImStucc -0.1553358 831022002
Condition2Feedr -0.1704940 232263919
Exterior2ndAsphShn -0.1883214 172305536
SaleTypeConLD -0.3108024 634606083
ExterCondGd -0.3127218 7102118290
MoSold10 -0.3885219 5339049068
FunctionalMaj2 -0.4629922 1083900261
SaleConditionAlloca -0.4684010 2992620163
SaleTypeConLw -0.5665886 299701708
MSSubClassSC45 -0.6039514 447356945
MoSold05 -0.6376564 6685430086
LowQualFinSF -0.6460163 5453098982
RoofMatlTar&Grv -0.7076708 1287371031
SaleTypeConLI -0.7560331 1394789713
SaleTypeCWD -0.7636488 1460041716
MoSold04 -0.7864021 6759841893
Condition1PosA -0.8621558 1844154549
RoofStyleShed -0.9576812 235225273
RoofStyleGambrel -0.9858484 1637119066
MoSold11 -1.0121206 4346092468
LotShapeIR3 -1.2100498 3236521680
Condition2RRNn -1.2259136 77988132
FoundationWood -1.3235549 436203068
NeighborhoodBlueste -1.3694832 48936925
MSSubClassSC40 -1.5958567 395203657
LotConfigInside -1.6433612 10394962225
FoundationStone -2.0794747 647894612
Exterior1stWdShing -2.3497002 1973590525
HouseStyle2.5Fin -2.5904792 4054129797
MoSold09 -4.5658305 4229849479

The takeaways from the preceeding table is that size seems to consistently rise to the top as a variable to include in the model. I don’t want to have multiple variables that are measuring the same thing (i.e. Total SF of Living Area, Total SF of Basement, etc.). I will use this table to select the variables that I think should be included in the linear regression model.

fit <- lm(SalePrice ~ GrLivArea + GarageArea + LotArea + YearRemodAdd + Fireplaces + YearBuilt + OverallQual, kaggle)
summary(fit)

Call:
lm(formula = SalePrice ~ GrLivArea + GarageArea + LotArea + YearRemodAdd + 
    Fireplaces + YearBuilt + OverallQual, data = kaggle)

Residuals:
    Min      1Q  Median      3Q     Max 
-132049  -20552   -1677   16488  273585 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.324e+06  1.075e+05 -12.318  < 2e-16 ***
GrLivArea     5.024e+01  2.452e+00  20.491  < 2e-16 ***
GarageArea    5.632e+01  5.387e+00  10.456  < 2e-16 ***
LotArea       8.885e-01  9.500e-02   9.353  < 2e-16 ***
YearRemodAdd  3.119e+02  5.681e+01   5.491 4.72e-08 ***
Fireplaces    9.821e+03  1.619e+03   6.065 1.68e-09 ***
YearBuilt     3.266e+02  4.163e+01   7.846 8.25e-15 ***
OverallQual   2.032e+04  1.048e+03  19.388  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33860 on 1448 degrees of freedom
Multiple R-squared:  0.806, Adjusted R-squared:  0.805 
F-statistic: 859.3 on 7 and 1448 DF,  p-value: < 2.2e-16

These seven variables explain 80% of the variability in the sale price. I will use this model to make the Kaggle submission.

test <- read.csv("Kaggle_House_Prices/test.csv") %>%
  fill_holes()
y_hat <- predict(fit, test)
submission <- data.frame(Id = test$Id, SalePrice = y_hat)
summary(submission)
       Id         SalePrice     
 Min.   :1461   Min.   : -6594  
 1st Qu.:1826   1st Qu.:123487  
 Median :2190   Median :169122  
 Mean   :2190   Mean   :177836  
 3rd Qu.:2554   3rd Qu.:222814  
 Max.   :2919   Max.   :537193  

It appears that some of the predictions are incorrect since the minimum is negative.

write.csv(submission, "submission.csv", quote = FALSE, row.names =  FALSE)

My Kaggle score is 0.47769 and my username is mikesilva. I am ranked 4520 on the leaderboard. I will try to improve this score when I have time.

This analysis is available on https://github.com/mikeasilva/CUNY-SPS. A walkthough of this analysis is available on YouTube.