Problem 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{N+1}{2}\).
set.seed(12345)
N <- 10
n <- 10000
mu <- sigma <- (N + 1)/2
df <- data.frame(X = runif(n, min=1, max=N),
Y = rnorm(n, mean=mu, sd=sigma))
summary(df$X)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.283 5.541 5.506 7.742 10.000
Min. 1st Qu. Median Mean 3rd Qu. Max.
-14.664 1.846 5.482 5.500 9.154 26.766
Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x <- median(df$X)
y <- as.numeric(quantile(df$Y)["25%"])
Small letter “x” is 5.540952 and small letter “y” is 1.8460758.
- \(P(X>x | X>y)\)
P_a_and_b <- df %>%
filter(X > x,
X > y) %>%
nrow() / n
P_b <- df %>%
filter(X > y) %>%
nrow() / n
problem_1a <- P_a_and_b / P_b
The probablity of a random number uniformly ranging from 1 to 10 being greater than 5.540952 given that it is greater than 1.8460758 is 0.5512679
- \(P(X>x, Y>y)\)
problem_1b <- df %>%
filter(X > x,
Y > y) %>%
nrow() / n
The probablity of a random number uniformly ranging from 1 to 10 being greater than 5.540952 and a random normally distributed number with a mean and standard deviation of 5.5 being greater than 1.8460758 is 0.3808
- \(P(X<x | X>y)\)
P_a_and_b <- df %>%
filter(X < x,
X > y) %>%
nrow() / n
P_b <- df %>%
filter(X > y) %>%
nrow() / n
problem_1c <- P_a_and_b / P_b
The probablity of a random number uniformly ranging from 1 to 10 being less than 5.540952 given that it is greater than 1.8460758 is 0.4487321
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
# Create Joint Probabilities
temp <- df %>%
mutate(A = ifelse(X > x, " X greater than x", " X not greater than x"),
B = ifelse(Y > y, " Y greater than y", " Y not greater than y")) %>%
group_by(A, B) %>%
summarise(count = n()) %>%
mutate(probability = count / n)
# Create Marginal Probabilities
temp <- temp %>%
ungroup() %>%
group_by(A) %>%
summarise(count = sum(count),
probability = sum(probability)) %>%
mutate(B = "Total") %>%
bind_rows(temp)
temp <- temp %>%
ungroup() %>%
group_by(B) %>%
summarise(count = sum(count),
probability = sum(probability)) %>%
mutate(A = "Total") %>%
bind_rows(temp)
# Create Table
temp %>%
select(-count) %>%
spread(A, probability) %>%
rename(" " = B) %>%
kable() %>%
kable_styling()
|
X greater than x
|
X not greater than x
|
Total
|
Y greater than y
|
0.3808
|
0.3692
|
0.75
|
Y not greater than y
|
0.1192
|
0.1308
|
0.25
|
Total
|
0.5000
|
0.5000
|
1.00
|
P(X>x and Y>y) is 0.3808. P(X>x)P(Y>y) is 0.5 \(\times\) 0.75 which is 0.375. They are not the same.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
count_data <- temp %>%
filter(A != "Total",
B != "Total") %>%
select(-probability) %>%
spread(A, count) %>%
as.data.frame()
row.names(count_data) <- count_data$B
count_data <- count_data %>%
select(-B) %>%
as.matrix()
fisher.test(count_data)
Fisher's Exact Test for Count Data
data: count_data
p-value = 0.007904
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.032680 1.240419
sample estimates:
odds ratio
1.131777
Pearson's Chi-squared test with Yates' continuity correction
data: count_data
X-squared = 7.0533, df = 1, p-value = 0.007912
Fisher’s Exact Test is for is used when you have small cell sizes (less than 5). The Chi Square Test is used when the cell sizes are large. It would be appropriate in this case.
Problem 2
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
kaggle <- read.csv("Kaggle_House_Prices/train.csv") %>%
# Removing outliers per http://jse.amstat.org/v19n3/decock.pdf
filter(GrLivArea < 4000)
fill_holes <- function(df){
df %>%
# Filling in missing values with zeros where it makes sense
mutate(BedroomAbvGr = replace_na(BedroomAbvGr, 0),
BsmtFullBath = replace_na(BsmtFullBath, 0),
BsmtHalfBath = replace_na(BsmtHalfBath, 0),
BsmtUnfSF = replace_na(BsmtUnfSF, 0),
EnclosedPorch = replace_na(EnclosedPorch, 0),
Fireplaces = replace_na(Fireplaces, 0),
GarageArea = replace_na(GarageArea, 0),
GarageCars = replace_na(GarageCars, 0),
HalfBath = replace_na(HalfBath, 0),
KitchenAbvGr = replace_na(KitchenAbvGr, 0),
LotFrontage = replace_na(LotFrontage, 0),
OpenPorchSF = replace_na(OpenPorchSF, 0),
PoolArea = replace_na(PoolArea, 0),
ScreenPorch = replace_na(ScreenPorch, 0),
TotRmsAbvGrd = replace_na(TotRmsAbvGrd, 0),
WoodDeckSF = replace_na(WoodDeckSF, 0))
}
kaggle <- fill_holes(kaggle)
Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set.
Id
Measure
|
Value
|
Min
|
1.000
|
First Quartile
|
364.750
|
Median
|
730.500
|
Mean
|
729.967
|
Third Quartile
|
1094.250
|
Max
|
1460.000
|
MSSubClass
Measure
|
Value
|
Min
|
20.00000
|
First Quartile
|
20.00000
|
Median
|
50.00000
|
Mean
|
56.88874
|
Third Quartile
|
70.00000
|
Max
|
190.00000
|
MSZoning
MSZoning
|
count
|
C (all)
|
10
|
FV
|
65
|
RH
|
16
|
RL
|
1147
|
RM
|
218
|
LotFrontage
Measure
|
Value
|
Min
|
0.00000
|
First Quartile
|
42.00000
|
Median
|
63.00000
|
Mean
|
57.29602
|
Third Quartile
|
79.00000
|
Max
|
313.00000
|
LotArea
Measure
|
Value
|
Min
|
1300.00
|
First Quartile
|
7538.75
|
Median
|
9468.50
|
Mean
|
10448.78
|
Third Quartile
|
11588.00
|
Max
|
215245.00
|
Street
Street
|
count
|
Grvl
|
6
|
Pave
|
1450
|
Alley
Alley
|
count
|
Grvl
|
50
|
Pave
|
41
|
NA
|
1365
|
LotShape
LotShape
|
count
|
IR1
|
481
|
IR2
|
41
|
IR3
|
9
|
Reg
|
925
|
LandContour
LandContour
|
count
|
Bnk
|
61
|
HLS
|
50
|
Low
|
36
|
Lvl
|
1309
|
Utilities
Utilities
|
count
|
AllPub
|
1455
|
NoSeWa
|
1
|
LotConfig
LotConfig
|
count
|
Corner
|
260
|
CulDSac
|
94
|
FR2
|
47
|
FR3
|
4
|
Inside
|
1051
|
LandSlope
LandSlope
|
count
|
Gtl
|
1378
|
Mod
|
65
|
Sev
|
13
|
Neighborhood
Neighborhood
|
count
|
Blmngtn
|
17
|
Blueste
|
2
|
BrDale
|
16
|
BrkSide
|
58
|
ClearCr
|
28
|
CollgCr
|
150
|
Crawfor
|
51
|
Edwards
|
98
|
Gilbert
|
79
|
IDOTRR
|
37
|
MeadowV
|
17
|
Mitchel
|
49
|
NAmes
|
225
|
NoRidge
|
39
|
NPkVill
|
9
|
NridgHt
|
77
|
NWAmes
|
73
|
OldTown
|
113
|
Sawyer
|
74
|
SawyerW
|
59
|
Somerst
|
86
|
StoneBr
|
25
|
SWISU
|
25
|
Timber
|
38
|
Veenker
|
11
|
Condition1
Condition1
|
count
|
Artery
|
48
|
Feedr
|
80
|
Norm
|
1258
|
PosA
|
8
|
PosN
|
18
|
RRAe
|
11
|
RRAn
|
26
|
RRNe
|
2
|
RRNn
|
5
|
Condition2
Condition2
|
count
|
Artery
|
2
|
Feedr
|
6
|
Norm
|
1442
|
PosA
|
1
|
PosN
|
1
|
RRAe
|
1
|
RRAn
|
1
|
RRNn
|
2
|
BldgType
BldgType
|
count
|
1Fam
|
1216
|
2fmCon
|
31
|
Duplex
|
52
|
Twnhs
|
43
|
TwnhsE
|
114
|
HouseStyle
HouseStyle
|
count
|
1.5Fin
|
154
|
1.5Unf
|
14
|
1Story
|
726
|
2.5Fin
|
8
|
2.5Unf
|
11
|
2Story
|
441
|
SFoyer
|
37
|
SLvl
|
65
|
OverallQual
Measure
|
Value
|
Min
|
1.000000
|
First Quartile
|
5.000000
|
Median
|
6.000000
|
Mean
|
6.088599
|
Third Quartile
|
7.000000
|
Max
|
10.000000
|
OverallCond
Measure
|
Value
|
Min
|
1.000000
|
First Quartile
|
5.000000
|
Median
|
5.000000
|
Mean
|
5.576236
|
Third Quartile
|
6.000000
|
Max
|
9.000000
|
YearBuilt
Measure
|
Value
|
Min
|
1872.000
|
First Quartile
|
1954.000
|
Median
|
1972.000
|
Mean
|
1971.185
|
Third Quartile
|
2000.000
|
Max
|
2010.000
|
YearRemodAdd
Measure
|
Value
|
Min
|
1950.000
|
First Quartile
|
1966.750
|
Median
|
1993.500
|
Mean
|
1984.819
|
Third Quartile
|
2004.000
|
Max
|
2010.000
|
RoofStyle
RoofStyle
|
count
|
Flat
|
13
|
Gable
|
1140
|
Gambrel
|
11
|
Hip
|
283
|
Mansard
|
7
|
Shed
|
2
|
RoofMatl
RoofMatl
|
count
|
CompShg
|
1432
|
Membran
|
1
|
Metal
|
1
|
Roll
|
1
|
Tar&Grv
|
11
|
WdShake
|
5
|
WdShngl
|
5
|
Exterior1st
Exterior1st
|
count
|
AsbShng
|
20
|
AsphShn
|
1
|
BrkComm
|
2
|
BrkFace
|
50
|
CBlock
|
1
|
CemntBd
|
60
|
HdBoard
|
221
|
ImStucc
|
1
|
MetalSd
|
220
|
Plywood
|
108
|
Stone
|
2
|
Stucco
|
24
|
VinylSd
|
515
|
Wd Sdng
|
205
|
WdShing
|
26
|
Exterior2nd
Exterior2nd
|
count
|
AsbShng
|
20
|
AsphShn
|
3
|
Brk Cmn
|
7
|
BrkFace
|
25
|
CBlock
|
1
|
CmentBd
|
59
|
HdBoard
|
206
|
ImStucc
|
9
|
MetalSd
|
214
|
Other
|
1
|
Plywood
|
142
|
Stone
|
5
|
Stucco
|
25
|
VinylSd
|
504
|
Wd Sdng
|
197
|
Wd Shng
|
38
|
MasVnrType
MasVnrType
|
count
|
BrkCmn
|
15
|
BrkFace
|
444
|
None
|
863
|
Stone
|
126
|
NA
|
8
|
MasVnrArea
Measure
|
Value
|
Min
|
0.0000
|
First Quartile
|
0.0000
|
Median
|
0.0000
|
Mean
|
102.0877
|
Third Quartile
|
164.2500
|
Max
|
1600.0000
|
ExterQual
ExterQual
|
count
|
Ex
|
49
|
Fa
|
14
|
Gd
|
487
|
TA
|
906
|
ExterCond
ExterCond
|
count
|
Ex
|
3
|
Fa
|
28
|
Gd
|
146
|
Po
|
1
|
TA
|
1278
|
Foundation
Foundation
|
count
|
BrkTil
|
146
|
CBlock
|
634
|
PConc
|
643
|
Slab
|
24
|
Stone
|
6
|
Wood
|
3
|
BsmtQual
BsmtQual
|
count
|
Ex
|
117
|
Fa
|
35
|
Gd
|
618
|
TA
|
649
|
NA
|
37
|
BsmtCond
BsmtCond
|
count
|
Fa
|
45
|
Gd
|
65
|
Po
|
2
|
TA
|
1307
|
NA
|
37
|
BsmtExposure
BsmtExposure
|
count
|
Av
|
220
|
Gd
|
131
|
Mn
|
114
|
No
|
953
|
NA
|
38
|
BsmtFinType1
BsmtFinType1
|
count
|
ALQ
|
220
|
BLQ
|
148
|
GLQ
|
414
|
LwQ
|
74
|
Rec
|
133
|
Unf
|
430
|
NA
|
37
|
BsmtFinSF1
Measure
|
Value
|
Min
|
0.0000
|
First Quartile
|
0.0000
|
Median
|
381.0000
|
Mean
|
436.9911
|
Third Quartile
|
706.5000
|
Max
|
2188.0000
|
BsmtFinType2
BsmtFinType2
|
count
|
ALQ
|
19
|
BLQ
|
33
|
GLQ
|
14
|
LwQ
|
46
|
Rec
|
54
|
Unf
|
1252
|
NA
|
38
|
BsmtFinSF2
Measure
|
Value
|
Min
|
0.0000
|
First Quartile
|
0.0000
|
Median
|
0.0000
|
Mean
|
46.6772
|
Third Quartile
|
0.0000
|
Max
|
1474.0000
|
BsmtUnfSF
Measure
|
Value
|
Min
|
0.0000
|
First Quartile
|
222.5000
|
Median
|
477.5000
|
Mean
|
566.9904
|
Third Quartile
|
808.0000
|
Max
|
2336.0000
|
TotalBsmtSF
Measure
|
Value
|
Min
|
0.000
|
First Quartile
|
795.000
|
Median
|
990.500
|
Mean
|
1050.659
|
Third Quartile
|
1293.750
|
Max
|
3206.000
|
Heating
Heating
|
count
|
Floor
|
1
|
GasA
|
1424
|
GasW
|
18
|
Grav
|
7
|
OthW
|
2
|
Wall
|
4
|
HeatingQC
HeatingQC
|
count
|
Ex
|
737
|
Fa
|
49
|
Gd
|
241
|
Po
|
1
|
TA
|
428
|
CentralAir
CentralAir
|
count
|
N
|
95
|
Y
|
1361
|
Electrical
Electrical
|
count
|
FuseA
|
94
|
FuseF
|
27
|
FuseP
|
3
|
Mix
|
1
|
SBrkr
|
1330
|
NA
|
1
|
X1stFlrSF
Measure
|
Value
|
Min
|
334.000
|
First Quartile
|
882.000
|
Median
|
1086.000
|
Mean
|
1157.109
|
Third Quartile
|
1389.250
|
Max
|
3228.000
|
X2ndFlrSF
Measure
|
Value
|
Min
|
0.000
|
First Quartile
|
0.000
|
Median
|
0.000
|
Mean
|
343.533
|
Third Quartile
|
728.000
|
Max
|
1818.000
|
LowQualFinSF
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
0.000000
|
Median
|
0.000000
|
Mean
|
5.860577
|
Third Quartile
|
0.000000
|
Max
|
572.000000
|
GrLivArea
Measure
|
Value
|
Min
|
334.000
|
First Quartile
|
1128.000
|
Median
|
1458.500
|
Mean
|
1506.502
|
Third Quartile
|
1775.250
|
Max
|
3627.000
|
BsmtFullBath
Measure
|
Value
|
Min
|
0.0000000
|
First Quartile
|
0.0000000
|
Median
|
0.0000000
|
Mean
|
0.4237637
|
Third Quartile
|
1.0000000
|
Max
|
3.0000000
|
BsmtHalfBath
Measure
|
Value
|
Min
|
0.0000000
|
First Quartile
|
0.0000000
|
Median
|
0.0000000
|
Mean
|
0.0570055
|
Third Quartile
|
0.0000000
|
Max
|
2.0000000
|
FullBath
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
1.000000
|
Median
|
2.000000
|
Mean
|
1.561813
|
Third Quartile
|
2.000000
|
Max
|
3.000000
|
HalfBath
Measure
|
Value
|
Min
|
0.0000000
|
First Quartile
|
0.0000000
|
Median
|
0.0000000
|
Mean
|
0.3811813
|
Third Quartile
|
1.0000000
|
Max
|
2.0000000
|
BedroomAbvGr
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
2.000000
|
Median
|
3.000000
|
Mean
|
2.864698
|
Third Quartile
|
3.000000
|
Max
|
8.000000
|
KitchenAbvGr
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
1.000000
|
Median
|
1.000000
|
Mean
|
1.046703
|
Third Quartile
|
1.000000
|
Max
|
3.000000
|
KitchenQual
KitchenQual
|
count
|
Ex
|
96
|
Fa
|
39
|
Gd
|
586
|
TA
|
735
|
TotRmsAbvGrd
Measure
|
Value
|
Min
|
2.000000
|
First Quartile
|
5.000000
|
Median
|
6.000000
|
Mean
|
6.506181
|
Third Quartile
|
7.000000
|
Max
|
14.000000
|
Functional
Functional
|
count
|
Maj1
|
14
|
Maj2
|
5
|
Min1
|
31
|
Min2
|
34
|
Mod
|
15
|
Sev
|
1
|
Typ
|
1356
|
Fireplaces
Measure
|
Value
|
Min
|
0.0000000
|
First Quartile
|
0.0000000
|
Median
|
1.0000000
|
Mean
|
0.6092033
|
Third Quartile
|
1.0000000
|
Max
|
3.0000000
|
FireplaceQu
FireplaceQu
|
count
|
Ex
|
23
|
Fa
|
33
|
Gd
|
378
|
Po
|
20
|
TA
|
312
|
NA
|
690
|
GarageType
GarageType
|
count
|
2Types
|
6
|
Attchd
|
867
|
Basment
|
19
|
BuiltIn
|
87
|
CarPort
|
9
|
Detchd
|
387
|
NA
|
81
|
GarageYrBlt
Measure
|
Value
|
Min
|
1900.00
|
First Quartile
|
1961.00
|
Median
|
1980.00
|
Mean
|
1978.44
|
Third Quartile
|
2002.00
|
Max
|
2010.00
|
GarageFinish
GarageFinish
|
count
|
Fin
|
348
|
RFn
|
422
|
Unf
|
605
|
NA
|
81
|
GarageCars
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
1.000000
|
Median
|
2.000000
|
Mean
|
1.764423
|
Third Quartile
|
2.000000
|
Max
|
4.000000
|
GarageArea
Measure
|
Value
|
Min
|
0.0000
|
First Quartile
|
329.5000
|
Median
|
478.5000
|
Mean
|
471.5687
|
Third Quartile
|
576.0000
|
Max
|
1390.0000
|
GarageQual
GarageQual
|
count
|
Ex
|
3
|
Fa
|
48
|
Gd
|
14
|
Po
|
3
|
TA
|
1307
|
NA
|
81
|
GarageCond
GarageCond
|
count
|
Ex
|
2
|
Fa
|
35
|
Gd
|
9
|
Po
|
7
|
TA
|
1322
|
NA
|
81
|
PavedDrive
PavedDrive
|
count
|
N
|
90
|
P
|
30
|
Y
|
1336
|
WoodDeckSF
Measure
|
Value
|
Min
|
0.00000
|
First Quartile
|
0.00000
|
Median
|
0.00000
|
Mean
|
93.83379
|
Third Quartile
|
168.00000
|
Max
|
857.00000
|
OpenPorchSF
Measure
|
Value
|
Min
|
0.00000
|
First Quartile
|
0.00000
|
Median
|
24.00000
|
Mean
|
46.22115
|
Third Quartile
|
68.00000
|
Max
|
547.00000
|
EnclosedPorch
Measure
|
Value
|
Min
|
0.00000
|
First Quartile
|
0.00000
|
Median
|
0.00000
|
Mean
|
22.01442
|
Third Quartile
|
0.00000
|
Max
|
552.00000
|
X3SsnPorch
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
0.000000
|
Median
|
0.000000
|
Mean
|
3.418956
|
Third Quartile
|
0.000000
|
Max
|
508.000000
|
ScreenPorch
Measure
|
Value
|
Min
|
0.00000
|
First Quartile
|
0.00000
|
Median
|
0.00000
|
Mean
|
15.10234
|
Third Quartile
|
0.00000
|
Max
|
480.00000
|
PoolArea
Measure
|
Value
|
Min
|
0.000000
|
First Quartile
|
0.000000
|
Median
|
0.000000
|
Mean
|
2.055632
|
Third Quartile
|
0.000000
|
Max
|
738.000000
|
PoolQC
PoolQC
|
count
|
Ex
|
1
|
Fa
|
2
|
Gd
|
2
|
NA
|
1451
|
Fence
Fence
|
count
|
GdPrv
|
59
|
GdWo
|
54
|
MnPrv
|
156
|
MnWw
|
11
|
NA
|
1176
|
MiscFeature
MiscFeature
|
count
|
Gar2
|
2
|
Othr
|
2
|
Shed
|
49
|
TenC
|
1
|
NA
|
1402
|
MiscVal
Measure
|
Value
|
Min
|
0.00000
|
First Quartile
|
0.00000
|
Median
|
0.00000
|
Mean
|
43.60852
|
Third Quartile
|
0.00000
|
Max
|
15500.00000
|
MoSold
Measure
|
Value
|
Min
|
1.000000
|
First Quartile
|
5.000000
|
Median
|
6.000000
|
Mean
|
6.326236
|
Third Quartile
|
8.000000
|
Max
|
12.000000
|
YrSold
Measure
|
Value
|
Min
|
2006.000
|
First Quartile
|
2007.000
|
Median
|
2008.000
|
Mean
|
2007.817
|
Third Quartile
|
2009.000
|
Max
|
2010.000
|
SaleType
SaleType
|
count
|
COD
|
43
|
Con
|
2
|
ConLD
|
9
|
ConLI
|
5
|
ConLw
|
5
|
CWD
|
4
|
New
|
120
|
Oth
|
3
|
WD
|
1265
|
SaleCondition
SaleCondition
|
count
|
Abnorml
|
100
|
AdjLand
|
4
|
Alloca
|
12
|
Family
|
20
|
Normal
|
1197
|
Partial
|
123
|
SalePrice
Measure
|
Value
|
Min
|
34900.0
|
First Quartile
|
129900.0
|
Median
|
163000.0
|
Mean
|
180151.2
|
Third Quartile
|
214000.0
|
Max
|
625000.0
|
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
kaggle %>%
select(TotalBsmtSF, GrLivArea, YearBuilt, SalePrice) %>%
pairs()

Derive a correlation matrix for any three quantitative variables in the dataset.
correlation_matrix <- kaggle %>%
select(TotalBsmtSF, GrLivArea, YearBuilt) %>%
cor() %>%
as.matrix()
correlation_matrix %>%
kable() %>%
kable_styling()
|
TotalBsmtSF
|
GrLivArea
|
YearBuilt
|
TotalBsmtSF
|
1.0000000
|
0.3948292
|
0.3998666
|
GrLivArea
|
0.3948292
|
1.0000000
|
0.1926450
|
YearBuilt
|
0.3998666
|
0.1926450
|
1.0000000
|
Test the hypothesis that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
zero_vars <- 0
test <- 0
variables <- kaggle %>%
select(-SalePrice) %>%
names()
cat("<table><thead><tr><th>Variable</th><th>Correlation Equal to Zero</th></tr></thread><tbody>")
Variable
|
Correlation Equal to Zero
|
for(variable in variables){
d <- kaggle[,names(kaggle) == variable]
if(is.numeric(d)){
test <- test + 1
results <- cor.test(kaggle$SalePrice, d, conf.level = 0.8)
if(0 > results$conf.int[1] & results$conf.int[2] > 0){
hypothesis_test_results <- "Yes"
zero_vars <- zero_vars + 1
} else {
hypothesis_test_results <- "No"
}
cat(paste("<tr><td>",variable,"</td><td>", hypothesis_test_results, "</td>"))
}
}
Id
|
Yes
|
MSSubClass
|
No
|
LotFrontage
|
No
|
LotArea
|
No
|
OverallQual
|
No
|
OverallCond
|
No
|
YearBuilt
|
No
|
YearRemodAdd
|
No
|
MasVnrArea
|
No
|
BsmtFinSF1
|
No
|
BsmtFinSF2
|
Yes
|
BsmtUnfSF
|
No
|
TotalBsmtSF
|
No
|
X1stFlrSF
|
No
|
X2ndFlrSF
|
No
|
LowQualFinSF
|
Yes
|
GrLivArea
|
No
|
BsmtFullBath
|
No
|
BsmtHalfBath
|
No
|
FullBath
|
No
|
HalfBath
|
No
|
BedroomAbvGr
|
No
|
KitchenAbvGr
|
No
|
TotRmsAbvGrd
|
No
|
Fireplaces
|
No
|
GarageYrBlt
|
No
|
GarageCars
|
No
|
GarageArea
|
No
|
WoodDeckSF
|
No
|
OpenPorchSF
|
No
|
EnclosedPorch
|
No
|
X3SsnPorch
|
No
|
ScreenPorch
|
No
|
PoolArea
|
Yes
|
MiscVal
|
Yes
|
MoSold
|
No
|
YrSold
|
Yes
|
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
There are 6 variables that have no correlation with the sale price. I would be worried be woried about family-wise error because I just ran 37 hypothesis tests. The family-wise error rate would be: \({FWER} = 1 - (1 - .2)^6 = 1 - 0.262144 = 0.737856\). This means there is a really high probability of committing a Type I error.
Linear Algebra and Correlation
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
precision_matrix <- inv(correlation_matrix)
precision_matrix %>%
kable() %>%
kable_styling()
|
|
|
1.3601385
|
-0.4489077
|
-0.4573942
|
-0.4489077
|
1.1867025
|
-0.0491090
|
-0.4573942
|
-0.0491090
|
1.1923573
|
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
This will generate the identity matrix.
correlation_matrix %*% precision_matrix %>%
round() %>%
kable() %>%
kable_styling()
|
|
|
|
TotalBsmtSF
|
1
|
0
|
0
|
GrLivArea
|
0
|
1
|
0
|
YearBuilt
|
0
|
0
|
1
|
This will too.
precision_matrix %*% correlation_matrix %>%
round() %>%
kable() %>%
kable_styling()
TotalBsmtSF
|
GrLivArea
|
YearBuilt
|
1
|
0
|
0
|
0
|
1
|
0
|
0
|
0
|
1
|
Conduct LU decomposition on the matrix.
lu_decomposition <- Matrix::expand(lu(correlation_matrix))
The LU decomposition should yield the correlation matrix after multiplying the two components. In other words this:
A <- lu_decomposition$L %*% lu_decomposition$U %>%
as.matrix()
colnames(A) <- colnames(correlation_matrix)
rownames(A) <- rownames(correlation_matrix)
A %>%
kable() %>%
kable_styling()
|
TotalBsmtSF
|
GrLivArea
|
YearBuilt
|
TotalBsmtSF
|
1.0000000
|
0.3948292
|
0.3998666
|
GrLivArea
|
0.3948292
|
1.0000000
|
0.1926450
|
YearBuilt
|
0.3998666
|
0.1926450
|
1.0000000
|
Should match this:
correlation_matrix %>%
kable() %>%
kable_styling()
|
TotalBsmtSF
|
GrLivArea
|
YearBuilt
|
TotalBsmtSF
|
1.0000000
|
0.3948292
|
0.3998666
|
GrLivArea
|
0.3948292
|
1.0000000
|
0.1926450
|
YearBuilt
|
0.3998666
|
0.1926450
|
1.0000000
|
Which it does.
Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).
I will be using the Square Footage of the Unfinished Basement for this part of the exam. I know it is right skewed because the mean (566.9903846) is larger than the median (477.5) as shown in the figure below:

The minimum (0) is not smaller than zero so we don’t need to shift the data. Remember that the maximum value is 2336. I have already loaded the MASS package.
lambda <- fitdistr(kaggle$BsmtUnfSF, densfun = "exponential")$estimate
The optimal value for \(\lambda\) is 0.0017637.
samples <- rexp(1000, rate = lambda)
Plot a histogram and compare it with a histogram of your original variable.

The original data is not quite exponential, but it’s not a bad approximation. The exponential distribution data has a longer tail than the basement square footage data.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
The PDF would be \(f(x;\lambda) = \lambda e^{-\lambda x}\) where \(x \geq 0\) and otherwise zero.
The CDF would be \(F(x;\lambda) = 1 - e^{-\lambda x}\). lambda is 0.0017637. To find the 5th percentile we need to solve for x in:
\[0.05 = 1 - e^{-0.0017637 x}\]
\[-0.95 = - e^{-0.0017637 x}\]
\[-ln(0.95) = 0.0017637 x\]
\[x = \frac{-ln(0.95)}{0.0017637} = 29.0828047\] For the 95th percentile we need to solve for x in:
\[0.95 = 1 - e^{-0.0017637 x}\]
\[-0.05 = - e^{-0.0017637 x}\]
\[-ln(0.05) = 0.0017637 x\]
\[x = \frac{-ln(0.05)}{0.0017637} = 1698.551394\]
So the 5th and 95th percentiles are approximately 29 and 1699, respectively.
Also generate a 95% confidence interval from the empirical data, assuming normality.
mu <- mean(kaggle$BsmtUnfSF)
s <- sd(kaggle$BsmtUnfSF)
n <- nrow(kaggle)
error <- qnorm(0.975) * s / sqrt(n)
ci <- c(mu - error, mu + error)
names(ci) <- c("5%", "95%")
ci
5% 95%
544.2769 589.7038
The 95% confidence interval (assuming normality) is 544 to 590. Although what I think was meant was assume the data is normally distributed. Calculate the 5th and 95th percentile. That is found using \(x = \mu + Z \sigma\). The Z is -1.645 for the 5th percentile, and 1.645 for the 95th. So the percentiles would be -160 and 1294.
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(kaggle$BsmtUnfSF, c(0.05, 0.95))
5% 95%
0 1468
The actual 5th percentile is 0 and the 95th is 1468. So the findings are summarized in the following table:
Exponential CDF |
29 |
1699 |
Normal 95% CI |
544 |
590 |
Normal Percentiles |
-160 |
1294 |
Emperical Percentiles |
0 |
1468 |
If we model the data as exponentially distributed the 5th percentile is 29. If we model it as normally distributed the 5th is at -160 which in the context of square footage does not make any sense. The actual 5th percentile is 0. The difference is explained to the assumed shape/distibution of the underlying data.
Looking at the 95th percentile we have 393 if the data are exponentially distributed, 1294 if it is normally distributed and 1468 in reality. Again the difference is due to the assumed shape.
I have left out the 95% CI from the discussion because the confidence interval is a way of estimating the mean of the population. We would know if we took 100 estimates that the actual mean falls within the confidence intervals 95% of the time. Within this context it is meaningless.
Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Since there are so many variables I will be using a random forest to detect which variables are the most important and will be using them to build the model. First I will drop out the variables that are missing a lot of data or otherwise meaningless.
set.seed(12345)
rf_data <- kaggle %>%
# Dropping variables with a lot of missing values
select(-Alley, -MasVnrType, -MasVnrArea, -BsmtQual, -BsmtCond, -BsmtExposure, -BsmtFinType1, -BsmtFinType2, -Electrical, -FireplaceQu, -GarageType, -GarageYrBlt, -GarageFinish, -GarageQual, -GarageCond, -PoolQC, -Fence, -MiscFeature) %>%
# Dropping the ID because it is meaningless
select(-Id)
ctrl <- trainControl(method = "cv", number = 5)
rf_fit <- train(SalePrice ~ .,
data = rf_data,
method = "rf",
ntree = 1000,
trControl = ctrl,
tuneGrid = data.frame(mtry = 10),
importance = TRUE,
verbose = TRUE)
Now we can examine the variables to in order of importance:
imp_var <- importance(rf_fit$finalModel)
variables <- rownames(imp_var)
imp_var <- imp_var %>%
as.data.frame()
names(imp_var) <- c("IncMSE", "IncNodePurity")
imp_var$Variable <- variables
imp_var %>%
select(Variable, IncMSE, IncNodePurity) %>%
arrange(-IncMSE) %>%
kable() %>%
kable_styling()
Variable
|
IncMSE
|
IncNodePurity
|
GrLivArea
|
27.7642054
|
558222457815
|
X1stFlrSF
|
23.8402736
|
438261813340
|
GarageArea
|
23.4461796
|
396371093603
|
TotalBsmtSF
|
23.0135452
|
466705746656
|
X2ndFlrSF
|
21.8530314
|
173856510098
|
BsmtFinSF1
|
21.4807603
|
258028180660
|
GarageCars
|
20.4468933
|
441435019266
|
LotArea
|
19.9500054
|
223742546295
|
YearRemodAdd
|
18.9757012
|
244497915498
|
TotRmsAbvGrd
|
18.6221084
|
225898592051
|
Fireplaces
|
18.2525067
|
183834633265
|
YearBuilt
|
18.2095526
|
365182036863
|
FullBath
|
18.0785926
|
267428897407
|
OverallQual07
|
17.8122831
|
54778740547
|
KitchenQualTA
|
17.3133398
|
212814029087
|
ExterQualTA
|
16.7886571
|
319459715203
|
OverallQual08
|
15.5552231
|
148055942628
|
ExterQualGd
|
15.4469920
|
168787089403
|
KitchenQualGd
|
14.9208736
|
81076347248
|
OverallQual05
|
14.6904962
|
71828292878
|
FoundationPConc
|
14.6041206
|
152501331460
|
HalfBath
|
14.1326475
|
64835240662
|
MSSubClassSC60
|
14.0719635
|
80260652096
|
OverallQual06
|
13.5570802
|
34216698228
|
BsmtUnfSF
|
13.0750197
|
92376101616
|
MSZoningRM
|
12.9698822
|
44695503360
|
BedroomAbvGr
|
12.8961520
|
64408140622
|
OpenPorchSF
|
12.6017606
|
150317042035
|
FoundationCBlock
|
12.0103478
|
61632768206
|
MSZoningRL
|
11.9412630
|
27030146717
|
HouseStyle2Story
|
11.9130210
|
37962672713
|
OverallQual09
|
11.8806930
|
125777334119
|
WoodDeckSF
|
11.1052709
|
101105003058
|
CentralAirY
|
11.0840199
|
29474398534
|
BsmtFullBath
|
10.7850346
|
47184296081
|
HouseStyle1Story
|
10.7231739
|
22313763589
|
MSSubClassSC30
|
10.0277032
|
23314258488
|
OverallCond05
|
9.9576233
|
36917969145
|
OverallQual04
|
9.9174075
|
29559767577
|
NeighborhoodNridgHt
|
9.6892514
|
111376178800
|
HeatingQCTA
|
9.5295453
|
34593868828
|
LotFrontage
|
9.4974677
|
132965031566
|
RoofStyleHip
|
9.1111191
|
56874431472
|
Exterior1stVinylSd
|
9.0163227
|
33230520180
|
OverallQual10
|
8.9706238
|
74465306490
|
MSSubClassSC160
|
8.8767743
|
7305541842
|
MSSubClassSC20
|
8.7685719
|
16412418205
|
PavedDriveY
|
8.6151987
|
19361352452
|
NeighborhoodCrawfor
|
8.2457088
|
12809035999
|
Exterior2ndVinylSd
|
8.2103380
|
41530410869
|
NeighborhoodEdwards
|
8.2012692
|
10704691960
|
KitchenAbvGr
|
7.9417898
|
15029863724
|
RoofStyleGable
|
7.7734354
|
51311533488
|
NeighborhoodOldTown
|
7.6717584
|
11896312325
|
OverallCond03
|
7.6473501
|
5288730615
|
NeighborhoodNAmes
|
7.5605677
|
11095991736
|
SaleTypeNew
|
7.4116705
|
55406944533
|
SaleConditionPartial
|
7.3684266
|
64585720051
|
LotShapeReg
|
7.2638065
|
31883580269
|
NeighborhoodGilbert
|
7.1491821
|
6147200232
|
BldgTypeDuplex
|
7.0724285
|
6136830586
|
BldgTypeTwnhsE
|
6.9237689
|
7383513502
|
OverallCond07
|
6.8054355
|
8063560161
|
NeighborhoodCollgCr
|
6.6595760
|
8424088310
|
NeighborhoodNWAmes
|
6.5626023
|
6555080248
|
OverallCond06
|
6.3219140
|
8471565422
|
NeighborhoodStoneBr
|
6.2252129
|
24890630342
|
MSSubClassSC90
|
6.1941219
|
6504393659
|
NeighborhoodNoRidge
|
6.1344144
|
36921906867
|
Exterior1stMetalSd
|
6.0971684
|
8541438797
|
MSSubClassSC50
|
5.8609017
|
8254128902
|
Exterior1stWd Sdng
|
5.8517990
|
7959702982
|
EnclosedPorch
|
5.8189704
|
18698479917
|
FunctionalTyp
|
5.7728467
|
10038339426
|
NeighborhoodSawyer
|
5.5022107
|
3603890796
|
MSSubClassSC80
|
5.3093292
|
3230952260
|
Exterior1stPlywood
|
5.2983259
|
5921642459
|
NeighborhoodClearCr
|
5.2490482
|
7049418796
|
OverallQual03
|
5.2486329
|
4976370867
|
NeighborhoodIDOTRR
|
5.1772984
|
7740132342
|
BldgTypeTwnhs
|
5.1029673
|
3399903454
|
SaleTypeWD
|
4.9979941
|
28116185911
|
MSZoningFV
|
4.9899826
|
4560551576
|
Condition1Feedr
|
4.9064343
|
4229460756
|
FoundationSlab
|
4.8125409
|
3320013219
|
OverallCond04
|
4.7987513
|
7337757055
|
Exterior2ndMetalSd
|
4.7147506
|
8078977399
|
KitchenQualFa
|
4.6807084
|
6580727317
|
LandContourLvl
|
4.5175216
|
10851868623
|
NeighborhoodBrDale
|
4.5006819
|
1244979412
|
BsmtFinSF2
|
4.4427439
|
14755269865
|
Exterior2ndPlywood
|
4.3389015
|
6779855429
|
MSSubClassSC70
|
4.3346720
|
5432789230
|
NeighborhoodMeadowV
|
4.3141175
|
1900477409
|
Exterior1stBrkFace
|
4.1725914
|
9310006897
|
Exterior2ndWd Sdng
|
4.1206877
|
7969996880
|
NeighborhoodNPkVill
|
4.1072830
|
307630325
|
MSSubClassSC180
|
3.9546190
|
897817473
|
Exterior2ndCmentBd
|
3.9475897
|
14208226735
|
ScreenPorch
|
3.9353891
|
20193048493
|
Condition1Norm
|
3.9270879
|
7969386646
|
ExterQualFa
|
3.8908805
|
3940621831
|
NeighborhoodSomerst
|
3.8351411
|
10594426163
|
Exterior1stHdBoard
|
3.8145594
|
6510562101
|
NeighborhoodTimber
|
3.8134536
|
6647192482
|
Condition1RRAe
|
3.7427599
|
813366813
|
BldgType2fmCon
|
3.6976839
|
2334975817
|
NeighborhoodSawyerW
|
3.6511495
|
3337881959
|
HeatingQCGd
|
3.5784565
|
7463053600
|
LandContourLow
|
3.4966691
|
7556532641
|
OverallQual02
|
3.3424402
|
1189891163
|
Exterior1stCemntBd
|
3.3372101
|
16787314267
|
SaleConditionNormal
|
3.2183362
|
17926761077
|
Exterior2ndBrk Cmn
|
3.2100712
|
1019961967
|
HeatingWall
|
3.2040764
|
341103999
|
MSSubClassSC190
|
3.1475311
|
2515495382
|
LandSlopeMod
|
3.1044339
|
10164110731
|
OverallCond08
|
2.9852142
|
4930064723
|
NeighborhoodMitchel
|
2.9809616
|
2900131238
|
HouseStyleSFoyer
|
2.9569055
|
1266022780
|
HeatingQCFa
|
2.8838835
|
4450317437
|
OverallCond09
|
2.8275507
|
7077081468
|
NeighborhoodBrkSide
|
2.7912815
|
3973978708
|
BsmtHalfBath
|
2.7906051
|
4658438070
|
ExterCondTA
|
2.7144591
|
8476120568
|
LotConfigCulDSac
|
2.5664585
|
13518708455
|
RoofMatlCompShg
|
2.5467057
|
6536927501
|
ExterCondFa
|
2.4539984
|
5256440740
|
PavedDriveP
|
2.3440645
|
1414425905
|
RoofMatlWdShngl
|
2.3253509
|
4676063657
|
NeighborhoodSWISU
|
2.2522396
|
1766072445
|
MSSubClassSC85
|
2.0506240
|
544540749
|
FunctionalMin2
|
2.0094515
|
1829415750
|
Exterior1stBrkComm
|
1.9782157
|
660262229
|
Exterior2ndHdBoard
|
1.9310587
|
7989689692
|
MoSold08
|
1.8776579
|
6510428093
|
Condition1PosN
|
1.8227573
|
2015016657
|
RoofMatlWdShake
|
1.7477976
|
925564602
|
HouseStyle1.5Unf
|
1.6977007
|
538926943
|
OverallCond02
|
1.6665478
|
1087968559
|
MoSold12
|
1.6045188
|
4069288751
|
FunctionalMod
|
1.5783282
|
3750376514
|
SaleTypeCon
|
1.5139422
|
785116346
|
LandSlopeSev
|
1.5054188
|
4478543599
|
SaleConditionFamily
|
1.4954567
|
1731293795
|
MoSold02
|
1.4563728
|
4061854353
|
Condition1RRNn
|
1.4536000
|
374244244
|
X3SsnPorch
|
1.4063978
|
3394784358
|
YrSold
|
1.4058075
|
27849586937
|
Exterior2ndBrkFace
|
1.4047286
|
4653607861
|
MiscVal
|
1.3745684
|
2961160418
|
Condition2Norm
|
1.3393934
|
1045544631
|
HouseStyleSLvl
|
1.3383236
|
2994722232
|
HeatingGrav
|
1.3114232
|
1169839448
|
HeatingOthW
|
1.1410317
|
179771859
|
Condition1RRNe
|
1.0005004
|
58389749
|
HeatingGasA
|
0.9356422
|
4105193137
|
NeighborhoodVeenker
|
0.8932208
|
2766468742
|
MoSold07
|
0.8385579
|
9273652119
|
PoolArea
|
0.8051327
|
890273272
|
Exterior2ndStucco
|
0.7526820
|
3057051233
|
MSZoningRH
|
0.7341767
|
728711503
|
Exterior2ndStone
|
0.7281763
|
609268079
|
LotShapeIR2
|
0.7130099
|
10648833393
|
Exterior2ndWd Shng
|
0.5948916
|
4089126186
|
LotConfigFR2
|
0.5930134
|
2261351395
|
LandContourHLS
|
0.5662390
|
10265622877
|
FunctionalMin1
|
0.5600259
|
1474125037
|
HeatingGasW
|
0.5060180
|
3264903027
|
SaleTypeOth
|
0.4152644
|
208526780
|
SaleConditionAdjLand
|
0.3927154
|
222853090
|
HouseStyle2.5Unf
|
0.3866142
|
1079826559
|
Exterior1stStucco
|
0.2917621
|
2546562371
|
StreetPave
|
0.2476584
|
1040113916
|
MoSold06
|
0.2367773
|
7808912454
|
RoofStyleMansard
|
0.0876502
|
377569410
|
Condition1RRAn
|
0.0838131
|
1787292874
|
Exterior1stStone
|
0.0810245
|
566527056
|
MSSubClassSC75
|
0.0573733
|
3640324222
|
UtilitiesNoSeWa
|
0.0000000
|
24405758
|
Condition2PosA
|
0.0000000
|
592259874
|
Condition2PosN
|
0.0000000
|
242273046
|
Condition2RRAe
|
0.0000000
|
49299817
|
Condition2RRAn
|
0.0000000
|
23914616
|
RoofMatlMembran
|
0.0000000
|
332916895
|
RoofMatlMetal
|
0.0000000
|
52599758
|
RoofMatlRoll
|
0.0000000
|
29326184
|
Exterior1stAsphShn
|
0.0000000
|
70708416
|
Exterior1stCBlock
|
0.0000000
|
34417577
|
Exterior1stImStucc
|
0.0000000
|
78026941
|
Exterior2ndCBlock
|
0.0000000
|
72049298
|
Exterior2ndOther
|
0.0000000
|
204369573
|
ExterCondPo
|
0.0000000
|
140093549
|
HeatingQCPo
|
0.0000000
|
23603682
|
FunctionalSev
|
0.0000000
|
306054523
|
MoSold03
|
-0.1111648
|
8529224563
|
LotConfigFR3
|
-0.1376133
|
207387605
|
Exterior2ndImStucc
|
-0.1553358
|
831022002
|
Condition2Feedr
|
-0.1704940
|
232263919
|
Exterior2ndAsphShn
|
-0.1883214
|
172305536
|
SaleTypeConLD
|
-0.3108024
|
634606083
|
ExterCondGd
|
-0.3127218
|
7102118290
|
MoSold10
|
-0.3885219
|
5339049068
|
FunctionalMaj2
|
-0.4629922
|
1083900261
|
SaleConditionAlloca
|
-0.4684010
|
2992620163
|
SaleTypeConLw
|
-0.5665886
|
299701708
|
MSSubClassSC45
|
-0.6039514
|
447356945
|
MoSold05
|
-0.6376564
|
6685430086
|
LowQualFinSF
|
-0.6460163
|
5453098982
|
RoofMatlTar&Grv
|
-0.7076708
|
1287371031
|
SaleTypeConLI
|
-0.7560331
|
1394789713
|
SaleTypeCWD
|
-0.7636488
|
1460041716
|
MoSold04
|
-0.7864021
|
6759841893
|
Condition1PosA
|
-0.8621558
|
1844154549
|
RoofStyleShed
|
-0.9576812
|
235225273
|
RoofStyleGambrel
|
-0.9858484
|
1637119066
|
MoSold11
|
-1.0121206
|
4346092468
|
LotShapeIR3
|
-1.2100498
|
3236521680
|
Condition2RRNn
|
-1.2259136
|
77988132
|
FoundationWood
|
-1.3235549
|
436203068
|
NeighborhoodBlueste
|
-1.3694832
|
48936925
|
MSSubClassSC40
|
-1.5958567
|
395203657
|
LotConfigInside
|
-1.6433612
|
10394962225
|
FoundationStone
|
-2.0794747
|
647894612
|
Exterior1stWdShing
|
-2.3497002
|
1973590525
|
HouseStyle2.5Fin
|
-2.5904792
|
4054129797
|
MoSold09
|
-4.5658305
|
4229849479
|
The takeaways from the preceeding table is that size seems to consistently rise to the top as a variable to include in the model. I don’t want to have multiple variables that are measuring the same thing (i.e. Total SF of Living Area, Total SF of Basement, etc.). I will use this table to select the variables that I think should be included in the linear regression model.
fit <- lm(SalePrice ~ GrLivArea + GarageArea + LotArea + YearRemodAdd + Fireplaces + YearBuilt + OverallQual, kaggle)
summary(fit)
Call:
lm(formula = SalePrice ~ GrLivArea + GarageArea + LotArea + YearRemodAdd +
Fireplaces + YearBuilt + OverallQual, data = kaggle)
Residuals:
Min 1Q Median 3Q Max
-132049 -20552 -1677 16488 273585
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.324e+06 1.075e+05 -12.318 < 2e-16 ***
GrLivArea 5.024e+01 2.452e+00 20.491 < 2e-16 ***
GarageArea 5.632e+01 5.387e+00 10.456 < 2e-16 ***
LotArea 8.885e-01 9.500e-02 9.353 < 2e-16 ***
YearRemodAdd 3.119e+02 5.681e+01 5.491 4.72e-08 ***
Fireplaces 9.821e+03 1.619e+03 6.065 1.68e-09 ***
YearBuilt 3.266e+02 4.163e+01 7.846 8.25e-15 ***
OverallQual 2.032e+04 1.048e+03 19.388 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33860 on 1448 degrees of freedom
Multiple R-squared: 0.806, Adjusted R-squared: 0.805
F-statistic: 859.3 on 7 and 1448 DF, p-value: < 2.2e-16
These seven variables explain 80% of the variability in the sale price. I will use this model to make the Kaggle submission.
test <- read.csv("Kaggle_House_Prices/test.csv") %>%
fill_holes()
y_hat <- predict(fit, test)
submission <- data.frame(Id = test$Id, SalePrice = y_hat)
summary(submission)
Id SalePrice
Min. :1461 Min. : -6594
1st Qu.:1826 1st Qu.:123487
Median :2190 Median :169122
Mean :2190 Mean :177836
3rd Qu.:2554 3rd Qu.:222814
Max. :2919 Max. :537193
It appears that some of the predictions are incorrect since the minimum is negative.
write.csv(submission, "submission.csv", quote = FALSE, row.names = FALSE)
My Kaggle score is 0.47769 and my username is mikesilva. I am ranked 4520 on the leaderboard. I will try to improve this score when I have time.
This analysis is available on https://github.com/mikeasilva/CUNY-SPS. A walkthough of this analysis is available on YouTube.