Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = \frac{N+1}{2}\)
set.seed(1234)
N = 10
X = runif(10000, 1, N)
Y = rnorm(10000, (N+1)/2, (N+1)/2)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter x is estimated as the median of the X variable, and the small letter y is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x = median(X)
y = quantile(Y, 0.25)
5 points. a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
\[ P(X>x | X>y) = \frac{P(X>x \cap X>y)}{P(X>y)} \]
prob_Xy = length(X[X > y]) / length(X)
prob_Xx_Xy = length(X[X > y & X > x]) / length(X)
prob_Xx_Xy /prob_Xy
## [1] 0.5543237
There is a 55.43% probability that X would be greater than the median of the X variable, given that X is greater than the first quartile of the Y variable.
\[ P(X>x, Y>y) = P(X>x) \cap P(Y>y) \]
prob_Xx_Yy = length(which(X > x & Y > y)) / length(X)
prob_Xx_Yy
## [1] 0.3738
There is a 37.38% probability that X is greater than its median and that Y is greater than its first quartile.
\[ P(X<x | X>y) = \frac{P(X<x \cap X>y)}{P(X>y)} \]
prob_Xy = length(X[X > y]) / length(X)
prob_Xx_Xy = length(X[X > y & X < x]) / length(X)
prob_Xx_Xy /prob_Xy
## [1] 0.4456763
There is a 44.57% probability that X would be less than the median of the X variable, given that X is greater than the first quartile of the Y variable.
5 points. Investigate whether \(P(X>x\text{ and }Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
probtable <- data.frame(c(length(which(X > x & Y > y)) / 10000,
length(which(X > x & Y <= y)) / 10000),
c(length(which(X <= x & Y > y)) / 10000,
length(which(X <= x & Y <= y)) / 10000)) %>%
mutate(Total = rowSums(.)) %>%
rbind(., colSums(.)) %>%
set_colnames(c('$X>x$', '$X\\leq x$', 'Total')) %>%
set_rownames(c('$Y>y$', '$Y\\leq y$', 'Total')) %>%
kable(escape = F) %>%
kable_styling(latex_options = "hold_position") %>%
kable_classic(full_width = F)
probtable
| \(X>x\) | \(X\leq x\) | Total | |
|---|---|---|---|
| \(Y>y\) | 0.3738 | 0.3762 | 0.75 |
| \(Y\leq y\) | 0.1262 | 0.1238 | 0.25 |
| Total | 0.5000 | 0.5000 | 1.00 |
prob_Xx = length(X[X > x]) / length(X)
prob_Yy = length(Y[Y > y]) / length(Y)
prob_Xx * prob_Yy
## [1] 0.375
As seen here, the joint probability is 0.3738 but the marginal probability is 0.375, which is found by multiplying \(P(X>x)\)=0.5 and \(P(Y>y)\)=0.75. They are close in value but are not exactly equal.
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
For both, FIsher’s Exact Test and the Chi Square Test, the hypothesis are as follows:
table(X > x, Y > y) %>%
fisher.test()
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8894241 1.0682094
## sample estimates:
## odds ratio
## 0.9747199
table(X > x, Y > y) %>%
chisq.test(.)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 0.28213, df = 1, p-value = 0.5953
The p-value for both tests is 0.5953. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. That means that according to the Fisher’s Exact Test and the Chi Square Test, \(P(X>x)\) and \(P(Y>y)\) are independent.
The Fisher’s Exact Test is better applied on small samples and produces an exact test for independence. On the other hand, the Chi Square Test only produces an approximate but is applied to large data sets and the accuracy increases as the data gets larger. It can be unreliable when the data set sample is too small.
Given that the Fisher’s Exact Test can be rather tedious with large samples, it would be preferable to use the Chi Square Test. However, since both tests produced the same results, either test is appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. I want you to do the following.
train <- read.csv("train.csv", stringsAsFactors = TRUE) %>%
#converts to the appropriate type
type.convert(.) %>%
#convert into factors
mutate(OverallQual = as.factor(OverallQual),
OverallCond = as.factor(OverallCond),
MSSubClass = as.factor(MSSubClass))
test <- read.csv("test.csv", stringsAsFactors = TRUE) %>%
#converts to the appropriate type
type.convert(.) %>%
#convert into factors
mutate(OverallQual = as.factor(OverallQual),
OverallCond = as.factor(OverallCond),
MSSubClass = as.factor(MSSubClass))
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset.
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis.
Would you be worried about familywise error? Why or why not?
The data set that describes houses sold between 2006 and 2010 in Ames, Iowa has 80 predictor variables and the response variable is the sales price.There are 1460 observations in the training data set.
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 20 :536 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 60 :299 FV : 65 1st Qu.: 59.00
## Median : 730.5 50 :144 RH : 16 Median : 69.00
## Mean : 730.5 120 : 87 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 30 : 69 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 160 : 63 Max. :313.00
## (Other):262 NA's :259
## LotArea Street Alley LotShape LandContour Utilities
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50 NoSeWa: 1
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 263 Gtl:1382 NAmes :225 Norm :1260 Norm :1445
## CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81 Feedr : 6
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48 Artery : 2
## FR3 : 4 Edwards:100 RRAn : 26 PosN : 2
## Inside :1052 Somerst: 86 PosN : 19 RRNn : 2
## Gilbert: 79 RRAe : 11 PosA : 1
## (Other):707 (Other): 15 (Other): 2
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1Fam :1220 1Story :726 5 :397 5 :821 Min. :1872
## 2fmCon: 31 2Story :445 6 :374 6 :252 1st Qu.:1954
## Duplex: 52 1.5Fin :154 7 :319 7 :205 Median :1973
## Twnhs : 43 SLvl : 65 8 :168 8 : 72 Mean :1971
## TwnhsE: 114 SFoyer : 37 4 :116 4 : 57 3rd Qu.:2000
## 1.5Unf : 14 9 : 43 3 : 25 Max. :2010
## (Other): 19 (Other): 43 (Other): 28
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## Min. :1950 Flat : 13 CompShg:1434 VinylSd:515 VinylSd:504
## 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222 MetalSd:214
## Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220 HdBoard:207
## Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206 Wd Sdng:197
## 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108 Plywood:142
## Max. :2010 Shed : 2 Membran: 1 CemntBd: 61 CmentBd: 60
## (Other): 2 (Other):128 (Other):136
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
## BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3 BrkTil:146 Ex :121
## BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28 CBlock:634 Fa : 35
## None :864 Median : 0.0 Gd:488 Gd: 146 PConc :647 Gd :618
## Stone :128 Mean : 103.7 TA:906 Po: 1 Slab : 24 TA :649
## NA's : 8 3rd Qu.: 166.0 TA:1282 Stone : 6 NA's: 37
## Max. :1600.0 Wood : 3
## NA's :8
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 45 Av :221 ALQ :220 Min. : 0.0 ALQ : 19
## Gd : 65 Gd :134 BLQ :148 1st Qu.: 0.0 BLQ : 33
## Po : 2 Mn :114 GLQ :418 Median : 383.5 GLQ : 14
## TA :1311 No :953 LwQ : 74 Mean : 443.6 LwQ : 46
## NA's: 37 NA's: 38 Rec :133 3rd Qu.: 712.2 Rec : 54
## Unf :430 Max. :5644.0 Unf :1256
## NA's: 37 NA's: 38
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49
## Median : 0.00 Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 46.55 Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :1474.00 Max. :2336.0 Max. :6110.0 Wall : 4
##
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## N: 95 FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## Y:1365 FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Gd:586 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## Maj1: 14 Min. :0.000 Ex : 24 2Types : 6 Min. :1900
## Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1961
## Min1: 31 Median :1.000 Gd :380 Basment: 19 Median :1980
## Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88 Mean :1979
## Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9 3rd Qu.:2002
## Sev : 1 Max. :3.000 NA's:690 Detchd :387 Max. :2010
## Typ :1360 NA's : 81 NA's :81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## Fin :352 Min. :0.000 Min. : 0.0 Ex : 3 Ex : 2
## RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48 Fa : 35
## Unf :605 Median :2.000 Median : 480.0 Gd : 14 Gd : 9
## NA's: 81 Mean :1.767 Mean : 473.0 Po : 3 Po : 7
## 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311 TA :1326
## Max. :4.000 Max. :1418.0 NA's: 81 NA's: 81
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Y:1340 Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2 GdWo : 54 Othr: 2
## Median : 0.00 Median : 0.000 Gd : 3 MnPrv: 157 Shed: 49
## Mean : 15.06 Mean : 2.759 NA's:1453 MnWw : 11 TenC: 1
## 3rd Qu.: 0.00 3rd Qu.: 0.000 NA's :1179 NA's:1406
## Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :1267
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 New : 122
## Median : 0.00 Median : 6.000 Median :2008 COD : 43
## Mean : 43.49 Mean : 6.322 Mean :2008 ConLD : 9
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5
## Max. :15500.00 Max. :12.000 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
Here are some interesting plot for the training data set.
train %>%
mutate(TotalSqFt = GrLivArea + TotalBsmtSF) %>%
ggplot(., aes(x = TotalSqFt, y = SalePrice)) +
geom_point() +
geom_smooth() +
ggtitle("Total Square Footage vs Sales Price") +
scale_y_continuous(labels = scales::label_comma())
train %>%
ggplot(., aes(x = OverallQual, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Distributions of Overall Quality vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
ggplot(., aes(x = OverallCond, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Distributions of Overall Condition vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
mutate(GarageCars = as.factor(GarageCars)) %>%
ggplot(., aes(x = GarageCars, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Distributions of Amount of Cars in Garage vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
ggplot(., aes(x = CentralAir, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Central Air Coniditioning vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
mutate(LandSlope = recode(LandSlope, `Gtl` = "Gentle",
`Mod` = "Moderate",
`Sev` = "Severe")) %>%
ggplot(., aes(x = LandSlope, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Land Slope vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
mutate(Condition = ifelse(Condition1 == "PosN" | Condition1 == "PosA", "positive off-site",
ifelse(Condition1 == "Norm", "normal" ,
ifelse(Condition1 == "Artery" | Condition1 == "Feedr", "busy street",
"railroad")))) %>%
ggplot(., aes(x = Condition, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Proximity to Various Land Features vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
mutate(ExterQual = fct_relevel(ExterQual, "Ex", "Gd", "TA", "Fa"),
ExterQual = recode(ExterQual, `Ex` = "Excellent",
`TA` = "Average",
`Fa` = "Fair",
`Gd` = "Good")) %>%
ggplot(., aes(x = ExterQual, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Exterior Quality vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
mutate(TotalBath = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath) %>%
ggplot(., aes(x = TotalBath, y = SalePrice)) +
geom_point() +
geom_smooth() +
ggtitle("Total Bathrooms vs Sales Price") +
scale_y_continuous(labels = scales::label_comma())
train %>%
mutate(Age = YrSold - YearBuilt) %>%
ggplot(., aes(x = Age, y = SalePrice)) +
geom_point() +
geom_smooth() +
ggtitle("Age of House vs Sales Price") +
scale_y_continuous(labels = scales::label_comma())
train %>%
ggplot(., aes(x = SaleCondition, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Conidition of Sale vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
train %>%
ggplot(., aes(x = SaleType, y = SalePrice)) +
geom_boxplot() +
labs(title = 'Type of Sale vs Sales Price') +
scale_y_continuous(labels = scales::label_comma())
As seen by the scatterplots, the Sale Price tends to generally increase as the square footage of the above ground living area (GrLivArea), the basement (TotalBsmtSF), and the garage (GarageArea) increases. There is also a increasing relationship between the square footage of the house above ground and the basement. There also seems to outliers for extremely large areas. It should be noted that the compiler of the data set mentions that houses who have over 4000 sq. ft. above ground should be removed for modeling purposes.
pairs(~SalePrice + GrLivArea + TotalBsmtSF + GarageArea, data = train, lwd =0.5)
correlation_table <- train %>%
dplyr::select(SalePrice, GrLivArea, GarageArea) %>%
cor(., method = "pearson")
correlation_table %>%
kable(escape = F) %>%
kable_styling(latex_options = "hold_position") %>%
kable_classic(full_width = F)
| SalePrice | GrLivArea | GarageArea | |
|---|---|---|---|
| SalePrice | 1.0000000 | 0.7086245 | 0.6234314 |
| GrLivArea | 0.7086245 | 1.0000000 | 0.4689975 |
| GarageArea | 0.6234314 | 0.4689975 | 1.0000000 |
Testing the correlations between each pairwise set of variables
cor.test(~ SalePrice + GrLivArea, data = train, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: SalePrice and GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(~ SalePrice + GarageArea, data = train, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: SalePrice and GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
cor.test(~ GarageArea + GrLivArea, data = train, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: GarageArea and GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
0 is not contained in any of the confidence intervals and all three p-values are less than our significance level of 0.05, which means that there correlations between each pairwise set of variables is not 0. It shows that there is indeed some relationships between these pairs that can be of further interest.
A familywise error is a type I error or false discovery in which the null hypotheses is rejected incorrectly. I would not be worried in this case because as we explored the variables in the plots above, we saw that there was some relationship between them. The p-values are extremely small and close to 0, that we would still reject the null hypothesis if we minimized the significance level.
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# inverse of matrix
precision_mat <- solve(correlation_table)
precision_mat %>%
kable(escape = F, caption = "Precision Matrix") %>%
kable_styling(latex_options = "hold_position") %>%
kable_classic(full_width = F)
| SalePrice | GrLivArea | GarageArea | |
|---|---|---|---|
| SalePrice | 2.5692028 | -1.3709485 | -0.9587504 |
| GrLivArea | -1.3709485 | 2.0135330 | -0.0896496 |
| GarageArea | -0.9587504 | -0.0896496 | 1.6397606 |
# Multiply the correlation matrix by the precision matrix
corr_prec <- correlation_table %*% precision_mat
corr_prec %>%
kable(escape = F, caption = "Correlation Matrix $\\cdot$ Precision Matrix", digits = 24) %>%
kable_styling(latex_options = "hold_position") %>%
kable_classic(full_width = F)
| SalePrice | GrLivArea | GarageArea | |
|---|---|---|---|
| SalePrice | 1.000000e+00 | 4.163336e-17 | 0 |
| GrLivArea | 2.220446e-16 | 1.000000e+00 | 0 |
| GarageArea | 3.330669e-16 | 8.326673e-17 | 1 |
# multiply the precision matrix by the correlation matrix
prec_corr <- precision_mat %*% correlation_table
prec_corr %>%
kable(escape = F, caption = "Precision Matrix $\\cdot$ Correlation Matrix", digits = 24 ) %>%
kable_styling(latex_options = "hold_position") %>%
kable_classic(full_width = F)
| SalePrice | GrLivArea | GarageArea | |
|---|---|---|---|
| SalePrice | 1.000000e+00 | 4.440892e-16 | 4.440892e-16 |
| GrLivArea | -1.804112e-16 | 1.000000e+00 | -1.387779e-16 |
| GarageArea | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 |
# When multiplying by the inverse, the result should be identity matrix
# when rounded the products are equal
round(corr_prec, 15) == round(prec_corr, 15)
## SalePrice GrLivArea GarageArea
## SalePrice TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE
#Conduct LU decomposition on the matrix
clu <- lu.decomposition(correlation_table)
clu
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.00000000 0
## [2,] 0.7086245 1.00000000 0
## [3,] 0.6234314 0.05467234 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.7086245 0.6234314
## [2,] 0 0.4978513 0.0272187
## [3,] 0 0.0000000 0.6098451
# check to see if A=LU
correlation_table == clu$L %*% clu$U
## SalePrice GrLivArea GarageArea
## SalePrice TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE
plu <- lu.decomposition(precision_mat)
plu
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] -0.5336085 1.0000000 0
## [3,] -0.3731704 -0.4689975 1
##
## $U
## [,1] [,2] [,3]
## [1,] 2.569203 -1.370948 -0.9587504
## [2,] 0.000000 1.281983 -0.6012469
## [3,] 0.000000 0.000000 1.0000000
round(precision_mat,16) == round(plu$L %*% plu$U,16)
## SalePrice GrLivArea GarageArea
## SalePrice TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE
#side note: 2 values come up as false only when they are rounded to the 17th digit
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. See this. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
e1071::skewness(train$GrLivArea)
## [1] 1.363754
min(train$GrLivArea) > 0
## [1] TRUE
The skewness of GrLivArea is 1.363754 which indicates that it is highly skewed to the right. Shifting the variable is not necessary as it is strictly above zero.
exp_pdf <- fitdistr(train$GrLivArea, "exponential")
exp_pdf
## rate
## 6.598640e-04
## (1.726943e-05)
optimal_lambda <- exp_pdf$estimate %>% unname(.)
exp_sample <- rexp(1000, optimal_lambda)
As seen, the scales for both axes changed, and the peak of the distribution shifted to the right.
train %>%
ggplot(., aes(x = GrLivArea)) +
geom_histogram(bins = 50) +
ggtitle("Original Variable") +
scale_x_continuous(labels = scales::label_comma()) +
geom_vline(aes(xintercept = mean(GrLivArea)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(GrLivArea)),col='blue',size=0.5) +
geom_text(aes(x=mean(GrLivArea) + 70, label="Mean", y=20), colour="red", angle=90) +
geom_text(aes(x=median(GrLivArea) - 80, label="Median", y=35), colour="blue", angle=90)
exp_sample%>%
as.data.frame() %>%
ggplot(., aes(x = .)) +
geom_histogram(bins = 50) +
ggtitle("Exponential Distribution") +
scale_x_continuous(labels = scales::label_comma()) +
geom_vline(aes(xintercept = mean(.)),col='red',size=0.5) +
geom_vline(aes(xintercept = median(.)),col='blue',size=0.5) +
geom_text(aes(x=mean(.) + 70, label="Mean", y=20), colour="red", angle=90) +
geom_text(aes(x=median(.) - 120, label="Median", y=35), colour="blue", angle=90)
### 5th and 95th Percentile
quantile(exp_sample, c(0.05, 0.95))
## 5% 95%
## 81.22943 4376.21881
avg = mean(train$GrLivArea)
sd = sd(train$GrLivArea)
n = length(train$GrLivArea)
z = qnorm(0.975)
c(avg - z * sd/sqrt(n), avg + z * sd/sqrt(n))
## [1] 1488.509 1542.418
The 95% confidence interval from the empirical data is (1488.509, 1542.419). It has a mean of 1515.4636986. It can be seen on the histogram as the red line.
quantile(train$GrLivArea, c(0.05, 0.95))
## 5% 95%
## 848.0 2466.1
The 5th and 95th percentile for the empirical data is over a smaller range compared to the samples taken of the exponential function with a \(\lambda\) of 6.598640410^{-4}.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Prior to building a model, the data has to be cleaned as there are a lot of missing variables. Some of the factored variables can be imputed with the value “none” as the house may not have it and the description text for the data set states that “NA” may specify none. The remainder of the missing variables can be imputed with the mice package, using the random forest method.
train %>%
summarise_all(funs(sum(is.na(.)))) %>%
gather(variable, value) %>%
filter(value != 0)
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## variable value
## 1 LotFrontage 259
## 2 Alley 1369
## 3 MasVnrType 8
## 4 MasVnrArea 8
## 5 BsmtQual 37
## 6 BsmtCond 37
## 7 BsmtExposure 38
## 8 BsmtFinType1 37
## 9 BsmtFinType2 38
## 10 Electrical 1
## 11 FireplaceQu 690
## 12 GarageType 81
## 13 GarageYrBlt 81
## 14 GarageFinish 81
## 15 GarageQual 81
## 16 GarageCond 81
## 17 PoolQC 1453
## 18 Fence 1179
## 19 MiscFeature 1406
# replacing na values
train <- train %>%
mutate(Alley = fct_explicit_na(Alley, "none"),
MasVnrType = fct_explicit_na(MasVnrType, "none"),
BsmtQual = fct_explicit_na(BsmtQual, "none"),
BsmtCond = fct_explicit_na(BsmtCond, "none"),
BsmtExposure = fct_explicit_na(BsmtExposure, "none"),
BsmtFinType1 = fct_explicit_na(BsmtFinType1, "none"),
BsmtFinType2 = fct_explicit_na(BsmtFinType2, "none"),
FireplaceQu = fct_explicit_na(FireplaceQu, "none"),
GarageType = fct_explicit_na(GarageType, "none"),
GarageFinish = fct_explicit_na(GarageFinish, "none"),
GarageQual = fct_explicit_na(GarageQual, "none"),
GarageCond = fct_explicit_na(GarageCond, "none"),
PoolQC = fct_explicit_na(PoolQC, "none"),
Fence = fct_explicit_na(Fence, "none"),
MiscFeature = fct_explicit_na(MiscFeature, "none"))
train %>%
summarise_all(funs(sum(is.na(.)))) %>%
gather(variable, value) %>%
filter(value != 0)
## variable value
## 1 LotFrontage 259
## 2 MasVnrArea 8
## 3 Electrical 1
## 4 GarageYrBlt 81
#imputing by random forest
tempData <- mice(train, m=1, maxit=1, meth='cart',seed=1234)
##
## iter imp variable
## 1 1 LotFrontage MasVnrArea Electrical GarageYrBlt
## Warning: Number of logged events: 4
imputed <- complete(tempData,1)
SalesPrice with an increase of square footage, the natural log of it was taken.TotalSqFt is a combination of the square footage above the ground and below.TotalBath is the total of all the bathroom, with 0.5 given as weight to half bathrooms.Age is just the difference in years from when the house was built to when it was sold.SaleCondition was marked as “normal”, “partial”, and the rest was marked as “other” since they were abnormal sales or traded between family members. They also had less of an effect on the variation on sales price.PorchSqFt is just the square footage of the entire porch.NewHome denotes if the house is newly constructed or not.OverallCond was regrouped in groups of {1,2,3,4}, {5}, {6,7}, {8,9}.#transforming variables
transformed = imputed %>%
filter(GrLivArea < 4000) %>%
mutate(TotalSqFt = GrLivArea + TotalBsmtSF,
TotalBath = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath,
Age = YrSold - YearBuilt,
SaleCondition = ifelse(SaleCondition == "Normal", "normal",
ifelse(SaleCondition == "Partial", "partial", "other")),
PorchSqFt = ScreenPorch + X3SsnPorch + EnclosedPorch + OpenPorchSF + WoodDeckSF,
OverallCond = ifelse(OverallCond %in% "5", "5" ,
ifelse(OverallCond %in% c("1", "2" ,"3" ,"4"), "less than 5",
ifelse(OverallCond %in% c("6", "7"), "6-7",
"8-9"))),
NewHome = ifelse(SaleType == 'New', 'new', 'other'))
model = lm(log(SalePrice) ~ TotalSqFt + OverallQual + Neighborhood + NewHome +
Age + CentralAir + Fireplaces + GarageArea + TotalBath + PorchSqFt + PoolArea +
SaleCondition + MSZoning + BldgType + OverallCond , data = transformed)
summary(model)
##
## Call:
## lm(formula = log(SalePrice) ~ TotalSqFt + OverallQual + Neighborhood +
## NewHome + Age + CentralAir + Fireplaces + GarageArea + TotalBath +
## PorchSqFt + PoolArea + SaleCondition + MSZoning + BldgType +
## OverallCond, data = transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69166 -0.06142 0.00437 0.06784 0.50492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.064e+01 1.269e-01 83.908 < 2e-16 ***
## TotalSqFt 1.813e-04 7.252e-06 25.006 < 2e-16 ***
## OverallQual2 9.797e-02 1.095e-01 0.895 0.370911
## OverallQual3 2.829e-01 8.869e-02 3.190 0.001456 **
## OverallQual4 3.502e-01 8.640e-02 4.054 5.32e-05 ***
## OverallQual5 3.860e-01 8.654e-02 4.461 8.82e-06 ***
## OverallQual6 4.341e-01 8.687e-02 4.997 6.57e-07 ***
## OverallQual7 5.068e-01 8.737e-02 5.800 8.17e-09 ***
## OverallQual8 5.821e-01 8.842e-02 6.583 6.49e-11 ***
## OverallQual9 7.228e-01 9.069e-02 7.970 3.26e-15 ***
## OverallQual10 7.564e-01 9.513e-02 7.951 3.77e-15 ***
## NeighborhoodBlueste -8.858e-02 9.158e-02 -0.967 0.333558
## NeighborhoodBrDale -7.403e-02 4.953e-02 -1.495 0.135213
## NeighborhoodBrkSide 1.283e-02 4.212e-02 0.305 0.760754
## NeighborhoodClearCr 1.038e-01 4.092e-02 2.536 0.011332 *
## NeighborhoodCollgCr 1.589e-02 3.410e-02 0.466 0.641268
## NeighborhoodCrawfor 1.604e-01 3.885e-02 4.129 3.85e-05 ***
## NeighborhoodEdwards -3.309e-02 3.700e-02 -0.894 0.371256
## NeighborhoodGilbert 3.248e-02 3.592e-02 0.904 0.366150
## NeighborhoodIDOTRR -3.407e-02 4.853e-02 -0.702 0.482686
## NeighborhoodMeadowV -1.499e-01 4.738e-02 -3.164 0.001592 **
## NeighborhoodMitchel -1.668e-02 3.775e-02 -0.442 0.658654
## NeighborhoodNAmes -8.865e-03 3.544e-02 -0.250 0.802498
## NeighborhoodNoRidge 7.228e-02 3.847e-02 1.879 0.060488 .
## NeighborhoodNPkVill -4.514e-02 5.083e-02 -0.888 0.374668
## NeighborhoodNridgHt 7.171e-02 3.464e-02 2.070 0.038628 *
## NeighborhoodNWAmes -4.988e-02 3.658e-02 -1.364 0.172938
## NeighborhoodOldTown -4.240e-02 4.374e-02 -0.969 0.332579
## NeighborhoodSawyer -2.206e-02 3.740e-02 -0.590 0.555393
## NeighborhoodSawyerW 6.852e-03 3.626e-02 0.189 0.850154
## NeighborhoodSomerst 2.913e-02 4.191e-02 0.695 0.487166
## NeighborhoodStoneBr 1.208e-01 3.889e-02 3.105 0.001938 **
## NeighborhoodSWISU -1.180e-03 4.452e-02 -0.027 0.978857
## NeighborhoodTimber 3.687e-02 3.815e-02 0.966 0.334046
## NeighborhoodVeenker 8.480e-02 4.787e-02 1.771 0.076719 .
## NewHomeother -1.376e-01 7.041e-02 -1.954 0.050860 .
## Age -2.175e-03 2.590e-04 -8.398 < 2e-16 ***
## CentralAirY 7.330e-02 1.557e-02 4.709 2.74e-06 ***
## Fireplaces 3.668e-02 6.130e-03 5.983 2.77e-09 ***
## GarageArea 1.987e-04 2.027e-05 9.803 < 2e-16 ***
## TotalBath 6.410e-02 5.970e-03 10.737 < 2e-16 ***
## PorchSqFt 1.336e-04 2.303e-05 5.804 8.00e-09 ***
## PoolArea 1.664e-04 8.919e-05 1.866 0.062282 .
## SaleConditionother -6.775e-02 1.117e-02 -6.063 1.72e-09 ***
## SaleConditionpartial -8.540e-02 6.964e-02 -1.226 0.220243
## MSZoningFV 3.396e-01 5.575e-02 6.092 1.44e-09 ***
## MSZoningRH 2.947e-01 5.604e-02 5.260 1.67e-07 ***
## MSZoningRL 3.070e-01 4.681e-02 6.558 7.63e-11 ***
## MSZoningRM 2.799e-01 4.386e-02 6.383 2.36e-10 ***
## BldgType2fmCon -6.620e-04 2.309e-02 -0.029 0.977134
## BldgTypeDuplex -5.707e-02 1.859e-02 -3.070 0.002180 **
## BldgTypeTwnhs -9.490e-02 2.541e-02 -3.735 0.000195 ***
## BldgTypeTwnhsE -4.295e-02 1.667e-02 -2.576 0.010110 *
## OverallCond6-7 7.539e-02 8.887e-03 8.483 < 2e-16 ***
## OverallCond8-9 1.243e-01 1.497e-02 8.307 2.29e-16 ***
## OverallCondless than 5 -9.790e-02 1.548e-02 -6.324 3.42e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1179 on 1400 degrees of freedom
## Multiple R-squared: 0.9147, Adjusted R-squared: 0.9114
## F-statistic: 273.1 on 55 and 1400 DF, p-value: < 2.2e-16
Overall, the model has an adjusted R-squared of 0.9114, meaning that 91.14% of the variation in sales price can be explained by the predictor variables.
ggplot(data = model, aes(x = .fitted, y = .resid)) +
geom_point() + geom_hline(yintercept = 0, linetype = 'dashed') +
geom_smooth(se = FALSE) + xlab('Fitted values') + ylab('Residuals')
ggplot(data = model, aes(x = .resid)) + geom_histogram() + xlab('Residuals')
ggplot(data = model) + stat_qq(aes(sample = .stdresid)) + geom_abline()
#replacing na values
test <- test %>%
mutate(Alley = fct_explicit_na(Alley, "none"),
MasVnrType = fct_explicit_na(MasVnrType, "none"),
BsmtQual = fct_explicit_na(BsmtQual, "none"),
BsmtCond = fct_explicit_na(BsmtCond, "none"),
BsmtExposure = fct_explicit_na(BsmtExposure, "none"),
BsmtFinType1 = fct_explicit_na(BsmtFinType1, "none"),
BsmtFinType2 = fct_explicit_na(BsmtFinType2, "none"),
FireplaceQu = fct_explicit_na(FireplaceQu, "none"),
GarageType = fct_explicit_na(GarageType, "none"),
GarageFinish = fct_explicit_na(GarageFinish, "none"),
GarageQual = fct_explicit_na(GarageQual, "none"),
GarageCond = fct_explicit_na(GarageCond, "none"),
PoolQC = fct_explicit_na(PoolQC, "none"),
Fence = fct_explicit_na(Fence, "none"),
MiscFeature = fct_explicit_na(MiscFeature, "none"))
#imputing by random forest
tempData <- mice(test, m=1, maxit=1, meth='cart',seed=1234)
##
## iter imp variable
## 1 1 MSZoning LotFrontage Exterior1st Exterior2nd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageYrBlt GarageCars GarageArea SaleType
## Warning: Number of logged events: 18
imputed_test <- complete(tempData,1)
#transforming variables
imputed_test <- imputed_test %>%
mutate(TotalSqFt = GrLivArea + TotalBsmtSF,
TotalBath = BsmtFullBath + 0.5 * BsmtHalfBath + FullBath + 0.5 * HalfBath,
Age = YrSold - YearBuilt,
SaleCondition = ifelse(SaleCondition == "Normal", "normal",
ifelse(SaleCondition == "Partial", "partial", "other")),
PorchSqFt = ScreenPorch + X3SsnPorch + EnclosedPorch + OpenPorchSF + WoodDeckSF,
OverallCond = ifelse(OverallCond %in% "5", "5" ,
ifelse(OverallCond %in% c("1", "2" ,"3" ,"4"), "less than 5",
ifelse(OverallCond %in% c("6", "7"), "6-7",
"8-9"))),
NewHome = ifelse(SaleType == 'New', 'new', 'other'))
#predicting sale price
imputed_test$SalePrice = exp(predict(model, imputed_test))
#creating file for submission
tosubmit <- imputed_test %>%
dplyr::select(Id, SalePrice)
write.csv(tosubmit, "finalmodel.csv")
Kaggle.com user name: llamas24 Score: 0.14102