The final exam features a mini-project and several small & medium sized questions. The final exam is a comprehensive review of the course. Post solutions to GitHub and make a short 3-5 minute presentation or post a recorded presentation to blackboard. This project will show off your ability to understand the elements of the class.
library(MASS)
library(agrmt)
library(VennDiagram)
## Loading required package: grid
## Loading required package: futile.logger
Register for free with Kaggle.com to compete in the House Prices: Advanced Regression Techniques competition.
Competition Details: Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. The potential for creative feature engineering provides a rich opportunity for fun and learning. This dataset lends itself to advanced regression techniques like random forests and gradient boosting with the popular XGBoost library. We encourage Kagglers to create benchmark code and tutorials on Kernels for community learning. Top kernels will be awarded swag prizes at the competition close.
Pick one of the quantitative independent variables from the training data set train.csv
, and define that variable as \(X\). Make sure this variable is skewed to the right. Pick the dependent variable and define it as \(Y\).
train <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
"master/DATA%20605/train.csv"), stringsAsFactors = F)
Y <- train[,"SalePrice"]; X <- {}
skewed <- function (train) {
# remove categorical variables
quantitative <- train[ , sapply(train, is.numeric)]
# remove ID and Y (dependent) variable
predictors <- quantitative[ , -c(1, ncol(quantitative))]
par(mfrow = c(3, 3))
for (i in 1:ncol(predictors)) {
x <- predictors[!sapply(predictors[ , i], is.na), i]
n <- length(x); mu <- mean(x); se <- sd(x)
skew <- sum(((x - mu) / se)^3) / n
# Standard Error of Skewness
ses <- sqrt(6 * n * (n - 1) / ((n - 2) * (n + 1) * (n + 3)))
if (modes(x)$contiguous) { # One (contiguous) mode
if (skew > 40 * ses) { # Very Significant Positive Skew
d <- density(x, na.rm = TRUE)
string <- paste(names(predictors[i]))
plot(d, ylab = string, main = string)
polygon(d, col="red")
X <- c(X, i)
}
}
}
return(predictors[, X])
}
X <- skewed(train)
Negative Skew means the left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left. Positive Skew means the right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right. In order to do a more thorough analysis, the vector \(\mathbf{X}\) containing all right skewed independent variables was created.
Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the 3rd quartile of the \(X\) variable, and the small letter \(y\) is estimated as the 2nd quartile of the \(Y\) variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below. Does splitting the training data in this fashion make them independent?
quantile(Y)
## 0% 25% 50% 75% 100%
## 34900 129975 163000 214000 755000
\[\textbf{a.} \quad P\left( { X > x }|{ Y > y } \right) \quad\quad\quad \textbf{b.} \quad P(X > x, Y > y) \quad\quad\quad \textbf{c.} \quad P\left( { X < x }|{ Y > y } \right)\]
probabilities <- function(X, Y) {
y <- quantile(Y, 0.5); Q <- V <- A <- B <- C <- {}
for (i in 1:ncol(X)) {
x <- quantile(X[,i], 0.75, na.rm = T); n <- length(Y)
a <- as.numeric(table(if (any(X[,i] < x)) {Y[X[,i] < x] > y}
else {Y > y})["TRUE"]) / n
c <- as.numeric(table(if (any(X[,i] > x)) {Y[X[,i] > x] > y}
else {Y > y})["TRUE"]) / n
t5 <- as.numeric(table(Y[X[,i] > x] > y)["TRUE"]) / n
t6 <- as.numeric(table(Y > y)["TRUE"]) / n
Q <- rbind(Q, quantile(X[,i], na.rm = T))
V <- c(V, names(X[i]));
A <- c(A, a / t6); B <- c(B, t5); C <- c(C, c / t6)
}
colnames(Q) <- c("Q_0","Q_25","Q_50","Q_75","Q_100")
print(list("Quartiles" = data.frame("X" = V, Q),
"Probabilities"= data.frame("X" = V, "a." = A, "b." = B, "c." = C)))
}
probabilities(X, Y)
## $Quartiles
## X Q_0 Q_25 Q_50 Q_75 Q_100
## 1 LotArea 1300 7553.5 9478.5 11601.5 215245
## 2 MasVnrArea 0 0.0 0.0 166.0 1600
## 3 BsmtFinSF2 0 0.0 0.0 0.0 1474
## 4 LowQualFinSF 0 0.0 0.0 0.0 572
## 5 EnclosedPorch 0 0.0 0.0 0.0 552
## 6 X3SsnPorch 0 0.0 0.0 0.0 508
## 7 ScreenPorch 0 0.0 0.0 0.0 480
## 8 PoolArea 0 0.0 0.0 0.0 738
## 9 MiscVal 0 0.0 0.0 0.0 15500
##
## $Probabilities
## X a. b. c.
## 1 LotArea 0.6208791 0.189041096 0.379120879
## 2 MasVnrArea 0.6208791 0.182191781 0.365384615
## 3 BsmtFinSF2 1.0000000 0.047945205 0.096153846
## 4 LowQualFinSF 1.0000000 0.005479452 0.010989011
## 5 EnclosedPorch 1.0000000 0.034246575 0.068681319
## 6 X3SsnPorch 1.0000000 0.012328767 0.024725275
## 7 ScreenPorch 1.0000000 0.051369863 0.103021978
## 8 PoolArea 1.0000000 0.004109589 0.008241758
## 9 MiscVal 1.0000000 0.012328767 0.024725275
\(\textbf{d.}\quad x/y\) | \(Y \leq\) 2nd quartile | \(Y >\) 2nd quartile | Total |
---|---|---|---|
\(X \leq\) 3rd quartile | \(n(X\leq x,Y\leq y)\) | \(n(X\leq x,Y>y)\) | \(n(X\leq x)\) |
\(X >\) 3rd quartile | \(n(X>x,Y\leq y)\) | \(n(X>x,Y>y)\) | \(n(X>x)\) |
Total | \(n(Y\leq y)\) | \(n(Y>y)\) | \(n(X)+n(Y)\) |
counts <- function(X, Y) {
y <- quantile(Y, 0.5)
for (i in 1:ncol(X)) {
x <- quantile(X[,i], 0.75, na.rm = T); n <- length(Y)
t1 <- as.numeric(table(Y[X[,i] <= x] <= y)["TRUE"])
t4 <- as.numeric(table(Y[X[,i] <= x] > y)["TRUE"])
t2 <- as.numeric(table(Y[X[,i] > x] <= y)["TRUE"])
t5 <- as.numeric(table(Y[X[,i] > x] > y)["TRUE"])
t7 <- as.numeric(table(X[,i] <= x)["TRUE"])
t8 <- as.numeric(table(X[,i] > x)["TRUE"])
t3 <- as.numeric(table(Y <= y)["TRUE"])
t6 <- as.numeric(table(Y > y)["TRUE"])
d <- data.frame(matrix(c(t1,t2,t3,t4,t5,t6,t7,t8,t7 + t8), nrow = 3))
colnames(d) <- c("Y <= 2nd quartile", "Y > 2nd quartile", "Total")
rownames(d) <- c("X <= 3rd quartile", "X > 3rd quartile", "Total")
d <- list(d); names(d) <- names(X[i]); print(d)
}
}
counts(X, Y)
## $LotArea
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 643 452 1095
## X > 3rd quartile 89 276 365
## Total 732 728 1460
##
## $MasVnrArea
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 637 454 1091
## X > 3rd quartile 95 266 361
## Total 732 728 1452
##
## $BsmtFinSF2
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 635 658 1293
## X > 3rd quartile 97 70 167
## Total 732 728 1460
##
## $LowQualFinSF
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 714 720 1434
## X > 3rd quartile 18 8 26
## Total 732 728 1460
##
## $EnclosedPorch
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 574 678 1252
## X > 3rd quartile 158 50 208
## Total 732 728 1460
##
## $X3SsnPorch
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 726 710 1436
## X > 3rd quartile 6 18 24
## Total 732 728 1460
##
## $ScreenPorch
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 691 653 1344
## X > 3rd quartile 41 75 116
## Total 732 728 1460
##
## $PoolArea
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 731 722 1453
## X > 3rd quartile 1 6 7
## Total 732 728 1460
##
## $MiscVal
## Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile 698 710 1408
## X > 3rd quartile 34 18 52
## Total 732 728 1460
\(P(X > x, Y > y)\) is the probability of the events \(X > x\) and \(Y > y\) both occurring jointly. In Venn diagram terms, it is the area represented by the intersection of both events. \(P\left( { X > x }|{ Y > y } \right)\) and \(P\left( { X < x }|{ Y > y } \right)\) are the conditional probabilities of \(X > x\) or \(X < x\) given that \(Y > y\) has occurred. In Venn diagram terms, it is the area represented by the intersection of both events, divided by the total area of the given event. The contingency table displays counts of the events specified in the top and left margins, and the sums of the joint events in the bottom and right margins. Splitting the training data in this fashion does not make the data independent as can be seen in the fact that the number of events occurring given that the conditioned event occurred is not equal to the number of events occurring absent the conditioned event. In mathematical terms: \(n(X|Y)=\frac { n\left( X,Y \right) }{ n\left( Y \right) } \neq \frac { n\left( X \right) n\left( Y \right) }{ n\left( Y \right) } =n(X)\).
grid.newpage()
draw.pairwise.venn(area1 = 1, area2 = 1, cross.area = 0.25, scaled = T, cat.prompts = T,
lty = rep("blank", 2), fill = c("blue", "green"), alpha = rep(0.5, 2),
category = c("X", "Y"), cat.pos = c(0, 0), cat.dist = rep(0.025, 2))
## INFO [2016-12-19 19:48:57] Placing category labels at default outer locations. Use 'cat.pos' and 'cat.dist' to modify location.
## INFO [2016-12-19 19:48:57] Current 'cat.pos': 0 degrees, 0 degrees
## INFO [2016-12-19 19:48:57] Current 'cat.dist': 0.025 , 0.025
## (polygon[GRID.polygon.1], polygon[GRID.polygon.2], polygon[GRID.polygon.3], polygon[GRID.polygon.4], text[GRID.text.5], text[GRID.text.6], text[GRID.text.7], text[GRID.text.8], text[GRID.text.9])
Let \(A\) be the new variable counting those observations above the 3rd quartile for \(X\), and let \(B\) be the new variable counting those observations above the 2nd quartile for \(Y\). Does \(P(A|B)=P(A)P(B)\)? Check mathematically, and then evaluate by running a Chi Square test for association.
independence <- function(X, Y) {
for (i in 1:ncol(X)) {
y <- quantile(Y, 0.5); n <- length(Y)
x <- quantile(X[,i], 0.75, na.rm = T)
A <- as.numeric(table(X[,i] > x)["TRUE"])
B <- as.numeric(table(Y > y)["TRUE"])
AB <- as.numeric(table(Y[X[,i] > x] > y)["TRUE"]) / n
Pa <- A / n; Pb <- B / n;
Pab1 <- (AB / n) / Pb; Pab2 <- (A / n) * (B / n)
chi <- chisq.test(table(Y[X[,i] > x] > y), simulate.p.value = TRUE)
c1 <- c(A, Pa); c2 <- c(B, Pb); c3 <- c(NA, Pab1); c4 <- c(NA, Pab2)
c5 <- rep(Pab1 != Pab2, 2); c6 <- c(chi$statistic, chi$p.value)
c7 <- rep(chi$p.value < 0.01, 2)
d <- data.frame(round(c1,4),round(c2,4),round(c3,4),round(c4,4),c5,round(c6,4),c7)
colnames(d) <- c("A","B","P(A|B)","P(A)P(B)","Dependent","Chi-test","Significant")
rownames(d) <- c("Value", "P(.)")
d <- list(d); names(d) <- names(X[i]); print(d)
}
}
independence(X, Y)
## $LotArea
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 365.00 728.0000 NA NA TRUE 95.8055 TRUE
## P(.) 0.25 0.4986 3e-04 0.1247 TRUE 0.0005 TRUE
##
## $MasVnrArea
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 361.0000 728.0000 NA NA TRUE 81.0000 TRUE
## P(.) 0.2473 0.4986 3e-04 0.1233 TRUE 0.0005 TRUE
##
## $BsmtFinSF2
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 167.0000 728.0000 NA NA TRUE 4.3653 FALSE
## P(.) 0.1144 0.4986 1e-04 0.057 TRUE 0.0395 FALSE
##
## $LowQualFinSF
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 26.0000 728.0000 NA NA TRUE 3.8462 FALSE
## P(.) 0.0178 0.4986 0 0.0089 TRUE 0.0755 FALSE
##
## $EnclosedPorch
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 208.0000 728.0000 NA NA TRUE 56.0769 TRUE
## P(.) 0.1425 0.4986 0 0.071 TRUE 0.0005 TRUE
##
## $X3SsnPorch
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 24.0000 728.0000 NA NA TRUE 6.000 FALSE
## P(.) 0.0164 0.4986 0 0.0082 TRUE 0.025 FALSE
##
## $ScreenPorch
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 116.0000 728.0000 NA NA TRUE 9.9655 TRUE
## P(.) 0.0795 0.4986 1e-04 0.0396 TRUE 0.0015 TRUE
##
## $PoolArea
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 7.0000 728.0000 NA NA TRUE 3.5714 FALSE
## P(.) 0.0048 0.4986 0 0.0024 TRUE 0.1199 FALSE
##
## $MiscVal
## A B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 52.0000 728.0000 NA NA TRUE 4.9231 FALSE
## P(.) 0.0356 0.4986 0 0.0178 TRUE 0.0335 FALSE
\[P(A|B)=\frac { P\left( A,B \right) }{ P\left( B \right) } =\frac { P\left( A \right) P\left( B \right) }{ P\left( B \right) } =P\left( A \right) P\left( B \right) \Leftrightarrow P\left( B \right) =1\Rightarrow P(A|B)=P(A)\models A\bot B \tag{1.2.1}\]
Mathematically, \(P(A|B)=P(A)P(B)\) holds if and only if \(P(B)=1\), implying that \(P(A|B)=P(A)\) and entailing that \(A\) and \(B\) are independent. Empirically, none of the right-skewed quantitative \(X\) variables meet this strict mathematical definition of independence. However, when examining the \(X\) variables statistically using the chisq.test()
function at a \(0.01\) significance level, only some of the \(X\) variables show evidence of that dependence being statistically significant. The \(X\) variables that show statistically significant signs of dependence listed in order of least to most dependent are ScreenPorch
, EnclosedPorch
, MasVnrArea
, and LotArea
. It is worth noting that order was ranked using the \({ \chi }^{ 2 }\) test statistic instead of the \(p\)-value since the function simulated \(p\)-values where very small expected values posed the potential for producing incorrect approximations of \(p\). This simulation lead to duplicate \(p\)-values in some cases, but the \({ \chi }^{ 2 }\) test statistics remained unique.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of \(X\) and \(Y\). Provide a 95% CI for the difference in the mean of the variables.
X <- X[ , c("ScreenPorch" , "EnclosedPorch", "MasVnrArea", "LotArea")]
describe <- function(X) {
par(mfrow=c(2,2))
for (i in 1:ncol(X)) {
xbar <- mean(na.omit(X[,i]))
n <- length(na.omit(X[,i])); a <- 1 - .95
l <- qchisq(1 - a / 2, df = 2 * n); u <- qchisq(a / 2, df = 2 * n)
lb = xbar - 2 * n * xbar / l; ub = xbar + 2 * n * xbar / u
d <- summary(na.omit(X[,i])); d["Lower CI"] <- lb; d["Upper CI"] <- ub
d <- list(round(d, 3)); names(d) <- names(X[i]); print(d)
plot(X[,i], Y, xlab = names(X[i]), ylab = names(train[81]),
main = paste(names(X[i]), "vs.", names(train[81])))
}
}
describe(X)
## $ScreenPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max. Lower CI Upper CI
## 0.000 0.000 0.000 15.060 0.000 480.000 0.744 30.925
## $EnclosedPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max. Lower CI Upper CI
## 0.000 0.000 0.000 21.950 0.000 552.000 1.084 45.079
## $MasVnrArea
## Min. 1st Qu. Median Mean 3rd Qu. Max. Lower CI Upper CI
## 0.000 0.000 0.000 103.700 166.000 1600.000 5.133 212.917
## $LotArea
## Min. 1st Qu. Median Mean 3rd Qu. Max. Lower CI
## 1300.00 7554.00 9478.00 10520.00 11600.00 215200.00 519.27
## Upper CI
## 21594.67
From the preceding parts the training data set has been filtered down to include just statistically dependent quantitative right skewed variables. A summary of the univariate descriptive statistics for these culled data was produced using the summary()
function. Given that the remaining variables appear to follow an exponential distribution; the confidence intervals for the exponential \({ 1 }/{ \lambda }\) means were calculated using the below formula which applies to exponential distributions. The scatterplots of each \(X\) variable against \(Y\) show signs of clustering.
\[\frac { { 2n }\bar { x } }{ { \chi }_{ 1-{ \alpha }/{ 2 },2n }^{ 2 } } <\frac { 1 }{ \lambda } <\frac { { 2n }\bar { x } }{ { \chi }_{ { \alpha }/{ 2 },2n }^{ 2 } }\tag{2.1.1}\]
Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
cor(na.omit(X))
## ScreenPorch EnclosedPorch MasVnrArea LotArea
## ScreenPorch 1.00000000 -0.08307443 0.06146554 0.04351116
## EnclosedPorch -0.08307443 1.00000000 -0.11020383 -0.02309390
## MasVnrArea 0.06146554 -0.11020383 1.00000000 0.10415982
## LotArea 0.04351116 -0.02309390 0.10415982 1.00000000
correlation <- function(X) {
X <- na.omit(X)
for (i in 1:(ncol(X) - 1)) {
for (j in (i+1):ncol(X)) {
r <- cor.test(X[,i], X[,j], conf.level = 0.99)
d <- data.frame("r"= r$estimate, "Lower CI" = r$conf.int[1],
"Upper CI" = r$conf.int[2], "p.value" = r$p.value,
"Significant" = r$p.value < 0.01)
d <- list(d); names(d) <- paste0(names(X[i]), " ~ ", names(X[j])); print(d)
}
}
}
correlation(X)
## $`ScreenPorch ~ EnclosedPorch`
## r Lower.CI Upper.CI p.value Significant
## cor -0.08307443 -0.1497985 -0.01559714 0.001533267 TRUE
##
## $`ScreenPorch ~ MasVnrArea`
## r Lower.CI Upper.CI p.value Significant
## cor 0.06146554 -0.006124728 0.1284967 0.01916276 FALSE
##
## $`ScreenPorch ~ LotArea`
## r Lower.CI Upper.CI p.value Significant
## cor 0.04351116 -0.02412459 0.1107504 0.09744771 FALSE
##
## $`EnclosedPorch ~ MasVnrArea`
## r Lower.CI Upper.CI p.value Significant
## cor -0.1102038 -0.1764548 -0.04295887 2.570285e-05 TRUE
##
## $`EnclosedPorch ~ LotArea`
## r Lower.CI Upper.CI p.value Significant
## cor -0.0230939 -0.0905175 0.04454043 0.3792086 FALSE
##
## $`MasVnrArea ~ LotArea`
## r Lower.CI Upper.CI p.value Significant
## cor 0.1041598 0.03685435 0.1705246 6.995925e-05 TRUE
The analysis was performed for the \({ _{ 4 }{ C }_{ 2 } }=6\) pairs of \(X\) variables that remain. The null hypothesis is that no statistically significant correlation exists between the pairs of \(X\) variables. The alternate hypothesis is that the correlation has a statistically significant difference from zero due to a relationship existing between the pairs of \(X\) variables. At a significance level of \(0.01\), the null hypothesis postulating the absence of a relationship failed to rejected for correlations between ScreenPorch ~ MasVnrArea
, ScreenPorch ~ LotArea
, and EnclosedPorch ~ LotArea
which do not have significantly small \(p\)-values. The null hypothesis postulating the absence of a relationship was rejected for correlations between ScreenPorch ~ EnclosedPorch
, EnclosedPorch ~ MasVnrArea
, and MasVnrArea ~ LotArea
which do have significantly small \(p\)-values. The significance of the relationship can also be seen in the confidence intervals. Those intervals that include zero as a possible correlation value do not show statistically significant relationships. The data support the conclusion that some of the \(X\) variables are correlated to each other.
Invert your correlation matrix. This is matrix and contains variance inflation factors on the diagonal. Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
precision <- function(X) {
S <- cor(na.omit(cbind(X)))
Sinv <- solve(S)
P <- solve(cov(na.omit(cbind(X))))
print(list("Correlation Matrix (S)" = S,
"Inverse Correlation Matrix" = Sinv,
"Precision Matrix (P)" = P,
"SP" = S %*% P, "PS" = P %*% S))
}
precision(X)
## $`Correlation Matrix (S)`
## ScreenPorch EnclosedPorch MasVnrArea LotArea
## ScreenPorch 1.00000000 -0.08307443 0.06146554 0.04351116
## EnclosedPorch -0.08307443 1.00000000 -0.11020383 -0.02309390
## MasVnrArea 0.06146554 -0.11020383 1.00000000 0.10415982
## LotArea 0.04351116 -0.02309390 0.10415982 1.00000000
##
## $`Inverse Correlation Matrix`
## ScreenPorch EnclosedPorch MasVnrArea LotArea
## ScreenPorch 1.01111953 0.077662314 -0.04973421 -0.037021156
## EnclosedPorch 0.07766231 1.018399086 0.10651581 0.009044966
## MasVnrArea -0.04973421 0.106515805 1.02543903 -0.102185688
## LotArea -0.03702116 0.009044966 -0.10218569 1.012463360
##
## $`Precision Matrix (P)`
## ScreenPorch EnclosedPorch MasVnrArea LotArea
## ScreenPorch 3.235818e-04 2.274217e-05 -4.913700e-06 -6.629706e-08
## EnclosedPorch 2.274217e-05 2.728849e-04 9.629581e-06 1.482147e-08
## MasVnrArea -4.913700e-06 9.629581e-06 3.127772e-05 -5.649450e-08
## LotArea -6.629706e-08 1.482147e-08 -5.649450e-08 1.014580e-08
##
## $SP
## ScreenPorch EnclosedPorch MasVnrArea LotArea
## ScreenPorch 3.213876e-04 6.649445e-07 -3.793628e-06 -7.055935e-08
## EnclosedPorch -3.596166e-06 2.699340e-04 6.592164e-06 2.632066e-08
## MasVnrArea 1.246225e-05 -1.904398e-05 2.990859e-05 -6.114608e-08
## LotArea 1.297611e-05 -4.294602e-06 2.765201e-06 1.034395e-09
##
## $PS
## ScreenPorch EnclosedPorch MasVnrArea LotArea
## ScreenPorch 3.213876e-04 -3.596166e-06 1.246225e-05 1.297611e-05
## EnclosedPorch 6.649445e-07 2.699340e-04 -1.904398e-05 -4.294602e-06
## MasVnrArea -3.793628e-06 6.592164e-06 2.990859e-05 2.765201e-06
## LotArea -7.055935e-08 2.632066e-08 -6.114608e-08 1.034395e-09
The cor()
function produces the correlation matrix \(\mathbf{S}\) for all the remaining variables. Missing values must first be removed however. The inverse of the correlation matrix \(\mathbf{S}^{-1}\) is produced using the solve()
function. The diagonal elements of the inverse correlation matrix, referred to as variance inflation factors, measure the extent to which the variables are linear combinations of other variables. Variance inflation factors (VIF) to help detect multicollinearity. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction. Multiplying the correlation matrix by its inverse and then the inverse by the correlation matrix both yield the identity matrix: \(\mathbf{S}\mathbf{S}^{-1}=\mathbf{S}^{-1}\mathbf{S}=\mathbf{I}\). Multiplying the correlation matrix by the inverse of the covariance matrix known as the precision matrix \(\mathbf{P}\), and then the precision matrix by the correlation matrix shows an interesting property such that: \(\mathbf{S}\mathbf{P}=(\mathbf{P}\mathbf{S})^{T}\).
Conduct principle components analysis (research this) and interpret.
PCA <- function(X) {
pairs(as.matrix(X))
Xpca <- prcomp(na.omit(X), center = T, scale. = T)
M <- as.matrix(X); R <- as.matrix(Xpca$rotation); score <- M %*% R
print(list("Importance of Components" = summary(Xpca)$importance,
"Rotation (Variable Loadings)" = Xpca$rotation,
"Correlation between X and PC" = cor(na.omit(X), na.omit(score))))
par(mfrow=c(2,3))
barplot(Xpca$sdev^2, ylab = "Component Variance")
barplot(cor(na.omit(cbind(X))), ylab = "Correlations")
barplot(Xpca$rotation, ylab = "Loadings")
biplot(Xpca); barplot(M); barplot(score); pairs(score)
}
PCA(X)
## $`Importance of Components`
## PC1 PC2 PC3 PC4
## Standard deviation 1.103575 0.9946552 0.9674999 0.9255957
## Proportion of Variance 0.304470 0.2473300 0.2340100 0.2141800
## Cumulative Proportion 0.304470 0.5518000 0.7858200 1.0000000
##
## $`Rotation (Variable Loadings)`
## PC1 PC2 PC3 PC4
## ScreenPorch 0.4488625 -0.4168237 0.7735512 0.1624777
## EnclosedPorch -0.5157049 0.4865127 0.4467899 0.5456487
## MasVnrArea 0.5916904 0.2062893 -0.3756023 0.6828397
## LotArea 0.4271689 0.7396000 0.2468202 -0.4578191
##
## $`Correlation between X and PC`
## PC1 PC2 PC3 PC4
## ScreenPorch 0.05137154 0.04032113 0.05853821 -0.04056489
## EnclosedPorch -0.03361681 -0.01936014 -0.01045982 0.02730270
## MasVnrArea 0.12999566 0.10854349 0.07657162 -0.07799089
## LotArea 0.99961438 0.99997638 0.99941730 -0.99962757
Principal Component Analysis is a dimension reduction technique. Principal components are linear (weighted) combinations of variables. Applying the weights rotates (transforms) the orthogonal basis of the data to align with the principal components. To perform the analysis the prcomp()
function is used with the center = T
argument to shift variables to a zero center and the scale = T
argument to impose unit variance on variables before the analysis takes place. The standard deviations of the principal components are equal to the square roots of the eigenvalues of the correlation matrix. The rotation matrix has eigenvector columns that are the variable loadings. Loadings are the correlations between the original variables and the unit-scaled components which are used to restore the original correlation matrix. The biplot()
function provides a visual representation of both the observations and variables for multivariate data. The effects of applying Principal Component Analysis to these data can best be seen in the pairs()
plots which show the scattering of the original data and then the alignment of the rotated data. The first principal component is the linear combination of variables that has maximum variance (among all linear combinations), so it accounts for as much variation in the data as possible. Each successive principal component is the linear combination of all remaining variables that accounts for as much of the remaining variation (among all remaining linear combinations) as possible. Correlations between the original variables and each principal component are useful for interpreting each component. The primary purpose of this Principal Component Analysis is descriptive, it is not hypothesis testing. Sometimes the principal components scores will be used as explanatory variables in a regression. In these instances regression coefficients will be independent since the components are independent.
Fit a closed form distribution to the data. For variables skewed to the right, shift the minimum value above zero. Then run fitdistr()
to fit an exponential probability density function. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value. Plot a histogram and compare it with a histogram of your original variable.
exponential <- function(X) {
par(mfrow=c(2,4))
for (i in 1:ncol(X)){
shifted <- na.omit(X[,i]) - min(na.omit(X[,i])) + 1e-32
fit <- fitdistr(shifted, "Exponential")
lambda <- fit$estimate
exp <- rexp(1000, lambda)
hist(exp, prob=TRUE, col="grey", main = names(X[i]), xlab="Theoretical")
lines(density(exp), col="blue", lwd=2)
lines(density(shifted), col="red", lwd=2)
hist(shifted, prob=TRUE, col="grey", main = names(X[i]), xlab="Empirical")
lines(density(exp), col="blue", lwd=2)
lines(density(shifted), col="red", lwd=2)
}
}
exponential(X)
The variables were shifted to slightly above zero by subtracting the minimum value and then adding \(1^{-32}\) to the modified value. This also shifts data with a negative minimum in the appropriate direction since subtracting the negative minimum value equates to adding the minimum value. The function fitdistr()
was used to fit an exponential distribution to the data and then the rexp()
function with a rate of \(\lambda\) was used to get a sample of theoretical values from the fitted exponential distribution. After plotting the theoretical
and empirical
histograms for each variable, curves were overlayed representing the theoretical densities in blue and the empirical densities in red. These overlaid curves show that LotArea
and MasVnrArea
follow their respective fitted exponential distributions best.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data.
percentiles <- function(X) {
par(mfrow=c(2,4))
for (i in 1:ncol(X)){
shifted <- na.omit(X[,i]) - min(na.omit(X[,i])) + 1e-32
fit <- fitdistr(shifted, "Exponential")
lambda <- fit$estimate
lb_e <- -log(1 - .05) / lambda
ub_e <- -log(1 - .95) / lambda
mu <- s <- 1 / lambda
n <- length(shifted)
lb_n <- mu + qnorm(0.05/2) * s / sqrt(n)
ub_n <- mu - qnorm(0.05/2) * s / sqrt(n)
emp <- quantile(na.omit(X[,i]), c(0.05,0.95))
d <- data.frame(c(lb_e, ub_e), c(lb_n, ub_n), c(emp[[1]], emp[[2]]))
rownames(d) <- c("5%","95%")
colnames(d) <- c("Exponential","Normal CI","Empirical")
x <- paste0(names(X[i]), " ~ Exp(", round(fit$estimate,6),")")
d <- list(d); names(d) <- x; print(d)
}
}
percentiles(X)
## $`ScreenPorch ~ Exp(0.066397)`
## Exponential Normal CI Empirical
## 5% 0.7725262 14.28841 0
## 95% 45.1186007 15.83350 160
##
## $`EnclosedPorch ~ Exp(0.04555)`
## Exponential Normal CI Empirical
## 5% 1.126099 20.82798 0.00
## 95% 65.768635 23.08024 180.15
##
## $`MasVnrArea ~ Exp(0.009645)`
## Exponential Normal CI Empirical
## 5% 5.318359 98.35214 0
## 95% 310.613285 109.01839 456
##
## $`LotArea ~ Exp(0.000108)`
## Exponential Normal CI Empirical
## 5% 472.7615 8744.055 3311.70
## 95% 27611.1493 9689.602 17401.15
\[\int _{ 0 }^{ x }{ \lambda { e }^{ -\lambda x }dt } =F(x)-F(0)=\left[ -{ e }^{ -\lambda x } \right] -\left[ -{ e }^{ -\lambda 0 } \right] =-{ e }^{ -\lambda x }+1=1-{ e }^{ -\lambda x } \tag{4.2.1}\]
\[{ P }_{ i }=1-{ e }^{ -\lambda x }\Rightarrow { P }_{ i }-1=-{ e }^{ -\lambda x }\Rightarrow { e }^{ -\lambda x }=1-{ P }_{ i }\Rightarrow -\lambda x=\ln { \left( 1-{ P }_{ i } \right) } \Rightarrow x=-\frac { \ln { \left( 1-{ P }_{ i } \right) } }{ \lambda } \tag{4.2.2}\]
\[\bar { x } \pm { z }_{ { \alpha }/{ 2 } }\frac { s }{ \sqrt { n } } \Rightarrow \frac { 1 }{ \lambda } \pm { z }_{ { \alpha }/{ 2 } }\frac { { 1 }/{ \lambda } }{ \sqrt { n } } \tag{4.2.3}\]
The CDF was derived using the pdf as outlined in equation \(4.2.1\). The percentiles were calculated using the derivation detailed in \(4.2.2\). Using the confidence interval equation for normally distributed data with exponential parameters as detailed in \(4.2.3\), a confidence interval assuming normality was calculated next. The empirical data percentiles were computed using the quantile()
function. The disparity in these intervals shows how varying assumptions can sometimes lead to wildly different results.
The final includes a GLM component to be done in Kaggle.com. Register for free with Kaggle.com and compete in the House Prices: Advanced Regression Techniques competition. Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
Your submission should be in CSV format. We expect the solution file to have 1,459 predictions. The file should have a header row. Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions. Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors. Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally. RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate [due to the shape of the logarithmic distribution having] a larger change in \(y\) corresponding to an equivalent change in \(x\) when \(x\) is small.
\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon_i \tag{5.1.1}\] \[\hat{SalePrice_i} = \beta_0 + \left(MasVnrArea_i\right) \cdot X_1 + \left(LotArea_i\right) \cdot X_2 + \varepsilon_i \tag{5.1.2}\]
cor(na.omit(cbind(X,Y)))
## ScreenPorch EnclosedPorch MasVnrArea LotArea Y
## ScreenPorch 1.00000000 -0.08307443 0.06146554 0.04351116 0.1130439
## EnclosedPorch -0.08307443 1.00000000 -0.11020383 -0.02309390 -0.1287781
## MasVnrArea 0.06146554 -0.11020383 1.00000000 0.10415982 0.4774930
## LotArea 0.04351116 -0.02309390 0.10415982 1.00000000 0.2646740
## Y 0.11304390 -0.12877809 0.47749305 0.26467400 1.0000000
fit <- lm(SalePrice ~ MasVnrArea + LotArea, data = train)
(sum <- summary(fit)); par(mfrow = c(2, 2)); plot(fit)
##
## Call:
## lm(formula = SalePrice ~ MasVnrArea + LotArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -250567 -41924 -11970 31784 576214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.418e+05 2.703e+03 52.47 <2e-16 ***
## MasVnrArea 1.992e+02 9.850e+00 20.22 <2e-16 ***
## LotArea 1.725e+00 1.785e-01 9.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67570 on 1449 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.2747, Adjusted R-squared: 0.2737
## F-statistic: 274.4 on 2 and 1449 DF, p-value: < 2.2e-16
test <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
"master/DATA%20605/test.csv"), stringsAsFactors = F)
X1 <- test[,"MasVnrArea"]; X1[is.na(X1)] <- median(na.omit(X1))
X2 <- test[,"LotArea"]; X2[is.na(X2)] <- median(na.omit(X2))
results <- data.frame(cbind("Id" = test[,"Id"],
"SalePrice" = fit$coefficients[1] +
fit$coefficients[2] * X1 +
fit$coefficients[3] * X2))
write.csv(results, "Kaggle.csv", row.names = F)
The two variables chosen for the model are those filtered through the preceding problems. They are right-skewed quantitative variables that show the strongest signs of correlation with \(Y\). In cases where there are missing values, nulls were replaced with the most likely value–the median. This all yielded the below model for which the summary is above. The two variables and the intercept together are significant at a \(0.001\) level. Adjusted \(R^2\) suggests that these variables account for 27.47% of the variation in \(Y\).
\[\hat{SalePrice_i} = 141842.3 + 199.175 \cdot MasVnrArea_i + 1.725 \cdot LotArea_i + \varepsilon_i \tag{5.1.3}\]
Report your Kaggle.com username and score. There will be a video showing how to upload to Kaggle and to discuss methods for nonlinear optimization.
Predicted <- seq(1, 100, by=1)
a <- rep(40, 100)
RMSLE <- sqrt((1 / 100) * (log(Predicted + 1) - log(a + 1))^2)
plot(Predicted, RMSLE, type="l", col="blue",
main="Basic RMSLE Pattern", xaxt='n', yaxt='n')
\[RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } =0.36631 \tag{5.2.1}\]
The score obtained culling the variables as previously mentioned yielded the above Root Mean Squared Logarithmic Error (RMSLE). A score of \(0.36631\) means that the deviations of the model’s predicted values were around \(e^{0.36631}=1.44\) times smaller or larger than the actual values. At the time of posting, there were \(8,471\) submissions. The range for the RMSLE was \([0.07186, 296.52145]\). The “Sample Submission Benchmark” on the letterboard was \(0.40890\). RMSLE values closer to zero are better. It is worth noting both that lower scores are skewed toward usernames with multiple (in several cases over 50) attempts and that the maximum value appears to be an outlier since the penultimate value is \(23.60064\). The RMSLE is one of several evaluation metrics commonly used in supervised machine learning.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
http://cnx.org/contents/bE-w34Vi@9/Descriptive-Statistics-Skewnes
http://brownmath.com/stat/shape.htm#Skewness
http://webstat.une.edu.au/unit_materials/c4_descriptive_statistics/determine_skew_kurt.html
https://estatistics.eu/what-is-statistics-standard-error-of-skewness-standard-error-of-kurtosis/
https://cran.r-project.org/web/packages/agrmt/vignettes/agrmt.pdf
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html
http://setosa.io/ev/principal-component-analysis/
https://onlinecourses.science.psu.edu/stat505/node/49
https://rstudio-pubs-static.s3.amazonaws.com/13301_6641d73cfac741a59c0a851feb99e98b.html
https://v8doc.sas.com/sashtml/insight/chap40/sect21.htm
https://onlinecourses.science.psu.edu/stat501/node/347
https://www.statlect.com/glossary/precision-matrix
http://beyondvalence.blogspot.com/2014/07/predicting-capital-bikeshare-demand-in.html