Final Instructions

The final exam features a mini-project and several small & medium sized questions. The final exam is a comprehensive review of the course. Post solutions to GitHub and make a short 3-5 minute presentation or post a recorded presentation to blackboard. This project will show off your ability to understand the elements of the class.

library(MASS)
library(agrmt)
library(VennDiagram)

## Loading required package: grid

## Loading required package: futile.logger

Competition Details: Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. The potential for creative feature engineering provides a rich opportunity for fun and learning. This dataset lends itself to advanced regression techniques like random forests and gradient boosting with the popular XGBoost library. We encourage Kagglers to create benchmark code and tutorials on Kernels for community learning. Top kernels will be awarded swag prizes at the competition close.

Pick one of the quantitative independent variables from the training data set train.csv, and define that variable as \(X\). Make sure this variable is skewed to the right. Pick the dependent variable and define it as \(Y\).

train <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                         "master/DATA%20605/train.csv"), stringsAsFactors = F)
Y <- train[,"SalePrice"]; X <- {}
skewed <- function (train) {
  # remove categorical variables
  quantitative <- train[ , sapply(train, is.numeric)]
  # remove ID and Y (dependent) variable
  predictors <- quantitative[ , -c(1, ncol(quantitative))]
  par(mfrow = c(3, 3))
  for (i in 1:ncol(predictors)) {
    x <- predictors[!sapply(predictors[ , i], is.na), i]
    n <- length(x); mu <- mean(x); se <- sd(x)
    skew <- sum(((x - mu) / se)^3) / n
    # Standard Error of Skewness
    ses <- sqrt(6 * n * (n - 1) / ((n - 2) * (n + 1) * (n + 3)))
    if (modes(x)$contiguous) { # One (contiguous) mode 
      if (skew > 40 * ses) { # Very Significant Positive Skew
        d <- density(x, na.rm = TRUE)
        string <- paste(names(predictors[i]))
        plot(d, ylab = string, main = string)
        polygon(d, col="red")
        X <- c(X, i)
      }
    }
  }
  return(predictors[, X])
}
X <- skewed(train)

Negative Skew means the left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is said to be left-skewed, left-tailed, or skewed to the left. Positive Skew means the right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is said to be right-skewed, right-tailed, or skewed to the right. In order to do a more thorough analysis, the vector \(\mathbf{X}\) containing all right skewed independent variables was created.

1. Probability

1.1 Instructions

Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the 3rd quartile of the \(X\) variable, and the small letter \(y\) is estimated as the 2nd quartile of the \(Y\) variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below. Does splitting the training data in this fashion make them independent?

quantile(Y)

##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000

1.1 Calculations

\[\textbf{a.} \quad P\left( { X > x }|{ Y > y } \right) \quad\quad\quad \textbf{b.} \quad P(X > x, Y > y) \quad\quad\quad \textbf{c.} \quad P\left( { X < x }|{ Y > y } \right)\]

probabilities <- function(X, Y) {
  y <- quantile(Y, 0.5);  Q <- V <- A <- B <- C <- {}
  for (i in 1:ncol(X)) {
    x <- quantile(X[,i], 0.75, na.rm = T); n <- length(Y)
    a <- as.numeric(table(if (any(X[,i] < x)) {Y[X[,i] < x] > y} 
                          else {Y > y})["TRUE"])  / n
    c <- as.numeric(table(if (any(X[,i] > x)) {Y[X[,i] > x] > y} 
                          else {Y > y})["TRUE"])  / n
    t5 <- as.numeric(table(Y[X[,i] > x] > y)["TRUE"]) / n
    t6 <- as.numeric(table(Y > y)["TRUE"]) / n
    Q <- rbind(Q, quantile(X[,i], na.rm = T))
    V <- c(V, names(X[i])); 
    A <- c(A, a / t6); B <- c(B, t5); C <- c(C, c / t6)
  }
  colnames(Q) <- c("Q_0","Q_25","Q_50","Q_75","Q_100")
  print(list("Quartiles" = data.frame("X" = V, Q),
    "Probabilities"= data.frame("X" = V, "a." = A, "b." = B, "c." = C)))
}
probabilities(X, Y)

## $Quartiles
##               X  Q_0   Q_25   Q_50    Q_75  Q_100
## 1       LotArea 1300 7553.5 9478.5 11601.5 215245
## 2    MasVnrArea    0    0.0    0.0   166.0   1600
## 3    BsmtFinSF2    0    0.0    0.0     0.0   1474
## 4  LowQualFinSF    0    0.0    0.0     0.0    572
## 5 EnclosedPorch    0    0.0    0.0     0.0    552
## 6    X3SsnPorch    0    0.0    0.0     0.0    508
## 7   ScreenPorch    0    0.0    0.0     0.0    480
## 8      PoolArea    0    0.0    0.0     0.0    738
## 9       MiscVal    0    0.0    0.0     0.0  15500
## 
## $Probabilities
##               X        a.          b.          c.
## 1       LotArea 0.6208791 0.189041096 0.379120879
## 2    MasVnrArea 0.6208791 0.182191781 0.365384615
## 3    BsmtFinSF2 1.0000000 0.047945205 0.096153846
## 4  LowQualFinSF 1.0000000 0.005479452 0.010989011
## 5 EnclosedPorch 1.0000000 0.034246575 0.068681319
## 6    X3SsnPorch 1.0000000 0.012328767 0.024725275
## 7   ScreenPorch 1.0000000 0.051369863 0.103021978
## 8      PoolArea 1.0000000 0.004109589 0.008241758
## 9       MiscVal 1.0000000 0.012328767 0.024725275

\(\textbf{d.}\quad x/y\)	\(Y \leq\) 2nd quartile	\(Y >\) 2nd quartile	Total
\(X \leq\) 3rd quartile	\(n(X\leq x,Y\leq y)\)	\(n(X\leq x,Y>y)\)	\(n(X\leq x)\)
\(X >\) 3rd quartile	\(n(X>x,Y\leq y)\)	\(n(X>x,Y>y)\)	\(n(X>x)\)
Total	\(n(Y\leq y)\)	\(n(Y>y)\)	\(n(X)+n(Y)\)

counts <- function(X, Y) {
  y <- quantile(Y, 0.5)
  for (i in 1:ncol(X)) {
    x <- quantile(X[,i], 0.75, na.rm = T); n <- length(Y)
    t1 <- as.numeric(table(Y[X[,i] <= x] <= y)["TRUE"])
    t4 <- as.numeric(table(Y[X[,i] <= x] > y)["TRUE"])
    t2 <- as.numeric(table(Y[X[,i] > x] <= y)["TRUE"])
    t5 <- as.numeric(table(Y[X[,i] > x] > y)["TRUE"])
    t7 <- as.numeric(table(X[,i] <= x)["TRUE"])
    t8 <- as.numeric(table(X[,i] > x)["TRUE"])
    t3 <- as.numeric(table(Y <= y)["TRUE"])
    t6 <- as.numeric(table(Y > y)["TRUE"])
    d <- data.frame(matrix(c(t1,t2,t3,t4,t5,t6,t7,t8,t7 + t8), nrow = 3))
    colnames(d) <- c("Y <= 2nd quartile", "Y > 2nd quartile", "Total")
    rownames(d) <- c("X <= 3rd quartile", "X > 3rd quartile", "Total")
    d <- list(d); names(d) <- names(X[i]); print(d)
  }
}
counts(X, Y)

## $LotArea
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               643              452  1095
## X > 3rd quartile                 89              276   365
## Total                           732              728  1460
## 
## $MasVnrArea
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               637              454  1091
## X > 3rd quartile                 95              266   361
## Total                           732              728  1452
## 
## $BsmtFinSF2
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               635              658  1293
## X > 3rd quartile                 97               70   167
## Total                           732              728  1460
## 
## $LowQualFinSF
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               714              720  1434
## X > 3rd quartile                 18                8    26
## Total                           732              728  1460
## 
## $EnclosedPorch
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               574              678  1252
## X > 3rd quartile                158               50   208
## Total                           732              728  1460
## 
## $X3SsnPorch
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               726              710  1436
## X > 3rd quartile                  6               18    24
## Total                           732              728  1460
## 
## $ScreenPorch
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               691              653  1344
## X > 3rd quartile                 41               75   116
## Total                           732              728  1460
## 
## $PoolArea
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               731              722  1453
## X > 3rd quartile                  1                6     7
## Total                           732              728  1460
## 
## $MiscVal
##                   Y <= 2nd quartile Y > 2nd quartile Total
## X <= 3rd quartile               698              710  1408
## X > 3rd quartile                 34               18    52
## Total                           732              728  1460

1.1 Analysis

\(P(X > x, Y > y)\) is the probability of the events \(X > x\) and \(Y > y\) both occurring jointly. In Venn diagram terms, it is the area represented by the intersection of both events. \(P\left( { X > x }|{ Y > y } \right)\) and \(P\left( { X < x }|{ Y > y } \right)\) are the conditional probabilities of \(X > x\) or \(X < x\) given that \(Y > y\) has occurred. In Venn diagram terms, it is the area represented by the intersection of both events, divided by the total area of the given event. The contingency table displays counts of the events specified in the top and left margins, and the sums of the joint events in the bottom and right margins. Splitting the training data in this fashion does not make the data independent as can be seen in the fact that the number of events occurring given that the conditioned event occurred is not equal to the number of events occurring absent the conditioned event. In mathematical terms: \(n(X|Y)=\frac { n\left( X,Y \right) }{ n\left( Y \right) } \neq \frac { n\left( X \right) n\left( Y \right) }{ n\left( Y \right) } =n(X)\).

grid.newpage()
draw.pairwise.venn(area1 = 1, area2 = 1, cross.area = 0.25, scaled = T, cat.prompts = T,
                   lty = rep("blank", 2), fill = c("blue", "green"), alpha = rep(0.5, 2), 
                   category = c("X", "Y"), cat.pos = c(0, 0), cat.dist = rep(0.025, 2))

## INFO [2016-12-19 19:48:57] Placing category labels at default outer locations.  Use 'cat.pos' and 'cat.dist' to modify location.
## INFO [2016-12-19 19:48:57] Current 'cat.pos': 0 degrees, 0 degrees
## INFO [2016-12-19 19:48:57] Current 'cat.dist': 0.025 , 0.025

## (polygon[GRID.polygon.1], polygon[GRID.polygon.2], polygon[GRID.polygon.3], polygon[GRID.polygon.4], text[GRID.text.5], text[GRID.text.6], text[GRID.text.7], text[GRID.text.8], text[GRID.text.9])

1.2 Instructions

Let \(A\) be the new variable counting those observations above the 3rd quartile for \(X\), and let \(B\) be the new variable counting those observations above the 2nd quartile for \(Y\). Does \(P(A|B)=P(A)P(B)\)? Check mathematically, and then evaluate by running a Chi Square test for association.

1.2 Calculations

independence <- function(X, Y) {
  for (i in 1:ncol(X)) {
    y <- quantile(Y, 0.5); n <- length(Y)
    x <- quantile(X[,i], 0.75, na.rm = T)
    A <- as.numeric(table(X[,i] > x)["TRUE"])
    B <- as.numeric(table(Y > y)["TRUE"])
    AB <- as.numeric(table(Y[X[,i] > x] > y)["TRUE"]) / n
    Pa <- A / n; Pb <- B / n; 
    Pab1 <- (AB / n) / Pb; Pab2 <- (A / n) * (B / n)
    chi <- chisq.test(table(Y[X[,i] > x] > y), simulate.p.value = TRUE)
    c1 <- c(A, Pa); c2 <- c(B, Pb); c3 <- c(NA, Pab1); c4 <- c(NA, Pab2)
    c5 <- rep(Pab1 != Pab2, 2); c6 <- c(chi$statistic, chi$p.value)
    c7 <- rep(chi$p.value < 0.01, 2)
    d <- data.frame(round(c1,4),round(c2,4),round(c3,4),round(c4,4),c5,round(c6,4),c7)
    colnames(d) <- c("A","B","P(A|B)","P(A)P(B)","Dependent","Chi-test","Significant")
    rownames(d) <- c("Value", "P(.)")
    d <- list(d); names(d) <- names(X[i]); print(d)
  }
}
independence(X, Y)

## $LotArea
##            A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 365.00 728.0000     NA       NA      TRUE  95.8055        TRUE
## P(.)    0.25   0.4986  3e-04   0.1247      TRUE   0.0005        TRUE
## 
## $MasVnrArea
##              A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 361.0000 728.0000     NA       NA      TRUE  81.0000        TRUE
## P(.)    0.2473   0.4986  3e-04   0.1233      TRUE   0.0005        TRUE
## 
## $BsmtFinSF2
##              A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 167.0000 728.0000     NA       NA      TRUE   4.3653       FALSE
## P(.)    0.1144   0.4986  1e-04    0.057      TRUE   0.0395       FALSE
## 
## $LowQualFinSF
##             A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 26.0000 728.0000     NA       NA      TRUE   3.8462       FALSE
## P(.)   0.0178   0.4986      0   0.0089      TRUE   0.0755       FALSE
## 
## $EnclosedPorch
##              A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 208.0000 728.0000     NA       NA      TRUE  56.0769        TRUE
## P(.)    0.1425   0.4986      0    0.071      TRUE   0.0005        TRUE
## 
## $X3SsnPorch
##             A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 24.0000 728.0000     NA       NA      TRUE    6.000       FALSE
## P(.)   0.0164   0.4986      0   0.0082      TRUE    0.025       FALSE
## 
## $ScreenPorch
##              A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 116.0000 728.0000     NA       NA      TRUE   9.9655        TRUE
## P(.)    0.0795   0.4986  1e-04   0.0396      TRUE   0.0015        TRUE
## 
## $PoolArea
##            A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 7.0000 728.0000     NA       NA      TRUE   3.5714       FALSE
## P(.)  0.0048   0.4986      0   0.0024      TRUE   0.1199       FALSE
## 
## $MiscVal
##             A        B P(A|B) P(A)P(B) Dependent Chi-test Significant
## Value 52.0000 728.0000     NA       NA      TRUE   4.9231       FALSE
## P(.)   0.0356   0.4986      0   0.0178      TRUE   0.0335       FALSE

1.2 Analysis

\[P(A|B)=\frac { P\left( A,B \right) }{ P\left( B \right) } =\frac { P\left( A \right) P\left( B \right) }{ P\left( B \right) } =P\left( A \right) P\left( B \right) \Leftrightarrow P\left( B \right) =1\Rightarrow P(A|B)=P(A)\models A\bot B \tag{1.2.1}\]

Mathematically, \(P(A|B)=P(A)P(B)\) holds if and only if \(P(B)=1\), implying that \(P(A|B)=P(A)\) and entailing that \(A\) and \(B\) are independent. Empirically, none of the right-skewed quantitative \(X\) variables meet this strict mathematical definition of independence. However, when examining the \(X\) variables statistically using the chisq.test() function at a \(0.01\) significance level, only some of the \(X\) variables show evidence of that dependence being statistically significant. The \(X\) variables that show statistically significant signs of dependence listed in order of least to most dependent are ScreenPorch, EnclosedPorch, MasVnrArea, and LotArea. It is worth noting that order was ranked using the \({ \chi }^{ 2 }\) test statistic instead of the \(p\)-value since the function simulated \(p\)-values where very small expected values posed the potential for producing incorrect approximations of \(p\). This simulation lead to duplicate \(p\)-values in some cases, but the \({ \chi }^{ 2 }\) test statistics remained unique.

2. Descriptive and Inferential Statistics

2.1 Instructions

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of \(X\) and \(Y\). Provide a 95% CI for the difference in the mean of the variables.

2.1 Calculations

X <- X[ , c("ScreenPorch" , "EnclosedPorch", "MasVnrArea", "LotArea")]
describe <- function(X) {
  par(mfrow=c(2,2))
  for (i in 1:ncol(X)) {
    xbar <- mean(na.omit(X[,i]))
    n <- length(na.omit(X[,i])); a <- 1 - .95
    l <- qchisq(1 - a / 2, df = 2 * n); u <- qchisq(a / 2, df = 2 * n)
    lb = xbar - 2 * n * xbar / l; ub = xbar + 2 * n * xbar / u
    d <- summary(na.omit(X[,i])); d["Lower CI"] <- lb; d["Upper CI"] <- ub
    d <- list(round(d, 3)); names(d) <- names(X[i]); print(d)
    plot(X[,i], Y, xlab = names(X[i]), ylab = names(train[81]),
         main = paste(names(X[i]), "vs.", names(train[81])))
  }
}
describe(X)

## $ScreenPorch
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. Lower CI Upper CI 
##    0.000    0.000    0.000   15.060    0.000  480.000    0.744   30.925

## $EnclosedPorch
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. Lower CI Upper CI 
##    0.000    0.000    0.000   21.950    0.000  552.000    1.084   45.079

## $MasVnrArea
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. Lower CI Upper CI 
##    0.000    0.000    0.000  103.700  166.000 1600.000    5.133  212.917

## $LotArea
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.  Lower CI 
##   1300.00   7554.00   9478.00  10520.00  11600.00 215200.00    519.27 
##  Upper CI 
##  21594.67

2.1 Analysis

From the preceding parts the training data set has been filtered down to include just statistically dependent quantitative right skewed variables. A summary of the univariate descriptive statistics for these culled data was produced using the summary() function. Given that the remaining variables appear to follow an exponential distribution; the confidence intervals for the exponential \({ 1 }/{ \lambda }\) means were calculated using the below formula which applies to exponential distributions. The scatterplots of each \(X\) variable against \(Y\) show signs of clustering.

\[\frac { { 2n }\bar { x } }{ { \chi }_{ 1-{ \alpha }/{ 2 },2n }^{ 2 } } <\frac { 1 }{ \lambda } <\frac { { 2n }\bar { x } }{ { \chi }_{ { \alpha }/{ 2 },2n }^{ 2 } }\tag{2.1.1}\]

2.2 Instructions

Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

2.2 Calculations

cor(na.omit(X))

##               ScreenPorch EnclosedPorch  MasVnrArea     LotArea
## ScreenPorch    1.00000000   -0.08307443  0.06146554  0.04351116
## EnclosedPorch -0.08307443    1.00000000 -0.11020383 -0.02309390
## MasVnrArea     0.06146554   -0.11020383  1.00000000  0.10415982
## LotArea        0.04351116   -0.02309390  0.10415982  1.00000000

correlation <-  function(X) {
  X <- na.omit(X)
  for (i in 1:(ncol(X) - 1)) {
    for (j in (i+1):ncol(X)) {
      r <- cor.test(X[,i], X[,j], conf.level = 0.99)
      d <- data.frame("r"= r$estimate, "Lower CI" = r$conf.int[1],
                      "Upper CI" = r$conf.int[2], "p.value" =  r$p.value,
                      "Significant" = r$p.value < 0.01)
      d <- list(d); names(d) <- paste0(names(X[i]), " ~ ", names(X[j])); print(d)
      }
    }
}
correlation(X)

## $`ScreenPorch ~ EnclosedPorch`
##               r   Lower.CI    Upper.CI     p.value Significant
## cor -0.08307443 -0.1497985 -0.01559714 0.001533267        TRUE
## 
## $`ScreenPorch ~ MasVnrArea`
##              r     Lower.CI  Upper.CI    p.value Significant
## cor 0.06146554 -0.006124728 0.1284967 0.01916276       FALSE
## 
## $`ScreenPorch ~ LotArea`
##              r    Lower.CI  Upper.CI    p.value Significant
## cor 0.04351116 -0.02412459 0.1107504 0.09744771       FALSE
## 
## $`EnclosedPorch ~ MasVnrArea`
##              r   Lower.CI    Upper.CI      p.value Significant
## cor -0.1102038 -0.1764548 -0.04295887 2.570285e-05        TRUE
## 
## $`EnclosedPorch ~ LotArea`
##              r   Lower.CI   Upper.CI   p.value Significant
## cor -0.0230939 -0.0905175 0.04454043 0.3792086       FALSE
## 
## $`MasVnrArea ~ LotArea`
##             r   Lower.CI  Upper.CI      p.value Significant
## cor 0.1041598 0.03685435 0.1705246 6.995925e-05        TRUE

2.2 Analysis

The analysis was performed for the \({ _{ 4 }{ C }_{ 2 } }=6\) pairs of \(X\) variables that remain. The null hypothesis is that no statistically significant correlation exists between the pairs of \(X\) variables. The alternate hypothesis is that the correlation has a statistically significant difference from zero due to a relationship existing between the pairs of \(X\) variables. At a significance level of \(0.01\), the null hypothesis postulating the absence of a relationship failed to rejected for correlations between ScreenPorch ~ MasVnrArea, ScreenPorch ~ LotArea, and EnclosedPorch ~ LotArea which do not have significantly small \(p\)-values. The null hypothesis postulating the absence of a relationship was rejected for correlations between ScreenPorch ~ EnclosedPorch, EnclosedPorch ~ MasVnrArea, and MasVnrArea ~ LotArea which do have significantly small \(p\)-values. The significance of the relationship can also be seen in the confidence intervals. Those intervals that include zero as a possible correlation value do not show statistically significant relationships. The data support the conclusion that some of the \(X\) variables are correlated to each other.

3. Linear Algebra and Correlation

3.1 Instructions

Invert your correlation matrix. This is matrix and contains variance inflation factors on the diagonal. Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

3.1 Calculations

precision <- function(X) {
  S <- cor(na.omit(cbind(X)))
  Sinv <- solve(S)
  P <- solve(cov(na.omit(cbind(X))))
  print(list("Correlation Matrix (S)" = S,
             "Inverse Correlation Matrix" = Sinv,
             "Precision Matrix (P)" = P,
             "SP" = S %*% P, "PS" = P %*% S))
}
precision(X)

## $`Correlation Matrix (S)`
##               ScreenPorch EnclosedPorch  MasVnrArea     LotArea
## ScreenPorch    1.00000000   -0.08307443  0.06146554  0.04351116
## EnclosedPorch -0.08307443    1.00000000 -0.11020383 -0.02309390
## MasVnrArea     0.06146554   -0.11020383  1.00000000  0.10415982
## LotArea        0.04351116   -0.02309390  0.10415982  1.00000000
## 
## $`Inverse Correlation Matrix`
##               ScreenPorch EnclosedPorch  MasVnrArea      LotArea
## ScreenPorch    1.01111953   0.077662314 -0.04973421 -0.037021156
## EnclosedPorch  0.07766231   1.018399086  0.10651581  0.009044966
## MasVnrArea    -0.04973421   0.106515805  1.02543903 -0.102185688
## LotArea       -0.03702116   0.009044966 -0.10218569  1.012463360
## 
## $`Precision Matrix (P)`
##                 ScreenPorch EnclosedPorch    MasVnrArea       LotArea
## ScreenPorch    3.235818e-04  2.274217e-05 -4.913700e-06 -6.629706e-08
## EnclosedPorch  2.274217e-05  2.728849e-04  9.629581e-06  1.482147e-08
## MasVnrArea    -4.913700e-06  9.629581e-06  3.127772e-05 -5.649450e-08
## LotArea       -6.629706e-08  1.482147e-08 -5.649450e-08  1.014580e-08
## 
## $SP
##                 ScreenPorch EnclosedPorch    MasVnrArea       LotArea
## ScreenPorch    3.213876e-04  6.649445e-07 -3.793628e-06 -7.055935e-08
## EnclosedPorch -3.596166e-06  2.699340e-04  6.592164e-06  2.632066e-08
## MasVnrArea     1.246225e-05 -1.904398e-05  2.990859e-05 -6.114608e-08
## LotArea        1.297611e-05 -4.294602e-06  2.765201e-06  1.034395e-09
## 
## $PS
##                 ScreenPorch EnclosedPorch    MasVnrArea       LotArea
## ScreenPorch    3.213876e-04 -3.596166e-06  1.246225e-05  1.297611e-05
## EnclosedPorch  6.649445e-07  2.699340e-04 -1.904398e-05 -4.294602e-06
## MasVnrArea    -3.793628e-06  6.592164e-06  2.990859e-05  2.765201e-06
## LotArea       -7.055935e-08  2.632066e-08 -6.114608e-08  1.034395e-09

3.1 Analysis

The cor() function produces the correlation matrix \(\mathbf{S}\) for all the remaining variables. Missing values must first be removed however. The inverse of the correlation matrix \(\mathbf{S}^{-1}\) is produced using the solve() function. The diagonal elements of the inverse correlation matrix, referred to as variance inflation factors, measure the extent to which the variables are linear combinations of other variables. Variance inflation factors (VIF) to help detect multicollinearity. The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction. Multiplying the correlation matrix by its inverse and then the inverse by the correlation matrix both yield the identity matrix: \(\mathbf{S}\mathbf{S}^{-1}=\mathbf{S}^{-1}\mathbf{S}=\mathbf{I}\). Multiplying the correlation matrix by the inverse of the covariance matrix known as the precision matrix \(\mathbf{P}\), and then the precision matrix by the correlation matrix shows an interesting property such that: \(\mathbf{S}\mathbf{P}=(\mathbf{P}\mathbf{S})^{T}\).

3.2 Instructions

Conduct principle components analysis (research this) and interpret.

3.2 Calculations

PCA <- function(X) {
  pairs(as.matrix(X)) 
  Xpca <- prcomp(na.omit(X), center = T, scale. = T) 
  M <- as.matrix(X); R <- as.matrix(Xpca$rotation); score <- M %*% R
  print(list("Importance of Components" = summary(Xpca)$importance, 
             "Rotation (Variable Loadings)" = Xpca$rotation,
             "Correlation between X and PC" = cor(na.omit(X), na.omit(score))))
  par(mfrow=c(2,3))
  barplot(Xpca$sdev^2, ylab = "Component Variance")
  barplot(cor(na.omit(cbind(X))), ylab = "Correlations")
  barplot(Xpca$rotation, ylab = "Loadings")  
  biplot(Xpca); barplot(M); barplot(score); pairs(score)
}
PCA(X)

## $`Importance of Components`
##                             PC1       PC2       PC3       PC4
## Standard deviation     1.103575 0.9946552 0.9674999 0.9255957
## Proportion of Variance 0.304470 0.2473300 0.2340100 0.2141800
## Cumulative Proportion  0.304470 0.5518000 0.7858200 1.0000000
## 
## $`Rotation (Variable Loadings)`
##                      PC1        PC2        PC3        PC4
## ScreenPorch    0.4488625 -0.4168237  0.7735512  0.1624777
## EnclosedPorch -0.5157049  0.4865127  0.4467899  0.5456487
## MasVnrArea     0.5916904  0.2062893 -0.3756023  0.6828397
## LotArea        0.4271689  0.7396000  0.2468202 -0.4578191
## 
## $`Correlation between X and PC`
##                       PC1         PC2         PC3         PC4
## ScreenPorch    0.05137154  0.04032113  0.05853821 -0.04056489
## EnclosedPorch -0.03361681 -0.01936014 -0.01045982  0.02730270
## MasVnrArea     0.12999566  0.10854349  0.07657162 -0.07799089
## LotArea        0.99961438  0.99997638  0.99941730 -0.99962757

3.2 Analysis

Principal Component Analysis is a dimension reduction technique. Principal components are linear (weighted) combinations of variables. Applying the weights rotates (transforms) the orthogonal basis of the data to align with the principal components. To perform the analysis the prcomp() function is used with the center = T argument to shift variables to a zero center and the scale = T argument to impose unit variance on variables before the analysis takes place. The standard deviations of the principal components are equal to the square roots of the eigenvalues of the correlation matrix. The rotation matrix has eigenvector columns that are the variable loadings. Loadings are the correlations between the original variables and the unit-scaled components which are used to restore the original correlation matrix. The biplot() function provides a visual representation of both the observations and variables for multivariate data. The effects of applying Principal Component Analysis to these data can best be seen in the pairs() plots which show the scattering of the original data and then the alignment of the rotated data. The first principal component is the linear combination of variables that has maximum variance (among all linear combinations), so it accounts for as much variation in the data as possible. Each successive principal component is the linear combination of all remaining variables that accounts for as much of the remaining variation (among all remaining linear combinations) as possible. Correlations between the original variables and each principal component are useful for interpreting each component. The primary purpose of this Principal Component Analysis is descriptive, it is not hypothesis testing. Sometimes the principal components scores will be used as explanatory variables in a regression. In these instances regression coefficients will be independent since the components are independent.

4. Calculus-Based Probability & Statistics

4.1 Instructions

Fit a closed form distribution to the data. For variables skewed to the right, shift the minimum value above zero. Then run fitdistr() to fit an exponential probability density function. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value. Plot a histogram and compare it with a histogram of your original variable.

4.1 Calculations

exponential <- function(X) {
  par(mfrow=c(2,4))
  for (i in 1:ncol(X)){
    shifted <- na.omit(X[,i]) - min(na.omit(X[,i])) + 1e-32
    fit <- fitdistr(shifted, "Exponential")
    lambda <- fit$estimate
    exp <- rexp(1000, lambda)
    hist(exp, prob=TRUE, col="grey", main = names(X[i]), xlab="Theoretical")
    lines(density(exp), col="blue", lwd=2) 
    lines(density(shifted), col="red", lwd=2)
    hist(shifted, prob=TRUE, col="grey", main = names(X[i]), xlab="Empirical")
    lines(density(exp), col="blue", lwd=2)
    lines(density(shifted), col="red", lwd=2)
  }
}
exponential(X)

4.1 Analysis

The variables were shifted to slightly above zero by subtracting the minimum value and then adding \(1^{-32}\) to the modified value. This also shifts data with a negative minimum in the appropriate direction since subtracting the negative minimum value equates to adding the minimum value. The function fitdistr() was used to fit an exponential distribution to the data and then the rexp() function with a rate of \(\lambda\) was used to get a sample of theoretical values from the fitted exponential distribution. After plotting the theoretical and empirical histograms for each variable, curves were overlayed representing the theoretical densities in blue and the empirical densities in red. These overlaid curves show that LotArea and MasVnrArea follow their respective fitted exponential distributions best.

4.2 Instructions

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data.

4.2 Calculations

percentiles <- function(X) {
  par(mfrow=c(2,4))
  for (i in 1:ncol(X)){
    shifted <- na.omit(X[,i]) - min(na.omit(X[,i])) + 1e-32
    fit <- fitdistr(shifted, "Exponential")
    lambda <- fit$estimate
    lb_e <-  -log(1 - .05) / lambda
    ub_e <-  -log(1 - .95) / lambda
    mu <- s <- 1 / lambda
    n <- length(shifted)
    lb_n <- mu + qnorm(0.05/2) * s / sqrt(n)
    ub_n <- mu - qnorm(0.05/2) * s / sqrt(n)
    emp <- quantile(na.omit(X[,i]), c(0.05,0.95))
    d <- data.frame(c(lb_e, ub_e), c(lb_n, ub_n), c(emp[[1]], emp[[2]]))
    rownames(d) <- c("5%","95%")
    colnames(d) <- c("Exponential","Normal CI","Empirical")
    x <- paste0(names(X[i]), " ~ Exp(", round(fit$estimate,6),")")
    d <- list(d); names(d) <- x; print(d)
  }
}
percentiles(X)

## $`ScreenPorch ~ Exp(0.066397)`
##     Exponential Normal CI Empirical
## 5%    0.7725262  14.28841         0
## 95%  45.1186007  15.83350       160
## 
## $`EnclosedPorch ~ Exp(0.04555)`
##     Exponential Normal CI Empirical
## 5%     1.126099  20.82798      0.00
## 95%   65.768635  23.08024    180.15
## 
## $`MasVnrArea ~ Exp(0.009645)`
##     Exponential Normal CI Empirical
## 5%     5.318359  98.35214         0
## 95%  310.613285 109.01839       456
## 
## $`LotArea ~ Exp(0.000108)`
##     Exponential Normal CI Empirical
## 5%     472.7615  8744.055   3311.70
## 95%  27611.1493  9689.602  17401.15

4.2 Analysis

\[\int _{ 0 }^{ x }{ \lambda { e }^{ -\lambda x }dt } =F(x)-F(0)=\left[ -{ e }^{ -\lambda x } \right] -\left[ -{ e }^{ -\lambda 0 } \right] =-{ e }^{ -\lambda x }+1=1-{ e }^{ -\lambda x } \tag{4.2.1}\]

\[{ P }_{ i }=1-{ e }^{ -\lambda x }\Rightarrow { P }_{ i }-1=-{ e }^{ -\lambda x }\Rightarrow { e }^{ -\lambda x }=1-{ P }_{ i }\Rightarrow -\lambda x=\ln { \left( 1-{ P }_{ i } \right) } \Rightarrow x=-\frac { \ln { \left( 1-{ P }_{ i } \right) } }{ \lambda } \tag{4.2.2}\]

\[\bar { x } \pm { z }_{ { \alpha }/{ 2 } }\frac { s }{ \sqrt { n } } \Rightarrow \frac { 1 }{ \lambda } \pm { z }_{ { \alpha }/{ 2 } }\frac { { 1 }/{ \lambda } }{ \sqrt { n } } \tag{4.2.3}\]

The CDF was derived using the pdf as outlined in equation \(4.2.1\). The percentiles were calculated using the derivation detailed in \(4.2.2\). Using the confidence interval equation for normally distributed data with exponential parameters as detailed in \(4.2.3\), a confidence interval assuming normality was calculated next. The empirical data percentiles were computed using the quantile() function. The disparity in these intervals shows how varying assumptions can sometimes lead to wildly different results.

5. Modeling

5.1 Instructions

The final includes a GLM component to be done in Kaggle.com. Register for free with Kaggle.com and compete in the House Prices: Advanced Regression Techniques competition. Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

Your submission should be in CSV format. We expect the solution file to have 1,459 predictions. The file should have a header row. Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions. Compared to the similar Mean Absolute Error, RMSE amplifies and severely punishes large errors. Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally. RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate [due to the shape of the logarithmic distribution having] a larger change in \(y\) corresponding to an equivalent change in \(x\) when \(x\) is small.

5.1 Calculations

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon_i \tag{5.1.1}\] \[\hat{SalePrice_i} = \beta_0 + \left(MasVnrArea_i\right) \cdot X_1 + \left(LotArea_i\right) \cdot X_2 + \varepsilon_i \tag{5.1.2}\]

cor(na.omit(cbind(X,Y)))

##               ScreenPorch EnclosedPorch  MasVnrArea     LotArea          Y
## ScreenPorch    1.00000000   -0.08307443  0.06146554  0.04351116  0.1130439
## EnclosedPorch -0.08307443    1.00000000 -0.11020383 -0.02309390 -0.1287781
## MasVnrArea     0.06146554   -0.11020383  1.00000000  0.10415982  0.4774930
## LotArea        0.04351116   -0.02309390  0.10415982  1.00000000  0.2646740
## Y              0.11304390   -0.12877809  0.47749305  0.26467400  1.0000000

fit <- lm(SalePrice ~ MasVnrArea + LotArea, data = train)
(sum <- summary(fit)); par(mfrow = c(2, 2)); plot(fit)

## 
## Call:
## lm(formula = SalePrice ~ MasVnrArea + LotArea, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -250567  -41924  -11970   31784  576214 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.418e+05  2.703e+03   52.47   <2e-16 ***
## MasVnrArea  1.992e+02  9.850e+00   20.22   <2e-16 ***
## LotArea     1.725e+00  1.785e-01    9.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67570 on 1449 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.2747, Adjusted R-squared:  0.2737 
## F-statistic: 274.4 on 2 and 1449 DF,  p-value: < 2.2e-16

test <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                         "master/DATA%20605/test.csv"), stringsAsFactors = F)
X1 <- test[,"MasVnrArea"]; X1[is.na(X1)] <- median(na.omit(X1))
X2 <- test[,"LotArea"]; X2[is.na(X2)] <- median(na.omit(X2))
results <- data.frame(cbind("Id" = test[,"Id"],
                            "SalePrice" = fit$coefficients[1] +
                              fit$coefficients[2] * X1 +
                              fit$coefficients[3] * X2))
write.csv(results, "Kaggle.csv", row.names = F)

5.1 Analysis

The two variables chosen for the model are those filtered through the preceding problems. They are right-skewed quantitative variables that show the strongest signs of correlation with \(Y\). In cases where there are missing values, nulls were replaced with the most likely value–the median. This all yielded the below model for which the summary is above. The two variables and the intercept together are significant at a \(0.001\) level. Adjusted \(R^2\) suggests that these variables account for 27.47% of the variation in \(Y\).

\[\hat{SalePrice_i} = 141842.3 + 199.175 \cdot MasVnrArea_i + 1.725 \cdot LotArea_i + \varepsilon_i \tag{5.1.3}\]

5.2 Instructions

Report your Kaggle.com username and score. There will be a video showing how to upload to Kaggle and to discuss methods for nonlinear optimization.

5.2 Calculations

Kaggle Username and Score

Predicted <- seq(1, 100, by=1)
a <- rep(40, 100)
RMSLE <- sqrt((1 / 100) * (log(Predicted + 1) - log(a + 1))^2)
plot(Predicted, RMSLE, type="l", col="blue", 
     main="Basic RMSLE Pattern", xaxt='n', yaxt='n')

5.2 Analysis

\[RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } =0.36631 \tag{5.2.1}\]

The score obtained culling the variables as previously mentioned yielded the above Root Mean Squared Logarithmic Error (RMSLE). A score of \(0.36631\) means that the deviations of the model’s predicted values were around \(e^{0.36631}=1.44\) times smaller or larger than the actual values. At the time of posting, there were \(8,471\) submissions. The range for the RMSLE was \([0.07186, 296.52145]\). The “Sample Submission Benchmark” on the letterboard was \(0.40890\). RMSLE values closer to zero are better. It is worth noting both that lower scores are skewed toward usernames with multiple (in several cases over 50) attempts and that the maximum value appears to be an outlier since the penultimate value is \(23.60064\). The RMSLE is one of several evaluation metrics commonly used in supervised machine learning.

House Prices

Jose Zuniga