IS605 Final

CUNY MSDA DATA 605

Tom Detzel, May 20, 2018




Part One: Probability


1.1 Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.

For a slight departure (because it will be useful later in modeling to avoid multicollinearity), our X will be a new variable composed of the sum of all the finished space in the basement and 1st and second floors. This is highly correlated (.708) with the SalePrice variable; as X increases, Y increases.

1.2 Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | Y>y)

From Grimstead, p. 164:

“If E and F are two events with positive probability in a continuous sample space, then, as in the case of discrete sample spaces, we define E and F to be independent if P(E|F) = P(E) and P(F|E) = P(F).”

In Grimstead p. 162, the conditional probability for continuous random variables is given as:

\(P(F|E)\quad =\quad \frac { P(E\cap F) }{ P(E) }\)

Here, P(E) = P(Y>y); P(F) = P(X>x), and P(E,F) is the intersection of E and F.

The R code below shows that P(E,F) = 0.6589 / P(E) = 0.75 and P(X>x | Y>y) = 0.8785.

In othe words, a home with a higher price has a high probability (.88) of having greater living space. By way of explanation, in these solutions we are counting frequencies. First, the frequency of Y greater than values in the 1st quartile of Y (intuitively 75 percent). Then we compute the frequency that X is greater than values in first quartile of X. We then apply the general formula for conditional probability to get our result.

# create our variable
# 
X <- train$BsmtFinSF1 + train$BsmtFinSF2 + train$X1stFlrSF + train$X2ndFlrSF
Y <- train$SalePrice

# check correlations
# cor(cbind(Y, X, train$BsmtFinSF2, train$X1stFlrSF, train$X2ndFlrSF))

# get the probability that Y > 1st quartile
# 
`P(E)` <- `P(Y>y)` <- length(which(Y>quantile(Y, probs=0.25)))/length(Y)

# 0.75

# get probability that X > 1st quartile
# 
`P(F)` <- `P(X>x)` <- length(which(X>quantile(X, probs=0.25)))/length(X)

# 0.75

# get probability that X>x and Y>y (intersection)
# 
`P(E,F)` <- length(which(X>quantile(X, probs=0.25) & Y>quantile(Y, probs=0.25)))/length(Y)

# 0.6589041

round(`P(E,F)`/`P(E)`,4)
## [1] 0.8785

b. P(X>x, Y>y)

We similarly compute this probability as the intersection where X and Y are both greater than the 1st quartile. Actually, we’ve already calculated it in our prior example.

P(X>x, Y>y) = 0.6589041

# get probability that X>x and Y>y (intersection)
`P(X>x, Y>y)` <- length(which(X>quantile(X, probs=0.25) & Y>quantile(Y, probs=0.25)))/length(Y)

`P(X>x, Y>y)`
## [1] 0.6589041

c. P(X less than x | Y greater than y)

Computed below as in 1) a.

# get the probability that Y > 1st quartile
`P(E)` <- `P(Y>y)` <- length(which(Y>quantile(Y, probs=0.25)))/length(Y)

# 0.75

# get probability that X < 1st quartile
`P(F)` <- `P(X<x)` <- length(which(X<quantile(X, probs=0.25)))/length(X)

# 0.25

# get probability that X<x and Y>y (intersection)
`P(E,F)` <- length(which(X<quantile(X, probs=0.25) & Y>quantile(Y, probs=0.25)))/length(Y)

# 0.09109589

round(`P(E,F)`/`P(E)`,4)
## [1] 0.1215
# 0.1215


1.3 In addition, make a table of counts as shown below.

# get the counts
a <- length(which(X<=quantile(X, probs=0.25) & Y<=quantile(Y, probs=0.25)))
b <- length(which(X>quantile(X, probs=0.25) & Y<=quantile(Y, probs=0.25)))
c <- length(which(X<=quantile(X, probs=0.25) & Y>quantile(Y, probs=0.25)))
d <- length(which(X>quantile(X, probs=0.25) & Y>quantile(Y, probs=0.25)))

# make a table
counts <- as.data.frame(matrix(c(a, b, c, d), byrow=T, nrow=2, ncol=2))
counts <- cbind(counts, rowSums(counts))
counts <- rbind(counts, colSums(counts))
colnames(counts) <- c('X<=1st Quartile', 'X>1st Quartile', "Total")
rownames(counts) <- c('Y<=1st Quartile', 'Y>1st Quartile', "Total")

pander(counts)
  X<=1st Quartile X>1st Quartile Total
Y<=1st Quartile 232 133 365
Y>1st Quartile 133 962 1095
Total 365 1095 1460


Here is a corresponding table of probabilities. The probability that SalesPrice is in the bottom quartile when TotSqFt is also in the bottom quartile is only about 0.16.

# make a table of probabilities
vals <- c(a, b, c, d)
tbl <- matrix(vals, byrow=T, nrow=2, ncol=2)
tbl <- as.data.frame(round(addmargins(prop.table((tbl))),4))
colnames(tbl) <- c('P(X<=1st Quartile)', 'P(X>1st Quartile)', "Total")
rownames(tbl) <- c('P(Y<=1st Quartile)', 'P(Y>1st Quartile)', "Total")
tbl
##                    P(X<=1st Quartile) P(X>1st Quartile) Total
## P(Y<=1st Quartile)             0.1589            0.0911  0.25
## P(Y>1st Quartile)              0.0911            0.6589  0.75
## Total                          0.2500            0.7500  1.00

1.4 Does splitting the training data in this fashion make them independent?

The question is a little ambiguous. There are several ways to consider independence here. 1) Are the observations independent of each other? Probably not. They are related by time of observation, neighborhood, and other attributes that are correlated to some degree. 2) Are the predictors independent of each other? Attributes like lot size and living space also can be correlated; one goal of modeling is to eliminate multicollinearity. 3) Are the training and test sets independent? Yes, in the sense that a model built on the training data does not include any information from the unseen test data. But splitting the data doesn’t affect associations among the variables in either set, as long as the split is random.

1.5 Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y.

# make the new variables
A <- X[which(X>quantile(X, prob=0.25))]
B <- Y[which(Y>quantile(Y, prob=0.25))]

# min(A)
# min(B)

print(summary(A))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1494    1768    2070    2263    2602   11286
print(summary(B))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  130000  153418  183000  205951  235000  755000

1.6 Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

The next code block computes the frequencies and tests the equality; in this case A and B are not independent because P(A,B) \(\neq\) P(A)P(B).

# probabilities of A, B both > 1st quartile
pA <- length(A)/1460
pB <- length(B)/1460

pAB <- length(which(X>quantile(X, prob=0.25) & Y>quantile(Y, prob=0.25)))/1460

# test equality; they are not equal and therefore A and B are not independent
pAB == pA*pB
## [1] FALSE

Statistical tests for independence reach the same conclusion. Below we’re using a Spearman test rather than Chi-square, which is is normally employed with categorical data.

The Spearman rank correlation coefficient tests the null hypothesis that the correlation between two variables is zero. It assumes the data are not normally distributed and that any relationship is linear. The test can be conducted in two ways; we show both.

For our variables, the Spearman coefficient is 0.0699 with a p-value of less than 0.05, so we reject the null hypothesis that the A and B are uncorrelated.

spearman_test(A~B, distribution='approximate')
## 
##  Approximative Spearman Correlation Test
## 
## data:  A by B
## Z = 2.314, p-value = 0.0233
## alternative hypothesis: true rho is not equal to 0
cor.test(A, B, method ="spearman", alternative ="two.sided")
## 
##  Spearman's rank correlation rho
## 
## data:  A and B
## S = 203510000, p-value = 0.0206
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## 0.06996163

Another way to look at the problem is to group each variable by quartile and count frequencies by quartile. Then a Chi-square test can be applied to the groups Q1-Q4 for A and B.

# uses the cut function to give each variable a quartile value
dfA <- data.frame(A)
dfA$QA <- cut(dfA$A, breaks=c(quantile(dfA$A, probs = seq(0, 1, by = 0.25))), 
                labels=c("Q1","Q2","Q3","Q4"), include.lowest=TRUE)
               
dfB <- data.frame(B)
dfB$QB <- cut(dfB$B, breaks=c(quantile(dfB$B, probs = seq(0, 1, by = 0.25))), 
                labels=c("Q1","Q2","Q3","Q4"), include.lowest=TRUE)

# make a two-way table to conduct the test
chi_df <- cbind(dfA, dfB)
chi_tbl <- xtabs(~QA+QB, data=chi_df)
ftable(addmargins(chi_tbl))
##     QB   Q1   Q2   Q3   Q4  Sum
## QA                             
## Q1       73   68   68   66  275
## Q2       69   75   75   54  273
## Q3       72   69   68   64  273
## Q4       60   62   66   86  274
## Sum     274  274  277  270 1095
chisq.test(chi_tbl)
## 
##  Pearson's Chi-squared test
## 
## data:  chi_tbl
## X-squared = 11.428, df = 9, p-value = 0.2475

The test fails to reject the null hypothesis that quartiles of A and B are independent (p > 0.05). So we get quite a different result than with ungrouped data.



Part Two: Descriptive and Inferential Statistics


2.1 Provide univariate descriptive statistics and appropriate plots for the training data set.

Following are univariate statistics for X and Y.

df <- data.frame(cbind(X,Y), col.names=c('X','Y'))
stats <- data.frame(rbind(describe(df$X), describe(df$Y)), row.names=c('X', 'Y'))
pander(round(stats[,-1],2))
Table continues below
  n mean sd median trimmed mad min max
X 1460 2000 774 1864 1931 627.1 334 11286
Y 1460 180921 79442 163000 170783 56339 34900 755000
  range skew kurtosis se
X 10952 2.16 16.07 20.26
Y 720100 1.88 6.5 2079

In the interests of efficiency, here is just a sampling of stats for other training data variables. Stats for all the variables can be obtained the same way.

describe(train[, c(2:11)])
##              vars    n     mean      sd median trimmed     mad  min    max
## MSSubClass      1 1460    56.90   42.30   50.0   49.15   44.48   20    190
## MSZoning*       2 1460     4.03    0.63    4.0    4.06    0.00    1      5
## LotFrontage     3 1201    70.05   24.28   69.0   68.94   16.31   21    313
## LotArea         4 1460 10516.83 9981.26 9478.5 9563.28 2962.23 1300 215245
## Street*         5 1460     2.00    0.06    2.0    2.00    0.00    1      2
## Alley*          6   91     1.45    0.50    1.0    1.44    0.00    1      2
## LotShape*       7 1460     2.94    1.41    4.0    3.05    0.00    1      4
## LandContour*    8 1460     3.78    0.71    4.0    4.00    0.00    1      4
## Utilities*      9 1460     1.00    0.03    1.0    1.00    0.00    1      2
## LotConfig*     10 1460     4.02    1.62    5.0    4.27    0.00    1      5
##               range   skew kurtosis     se
## MSSubClass      170   1.40     1.56   1.11
## MSZoning*         4  -1.73     6.25   0.02
## LotFrontage     292   2.16    17.34   0.70
## LotArea      213945  12.18   202.26 261.22
## Street*           1 -15.49   238.01   0.00
## Alley*            1   0.20    -1.98   0.05
## LotShape*         3  -0.61    -1.60   0.04
## LandContour*      3  -3.16     8.65   0.02
## Utilities*        1  38.13  1453.00   0.00
## LotConfig*        4  -1.13    -0.59   0.04


2.2 Provide a scatterplot of X and Y.

plot(df$X,df$Y, main='Scatterplot of X, Y', col='blue')
abline(lm(df$Y~df$X), col='red', lwd=3)
text(9000, 400000, paste0('Cor = ', round(cor(X,Y),4)))


And some histograms:

# histograms each attribute
numcols <- sapply(train, is.numeric)
nums <- train[, numcols]

multi.hist(nums[,1:6], bcol='lightblue',
           dcol='red', lwd=3,
           nrow=3, ncol=2)


# multi.hist(nums[,18:34], bcol='lightblue', dcol='red', lwd=3)


2.3 Derive a correlation matrix for any THREE quantitative variables in the dataset.

We’ll look at TotalSqft, LotArea and YearBuilt. The function corr.test() provides confidence intervals. There are a modest postive correlations between TotalSqft and LotArea (.33) and YearBuilt (.28). However, the correlation between LotArea and YearBuilt is not statistically significant (p=0.59).

print(corr.test(train[, c(5,20,82)]), short=FALSE)
## Call:corr.test(x = train[, c(5, 20, 82)])
## Correlation matrix 
##           LotArea YearBuilt TotSqft
## LotArea      1.00      0.01    0.33
## YearBuilt    0.01      1.00    0.28
## TotSqft      0.33      0.28    1.00
## Sample Size 
## [1] 1460
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##           LotArea YearBuilt TotSqft
## LotArea      0.00      0.59       0
## YearBuilt    0.59      0.00       0
## TotSqft      0.00      0.00       0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##             raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## LotAr-YrBlt     -0.04  0.01      0.07  0.59     -0.04      0.07
## LotAr-TtSqf      0.28  0.33      0.37  0.00      0.27      0.38
## YrBlt-TtSqf      0.24  0.28      0.33  0.00      0.23      0.34

2.4 Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval.

We’ll do them in pairs.

  1. Correlation of TotalSqft to LotArea at 0.33 is signficant at p<0.05 with the 92 percent confidence interval shown.
cor.test(train$TotSqft, train$LotArea, method="pearson", conf.level=0.92, alternative='two.sided')
## 
##  Pearson's product-moment correlation
## 
## data:  train$TotSqft and train$LotArea
## t = 13.243, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 92 percent confidence interval:
##  0.2861367 0.3679783
## sample estimates:
##       cor 
## 0.3276721
  1. Correlation of TotSqft to YearBuilt at 0.28 is signficant at p<0.05 with the 92 percent confidence interval shown.
cor.test(train$TotSqft, train$YearBuilt, method="pearson", conf.level=0.92, alternative='two.sided')
## 
##  Pearson's product-moment correlation
## 
## data:  train$TotSqft and train$YearBuilt
## t = 11.286, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 92 percent confidence interval:
##  0.2407445 0.3250592
## sample estimates:
##       cor 
## 0.2834495
  1. Correlation of LotArea to YearBuilt at 0.014 is not signficant (p=0.59) with the 92 percent confidence interval shown crossing zero.
cor.test(train$LotArea, train$YearBuilt, method="pearson", conf.level=0.92, alternative='two.sided')
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$YearBuilt
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 92 percent confidence interval:
##  -0.03162553  0.06002107
## sample estimates:
##        cor 
## 0.01422765

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

In these examples, familywise error will probably not be an issue. In both cases where the alternative hypothesis was affirmed (i.e., true correlation exists), the p-values were extremely small. In the third case, the p-value was well outside significance norms, and the correlation barely registered. For the models we are building, a goal is to reduce multicollinearity, or high correlation between predictors. So it’s good that LotArea and YearBuilt are not correlated.



Part Three: Linear Algebra and Correlation


3.1 Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

The solve() function inverts a matrix. First my matrix, then the inversion.

myMat <- cor(train[, c(5,20,82)])
# round(myMat,4)
# solve() in base R computes inverse matrix
myMat
##              LotArea  YearBuilt   TotSqft
## LotArea   1.00000000 0.01422765 0.3276721
## YearBuilt 0.01422765 1.00000000 0.2834495
## TotSqft   0.32767207 0.28344950 1.0000000
round(solve(myMat),4)
##           LotArea YearBuilt TotSqft
## LotArea    1.1288    0.0965 -0.3972
## YearBuilt  0.0965    1.0956 -0.3422
## TotSqft   -0.3972   -0.3422  1.2272

3.2 Multiply the correlation matrix by the precision matrix …

Voila! We have the identity matrix.

round(myMat%*%solve(myMat),4)
##           LotArea YearBuilt TotSqft
## LotArea         1         0       0
## YearBuilt       0         1       0
## TotSqft         0         0       1

3.3 … and then multiply the precision matrix by the correlation matrix.

Another identity matrix!

round(solve(myMat)%*%myMat,4)
##           LotArea YearBuilt TotSqft
## LotArea         1         0       0
## YearBuilt       0         1       0
## TotSqft         0         0       1

3.4 Conduct LU decomposition on the matrix.

The lu function in Matrix decomposes a matrix.

luMyMat <- lu(myMat, method='LU')
L <- expand(luMyMat)$L
U <- expand(luMyMat)$U

L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,] 1.00000000          .          .
## [2,] 0.01422765 1.00000000          .
## [3,] 0.32767207 0.27884394 1.00000000
U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,] 1.00000000 0.01422765 0.32767207
## [2,]          . 0.99979757 0.27878750
## [3,]          .          . 0.81489281

Testing it shows that the product of LU is indeed the original matrix.

L%*%U
## 3 x 3 Matrix of class "dgeMatrix"
##            [,1]       [,2]      [,3]
## [1,] 1.00000000 0.01422765 0.3276721
## [2,] 0.01422765 1.00000000 0.2834495
## [3,] 0.32767207 0.28344950 1.0000000


Part 4: Calculus-Based Probability & Statistics


4.1 Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice.

(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).

First, let’s have a look at the data. We’re working with the new feature we created above, ‘TotSqft’, and the minimum value is 344.

# min of 334 is above zero
# min(X)

hist(X, freq=F, col='lightblue',
     breaks=50, main='Total Finished Square Feet')
lines(density(X), col='red', lwd=3)
abline(v=quantile(X, probs=0.025), col='lightgray', lty=2, lwd=2)
abline(v=quantile(X, probs=0.975), col='lightgray', lty=2, lwd=2)

4.2 Run fitdistr to find an optimal distribution.

Gamma appears to be the best match.

library(MASS)
# distributions <- c("beta", "cauchy", "chi-squared", "exponential", "gamma", "geometric", "log-normal", "lognormal", "logistic", "negative binomial", "normal", "Poisson", "t")

# MASS::fitdistr(X, "normal")
# MASS::fitdistr(X, "lognormal")
# MASS::fitdistr(X, "Poisson")
# MASS::fitdistr(X, "exponential")
MASS::fitdistr(X, "gamma")
##       shape           rate    
##   7.6800697027   0.0038404034 
##  (0.2151646143) (0.0001073509)
# MASS::fitdistr(X, "cauchy")
# MASS::fitdistr(X, "t")
# MASS::fitdistr(X, "weibull", start=list(scale=1,shape=2))

4.3 Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, \(\lambda\)) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

The gamma parameters are shown above. Below are the histograms. Fairly close fit.

par(mfrow=c(1,2))
hist(X, freq=F, col='lightblue',
     breaks=50, main='TotSqft')

hist(rgamma(1000, 7.6800697027, 0.0038404034), freq=F,
     breaks=50, xlim=c(min(X), max(X)),
     col='lightblue', main='Gamma distribution')

4.3 Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

The exponential distribution is a poor fit to the data and much worse than gamma. The table below compares the actual data to the two theoretical distributions.

MASS::fitdistr(X, "exponential")
##        rate     
##   0.00050004795 
##  (0.00001308685)
#rate = 0.00050004795

e5 <- qexp(0.025, rate=0.00050004795)
e95 <- qexp(0.975, rate=0.00050004795)
g5 <- qgamma(0.025, shape=7.6800697027, rate=0.0038404034)
g95 <- qgamma(0.975, shape=7.6800697027, rate=0.0038404034)
x5 <- quantile(X, probs=0.025)
x95 <- quantile(X, probs=0.975)

tbl3 <- as.data.frame(matrix(c(e5, e95, g5, g95, x5, x95), byrow=T, ncol=2, nrow=3))
colnames(tbl3) <- c('2.5th Percentile', '97.5th Percentile')
rownames(tbl3) <- c('Exponential', 'Gamma', 'Data')

pander(tbl3)
  2.5th Percentile 97.5th Percentile
Exponential 50.63 7377
Gamma 845.4 3643
Data 894 3637


4.4 Also generate a 95% confidence interval from the empirical data, assuming normality.

See plot below (and chart above). Range is 894 to 3637.

hist(X, freq=F, col='lightblue',
     breaks=50, main='Total Finished Square Feet')
lines(density(X), col='red', lwd=3)
abline(v=quantile(X, probs=0.025), col='lightgray', lty=2, lwd=2)
abline(v=quantile(X, probs=0.975), col='lightgray', lty=2, lwd=2)


4.5 Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Below are the percentiles for the data; see table above for a comparison. The exponential estimates are off by about 80-90 percent versus 1 to 4 percent for gamma.

So, it pays to use fitdistr().

lower <- quantile(X, probs=0.05)
upper <- quantile(X, probs=0.95)

lower; upper
##      5% 
## 1015.85
##     95% 
## 3334.15
# gam_error_low <- (1016-978)/1016
#3gam_erro_high <- (3334-3315)/3334

#3exp_error_low <- (1016-102.6)/1016
# exp_error_high <- (3334-5991)/3334