CUNY MSDS DATA 605 Final

library(tidyverse)
library(plotly)
library(psych)
library(reshape)
library(readxl)
library(rmdformats)
library(scales)
library(corrgram)
library(MASS)
library(randomForest)
library(Amelia)
select <- dplyr::select

train <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-605/master/train.csv")
test <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-605/master/test.csv")
data605 <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-605/master/605final.csv")

Your final is due by the end of day on 12/16/2018. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1

Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.

data605
## # A tibble: 20 x 8
##       Y1    Y2    Y3    Y4    X1    X2    X3    X4
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  20.3  20.8  28.4  20.2   9.3   7.4   9.5   9.3
##  2  19.1  14.6  21.5  18.6   4.1   6.4   3.7  12.4
##  3  19.3  18    20.8  22.6  22.4   8.5  11.7  19.9
##  4  20.9   7.3  22.2  11.4   9.1   9.5   7.4   6.9
##  5  22    19.4  21.6  23.6  15.8  11.8   5.3  -1  
##  6  23.5  13.5  21.8  24     7.1   8.8   7.4  10.6
##  7  13.8  14.7  25.2  26    15.9   8.4   7.4   6.4
##  8  18.8  15.3  22.5  26.8   6.9   5.1   8.6  10.6
##  9  20.9  12.6  21.1  19.7  16    11.4   9.1   1.2
## 10  18.6  13    21.7  22.7   6.7  15.1  11.4   7.7
## 11  22.3  13.1  21.4  16.8   8.2  12.6   8.4  15.5
## 12  17.6  10.3  20.8  20.2  16     8     7.3   6.9
## 13  20.8  14.9  23    21.7   6.4  10.3  11.3  13.7
## 14  28.7  14.8  17.4  20.9  11.8  10.4   4.4   3.7
## 15  15.2  16.2  21.3  26.9   3.5   9.5   9.3   4.4
## 16  20.9  15.7  15.1  16.3  21.7   9.5  10.9  11.5
## 17  18.4  16.3  17.8  19.9  12.2  15.1  10.9   4.2
## 18  10.3  11.5  26.4  15.5   9.3   6.6   7.7  13.9
## 19  26.3  12.2  21.6  26.5   8    15.4   7.7  12.9
## 20  28.1  11.8  22.5  21.7   6.2   8.2  11.5   1.2

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

X <- data605$X4
Y <- data605$Y4
XY<- cbind(X,Y)

var <- nrow(XY)
x <- quantile(X, 0.75)
y <- quantile(Y, 0.25)

\[P(X>x|Y>y)\]

YGy <- data.frame(subset(XY, Y > y))
lenYGy <- nrow(YGy)
XGx <- subset(YGy, X > x)
lenXGx <- nrow(XGx)
lenXGx/lenYGy
## [1] 0.2

\[P(X>x & Y>y)\]

nrow(subset(XY, Y > y & X > x))/var
## [1] 0.15

\[P(X<x|Y>y)\]

YGy <- data.frame(subset(XY, Y > y))
lenYGy <- nrow(YGy)
XGx <- subset(YGy, X < x)
lenXGx <- nrow(XGx)
lenXGx/lenYGy
## [1] 0.8

Table of Counts

5 points. In addition, make a table of counts as shown below.

nums <- c(0.15, 0.10, 0.25, 0.60, 0.15, 0.75, 0.75, 0.25, 1)
nums <- matrix(nums, byrow = T, ncol = 3)
nums
##      [,1] [,2] [,3]
## [1,] 0.15 0.10 0.25
## [2,] 0.60 0.15 0.75
## [3,] 0.75 0.25 1.00

Chi-Square

5 points. Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

chisq.test(chitest)
## Warning in chisq.test(chitest): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  chitest
## X-squared = 0.088889, df = 1, p-value = 0.7656

Because the p-value is high, we fail to reject the null hypothesis. They are not independent.

Problem 2 Kaggle

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Descriptive and Inferential Statistics

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Y <- train$SalePrice

Plots

trianhist <- select_if(train, is.numeric)

trainhist1 <- trianhist %>%
  keep(is.numeric) %>%                     
  gather() 

p2 <- ggplot(trainhist1, aes(value)) +                     
    facet_wrap(~ key, scales = "free") +   
    geom_bar()    

ggplotly(p2)
## Warning: Removed 348 rows containing non-finite values (stat_count).
p3 <- ggplot(train, aes(train$GarageCars, train$SalePrice)) +
  geom_point() +
  xlab("Garage") +
  ylab("Sales Price") +
  ggtitle("Sales Price vs. Garage Cars") +
  scale_y_continuous(labels = comma)

ggplotly(p3)
p4 <- ggplot(train, aes(train$`1stFlrSF`, train$SalePrice)) +
  geom_point(aes(color=train$`1stFlrSF`)) +
  xlab("1st floor") +
  ylab("Sales Price") +
  ggtitle("Sales Price vs. 1st floor Sqft") +
  scale_y_continuous(labels = comma)

ggplotly(p4)
p5 <- ggplot(train, aes(train$LotArea, train$SalePrice)) +
  geom_point(aes(color=train$`1stFlrSF`)) +
  xlab("Lot Area") +
  ylab("Sales Price") +
  ggtitle("Sales Price vs. Lot Area") +
  scale_y_continuous(labels = comma)

ggplotly(p5)

Correlation Plot

corr <- select(train, LotArea, `1stFlrSF`, GarageCars, SalePrice)
corrcor <- cor(corr)
corrgram(drop_na(corr), order=TRUE, upper.panel=panel.cor, main= "train")

cor.test(train$LotArea, train$SalePrice, method = "pearson" , conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434
cor.test(train$`1stFlrSF`, train$SalePrice, method = "pearson" , conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$`1stFlrSF` and train$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522
cor.test(train$GarageCars, train$SalePrice, method = "pearson" , conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageCars and train$SalePrice
## t = 31.839, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6201771 0.6597899
## sample estimates:
##       cor 
## 0.6404092

The p-values for all three variables are less than 0.05. The correlation falls within the 80% CI

Familywise error

According to statisticshowto, The familywise error rate (FWE or FWER) is the probability of a coming to at least one false conclusion in a series of hypothesis tests. In other words, it’s the probability of making at least one Type I Error.

\[1-(1-\alpha )^c\]

1-(.20/3)
## [1] 0.9333333

Rerunning the correlation accounting for the familywise…

cor.test(train$LotArea, train$SalePrice, method = "pearson" , conf.level = 0.9333333)
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.33333 percent confidence interval:
##  0.2186041 0.3079508
## sample estimates:
##       cor 
## 0.2638434
cor.test(train$`1stFlrSF`, train$SalePrice, method = "pearson" , conf.level = 0.9333333)
## 
##  Pearson's product-moment correlation
## 
## data:  train$`1stFlrSF` and train$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.33333 percent confidence interval:
##  0.5745554 0.6353798
## sample estimates:
##       cor 
## 0.6058522
cor.test(train$GarageCars, train$SalePrice, method = "pearson" , conf.level = 0.9333333)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageCars and train$SalePrice
## t = 31.839, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 93.33333 percent confidence interval:
##  0.6111920 0.6678834
## sample estimates:
##       cor 
## 0.6404092

Linear Algebra and Correlation

5 points. Linear Algebra and Correlation. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

corrcor
##              LotArea  1stFlrSF GarageCars SalePrice
## LotArea    1.0000000 0.2994746  0.1548707 0.2638434
## 1stFlrSF   0.2994746 1.0000000  0.4393168 0.6058522
## GarageCars 0.1548707 0.4393168  1.0000000 0.6404092
## SalePrice  0.2638434 0.6058522  0.6404092 1.0000000
precision_matrix <- solve(corrcor)
precision_matrix
##                LotArea   1stFlrSF  GarageCars  SalePrice
## LotArea     1.11298710 -0.2494367  0.04830118 -0.1734650
## 1stFlrSF   -0.24943672  1.6470490 -0.14926630 -0.8364645
## GarageCars  0.04830118 -0.1492663  1.70941294 -1.0170344
## SalePrice  -0.17346499 -0.8364645 -1.01703440  2.2038596
corrcor %*% precision_matrix
##                  LotArea     1stFlrSF   GarageCars    SalePrice
## LotArea     1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## 1stFlrSF   -1.387779e-17 1.000000e+00 0.000000e+00 0.000000e+00
## GarageCars -2.775558e-17 1.110223e-16 1.000000e+00 2.220446e-16
## SalePrice  -2.775558e-17 1.110223e-16 2.220446e-16 1.000000e+00

Calculus-Based Probability & Statistics

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

xfit <- fitdistr(train$`1stFlrSF`, "exponential")
hist(train$`1stFlrSF`, breaks = 30)

rand_samp <- rexp(1000, xfit$estimate[[1]])
hist(rand_samp, breaks = 30)

xfit2 <- fitdistr(train$LotArea, "exponential")
hist(train$LotArea, breaks = 30)

rand_samp <- rexp(1000, xfit2$estimate[[1]])
hist(rand_samp, breaks = 30)

Clean and Tidy Data

missmap(train, main = "missing vs. observed")

Clean train Data

When running the linear models, I was running into issues where there were less than 1 factor. In order to fix this, I used the following code to clean the data.

names(train) <- make.names(names(train))
features <- setdiff(colnames(train), c("Id", "SalePrice"))
for (f in features) {
  if (any(is.na(train[[f]]))) 
    if (is.character(train[[f]])){ 
      train[[f]][is.na(train[[f]])] <- "Others"
    }else{
      train[[f]][is.na(train[[f]])] <- -999  
    }
}

column_class <- lapply(train,class)
column_class <- column_class[column_class != "factor"]
factor_levels <- lapply(train, nlevels)
factor_levels <- factor_levels[factor_levels > 1]
train <- train[,names(train) %in% c(names(factor_levels), names(column_class))]

train <- as.data.frame(unclass(train))
ntrain<-select_if(train, is.numeric)

Clean test data

names(test) <- make.names(names(test))
features <- setdiff(colnames(test), c("Id", "SalePrice"))
for (f in features) {
  if (any(is.na(test[[f]]))) 
    if (is.character(test[[f]])){ 
      test[[f]][is.na(test[[f]])] <- "Others"
    }else{
      test[[f]][is.na(test[[f]])] <- -999  
    }
}

column_class <- lapply(test,class)
column_class <- column_class[column_class != "factor"]
factor_levels <- lapply(test, nlevels)
factor_levels <- factor_levels[factor_levels > 1]
train <- test[,names(test) %in% c(names(factor_levels), names(column_class))]

test <- as.data.frame(unclass(test))
ntest<-select_if(test, is.numeric)

Models

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

LM Baseline Model

All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

Looking at our model, many of the variables are not statistically significant via their p-values. In this case, I will create another model with only significant variables.

model <- lm(SalePrice ~ . , data=ntrain)
summary(model)
## 
## Call:
## lm(formula = SalePrice ~ ., data = ntrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475262  -16290   -2217   14029  295097 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.143e+05  1.403e+06   0.295 0.767771    
## Id            -1.048e+00  2.171e+00  -0.483 0.629203    
## MSSubClass    -1.687e+02  2.608e+01  -6.468 1.36e-10 ***
## LotFrontage    3.109e+00  2.299e+00   1.352 0.176463    
## LotArea        4.024e-01  1.003e-01   4.013 6.30e-05 ***
## OverallQual    1.736e+04  1.181e+03  14.697  < 2e-16 ***
## OverallCond    5.153e+03  1.025e+03   5.027 5.61e-07 ***
## YearBuilt      3.497e+02  6.067e+01   5.764 1.01e-08 ***
## YearRemodAdd   1.115e+02  6.619e+01   1.684 0.092412 .  
## MasVnrArea     2.150e+01  5.193e+00   4.141 3.66e-05 ***
## BsmtFinSF1     1.900e+01  4.628e+00   4.105 4.27e-05 ***
## BsmtFinSF2     8.632e+00  7.011e+00   1.231 0.218432    
## BsmtUnfSF      8.368e+00  4.176e+00   2.004 0.045279 *  
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      4.812e+01  5.731e+00   8.397  < 2e-16 ***
## X2ndFlrSF      4.889e+01  4.917e+00   9.943  < 2e-16 ***
## LowQualFinSF   1.731e+01  1.970e+01   0.879 0.379726    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   8.486e+03  2.598e+03   3.267 0.001114 ** 
## BsmtHalfBath   1.862e+03  4.058e+03   0.459 0.646417    
## FullBath       3.219e+03  2.803e+03   1.148 0.250967    
## HalfBath      -1.563e+03  2.644e+03  -0.591 0.554423    
## BedroomAbvGr  -1.021e+04  1.683e+03  -6.066 1.68e-09 ***
## KitchenAbvGr  -1.543e+04  5.200e+03  -2.968 0.003048 ** 
## TotRmsAbvGrd   4.836e+03  1.230e+03   3.932 8.84e-05 ***
## Fireplaces     4.314e+03  1.766e+03   2.443 0.014694 *  
## GarageYrBlt   -9.649e+00  1.773e+00  -5.441 6.23e-08 ***
## GarageCars     1.568e+04  2.978e+03   5.266 1.61e-07 ***
## GarageArea     5.190e+00  9.707e+00   0.535 0.592957    
## WoodDeckSF     2.568e+01  7.923e+00   3.242 0.001215 ** 
## OpenPorchSF   -5.835e+00  1.509e+01  -0.387 0.698981    
## EnclosedPorch  1.222e+01  1.674e+01   0.730 0.465505    
## X3SsnPorch     2.017e+01  3.118e+01   0.647 0.517861    
## ScreenPorch    5.712e+01  1.707e+01   3.347 0.000837 ***
## PoolArea      -3.455e+01  2.346e+01  -1.473 0.140989    
## MiscVal       -3.759e-01  1.847e+00  -0.204 0.838748    
## MoSold        -5.100e+01  3.424e+02  -0.149 0.881631    
## YrSold        -6.839e+02  6.974e+02  -0.981 0.326921    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34500 on 1424 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.8114 
## F-statistic: 180.4 on 35 and 1424 DF,  p-value: < 2.2e-16

Significant Variables only

sigvars <- data.frame(summary(model)$coef[summary(model)$coef[,4] <= .05, 4])
sigvars <- add_rownames(sigvars, "vars")
## Warning: Deprecated, use tibble::rownames_to_column() instead.
colist<-dplyr::pull(sigvars, vars)
matchid <- match(colist, names(ntrain))
train2 <- cbind(ntrain[,matchid], ntrain['SalePrice'])
model2<-lm(SalePrice ~ ., data=train2)
summary(model2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -485362  -16106   -2176   14198  277137 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.060e+05  8.784e+04  -9.176  < 2e-16 ***
## MSSubClass   -1.648e+02  2.561e+01  -6.436 1.67e-10 ***
## LotArea       4.076e-01  9.914e-02   4.111 4.16e-05 ***
## OverallQual   1.821e+04  1.146e+03  15.884  < 2e-16 ***
## OverallCond   5.692e+03  9.195e+02   6.190 7.84e-10 ***
## YearBuilt     3.809e+02  4.455e+01   8.550  < 2e-16 ***
## MasVnrArea    2.041e+01  5.089e+00   4.010 6.39e-05 ***
## BsmtFinSF1    1.559e+01  3.862e+00   4.036 5.71e-05 ***
## BsmtUnfSF     6.304e+00  3.604e+00   1.749 0.080498 .  
## X1stFlrSF     5.211e+01  5.039e+00  10.341  < 2e-16 ***
## X2ndFlrSF     4.860e+01  4.029e+00  12.062  < 2e-16 ***
## BsmtFullBath  8.699e+03  2.379e+03   3.657 0.000264 ***
## BedroomAbvGr -1.037e+04  1.629e+03  -6.363 2.65e-10 ***
## KitchenAbvGr -1.610e+04  5.049e+03  -3.188 0.001463 ** 
## TotRmsAbvGrd  5.168e+03  1.200e+03   4.307 1.76e-05 ***
## Fireplaces    3.302e+03  1.718e+03   1.922 0.054783 .  
## GarageYrBlt  -1.041e+01  1.716e+00  -6.067 1.66e-09 ***
## GarageCars    1.766e+04  2.031e+03   8.696  < 2e-16 ***
## WoodDeckSF    2.517e+01  7.819e+00   3.219 0.001317 ** 
## ScreenPorch   5.360e+01  1.674e+01   3.203 0.001391 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34490 on 1440 degrees of freedom
## Multiple R-squared:  0.814,  Adjusted R-squared:  0.8115 
## F-statistic: 331.6 on 19 and 1440 DF,  p-value: < 2.2e-16

Random Forest Model

A random forest, or random decision forests, are an “ensemble learning method for classification, regression, and other tasks.” To use the random forest method in R, the randomForest package must be loaded to create and analyze random forests.

#pull out Y variable and store into seperate var.
y_train <- ntrain$SalePrice
#select all but y variable.
x_train_df <- subset(ntrain, select = -SalePrice)

#run the randomForest model
modelrf <- randomForest(x_train_df, y = y_train, ntree = 500, importance = T)

plot(modelrf)

Predictions

pred1<-predict(model, ntest)

kaggle <- as.data.frame(cbind(ntest$Id, pred1))
colnames(kaggle) <- c("Id", "SalePrice")

write.csv(kaggle, file = "Kaggle_Submission1.csv", quote=FALSE, row.names=FALSE)
pred2<-predict(model2, ntest)

#export data for Kaggle
kaggle <- as.data.frame(cbind(ntest$Id, pred2))
colnames(kaggle) <- c("Id", "SalePrice")

write.csv(kaggle, file = "Kaggle_Submission2.csv", quote=FALSE, row.names=FALSE)
pred3<-predict(modelrf, ntest)

kaggle <- as.data.frame(cbind(ntest$Id, pred3))
colnames(kaggle) <- c("Id", "SalePrice")

write.csv(kaggle, file = "Kaggle_Submission3.csv", quote=FALSE, row.names=FALSE)

Conculsions

Out of the three models created, (Baseline LM, LM with highly significant variables, and RF), the Random Forest made the best predictions.

Kaggle username: Drubsteps

knitr::include_graphics('final605.png')

Nicholas Schettini

12/8/2018