CUNY SPS - Master of Science in Data Science

Computational Mathematics Final Project

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =\frac { N+1 }{ 2 }\)

We will generate the requested random variable X and Y below:

set.seed(9)
N <- 25
X <- runif(10000, min = 1, max = N)
Y <- rnorm(10000, mean = (N+1)/2)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.002   7.136  13.149  13.095  19.121  25.000

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.182  12.315  13.001  13.004  13.675  16.793

a. \(P(X>x\quad |\quad X>y)\)

To find the probability we have to remember that \(P(A\quad |\quad B)=P(A\quad and\quad B)/P(B)\), thus this will be equal to \(P(X>x\quad and\quad X>y)/P(X>y)\)

x <- median(X)
y <- quantile(Y, 0.25)

sum(X>x & X>y)/sum(X>y)

## [1] 0.9384384

This means that there is roughly a 93.8% probability that the variable X is greater than its median of 13.149 given that this same variable is greater than the 1st quartile of the Y variable at 12.315.

b. \(P(X>x,\quad Y>y)\)

This is equivalent to finding \(P(X>x\quad and\quad Y>y)\)

sum(X>x & Y>y)/length(X)

## [1] 0.3783

This means that there is a 37.83% probability that the variable X is greater than its median of 13.149 while variable Y is greater than its 1st quartile of 12.315.

c. \(P(X<x\quad |\quad X>y)\)

This probability is similar to the first one but here we will find the probability that \(X<x\) given \(X>y\), thus is is the same as \(P(X<x\quad and\quad X>y)/P(X>y)\)

sum(X<x & X>y)/sum(X>y)

## [1] 0.06156156

This means that there is roughly a 6.2% probability that the variable X is less than its median of 13.149 given that this same variable is greater than the 1st quartile of the Y variable at 12.315.

Investigate whether \(P(X>x\quad and\quad Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

library(kableExtra)

tabl <- c(sum(X<x & Y<y), sum(X<x & Y>y))
tabl <- rbind(tabl, c(sum(X>x & Y<y), sum(X>x & Y>y)))
tabl <- cbind(tabl, tabl[,1] + tabl[,2])
tabl <- rbind(tabl, tabl[1,] + tabl[2,])

colnames(tabl) <- c("Y<y", "Y>y", "Total")
rownames(tabl) <- c("X<x", "X>x", "Total")

tabl %>% kable() %>%  kable_styling()

	Y<y	Y>y	Total
X<x	1283	3717	5000
X>x	1217	3783	5000
Total	2500	7500	10000

In one of our previous questions we had established that \(P(X>x\quad and\quad Y>y)\) was equal to 37.83%, but we will use the table we have created above to verify this is true and also to find \(P(X>x)P(Y>y)\).

Please note that in order to simplify our variable names we will refer to \(X>x\) simply as \(Xx\), and \(Y>y\) simply as \(Yy\) in our calculations below:

Xx <- tabl[2, 3] #this is total X>x located in row 2 column 3
Yy <- tabl[3, 2] #this is total Y>y located in row 3 column 2
Xx_Yy <- tabl[2, 2] #this is the interception of X>x and Y>y located in row 2 column 2
total <- tabl[3, 3]

prob_Xx <- Xx/total
prob_Yy <- Yy/total
prob_Xx_Yy <- Xx_Yy/total

prob_Xx

## [1] 0.5

prob_Yy

## [1] 0.75

prob_Xx_Yy

## [1] 0.3783

prob_Xx*prob_Yy

## [1] 0.375

As we can see from our results, we were able to verify that \(P(X>x\quad and\quad Y>y)\) is equal to 37.83%. We also found the solution to \(P(X>x)P(Y>y)\), which is 37.5% and it does not equal \(P(X>x\quad and\quad Y>y)\). They are indeed very close though.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

We will use the built in fisher.test and chisq.test functions in R to determine independence.

fisher.test(table(X>x, Y>y))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(X > x, Y > y)
## p-value = 0.1333
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.979014 1.175895
## sample estimates:
## odds ratio 
##   1.072926

The results of the Fisher’s Exact test show a p-value that is not close to 0, it is larger than a significance level of 0.05, thus we do not reject the null hypothesis. In other words, the variables are independent, there is no relationship between X>x and Y>y.

chisq.test(table(X>x, Y>y))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(X > x, Y > y)
## X-squared = 2.2533, df = 1, p-value = 0.1333

We first see that X-squared is not a very large number, which suggests evidence against the alternative hypothesis. Second, since our results of the Chi-Squared test show a p-value that is not close to 0, it is larger than a significance level of 0.05, thus we do not reject the null hypothesis. In other words, the variables are independent, there is no relationship between X>x and Y>y.

In this case I would say that the Chi Square Test is the most appropriate since we have a sample of 10,000. The Fisher’s Exact Test is used when the sample is small, however, we are able to see that the result for both tests gave us the same p-value.

Problem 2.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

I have saved the data in a Github repository, thus I will be loading the data from there:

train_df <- read.csv ('https://raw.githubusercontent.com/marioipena/FinalProjectDATA605/master/train.csv', header=TRUE, stringsAsFactors = FALSE)

Let’s take a look at the descriptive statistics of our variables and at some of the plots:

summary(train_df)

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

We will pick a few independent variables for the “SalePrice” variable from our data to provide the required plots.

library(ggplot2)

plot1 <- ggplot(train_df, aes(LotArea, color = )) + geom_freqpoly(col = "red", binwidth = 4000, lwd = 1, na.rm = TRUE, position = "identity") + labs(title = "Frequency Polygon Histogram of Lot Area", x = "Area (sq ft)", y = "Count") + theme(plot.title = element_text(size = 11))

plot2 <- ggplot(train_df, aes(YearBuilt, color = )) + geom_histogram(col = "blue", binwidth = 5, lwd = 1, na.rm = TRUE, position = "identity") + labs(title = "Histogram of Year Built", x = "Year", y = "Count")

plot3 <- ggplot(train_df, aes(TotalBsmtSF, color =)) + geom_histogram(col = "green", binwidth = 300, fill = "red", alpha = 0.2, lwd = 1, na.rm = TRUE, position = "identity") + labs(title = "Histogram of Total Basement Area", x = "Area (sq ft)", y = "Count")

plot4 <- ggplot(train_df, aes(GrLivArea, color =)) + geom_histogram(col = "black", binwidth = 400, fill = "deeppink4", alpha = 0.4, lwd = 1, na.rm = TRUE, position = "identity") + labs(title = "Histogram of Total Living Area", x = "Area (sq ft)", y = "Count") + theme(plot.title = element_text(size = 12))

plot5 <- ggplot(train_df, aes(TotRmsAbvGrd, color =)) + geom_bar(fill = "khaki4", alpha = 0.7, lwd = 1, na.rm = TRUE, position = "identity") + labs(title = "Histogram of Total Rooms", x = "Number of Rooms", y = "Count") + theme(plot.title = element_text(size = 12))

plot6 <- ggplot(train_df, aes(SalePrice, color =)) + geom_histogram(col = "gray", fill = "dark blue", alpha = 0.7, lwd = 1, na.rm = TRUE, position = "identity", binwidth = 10000) + labs(title = "Histogram of Sale Price", x = "Price in USD", y = "Count") + theme(plot.title = element_text(size = 12))

require(gridExtra)

## Loading required package: gridExtra

grid.arrange(plot1, plot2, ncol=2)

grid.arrange(plot3, plot4, ncol=2)

grid.arrange(plot5, plot6, ncol=2)

Let’s examine the scatterplot matrix and correlation matrix:

train_var <- train_df[ , c(5,20,39,47,55,81)]
plot(train_var , pch=20 , cex=1.5 , col="dark blue")

Although, not entirely clear, the variables that seem to have the strongest correlation with “SalePrice” are “TotalBsmtSF” and “GrLivArea”.

We will explore these variables further:

library(corrplot)

## corrplot 0.84 loaded

train_cor <- cor(train_var)
print(train_cor)

##                 LotArea  YearBuilt TotalBsmtSF GrLivArea TotRmsAbvGrd SalePrice
## LotArea      1.00000000 0.01422765   0.2608331 0.2631162   0.19001478 0.2638434
## YearBuilt    0.01422765 1.00000000   0.3914520 0.1990097   0.09558913 0.5228973
## TotalBsmtSF  0.26083313 0.39145200   1.0000000 0.4548682   0.28557256 0.6135806
## GrLivArea    0.26311617 0.19900971   0.4548682 1.0000000   0.82548937 0.7086245
## TotRmsAbvGrd 0.19001478 0.09558913   0.2855726 0.8254894   1.00000000 0.5337232
## SalePrice    0.26384335 0.52289733   0.6135806 0.7086245   0.53372316 1.0000000

corrplot(train_cor)

As expected, “TotalBsmtSF” and “GrLivArea” have the strongest correlation coefficients in regards to “SalesPrice” among the variables we chose to look at.

Lastly, we will test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

SalePrice and TotalBsmtSF:

cor.test(train_var$SalePrice, train_var$TotalBsmtSF, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train_var$SalePrice and train_var$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806

We can see from our results that p-value is close to 0, which means that is less than a significance level of 0.05, thus we reject the null hypothesis and conclude that there is a correlation between “SalePrice” and “TotalBsmtSF”. We are 80% confident that the correlation coefficient of these two variables is roughly between 0.59 and 0.64.

SalePrice and GrLivArea:

cor.test(train_var$SalePrice, train_var$GrLivArea, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train_var$SalePrice and train_var$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

In this case, we can also see from our results that p-value is close to 0, which means that is less than a significance level of 0.05, thus we reject the null hypothesis and conclude that there is a correlation between “SalePrice” and “GrLivArea”. We are 80% confident that the correlation coefficient of these two variables is roughly between 0.69 and 0.73.

TotalBsmtSF and GrLivArea:

cor.test(train_var$TotalBsmtSF, train_var$GrLivArea, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train_var$TotalBsmtSF and train_var$GrLivArea
## t = 19.503, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4278380 0.4810855
## sample estimates:
##       cor 
## 0.4548682

Lastly, the resulting p-value from our test is close to 0, which means that is less than a significance level of 0.05, thus we reject the null hypothesis and conclude that there is a correlation between “TotalBsmtSF” and “GrLivArea”. We are 80% confident that the correlation coefficient of these two variables is roughly between 0.42 and 0.49.

We have a large number of variables in these dataset, which means that we could conduct multiple hypotheses tests to look for correlation and other indicators of a relationship between variables. The more tests we conduct the greater the concern for familywise error and the probability of making one or more false discoveries, or type I errors.

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

In order to simplify the correlation matrix that I created above, I will reduce it to 3 variables in order to make it more clear. I will use the variable “SalesPrice” and the two variables with the strongest correlation to it, which are “TotalBsmtSF” and “GrLivArea”:

train_var2 <- train_df[ , c(39,47,81)]

correlation_matx <- cor(train_var2)
precision_matx <- solve(correlation_matx)

cm_pm <- round((correlation_matx %*% precision_matx), 2)
cm_pm

##             TotalBsmtSF GrLivArea SalePrice
## TotalBsmtSF           1         0         0
## GrLivArea             0         1         0
## SalePrice             0         0         1

pm_cm <- round((precision_matx %*% correlation_matx), 2)
pm_cm

##             TotalBsmtSF GrLivArea SalePrice
## TotalBsmtSF           1         0         0
## GrLivArea             0         1         0
## SalePrice             0         0         1

As we can see, both of the multiplications produce an identity matrix.

LU Decomposition:

library(Matrix)
lu_cor <- lu(correlation_matx)
lud_cor <- expand(lu_cor)

lu_pre <- lu(precision_matx)
lud_pre <- expand(lu_pre)

lud_cor

## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.4548682 1.0000000         .
## [3,] 0.6135806 0.5415823 1.0000000
## 
## $U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.4548682 0.6135806
## [2,]         . 0.7930949 0.4295262
## [3,]         .         . 0.3908951
## 
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##           
## [1,] | . .
## [2,] . | .
## [3,] . . |

lud_pre

## $L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]        [,2]        [,3]       
## [1,]  1.00000000           .           .
## [2,] -0.04031325  1.00000000           .
## [3,] -0.58501360 -0.70862448  1.00000000
## 
## $U
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]        [,2]        [,3]       
## [1,]  1.60588442 -0.06473842 -0.93946422
## [2,]           .  2.00863169 -1.42336558
## [3,]           .           .  1.00000000
## 
## $P
## 3 x 3 sparse Matrix of class "pMatrix"
##           
## [1,] | . .
## [2,] . | .
## [3,] . . |

We now that A = LU, so we will check if we get our original matrices when multiplying both together.

cor_L <- lud_cor$L
cor_U <- lud_cor$U

pre_L <- lud_pre$L
pre_U <- lud_pre$U

correlation_matx

##             TotalBsmtSF GrLivArea SalePrice
## TotalBsmtSF   1.0000000 0.4548682 0.6135806
## GrLivArea     0.4548682 1.0000000 0.7086245
## SalePrice     0.6135806 0.7086245 1.0000000

cor_L %*% cor_U

## 3 x 3 Matrix of class "dgeMatrix"
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.4548682 0.6135806
## [2,] 0.4548682 1.0000000 0.7086245
## [3,] 0.6135806 0.7086245 1.0000000

precision_matx

##             TotalBsmtSF   GrLivArea  SalePrice
## TotalBsmtSF  1.60588442 -0.06473842 -0.9394642
## GrLivArea   -0.06473842  2.01124151 -1.3854927
## SalePrice   -0.93946422 -1.38549273  2.5582310

pre_L %*% pre_U

## 3 x 3 Matrix of class "dgeMatrix"
##             [,1]        [,2]       [,3]
## [1,]  1.60588442 -0.06473842 -0.9394642
## [2,] -0.06473842  2.01124151 -1.3854927
## [3,] -0.93946422 -1.38549273  2.5582310

We get the corresponding matrices when multiplying L and U.

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

hist(train_var2$GrLivArea, main = "Distribution of Gross Living Area", xlab = "Area (sq ft)")

min(train_var2$GrLivArea)

## [1] 334

The minimum value of this variable is 334 so we will not need to shift it.

library(MASS)
set.seed(20)
(fexp <- fitdistr(train_var2$GrLivArea, "exponential"))

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

samples <- rexp(1000, fexp$estimate)

hist(train_var2$GrLivArea, breaks=30, prob=TRUE, main = "Distribution of Gross Living Area", xlab = "Area (sq ft)")

hist(samples, breaks=30, prob=TRUE, main = "Distribution of Sample Gross Living Area", xlab = "Sample Area (sq ft)")

We see that the sample distribution is a lot more skewed to the right and its highest frequency is near zero.

Finding the 5th and 95th percentiles using the cumulative distribution function (CDF):

Samp_cdf <- ecdf(samples)
samples[Samp_cdf(samples)==0.05]

## [1] 51.37788

samples[Samp_cdf(samples)==0.95]

## [1] 4538.718

Generating a 95% confidence interval from the empirical data, assuming normality:

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

CI(train_var2$GrLivArea, ci = 0.95)

##    upper     mean    lower 
## 1542.440 1515.464 1488.487

Providing the empirical 5th percentile and 95th percentile of the data:

quantile(train_var2$GrLivArea, probs = c(0.05, 0.95))

##     5%    95% 
##  848.0 2466.1

The differences in the percentiles between our empirical data and sample data can be explained through our distribution plots; the frequencies were distributed differently. We saw that the minimum number of our empirical data was 334, whereas the highest frequency of our sample data was near zero with greater sknewness to the right.

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Kaggle Username: mariopena

Kaggle Competition Score: 0.85277

We will first find variables that have a correlation coefficient greater than 0.5 in regards to Sale Price in order to build our first multiple regression model:

num_Vars <- which(sapply(train_df, is.numeric))

all_num_Var <- train_df[, num_Vars]
all_num_Var <- all_num_Var[-c(1)]
cor_num_Var <- cor(all_num_Var, use="pairwise.complete.obs")

cor_sorted <- as.matrix(sort(cor_num_Var[,'SalePrice'], decreasing = TRUE))

Cor_High <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_num_Var <- cor_num_Var[Cor_High, Cor_High]
corrplot(cor_num_Var)

model <- lm(SalePrice ~ ., data = all_num_Var[Cor_High])

summary(model)

## 
## Call:
## lm(formula = SalePrice ~ ., data = all_num_Var[Cor_High])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -489958  -19316   -1948   16020  290558 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.186e+06  1.291e+05  -9.187  < 2e-16 ***
## OverallQual   1.960e+04  1.190e+03  16.472  < 2e-16 ***
## GrLivArea     5.130e+01  4.233e+00  12.119  < 2e-16 ***
## GarageCars    1.042e+04  3.044e+03   3.422 0.000639 ***
## GarageArea    1.495e+01  1.031e+01   1.450 0.147384    
## TotalBsmtSF   1.986e+01  4.295e+00   4.625 4.09e-06 ***
## X1stFlrSF     1.417e+01  4.930e+00   2.875 0.004097 ** 
## FullBath     -6.791e+03  2.682e+03  -2.532 0.011457 *  
## TotRmsAbvGrd  3.310e+01  1.119e+03   0.030 0.976404    
## YearBuilt     2.682e+02  5.035e+01   5.328 1.15e-07 ***
## YearRemodAdd  2.965e+02  6.363e+01   4.659 3.47e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37920 on 1449 degrees of freedom
## Multiple R-squared:  0.7737, Adjusted R-squared:  0.7721 
## F-statistic: 495.4 on 10 and 1449 DF,  p-value: < 2.2e-16

We get an \(R^2\) of 0.7737 in our first model, which is pretty good. This means that roughly 77% variance of the sale price can be explained by the predictor variables in this particular model. We also have a p-value close to zero, which supports the idea that we’ve constructed a good model.

Now let’s get rid of some variables in order to see if we can improve the model. We can eliminate the variables that have a p-value greater than 0.05:

model2 <- lm(SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF + X1stFlrSF + FullBath + YearBuilt + YearRemodAdd, data = all_num_Var)

summary(model2)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     TotalBsmtSF + X1stFlrSF + FullBath + YearBuilt + YearRemodAdd, 
##     data = all_num_Var)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -482525  -19191   -1801   16208  289639 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.188e+06  1.284e+05  -9.255  < 2e-16 ***
## OverallQual   1.959e+04  1.188e+03  16.486  < 2e-16 ***
## GrLivArea     5.177e+01  3.097e+00  16.714  < 2e-16 ***
## GarageCars    1.395e+04  1.817e+03   7.680 2.92e-14 ***
## TotalBsmtSF   2.039e+01  4.269e+00   4.775 1.98e-06 ***
## X1stFlrSF     1.465e+01  4.919e+00   2.979  0.00294 ** 
## FullBath     -7.184e+03  2.644e+03  -2.717  0.00666 ** 
## YearBuilt     2.699e+02  5.017e+01   5.380 8.69e-08 ***
## YearRemodAdd  2.957e+02  6.362e+01   4.649 3.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37920 on 1451 degrees of freedom
## Multiple R-squared:  0.7734, Adjusted R-squared:  0.7721 
## F-statistic: 618.9 on 8 and 1451 DF,  p-value: < 2.2e-16

Our \(R^2\) has decreased slightly to 0.7734, which means we did not improved the model. However, we managed to make sure that our model consists of variables that are significant predictors of Sale Price. We will use this second model for our prediction.

Let’s also perform some residual analysis to determine whether our model is appropriate:

plot(fitted(model2), resid(model2))
abline(h = 0, lty = 3)

hist(model2$residuals)

qqnorm(model2$residuals)
qqline(model2$residuals)

Although the residuals look nearly normal in the normal probability plot, we realize that there are significant outliers that may skew our data as evident in the residual plot and normal q-q plot. We conclude that the residuals are not nearly normal. The model we created may not be appropriate for predicting Sale Price. For the purpose of this assignment we will go ahead with our prediction.

We will make the predictions using our second model:

test_df <- read.csv ('https://raw.githubusercontent.com/marioipena/FinalProjectDATA605/master/test.csv', header=TRUE, stringsAsFactors = FALSE)

predict_model2 <- test_df

predict_model2$salePrice <- predict(model2, test_df)

# Kaggle dataset
Id <- test_df$Id
salePrice <- predict_model2$salePrice
kaggle_modelDF <- data.frame(cbind(Id, salePrice))
kaggle_modelDF[is.na(kaggle_modelDF)] <- 0

head(kaggle_modelDF)

##     Id salePrice
## 1 1461  103366.9
## 2 1462  159127.7
## 3 1463  170144.3
## 4 1464  188636.7
## 5 1465  220046.7
## 6 1466  183033.5

Kaggle Username: mariopena

Kaggle Competition Score: 0.85277

CUNY SPS - Master of Science in Data Science - DATA605

Mario Pena

May 22, 2020

Computational Mathematics Final Project

Problem 1.

Problem 2.