DATA 605 Final

library(tidyverse)
library(DT)
library(dplyr)
library(ggplot2)
library(matrixcalc)
library(Rmisc)

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=\frac{N+1}{2}\)

Given data

#10000 random numbers
n <- 10000

#N can be any number of your choosing greater than or equal to 6
N <- 7

mu <-sig <- (N+1)/2

Random variable X - uniform

#setting the seed for creating reproducible results
set.seed(1234)

X <- runif(n, min = 1, max = N)
hist(X)

Random variable Y

#setting the seed for creating reproducible results
set.seed(1234)
Y <- rnorm(n, mean=mu, sd=sig)
hist(Y)

Create a dataframe

df <- data.frame(X,Y)

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x<-quantile(df$X, probs = 0.50)
x

##      50% 
## 4.007595

y<-quantile(df$Y, probs = 0.25)
y

##      25% 
## 1.356831

a) P(X>x | X>y)

\(P(A|B) = \frac{P(A)P(B)}{P(B)}\)

\(P(X>x | X>y) = \frac{P(X>x,X>y)}{P(X>y)}\)

P_Xx_Xy <- df %>% filter(X > x, X > y) %>% nrow()/n
P_Xy <- df %>% filter(X > y) %>% nrow()/n

P_a <- P_Xx_Xy/P_Xy

P_a

## [1] 0.5315756

b) P(X>x, Y>y)

Probability that X is greater than all possible x and Y is greater than all possible y

P_Xx_Yy <- df %>% filter(X > x, Y > y) %>% nrow()/n
P_Xx_Yy

## [1] 0.3762

c) P(X<x | X>y)

\(P(X<x | X>y) = \frac{P(X<x,X>y)}{P(X>y)}\)

P_Xx_Xy <- df %>% filter(X<x, X>y) %>% nrow()/n
P_X_y <- df %>% filter(X>y) %>% nrow()/n

P_c <- P_Xx_Xy/P_X_y
P_c

## [1] 0.4684244

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Joint Probability

jointProb <- df %>% mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y, 
    " Y > y", " Y < y")) %>% group_by(A, B) %>% dplyr::summarise(count = n()) %>% mutate(probability = count/n)
datatable(jointProb)

Marginal probability

marProb_A <- jointProb %>% ungroup() %>% group_by(A) %>% dplyr::summarize(count = sum(count), probability = sum(probability)) 
datatable(marProb_A)

marProb_B <- jointProb %>% ungroup() %>% group_by(B) %>% dplyr::summarize(count = sum(count), probability = sum(probability)) 
datatable(marProb_B)

res <- jointProb %>% select(-count) %>% spread(A, probability) %>% dplyr::rename(` ` = B)
datatable(res)

Evaluate

From the below table we see P(X>x and Y>y)=0.3762, P(X>x) =(0.1238+0.3762), P(Y>y) = (0.3738+0.3762)**

So we get P(X>x)P(Y>y) = 0.375 and we have P(X>x and Y>y)=0.3762, so with a difference of 0.001, we can say P(X>x and Y>y)=P(X>x)P(Y>y)

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

#we will create a table out of the above dataframe df
dfTest <- table(df$X > x, df$Y>y)

We define below hypothesis:

\(H_0 : X>x\ and\ Y>y\ are\ independent\ events\)

\(H_1 : X>x\ and\ Y>y\ are\ dependent\ events\)

Fisher’s Test

fisher.test(dfTest)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  dfTest
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9361461 1.1243231
## sample estimates:
## odds ratio 
##   1.025936

From the Fisher’s test we see - The p-value is higher than 0.05 and we fail to reject the null hypothesis which means the events are independent.

chisq.test(dfTest)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  dfTest
## X-squared = 0.28213, df = 1, p-value = 0.5953

From the Chi squares test also we see - The p-value is higher than 0.05 and we fail to reject the null hypothesis which means the events are independent.

Fisher’s exact test is used when one or more expected cell counts in the cross-tabulation are less than 5 whereas Ch square is used when the cell count is large, which is our case here.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Loading of the data

I uploaded the training and testing data from kaggle into my git account and loaded them here as below

training_set <- read.csv('https://raw.githubusercontent.com/petferns/605-final/main/train.csv?token=AJEOBJGU45H6WWBZIQBP4T3AVPFEC')



testing_set <- read.csv('https://raw.githubusercontent.com/petferns/605-final/main/test.csv?token=AJEOBJEX3TBZ5AZMSQWLSDTAVPFJO')

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

I chose LotArea and OverallQual as 2 independent variables and SalePrice as dependent variable.

Bigger the area and better the overall quality, higher is the sale price. So sale price depends on area and quality.

df <- data.frame(training_set$LotArea, training_set$OverallQual,training_set$SalePrice)
datatable(df)

Scatterplot for LotArea and SalePrice

ggplot(training_set, aes(x = LotArea, y = SalePrice)) + geom_point() + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1))

Scatterplot for OverallQual and SalePrice

ggplot(training_set, aes(x = OverallQual, y = SalePrice)) + geom_point() + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1))

Correlation matrix

cor_mat = training_set %>% select(LotArea, OverallQual, SalePrice) %>% cor() %>% 
    as.matrix()
cor_mat

##               LotArea OverallQual SalePrice
## LotArea     1.0000000   0.1058057 0.2638434
## OverallQual 0.1058057   1.0000000 0.7909816
## SalePrice   0.2638434   0.7909816 1.0000000

Testing of the hypotheses

Correlation between LotArea and SalePrice

cor.test(training_set$LotArea, training_set$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  training_set$LotArea and training_set$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

Correlation between OverallQual and SalePrice

cor.test(training_set$OverallQual, training_set$SalePrice, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  training_set$OverallQual and training_set$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

We see from the above there is a strong coorelation between SalePrice and OverallQual and weak coorelation between SalePrice and LotArea.

Familywise error

Familywise error states the probability of coming to at least one false conclusion in a series of hypothesis tests. I increased the conf level to 90 and I didn’t see any significant difference. So I wouldn’t worry in my case here.

Correlation between LotArea and SalePrice at level 90

cor.test(training_set$LotArea, training_set$SalePrice, conf.level = 0.9)

## 
##  Pearson's product-moment correlation
## 
## data:  training_set$LotArea and training_set$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  0.2233154 0.3034607
## sample estimates:
##       cor 
## 0.2638434

Correlation between OverallQual and SalePrice at lebel 90

cor.test(training_set$OverallQual, training_set$SalePrice, conf.level = 0.9)

## 
##  Pearson's product-moment correlation
## 
## data:  training_set$OverallQual and training_set$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 90 percent confidence interval:
##  0.7742916 0.8065720
## sample estimates:
##       cor 
## 0.7909816

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Precision matrix

prec_mat <- solve(cor_mat)
prec_mat

##                LotArea OverallQual  SalePrice
## LotArea      1.1085153   0.3046752 -0.5334669
## OverallQual  0.3046752   2.7550503 -2.2595806
## SalePrice   -0.5334669  -2.2595806  2.9280384

Multiply the correlation matrix by the precision matrix

cor_pre <- round(cor_mat %*% prec_mat)

Then multiply the precision matrix by the correlation matrix

pre_cor <- round(prec_mat %*%  cor_mat)

LU decomposition

lu.decomposition(cor_pre)

## $L
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## $U
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

lu.decomposition(pre_cor)

## $L
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## $U
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

I chose the variable X1stFlrSF, lets verify if its skewed right and greater than 0

Skewed right

hist(training_set$X1stFlrSF,breaks = 50)

Above 0

min(training_set$X1stFlrSF)

## [1] 334

Exponential probability density function - EPDF

library(MASS)
epdf <- fitdistr(training_set$X1stFlrSF, "exponential")
epdf

##        rate    
##   0.0008601213 
##  (0.0000225104)

Optimal value for λ and distribution

lambda <- epdf$estimate

#1000 samples
expo <- rexp(1000,lambda)

Histogram

par(mfrow = c(1, 2))
hist(expo, breaks = 50, xlim = c(0, 6000), main = "Exponential"
   ) 
hist(training_set$X1stFlrSF, breaks = 50, main = "Original")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(0.05, rate = lambda)

## [1] 59.63495

qexp(0.95, rate = lambda)

## [1] 3482.918

Generate a 95% confidence interval from the empirical data, assuming normality.

CI(na.exclude(training_set$X1stFlrSF), ci = 0.95)

##    upper     mean    lower 
## 1182.473 1162.627 1142.780

Provide the empirical 5th percentile and 95th percentile of the data

quantile(training_set$X1stFlrSF, 0.05)

##     5% 
## 672.95

quantile(training_set$X1stFlrSF, 0.95)

##     95% 
## 1831.25

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I selected 3 variables LotArea, OverallQual and SalePrice and tried running the linear regression on them. I got an R-squared of 0.6585.

Linear and multiple regression

lm <-lm(SalePrice ~ OverallQual + LotArea, data = training_set)

summary(lm)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + LotArea, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -271225  -26819   -1459   20172  385190 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.047e+05  5.547e+03  -18.88   <2e-16 ***
## OverallQual  4.433e+04  8.844e+02   50.12   <2e-16 ***
## LotArea      1.450e+00  1.225e-01   11.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46460 on 1457 degrees of freedom
## Multiple R-squared:  0.6585, Adjusted R-squared:  0.658 
## F-statistic:  1405 on 2 and 1457 DF,  p-value: < 2.2e-16

I further added few more variables and ran the model and I saw improvements in the R-squared from 0.6585 to 0.7802. Tried adding few more variables but the improvements weren’t significant and thats why I stopped it at R-squared: 0.7802, Adjusted R-squared: 0.7756

rlm <- lm(SalePrice ~ OverallQual + LotArea +KitchenQual+ factor(Neighborhood)+ GarageArea , data = training_set, na.action=na.omit)

summary(rlm)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + LotArea + KitchenQual + 
##     factor(Neighborhood) + GarageArea, data = training_set, na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -273065  -19355   -1546   15308  304198 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.870e+04  1.382e+04   2.801  0.00517 ** 
## OverallQual                  2.392e+04  1.177e+03  20.325  < 2e-16 ***
## LotArea                      1.009e+00  1.102e-01   9.158  < 2e-16 ***
## KitchenQualFa               -6.446e+04  8.380e+03  -7.691 2.70e-14 ***
## KitchenQualGd               -5.508e+04  4.649e+03 -11.850  < 2e-16 ***
## KitchenQualTA               -6.674e+04  5.260e+03 -12.689  < 2e-16 ***
## factor(Neighborhood)Blueste -9.547e+03  2.826e+04  -0.338  0.73551    
## factor(Neighborhood)BrDale  -2.465e+04  1.337e+04  -1.844  0.06543 .  
## factor(Neighborhood)BrkSide  1.803e+03  1.078e+04   0.167  0.86717    
## factor(Neighborhood)ClearCr  2.806e+04  1.208e+04   2.322  0.02037 *  
## factor(Neighborhood)CollgCr  9.232e+03  9.686e+03   0.953  0.34069    
## factor(Neighborhood)Crawfor  4.180e+04  1.068e+04   3.913 9.53e-05 ***
## factor(Neighborhood)Edwards -2.170e+03  1.026e+04  -0.211  0.83260    
## factor(Neighborhood)Gilbert  1.559e+04  1.017e+04   1.534  0.12531    
## factor(Neighborhood)IDOTRR  -1.644e+04  1.145e+04  -1.436  0.15117    
## factor(Neighborhood)MeadowV -4.266e+03  1.333e+04  -0.320  0.74895    
## factor(Neighborhood)Mitchel  4.880e+03  1.089e+04   0.448  0.65406    
## factor(Neighborhood)NAmes    4.137e+03  9.811e+03   0.422  0.67332    
## factor(Neighborhood)NoRidge  9.669e+04  1.098e+04   8.803  < 2e-16 ***
## factor(Neighborhood)NPkVill -6.191e+03  1.569e+04  -0.395  0.69320    
## factor(Neighborhood)NridgHt  4.736e+04  1.037e+04   4.569 5.33e-06 ***
## factor(Neighborhood)NWAmes   1.317e+04  1.033e+04   1.275  0.20259    
## factor(Neighborhood)OldTown -1.136e+04  1.009e+04  -1.126  0.26043    
## factor(Neighborhood)Sawyer   6.377e+03  1.052e+04   0.606  0.54444    
## factor(Neighborhood)SawyerW  1.258e+04  1.044e+04   1.205  0.22832    
## factor(Neighborhood)Somerst  1.118e+04  1.003e+04   1.115  0.26489    
## factor(Neighborhood)StoneBr  5.974e+04  1.195e+04   4.998 6.51e-07 ***
## factor(Neighborhood)SWISU    1.015e+04  1.211e+04   0.838  0.40227    
## factor(Neighborhood)Timber   1.865e+04  1.123e+04   1.660  0.09711 .  
## factor(Neighborhood)Veenker  3.690e+04  1.467e+04   2.515  0.01202 *  
## GarageArea                   6.550e+01  6.037e+00  10.850  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37630 on 1429 degrees of freedom
## Multiple R-squared:  0.7802, Adjusted R-squared:  0.7756 
## F-statistic: 169.1 on 30 and 1429 DF,  p-value: < 2.2e-16

Residual Analysis

From the plot below we see the residuals are not randomly distributed, dense towards 0 and gets scattered towards end and almost form a line shape.

plot(fitted(rlm), resid(rlm))

Normal QQ plot shows that the residuals don’t fall on the theoretical normal line expect the center part of the distribution.

qqnorm(resid(rlm))
qqline(resid(rlm))

I tested my model with the test data and also submitted it on kaggle. My username on kaggle is petferns

kaggle entry

testing_set <- mutate_if(testing_set, is_character, factor)

prediction <- tibble(
  Id = testing_set$Id, SalePrice = predict(rlm, newdata = testing_set)
  )
prediction %>% write_csv("PFernandes_kaggle.csv", na="0")