DATA605 - Final Exam

Author: Romerl Elizes

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(matrixcalc)
library(MASS)
library(spatstat)
library(pander)

Summary

All requirements for the Final Project/Exam have been completed to the best of my knowledge and all done by me.
The presentation video for this Final Project/Exam can be found here: https://youtu.be/bw8zkzyiPtA

Cavaets

For each run of this program, values will change especially for the random-generated numbers. That means the mean, standard deviation, confidence intervals, etc. will change as well.
Objectives were completed based on my current and somewhat limited knowledge of R and Computational Mathematics. Due to the deadline, I had to heavily rely on R functions to complete the tasks at hand.

I. Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)

numiterations <- 10000
x <- sample(1:10,numiterations, replace=TRUE)
y <- sample(1:1000,numiterations, replace=TRUE)
medianx <- median(x)
quantiley <- quantile(y)
medianx

## [1] 6

quantiley

##   0%  25%  50%  75% 100% 
##    1  244  498  746 1000

firstquartiley <- quantiley[2]
numtimesxgtmedian <- 0
numtimesygtfirstquartiley <- 0
probxgtmedian <- 0
probygtfirstquartiley <- 0

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

(5 points)

a. P(X>x | X>y)

numtimes <- 0
for (i in 1:numiterations) {
  X <- x[i]
  Y <- y[i]
  if (X > medianx || Y > firstquartiley)
    numtimes = numtimes + 1
}
curprob <- numtimes / numiterations
print(curprob)

## [1] 0.8545

Approximate Probability: approximately 0.85

b. P(X>x, Y>y)

numtimes <- 0
for (i in 1:numiterations) {
  X <- x[i]
  Y <- y[i]
  if (X > medianx && Y > firstquartiley)
    numtimes = numtimes + 1
}
curprob <- numtimes / numiterations
print(curprob)

## [1] 0.298

Approximate Probability: approximately 0.37

c. P(X<x | X>y)

numtimes <- 0
for (i in 1:numiterations) {
  X <- x[i]
  Y <- y[i]
  if (X < medianx || Y > firstquartiley)
    numtimes = numtimes + 1
}
curprob <- numtimes / numiterations
print(curprob)

## [1] 0.8704

Approximate Probability: approximately 0.85

(5 points) Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

for (i in 1:numiterations) {
  X <- x[i]
  Y <- y[i]
  if (X > medianx)
    numtimesxgtmedian = numtimesxgtmedian + 1
  if (Y > firstquartiley)
    numtimesygtfirstquartiley = numtimesygtfirstquartiley + 1
}
probxgtmedian <- numtimesxgtmedian / numiterations
probygtfirstquartiley <- numtimesygtfirstquartiley / numiterations
probxlemedian <- 1 - probxgtmedian
probylefirstquariley <- 1 - probygtfirstquartiley
print(probxgtmedian)

## [1] 0.403

print(probygtfirstquartiley)

## [1] 0.7495

ScenarioC = c("X and Median", "Y and 1st Quartile", "TOTAL")
GreaterThanC = c(probxgtmedian,probygtfirstquartiley, probxgtmedian+probygtfirstquartiley)
LessThanEqualC = c(probxlemedian,probylefirstquariley, probxlemedian+probylefirstquariley)
TotalC = c(GreaterThanC[1]+LessThanEqualC[1],GreaterThanC[2]+LessThanEqualC[2],GreaterThanC[3]+LessThanEqualC[3])
DT <- data.table(Scenario = ScenarioC, GreaterThan = GreaterThanC, LessThanEqual = LessThanEqualC, TOTAL=TotalC)
print(DT)

##              Scenario GreaterThan LessThanEqual TOTAL
## 1:       X and Median      0.4030        0.5970     1
## 2: Y and 1st Quartile      0.7495        0.2505     1
## 3:              TOTAL      1.1525        0.8475     2

Based on my creation of a joint and marginal probabilities table, I am not satisfied with my table. The grand total for the table is 2, not 1 as I expected. I guess I have the problem of expecting P(X>x and Y>y) = P(X>x) * P(Y>y). However, creating the joint and marginal probabilities table automatically assumes that when I am adding totals, I should be expecting all variables to eventually add to 1 as the Grand Total. P(X>x) * P(Y>y) implies that the probabilities will eventually be multiplied to come up with the total probability result. This is not the case for the joint and marginal probabilities table.

(5 points) Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Conducting Fisher’s Exact Test

testmatrix = cbind(x,y)
#testmatrix
fisher.test(testmatrix, simulate.p.value=TRUE)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based
##  on 2000 replicates)
## 
## data:  testmatrix
## p-value = 0.0004998
## alternative hypothesis: two.sided

As the p-value 0.0004 is significantly less than the .05 significance level, we have enough evidence to reject the null hypothesis that the X column is independent of the Y column.

Conducting Chi-Square Test

tbl = table(x,y)
chisq.test(tbl)

## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 8834.3, df = 8991, p-value = 0.8791

As the p-value 0.9937 is greater than the .05 significance level, we do not reject the null hypothesis that the X column is independent of the Y column.

Analysis

It appears that Fisher’s Exact Test is less appropriate than the Chi Square Test. The Chi-Square Test requires at least one-column to be categorical. There is a very big difference between the p-values for the Fisher’s Exact Test and Chi Square Test. Fisher’s Exact Test indicates evidence to reject the null-hypothesis while the Chi-Square Test indicates evidence to not reject the null-hypothesis test. Fisher’s Exact Test tests independence for smaller test sizes (expected values less than 5) while the Chi Square Test tests indepence for large test sizes. The number of values in this experiment is 10,000! Therefore, the Chi Square Test is appropriate for this test.

II. Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

NOTE: I will be focusing on the relationship between Sales Price and Square Footage of Basement, 1st Floor, and 2nd Floor of homes. The strict requirement for the data for this experiment is to focus only homes that have a basement and a second-floor. The test data will have 359 rows.

trainingDF <- read.csv(file="training.csv",header=TRUE,sep=",")
trainingDF <- subset(trainingDF,trainingDF$BsmtFinSF1 > 0)
trainingDF <- subset(trainingDF,trainingDF$X2ndFlrSF > 0)
#trainingDF <- subset(trainingDF,trainingDF$X1stFlrSF > 0)
nrow(trainingDF)

## [1] 359

#head(trainingDF)

(5 points) Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Provide univariate descriptive statistics and appropriate plots for the training data set

For the calculations, initially I will provide univerate statistics for each of the variables in this study: Sale Price, Basement, First Floor, and Second Floor square footage. Next, I will provide inferential statistics by comparing the Sales Price to each of the independent variables: Basement, First Floor, and Second Floor square footage. I will plot a scatter chart for each and conduct a linear regression test on the data.

Univariate Descriptive Statistics for Individual Variables

# ref: [UNI]
trainingDF %>%
  summarize(variable = "SalePrice", 
            mean = mean(SalePrice), 
            sd = sd(SalePrice), 
            q0.25 = quantile(SalePrice, 0.25),
            q0.75 = quantile(SalePrice, 0.75)) %>%
  pander()

variable	mean	sd	q0.25	q0.75
SalePrice	208285	89807	150250	239593

trainingDF %>%
  summarize(variable = "BsmtFinSF1", 
            mean = mean(BsmtFinSF1), 
            sd = sd(BsmtFinSF1), 
            q0.25 = quantile(BsmtFinSF1, 0.25),
            q0.75 = quantile(BsmtFinSF1, 0.75)) %>%
  pander()

variable	mean	sd	q0.25	q0.75
BsmtFinSF1	582.8	426.2	341	730

trainingDF %>%
  summarize(variable = "X2ndFlrSF", 
            mean = mean(X2ndFlrSF), 
            sd = sd(X2ndFlrSF), 
            q0.25 = quantile(X2ndFlrSF, 0.25),
            q0.75 = quantile(X2ndFlrSF, 0.75)) %>%
  pander()

variable	mean	sd	q0.25	q0.75
X2ndFlrSF	826.9	280	660	973

trainingDF %>%
  summarize(variable = "X1stFlrSF", 
            mean = mean(X1stFlrSF), 
            sd = sd(X1stFlrSF), 
            q0.25 = quantile(X1stFlrSF, 0.25),
            q0.75 = quantile(X1stFlrSF, 0.75)) %>%
  pander()

variable	mean	sd	q0.25	q0.75
X1stFlrSF	1062	390.8	812.5	1189

Inferential Statistics for Individual Variables

plot( trainingDF$BsmtFinSF1, trainingDF$SalePrice, main="Finished Basement SF vs. Sales Price", ylab="Sales Price", xlab="Finished Basement SF")
lmresult1 <- lm(SalePrice ~BsmtFinSF1, data = trainingDF)
abline(lmresult1)

summary(lmresult1)

## 
## Call:
## lm(formula = SalePrice ~ BsmtFinSF1, data = trainingDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -509884  -45473  -11540   28776  467168 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 155131.03    7254.96  21.383   <2e-16 ***
## BsmtFinSF1      91.20      10.05   9.072   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81070 on 357 degrees of freedom
## Multiple R-squared:  0.1874, Adjusted R-squared:  0.1851 
## F-statistic: 82.31 on 1 and 357 DF,  p-value: < 2.2e-16

Based on the plot observation, the Finished Basement Square Footage after 1000 square feet, does not really matter in increasing the sales price of the house. The outliers in the Basement Square Footage indicate that sales prices does not seem important even when Finished Basement Square Footage is closing in on 1500 square feet. The Linear Regression Model indicates that there is enough evidence to reject the Null Hypothesis indicating that a Finshed Basement may influence the Sales Price.

plot( trainingDF$X2ndFlrSF, trainingDF$SalePrice, main="2nd Floor SF vs. Sales Price", ylab="Sales Price", xlab="2nd Floor SF")
lmresult1 <- lm(SalePrice ~ X2ndFlrSF, data = trainingDF)
abline(lmresult1)

summary(lmresult1)

## 
## Call:
## lm(formula = SalePrice ~ X2ndFlrSF, data = trainingDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174811  -35526  -11596   18655  337857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32358.5    11089.4   2.918  0.00375 ** 
## X2ndFlrSF      212.8       12.7  16.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67300 on 357 degrees of freedom
## Multiple R-squared:   0.44,  Adjusted R-squared:  0.4384 
## F-statistic: 280.5 on 1 and 357 DF,  p-value: < 2.2e-16

Based on the plot observation, the 2nd Floor Square Footage follows a linear relationship with the Sales Price of the house. It appears that there is a cluster of data for homes wiht second floors between 500 and 1000 square feet. It appears the majority of homes buyers are satisfied with at most a 1500 square footage of 2nd floor space.. The Linear Regression Model indicates that there is enough evidence to reject the Null Hypothesis indicating that a 2nd Floor may influence the Sales Price.

plot( trainingDF$X1stFlrSF, trainingDF$SalePrice, main="1st Floor SF vs. Sales Price", ylab="Sales Price", xlab="1st Floor SF")
lmresult1 <- lm(SalePrice ~ X1stFlrSF, data = trainingDF)
abline(lmresult1)

summary(lmresult1)

## 
## Call:
## lm(formula = SalePrice ~ X1stFlrSF, data = trainingDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -541112  -38424     979   30222  359110 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 64055.895  11105.808   5.768 1.74e-08 ***
## X1stFlrSF     135.775      9.814  13.835  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 72560 on 357 degrees of freedom
## Multiple R-squared:  0.349,  Adjusted R-squared:  0.3472 
## F-statistic: 191.4 on 1 and 357 DF,  p-value: < 2.2e-16

Based on the plot observation, the 1st Floor Square Footage follows a linear relationship with the Sales Price of the house up to 1800 square feet. It does not seem important afterwards despite the presence of outliers. It appears that there is a cluster of data for homes wiht first floors between 500 and 1500 square feet. It appears the majority of homes buyers are satisfied with at most a 1500 square footage of 1st floor space.. The Linear Regression Model indicates that there is enough evidence to reject the Null Hypothesis indicating that a 1st Floor may influence the Sales Price.

Scatterplot Matrix for at least two of the independent variables and the dependent variable

#ref: [CRE]
salePrice <- trainingDF$SalePrice
finBsmtSF <- trainingDF$BsmtFinSF1
Flr1SF <- trainingDF$X1stFlrSF
Flr2SF <- trainingDF$X2ndFlrSF
data2A2 = data.frame(salePrice,finBsmtSF,Flr1SF,Flr2SF)
pairs(data2A2)

Derive a Correlation Matrix for any three quantitative variables in the dataset.

#ref: [COR1]
res <- cor(data2A2)
round(res,2)

##           salePrice finBsmtSF Flr1SF Flr2SF
## salePrice      1.00      0.43   0.59   0.66
## finBsmtSF      0.43      1.00   0.69   0.28
## Flr1SF         0.59      0.69   1.00   0.45
## Flr2SF         0.66      0.28   0.45   1.00

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis.

#ref: [COR2]
res2 <- cor.test(salePrice,finBsmtSF, method="pearson", conf.level=0.8)
res2

## 
##  Pearson's product-moment correlation
## 
## data:  salePrice and finBsmtSF
## t = 9.0724, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3760735 0.4863915
## sample estimates:
##       cor 
## 0.4328517

The p-value of the test is \(2.2 x 10^{-16}\) which is less than the significant level alpha = 0.20. We can conclude that Sale Price and Finished Basement Square Footage are signficantly correlated with a correlation coefficient of 0.43 and p-value \(2.2 x 10^{-16}\). The t-statistics is signficiant as its value of 9.07 is greater than 0 signifying the difference represented in units of standard error. This coincides with the very low p-value which is important in validating the t-statistics.

#ref: [COR2]
res3 <- cor.test(salePrice,Flr1SF, method="pearson", conf.level=0.8)
res3

## 
##  Pearson's product-moment correlation
## 
## data:  salePrice and Flr1SF
## t = 13.835, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5448047 0.6332404
## sample estimates:
##       cor 
## 0.5907942

The p-value of the test is \(2.2 x 10^{-16}\) which is less than the significant level alpha = 0.20. We can conclude that Sale Price and 1st Floor Square Footage are signficantly correlated with a correlation coefficient of 0.59 and p-value \(2.2 x 10^{-16}\). The t-statistics is signficiant as its value of 12.84 is greater than 0 signifying the difference represented in units of standard error. This coincides with the very low p-value which is important in validating the t-statistics.

#ref: [COR2]
res4 <- cor.test(salePrice,Flr2SF, method="pearson", conf.level=0.8)
res4

## 
##  Pearson's product-moment correlation
## 
## data:  salePrice and Flr2SF
## t = 16.747, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6235253 0.6996407
## sample estimates:
##      cor 
## 0.663295

The p-value of the test is \(2.2 x 10^{-16}\) which is less than the significant level alpha = 0.20. We can conclude that Sale Price and 2nd Floor Square Footage are signficantly correlated with a correlation coefficient of 0.66 and p-value \(2.2 x 10^{-16}\). The t-statistics is signficiant as its value of 16.75 is greater than 0 signifying the difference represented in units of standard error. This coincides with the very low p-value which is important in validating the t-statistics.

Would you be worried about familywise error? Why or why not?

I don’t think I should worry about familywise error. The data set is large with 359 observations. Moreover, each of the 3 pair sets have demonstrated a high t-statistic and very low p-value indicating that we have enough evidence to reject the null hypothesis.

(5 points) Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Invert your correlation matrix from above

res

##           salePrice finBsmtSF    Flr1SF    Flr2SF
## salePrice 1.0000000 0.4328517 0.5907942 0.6632950
## finBsmtSF 0.4328517 1.0000000 0.6852469 0.2788414
## Flr1SF    0.5907942 0.6852469 1.0000000 0.4497692
## Flr2SF    0.6632950 0.2788414 0.4497692 1.0000000

invres <- solve(res)
invres

##            salePrice  finBsmtSF     Flr1SF     Flr2SF
## salePrice  2.2253001 -0.1792725 -0.6900508 -1.1156782
## finBsmtSF -0.1792725  1.9035304 -1.2701740  0.1594125
## Flr1SF    -0.6900508 -1.2701740  2.3979965 -0.2666606
## Flr2SF    -1.1156782  0.1594125 -0.2666606  1.8155086

Multiply the correlation matrix by the precision matrix

res %*% invres

##               salePrice     finBsmtSF        Flr1SF        Flr2SF
## salePrice  1.000000e+00  8.459451e-18  3.124601e-18  8.042068e-18
## finBsmtSF -9.588181e-17  1.000000e+00 -9.795823e-17 -9.862991e-17
## Flr1SF    -2.227974e-16 -6.641383e-17  1.000000e+00  9.903251e-17
## Flr2SF    -2.220446e-16  2.775558e-17  5.551115e-17  1.000000e+00

Multiply the precision matrix by the correlation matrix

invres %*% res

##               salePrice     finBsmtSF        Flr1SF        Flr2SF
## salePrice  1.000000e+00  2.154458e-17 -1.190629e-17  0.000000e+00
## finBsmtSF -2.274629e-16  1.000000e+00 -2.884584e-16 -1.110223e-16
## Flr1SF     2.251692e-16  3.461310e-16  1.000000e+00  1.665335e-16
## Flr2SF    -2.140025e-16 -2.096522e-16 -1.230121e-16  1.000000e+00

Conduct LU decomposition on the matrix

# ref: [LUR]
LUres <- lu.decomposition(res)
L <- LUres$L
U <- LUres$U
print(L)

##           [,1]        [,2]      [,3] [,4]
## [1,] 1.0000000  0.00000000 0.0000000    0
## [2,] 0.4328517  1.00000000 0.0000000    0
## [3,] 0.5907942  0.52855010 1.0000000    0
## [4,] 0.6632950 -0.01017292 0.1468793    1

print(U)

##      [,1]      [,2]      [,3]         [,4]
## [1,]    1 0.4328517 0.5907942  0.663294968
## [2,]    0 0.8126394 0.4295206 -0.008266919
## [3,]    0 0.0000000 0.4239391  0.062267857
## [4,]    0 0.0000000 0.0000000  0.550809829

print(L %*% U)

##           [,1]      [,2]      [,3]      [,4]
## [1,] 1.0000000 0.4328517 0.5907942 0.6632950
## [2,] 0.4328517 1.0000000 0.6852469 0.2788414
## [3,] 0.5907942 0.6852469 1.0000000 0.4497692
## [4,] 0.6632950 0.2788414 0.4497692 1.0000000

(5 points) Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Luckily based on experimentation, I found that the Finished Basement Square Footage is found to be skewed to the right as evidenced by its resultant histogram shown below.

hist(finBsmtSF)

sdfinBsmtSF <- sd(finBsmtSF)
sdfinBsmtSF

## [1] 426.2223

Load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

fit <- fitdistr(finBsmtSF, densfun = "exponential")
fit

##        rate    
##   1.715823e-03 
##  (9.055769e-05)

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)).

lambda <- 1.7158*10^(-3)
epmean <- 1/lambda
epmean

## [1] 582.8185

rvalues <- rexp(1000,lambda)

Plot a histogram and compare it with a histogram of your original variable.

hist(rvalues)

I followed the instructions and still found that the histogram for the newly-modified exponential distribution still is right-skewed. While I am not satisfied with this result, I noticed that the histogram is more uniform to the right skew with no values missing as it goes from left to right.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

densityData <- density(rvalues)
cdfData <- CDF(densityData)
plot(cdfData)

Generate a 95% confidence interval from the empirical data, assuming normality.

# ref: [CAL]
epmean2 <- mean(rvalues)
epsd2 <- sd(rvalues)
eplen2 <- 1000
error <- qnorm(0.95)*epsd2/sqrt(eplen2)
left <- epmean2 - error
right <- epmean2 + error
left

## [1] 561.2449

right

## [1] 623.8408

Provide the empirical 5th percentile and 95th percentile of the data. Discuss.

epmean2

## [1] 592.5428

epsd2

## [1] 601.712

quantile(rvalues, probs = c(0.05,0.95))

##         5%        95% 
##   34.05061 1870.01164

Based on running the empirical data derived from creating a random normalized distribution for Finished basement, we find that the lower and upper range for the data at 95% confidence interval will be 543 and 605 with a mean of 574.. HOwever, there is a standard deviation of 585 which means the variables can fluctuate with that. As indicated by finding the 5 and 95 percentile, we still find variability with the presence of such a low variable value as 19.7 and such a high variable as 1675.

(10 points) Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I am going to include at least 10 variables and compare it to SalePrice

First Pass

lmresult2D1 <- lm(SalePrice ~ MSSubClass + LotFrontage + LotArea + MasVnrArea + BsmtUnfSF + BsmtFinSF1 + X2ndFlrSF + X1stFlrSF + TotRmsAbvGrd + KitchenAbvGr, data = trainingDF)
summary(lmresult2D1)

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea + 
##     MasVnrArea + BsmtUnfSF + BsmtFinSF1 + X2ndFlrSF + X1stFlrSF + 
##     TotRmsAbvGrd + KitchenAbvGr, data = trainingDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -329042  -26668   -4967   21906  258686 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.006e+05  2.395e+04   4.201 3.61e-05 ***
## MSSubClass   -3.242e+02  1.152e+02  -2.813 0.005264 ** 
## LotFrontage  -4.500e+02  1.838e+02  -2.449 0.014976 *  
## LotArea      -2.635e-01  8.297e-01  -0.318 0.751076    
## MasVnrArea    5.889e+01  1.670e+01   3.526 0.000495 ***
## BsmtUnfSF     1.951e+01  1.991e+01   0.980 0.327875    
## BsmtFinSF1    2.421e+01  1.726e+01   1.402 0.161961    
## X2ndFlrSF     1.726e+02  1.829e+01   9.434  < 2e-16 ***
## X1stFlrSF     5.564e+01  2.117e+01   2.629 0.009057 ** 
## TotRmsAbvGrd  2.063e+03  3.682e+03   0.560 0.575863    
## KitchenAbvGr -7.772e+04  2.121e+04  -3.664 0.000298 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59990 on 272 degrees of freedom
##   (76 observations deleted due to missingness)
## Multiple R-squared:  0.6322, Adjusted R-squared:  0.6186 
## F-statistic: 46.74 on 10 and 272 DF,  p-value: < 2.2e-16

Analysis: Based on the initial pass, we have identified that LotArea, Unfinished Basement SF, Finished Basement SF, and Total Rooms Above Ground should be eliminated from the second pass since their p-values are greater than 0.05.

Second Pass

lmresult2D2 <- lm(SalePrice ~ MSSubClass + LotFrontage + MasVnrArea + X2ndFlrSF + X1stFlrSF + KitchenAbvGr, data = trainingDF)
summary(lmresult2D2)

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + MasVnrArea + 
##     X2ndFlrSF + X1stFlrSF + KitchenAbvGr, data = trainingDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333902  -26830   -6670   21177  261020 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  100476.09   22698.38   4.427 1.38e-05 ***
## MSSubClass     -303.57     109.48  -2.773 0.005938 ** 
## LotFrontage    -409.27     172.41  -2.374 0.018291 *  
## MasVnrArea       62.50      16.19   3.860 0.000141 ***
## X2ndFlrSF       176.13      14.89  11.832  < 2e-16 ***
## X1stFlrSF        78.79      11.82   6.665 1.43e-10 ***
## KitchenAbvGr -76140.46   19927.94  -3.821 0.000164 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59790 on 276 degrees of freedom
##   (76 observations deleted due to missingness)
## Multiple R-squared:  0.6293, Adjusted R-squared:  0.6213 
## F-statistic:  78.1 on 6 and 276 DF,  p-value: < 2.2e-16

Analysis: After the second pass, we have arrived at out final model, luckily. It appears athat all variables listed have p-values are less than 0.05. Moreover, the Adjusted R-Squared and Multiple R-Squared values are close to 0.62 indicating a stable linear model.

References

[CAL] Calculating Confidence Intervals. Retrieved from website:https://www.cyclismo.org/tutorial/R/confidence.html

[COR1] Correlation matrix : A quick start guide to analyze, format and visualize a correlation matrix using R software. Retrieved from website: http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software

[COR2] Correlation Test Between Two Variables in R. Retrieved from website: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r

[CRE] Creating and Interpretting a Scatterplot Matrix in R. Retrieved from website: https://www.youtube.com/watch?v=tS7dX-wTa9I

[LUR] lu.decomposition. Retrieved from website: https://www.rdocumentation.org/packages/matrixcalc/versions/1.0-3/topics/lu.decomposition

[UNI] Univariate and bivariate descriptive analysis. Retrieved from website:https://beta.rstudioconnect.com/content/3350/dplyr_tutorial.html