Data 605 - Final Exam

Video Link: https://youtu.be/KD21iznAR9g

library(MASS)
library(Matrix)
library(matlib)
library(dplyr)
library(ggplot2)
library(tidyr)
library(kableExtra)
library(purrr)
library(Hmisc)

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of $\mu$ = $\sigma$ =(N+1)/2.

Solution:

Generate a random variable X

# set seed value
set.seed(1)
N <- 6
X <- runif(10000, min = 1, max = N)

Generate a random variable Y

# mean 
mu <- (N+1)/2
Y <- rnorm(10000 , mean = mu)

a. P(X>x | X>y)

# first calculate x and y
x <- median(X)
y <- summary(Y)[2][[1]]

#p(A|B) = P(AB)/P(B)
p1 <- (sum(X>x & X>y)/length(X))/(sum(X>y)/length(X)) 

p1

## [1] 0.7875256

The probability of X greater than median value of X given that X is greater than first quartile of y is 0.7875

b. P(X>x, Y>y)

Solution:

#P(AB)
p2 <- sum(X>x & Y>y)/length(X)

p2

## [1] 0.3754

The probability of X greater than median value of X and Y is greater than first quartile of y is 0.3754

c. P(X<x | X>y)

Solution:

#p(A|B) = P(AB)/P(B)
p3 <- sum(X<x & X > y)/sum(X>y)  # simplified the n = length(X)

p3

## [1] 0.2124744

The probability of X less than median value of X given that X is greater than first quartile of y is 0.2124744

`Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.`

Solution:

Making the joint and marginal probability table:

matrix<-matrix( c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), nrow = 2,ncol = 2)
matrix<-cbind(matrix,c(matrix[1,1]+matrix[1,2],matrix[2,1]+matrix[2,2]))
matrix<-rbind(matrix,c(matrix[1,1]+matrix[2,1],matrix[1,2]+matrix[2,2],matrix[1,3]+matrix[2,3]))
contingency<-as.data.frame(matrix)
names(contingency) <- c("X>x","X<x", "Total")
row.names(contingency) <- c("Y<y","Y>y", "Total")
kable(contingency) %>%
  kable_styling(bootstrap_options = "bordered")

	X>x	X<x	Total
Y<y	1246	1254	2500
Y>y	3754	3746	7500
Total	5000	5000	10000

prob_matrix<-matrix/matrix[3,3]
contingency_p<-as.data.frame(prob_matrix)
names(contingency_p) <- c("X>x","X<x", "Total")
row.names(contingency_p) <- c("Y<y","Y>y", "Total")
kable(round(contingency_p,3)) %>%
  kable_styling(bootstrap_options = "bordered")

	X>x	X<x	Total
Y<y	0.125	0.125	0.25
Y>y	0.375	0.375	0.75
Total	0.500	0.500	1.00

Compute P(X>x)P(Y>y)

prob_matrix[3,1]*prob_matrix[2,3]

## [1] 0.375

Compute P(X>x and Y>y)

round(prob_matrix[2,1],digits = 3)

## [1] 0.375

Verify P(X>x and Y>y)=P(X>x)P(Y>y)

prob_matrix[3,1]*prob_matrix[2,3]==round(prob_matrix[2,1],digits = 3)

## [1] TRUE

Since both results are the same, the condition P(X>x and Y>y)=P(X>x)P(Y>y) holds and X and Y are independent.

`Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?`

Solution:

Fisher’s Exact Test

fisher.test(table(X>x,Y>y))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(X > x, Y > y)
## p-value = 0.8716
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9202847 1.1052820
## sample estimates:
## odds ratio 
##    1.00857

Chi Square Test

chisq.test(table(X>x,Y>y))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(X > x, Y > y)
## X-squared = 0.026133, df = 1, p-value = 0.8716

$H_0$ : X and Y are independent.

$H_a$ : X and Y are not independent.

The chi square test for independence compares two variables to see if they are related. A small chi-square p-value demonstrates that there is a significant association between the two variables.

In both cases the p-value is much greater than a reasonable threshold of 0.05. So we do not reject the null hypothesis of independence and conclude that they are indeed independent.

Fisher’s exact test tests the null Hypothesis of independence and used when then the sample size is small.

Chi-squared test is used when the sample size is large.

We have a large enough sample size, so chi-square is more appropriate in this case.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .

`Descriptive and Inferential Statistics`

library(readr)
library(tidyverse)

train <- read_csv('train.csv')
test <- read_csv("test.csv")

descriptive statistics

train <- train[ ,1:81]
dim(train)

## [1] 1460   81

#summary(train)

Plots

MSSubClass is left skewed:

summary(train$MSSubClass)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    20.0    50.0    56.9    70.0   190.0

hist(train$MSSubClass, main="Distribution of MSSubClass",xlab="MSSubClass")

RL has the highest frequency , C lowest frequency:

summary(train$MSZoning)

##    Length     Class      Mode 
##      1460 character character

barplot(table(train$MSZoning), main="MS Zoning")

LotFrontage is left skewed:

summary(train$LotFrontage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   59.00   69.00   70.05   80.00  313.00     259

hist(train$LotFrontage,main="Histogram of Lot Frontage",xlab="LotFrontage")

Lot Area is left skewed with very high small values:

hist(train$LotArea,main="Distribution of LotArea",xlab="Lot Area")

Ground Living Area is approximately normally distributed:

hist(train$GrLivArea,main="Distribution of Ground Living Area",xlab="Ground Living Area")

Sales price is slightly approximately normally distributed:

hist(train$SalePrice,main="Distribution of Sale Price",xlab="Sale Price")

Scatterplot matrix for “SalePrice”,“GrLivArea”,“LotFrontage”

pairs(train[,c("SalePrice","GrLivArea","LotFrontage")])

According to the scatter plot above, we can see that LotFrontage and GrLiveArea are positively correlated with sale Price.

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:matlib':
## 
##     tr

#a scatterplot matrix of independent variables and the dependent variable
sp_train <- train
sp_train %>% 
  dplyr::select(c("SalePrice", "GrLivArea", "LotFrontage")) %>% 
  pairs.panels(method = "pearson", hist.col = "#c95656")

Correlation matrix for any three quantitative variables

We choose the following variables: SalePrice , GrLivArea and TotalBsmtSF

correlation_matrix <- cor(train[,c("SalePrice","GrLivArea","TotalBsmtSF")])
correlation_matrix

##             SalePrice GrLivArea TotalBsmtSF
## SalePrice   1.0000000 0.7086245   0.6135806
## GrLivArea   0.7086245 1.0000000   0.4548682
## TotalBsmtSF 0.6135806 0.4548682   1.0000000

From the matrix above, we can see that Sales Price has strong positive correlation with GrLivArea and moderate correlation with TotalBsmTSF.

GrLivArea shows Strong positive correlation with SalePrice and weak positive correlation with TotalBsmSF.

TotalBsmSF shows moderate positive correlation with SalePrice and weak positive correlation with GrLivArea.

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

SalePrice vs GrLivArea

$H_0$ : The correlation between GrLivArea and SalePrice is 0

$H_a$ : The correlation between GrLivArea and SalePrice is other than 0

cor.test(train$SalePrice, train$GrLivArea, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

Since the the p value is so small and less than 0.05 at 5% level of significance, we reject the null hypothesis and conclude that the correlation between GrLivArea and SalePrice is other than 0.

80 percent confidence interval of the test is [0.6915087 0.7249450].

SalePrice vs TotalBsmtSF

$H_0$ : The correlation between TotalBsmtSF and SalePrice is 0

$H_a$ : The correlation between TotalBsmtSF and SalePrice is other than 0

cor.test(train$SalePrice, train$TotalBsmtSF, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806

Since the the p value is so small and less than 0.05 at 5% level of significance, we reject the null hypothesis and conclude that the correlation between TotalBsmtSF and SalePrice is other than 0.

80 percent confidence interval of the test is [0.5922142 0.6340846].

TotalBsmtSF vs GrLivArea

$H_0$ : The correlation between TotalBsmtSF and GrLivArea is 0

$H_a$ : The correlation between TotalBsmtSF and GrLivArea is other than 0

cor.test(train$GrLivArea, train$TotalBsmtSF, conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$GrLivArea and train$TotalBsmtSF
## t = 19.503, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4278380 0.4810855
## sample estimates:
##       cor 
## 0.4548682

Since the the p value is so small and less than 0.05 at 5% level of significance, we reject the null hypothesis and conclude that the correlation between GrLivArea and TotalBsmtSF is other than 0.

80 percent confidence interval of the test is [0.4278380 0.4810855].

family wise error

family_wise_error <- 1 - (1 - .05)^2 
family_wise_error

## [1] 0.0975

The familywise error is the probability of a coming to at least one false conclusion in a series of hypothesis tests.

There is a 9.75% chance of type 1 error. Since the chance is low, I will not be worried for family wise error.

`Linear Algebra and Correlation`

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Solution:

# find inverse
precision_matrix <- solve(correlation_matrix)
precision_matrix

##              SalePrice   GrLivArea TotalBsmtSF
## SalePrice    2.5582310 -1.38549273 -0.93946422
## GrLivArea   -1.3854927  2.01124151 -0.06473842
## TotalBsmtSF -0.9394642 -0.06473842  1.60588442

# Multiply the correlation matrix by the precision matrix
correlation_precision <- correlation_matrix %*% precision_matrix
correlation_precision

##                SalePrice     GrLivArea  TotalBsmtSF
## SalePrice   1.000000e+00 -2.081668e-17 0.000000e+00
## GrLivArea   5.551115e-17  1.000000e+00 1.110223e-16
## TotalBsmtSF 0.000000e+00  5.551115e-17 1.000000e+00

#  multiply the precision matrix by the correlation matrix
prec_cor <-   precision_matrix %*% correlation_matrix
prec_cor

##                SalePrice     GrLivArea   TotalBsmtSF
## SalePrice   1.000000e+00 -1.665335e-16 -1.110223e-16
## GrLivArea   2.012279e-16  1.000000e+00  1.665335e-16
## TotalBsmtSF 0.000000e+00  1.110223e-16  1.000000e+00

# LU Decomposistion
library(pracma)

## 
## Attaching package: 'pracma'

## The following objects are masked from 'package:psych':
## 
##     logit, polar

## The following object is masked from 'package:Hmisc':
## 
##     ceil

## The following object is masked from 'package:purrr':
## 
##     cross

## The following objects are masked from 'package:matlib':
## 
##     angle, inv

## The following objects are masked from 'package:Matrix':
## 
##     expm, lu, tril, triu

lu(precision_matrix)

## $L
##              SalePrice  GrLivArea TotalBsmtSF
## SalePrice    1.0000000  0.0000000           0
## GrLivArea   -0.5415823  1.0000000           0
## TotalBsmtSF -0.3672320 -0.4548682           1
## 
## $U
##             SalePrice GrLivArea TotalBsmtSF
## SalePrice    2.558231 -1.385493  -0.9394642
## GrLivArea    0.000000  1.260883  -0.5735356
## TotalBsmtSF  0.000000  0.000000   1.0000000

`Calculus-Based Probability & Statistics`

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

We have selected the variable LotArea, because it is skewed to the right.

summary(train$LotArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

optimal value of exponential for this distribution

library(MASS)
# Fitting of univariate distribution
(fit_dist <- fitdistr(train$LotArea, "exponential"))

##        rate    
##   9.508570e-05 
##  (2.488507e-06)

# optimam value of lambda
fit_dist$estimate

##        rate 
## 9.50857e-05

1000 samples from this exponential distribution using this value

values <- rexp(1000, rate = fit_dist$estimate)
par(mfrow=c(1,2))
# Actual vs simulated distribution
hist(train$LotArea, breaks=40, prob=TRUE, xlab="Lot Area",
     main="Original - LotArea")
hist(values, breaks=40, prob=TRUE, xlab="Generated Data",
     main="Exponential - LotArea")

From the two plots we can see that our Lot Area approximately fits a exponential distribution.

5th and 95th percentiles using the cumulative distribution function (CDF)

five_95 <- ecdf(values)
values[five_95(values)==0.05]

## [1] 402.4144

values[five_95(values)==0.95]

## [1] 30176.06

So: 5% is 402.4144 and 95% is 30176.06

95% confidence interval from the empirical data

t.test(values)$conf.int

## [1]  9667.527 10915.180
## attr(,"conf.level")
## [1] 0.95

empirical 5th percentile and 95th percentile of the data

t.test(train$LotArea)$conf.int

## [1] 10004.42 11029.24
## attr(,"conf.level")
## [1] 0.95

`Modeling`

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Missing values

sapply(train, function(x){sum(is.na(x))})

##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical      1stFlrSF      2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

Let’s do some cleanin. Will be removing all variables with large numbers of missing values.

train <-train[, !colnames(train) %in% c("Id","Alley","PoolQC","Fence","MiscFeature","FireplaceQu","LotFrontage","YearRemodAdd")]
test <- test[, !colnames(test) %in% c("Alley","PoolQC","Fence","MiscFeature","FireplaceQu","LotFrontage","YearRemodAdd")]

# convert categorical to numeric
train <- train%>%
  mutate_if(is.character, as.factor)%>%
  mutate_if(is.factor, as.integer)
test <- test %>%
   mutate_if(is.character, as.factor)%>%
  mutate_if(is.factor, as.integer)

Restricting the data to numeric variables

train <- select_if(train, is.numeric)
#test <- select_if(test, is.numeric)
#train <- na.omit(train)
train[is.na(train)] <- 0
test[is.na(test)] <- 0

Now we are in business, and we can start modeling!!!

Let’s fit a multiple regression model

`Stepwise Regression`

Stepwise regression is very useful for high-dimensional data containing multiple predictor variables. It will return set of variable that will return an optimal simple model.

So let’s Do it!!

library(MASS)
res.lm <- lm(SalePrice ~., data = train)
step <- stepAIC(res.lm, direction = "both", trace = FALSE)
step

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + Street + LotShape + 
##     LandContour + LandSlope + Neighborhood + Condition2 + HouseStyle + 
##     OverallQual + OverallCond + YearBuilt + RoofStyle + RoofMatl + 
##     Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual + 
##     BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 + 
##     BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + 
##     BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
##     Functional + Fireplaces + GarageYrBlt + GarageCars + PavedDrive + 
##     WoodDeckSF + ScreenPorch + PoolArea + YrSold + SaleCondition, 
##     data = train)
## 
## Coefficients:
##   (Intercept)     MSSubClass        LotArea         Street       LotShape  
##     1.795e+06     -1.485e+02      3.386e-01      3.361e+04     -9.974e+02  
##   LandContour      LandSlope   Neighborhood     Condition2     HouseStyle  
##     3.537e+03      7.013e+03      2.982e+02     -9.294e+03     -1.206e+03  
##   OverallQual    OverallCond      YearBuilt      RoofStyle       RoofMatl  
##     1.172e+04      5.256e+03      1.916e+02      2.105e+03      4.312e+03  
##   Exterior1st     MasVnrType     MasVnrArea      ExterQual       BsmtQual  
##    -5.467e+02      3.637e+03      2.626e+01     -9.784e+03     -7.088e+03  
##      BsmtCond   BsmtExposure     BsmtFinSF1   BsmtFinType2     BsmtFinSF2  
##     3.397e+03     -3.183e+03      1.736e+01      2.350e+03      2.504e+01  
##     BsmtUnfSF     `1stFlrSF`     `2ndFlrSF`   BsmtFullBath       FullBath  
##     8.234e+00      3.942e+01      4.666e+01      7.999e+03      3.810e+03  
##  BedroomAbvGr   KitchenAbvGr    KitchenQual   TotRmsAbvGrd     Functional  
##    -4.182e+03     -1.647e+04     -8.664e+03      3.660e+03      3.743e+03  
##    Fireplaces    GarageYrBlt     GarageCars     PavedDrive     WoodDeckSF  
##     4.799e+03     -9.056e+00      1.358e+04      2.917e+03      1.922e+01  
##   ScreenPorch       PoolArea         YrSold  SaleCondition  
##     4.176e+01     -3.031e+01     -1.112e+03      2.512e+03

Let call what the stepwise just told us to do.

mf <- lm(SalePrice ~ MSSubClass + LotArea + Street + LotShape + 
    LandContour + LandSlope + Neighborhood + Condition2 + HouseStyle + 
    OverallQual + OverallCond + YearBuilt + RoofStyle + RoofMatl + 
    Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual + 
    BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 + 
    BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + 
    BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
    Functional + Fireplaces + GarageYrBlt + GarageCars + PavedDrive + 
    WoodDeckSF + ScreenPorch + PoolArea + YrSold + SaleCondition, 
    data = train)
summary(mf)

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + Street + LotShape + 
##     LandContour + LandSlope + Neighborhood + Condition2 + HouseStyle + 
##     OverallQual + OverallCond + YearBuilt + RoofStyle + RoofMatl + 
##     Exterior1st + MasVnrType + MasVnrArea + ExterQual + BsmtQual + 
##     BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinType2 + BsmtFinSF2 + 
##     BsmtUnfSF + `1stFlrSF` + `2ndFlrSF` + BsmtFullBath + FullBath + 
##     BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
##     Functional + Fireplaces + GarageYrBlt + GarageCars + PavedDrive + 
##     WoodDeckSF + ScreenPorch + PoolArea + YrSold + SaleCondition, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -456136  -14119   -1041   12562  293002 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.795e+06  1.284e+06   1.398 0.162217    
## MSSubClass    -1.485e+02  2.519e+01  -5.895 4.67e-09 ***
## LotArea        3.386e-01  1.026e-01   3.300 0.000992 ***
## Street         3.361e+04  1.366e+04   2.460 0.014023 *  
## LotShape      -9.974e+02  6.345e+02  -1.572 0.116175    
## LandContour    3.537e+03  1.313e+03   2.695 0.007125 ** 
## LandSlope      7.013e+03  3.765e+03   1.863 0.062676 .  
## Neighborhood   2.982e+02  1.492e+02   1.998 0.045894 *  
## Condition2    -9.294e+03  3.234e+03  -2.874 0.004109 ** 
## HouseStyle    -1.206e+03  5.939e+02  -2.030 0.042506 *  
## OverallQual    1.172e+04  1.153e+03  10.163  < 2e-16 ***
## OverallCond    5.256e+03  8.817e+02   5.962 3.15e-09 ***
## YearBuilt      1.916e+02  5.115e+01   3.746 0.000187 ***
## RoofStyle      2.105e+03  1.094e+03   1.925 0.054390 .  
## RoofMatl       4.312e+03  1.470e+03   2.933 0.003411 ** 
## Exterior1st   -5.467e+02  2.740e+02  -1.995 0.046196 *  
## MasVnrType     3.637e+03  1.431e+03   2.542 0.011125 *  
## MasVnrArea     2.626e+01  5.832e+00   4.503 7.26e-06 ***
## ExterQual     -9.784e+03  1.903e+03  -5.142 3.09e-07 ***
## BsmtQual      -7.088e+03  1.314e+03  -5.394 8.07e-08 ***
## BsmtCond       3.397e+03  1.251e+03   2.715 0.006699 ** 
## BsmtExposure  -3.183e+03  8.511e+02  -3.739 0.000192 ***
## BsmtFinSF1     1.736e+01  4.924e+00   3.525 0.000437 ***
## BsmtFinType2   2.350e+03  1.073e+03   2.190 0.028716 *  
## BsmtFinSF2     2.504e+01  7.638e+00   3.278 0.001072 ** 
## BsmtUnfSF      8.234e+00  4.787e+00   1.720 0.085614 .  
## `1stFlrSF`     3.942e+01  5.888e+00   6.694 3.12e-11 ***
## `2ndFlrSF`     4.666e+01  4.023e+00  11.597  < 2e-16 ***
## BsmtFullBath   7.999e+03  2.272e+03   3.521 0.000444 ***
## FullBath       3.810e+03  2.390e+03   1.594 0.111123    
## BedroomAbvGr  -4.182e+03  1.593e+03  -2.625 0.008750 ** 
## KitchenAbvGr  -1.647e+04  4.817e+03  -3.420 0.000644 ***
## KitchenQual   -8.664e+03  1.395e+03  -6.209 6.98e-10 ***
## TotRmsAbvGrd   3.660e+03  1.128e+03   3.245 0.001202 ** 
## Functional     3.743e+03  9.228e+02   4.056 5.26e-05 ***
## Fireplaces     4.799e+03  1.600e+03   2.999 0.002757 ** 
## GarageYrBlt   -9.056e+00  2.512e+00  -3.605 0.000323 ***
## GarageCars     1.358e+04  1.926e+03   7.054 2.72e-12 ***
## PavedDrive     2.917e+03  2.002e+03   1.457 0.145207    
## WoodDeckSF     1.922e+01  7.274e+00   2.642 0.008328 ** 
## ScreenPorch    4.176e+01  1.558e+01   2.680 0.007458 ** 
## PoolArea      -3.031e+01  2.156e+01  -1.406 0.159938    
## YrSold        -1.112e+03  6.369e+02  -1.745 0.081192 .  
## SaleCondition  2.512e+03  7.929e+02   3.168 0.001566 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31570 on 1416 degrees of freedom
## Multiple R-squared:  0.8467, Adjusted R-squared:  0.8421 
## F-statistic: 181.9 on 43 and 1416 DF,  p-value: < 2.2e-16

Our fitted model is 84.67% accurate in predicting Sales price based on the features variables. Since the F p value less is smaller that 0.05, we can assure that this model is a valid model.

Residual Analysis

par(mfrow=c(2,2))
# plotting the step model
plot(mf)

The residuals are approximately normally distributed. There is not heteroscedacity and pattern in the residuals, therefore the assumptions of multiple regression model are satisfied.

How about some predictions

Predicted <- predict(mf, test)
kaggle_results <- data.frame(Id = test$Id, SalePrice=Predicted)
write.csv(kaggle_results,"kaggle_results.csv",row.names = FALSE)

info <- c("theoracley", 0.18542)
names(info) <- c("Username", "Score")
kable(info, col.names = "Kaggle") %>% 
  kable_styling(full_width = F)

	Kaggle
Username	theoracley
Score	0.18542

Video link: https://youtu.be/KD21iznAR9g

Data 605 - Final Exam

Abdelmalek Hajjam

5/21/2020

Problem 1

a. P(X>x | X>y)

b. P(X>x, Y>y)

c. P(X<x | X>y)

`Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.`

`Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?`

Problem 2

`Descriptive and Inferential Statistics`

descriptive statistics

Plots

Scatterplot matrix for “SalePrice”,“GrLivArea”,“LotFrontage”

Correlation matrix for any three quantitative variables

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

family wise error

`Linear Algebra and Correlation`

`Calculus-Based Probability & Statistics`

optimal value of exponential for this distribution

1000 samples from this exponential distribution using this value

5th and 95th percentiles using the cumulative distribution function (CDF)

95% confidence interval from the empirical data

empirical 5th percentile and 95th percentile of the data

`Modeling`

Missing values

`Stepwise Regression`

Residual Analysis