library(dplyr)
library(glue)
library(ggplot2)
library(scales)
library(corrplot)
library(matrixcalc)
library(tidyr)
library(purrr)
library(MASS)
library(Amelia)
library(randomForest)
library(caret)
library(e1071)
library(rpart)
library(readr)
library(Metrics)
library(forecast)
library(mice)
Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (λ , a shape parameter). Choose any n greater 3 and an expected value (λ) between 2 and 10 (you choose).
n = 4
lambda = 3
observations = 10000
X = rgamma(observations , n, lambda)
head(X)
## [1] 1.3449226 1.6232740 0.6383956 1.7484828 3.4030763 0.9232465
Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (λ). The n and λ must be the same as in the previous case. (e.g., mysum=rexp(10000,λ)+rexp(10000,λ)+..)
Y = rexp(observations, lambda) + rexp(observations, lambda) + rexp(observations, lambda) + rexp(observations, lambda)
head(Y)
## [1] 2.2679843 0.4105526 0.8449030 1.8832580 2.0976125 1.6004484
Probability Density 3: Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (λ).
Z = rexp(observations, lambda)
head(Z)
## [1] 0.3498248 0.0651057 0.1445587 0.1936039 0.8894497 0.3311824
1a. Calculate the empirical expected value (means) and variances of all three pdfs.
PD1eev = mean(X)
PD1eev
## [1] 1.338276
PD1var = var(X)
PD1var
## [1] 0.4550475
PD2eev = mean(Y)
PD2eev
## [1] 1.337444
PD2var = var(Y)
PD2var
## [1] 0.4442312
PD3eev = mean(Z)
PD3eev
## [1] 0.336711
PD3var = var(Z)
PD3var
## [1] 0.1101664
glue('The empirical expected value of X is {round(PD1eev, 4)} and the variance of X is {round(PD1var, 4)}.\n The empirical expected value of Y is {round(PD2eev, 4)} and the variance of Y is {round(PD2var, 4)}.\n The empirical expected value of Z is {round(PD3eev, 4)} and the variance of Z is {round(PD3var, 4)}.')
## The empirical expected value of X is 1.3383 and the variance of X is 0.455.
## The empirical expected value of Y is 1.3374 and the variance of Y is 0.4442.
## The empirical expected value of Z is 0.3367 and the variance of Z is 0.1102.
Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)
X
integrand <- function(x) x * dgamma(x, n, lambda)
expected_value_X <- integrate(integrand, lower = 0, upper = Inf)$value
expected_value_X
## [1] 1.333333
integrand <- function(x) x^2 * dgamma(x, n, lambda)
variance_X <- integrate(integrand, lower = 0, upper = Inf)$value - expected_value_X^2
variance_X
## [1] 0.4444444
Z The moment generating function of an exponential distribution is:
MX(t)=λ/λ−t
The first derivative of which, with λ=3 is:
M′X(t)=3/(3−t)^2
When we evaluate at t=0 we get:
M′X(0)= .33
Y The moment generating function of a sum of exponential distributions is:
MX(t)=(1−(t/λ))^−α
The first derivative of which, with α=4 and λ=3is:
M′X(t)=4/3(1−(t/3)^−5
When we evaluate at t=0 we get:
M′X(0)= 4/3
1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.
a. P(Z> λ| Z> λ/2) b. P(Z> 2λ | Z> λ) cb. P(Z>3λ | Z> λ)
Baye’s Law P(A|B) = P(B|A) * P(A) / P(B)
prob_a = sum(Z > lambda/2 & Z > lambda) * sum(Z > lambda) / sum(Z > lambda/2)
prob_b = sum(Z > lambda & Z > 2*lambda)* sum(Z > 2*lambda) / sum(Z > lambda)
prob_c = sum(Z > lambda & Z > 3*lambda)* sum(Z > 3*lambda) / sum(Z > lambda)
glue('For pdf Z (the exponential), the empiricall probabilities of a = {round(prob_a, 4)} // b = {round(prob_b, 4)} // c = {round(prob_c, 4)}')
## For pdf Z (the exponential), the empiricall probabilities of a = 0.0364 // b = 0 // c = 0
The memoryless property states that a random variable’s past does not
influence its future behavior. Therefore, all three probabilities hold
the memoryless property.
Loosely investigate whether P(YZ) = P(Y) P(Z) by building a
table with quartiles and evaluating the marginal and joint
probabilities. With a little help from this website ’https://www.tutorialspoint.com/create-a-quartile-column-for-each-value-in-an-r-data-frame-column'->
bin_Y = cut(Y, breaks = quantile(Y), include.lowest = TRUE)
levels(bin_Y) = c("1st Quartile Y", "2d Quartile Y", "3d Quartile Y", "4th Quartile Y")
bin_Z = cut(Z, breaks = quantile(Z), include.lowest = TRUE)
levels(bin_Z) = c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z")
combined_instances = paste(sample(bin_Z, 10000, replace = TRUE), sample(bin_Y, 10000, replace = TRUE))
instance_counts_matrix <- matrix(table(combined_instances), ncol = 4)
instance_counts <- prop.table(instance_counts_matrix)
df <- as.data.frame(instance_counts) %>%
mutate(Sum = rowSums(across(where(is.numeric)))) %>%
bind_rows(summarise(.,across(where(is.numeric), sum)))
colnames(df) <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Sum")
rownames(df) <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Sum")
df
## 1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y
## 1st Quartile Z 0.0607 0.0625 0.0610 0.0619
## 2nd Quartile Z 0.0590 0.0630 0.0601 0.0640
## 3rd Quartile Z 0.0671 0.0620 0.0639 0.0633
## 4th Quartile Z 0.0657 0.0635 0.0621 0.0602
## Sum 0.2525 0.2510 0.2471 0.2494
## Sum
## 1st Quartile Z 0.2461
## 2nd Quartile Z 0.2461
## 3rd Quartile Z 0.2563
## 4th Quartile Z 0.2515
## Sum 1.0000
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s exact test is used when the sample size is small and the data do not follow a multinomial distribution, while the chi-square test is used when the sample size is large and the data follow a multinomial distribution. Fisher’s exact test is more computationally intensive than the chi-square test, but it provides exact probabilities of the observed data, while the chi-square test provides an approximate probability. The Chi Square Test is the more appropriate test in this case.
fisher.test(instance_counts_matrix, simulate.p.value = TRUE, B = 10000)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 10000 replicates)
##
## data: instance_counts_matrix
## p-value = 0.6263
## alternative hypothesis: two.sided
Since the p-value is greater than the usual significance level of 0.05, we cannot reject the null hypothesis and conclude that there is no significant evidence of an association between the two categorical variables being compared. This means that any observed association between the variables could be due to chance, and further investigation may be needed to determine if there is a true relationship between the variables.
chisq.test(instance_counts_matrix)
##
## Pearson's Chi-squared test
##
## data: instance_counts_matrix
## X-squared = 7.1378, df = 9, p-value = 0.6228
This also suggests that there is no significant evidence of an association between the two categorical variables being compared.
Problem 2 You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following. 5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train = read.csv("train_ames_data.csv", stringsAsFactors = F)
test = read.csv("test_ames_data.csv", stringsAsFactors = F)
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
dim(train)
## [1] 1460 81
dim(test)
## [1] 1459 80
test_labels = test$Id
test$Id = NULL
train$Id = NULL
test$SalePrice = NA
fulldf = rbind(train, test)
dim(fulldf)
## [1] 2919 80
ggplot(data=fulldf[!is.na(fulldf$SalePrice),], aes(x=SalePrice)) +
geom_histogram(fill="blue", binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
summary(fulldf$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 34900 129975 163000 180921 214000 755000 1459
numeric = which(sapply(fulldf, is.numeric))
all_num <- fulldf[, numeric]
cor_num <- cor(all_num, use="pairwise.complete.obs")
cor_sorted <- as.matrix(sort(cor_num[,'SalePrice'], decreasing = TRUE))
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_num <- cor_num[CorHigh, CorHigh]
corrplot.mixed(cor_num, tl.col="black", tl.pos = "lt", number.cex = 0.7)
par(mfrow = c(1,3))
plot(train$OverallQual, train$SalePrice,
xlab = 'Overall Quality', ylab = 'Sale Price')
plot(train$GrLivArea, train$SalePrice,
xlab = 'Ground Living Area', ylab = 'Sale Price')
plot(train$GarageCars, train$SalePrice,
xlab = 'Garage Cars', ylab = 'Sale Price')
cor_test1 <- cor.test(formula = ~ OverallQual + GarageCars,
data = all_num,
method = "pearson",
conf.level = 0.80)
cor_test2 <- cor.test(formula = ~ OverallQual + GrLivArea,
data = all_num,
method = "pearson",
conf.level = 0.80)
cor_test3 <- cor.test(formula = ~ GarageCars + GrLivArea,
data = all_num,
method = "pearson",
conf.level = 0.80)
cor_test1
##
## Pearson's product-moment correlation
##
## data: OverallQual and GarageCars
## t = 40.579, df = 2916, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5853570 0.6156978
## sample estimates:
## cor
## 0.6007437
cor_test2
##
## Pearson's product-moment correlation
##
## data: OverallQual and GrLivArea
## t = 37.97, df = 2917, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5590270 0.5907918
## sample estimates:
## cor
## 0.5751261
cor_test3
##
## Pearson's product-moment correlation
##
## data: GarageCars and GrLivArea
## t = 30.348, df = 2916, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4716853 0.5077613
## sample estimates:
## cor
## 0.489933
All three tests suggest strong evidence of a statistically significant correlation between the two variables. A p-value less than 0.05 is commonly used to indicate statistical significance, but a p-value of less than 2.2e-16 is even stronger evidence of significance.
I would not be worried about familywise error because:
To calculate the FWER, you can use the following formula:
FWER = 1 - (1 - alpha)^m
FWER = 1 - (1 - .05)^3
FWER
## [1] 0.142625
The probability of a type I error is just over 14%, which is low considering only 3 tests were performed.
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
cm <- dplyr::select(all_num, OverallQual, FullBath, GrLivArea)
cor_matrix <- round(cor(cm),digits = 2)
cor_matrix
## OverallQual FullBath GrLivArea
## OverallQual 1.00 0.53 0.58
## FullBath 0.53 1.00 0.63
## GrLivArea 0.58 0.63 1.00
precision_matrix <- solve(cor_matrix)
precision_matrix
## OverallQual FullBath GrLivArea
## OverallQual 1.6163527 -0.4411402 -0.6595663
## FullBath -0.4411402 1.7784972 -0.8645919
## GrLivArea -0.6595663 -0.8645919 1.9272413
cor_pre <- cor_matrix %*% precision_matrix
cor_pre
## OverallQual FullBath GrLivArea
## OverallQual 1.000000e+00 -4.315958e-17 -3.706001e-17
## FullBath 8.699900e-17 1.000000e+00 -1.624157e-16
## GrLivArea 1.110223e-16 0.000000e+00 1.000000e+00
pre_cor <- precision_matrix %*% cor_matrix
pre_cor
## OverallQual FullBath GrLivArea
## OverallQual 1.000000e+00 1.425102e-16 1.110223e-16
## FullBath -4.315958e-17 1.000000e+00 -1.110223e-16
## GrLivArea -3.706001e-17 -1.624157e-16 1.000000e+00
lu_cor_pre = matrixcalc::lu.decomposition(cor_pre)
L1 <- lu_cor_pre$L
U1 <- lu_cor_pre$U
print(L1)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] 8.699900e-17 1.000000e+00 0
## [3,] 1.110223e-16 4.791676e-33 1
print(U1)
## [,1] [,2] [,3]
## [1,] 1 -4.315958e-17 -3.706001e-17
## [2,] 0 1.000000e+00 -1.624157e-16
## [3,] 0 0.000000e+00 1.000000e+00
print( L1 %*% U1)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 -4.315958e-17 -3.706001e-17
## [2,] 8.699900e-17 1.000000e+00 -1.624157e-16
## [3,] 1.110223e-16 1.522449e-49 1.000000e+00
print(cor_pre)
## OverallQual FullBath GrLivArea
## OverallQual 1.000000e+00 -4.315958e-17 -3.706001e-17
## FullBath 8.699900e-17 1.000000e+00 -1.624157e-16
## GrLivArea 1.110223e-16 0.000000e+00 1.000000e+00
lu_pre_cor = matrixcalc::lu.decomposition(pre_cor)
L2 <- lu_pre_cor$L
U2 <- lu_pre_cor$U
print(L2)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] -4.315958e-17 1.000000e+00 0
## [3,] -3.706001e-17 -1.624157e-16 1
print( L2 %*% U2)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 1.425102e-16 1.110223e-16
## [2,] -4.315958e-17 1.000000e+00 -1.110223e-16
## [3,] -3.706001e-17 -1.624157e-16 1.000000e+00
print(pre_cor)
## OverallQual FullBath GrLivArea
## OverallQual 1.000000e+00 1.425102e-16 1.110223e-16
## FullBath -4.315958e-17 1.000000e+00 -1.110223e-16
## GrLivArea -3.706001e-17 -1.624157e-16 1.000000e+00
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
train %>%
dplyr::select(names(all_num)) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
The first one that jumped out at me as being skewed to the right was the
BsmtUnfSF (Unfinished square feet of basement area). Let’s take a closer
look.
ggplot(train, aes(BsmtUnfSF)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Histogram of Basement Unfinished Square Feet",
x = "Basement Unfinished Square Feet",
y = "Count")
summary(train$BsmtUnfSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 223.0 477.5 567.2 808.0 2336.0
fitdistr(train$BsmtUnfSF, "exponential")
## rate
## 1.762921e-03
## (4.613775e-05)
expo <- rexp(1000, 0.00176292094)
expdf <- data.frame(
type = c(rep("Original", 1460), rep("Exponential",1000)),
value = c(train$BsmtUnfSF, expo)
)
ggplot(expdf, aes(x=value, fill=type)) +
geom_histogram(binwidth = 50)
qexp(.05, rate= 0.00176292094)
## [1] 29.09563
qexp(.95, rate= 0.00176292094)
## [1] 1699.3
sigma <- sd(train$BsmtUnfSF)
n <- length(train$BsmtUnfSF)
ci <- qnorm(0.95)
me <- ci * (sigma/sqrt(n))
(mean(train$BsmtUnfSF) + me)
## [1] 586.2618
(mean(train$BsmtUnfSF) - me)
## [1] 548.219
quantile(train$BsmtUnfSF, 0.05)
## 5%
## 0
quantile(train$BsmtUnfSF, 0.95)
## 95%
## 1468
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
traindf = read.csv("train_ames_data.csv", stringsAsFactors = F)
testdf = read.csv("test_ames_data.csv", stringsAsFactors = F)
How many columns have text data?
sum(sapply(traindf[,1:81], typeof) == "character")
## [1] 43
How many columns have numerical data?
sum(sapply(traindf[,1:81], typeof) == "integer")
## [1] 38
Test has no ‘Saleslprice’ variable so let’s make it:
testdf$SalePrice<-rep(NA,1459)
ames<-bind_rows(traindf,testdf)
Remove Outliers:
traindf <- traindf[traindf$GrLivArea<=4000,]
How many NAs per column?
colSums(sapply(traindf, is.na))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1365 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1451 1176 1402
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
Let’s take all the missing data:
NAs <- sapply(traindf,function(x) sum(is.na(x)))
NAdf <- data.frame(index = names(traindf),Missing_Values=NAs)
NAdf[NAdf$Missing_Values > 0,]
## index Missing_Values
## LotFrontage LotFrontage 259
## Alley Alley 1365
## MasVnrType MasVnrType 8
## MasVnrArea MasVnrArea 8
## BsmtQual BsmtQual 37
## BsmtCond BsmtCond 37
## BsmtExposure BsmtExposure 38
## BsmtFinType1 BsmtFinType1 37
## BsmtFinType2 BsmtFinType2 38
## Electrical Electrical 1
## FireplaceQu FireplaceQu 690
## GarageType GarageType 81
## GarageYrBlt GarageYrBlt 81
## GarageFinish GarageFinish 81
## GarageQual GarageQual 81
## GarageCond GarageCond 81
## PoolQC PoolQC 1451
## Fence Fence 1176
## MiscFeature MiscFeature 1402
Creating new column for data after combining both to ensure we know if its test or train data:
testdf$SalePrice <- NA
traindf$isTrain <- 1
testdf$isTrain <- 0
ames <- rbind(traindf,testdf)
19 Columns must be ‘fixed’
MasVnrArea
ames$MasVnrArea[which(is.na(ames$MasVnrArea))] <- mean(ames$MasVnrArea,na.rm=T)
Alley
ames$Alley1 <- as.character(ames$Alley)
ames$Alley1[which(is.na(ames$Alley))] <- "None"
ames$Alley <- as.factor(ames$Alley1)
ames <- subset(ames,select = -Alley1)
MasVnrType1
ames$MasVnrType1 <- as.character(ames$MasVnrType)
ames$MasVnrType1[which(is.na(ames$MasVnrType))] <- "None"
ames$MasVnrType <- as.factor(ames$MasVnrType1)
ames <- subset(ames,select = -MasVnrType1)
LotFrontage
ames$LotFrontage[which(is.na(ames$LotFrontage))] <- median(ames$LotFrontage, na.rm = T)
FireplaceQu1
ames$FireplaceQu1 <- as.character(ames$FireplaceQu)
ames$FireplaceQu1[which(is.na(ames$FireplaceQu))] <- "None"
ames$FireplaceQu <- as.factor(ames$FireplaceQu1)
ames <- subset(ames, select = -FireplaceQu1)
PoolQC1
ames$PoolQC1 <- as.character(ames$PoolQC)
ames$PoolQC1[which(is.na(ames$PoolQC))] <- "None"
ames$PoolQC <- as.factor(ames$PoolQC1)
ames <- subset(ames, select = -PoolQC1)
Fence1
ames$Fence1 <- as.character(ames$Fence)
ames$Fence1[which(is.na(ames$Fence))] <- "None"
ames$Fence <- as.factor(ames$Fence1)
ames <- subset(ames, select = -Fence1)
MiscFeature1
ames$MiscFeature1 <- as.character(ames$MiscFeature)
ames$MiscFeature1[which(is.na(ames$MiscFeature))] <- "None"
ames$MiscFeature <- as.factor(ames$MiscFeature1)
ames <- subset(ames,select = -MiscFeature1)
GarageType1
ames$GarageType1 <- as.character(ames$GarageType)
ames$GarageType1[which(is.na(ames$GarageType))] <- "None"
ames$GarageType <- as.factor(ames$GarageType1)
ames <- subset(ames, select = -GarageType1)
GarageYrBlt
ames$GarageYrBlt[which(is.na(ames$GarageYrBlt))] <- 0
GarageFinish1
ames$GarageFinish1 <- as.character(ames$GarageFinish)
ames$GarageFinish1[which(is.na(ames$GarageFinish))] <- "None"
ames$GarageFinish <- as.factor(ames$GarageFinish1)
ames <- subset(ames,select = -GarageFinish1)
GarageQual1
ames$GarageQual1 <- as.character(ames$GarageQual)
ames$GarageQual1[which(is.na(ames$GarageQual))] <- "None"
ames$GarageQual <- as.factor(ames$GarageQual1)
ames <- subset(ames, select = -GarageQual1)
GarageCond1
ames$GarageCond1 <- as.character(ames$GarageCond)
ames$GarageCond1[which(is.na(ames$GarageCond))] <- "None"
ames$GarageCond <- as.factor(ames$GarageCond1)
ames <- subset(ames,select = -GarageCond1)
BsmtQual1
ames$BsmtQual1 <- as.character(ames$BsmtQual)
ames$BsmtQual1[which(is.na(ames$BsmtQual))] <- "None"
ames$BsmtQual <- as.factor(ames$BsmtQual1)
ames <- subset(ames, select = -BsmtQual1)
BsmtCond1
ames$BsmtCond1 <- as.character(ames$BsmtCond)
ames$BsmtCond1[which(is.na(ames$BsmtCond))] <- "None"
ames$BsmtCond <- as.factor(ames$BsmtCond1)
ames <- subset(ames, select = -BsmtCond1)
BsmtExposure1
ames$BsmtExposure1 <- as.character(ames$BsmtExposure)
ames$BsmtExposure1[which(is.na(ames$BsmtExposure))] <- "None"
ames$BsmtExposure <- as.factor(ames$BsmtExposure1)
ames <- subset(ames, select = -BsmtExposure1)
BsmtFinType11
ames$BsmtFinType11 <- as.character(ames$BsmtFinType1)
ames$BsmtFinType11[which(is.na(ames$BsmtFinType1))] <- "None"
ames$BsmtFinType1 <- as.factor(ames$BsmtFinType11)
ames <- subset(ames,select = -BsmtFinType11)
BsmtFinType21
ames$BsmtFinType21 <- as.character(ames$BsmtFinType2)
ames$BsmtFinType21[which(is.na(ames$BsmtFinType2))] <- "None"
ames$BsmtFinType2 <- as.factor(ames$BsmtFinType21)
ames <- subset(ames, select = -BsmtFinType21)
Electrical1
ames$Electrical1 <- as.character(ames$Electrical)
ames$Electrical1[which(is.na(ames$Electrical))] <- "None"
ames$Electrical <- as.factor(ames$Electrical1)
ames <- subset(ames, select = -Electrical1)
Factorize these columns:
ames$MSZoning <- factor(ames$MSZoning)
ames$Street <- factor(ames$Street)
ames$LotShape <- factor(ames$LotShape)
ames$LandContour <- factor(ames$LandContour)
ames$Utilities <- factor(ames$Utilities)
ames$LotConfig <- factor(ames$LotConfig)
ames$LandSlope <- factor(ames$LandSlope)
ames$Neighborhood <- factor(ames$Neighborhood)
ames$Condition1 <- factor(ames$Condition1)
ames$Condition2 <- factor(ames$Condition2)
ames$BldgType <- factor(ames$BldgType)
ames$HouseStyle <- factor(ames$HouseStyle)
ames$RoofStyle <- factor(ames$RoofStyle)
ames$RoofMatl <- factor(ames$RoofMatl)
ames$Exterior1st <- factor(ames$Exterior1st)
ames$Exterior2nd <- factor(ames$Exterior2nd)
ames$ExterQual <- factor(ames$ExterQual)
ames$ExterCond <- factor(ames$ExterCond)
ames$Foundation <- factor(ames$Foundation)
ames$Heating <- factor(ames$Heating)
ames$HeatingQC <- factor(ames$HeatingQC)
ames$CentralAir <- factor(ames$CentralAir)
ames$KitchenQual <- factor(ames$KitchenQual)
ames$Functional <- factor(ames$Functional)
ames$PavedDrive <- factor(ames$PavedDrive)
ames$SaleType <- factor(ames$SaleType)
ames$SaleCondition <- factor(ames$SaleCondition)
i <- 1
na <- 1
for (i in 1:length(ames)){
na[i] <- ifelse(sum(is.na(ames[i]))>0, sum(is.na(ames[i])), 0)
if (na[i]>0) { cat(names(ames[i]), '=', na[i],'
')}
}
## MSZoning = 4
## Utilities = 2
## Exterior1st = 1
## Exterior2nd = 1
## BsmtFinSF1 = 1
## BsmtFinSF2 = 1
## BsmtUnfSF = 1
## TotalBsmtSF = 1
## BsmtFullBath = 2
## BsmtHalfBath = 2
## KitchenQual = 1
## Functional = 2
## GarageCars = 1
## GarageArea = 1
## SaleType = 1
## SalePrice = 1459
mice_mod <- mice(ames[, c("MSZoning", "Utilities", "Exterior1st", "Exterior2nd", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath", "KitchenQual", "Functional", "GarageCars", "GarageArea", "SaleType")], method='pmm')
##
## iter imp variable
## 1 1 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 1 2 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 1 3 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 1 4 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 1 5 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 2 1 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 2 2 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 2 3 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 2 4 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 2 5 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 3 1 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 3 2 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 3 3 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 3 4 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 3 5 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 4 1 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 4 2 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 4 3 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 4 4 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 4 5 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 5 1 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 5 2 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 5 3 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 5 4 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
## 5 5 MSZoning Utilities Exterior1st Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath KitchenQual Functional GarageCars GarageArea SaleType
mice_complete <- complete(mice_mod)
ames$MSZoning = mice_complete$MSZoning
ames$LotFrontage = mice_complete$LotFrontage
ames$Utilities = mice_complete$Utilities
ames$Exterior1st = mice_complete$Exterior1st
ames$Exterior2nd = mice_complete$Exterior2nd
ames$BsmtFinSF1 = mice_complete$BsmtFinSF1
ames$BsmtFinSF2 = mice_complete$BsmtFinSF2
ames$BsmtUnfSF = mice_complete$BsmtUnfSF
ames$TotalBsmtSF = mice_complete$TotalBsmtSF
ames$BsmtFullBath = mice_complete$BsmtFullBath
ames$BsmtHalfBath = mice_complete$BsmtHalfBath
ames$KitchenQual = mice_complete$KitchenQual
ames$Functional = mice_complete$Functional
ames$GarageCars = mice_complete$GarageCars
ames$GarageArea = mice_complete$GarageArea
ames$SaleType = mice_complete$SaleType
i <- 1
na <- 1
for (i in 1:length(ames)){
na[i] <- ifelse(sum(is.na(ames[i]))>0, sum(is.na(ames[i])), 0)
if (na[i]>0) { cat(names(ames[i]), '=', na[i],'
')}
}
## SalePrice = 1459
train <- ames[ames$isTrain==1,]
test <- ames[ames$isTrain==0,]
train_indices <- createDataPartition(train$SalePrice, p = 0.8, list = FALSE)
training <- train[train_indices, ]
validation <- train[-train_indices, ]
test <- subset(test, select = -SalePrice)
forest <- randomForest(SalePrice ~ ., data=training, ntree = 400)
SalePrice <- predict(forest, newdata=test)
Id<-test$Id
to_csv <- data.frame(Id, SalePrice)
write_csv(to_csv, '605Final_4.csv')
Just checking if it worked:
testwithid = read.csv("605Final_4.csv", stringsAsFactors = F)
dim(testwithid)
## [1] 1459 2
I submitted 3 times. The first time something didn’t run correctly so
I did it again and got a 0.14616 and that was with 400 trees so I
thought I would try with 4000 trees because why not? With 4000 trees,
the model had a generally improved performance (0.14602) but it also
increased the computational cost of training and prediction and was
definitely not worth it.
Lets
take a look at how the model performed besides our Kaggle Score:
varImpPlot(forest, sort=TRUE, n.var=15,
type=NULL, class=NULL, scale=TRUE,
main= 'Dotchart of variable importance')
This makes perfect sense. The most important predictors in a model are
usually the variables most correlated with the independent variable.
My Model:
pred1 <- predict(forest, validation ,type = "response")
residuals <- validation$SalePrice - pred1
forest_pred <- data.frame("Predicted" = pred1, "Actual" = validation$SalePrice, "Residual" = residuals)
accuracy(pred1, validation$SalePrice)
## ME RMSE MAE MPE MAPE
## Test set -1741.781 23504.11 15490.34 -3.605621 9.740062
plot(pred1, validation$SalePrice, main = "Predicted vs. Actual SalePrice")
abline(0,1)
set.seed(123)
train_idx <- sample(nrow(train), 0.7 * nrow(train))
train_data <- train[train_idx, ]
test_data <- train[-train_idx, ]
target_col <- "SalePrice"
baseline_pred <- rep(mean(train_data[[target_col]]), nrow(test_data))
Baseline Model:
accuracy(baseline_pred, test_data$SalePrice)
## ME RMSE MAE MPE MAPE
## Test set -870.1055 78685.91 57195.77 -18.89121 37.33238
The regression model that I created showed an improvement over the baseline model that was established, as demonstrated above. However, despite the improvement over the baseline model, there is still potential for enhancement in the regression model. One way to increase the performance of the model could be to include some features and not others. Additionally, the model’s performance could be further optimized through hyperparameter tuning, which involves adjusting the model’s settings to achieve better performance, such as through cross-validation techniques. While the current regression model has shown an improvement over the baseline, there remains potential for further optimization and refinement of the model’s features and settings to achieve even better performance.