Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (λ , a shape parameter). Choose any n greater 3 and an expected value (λ) between 2 and 10 (you choose).
*Note: Assuming in the above prompt that the n parameter is supposed to refer to the shape (alpha) and the lambda is referencing the rate (beta)
Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (λ). The n and λ must be the same as in the previous case. (e.g., mysum=rexp(10000,λ)+rexp(10000,λ)+..)
Probability Density 3: Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (λ).
set.seed(45)
lambda <- 6
alpha <- 4
n <-10000
X <- rgamma(n,alpha,lambda)
Y <- rexp(n,lambda)+rexp(n,lambda)+rexp(n,lambda)+rexp(n,lambda)
Z<- rexp(n,lambda)
1a: Calculate the empirical expected value (means) and variances of all three pdfs.
sprintf('Gamma Distribution: The mean of the X random variables is: %f, and the variance is: %f',mean(X),var(X))
## [1] "Gamma Distribution: The mean of the X random variables is: 0.664633, and the variance is: 0.114240"
#alpha/lambda,alpha/lambda^2
sprintf('Sum of Exponential Distribution: The mean of the X random variables is: %f, and the variance is: %f',mean(Y),var(Y))
## [1] "Sum of Exponential Distribution: The mean of the X random variables is: 0.666293, and the variance is: 0.109779"
#alpha/lambda,alpha/lambda^2
sprintf('Exponential Distribution: The mean of the X random variables is: %f, and the variance is: %f',mean(Z),var(Z))
## [1] "Exponential Distribution: The mean of the X random variables is: 0.166306, and the variance is: 0.027380"
#1/lambda,2/lambda^2
1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X).
$$
sprintf('Expected Value approximation: %f, which is very close to the empirical calculation',alpha/lambda)
## [1] "Expected Value approximation: 0.666667, which is very close to the empirical calculation"
$$
sprintf('Variance approximation: %f, which is very close to the empirical calculation',alpha/lambda^2)
## [1] "Variance approximation: 0.111111, which is very close to the empirical calculation"
Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)
$$ \[\begin{aligned} M(t) = E[e^{tX}] \\ \int_{0}^{\infty} e^{tx}\lambda e^{-\lambda x} dx \\ \lambda \int_{0}^{\infty} e^{t-\lambda x} dx \\ \frac{\lambda}{t- \lambda } \left. e^{(t-\lambda x)} \right|_{0}^{\infty} \\ \frac{\lambda}{\lambda - t} \\ E(X^n) = M^n_X(0) = \frac{d^n M_x}{dt^n}(0) \\ dt = \frac{\lambda}{(\lambda - t)^2} \end{aligned}\]$$
Taking the first moment of the function using the derivative calculated above for the first moment:
sprintf('The expected value of the first moment is: %f (very close approximation of other methods)',lambda/(lambda - 0)^2)
## [1] "The expected value of the first moment is: 0.166667 (very close approximation of other methods)"
Using this source as a reference for the sum of exponential moment generating function:
\(M_{x_1+x_2}= \frac{\lambda_1 \lambda_2}{(\lambda_1 -t)(\lambda_2 -t)}\)
In this case the lambda is the same value across the different random variables so we can take the nth power (alpha times) the moment function
$$ \[\begin{aligned} (\frac{\lambda}{\lambda - t})^\alpha = (\frac{6}{6 - t})^4 \\ M'(t) = \frac{4*6^4}{(6-t)^5} \end{aligned}\]$$
sprintf('The expected value of the first moment is: %f (very close approximation of other methods)',(4*lambda^4)/(lambda - 0)^5)
## [1] "The expected value of the first moment is: 0.666667 (very close approximation of other methods)"
1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds. a. P(Z> λ| Z> λ/2) b. P(Z> 2λ | Z> λ) b. P(Z>3λ | Z> λ)
$$ \[\begin{aligned} P(Z>\lambda|Z>\lambda/2) = \frac{P(Z\leq \lambda \cap Z>\lambda/2 )}{P( Z>\lambda/2)} = \frac{P(\lambda/2 < Z \leq \lambda)}{P( Z>\lambda/2)} \\ \frac{e^{\frac{\lambda}{2}}-e^\lambda}{e^{\frac{\lambda}{2}}} \end{aligned}\]$$
(exp(-lambda/2) - exp(-lambda))/exp(-lambda/2)
## [1] 0.9502129
1-exp(-lambda/2)
## [1] 0.9502129
(exp(-lambda) - exp(-2*lambda))/exp(-lambda)
## [1] 0.9975212
1-exp(-lambda)
## [1] 0.9975212
(exp(-lambda) - exp(-3*lambda))/exp(-lambda)
## [1] 0.9999939
1-exp(-2*lambda)
## [1] 0.9999939
Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.
y_quant <- quantile(Y,probs=c(0.25,0.5,0.75,1))
z_quant <- quantile(Z,probs=c(0.25,0.5,0.75,1))
comb_prob <- as.matrix(z_quant) %*% t(as.matrix(y_quant))
prob_tbl <- rbind(cbind(comb_prob/sum(comb_prob),c(t(t(rowSums(comb_prob/sum(comb_prob)))),1)),colSums(comb_prob/sum(comb_prob)))
## Warning in cbind(comb_prob/sum(comb_prob), c(t(t(rowSums(comb_prob/
## sum(comb_prob)))), : number of rows of result is not a multiple of vector length
## (arg 2)
## Warning in rbind(cbind(comb_prob/sum(comb_prob), c(t(t(rowSums(comb_prob/
## sum(comb_prob)))), : number of columns of result is not a multiple of vector
## length (arg 2)
#validation
for (x in 1:4){
print(sum(prob_tbl[x,1:4])==prob_tbl[x,5])
print(sum(prob_tbl[1:4,x])==prob_tbl[5,x])
}
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
ind_test <- comb_prob
fisher.test(ind_test)
## Warning in fisher.test(ind_test): 'x' has been rounded to integer: Mean relative
## difference: 0.2727586
##
## Fisher's Exact Test for Count Data
##
## data: ind_test
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(ind_test)
## Warning in chisq.test(ind_test): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: ind_test
## X-squared = 7.6258e-32, df = 9, p-value = 1
frmt_mat <- cbind(t(t(y_quant)),t(t(z_quant)))
fisher.test(frmt_mat)
## Warning in fisher.test(frmt_mat): 'x' has been rounded to integer: Mean relative
## difference: 0.3324803
##
## Fisher's Exact Test for Count Data
##
## data: frmt_mat
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(frmt_mat)
## Warning in chisq.test(frmt_mat): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: frmt_mat
## X-squared = 0.32992, df = 3, p-value = 0.9543
Both tests indicate based on the p-value that you fail to reject the null hypothesis and the two distributions appear to be independent on both the quantiles merged together and the joint probability table. Both were used as the requested test’s are designed to compare two variables against one another for independence to see if a slightly alternative input would impact the results. Fisher’s Exact Test determines relationships that exist between two categorical variables that are not explained by randomness; however, the initial test was designed to assess 2 x 2 contingency tables and can often be computationally expensive with larger matrices. The Chi Square Test is also used for this same purpose and does not have the same potential performance constraints which is why it is more appropriate in this case to evaluate the question on independence.
Kaggle Housing Price Predictions
train_df <- read.csv('https://raw.githubusercontent.com/jforster19/Data605/main/House_Prices_Prediction_train.csv')
head(train_df)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
summary(train_df)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
hist(train_df$BedroomAbvGr)
hist(train_df$FullBath)
ggplot(train_df,aes(x=YearBuilt)) +
geom_bar()
ggplot(train_df,aes(x=BedroomAbvGr,y=SalePrice)) +
geom_point() +
labs(title='Comparing Sales Price of Homes Against Number of Bedrooms') +
xlab('Bedrooms Above Ground')
ggplot(train_df,aes(x=GrLivArea,y=SalePrice))+
geom_point()
ggplot(train_df,aes(x=OverallQual,y=SalePrice))+
geom_point()
ggplot(train_df,aes(x=PoolQC))+
geom_bar() +
geom_text(stat='count', aes(label=..count..),vjust=-0.25)
ggplot(train_df,aes(MiscFeature))+
geom_bar() +
labs(title='Review of Frequency of Miscellaneous Features of Property')
ggplot(train_df,aes(SaleType))+
geom_bar() +
labs(title='Review of Sale Type Frequency') +
geom_text(stat='count', aes(label=..count..),vjust=-0.25)
train_df <- train_df |> mutate(age=2022-YearBuilt)
ggplot(train_df,aes(x=age,y=SalePrice))+
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
corr_df <- train_df |> dplyr::select(c(age,BedroomAbvGr,X1stFlrSF))
cor_mat <- cor(corr_df,method='pearson')
pairs(corr_df)
first_cor_test <- cor.test(corr_df$age,corr_df$X1stFlrSF,conf.level = 0.8)
first_cor_test
##
## Pearson's product-moment correlation
##
## data: corr_df$age and corr_df$X1stFlrSF
## t = -11.223, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.3125892 -0.2507977
## sample estimates:
## cor
## -0.2819859
sec_cor_test <- cor.test(corr_df$age,corr_df$BedroomAbvGr,conf.level = 0.8)
sec_cor_test
##
## Pearson's product-moment correlation
##
## data: corr_df$age and corr_df$BedroomAbvGr
## t = 2.7045, df = 1458, p-value = 0.006921
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.03717773 0.10396633
## sample estimates:
## cor
## 0.07065122
third_cor_test <- cor.test(corr_df$BedroomAbvGr,corr_df$X1stFlrSF,conf.level = 0.8)
third_cor_test
##
## Pearson's product-moment correlation
##
## data: corr_df$BedroomAbvGr and corr_df$X1stFlrSF
## t = 4.9046, df = 1458, p-value = 1.041e-06
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.09424207 0.16027708
## sample estimates:
## cor
## 0.1274007
The correlation test results indicate to reject the null hypothesis given a P-Value below 0.05 for each pairwise comparison. The confidence interval for age vs 1st floor sq ft is (-0.3125892, -0.2507977) , age vs bedroooms above ground is ( 0.03717773, 0.10396633) and bedrooms above ground floor vs 1st floor sq ft (0.09424207,0.16027708) also substantiates that there is a correlation (i.e. that the correlation is not 0) given that 80% of samples taken to compare these two variables would not be expected to contain zero and the true correlation average is not zero. Family wise error, type 1 error rate, does not appear to be that worrying because the p-values provided in each test are far below the alpha threshold which should indicate there is less of a chance that the interpretation is incorrect. Also, in this case there aren’t that many risks to rejecting the null hypothesis as the visual eye test seems to indicate some type of linear relationship irrespective of their strength.
if (det(cor_mat)==0){
print('Matrix cannot be inverted')
}else{
prec_mat <- solve(cor_mat)
}
decomb_mat <- cor_mat %*% prec_mat %*% cor_mat
decomb_mat
## age BedroomAbvGr X1stFlrSF
## age 1.00000000 0.07065122 -0.2819859
## BedroomAbvGr 0.07065122 1.00000000 0.1274007
## X1stFlrSF -0.28198586 0.12740075 1.0000000
str(lum <- lu(decomb_mat))
## Formal class 'denseLU' [package "Matrix"] with 4 slots
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:3] "age" "BedroomAbvGr" "X1stFlrSF"
## .. ..$ : chr [1:3] "age" "BedroomAbvGr" "X1stFlrSF"
## ..@ x : num [1:9] 1 0.0707 -0.282 0.0707 0.995 ...
## ..@ perm : int [1:3] 1 2 3
## ..@ Dim : int [1:2] 3 3
elu <- expand(lum)
elu$L
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.00000000 . .
## [2,] 0.07065122 1.00000000 .
## [3,] -0.28198586 0.14806246 1.00000000
elu$U
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.00000000 0.07065122 -0.28198586
## [2,] . 0.99500841 0.14732339
## [3,] . . 0.89867091
ggplot(train_df,aes(x=OpenPorchSF)) + #GarageArea WoodDeckSF
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sum(is.na(train_df$OpenPorchSF))
## [1] 0
fit_exp <- MASS::fitdistr(train_df$OpenPorchSF,'exponential')
fit_exp$estimate
## rate
## 0.02143151
exp_hist_data <- rep(sample(rexp(1000,fit_exp$estimate),100),1000)
par(mfrow=c(1,1))
par(mfrow=c(1,2))
hist(train_df$OpenPorchSF)
hist(exp_hist_data)
The shapes of distributions overall do look fairly similar although there is less right skewness from the transformed exponential and higher concentrations of values closer to zero.
qexp(0.05,fit_exp$estimate)
## [1] 2.393359
qexp(0.95,fit_exp$estimate)
## [1] 139.7817
len_exp <- length(train_df$OpenPorchSF)
sd_exp <- sd(train_df$OpenPorchSF)
std_err_exp <- sd_exp/sqrt(len_exp)
error_rate <- 0.05
t_var <- qt(p=error_rate/2,df=len_exp-1,lower.tail=F)
margin_error <- t_var * std_err_exp
paste0('Manually calculated 95% confidence interval: (',t_var-margin_error,",",t_var+margin_error,")")
## [1] "Manually calculated 95% confidence interval: (-1.4398070004854,5.36298953704289)"
#Empirical Percentiles
quantile(train_df$OpenPorchSF,c(0.05,0.95))
## 5% 95%
## 0.00 175.05
The alternative methods of estimating the 5th and 95th percentiles were not that substantially different, but the reason the confidence interval includes zero is likely because of the large concentration of 0 values where houses did not have porches in the data set. The fit distribution function also determined a lambda that was fairly close to zero so it is not that surprising that the mean value would estimate zero. When comparing the exponential 95th percentile against the original data there is some variance as the exponential transformation slightly minimized the effects of the skew leading to a smaller value as compared to the input data source.
quality <- c('Ex','Gd','TA','Fa','Po')
kitchen_rating <- c(0.9,0.75,0.5,0.25,0.1)
kitch_df <- data.frame(quality,kitchen_rating)
train_df <- train_df |>
mutate(new_home= ifelse(SaleType=='New',1,0),
court_officer=ifelse(SaleType=='COD',1,0),
has_porch=ifelse(OpenPorchSF==0 & EnclosedPorch==0,0,1),
single_fam = ifelse(BldgType=='1Fam',1,0),
twoFmCon = ifelse(BldgType=='2fmCon',1,0),
house_unf = ifelse(str_detect(HouseStyle, 'Unf'),1,0),
central_air = ifelse(CentralAir=='Yes',1,0),
functional_num = ifelse(Functional=='Typ',1,ifelse(str_detect(Functional,'Min'),0.5,ifelse(Functional=='Mod',0,ifelse(str_detect(Functional,'Major'),-0.5,-1)))),
kitchen_rating = ifelse(KitchenQual=='Ex',0.9,ifelse(KitchenQual=='Gd',0.75,ifelse(KitchenQual=='TA',0.5,ifelse(KitchenQual=='Fa',0.25,0.1))))
)
lm.out <- lm(SalePrice ~ BedroomAbvGr +age^2 +GrLivArea+ X2ndFlrSF+ GarageArea+ OverallCond + OverallQual + LotArea+ new_home + court_officer +single_fam +TotalBsmtSF +LowQualFinSF + kitchen_rating + functional_num,data=train_df)
summary(lm.out)
##
## Call:
## lm(formula = SalePrice ~ BedroomAbvGr + age^2 + GrLivArea + X2ndFlrSF +
## GarageArea + OverallCond + OverallQual + LotArea + new_home +
## court_officer + single_fam + TotalBsmtSF + LowQualFinSF +
## kitchen_rating + functional_num, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -541573 -17542 -1634 13515 277980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.158e+04 8.987e+03 -10.191 < 2e-16 ***
## BedroomAbvGr -8.318e+03 1.475e+03 -5.639 2.05e-08 ***
## age -3.887e+02 4.759e+01 -8.167 6.81e-16 ***
## GrLivArea 7.243e+01 4.854e+00 14.921 < 2e-16 ***
## X2ndFlrSF -1.365e+01 4.731e+00 -2.886 0.003965 **
## GarageArea 2.779e+01 5.915e+00 4.698 2.88e-06 ***
## OverallCond 4.532e+03 9.690e+02 4.677 3.18e-06 ***
## OverallQual 1.620e+04 1.184e+03 13.688 < 2e-16 ***
## LotArea 5.495e-01 1.008e-01 5.452 5.87e-08 ***
## new_home 1.795e+04 3.708e+03 4.841 1.43e-06 ***
## court_officer -1.621e+04 5.581e+03 -2.905 0.003727 **
## single_fam 1.619e+04 2.662e+03 6.081 1.52e-09 ***
## TotalBsmtSF 1.342e+01 4.030e+00 3.329 0.000893 ***
## LowQualFinSF -4.268e+01 2.023e+01 -2.110 0.035038 *
## kitchen_rating 4.387e+04 8.933e+03 4.911 1.01e-06 ***
## functional_num 1.308e+04 3.650e+03 3.584 0.000350 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35620 on 1444 degrees of freedom
## Multiple R-squared: 0.801, Adjusted R-squared: 0.7989
## F-statistic: 387.5 on 15 and 1444 DF, p-value: < 2.2e-16
When reviewing the model summary detail there was one initial item that might be slightly concerning and cause issues for the model. The residual spread which can be further reviewed by the diagnostic plots does not seem to be centered at zero although the remainder of the residuals show somewhat even spread at the interquartile range. The F-statistic confirms that the there is some significance to this linear model and the R-Squared adjusted seems to indicate that just under 80% of the variability in sales price is explained by the independent variables.
subset_df <- train_df |> dplyr::select(c(SalePrice,BedroomAbvGr,age,GrLivArea,X2ndFlrSF,GarageArea,OverallCond,OverallQual,LotArea,TotalBsmtSF,LowQualFinSF,kitchen_rating))
ResourceSelection::kdepairs(subset_df[,c(1:10)])
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
vif(lm.out)
## BedroomAbvGr age GrLivArea X2ndFlrSF GarageArea
## 1.664722 2.375276 7.481039 4.902881 1.838649
## OverallCond OverallQual LotArea new_home court_officer
## 1.336816 3.080926 1.164042 1.211424 1.024345
## single_fam TotalBsmtSF LowQualFinSF kitchen_rating functional_num
## 1.120262 3.594789 1.112440 2.117113 1.113141
Best practice indicates that any Variance Inflation factor over 5 is likely subject to multicollinearity and therefore the model will be subsequently revised to account for this issue.
lm2.out <- lm(SalePrice ~ BedroomAbvGr +age^2 + X2ndFlrSF+ GarageArea+ OverallCond + OverallQual + LotArea+ new_home + court_officer +single_fam +TotalBsmtSF +LowQualFinSF + kitchen_rating + functional_num,data=train_df)
summary(lm2.out)
##
## Call:
## lm(formula = SalePrice ~ BedroomAbvGr + age^2 + X2ndFlrSF + GarageArea +
## OverallCond + OverallQual + LotArea + new_home + court_officer +
## single_fam + TotalBsmtSF + LowQualFinSF + kitchen_rating +
## functional_num, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -514051 -19267 -2830 12713 314852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.239e+04 9.651e+03 -9.573 < 2e-16 ***
## BedroomAbvGr -1.560e+03 1.508e+03 -1.035 0.300984
## age -2.464e+02 5.007e+01 -4.921 9.58e-07 ***
## X2ndFlrSF 4.091e+01 3.223e+00 12.694 < 2e-16 ***
## GarageArea 4.815e+01 6.181e+00 7.791 1.26e-14 ***
## OverallCond 3.794e+03 1.039e+03 3.651 0.000271 ***
## OverallQual 1.975e+04 1.245e+03 15.861 < 2e-16 ***
## LotArea 8.042e-01 1.067e-01 7.538 8.41e-14 ***
## new_home 1.854e+04 3.982e+03 4.656 3.53e-06 ***
## court_officer -1.096e+04 5.982e+03 -1.833 0.067058 .
## single_fam 1.273e+04 2.848e+03 4.469 8.47e-06 ***
## TotalBsmtSF 5.172e+01 3.337e+00 15.496 < 2e-16 ***
## LowQualFinSF 2.196e+01 2.122e+01 1.035 0.300972
## kitchen_rating 5.610e+04 9.553e+03 5.872 5.32e-09 ***
## functional_num 1.831e+03 3.836e+03 0.477 0.633139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38260 on 1445 degrees of freedom
## Multiple R-squared: 0.7703, Adjusted R-squared: 0.7681
## F-statistic: 346.2 on 14 and 1445 DF, p-value: < 2.2e-16
Given that certain variables are not considered significant after this change it is necessary to test the impact to the model using backward testing.
lm3.out <- lm(SalePrice ~ age^2 + X2ndFlrSF+ GarageArea+ OverallCond + OverallQual + LotArea+ new_home +single_fam +TotalBsmtSF + kitchen_rating,data=train_df)
summary(lm3.out)
##
## Call:
## lm(formula = SalePrice ~ age^2 + X2ndFlrSF + GarageArea + OverallCond +
## OverallQual + LotArea + new_home + single_fam + TotalBsmtSF +
## kitchen_rating, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -509332 -19364 -2599 13022 315897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.647e+04 8.385e+03 -11.506 < 2e-16 ***
## age -2.449e+02 4.882e+01 -5.017 5.88e-07 ***
## X2ndFlrSF 3.942e+01 2.733e+00 14.425 < 2e-16 ***
## GarageArea 4.752e+01 6.170e+00 7.703 2.45e-14 ***
## OverallCond 3.894e+03 1.022e+03 3.812 0.000144 ***
## OverallQual 1.999e+04 1.236e+03 16.168 < 2e-16 ***
## LotArea 8.051e-01 1.067e-01 7.546 7.87e-14 ***
## new_home 1.910e+04 3.975e+03 4.807 1.69e-06 ***
## single_fam 1.242e+04 2.822e+03 4.401 1.16e-05 ***
## TotalBsmtSF 5.073e+01 3.238e+00 15.669 < 2e-16 ***
## kitchen_rating 5.787e+04 9.485e+03 6.101 1.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38280 on 1449 degrees of freedom
## Multiple R-squared: 0.7694, Adjusted R-squared: 0.7678
## F-statistic: 483.6 on 10 and 1449 DF, p-value: < 2.2e-16
The residuals remains somewhat distant from averaging zero, but the adjusted r-squared hasn’t changed dramatically with the substitutions from the earlier version of the model and all of the p-values of the dependent variables are significant.
par(mfrow=c(2,2))
plot(lm3.out)
The residuals are still somewhat far from zero and there is somewhat of pattern in the distribution across the fitted values and that is a potential concern of a violation of this assumption. The normal QQ plot follows a normal approximation for the majority of values; however, the third quantile appear to diverge more substantially
vif(lm3.out)
## age X2ndFlrSF GarageArea OverallCond OverallQual
## 2.164644 1.417250 1.732802 1.287010 2.911483
## LotArea new_home single_fam TotalBsmtSF kitchen_rating
## 1.129041 1.205520 1.089813 2.009204 2.066997
The variance factors have lowered across the board after revising the model and are no longer of concern.
test_df <- read.csv('https://raw.githubusercontent.com/jforster19/Data605/main/House_Prices_Prediction_test.csv')
dim(test_df)
## [1] 1459 80
test_df <- test_df |>
mutate(age=2022-YearBuilt, new_home= ifelse(is.na(SaleType),0,ifelse(SaleType=='New',1,0)),
GarageArea = ifelse(is.na(GarageArea),0,GarageArea),
TotalBsmtSF = ifelse(is.na(TotalBsmtSF),0,TotalBsmtSF),
single_fam = ifelse(BldgType=='1Fam',1,0),
twoFmCon = ifelse(BldgType=='2fmCon',1,0),
house_unf = ifelse(str_detect(HouseStyle, 'Unf'),1,0),
central_air = ifelse(CentralAir=='Yes',1,ifelse(is.na(CentralAir),0,0)),
functional_num = ifelse(is.na(Functional),0,ifelse(Functional=='Typ',1,ifelse(str_detect(Functional,'Min'),0.5,ifelse(Functional=='Mod',0,ifelse(str_detect(Functional,'Major'),-0.5,-1))))),
kitchen_rating = ifelse(is.na(KitchenQual),0.5,ifelse(KitchenQual=='Ex',0.9,ifelse(KitchenQual=='Gd',0.75,ifelse(KitchenQual=='TA',0.5,ifelse(KitchenQual=='Fa',0.25,0.1))))))
predictions_lm <- cbind(test_df$Id,predict(lm3.out,test_df,interval='prediction'))
sum(is.na(test_df))
## [1] 6998
sapply(test_df, function(x) sum(is.na(x)))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 4 227 0
## Street Alley LotShape LandContour Utilities
## 0 1352 0 0 2
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 1 1
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 16 15 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 44 45 44 42 1
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 42 1 1 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 0 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 2 2 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 1 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 2 0 730 76 78
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 78 1 0 78 78
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1456 1169 1408
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 1 0
## age new_home single_fam twoFmCon house_unf
## 0 0 0 0 0
## central_air functional_num kitchen_rating
## 0 0 0
head(predictions_lm)
## fit lwr upr
## 1 1461 142049.0 66771.36 217326.6
## 2 1462 180713.1 105478.15 255948.0
## 3 1463 166932.9 91638.32 242227.5
## 4 1464 200849.3 125622.15 276076.4
## 5 1465 211985.6 136609.29 287361.9
## 6 1466 180022.3 104764.47 255280.2
colnames(predictions_lm) <- c('Id','SalePrice','lwr','upr')
output_df <- as_tibble(predictions_lm )|> dplyr::select(c('Id','SalePrice'))
sapply(output_df, function(x) sum(is.na(x)))
## Id SalePrice
## 0 0
head(output_df)
## # A tibble: 6 × 2
## Id SalePrice
## <dbl> <dbl>
## 1 1461 142049.
## 2 1462 180713.
## 3 1463 166933.
## 4 1464 200849.
## 5 1465 211986.
## 6 1466 180022.
write_delim(output_df,'house_price_predictions.csv',delim=',')