## corrplot 0.84 loaded
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## -- Attaching packages ------------------------ tidyverse 1.3.0 --
## v tibble 3.0.1 v purrr 0.3.4
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
Computational Mathematics
Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Problem 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ = σ = (N+1)/2.
#random variable X
# Enter any number of your choosing greater than or equal to 6 for N
N = 80
n = 10000
X = round(runif(10000, min=1, max=N))
hist(X)Probability.
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
- P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
## [1] 40
## 25%
## 13.05264
a.P(X>x | X>y)
#Probability of sum(X>x & X>y) over sum(X>y)
Px_and_Py = sum(X>x & X>y)/n
Py = sum(X>y)/n
round(Px_and_Py/Py, 2)## [1] 0.59
The probability of X is greater than the median.
- P(X>x, Y>y)
## [1] 0.37
The probability of X is greater than all X and Y is greater than all y.
- P(X<x | X>y)
## [1] 0.39
The probability of X is greater than the median.
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
#form the matrix
matrix = matrix( c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), nrow = 2,ncol = 2)
#add total
matrix = cbind(matrix,c(matrix[1,1]+matrix[1,2],matrix[2,1]+matrix[2,2]))
#merge the two
matrix = rbind(matrix,c(matrix[1,1]+matrix[2,1],matrix[1,2]+matrix[2,2],matrix[1,3]+matrix[2,3]))
# convert into dataframe
matrix_df = as.data.frame(matrix)
# change the names of the columns and rows
names(matrix_df) = c("X>x","X<x", "Total")
row.names(matrix_df) = c("Y<y","Y>y", "Total")
# get the joint probabilities and marginal from dividing the values from the total 9900
prob_matrix = matrix/matrix[3,3]
prob_matrix = as.data.frame(prob_matrix)
names(prob_matrix) = c("X>x","X<x", "Total")
row.names(prob_matrix) = c("Y<y","Y>y", "Total")
prob_matrix## X>x X<x Total
## Y<y 0.1293653 0.1207612 0.2501265
## Y>y 0.3751392 0.3747343 0.7498735
## Total 0.5045045 0.4954955 1.0000000
The Total row and column consists of the marginal probability distributions.
The joint probability distribution is the values in the table cells of sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y).
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s Exact test is a non-parametric alternative to the Chi-Square test and is used when we have cell sizes less than 5. With the Chi-Square test we can use for greater amount of cell sizes, there is the most appropiate in this scenario.
# Chi Square Test
CHI = chisq.test(matrix_df, correct=T)
# Fisher’s Exact Test
fisher.test(matrix_df, simulate.p.value=T)##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: matrix_df
## p-value = 0.6937
## alternative hypothesis: two.sided
Problem 2
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Descriptive and Inferential Statistics.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
#Provide univariate descriptive statistics and appropriate plots for the training data set.
training_dataset = read.csv("train.csv")
head(training_dataset)## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl NAmes Feedr Norm
## 2 Lvl AllPub Corner Gtl NAmes Norm Norm
## 3 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 4 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 5 HLS AllPub Inside Gtl StoneBr Norm Norm
## 6 Lvl AllPub Corner Gtl Gilbert Norm Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 1Story 5 6 1961 1961 Gable
## 2 1Fam 1Story 6 6 1958 1958 Hip
## 3 1Fam 2Story 5 5 1997 1998 Gable
## 4 1Fam 2Story 6 6 1998 1998 Gable
## 5 TwnhsE 1Story 8 5 1992 1992 Gable
## 6 1Fam 2Story 6 5 1993 1994 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd None 0 TA TA
## 2 CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA
## 3 CompShg VinylSd VinylSd None 0 TA TA
## 4 CompShg VinylSd VinylSd BrkFace 20 TA TA
## 5 CompShg HdBoard HdBoard None 0 Gd TA
## 6 CompShg HdBoard HdBoard None 0 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 CBlock TA TA No Rec 468
## 2 CBlock TA TA No ALQ 923
## 3 PConc Gd TA No GLQ 791
## 4 PConc TA TA No GLQ 602
## 5 PConc Gd TA No ALQ 263
## 6 PConc Gd TA No Unf 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 LwQ 144 270 882 GasA TA Y
## 2 Unf 0 406 1329 GasA TA Y
## 3 Unf 0 137 928 GasA Gd Y
## 4 Unf 0 324 926 GasA Ex Y
## 5 Unf 0 1017 1280 GasA Ex Y
## 6 Unf 0 763 763 GasA Gd Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 896 0 0 896 0
## 2 SBrkr 1329 0 0 1329 0
## 3 SBrkr 928 701 0 1629 0
## 4 SBrkr 926 678 0 1604 0
## 5 SBrkr 1280 0 0 1280 0
## 6 SBrkr 763 892 0 1655 0
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 0 2 1 TA
## 2 0 1 1 3 1 Gd
## 3 0 2 1 3 1 TA
## 4 0 2 1 3 1 Gd
## 5 0 2 0 2 1 Gd
## 6 0 2 1 3 1 TA
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 5 Typ 0 <NA> Attchd 1961
## 2 6 Typ 0 <NA> Attchd 1958
## 3 6 Typ 1 TA Attchd 1997
## 4 7 Typ 1 Gd Attchd 1998
## 5 5 Typ 0 <NA> Attchd 1992
## 6 7 Typ 1 TA Attchd 1993
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Unf 1 730 TA TA Y
## 2 Unf 1 312 TA TA Y
## 3 Fin 2 482 TA TA Y
## 4 Fin 2 470 TA TA Y
## 5 RFn 2 506 TA TA Y
## 6 Fin 2 440 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 140 0 0 0 120 0 <NA>
## 2 393 36 0 0 0 0 <NA>
## 3 212 34 0 0 0 0 <NA>
## 4 360 36 0 0 0 0 <NA>
## 5 0 82 0 0 144 0 <NA>
## 6 157 84 0 0 0 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv <NA> 0 6 2010 WD Normal
## 2 <NA> Gar2 12500 6 2010 WD Normal
## 3 MnPrv <NA> 0 3 2010 WD Normal
## 4 <NA> <NA> 0 6 2010 WD Normal
## 5 <NA> <NA> 0 1 2010 WD Normal
## 6 <NA> <NA> 0 4 2010 WD Normal
## [1] 180921.2
## [1] 163000
## [1] 79442.5
#linear model
training_dataset.lm = lm(training_dataset$SalePrice ~ training_dataset$BldgType)
summary(training_dataset.lm)##
## Call:
## lm(formula = training_dataset$SalePrice ~ training_dataset$BldgType)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150864 -50764 -14701 31861 569236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 185764 2238 83.009 < 2e-16 ***
## training_dataset$BldgType2fmCon -57332 14216 -4.033 5.80e-05 ***
## training_dataset$BldgTypeDuplex -52223 11068 -4.718 2.61e-06 ***
## training_dataset$BldgTypeTwnhs -49852 12128 -4.110 4.17e-05 ***
## training_dataset$BldgTypeTwnhsE -3804 7655 -0.497 0.619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 78170 on 1455 degrees of freedom
## Multiple R-squared: 0.03453, Adjusted R-squared: 0.03188
## F-statistic: 13.01 on 4 and 1455 DF, p-value: 2.057e-10
# residual analysis to evualate model quality
plot(fitted(training_dataset.lm, resid(training_dataset.lm)))# Data subset with selected columns
# Create a scatterplotplot/ correlation matrx
correlation_data<-dplyr::select(training_dataset, SalePrice, GrLivArea, LotArea, YearBuilt, GarageArea, FullBath, OverallQual)
correlation_matrix<-round(cor(correlation_data),4)
correlation_matrix ## SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice 1.0000 0.7086 0.2638 0.5229 0.6234 0.5607
## GrLivArea 0.7086 1.0000 0.2631 0.1990 0.4690 0.6300
## LotArea 0.2638 0.2631 1.0000 0.0142 0.1804 0.1260
## YearBuilt 0.5229 0.1990 0.0142 1.0000 0.4790 0.4683
## GarageArea 0.6234 0.4690 0.1804 0.4790 1.0000 0.4057
## FullBath 0.5607 0.6300 0.1260 0.4683 0.4057 1.0000
## OverallQual 0.7910 0.5930 0.1058 0.5723 0.5620 0.5506
## OverallQual
## SalePrice 0.7910
## GrLivArea 0.5930
## LotArea 0.1058
## YearBuilt 0.5723
## GarageArea 0.5620
## FullBath 0.5506
## OverallQual 1.0000
# SalePrice vs GrLivArea
cor.test(correlation_data$SalePrice,correlation_data$GrLivArea, conf.level = 0.8)##
## Pearson's product-moment correlation
##
## data: correlation_data$SalePrice and correlation_data$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
We are 80% confident that the correlation bettwe these two variables is between 0.6915087 and 0.7249450.
# SalePrice vs YearBuilt
cor.test(correlation_data$SalePrice,correlation_data$YearBuilt, conf.level = 0.8)##
## Pearson's product-moment correlation
##
## data: correlation_data$SalePrice and correlation_data$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
We are 80% confident that the correlation bettwe these two variables is between 0.6915087 and 0.7249450.
# SalePrice vs OverallQual
cor.test(correlation_data$SalePrice,correlation_data$OverallQual, conf.level = 0.8)##
## Pearson's product-moment correlation
##
## data: correlation_data$SalePrice and correlation_data$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
We are 80% confident that the correlation bettwe these two variables is between 0.7780752 and 0.8032204.
With a low p-value, we would not be worried about familywise error.
Linear Algebra and Correlation.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# Inverse the matrix with the precision matrix
precision_matrix<-solve(correlation_matrix)
round(precision_matrix,4)## SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice 4.2107 -1.5406 -0.4289 -0.6971 -0.5858 0.1904
## GrLivArea -1.5406 3.0440 -0.1666 1.1284 -0.2560 -1.2455
## LotArea -0.4289 -0.1666 1.1362 0.1012 -0.0829 0.0284
## YearBuilt -0.6971 1.1284 0.1012 2.0790 -0.4289 -0.7739
## GarageArea -0.5858 -0.2560 -0.0829 -0.4289 1.7684 0.0748
## FullBath 0.1904 -1.2455 0.0284 -0.7739 0.0748 2.1003
## OverallQual -1.7484 -0.3850 0.2909 -0.6512 -0.1656 -0.1706
## OverallQual
## SalePrice -1.7484
## GrLivArea -0.3850
## LotArea 0.2909
## YearBuilt -0.6512
## GarageArea -0.1656
## FullBath -0.1706
## OverallQual 3.1402
# Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
round(correlation_matrix %*% precision_matrix,4)## SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice 1 0 0 0 0 0
## GrLivArea 0 1 0 0 0 0
## LotArea 0 0 1 0 0 0
## YearBuilt 0 0 0 1 0 0
## GarageArea 0 0 0 0 1 0
## FullBath 0 0 0 0 0 1
## OverallQual 0 0 0 0 0 0
## OverallQual
## SalePrice 0
## GrLivArea 0
## LotArea 0
## YearBuilt 0
## GarageArea 0
## FullBath 0
## OverallQual 1
## SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice 1 0 0 0 0 0
## GrLivArea 0 1 0 0 0 0
## LotArea 0 0 1 0 0 0
## YearBuilt 0 0 0 1 0 0
## GarageArea 0 0 0 0 1 0
## FullBath 0 0 0 0 0 1
## OverallQual 0 0 0 0 0 0
## OverallQual
## SalePrice 0
## GrLivArea 0
## LotArea 0
## YearBuilt 0
## GarageArea 0
## FullBath 0
## OverallQual 1
# Conduct LU decomposition on the matrix.
A <- correlation_matrix
luA <- lu.decomposition( A )
L <- luA$L
U <- luA$U
print( L )## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0000 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0
## [2,] 0.7086 1.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0
## [3,] 0.2638 0.15298947 1.00000000 0.0000000 0.00000000 0.00000000 0
## [4,] 0.5229 -0.34451044 -0.10612087 1.0000000 0.00000000 0.00000000 0
## [5,] 0.6234 0.05474899 0.01281817 0.2490577 1.00000000 0.00000000 0
## [6,] 0.5607 0.46735189 -0.06259710 0.3791760 -0.03146122 1.00000000 0
## [7,] 0.7910 0.06527076 -0.11737343 0.2411038 0.05102839 0.05433795 1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 0.708600 2.638000e-01 0.5229000 0.62340000 0.56070000 0.79100000
## [2,] 0 0.497886 7.617132e-02 -0.1715269 0.02725876 0.23268798 0.03249740
## [3,] 0 0.000000 9.187562e-01 -0.0974992 0.01177678 -0.05751147 -0.10783756
## [4,] 0 0.000000 0.000000e+00 0.6571361 0.16366483 0.24917024 0.15843798
## [5,] 0 0.000000 0.000000e+00 0.0000000 0.56896710 -0.01790040 0.02903347
## [6,] 0 0.000000 -6.938894e-18 0.0000000 0.00000000 0.47822574 0.02598581
## [7,] 0 0.000000 3.770453e-19 0.0000000 0.00000000 0.00000000 0.31844707
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0000 0.7086 0.2638 0.5229 0.6234 0.5607 0.7910
## [2,] 0.7086 1.0000 0.2631 0.1990 0.4690 0.6300 0.5930
## [3,] 0.2638 0.2631 1.0000 0.0142 0.1804 0.1260 0.1058
## [4,] 0.5229 0.1990 0.0142 1.0000 0.4790 0.4683 0.5723
## [5,] 0.6234 0.4690 0.1804 0.4790 1.0000 0.4057 0.5620
## [6,] 0.5607 0.6300 0.1260 0.4683 0.4057 1.0000 0.5506
## [7,] 0.7910 0.5930 0.1058 0.5723 0.5620 0.5506 1.0000
## SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice 1.0000 0.7086 0.2638 0.5229 0.6234 0.5607
## GrLivArea 0.7086 1.0000 0.2631 0.1990 0.4690 0.6300
## LotArea 0.2638 0.2631 1.0000 0.0142 0.1804 0.1260
## YearBuilt 0.5229 0.1990 0.0142 1.0000 0.4790 0.4683
## GarageArea 0.6234 0.4690 0.1804 0.4790 1.0000 0.4057
## FullBath 0.5607 0.6300 0.1260 0.4683 0.4057 1.0000
## OverallQual 0.7910 0.5930 0.1058 0.5723 0.5620 0.5506
## OverallQual
## SalePrice 0.7910
## GrLivArea 0.5930
## LotArea 0.1058
## YearBuilt 0.5723
## GarageArea 0.5620
## FullBath 0.5506
## OverallQual 1.0000
# Compare the original matrix to the correlation matrix
round(L %*% U ,4) == round(correlation_matrix,4)## SalePrice GrLivArea LotArea YearBuilt GarageArea FullBath
## SalePrice TRUE TRUE TRUE TRUE TRUE TRUE
## GrLivArea TRUE TRUE TRUE TRUE TRUE TRUE
## LotArea TRUE TRUE TRUE TRUE TRUE TRUE
## YearBuilt TRUE TRUE TRUE TRUE TRUE TRUE
## GarageArea TRUE TRUE TRUE TRUE TRUE TRUE
## FullBath TRUE TRUE TRUE TRUE TRUE TRUE
## OverallQual TRUE TRUE TRUE TRUE TRUE TRUE
## OverallQual
## SalePrice TRUE
## GrLivArea TRUE
## LotArea TRUE
## YearBuilt TRUE
## GarageArea TRUE
## FullBath TRUE
## OverallQual TRUE
Calculus-Based Probability & Statistics.
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
## $breaks
## [1] 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
##
## $counts
## [1] 3 228 554 461 144 52 12 2 2 1 0 1
##
## $density
## [1] 4.109589e-06 3.123288e-04 7.589041e-04 6.315068e-04 1.972603e-04
## [6] 7.123288e-05 1.643836e-05 2.739726e-06 2.739726e-06 1.369863e-06
## [11] 0.000000e+00 1.369863e-06
##
## $mids
## [1] 250 750 1250 1750 2250 2750 3250 3750 4250 4750 5250 5750
##
## $xname
## [1] "training_dataset$GrLivArea"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
fit = fitdistr(training_dataset$GrLivArea, "exponential")
# Histogram of the sim and the fit model
l<-fit$estimate
sim<- rexp(1000,l)
hist(sim,breaks = 50)sim.df <- data.frame(length = sim)
fit.df <- data.frame(length = training_dataset$GrLivArea)
sim.df$from <- 'sim'
fit.df$from <- 'fit'
both.df <- rbind(sim.df,fit.df)
ggplot(both.df, aes(length, fill = from)) + geom_density(alpha = 0.2)# Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution
quantile(sim, probs=c(0.05, 0.95))## 5% 95%
## 72.18889 4289.87669
## 5% 95%
## 848.0 2466.1
Modeling.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score
We will be analyzing a dataset to see what factors play a part in the predicting the costs of real estate. The factors are the transaction date, house age, distance to the MRT(metro), number of convenience stores, house price of unit area.
# Load dataset
df = read.csv('train.csv', header = T, na.strings = "NA")
df1 = read.csv('test.csv', header = T, na.strings = "NA")
head(df)## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
#create a neighborhood numeric variable
Neighborhood_var <- aggregate(df[, 81], list(df$Neighborhood), mean)
colnames(Neighborhood_var)<- c("Neighborhood","Neighborhood_Average")
df <- merge(df,Neighborhood_var)
df1 <- merge(df1,Neighborhood_var)# Summary of our data
df_lm = lm(SalePrice ~ GrLivArea + LotArea + OverallQual+ Fireplaces + YearBuilt + BedroomAbvGr + Neighborhood_Average, data = df)
summary(df_lm)##
## Call:
## lm(formula = SalePrice ~ GrLivArea + LotArea + OverallQual +
## Fireplaces + YearBuilt + BedroomAbvGr + Neighborhood_Average,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -363910 -18585 -1056 16636 274616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.132e+05 8.212e+04 -6.249 5.42e-10 ***
## GrLivArea 5.697e+01 3.006e+00 18.949 < 2e-16 ***
## LotArea 6.547e-01 1.010e-01 6.485 1.22e-10 ***
## OverallQual 1.755e+04 1.144e+03 15.342 < 2e-16 ***
## Fireplaces 7.106e+03 1.731e+03 4.105 4.28e-05 ***
## YearBuilt 2.248e+02 4.323e+01 5.200 2.28e-07 ***
## BedroomAbvGr -7.434e+03 1.453e+03 -5.116 3.53e-07 ***
## Neighborhood_Average 3.739e-01 2.483e-02 15.060 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36220 on 1452 degrees of freedom
## Multiple R-squared: 0.7931, Adjusted R-squared: 0.7921
## F-statistic: 795.3 on 7 and 1452 DF, p-value: < 2.2e-16
At this point, all the p-values of all the p-values are less than 0.05
predicted_lm = predict(df_lm, newdata = df1)
# add predicted_lm prices to the test data
df1 = arrange(df1, Id)
df1$SalePrice = 0
df1$SalePrice = predicted_lm
#select the id and saleprices and write a csv file
selected_df1 = df1 %>%
dplyr::select(2, 82)
write.csv(selected_df1, file = "Test_House_Prices.csv",row.names = F)Data Analysis
Most of the points are didstributed uniformly around zero.
# plot a normal qqpplot and see where the residual points fall on the line
qqnorm(resid(df_lm))
qqline(resid(df_lm))Most of the points follow the line, even though we have some outliers on the end points. We do see a slight right-skewed distribution.
Conclusion
THe p-values for each variable is less than 0.05 so we don’t have to do the backward elimination process.
The Adjusted R-squared is 0.0.8084 with a degrees of freedom of 1450.
As someone that bought a property, I can advocate that the year people buys, the distance from metros, the house age and the convenience stores nearby(accessiblity) affects the prices of real estate. Another factor that is not in the dataset that affects the sale prices is the Overall condition of the home.
Kaggle.com username: tony #2 score : 0.58291