Using R, generate a variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{(N+1)}{2}\).
set.seed(525)
r = 10000
X = runif(r, 1, 8)
mu = (8+1)/2
sigma = mu
Y = rnorm(r, mean = mu, sd = sigma)
Calculate as a minimum the below probabilities. Assume the smaller letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
This is equal to \(\frac {P(X > x \ \& \ X > y)}{P(X > y)}\).
This is equal to \(P(X > x)*P(Y > y)\).
This is equal to \(\frac {P(X < x \ \& \ X > y)}{P(X > y)}\).
x = median(X)
y = quantile(Y, 1/4)
# Part 1: P(X > x | X > y) is
P1 = (length(X[X>x & X>y])/length(X)) /(length(X[X>y])/length(X))
P1
## [1] 0.5343593
# Part 2: P(X > x, Y > y) is
P2 = (length(X[X>x]) / length(X))*(length(Y[Y>y]) / length(Y))
P2
## [1] 0.375
# Part 3: P(X < x | X > y) is
P3 = (length(X[X<x & X>y])/length(X))/(length(X[X>y])/length(X))
P3
## [1] 0.4656407
Investigate whether \(P(X > x \ \&\ Y > y) = P(X > x)P(Y > y)\) by building a table and evaluating the marginal and joint probabilities.
a = length(X[X > x & Y > y])/10000 # P(X > x & Y > y)
b = length(X[X > x & Y < y])/10000 # P(X > x & Y < y)
c = length(X[X < x & Y > y])/10000 # P(X < x & Y > y)
d = length(X[X < x & Y < y])/10000 # P(X < x & Y < y)
ptable = matrix(c(a,b,a+b,c,d,c+d,a+c,b+d,a+b+c+d), nrow = 3)
colnames(ptable) = c('P(X > x)', 'P(X < x)', 'marginal')
rownames(ptable) = c('P(Y > y)', 'P(Y < y)', 'marginal')
ptable = as.table(ptable)
ptable
## P(X > x) P(X < x) marginal
## P(Y > y) 0.377 0.373 0.750
## P(Y < y) 0.123 0.127 0.250
## marginal 0.500 0.500 1.000
# a is P(X > x & Y > y)
a
## [1] 0.377
# P(X > x)P(Y > y)
P2
## [1] 0.375
From the above, it is clear that the probabilities are very close to each other, 0.377 and 0.375 receptively.
Check to see if independence holds by using Fisher’s Exact Test and the Chi-Square Test. What is the difference between the two? Which is most appropriate?
Both the Fisher’s Exact Test and the Chi-Square Test are test of independence such that the probability distribution of one variable does not affect the presence of another.
\(H_{0} =\) the proportions at one variable are the same for different values of the second variable.
ptable = as.table(matrix(c(a,b,c,d)*10000, nrow = 2))
fisher.test(ptable)
##
## Fisher's Exact Test for Count Data
##
## data: ptable
## p-value = 0.3678
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9522537 1.1436981
## sample estimates:
## odds ratio
## 1.043608
The p-value of 0.3678 for the test does not provide any evidence against the assumption of independence. This means that we cannot confidently claim any difference in the two sets of data.
chisq.test(ptable)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: ptable
## X-squared = 0.8112, df = 1, p-value = 0.3678
The p-value of 0.3678 is greater than the 0.05 significance level, we do not reject the null hypothesis that the X dataset is independent of the Y dataset.
The chi-squared test applies an approximation assuming the sample is large, while the Fisher’s exact test is an exact significant test for small sample sizes. In the Fisher’s test, a false rejection equals to the significance level of the test, which is not necessarily true for approximate tests such as the chi-squared test. In short, Fisher’s exact test relies on computing the p-value according to the hypergeometric distribution using binomial coefficients. Since the computed factorials can become very large, Fisher’s exact test may not work for large sample sizes.\(^1\)
\(^1\) Kim, Hae-Young. “Statistical notes for clinical researchers: Chi-squared test and Fisher’s exact test.” Restorative dentistry & endodontics vol. 42,2 (2017): 152-155. doi:10.5395/rde.2017.42.2.152
House Prices: Advanced Regression Techniques is a Kaggle Challenge with a dataset that proves 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.
There is a training (n = 1460) and a test (n = 1459) dataset. Some of the variables of interested are listed below. Briefly, there is an interest in the square feet of the porch, lot, and garage.
| Variables | Descriptions |
|---|---|
| MSZoning | Identifies the general zoning classification of the sale |
| LotFrontage | Linear feet of street connected to property |
| LotArea | Lot size in square feet |
| WoodDeckSF | Wood deck area in square feet |
| OpenPorchSF | Open porch area in square feet |
| EnclosedPorch | Enclosed porch area in square feet |
| 3SsnPorch | Three season porch area in square feet |
| ScreenPorch | Screen porch area in square feet |
| PoolArea | Pool area in square feet |
| PoolQC | Pool quality |
| GarageArea | Size of garage in square feet |
| GarageQual | Garage quality |
| OverallCond | Rates the overall condition of the house |
| SalePrice | Selling Price of the House |
# Libraries needed
library(tidyverse)
library(psych)
library(Matrix)
library(MASS)
library(scales)
testdata = read.csv("test.csv")
train = read.csv("train.csv")
# Descriptive statistics of the numeric variables
train %>% keep(is.numeric) %>% describe()
## vars n mean sd median trimmed mad
## Id 1 1460 730.50 421.61 730.5 730.50 541.15
## MSSubClass 2 1460 56.90 42.30 50.0 49.15 44.48
## LotFrontage 3 1201 70.05 24.28 69.0 68.94 16.31
## LotArea 4 1460 10516.83 9981.26 9478.5 9563.28 2962.23
## OverallQual 5 1460 6.10 1.38 6.0 6.08 1.48
## OverallCond 6 1460 5.58 1.11 5.0 5.48 0.00
## YearBuilt 7 1460 1971.27 30.20 1973.0 1974.13 37.06
## YearRemodAdd 8 1460 1984.87 20.65 1994.0 1986.37 19.27
## MasVnrArea 9 1452 103.69 181.07 0.0 63.15 0.00
## BsmtFinSF1 10 1460 443.64 456.10 383.5 386.08 568.58
## BsmtFinSF2 11 1460 46.55 161.32 0.0 1.38 0.00
## BsmtUnfSF 12 1460 567.24 441.87 477.5 519.29 426.99
## TotalBsmtSF 13 1460 1057.43 438.71 991.5 1036.70 347.67
## X1stFlrSF 14 1460 1162.63 386.59 1087.0 1129.99 347.67
## X2ndFlrSF 15 1460 346.99 436.53 0.0 285.36 0.00
## LowQualFinSF 16 1460 5.84 48.62 0.0 0.00 0.00
## GrLivArea 17 1460 1515.46 525.48 1464.0 1467.67 483.33
## BsmtFullBath 18 1460 0.43 0.52 0.0 0.39 0.00
## BsmtHalfBath 19 1460 0.06 0.24 0.0 0.00 0.00
## FullBath 20 1460 1.57 0.55 2.0 1.56 0.00
## HalfBath 21 1460 0.38 0.50 0.0 0.34 0.00
## BedroomAbvGr 22 1460 2.87 0.82 3.0 2.85 0.00
## KitchenAbvGr 23 1460 1.05 0.22 1.0 1.00 0.00
## TotRmsAbvGrd 24 1460 6.52 1.63 6.0 6.41 1.48
## Fireplaces 25 1460 0.61 0.64 1.0 0.53 1.48
## GarageYrBlt 26 1379 1978.51 24.69 1980.0 1981.07 31.13
## GarageCars 27 1460 1.77 0.75 2.0 1.77 0.00
## GarageArea 28 1460 472.98 213.80 480.0 469.81 177.91
## WoodDeckSF 29 1460 94.24 125.34 0.0 71.76 0.00
## OpenPorchSF 30 1460 46.66 66.26 25.0 33.23 37.06
## EnclosedPorch 31 1460 21.95 61.12 0.0 3.87 0.00
## X3SsnPorch 32 1460 3.41 29.32 0.0 0.00 0.00
## ScreenPorch 33 1460 15.06 55.76 0.0 0.00 0.00
## PoolArea 34 1460 2.76 40.18 0.0 0.00 0.00
## MiscVal 35 1460 43.49 496.12 0.0 0.00 0.00
## MoSold 36 1460 6.32 2.70 6.0 6.25 2.97
## YrSold 37 1460 2007.82 1.33 2008.0 2007.77 1.48
## SalePrice 38 1460 180921.20 79442.50 163000.0 170783.29 56338.80
## min max range skew kurtosis se
## Id 1 1460 1459 0.00 -1.20 11.03
## MSSubClass 20 190 170 1.40 1.56 1.11
## LotFrontage 21 313 292 2.16 17.34 0.70
## LotArea 1300 215245 213945 12.18 202.26 261.22
## OverallQual 1 10 9 0.22 0.09 0.04
## OverallCond 1 9 8 0.69 1.09 0.03
## YearBuilt 1872 2010 138 -0.61 -0.45 0.79
## YearRemodAdd 1950 2010 60 -0.50 -1.27 0.54
## MasVnrArea 0 1600 1600 2.66 10.03 4.75
## BsmtFinSF1 0 5644 5644 1.68 11.06 11.94
## BsmtFinSF2 0 1474 1474 4.25 20.01 4.22
## BsmtUnfSF 0 2336 2336 0.92 0.46 11.56
## TotalBsmtSF 0 6110 6110 1.52 13.18 11.48
## X1stFlrSF 334 4692 4358 1.37 5.71 10.12
## X2ndFlrSF 0 2065 2065 0.81 -0.56 11.42
## LowQualFinSF 0 572 572 8.99 82.83 1.27
## GrLivArea 334 5642 5308 1.36 4.86 13.75
## BsmtFullBath 0 3 3 0.59 -0.84 0.01
## BsmtHalfBath 0 2 2 4.09 16.31 0.01
## FullBath 0 3 3 0.04 -0.86 0.01
## HalfBath 0 2 2 0.67 -1.08 0.01
## BedroomAbvGr 0 8 8 0.21 2.21 0.02
## KitchenAbvGr 0 3 3 4.48 21.42 0.01
## TotRmsAbvGrd 2 14 12 0.67 0.87 0.04
## Fireplaces 0 3 3 0.65 -0.22 0.02
## GarageYrBlt 1900 2010 110 -0.65 -0.42 0.66
## GarageCars 0 4 4 -0.34 0.21 0.02
## GarageArea 0 1418 1418 0.18 0.90 5.60
## WoodDeckSF 0 857 857 1.54 2.97 3.28
## OpenPorchSF 0 547 547 2.36 8.44 1.73
## EnclosedPorch 0 552 552 3.08 10.37 1.60
## X3SsnPorch 0 508 508 10.28 123.06 0.77
## ScreenPorch 0 480 480 4.11 18.34 1.46
## PoolArea 0 738 738 14.80 222.19 1.05
## MiscVal 0 15500 15500 24.43 697.64 12.98
## MoSold 1 12 11 0.21 -0.41 0.07
## YrSold 2006 2010 4 0.10 -1.19 0.03
## SalePrice 34900 755000 720100 1.88 6.50 2079.11
# Descriptive statistics of the factor variables
train %>% keep(is.factor) %>% summary()
## MSZoning Street Alley LotShape LandContour
## C (all): 10 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## FV : 65 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## RH : 16 NA's:1369 IR3: 10 Low: 36
## RL :1151 Reg:925 Lvl:1311
## RM : 218
##
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle RoofStyle RoofMatl
## Norm :1445 1Fam :1220 1Story :726 Flat : 13 CompShg:1434
## Feedr : 6 2fmCon: 31 2Story :445 Gable :1141 Tar&Grv: 11
## Artery : 2 Duplex: 52 1.5Fin :154 Gambrel: 11 WdShngl: 6
## PosN : 2 Twnhs : 43 SLvl : 65 Hip : 286 WdShake: 5
## RRNn : 2 TwnhsE: 114 SFoyer : 37 Mansard: 7 ClyTile: 1
## PosA : 1 1.5Unf : 14 Shed : 2 Membran: 1
## (Other): 2 (Other): 19 (Other): 2
## Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
## VinylSd:515 VinylSd:504 BrkCmn : 15 Ex: 52 Ex: 3
## HdBoard:222 MetalSd:214 BrkFace:445 Fa: 14 Fa: 28
## MetalSd:220 HdBoard:207 None :864 Gd:488 Gd: 146
## Wd Sdng:206 Wd Sdng:197 Stone :128 TA:906 Po: 1
## Plywood:108 Plywood:142 NA's : 8 TA:1282
## CemntBd: 61 CmentBd: 60
## (Other):128 (Other):136
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## BrkTil:146 Ex :121 Fa : 45 Av :221 ALQ :220
## CBlock:634 Fa : 35 Gd : 65 Gd :134 BLQ :148
## PConc :647 Gd :618 Po : 2 Mn :114 GLQ :418
## Slab : 24 TA :649 TA :1311 No :953 LwQ : 74
## Stone : 6 NA's: 37 NA's: 37 NA's: 38 Rec :133
## Wood : 3 Unf :430
## NA's: 37
## BsmtFinType2 Heating HeatingQC CentralAir Electrical KitchenQual
## ALQ : 19 Floor: 1 Ex:741 N: 95 FuseA: 94 Ex:100
## BLQ : 33 GasA :1428 Fa: 49 Y:1365 FuseF: 27 Fa: 39
## GLQ : 14 GasW : 18 Gd:241 FuseP: 3 Gd:586
## LwQ : 46 Grav : 7 Po: 1 Mix : 1 TA:735
## Rec : 54 OthW : 2 TA:428 SBrkr:1334
## Unf :1256 Wall : 4 NA's : 1
## NA's: 38
## Functional FireplaceQu GarageType GarageFinish GarageQual
## Maj1: 14 Ex : 24 2Types : 6 Fin :352 Ex : 3
## Maj2: 5 Fa : 33 Attchd :870 RFn :422 Fa : 48
## Min1: 31 Gd :380 Basment: 19 Unf :605 Gd : 14
## Min2: 34 Po : 20 BuiltIn: 88 NA's: 81 Po : 3
## Mod : 15 TA :313 CarPort: 9 TA :1311
## Sev : 1 NA's:690 Detchd :387 NA's: 81
## Typ :1360 NA's : 81
## GarageCond PavedDrive PoolQC Fence MiscFeature
## Ex : 2 N: 90 Ex : 2 GdPrv: 59 Gar2: 2
## Fa : 35 P: 30 Fa : 2 GdWo : 54 Othr: 2
## Gd : 9 Y:1340 Gd : 3 MnPrv: 157 Shed: 49
## Po : 7 NA's:1453 MnWw : 11 TenC: 1
## TA :1326 NA's :1179 NA's:1406
## NA's: 81
##
## SaleType SaleCondition
## WD :1267 Abnorml: 101
## New : 122 AdjLand: 4
## COD : 43 Alloca : 12
## ConLD : 9 Family : 20
## ConLI : 5 Normal :1198
## ConLw : 5 Partial: 125
## (Other): 9
The dependent variable is the selling price of the house, SalePrice. The independent variables are the lot size in square feet, LotArea and size of garage in square feet, GarageArea. Moreover, one of independent variables of interest is the newly created variable of the total porch area which is the sum of all porch area (i.e. WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, and ScreenPorch) in square feet called TotalPorch.
# Scatterplots of Independent Variables with Dependent Variable: SalePrice
# vs LotArea
ggplot(train, aes(x = log(LotArea/1000), y = log(SalePrice/1000))) + geom_point(color = "#69b3a2", alpha = 0.5) + labs(title = "Home Sale Price vs Lot Size", x = "Lot Size in log(Kft^2)", y = "Price in log(K$)") + theme_minimal()
# vs TotalPorch: Sum of all porch area
train$TotalPorch = rowSums(train[,c(67:71)])
ggplot(train, aes(x = TotalPorch, y = (SalePrice/1000))) + geom_point(color = "#69b3a2", alpha = 0.5) + labs(title = "Home Sale Price vs Total Porch Area", x = "Total Porch Area in K ft^2", y = "Price in K$") + theme_minimal()
# vs GarageArea
ggplot(train, aes(x = (GarageArea/1000), y = (SalePrice/1000))) + geom_point(color = "#69b3a2", alpha = 0.5) + labs(title = " Home Sale Price vs Garage Area", x = "Garage Area in K ft^2", y = "Price in K$") + theme_minimal()
Correlation matrix will be of SalePrice, LotArea, PoolArea, GarageArea, and TotalPorch. Following it, will be a correlation hypothesis testing where:
Null: true correlation is equal to 0
test = train[,c(5,63,72,82,81)]
res = corr.test(test, y = NULL, use = "pairwise", method = "pearson", adjust = "holm", alpha =.2, ci = TRUE, minlength=5)
print(res, short = FALSE)
## Call:corr.test(x = test, y = NULL, use = "pairwise", method = "pearson",
## adjust = "holm", alpha = 0.2, ci = TRUE, minlength = 5)
## Correlation matrix
## LotArea GarageArea PoolArea TotalPorch SalePrice
## LotArea 1.00 0.18 0.08 0.19 0.26
## GarageArea 0.18 1.00 0.06 0.26 0.62
## PoolArea 0.08 0.06 1.00 0.12 0.09
## TotalPorch 0.19 0.26 0.12 1.00 0.39
## SalePrice 0.26 0.62 0.09 0.39 1.00
## Sample Size
## [1] 1460
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## LotArea GarageArea PoolArea TotalPorch SalePrice
## LotArea 0 0.00 0.01 0 0
## GarageArea 0 0.00 0.02 0 0
## PoolArea 0 0.02 0.00 0 0
## TotalPorch 0 0.00 0.00 0 0
## SalePrice 0 0.00 0.00 0 0
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## raw.lower raw.r raw.upper raw.p lower.adj upper.adj
## LotAr-GrgAr 0.15 0.18 0.21 0.00 0.13 0.23
## LotAr-PolAr 0.04 0.08 0.11 0.00 0.03 0.12
## LotAr-TtlPr 0.15 0.19 0.22 0.00 0.13 0.24
## LotAr-SlPrc 0.23 0.26 0.29 0.00 0.21 0.32
## GrgAr-PolAr 0.03 0.06 0.09 0.02 0.03 0.09
## GrgAr-TtlPr 0.23 0.26 0.29 0.00 0.20 0.31
## GrgAr-SlPrc 0.60 0.62 0.64 0.00 0.58 0.66
## PolAr-TtlPr 0.09 0.12 0.16 0.00 0.07 0.17
## PolAr-SlPrc 0.06 0.09 0.13 0.00 0.04 0.14
## TtlPr-SlPrc 0.36 0.39 0.42 0.00 0.34 0.44
Let’s take a closer look at the analysis above. Firstly, the correlation matrix, it highlights that no pairs of variable have a very strong relationship with another (r > 0.80). At most, the sale price and garage area have a moderate positive correlation of 0.62. In addition, there is a slight positive correlation between sale price with lot area, and sale price with total porch area, r = 0.26 and r = 0.39, respectively. Now at the 80% confidence level, the left diagonal of the probability values matrix suggests that there are evidence that a correlation exist since the p-values are less than 0.2. The familywise error rate is the probability of making at least one Type I Error in a series of hypothesis tests. Fortunately, corr.test adjust for multiple testing using Holm method where tests are run and then ordered from lowest to highest p-values. These values are in the right diagonal of the probability values matrix.
Let’s invert the correlation matrix to find the precision matrix and conduct LU decomposition.
# Precision Matrix
precision = solve(res[["r"]])
precision
## LotArea GarageArea PoolArea TotalPorch SalePrice
## LotArea 1.08690976 -0.025491008 -0.049657024 -0.09981525 -0.22726647
## GarageArea -0.02549101 1.637131857 -0.001944333 -0.02721284 -1.00309414
## PoolArea -0.04965702 -0.001944333 1.019887469 -0.09892174 -0.04124968
## TotalPorch -0.09981525 -0.027212840 -0.098921736 1.20069881 -0.41702321
## SalePrice -0.22726647 -1.003094142 -0.041249680 -0.41702321 1.85218795
# Correlation matrix times Precision matrix
cp = res[["r"]] %*% precision
cp
## LotArea GarageArea PoolArea TotalPorch SalePrice
## LotArea 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0
## GarageArea -2.775558e-17 1.000000e+00 3.469447e-18 0.000000e+00 0
## PoolArea 1.734723e-17 1.387779e-17 1.000000e+00 6.938894e-18 0
## TotalPorch 1.387779e-17 5.551115e-17 2.428613e-17 1.000000e+00 0
## SalePrice -2.775558e-17 2.220446e-16 -2.081668e-17 0.000000e+00 1
# Precision matrix times Correlation matrix
pc = precision %*% res[["r"]]
pc
## LotArea GarageArea PoolArea TotalPorch
## LotArea 1.000000e+00 0.000000e+00 1.734723e-17 4.163336e-17
## GarageArea 5.551115e-17 1.000000e+00 2.775558e-17 5.551115e-17
## PoolArea -8.673617e-18 -3.469447e-18 1.000000e+00 3.469447e-18
## TotalPorch 1.387779e-17 5.551115e-17 4.163336e-17 1.000000e+00
## SalePrice -1.110223e-16 0.000000e+00 -2.775558e-17 0.000000e+00
## SalePrice
## LotArea 0.000000e+00
## GarageArea 2.220446e-16
## PoolArea -6.938894e-18
## TotalPorch 5.551115e-17
## SalePrice 1.000000e+00
# LU decomposition
expand(lu(res[["r"]]))
## $L
## 5 x 5 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 . . . .
## [2,] 0.18040276 1.00000000 . . .
## [3,] 0.07767239 0.04861721 1.00000000 . .
## [4,] 0.18525613 0.23339423 0.09776712 1.00000000 .
## [5,] 0.26384335 0.59520439 0.04428321 0.22515167 1.00000000
##
## $U
## 5 x 5 Matrix of class "dtrMatrix"
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.18040276 0.07767239 0.18525613 0.26384335
## [2,] . 0.96745485 0.04703496 0.22579838 0.57583337
## [3,] . . 0.99168029 0.09695373 0.04391479
## [4,] . . . 0.90350124 0.20342481
## [5,] . . . . 0.53990201
##
## $P
## 5 x 5 sparse Matrix of class "pMatrix"
##
## [1,] | . . . .
## [2,] . | . . .
## [3,] . . | . .
## [4,] . . . | .
## [5,] . . . . |
By multiplying the correlation matrix by the precision matrix and vice versa, result is the identity matrix. Moreover, the LU decomposition resulted in the correlation matrix after multiplying the two components.
From the descriptive statistic, one of the variable of interest is positively skewed, namely LotArea, with a skewness of 12.18.
skew(train$LotArea)
## [1] 12.18262
# With the MASS library, fitdistr is used to fit the data to an exponential probability density function
LotArea_exp = fitdistr(train$LotArea, "exponential")
LotArea_exp
## rate
## 9.508570e-05
## (2.488507e-06)
# Generate 1000 samples from the exponential distribution
y = rexp(1000, LotArea_exp$estimate)
ggplot(train, aes(x = LotArea)) + geom_histogram(aes(y = ..density..,fill = "Observed"),bins = 50,alpha=0.4) + geom_histogram(data = data.frame(y),aes(x = y, y = ..density.., fill = "Simulated"), bins = 100, alpha = 0.5) + theme_minimal() + theme(legend.title = element_blank()) + scale_fill_brewer(palette = "Set2") + labs(title = "Plots of Lot Areas", x = "Square Feet", y = "Proportion") + xlim(0, 10^5)
From the graph, it is evident that the observed graph of the data does not fit too well with the simulated dataset. Thus, let’s compare the 5th and 95th percentiles of the exponential pdf and empirical data.
# Exponential PDF
qexp(c(0.05,0.95),LotArea_exp$estimate)
## [1] 539.4428 31505.6013
# Emperical Data
quantile(train$LotArea, probs = c(0.05,0.95))
## 5% 95%
## 3311.70 17401.15
# 95% condifence interval of empirical data
z = qnorm(0.95)
a = z * sd(train$LotArea)/sqrt(length(train$LotArea))
CI = matrix(c(round(mean(train$LotArea - a),2), round(mean(train$LotArea + a),2)))
rownames(CI) = c("Lower CI","Upper CI")
CI
## [,1]
## Lower CI 10087.16
## Upper CI 10946.50
# Exponential PDF and Emperical Data means
means = matrix(c(mean(train$LotArea), mean(y)))
rownames(means) = c("Mean of LotArea","Mean of EXP")
means
## [,1]
## Mean of LotArea 10516.83
## Mean of EXP 10518.08
The model’s sample mean lies within the 95% confidence interval for the mean. Thus, an exponential distribution does appear to be an appropriate model that fits the lot area. In spite of this, based on the graph of the observed and simulated data, it is still not the best fitted model.
Now, finding the best fit model with the data using Stepwise Regression.
dataset = train %>% select_if(is.numeric) %>% filter(complete.cases(.))
# Remove Outlier
remove_outliers = function(x, na.rm = TRUE, ...) {
qnt = quantile(x, probs = c(.30, .70), na.rm = na.rm, ...)
caps = quantile(x, probs=c(.05, .95), na.rm = T)
H = 1.5 * IQR(x, na.rm = na.rm)
y = x
y[x < (qnt[1] - H)] = caps[1]
y[x > (qnt[2] + H)] = caps[2]
y
}
remove_all_outliers = function(df){
df[,sapply(df, is.numeric)] = lapply(df[,sapply(df, is.numeric)], remove_outliers)
df
}
dataset_new = remove_all_outliers(dataset)
# Fit the full model
model = lm(SalePrice ~., data = dataset_new)
# Stepwise Regression model
step = stepAIC(model, direction = "both", trace = FALSE)
summary(step)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF2 + BsmtUnfSF +
## TotalBsmtSF + X2ndFlrSF + GrLivArea + BsmtHalfBath + BedroomAbvGr +
## Fireplaces + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF +
## EnclosedPorch + ScreenPorch + TotalPorch, data = dataset_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -207243 -12221 -345 12895 95465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.253e+06 9.522e+04 -13.163 < 2e-16 ***
## MSSubClass -9.641e+01 2.267e+01 -4.253 2.28e-05 ***
## LotArea 1.601e+00 2.958e-01 5.413 7.59e-08 ***
## OverallQual 1.463e+04 9.184e+02 15.924 < 2e-16 ***
## OverallCond 7.118e+03 9.557e+02 7.448 1.92e-13 ***
## YearBuilt 3.887e+02 4.629e+01 8.397 < 2e-16 ***
## YearRemodAdd 2.053e+02 5.340e+01 3.844 0.000128 ***
## BsmtFinSF2 -2.201e+01 6.393e+00 -3.443 0.000597 ***
## BsmtUnfSF -2.105e+01 1.990e+00 -10.577 < 2e-16 ***
## TotalBsmtSF 5.172e+01 4.246e+00 12.181 < 2e-16 ***
## X2ndFlrSF 1.469e+01 3.963e+00 3.707 0.000220 ***
## GrLivArea 3.946e+01 4.311e+00 9.154 < 2e-16 ***
## BsmtHalfBath -6.259e+03 3.202e+03 -1.955 0.050856 .
## BedroomAbvGr -5.970e+03 1.498e+03 -3.986 7.17e-05 ***
## Fireplaces 5.791e+03 1.409e+03 4.110 4.25e-05 ***
## GarageCars 8.237e+03 2.365e+03 3.483 0.000516 ***
## GarageArea 1.429e+01 8.063e+00 1.773 0.076582 .
## WoodDeckSF 4.104e+01 1.540e+01 2.665 0.007813 **
## OpenPorchSF 6.462e+01 2.230e+01 2.898 0.003835 **
## EnclosedPorch 4.729e+01 1.699e+01 2.784 0.005462 **
## ScreenPorch 5.069e+01 2.168e+01 2.338 0.019554 *
## TotalPorch -2.280e+01 1.447e+01 -1.576 0.115370
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23670 on 1099 degrees of freedom
## Multiple R-squared: 0.8844, Adjusted R-squared: 0.8822
## F-statistic: 400.3 on 21 and 1099 DF, p-value: < 2.2e-16
# Using the test dataset to make the prediction.
# All tranformation and new variable from train data is created in testdata
testdata$TotalPorch = rowSums(testdata[,c(67:71)])
testdata$SalePrice = predict(step, testdata)
testdata[is.na(testdata)] = mean(testdata$SalePrice, na.rm = TRUE)
submission = subset(testdata, select = c("Id", "SalePrice"))
write.csv(submission, file = "sam13hopexlife_submission.csv")