Question 1: Probability
Question 1
First we have to load the train.csv from Kaggle competition to start our analysis. 1. Next we have to pick two variables one dependent and another independent variables.
For X Variable we will be using \[ GrLivArea \] and For Y Variable we will be using \[ SalePrice \]
Min. 1st Qu. Median Mean 3rd Qu. Max.
334 1130 1464 1515 1777 5642
Min. 1st Qu. Median Mean 3rd Qu. Max.
34900 130000 163000 180900 214000 755000
25%
1129.5
25%
129975
Problem 1: P(X > x | Y > y)
[1] 0.8712329
This question say Probability of GrLivArea greater than 1129.5 given the Sales Price greater than 1.2997510^{5} is 0.8712329. These two has dependence.
Problem 2 : P(X >x , Y > y)
[1] 0.6534247
This question say Probability of GrLivArea greater greater than 1129.5 and the Sales Price greater than 1.2997510^{5} is 0.6534247.
Problem 3: P(X < x | Y > y)
[1] 0.1287671
This question say Probability of GrLivArea lesser than 1129.5 given the Sales Price greater than 1.2997510^{5} is 0.1287671.
Table to be built.
(Y <= 1st quartile) | (Y > 1st quartile) | Total | |
---|---|---|---|
(X <= 1st quartile) | 224 | 141 | 365 |
(X > 1st quartile) | 141 | 954 | 1095 |
Total | 365 | 1095 | 1460 |
Splitting the data in above format does not make them independent. Of the house in the top 75% in GrLivArea , about 91% of them are in top 75% of SalePrice. This is from 954 which is 91% of Total(1095). This makes them dependent.
Let A be the new variable counting those observations above the 1st Q of X, and B be new variable counting those above 1st Q for Y.
P(A)P(B)
[1] 0.5625
P(AB)
[1] 0.6534247
Null Hypothesis : GrLivArea and SalePrice are Independent.
FALSE TRUE
FALSE 224 141
TRUE 141 954
Pearson's Chi-squared test with Yates' continuity correction
data: chiinput
X-squared = 340.75, df = 1, p-value < 2.2e-16
From the above we can say that Mathemathically.
\[ P(AB) <> P(A)P(B)\]
The output of Chi Test gave a p-value which is less than 0.05. We can Reject the Null Hypothesis and accept the Alternative Hypothesis. Saying the GrLivArea and SalePrice are dependent to each other.
Question 2: Descriptive and Inferential Statistics :
Question 2
First we will provider univariate descriptive stats and plots for training data set.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GrLivArea | 1 | 1460 | 1515.464 | 525.4804 | 1464 | 1467.67 | 483.3276 | 334 | 5642 | 5308 | 1.363754 | 4.863483 | 13.75245 |
SalePrice | 2 | 1460 | 180921.196 | 79442.5029 | 163000 | 170783.29 | 56338.8000 | 34900 | 755000 | 720100 | 1.879009 | 6.496789 | 2079.10532 |
From the above descriptive stats we can see Mean/SD is 1515/525 for GrLivArea and 180921/79442 for SalePrice.
Next we will put the plots for x, y and scatter plot for X and Y
We can see that it is Rightly Skewed. Which means outliers to the right of average.
We can see that it is Rightly Skewed. Which means outliers to the right of average.
Next we plot a scatter plot between X and Y
Next we have to check the correlation for three variables from KaggleData.
The three variables are LotArea, GrLivArea and GarageArea. We will see how they are correlated and plots.
LotArea GrLivArea GarageArea
LotArea 1.0000000 0.2631162 0.1804028
GrLivArea 0.2631162 1.0000000 0.4689975
GarageArea 0.1804028 0.4689975 1.0000000
Testing all three unique correlations at 92% Confidence Interval.
Pearson's product-moment correlation
data: kaggletrain$LotArea and kaggletrain$GrLivArea
t = 10.414, df = 1458, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
92 percent confidence interval:
0.2199359 0.3052674
sample estimates:
cor
0.2631162
Pearson's product-moment correlation
data: kaggletrain$LotArea and kaggletrain$GarageArea
t = 7.0034, df = 1458, p-value = 3.803e-12
alternative hypothesis: true correlation is not equal to 0
92 percent confidence interval:
0.1356921 0.2243801
sample estimates:
cor
0.1804028
Pearson's product-moment correlation
data: kaggletrain$GrLivArea and kaggletrain$LotArea
t = 10.414, df = 1458, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
92 percent confidence interval:
0.2199359 0.3052674
sample estimates:
cor
0.2631162
For all of the above since the p-value is very low that 0.05 we can reject Null Hypothesis that Correlation between the variables in Zero
Since we are rejecting the Null Hypothesis for all three condition, we should be worried about Type 1 Error. For Controlling of Family Wise Error, we have two corrections to make Single Step and Sequential. Both of these corrections needs manupulation of p-value. Since in our case the p-value are very low. We can ignore Type 1 Error.
Question 3: Linear Algebra and Correlation:
Question 3
Precision Matrix is inverse of corrmatrix.
LotArea GrLivArea GarageArea
LotArea 1.07920917 -0.2469705 -0.07886378
GrLivArea -0.24697046 1.3385010 -0.58319943
GarageArea -0.07886378 -0.5831994 1.28774631
Now we have to multiply Corr Matrix and Prec Matrix.
LotArea GrLivArea GarageArea
LotArea 1 0 0
GrLivArea 0 1 0
GarageArea 0 0 1
Precision Matrix multiplied by Corr Matrix
LotArea GrLivArea GarageArea
LotArea 1 0 0
GrLivArea 0 1 0
GarageArea 0 0 1
In both of the above scenarios we are getting Identity Matrix.
Next we will do the LU Decomposition of Precision Matrix as discussed in Discussion Forum.
[1] "Value of L"
[,1] [,2] [,3]
[1,] 1.00000000 0.0000000 0
[2,] -0.22884393 1.0000000 0
[3,] -0.07307553 -0.4689975 1
[1] "Value of U"
LotArea GrLivArea GarageArea
[1,] 1.079209 -0.2469705 -0.07886378
[2,] 0.000000 1.2819833 -0.60124693
[3,] 0.000000 0.0000000 1.00000000
[1] "A = LU"
LotArea GrLivArea GarageArea
[1,] 1.07920917 -0.2469705 -0.07886378
[2,] -0.24697046 1.3385010 -0.58319943
[3,] -0.07886378 -0.5831994 1.28774631
Question 4: Calculus-Based Probability and Statistics
Question 4
From our data set both the variables min value are greater than zero, so no shift needed. Even though both are skewed right we will take GrLivArea for this problem.
vars n mean sd median trimmed mad min max range skew
X1 1 1460 1515.46 525.48 1464 1467.67 483.33 334 5642 5308 1.36
kurtosis se
X1 4.86 13.75
vars n mean sd median trimmed mad min max range
X1 1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
skew kurtosis se
X1 1.88 6.5 2079.11
Then Load the MASS package and run firdistr to fit an expo PDF.
rate
6.598640e-04
(1.726943e-05)
Find the optimal value lambda for this distribution, then take 1000 samples from this expo PDF useing this lambda value.
The optimal value lambda
rate
0.000659864
Next picking of 1000 samples from fitdis using the lambda value.
Plot the histogram and compare it with the original histogram.
Using the exponential PDF, find the 5th and 95th percentile using the cumulative distribution function.
The 5th and 95th Percentile for the CDF sample
5%
86.8844
95%
4326.99
Also generate a 95% confidence interval from the empirical data, assuming normality.
One Sample t-test
data: CalXYData$GrLivArea
t = 110.2, df = 1459, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1488.487 1542.440
sample estimates:
mean of x
1515.464
The 95% confidence interval ranges between 1488.487 and 1542.440
Finally, provide the empirical 5th percentile and 95th percentile of the data.
5%
848
95%
2466.1
From the above analysis we can see that 5th and 95th percentile of exponential data deviates a lot from the 5th and 95th percentile of the emprical data. Instead of Exponential fit, we should go for normal distribution.
Question 5 : Modelling.
Build Some type of multiple regression model and submit your model to the competition board. Provider your complete model summary and results with analysis. Report your kaggle.com user name and score.
Data cleansing : Removal of NAs
Id MSSubClass MSZoning LotFrontage LotArea
"integer" "integer" "factor" "integer" "integer"
Street Alley LotShape LandContour Utilities
"factor" "factor" "factor" "factor" "factor"
LotConfig LandSlope Neighborhood Condition1 Condition2
"factor" "factor" "factor" "factor" "factor"
BldgType HouseStyle OverallQual OverallCond YearBuilt
"factor" "factor" "integer" "integer" "integer"
YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
"integer" "factor" "factor" "factor" "factor"
MasVnrType MasVnrArea ExterQual ExterCond Foundation
"factor" "integer" "factor" "factor" "factor"
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
"factor" "factor" "factor" "factor" "integer"
BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
"factor" "integer" "integer" "integer" "factor"
HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
"factor" "factor" "factor" "integer" "integer"
LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
"integer" "integer" "integer" "integer" "integer"
HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
"integer" "integer" "integer" "factor" "integer"
Functional Fireplaces FireplaceQu GarageType GarageYrBlt
"factor" "integer" "factor" "factor" "integer"
GarageFinish GarageCars GarageArea GarageQual GarageCond
"factor" "integer" "integer" "factor" "factor"
PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
"factor" "integer" "integer" "integer" "integer"
ScreenPorch PoolArea PoolQC Fence MiscFeature
"integer" "integer" "factor" "factor" "factor"
[ reached getOption("max.print") -- omitted 6 entries ]
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
LandContour Utilities LotConfig LandSlope Neighborhood Condition1
Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
SalePrice
[ reached getOption("max.print") -- omitted 1460 rows ]
Step 1: First we will see the linearity between Dependent Variable and Independent Variables. For us the dependent variable is SalePrice and independent variable are rest.
From the entire data set, picked few numeric variables to see how they are with SalePrice.
From the above plot we can see the following strong correlation betweem (GrLivArea,TotalBsmtSF,FullBath,TotRmsAbvGrd) and Sale Price
Step 2 : Next we will check the linearity and correlation between independent variables. This is need to check Multicollinearity.
From the above analysis we can see GrLivArea and TotRmsAbvGrd has a high correlation, So we mught have to drop one We cannot drop GrLivArea so we can think about TotRmsAbvGrd. But for Initial analysis we will take all into consideration and will decide based on P-Value which one to reject when we do the model.
Step 3 : Linear Model
For first we will create a model with SalePrice and all other independent variables we are interested with.
Call:
lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
Street + YearBuilt + OverallQual + OverallCond + MasVnrArea +
BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + BsmtFullBath +
BsmtHalfBath + X1stFlrSF + X2ndFlrSF + GrLivArea + FullBath +
HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageCars +
GarageArea + WoodDeckSF + OpenPorchSF + PoolArea + EnclosedPorch +
Fireplaces + YrSold, data = kaggletrain)
Residuals:
Min 1Q Median 3Q Max
-454792 -18057 -2458 14127 317595
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.678e+05 1.628e+06 -0.349 0.727380
MSSubClass -1.911e+02 3.236e+01 -5.905 4.61e-09 ***
LotFrontage -9.751e+01 5.841e+01 -1.670 0.095283 .
LotArea 5.613e-01 1.561e-01 3.595 0.000338 ***
StreetPave 1.140e+04 1.678e+04 0.679 0.497295
YearBuilt 3.474e+02 6.337e+01 5.482 5.17e-08 ***
OverallQual 1.785e+04 1.382e+03 12.916 < 2e-16 ***
OverallCond 5.560e+03 1.132e+03 4.912 1.03e-06 ***
MasVnrArea 3.475e+01 6.878e+00 5.052 5.08e-07 ***
BsmtFinSF1 1.744e+01 5.520e+00 3.160 0.001619 **
BsmtFinSF2 7.633e+00 8.442e+00 0.904 0.366111
BsmtUnfSF 7.558e+00 4.935e+00 1.532 0.125874
TotalBsmtSF NA NA NA NA
BsmtFullBath 1.086e+04 3.014e+03 3.604 0.000326 ***
BsmtHalfBath 2.735e+03 4.837e+03 0.565 0.571863
X1stFlrSF 8.461e+00 2.262e+01 0.374 0.708369
X2ndFlrSF 6.584e+00 2.216e+01 0.297 0.766448
GrLivArea 4.000e+01 2.189e+01 1.827 0.067889 .
FullBath 8.446e+03 3.264e+03 2.587 0.009790 **
HalfBath 8.532e+02 3.129e+03 0.273 0.785130
BedroomAbvGr -1.080e+04 1.933e+03 -5.588 2.86e-08 ***
KitchenAbvGr -1.393e+04 5.817e+03 -2.394 0.016811 *
TotRmsAbvGrd 5.148e+03 1.425e+03 3.612 0.000317 ***
GarageCars 1.190e+04 3.280e+03 3.628 0.000298 ***
GarageArea -2.243e+00 1.139e+01 -0.197 0.843997
WoodDeckSF 2.063e+01 9.623e+00 2.144 0.032238 *
OpenPorchSF 6.106e+00 1.793e+01 0.341 0.733511
PoolArea -6.052e+01 2.921e+01 -2.072 0.038475 *
EnclosedPorch 7.490e+00 1.951e+01 0.384 0.701088
Fireplaces 4.749e+03 2.089e+03 2.273 0.023208 *
YrSold -9.695e+01 8.085e+02 -0.120 0.904569
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36720 on 1165 degrees of freedom
(265 observations deleted due to missingness)
Multiple R-squared: 0.8098, Adjusted R-squared: 0.8051
F-statistic: 171 on 29 and 1165 DF, p-value: < 2.2e-16
From the above analysis we can elimnate variables with higher p-Values. OpenPorchSF ,Street, X2ndFlrSF and many other values to see how the \[ R^2 \] and adjusted R^2 are changing.
Call:
lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
YearBuilt + OverallQual + OverallCond + MasVnrArea + BsmtFinSF1 +
TotalBsmtSF + BsmtFullBath + GrLivArea + FullBath + BedroomAbvGr +
KitchenAbvGr + TotRmsAbvGrd + GarageCars + WoodDeckSF + PoolArea +
Fireplaces, data = kaggletrain)
Residuals:
Min 1Q Median 3Q Max
-456447 -17968 -2272 14016 316480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.560e+05 1.078e+05 -7.010 4.01e-12 ***
MSSubClass -1.921e+02 3.126e+01 -6.146 1.08e-09 ***
LotFrontage -9.903e+01 5.749e+01 -1.723 0.085231 .
LotArea 5.544e-01 1.541e-01 3.598 0.000334 ***
YearBuilt 3.500e+02 5.482e+01 6.384 2.48e-10 ***
OverallQual 1.792e+04 1.360e+03 13.183 < 2e-16 ***
OverallCond 5.626e+03 1.102e+03 5.104 3.88e-07 ***
MasVnrArea 3.459e+01 6.795e+00 5.090 4.16e-07 ***
BsmtFinSF1 1.030e+01 3.564e+00 2.890 0.003925 **
TotalBsmtSF 8.178e+00 3.662e+00 2.233 0.025706 *
BsmtFullBath 1.043e+04 2.782e+03 3.747 0.000187 ***
GrLivArea 4.753e+01 4.902e+00 9.696 < 2e-16 ***
FullBath 8.120e+03 2.964e+03 2.739 0.006251 **
BedroomAbvGr -1.073e+04 1.892e+03 -5.673 1.76e-08 ***
KitchenAbvGr -1.367e+04 5.547e+03 -2.464 0.013901 *
TotRmsAbvGrd 5.103e+03 1.414e+03 3.609 0.000320 ***
GarageCars 1.142e+04 1.906e+03 5.991 2.78e-09 ***
WoodDeckSF 2.091e+01 9.522e+00 2.196 0.028287 *
PoolArea -6.071e+01 2.882e+01 -2.107 0.035365 *
Fireplaces 4.978e+03 2.028e+03 2.454 0.014266 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36590 on 1175 degrees of freedom
(265 observations deleted due to missingness)
Multiple R-squared: 0.8096, Adjusted R-squared: 0.8065
F-statistic: 262.9 on 19 and 1175 DF, p-value: < 2.2e-16
From the above model we can see, the R-Squared more or less remained the same but there is a marginal improvement in Adjusted R - Squared.
Step 4 : Residual Analysis.
Comparing the QQ Plot of the Model 1 and Model 2, the Model2 plot looks little better in terms of not heaveir tail on the top. So this concludes our second model is better compared to the first model.
Step 5 : Predicting and Submitting to Kaggle.
Id SalePrice
1 1461 115907.6
2 1462 166464.8
3 1463 175943.6
4 1464 203051.6
5 1465 191692.4
6 1466 185939.8
Kaggle Website Details
UserName:dillynesan DisplayName:DilipGanesan Email:dilipgan@gmail.com
Final Score.
Final Score
Still we have some scope for improvement.
If you remember our Dependent variable was rightly skewed. In order to make it to fit a normal distribution, we can apply log transformation on the dependent variable. So we would like to repeat our analysis by applying log transformation on the SalePrice and see what impact it creates.
Reference
http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf
https://rpubs.com/melike/corrplot
https://stats.stackexchange.com/questions/252367/what-can-one-say-about-these-cor-test-results
https://ms.mcmaster.ca/~bolker/R/misc/multhist.pdf
https://stats.stackexchange.com/questions/30858/how-to-calculate-cumulative-distribution-in-r
https://www.kaggle.com/jimthompson/regularized-linear-models-in-r
https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/