Question 1: Probability

Question 1

Question 1

First we have to load the train.csv from Kaggle competition to start our analysis. 1. Next we have to pick two variables one dependent and another independent variables.

For X Variable we will be using \[ GrLivArea \] and For Y Variable we will be using \[ SalePrice \]

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    334    1130    1464    1515    1777    5642 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34900  130000  163000  180900  214000  755000 
   25% 
1129.5 
   25% 
129975 

Problem 1: P(X > x | Y > y)

[1] 0.8712329

This question say Probability of GrLivArea greater than 1129.5 given the Sales Price greater than 1.2997510^{5} is 0.8712329. These two has dependence.

Problem 2 : P(X >x , Y > y)

[1] 0.6534247

This question say Probability of GrLivArea greater greater than 1129.5 and the Sales Price greater than 1.2997510^{5} is 0.6534247.

Problem 3: P(X < x | Y > y)

[1] 0.1287671

This question say Probability of GrLivArea lesser than 1129.5 given the Sales Price greater than 1.2997510^{5} is 0.1287671.

Table to be built.

Table Data
(Y <= 1st quartile) (Y > 1st quartile) Total
(X <= 1st quartile) 224 141 365
(X > 1st quartile) 141 954 1095
Total 365 1095 1460

Splitting the data in above format does not make them independent. Of the house in the top 75% in GrLivArea , about 91% of them are in top 75% of SalePrice. This is from 954 which is 91% of Total(1095). This makes them dependent.

Let A be the new variable counting those observations above the 1st Q of X, and B be new variable counting those above 1st Q for Y.

P(A)P(B)

[1] 0.5625

P(AB)

[1] 0.6534247

Null Hypothesis : GrLivArea and SalePrice are Independent.

       
        FALSE TRUE
  FALSE   224  141
  TRUE    141  954

    Pearson's Chi-squared test with Yates' continuity correction

data:  chiinput
X-squared = 340.75, df = 1, p-value < 2.2e-16

From the above we can say that Mathemathically.

\[ P(AB) <> P(A)P(B)\]

The output of Chi Test gave a p-value which is less than 0.05. We can Reject the Null Hypothesis and accept the Alternative Hypothesis. Saying the GrLivArea and SalePrice are dependent to each other.

Question 2: Descriptive and Inferential Statistics :

Question 2

Question 2

First we will provider univariate descriptive stats and plots for training data set.

vars n mean sd median trimmed mad min max range skew kurtosis se
GrLivArea 1 1460 1515.464 525.4804 1464 1467.67 483.3276 334 5642 5308 1.363754 4.863483 13.75245
SalePrice 2 1460 180921.196 79442.5029 163000 170783.29 56338.8000 34900 755000 720100 1.879009 6.496789 2079.10532

From the above descriptive stats we can see Mean/SD is 1515/525 for GrLivArea and 180921/79442 for SalePrice.

Next we will put the plots for x, y and scatter plot for X and Y

We can see that it is Rightly Skewed. Which means outliers to the right of average.

We can see that it is Rightly Skewed. Which means outliers to the right of average.

Next we plot a scatter plot between X and Y

Next we have to check the correlation for three variables from KaggleData.

The three variables are LotArea, GrLivArea and GarageArea. We will see how they are correlated and plots.

             LotArea GrLivArea GarageArea
LotArea    1.0000000 0.2631162  0.1804028
GrLivArea  0.2631162 1.0000000  0.4689975
GarageArea 0.1804028 0.4689975  1.0000000

Testing all three unique correlations at 92% Confidence Interval.


    Pearson's product-moment correlation

data:  kaggletrain$LotArea and kaggletrain$GrLivArea
t = 10.414, df = 1458, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
92 percent confidence interval:
 0.2199359 0.3052674
sample estimates:
      cor 
0.2631162 

    Pearson's product-moment correlation

data:  kaggletrain$LotArea and kaggletrain$GarageArea
t = 7.0034, df = 1458, p-value = 3.803e-12
alternative hypothesis: true correlation is not equal to 0
92 percent confidence interval:
 0.1356921 0.2243801
sample estimates:
      cor 
0.1804028 

    Pearson's product-moment correlation

data:  kaggletrain$GrLivArea and kaggletrain$LotArea
t = 10.414, df = 1458, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
92 percent confidence interval:
 0.2199359 0.3052674
sample estimates:
      cor 
0.2631162 

For all of the above since the p-value is very low that 0.05 we can reject Null Hypothesis that Correlation between the variables in Zero

Since we are rejecting the Null Hypothesis for all three condition, we should be worried about Type 1 Error. For Controlling of Family Wise Error, we have two corrections to make Single Step and Sequential. Both of these corrections needs manupulation of p-value. Since in our case the p-value are very low. We can ignore Type 1 Error.

Question 3: Linear Algebra and Correlation:

Question 3

Question 3

Precision Matrix is inverse of corrmatrix.

               LotArea  GrLivArea  GarageArea
LotArea     1.07920917 -0.2469705 -0.07886378
GrLivArea  -0.24697046  1.3385010 -0.58319943
GarageArea -0.07886378 -0.5831994  1.28774631

Now we have to multiply Corr Matrix and Prec Matrix.

           LotArea GrLivArea GarageArea
LotArea          1         0          0
GrLivArea        0         1          0
GarageArea       0         0          1

Precision Matrix multiplied by Corr Matrix

           LotArea GrLivArea GarageArea
LotArea          1         0          0
GrLivArea        0         1          0
GarageArea       0         0          1

In both of the above scenarios we are getting Identity Matrix.

Next we will do the LU Decomposition of Precision Matrix as discussed in Discussion Forum.

[1] "Value of L"
            [,1]       [,2] [,3]
[1,]  1.00000000  0.0000000    0
[2,] -0.22884393  1.0000000    0
[3,] -0.07307553 -0.4689975    1
[1] "Value of U"
      LotArea  GrLivArea  GarageArea
[1,] 1.079209 -0.2469705 -0.07886378
[2,] 0.000000  1.2819833 -0.60124693
[3,] 0.000000  0.0000000  1.00000000
[1] "A = LU"
         LotArea  GrLivArea  GarageArea
[1,]  1.07920917 -0.2469705 -0.07886378
[2,] -0.24697046  1.3385010 -0.58319943
[3,] -0.07886378 -0.5831994  1.28774631

Question 4: Calculus-Based Probability and Statistics

Question 4

Question 4

From our data set both the variables min value are greater than zero, so no shift needed. Even though both are skewed right we will take GrLivArea for this problem.

   vars    n    mean     sd median trimmed    mad min  max range skew
X1    1 1460 1515.46 525.48   1464 1467.67 483.33 334 5642  5308 1.36
   kurtosis    se
X1     4.86 13.75
   vars    n     mean      sd median  trimmed     mad   min    max  range
X1    1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
   skew kurtosis      se
X1 1.88      6.5 2079.11

Then Load the MASS package and run firdistr to fit an expo PDF.

       rate    
  6.598640e-04 
 (1.726943e-05)

Find the optimal value lambda for this distribution, then take 1000 samples from this expo PDF useing this lambda value.

The optimal value lambda

       rate 
0.000659864 

Next picking of 1000 samples from fitdis using the lambda value.

Plot the histogram and compare it with the original histogram.

Using the exponential PDF, find the 5th and 95th percentile using the cumulative distribution function.

The 5th and 95th Percentile for the CDF sample

     5% 
86.8844 
    95% 
4326.99 

Also generate a 95% confidence interval from the empirical data, assuming normality.


    One Sample t-test

data:  CalXYData$GrLivArea
t = 110.2, df = 1459, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1488.487 1542.440
sample estimates:
mean of x 
 1515.464 

The 95% confidence interval ranges between 1488.487 and 1542.440

Finally, provide the empirical 5th percentile and 95th percentile of the data.

 5% 
848 
   95% 
2466.1 

From the above analysis we can see that 5th and 95th percentile of exponential data deviates a lot from the 5th and 95th percentile of the emprical data. Instead of Exponential fit, we should go for normal distribution.

Question 5 : Modelling.

Build Some type of multiple regression model and submit your model to the competition board. Provider your complete model summary and results with analysis. Report your kaggle.com user name and score.

Data cleansing : Removal of NAs

           Id    MSSubClass      MSZoning   LotFrontage       LotArea 
    "integer"     "integer"      "factor"     "integer"     "integer" 
       Street         Alley      LotShape   LandContour     Utilities 
     "factor"      "factor"      "factor"      "factor"      "factor" 
    LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
     "factor"      "factor"      "factor"      "factor"      "factor" 
     BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
     "factor"      "factor"     "integer"     "integer"     "integer" 
 YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
    "integer"      "factor"      "factor"      "factor"      "factor" 
   MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
     "factor"     "integer"      "factor"      "factor"      "factor" 
     BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
     "factor"      "factor"      "factor"      "factor"     "integer" 
 BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
     "factor"     "integer"     "integer"     "integer"      "factor" 
    HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
     "factor"      "factor"      "factor"     "integer"     "integer" 
 LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
    "integer"     "integer"     "integer"     "integer"     "integer" 
     HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
    "integer"     "integer"     "integer"      "factor"     "integer" 
   Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
     "factor"     "integer"      "factor"      "factor"     "integer" 
 GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
     "factor"     "integer"     "integer"      "factor"      "factor" 
   PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
     "factor"     "integer"     "integer"     "integer"     "integer" 
  ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
    "integer"     "integer"      "factor"      "factor"      "factor" 
 [ reached getOption("max.print") -- omitted 6 entries ]
       Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
     LandContour Utilities LotConfig LandSlope Neighborhood Condition1
     Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
     YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
     MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
     BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
     BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
     X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
     FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
     Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
     GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
     OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
     Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
     SalePrice
 [ reached getOption("max.print") -- omitted 1460 rows ]

Step 1: First we will see the linearity between Dependent Variable and Independent Variables. For us the dependent variable is SalePrice and independent variable are rest.

From the entire data set, picked few numeric variables to see how they are with SalePrice.

From the above plot we can see the following strong correlation betweem (GrLivArea,TotalBsmtSF,FullBath,TotRmsAbvGrd) and Sale Price

Step 2 : Next we will check the linearity and correlation between independent variables. This is need to check Multicollinearity.

From the above analysis we can see GrLivArea and TotRmsAbvGrd has a high correlation, So we mught have to drop one We cannot drop GrLivArea so we can think about TotRmsAbvGrd. But for Initial analysis we will take all into consideration and will decide based on P-Value which one to reject when we do the model.

Step 3 : Linear Model

For first we will create a model with SalePrice and all other independent variables we are interested with.


Call:
lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea + 
    Street + YearBuilt + OverallQual + OverallCond + MasVnrArea + 
    BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + BsmtFullBath + 
    BsmtHalfBath + X1stFlrSF + X2ndFlrSF + GrLivArea + FullBath + 
    HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageCars + 
    GarageArea + WoodDeckSF + OpenPorchSF + PoolArea + EnclosedPorch + 
    Fireplaces + YrSold, data = kaggletrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-454792  -18057   -2458   14127  317595 

Coefficients: (1 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -5.678e+05  1.628e+06  -0.349 0.727380    
MSSubClass    -1.911e+02  3.236e+01  -5.905 4.61e-09 ***
LotFrontage   -9.751e+01  5.841e+01  -1.670 0.095283 .  
LotArea        5.613e-01  1.561e-01   3.595 0.000338 ***
StreetPave     1.140e+04  1.678e+04   0.679 0.497295    
YearBuilt      3.474e+02  6.337e+01   5.482 5.17e-08 ***
OverallQual    1.785e+04  1.382e+03  12.916  < 2e-16 ***
OverallCond    5.560e+03  1.132e+03   4.912 1.03e-06 ***
MasVnrArea     3.475e+01  6.878e+00   5.052 5.08e-07 ***
BsmtFinSF1     1.744e+01  5.520e+00   3.160 0.001619 ** 
BsmtFinSF2     7.633e+00  8.442e+00   0.904 0.366111    
BsmtUnfSF      7.558e+00  4.935e+00   1.532 0.125874    
TotalBsmtSF           NA         NA      NA       NA    
BsmtFullBath   1.086e+04  3.014e+03   3.604 0.000326 ***
BsmtHalfBath   2.735e+03  4.837e+03   0.565 0.571863    
X1stFlrSF      8.461e+00  2.262e+01   0.374 0.708369    
X2ndFlrSF      6.584e+00  2.216e+01   0.297 0.766448    
GrLivArea      4.000e+01  2.189e+01   1.827 0.067889 .  
FullBath       8.446e+03  3.264e+03   2.587 0.009790 ** 
HalfBath       8.532e+02  3.129e+03   0.273 0.785130    
BedroomAbvGr  -1.080e+04  1.933e+03  -5.588 2.86e-08 ***
KitchenAbvGr  -1.393e+04  5.817e+03  -2.394 0.016811 *  
TotRmsAbvGrd   5.148e+03  1.425e+03   3.612 0.000317 ***
GarageCars     1.190e+04  3.280e+03   3.628 0.000298 ***
GarageArea    -2.243e+00  1.139e+01  -0.197 0.843997    
WoodDeckSF     2.063e+01  9.623e+00   2.144 0.032238 *  
OpenPorchSF    6.106e+00  1.793e+01   0.341 0.733511    
PoolArea      -6.052e+01  2.921e+01  -2.072 0.038475 *  
EnclosedPorch  7.490e+00  1.951e+01   0.384 0.701088    
Fireplaces     4.749e+03  2.089e+03   2.273 0.023208 *  
YrSold        -9.695e+01  8.085e+02  -0.120 0.904569    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36720 on 1165 degrees of freedom
  (265 observations deleted due to missingness)
Multiple R-squared:  0.8098,    Adjusted R-squared:  0.8051 
F-statistic:   171 on 29 and 1165 DF,  p-value: < 2.2e-16

From the above analysis we can elimnate variables with higher p-Values. OpenPorchSF ,Street, X2ndFlrSF and many other values to see how the \[ R^2 \] and adjusted R^2 are changing.


Call:
lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea + 
    YearBuilt + OverallQual + OverallCond + MasVnrArea + BsmtFinSF1 + 
    TotalBsmtSF + BsmtFullBath + GrLivArea + FullBath + BedroomAbvGr + 
    KitchenAbvGr + TotRmsAbvGrd + GarageCars + WoodDeckSF + PoolArea + 
    Fireplaces, data = kaggletrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-456447  -17968   -2272   14016  316480 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.560e+05  1.078e+05  -7.010 4.01e-12 ***
MSSubClass   -1.921e+02  3.126e+01  -6.146 1.08e-09 ***
LotFrontage  -9.903e+01  5.749e+01  -1.723 0.085231 .  
LotArea       5.544e-01  1.541e-01   3.598 0.000334 ***
YearBuilt     3.500e+02  5.482e+01   6.384 2.48e-10 ***
OverallQual   1.792e+04  1.360e+03  13.183  < 2e-16 ***
OverallCond   5.626e+03  1.102e+03   5.104 3.88e-07 ***
MasVnrArea    3.459e+01  6.795e+00   5.090 4.16e-07 ***
BsmtFinSF1    1.030e+01  3.564e+00   2.890 0.003925 ** 
TotalBsmtSF   8.178e+00  3.662e+00   2.233 0.025706 *  
BsmtFullBath  1.043e+04  2.782e+03   3.747 0.000187 ***
GrLivArea     4.753e+01  4.902e+00   9.696  < 2e-16 ***
FullBath      8.120e+03  2.964e+03   2.739 0.006251 ** 
BedroomAbvGr -1.073e+04  1.892e+03  -5.673 1.76e-08 ***
KitchenAbvGr -1.367e+04  5.547e+03  -2.464 0.013901 *  
TotRmsAbvGrd  5.103e+03  1.414e+03   3.609 0.000320 ***
GarageCars    1.142e+04  1.906e+03   5.991 2.78e-09 ***
WoodDeckSF    2.091e+01  9.522e+00   2.196 0.028287 *  
PoolArea     -6.071e+01  2.882e+01  -2.107 0.035365 *  
Fireplaces    4.978e+03  2.028e+03   2.454 0.014266 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36590 on 1175 degrees of freedom
  (265 observations deleted due to missingness)
Multiple R-squared:  0.8096,    Adjusted R-squared:  0.8065 
F-statistic: 262.9 on 19 and 1175 DF,  p-value: < 2.2e-16

From the above model we can see, the R-Squared more or less remained the same but there is a marginal improvement in Adjusted R - Squared.

Step 4 : Residual Analysis.

Comparing the QQ Plot of the Model 1 and Model 2, the Model2 plot looks little better in terms of not heaveir tail on the top. So this concludes our second model is better compared to the first model.

Step 5 : Predicting and Submitting to Kaggle.

    Id SalePrice
1 1461  115907.6
2 1462  166464.8
3 1463  175943.6
4 1464  203051.6
5 1465  191692.4
6 1466  185939.8

Kaggle Website Details

UserName:dillynesan DisplayName:DilipGanesan Email:dilipgan@gmail.com

Final Score.

Final Score

Final Score

Still we have some scope for improvement.

If you remember our Dependent variable was rightly skewed. In order to make it to fit a normal distribution, we can apply log transformation on the dependent variable. So we would like to repeat our analysis by applying log transformation on the SalePrice and see what impact it creates.

Reference : https://stats.stackexchange.com/questions/298/in-linear-regression-when-is-it-appropriate-to-use-the-log-of-an-independent-va/3530#3530