Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)

X Y
1.624248 2.013936
8.359977 -4.616887
9.483596 6.414498
3.424437 8.840356
2.524133 5.920598
1.305061 4.692422
2.609065 -3.742906
6.774988 10.915229
1.205900 6.883874
1.074923 7.445159

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

## [1] 5.540831
##      25% 
## 1.813683

a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

## [1] 0.5514503
## [1] 0.3742
## [1] 0.4485497

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

## [1] 0.3742
## [1] 0.375

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
7 20 RL 75 10084 Pave NA Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004 RFn 2 636 TA TA Y 255 57 0 0 0 0 NA NA NA 0 8 2007 WD Normal 307000
8 60 RL NA 10382 Pave NA IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973 RFn 2 484 TA TA Y 235 204 228 0 0 0 NA NA Shed 350 11 2009 WD Normal 200000
9 50 RM 51 6120 Pave NA Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NA NA NA 0 4 2008 WD Abnorml 129900
10 190 RL 50 7420 Pave NA Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NA NA NA 0 1 2008 WD Normal 118000

Univariate descriptive statistics

-Provide univariate descriptive statistics and appropriate plots for the training data set

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Scatterplot Matrix

-Provide a scatterplot matrix for at least two of the independent variables and the dependent variable

Correlation Matrix

-Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea        1.0000000       0.2631162        0.1804028
## train.GrLivArea      0.2631162       1.0000000        0.4689975
## train.GarageArea     0.1804028       0.4689975        1.0000000

Hypotheses Test

## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162
## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$GarageArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028
## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageArea and train$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

Analysis Observation

All three confidence intervals have p-values less than 0.5 which means that the null hypothesis could be rejected. Possibility of FWE is going to be high since we’re only executing a single experiment so probability wil be higher. FWE on type I errors when performing multiple hypotheses tests. This problem can be avoid by ajusting the correlation test to a confident level of higher percentage.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Inversion

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea       1.07920917      -0.2469705      -0.07886378
## train.GrLivArea    -0.24697046       1.3385010      -0.58319943
## train.GarageArea   -0.07886378      -0.5831994       1.28774631
##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea                1               0                0
## train.GrLivArea              0               1                0
## train.GarageArea             0               0                1

Identity Matrix

-Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix

##                  train.LotArea train.GrLivArea train.GarageArea
## train.LotArea                1               0                0
## train.GrLivArea              0               1                0
## train.GarageArea             0               0                1

LU Decomposition

-Conduct LU decomposition on the matrix.

## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.2631162 1.0000000         .
## [3,] 0.1804028 0.4528838 1.0000000
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.2631162 0.1804028
## [2,]         . 0.9307699 0.4215306
## [3,]         .         . 0.7765505
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]        [,2]        [,3]       
## [1,]  1.00000000           .           .
## [2,] -0.22884393  1.00000000           .
## [3,] -0.07307553 -0.46899748  1.00000000
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]        [,2]        [,3]       
## [1,]  1.07920917 -0.24697046 -0.07886378
## [2,]           .  1.28198329 -0.60124693
## [3,]           .           .  1.00000000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

##         rate 
## 0.0009456896

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

## 
## Call:
## lm(formula = train.SalePrice ~ ., data = quantitative)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512233  -17548   -1737   14681  283280 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.094e+06  1.268e+05  -8.627  < 2e-16 ***
## train.OverallQual   1.856e+04  1.174e+03  15.807  < 2e-16 ***
## train.YearBuilt     1.638e+02  4.978e+01   3.290 0.001028 ** 
## train.YearRemodAdd  3.564e+02  6.208e+01   5.741 1.15e-08 ***
## train.MasVnrArea    2.881e+01  6.159e+00   4.678 3.17e-06 ***
## train.BsmtFinSF1    1.725e+01  2.596e+00   6.646 4.26e-11 ***
## train.TotalBsmtSF   1.165e+01  4.298e+00   2.711 0.006796 ** 
## train.X1stFlrSF     2.618e+01  2.082e+01   1.257 0.208871    
## train.X2ndFlrSF     1.753e+01  2.048e+01   0.856 0.392000    
## train.GrLivArea     2.135e+01  2.035e+01   1.049 0.294370    
## train.FullBath     -1.489e+03  2.630e+03  -0.566 0.571228    
## train.TotRmsAbvGrd  1.688e+03  1.089e+03   1.550 0.121402    
## train.Fireplaces    7.888e+03  1.783e+03   4.423 1.05e-05 ***
## train.GarageCars    1.011e+04  2.960e+03   3.414 0.000659 ***
## train.GarageArea    1.040e+01  1.005e+01   1.035 0.301006    
## train.WoodDeckSF    3.068e+01  8.129e+00   3.774 0.000167 ***
## train.OpenPorchSF   7.271e+00  1.572e+01   0.462 0.643861    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36380 on 1435 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7918, Adjusted R-squared:  0.7894 
## F-statistic:   341 on 16 and 1435 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = SalePrice ~ ., data = quantitative2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -407840  -21443   -2760   16410  363961 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.307e+05  1.210e+05  -6.867 9.70e-12 ***
## OverallQual   2.449e+04  1.183e+03  20.706  < 2e-16 ***
## YearRemodAdd  3.925e+02  6.256e+01   6.273 4.66e-10 ***
## MasVnrArea    4.651e+01  6.602e+00   7.045 2.85e-12 ***
## BsmtFinSF1    1.482e+01  2.752e+00   5.383 8.52e-08 ***
## TotalBsmtSF   2.504e+01  3.290e+00   7.611 4.89e-14 ***
## Fireplaces    1.551e+04  1.849e+03   8.389  < 2e-16 ***
## GarageCars    1.794e+04  1.820e+03   9.855  < 2e-16 ***
## WoodDeckSF    4.464e+01  8.848e+00   5.045 5.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39960 on 1443 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.746 
## F-statistic: 533.6 on 8 and 1443 DF,  p-value: < 2.2e-16

Nearly normal distributed, perhaps some putliers, not an perfect fit with all dependent variables being statistically significant. Lets check the performance using test data.