Computational Mathematics.

Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Pick one of the quanititative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

Introduction

For this final exam we were asked to go to Kaggle.com and download the train.csv data set for House Prices - Advanced Regression Techniques, we need to use the data to build a model for the housing prices then use final output of the data to enter to kaggle competition. You will get a score and I will submit my Kaggle username along with my score.

Data Import

First I will begin by importing the train.csv data set that I have downloaded from Kaggle.com. I will use the read.csv command to import my csv file.

## [1] 1460   81
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

Data Tidy

In order to clean the data I Will be using imputation method to replace the NA’s value with the mean of the each column with numerical value. This method was perform on the following method LotFrontage, MasVnrArea and GarageYrBlt.

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 60.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 70.05  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 79.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                                      
##     LotArea          Street            LotShape         LandContour       
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##   Utilities          LotConfig          LandSlope         Neighborhood      
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Condition1         Condition2          BldgType          HouseStyle       
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##   RoofStyle           RoofMatl         Exterior1st        Exterior2nd       
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   MasVnrType          MasVnrArea      ExterQual          ExterCond        
##  Length:1460        Min.   :   0.0   Length:1460        Length:1460       
##  Class :character   1st Qu.:   0.0   Class :character   Class :character  
##  Mode  :character   Median :   0.0   Mode  :character   Mode  :character  
##                     Mean   : 103.7                                        
##                     3rd Qu.: 164.2                                        
##                     Max.   :1600.0                                        
##                                                                           
##   Foundation          BsmtQual           BsmtCond         BsmtExposure      
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtFinType1         BsmtFinSF1     BsmtFinType2         BsmtFinSF2     
##  Length:1460        Min.   :   0.0   Length:1460        Min.   :   0.00  
##  Class :character   1st Qu.:   0.0   Class :character   1st Qu.:   0.00  
##  Mode  :character   Median : 383.5   Mode  :character   Median :   0.00  
##                     Mean   : 443.6                      Mean   :  46.55  
##                     3rd Qu.: 712.2                      3rd Qu.:   0.00  
##                     Max.   :5644.0                      Max.   :1474.00  
##                                                                          
##    BsmtUnfSF       TotalBsmtSF       Heating           HeatingQC        
##  Min.   :   0.0   Min.   :   0.0   Length:1460        Length:1460       
##  1st Qu.: 223.0   1st Qu.: 795.8   Class :character   Class :character  
##  Median : 477.5   Median : 991.5   Mode  :character   Mode  :character  
##  Mean   : 567.2   Mean   :1057.4                                        
##  3rd Qu.: 808.0   3rd Qu.:1298.2                                        
##  Max.   :2336.0   Max.   :6110.0                                        
##                                                                         
##   CentralAir         Electrical          X1stFlrSF      X2ndFlrSF   
##  Length:1460        Length:1460        Min.   : 334   Min.   :   0  
##  Class :character   Class :character   1st Qu.: 882   1st Qu.:   0  
##  Mode  :character   Mode  :character   Median :1087   Median :   0  
##                                        Mean   :1163   Mean   : 347  
##                                        3rd Qu.:1391   3rd Qu.: 728  
##                                        Max.   :4692   Max.   :2065  
##                                                                     
##   LowQualFinSF       GrLivArea     BsmtFullBath     BsmtHalfBath    
##  Min.   :  0.000   Min.   : 334   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :  0.000   Median :1464   Median :0.0000   Median :0.00000  
##  Mean   :  5.845   Mean   :1515   Mean   :0.4253   Mean   :0.05753  
##  3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :572.000   Max.   :5642   Max.   :3.0000   Max.   :2.00000  
##                                                                     
##     FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :0.0000   Median :3.000   Median :1.000  
##  Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000  
##                                                                  
##  KitchenQual         TotRmsAbvGrd     Functional          Fireplaces   
##  Length:1460        Min.   : 2.000   Length:1460        Min.   :0.000  
##  Class :character   1st Qu.: 5.000   Class :character   1st Qu.:0.000  
##  Mode  :character   Median : 6.000   Mode  :character   Median :1.000  
##                     Mean   : 6.518                      Mean   :0.613  
##                     3rd Qu.: 7.000                      3rd Qu.:1.000  
##                     Max.   :14.000                      Max.   :3.000  
##                                                                        
##  FireplaceQu         GarageType         GarageYrBlt   GarageFinish      
##  Length:1460        Length:1460        Min.   :1900   Length:1460       
##  Class :character   Class :character   1st Qu.:1962   Class :character  
##  Mode  :character   Mode  :character   Median :1979   Mode  :character  
##                                        Mean   :1979                     
##                                        3rd Qu.:2001                     
##                                        Max.   :2010                     
##                                                                         
##    GarageCars      GarageArea      GarageQual         GarageCond       
##  Min.   :0.000   Min.   :   0.0   Length:1460        Length:1460       
##  1st Qu.:1.000   1st Qu.: 334.5   Class :character   Class :character  
##  Median :2.000   Median : 480.0   Mode  :character   Mode  :character  
##  Mean   :1.767   Mean   : 473.0                                        
##  3rd Qu.:2.000   3rd Qu.: 576.0                                        
##  Max.   :4.000   Max.   :1418.0                                        
##                                                                        
##   PavedDrive          WoodDeckSF      OpenPorchSF     EnclosedPorch   
##  Length:1460        Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  Class :character   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Median :  0.00   Median : 25.00   Median :  0.00  
##                     Mean   : 94.24   Mean   : 46.66   Mean   : 21.95  
##                     3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00  
##                     Max.   :857.00   Max.   :547.00   Max.   :552.00  
##                                                                       
##    X3SsnPorch      ScreenPorch        PoolArea          PoolQC         
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Length:1460       
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Class :character  
##  Median :  0.00   Median :  0.00   Median :  0.000   Mode  :character  
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759                     
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000                     
##  Max.   :508.00   Max.   :480.00   Max.   :738.000                     
##                                                                        
##     Fence           MiscFeature           MiscVal             MoSold   
##  Length:1460        Length:1460        Min.   :    0.00   6      :253  
##  Class :character   Class :character   1st Qu.:    0.00   7      :234  
##  Mode  :character   Mode  :character   Median :    0.00   5      :204  
##                                        Mean   :   43.49   4      :141  
##                                        3rd Qu.:    0.00   8      :122  
##                                        Max.   :15500.00   3      :106  
##                                                           (Other):400  
##      YrSold       SaleType         SaleCondition        SalePrice     
##  Min.   :2006   Length:1460        Length:1460        Min.   : 34900  
##  1st Qu.:2007   Class :character   Class :character   1st Qu.:129975  
##  Median :2008   Mode  :character   Mode  :character   Median :163000  
##  Mean   :2008                                         Mean   :180921  
##  3rd Qu.:2009                                         3rd Qu.:214000  
##  Max.   :2010                                         Max.   :755000  
## 

Pick one of the quanititative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

In order for me to pick the variable x that is skewed to the right I need to plot some the variables and see which one is skewed to the right. The variables that I will be plotting are LotFrontage, LotArea, TotalBsmtSF, GrLivArea, SalePrice

Base on the plots above we can agree that the right skewed variable is LotArea.Therefore, variable X is LotArea and variable Y is SalePrice.

Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable.Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

\[A. P(X>x | Y>y) \]

## [1] 0.8200913

\[B. P(X>x, Y>y) \]

## [1] 0.6150685

\[C. P(X<x | Y>y) \]

## [1] 0.1799087

Does splitting the training data in this fashion make them independent?

I will say that splitting the training data doesn’t make them independent.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y.Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Below is a boxplot of both the Variable X and variable Y so that I can see where the outliers lies. Since the data have over 1400 rows I used log on the box plot to better see the outliers.

univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y

I will begin by providing an Univariate descriptive statistics along with the scatterplot for the variable X and Y as require. Below is the result of the analysis, as for the scatterplot we can see that all is close to zero and the other points can be outlier and scatter all over.

##     LotArea         SalePrice     
##  Min.   :  1300   Min.   : 34900  
##  1st Qu.:  7554   1st Qu.:129975  
##  Median :  9478   Median :163000  
##  Mean   : 10517   Mean   :180921  
##  3rd Qu.: 11602   3rd Qu.:214000  
##  Max.   :215245   Max.   :755000

Since the data consist of numerical value and categorical value so I will filter only for all the numerical value from housing market value data.

Below I will filter the data only look at houses market value Sale Price with Correlation greater than .70

##             OverallQual GrLivArea SalePrice
## OverallQual   1.0000000 0.5930074 0.7909816
## GrLivArea     0.5930074 1.0000000 0.7086245
## SalePrice     0.7909816 0.7086245 1.0000000

Derive a correlation matrix for two of the quantitative variables you selected

I will provide a correlation matrix for two quantitative variables I have selected. The two variables that I selected are OverallQual and SalePrice for the correlation testing. I performed a correlation test on both variables from 99% confidence interval of 0.7643382 0.8149288 and the correlation test between both variables is 0.7909816. The correlation test of 0.7909816 is closer to 0.8149288 which mean that theres is true correlation between OverallQual and SalePrice of the house.

Correlation Test of SalePrice and OverallQual.

## 
##  Pearson's product-moment correlation
## 
## data:  Housing_MarketValue_subset$OverallQual and Housing_MarketValue_subset$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.7643382 0.8149288
## sample estimates:
##       cor 
## 0.7909816

Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

Correlation matrix by the precision matrix

##           [,1]      [,2]
## [1,]  2.671310 -2.112957
## [2,] -2.112957  2.671310

multiply the precision matrix by the correlation matrix

Below is the product of the correlation matrix and precision matrix which is the identity matrix. The first matrix is correlation multiply by precision matrix and the output is print it, then the second matrix is precision matrix multiply by correlation matrix and the output is also printed.

Correlation Matrix by Precision Matrix

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Precision Matrix by Correlation Matrix

##           [,1]      [,2]
## [1,]  11.60049 -11.28873
## [2,] -11.28873  11.60049

Conduct principle components analysis.

Principle Components analysis centering data around 0 by shifting the variables; rescaling the variance to 1 unit. The Eigenvalues measure the amount of variation held by each principal component (PC). They are evaluated to determine the number of principal components to be considered.

## Standard deviations (1, .., p=2):
## [1] 1.414214e+00 7.850462e-17
## 
## Rotation (n x k) = (2 x 2):
##             PC1       PC2
## [1,] -0.7071068 0.7071068
## [2,]  0.7071068 0.7071068

Eigenvalue of Principle Components Analysis.

Plot of Correlation matrix.

Below is a correlation plot of housing market value data set with correlation greater than 0.70.

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

looking at housing market value data to see if there is a difference between the mean and median and see all right skew, base on the results below we can MSSubClass is more right skew follow by LotArea.

rowname

V1

MSSubClass

0.1630535974320969361262

LotFrontage

-0.0000000000000002427602

LotArea

0.1040277048276167931595

OverallQual

0.0718115086600481927759

OverallCond

0.5170226533860245998753

YearBuilt

-0.0573518287639829746349

Skewness Summary

In the beginning I plotted some of the variables from the train.csv data to see which variable is right skewed then set it as dependent variable X, this is a test to see how rightly skewed is variable X.

## [1] 12.19514
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

For your variable that is skewed to the right, shift it so that the minimum value is above zero.

Since MSSubClass and LotArea are the more right skewed so I will use MSSubClass to finish the remainder of the question.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Minimum Value is above zero.

it looks like that 20 is the minimum value that is above zero for variable MSSubClass.

## [1] 20

load the MASS package and run fitdistr to fit an exponential probability density function.

I will use the fitdistr function to fit this variable to an exponential distribution with lambda as exponential rate.

##        rate    
##   0.0175755387 
##  (0.0004599729)
##       rate 
## 0.01757554

The parameter is 0.0175755

Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).

I will now take 1000 samples from this distribution and print the output below.

##           .
## 1 80.679153
## 2  8.069586
## 3 28.547301
## 4  9.181433
## 5 85.706641
## 6 57.165496

Plot a histogram and compare it with a histogram of your original variable.

Looking at original histogram and sample histogram we can say that both histogram are different. Original is right skewed but spread, while sample histogram is more right skewed with everything skew between 0 and 100.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

## [1]   2.918448 170.448959

Generate a 95% confidence interval from the empirical data, assuming normality.

## 
##  One Sample t-test
## 
## data:  Housing_MarketValue$MSSubClass
## t = 51.395, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  54.72567 59.06885
## sample estimates:
## mean of x 
##  56.89726

Finally, provide the empirical 5th percentile and 95th percentile of the data.

##  5% 95% 
##  20 160

Discuss.

I plotted original histogram of MSSubClass from housing market value, find the minimum value above zero for MSSubClass variable to be 20. I use the mass package to find the fit distribution at which MSSubClass variable will increase exponentially at lambda rate of 0.01757554. I take a 1000 sample that was instructed using lambda rate 0.01757554 on the fit distribution to see how fast it will grow then plot it as histogram. The exponential distribution histogram look very different from original histogram, since exponential histogram are more skew to the right. Perform a t-test at 95% confidence interval with result 54.72567 59.06885 and a mean of X of 56.89726, which lay in between the interval result of 54.72567 and 59.06885.

Modeling.

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com username and score.

From Housing Market Value data I will have a train data then use I will train data on the test data.

Train Data

# Train data set

Housing_MarketValue_train = Housing_MarketValue %>% 
  select_if(is.numeric) %>% 
  dplyr::select(-Id)

summary(Housing_MarketValue_train)
##    MSSubClass     LotFrontage        LotArea        OverallQual    
##  Min.   : 20.0   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 20.0   1st Qu.: 60.00   1st Qu.:  7554   1st Qu.: 5.000  
##  Median : 50.0   Median : 70.05   Median :  9478   Median : 6.000  
##  Mean   : 56.9   Mean   : 70.05   Mean   : 10517   Mean   : 6.099  
##  3rd Qu.: 70.0   3rd Qu.: 79.00   3rd Qu.: 11602   3rd Qu.: 7.000  
##  Max.   :190.0   Max.   :313.00   Max.   :215245   Max.   :10.000  
##   OverallCond      YearBuilt     YearRemodAdd    MasVnrArea    
##  Min.   :1.000   Min.   :1872   Min.   :1950   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0  
##  Median :5.000   Median :1973   Median :1994   Median :   0.0  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Mean   : 103.7  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 164.2  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Max.   :1600.0  
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
##  Median : 383.5   Median :   0.00   Median : 477.5   Median : 991.5  
##  Mean   : 443.6   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
##  3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Max.   :5644.0   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
##    X1stFlrSF      X2ndFlrSF     LowQualFinSF       GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.000   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642  
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4253   Mean   :0.05753   Mean   :1.565   Mean   :0.3829  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd      Fireplaces   
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.000  
##  Median :3.000   Median :1.000   Median : 6.000   Median :1.000  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518   Mean   :0.613  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000  
##  Max.   :8.000   Max.   :3.000   Max.   :14.000   Max.   :3.000  
##   GarageYrBlt     GarageCars      GarageArea       WoodDeckSF    
##  Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:1962   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :1979   Median :2.000   Median : 480.0   Median :  0.00  
##  Mean   :1979   Mean   :1.767   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.:2001   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2010   Max.   :4.000   Max.   :1418.0   Max.   :857.00  
##   OpenPorchSF     EnclosedPorch      X3SsnPorch      ScreenPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 25.00   Median :  0.00   Median :  0.00   Median :  0.00  
##  Mean   : 46.66   Mean   : 21.95   Mean   :  3.41   Mean   : 15.06  
##  3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :547.00   Max.   :552.00   Max.   :508.00   Max.   :480.00  
##     PoolArea          MiscVal             YrSold       SalePrice     
##  Min.   :  0.000   Min.   :    0.00   Min.   :2006   Min.   : 34900  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.:2007   1st Qu.:129975  
##  Median :  0.000   Median :    0.00   Median :2008   Median :163000  
##  Mean   :  2.759   Mean   :   43.49   Mean   :2008   Mean   :180921  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :738.000   Max.   :15500.00   Max.   :2010   Max.   :755000

Linear Model

## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF + 
##     BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch + 
##     PoolArea, data = Housing_MarketValue_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -452376  -17295   -1323   13837  292715 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.468e+05  8.790e+04  -9.634  < 2e-16 ***
## MSSubClass   -1.909e+02  2.420e+01  -7.886 6.09e-15 ***
## LotArea       4.468e-01  1.002e-01   4.458 8.93e-06 ***
## OverallQual   1.978e+04  1.096e+03  18.055  < 2e-16 ***
## OverallCond   5.408e+03  9.265e+02   5.837 6.54e-09 ***
## YearBuilt     3.915e+02  4.493e+01   8.714  < 2e-16 ***
## MasVnrArea    3.208e+01  5.865e+00   5.470 5.29e-08 ***
## X1stFlrSF     7.072e+01  3.717e+00  19.027  < 2e-16 ***
## X2ndFlrSF     6.072e+01  3.331e+00  18.230  < 2e-16 ***
## BsmtFullBath  1.368e+04  1.922e+03   7.117 1.73e-12 ***
## BedroomAbvGr -8.115e+03  1.446e+03  -5.611 2.41e-08 ***
## GarageCars    1.056e+04  1.707e+03   6.187 7.96e-10 ***
## WoodDeckSF    2.811e+01  7.955e+00   3.534 0.000423 ***
## ScreenPorch   5.623e+01  1.688e+01   3.332 0.000885 ***
## PoolArea     -2.959e+01  2.352e+01  -1.258 0.208649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35270 on 1445 degrees of freedom
## Multiple R-squared:  0.8048, Adjusted R-squared:  0.8029 
## F-statistic: 425.4 on 14 and 1445 DF,  p-value: < 2.2e-16

Linear Model Plot

Linear Model with Log Transform

## 
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF + 
##     BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch + 
##     PoolArea, data = Housing_MarketValue_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88574 -0.06865  0.00413  0.07766  0.50022 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.743e+00  3.776e-01   9.913  < 2e-16 ***
## MSSubClass   -7.229e-04  1.040e-04  -6.951 5.47e-12 ***
## LotArea       2.281e-06  4.307e-07   5.297 1.36e-07 ***
## OverallQual   9.955e-02  4.708e-03  21.146  < 2e-16 ***
## OverallCond   5.462e-02  3.981e-03  13.720  < 2e-16 ***
## YearBuilt     3.421e-03  1.930e-04  17.720  < 2e-16 ***
## MasVnrArea   -1.940e-06  2.520e-05  -0.077 0.938635    
## X1stFlrSF     3.162e-04  1.597e-05  19.800  < 2e-16 ***
## X2ndFlrSF     2.552e-04  1.431e-05  17.836  < 2e-16 ***
## BsmtFullBath  7.461e-02  8.260e-03   9.034  < 2e-16 ***
## BedroomAbvGr  1.492e-03  6.214e-03   0.240 0.810316    
## GarageCars    7.505e-02  7.334e-03  10.233  < 2e-16 ***
## WoodDeckSF    1.392e-04  3.418e-05   4.072 4.92e-05 ***
## ScreenPorch   4.031e-04  7.251e-05   5.560 3.21e-08 ***
## PoolArea     -3.599e-04  1.011e-04  -3.562 0.000381 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1515 on 1445 degrees of freedom
## Multiple R-squared:  0.8575, Adjusted R-squared:  0.8561 
## F-statistic: 620.8 on 14 and 1445 DF,  p-value: < 2.2e-16

Linear Model with Log trasform Plot

Linear Model Train Model with NA’S remove

I will use the train data and remove missing values (NA’S) before testing it against the test.csv file.

## 
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + MasVnrArea + X1stFlrSF + X2ndFlrSF + 
##     BsmtFullBath + BedroomAbvGr + GarageCars + WoodDeckSF + ScreenPorch + 
##     PoolArea, data = Housing_MarketValue_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88574 -0.06865  0.00413  0.07766  0.50022 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.743e+00  3.776e-01   9.913  < 2e-16 ***
## MSSubClass   -7.229e-04  1.040e-04  -6.951 5.47e-12 ***
## LotArea       2.281e-06  4.307e-07   5.297 1.36e-07 ***
## OverallQual   9.955e-02  4.708e-03  21.146  < 2e-16 ***
## OverallCond   5.462e-02  3.981e-03  13.720  < 2e-16 ***
## YearBuilt     3.421e-03  1.930e-04  17.720  < 2e-16 ***
## MasVnrArea   -1.940e-06  2.520e-05  -0.077 0.938635    
## X1stFlrSF     3.162e-04  1.597e-05  19.800  < 2e-16 ***
## X2ndFlrSF     2.552e-04  1.431e-05  17.836  < 2e-16 ***
## BsmtFullBath  7.461e-02  8.260e-03   9.034  < 2e-16 ***
## BedroomAbvGr  1.492e-03  6.214e-03   0.240 0.810316    
## GarageCars    7.505e-02  7.334e-03  10.233  < 2e-16 ***
## WoodDeckSF    1.392e-04  3.418e-05   4.072 4.92e-05 ***
## ScreenPorch   4.031e-04  7.251e-05   5.560 3.21e-08 ***
## PoolArea     -3.599e-04  1.011e-04  -3.562 0.000381 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1515 on 1445 degrees of freedom
## Multiple R-squared:  0.8575, Adjusted R-squared:  0.8561 
## F-statistic: 620.8 on 14 and 1445 DF,  p-value: < 2.2e-16

Test data Loaded

I load the test.csv data set with Housing Market Value train data set after NA removal.

Housing_MarketValue_test = read.csv('test.csv') 

pred = predict(lm3, Housing_MarketValue_test) %>% 
  exp() %>% 
  cbind(Housing_MarketValue_test$Id, .) %>% 
  as.data.frame() %>% 
  set_names(c("Id","SalePrice"))

head(pred) %>% 
  flextable()

Id

SalePrice

1,461

122,794.4

1,462

153,066.0

1,463

161,158.5

1,464

189,662.3

1,465

192,457.6

1,466

172,162.8

Export to CSV

I will now export data a save a new csv filename as seen below.

pred %>% 
  replace(is.na(.), 0) %>% 
  write.csv("Housing_MarketValue_Final.csv",row.names=F)

Conclusion

In conclusion after perform all these test and train the data I Upload the the final data to kaggle.com using the logistic model I was getting a score of 5.54830 or Errors which require me to make some revision to the building model. After revision I re-ran the train data and re-tested against test.csv file. Re-upload the file Housing Market Value Final back to Kaggle.com and I received a score that is lower than my couple trial, I am unsure what the score mean. Below is my Kaggle username, score along a screenshot of my Kaggle.com score.

Kaggle.com Username and Score

My Kaggle username is Valor383. My score for multiple submission were 1.36971, 4.78295, 5.54830.

Kaggle Score{width=“75%”“}

Kaggle Score 1{width=“75%”“}