Problem 1

Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from \(1\) to \(N\), where \(N\) can be any number of your choosing greater than or equal to \(6\). Then generate a random variable \(Y\) that has \(10,000\) random normal numbers with a mean of \(\mu =\sigma=(N+1)/2\).

set.seed(100)

# Set N to any number greater than or equal to 6 (in this case, 8).
N <- 8

# Generate a random variable X that has 10,000 random uniform numbers from 1 to N.
X <- runif(10000, 1, N)

# Generate a random variable Y that has 10,000 random normal numbers with the requested mean.
Y <- rnorm(10000, mean = (N+1)/2, (N+1)/2)

 

Problem 1 - Probability

Calculate as a minimum the below probabilities \(a\) through \(c\). Assume the small letter "\(x\)" is estimated as the median of the \(X\) variable, and the small letter "\(y\)" is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.

# Small x is estimated as the median of the X variable.
x <- round(median(X), 2)
x
## [1] 4.48
# Small y is estimated as the 1st quartile of the Y variable.
y <- round(quantile(Y, 0.25), 2)
y
##  25% 
## 1.45

Answer: Small x is equal to 4.48 (median of the X variable), small y is equal to 1.45 (1st quartile of the Y variable).

A. \(P(X>x \ | \ X>y)\)

Interpretation:

Probability that X is greater than its median given that X is greater than the first quartile of Y.

\[P(X>x \ | \ X>y) = \frac{P(X>x \ , \ X>y)}{P(X>y)}\]

# Define the events.
event_one <- (X > x)
event_two <- (X > y)

# P(X>x and X>y).
a_and_b <- length(X[event_one & event_two]) / length(X)

# P(X>y).
b <- length(X[event_two]) / length(X)

# P(X>x | X>y).
probability <- a_and_b / b

answer <- round(probability, 2)
answer
## [1] 0.53

Answer: \(P(X > x \ | \ X > y)\) = 0.53

 

B. \(P(X>x, Y>y)\)

Interpretation:

Probability that \(X\) is greater than \(x\), and \(Y\) is greater than \(y\).

# Define the events.
event_one <- (X > x)
event_two <- (Y > y)

# P(X > x).
X_gt_x <- length(X[event_one]) / length(X)

# P(Y > y).
Y_gt_y <- length(Y[event_two]) / length(Y)

probability  <- X_gt_x * Y_gt_y

answer <- round(probability, 2)
answer
## [1] 0.37

Answer: \(P(X>x, Y>y)\) = 0.37

 

C. \(P(X<x \ | \ X>y)\)

Interpretation:

Probability that \(X\) is less than its median given that it is greater than the first quantile of \(Y\).

# Define the events.
event_one <- (X < x)
event_two <- (X > y)

# P(X > x & X > y).
a_and_b  <- length(X[event_one & event_two]) / length(X)

# P(X > y).
b <- length(X[event_two]) / length(X)

probability <- a_and_b / b 

answer <- round(probability, 2)
answer
## [1] 0.47

Answer: \(P(X<x \ | \ X>y)\) = 0.47

 

Investigate whether \(P(X>x \ and \ Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

part_one <- (X > x)
prob_X_gt_x <- (length(part_one[part_one == TRUE])) / (length(part_one))

part_two <- (Y > y)
prob_Y_gt_y <- (length(part_two[part_two == TRUE])) / (length(part_two))

results_table <- data.table(
  Event = c('(X>x)', '(Y>y)', '(X>x)*(Y>y)', '(X>x and Y>y)'),
  Xx = c(prob_X_gt_x, prob_Y_gt_y, prob_X_gt_x * prob_Y_gt_y, prob_X_gt_x * prob_Y_gt_y),
  Yy = c(prob_Y_gt_y, prob_X_gt_x, prob_X_gt_x * prob_Y_gt_y, prob_X_gt_x * prob_Y_gt_y)
)

results_table
##            Event       Xx       Yy
## 1:         (X>x) 0.499900 0.749800
## 2:         (Y>y) 0.749800 0.499900
## 3:   (X>x)*(Y>y) 0.374825 0.374825
## 4: (X>x and Y>y) 0.374825 0.374825

Answer: They are both equal.

 

Problem 2

Register for Kaggle.com and compete in the House Prices: Advanced Regression Techniques competition.

Competition Objective

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

 

Import the training and test datasets and summarize them.

# Pull in the train and test datasets.
training_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-final-exam/train.csv')
test_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-final-exam/test.csv')

 

Snapshot of the training dataset.

head(training_dataset, 1)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500

Summarize the training dataset.

summary(training_dataset)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour  Utilities   
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
##  Median :  9478               NA's:1369   IR3: 10   Low:  36                 
##  Mean   : 10517                           Reg:925   Lvl:1311                 
##  3rd Qu.: 11602                                                              
##  Max.   :215245                                                              
##                                                                              
##    LotConfig    LandSlope   Neighborhood   Condition1     Condition2  
##  Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   OldTown:113   Artery :  48   Artery :   2  
##  FR3    :   4              Edwards:100   RRAn   :  26   PosN   :   2  
##  Inside :1052              Somerst: 86   PosN   :  19   RRNn   :   2  
##                            Gilbert: 79   RRAe   :  11   PosA   :   1  
##                            (Other):707   (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual      OverallCond      YearBuilt   
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000   Median :1973  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575   Mean   :1971  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000   Max.   :2010  
##                (Other): 19                                                  
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515   VinylSd:504  
##  1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222   MetalSd:214  
##  Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220   HdBoard:207  
##  Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206   Wd Sdng:197  
##  3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108   Plywood:142  
##  Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61   CmentBd: 60  
##                                (Other):   2   (Other):128   (Other):136  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146   Ex  :121  
##  BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634   Fa  : 35  
##  None   :864   Median :   0.0   Gd:488    Gd: 146   PConc :647   Gd  :618  
##  Stone  :128   Mean   : 103.7   TA:906    Po:   1   Slab  : 24   TA  :649  
##  NA's   :  8   3rd Qu.: 166.0             TA:1282   Stone :  6   NA's: 37  
##                Max.   :1600.0                       Wood  :  3             
##                NA's   :8                                                   
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  45   Av  :221     ALQ :220     Min.   :   0.0   ALQ :  19   
##  Gd  :  65   Gd  :134     BLQ :148     1st Qu.:   0.0   BLQ :  33   
##  Po  :   2   Mn  :114     GLQ :418     Median : 383.5   GLQ :  14   
##  TA  :1311   No  :953     LwQ : 74     Mean   : 443.6   LwQ :  46   
##  NA's:  37   NA's: 38     Rec :133     3rd Qu.: 712.2   Rec :  54   
##                           Unf :430     Max.   :5644.0   Unf :1256   
##                           NA's: 37                      NA's:  38   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741   
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49   
##  Median :   0.00   Median : 477.5   Median : 991.5   GasW :  18   Gd:241   
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1   
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428   
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Wall :   4            
##                                                                            
##  CentralAir Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  N:  95     FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  Y:1365     FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##             FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##             Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##             SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##             NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                         
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586      Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000               Max.   :14.000  
##                                                                               
##  Functional    Fireplaces    FireplaceQu   GarageType   GarageYrBlt  
##  Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900  
##  Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961  
##  Min1:  31   Median :1.000   Gd  :380    Basment: 19   Median :1980  
##  Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1979  
##  Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002  
##  Sev :   1   Max.   :3.000   NA's:690    Detchd :387   Max.   :2010  
##  Typ :1360                               NA's   : 81   NA's   :81    
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2  
##  RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35  
##  Unf :605     Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9  
##  NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3   Po  :   7  
##               3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311   TA  :1326  
##               Max.   :4.000   Max.   :1418.0   NA's:  81   NA's:  81  
##                                                                       
##  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Y:1340     Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##             Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##             3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##             Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                                
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.06   Mean   :  2.759   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1179   NA's:1406  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
## 

 

Snapshot of the test dataset.

head(test_dataset, 1)
##     Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461         20       RH          80   11622   Pave  <NA>      Reg
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1         Lvl    AllPub    Inside       Gtl        NAmes      Feedr       Norm
##   BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1     1Fam     1Story           5           6      1961         1961     Gable
##   RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1  CompShg     VinylSd     VinylSd       None          0        TA        TA
##   Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1     CBlock       TA       TA           No          Rec        468
##   BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1          LwQ        144       270         882    GasA        TA          Y
##   Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1      SBrkr       896         0            0       896            0
##   BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1            0        1        0            2            1          TA
##   TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1            5        Typ          0        <NA>     Attchd        1961
##   GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1          Unf          1        730         TA         TA          Y
##   WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1        140           0             0          0         120        0   <NA>
##   Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv        <NA>       0      6   2010       WD        Normal

Summarize the test dataset.

summary(test_dataset)
##        Id         MSSubClass        MSZoning     LotFrontage    
##  Min.   :1461   Min.   : 20.00   C (all):  15   Min.   : 21.00  
##  1st Qu.:1826   1st Qu.: 20.00   FV     :  74   1st Qu.: 58.00  
##  Median :2190   Median : 50.00   RH     :  10   Median : 67.00  
##  Mean   :2190   Mean   : 57.38   RL     :1114   Mean   : 68.58  
##  3rd Qu.:2554   3rd Qu.: 70.00   RM     : 242   3rd Qu.: 80.00  
##  Max.   :2919   Max.   :190.00   NA's   :   4   Max.   :200.00  
##                                                 NA's   :227     
##     LotArea       Street      Alley      LotShape  LandContour  Utilities   
##  Min.   : 1470   Grvl:   6   Grvl:  70   IR1:484   Bnk:  54    AllPub:1457  
##  1st Qu.: 7391   Pave:1453   Pave:  37   IR2: 35   HLS:  70    NA's  :   2  
##  Median : 9399               NA's:1352   IR3:  6   Low:  24                 
##  Mean   : 9819                           Reg:934   Lvl:1311                 
##  3rd Qu.:11518                                                              
##  Max.   :56600                                                              
##                                                                             
##    LotConfig    LandSlope   Neighborhood   Condition1    Condition2  
##  Corner : 248   Gtl:1396   NAmes  :218   Norm   :1251   Artery:   3  
##  CulDSac:  82   Mod:  60   OldTown:126   Feedr  :  83   Feedr :   7  
##  FR2    :  38   Sev:   3   CollgCr:117   Artery :  44   Norm  :1444  
##  FR3    :  10              Somerst: 96   RRAn   :  24   PosA  :   3  
##  Inside :1081              Edwards: 94   PosN   :  20   PosN  :   2  
##                            NridgHt: 89   RRAe   :  17                
##                            (Other):719   (Other):  20                
##    BldgType     HouseStyle   OverallQual      OverallCond      YearBuilt   
##  1Fam  :1205   1.5Fin:160   Min.   : 1.000   Min.   :1.000   Min.   :1879  
##  2fmCon:  31   1.5Unf:  5   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1953  
##  Duplex:  57   1Story:745   Median : 6.000   Median :5.000   Median :1973  
##  Twnhs :  53   2.5Unf: 13   Mean   : 6.079   Mean   :5.554   Mean   :1971  
##  TwnhsE: 113   2Story:427   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001  
##                SFoyer: 46   Max.   :10.000   Max.   :9.000   Max.   :2010  
##                SLvl  : 63                                                  
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :   7   CompShg:1442   VinylSd:510   VinylSd:510  
##  1st Qu.:1963   Gable  :1169   Tar&Grv:  12   MetalSd:230   MetalSd:233  
##  Median :1992   Gambrel:  11   WdShake:   4   HdBoard:220   HdBoard:199  
##  Mean   :1984   Hip    : 265   WdShngl:   1   Wd Sdng:205   Wd Sdng:194  
##  3rd Qu.:2004   Mansard:   4                  Plywood:113   Plywood:128  
##  Max.   :2010   Shed   :   3                  (Other):180   (Other):194  
##                                               NA's   :  1   NA's   :  1  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 10   Min.   :   0.0   Ex: 55    Ex:   9   BrkTil:165   Ex  :137  
##  BrkFace:434   1st Qu.:   0.0   Fa: 21    Fa:  39   CBlock:601   Fa  : 53  
##  None   :878   Median :   0.0   Gd:491    Gd: 153   PConc :661   Gd  :591  
##  Stone  :121   Mean   : 100.7   TA:892    Po:   2   Slab  : 25   TA  :634  
##  NA's   : 16   3rd Qu.: 164.0             TA:1256   Stone :  5   NA's: 44  
##                Max.   :1290.0                       Wood  :  2             
##                NA's   :15                                                  
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  59   Av  :197     ALQ :209     Min.   :   0.0   ALQ :  33   
##  Gd  :  57   Gd  :142     BLQ :121     1st Qu.:   0.0   BLQ :  35   
##  Po  :   3   Mn  :125     GLQ :431     Median : 350.5   GLQ :  20   
##  TA  :1295   No  :951     LwQ : 80     Mean   : 439.2   LwQ :  41   
##  NA's:  45   NA's: 44     Rec :155     3rd Qu.: 753.5   Rec :  51   
##                           Unf :421     Max.   :4010.0   Unf :1237   
##                           NA's: 42     NA's   :1        NA's:  42   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF   Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0   GasA:1446   Ex:752   
##  1st Qu.:   0.00   1st Qu.: 219.2   1st Qu.: 784   GasW:   9   Fa: 43   
##  Median :   0.00   Median : 460.0   Median : 988   Grav:   2   Gd:233   
##  Mean   :  52.62   Mean   : 554.3   Mean   :1046   Wall:   2   Po:  2   
##  3rd Qu.:   0.00   3rd Qu.: 797.8   3rd Qu.:1305               TA:429   
##  Max.   :1526.00   Max.   :2140.0   Max.   :5095                        
##  NA's   :1         NA's   :1        NA's   :1                           
##  CentralAir Electrical     X1stFlrSF        X2ndFlrSF     LowQualFinSF     
##  N: 101     FuseA:  94   Min.   : 407.0   Min.   :   0   Min.   :   0.000  
##  Y:1358     FuseF:  23   1st Qu.: 873.5   1st Qu.:   0   1st Qu.:   0.000  
##             FuseP:   5   Median :1079.0   Median :   0   Median :   0.000  
##             SBrkr:1337   Mean   :1156.5   Mean   : 326   Mean   :   3.543  
##                          3rd Qu.:1382.5   3rd Qu.: 676   3rd Qu.:   0.000  
##                          Max.   :5095.0   Max.   :1862   Max.   :1064.000  
##                                                                            
##    GrLivArea     BsmtFullBath     BsmtHalfBath       FullBath    
##  Min.   : 407   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:1118   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :1432   Median :0.0000   Median :0.0000   Median :2.000  
##  Mean   :1486   Mean   :0.4345   Mean   :0.0652   Mean   :1.571  
##  3rd Qu.:1721   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:2.000  
##  Max.   :5095   Max.   :3.0000   Max.   :2.0000   Max.   :4.000  
##                 NA's   :2        NA's   :2                       
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex  :105    Min.   : 3.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa  : 31    1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd  :565    Median : 6.000  
##  Mean   :0.3777   Mean   :2.854   Mean   :1.042   TA  :757    Mean   : 6.385  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   NA's:  1    3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :6.000   Max.   :2.000               Max.   :15.000  
##                                                                               
##    Functional     Fireplaces     FireplaceQu   GarageType   GarageYrBlt  
##  Typ    :1357   Min.   :0.0000   Ex  : 19    2Types : 17   Min.   :1895  
##  Min2   :  36   1st Qu.:0.0000   Fa  : 41    Attchd :853   1st Qu.:1959  
##  Min1   :  34   Median :0.0000   Gd  :364    Basment: 17   Median :1979  
##  Mod    :  20   Mean   :0.5812   Po  : 26    BuiltIn: 98   Mean   :1978  
##  Maj1   :   5   3rd Qu.:1.0000   TA  :279    CarPort:  6   3rd Qu.:2002  
##  (Other):   5   Max.   :4.0000   NA's:730    Detchd :392   Max.   :2207  
##  NA's   :   2                                NA's   : 76   NA's   :78    
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :367     Min.   :0.000   Min.   :   0.0   Fa  :  76   Ex  :   1  
##  RFn :389     1st Qu.:1.000   1st Qu.: 318.0   Gd  :  10   Fa  :  39  
##  Unf :625     Median :2.000   Median : 480.0   Po  :   2   Gd  :   6  
##  NA's: 78     Mean   :1.766   Mean   : 472.8   TA  :1293   Po  :   7  
##               3rd Qu.:2.000   3rd Qu.: 576.0   NA's:  78   TA  :1328  
##               Max.   :5.000   Max.   :1488.0               NA's:  78  
##               NA's   :1       NA's   :1                               
##  PavedDrive   WoodDeckSF       OpenPorchSF     EnclosedPorch    
##  N: 126     Min.   :   0.00   Min.   :  0.00   Min.   :   0.00  
##  P:  32     1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:   0.00  
##  Y:1301     Median :   0.00   Median : 28.00   Median :   0.00  
##             Mean   :  93.17   Mean   : 48.31   Mean   :  24.24  
##             3rd Qu.: 168.00   3rd Qu.: 72.00   3rd Qu.:   0.00  
##             Max.   :1424.00   Max.   :742.00   Max.   :1012.00  
##                                                                 
##    X3SsnPorch       ScreenPorch        PoolArea        PoolQC       Fence     
##  Min.   :  0.000   Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59  
##  1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000   Gd  :   1   GdWo :  58  
##  Median :  0.000   Median :  0.00   Median :  0.000   NA's:1456   MnPrv: 172  
##  Mean   :  1.794   Mean   : 17.06   Mean   :  1.744               MnWw :   1  
##  3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1169  
##  Max.   :360.000   Max.   :576.00   Max.   :800.000                           
##                                                                               
##  MiscFeature    MiscVal             MoSold           YrSold        SaleType   
##  Gar2:   3   Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1258  
##  Othr:   2   1st Qu.:    0.00   1st Qu.: 4.000   1st Qu.:2007   New    : 117  
##  Shed:  46   Median :    0.00   Median : 6.000   Median :2008   COD    :  44  
##  NA's:1408   Mean   :   58.17   Mean   : 6.104   Mean   :2008   ConLD  :  17  
##              3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   CWD    :   8  
##              Max.   :17000.00   Max.   :12.000   Max.   :2010   (Other):  14  
##                                                                 NA's   :   1  
##  SaleCondition 
##  Abnorml:  89  
##  AdjLand:   8  
##  Alloca :  12  
##  Family :  26  
##  Normal :1204  
##  Partial: 120  
## 

 

Descriptive and Inferential Statistics

1. Provide univariate descriptive statistics and appropriate plots for the training data set.

# Summarize the training dataset's SalePrice variable.
summary(training_dataset$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

 

Create a histogram of sales prices.

hist(training_dataset$SalePrice,
     xlab = 'Sale Price',
     main = 'Distribution of Sales Prices',
     col = 'darkgreen')

 

Create a QQ plot of sales prices.

qqnorm(training_dataset$SalePrice, col = 'darkred')
qqline(training_dataset$SalePrice)

 

2. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

# Define the variables to include in the matrix.
sale_price <- training_dataset$SalePrice
lot_area <- training_dataset$LotArea
gr_liv_area <- training_dataset$GrLivArea
garage_area <- training_dataset$GarageArea

# Plot the matrix.
plot_data <- data.frame(sale_price, lot_area, gr_liv_area, garage_area)
pairs(plot_data, main = 'Scatterplot Matrix', col = '#50394c')

 

3. Derive a correlation matrix for any three quantitative variables in the dataset.

# Create a dataframe containing the 3 variables to include in the matrix.
LotArea <- training_dataset$LotArea
GrLivArea <- training_dataset$GrLivArea
GarageArea <- training_dataset$GarageArea
matrix_variables <- data.frame(LotArea, GrLivArea, GarageArea)

# Create the correlation matrix.
cor_matrix <- cor(matrix_variables)
corrplot(cor_matrix, method = "shade")

 

4. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

4(a). Test LotArea Vs. GrLivArea.

# Test LotArea Vs. GrLivArea using the Pearson method with 80% confidence level.
cor.test(training_dataset$LotArea, training_dataset$GrLivArea, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  training_dataset$LotArea and training_dataset$GrLivArea
## t = 10.414, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162

4(b). Test LotArea vs GarageArea.

# Test LotArea Vs. GarageArea vs with 80% confidence level.
cor.test(training_dataset$LotArea, training_dataset$GarageArea, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  training_dataset$LotArea and training_dataset$GarageArea
## t = 7.0034, df = 1458, p-value = 0.000000000003803
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028

4(c). Test GarageArea Vs. GrLivArea.

# Test GarageArea Vs. GrLivArea with 80% confidence level.
cor.test(training_dataset$GarageArea, training_dataset$GrLivArea, method = 'pearson', conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  training_dataset$GarageArea and training_dataset$GrLivArea
## t = 20.276, df = 1458, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

 

5. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

From the above results, the correlation between the selected variables is not equal to 0 in all 3 comparisons. Additionally, the p-values for all 3 samples are less than 0.05, so we can reject the null hypothesis.

Due to the fact that the correlation tests result in low p-values, I would not be worried about family-wise errors when measuring relationships across the 3 attributes.

 

Linear Algebra and Correlation

1. Invert your correlation matrix from above.

# Invert the matrix.
inverted_matrix <- solve(cor_matrix)
round(inverted_matrix, 2)
##            LotArea GrLivArea GarageArea
## LotArea       1.08     -0.25      -0.08
## GrLivArea    -0.25      1.34      -0.58
## GarageArea   -0.08     -0.58       1.29

 

2. Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

2(a). Multiply the correlation matrix by the precision matrix.

correlation_x_precision <- cor_matrix %*% inverted_matrix
round(correlation_x_precision, 2)
##            LotArea GrLivArea GarageArea
## LotArea          1         0          0
## GrLivArea        0         1          0
## GarageArea       0         0          1

2(b). Multiply the precision matrix by the correlation matrix.

precision_x_correlation <- inverted_matrix %*% cor_matrix
round(precision_x_correlation, 2)
##            LotArea GrLivArea GarageArea
## LotArea          1         0          0
## GrLivArea        0         1          0
## GarageArea       0         0          1

 

3. Conduct LU decomposition on the matrix.

# Perform LU decomposition on the matrix.
lu.decomposition(inverted_matrix)
## $L
##             [,1]       [,2] [,3]
## [1,]  1.00000000  0.0000000    0
## [2,] -0.22884393  1.0000000    0
## [3,] -0.07307553 -0.4689975    1
## 
## $U
##          [,1]       [,2]        [,3]
## [1,] 1.079209 -0.2469705 -0.07886378
## [2,] 0.000000  1.2819833 -0.60124693
## [3,] 0.000000  0.0000000  1.00000000

 

Calculus-Based Probability & Statistics

1. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.

# Select the right skewed TotalBsmtSF variable from the training dataset
# and run fitdistr to fit an exponential probability density function.
fit <- fitdistr(training_dataset$TotalBsmtSF, "exponential")
fit
##        rate     
##   0.00094568957 
##  (0.00002474983)

** 2. Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)))**.

# Check the names that are available in the fit.
names(fit)
## [1] "estimate" "sd"       "vcov"     "n"        "loglik"
lambda <- fit$estimate
samples <- rexp(1000, lambda)

Plot a histogram and compare it with a histogram of your original variable.

# Histrgram of samples.
hist(samples, breaks = 100)

# Histrgram of original variable.
original_variable <- training_dataset$TotalBsmtSF
hist(original_variable, breaks = 100)

Conclusion:

The histogram of samples produces a less skewed distribution than that of the original variable.

3a. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

quantile(samples, probs = c(0.05, 0.95))
##         5%        95% 
##   53.62245 3062.35061

3b. Also generate a 95% confidence interval from the empirical data, assuming normality.

empirical_data <- training_dataset$TotalBsmtSF
mean(empirical_data)
## [1] 1057.429
normality <-rnorm(length(empirical_data), mean(empirical_data), sd(empirical_data))
hist(normality)

3c. Finally, provide the empirical 5th percentile and 95th percentile of the data.

quantile(normality, probs = c(0.05, 0.95))
##        5%       95% 
##  329.8326 1789.1667