Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

training_data_url <- 'https://raw.githubusercontent.com/folushoa/Data-Science/Data-605/Final/train.csv'
test_data_url <- 'https://raw.githubusercontent.com/folushoa/Data-Science/Data-605/Final/test.csv'

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

training_data_df <- read_csv(training_data_url) %>% 
  rename(FirstFlrSF = '1stFlrSF', SecondFlrSF = '2ndFlrSF')
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_data_df <- read_csv(test_data_url)%>% 
  rename(FirstFlrSF = '1stFlrSF', SecondFlrSF = '2ndFlrSF')
## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(training_data_df)
## # A tibble: 6 × 81
##      Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
##   <dbl>      <dbl> <chr>          <dbl>   <dbl> <chr>  <chr> <chr>   
## 1     1         60 RL                65    8450 Pave   <NA>  Reg     
## 2     2         20 RL                80    9600 Pave   <NA>  Reg     
## 3     3         60 RL                68   11250 Pave   <NA>  IR1     
## 4     4         70 RL                60    9550 Pave   <NA>  IR1     
## 5     5         60 RL                84   14260 Pave   <NA>  IR1     
## 6     6         50 RL                85   14115 Pave   <NA>  IR1     
## # ℹ 73 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## #   LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## #   BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## #   YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## #   Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## #   ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## #   BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …

Variable X is to be skewed to the right and quantitative so going to check histogram of all quantitative variables

# Data-Frame of only numeric variables
training_data_numeric_df <- training_data_df %>% 
  dplyr::select(where(is.numeric))

# convert data to long
training_data_numeric_df_long <- training_data_numeric_df %>% pivot_longer(cols = everything(), 
                                                           names_to = 'variable', 
                                                           values_to = 'value')

# Plot histogram of all variables
training_data_numeric_df_long %>% ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = 'blue', color = 'black') +
  facet_wrap(~ variable, scales = 'free') +
  labs(title = 'Histogram of Quantitative Variables from Test Data') +
  theme_minimal()
## Warning: Removed 348 rows containing non-finite values (`stat_bin()`).

I will set X to LotFrontage, and Y to SalePrice.
Note LotFrontage has missing data, and since it is right skewed going to replace missing data with median

# X and Y
X <- training_data_df$LotFrontage
Y <- training_data_df$SalePrice

# calculate median of X
median_x <- median(X, na.rm = TRUE)

# Replace NA values in X with the median
X <- ifelse(is.na(X), median_x, X)

Quesition 1

Probability Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

# Define quantiles
x <- quantile(X, 0.75) # third quantile of X
y <- quantile(Y, 0.50) # second quantile of y

a. \(P(X>x|Y>y)\)

prob_a <- sum(X>x & Y>y)/sum(Y>y)
print(paste('P(X>x|Y>y) =', round(prob_a, 3)))
## [1] "P(X>x|Y>y) = 0.334"

Meaning: \(P(X>x|Y>y)\) is 33.4%. This means that the probability of the LotFrontage being larger than 79 given that the SalePrice is larger than 163000 is 33.4%

b. \(P(X>x, Y>y)\)

prob_b <- sum(X>x & Y>y)/nrow(training_data_df)
print(paste('P(X>x,Y>y) =', round(prob_b, 3)))
## [1] "P(X>x,Y>y) = 0.166"

Meaning: \(P(X>x, Y>y)\) is 16.6%. This means that the probability of the LotFrontage being larger than 79 and the SalePrice is larger than 163000 is 16.6%

c. \(P(X<x|Y>y)\)

prob_c <- sum(X<x & Y>y)/sum(Y>y)
print(paste('P(X<x|Y>y) =', round(prob_c, 3)))
## [1] "P(X<x|Y>y) = 0.651"

Meaning: \(P(X<x|Y>y)\) is 65.1%. This means that the probability of the LotFrontage being smaller than 79 given that the SalePrice is larger than 163000 is 65.1%

Table of Counts:

table_of_counts <- addmargins(table(X>x, Y>y)) # table of counts with sum

# define column names and row names of table
rownames(table_of_counts) <- c('<=3d quartile', '>3d quartile', 'Total')
colnames(table_of_counts) <- c('<=2d quartile', '>2d quartile', 'Total')
print(table_of_counts)
##                
##                 <=2d quartile >2d quartile Total
##   <=3d quartile           622          485  1107
##   >3d quartile            110          243   353
##   Total                   732          728  1460

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A <- X > x # observations above the 3rd quartile
B <- Y > y # observations above the 2nd quartile

P_A_given_B <- sum(A*B)/sum(B) # P(A|B)

# P(A)P(B)
P_A <- sum(A)/nrow(training_data_df)
P_B <- sum(B)/nrow(training_data_df)
P_A_P_B <- P_A*P_B

# check if P(A|B) = P(A)P(B)
print(paste('P(A|B) = P(A)P(B):', P_A_given_B == P_A_P_B))
## [1] "P(A|B) = P(A)P(B): FALSE"

\(P(A|B) \neq P(A)P(B)\), therefore splitting the training data does not make them independent.

Confirming using the Chi square test

association_table <- table(A, B)

# Chi-square test
chi_test <- chisq.test(association_table)
print(chi_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  association_table
## X-squared = 66.058, df = 1, p-value = 4.38e-16

The p-value is less than 0.05, so we reject the null hypothesis, further proof that A and B are independent.

Question 2

Descriptive and Inferential Statistics: Provide uni-variate descriptive statistics and appropriate plots for the training data set.

Descriptive statistics:

# summary of data
print('Statistical summary of the training data set:')
## [1] "Statistical summary of the training data set:"
summary(training_data_df)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          FirstFlrSF  
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##   SecondFlrSF    LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      3SsnPorch       ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 
# structure of data
print('Strucure of the training data set:')
## [1] "Strucure of the training data set:"
str(training_data_df)
## spc_tbl_ [1,460 × 81] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id           : num [1:1460] 1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : num [1:1460] 60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr [1:1460] "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : num [1:1460] 65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : num [1:1460] 8450 9600 11250 9550 14260 ...
##  $ Street       : chr [1:1460] "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr [1:1460] NA NA NA NA ...
##  $ LotShape     : chr [1:1460] "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr [1:1460] "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr [1:1460] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr [1:1460] "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr [1:1460] "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr [1:1460] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr [1:1460] "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr [1:1460] "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr [1:1460] "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr [1:1460] "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : num [1:1460] 7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : num [1:1460] 5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : num [1:1460] 2003 1976 2001 1915 2000 ...
##  $ YearRemodAdd : num [1:1460] 2003 1976 2002 1970 2000 ...
##  $ RoofStyle    : chr [1:1460] "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr [1:1460] "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr [1:1460] "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : num [1:1460] 196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr [1:1460] "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr [1:1460] "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr [1:1460] "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr [1:1460] "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr [1:1460] "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr [1:1460] "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : num [1:1460] 706 978 486 216 655 ...
##  $ BsmtFinType2 : chr [1:1460] "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : num [1:1460] 0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num [1:1460] 150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num [1:1460] 856 1262 920 756 1145 ...
##  $ Heating      : chr [1:1460] "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr [1:1460] "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr [1:1460] "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr [1:1460] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ FirstFlrSF   : num [1:1460] 856 1262 920 961 1145 ...
##  $ SecondFlrSF  : num [1:1460] 854 0 866 756 1053 ...
##  $ LowQualFinSF : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : num [1:1460] 1710 1262 1786 1717 2198 ...
##  $ BsmtFullBath : num [1:1460] 1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num [1:1460] 0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : num [1:1460] 2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : num [1:1460] 1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : num [1:1460] 3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : num [1:1460] 1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr [1:1460] "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : num [1:1460] 8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr [1:1460] "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : num [1:1460] 0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr [1:1460] NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr [1:1460] "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : num [1:1460] 2003 1976 2001 1998 2000 ...
##  $ GarageFinish : chr [1:1460] "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : num [1:1460] 2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num [1:1460] 548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr [1:1460] "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr [1:1460] "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : num [1:1460] 0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : num [1:1460] 61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: num [1:1460] 0 0 0 272 0 0 0 228 205 0 ...
##  $ 3SsnPorch    : num [1:1460] 0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr [1:1460] NA NA NA NA ...
##  $ Fence        : chr [1:1460] NA NA NA NA ...
##  $ MiscFeature  : chr [1:1460] NA NA NA NA ...
##  $ MiscVal      : num [1:1460] 0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : num [1:1460] 2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : num [1:1460] 2008 2007 2008 2006 2008 ...
##  $ SaleType     : chr [1:1460] "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr [1:1460] "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : num [1:1460] 208500 181500 223500 140000 250000 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   MSSubClass = col_double(),
##   ..   MSZoning = col_character(),
##   ..   LotFrontage = col_double(),
##   ..   LotArea = col_double(),
##   ..   Street = col_character(),
##   ..   Alley = col_character(),
##   ..   LotShape = col_character(),
##   ..   LandContour = col_character(),
##   ..   Utilities = col_character(),
##   ..   LotConfig = col_character(),
##   ..   LandSlope = col_character(),
##   ..   Neighborhood = col_character(),
##   ..   Condition1 = col_character(),
##   ..   Condition2 = col_character(),
##   ..   BldgType = col_character(),
##   ..   HouseStyle = col_character(),
##   ..   OverallQual = col_double(),
##   ..   OverallCond = col_double(),
##   ..   YearBuilt = col_double(),
##   ..   YearRemodAdd = col_double(),
##   ..   RoofStyle = col_character(),
##   ..   RoofMatl = col_character(),
##   ..   Exterior1st = col_character(),
##   ..   Exterior2nd = col_character(),
##   ..   MasVnrType = col_character(),
##   ..   MasVnrArea = col_double(),
##   ..   ExterQual = col_character(),
##   ..   ExterCond = col_character(),
##   ..   Foundation = col_character(),
##   ..   BsmtQual = col_character(),
##   ..   BsmtCond = col_character(),
##   ..   BsmtExposure = col_character(),
##   ..   BsmtFinType1 = col_character(),
##   ..   BsmtFinSF1 = col_double(),
##   ..   BsmtFinType2 = col_character(),
##   ..   BsmtFinSF2 = col_double(),
##   ..   BsmtUnfSF = col_double(),
##   ..   TotalBsmtSF = col_double(),
##   ..   Heating = col_character(),
##   ..   HeatingQC = col_character(),
##   ..   CentralAir = col_character(),
##   ..   Electrical = col_character(),
##   ..   `1stFlrSF` = col_double(),
##   ..   `2ndFlrSF` = col_double(),
##   ..   LowQualFinSF = col_double(),
##   ..   GrLivArea = col_double(),
##   ..   BsmtFullBath = col_double(),
##   ..   BsmtHalfBath = col_double(),
##   ..   FullBath = col_double(),
##   ..   HalfBath = col_double(),
##   ..   BedroomAbvGr = col_double(),
##   ..   KitchenAbvGr = col_double(),
##   ..   KitchenQual = col_character(),
##   ..   TotRmsAbvGrd = col_double(),
##   ..   Functional = col_character(),
##   ..   Fireplaces = col_double(),
##   ..   FireplaceQu = col_character(),
##   ..   GarageType = col_character(),
##   ..   GarageYrBlt = col_double(),
##   ..   GarageFinish = col_character(),
##   ..   GarageCars = col_double(),
##   ..   GarageArea = col_double(),
##   ..   GarageQual = col_character(),
##   ..   GarageCond = col_character(),
##   ..   PavedDrive = col_character(),
##   ..   WoodDeckSF = col_double(),
##   ..   OpenPorchSF = col_double(),
##   ..   EnclosedPorch = col_double(),
##   ..   `3SsnPorch` = col_double(),
##   ..   ScreenPorch = col_double(),
##   ..   PoolArea = col_double(),
##   ..   PoolQC = col_character(),
##   ..   Fence = col_character(),
##   ..   MiscFeature = col_character(),
##   ..   MiscVal = col_double(),
##   ..   MoSold = col_double(),
##   ..   YrSold = col_double(),
##   ..   SaleType = col_character(),
##   ..   SaleCondition = col_character(),
##   ..   SalePrice = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# shape of date
training_data_df_size <- dim(training_data_df)
print(paste('The size of the training data set is', training_data_df_size[1], 'by', training_data_df_size[2]))
## [1] "The size of the training data set is 1460 by 81"
# count number of missing data
print('Variables and the number of missing data:')
## [1] "Variables and the number of missing data:"
missing_values <- colSums(is.na(training_data_df)) # count missing data in each column

# create data frame of Variable and Count of missing data
missing_values_df <- tibble(
  Variable = names(missing_values),
  'Missing Value Count' = missing_values
)
print(missing_values_df)
## # A tibble: 81 × 2
##    Variable    `Missing Value Count`
##    <chr>                       <dbl>
##  1 Id                              0
##  2 MSSubClass                      0
##  3 MSZoning                        0
##  4 LotFrontage                   259
##  5 LotArea                         0
##  6 Street                          0
##  7 Alley                        1369
##  8 LotShape                        0
##  9 LandContour                     0
## 10 Utilities                       0
## # ℹ 71 more rows

Plots: Histogram of numeric columns:

# Data-Frame of only numeric variables
training_data_numeric_df <- training_data_df %>% 
  dplyr::select(where(~is.numeric(.)))

# convert data to long
training_data_numeric_df_long <- training_data_numeric_df %>% pivot_longer(cols = everything(), 
                                                           names_to = 'variable', 
                                                           values_to = 'value')

# Plot histogram of all variables
training_data_numeric_df_long %>% ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = 'blue', color = 'black') +
  facet_wrap(~ variable, scales = 'free') +
  labs(title = 'Histogram of Quantitative Variables from Test Data') +
  theme_minimal()
## Warning: Removed 348 rows containing non-finite values (`stat_bin()`).

Bar plot of categorical columns:

# Data-Frame of only numeric variables
training_data_categorical_df <- training_data_df %>% 
  dplyr::select(where(~is.character(.)))

# convert data to long
training_data_categorical_df_long <- training_data_categorical_df %>% pivot_longer(cols = everything(), 
                                                           names_to = 'variable', 
                                                           values_to = 'value')

# Plot histogram of all variables
training_data_categorical_df_long %>% ggplot(aes(x = value)) +
  geom_bar(fill = 'blue', color = 'black') +
  facet_wrap(~ variable, scales = 'free') +
  labs(title = 'Distribution of Categorical Variables from Test Data') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()

Scatter plot of X and Y

# created a new data frame of LotFrontage and SalePrice because I replaced the missing data from LotFrontage with the mean
df <- tibble(LotFrontage = X, SalePrice = Y)

# plot scatter plot
df %>% ggplot(aes(x = LotFrontage, y = SalePrice)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = 'Scatter Plot of SalePrice vs LotFrontage') +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

There is a positive relationship between LotFrontage and SalePrice. However the larger the LotFrontage becomes the further the less dense the data is in relation to the linear model drawn.

Provide a 95% CI for the difference in the mean of the variables.

CI <- t.test(X, Y, conf.level = 0.95)
print(paste('95% confidence interval:', round(CI$conf.int[1], 3), round(CI$conf.int[2], 3)))
## [1] "95% confidence interval: -184929.687 -176772.977"

This means that 95% of the difference in means between SalePrice and LotFrontage is between -184929.687 and 176772.977.

Derive a correlation matrix for two of the quantitative variables you selected:

correlation_matrix <- cor(df[, c('LotFrontage', 'SalePrice')])
print(correlation_matrix)
##             LotFrontage SalePrice
## LotFrontage   1.0000000 0.3347709
## SalePrice     0.3347709 1.0000000

Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval

correlation_test <- cor.test(df$LotFrontage, df$SalePrice, conf.level = 0.99)
print(paste('p-value:', correlation_test$p.value))
## [1] "p-value: 1.44698460070571e-39"
print(paste('99% confidence interval:', 
            round(correlation_test$conf.int[1], 3), 
            round(correlation_test$conf.int[2], 3)))
## [1] "99% confidence interval: 0.274 0.393"

The p-value is less than 0.01 (significance level). Thus, we reject the null hypothesis that correlation between LotFrontage and SalePrice is 0. This is evident from the earlier correlation matrix showing that the correlation between these two variables is 0.335.

The 99% confidence interval of the correlation relationship is 0.274 and 0.393. This means that there is a 99% chance the correlation between LotFrontage and SalePrice will be between 0.274 and 0.393, which is the case (the correlation is 0.335)

Question 3

Linear Algebra and Correlation Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

precision_matrix <- solve(correlation_matrix)
print(precision_matrix)
##             LotFrontage  SalePrice
## LotFrontage   1.1262168 -0.3770246
## SalePrice    -0.3770246  1.1262168

Multiply the correlation matrix by the precision matrix

c_by_p <- correlation_matrix%*%precision_matrix
print(round(c_by_p), 3)
##             LotFrontage SalePrice
## LotFrontage           1         0
## SalePrice             0         1

Multiply the precision matrix by the correlation matrix

p_by_c <- precision_matrix%*%correlation_matrix
print(round(p_by_c), 3)
##             LotFrontage SalePrice
## LotFrontage           1         0
## SalePrice             0         1

Since the precision matrix is the inverse of the correlation matrix I would expect the result of the multiplication between them to result in the identity matrix, which it did.

Conduct principle components analysis Principle component analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining the variability.

# PCA between LotFrontage and SalePrice
pca <- prcomp(df, scale. = TRUE)

# Summary of PCA
print('PCA summary:')
## [1] "PCA summary:"
print(summary(pca))
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.1553 0.8156
## Proportion of Variance 0.6674 0.3326
## Cumulative Proportion  0.6674 1.0000
# PCA component
print('PCA loading:')
## [1] "PCA loading:"
print(pca$rotation)
##                   PC1        PC2
## LotFrontage 0.7071068 -0.7071068
## SalePrice   0.7071068  0.7071068

PC1 accounts for 66.7% of the variability of the transformed data set while PC2 accounts for 33.3% of the variability of of the transformed data set. LotFrontage and SalePrice both contribute equally to PC1, while for PC2, the both contribute equally but in opposite directions.

Question 4

Calculus-Based Probability & Statistics Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero.

# check if minimum value of chosen X > 0
print(summary(X))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   60.00   69.00   69.86   79.00  313.00

The minimum value is greater than 0 so no shifting is needed.

Fit an exponential probability density function and find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)))

# fit an exponential probability density funciton
exp_fit <- fitdistr(X, "exponential")

# optimal value of lambda
lambda <- exp_fit$estimate
print(paste("Optimal value of lambda:", lambda))
## [1] "Optimal value of lambda: 0.0143135851609298"
# 1000 samples from this exponential distribution
set.seed(42)
sample <- rexp(1000, lambda)

Plot a histogram and compare it with a histogram of your original variable

ggplot() +
  geom_histogram(aes(X, fill = "Original Variable"), bins = 30, alpha = 0.5) +
  geom_histogram(aes(sample, fill = "Exponential Sample"), bins = 30, alpha = 0.5) +
  labs(title = "Histogram of Expnential Sample and Original Variable")

  theme_minimal()
## List of 97
##  $ line                      :List of 6
##   ..$ colour       : chr "black"
##   ..$ linewidth    : num 0.5
##   ..$ linetype     : num 1
##   ..$ lineend      : chr "butt"
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ rect                      :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : chr "black"
##   ..$ linewidth    : num 0.5
##   ..$ linetype     : num 1
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ text                      :List of 11
##   ..$ family       : chr ""
##   ..$ face         : chr "plain"
##   ..$ colour       : chr "black"
##   ..$ size         : num 11
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : num 0
##   ..$ lineheight   : num 0.9
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ title                     : NULL
##  $ aspect.ratio              : NULL
##  $ axis.title                : NULL
##  $ axis.title.x              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.top          :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.bottom       : NULL
##  $ axis.title.y              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.y.left         : NULL
##  $ axis.title.y.right        :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text                 :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey30"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.top           :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.bottom        : NULL
##  $ axis.text.y               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y.left          : NULL
##  $ axis.text.y.right         :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks                : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.ticks.x              : NULL
##  $ axis.ticks.x.top          : NULL
##  $ axis.ticks.x.bottom       : NULL
##  $ axis.ticks.y              : NULL
##  $ axis.ticks.y.left         : NULL
##  $ axis.ticks.y.right        : NULL
##  $ axis.ticks.length         : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ axis.ticks.length.x       : NULL
##  $ axis.ticks.length.x.top   : NULL
##  $ axis.ticks.length.x.bottom: NULL
##  $ axis.ticks.length.y       : NULL
##  $ axis.ticks.length.y.left  : NULL
##  $ axis.ticks.length.y.right : NULL
##  $ axis.line                 : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.line.x               : NULL
##  $ axis.line.x.top           : NULL
##  $ axis.line.x.bottom        : NULL
##  $ axis.line.y               : NULL
##  $ axis.line.y.left          : NULL
##  $ axis.line.y.right         : NULL
##  $ legend.background         : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.margin             : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing            : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing.x          : NULL
##  $ legend.spacing.y          : NULL
##  $ legend.key                : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.key.size           : 'simpleUnit' num 1.2lines
##   ..- attr(*, "unit")= int 3
##  $ legend.key.height         : NULL
##  $ legend.key.width          : NULL
##  $ legend.text               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.text.align         : NULL
##  $ legend.title              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title.align        : NULL
##  $ legend.position           : chr "right"
##  $ legend.direction          : NULL
##  $ legend.justification      : chr "center"
##  $ legend.box                : NULL
##  $ legend.box.just           : NULL
##  $ legend.box.margin         : 'margin' num [1:4] 0cm 0cm 0cm 0cm
##   ..- attr(*, "unit")= int 1
##  $ legend.box.background     : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.box.spacing        : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ panel.background          : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.border              : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.spacing             : 'simpleUnit' num 5.5points
##   ..- attr(*, "unit")= int 8
##  $ panel.spacing.x           : NULL
##  $ panel.spacing.y           : NULL
##  $ panel.grid                :List of 6
##   ..$ colour       : chr "grey92"
##   ..$ linewidth    : NULL
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.major          : NULL
##  $ panel.grid.minor          :List of 6
##   ..$ colour       : NULL
##   ..$ linewidth    : 'rel' num 0.5
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.major.x        : NULL
##  $ panel.grid.major.y        : NULL
##  $ panel.grid.minor.x        : NULL
##  $ panel.grid.minor.y        : NULL
##  $ panel.ontop               : logi FALSE
##  $ plot.background           : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ plot.title                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.title.position       : chr "panel"
##  $ plot.subtitle             :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : num 1
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 5.5points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption.position     : chr "panel"
##  $ plot.tag                  :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.tag.position         : chr "topleft"
##  $ plot.margin               : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ strip.background          : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ strip.background.x        : NULL
##  $ strip.background.y        : NULL
##  $ strip.clip                : chr "inherit"
##  $ strip.placement           : chr "inside"
##  $ strip.text                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey10"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.x              : NULL
##  $ strip.text.x.bottom       : NULL
##  $ strip.text.x.top          : NULL
##  $ strip.text.y              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.y.left         :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.y.right        : NULL
##  $ strip.switch.pad.grid     : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ strip.switch.pad.wrap     : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi TRUE
##  - attr(*, "validate")= logi TRUE

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)

exp_percentile <- quantile(sample, c(0.05, 0.95))

print(paste('The 5th percentile:', round(exp_percentile[1], 3)))
## [1] "The 5th percentile: 3.429"
print(paste('The 95th percentile:', round(exp_percentile[2], 3)))
## [1] "The 95th percentile: 225.481"

Generate a 95% confidence interval from the empirical data, assuming normality

# 95% confidence interval
CI <- t.test(X)
print(paste("95% confidence interval of X:", round(CI$conf.int[1], 3), round(CI$conf.int[2], 3)))
## [1] "95% confidence interval of X: 68.733 70.995"

Thus, there is a 95% confident that the mean of X falls between 68.733, and 70.995.

Checking:

mean_X <- mean(X)
print(paste("Mean of X:", round(mean_X, 3)))
## [1] "Mean of X: 69.864"

The mean of X (69.864) falls between 68.733 and 70.995.

The empirical 5th percentile and 95th percentile of the data

percentile <- quantile(X, probs = c(0.05, 0.95))
print(paste("The 5th percentile:", percentile[1]))
## [1] "The 5th percentile: 35.95"
print(paste("The 95the percentile:", percentile[2]))
## [1] "The 95the percentile: 104"

Question 5

Modeling: Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Going to use data frame with only numeric columns.

# tidy data frame
# remove columns with more than half the data missing
# identify those columns
columns_to_remove <- 
  colnames(training_data_numeric_df)[colSums(is.na(training_data_numeric_df)) > nrow(training_data_numeric_df)/2]

# remove those columns
new_train_df <- training_data_numeric_df %>% 
  dplyr::select(-all_of(columns_to_remove)) %>% 
  rename_all(make.names)

# replace missing data
# perform imputation
imputed_object <- mice(new_train_df, m = 1, maxit = 50, meth = "pmm", seed = 42)
## 
##  iter imp variable
##   1   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   2   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   3   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   4   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   5   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   6   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   7   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   8   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   9   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   10   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   11   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   12   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   13   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   14   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   15   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   16   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   17   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   18   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   19   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   20   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   21   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   22   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   23   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   24   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   25   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   26   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   27   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   28   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   29   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   30   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   31   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   32   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   33   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   34   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   35   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   36   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   37   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   38   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   39   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   40   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   41   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   42   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   43   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   44   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   45   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   46   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   47   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   48   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   49   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
##   50   1  LotFrontage*  MasVnrArea*  GarageYrBlt*
## Warning: Number of logged events: 300
imputed_train_df <- complete(imputed_object, 1) %>% 
  dplyr::select(-c(Id))

Create a linear regression model and using manual elimination optimize the model

model <- lm(SalePrice ~., data = imputed_train_df)
model_summary <- summary(model)

print(model_summary)
## 
## Call:
## lm(formula = SalePrice ~ ., data = imputed_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470962  -16652   -2096   13593  303914 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.864e+05  1.413e+06   0.273 0.784608    
## MSSubClass    -1.767e+02  2.647e+01  -6.677 3.49e-11 ***
## LotFrontage   -5.144e+01  1.908e+01  -2.696 0.007094 ** 
## LotArea        4.302e-01  1.012e-01   4.251 2.26e-05 ***
## OverallQual    1.722e+04  1.188e+03  14.495  < 2e-16 ***
## OverallCond    4.563e+03  1.028e+03   4.440 9.69e-06 ***
## YearBuilt      3.615e+02  7.070e+01   5.113 3.60e-07 ***
## YearRemodAdd   1.704e+02  6.786e+01   2.512 0.012129 *  
## MasVnrArea     3.087e+01  5.934e+00   5.203 2.25e-07 ***
## BsmtFinSF1     1.864e+01  4.660e+00   4.000 6.67e-05 ***
## BsmtFinSF2     7.528e+00  7.046e+00   1.068 0.285480    
## BsmtUnfSF      8.797e+00  4.194e+00   2.097 0.036134 *  
## TotalBsmtSF           NA         NA      NA       NA    
## FirstFlrSF     4.846e+01  5.786e+00   8.376  < 2e-16 ***
## SecondFlrSF    4.845e+01  4.959e+00   9.770  < 2e-16 ***
## LowQualFinSF   3.343e+01  1.999e+01   1.672 0.094677 .  
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.385e+03  2.610e+03   3.595 0.000335 ***
## BsmtHalfBath   1.500e+03  4.088e+03   0.367 0.713745    
## FullBath       4.254e+03  2.819e+03   1.509 0.131433    
## HalfBath      -1.557e+03  2.661e+03  -0.585 0.558395    
## BedroomAbvGr  -1.012e+04  1.697e+03  -5.963 3.12e-09 ***
## KitchenAbvGr  -1.259e+04  5.199e+03  -2.422 0.015559 *  
## TotRmsAbvGrd   5.049e+03  1.235e+03   4.087 4.62e-05 ***
## Fireplaces     3.941e+03  1.783e+03   2.211 0.027208 *  
## GarageYrBlt   -5.215e+01  7.547e+01  -0.691 0.489684    
## GarageCars     1.071e+04  2.850e+03   3.759 0.000178 ***
## GarageArea     9.382e-01  1.006e+01   0.093 0.925700    
## WoodDeckSF     2.730e+01  7.992e+00   3.416 0.000654 ***
## OpenPorchSF    2.817e-01  1.515e+01   0.019 0.985162    
## EnclosedPorch  1.374e+01  1.687e+01   0.814 0.415672    
## X3SsnPorch     2.116e+01  3.134e+01   0.675 0.499592    
## ScreenPorch    5.401e+01  1.717e+01   3.145 0.001697 ** 
## PoolArea      -3.298e+01  2.359e+01  -1.398 0.162252    
## MiscVal       -4.047e-01  1.856e+00  -0.218 0.827430    
## MoSold        -6.304e+01  3.444e+02  -0.183 0.854782    
## YrSold        -6.922e+02  7.024e+02  -0.985 0.324559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34710 on 1425 degrees of freedom
## Multiple R-squared:  0.8135, Adjusted R-squared:  0.8091 
## F-statistic: 182.9 on 34 and 1425 DF,  p-value: < 2.2e-16

There are a few variables with high p_value (p_value > 0.05). These variables will be removed to reduce the risk of over-fitting.

# create a data frame containing the variables and their associated p-value
p_values <- coef(model_summary)[, "Pr(>|t|)"]

p_value_df <- tibble(
  Predictors = names(p_values),
  p_value = p_values
)

# indicate rows with p-value < 0.05
signifcant_predictor_df <- p_value_df %>% 
  filter(p_value <= 0.05)

# redo linear regression taking out non-significant predictors
updated_predictors_formula <- paste("SalePrice ~ ", 
                                    paste(signifcant_predictor_df$Predictors, collapse = " + "))
updated_model <- lm(as.formula(updated_predictors_formula), data = imputed_train_df)
updated_model_summary <- summary(updated_model)

print(updated_model_summary)
## 
## Call:
## lm(formula = as.formula(updated_predictors_formula), data = imputed_train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486813  -16571   -1918   13768  287972 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.048e+06  1.156e+05  -9.064  < 2e-16 ***
## MSSubClass   -1.708e+02  2.596e+01  -6.580 6.58e-11 ***
## LotFrontage  -5.226e+01  1.891e+01  -2.763 0.005799 ** 
## LotArea       4.505e-01  1.003e-01   4.494 7.56e-06 ***
## OverallQual   1.767e+04  1.167e+03  15.144  < 2e-16 ***
## OverallCond   4.340e+03  1.005e+03   4.319 1.67e-05 ***
## YearBuilt     3.248e+02  5.192e+01   6.257 5.17e-10 ***
## YearRemodAdd  1.780e+02  6.476e+01   2.749 0.006044 ** 
## MasVnrArea    3.025e+01  5.863e+00   5.161 2.81e-07 ***
## BsmtFinSF1    1.567e+01  3.901e+00   4.018 6.19e-05 ***
## BsmtUnfSF     6.494e+00  3.627e+00   1.790 0.073587 .  
## FirstFlrSF    5.223e+01  5.101e+00  10.238  < 2e-16 ***
## SecondFlrSF   4.794e+01  4.096e+00  11.706  < 2e-16 ***
## BsmtFullBath  9.191e+03  2.397e+03   3.834 0.000131 ***
## BedroomAbvGr -9.569e+03  1.661e+03  -5.762 1.02e-08 ***
## KitchenAbvGr -1.288e+04  5.051e+03  -2.549 0.010892 *  
## TotRmsAbvGrd  5.290e+03  1.208e+03   4.378 1.28e-05 ***
## Fireplaces    3.777e+03  1.747e+03   2.162 0.030786 *  
## GarageCars    1.067e+04  1.687e+03   6.323 3.41e-10 ***
## WoodDeckSF    2.610e+01  7.868e+00   3.317 0.000933 ***
## ScreenPorch   5.111e+01  1.684e+01   3.034 0.002455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34680 on 1439 degrees of freedom
## Multiple R-squared:  0.812,  Adjusted R-squared:  0.8094 
## F-statistic: 310.8 on 20 and 1439 DF,  p-value: < 2.2e-16

The p-values for the remaining predictors are all less than 0.05. Signifying that they are all significant.Note the r-squared value didn’t change much.

Plot the model to verify that the conditons for a reasonable model are met.

plot(updated_model)

Residuals vs Fitted: Most of the data is about the horizontal line at zero, this indicates linearity. However, there is a slight curve to the distribution of data indication it is moderately linear.
Q-Q Residuals: Most of the data is is distributed on a straight line. At the ends it curves up at the positive end and down at the negative indicating that the distribution is a Laplace distribution

Predict the Test Set Using the Final Model Since my model was done using only numeric columns I will perform the prediction on the numeric columns. Also remove Id columns

test_data_numeric_df <- test_data_df %>% 
  dplyr::select(-Id)%>% 
  dplyr::select(where(is.numeric)) 

# Identify columns with missing data
missing_data_columns <- colnames(test_data_numeric_df)[colSums(is.na(test_data_numeric_df)) > 0]

# Calculate the median for each column with missing data and replace missing values
for (col in missing_data_columns) {
  median_value <- median(test_data_numeric_df[[col]], na.rm = TRUE)
  test_data_numeric_df[[col]][is.na(test_data_numeric_df[[col]])] <- median_value
}

# perform predictions
predictions <- predict(updated_model, test_data_numeric_df)
prediction_df <- data.frame(Id = test_data_df$Id, SalePrice = predictions)

# save to csv for submision
write.csv(prediction_df, "FA_LF_Final.csv", row.names = FALSE)

**Kaggle Submission:* My Kaggle account user name is FolushoAta and my score for this assignment is 0.23010. Note I only used the numeric columns in both the training and test data set. I tried to include the categorical columns but had issue imputing the missing data in the categorical columns.