Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
training_data_url <- 'https://raw.githubusercontent.com/folushoa/Data-Science/Data-605/Final/train.csv'
test_data_url <- 'https://raw.githubusercontent.com/folushoa/Data-Science/Data-605/Final/test.csv'
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
training_data_df <- read_csv(training_data_url) %>%
rename(FirstFlrSF = '1stFlrSF', SecondFlrSF = '2ndFlrSF')
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
test_data_df <- read_csv(test_data_url)%>%
rename(FirstFlrSF = '1stFlrSF', SecondFlrSF = '2ndFlrSF')
## Rows: 1459 Columns: 80
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(training_data_df)
## # A tibble: 6 × 81
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## # ℹ 73 more variables: LandContour <chr>, Utilities <chr>, LotConfig <chr>,
## # LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>,
## # BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>,
## # YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>, RoofMatl <chr>,
## # Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>,
## # ExterQual <chr>, ExterCond <chr>, Foundation <chr>, BsmtQual <chr>,
## # BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>, BsmtFinSF1 <dbl>, …
Variable X is to be skewed to the right and quantitative so going to check histogram of all quantitative variables
# Data-Frame of only numeric variables
training_data_numeric_df <- training_data_df %>%
dplyr::select(where(is.numeric))
# convert data to long
training_data_numeric_df_long <- training_data_numeric_df %>% pivot_longer(cols = everything(),
names_to = 'variable',
values_to = 'value')
# Plot histogram of all variables
training_data_numeric_df_long %>% ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = 'blue', color = 'black') +
facet_wrap(~ variable, scales = 'free') +
labs(title = 'Histogram of Quantitative Variables from Test Data') +
theme_minimal()
## Warning: Removed 348 rows containing non-finite values (`stat_bin()`).
I will set X to LotFrontage, and Y to SalePrice.
Note LotFrontage has missing data, and since it is
right skewed going to replace missing data with median
# X and Y
X <- training_data_df$LotFrontage
Y <- training_data_df$SalePrice
# calculate median of X
median_x <- median(X, na.rm = TRUE)
# Replace NA values in X with the median
X <- ifelse(is.na(X), median_x, X)
Probability Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
# Define quantiles
x <- quantile(X, 0.75) # third quantile of X
y <- quantile(Y, 0.50) # second quantile of y
a. \(P(X>x|Y>y)\)
prob_a <- sum(X>x & Y>y)/sum(Y>y)
print(paste('P(X>x|Y>y) =', round(prob_a, 3)))
## [1] "P(X>x|Y>y) = 0.334"
Meaning: \(P(X>x|Y>y)\) is 33.4%. This means that the probability of the LotFrontage being larger than 79 given that the SalePrice is larger than 163000 is 33.4%
b. \(P(X>x, Y>y)\)
prob_b <- sum(X>x & Y>y)/nrow(training_data_df)
print(paste('P(X>x,Y>y) =', round(prob_b, 3)))
## [1] "P(X>x,Y>y) = 0.166"
Meaning: \(P(X>x, Y>y)\) is 16.6%. This means that the probability of the LotFrontage being larger than 79 and the SalePrice is larger than 163000 is 16.6%
c. \(P(X<x|Y>y)\)
prob_c <- sum(X<x & Y>y)/sum(Y>y)
print(paste('P(X<x|Y>y) =', round(prob_c, 3)))
## [1] "P(X<x|Y>y) = 0.651"
Meaning: \(P(X<x|Y>y)\) is 65.1%. This means that the probability of the LotFrontage being smaller than 79 given that the SalePrice is larger than 163000 is 65.1%
Table of Counts:
table_of_counts <- addmargins(table(X>x, Y>y)) # table of counts with sum
# define column names and row names of table
rownames(table_of_counts) <- c('<=3d quartile', '>3d quartile', 'Total')
colnames(table_of_counts) <- c('<=2d quartile', '>2d quartile', 'Total')
print(table_of_counts)
##
## <=2d quartile >2d quartile Total
## <=3d quartile 622 485 1107
## >3d quartile 110 243 353
## Total 732 728 1460
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
A <- X > x # observations above the 3rd quartile
B <- Y > y # observations above the 2nd quartile
P_A_given_B <- sum(A*B)/sum(B) # P(A|B)
# P(A)P(B)
P_A <- sum(A)/nrow(training_data_df)
P_B <- sum(B)/nrow(training_data_df)
P_A_P_B <- P_A*P_B
# check if P(A|B) = P(A)P(B)
print(paste('P(A|B) = P(A)P(B):', P_A_given_B == P_A_P_B))
## [1] "P(A|B) = P(A)P(B): FALSE"
\(P(A|B) \neq P(A)P(B)\), therefore splitting the training data does not make them independent.
Confirming using the Chi square test
association_table <- table(A, B)
# Chi-square test
chi_test <- chisq.test(association_table)
print(chi_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: association_table
## X-squared = 66.058, df = 1, p-value = 4.38e-16
The p-value is less than 0.05, so we reject the null hypothesis, further proof that A and B are independent.
Descriptive and Inferential Statistics: Provide uni-variate descriptive statistics and appropriate plots for the training data set.
Descriptive statistics:
# summary of data
print('Statistical summary of the training data set:')
## [1] "Statistical summary of the training data set:"
summary(training_data_df)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical FirstFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## SecondFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
# structure of data
print('Strucure of the training data set:')
## [1] "Strucure of the training data set:"
str(training_data_df)
## spc_tbl_ [1,460 × 81] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:1460] 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : num [1:1460] 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr [1:1460] "RL" "RL" "RL" "RL" ...
## $ LotFrontage : num [1:1460] 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : num [1:1460] 8450 9600 11250 9550 14260 ...
## $ Street : chr [1:1460] "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr [1:1460] NA NA NA NA ...
## $ LotShape : chr [1:1460] "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr [1:1460] "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr [1:1460] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr [1:1460] "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr [1:1460] "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr [1:1460] "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr [1:1460] "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr [1:1460] "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr [1:1460] "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr [1:1460] "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : num [1:1460] 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : num [1:1460] 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : num [1:1460] 2003 1976 2001 1915 2000 ...
## $ YearRemodAdd : num [1:1460] 2003 1976 2002 1970 2000 ...
## $ RoofStyle : chr [1:1460] "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr [1:1460] "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr [1:1460] "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr [1:1460] "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : num [1:1460] 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr [1:1460] "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ Foundation : chr [1:1460] "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr [1:1460] "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr [1:1460] "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr [1:1460] "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr [1:1460] "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : num [1:1460] 706 978 486 216 655 ...
## $ BsmtFinType2 : chr [1:1460] "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : num [1:1460] 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num [1:1460] 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num [1:1460] 856 1262 920 756 1145 ...
## $ Heating : chr [1:1460] "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr [1:1460] "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr [1:1460] "Y" "Y" "Y" "Y" ...
## $ Electrical : chr [1:1460] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ FirstFlrSF : num [1:1460] 856 1262 920 961 1145 ...
## $ SecondFlrSF : num [1:1460] 854 0 866 756 1053 ...
## $ LowQualFinSF : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : num [1:1460] 1710 1262 1786 1717 2198 ...
## $ BsmtFullBath : num [1:1460] 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num [1:1460] 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : num [1:1460] 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : num [1:1460] 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : num [1:1460] 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : num [1:1460] 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr [1:1460] "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : num [1:1460] 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr [1:1460] "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : num [1:1460] 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr [1:1460] NA "TA" "TA" "Gd" ...
## $ GarageType : chr [1:1460] "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : num [1:1460] 2003 1976 2001 1998 2000 ...
## $ GarageFinish : chr [1:1460] "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : num [1:1460] 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num [1:1460] 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr [1:1460] "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr [1:1460] "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : num [1:1460] 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : num [1:1460] 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: num [1:1460] 0 0 0 272 0 0 0 228 205 0 ...
## $ 3SsnPorch : num [1:1460] 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : num [1:1460] 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr [1:1460] NA NA NA NA ...
## $ Fence : chr [1:1460] NA NA NA NA ...
## $ MiscFeature : chr [1:1460] NA NA NA NA ...
## $ MiscVal : num [1:1460] 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : num [1:1460] 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : num [1:1460] 2008 2007 2008 2006 2008 ...
## $ SaleType : chr [1:1460] "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr [1:1460] "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : num [1:1460] 208500 181500 223500 140000 250000 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. MSSubClass = col_double(),
## .. MSZoning = col_character(),
## .. LotFrontage = col_double(),
## .. LotArea = col_double(),
## .. Street = col_character(),
## .. Alley = col_character(),
## .. LotShape = col_character(),
## .. LandContour = col_character(),
## .. Utilities = col_character(),
## .. LotConfig = col_character(),
## .. LandSlope = col_character(),
## .. Neighborhood = col_character(),
## .. Condition1 = col_character(),
## .. Condition2 = col_character(),
## .. BldgType = col_character(),
## .. HouseStyle = col_character(),
## .. OverallQual = col_double(),
## .. OverallCond = col_double(),
## .. YearBuilt = col_double(),
## .. YearRemodAdd = col_double(),
## .. RoofStyle = col_character(),
## .. RoofMatl = col_character(),
## .. Exterior1st = col_character(),
## .. Exterior2nd = col_character(),
## .. MasVnrType = col_character(),
## .. MasVnrArea = col_double(),
## .. ExterQual = col_character(),
## .. ExterCond = col_character(),
## .. Foundation = col_character(),
## .. BsmtQual = col_character(),
## .. BsmtCond = col_character(),
## .. BsmtExposure = col_character(),
## .. BsmtFinType1 = col_character(),
## .. BsmtFinSF1 = col_double(),
## .. BsmtFinType2 = col_character(),
## .. BsmtFinSF2 = col_double(),
## .. BsmtUnfSF = col_double(),
## .. TotalBsmtSF = col_double(),
## .. Heating = col_character(),
## .. HeatingQC = col_character(),
## .. CentralAir = col_character(),
## .. Electrical = col_character(),
## .. `1stFlrSF` = col_double(),
## .. `2ndFlrSF` = col_double(),
## .. LowQualFinSF = col_double(),
## .. GrLivArea = col_double(),
## .. BsmtFullBath = col_double(),
## .. BsmtHalfBath = col_double(),
## .. FullBath = col_double(),
## .. HalfBath = col_double(),
## .. BedroomAbvGr = col_double(),
## .. KitchenAbvGr = col_double(),
## .. KitchenQual = col_character(),
## .. TotRmsAbvGrd = col_double(),
## .. Functional = col_character(),
## .. Fireplaces = col_double(),
## .. FireplaceQu = col_character(),
## .. GarageType = col_character(),
## .. GarageYrBlt = col_double(),
## .. GarageFinish = col_character(),
## .. GarageCars = col_double(),
## .. GarageArea = col_double(),
## .. GarageQual = col_character(),
## .. GarageCond = col_character(),
## .. PavedDrive = col_character(),
## .. WoodDeckSF = col_double(),
## .. OpenPorchSF = col_double(),
## .. EnclosedPorch = col_double(),
## .. `3SsnPorch` = col_double(),
## .. ScreenPorch = col_double(),
## .. PoolArea = col_double(),
## .. PoolQC = col_character(),
## .. Fence = col_character(),
## .. MiscFeature = col_character(),
## .. MiscVal = col_double(),
## .. MoSold = col_double(),
## .. YrSold = col_double(),
## .. SaleType = col_character(),
## .. SaleCondition = col_character(),
## .. SalePrice = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# shape of date
training_data_df_size <- dim(training_data_df)
print(paste('The size of the training data set is', training_data_df_size[1], 'by', training_data_df_size[2]))
## [1] "The size of the training data set is 1460 by 81"
# count number of missing data
print('Variables and the number of missing data:')
## [1] "Variables and the number of missing data:"
missing_values <- colSums(is.na(training_data_df)) # count missing data in each column
# create data frame of Variable and Count of missing data
missing_values_df <- tibble(
Variable = names(missing_values),
'Missing Value Count' = missing_values
)
print(missing_values_df)
## # A tibble: 81 × 2
## Variable `Missing Value Count`
## <chr> <dbl>
## 1 Id 0
## 2 MSSubClass 0
## 3 MSZoning 0
## 4 LotFrontage 259
## 5 LotArea 0
## 6 Street 0
## 7 Alley 1369
## 8 LotShape 0
## 9 LandContour 0
## 10 Utilities 0
## # ℹ 71 more rows
Plots: Histogram of numeric columns:
# Data-Frame of only numeric variables
training_data_numeric_df <- training_data_df %>%
dplyr::select(where(~is.numeric(.)))
# convert data to long
training_data_numeric_df_long <- training_data_numeric_df %>% pivot_longer(cols = everything(),
names_to = 'variable',
values_to = 'value')
# Plot histogram of all variables
training_data_numeric_df_long %>% ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = 'blue', color = 'black') +
facet_wrap(~ variable, scales = 'free') +
labs(title = 'Histogram of Quantitative Variables from Test Data') +
theme_minimal()
## Warning: Removed 348 rows containing non-finite values (`stat_bin()`).
Bar plot of categorical columns:
# Data-Frame of only numeric variables
training_data_categorical_df <- training_data_df %>%
dplyr::select(where(~is.character(.)))
# convert data to long
training_data_categorical_df_long <- training_data_categorical_df %>% pivot_longer(cols = everything(),
names_to = 'variable',
values_to = 'value')
# Plot histogram of all variables
training_data_categorical_df_long %>% ggplot(aes(x = value)) +
geom_bar(fill = 'blue', color = 'black') +
facet_wrap(~ variable, scales = 'free') +
labs(title = 'Distribution of Categorical Variables from Test Data') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_minimal()
Scatter plot of X and Y
# created a new data frame of LotFrontage and SalePrice because I replaced the missing data from LotFrontage with the mean
df <- tibble(LotFrontage = X, SalePrice = Y)
# plot scatter plot
df %>% ggplot(aes(x = LotFrontage, y = SalePrice)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = 'Scatter Plot of SalePrice vs LotFrontage') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
There is a positive relationship between LotFrontage and SalePrice.
However the larger the LotFrontage becomes the further the less dense
the data is in relation to the linear model drawn.
Provide a 95% CI for the difference in the mean of the variables.
CI <- t.test(X, Y, conf.level = 0.95)
print(paste('95% confidence interval:', round(CI$conf.int[1], 3), round(CI$conf.int[2], 3)))
## [1] "95% confidence interval: -184929.687 -176772.977"
This means that 95% of the difference in means between SalePrice and LotFrontage is between -184929.687 and 176772.977.
Derive a correlation matrix for two of the quantitative variables you selected:
correlation_matrix <- cor(df[, c('LotFrontage', 'SalePrice')])
print(correlation_matrix)
## LotFrontage SalePrice
## LotFrontage 1.0000000 0.3347709
## SalePrice 0.3347709 1.0000000
Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval
correlation_test <- cor.test(df$LotFrontage, df$SalePrice, conf.level = 0.99)
print(paste('p-value:', correlation_test$p.value))
## [1] "p-value: 1.44698460070571e-39"
print(paste('99% confidence interval:',
round(correlation_test$conf.int[1], 3),
round(correlation_test$conf.int[2], 3)))
## [1] "99% confidence interval: 0.274 0.393"
The p-value is less than 0.01 (significance level). Thus, we reject the null hypothesis that correlation between LotFrontage and SalePrice is 0. This is evident from the earlier correlation matrix showing that the correlation between these two variables is 0.335.
The 99% confidence interval of the correlation relationship is 0.274 and 0.393. This means that there is a 99% chance the correlation between LotFrontage and SalePrice will be between 0.274 and 0.393, which is the case (the correlation is 0.335)
Linear Algebra and Correlation Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
precision_matrix <- solve(correlation_matrix)
print(precision_matrix)
## LotFrontage SalePrice
## LotFrontage 1.1262168 -0.3770246
## SalePrice -0.3770246 1.1262168
Multiply the correlation matrix by the precision matrix
c_by_p <- correlation_matrix%*%precision_matrix
print(round(c_by_p), 3)
## LotFrontage SalePrice
## LotFrontage 1 0
## SalePrice 0 1
Multiply the precision matrix by the correlation matrix
p_by_c <- precision_matrix%*%correlation_matrix
print(round(p_by_c), 3)
## LotFrontage SalePrice
## LotFrontage 1 0
## SalePrice 0 1
Since the precision matrix is the inverse of the correlation matrix I would expect the result of the multiplication between them to result in the identity matrix, which it did.
Conduct principle components analysis Principle component analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining the variability.
# PCA between LotFrontage and SalePrice
pca <- prcomp(df, scale. = TRUE)
# Summary of PCA
print('PCA summary:')
## [1] "PCA summary:"
print(summary(pca))
## Importance of components:
## PC1 PC2
## Standard deviation 1.1553 0.8156
## Proportion of Variance 0.6674 0.3326
## Cumulative Proportion 0.6674 1.0000
# PCA component
print('PCA loading:')
## [1] "PCA loading:"
print(pca$rotation)
## PC1 PC2
## LotFrontage 0.7071068 -0.7071068
## SalePrice 0.7071068 0.7071068
PC1 accounts for 66.7% of the variability of the transformed data set while PC2 accounts for 33.3% of the variability of of the transformed data set. LotFrontage and SalePrice both contribute equally to PC1, while for PC2, the both contribute equally but in opposite directions.
Calculus-Based Probability & Statistics Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero.
# check if minimum value of chosen X > 0
print(summary(X))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 60.00 69.00 69.86 79.00 313.00
The minimum value is greater than 0 so no shifting is needed.
Fit an exponential probability density function and find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\)))
# fit an exponential probability density funciton
exp_fit <- fitdistr(X, "exponential")
# optimal value of lambda
lambda <- exp_fit$estimate
print(paste("Optimal value of lambda:", lambda))
## [1] "Optimal value of lambda: 0.0143135851609298"
# 1000 samples from this exponential distribution
set.seed(42)
sample <- rexp(1000, lambda)
Plot a histogram and compare it with a histogram of your original variable
ggplot() +
geom_histogram(aes(X, fill = "Original Variable"), bins = 30, alpha = 0.5) +
geom_histogram(aes(sample, fill = "Exponential Sample"), bins = 30, alpha = 0.5) +
labs(title = "Histogram of Expnential Sample and Original Variable")
theme_minimal()
## List of 97
## $ line :List of 6
## ..$ colour : chr "black"
## ..$ linewidth : num 0.5
## ..$ linetype : num 1
## ..$ lineend : chr "butt"
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ rect :List of 5
## ..$ fill : chr "white"
## ..$ colour : chr "black"
## ..$ linewidth : num 0.5
## ..$ linetype : num 1
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ text :List of 11
## ..$ family : chr ""
## ..$ face : chr "plain"
## ..$ colour : chr "black"
## ..$ size : num 11
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : num 0
## ..$ lineheight : num 0.9
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ title : NULL
## $ aspect.ratio : NULL
## $ axis.title : NULL
## $ axis.title.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.75points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x.top :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 2.75points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x.bottom : NULL
## $ axis.title.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.75points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.y.left : NULL
## $ axis.title.y.right :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : num -90
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 2.75points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "grey30"
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.2points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x.top :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 2.2points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x.bottom : NULL
## $ axis.text.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.2points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.y.left : NULL
## $ axis.text.y.right :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 2.2points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.ticks : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ axis.ticks.x : NULL
## $ axis.ticks.x.top : NULL
## $ axis.ticks.x.bottom : NULL
## $ axis.ticks.y : NULL
## $ axis.ticks.y.left : NULL
## $ axis.ticks.y.right : NULL
## $ axis.ticks.length : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## $ axis.ticks.length.x : NULL
## $ axis.ticks.length.x.top : NULL
## $ axis.ticks.length.x.bottom: NULL
## $ axis.ticks.length.y : NULL
## $ axis.ticks.length.y.left : NULL
## $ axis.ticks.length.y.right : NULL
## $ axis.line : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ axis.line.x : NULL
## $ axis.line.x.top : NULL
## $ axis.line.x.bottom : NULL
## $ axis.line.y : NULL
## $ axis.line.y.left : NULL
## $ axis.line.y.right : NULL
## $ legend.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
## ..- attr(*, "unit")= int 8
## $ legend.spacing : 'simpleUnit' num 11points
## ..- attr(*, "unit")= int 8
## $ legend.spacing.x : NULL
## $ legend.spacing.y : NULL
## $ legend.key : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.key.size : 'simpleUnit' num 1.2lines
## ..- attr(*, "unit")= int 3
## $ legend.key.height : NULL
## $ legend.key.width : NULL
## $ legend.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.text.align : NULL
## $ legend.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.title.align : NULL
## $ legend.position : chr "right"
## $ legend.direction : NULL
## $ legend.justification : chr "center"
## $ legend.box : NULL
## $ legend.box.just : NULL
## $ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
## ..- attr(*, "unit")= int 1
## $ legend.box.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.box.spacing : 'simpleUnit' num 11points
## ..- attr(*, "unit")= int 8
## $ panel.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.border : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.spacing : 'simpleUnit' num 5.5points
## ..- attr(*, "unit")= int 8
## $ panel.spacing.x : NULL
## $ panel.spacing.y : NULL
## $ panel.grid :List of 6
## ..$ colour : chr "grey92"
## ..$ linewidth : NULL
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ panel.grid.major : NULL
## $ panel.grid.minor :List of 6
## ..$ colour : NULL
## ..$ linewidth : 'rel' num 0.5
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ panel.grid.major.x : NULL
## $ panel.grid.major.y : NULL
## $ panel.grid.minor.x : NULL
## $ panel.grid.minor.y : NULL
## $ panel.ontop : logi FALSE
## $ plot.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ plot.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 1.2
## ..$ hjust : num 0
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.title.position : chr "panel"
## $ plot.subtitle :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.caption :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : num 1
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 5.5points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.caption.position : chr "panel"
## $ plot.tag :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 1.2
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.tag.position : chr "topleft"
## $ plot.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
## ..- attr(*, "unit")= int 8
## $ strip.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ strip.background.x : NULL
## $ strip.background.y : NULL
## $ strip.clip : chr "inherit"
## $ strip.placement : chr "inside"
## $ strip.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "grey10"
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.text.x : NULL
## $ strip.text.x.bottom : NULL
## $ strip.text.x.top : NULL
## $ strip.text.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : num -90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.text.y.left :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.text.y.right : NULL
## $ strip.switch.pad.grid : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## $ strip.switch.pad.wrap : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi TRUE
## - attr(*, "validate")= logi TRUE
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)
exp_percentile <- quantile(sample, c(0.05, 0.95))
print(paste('The 5th percentile:', round(exp_percentile[1], 3)))
## [1] "The 5th percentile: 3.429"
print(paste('The 95th percentile:', round(exp_percentile[2], 3)))
## [1] "The 95th percentile: 225.481"
Generate a 95% confidence interval from the empirical data, assuming normality
# 95% confidence interval
CI <- t.test(X)
print(paste("95% confidence interval of X:", round(CI$conf.int[1], 3), round(CI$conf.int[2], 3)))
## [1] "95% confidence interval of X: 68.733 70.995"
Thus, there is a 95% confident that the mean of X falls between 68.733, and 70.995.
Checking:
mean_X <- mean(X)
print(paste("Mean of X:", round(mean_X, 3)))
## [1] "Mean of X: 69.864"
The mean of X (69.864) falls between 68.733 and 70.995.
The empirical 5th percentile and 95th percentile of the data
percentile <- quantile(X, probs = c(0.05, 0.95))
print(paste("The 5th percentile:", percentile[1]))
## [1] "The 5th percentile: 35.95"
print(paste("The 95the percentile:", percentile[2]))
## [1] "The 95the percentile: 104"
Modeling: Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Going to use data frame with only numeric columns.
# tidy data frame
# remove columns with more than half the data missing
# identify those columns
columns_to_remove <-
colnames(training_data_numeric_df)[colSums(is.na(training_data_numeric_df)) > nrow(training_data_numeric_df)/2]
# remove those columns
new_train_df <- training_data_numeric_df %>%
dplyr::select(-all_of(columns_to_remove)) %>%
rename_all(make.names)
# replace missing data
# perform imputation
imputed_object <- mice(new_train_df, m = 1, maxit = 50, meth = "pmm", seed = 42)
##
## iter imp variable
## 1 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 2 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 3 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 4 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 5 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 6 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 7 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 8 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 9 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 10 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 11 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 12 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 13 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 14 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 15 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 16 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 17 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 18 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 19 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 20 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 21 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 22 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 23 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 24 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 25 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 26 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 27 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 28 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 29 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 30 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 31 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 32 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 33 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 34 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 35 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 36 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 37 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 38 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 39 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 40 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 41 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 42 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 43 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 44 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 45 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 46 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 47 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 48 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 49 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 50 1 LotFrontage* MasVnrArea* GarageYrBlt*
## Warning: Number of logged events: 300
imputed_train_df <- complete(imputed_object, 1) %>%
dplyr::select(-c(Id))
Create a linear regression model and using manual elimination optimize the model
model <- lm(SalePrice ~., data = imputed_train_df)
model_summary <- summary(model)
print(model_summary)
##
## Call:
## lm(formula = SalePrice ~ ., data = imputed_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470962 -16652 -2096 13593 303914
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.864e+05 1.413e+06 0.273 0.784608
## MSSubClass -1.767e+02 2.647e+01 -6.677 3.49e-11 ***
## LotFrontage -5.144e+01 1.908e+01 -2.696 0.007094 **
## LotArea 4.302e-01 1.012e-01 4.251 2.26e-05 ***
## OverallQual 1.722e+04 1.188e+03 14.495 < 2e-16 ***
## OverallCond 4.563e+03 1.028e+03 4.440 9.69e-06 ***
## YearBuilt 3.615e+02 7.070e+01 5.113 3.60e-07 ***
## YearRemodAdd 1.704e+02 6.786e+01 2.512 0.012129 *
## MasVnrArea 3.087e+01 5.934e+00 5.203 2.25e-07 ***
## BsmtFinSF1 1.864e+01 4.660e+00 4.000 6.67e-05 ***
## BsmtFinSF2 7.528e+00 7.046e+00 1.068 0.285480
## BsmtUnfSF 8.797e+00 4.194e+00 2.097 0.036134 *
## TotalBsmtSF NA NA NA NA
## FirstFlrSF 4.846e+01 5.786e+00 8.376 < 2e-16 ***
## SecondFlrSF 4.845e+01 4.959e+00 9.770 < 2e-16 ***
## LowQualFinSF 3.343e+01 1.999e+01 1.672 0.094677 .
## GrLivArea NA NA NA NA
## BsmtFullBath 9.385e+03 2.610e+03 3.595 0.000335 ***
## BsmtHalfBath 1.500e+03 4.088e+03 0.367 0.713745
## FullBath 4.254e+03 2.819e+03 1.509 0.131433
## HalfBath -1.557e+03 2.661e+03 -0.585 0.558395
## BedroomAbvGr -1.012e+04 1.697e+03 -5.963 3.12e-09 ***
## KitchenAbvGr -1.259e+04 5.199e+03 -2.422 0.015559 *
## TotRmsAbvGrd 5.049e+03 1.235e+03 4.087 4.62e-05 ***
## Fireplaces 3.941e+03 1.783e+03 2.211 0.027208 *
## GarageYrBlt -5.215e+01 7.547e+01 -0.691 0.489684
## GarageCars 1.071e+04 2.850e+03 3.759 0.000178 ***
## GarageArea 9.382e-01 1.006e+01 0.093 0.925700
## WoodDeckSF 2.730e+01 7.992e+00 3.416 0.000654 ***
## OpenPorchSF 2.817e-01 1.515e+01 0.019 0.985162
## EnclosedPorch 1.374e+01 1.687e+01 0.814 0.415672
## X3SsnPorch 2.116e+01 3.134e+01 0.675 0.499592
## ScreenPorch 5.401e+01 1.717e+01 3.145 0.001697 **
## PoolArea -3.298e+01 2.359e+01 -1.398 0.162252
## MiscVal -4.047e-01 1.856e+00 -0.218 0.827430
## MoSold -6.304e+01 3.444e+02 -0.183 0.854782
## YrSold -6.922e+02 7.024e+02 -0.985 0.324559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34710 on 1425 degrees of freedom
## Multiple R-squared: 0.8135, Adjusted R-squared: 0.8091
## F-statistic: 182.9 on 34 and 1425 DF, p-value: < 2.2e-16
There are a few variables with high p_value (p_value > 0.05). These variables will be removed to reduce the risk of over-fitting.
# create a data frame containing the variables and their associated p-value
p_values <- coef(model_summary)[, "Pr(>|t|)"]
p_value_df <- tibble(
Predictors = names(p_values),
p_value = p_values
)
# indicate rows with p-value < 0.05
signifcant_predictor_df <- p_value_df %>%
filter(p_value <= 0.05)
# redo linear regression taking out non-significant predictors
updated_predictors_formula <- paste("SalePrice ~ ",
paste(signifcant_predictor_df$Predictors, collapse = " + "))
updated_model <- lm(as.formula(updated_predictors_formula), data = imputed_train_df)
updated_model_summary <- summary(updated_model)
print(updated_model_summary)
##
## Call:
## lm(formula = as.formula(updated_predictors_formula), data = imputed_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -486813 -16571 -1918 13768 287972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.048e+06 1.156e+05 -9.064 < 2e-16 ***
## MSSubClass -1.708e+02 2.596e+01 -6.580 6.58e-11 ***
## LotFrontage -5.226e+01 1.891e+01 -2.763 0.005799 **
## LotArea 4.505e-01 1.003e-01 4.494 7.56e-06 ***
## OverallQual 1.767e+04 1.167e+03 15.144 < 2e-16 ***
## OverallCond 4.340e+03 1.005e+03 4.319 1.67e-05 ***
## YearBuilt 3.248e+02 5.192e+01 6.257 5.17e-10 ***
## YearRemodAdd 1.780e+02 6.476e+01 2.749 0.006044 **
## MasVnrArea 3.025e+01 5.863e+00 5.161 2.81e-07 ***
## BsmtFinSF1 1.567e+01 3.901e+00 4.018 6.19e-05 ***
## BsmtUnfSF 6.494e+00 3.627e+00 1.790 0.073587 .
## FirstFlrSF 5.223e+01 5.101e+00 10.238 < 2e-16 ***
## SecondFlrSF 4.794e+01 4.096e+00 11.706 < 2e-16 ***
## BsmtFullBath 9.191e+03 2.397e+03 3.834 0.000131 ***
## BedroomAbvGr -9.569e+03 1.661e+03 -5.762 1.02e-08 ***
## KitchenAbvGr -1.288e+04 5.051e+03 -2.549 0.010892 *
## TotRmsAbvGrd 5.290e+03 1.208e+03 4.378 1.28e-05 ***
## Fireplaces 3.777e+03 1.747e+03 2.162 0.030786 *
## GarageCars 1.067e+04 1.687e+03 6.323 3.41e-10 ***
## WoodDeckSF 2.610e+01 7.868e+00 3.317 0.000933 ***
## ScreenPorch 5.111e+01 1.684e+01 3.034 0.002455 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34680 on 1439 degrees of freedom
## Multiple R-squared: 0.812, Adjusted R-squared: 0.8094
## F-statistic: 310.8 on 20 and 1439 DF, p-value: < 2.2e-16
The p-values for the remaining predictors are all less than 0.05. Signifying that they are all significant.Note the r-squared value didn’t change much.
Plot the model to verify that the conditons for a reasonable model are met.
plot(updated_model)
Residuals vs Fitted: Most of the data is about the
horizontal line at zero, this indicates linearity. However, there is a
slight curve to the distribution of data indication it is moderately
linear.
Q-Q Residuals: Most of the data is is distributed on a
straight line. At the ends it curves up at the positive end and down at
the negative indicating that the distribution is a Laplace
distribution
Predict the Test Set Using the Final Model Since my model was done using only numeric columns I will perform the prediction on the numeric columns. Also remove Id columns
test_data_numeric_df <- test_data_df %>%
dplyr::select(-Id)%>%
dplyr::select(where(is.numeric))
# Identify columns with missing data
missing_data_columns <- colnames(test_data_numeric_df)[colSums(is.na(test_data_numeric_df)) > 0]
# Calculate the median for each column with missing data and replace missing values
for (col in missing_data_columns) {
median_value <- median(test_data_numeric_df[[col]], na.rm = TRUE)
test_data_numeric_df[[col]][is.na(test_data_numeric_df[[col]])] <- median_value
}
# perform predictions
predictions <- predict(updated_model, test_data_numeric_df)
prediction_df <- data.frame(Id = test_data_df$Id, SalePrice = predictions)
# save to csv for submision
write.csv(prediction_df, "FA_LF_Final.csv", row.names = FALSE)
**Kaggle Submission:* My Kaggle account user name is FolushoAta and my score for this assignment is 0.23010. Note I only used the numeric columns in both the training and test data set. I tried to include the categorical columns but had issue imputing the missing data in the categorical columns.