Generating random numbers…
# Set seed and define N, n
N <- 6
n <- 10000
# Random uniform numbers
X <- runif(n, 1, N)
# Random normal numbers
u <- (N + 1)/2
Y <- rnorm(n, mean=u, sd = u)
Defining x and y…
x <- quantile(X, 0.5)
y <- quantile(Y, 0.25)
We are going to use the following probabilities formulas:
P( A/B) = ; P( A,B) = P( A) *P( B)
Let P( X >x|X >y) be p1
p1 <- (length(X[X>x & X>y]) / n) / (length(X[X>y]) / n)
paste0("The probability is: ", round(p1, 4))
## [1] "The probability is: 0.5164"
Let P( X >x,Y>y) be p2
p2 <- (length(X[X>x]) / n) * (length(Y[Y>y]) / n)
paste0("The probability is: ", round(p2, 4))
## [1] "The probability is: 0.375"
Let P( X <x| X>y) be p3
p3 <- (length(X[X<x & X>y]) / n) / (length(X[X>y]) / n)
paste0("The probability is: ", round(p3, 4))
## [1] "The probability is: 0.4836"
Let investigate P( X >x and Y >y) = P( X >x) P( Y >y)
Building a probability table…
df <- data.frame(c(length(X[X<x])/n * length(Y[Y<y])/n, length(X[X<x])/n * length(Y[Y>y])/n),
c(length(X[X>x])/n * length(Y[Y<y])/n, length(X[X>x])/n * length(Y[Y>y])/n))
colnames(df) <- c("X<x", "X>x")
# Add the total colums
df$total <- df$`X<x` + df$`X>x`
# Add rows
new_row <- c(sum(df$`X<x`), sum(df$`X>x`), sum(df$total))
df1 <- rbind(df, new_row)
rownames(df1) <- c("Y<y", "Y>y", "total")
df1
## X<x X>x total
## Y<y 0.125 0.125 0.25
## Y>y 0.375 0.375 0.75
## total 0.500 0.500 1.00
From the table, we are going to evaluate the joint and marginal probabilities.
Joint probability:
joint_p <- df1[2,2]
paste0("P(X>x and Y>y) is : ", joint_p )
## [1] "P(X>x and Y>y) is : 0.375"
Marginal probability:
marginal_p <- df1[3,2]*df1[2,3]
paste0("P(X>x)P(Y>y) is : ", marginal_p )
## [1] "P(X>x)P(Y>y) is : 0.375"
Let check now if P(X>x and Y>y) = P(X>x)P(Y>y)
joint_p==marginal_p
## [1] TRUE
As proven above, we can conclude that P(X>x and Y>y) = P(X>x)P(Y>y)
Fisher’s Exact Test and Chi Square Test
Let build our matrix again
gen_matrix <- matrix(c(length(X[X<x])/n * length(Y[Y<y])/n, length(X[X<x])/n * length(Y[Y>y])/n,
length(X[X>x])/n * length(Y[Y<y])/n, length(X[X>x])/n * length(Y[Y>y])/n), nrow = 2)
Fisher’s Exact Test…
fisher.test(gen_matrix*n)
##
## Fisher's Exact Test for Count Data
##
## data: gen_matrix * n
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9124854 1.0959080
## sample estimates:
## odds ratio
## 1
Chi Square Test…
chisq.test(gen_matrix*n)
##
## Pearson's Chi-squared test
##
## data: gen_matrix * n
## X-squared = 0, df = 1, p-value = 1
Both the Fisher’s Exact Test and Chi Square Test give us a p-value > 0.05, that’s it, we fail to reject to the null hypothesis: The relative proportion of one variable are independent of the second variable.
Furthermore, the difference is that Fisher’s Exact Test is used for a small sample size and Chi Square Test is suitable for bigger sample size.
That’s said, It will be more appropriate to use Chi Square Test in this case where the sample size is bigger.
Libraries
library(tidyverse)
library(corrplot)
library(Hmisc)
library(MASS)
library(pracma)
library(matrixcalc)
library(fitdistrplus)
library(Rmisc)
library(funModeling)
Get the data
data <- read.csv("https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/train.csv")
glimpse(data)
## Rows: 1,460
## Columns: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,~
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R~
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, ~
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612~
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ~
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", ~
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", ~
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu~
## $ LotConfig <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I~
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", ~
## $ Neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "~
## $ Condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",~
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ~
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ~
## $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi~
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,~
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,~
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19~
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19~
## $ RoofStyle <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G~
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "~
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "~
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "~
## $ MasVnrType <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",~
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, ~
## $ ExterQual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ Foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "~
## $ BsmtQual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T~
## $ BsmtCond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T~
## $ BsmtExposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N~
## $ BsmtFinType1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", ~
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99~
## $ BsmtFinType2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", ~
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17~
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10~
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ~
## $ HeatingQC <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E~
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S~
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, ~
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,~
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10~
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,~
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,~
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,~
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,~
## $ KitchenQual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6~
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", ~
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,~
## $ FireplaceQu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", ~
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch~
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19~
## $ GarageFinish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", ~
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,~
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7~
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G~
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160~
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,~
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, ~
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, ~
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,~
## $ MiscFeature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, ~
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,~
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10~
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20~
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W~
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm~
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, ~
We are going go skim the quantitative variables
data %>%
select_if(is.numeric) %>%
skimr::skim()
| Name | Piped data |
| Number of rows | 1460 |
| Number of columns | 38 |
| _______________________ | |
| Column type frequency: | |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 0 | 1.00 | 730.50 | 421.61 | 1 | 365.75 | 730.5 | 1095.25 | 1460 | ▇▇▇▇▇ |
| MSSubClass | 0 | 1.00 | 56.90 | 42.30 | 20 | 20.00 | 50.0 | 70.00 | 190 | ▇▅▂▁▁ |
| LotFrontage | 259 | 0.82 | 70.05 | 24.28 | 21 | 59.00 | 69.0 | 80.00 | 313 | ▇▃▁▁▁ |
| LotArea | 0 | 1.00 | 10516.83 | 9981.26 | 1300 | 7553.50 | 9478.5 | 11601.50 | 215245 | ▇▁▁▁▁ |
| OverallQual | 0 | 1.00 | 6.10 | 1.38 | 1 | 5.00 | 6.0 | 7.00 | 10 | ▁▂▇▅▁ |
| OverallCond | 0 | 1.00 | 5.58 | 1.11 | 1 | 5.00 | 5.0 | 6.00 | 9 | ▁▁▇▅▁ |
| YearBuilt | 0 | 1.00 | 1971.27 | 30.20 | 1872 | 1954.00 | 1973.0 | 2000.00 | 2010 | ▁▂▃▆▇ |
| YearRemodAdd | 0 | 1.00 | 1984.87 | 20.65 | 1950 | 1967.00 | 1994.0 | 2004.00 | 2010 | ▅▂▂▃▇ |
| MasVnrArea | 8 | 0.99 | 103.69 | 181.07 | 0 | 0.00 | 0.0 | 166.00 | 1600 | ▇▁▁▁▁ |
| BsmtFinSF1 | 0 | 1.00 | 443.64 | 456.10 | 0 | 0.00 | 383.5 | 712.25 | 5644 | ▇▁▁▁▁ |
| BsmtFinSF2 | 0 | 1.00 | 46.55 | 161.32 | 0 | 0.00 | 0.0 | 0.00 | 1474 | ▇▁▁▁▁ |
| BsmtUnfSF | 0 | 1.00 | 567.24 | 441.87 | 0 | 223.00 | 477.5 | 808.00 | 2336 | ▇▅▂▁▁ |
| TotalBsmtSF | 0 | 1.00 | 1057.43 | 438.71 | 0 | 795.75 | 991.5 | 1298.25 | 6110 | ▇▃▁▁▁ |
| X1stFlrSF | 0 | 1.00 | 1162.63 | 386.59 | 334 | 882.00 | 1087.0 | 1391.25 | 4692 | ▇▅▁▁▁ |
| X2ndFlrSF | 0 | 1.00 | 346.99 | 436.53 | 0 | 0.00 | 0.0 | 728.00 | 2065 | ▇▃▂▁▁ |
| LowQualFinSF | 0 | 1.00 | 5.84 | 48.62 | 0 | 0.00 | 0.0 | 0.00 | 572 | ▇▁▁▁▁ |
| GrLivArea | 0 | 1.00 | 1515.46 | 525.48 | 334 | 1129.50 | 1464.0 | 1776.75 | 5642 | ▇▇▁▁▁ |
| BsmtFullBath | 0 | 1.00 | 0.43 | 0.52 | 0 | 0.00 | 0.0 | 1.00 | 3 | ▇▆▁▁▁ |
| BsmtHalfBath | 0 | 1.00 | 0.06 | 0.24 | 0 | 0.00 | 0.0 | 0.00 | 2 | ▇▁▁▁▁ |
| FullBath | 0 | 1.00 | 1.57 | 0.55 | 0 | 1.00 | 2.0 | 2.00 | 3 | ▁▇▁▇▁ |
| HalfBath | 0 | 1.00 | 0.38 | 0.50 | 0 | 0.00 | 0.0 | 1.00 | 2 | ▇▁▅▁▁ |
| BedroomAbvGr | 0 | 1.00 | 2.87 | 0.82 | 0 | 2.00 | 3.0 | 3.00 | 8 | ▁▇▂▁▁ |
| KitchenAbvGr | 0 | 1.00 | 1.05 | 0.22 | 0 | 1.00 | 1.0 | 1.00 | 3 | ▁▇▁▁▁ |
| TotRmsAbvGrd | 0 | 1.00 | 6.52 | 1.63 | 2 | 5.00 | 6.0 | 7.00 | 14 | ▂▇▇▁▁ |
| Fireplaces | 0 | 1.00 | 0.61 | 0.64 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▁▁ |
| GarageYrBlt | 81 | 0.94 | 1978.51 | 24.69 | 1900 | 1961.00 | 1980.0 | 2002.00 | 2010 | ▁▁▅▅▇ |
| GarageCars | 0 | 1.00 | 1.77 | 0.75 | 0 | 1.00 | 2.0 | 2.00 | 4 | ▁▃▇▂▁ |
| GarageArea | 0 | 1.00 | 472.98 | 213.80 | 0 | 334.50 | 480.0 | 576.00 | 1418 | ▂▇▃▁▁ |
| WoodDeckSF | 0 | 1.00 | 94.24 | 125.34 | 0 | 0.00 | 0.0 | 168.00 | 857 | ▇▂▁▁▁ |
| OpenPorchSF | 0 | 1.00 | 46.66 | 66.26 | 0 | 0.00 | 25.0 | 68.00 | 547 | ▇▁▁▁▁ |
| EnclosedPorch | 0 | 1.00 | 21.95 | 61.12 | 0 | 0.00 | 0.0 | 0.00 | 552 | ▇▁▁▁▁ |
| X3SsnPorch | 0 | 1.00 | 3.41 | 29.32 | 0 | 0.00 | 0.0 | 0.00 | 508 | ▇▁▁▁▁ |
| ScreenPorch | 0 | 1.00 | 15.06 | 55.76 | 0 | 0.00 | 0.0 | 0.00 | 480 | ▇▁▁▁▁ |
| PoolArea | 0 | 1.00 | 2.76 | 40.18 | 0 | 0.00 | 0.0 | 0.00 | 738 | ▇▁▁▁▁ |
| MiscVal | 0 | 1.00 | 43.49 | 496.12 | 0 | 0.00 | 0.0 | 0.00 | 15500 | ▇▁▁▁▁ |
| MoSold | 0 | 1.00 | 6.32 | 2.70 | 1 | 5.00 | 6.0 | 8.00 | 12 | ▃▆▇▃▃ |
| YrSold | 0 | 1.00 | 2007.82 | 1.33 | 2006 | 2007.00 | 2008.0 | 2009.00 | 2010 | ▇▇▇▇▅ |
| SalePrice | 0 | 1.00 | 180921.20 | 79442.50 | 34900 | 129975.00 | 163000.0 | 214000.00 | 755000 | ▇▅▁▁▁ |
We are going to consider few variables and study their distribution: SalePrice, YearBuilt, OverallQual,TotRmsAbvGrd, LotArea
ggplot(data=data, aes(x=SalePrice)) +
geom_histogram( fill = 'magenta') +
labs(title = "Houses sale price distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data, aes(x=YearBuilt)) +
geom_histogram(fill = "blue") +
labs(x = "Year built", title = "Houses year built distribution ")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = data, aes(x=factor(OverallQual), y = SalePrice))+
geom_boxplot() +
labs(x ="Overall quality", title=" Houses sale price by Overall quality")
ggplot(data = data, aes(x=factor(TotRmsAbvGrd), y = SalePrice)) +
geom_boxplot() +
labs(x ="Rooms", title="Houses sale price by bedrooms ")
ggplot(data = data, aes(x=LotArea, y=SalePrice)) +
geom_point(color = "green") +
labs(x="Lot Area", title ="Sale price by lot area")
ggplot(data = data, aes (x = YearBuilt, y=SalePrice)) +
geom_point(color = 'blue') +
labs(x='Year built', y='Sale price', title='Sale price over years')
We are going to consider same variables for the scatter plot matrix.
pairs(~SalePrice + YearBuilt + OverallQual + TotRmsAbvGrd + LotArea,
data = data)
We are going to consider three of our previous variables to build a correlation matrix
data1 <- data %>%
dplyr::select(OverallQual, TotRmsAbvGrd, LotArea)
res <- cor(data1)
res
## OverallQual TotRmsAbvGrd LotArea
## OverallQual 1.0000000 0.4274523 0.1058057
## TotRmsAbvGrd 0.4274523 1.0000000 0.1900148
## LotArea 0.1058057 0.1900148 1.0000000
data1 %>%
cor() %>%
corrplot()
cor.test(data$OverallQual, data$TotRmsAbvGrd, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: data$OverallQual and data$TotRmsAbvGrd
## t = 18.054, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3996237 0.4544938
## sample estimates:
## cor
## 0.4274523
p_value less than 0.05: Reject the null hypothesis.
We are 80% confident that the correlation is between 0.3996 and 0.4545
cor.test(data$OverallQual, data$LotArea, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: data$OverallQual and data$LotArea
## t = 4.0629, df = 1458, p-value = 5.106e-05
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.07250156 0.13887424
## sample estimates:
## cor
## 0.1058057
p_value less than 0.05: Reject the null hypothesis.
We are 80% confident that the correlation is between 0.0725 and 0.1389
cor.test(data$TotRmsAbvGrd, data$LotArea, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: data$TotRmsAbvGrd and data$LotArea
## t = 7.3901, df = 1458, p-value = 2.461e-13
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1574573 0.2221597
## sample estimates:
## cor
## 0.1900148
p_value less than 0.05: Reject the null hypothesis.
We are 80% confident that the correlation is between 0.1575 and 0.2222
In this case, as we are not performing multiple hypothesis test at once, I wouldn’t worry much about familywise error. I would consider each test as dependant and having the type error rate equal to the significance level of 0.05.
We are going o use ginv function from MASS package to invert the correlation matrix
inv_res <-round(ginv(res), 3)
inv_res
## [,1] [,2] [,3]
## [1,] 1.225 -0.517 -0.031
## [2,] -0.517 1.256 -0.184
## [3,] -0.031 -0.184 1.038
m1 <- round(res%*%inv_res,2)
m1
## [,1] [,2] [,3]
## OverallQual 1 0 0
## TotRmsAbvGrd 0 1 0
## LotArea 0 0 1
m2 <- round(inv_res%*%res, 2)
m2
## OverallQual TotRmsAbvGrd LotArea
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
lu.decomposition function from matrixcalc will be used here to conduct LU decomposition on the matrix.
lu.decomposition(res)
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.4274523 1.0000000 0
## [3,] 0.1058057 0.1771572 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.4274523 0.1058057
## [2,] 0 0.8172845 0.1447879
## [3,] 0 0.0000000 0.9631549
ggplot(data = data, aes(x=X1stFlrSF)) +
geom_histogram(fill = 'gold', bins = 20) +
labs(title ='First floor square feet distribution' )
skimr::skim(data$X1stFlrSF)
| Name | data$X1stFlrSF |
| Number of rows | 1460 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 1162.63 | 386.59 | 334 | 882 | 1087 | 1391.25 | 4692 | ▇▅▁▁▁ |
x1_exp <-fitdistr(data$X1stFlrSF, densfun = 'exponential')
lambda <- x1_exp$estimate
lambda
## rate
## 0.0008601213
n <- 1000
x1_exp_dist <- rexp(n, lambda)
par(mfrow=c(1, 2))
hist(x1_exp_dist, col = 'blue', main = 'Histogram with exp prob densfunc', xlab ='1st floor exp dist')
hist(data$X1stFlrSF, col = 'gold', xlab = 'Fist floor sq', main='Histogram prior fitting exp funct')
The exponential probability density function transformed the first histogram to a clean right skewed histogram where we can see at first glance its exponential distribution.
x1_exp_dist_cum <- plotdist(x1_exp_dist)
quantile(x1_exp_dist, c(0.05, 0.95))
## 5% 95%
## 43.20805 3633.01347
CI(data$X1stFlrSF, ci = 0.95)
## upper mean lower
## 1182.473 1162.627 1142.780
We are 95% percent that the mean of first floor sq of houses in Boston falls in the interval (1142.780, 1182.473)
5th and 95th percentile of empirical data
quantile(data$X1stFlrSF, c(0.05, 0.95))
## 5% 95%
## 672.95 1831.25
Regarding the houses first floor square ft from data collected,
5% of first floor square ft is 672.95, The mean is between 1142.780 and 1182.473 with 95% of confidence and 95% is 1831.25
As the distribution is right skewed, there is more houses with first floor square ft above the mean of 1162.627
To build our model, we are going to have few steps that would help us to build the model that would fit the data better. This will include for example finding some variables of interest, those are variables which would contribute significantly to our model. One thing to mention is that we are not going to be very deep in this process. Elements such as optimization and tuning won’t be within our scope for this stage.
Build Build a correlation table to seek for some variables of interest
correlation_table(data = data, target = "SalePrice")
## Variable SalePrice
## 1 SalePrice 1.00
## 2 OverallQual 0.80
## 3 GrLivArea 0.71
## 4 GarageCars 0.65
## 5 TotalBsmtSF 0.62
## 6 GarageArea 0.62
## 7 X1stFlrSF 0.61
## 8 FullBath 0.57
## 9 TotRmsAbvGrd 0.55
## 10 YearBuilt 0.53
## 11 YearRemodAdd 0.52
## 12 GarageYrBlt 0.50
## 13 MasVnrArea 0.49
## 14 Fireplaces 0.46
## 15 BsmtFinSF1 0.39
## 16 LotFrontage 0.34
## 17 WoodDeckSF 0.34
## 18 OpenPorchSF 0.34
## 19 X2ndFlrSF 0.31
## 20 LotArea 0.30
## 21 HalfBath 0.27
## 22 BsmtFullBath 0.24
## 23 BsmtUnfSF 0.21
## 24 BedroomAbvGr 0.17
## 25 ScreenPorch 0.11
## 26 PoolArea 0.09
## 27 MoSold 0.05
## 28 X3SsnPorch 0.03
## 29 LowQualFinSF 0.00
## 30 YrSold -0.01
## 31 BsmtFinSF2 -0.03
## 32 BsmtHalfBath -0.04
## 33 MiscVal -0.04
## 34 Id -0.05
## 35 MSSubClass -0.09
## 36 OverallCond -0.12
## 37 KitchenAbvGr -0.14
## 38 EnclosedPorch -0.15
As we can see, they are few variables that are highly correlated to the target, and we are going to consider those first while building our model.
data1 <- data[, c('SalePrice', 'OverallQual','GrLivArea', 'GarageCars', 'TotalBsmtSF', 'X2ndFlrSF', 'LotArea', 'BedroomAbvGr', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'Neighborhood', 'HouseStyle')]
lm1 <- lm(SalePrice~., data = data1)
summary(lm1)
##
## Call:
## lm(formula = SalePrice ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398905 -14372 -59 13885 262903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.525e+05 1.623e+05 -5.869 5.46e-09 ***
## OverallQual 1.581e+04 1.175e+03 13.451 < 2e-16 ***
## GrLivArea 4.008e+01 5.391e+00 7.436 1.79e-13 ***
## GarageCars 1.068e+04 1.683e+03 6.349 2.91e-10 ***
## TotalBsmtSF 1.854e+01 3.895e+00 4.759 2.15e-06 ***
## X2ndFlrSF 3.145e+01 7.444e+00 4.225 2.54e-05 ***
## LotArea 6.383e-01 1.029e-01 6.202 7.30e-10 ***
## BedroomAbvGr -7.730e+03 1.675e+03 -4.614 4.31e-06 ***
## TotRmsAbvGrd 4.251e+03 1.181e+03 3.599 0.000330 ***
## YearBuilt 1.717e+02 7.069e+01 2.429 0.015276 *
## YearRemodAdd 2.713e+02 5.930e+01 4.575 5.17e-06 ***
## NeighborhoodBlueste 1.007e+04 2.546e+04 0.396 0.692511
## NeighborhoodBrDale 1.051e+04 1.244e+04 0.844 0.398592
## NeighborhoodBrkSide 3.181e+04 1.065e+04 2.987 0.002869 **
## NeighborhoodClearCr 3.335e+04 1.115e+04 2.991 0.002826 **
## NeighborhoodCollgCr 2.511e+04 8.867e+03 2.832 0.004686 **
## NeighborhoodCrawfor 4.963e+04 1.041e+04 4.768 2.05e-06 ***
## NeighborhoodEdwards 1.440e+04 9.707e+03 1.483 0.138251
## NeighborhoodGilbert 1.944e+04 9.400e+03 2.068 0.038805 *
## NeighborhoodIDOTRR 1.727e+04 1.123e+04 1.538 0.124183
## NeighborhoodMeadowV 1.959e+04 1.248e+04 1.570 0.116687
## NeighborhoodMitchel 1.641e+04 9.914e+03 1.656 0.098030 .
## NeighborhoodNAmes 2.244e+04 9.267e+03 2.421 0.015589 *
## NeighborhoodNoRidge 7.821e+04 1.027e+04 7.612 4.89e-14 ***
## NeighborhoodNPkVill 1.370e+04 1.422e+04 0.963 0.335513
## NeighborhoodNridgHt 7.329e+04 9.176e+03 7.987 2.83e-15 ***
## NeighborhoodNWAmes 1.727e+04 9.565e+03 1.805 0.071253 .
## NeighborhoodOldTown 1.079e+04 1.034e+04 1.044 0.296799
## NeighborhoodSawyer 2.195e+04 9.794e+03 2.241 0.025180 *
## NeighborhoodSawyerW 2.104e+04 9.592e+03 2.194 0.028408 *
## NeighborhoodSomerst 3.394e+04 9.167e+03 3.702 0.000222 ***
## NeighborhoodStoneBr 7.675e+04 1.074e+04 7.148 1.40e-12 ***
## NeighborhoodSWISU 2.090e+04 1.220e+04 1.714 0.086829 .
## NeighborhoodTimber 3.252e+04 1.018e+04 3.195 0.001429 **
## NeighborhoodVeenker 5.402e+04 1.326e+04 4.074 4.87e-05 ***
## HouseStyle1.5Unf 1.349e+04 9.970e+03 1.353 0.176266
## HouseStyle1Story 2.321e+04 4.659e+03 4.981 7.11e-07 ***
## HouseStyle2.5Fin -1.535e+04 1.329e+04 -1.155 0.248198
## HouseStyle2.5Unf -2.642e+04 1.097e+04 -2.408 0.016150 *
## HouseStyle2Story -9.881e+03 4.098e+03 -2.411 0.016032 *
## HouseStyleSFoyer 2.414e+04 7.235e+03 3.336 0.000871 ***
## HouseStyleSLvl 1.946e+04 5.801e+03 3.355 0.000814 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33700 on 1418 degrees of freedom
## Multiple R-squared: 0.8251, Adjusted R-squared: 0.82
## F-statistic: 163.1 on 41 and 1418 DF, p-value: < 2.2e-16
lm2 <- step(lm1)
## Start: AIC=30483.24
## SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF +
## X2ndFlrSF + LotArea + BedroomAbvGr + TotRmsAbvGrd + YearBuilt +
## YearRemodAdd + Neighborhood + HouseStyle
##
## Df Sum of Sq RSS AIC
## <none> 1.6105e+12 30483
## - YearBuilt 1 6.6995e+09 1.6172e+12 30487
## - TotRmsAbvGrd 1 1.4715e+10 1.6252e+12 30495
## - X2ndFlrSF 1 2.0274e+10 1.6308e+12 30500
## - YearRemodAdd 1 2.3775e+10 1.6343e+12 30503
## - BedroomAbvGr 1 2.4180e+10 1.6347e+12 30503
## - TotalBsmtSF 1 2.5718e+10 1.6362e+12 30504
## - HouseStyle 7 4.7554e+10 1.6581e+12 30512
## - LotArea 1 4.3687e+10 1.6542e+12 30520
## - GarageCars 1 4.5780e+10 1.6563e+12 30522
## - GrLivArea 1 6.2799e+10 1.6733e+12 30537
## - OverallQual 1 2.0549e+11 1.8160e+12 30657
## - Neighborhood 24 3.2156e+11 1.9321e+12 30701
There are two conditions that we are going to be focus on here for residual analysis: Nearly normal residuals and linearity.
Nearly normal residuals
Let’s plot an histogram…
ggplot(data = lm1, aes(x = .resid)) +
geom_histogram()+
xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram shows that the distribution is close to normal.
Or a normal probability plot of residuals…
qqnorm(resid(lm1))
qqline(resid(lm1))
As per Q-Q plot, the residuals distribution is also a bit normal.
Linearity
ggplot(data = lm1, aes(x = .fitted, y = .resid)) +
geom_point()+
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
As we look at on the graph, we can see that there is not a strong pattern in residuals and a linear model could fit the data but we will need to do more transformation
We are goig to test our first model…
test <- read.csv('https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/test.csv')
sale_pred <- predict(lm1, test)
df_p <- data.frame(Id = test[, "Id"], SalePrice = sale_pred)
df_p[df_p < 0] <- 0
df_p <- replace(df_p, is.na(df_p), 0)
head(df_p)
## Id SalePrice
## 1 1461 117064.3
## 2 1462 155397.3
## 3 1463 158092.3
## 4 1464 174104.2
## 5 1465 261765.3
## 6 1466 177928.0
Save the results to csv
write.csv(df_p, "mypreds2_kaggle.csv", row.names = FALSE)
Kaggle results
Username: jnataky
Score: 0.46551
As we mentioned earlier, we didn’t optimize or tuning our model, doing so will help us to improve the model.
Work to be continued…