Problem 1

Generating random numbers…

# Set seed and define N, n


N <- 6
n <- 10000

# Random uniform numbers

X <- runif(n, 1, N)

# Random normal numbers

u <- (N + 1)/2

Y <- rnorm(n, mean=u, sd = u)

Defining x and y…

x <- quantile(X, 0.5)
y <- quantile(Y, 0.25)

We are going to use the following probabilities formulas:

P( A/B) = ; P( A,B)  = P( A) *P( B)

Let P( X >x|X >y)  be p1

p1 <- (length(X[X>x & X>y]) / n) / (length(X[X>y]) / n)

paste0("The probability is: ", round(p1, 4))
## [1] "The probability is: 0.5164"

Let P( X >x,Y>y)  be p2

p2 <- (length(X[X>x]) / n) * (length(Y[Y>y]) / n)

paste0("The probability is: ", round(p2, 4))
## [1] "The probability is: 0.375"

Let P( X <x| X>y)  be p3

p3 <- (length(X[X<x & X>y]) / n) / (length(X[X>y]) / n)

paste0("The probability is: ", round(p3, 4))
## [1] "The probability is: 0.4836"

Let investigate P( X >x and Y >y)  = P( X >x) P( Y >y)

Building a probability table…

df <- data.frame(c(length(X[X<x])/n * length(Y[Y<y])/n, length(X[X<x])/n * length(Y[Y>y])/n),
                c(length(X[X>x])/n * length(Y[Y<y])/n, length(X[X>x])/n * length(Y[Y>y])/n))
                  
                  

colnames(df) <- c("X<x", "X>x")

# Add the total colums

df$total <- df$`X<x` + df$`X>x`

# Add rows
new_row <- c(sum(df$`X<x`), sum(df$`X>x`), sum(df$total))

df1 <- rbind(df, new_row)

rownames(df1) <- c("Y<y", "Y>y", "total")

df1                   
##         X<x   X>x total
## Y<y   0.125 0.125  0.25
## Y>y   0.375 0.375  0.75
## total 0.500 0.500  1.00

From the table, we are going to evaluate the joint and marginal probabilities.

Joint probability:

joint_p <- df1[2,2]
paste0("P(X>x and Y>y) is : ", joint_p )
## [1] "P(X>x and Y>y) is : 0.375"

Marginal probability:

marginal_p <- df1[3,2]*df1[2,3]

paste0("P(X>x)P(Y>y) is : ", marginal_p )
## [1] "P(X>x)P(Y>y) is : 0.375"

Let check now if P(X>x and Y>y) = P(X>x)P(Y>y)

joint_p==marginal_p
## [1] TRUE

As proven above, we can conclude that P(X>x and Y>y) = P(X>x)P(Y>y)

Fisher’s Exact Test and Chi Square Test

Let build our matrix again

gen_matrix <- matrix(c(length(X[X<x])/n * length(Y[Y<y])/n, length(X[X<x])/n * length(Y[Y>y])/n,
                length(X[X>x])/n * length(Y[Y<y])/n, length(X[X>x])/n * length(Y[Y>y])/n), nrow = 2)

Fisher’s Exact Test…

fisher.test(gen_matrix*n)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  gen_matrix * n
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9124854 1.0959080
## sample estimates:
## odds ratio 
##          1

Chi Square Test…

chisq.test(gen_matrix*n)
## 
##  Pearson's Chi-squared test
## 
## data:  gen_matrix * n
## X-squared = 0, df = 1, p-value = 1

Both the Fisher’s Exact Test and Chi Square Test give us a p-value > 0.05, that’s it, we fail to reject to the null hypothesis: The relative proportion of one variable are independent of the second variable.

Furthermore, the difference is that Fisher’s Exact Test is used for a small sample size and Chi Square Test is suitable for bigger sample size.

That’s said, It will be more appropriate to use Chi Square Test in this case where the sample size is bigger.

Problem 2

Libraries

library(tidyverse)
library(corrplot)
library(Hmisc)
library(MASS)
library(pracma)
library(matrixcalc)
library(fitdistrplus)
library(Rmisc)
library(funModeling)

Get the data

data <- read.csv("https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/train.csv")

Descriptive and Inferential Statistics

Explore data and get the insight

glimpse(data)
## Rows: 1,460
## Columns: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,~
## $ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R~
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, ~
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612~
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ~
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", ~
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", ~
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu~
## $ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I~
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", ~
## $ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "~
## $ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",~
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ~
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ~
## $ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi~
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,~
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,~
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19~
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19~
## $ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G~
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "~
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "~
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "~
## $ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",~
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, ~
## $ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "~
## $ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T~
## $ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T~
## $ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N~
## $ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", ~
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99~
## $ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", ~
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17~
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10~
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ~
## $ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E~
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S~
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, ~
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,~
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10~
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,~
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,~
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,~
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,~
## $ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6~
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", ~
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,~
## $ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", ~
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch~
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19~
## $ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", ~
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,~
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7~
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G~
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160~
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,~
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, ~
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, ~
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,~
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, ~
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,~
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10~
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20~
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W~
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm~
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, ~

Descriptive statistics:

We are going go skim the quantitative variables

data %>%
  select_if(is.numeric) %>%
  skimr::skim()
Data summary
Name Piped data
Number of rows 1460
Number of columns 38
_______________________
Column type frequency:
numeric 38
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Id 0 1.00 730.50 421.61 1 365.75 730.5 1095.25 1460 ▇▇▇▇▇
MSSubClass 0 1.00 56.90 42.30 20 20.00 50.0 70.00 190 ▇▅▂▁▁
LotFrontage 259 0.82 70.05 24.28 21 59.00 69.0 80.00 313 ▇▃▁▁▁
LotArea 0 1.00 10516.83 9981.26 1300 7553.50 9478.5 11601.50 215245 ▇▁▁▁▁
OverallQual 0 1.00 6.10 1.38 1 5.00 6.0 7.00 10 ▁▂▇▅▁
OverallCond 0 1.00 5.58 1.11 1 5.00 5.0 6.00 9 ▁▁▇▅▁
YearBuilt 0 1.00 1971.27 30.20 1872 1954.00 1973.0 2000.00 2010 ▁▂▃▆▇
YearRemodAdd 0 1.00 1984.87 20.65 1950 1967.00 1994.0 2004.00 2010 ▅▂▂▃▇
MasVnrArea 8 0.99 103.69 181.07 0 0.00 0.0 166.00 1600 ▇▁▁▁▁
BsmtFinSF1 0 1.00 443.64 456.10 0 0.00 383.5 712.25 5644 ▇▁▁▁▁
BsmtFinSF2 0 1.00 46.55 161.32 0 0.00 0.0 0.00 1474 ▇▁▁▁▁
BsmtUnfSF 0 1.00 567.24 441.87 0 223.00 477.5 808.00 2336 ▇▅▂▁▁
TotalBsmtSF 0 1.00 1057.43 438.71 0 795.75 991.5 1298.25 6110 ▇▃▁▁▁
X1stFlrSF 0 1.00 1162.63 386.59 334 882.00 1087.0 1391.25 4692 ▇▅▁▁▁
X2ndFlrSF 0 1.00 346.99 436.53 0 0.00 0.0 728.00 2065 ▇▃▂▁▁
LowQualFinSF 0 1.00 5.84 48.62 0 0.00 0.0 0.00 572 ▇▁▁▁▁
GrLivArea 0 1.00 1515.46 525.48 334 1129.50 1464.0 1776.75 5642 ▇▇▁▁▁
BsmtFullBath 0 1.00 0.43 0.52 0 0.00 0.0 1.00 3 ▇▆▁▁▁
BsmtHalfBath 0 1.00 0.06 0.24 0 0.00 0.0 0.00 2 ▇▁▁▁▁
FullBath 0 1.00 1.57 0.55 0 1.00 2.0 2.00 3 ▁▇▁▇▁
HalfBath 0 1.00 0.38 0.50 0 0.00 0.0 1.00 2 ▇▁▅▁▁
BedroomAbvGr 0 1.00 2.87 0.82 0 2.00 3.0 3.00 8 ▁▇▂▁▁
KitchenAbvGr 0 1.00 1.05 0.22 0 1.00 1.0 1.00 3 ▁▇▁▁▁
TotRmsAbvGrd 0 1.00 6.52 1.63 2 5.00 6.0 7.00 14 ▂▇▇▁▁
Fireplaces 0 1.00 0.61 0.64 0 0.00 1.0 1.00 3 ▇▇▁▁▁
GarageYrBlt 81 0.94 1978.51 24.69 1900 1961.00 1980.0 2002.00 2010 ▁▁▅▅▇
GarageCars 0 1.00 1.77 0.75 0 1.00 2.0 2.00 4 ▁▃▇▂▁
GarageArea 0 1.00 472.98 213.80 0 334.50 480.0 576.00 1418 ▂▇▃▁▁
WoodDeckSF 0 1.00 94.24 125.34 0 0.00 0.0 168.00 857 ▇▂▁▁▁
OpenPorchSF 0 1.00 46.66 66.26 0 0.00 25.0 68.00 547 ▇▁▁▁▁
EnclosedPorch 0 1.00 21.95 61.12 0 0.00 0.0 0.00 552 ▇▁▁▁▁
X3SsnPorch 0 1.00 3.41 29.32 0 0.00 0.0 0.00 508 ▇▁▁▁▁
ScreenPorch 0 1.00 15.06 55.76 0 0.00 0.0 0.00 480 ▇▁▁▁▁
PoolArea 0 1.00 2.76 40.18 0 0.00 0.0 0.00 738 ▇▁▁▁▁
MiscVal 0 1.00 43.49 496.12 0 0.00 0.0 0.00 15500 ▇▁▁▁▁
MoSold 0 1.00 6.32 2.70 1 5.00 6.0 8.00 12 ▃▆▇▃▃
YrSold 0 1.00 2007.82 1.33 2006 2007.00 2008.0 2009.00 2010 ▇▇▇▇▅
SalePrice 0 1.00 180921.20 79442.50 34900 129975.00 163000.0 214000.00 755000 ▇▅▁▁▁

We are going to consider few variables and study their distribution: SalePrice, YearBuilt, OverallQual,TotRmsAbvGrd, LotArea

Visualization

ggplot(data=data, aes(x=SalePrice)) +
  geom_histogram( fill = 'magenta') +
  labs(title = "Houses sale price distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = data, aes(x=YearBuilt)) +
  geom_histogram(fill = "blue") +
  labs(x = "Year built", title = "Houses year built distribution ")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = data, aes(x=factor(OverallQual), y = SalePrice))+
  geom_boxplot() +
  labs(x ="Overall quality", title=" Houses sale price by Overall quality")

ggplot(data = data, aes(x=factor(TotRmsAbvGrd), y = SalePrice)) +
  geom_boxplot() +
  labs(x ="Rooms", title="Houses sale price by bedrooms ")

ggplot(data = data, aes(x=LotArea, y=SalePrice)) +
  geom_point(color = "green") +
  labs(x="Lot Area", title ="Sale price by lot area")  

ggplot(data = data, aes (x = YearBuilt, y=SalePrice)) +
  geom_point(color = 'blue') +
  labs(x='Year built', y='Sale price', title='Sale price over years')

Scatter plot matrix

We are going to consider same variables for the scatter plot matrix.

pairs(~SalePrice + YearBuilt + OverallQual + TotRmsAbvGrd + LotArea, 
      data = data)

Correlation matrix

We are going to consider three of our previous variables to build a correlation matrix

data1 <- data %>%
  dplyr::select(OverallQual, TotRmsAbvGrd, LotArea)

res <- cor(data1)
res
##              OverallQual TotRmsAbvGrd   LotArea
## OverallQual    1.0000000    0.4274523 0.1058057
## TotRmsAbvGrd   0.4274523    1.0000000 0.1900148
## LotArea        0.1058057    0.1900148 1.0000000
data1 %>%
  cor() %>%
  corrplot()

Testing the hypothesis

Overall quality and Total rooms
cor.test(data$OverallQual, data$TotRmsAbvGrd, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  data$OverallQual and data$TotRmsAbvGrd
## t = 18.054, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3996237 0.4544938
## sample estimates:
##       cor 
## 0.4274523

p_value less than 0.05: Reject the null hypothesis.

We are 80% confident that the correlation is between 0.3996 and 0.4545

Overall quality and Lot area
cor.test(data$OverallQual, data$LotArea, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  data$OverallQual and data$LotArea
## t = 4.0629, df = 1458, p-value = 5.106e-05
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.07250156 0.13887424
## sample estimates:
##       cor 
## 0.1058057

p_value less than 0.05: Reject the null hypothesis.

We are 80% confident that the correlation is between 0.0725 and 0.1389

Total rooms and Lot area
cor.test(data$TotRmsAbvGrd, data$LotArea, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  data$TotRmsAbvGrd and data$LotArea
## t = 7.3901, df = 1458, p-value = 2.461e-13
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1574573 0.2221597
## sample estimates:
##       cor 
## 0.1900148

p_value less than 0.05: Reject the null hypothesis.

We are 80% confident that the correlation is between 0.1575 and 0.2222

In this case, as we are not performing multiple hypothesis test at once, I wouldn’t worry much about familywise error. I would consider each test as dependant and having the type error rate equal to the significance level of 0.05.

Linear Algebra and Correlation

Precision matrix

We are going o use ginv function from MASS package to invert the correlation matrix

inv_res <-round(ginv(res), 3)
inv_res
##        [,1]   [,2]   [,3]
## [1,]  1.225 -0.517 -0.031
## [2,] -0.517  1.256 -0.184
## [3,] -0.031 -0.184  1.038

Multiply the correlation matrix by the precision matrix

m1 <- round(res%*%inv_res,2)
m1
##              [,1] [,2] [,3]
## OverallQual     1    0    0
## TotRmsAbvGrd    0    1    0
## LotArea         0    0    1

Multiply the precision matrix by the correlation matrix

m2 <- round(inv_res%*%res, 2)
m2
##      OverallQual TotRmsAbvGrd LotArea
## [1,]           1            0       0
## [2,]           0            1       0
## [3,]           0            0       1

LU decomposition on res

lu.decomposition function from matrixcalc will be used here to conduct LU decomposition on the matrix.

lu.decomposition(res)
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.4274523 1.0000000    0
## [3,] 0.1058057 0.1771572    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.4274523 0.1058057
## [2,]    0 0.8172845 0.1447879
## [3,]    0 0.0000000 0.9631549

Calculus-Based Probability & Statistics

Exploring First floor square feet variable

ggplot(data = data, aes(x=X1stFlrSF)) +
  geom_histogram(fill = 'gold', bins = 20) +
  labs(title ='First floor square feet distribution' )

skimr::skim(data$X1stFlrSF)
Data summary
Name data$X1stFlrSF
Number of rows 1460
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 1162.63 386.59 334 882 1087 1391.25 4692 ▇▅▁▁▁

Exponential probability density function

x1_exp <-fitdistr(data$X1stFlrSF, densfun = 'exponential')

lambda <- x1_exp$estimate
lambda
##         rate 
## 0.0008601213

Plotting histograms

n <- 1000
x1_exp_dist <- rexp(n, lambda)

par(mfrow=c(1, 2))
hist(x1_exp_dist, col = 'blue', main = 'Histogram with exp prob densfunc', xlab ='1st floor exp dist')
hist(data$X1stFlrSF, col = 'gold', xlab = 'Fist floor sq', main='Histogram prior fitting exp funct')

The exponential probability density function transformed the first histogram to a clean right skewed histogram where we can see at first glance its exponential distribution.

Plotting the cumulative distribution

x1_exp_dist_cum <- plotdist(x1_exp_dist)

5th and 95th percentile:

quantile(x1_exp_dist, c(0.05, 0.95))
##         5%        95% 
##   43.20805 3633.01347

95% Confidence interval of empirical data

CI(data$X1stFlrSF, ci = 0.95)
##    upper     mean    lower 
## 1182.473 1162.627 1142.780

We are 95% percent that the mean of first floor sq of houses in Boston falls in the interval (1142.780, 1182.473)

5th and 95th percentile of empirical data

quantile(data$X1stFlrSF, c(0.05, 0.95))
##      5%     95% 
##  672.95 1831.25

Regarding the houses first floor square ft from data collected,

5% of first floor square ft is 672.95, The mean is between 1142.780 and 1182.473 with 95% of confidence and 95% is 1831.25

As the distribution is right skewed, there is more houses with first floor square ft above the mean of 1162.627

Modeling

To build our model, we are going to have few steps that would help us to build the model that would fit the data better. This will include for example finding some variables of interest, those are variables which would contribute significantly to our model. One thing to mention is that we are not going to be very deep in this process. Elements such as optimization and tuning won’t be within our scope for this stage.

Variables of interest

Build Build a correlation table to seek for some variables of interest

correlation_table(data = data, target = "SalePrice")
##         Variable SalePrice
## 1      SalePrice      1.00
## 2    OverallQual      0.80
## 3      GrLivArea      0.71
## 4     GarageCars      0.65
## 5    TotalBsmtSF      0.62
## 6     GarageArea      0.62
## 7      X1stFlrSF      0.61
## 8       FullBath      0.57
## 9   TotRmsAbvGrd      0.55
## 10     YearBuilt      0.53
## 11  YearRemodAdd      0.52
## 12   GarageYrBlt      0.50
## 13    MasVnrArea      0.49
## 14    Fireplaces      0.46
## 15    BsmtFinSF1      0.39
## 16   LotFrontage      0.34
## 17    WoodDeckSF      0.34
## 18   OpenPorchSF      0.34
## 19     X2ndFlrSF      0.31
## 20       LotArea      0.30
## 21      HalfBath      0.27
## 22  BsmtFullBath      0.24
## 23     BsmtUnfSF      0.21
## 24  BedroomAbvGr      0.17
## 25   ScreenPorch      0.11
## 26      PoolArea      0.09
## 27        MoSold      0.05
## 28    X3SsnPorch      0.03
## 29  LowQualFinSF      0.00
## 30        YrSold     -0.01
## 31    BsmtFinSF2     -0.03
## 32  BsmtHalfBath     -0.04
## 33       MiscVal     -0.04
## 34            Id     -0.05
## 35    MSSubClass     -0.09
## 36   OverallCond     -0.12
## 37  KitchenAbvGr     -0.14
## 38 EnclosedPorch     -0.15

As we can see, they are few variables that are highly correlated to the target, and we are going to consider those first while building our model.

Model 1

data1 <- data[, c('SalePrice', 'OverallQual','GrLivArea', 'GarageCars', 'TotalBsmtSF', 'X2ndFlrSF', 'LotArea', 'BedroomAbvGr', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'Neighborhood', 'HouseStyle')]
lm1 <- lm(SalePrice~., data = data1)

summary(lm1)
## 
## Call:
## lm(formula = SalePrice ~ ., data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398905  -14372     -59   13885  262903 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -9.525e+05  1.623e+05  -5.869 5.46e-09 ***
## OverallQual          1.581e+04  1.175e+03  13.451  < 2e-16 ***
## GrLivArea            4.008e+01  5.391e+00   7.436 1.79e-13 ***
## GarageCars           1.068e+04  1.683e+03   6.349 2.91e-10 ***
## TotalBsmtSF          1.854e+01  3.895e+00   4.759 2.15e-06 ***
## X2ndFlrSF            3.145e+01  7.444e+00   4.225 2.54e-05 ***
## LotArea              6.383e-01  1.029e-01   6.202 7.30e-10 ***
## BedroomAbvGr        -7.730e+03  1.675e+03  -4.614 4.31e-06 ***
## TotRmsAbvGrd         4.251e+03  1.181e+03   3.599 0.000330 ***
## YearBuilt            1.717e+02  7.069e+01   2.429 0.015276 *  
## YearRemodAdd         2.713e+02  5.930e+01   4.575 5.17e-06 ***
## NeighborhoodBlueste  1.007e+04  2.546e+04   0.396 0.692511    
## NeighborhoodBrDale   1.051e+04  1.244e+04   0.844 0.398592    
## NeighborhoodBrkSide  3.181e+04  1.065e+04   2.987 0.002869 ** 
## NeighborhoodClearCr  3.335e+04  1.115e+04   2.991 0.002826 ** 
## NeighborhoodCollgCr  2.511e+04  8.867e+03   2.832 0.004686 ** 
## NeighborhoodCrawfor  4.963e+04  1.041e+04   4.768 2.05e-06 ***
## NeighborhoodEdwards  1.440e+04  9.707e+03   1.483 0.138251    
## NeighborhoodGilbert  1.944e+04  9.400e+03   2.068 0.038805 *  
## NeighborhoodIDOTRR   1.727e+04  1.123e+04   1.538 0.124183    
## NeighborhoodMeadowV  1.959e+04  1.248e+04   1.570 0.116687    
## NeighborhoodMitchel  1.641e+04  9.914e+03   1.656 0.098030 .  
## NeighborhoodNAmes    2.244e+04  9.267e+03   2.421 0.015589 *  
## NeighborhoodNoRidge  7.821e+04  1.027e+04   7.612 4.89e-14 ***
## NeighborhoodNPkVill  1.370e+04  1.422e+04   0.963 0.335513    
## NeighborhoodNridgHt  7.329e+04  9.176e+03   7.987 2.83e-15 ***
## NeighborhoodNWAmes   1.727e+04  9.565e+03   1.805 0.071253 .  
## NeighborhoodOldTown  1.079e+04  1.034e+04   1.044 0.296799    
## NeighborhoodSawyer   2.195e+04  9.794e+03   2.241 0.025180 *  
## NeighborhoodSawyerW  2.104e+04  9.592e+03   2.194 0.028408 *  
## NeighborhoodSomerst  3.394e+04  9.167e+03   3.702 0.000222 ***
## NeighborhoodStoneBr  7.675e+04  1.074e+04   7.148 1.40e-12 ***
## NeighborhoodSWISU    2.090e+04  1.220e+04   1.714 0.086829 .  
## NeighborhoodTimber   3.252e+04  1.018e+04   3.195 0.001429 ** 
## NeighborhoodVeenker  5.402e+04  1.326e+04   4.074 4.87e-05 ***
## HouseStyle1.5Unf     1.349e+04  9.970e+03   1.353 0.176266    
## HouseStyle1Story     2.321e+04  4.659e+03   4.981 7.11e-07 ***
## HouseStyle2.5Fin    -1.535e+04  1.329e+04  -1.155 0.248198    
## HouseStyle2.5Unf    -2.642e+04  1.097e+04  -2.408 0.016150 *  
## HouseStyle2Story    -9.881e+03  4.098e+03  -2.411 0.016032 *  
## HouseStyleSFoyer     2.414e+04  7.235e+03   3.336 0.000871 ***
## HouseStyleSLvl       1.946e+04  5.801e+03   3.355 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33700 on 1418 degrees of freedom
## Multiple R-squared:  0.8251, Adjusted R-squared:   0.82 
## F-statistic: 163.1 on 41 and 1418 DF,  p-value: < 2.2e-16

Model 2: Stepwise

lm2 <- step(lm1)
## Start:  AIC=30483.24
## SalePrice ~ OverallQual + GrLivArea + GarageCars + TotalBsmtSF + 
##     X2ndFlrSF + LotArea + BedroomAbvGr + TotRmsAbvGrd + YearBuilt + 
##     YearRemodAdd + Neighborhood + HouseStyle
## 
##                Df  Sum of Sq        RSS   AIC
## <none>                       1.6105e+12 30483
## - YearBuilt     1 6.6995e+09 1.6172e+12 30487
## - TotRmsAbvGrd  1 1.4715e+10 1.6252e+12 30495
## - X2ndFlrSF     1 2.0274e+10 1.6308e+12 30500
## - YearRemodAdd  1 2.3775e+10 1.6343e+12 30503
## - BedroomAbvGr  1 2.4180e+10 1.6347e+12 30503
## - TotalBsmtSF   1 2.5718e+10 1.6362e+12 30504
## - HouseStyle    7 4.7554e+10 1.6581e+12 30512
## - LotArea       1 4.3687e+10 1.6542e+12 30520
## - GarageCars    1 4.5780e+10 1.6563e+12 30522
## - GrLivArea     1 6.2799e+10 1.6733e+12 30537
## - OverallQual   1 2.0549e+11 1.8160e+12 30657
## - Neighborhood 24 3.2156e+11 1.9321e+12 30701

Residual analysis

There are two conditions that we are going to be focus on here for residual analysis: Nearly normal residuals and linearity.

Nearly normal residuals

Let’s plot an histogram…

ggplot(data = lm1, aes(x = .resid)) +
  geom_histogram()+
  xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram shows that the distribution is close to normal.

Or a normal probability plot of residuals…

qqnorm(resid(lm1))
qqline(resid(lm1))

As per Q-Q plot, the residuals distribution is also a bit normal.

Linearity

ggplot(data = lm1, aes(x = .fitted, y = .resid)) +
  geom_point()+
  
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

As we look at on the graph, we can see that there is not a strong pattern in residuals and a linear model could fit the data but we will need to do more transformation

Test the model

We are goig to test our first model…

test <- read.csv('https://raw.githubusercontent.com/jnataky/Computational_Mathematics/main/Assignments/test.csv')

sale_pred <- predict(lm1, test)

df_p <- data.frame(Id = test[, "Id"], SalePrice = sale_pred)

df_p[df_p < 0] <- 0

df_p <- replace(df_p, is.na(df_p), 0)

head(df_p)
##     Id SalePrice
## 1 1461  117064.3
## 2 1462  155397.3
## 3 1463  158092.3
## 4 1464  174104.2
## 5 1465  261765.3
## 6 1466  177928.0

Kaggle results

Save the results to csv

write.csv(df_p, "mypreds2_kaggle.csv", row.names = FALSE)

Kaggle results

Username: jnataky

Score: 0.46551

As we mentioned earlier, we didn’t optimize or tuning our model, doing so will help us to improve the model.

Work to be continued…