Data605_Final.utf8

Kaggle submission: position = 9749 Team Name = Alexis Mekueko Score = 0.34840

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ = σ = (N+1)/2.

#with random, let's set seed
set.seed(29873)

N <- 6
n <- 10000
# 10,000 random uniform number from 1 to 6
X <- runif(n, 1, N)
# 10,000 random normal numbers with a mean of μ = σ = (N+1)/2 
Y <- rnorm(n, mean = (N+1)/2 , sd = (N+1)/2 )

cat("\nLet us visualize  a random variable Y that has 10,000 random normal numbers with a mean of μ = σ = (N+1)/2")

## 
## Let us visualize  a random variable Y that has 10,000 random normal numbers with a mean of µ = s = (N+1)/2

hist(Y)

Probability: calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- quantile(Y, 0.25)
cat("\n“x” is estimated as the median of the X variable: ", x)

## 
## “x” is estimated as the median of the X variable:  3.535246

cat("\n“y” is estimated as the 1st quartile of the Y variable: ",y )

## 
## “y” is estimated as the 1st quartile of the Y variable:  1.141222

a. P(X>x | X>y): This is a probability that X is greater than x (median) with X greater the 1st quartile of the Y variable.

a1 <- sum(X>x & X>y)/n
a2<- sum(X>y)/n
a3 <- a1/a2
# or sum(X>x & X>y)/sum(X>y)
cat("\nP(X>x | X>y) = ", a3)

## 
## P(X>x | X>y) =  0.5133997

b. P(X>x, Y>y): This is the probability that X is greater than x with Y greater the 1st quartile of the Y variable.

b1 <- sum(X>x & Y>y)/n

# or sum(X>x & X>y)/sum(X>y)
cat("\nP(X>x, Y>y) = ", b1)

## 
## P(X>x, Y>y) =  0.3728

c. P(X<x | X>y): This is the probability that X is smaller than x with X greater the 1st quartile of the Y variable.

c1 <- sum(X<x & X>y)/n

# or sum(X>x & X>y)/sum(X>y)
cat("\nP(X<x | X>y) = ", c1/a2)

## 
## P(X<x | X>y) =  0.4866003

d. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

d1 <- c(sum(X<x & Y<y) , sum(X>x & Y<y), sum(X & Y<y))
d2 <- c(sum(X<x & Y>y), sum(X>x & Y>y), sum(X & Y>y))
d3 <- c(sum(X<x & y), sum(X>x & y), sum(X & Y))
df <- data.frame(d1,d2,d3)
rownames(df) <- c("X<x", "X>x", "Total")
colnames(df) <- c("Y<y", "Y>y", "Total")
kable(df)

	Y<y	Y>y	Total
X<x	1228	3772	5000
X>x	1272	3728	5000
Total	2500	7500	10000

# Now, we want to check P(X>x and Y>y) 
# P1 <- P(X>x & Y>y)
P1 <- 3728/10000
# P2 <-P(X>x)P(Y>y)
P2 <- 5000/7500
if (P1 == P2){
  print("Indeed, P(X>x and Y>y)=P(X>x)P(Y>y)")
}

e. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Null hypothesis:

X and Y are independent, meaning P(X>x and Y>y)=P(X>x)P(Y>y)

Alternative hypothesis: X and Y are dependent

We will confirm based on p-value

df_e <- matrix( c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), 2, 2)
fisher.test(df_e)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  df_e
## p-value = 0.3207
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9563348 1.1486016
## sample estimates:
## odds ratio 
##   1.048075

chisq.test(df_e)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df_e
## X-squared = 0.98613, df = 1, p-value = 0.3207

p-value = 0.3207 is high than the 0.05 based on 95% CI. Thus, we accept the Null hypothesis for both test. We though with large sample population (10000), the chi square test will be more suitable.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

1. Data Exploration

Variables Desciption

#vars.description <- read.table(file = "C:\Users\owner\OneDrive\Documents\R\Data605_final\data_description.txt", header = TRUE)
#vars.description

train.df <- read.csv("https://raw.githubusercontent.com/asmozo24/Data605_House-Prices/main/train.csv", stringsAsFactors=FALSE)
test.df <- read.csv("https://raw.githubusercontent.com/asmozo24/Data605_House-Prices/main/test.csv", stringsAsFactors=FALSE)


#View()

str(train.df)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

We have 1460 observation of 81 variables: This is a dataset with a large number of variable but not large overall. We want to check the data distribution of the response variable (SalePrice) and 03 other independendent variables (YearBuilt = Original construction date, LotArea = Lot size in square feet, OverallCond =Rates the overall condition of the house (1 to 10), ), We assumed by real life experience that there is a relationship between the sale price and these independent variables.

####Sample data of the new dataframe , train.df1 (04 variables selected)

train.df1 <- dplyr::select (train.df, SalePrice, YearBuilt, LotArea, OverallCond) 

head(train.df1)

##   SalePrice YearBuilt LotArea OverallCond
## 1    208500      2003    8450           5
## 2    181500      1976    9600           8
## 3    223500      2001   11250           5
## 4    140000      1915    9550           5
## 5    250000      2000   14260           5
## 6    143000      1993   14115           5

Summary of train.df1

library(pastecs)

## Warning:
## package
## 'pastecs'
## was built
## under R
## version
## 4.0.5

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:data.table':
## 
##     first,
##     last

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following objects are masked from 'package:dplyr':
## 
##     first,
##     last

options(scipen=10)
options(digits=2)
#stat.desc(wineT_df, basic=T,desc=T, norm=FALSE, p=0.95)


stat.desc( train.df1, norm=FALSE, p=0.95) #basic=T,desc=F

##                  SalePrice
## nbr.val            1460.00
## nbr.null              0.00
## nbr.na                0.00
## min               34900.00
## max              755000.00
## range            720100.00
## sum           264144946.00
## median           163000.00
## mean             180921.20
## SE.mean            2079.11
## CI.mean.0.95       4078.35
## var          6311111264.30
## std.dev           79442.50
## coef.var              0.44
##                YearBuilt
## nbr.val         1460.000
## nbr.null           0.000
## nbr.na             0.000
## min             1872.000
## max             2010.000
## range            138.000
## sum          2878051.000
## median          1973.000
## mean            1971.268
## SE.mean            0.790
## CI.mean.0.95       1.551
## var              912.215
## std.dev           30.203
## coef.var           0.015
##                  LotArea
## nbr.val          1460.00
## nbr.null            0.00
## nbr.na              0.00
## min              1300.00
## max            215245.00
## range          213945.00
## sum          15354569.00
## median           9478.50
## mean            10516.83
## SE.mean           261.22
## CI.mean.0.95      512.41
## var          99625649.65
## std.dev          9981.26
## coef.var            0.95
##              OverallCond
## nbr.val         1460.000
## nbr.null           0.000
## nbr.na             0.000
## min                1.000
## max                9.000
## range              8.000
## sum             8140.000
## median             5.000
## mean               5.575
## SE.mean            0.029
## CI.mean.0.95       0.057
## var                1.238
## std.dev            1.113
## coef.var           0.200

Data distribution

#checking variable distribution
library(ggthemes)

## Warning:
## package
## 'ggthemes'
## was built
## under R
## version
## 4.0.5

plot_H <-train.df1 %>%
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 25) +
  theme_wsj()+ scale_colour_wsj("colors6")
plot_H

Data Cleaning-Fixing variable with missing Values

Let’s see the distribution of the missing values across the training dataset.

## The total of missing values is :  0

Plotting the response variable against the independent variables.

We will add a new variable called HouseAge. Since there is no note about when this data was collected, we will pick the reference year to be 2015, this does not mean the house price reflects the market in 2010.

Scatter plots + fitted regression line

train.df2 <- train.df1 %>%
  mutate(HouseAge = 2015 - YearBuilt)
View(train.df2)

plots <-function(df, x, y) {
ggplot(df, aes(x=x, y=y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("House Sale Price against Predictor") +
  xlab("x") + ylab("y") +

# Change the color, the size and the face of  salary~YearsExperience, data = salary_df
# the main title, x and y axis labels
 theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=12, face="bold"),
axis.title.y = element_text(color="#993333", size=12, face="bold"))
}

print(" y = House Sale Price (in dollars) Vs. x = Original construction date")

## [1] " y = House Sale Price (in dollars) Vs. x = Original construction date"

plots (train.df2, train.df2$YearBuilt,train.df2$SalePrice)

## `geom_smooth()` using formula 'y ~ x'

print(" y = House Sale Price (in dollars) Vs. x = House Age (in years) ")

## [1] " y = House Sale Price (in dollars) Vs. x = House Age (in years) "

plots (train.df2, train.df2$HouseAge,train.df2$SalePrice)

## `geom_smooth()` using formula 'y ~ x'

print(" y = House Sale Price (in dollars) Vs. x = Rates the overall condition of the house (scale 1 to 10)")

## [1] " y = House Sale Price (in dollars) Vs. x = Rates the overall condition of the house (scale 1 to 10)"

plots (train.df2, train.df2$OverallCond,train.df2$SalePrice)

## `geom_smooth()` using formula 'y ~ x'

print(" y = House Sale Price (in dollars) Vs. x = Lot size (in square feet)")

## [1] " y = House Sale Price (in dollars) Vs. x = Lot size (in square feet)"

plots (train.df2, train.df2$LotArea,train.df2$SalePrice)

## `geom_smooth()` using formula 'y ~ x'

The scatter plots show weak linear relationship between the house sale price and the predictors excepted the original built year or house age. We actually think this might be a negative exponential relationship between the house sale price and the built year. There is also some linear relationship between house sale price and the overall condition. There are many outliers.

Correlation

library(corrr)

## Warning: package 'corrr' was built under R version 4.0.5

cor1 <- cor(train.df2, method = "pearson", use = "complete.obs")
kable(cor1)

	SalePrice	YearBuilt	LotArea	OverallCond	HouseAge
SalePrice	1.00	0.52	0.26	-0.08	-0.52
YearBuilt	0.52	1.00	0.01	-0.38	-1.00
LotArea	0.26	0.01	1.00	-0.01	-0.01
OverallCond	-0.08	-0.38	-0.01	1.00	0.38
HouseAge	-0.52	-1.00	-0.01	0.38	1.00

ggcorr(train.df2, label = TRUE , label_alpha =TRUE )

#corrplot(train_m, order="hclust", type="upper", tl.col="grey", tl.srt=45, addCoef.col = "yellow")

The correlation output does not show anything strong. Again, just as we mentioned above there is some tangible relationship between house sale price and built year or house age: correlation = 0.52. not bad!

Hypothesis Testing

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

Null Hypothesis: the correlation between house sale price and built year is 0.

Alternative Hypothesis: the correlation between house sale price and built is not 0.

cor.test(train.df2$YearBuilt, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train.df2$YearBuilt and train.df2$SalePrice
## t = 23, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.50 0.55
## sample estimates:
##  cor 
## 0.52

p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the #### Null hypothesis. The true correlation is not equal to 0.

Null Hypothesis: the correlation between house sale price and built year is 0.

Alternative Hypothesis: the correlation between house sale price and built year is not 0.

cor.test(train.df2$YearBuilt, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train.df2$YearBuilt and train.df2$SalePrice
## t = 23, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.50 0.55
## sample estimates:
##  cor 
## 0.52

p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the #### Null hypothesis. The true correlation is not equal to 0.

House Sale Price and Lot Size

Null Hypothesis: the correlation between house sale price and lot size is 0.

Alternative Hypothesis: the correlation between house sale price and lot size is not 0.

cor.test(train.df2$LotArea, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train.df2$LotArea and train.df2$SalePrice
## t = 10, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.23 0.29
## sample estimates:
##  cor 
## 0.26

p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the Null hypothesis. The true correlation is not equal to 0. However, we observed a weak correlation.

House Sale Price and Overall House Condition

Null Hypothesis: the correlation between house sale price and the overall house condition is 0.

Alternative Hypothesis: the correlation between house sale price and overall house condition is not 0.

cor.test(train.df2$OverallCond, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  train.df2$OverallCond and train.df2$SalePrice
## t = -3, df = 1458, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.111 -0.044
## sample estimates:
##    cor 
## -0.078

p-value = 0.003 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the Null hypothesis. The true correlation is not equal to 0. However, we observed a very weak correlation.

Familywise Error Rate (Alpha Inflation)

The familywise error (FWE or FWER) is the probability of making at least one false conclusion in a series of hypothesis tests. We won’t worry about it because the found p-values are highly significant.

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

library(matlib)

## Warning: package 'matlib' was built under R version 4.0.5

library(Matrix)
library(matrixcalc)

## 
## Attaching package: 'matrixcalc'

## The following object is masked from 'package:matlib':
## 
##     vec

(cor2 <- Ginv(cor1))

##                                 
## [1,]  1.55 -0.89 -0.398 -0.215 0
## [2,] -0.89  1.67  0.213  0.561 0
## [3,] -0.40  0.21  1.102  0.055 0
## [4,] -0.21  0.56  0.055  1.194 0
## [5,]  0.00  0.00  0.000  0.000 0

Multiply the correlation matrix by the precision matrix

(cor1 %*% cor2)

##                                                                          
## SalePrice    0.999999999675  0.0000000053  0.00000000102  0.00000000063 0
## YearBuilt    0.000000001686  1.0000000053  0.00000000197  0.00000000102 0
## LotArea      0.000000000041  0.0000000028  0.99999999839  0.00000000071 0
## OverallCond -0.000000000668 -0.0000000015 -0.00000000013  0.99999999800 0
## HouseAge    -0.000000001686 -1.0000000053 -0.00000000197 -0.00000000102 0

Multiply the precision matrix by the correlation matrix.

(cor2 %*% cor1)

##          SalePrice    YearBuilt        LotArea    OverallCond      HouseAge
## [1,] 0.99999999968 0.0000000017 0.000000000041 -0.00000000067 -0.0000000017
## [2,] 0.00000000531 1.0000000053 0.000000002758 -0.00000000153 -1.0000000053
## [3,] 0.00000000102 0.0000000020 0.999999998394 -0.00000000013 -0.0000000020
## [4,] 0.00000000063 0.0000000010 0.000000000713  0.99999999800 -0.0000000010
## [5,] 0.00000000000 0.0000000000 0.000000000000  0.00000000000  0.0000000000

For some reason, we don’t see identity matrix when we did inverted correlation and the precision matrix. we think the issue is with round () that we should have applied.

Conduct LU decomposition on the matrix.

lucor1 <- lu.decomposition( cor1 )
lucor1

## $L
##        [,1]  [,2]   [,3] [,4] [,5]
## [1,]  1.000  0.00  0.000    0    0
## [2,]  0.523  1.00  0.000    0    0
## [3,]  0.264 -0.17  1.000    0    0
## [4,] -0.078 -0.46 -0.046    1    0
## [5,] -0.523 -1.00  0.000    0    1
## 
## $U
##      [,1] [,2]  [,3]   [,4]  [,5]
## [1,]    1 0.52  0.26 -0.078 -0.52
## [2,]    0 0.73 -0.12 -0.335 -0.73
## [3,]    0 0.00  0.91 -0.042  0.00
## [4,]    0 0.00  0.00  0.837  0.00
## [5,]    0 0.00  0.00  0.000  0.00

Double checking Lu decomposition

if ((lucor1$L %*% lucor1$U) == cor1) {
  print("Good LU decomposition")
}else {
  print("Lu decomposition needs a review")
}

## Warning in if ((lucor1$L %*% lucor1$U) == cor1) {: the condition has length > 1
## and only the first element will be used

## [1] "Good LU decomposition"

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ?? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

We could use the House Age (mutate from YearBuilt) but SalePrice looks already right skewed.

sum(is.na(train.df2$`SalePrice`))

## [1] 0

hist1 <- hist(as.numeric(train.df2$`SalePrice`), breaks = 25)

class(hist1$density)

## [1] "numeric"

exp1 <- fitdistr(hist1$density, densfun = "exponential")
(exp_fit <- exp1$estimate)

##   rate 
## 800000

Taking 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda))

(sample1 <- hist(rexp(1000, exp_fit), breaks = 50))

## $breaks
##  [1] 0.0000000 0.0000002 0.0000004 0.0000006 0.0000008 0.0000010 0.0000012
##  [8] 0.0000014 0.0000016 0.0000018 0.0000020 0.0000022 0.0000024 0.0000026
## [15] 0.0000028 0.0000030 0.0000032 0.0000034 0.0000036 0.0000038 0.0000040
## [22] 0.0000042 0.0000044 0.0000046 0.0000048 0.0000050 0.0000052 0.0000054
## [29] 0.0000056 0.0000058 0.0000060 0.0000062 0.0000064 0.0000066 0.0000068
## [36] 0.0000070 0.0000072 0.0000074 0.0000076 0.0000078 0.0000080 0.0000082
## [43] 0.0000084 0.0000086 0.0000088 0.0000090 0.0000092 0.0000094 0.0000096
## [50] 0.0000098 0.0000100 0.0000102 0.0000104 0.0000106 0.0000108 0.0000110
## [57] 0.0000112 0.0000114 0.0000116
## 
## $counts
##  [1] 143 147  92  93  67  75  39  55  50  35  34  16  29  23  17  16   9  10   9
## [20]   8   4   2   3   2   4   1   3   3   1   3   0   1   0   0   3   1   0   0
## [39]   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## [58]   1
## 
## $density
##  [1] 715000 735000 460000 465000 335000 375000 195000 275000 250000 175000
## [11] 170000  80000 145000 115000  85000  80000  45000  50000  45000  40000
## [21]  20000  10000  15000  10000  20000   5000  15000  15000   5000  15000
## [31]      0   5000      0      0  15000   5000      0      0      0   5000
## [41]      0      0      0      0      0      0      0      0      0      0
## [51]      0      0      0      0      0      0      0   5000
## 
## $mids
##  [1] 0.0000001 0.0000003 0.0000005 0.0000007 0.0000009 0.0000011 0.0000013
##  [8] 0.0000015 0.0000017 0.0000019 0.0000021 0.0000023 0.0000025 0.0000027
## [15] 0.0000029 0.0000031 0.0000033 0.0000035 0.0000037 0.0000039 0.0000041
## [22] 0.0000043 0.0000045 0.0000047 0.0000049 0.0000051 0.0000053 0.0000055
## [29] 0.0000057 0.0000059 0.0000061 0.0000063 0.0000065 0.0000067 0.0000069
## [36] 0.0000071 0.0000073 0.0000075 0.0000077 0.0000079 0.0000081 0.0000083
## [43] 0.0000085 0.0000087 0.0000089 0.0000091 0.0000093 0.0000095 0.0000097
## [50] 0.0000099 0.0000101 0.0000103 0.0000105 0.0000107 0.0000109 0.0000111
## [57] 0.0000113 0.0000115
## 
## $xname
## [1] "rexp(1000, exp_fit)"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

hist(train.df2$SalePrice, breaks = 50)

Find the 5th percentiles using the cumulative distribution function (CDF)

qexp(0.05, rate = exp_fit, log.p = F, lower.tail = T)

## [1] 0.000000064

Find the 95th percentiles using the cumulative distribution function (CDF)

qexp(0.95, rate = exp_fit, log.p = F, lower.tail = T)

## [1] 0.0000037

Generate a 95% confidence interval from the empirical data, assuming normality.

R does not have a command to find confidence intervals for the mean of normal datawhen the variance is known.

norm.interval = function(data, variance = var(data), conf.level = 0.95) {
  z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
  xbar = mean(data)
  sdx = sqrt(variance/length(data))
  c(xbar - z * sdx, xbar + z * sdx)
}

norm.interval(train.df2$SalePrice, 2)

## [1] 180921 180921

Finally, provide the empirical 5th percentile and 95th percentile of the data.

(quantile5_95 <- quantile(train.df2$SalePrice, c(0.05, 0.95)))

##     5%    95% 
##  88000 326100

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

misValues1 <- sum(is.na(train.df))# Returning the column names with missing values
#column_na1 <- colnames(insuranceT_df1)[ apply(insuranceT_df1, 2, anyNA) ] # 2 is dimension(dim())

cat("The total of missing values is : ", misValues1)

## The total of missing values is :  6965

cat("\nVariable distribution of missing 'NA' is: ")

## 
## Variable distribution of missing 'NA' is:

#column_na1

missing.values <- function(df){
    df %>%
    gather(key = "variables", value = "val") %>%
    mutate(is.missing = is.na(val)) %>%
    group_by(variables, is.missing) %>%
    summarise(number.missing = n()) %>%
    filter(is.missing==T) %>%
    dplyr::select(-is.missing) %>%
    arrange(desc(number.missing))
}

missing.values(train.df)%>% kable()

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

variables	number.missing
PoolQC	1453
MiscFeature	1406
Alley	1369
Fence	1179
FireplaceQu	690
LotFrontage	259
GarageCond	81
GarageFinish	81
GarageQual	81
GarageType	81
GarageYrBlt	81
BsmtExposure	38
BsmtFinType2	38
BsmtCond	37
BsmtFinType1	37
BsmtQual	37
MasVnrArea	8
MasVnrType	8
Electrical	1

gg_miss_var(train.df, show_pct = TRUE) + labs(y = "Missing Values in % to total record")+ theme_update()

We have too many missing values. Some variables have nearly 100% missing values. So, we will remove those variable with 45% or more missing value. second , we will impute those variable with less than 40% missing values using median.

These variables will be removed: PoolQC, MiscFeature ,Alley, Fence, FireplaceQu

These variables will go under imputation: LotFrontage, GarageCond, GarageFinish, GarageQual,GarageType, GarageYrBlt,BsmtExposure , BsmtFinType2, BsmtCond, BsmtFinType1,BsmtQual ,MasVnrArea ,MasVnrType, Electrical .

Since Multiple linear regression is based on numerical values, we could transform the variables with character data type into numeric. Due to high volume, we want to focus directly on selecting the variables with numeric data type.

We will use lm() to build our model and use back elimination (remove variable with less significance )

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.0.5

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

train.df3 <- dplyr::select (train.df,-Id, -PoolQC, -MiscFeature  ,-Alley, -Fence, -FireplaceQu) 
# impute missing value by median
 train.df3$LotFrontage <- impute( train.df3$LotFrontage , median)
 train.df3$GarageCond <- impute( train.df3$GarageCond , median)
 train.df3$GarageFinish <- impute( train.df3$GarageFinish , median)
 train.df3$GarageQual <- impute( train.df3$GarageQual , median)
 train.df3$GarageType <- impute( train.df3$GarageType , median)
 train.df3$GarageYrBlt <- impute( train.df3$GarageYrBlt , median)
 train.df3$BsmtExposure <- impute( train.df3$BsmtExposure , median)

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

 train.df3$BsmtFinType2 <- impute( train.df3$BsmtFinType2 , median)

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

 train.df3$BsmtCond <- impute( train.df3$BsmtCond , median)
train.df3$BsmtFinType1 <- impute( train.df3$BsmtFinType1 , median)
 train.df3$BsmtQual <- impute( train.df3$BsmtQual , median)
 train.df3$MasVnrArea <- impute( train.df3$MasVnrArea , median)
 train.df3$MasVnrType <- impute( train.df3$MasVnrType , median)

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

 train.df3$Electrical <- impute( train.df3$Electrical , median)

 missing.values(train.df3)%>% kable()

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

variables	number.missing
BsmtExposure	38
BsmtFinType2	38
MasVnrType	8

 class(train.df3$BsmtExposure)

## [1] "impute"

 #data.class(train.df3$BsmtExposure)
# View(train.df3)
 #train.df3[complete.cases(train.df3), ]
train.df4 <- na.omit(train.df3)
 dim(train.df4)

## [1] 1413   75

 missing.values(train.df4)%>% kable()

## Warning: attributes are not identical across measure variables;
## they will be dropped

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

variables	number.missing

 gg_miss_var(train.df4, show_pct = TRUE) + labs(y = "Missing Values in % to total record")+ theme_update()

 # We could transform some of these variable into numberic, instead we will select the one that are already numberical
 
 train.df5 <- dplyr::select_if(train.df4, is.numeric)
 lm1 <- lm(SalePrice ~., data = train.df5)
summary(lm1)

## 
## Call:
## lm(formula = SalePrice ~ ., data = train.df5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -479749  -16470   -2345   13831  302053 
## 
## Coefficients: (2 not defined because of singularities)
##                  Estimate  Std. Error t value      Pr(>|t|)    
## (Intercept)    120649.949 1444577.512    0.08       0.93345    
## MSSubClass       -176.715      28.256   -6.25 0.00000000053 ***
## LotFrontage       -72.718      52.305   -1.39       0.16467    
## LotArea             0.433       0.103    4.21 0.00002706031 ***
## OverallQual     17391.550    1211.302   14.36       < 2e-16 ***
## OverallCond      4910.814    1055.495    4.65 0.00000359261 ***
## YearBuilt         282.119      69.490    4.06 0.00005186088 ***
## YearRemodAdd      166.463      70.430    2.36       0.01824 *  
## MasVnrArea         30.792       6.019    5.12 0.00000035689 ***
## BsmtFinSF1         24.321       6.097    3.99 0.00006979028 ***
## BsmtFinSF2         14.110       8.206    1.72       0.08576 .  
## BsmtUnfSF          15.219       5.940    2.56       0.01051 *  
## TotalBsmtSF            NA          NA      NA            NA    
## X1stFlrSF          44.820       6.894    6.50 0.00000000011 ***
## X2ndFlrSF          49.724       5.065    9.82       < 2e-16 ***
## LowQualFinSF       26.110      20.060    1.30       0.19328    
## GrLivArea              NA          NA      NA            NA    
## BsmtFullBath     9801.366    2632.104    3.72       0.00020 ***
## BsmtHalfBath     2242.247    4108.576    0.55       0.58533    
## FullBath         1901.747    2908.400    0.65       0.51330    
## HalfBath        -2633.272    2708.924   -0.97       0.33118    
## BedroomAbvGr    -9261.867    1740.810   -5.32 0.00000012069 ***
## KitchenAbvGr   -15531.065    5777.685   -2.69       0.00727 ** 
## TotRmsAbvGrd     5430.652    1262.501    4.30 0.00001816001 ***
## Fireplaces       4443.424    1810.613    2.45       0.01425 *  
## GarageYrBlt       109.794      70.253    1.56       0.11832    
## GarageCars      11075.196    2923.323    3.79       0.00016 ***
## GarageArea         -3.324      10.069   -0.33       0.74135    
## WoodDeckSF         22.148       8.109    2.73       0.00639 ** 
## OpenPorchSF        -8.946      15.378   -0.58       0.56084    
## EnclosedPorch      14.707      17.191    0.86       0.39240    
## X3SsnPorch         24.730      31.899    0.78       0.43832    
## ScreenPorch        55.364      17.258    3.21       0.00137 ** 
## PoolArea          -30.882      23.906   -1.29       0.19664    
## MiscVal            -1.136       1.894   -0.60       0.54876    
## MoSold            -22.920     351.524   -0.07       0.94802    
## YrSold           -638.413     716.910   -0.89       0.37335    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34800 on 1378 degrees of freedom
## Multiple R-squared:  0.812,  Adjusted R-squared:  0.807 
## F-statistic:  174 on 34 and 1378 DF,  p-value: <2e-16

#plot(lm1)

 
 ##Mmodel 2 with only the significant vaiables
 # train.df5 <- dplyr::select (train.df4,MSZoning ,LotArea , Street , LotConfig,LandSlope ,Condition1, Neighborhood, HouseStyle, OverallQual , OverallCond,YearBuilt ,RoofStyle ,RoofMatl, Exterior2nd, MasVnrArea,ExterQual,BsmtQual ,BsmtCond , BsmtExposure , BsmtFinType2, BsmtFinSF2 , BsmtUnfSF ,X1stFlrSF, X2ndFlrSF, BedroomAbvGr, KitchenAbvGr, KitchenQual,Functional, GarageQual,SalePrice)

Model 2

  train.df6 <- dplyr::select (train.df5,MSSubClass , LotArea , OverallQual , OverallCond,YearBuilt, YearRemodAdd , MasVnrArea,BsmtFinSF1,  BsmtUnfSF, X1stFlrSF, X2ndFlrSF,BsmtFullBath,  BedroomAbvGr, KitchenAbvGr,TotRmsAbvGrd,Fireplaces, GarageCars, WoodDeckSF,ScreenPorch,SalePrice)
  
  lm2 <- lm(SalePrice ~., data = train.df6)
summary(lm2)

## 
## Call:
## lm(formula = SalePrice ~ ., data = train.df6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -496010  -16475   -2014   13623  287203 
## 
## Coefficients:
##                  Estimate   Std. Error t value     Pr(>|t|)    
## (Intercept)  -1072422.887   118493.033   -9.05      < 2e-16 ***
## MSSubClass       -157.617       26.193   -6.02 0.0000000023 ***
## LotArea             0.424        0.100    4.23 0.0000254270 ***
## OverallQual     17923.757     1191.524   15.04      < 2e-16 ***
## OverallCond      4374.403     1028.318    4.25 0.0000224056 ***
## YearBuilt         305.931       52.920    5.78 0.0000000092 ***
## YearRemodAdd      206.456       66.438    3.11      0.00192 ** 
## MasVnrArea         30.334        5.957    5.09 0.0000004026 ***
## BsmtFinSF1         16.861        4.473    3.77      0.00017 ***
## BsmtUnfSF           8.674        4.434    1.96      0.05065 .  
## X1stFlrSF          49.624        5.556    8.93      < 2e-16 ***
## X2ndFlrSF          46.129        4.170   11.06      < 2e-16 ***
## BsmtFullBath     9725.108     2445.664    3.98 0.0000735367 ***
## BedroomAbvGr    -9163.348     1696.903   -5.40 0.0000000782 ***
## KitchenAbvGr   -16145.471     5650.937   -2.86      0.00434 ** 
## TotRmsAbvGrd     5692.693     1232.527    4.62 0.0000042177 ***
## Fireplaces       3577.771     1770.169    2.02      0.04346 *  
## GarageCars      10446.171     1731.250    6.03 0.0000000020 ***
## WoodDeckSF         23.510        7.976    2.95      0.00326 ** 
## ScreenPorch        52.485       16.945    3.10      0.00199 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34900 on 1393 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.807 
## F-statistic:  311 on 19 and 1393 DF,  p-value: <2e-16

Model 3

train.df7 <- dplyr::select(train.df6, -BsmtUnfSF)

  lm3 <- lm(SalePrice ~., data = train.df7)
summary(lm3)

## 
## Call:
## lm(formula = SalePrice ~ ., data = train.df7)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483667  -16003   -1909   13805  288204 
## 
## Coefficients:
##                  Estimate   Std. Error t value      Pr(>|t|)    
## (Intercept)  -1098268.931   117873.391   -9.32       < 2e-16 ***
## MSSubClass       -163.526       26.044   -6.28 0.00000000045 ***
## LotArea             0.425        0.101    4.22 0.00002554877 ***
## OverallQual     18355.442     1172.096   15.66       < 2e-16 ***
## OverallCond      4108.981     1020.359    4.03 0.00005953749 ***
## YearBuilt         314.686       52.784    5.96 0.00000000316 ***
## YearRemodAdd      210.708       66.470    3.17       0.00156 ** 
## MasVnrArea         31.333        5.941    5.27 0.00000015474 ***
## BsmtFinSF1         10.382        3.009    3.45       0.00058 ***
## X1stFlrSF          55.390        4.715   11.75       < 2e-16 ***
## X2ndFlrSF          45.698        4.169   10.96       < 2e-16 ***
## BsmtFullBath     8674.498     2388.380    3.63       0.00029 ***
## BedroomAbvGr    -9115.451     1698.446   -5.37 0.00000009369 ***
## KitchenAbvGr   -14604.477     5601.427   -2.61       0.00922 ** 
## TotRmsAbvGrd     5639.375     1233.474    4.57 0.00000526033 ***
## Fireplaces       3305.566     1766.480    1.87       0.06152 .  
## GarageCars      10439.903     1733.001    6.02 0.00000000217 ***
## WoodDeckSF         22.513        7.968    2.83       0.00479 ** 
## ScreenPorch        51.132       16.948    3.02       0.00260 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34900 on 1394 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.806 
## F-statistic:  327 on 18 and 1394 DF,  p-value: <2e-16

Model 4

train.df8 <- dplyr::select(train.df7, -Fireplaces)

  lm4 <- lm(SalePrice ~., data = train.df8)
summary(lm4)

## 
## Call:
## lm(formula = SalePrice ~ ., data = train.df8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -486704  -16154   -2132   13648  286987 
## 
## Coefficients:
##                   Estimate    Std. Error t value      Pr(>|t|)    
## (Intercept)  -1066201.0149   116725.5369   -9.13       < 2e-16 ***
## MSSubClass       -162.1815       26.0577   -6.22 0.00000000064 ***
## LotArea             0.4489        0.0998    4.50 0.00000741480 ***
## OverallQual     18603.6740     1165.6091   15.96       < 2e-16 ***
## OverallCond      4190.1660     1020.3502    4.11 0.00004248505 ***
## YearBuilt         314.3236       52.8307    5.95 0.00000000339 ***
## YearRemodAdd      194.1873       65.9400    2.94       0.00328 ** 
## MasVnrArea         31.2360        5.9463    5.25 0.00000017284 ***
## BsmtFinSF1         10.5876        3.0102    3.52       0.00045 ***
## X1stFlrSF          57.2992        4.6070   12.44       < 2e-16 ***
## X2ndFlrSF          46.9239        4.1205   11.39       < 2e-16 ***
## BsmtFullBath     8754.2881     2390.1391    3.66       0.00026 ***
## BedroomAbvGr    -9433.0580     1691.4588   -5.58 0.00000002937 ***
## KitchenAbvGr   -16016.7844     5555.3231   -2.88       0.00400 ** 
## TotRmsAbvGrd     5719.6283     1233.8329    4.64 0.00000389089 ***
## GarageCars      10511.8505     1734.1272    6.06 0.00000000173 ***
## WoodDeckSF         23.2838        7.9641    2.92       0.00352 ** 
## ScreenPorch        55.5152       16.8005    3.30       0.00098 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34900 on 1395 degrees of freedom
## Multiple R-squared:  0.808,  Adjusted R-squared:  0.806 
## F-statistic:  346 on 17 and 1395 DF,  p-value: <2e-16

Now, we have trained the model successful . We want to evaluate the model using the test dataset. We will apply the same cleaning process.

Prediction

#str(test.df)
#View(test.df)

# cleaning up
 test.df1 <- dplyr::select (test.df, Id, MSSubClass , LotArea , OverallQual , OverallCond,YearBuilt, YearRemodAdd , MasVnrArea,BsmtFinSF1, X1stFlrSF, X2ndFlrSF,BsmtFullBath,  BedroomAbvGr, KitchenAbvGr,TotRmsAbvGrd, GarageCars, WoodDeckSF,ScreenPorch)
 missing.values(test.df1)%>% kable()

## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.

variables	number.missing
MasVnrArea	15
BsmtFullBath	2
BsmtFinSF1	1
GarageCars	1

 test.df2 <- test.df1
 test.df2$MasVnrArea <- impute( test.df2$MasVnrArea , median)
   test.df2$BsmtFullBath  <- impute( test.df2$BsmtFullBath  , median)
 test.df2$BsmtFinSF1    <- impute( test.df2$BsmtFinSF1    , median)
 test.df2$GarageCars    <- impute( test.df2$GarageCars    , median)

#length(test.df2) 
#str(test.df2)
#length(train.df8)
#str(train.df8)
predict1 <- predict(lm4, test.df2) 

#str(predict1)

SalePrice <- predict1
Id <- test.df2$Id
Predi_SalePrice <- data.frame(Id, SalePrice)
#View(Predi_SalePrice)

write.csv(Predi_SalePrice, file = "Predicting_House_Sale_Price.csv", quote = F, row.names = F)

Interpretation of summary of the Multilinear regression model 4 output

F statistics: how good is the relationship between the predictors and response (SalePrice)? In this case, F-statistic: 346 which indicates there is a relationship between the two variables, how strong we cannot confirm it.

R-squared: 0.806 or 80.6% of the variability in SalePrice is explained by YearBuilt, LotArea, MSSubClass, OverallCond, etc. Since we have more independent variables, adjusted R-squared < R-squared. Rsquared is measured on the scale from 0 to 1, R^2 = 1 being best fit. This 80.6% R-squared shows there is a good fit to the train dataset.

Standard error: errors in the models or a measure of the quality of the simple linear regression fit. We expect all regression model to carry in some errors. The actual SalePrice can deviate from the true regression line by 34900 on the independent variables (YearBuilt, LotArea, MSSubClass, OverallCond, etc.).

p-values: based on the 95% confidence interval, 2.2e-16 <0.05 which indicates that changes in the predictors (YearBuilt, LotArea, MSSubClass, OverallCond, etc.) are associated with changes in the response variable(SalePrice).

The City University of New York School of Professional Studies Fundamentals of Computational Mathematics(DATA 605) Final Project Probability Distribution, Descriptive and Inferential Statistics Alexis Mekueko email: alexis.mekueko08@login.cuny.edu

5/25/2021

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ = σ = (N+1)/2.

Probability: calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y): This is a probability that X is greater than x (median) with X greater the 1st quartile of the Y variable.

b. P(X>x, Y>y): This is the probability that X is greater than x with Y greater the 1st quartile of the Y variable.

c. P(X<x | X>y): This is the probability that X is smaller than x with X greater the 1st quartile of the Y variable.

d. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

e. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Null hypothesis:

X and Y are independent, meaning P(X>x and Y>y)=P(X>x)P(Y>y)

Alternative hypothesis: X and Y are dependent

We will confirm based on p-value

p-value = 0.3207 is high than the 0.05 based on 95% CI. Thus, we accept the Null hypothesis for both test. We though with large sample population (10000), the chi square test will be more suitable.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

1. Data Exploration

Summary of train.df1

Data distribution

Data Cleaning-Fixing variable with missing Values

Let’s see the distribution of the missing values across the training dataset.

Plotting the response variable against the independent variables.

We will add a new variable called HouseAge. Since there is no note about when this data was collected, we will pick the reference year to be 2015, this does not mean the house price reflects the market in 2010.

Scatter plots + fitted regression line

Correlation

The correlation output does not show anything strong. Again, just as we mentioned above there is some tangible relationship between house sale price and built year or house age: correlation = 0.52. not bad!

Hypothesis Testing

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

Null Hypothesis: the correlation between house sale price and built year is 0.

Alternative Hypothesis: the correlation between house sale price and built is not 0.

p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the #### Null hypothesis. The true correlation is not equal to 0.

Null Hypothesis: the correlation between house sale price and built year is 0.

Alternative Hypothesis: the correlation between house sale price and built year is not 0.

p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the #### Null hypothesis. The true correlation is not equal to 0.

House Sale Price and Lot Size

Null Hypothesis: the correlation between house sale price and lot size is 0.

Alternative Hypothesis: the correlation between house sale price and lot size is not 0.

p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the Null hypothesis. The true correlation is not equal to 0. However, we observed a weak correlation.

House Sale Price and Overall House Condition

Null Hypothesis: the correlation between house sale price and the overall house condition is 0.

Alternative Hypothesis: the correlation between house sale price and overall house condition is not 0.

p-value = 0.003 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the Null hypothesis. The true correlation is not equal to 0. However, we observed a very weak correlation.

Familywise Error Rate (Alpha Inflation)

The familywise error (FWE or FWER) is the probability of making at least one false conclusion in a series of hypothesis tests. We won’t worry about it because the found p-values are highly significant.

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

Multiply the correlation matrix by the precision matrix

Multiply the precision matrix by the correlation matrix.

For some reason, we don’t see identity matrix when we did inverted correlation and the precision matrix. we think the issue is with round () that we should have applied.

Conduct LU decomposition on the matrix.

Double checking Lu decomposition

Calculus-Based Probability & Statistics.

We could use the House Age (mutate from YearBuilt) but SalePrice looks already right skewed.

Taking 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda))

Find the 5th percentiles using the cumulative distribution function (CDF)

Find the 95th percentiles using the cumulative distribution function (CDF)

Generate a 95% confidence interval from the empirical data, assuming normality.

R does not have a command to find confidence intervals for the mean of normal datawhen the variance is known.

Finally, provide the empirical 5th percentile and 95th percentile of the data.

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

We have too many missing values. Some variables have nearly 100% missing values. So, we will remove those variable with 45% or more missing value. second , we will impute those variable with less than 40% missing values using median.

These variables will be removed: PoolQC, MiscFeature ,Alley, Fence, FireplaceQu

These variables will go under imputation: LotFrontage, GarageCond, GarageFinish, GarageQual,GarageType, GarageYrBlt,BsmtExposure , BsmtFinType2, BsmtCond, BsmtFinType1,BsmtQual ,MasVnrArea ,MasVnrType, Electrical .

Since Multiple linear regression is based on numerical values, we could transform the variables with character data type into numeric. Due to high volume, we want to focus directly on selecting the variables with numeric data type.

We will use lm() to build our model and use back elimination (remove variable with less significance )

Model 2

Model 3

Model 4

Now, we have trained the model successful . We want to evaluate the model using the test dataset. We will apply the same cleaning process.

Prediction

Interpretation of summary of the Multilinear regression model 4 output

F statistics: how good is the relationship between the predictors and response (SalePrice)? In this case, F-statistic: 346 which indicates there is a relationship between the two variables, how strong we cannot confirm it.

p-values: based on the 95% confidence interval, 2.2e-16 <0.05 which indicates that changes in the predictors (YearBuilt, LotArea, MSSubClass, OverallCond, etc.) are associated with changes in the response variable(SalePrice).

The City University of New York School of Professional Studies
Fundamentals of Computational Mathematics(DATA 605)

Final Project Probability Distribution, Descriptive and Inferential Statistics
Alexis Mekueko email: alexis.mekueko08@login.cuny.edu