Problem-1:

Probability Density 1: X~Gamma.

Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (\(\lambda\), a shape parameter). Choose any n greater 3 and an expected value (\(\lambda\)) between 2 and 10 (you choose).

set.seed(123)
n <- 4
lambda <- 3
X <- rgamma(10000, n, lambda)
hist(X)

Probability Density 2: Y~Sum of Exponentials.

Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (\(\lambda\)). The n and \(\lambda\) must be the same as in the previous case. (e.g., mysum = rexp(10000, \(\lambda\)) + rexp(10000, \(\lambda\)) + …)

set.seed(123)

Y <- rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda)
hist(Y)

Probability Density 3: Z~ Exponential.

Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (\(\lambda\)).

set.seed(123)

Z <- rexp(10000, rate =lambda)
hist(Z)

1a. Calculate the empirical expected value (means) and variances of all three pdfs:

# Calculate empirical mean and variance of Gamma pdf (X)
m1<-mean(X)
v1<-var(X)

# Calculate empirical mean and variance of Sum of Exponentials Pdf (Y)
m2<-mean(Y)
v2<-var(Y)

# Calculate empirical mean and variance of Single Exponential Pdf(Z)
m3<-mean(Z)
v3<-var(Z)

cat("The empirical mean of Gamma pdf:", m1,"\n","The empirical variance of Gamma pdf:", v1,"\n","The empirical mean of Sum of Exponentials pdf:", m2,"\n","The empirical variance of Sum of Exponentials pdf:", v2,"\n","The empirical mean of Single Exponential pdf:", m3,"\n","The empirical variance of Single Exponential pdf:", v3,"\n")

## The empirical mean of Gamma pdf: 1.320105 
##  The empirical variance of Gamma pdf: 0.4303175 
##  The empirical mean of Sum of Exponentials pdf: 1.33651 
##  The empirical variance of Sum of Exponentials pdf: 0.437936 
##  The empirical mean of Single Exponential pdf: 0.3345938 
##  The empirical variance of Single Exponential pdf: 0.1110849

1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

Expected value and variance of X (Gamma pdf) using calculus:

The integral of x times the gamma pdf over the range of the distribution i.e. from o to infinity gives us the expected value of the gamma pdf. On the other hand, the variance will be the integral of the function consisting of the product of the squared difference between each value of x and the expected value of x and the gamma pdf over the entire range of x i.e. from 0 to infinity gives us the variance of the gamma distribution.

# Calculate expected value
x_times_pdf <- Vectorize(function(x) x * dgamma(x, n, lambda))
expected_X <- integrate(x_times_pdf, lower = 0, upper = Inf)$value

# calculate variance
integrand <- Vectorize(function(x) (x - expected_X)^2 * dgamma(x, n, lambda))
variance_X <- integrate(integrand, lower = 0, upper = Inf)$value

cat("Expected value: ", expected_X, "\n","Variance: ", variance_X, "\n")

## Expected value:  1.333333 
##  Variance:  0.4444444

Expected value of exponential (Z) using the moment generating function: For the exponential distribution Z, the moment generating function is \(M_Z(x) = \frac{\lambda}{\lambda - x}\), where \(\lambda\) is the rate parameter and x<\(\lambda\). Its first derivative evaluated at 0 will give the expected value the exponential distribution which is 1/\(\lambda\).

moment_Z<- expression(lambda / (lambda - x))
derivative_moment_Z <- D(moment_Z,'x')
expected_Z <- eval(derivative_moment_Z ,list(x=0))

cat("The expected value of exponential pdf Z using moment generating function:",expected_Z,"\n")

## The expected value of exponential pdf Z using moment generating function: 0.3333333

Expected value of Sum of exponential (Y) using the moment generating function: The moment generating function for the sum of exponentials will be the product of all the individual exponential distribution Y which is \(M_Y(x) = (\frac{\lambda}{\lambda - x})^{n}\), and its first derivative evaluated at 0 will be the expected value which is n/\(\lambda\).

moment_Y <- expression((lambda / (lambda - x))^n)
derivative_moment_Y <- D(moment_Y,'x')
expected_Y <- eval(derivative_moment_Y ,list(x=0))

cat("The expected value of sum of exponential pdf Y using moment generating function:",expected_Y,"\n")

## The expected value of sum of exponential pdf Y using moment generating function: 1.333333

1c. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.

a. P(Z> λ| Z> λ/2) b. P(Z> 2λ| Z> λ) c. P(Z> 3λ| Z> λ)

## a: Calculate P(Z> λ| Z> λ/2):
# Count number of observations where Z > λ/2
n1 <- sum(Z > lambda/2)
# Count number of observations where Z > λ and Z > λ/2
n2 <- sum(Z > lambda)
# Calculate empirical probability of P(Z > λ | Z > λ/2)
prob_a <- n2/n1

## b: Calculate P(Z> λ| Z> λ/2):
# Count number of observations where Z > λ
n3 <- sum(Z > lambda)
# Count number of observations where Z > 2λ and Z > λ
n4 <- sum(Z > 2*lambda & Z > lambda)
# Calculate empirical probability of P(Z > 2λ | Z > λ)
prob_b <- n4/n3

## c: Calculate P(Z> 3λ| Z> λ):
# Count number of observations where Z > 3λ
n5 <- sum(Z > lambda)
# Count number of observations where Z > 3λ and Z > λ
n6 <- sum(Z > 3*lambda & Z > lambda)
# Calculate empirical probability of P(Z > 3λ | Z > λ)
prob_c<- n6/n5

cat("The probability of the exponential distribution Z for condition-a is:",prob_a,"\n","The probability of the exponential distribution Z for condition-b is:",prob_b,"\n","The probability of the exponential distribution Z for condition-c is:",prob_c,"\n")

## The probability of the exponential distribution Z for condition-a is: 0.0106383 
##  The probability of the exponential distribution Z for condition-b is: 0 
##  The probability of the exponential distribution Z for condition-c is: 0

## Check for condition-a:
# Unconditional probability: P(Z > lambda/2)
uncond_prob_a <- 1 - pexp(lambda/2, rate = 3)

# Conditional probability: P(Z > lambda | Z > lambda/2)
cond_prob_a <- integrate(function(x) dexp(x, rate = 3) / uncond_prob_a, lower = lambda/2, upper = Inf)$value

# Check if memoryless property holds
if(abs(cond_prob_a - 1) < 1e-6){
  print("For P(Z > lambda | Z > lambda/2), memoryless property holds")
}else{
  print("For P(Z > lambda | Z > lambda/2), memoryless property does not hold")
}

## [1] "For P(Z > lambda | Z > lambda/2), memoryless property holds"

## Check for condition condition-b:
# Unconditional probability: P(Z > lambda)
uncond_prob_b <- 1 - pexp(lambda, rate = 3)

# Conditional probability: P(Z > 2lambda | Z > lambda)
cond_prob_b <- integrate(function(x) dexp(x, rate = 3) / uncond_prob_b, lower = lambda, upper = Inf)$value

# Check if memoryless property holds
if(abs(cond_prob_b - 1) < 1e-6){
  print("For P(Z > 2lambda | Z > lambda), memoryless property holds")
}else{
  print("For P(Z > 2lambda | Z > lambda), memoryless property does not hold")
}

## [1] "For P(Z > 2lambda | Z > lambda), memoryless property holds"

## Check for condition-c:
# Unconditional probability: P(Z > lambda)
uncond_prob_c <- 1 - pexp(lambda, rate = 3)

# Conditional probability: P(Z > 3lambda | Z > lambda)
cond_prob_c <- integrate(function(x) dexp(x, rate = 3) / uncond_prob_c, lower = lambda, upper = Inf)$value

# Check if memoryless property holds
if(abs(cond_prob_c - 1) < 1e-6){
  print("For P(Z > 3lambda | Z > lambda), memoryless property holds")
}else{
  print("For P(Z > 3lambda | Z > lambda), memoryless property does not hold")
}

## [1] "For P(Z > 3lambda | Z > lambda), memoryless property holds"

1d.Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.

To investigate whether P(YZ) = P(Y) P(Z), we can build a table with quartiles of Y and Z and then evaluate the marginal and joint probabilities.

# Calculate quartiles of Y and Z
Y_quartiles <- quantile(Y, probs = c(0.25, 0.5, 0.75,1.0))
Z_quartiles <- quantile(Z, probs = c(0.25, 0.5, 0.75,1.0))

# Create a single vector by combining Y and Z values by sampling 10000 values from the quartile breakpoints of Y and Z variables 
single_vector <- paste(sample(Y_quartiles,10000, replace = TRUE), sample(Z_quartiles,10000, replace = TRUE))

# Create a matrix of the counts of the unique combinations of quartile ranges from Y_quartiles and Z_quartiles
matrix_counts <- matrix(table(single_vector), ncol = 4)

# Convert the count matrix to a matrix of proportion that provides a table of joint probabilities for each quartile combination of Y and Z.
matrix_prop<- prop.table(matrix_counts)

# Convert the matrix to a data frame and add Sum columns and rows to the data frame
df <- data.frame(matrix_prop)
df$Sum <- rowSums(matrix_prop)
df <- rbind(df, colSums(df))
row.names(df)[nrow(df)] <- "Sum"

# Specify the column and row names
colnames(df) <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Sum")
rownames(df) <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Sum")

knitr::kable(df)

	1st Quartile Y	2nd Quartile Y	3rd Quartile Y	4th Quartile Y	Sum
1st Quartile Z	0.0621	0.0606	0.0641	0.0679	0.2547
2nd Quartile Z	0.0598	0.0635	0.0691	0.0625	0.2549
3rd Quartile Z	0.0607	0.0607	0.0624	0.0643	0.2481
4th Quartile Z	0.0615	0.0606	0.0596	0.0606	0.2423
Sum	0.2441	0.2454	0.2552	0.2553	1.0000

To compare P(YZ) with P(Y)*P(Z), we can check whether the ratio of these two is close to 1. Now, create a new table for the product of marginal probabilities and calculate the ratio.

# Calculate the product of marginal probabilities
marginal_Y <- colSums(df)[-5]
marginal_Y

## 1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y 
##         0.4882         0.4908         0.5104         0.5106

marginal_Z <- rowSums(df)[-5]
marginal_Z

## 1st Quartile Z 2nd Quartile Z 3rd Quartile Z 4th Quartile Z 
##         0.5094         0.5098         0.4962         0.4846

product_marginals <- as.matrix(outer(marginal_Y, marginal_Z))

# Calculate the ratio of joint probabilities to the product of marginals
ratio <- df[-5, -5] / product_marginals
ratio

##                1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y
## 1st Quartile Z      0.2497094      0.2434866      0.2646083      0.2870044
## 2nd Quartile Z      0.2391871      0.2537870      0.2837375      0.2627798
## 3rd Quartile Z      0.2334635      0.2332804      0.2463866      0.2599662
## 4th Quartile Z      0.2364478      0.2328048      0.2352387      0.2449111

Based on the values in ratio table above, it is seen that the joint probabilities P(YZ) are not equal to the product of the marginal probabilities P(Y) and P(Z) for most quartiles of Y and Z. Therefore, we can say that P(YZ) is not equal to P(Y)*P(Z).

1e. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Both the Fisher’s Exact Test and the Chi Square Test are used to determine if there is a statistically significant relationship between two categorical variables.

Fisher’s Exact Test is used when the sample size is small. It can calculate the exact probability of observing the data. On the other hand, the Chi Square Test is used when we the sample size is large. It can approximate the probability distribution of the test statistic.

In both tests, if the p-value is very low (usually less than 0.05), we can reject the null hypothesis and can conclude that there is a statistically significant relationship between the two variables.

Perform Fisher’s exact test on the joint probabilities:

# Find count table
count_table<-matrix_prop*10000
fisher.test(count_table, simulate.p.value = TRUE)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  count_table
## p-value = 0.4228
## alternative hypothesis: two.sided

Perform Chi Square Test:

chisq.test(count_table)

## 
##  Pearson's Chi-squared test
## 
## data:  count_table
## X-squared = 9.1087, df = 9, p-value = 0.4273

Based on the p-value above for both the tests, it can be said that there is no association between Y and Z. Therefore, independence holds on both the Fisher’s exact test and the the Chi Square test. Also, for the sample we have here, the Chi Square test is more appropriate.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Load the libraries

library(readr)
library(tidyverse)
library(ggplot2)
library(pracma)
library(MASS)

Read data

df_train<-read.csv("https://raw.githubusercontent.com/Raji030/data605_final/main/train.csv")
str(df_train)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

# Get names of columns that have at least one NA value
na_cols <- names(df_train)[colSums(is.na(df_train)) > 0]
na_cols

##  [1] "LotFrontage"  "Alley"        "MasVnrType"   "MasVnrArea"   "BsmtQual"    
##  [6] "BsmtCond"     "BsmtExposure" "BsmtFinType1" "BsmtFinType2" "Electrical"  
## [11] "FireplaceQu"  "GarageType"   "GarageYrBlt"  "GarageFinish" "GarageQual"  
## [16] "GarageCond"   "PoolQC"       "Fence"        "MiscFeature"

# Subset data for a few columns
df_subset <- subset(df_train, select = c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt"))
head(df_subset)

##   SalePrice LotArea OverallQual OverallCond YearBuilt
## 1    208500    8450           7           5      2003
## 2    181500    9600           6           8      1976
## 3    223500   11250           7           5      2001
## 4    140000    9550           7           5      1915
## 5    250000   14260           8           5      2000
## 6    143000   14115           5           5      1993

df_test<-read.csv("https://raw.githubusercontent.com/Raji030/data605_final/main/test.csv")

Univariate descriptive statistics and appropriate plots

# Summary statistics
summary(df_subset)

##    SalePrice         LotArea        OverallQual      OverallCond   
##  Min.   : 34900   Min.   :  1300   Min.   : 1.000   Min.   :1.000  
##  1st Qu.:129975   1st Qu.:  7554   1st Qu.: 5.000   1st Qu.:5.000  
##  Median :163000   Median :  9478   Median : 6.000   Median :5.000  
##  Mean   :180921   Mean   : 10517   Mean   : 6.099   Mean   :5.575  
##  3rd Qu.:214000   3rd Qu.: 11602   3rd Qu.: 7.000   3rd Qu.:6.000  
##  Max.   :755000   Max.   :215245   Max.   :10.000   Max.   :9.000  
##    YearBuilt   
##  Min.   :1872  
##  1st Qu.:1954  
##  Median :1973  
##  Mean   :1971  
##  3rd Qu.:2000  
##  Max.   :2010

# Histogram for SalePrice
ggplot(df_subset, aes(x = SalePrice)) + 
  geom_histogram(binwidth = 50000) + 
  labs(title = "Histogram of SalePrice", x = "SalePrice", y = "Count")

# Boxplot for OverallQual
ggplot(df_subset, aes(x = OverallQual, y = SalePrice)) + 
  geom_boxplot() + 
  labs(title = "Boxplot of SalePrice by OverallQual", x = "OverallQual", y = "SalePrice")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Scatterplot matrix

# Scatterplot matrix for SalePrice, LotArea, and YearBuilt
pairs(df_subset[c("SalePrice", "LotArea", "YearBuilt")], 
      main = "Scatterplot Matrix for SalePrice, LotArea, and YearBuilt")

Correlation matrix and hypothesis testing

# Correlation matrix for SalePrice, LotArea, and OverallQual
cor_matrix<-cor(df_subset[c("SalePrice", "LotArea", "OverallQual")])
cor_matrix

##             SalePrice   LotArea OverallQual
## SalePrice   1.0000000 0.2638434   0.7909816
## LotArea     0.2638434 1.0000000   0.1058057
## OverallQual 0.7909816 0.1058057   1.0000000

# Hypothesis testing for pairwise correlations
cor.test(df_subset$SalePrice, df_subset$LotArea, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  df_subset$SalePrice and df_subset$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

cor.test(df_subset$SalePrice, df_subset$OverallQual, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  df_subset$SalePrice and df_subset$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816

cor.test(df_subset$LotArea, df_subset$OverallQual, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  df_subset$LotArea and df_subset$OverallQual
## t = 4.0629, df = 1458, p-value = 5.106e-05
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.07250156 0.13887424
## sample estimates:
##       cor 
## 0.1058057

From the univariate analysis, it is seen that the mean sale price of the houses is USD 180,921, with the minimum sale price is USD 34,900 and the maximum sale price is USD 755,000. The lot area ranges from 1300 sq.ft. to 215,245 sq.ft. with a mean of 10,517 sq.ft. The overall quality of the houses ranges from 1 to 10 with a mean of 6.099, and the overall condition ranges from 1 to 9 with a mean of 5.575. The year built ranges from 1872 to 2010, with a mean of 1971. The bivariate analysis is showing that the sale price is positively correlated with the overall quality of the houses and the lot area. There is a weak positive correlation between sale price and year built.

The correlation matrix for three quantitative variables (sale price, lot area and overall quality) showed that there is a strong positive correlation between sale price and overall quality (r = 0.79) and a moderate positive correlation between sale price and lot area (r = 0.26). There is also a moderate positive correlation between overall quality and lot area (r = 0.11).

Based on the correlation tests above, it can be said that the confidence intervals are not close to zero,rather indicating statistically significant positive correlations between the variables.Therefore, I am not worried about a familywise error occurring where we accidentally reject the null hypothesis.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

# Invert correlation matrix
#cor_matrix <- cor(df_subset[c("SalePrice", "LotArea", "OverallQual")])
preci_matrix <- solve(cor_matrix)
preci_matrix

##              SalePrice    LotArea OverallQual
## SalePrice    2.9280384 -0.5334669  -2.2595806
## LotArea     -0.5334669  1.1085153   0.3046752
## OverallQual -2.2595806  0.3046752   2.7550503

# Multiply correlation matrix by precision matrix
cor_preci_prod <- cor_matrix %*% preci_matrix
cor_preci_prod

##                 SalePrice      LotArea   OverallQual
## SalePrice    1.000000e+00 5.551115e-17  0.000000e+00
## LotArea     -5.551115e-17 1.000000e+00 -5.551115e-17
## OverallQual -4.440892e-16 0.000000e+00  1.000000e+00

# Multiply precision matrix by correlation matrix
preci_cor_prod <- preci_matrix %*% cor_matrix
cor_preci_prod

##                 SalePrice      LotArea   OverallQual
## SalePrice    1.000000e+00 5.551115e-17  0.000000e+00
## LotArea     -5.551115e-17 1.000000e+00 -5.551115e-17
## OverallQual -4.440892e-16 0.000000e+00  1.000000e+00

# Perform LU decomposition
lu_decom <- lu(cor_matrix)
lu_decom

## $L
##             SalePrice    LotArea OverallQual
## SalePrice   1.0000000  0.0000000           0
## LotArea     0.2638434  1.0000000           0
## OverallQual 0.7909816 -0.1105879           1
## 
## $U
##             SalePrice   LotArea OverallQual
## SalePrice           1 0.2638434   0.7909816
## LotArea             0 0.9303867  -0.1028895
## OverallQual         0 0.0000000   0.3629698

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

From the SalePrice variable drawn above it is found that the distribution of this variable is right-skewed. From the summary of the SalePrice data it is found that the minimum value of this variable is above zero and also practically SalePrice of the houses can never be zero, so we don’t perform any shifting here.

# Fit exponential distribution by using 'fitdistr' function from the MASS package
fit <- fitdistr(df_subset$SalePrice, "exponential")

#Find the optimal value of lambda for this exponential distribution
lambda <- fit$estimate

# Generate 1000 random samples from the exponential distribution using the estimated lambda value
samples <- rexp(1000, rate = lambda)

Plot histograms to compare the original variable and the generated exponential samples

# Histogram of original variable
hist(df_subset$SalePrice,main = "Histogram for original variable")

# Histogram of exponential samples
hist(samples, main = "Histogram of exponential samples")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF)

# Find percentiles using exponential CDF 
p_5th <- qexp(0.05, rate = lambda)
p_95th <- qexp(0.95, rate = lambda)

Calculate a 95% confidence interval from the empirical data assuming normality

# Generate 95% confidence interval
mean_value <- mean(df_subset$SalePrice)
sd_value <- sd(df_subset$SalePrice)
n <- length(df_subset$SalePrice)
alpha <- 0.05

lower_ci <- mean_value - qnorm(1 - alpha/2) * (sd_value / sqrt(n))
upper_ci <- mean_value + qnorm(1 - alpha/2) * (sd_value / sqrt(n))
cat("lower interval is:",lower_ci,"and upper interval is:",upper_ci)

## lower interval is: 176846.2 and upper interval is: 184996.2

Determine the empirical 5th and 95th percentiles of the original variable

# Calculate empirical percentiles
empirical_5th <- quantile(df_subset$SalePrice, 0.05)
empirical_95th <- quantile(df_subset$SalePrice, 0.95)

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

Buid multiple regression model

For building up the multiple regression model I chose variables: OverallQual, GrLivArea, TotalBsmtSF, GarageCars, YearBuilt, Neighborhood, ExterQual, as because I think these are the impactful variables to consider for predicting the Sale Price of the houses. Also, these variables have no missing values.

# Select few variables that look significant to SalePrice
selected_vars <- c("SalePrice", "OverallQual", "GrLivArea", "TotalBsmtSF", "GarageCars", "YearBuilt", "Neighborhood", "ExterQual")

# Get a subset of the train data
subset_data <- df_train[, selected_vars]

# Convert qualitative variables to factors
subset_data$Neighborhood <- as.factor(subset_data$Neighborhood)
subset_data$ExterQual <- as.factor(subset_data$ExterQual)

# Build regression model
model <- lm(SalePrice ~ OverallQual + GrLivArea + TotalBsmtSF + GarageCars + YearBuilt + Neighborhood + ExterQual, data = subset_data)

# Get model summary
summary(model)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotalBsmtSF + 
##     GarageCars + YearBuilt + Neighborhood + ExterQual, data = subset_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -414812  -14617     223   13275  269280 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -3.958e+05  1.368e+05  -2.893 0.003876 ** 
## OverallQual          1.386e+04  1.207e+03  11.487  < 2e-16 ***
## GrLivArea            4.673e+01  2.464e+00  18.969  < 2e-16 ***
## TotalBsmtSF          1.966e+01  2.700e+00   7.284 5.35e-13 ***
## GarageCars           1.223e+04  1.683e+03   7.266 6.06e-13 ***
## YearBuilt            2.106e+02  6.869e+01   3.066 0.002208 ** 
## NeighborhoodBlueste -1.325e+04  2.565e+04  -0.516 0.605607    
## NeighborhoodBrDale  -1.670e+04  1.237e+04  -1.350 0.177354    
## NeighborhoodBrkSide  1.386e+04  1.062e+04   1.304 0.192312    
## NeighborhoodClearCr  3.838e+04  1.095e+04   3.504 0.000473 ***
## NeighborhoodCollgCr  1.797e+04  8.776e+03   2.047 0.040807 *  
## NeighborhoodCrawfor  3.935e+04  1.046e+04   3.763 0.000175 ***
## NeighborhoodEdwards  7.321e+02  9.733e+03   0.075 0.940051    
## NeighborhoodGilbert  1.062e+04  9.312e+03   1.140 0.254474    
## NeighborhoodIDOTRR  -3.768e+02  1.132e+04  -0.033 0.973456    
## NeighborhoodMeadowV -4.498e+02  1.229e+04  -0.037 0.970805    
## NeighborhoodMitchel  8.616e+03  9.973e+03   0.864 0.387750    
## NeighborhoodNAmes    9.639e+03  9.272e+03   1.040 0.298736    
## NeighborhoodNoRidge  7.223e+04  1.009e+04   7.160 1.29e-12 ***
## NeighborhoodNPkVill -6.581e+03  1.436e+04  -0.458 0.646763    
## NeighborhoodNridgHt  5.508e+04  9.365e+03   5.881 5.06e-09 ***
## NeighborhoodNWAmes   7.472e+03  9.616e+03   0.777 0.437288    
## NeighborhoodOldTown -3.199e+03  1.037e+04  -0.308 0.757864    
## NeighborhoodSawyer   1.132e+04  9.820e+03   1.152 0.249310    
## NeighborhoodSawyerW  1.160e+04  9.543e+03   1.215 0.224402    
## NeighborhoodSomerst  2.244e+04  9.081e+03   2.471 0.013597 *  
## NeighborhoodStoneBr  6.941e+04  1.080e+04   6.430 1.74e-10 ***
## NeighborhoodSWISU    2.215e+01  1.204e+04   0.002 0.998532    
## NeighborhoodTimber   3.270e+04  1.004e+04   3.259 0.001145 ** 
## NeighborhoodVeenker  5.382e+04  1.327e+04   4.057 5.25e-05 ***
## ExterQualFa         -5.835e+04  1.174e+04  -4.969 7.55e-07 ***
## ExterQualGd         -5.277e+04  5.676e+03  -9.296  < 2e-16 ***
## ExterQualTA         -5.581e+04  6.447e+03  -8.656  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34050 on 1427 degrees of freedom
## Multiple R-squared:  0.8204, Adjusted R-squared:  0.8163 
## F-statistic: 203.7 on 32 and 1427 DF,  p-value: < 2.2e-16

The R-squared value found for the model is 0.8204. This indicates that the model we have developed is a good fit for the data. This model can explain 82.04% of the variability in the Sales price data. Additionally, the F-test p-value is less than 0.05 suggesting that the overall model is statistically significant. This means that at least one of the independent variables has a significant impact on the Sales price, which supporting the validity of the model. Overall, based on the R-squared value, accuracy, and statistical significance, it can be said that the model is well-fitted and valid for predicting the Sales price based on the selected independent variables.

Residual analysis

par(mfrow=c(2,2))
plot(model)

From the residuals plot above, it is seen that the distribution of residuals appears to be approximately normal. Additionally, there is no evidence of heteroscedasticity found.The residuals remains fairly consistent across the range of predicted values. Moreover, there are no discernible patterns or trends observed in the residuals. This means the model adequately captures the relationships between the independent and the dependent variables. Therefore, it can be said that the assumptions of the multiple regression model are met.

Subset test data and prepare for prediction

# Select predictor variables 
selected_vars1 <- c("Id", "OverallQual", "GrLivArea", "TotalBsmtSF", "GarageCars", "YearBuilt", "Neighborhood", "ExterQual")

# Get a subset of the test data
subset_test <- df_test[, selected_vars1]

# Check if any missing value present in the subset test data
sapply(subset_test, function(x){sum(is.na(x))})

##           Id  OverallQual    GrLivArea  TotalBsmtSF   GarageCars    YearBuilt 
##            0            0            0            1            1            0 
## Neighborhood    ExterQual 
##            0            0

# Replace missing values with mean of specific columns
subset_test <- subset_test%>% 
  mutate(across(c("TotalBsmtSF", "GarageCars"), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Change categorical variables into factor in test dataset
subset_test <- subset_test %>%
  mutate_at(vars(Neighborhood, ExterQual), as.factor)

Predict sale price for test dataset

predicted <- predict(model, subset_test)
data1 <- data.frame(Id = subset_test$Id, SalePrice=predicted)
write.csv(data1,"kagglesubmission.csv",row.names = FALSE)

Kaggle submission

Kaggle username is MHAR03 . Final score is 0.17538

Data605_Final_Project

Mahmud Hasan Al Raji

2023-05-17