Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (\(\lambda\), a shape parameter). Choose any n greater 3 and an expected value (\(\lambda\)) between 2 and 10 (you choose).
set.seed(123)
n <- 4
lambda <- 3
X <- rgamma(10000, n, lambda)
hist(X)
Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (\(\lambda\)). The n and \(\lambda\) must be the same as in the previous case. (e.g., mysum = rexp(10000, \(\lambda\)) + rexp(10000, \(\lambda\)) + …)
set.seed(123)
Y <- rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda) + rexp(10000, lambda)
hist(Y)
Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (\(\lambda\)).
set.seed(123)
Z <- rexp(10000, rate =lambda)
hist(Z)
# Calculate empirical mean and variance of Gamma pdf (X)
m1<-mean(X)
v1<-var(X)
# Calculate empirical mean and variance of Sum of Exponentials Pdf (Y)
m2<-mean(Y)
v2<-var(Y)
# Calculate empirical mean and variance of Single Exponential Pdf(Z)
m3<-mean(Z)
v3<-var(Z)
cat("The empirical mean of Gamma pdf:", m1,"\n","The empirical variance of Gamma pdf:", v1,"\n","The empirical mean of Sum of Exponentials pdf:", m2,"\n","The empirical variance of Sum of Exponentials pdf:", v2,"\n","The empirical mean of Single Exponential pdf:", m3,"\n","The empirical variance of Single Exponential pdf:", v3,"\n")
## The empirical mean of Gamma pdf: 1.320105
## The empirical variance of Gamma pdf: 0.4303175
## The empirical mean of Sum of Exponentials pdf: 1.33651
## The empirical variance of Sum of Exponentials pdf: 0.437936
## The empirical mean of Single Exponential pdf: 0.3345938
## The empirical variance of Single Exponential pdf: 0.1110849
Expected value and variance of X (Gamma pdf) using calculus:
The integral of x times the gamma pdf over the range of the distribution i.e. from o to infinity gives us the expected value of the gamma pdf. On the other hand, the variance will be the integral of the function consisting of the product of the squared difference between each value of x and the expected value of x and the gamma pdf over the entire range of x i.e. from 0 to infinity gives us the variance of the gamma distribution.
# Calculate expected value
x_times_pdf <- Vectorize(function(x) x * dgamma(x, n, lambda))
expected_X <- integrate(x_times_pdf, lower = 0, upper = Inf)$value
# calculate variance
integrand <- Vectorize(function(x) (x - expected_X)^2 * dgamma(x, n, lambda))
variance_X <- integrate(integrand, lower = 0, upper = Inf)$value
cat("Expected value: ", expected_X, "\n","Variance: ", variance_X, "\n")
## Expected value: 1.333333
## Variance: 0.4444444
Expected value of exponential (Z) using the moment generating function: For the exponential distribution Z, the moment generating function is \(M_Z(x) = \frac{\lambda}{\lambda - x}\), where \(\lambda\) is the rate parameter and x<\(\lambda\). Its first derivative evaluated at 0 will give the expected value the exponential distribution which is 1/\(\lambda\).
moment_Z<- expression(lambda / (lambda - x))
derivative_moment_Z <- D(moment_Z,'x')
expected_Z <- eval(derivative_moment_Z ,list(x=0))
cat("The expected value of exponential pdf Z using moment generating function:",expected_Z,"\n")
## The expected value of exponential pdf Z using moment generating function: 0.3333333
Expected value of Sum of exponential (Y) using the moment generating function: The moment generating function for the sum of exponentials will be the product of all the individual exponential distribution Y which is \(M_Y(x) = (\frac{\lambda}{\lambda - x})^{n}\), and its first derivative evaluated at 0 will be the expected value which is n/\(\lambda\).
moment_Y <- expression((lambda / (lambda - x))^n)
derivative_moment_Y <- D(moment_Y,'x')
expected_Y <- eval(derivative_moment_Y ,list(x=0))
cat("The expected value of sum of exponential pdf Y using moment generating function:",expected_Y,"\n")
## The expected value of sum of exponential pdf Y using moment generating function: 1.333333
## a: Calculate P(Z> λ| Z> λ/2):
# Count number of observations where Z > λ/2
n1 <- sum(Z > lambda/2)
# Count number of observations where Z > λ and Z > λ/2
n2 <- sum(Z > lambda)
# Calculate empirical probability of P(Z > λ | Z > λ/2)
prob_a <- n2/n1
## b: Calculate P(Z> λ| Z> λ/2):
# Count number of observations where Z > λ
n3 <- sum(Z > lambda)
# Count number of observations where Z > 2λ and Z > λ
n4 <- sum(Z > 2*lambda & Z > lambda)
# Calculate empirical probability of P(Z > 2λ | Z > λ)
prob_b <- n4/n3
## c: Calculate P(Z> 3λ| Z> λ):
# Count number of observations where Z > 3λ
n5 <- sum(Z > lambda)
# Count number of observations where Z > 3λ and Z > λ
n6 <- sum(Z > 3*lambda & Z > lambda)
# Calculate empirical probability of P(Z > 3λ | Z > λ)
prob_c<- n6/n5
cat("The probability of the exponential distribution Z for condition-a is:",prob_a,"\n","The probability of the exponential distribution Z for condition-b is:",prob_b,"\n","The probability of the exponential distribution Z for condition-c is:",prob_c,"\n")
## The probability of the exponential distribution Z for condition-a is: 0.0106383
## The probability of the exponential distribution Z for condition-b is: 0
## The probability of the exponential distribution Z for condition-c is: 0
## Check for condition-a:
# Unconditional probability: P(Z > lambda/2)
uncond_prob_a <- 1 - pexp(lambda/2, rate = 3)
# Conditional probability: P(Z > lambda | Z > lambda/2)
cond_prob_a <- integrate(function(x) dexp(x, rate = 3) / uncond_prob_a, lower = lambda/2, upper = Inf)$value
# Check if memoryless property holds
if(abs(cond_prob_a - 1) < 1e-6){
print("For P(Z > lambda | Z > lambda/2), memoryless property holds")
}else{
print("For P(Z > lambda | Z > lambda/2), memoryless property does not hold")
}
## [1] "For P(Z > lambda | Z > lambda/2), memoryless property holds"
## Check for condition condition-b:
# Unconditional probability: P(Z > lambda)
uncond_prob_b <- 1 - pexp(lambda, rate = 3)
# Conditional probability: P(Z > 2lambda | Z > lambda)
cond_prob_b <- integrate(function(x) dexp(x, rate = 3) / uncond_prob_b, lower = lambda, upper = Inf)$value
# Check if memoryless property holds
if(abs(cond_prob_b - 1) < 1e-6){
print("For P(Z > 2lambda | Z > lambda), memoryless property holds")
}else{
print("For P(Z > 2lambda | Z > lambda), memoryless property does not hold")
}
## [1] "For P(Z > 2lambda | Z > lambda), memoryless property holds"
## Check for condition-c:
# Unconditional probability: P(Z > lambda)
uncond_prob_c <- 1 - pexp(lambda, rate = 3)
# Conditional probability: P(Z > 3lambda | Z > lambda)
cond_prob_c <- integrate(function(x) dexp(x, rate = 3) / uncond_prob_c, lower = lambda, upper = Inf)$value
# Check if memoryless property holds
if(abs(cond_prob_c - 1) < 1e-6){
print("For P(Z > 3lambda | Z > lambda), memoryless property holds")
}else{
print("For P(Z > 3lambda | Z > lambda), memoryless property does not hold")
}
## [1] "For P(Z > 3lambda | Z > lambda), memoryless property holds"
To investigate whether P(YZ) = P(Y) P(Z), we can build a table with quartiles of Y and Z and then evaluate the marginal and joint probabilities.
# Calculate quartiles of Y and Z
Y_quartiles <- quantile(Y, probs = c(0.25, 0.5, 0.75,1.0))
Z_quartiles <- quantile(Z, probs = c(0.25, 0.5, 0.75,1.0))
# Create a single vector by combining Y and Z values by sampling 10000 values from the quartile breakpoints of Y and Z variables
single_vector <- paste(sample(Y_quartiles,10000, replace = TRUE), sample(Z_quartiles,10000, replace = TRUE))
# Create a matrix of the counts of the unique combinations of quartile ranges from Y_quartiles and Z_quartiles
matrix_counts <- matrix(table(single_vector), ncol = 4)
# Convert the count matrix to a matrix of proportion that provides a table of joint probabilities for each quartile combination of Y and Z.
matrix_prop<- prop.table(matrix_counts)
# Convert the matrix to a data frame and add Sum columns and rows to the data frame
df <- data.frame(matrix_prop)
df$Sum <- rowSums(matrix_prop)
df <- rbind(df, colSums(df))
row.names(df)[nrow(df)] <- "Sum"
# Specify the column and row names
colnames(df) <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Sum")
rownames(df) <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Sum")
knitr::kable(df)
| 1st Quartile Y | 2nd Quartile Y | 3rd Quartile Y | 4th Quartile Y | Sum | |
|---|---|---|---|---|---|
| 1st Quartile Z | 0.0621 | 0.0606 | 0.0641 | 0.0679 | 0.2547 |
| 2nd Quartile Z | 0.0598 | 0.0635 | 0.0691 | 0.0625 | 0.2549 |
| 3rd Quartile Z | 0.0607 | 0.0607 | 0.0624 | 0.0643 | 0.2481 |
| 4th Quartile Z | 0.0615 | 0.0606 | 0.0596 | 0.0606 | 0.2423 |
| Sum | 0.2441 | 0.2454 | 0.2552 | 0.2553 | 1.0000 |
To compare P(YZ) with P(Y)*P(Z), we can check whether the ratio of these two is close to 1. Now, create a new table for the product of marginal probabilities and calculate the ratio.
# Calculate the product of marginal probabilities
marginal_Y <- colSums(df)[-5]
marginal_Y
## 1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y
## 0.4882 0.4908 0.5104 0.5106
marginal_Z <- rowSums(df)[-5]
marginal_Z
## 1st Quartile Z 2nd Quartile Z 3rd Quartile Z 4th Quartile Z
## 0.5094 0.5098 0.4962 0.4846
product_marginals <- as.matrix(outer(marginal_Y, marginal_Z))
# Calculate the ratio of joint probabilities to the product of marginals
ratio <- df[-5, -5] / product_marginals
ratio
## 1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y
## 1st Quartile Z 0.2497094 0.2434866 0.2646083 0.2870044
## 2nd Quartile Z 0.2391871 0.2537870 0.2837375 0.2627798
## 3rd Quartile Z 0.2334635 0.2332804 0.2463866 0.2599662
## 4th Quartile Z 0.2364478 0.2328048 0.2352387 0.2449111
Based on the values in ratio table above, it is seen that the joint probabilities P(YZ) are not equal to the product of the marginal probabilities P(Y) and P(Z) for most quartiles of Y and Z. Therefore, we can say that P(YZ) is not equal to P(Y)*P(Z).
Both the Fisher’s Exact Test and the Chi Square Test are used to determine if there is a statistically significant relationship between two categorical variables.
Fisher’s Exact Test is used when the sample size is small. It can calculate the exact probability of observing the data. On the other hand, the Chi Square Test is used when we the sample size is large. It can approximate the probability distribution of the test statistic.
In both tests, if the p-value is very low (usually less than 0.05), we can reject the null hypothesis and can conclude that there is a statistically significant relationship between the two variables.
Perform Fisher’s exact test on the joint probabilities:
# Find count table
count_table<-matrix_prop*10000
fisher.test(count_table, simulate.p.value = TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: count_table
## p-value = 0.4228
## alternative hypothesis: two.sided
Perform Chi Square Test:
chisq.test(count_table)
##
## Pearson's Chi-squared test
##
## data: count_table
## X-squared = 9.1087, df = 9, p-value = 0.4273
Based on the p-value above for both the tests, it can be said that there is no association between Y and Z. Therefore, independence holds on both the Fisher’s exact test and the the Chi Square test. Also, for the sample we have here, the Chi Square test is more appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
library(readr)
library(tidyverse)
library(ggplot2)
library(pracma)
library(MASS)
df_train<-read.csv("https://raw.githubusercontent.com/Raji030/data605_final/main/train.csv")
str(df_train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
# Get names of columns that have at least one NA value
na_cols <- names(df_train)[colSums(is.na(df_train)) > 0]
na_cols
## [1] "LotFrontage" "Alley" "MasVnrType" "MasVnrArea" "BsmtQual"
## [6] "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinType2" "Electrical"
## [11] "FireplaceQu" "GarageType" "GarageYrBlt" "GarageFinish" "GarageQual"
## [16] "GarageCond" "PoolQC" "Fence" "MiscFeature"
# Subset data for a few columns
df_subset <- subset(df_train, select = c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt"))
head(df_subset)
## SalePrice LotArea OverallQual OverallCond YearBuilt
## 1 208500 8450 7 5 2003
## 2 181500 9600 6 8 1976
## 3 223500 11250 7 5 2001
## 4 140000 9550 7 5 1915
## 5 250000 14260 8 5 2000
## 6 143000 14115 5 5 1993
df_test<-read.csv("https://raw.githubusercontent.com/Raji030/data605_final/main/test.csv")
# Summary statistics
summary(df_subset)
## SalePrice LotArea OverallQual OverallCond
## Min. : 34900 Min. : 1300 Min. : 1.000 Min. :1.000
## 1st Qu.:129975 1st Qu.: 7554 1st Qu.: 5.000 1st Qu.:5.000
## Median :163000 Median : 9478 Median : 6.000 Median :5.000
## Mean :180921 Mean : 10517 Mean : 6.099 Mean :5.575
## 3rd Qu.:214000 3rd Qu.: 11602 3rd Qu.: 7.000 3rd Qu.:6.000
## Max. :755000 Max. :215245 Max. :10.000 Max. :9.000
## YearBuilt
## Min. :1872
## 1st Qu.:1954
## Median :1973
## Mean :1971
## 3rd Qu.:2000
## Max. :2010
# Histogram for SalePrice
ggplot(df_subset, aes(x = SalePrice)) +
geom_histogram(binwidth = 50000) +
labs(title = "Histogram of SalePrice", x = "SalePrice", y = "Count")
# Boxplot for OverallQual
ggplot(df_subset, aes(x = OverallQual, y = SalePrice)) +
geom_boxplot() +
labs(title = "Boxplot of SalePrice by OverallQual", x = "OverallQual", y = "SalePrice")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
# Scatterplot matrix for SalePrice, LotArea, and YearBuilt
pairs(df_subset[c("SalePrice", "LotArea", "YearBuilt")],
main = "Scatterplot Matrix for SalePrice, LotArea, and YearBuilt")
# Correlation matrix for SalePrice, LotArea, and OverallQual
cor_matrix<-cor(df_subset[c("SalePrice", "LotArea", "OverallQual")])
cor_matrix
## SalePrice LotArea OverallQual
## SalePrice 1.0000000 0.2638434 0.7909816
## LotArea 0.2638434 1.0000000 0.1058057
## OverallQual 0.7909816 0.1058057 1.0000000
# Hypothesis testing for pairwise correlations
cor.test(df_subset$SalePrice, df_subset$LotArea, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_subset$SalePrice and df_subset$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(df_subset$SalePrice, df_subset$OverallQual, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_subset$SalePrice and df_subset$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(df_subset$LotArea, df_subset$OverallQual, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: df_subset$LotArea and df_subset$OverallQual
## t = 4.0629, df = 1458, p-value = 5.106e-05
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.07250156 0.13887424
## sample estimates:
## cor
## 0.1058057
From the univariate analysis, it is seen that the mean sale price of the houses is USD 180,921, with the minimum sale price is USD 34,900 and the maximum sale price is USD 755,000. The lot area ranges from 1300 sq.ft. to 215,245 sq.ft. with a mean of 10,517 sq.ft. The overall quality of the houses ranges from 1 to 10 with a mean of 6.099, and the overall condition ranges from 1 to 9 with a mean of 5.575. The year built ranges from 1872 to 2010, with a mean of 1971. The bivariate analysis is showing that the sale price is positively correlated with the overall quality of the houses and the lot area. There is a weak positive correlation between sale price and year built.
The correlation matrix for three quantitative variables (sale price, lot area and overall quality) showed that there is a strong positive correlation between sale price and overall quality (r = 0.79) and a moderate positive correlation between sale price and lot area (r = 0.26). There is also a moderate positive correlation between overall quality and lot area (r = 0.11).
Based on the correlation tests above, it can be said that the confidence intervals are not close to zero,rather indicating statistically significant positive correlations between the variables.Therefore, I am not worried about a familywise error occurring where we accidentally reject the null hypothesis.
# Invert correlation matrix
#cor_matrix <- cor(df_subset[c("SalePrice", "LotArea", "OverallQual")])
preci_matrix <- solve(cor_matrix)
preci_matrix
## SalePrice LotArea OverallQual
## SalePrice 2.9280384 -0.5334669 -2.2595806
## LotArea -0.5334669 1.1085153 0.3046752
## OverallQual -2.2595806 0.3046752 2.7550503
# Multiply correlation matrix by precision matrix
cor_preci_prod <- cor_matrix %*% preci_matrix
cor_preci_prod
## SalePrice LotArea OverallQual
## SalePrice 1.000000e+00 5.551115e-17 0.000000e+00
## LotArea -5.551115e-17 1.000000e+00 -5.551115e-17
## OverallQual -4.440892e-16 0.000000e+00 1.000000e+00
# Multiply precision matrix by correlation matrix
preci_cor_prod <- preci_matrix %*% cor_matrix
cor_preci_prod
## SalePrice LotArea OverallQual
## SalePrice 1.000000e+00 5.551115e-17 0.000000e+00
## LotArea -5.551115e-17 1.000000e+00 -5.551115e-17
## OverallQual -4.440892e-16 0.000000e+00 1.000000e+00
# Perform LU decomposition
lu_decom <- lu(cor_matrix)
lu_decom
## $L
## SalePrice LotArea OverallQual
## SalePrice 1.0000000 0.0000000 0
## LotArea 0.2638434 1.0000000 0
## OverallQual 0.7909816 -0.1105879 1
##
## $U
## SalePrice LotArea OverallQual
## SalePrice 1 0.2638434 0.7909816
## LotArea 0 0.9303867 -0.1028895
## OverallQual 0 0.0000000 0.3629698
From the SalePrice variable drawn above it is found that the distribution of this variable is right-skewed. From the summary of the SalePrice data it is found that the minimum value of this variable is above zero and also practically SalePrice of the houses can never be zero, so we don’t perform any shifting here.
# Fit exponential distribution by using 'fitdistr' function from the MASS package
fit <- fitdistr(df_subset$SalePrice, "exponential")
#Find the optimal value of lambda for this exponential distribution
lambda <- fit$estimate
# Generate 1000 random samples from the exponential distribution using the estimated lambda value
samples <- rexp(1000, rate = lambda)
# Histogram of original variable
hist(df_subset$SalePrice,main = "Histogram for original variable")
# Histogram of exponential samples
hist(samples, main = "Histogram of exponential samples")
# Find percentiles using exponential CDF
p_5th <- qexp(0.05, rate = lambda)
p_95th <- qexp(0.95, rate = lambda)
# Generate 95% confidence interval
mean_value <- mean(df_subset$SalePrice)
sd_value <- sd(df_subset$SalePrice)
n <- length(df_subset$SalePrice)
alpha <- 0.05
lower_ci <- mean_value - qnorm(1 - alpha/2) * (sd_value / sqrt(n))
upper_ci <- mean_value + qnorm(1 - alpha/2) * (sd_value / sqrt(n))
cat("lower interval is:",lower_ci,"and upper interval is:",upper_ci)
## lower interval is: 176846.2 and upper interval is: 184996.2
# Calculate empirical percentiles
empirical_5th <- quantile(df_subset$SalePrice, 0.05)
empirical_95th <- quantile(df_subset$SalePrice, 0.95)
For building up the multiple regression model I chose variables: OverallQual, GrLivArea, TotalBsmtSF, GarageCars, YearBuilt, Neighborhood, ExterQual, as because I think these are the impactful variables to consider for predicting the Sale Price of the houses. Also, these variables have no missing values.
# Select few variables that look significant to SalePrice
selected_vars <- c("SalePrice", "OverallQual", "GrLivArea", "TotalBsmtSF", "GarageCars", "YearBuilt", "Neighborhood", "ExterQual")
# Get a subset of the train data
subset_data <- df_train[, selected_vars]
# Convert qualitative variables to factors
subset_data$Neighborhood <- as.factor(subset_data$Neighborhood)
subset_data$ExterQual <- as.factor(subset_data$ExterQual)
# Build regression model
model <- lm(SalePrice ~ OverallQual + GrLivArea + TotalBsmtSF + GarageCars + YearBuilt + Neighborhood + ExterQual, data = subset_data)
# Get model summary
summary(model)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + TotalBsmtSF +
## GarageCars + YearBuilt + Neighborhood + ExterQual, data = subset_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -414812 -14617 223 13275 269280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.958e+05 1.368e+05 -2.893 0.003876 **
## OverallQual 1.386e+04 1.207e+03 11.487 < 2e-16 ***
## GrLivArea 4.673e+01 2.464e+00 18.969 < 2e-16 ***
## TotalBsmtSF 1.966e+01 2.700e+00 7.284 5.35e-13 ***
## GarageCars 1.223e+04 1.683e+03 7.266 6.06e-13 ***
## YearBuilt 2.106e+02 6.869e+01 3.066 0.002208 **
## NeighborhoodBlueste -1.325e+04 2.565e+04 -0.516 0.605607
## NeighborhoodBrDale -1.670e+04 1.237e+04 -1.350 0.177354
## NeighborhoodBrkSide 1.386e+04 1.062e+04 1.304 0.192312
## NeighborhoodClearCr 3.838e+04 1.095e+04 3.504 0.000473 ***
## NeighborhoodCollgCr 1.797e+04 8.776e+03 2.047 0.040807 *
## NeighborhoodCrawfor 3.935e+04 1.046e+04 3.763 0.000175 ***
## NeighborhoodEdwards 7.321e+02 9.733e+03 0.075 0.940051
## NeighborhoodGilbert 1.062e+04 9.312e+03 1.140 0.254474
## NeighborhoodIDOTRR -3.768e+02 1.132e+04 -0.033 0.973456
## NeighborhoodMeadowV -4.498e+02 1.229e+04 -0.037 0.970805
## NeighborhoodMitchel 8.616e+03 9.973e+03 0.864 0.387750
## NeighborhoodNAmes 9.639e+03 9.272e+03 1.040 0.298736
## NeighborhoodNoRidge 7.223e+04 1.009e+04 7.160 1.29e-12 ***
## NeighborhoodNPkVill -6.581e+03 1.436e+04 -0.458 0.646763
## NeighborhoodNridgHt 5.508e+04 9.365e+03 5.881 5.06e-09 ***
## NeighborhoodNWAmes 7.472e+03 9.616e+03 0.777 0.437288
## NeighborhoodOldTown -3.199e+03 1.037e+04 -0.308 0.757864
## NeighborhoodSawyer 1.132e+04 9.820e+03 1.152 0.249310
## NeighborhoodSawyerW 1.160e+04 9.543e+03 1.215 0.224402
## NeighborhoodSomerst 2.244e+04 9.081e+03 2.471 0.013597 *
## NeighborhoodStoneBr 6.941e+04 1.080e+04 6.430 1.74e-10 ***
## NeighborhoodSWISU 2.215e+01 1.204e+04 0.002 0.998532
## NeighborhoodTimber 3.270e+04 1.004e+04 3.259 0.001145 **
## NeighborhoodVeenker 5.382e+04 1.327e+04 4.057 5.25e-05 ***
## ExterQualFa -5.835e+04 1.174e+04 -4.969 7.55e-07 ***
## ExterQualGd -5.277e+04 5.676e+03 -9.296 < 2e-16 ***
## ExterQualTA -5.581e+04 6.447e+03 -8.656 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34050 on 1427 degrees of freedom
## Multiple R-squared: 0.8204, Adjusted R-squared: 0.8163
## F-statistic: 203.7 on 32 and 1427 DF, p-value: < 2.2e-16
The R-squared value found for the model is 0.8204. This indicates that the model we have developed is a good fit for the data. This model can explain 82.04% of the variability in the Sales price data. Additionally, the F-test p-value is less than 0.05 suggesting that the overall model is statistically significant. This means that at least one of the independent variables has a significant impact on the Sales price, which supporting the validity of the model. Overall, based on the R-squared value, accuracy, and statistical significance, it can be said that the model is well-fitted and valid for predicting the Sales price based on the selected independent variables.
par(mfrow=c(2,2))
plot(model)
From the residuals plot above, it is seen that the distribution of residuals appears to be approximately normal. Additionally, there is no evidence of heteroscedasticity found.The residuals remains fairly consistent across the range of predicted values. Moreover, there are no discernible patterns or trends observed in the residuals. This means the model adequately captures the relationships between the independent and the dependent variables. Therefore, it can be said that the assumptions of the multiple regression model are met.
# Select predictor variables
selected_vars1 <- c("Id", "OverallQual", "GrLivArea", "TotalBsmtSF", "GarageCars", "YearBuilt", "Neighborhood", "ExterQual")
# Get a subset of the test data
subset_test <- df_test[, selected_vars1]
# Check if any missing value present in the subset test data
sapply(subset_test, function(x){sum(is.na(x))})
## Id OverallQual GrLivArea TotalBsmtSF GarageCars YearBuilt
## 0 0 0 1 1 0
## Neighborhood ExterQual
## 0 0
# Replace missing values with mean of specific columns
subset_test <- subset_test%>%
mutate(across(c("TotalBsmtSF", "GarageCars"), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Change categorical variables into factor in test dataset
subset_test <- subset_test %>%
mutate_at(vars(Neighborhood, ExterQual), as.factor)
predicted <- predict(model, subset_test)
data1 <- data.frame(Id = subset_test$Id, SalePrice=predicted)
write.csv(data1,"kagglesubmission.csv",row.names = FALSE)
Kaggle username is MHAR03 . Final score is 0.17538