set.seed(123)
n = 4
lambda = 5
X = rgamma(10000,shape = n, scale = 10/lambda)
hist(X, breaks = 50, col = "blue", xlab = "Value of X", main = "Gamma-distributed random variable X")
set.seed(123)
Y = rexp(10000,rate = lambda/10)+rexp(10000,rate = lambda/10)+rexp(10000,rate = lambda/10)+rexp(10000,rate = lambda/10)
hist(Y, breaks = 50, col = "blue", xlab = "Value of Y", main = "Sum of Exponential-distributed random variable Y")
set.seed(123)
Z = rexp(10000,rate = lambda/10)
hist(Z, breaks = 50, col = "blue", xlab = "Value of Z", main = "Single Exponential-distributed random variable Z")
##### 1a. Calculate the empirical expected value (means) and variances
of all three pdfs.
X_mean = mean(X)
X_var = var(X)
print(paste0('The empirical mean of X is:',X_mean))
## [1] "The empirical mean of X is:7.92062845373995"
print(paste0('The empirical variance of X is:',X_var))
## [1] "The empirical variance of X is:15.4914314793823"
Y_mean = mean(Y)
Y_var = var(Y)
print(paste0('The empirical mean of Y is:',Y_mean))
## [1] "The empirical mean of Y is:8.01905804579485"
print(paste0('The empirical variance of Y is:',Y_var))
## [1] "The empirical variance of Y is:15.7656963882112"
Z_mean = mean(Z)
Z_var = var(Z)
print(paste0('The empirical mean of Z is:',Z_mean))
## [1] "The empirical mean of Z is:2.0075625394718"
print(paste0('The empirical variance of Z is:',Z_var))
## [1] "The empirical variance of Z is:3.99905760034308"
Derivation of Expected value of the Gamma pdf (X)
\[X\sim Gamma(\alpha,\lambda)\] The expected value is the probability-weighted average over all possible values: \[E(X)=\int xf_x(x)dx\]
using pdf of the gamma distribution: \[E(x)=\int_0^\infty x\frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}dx\] \[=\int_0^\infty \frac{1}{\lambda}\frac{\lambda^{\alpha+1}}{\Gamma(\alpha)}x^{(\alpha+1)-1}e^{-\lambda x}dx\]
using the relation: \[\Gamma(x+1)=\Gamma(x)x\] \[E(x)=\int_0^\infty \frac{\alpha}{\lambda}\frac{\lambda^{\alpha+1}}{\Gamma(\alpha+1)}x^{(\alpha+1)-1}e^{-\lambda x}dx\]
using density of gamma distribution: \[E(x) = \frac{\alpha}{\lambda}\int_0^\infty Gamma(x;\alpha+1,\lambda)dx\] \[E(x) = \frac{\alpha}{\lambda}\]
For Gamma Distribution X the expected value is: \[E(x) = 4/5\]
Derivation of the variance of the Gamma pdf (X) The variance can be calculated using expectation of square minus the square of expectation: \[Var=\int_0^\infty x^2 f_x(x)dx-(E(X))^2\] \[(E(X))^2=\int_0^\infty x^2 \frac{\lambda^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\lambda x}dx\] \[=\frac{\lambda^\alpha}{\Gamma(\alpha)}\int_0^\infty x^{\alpha+1}e^{-\lambda x}dx-(\frac{\alpha}{\lambda})^2\] \[=\frac{\Gamma(\alpha+2)}{\lambda^2 \Gamma(\alpha)}-(\frac{\alpha}{\lambda})^2\] \[=\frac{\Gamma(\alpha+2)-\alpha^2\Gamma(\alpha)}{\lambda^2\Gamma(\alpha)}\] \[=\frac{\alpha(\alpha+1)\Gamma(\alpha)-\alpha^2\Gamma(\alpha)}{\lambda^2\Gamma(\alpha)}\] \[=\frac{\alpha\Gamma(\alpha)(\alpha+1-\alpha)}{\lambda^2\Gamma(\alpha)}\] \[=\frac{\alpha}{\lambda^2}\]
Expected value of the single exponential (Z)
Moment Generating Function: \[g(t)=\int_0^{+\infty}e^{tx}\lambda e^{-\lambda x}dx\] \[=\int_0^{+\infty}e^{x(t-\lambda)}\lambda dx=\frac{\lambda e^{x(t-\lambda)}}{t-\lambda}|_0^{+\infty}\] \[=-\frac{\lambda}{t-\lambda}=\frac{\lambda}{\lambda-t}\]
The first moment or the mean is calculated below. The quotient rule is used to find the derivative: \[M'(0)=E(X)=\frac{d}{dt}\frac{\lambda}{\lambda-t}|_{t=0}=\frac{\lambda (+1)}{(\lambda -t)^2}|_{t=0}=\frac{\lambda}{(\lambda -t)^2}|_{t=0}\] \[=\frac{1}{\lambda}=\frac{1}{5}=0.2\]
Expected value of the sum of exponential (Y)
Moment Generating Function: \[g(t)=\int_0^{+\infty}e^{tx}4\lambda e^{-\lambda x}dx\] \[=\int_0^{+\infty}e^{x(t-\lambda)}4\lambda dx=\frac{4\lambda e^{x(t-\lambda)}}{t-\lambda}|_0^{+\infty}\] \[=-\frac{4\lambda}{t-\lambda}=\frac{4\lambda}{\lambda-t}\]
The first moment or the mean is calculated below. The quotient rule is used to find the derivative: \[M'(0)=E(X)=\frac{d}{dt}\frac{4\lambda}{\lambda-t}|_{t=0}=\frac{4\lambda (+1)}{(\lambda -t)^2}|_{t=0}=\frac{4\lambda}{(\lambda -t)^2}|_{t=0}\] \[=\frac{4}{\lambda}=\frac{4}{5}=0.8\]
# Assuming Z is already greater than lambda/2, what is probability Z > lambda (assuming memoryless property)
length(Z[Z>lambda/2])/length(Z)
## [1] 0.2884
1-pexp(lambda/2,lambda/10)
## [1] 0.2865048
calculus proof that memoryless property holds:
Using the CDF of exponential distribution: \[P(Z \le \lambda|Z>\frac{\lambda}{2}) \] \[=\frac{P(Z\le\lambda\cap Z>\frac{\lambda}{2})}{P(Z>\frac{\lambda}{2})}=\frac{P(\frac{\lambda}{2}<Z\le\lambda)}{P(Z>\frac{\lambda}{2})}\] \[=\frac{e^{-\frac{\lambda}{2}*\lambda}-e^{-\lambda*\lambda}}{e^{-\frac{\lambda}{2}*\lambda}}=1-e^{-\frac{\lambda}{2}*\lambda}\]
\[P(Z>\lambda | Z>\frac{\lambda}{2}) = 1-P(Z \le \lambda|Z>\frac{\lambda}{2})=e^{-\frac{\lambda}{2}*\lambda}\]
This is the same thing as P(Z > lambda/2) because it doesn’t matter what happened before. The probability is memoryless so you can get the same answer by just evaluating P(Z > lambda/2).
length(Z[Z>2*lambda])/length(Z)
## [1] 0.006
1-pexp(2*lambda,rate = lambda/10)
## [1] 0.006737947
calculus proof that memoryless property holds:
Using the CDF of exponential distribution: \[P(Z \le 2*\lambda|Z>\lambda) \] \[=\frac{P(Z\le2*\lambda\cap Z>\lambda)}{P(Z>\lambda)}=\frac{P(\lambda<Z\le2*\lambda)}{P(Z>\lambda)}\] \[=\frac{e^{-\lambda*\lambda}-e^{-2*\lambda*\lambda}}{e^{-\lambda*\lambda}}=1-e^{-\lambda*\lambda}\]
\[P(Z>2*\lambda | Z>\lambda) = 1-P(Z \le 2*\lambda|Z>\lambda)=e^{-\lambda*\lambda}\]
This is the same thing as P(Z > lambda) because it doesn’t matter what happened before. The probability is memoryless so you can get the same answer by just evaluating P(Z > lambda).
length(Z[Z>3*lambda])/length(Z)
## [1] 5e-04
1-pexp(3*lambda,rate = lambda/10)
## [1] 0.0005530844
calculus proof that memoryless property holds:
Using the CDF of exponential distribution: \[P(Z \le 3*\lambda|Z>\lambda) \] \[=\frac{P(Z\le3*\lambda\cap Z>\lambda)}{P(Z>\lambda)}=\frac{P(\lambda<Z\le3*\lambda)}{P(Z>\lambda)}\] \[=\frac{e^{-\lambda*\lambda}-e^{-3*\lambda*\lambda}}{e^{-\lambda*\lambda}}=1-e^{-2\lambda*\lambda}\]
\[P(Z>3*\lambda | Z>\lambda) = 1-P(Z \le 3*\lambda|Z>\lambda)=e^{-2\lambda*\lambda}\]
This is the same thing as P(Z > 2 lambda) because it doesn’t matter what happened before. The probability is memoryless so you can get the same answer by just evaluating P(Z > 2 lambda).
quartiles_Y <- quantile(Y, probs = c(0, 0.25, 0.5, 0.75, 1))
quartiles_Z <- quantile(Z, probs = c(0, 0.25, 0.5, 0.75, 1))
#every unique combination of Y and Z are counted as occurrences
contingency_table <- table(cut(Y, breaks = quartiles_Y), cut(Z, breaks = quartiles_Z))
marginal_prob_Y <- margin.table(contingency_table, 1) / sum(contingency_table)
marginal_prob_Z <- margin.table(contingency_table, 2) / sum(contingency_table)
joint_prob_YZ <- contingency_table / sum(contingency_table)
joint_prob_YZ
##
## (0.000676,0.579] (0.579,1.4] (1.4,2.77] (2.77,18.2]
## (0.372,5.11] 0.10692138 0.08571714 0.04910982 0.00810162
## (5.11,7.35] 0.06461292 0.07021404 0.07161432 0.04360872
## (7.35,10.2] 0.04590918 0.05621124 0.07121424 0.07671534
## (10.2,31.5] 0.03240648 0.03790758 0.05811162 0.12162432
#The probability of P(Y = Q2, Z = Q2)
joint_prob_YZ[2,2]
## [1] 0.07021404
#The probability of P(Y = Q2)*P(Z = Q2)
sum(joint_prob_YZ[1:4,2])*sum(joint_prob_YZ[2,1:4])
## [1] 0.06252501
#The probability of P(Y = Q3, Z = Q3)
joint_prob_YZ[3,3]
## [1] 0.07121424
#The probability of P(Y = Q3)*P(Z = Q3)
sum(joint_prob_YZ[1:4,3])*sum(joint_prob_YZ[3,1:4])
## [1] 0.06252501
#The probability of P(Y = Q1, Z = Q4)
joint_prob_YZ[1,4]
## [1] 0.00810162
#The probability of P(Y = Q3)*P(Z = Q3)
sum(joint_prob_YZ[1:4,4])*sum(joint_prob_YZ[1,1:4])
## [1] 0.06247499
The Fisher’s exact test is recommended for small sample sizes (i.e. less than 20 or 30 observations). The Chi-square test is recommended for larger sample sizes. Fisher’s test can handle small sample sizes because it calculates the exact probabilities of all possible outcomes, rather than relying on large-sample approximation as the chi-square test does. Therefore, fisher’s test is computationally expensive and therefore less practical for large contingency tables. For this reason, the contingency table calculated for P(Y) and P(Z) quartile ranges with its large sample size of 10,000 points would be more appropriate to be tested using the Chi-Square test. The Fisher test with a sample size of 10,000 would be inpractical.
Additionally, Chi-square test is more appropriate because it assumes a normal distribution because of the law of large numbers. Both Fisher and Chi-square tests considers categorical data and considers marginal totals, which are typically integers. However, if the sample size is large, the frequencies can be treated as continous variables and the non-integers may not be as big of a concern.
The chi-square test is performed below to test the null hypothesis to reject or accept the null hypothesis that independence holds. An X-squared of 1884 is reported, which measures the discrepancy between the observed counts in the tontingency table and the expected counts if the null hypothesis of independence is true. This large value provides strong evidence against the null hypothsis. The p-value which is less than the significance value of 0.05, shows that observing a test statistic of 1884, assuming the null hypothesis of independence, is very unlikely. This indicates strong evidence against the null hypothesis suggesting signifciant association between P(Y) and P(Z).
n<-10000
table <- joint_prob_YZ*n
chisq.test(table)
##
## Pearson's Chi-squared test
##
## data: table
## X-squared = 1884, df = 9, p-value < 2.2e-16
n<-30
table <- joint_prob_YZ*n
fisher.test(table)
## Warning in fisher.test(table): 'x' has been rounded to integer: Mean relative
## difference: 0.1292992
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 0.6649
## alternative hypothesis: two.sided
#libraries
library(readr)
library(psych)
library(GGally)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(corrplot)
## corrplot 0.92 loaded
By reviewing the list of possible predictors, two quantitative variables are chosen (GrLivArea & GarageArea) and one qualitative (OverallQual). The descriptive statistics (i.e. mean, median, standard dev, etc.) are shown for the chosen variables.
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/house-prices-advanced-regression-techniques/train.csv"
train_data<-read_csv(url(urlfile))
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_var <- train_data[c('SalePrice','GrLivArea','GarageArea','OverallQual')]
describe(train_var)
## vars n mean sd median trimmed mad min max
## SalePrice 1 1460 180921.20 79442.50 163000 170783.29 56338.80 34900 755000
## GrLivArea 2 1460 1515.46 525.48 1464 1467.67 483.33 334 5642
## GarageArea 3 1460 472.98 213.80 480 469.81 177.91 0 1418
## OverallQual 4 1460 6.10 1.38 6 6.08 1.48 1 10
## range skew kurtosis se
## SalePrice 720100 1.88 6.50 2079.11
## GrLivArea 5308 1.36 4.86 13.75
## GarageArea 1418 0.18 0.90 5.60
## OverallQual 9 0.22 0.09 0.04
par(mfrow = c(2, 2))
hist(train_data$SalePrice, main = "Histogram of Sale Price")
hist(train_data$GrLivArea, main = "Histogram of GrLivArea", breaks = 30)
hist(train_data$GarageArea, main = "Histogram of Garage Area")
hist(train_data$OverallQual, main = "Histogram of Overall Quality", breaks = 10)
The scatter plot matrix: The scatter matrix shows a positive correlation between the independent variables and sale price. The pearson correlation of ~0.6 to 0.7 is shown. The positive trend implies that generally with greater living area, garage area, or overall quality, the higher the sale price would be.
train_var %>%
ggpairs()
# create df of 3 quantitative variables
train_3 <- train_data[c('GrLivArea','GarageArea','SalePrice')]
# Generate a correlation matrix
cor_mat <- cor(train_3)
# Create a correlation plot
corrplot(cor_mat, method="color")
null hypothesis: the pearson correlation is 0 alternate hypothesis: the
pearson correlation is not 0
The hypothesis is tested for different combinations of the three quantitative variables.
cor.1 tests the correlation that GrLivArea and GarageArea is 0. The pearson correlation is given as 0.469 with a p-value of 2.2e-16. Since the p-value is less than 0.05 (the significance factor), the null hypothesis can be rejected since the probability of obtaining a correlation of 0.469 is 2.2e-16 with the assumption that the null hypothesis is true. We can accept the alternate hypothesis that the correlation is not 0. Additionally, we are 80% confident that the true correlation value between GrLivArea and GarageArea is between 0.4424 and 0.4948.
cor.2 tests the correlation that GrLivArea and SalePrice is 0. The pearson correlation is 0.7086 and the probability of observing this if the null hypothesis were true is 2.2e-16. Therefore, the null hypothesis that the correlation is 0 can be rejected. We can be 80% confident that the true correlation value between GrLivArea and SalePrice is between 0.69 and 0.72.
cor.3 test the correlation that GarageArea and SalePrice is 0. The pearson correlation is 0.623 and the probability of observing this if the null hypothesis were true is 2.2e-16. Therefore, the null hypothesis that the correlation is 0 can be rejected. We can be 80% confident that the true correlation value between GarageArea and SalePrice is between 0.602 and 0.644.
#defaults to pearson method
#two.sided: The alternative hypothesis is that the true correlation coefficient is not equal to zero (default).
cor.1 <- cor.test(train_3$GrLivArea, train_3$GarageArea, conf.level = 0.8)
cor.2 <- cor.test(train_3$GrLivArea, train_3$SalePrice, conf.level = 0.8)
cor.3 <- cor.test(train_3$GarageArea, train_3$SalePrice, conf.level = 0.8)
cor.1
##
## Pearson's product-moment correlation
##
## data: train_3$GrLivArea and train_3$GarageArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
cor.2
##
## Pearson's product-moment correlation
##
## data: train_3$GrLivArea and train_3$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.3
##
## Pearson's product-moment correlation
##
## data: train_3$GarageArea and train_3$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
Would you be worried about familywise error? Why or why not?
I would be concerned with familywise error since there is high correlation between these 3 variables. The correlation is problematic because highly correlated variables can lead to redundant testing. Meaning highly correlated variables will explain overlapping variation in the dependent variable. This can increase the number of necessary tests and increase the chance of false positive accross all the test. In hypothesis testing, it is assumed that each test is independent and when high correlation exists then there could be less independent tests with the same number of test.
The significance level is an important factor that contributes to familywise error (FWE) because it represents the probability of making a false positive (Type I error) in a single hypothesis test. When multiple hypothesis tests are conducted on the same dataset, the probability of making at least one false positive across all tests increases. For example with a 0.05 significance level each test will have a 5% chance of making a type 1 error. Since there are multiple tests happening, the error is no longer 5% but greater.
To address this issue there are methods to correct familywise error such as Bonferonni correction. The Bonferroni correction divides the desired alpha level by the number of tests being conducted to ensure that the overall probability of making a false positive across all tests is controlled at the desired level.
The precision matrix provides information about the partial correlations between pairs of variables in a multivariate normal distribution. Partial correlations are a type of correlation coefficient that measures the strength and direction of the linear relationship between two variables while controlling for the effects of one or more additional variables. In other words, partial correlations provide a measure of the correlation between two variables that is not explained by the other variables in a multivariate data set.
#correlation matrix
cor_mat
## GrLivArea GarageArea SalePrice
## GrLivArea 1.0000000 0.4689975 0.7086245
## GarageArea 0.4689975 1.0000000 0.6234314
## SalePrice 0.7086245 0.6234314 1.0000000
#inverse of correlation matrix
precision<-solve(cor_mat)
precision
## GrLivArea GarageArea SalePrice
## GrLivArea 2.01353305 -0.08964955 -1.3709485
## GarageArea -0.08964955 1.63976057 -0.9587504
## SalePrice -1.37094845 -0.95875043 2.5692028
#multiplying correlation matrix by precision matrix
round(cor_mat %*% precision)
## GrLivArea GarageArea SalePrice
## GrLivArea 1 0 0
## GarageArea 0 1 0
## SalePrice 0 0 1
library(matrixcalc)
LU <- lu.decomposition(cor_mat)
LU
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.4689975 1.0000000 0
## [3,] 0.7086245 0.3731704 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.4689975 0.7086245
## [2,] 0 0.7800414 0.2910883
## [3,] 0 0.0000000 0.3892258
library(MASS)
hist(train_data$`1stFlrSF`, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "Observed Data")
The exponential probability distribution is fitted to
SalePrice. The $estimate value will return the estimated
rate parameter of the exponential distribution, denoted by lambda
(λ).
exp <- fitdistr(train_data$`1stFlrSF`,'exponential')
lambda <- exp$estimate
lambda
## rate
## 0.0008601213
The optimal value of lambda for this distribution is used to randomly select 1000 samples using rexp(). The 100 samples are plotted in a histogram below.
sample_dist <- rexp(1000,lambda)
par(mfrow=c(1,2))
hist(sample_dist, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "exp distribution of 1K samples")
hist(train_data$`1stFlrSF`, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "Observed Data")
Conclusion on fit: The observed data has a less severe right skew than
the exponential data. The exponential has a greater spread of unique
data points ranging up to 8000 sq ft. Additionally, the peak for the
observed is seen at about 1000 sq ft and the exponential sees its peak
at 0.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
Using the rate parameter lambda, 5th and 95th percentile is calculated below.
qexp(c(0.05,0.95), lambda)
## [1] 59.63495 3482.91836
Also generate a 95% confidence interval from the empirical data, assuming normality.
t.test(train_data$`1stFlrSF`, conf.level = 0.95)
##
## One Sample t-test
##
## data: train_data$`1stFlrSF`
## t = 114.91, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1142.780 1182.473
## sample estimates:
## mean of x
## 1162.627
#assuming normality by transforming the data using log().
par(mfrow=c(1,2))
hist(train_data$`1stFlrSF`, breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "Observed Data")
hist(log(train_data$`1stFlrSF`), breaks = 50, col = "blue", xlab = "`1stFlrSF`", main = "log(Observed Data)")
log_95<-t.test(log(train_data$`1stFlrSF`), conf.level = 0.95)
log_95
##
## One Sample t-test
##
## data: log(train_data$`1stFlrSF`)
## t = 842.72, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 6.991190 7.023813
## sample estimates:
## mean of x
## 7.007501
#back transforming from the log() to get the actual confidence interval
exp(6.99)
## [1] 1085.721
exp(7.02)
## [1] 1118.787
Finally, provide the empirical 5th percentile and 95th percentile of the data.
quantile(train_data$`1stFlrSF`, probs = c(0.05, 0.95))
## 5% 95%
## 672.95 1831.25
The 5th and 95th percentile of the empirical data is 672.95 and 1831.25. The distribution was only slightly skewed and looked arguably normal, so it makes sense that the confidence interval between the normality assumption and observed is similar (~100 to ~1180). The confidence interval exists between the 5th and 95th percentile.
num_index<-sapply(train_data,is.numeric)
train_data<-train_data[,num_index]
#note lm() will remove NA data prior to analysis
lm_saleprice<-lm(SalePrice ~ ., data = train_data)
summ<-summary(lm_saleprice)
summ
##
## Call:
## lm(formula = SalePrice ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442182 -16955 -2824 15125 318183
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.351e+05 1.701e+06 -0.197 0.843909
## Id -1.205e+00 2.658e+00 -0.453 0.650332
## MSSubClass -2.001e+02 3.451e+01 -5.797 8.84e-09 ***
## LotFrontage -1.160e+02 6.126e+01 -1.894 0.058503 .
## LotArea 5.422e-01 1.575e-01 3.442 0.000599 ***
## OverallQual 1.866e+04 1.482e+03 12.592 < 2e-16 ***
## OverallCond 5.239e+03 1.368e+03 3.830 0.000135 ***
## YearBuilt 3.164e+02 8.766e+01 3.610 0.000321 ***
## YearRemodAdd 1.194e+02 8.668e+01 1.378 0.168607
## MasVnrArea 3.141e+01 7.022e+00 4.473 8.54e-06 ***
## BsmtFinSF1 1.736e+01 5.838e+00 2.973 0.003014 **
## BsmtFinSF2 8.342e+00 8.766e+00 0.952 0.341532
## BsmtUnfSF 5.005e+00 5.277e+00 0.948 0.343173
## TotalBsmtSF NA NA NA NA
## `1stFlrSF` 4.597e+01 7.360e+00 6.246 6.02e-10 ***
## `2ndFlrSF` 4.663e+01 6.102e+00 7.641 4.72e-14 ***
## LowQualFinSF 3.341e+01 2.794e+01 1.196 0.232009
## GrLivArea NA NA NA NA
## BsmtFullBath 9.043e+03 3.198e+03 2.828 0.004776 **
## BsmtHalfBath 2.465e+03 5.073e+03 0.486 0.627135
## FullBath 5.433e+03 3.531e+03 1.539 0.124182
## HalfBath -1.098e+03 3.321e+03 -0.331 0.740945
## BedroomAbvGr -1.022e+04 2.155e+03 -4.742 2.40e-06 ***
## KitchenAbvGr -2.202e+04 6.710e+03 -3.282 0.001063 **
## TotRmsAbvGrd 5.464e+03 1.487e+03 3.674 0.000251 ***
## Fireplaces 4.372e+03 2.189e+03 1.998 0.046020 *
## GarageYrBlt -4.728e+01 9.106e+01 -0.519 0.603742
## GarageCars 1.685e+04 3.491e+03 4.827 1.58e-06 ***
## GarageArea 6.274e+00 1.213e+01 0.517 0.605002
## WoodDeckSF 2.144e+01 1.002e+01 2.139 0.032662 *
## OpenPorchSF -2.252e+00 1.949e+01 -0.116 0.907998
## EnclosedPorch 7.295e+00 2.062e+01 0.354 0.723590
## `3SsnPorch` 3.349e+01 3.758e+01 0.891 0.373163
## ScreenPorch 5.805e+01 2.041e+01 2.844 0.004532 **
## PoolArea -6.052e+01 2.990e+01 -2.024 0.043204 *
## MiscVal -3.761e+00 6.960e+00 -0.540 0.589016
## MoSold -2.217e+02 4.229e+02 -0.524 0.600188
## YrSold -2.474e+02 8.458e+02 -0.293 0.769917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36800 on 1085 degrees of freedom
## (339 observations deleted due to missingness)
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.8034
## F-statistic: 131.8 on 35 and 1085 DF, p-value: < 2.2e-16
First, in order to reduce the list of predictors, the p-value is assessed to determine if the null hypothesis that there is no correlation between the said predictor can be rejected. The significance factor for this analysis is 0.05. So any p-value greater than 0.05 is removed. Additionally, we can look at the estimated coefficient values and the standard error. For a good model we typically like to see this ratio (test statistic) number between 5 to 10. The larger the ratio, the smaller the variability would be in the slope estimate.
pvalue<-summ$coefficients[3:nrow(summ$coefficients),4]
pvalue_index<-abs(pvalue)<0.05
list_summ<-data.frame(summ$coefficients)
names(list_summ) <- c('Estimate','Error','tvalue','pvalue')
#variables listed with pvalue less than 0.05
list_summ<-list_summ[list_summ$pvalue<0.05,]
#the test_statistic is added as a column
list_summ$test_statistic<-list_summ$Estimate/list_summ$Error
#filter through list_summ for test_statistic greater than 4. Rule of thumb is >5, but that would leave us with very few variables
list_summ<-list_summ[abs(list_summ$test_statistic)>4,]
list_summ
## Estimate Error tvalue pvalue test_statistic
## MSSubClass -200.06229 34.511335 -5.797002 8.843883e-09 -5.797002
## OverallQual 18656.24965 1481.619499 12.591795 4.941611e-34 12.591795
## MasVnrArea 31.40756 7.022059 4.472700 8.536692e-06 4.472700
## `1stFlrSF` 45.97355 7.360064 6.246353 6.022410e-10 6.246353
## `2ndFlrSF` 46.62535 6.102050 7.640933 4.724807e-14 7.640933
## BedroomAbvGr -10218.91046 2155.037820 -4.741871 2.399480e-06 -4.741871
## GarageCars 16849.71719 3490.579204 4.827198 1.582910e-06 4.827198
The categorical variables are: - OverallQual (Rates the
overall material and finish of the house 1-10) -
BedroomAbvGr (Bedrooms above grade (does NOT include
basement bedrooms)) - MSSubClass (Identifies the type of
dwelling involved in the sale.) - GarageCars (Size of
garage in car capacity)
The continuous variables are: - MasVnrArea (Masonry
veneer area in square feet) - 1stFlrSF (First Floor square
feet) - 2ndFlrSF (Second floor square feet)
The new linear regression model is built:
# 1stFlrSF and 2ndFlrSF is changed to match test dataset
train_data2<-train_data
colnames(train_data2)[14]<-'X1stFlrSF'
colnames(train_data2)[15]<-'X2ndFlrSF'
lm_saleprice2<-lm(SalePrice ~ OverallQual+BedroomAbvGr+MSSubClass+GarageCars+MasVnrArea+`X1stFlrSF`+`X2ndFlrSF`, data = train_data2)
summ2<-summary(lm_saleprice2)
summ2
##
## Call:
## lm(formula = SalePrice ~ OverallQual + BedroomAbvGr + MSSubClass +
## GarageCars + MasVnrArea + X1stFlrSF + X2ndFlrSF, data = train_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -440242 -19212 -1179 17211 277235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64151.794 6515.242 -9.846 < 2e-16 ***
## OverallQual 23750.794 1043.683 22.757 < 2e-16 ***
## BedroomAbvGr -9564.999 1538.501 -6.217 6.62e-10 ***
## MSSubClass -186.695 25.527 -7.314 4.29e-13 ***
## GarageCars 15601.329 1720.397 9.068 < 2e-16 ***
## MasVnrArea 37.766 6.226 6.066 1.67e-09 ***
## X1stFlrSF 75.040 3.659 20.507 < 2e-16 ***
## X2ndFlrSF 56.245 3.388 16.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37730 on 1444 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7747, Adjusted R-squared: 0.7736
## F-statistic: 709.2 on 7 and 1444 DF, p-value: < 2.2e-16
From here the conditions are checked for linear regression.
predictors<-train_data2[c('SalePrice','OverallQual','BedroomAbvGr','MSSubClass','GarageCars','MasVnrArea','X1stFlrSF','X2ndFlrSF')]
predictors %>%
ggpairs()
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning: Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
## Warning: Removed 8 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 8 rows containing missing values
## Warning: Removed 8 rows containing missing values (`geom_point()`).
## Removed 8 rows containing missing values (`geom_point()`).
The predictors have fairly linear relationship to SalePrice except for
MSSubClass although it does add some
An F-statistic of 540.6 with 7 numerator degrees of freedom and 1113 denominator degrees of freedom, and a p-value of < 2.2e-16 (which is basically 0), shows that the model is statistically significant and that the predictor variables are collectively contributing to the prediction of the outcome variable. The low p-value is less than 0.05, so the null hypothesis that there is no relationship between the variables and the outcome can be rejected.
The baseline model where all the variables in the training data set was used showed the following: F-statistic: 131.8 on 35 and 1085 DF, p-value: < 2.2e-16
Reducing the number of predictors increased the F-statistic with a fewer degrees of freedom. This was an improvement in the model.
test <- read.csv("https://raw.githubusercontent.com/catcho1632/DATA605/main/house-prices-advanced-regression-techniques/test.csv")
test[is.na(test)] = 0
prediction <- predict(lm_saleprice2, test)
prediction <- data.frame(cbind(test$Id, prediction))
colnames(prediction) = c('Id', 'SalePrice')
#write_csv(prediction, "/Users/catherinecho/Documents/DATA 605/SalePrice.csv")