Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the data set. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about family wise error? Why or why not?
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
training_data <- read.csv(file="train.csv", header=TRUE, sep=",")
The descriptive statistics below gives the minimum, 1st quartile, median, 3rd quartile, max, and number of missing data if any. For categorical data, the summary provides the frequency of the first few levels. Please scroll to the right to view all 81 variables in the training data.
The descriptive statistics below comes from the package psych, and provides information such as count of values that are not missing, standard deviation, interquartile range, and standard error. By default, categorical variables are converted to numeric. These are marked with *
.
Reference: https://www.rdocumentation.org/packages/psych/versions/1.8.10/topics/describe
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
As you can see, the distribution is skewed to the right with some prices that are outliers towards the tail. The minimum sale price is 34,900 USD and the maximum sale price is 755,00 USD. The median sale price is 163,000.
The lot frontage measures the linear feet of street connected to the property.
Data for Correlation
The corr_data
data frame only includes complete cases.
Each of the variable are continuous. These variables measure area (in square feet) or distance (in feet).
corr_data <- data.frame(training_data$GarageArea, training_data$GrLivArea, training_data$LotFrontage)
corr_data <- corr_data[complete.cases(corr_data), ]
colnames(corr_data) <- c("GarageArea", "GrLivArea", "LotFrontage")
The Pearson correlation assumes that each of the variables are continuous, each observation is complete (no missing data), there are no outliers in each variable, and relationship between variables is linear and homoscedastic. Missing data were removed from corr_data
. Below investigates the other assumptions.
The plots below show a linear relationship. I do not observe any noticeable curved patterns.
Garage Area and Above Ground Living Area
Garage Area and Lot Frontage
Above Ground Living Area and Lot Frontage
One of the assumptions of the Pearson correlation is that the data has a normal distribution.
The Shapiro-wilk normality test for Garage Area has a p-value much less than 0.05, which suggests that the distribution of this data is not normal. If the null hypothesis that the distribution is normal were true, then the probability of finding this observed data is 2.195e-12 (reject H0).
shapiro.test(corr_data$GarageArea)
##
## Shapiro-Wilk normality test
##
## data: corr_data$GarageArea
## W = 0.97847, p-value = 2.195e-12
The Shapiro-wilk normality test for Above Ground Living Area has a p-value that is much less than 0.05, which suggests that the distribution of this data is not normal. If the null hypothesis that the distribution is normal were true, then the probability of finding this observed data is 2.2e-16 (reject H0).
shapiro.test(corr_data$GrLivArea)
##
## Shapiro-Wilk normality test
##
## data: corr_data$GrLivArea
## W = 0.91971, p-value < 2.2e-16
The Shapiro-wilk normality test for Log Frontage has a p-value that is much less than 0.05, which suggests that the distribution of this data is not normal. If the null hypothesis that the distribution is normal were true, then the probability of finding this observed data is 2.2e-16 (reject H0).
shapiro.test(corr_data$LotFrontage)
##
## Shapiro-Wilk normality test
##
## data: corr_data$LotFrontage
## W = 0.8804, p-value < 2.2e-16
Below is a visual representation of the q-q plots of each of the variable.
q1 <- ggqqplot(corr_data$GarageArea, ylab = "Garage Area")
q2 <- ggqqplot(corr_data$GrLivArea, ylab = "Above Ground Area")
q3 <- ggqqplot(corr_data$LotFrontage, ylab = "Lot Frontage")
grid.arrange(q1, q2, q3, nrow=2, ncol=2)
The Pearson correlation assumes linear relationship and homoscedasticity (homogeneity of variance).
A Residual vs Fitted plot that shows a horizontal red line that is close to zero suggests that the relationship is linear.
A Scale-location plot that shows a horizontal red line with points approximately equally spread out suggests a constant variance in the residual errors (homoscedasticity).
Overall I think that there is a linear relationship between the variables; however, there are some outliers that is causing the line in the Residual vs Fitted plots to deviate a lot more.
Overall, I think that the variance in the residual errors is somewhat constant. The residuals appear to be equally spread out.
Garage Area and Above Ground Living Area
Garage Area and Lot Frontage
Above Ground Living Area and Lot Frontage
The Pearson correlation is calculated by the cor function calculates the correlation of each variable to one another. The assumptions of the Pearson correlations were investigated above.
cor <- cor(corr_data, method = "pearson", use = "complete.obs")
kable(cor)
GarageArea | GrLivArea | LotFrontage | |
---|---|---|---|
GarageArea | 1.0000000 | 0.4737098 | 0.3449967 |
GrLivArea | 0.4737098 | 1.0000000 | 0.4027974 |
LotFrontage | 0.3449967 | 0.4027974 | 1.0000000 |
The cor.test function of the ggpubr package is used to perform the hypothesis tests.
corr_GarageArea_GrLivArea <- cor.test(corr_data$GarageArea, corr_data$GrLivArea, method = "pearson")
corr_Garagearea_LotFrontage <- cor.test(corr_data$GarageArea, corr_data$LotFrontage , method = "pearson")
corr_GrLivArea_LotFrontage <- cor.test(corr_data$GrLivArea, corr_data$LotFrontage , method = "pearson")
Results for Garage Area and Above Ground Living Area
The correlation coefficient is 0.4737098, which is found to be significant with p-value (< 2.2e-16) that is much less than the significance level of 0.05. If the null hypothesis were true (correlation is zero), the probability of this observed data is < 2.2e-16. This is evidence to reject the null hypothesis and accept the alternative that the correlation is not zero.
corr_GarageArea_GrLivArea
##
## Pearson's product-moment correlation
##
## data: corr_data$GarageArea and corr_data$GrLivArea
## t = 18.625, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4286292 0.5164374
## sample estimates:
## cor
## 0.4737098
Results for Garage Area and Lot Frontage
The correlation coefficient is 0.3449967, which is found to be significant with p-value (< 2.2e-16) that is much less than the significance level of 0.05. If the null hypothesis were true (correlation is zero), the probability of this observed data is < 2.2e-16. This is evidence to reject the null hypothesis and accept the alternative that the correlation is not zero.
corr_Garagearea_LotFrontage
##
## Pearson's product-moment correlation
##
## data: corr_data$GarageArea and corr_data$LotFrontage
## t = 12.727, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2941715 0.3938762
## sample estimates:
## cor
## 0.3449967
Results for Above Ground Living Area and Lot Frontage
The correlation coefficient is 0.4491302, which is found to be significant with p-value (< 2.2e-16) that is much less than the significance level of 0.05. If the null hypothesis were true (correlation is zero), the probability of this observed data is < 2.2e-16. This is evidence to reject the null hypothesis and accept the alternative that the correlation is not zero.
corr_GrLivArea_LotFrontage
##
## Pearson's product-moment correlation
##
## data: corr_data$GrLivArea and corr_data$LotFrontage
## t = 15.238, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3543041 0.4491302
## sample estimates:
## cor
## 0.4027974
The CIr function of the Psychometric package is used to calculate the 80% confidence interval.
corr_data_n <- nrow(corr_data)
As you can see zero does not fall within the range [0.4444933, 0.5019195].
CIr(r=corr_GarageArea_GrLivArea$estimate , n = corr_data_n, level = .80)
## [1] 0.4444933 0.5019195
As you can see zero does not fall within the range [0.3119708, 0.3771899].
CIr(r=corr_Garagearea_LotFrontage$estimate , n = corr_data_n, level = .80)
## [1] 0.3119708 0.3771899
As you can see zero does not fall within the range [0.3713236, 0.4333466]
CIr(r=corr_GrLivArea_LotFrontage$estimate , n = corr_data_n, level = .80)
## [1] 0.3713236 0.4333466
The variables have positive correlation values that range from 0.34 to 0.47 (weak to moderate), and the correlation coefficients were found to be significant. Garage Area and Above Ground Living Area has the strongest correlation at 0.47. Garage Area and Lot Frontage has the weakest correlation at 0.34.
Houses that have big garages tend to have big above ground living spaces and longer street distance connected to their property. And houses that have big above ground living space tend to have longer distance of street connected to their property.
GarageArea | GrLivArea | LotFrontage | |
---|---|---|---|
GarageArea | 1.0000000 | 0.4737098 | 0.3449967 |
GrLivArea | 0.4737098 | 1.0000000 | 0.4027974 |
LotFrontage | 0.3449967 | 0.4027974 | 1.0000000 |
The familywise error rate needs to be considered when multiple statistical analyses are conducted on the same data. For this pairwise correlation of 3 variables, a total of 3 different comparisons were made. The significance level was set to 0.05 per analysis, which means that there is a 5% chance of committing a Type I error. Type I error happens when the null hypothesis is rejected when in fact it is true in the population. This 5% Type I error rate is only for a single analysis. When multiple analyses are made, we need to consider the familywise error rate. This is the overall Type I error rate of the multiple comparisons, which is known to be larger than the per analysis error rate.
The formula for calculating the familywise error rate is 1 - (1 - alpha)^k , where alpha is the significance level and k is the number of comparisons.
The familywise error rate in this case is 0.142625. This means the probability of committing at least one Type I error is 14.26%.
k <- 3
alpha <- .05
1 - (1-alpha)^k
## [1] 0.142625
To control the familywise error rate, a common technique is to perform a Bonferroni correction. This simple approach adjusts alpha by dividing it by the number of comparisons. In this case, the adjusted alpha would be .05/3 or around 0.017. The p-values of the results would then need to be less than 0.017 in order for the null hypothesis to be rejected. This adjustment controls for the overall Type I error rate of the family of analyses to 0.05.
Below are the p-values of the analyses, and all are less than 0.017. This is evidence to reject the null hypothesis that the correlation is zero.
Comparisons | P-value | Compare to adjusted P-value |
---|---|---|
GarageArea-GrLivArea | 3.33606210043948e-68 | less than 0.017 |
GarageArea-GrLivArea | 6.73480729281281e-35 | less than 0.017 |
GrLivArea-LotFrontage | 4.61242260931362e-48 | less than 0.017 |
Since we are doing multiple analyses on the same data, I would consider controlling for the familywise error rate.
Reference: https://www.youtube.com/watch?v=rMuNniCTsOw
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
cor_matrix
is the 3 X 3 matrix of the correlation coefficients
cor_matrix <- data.matrix(cor)
rownames(cor_matrix) <- c()
colnames(cor_matrix) <- c()
cor_matrix
## [,1] [,2] [,3]
## [1,] 1.0000000 0.4737098 0.3449967
## [2,] 0.4737098 1.0000000 0.4027974
## [3,] 0.3449967 0.4027974 1.0000000
The determinant of the correlation matrix is not zero. This means that this matrix has an inverse.
det(cor_matrix)
## [1] 0.6259876
The inv function from the matlib package is used to generate the inverse of the correlation matrix.
precision_matrix <- inv(cor_matrix)
precision_matrix
## [,1] [,2] [,3]
## [1,] 1.3382921 -0.5347486 -0.2463111
## [2,] -0.5347486 1.4073399 -0.3823863
## [3,] -0.2463111 -0.3823863 1.2390007
The operator %*%
performs the matrix multiplication.
result1 <- cor_matrix %*% precision_matrix
result1
## [,1] [,2] [,3]
## [1,] 1.000000e+00 -4.854738e-09 5.875876e-09
## [2,] -2.944996e-09 1.000000e+00 6.229884e-09
## [3,] 1.509716e-09 5.486613e-10 1.000000e+00
We expect to get the identity matrix; however, at first glance it does not look like it is the identity matrix. The off diagonal values are very tiny. We can use the zapsmall function to round very small values to zero.
result1 <- zapsmall(result1)
result1
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
As expected, multiplying the precision matrix by the correlation matrix also results in the identity matrix.
result2 <- zapsmall(precision_matrix %*% cor_matrix)
result2
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
The function lu.decomposition is used from the matrixcalc package.
lu_decomp <- lu.decomposition(cor_matrix)
The lower triangular matrix L
L <- lu_decomp$L
L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.4737098 1.0000000 0
## [3,] 0.3449967 0.3086248 1
The upper triangular matrix U
U <- lu_decomp$U
U
## [,1] [,2] [,3]
## [1,] 1 0.4737098 0.3449967
## [2,] 0 0.7755991 0.2393691
## [3,] 0 0.0000000 0.8071020
Multiplying L and U should result in the correlation matrix.
L %*% U
## [,1] [,2] [,3]
## [1,] 1.0000000 0.4737098 0.3449967
## [2,] 0.4737098 1.0000000 0.4027974
## [3,] 0.3449967 0.4027974 1.0000000
As you can see, LU is equivalent to the cor_matrix
.
cor_matrix
## [,1] [,2] [,3]
## [1,] 1.0000000 0.4737098 0.3449967
## [2,] 0.4737098 1.0000000 0.4027974
## [3,] 0.3449967 0.4027974 1.0000000
References: https://cran.r-project.org/web/packages/matlib/vignettes/inv-ex1.html, https://www.youtube.com/watch?v=UlWcofkUDDU&t=350s
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training data set that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
I selected the variable MasVnrArea.
The MasVnArea is the masonry veneer area measured in square feet.
fit_data <- training_data$MasVnrArea
fit_data <- fit_data[complete.cases(fit_data)]
As you can see, the distribution is skewed to the right.
hist(fit_data)
As you can see, we have values that are zeroes. This would represent houses that do not have masonry veneers. The total number of observations is 1452.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 103.7 166.0 1600.0
## vars n mean sd min max range se
## X1 1 1452 103.69 181.07 0 1600 1600 4.75
Out of the 1452 houses, there are 861 that have no masonry veneers (area is zero).
length(fit_data[fit_data == 0])
## [1] 861
Because the data measures area, adding a value of .01 should be negligible and would get rid of the zero values. A property with a masonry veneer area of .01 square feet would mean this property does not really have any masonry veneer.
fit_data <- fit_data + .01
Below is the updated histogram and summary after the adjustment.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 0.01 0.01 103.70 166.01 1600.01
## vars n mean sd min max range se
## X1 1 1452 103.7 181.07 0.01 1600.01 1600 4.75
library(MASS)
fit <- fitdistr(fit_data, densfun="exponential")
rexp(1000,)
). Plot a histogram and compare it with a histogram of your original variable.Optimal value for the fitted distribution
fit$estimate
## rate
## 0.009643642
Generate exponential distribution with the same rate
exp_dist <- rexp(length(fit_data), rate = fit$estimate)
Histogram of fitted data
Histogram of exponentially distributed data with same rate as fitted data
The histogram of fit_data
and exp_dist
have similar general shape; however, the first bin of fit_data
has a frequency that is about double the frequency of exp_data
. Both have the same count, but the distribution of the frequency is not similar.
qexp(.05, rate=fit$estimate)
## [1] 5.318872
qexp(.95, rate=fit$estimate)
## [1] 310.6432
This is a function that calculates the confidence interval assuming normality.
norm.interval = function(data, variance = var(data), conf.level = 0.95)
{
z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
xbar = mean(data)
sdx = sqrt(variance/length(data))
c(xbar - z * sdx, xbar + z * sdx)
}
Reference: https://www.stat.wisc.edu/~yandell/st571/R/append7.pdf
95% confidence interval of the mean of the empirical data
norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)
## [1] 94.38199 113.00853
quantile(x=fit_data, probs=c(.05, .95))
## 5% 95%
## 0.01 456.01