Part 1: Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the data set. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about family wise error? Why or why not?



Load training data set

Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

  • No. of observations: 1460
  • No. of attributes: 81
training_data <- read.csv(file="train.csv", header=TRUE, sep=",")


Preview of training data set



Provide univariate descriptive statistics and appropriate plots for the training data set


The descriptive statistics below gives the minimum, 1st quartile, median, 3rd quartile, max, and number of missing data if any. For categorical data, the summary provides the frequency of the first few levels. Please scroll to the right to view all 81 variables in the training data.


The descriptive statistics below comes from the package psych, and provides information such as count of values that are not missing, standard deviation, interquartile range, and standard error. By default, categorical variables are converted to numeric. These are marked with *.

Reference: https://www.rdocumentation.org/packages/psych/versions/1.8.10/topics/describe



Visualize dependent varirable Sale Price


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

As you can see, the distribution is skewed to the right with some prices that are outliers towards the tail. The minimum sale price is 34,900 USD and the maximum sale price is 755,00 USD. The median sale price is 163,000.


Visualize distribution of some of the explanatory variables in training set




Provide a scatterplot matrix for at least two of the independent variables and the dependent variable


  • Dependent variable: Sale Price
  • Selected independent continuous variables: Garage Area, Above Ground Living Area, and Lot Frontage




The lot frontage measures the linear feet of street connected to the property.



Derive a correlation matrix for any THREE quantitative variables in the dataset


Data for Correlation

The corr_data data frame only includes complete cases.

Each of the variable are continuous. These variables measure area (in square feet) or distance (in feet).

corr_data <- data.frame(training_data$GarageArea, training_data$GrLivArea, training_data$LotFrontage)
corr_data <- corr_data[complete.cases(corr_data), ]
colnames(corr_data) <- c("GarageArea", "GrLivArea", "LotFrontage")


Assumptions of Pearson Correlation

The Pearson correlation assumes that each of the variables are continuous, each observation is complete (no missing data), there are no outliers in each variable, and relationship between variables is linear and homoscedastic. Missing data were removed from corr_data. Below investigates the other assumptions.


Linear Relationship: Scatterplot of the Variables

The plots below show a linear relationship. I do not observe any noticeable curved patterns.


Garage Area and Above Ground Living Area

Garage Area and Lot Frontage

Above Ground Living Area and Lot Frontage


Normality

One of the assumptions of the Pearson correlation is that the data has a normal distribution.

The Shapiro-wilk normality test for Garage Area has a p-value much less than 0.05, which suggests that the distribution of this data is not normal. If the null hypothesis that the distribution is normal were true, then the probability of finding this observed data is 2.195e-12 (reject H0).

shapiro.test(corr_data$GarageArea)
## 
##  Shapiro-Wilk normality test
## 
## data:  corr_data$GarageArea
## W = 0.97847, p-value = 2.195e-12

The Shapiro-wilk normality test for Above Ground Living Area has a p-value that is much less than 0.05, which suggests that the distribution of this data is not normal. If the null hypothesis that the distribution is normal were true, then the probability of finding this observed data is 2.2e-16 (reject H0).

shapiro.test(corr_data$GrLivArea)
## 
##  Shapiro-Wilk normality test
## 
## data:  corr_data$GrLivArea
## W = 0.91971, p-value < 2.2e-16

The Shapiro-wilk normality test for Log Frontage has a p-value that is much less than 0.05, which suggests that the distribution of this data is not normal. If the null hypothesis that the distribution is normal were true, then the probability of finding this observed data is 2.2e-16 (reject H0).

shapiro.test(corr_data$LotFrontage)
## 
##  Shapiro-Wilk normality test
## 
## data:  corr_data$LotFrontage
## W = 0.8804, p-value < 2.2e-16

Below is a visual representation of the q-q plots of each of the variable.

q1 <- ggqqplot(corr_data$GarageArea, ylab = "Garage Area")
q2 <- ggqqplot(corr_data$GrLivArea, ylab = "Above Ground Area")
q3 <- ggqqplot(corr_data$LotFrontage, ylab = "Lot Frontage")
grid.arrange(q1, q2, q3, nrow=2, ncol=2)


Linearity and Homoscedasticity


The Pearson correlation assumes linear relationship and homoscedasticity (homogeneity of variance).

A Residual vs Fitted plot that shows a horizontal red line that is close to zero suggests that the relationship is linear.

A Scale-location plot that shows a horizontal red line with points approximately equally spread out suggests a constant variance in the residual errors (homoscedasticity).

Overall I think that there is a linear relationship between the variables; however, there are some outliers that is causing the line in the Residual vs Fitted plots to deviate a lot more.

Overall, I think that the variance in the residual errors is somewhat constant. The residuals appear to be equally spread out.


Garage Area and Above Ground Living Area

Garage Area and Lot Frontage

Above Ground Living Area and Lot Frontage

Reference: http://www.sthda.com/english/articles/39-regression-model-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essentials


Correlation Matrix

The Pearson correlation is calculated by the cor function calculates the correlation of each variable to one another. The assumptions of the Pearson correlations were investigated above.

cor <- cor(corr_data, method = "pearson", use = "complete.obs") 
kable(cor)
GarageArea GrLivArea LotFrontage
GarageArea 1.0000000 0.4737098 0.3449967
GrLivArea 0.4737098 1.0000000 0.4027974
LotFrontage 0.3449967 0.4027974 1.0000000


Test the hypotheses that the correlations between each pairwise set of variables is 0


  • H0: The correlation is zero.
  • H1: The correlation is not zero.
  • Significance level is 0.05.

The cor.test function of the ggpubr package is used to perform the hypothesis tests.


corr_GarageArea_GrLivArea <- cor.test(corr_data$GarageArea, corr_data$GrLivArea, method = "pearson")
corr_Garagearea_LotFrontage <- cor.test(corr_data$GarageArea, corr_data$LotFrontage , method = "pearson")
corr_GrLivArea_LotFrontage <- cor.test(corr_data$GrLivArea, corr_data$LotFrontage , method = "pearson")


Results for Garage Area and Above Ground Living Area

The correlation coefficient is 0.4737098, which is found to be significant with p-value (< 2.2e-16) that is much less than the significance level of 0.05. If the null hypothesis were true (correlation is zero), the probability of this observed data is < 2.2e-16. This is evidence to reject the null hypothesis and accept the alternative that the correlation is not zero.

corr_GarageArea_GrLivArea
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$GarageArea and corr_data$GrLivArea
## t = 18.625, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4286292 0.5164374
## sample estimates:
##       cor 
## 0.4737098


Results for Garage Area and Lot Frontage

The correlation coefficient is 0.3449967, which is found to be significant with p-value (< 2.2e-16) that is much less than the significance level of 0.05. If the null hypothesis were true (correlation is zero), the probability of this observed data is < 2.2e-16. This is evidence to reject the null hypothesis and accept the alternative that the correlation is not zero.

corr_Garagearea_LotFrontage
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$GarageArea and corr_data$LotFrontage
## t = 12.727, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2941715 0.3938762
## sample estimates:
##       cor 
## 0.3449967


Results for Above Ground Living Area and Lot Frontage

The correlation coefficient is 0.4491302, which is found to be significant with p-value (< 2.2e-16) that is much less than the significance level of 0.05. If the null hypothesis were true (correlation is zero), the probability of this observed data is < 2.2e-16. This is evidence to reject the null hypothesis and accept the alternative that the correlation is not zero.

corr_GrLivArea_LotFrontage
## 
##  Pearson's product-moment correlation
## 
## data:  corr_data$GrLivArea and corr_data$LotFrontage
## t = 15.238, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3543041 0.4491302
## sample estimates:
##       cor 
## 0.4027974


Provide a 80% confidence interval


The CIr function of the Psychometric package is used to calculate the 80% confidence interval.

corr_data_n <- nrow(corr_data)


80% confidence interval of the correlation between GarageArea and GrLivArea.

As you can see zero does not fall within the range [0.4444933, 0.5019195].

CIr(r=corr_GarageArea_GrLivArea$estimate , n = corr_data_n, level = .80)
## [1] 0.4444933 0.5019195


80% confidence interval of the correlation between GarageArea and LotFrontage.

As you can see zero does not fall within the range [0.3119708, 0.3771899].

CIr(r=corr_Garagearea_LotFrontage$estimate , n = corr_data_n, level = .80)
## [1] 0.3119708 0.3771899


80% confidence interval of the correlation between GrLivArea and LotFrontage.

As you can see zero does not fall within the range [0.3713236, 0.4333466]

CIr(r=corr_GrLivArea_LotFrontage$estimate , n = corr_data_n, level = .80)
## [1] 0.3713236 0.4333466


Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?


The variables have positive correlation values that range from 0.34 to 0.47 (weak to moderate), and the correlation coefficients were found to be significant. Garage Area and Above Ground Living Area has the strongest correlation at 0.47. Garage Area and Lot Frontage has the weakest correlation at 0.34.

Houses that have big garages tend to have big above ground living spaces and longer street distance connected to their property. And houses that have big above ground living space tend to have longer distance of street connected to their property.


GarageArea GrLivArea LotFrontage
GarageArea 1.0000000 0.4737098 0.3449967
GrLivArea 0.4737098 1.0000000 0.4027974
LotFrontage 0.3449967 0.4027974 1.0000000


Considering Familywise Error Rate


The familywise error rate needs to be considered when multiple statistical analyses are conducted on the same data. For this pairwise correlation of 3 variables, a total of 3 different comparisons were made. The significance level was set to 0.05 per analysis, which means that there is a 5% chance of committing a Type I error. Type I error happens when the null hypothesis is rejected when in fact it is true in the population. This 5% Type I error rate is only for a single analysis. When multiple analyses are made, we need to consider the familywise error rate. This is the overall Type I error rate of the multiple comparisons, which is known to be larger than the per analysis error rate.

The formula for calculating the familywise error rate is 1 - (1 - alpha)^k , where alpha is the significance level and k is the number of comparisons.

The familywise error rate in this case is 0.142625. This means the probability of committing at least one Type I error is 14.26%.

k <- 3
alpha <- .05
1 - (1-alpha)^k
## [1] 0.142625

To control the familywise error rate, a common technique is to perform a Bonferroni correction. This simple approach adjusts alpha by dividing it by the number of comparisons. In this case, the adjusted alpha would be .05/3 or around 0.017. The p-values of the results would then need to be less than 0.017 in order for the null hypothesis to be rejected. This adjustment controls for the overall Type I error rate of the family of analyses to 0.05.

Below are the p-values of the analyses, and all are less than 0.017. This is evidence to reject the null hypothesis that the correlation is zero.

Comparisons P-value Compare to adjusted P-value
GarageArea-GrLivArea 3.33606210043948e-68 less than 0.017
GarageArea-GrLivArea 6.73480729281281e-35 less than 0.017
GrLivArea-LotFrontage 4.61242260931362e-48 less than 0.017

Since we are doing multiple analyses on the same data, I would consider controlling for the familywise error rate.

Reference: https://www.youtube.com/watch?v=rMuNniCTsOw



Part 2: Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.



cor_matrix is the 3 X 3 matrix of the correlation coefficients

cor_matrix <- data.matrix(cor)
rownames(cor_matrix) <- c()
colnames(cor_matrix) <- c()
cor_matrix
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.4737098 0.3449967
## [2,] 0.4737098 1.0000000 0.4027974
## [3,] 0.3449967 0.4027974 1.0000000


Precision Matrix: invert your 3 x 3 correlation matrix


The determinant of the correlation matrix is not zero. This means that this matrix has an inverse.

det(cor_matrix)
## [1] 0.6259876

The inv function from the matlib package is used to generate the inverse of the correlation matrix.

precision_matrix <- inv(cor_matrix)
precision_matrix
##            [,1]       [,2]       [,3]
## [1,]  1.3382921 -0.5347486 -0.2463111
## [2,] -0.5347486  1.4073399 -0.3823863
## [3,] -0.2463111 -0.3823863  1.2390007


Multiply the correlation matrix by the precision matrix


The operator %*% performs the matrix multiplication.

result1 <- cor_matrix %*% precision_matrix
result1
##               [,1]          [,2]         [,3]
## [1,]  1.000000e+00 -4.854738e-09 5.875876e-09
## [2,] -2.944996e-09  1.000000e+00 6.229884e-09
## [3,]  1.509716e-09  5.486613e-10 1.000000e+00

We expect to get the identity matrix; however, at first glance it does not look like it is the identity matrix. The off diagonal values are very tiny. We can use the zapsmall function to round very small values to zero.

result1 <- zapsmall(result1)
result1
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1


Multiply the precision matrix by the correlation matrix


As expected, multiplying the precision matrix by the correlation matrix also results in the identity matrix.

result2 <- zapsmall(precision_matrix %*% cor_matrix)
result2
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1


Conduct LU decomposition on the matrix


The function lu.decomposition is used from the matrixcalc package.

lu_decomp <- lu.decomposition(cor_matrix)


The lower triangular matrix L

L <- lu_decomp$L
L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.4737098 1.0000000    0
## [3,] 0.3449967 0.3086248    1


The upper triangular matrix U

U <- lu_decomp$U
U
##      [,1]      [,2]      [,3]
## [1,]    1 0.4737098 0.3449967
## [2,]    0 0.7755991 0.2393691
## [3,]    0 0.0000000 0.8071020


Multiplying L and U should result in the correlation matrix.

L %*% U
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.4737098 0.3449967
## [2,] 0.4737098 1.0000000 0.4027974
## [3,] 0.3449967 0.4027974 1.0000000


As you can see, LU is equivalent to the cor_matrix.

cor_matrix
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.4737098 0.3449967
## [2,] 0.4737098 1.0000000 0.4027974
## [3,] 0.3449967 0.4027974 1.0000000


References: https://cran.r-project.org/web/packages/matlib/vignettes/inv-ex1.html, https://www.youtube.com/watch?v=UlWcofkUDDU&t=350s



Part 3: Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training data set that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.



Select a variable in the training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.


I selected the variable MasVnrArea.

The MasVnArea is the masonry veneer area measured in square feet.

fit_data <- training_data$MasVnrArea
fit_data <- fit_data[complete.cases(fit_data)]


As you can see, the distribution is skewed to the right.

hist(fit_data)


As you can see, we have values that are zeroes. This would represent houses that do not have masonry veneers. The total number of observations is 1452.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   103.7   166.0  1600.0
##    vars    n   mean     sd min  max range   se
## X1    1 1452 103.69 181.07   0 1600  1600 4.75

Out of the 1452 houses, there are 861 that have no masonry veneers (area is zero).

length(fit_data[fit_data == 0])
## [1] 861

Because the data measures area, adding a value of .01 should be negligible and would get rid of the zero values. A property with a masonry veneer area of .01 square feet would mean this property does not really have any masonry veneer.

fit_data <- fit_data + .01

Below is the updated histogram and summary after the adjustment.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.01    0.01    0.01  103.70  166.01 1600.01
##    vars    n  mean     sd  min     max range   se
## X1    1 1452 103.7 181.07 0.01 1600.01  1600 4.75


Load the MASS package and run fitdistr to fit an exponential probability density function.

library(MASS)
fit <- fitdistr(fit_data, densfun="exponential")


Find the optimal value for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000,)). Plot a histogram and compare it with a histogram of your original variable.


Optimal value for the fitted distribution

fit$estimate
##        rate 
## 0.009643642


Generate exponential distribution with the same rate

exp_dist <- rexp(length(fit_data), rate = fit$estimate) 


Histogram of fitted data


Histogram of exponentially distributed data with same rate as fitted data

The histogram of fit_data and exp_dist have similar general shape; however, the first bin of fit_data has a frequency that is about double the frequency of exp_data. Both have the same count, but the distribution of the frequency is not similar.


Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).


qexp(.05, rate=fit$estimate)
## [1] 5.318872
qexp(.95, rate=fit$estimate)
## [1] 310.6432


Generate a 95% confidence interval from the empirical data, assuming normality.


This is a function that calculates the confidence interval assuming normality.

norm.interval = function(data, variance = var(data), conf.level = 0.95) 
{
      z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
      xbar = mean(data)
      sdx = sqrt(variance/length(data))
      c(xbar - z * sdx, xbar + z * sdx)
}

Reference: https://www.stat.wisc.edu/~yandell/st571/R/append7.pdf


95% confidence interval of the mean of the empirical data

norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)
## [1]  94.38199 113.00853


Provide the empirical 5th percentile and 95th percentile of the data.


quantile(x=fit_data, probs=c(.05, .95))
##     5%    95% 
##   0.01 456.01