I listed out the quantitative variables and then checked their skew. I ordered the skew by most skew to least. The most skewed variable is MiscVal, but I chose to use LotArea because its more interesting and relevant.
The dependent variable is SalePrice.
## MiscVal PoolArea LotArea LowQualFinSF BsmtFinSF2
## 24.426522 14.797918 12.182615 8.992833 4.246521
Calculating the probabilities:
## P(X > x | Y > y): 0.3791209
## P(X > x, Y > y): 0.1890411
## P(X < x | Y > y): 0.6208791
Getting the table of counts based on the probabilities:
## <= 2nd quartile > 2nd quartile Total
## <= 3rd quartile 643 452 1095
## > 3rd quartile 89 276 365
## Total 732 728 1460
We find that P(A|B) does not equal P(A)P(B). Using the chi-squared test we see that the p-value is much lower than 0.05. Given this information, we can confidently day that there is an association between the variables.
## P(A): 0.25
## P(B): 0.4986301
## P(A|B): 0.3791209
## P(A) * P(B): 0.1246575
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_table
## X-squared = 127.74, df = 1, p-value < 2.2e-16
Below is a summary of the univariate stats. The mean lot area is 10516.83, and the lot area standard deviation is 998.265. Interestingly there is a max lot area of 215,245, which has to either be a mistake or a wild outlier- perhaps a large building with multiple family units or a farm (quite possible for Iowa).
Sales prices also vary greatly, with a mean sale price of 180,921.2 and a standard deviation of 79,442.5. The surprising part was the minimum price of 34,900.
## LotArea_mean LotArea_median LotArea_sd LotArea_min LotArea_max SalePrice_mean
## 1 10516.83 9478.5 9981.265 1300 215245 180921.2
## SalePrice_median SalePrice_sd SalePrice_min SalePrice_max
## 1 163000 79442.5 34900 755000
The histograms below show us that both variables skew right. Most
values are to the left of the mean, which means that the extreme values
are to the right. This tells us that its more likely for houses to be
smaller and cheaper rather than larger and expensive.
The boxplots tell us the same thing- theres a low median sale
price/home lot size and a lot of the houses are around there, with many
outiers on the higher end.
The scatterplot below shows us that there is at least some positive
correlation between LotArea and SalePrice.
The difference in means between LotArea and SalePrice is likely to be between -174514.8 and -166293.9. Since its a negative interval, it means SalePrice will be higher than LotArea. Considering the units (dollars vs square feet), this make sense.
## 95% CI for the difference in means: -174514.8 to -166293.9
This correlation matrix shows us that the correlation coefficient between LotArea and SalePrice is 0.2638434. This suggests that there is a positive linear relationship between LotArea and SalePrice, but it is not a very strong relationship.
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
The correlation between LotArea and SalePrice is between 0.2000196 and 0.3254375 with 99% confidence. This further supports a positive correlation between the two variables, especially considering the very low p-value of 1.123139e-24.
## Correlation coefficient: 0.2638434
## 99% CI for the correlation: 0.2000196 to 0.3254375
## p-value of the correlation test: 1.123139e-24
Overall we can say that we can be sure that the correlation between LotSize and SalePrice is positive, but the relationship isn’t very strong. This could indicate that LotSize is an important factor for predicting SalePrice, but that there are other factors that affect it as well.
Below we find and invert the correlation matrix to investigate the relationship between SalePrice and LotArea while controlling for other variables.
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
## [1] "Precision Matrix (Inverse of Correlation Matrix):"
## LotArea SalePrice
## LotArea 1.0748219 -0.2835846
## SalePrice -0.2835846 1.0748219
We then check to make sure that the inversion was correct by multiplying the correlation matrix and its inverse and vice versa. Both multiplications result in the identity matrix so we are all good.
## [1] "Correlation Matrix * Precision Matrix:"
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
## [1] "Precision Matrix * Correlation Matrix:"
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
To go further into the relationship between these two variables, we delve into principal component analysis. The PCA performed below will break this relationship down into two components, PC1 and PC2. These components each account for a portion of the variance between these two variables. We see from the PCA summary below that PC1 explains 63.19% of the total variance, while PC2 explains 36.81%.
## [1] "PCA Summary:"
## Importance of components:
## PC1 PC2
## Standard deviation 1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion 0.6319 1.0000
PCA loadings show us how each variable affects the principal components. We see that LotArea and SalePrice contribute positively to PC1- as either increases PC1 will increase. However PC2 is a little different- LotArea contributes positively to PC2 while SalePrice contributes negatively. This means if SalePrice increases, PC2 will decrease.
## [1] "PCA Loadings:"
## PC1 PC2
## LotArea 0.7071068 0.7071068
## SalePrice 0.7071068 -0.7071068
The above plays out in the PCA scatterplot below, where PC1 is
plotted against PC2. If PC1 increases, ie either LotArea or SalePrice
increase, then there is a decrease in PC2 since PC2 is negatively
correlated with one of the elements of PC1.
## [1] "Variance Inflation Factors (VIFs):"
## LotArea SalePrice
## 1.074822 1.074822
First, we shift the variable so that the minimum value is above zero.
## [1] 1300
Since we already know that the minimum value is above 0, we can skip the shift step. We are fitting an exponential distribution to this variable.
The rate parameter lambda represents the frequency of occurences, with a lower value meaning lower frequency of occurence for any one LotArea. The exponential distribution will have a mean of 1/lambda, which equals 10517.
We will then select 1000 values from the exponential distribution and compare it with the original values of LotArea.
## [1] "Optimal value of λ:"
## rate
## 9.50857e-05
The histograms plotted on top of each other show that the exponential samples capture the right-skewness of the variable, but is truncated as far as the higher frequency values go. It also doesn’t do a great job capturing the shape of the data, being much more right-skewed.
Comparing the theoretical percentiles vs the empirical ones, we see that the 5th percentile using the exponential CDF is much lower than the empirical 5th percentile= 539.4428 vs 3311.7.
We see that the CDF 95th percentile is 31505.6, much higher than the empirical 95th percentile of 17401.15.
This shows that in addition to not capturing the shape of the data very well, this exponential distribution also doesn’t capture data well at the tails.
## 5th percentile using exponential CDF: 539.4428
## 95th percentile using exponential CDF: 31505.6
## Empirical 5th percentile: 3311.7
## Empirical 95th percentile: 17401.15
## 95% CI for the empirical data: 10004.42 to 11029.24
##
## Call:
## randomForest(x = sale_data, y = sale_price, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 26
##
## Mean of squared residuals: 797529969
## % Var explained: 87.35