Probability

I listed out the quantitative variables and then checked their skew. I ordered the skew by most skew to least. The most skewed variable is MiscVal, but I chose to use LotArea because its more interesting and relevant.

The dependent variable is SalePrice.

##      MiscVal     PoolArea      LotArea LowQualFinSF   BsmtFinSF2 
##    24.426522    14.797918    12.182615     8.992833     4.246521

Calculating the probabilities:

## P(X > x | Y > y): 0.3791209
## P(X > x, Y > y): 0.1890411
## P(X < x | Y > y): 0.6208791

Getting the table of counts based on the probabilities:

##                 <= 2nd quartile > 2nd quartile Total
## <= 3rd quartile             643            452  1095
## > 3rd quartile               89            276   365
## Total                       732            728  1460

We find that P(A|B) does not equal P(A)P(B). Using the chi-squared test we see that the p-value is much lower than 0.05. Given this information, we can confidently day that there is an association between the variables.

## P(A): 0.25
## P(B): 0.4986301
## P(A|B): 0.3791209
## P(A) * P(B): 0.1246575
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 127.74, df = 1, p-value < 2.2e-16

Descriptive and Inferential Statistics

Summary Stats

Below is a summary of the univariate stats. The mean lot area is 10516.83, and the lot area standard deviation is 998.265. Interestingly there is a max lot area of 215,245, which has to either be a mistake or a wild outlier- perhaps a large building with multiple family units or a farm (quite possible for Iowa).

Sales prices also vary greatly, with a mean sale price of 180,921.2 and a standard deviation of 79,442.5. The surprising part was the minimum price of 34,900.

##   LotArea_mean LotArea_median LotArea_sd LotArea_min LotArea_max SalePrice_mean
## 1     10516.83         9478.5   9981.265        1300      215245       180921.2
##   SalePrice_median SalePrice_sd SalePrice_min SalePrice_max
## 1           163000      79442.5         34900        755000

Histograms

The histograms below show us that both variables skew right. Most values are to the left of the mean, which means that the extreme values are to the right. This tells us that its more likely for houses to be smaller and cheaper rather than larger and expensive.

Boxplots

The boxplots tell us the same thing- theres a low median sale price/home lot size and a lot of the houses are around there, with many outiers on the higher end.

Scatterplot

The scatterplot below shows us that there is at least some positive correlation between LotArea and SalePrice.

Confidence Interval 95%

The difference in means between LotArea and SalePrice is likely to be between -174514.8 and -166293.9. Since its a negative interval, it means SalePrice will be higher than LotArea. Considering the units (dollars vs square feet), this make sense.

## 95% CI for the difference in means: -174514.8 to -166293.9

Correlation Matrix

This correlation matrix shows us that the correlation coefficient between LotArea and SalePrice is 0.2638434. This suggests that there is a positive linear relationship between LotArea and SalePrice, but it is not a very strong relationship.

##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000

Confidence Interval 99%

The correlation between LotArea and SalePrice is between 0.2000196 and 0.3254375 with 99% confidence. This further supports a positive correlation between the two variables, especially considering the very low p-value of 1.123139e-24.

## Correlation coefficient: 0.2638434
## 99% CI for the correlation: 0.2000196 to 0.3254375
## p-value of the correlation test: 1.123139e-24

Overall we can say that we can be sure that the correlation between LotSize and SalePrice is positive, but the relationship isn’t very strong. This could indicate that LotSize is an important factor for predicting SalePrice, but that there are other factors that affect it as well.

Linear Algebra and Correlation

Correlation Matrix

Below we find and invert the correlation matrix to investigate the relationship between SalePrice and LotArea while controlling for other variables.

##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
## [1] "Precision Matrix (Inverse of Correlation Matrix):"
##              LotArea  SalePrice
## LotArea    1.0748219 -0.2835846
## SalePrice -0.2835846  1.0748219

Checks

We then check to make sure that the inversion was correct by multiplying the correlation matrix and its inverse and vice versa. Both multiplications result in the identity matrix so we are all good.

## [1] "Correlation Matrix * Precision Matrix:"
##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1
## [1] "Precision Matrix * Correlation Matrix:"
##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

PCA

To go further into the relationship between these two variables, we delve into principal component analysis. The PCA performed below will break this relationship down into two components, PC1 and PC2. These components each account for a portion of the variance between these two variables. We see from the PCA summary below that PC1 explains 63.19% of the total variance, while PC2 explains 36.81%.

## [1] "PCA Summary:"
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion  0.6319 1.0000

PCA loadings show us how each variable affects the principal components. We see that LotArea and SalePrice contribute positively to PC1- as either increases PC1 will increase. However PC2 is a little different- LotArea contributes positively to PC2 while SalePrice contributes negatively. This means if SalePrice increases, PC2 will decrease.

## [1] "PCA Loadings:"
##                 PC1        PC2
## LotArea   0.7071068  0.7071068
## SalePrice 0.7071068 -0.7071068

The above plays out in the PCA scatterplot below, where PC1 is plotted against PC2. If PC1 increases, ie either LotArea or SalePrice increase, then there is a decrease in PC2 since PC2 is negatively correlated with one of the elements of PC1.

## [1] "Variance Inflation Factors (VIFs):"
##   LotArea SalePrice 
##  1.074822  1.074822

Calculus Based Probability and Statistics

Optimal Value of Lambda

First, we shift the variable so that the minimum value is above zero.

## [1] 1300

Since we already know that the minimum value is above 0, we can skip the shift step. We are fitting an exponential distribution to this variable.

The rate parameter lambda represents the frequency of occurences, with a lower value meaning lower frequency of occurence for any one LotArea. The exponential distribution will have a mean of 1/lambda, which equals 10517.

We will then select 1000 values from the exponential distribution and compare it with the original values of LotArea.

## [1] "Optimal value of λ:"
##        rate 
## 9.50857e-05

Histograms

The histograms plotted on top of each other show that the exponential samples capture the right-skewness of the variable, but is truncated as far as the higher frequency values go. It also doesn’t do a great job capturing the shape of the data, being much more right-skewed.

Compare Percentiles

Comparing the theoretical percentiles vs the empirical ones, we see that the 5th percentile using the exponential CDF is much lower than the empirical 5th percentile= 539.4428 vs 3311.7.

We see that the CDF 95th percentile is 31505.6, much higher than the empirical 95th percentile of 17401.15.

This shows that in addition to not capturing the shape of the data very well, this exponential distribution also doesn’t capture data well at the tails.

## 5th percentile using exponential CDF: 539.4428
## 95th percentile using exponential CDF: 31505.6
## Empirical 5th percentile: 3311.7
## Empirical 95th percentile: 17401.15

Empirical Confidence Intervals

## 95% CI for the empirical data: 10004.42 to 11029.24

Modeling

## 
## Call:
##  randomForest(x = sale_data, y = sale_price, ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 26
## 
##           Mean of squared residuals: 797529969
##                     % Var explained: 87.35

My kaggle username: Ahmed Elsaeyed

My score: 0.14650

Thank you professor!