Question 1

Part 1:Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda ( a shape parameter). Choose any n greater 3 and an expected value between 2 and 10 (you choose).

# Set the seed

# Generate random variable with Gamma distribution
set.seed(0)
n <- 5
lambda <- 5
X <-rgamma(10000, rate=n, shape=lambda) 

# Check the summary statistics of the generated random variable
summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0796  0.6709  0.9330  1.0001  1.2577  4.0783

Part 2: Probability Density 2: Y~Sum of Exponentials

# Generate observations from the sum of n exponential PDFs
Y <- replicate(n, rexp(10000, lambda))
Y <- rowSums(Y)
# Check the summary statistics of the generated observations
summary(Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09313 0.67040 0.93495 1.00016 1.25784 3.23172

Part 3: Probability Density 3: Z~ Exponential

# Generate observations from a single exponential PDF
Z <- rexp(10000, rate = lambda)

# Check the summary statistics of the generated observations
summary(Z)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000208 0.0579385 0.1371881 0.1995976 0.2784449 1.9065101

Question 1a-Actual Values

# Calculate empirical mean and variance of the Gamma PDF
gamma_mean <- mean(X)
gamma_variance <- var(X)
cat("Gamma PDF - Empirical Mean:", gamma_mean, "\n")
## Gamma PDF - Empirical Mean: 1.000132
cat("Gamma PDF - Empirical Variance:", gamma_variance, "\n")
## Gamma PDF - Empirical Variance: 0.1993912
sum_exp_mean <- mean(Y)
sum_exp_variance <- var(Y)
cat("Sum of Exponential PDFs - Empirical Mean:", sum_exp_mean, "\n")
## Sum of Exponential PDFs - Empirical Mean: 1.000158
cat("Sum of Exponential PDFs - Empirical Variance:", sum_exp_variance, "\n")
## Sum of Exponential PDFs - Empirical Variance: 0.201989
exp_mean <- mean(Z)
exp_variance <- var(Z)
cat("Single Exponential PDF - Empirical Mean:", exp_mean, "\n")
## Single Exponential PDF - Empirical Mean: 0.1995976
cat("Single Exponential PDF - Empirical Variance:", exp_variance, "\n")
## Single Exponential PDF - Empirical Variance: 0.03936069

Question 1a-Derived Values

Question 1: For X

with shape parameter n and scale parameter lambda

E(X)=n * (1/lambda) var(X)=(1/lambda)^2 * n

gamma_mean_derived <- n * 1/lambda
gamma_variance_derived <- n * (1/lambda)^2
cat("Gamma PDF - Expected Value:", gamma_mean_derived, "\n")
## Gamma PDF - Expected Value: 1
cat("Gamma PDF - Expected Variance:", gamma_variance_derived, "\n")
## Gamma PDF - Expected Variance: 0.2

Question 1b: For Y

with shape parameter n and scale parameter lambda

E(X)=n / lambda var(X)= n/ lambda^2

sum_exp_mean_derived <- n / (lambda)
sum_exp_variance_derived <- n / (lambda^2)
cat("Sum of Exponential PDFs - Expected Value:", sum_exp_mean_derived, "\n")
## Sum of Exponential PDFs - Expected Value: 1
cat("Sum of Exponential PDFs - Expected Variance:", sum_exp_variance_derived, "\n")
## Sum of Exponential PDFs - Expected Variance: 0.2

Question 1b: For Z

E(X)= 1/ lambda var(X)= 1/ lambda^2

exp_mean_derived <- 1/ (lambda)
exp_variance_derived <- 1 / (lambda^2)
cat("Exponential PDFs - Expected Value:", exp_mean_derived, "\n")
## Exponential PDFs - Expected Value: 0.2
cat("Exponential PDFs - Expected Variance:", exp_variance_derived, "\n")
## Exponential PDFs - Expected Variance: 0.04

Question 1B-Expected Value & Variance Calculations

Calculate Expected Value Gamma (X)

alt text

Calculate Variance Gamma (X)

alt text

Calculate Expected Value for Z

alt text

Calculate Expected Value for Y

I liked this simple linear method as it was super clear, and easily noticed from the prior Expected Value alt text And here is the method with calc: alt text

Question 1C-E

Question 1C-E Calculations

alt text

Memoryless property derivation

alt text

Yes the property holds!

1F: Investigate P(YZ) = P(Y)P(Z)

quantiles_Y <- quantile(Y, probs = c(0.25, 0.5, 0.75, 1))
quantiles_Z <- quantile(Z, probs = c(0.25, 0.5, 0.75, 1))
outer_prod <- outer(quantiles_Z, quantiles_Y, FUN = "*")

row_sums <- rowSums(outer_prod)
col_sums <- colSums(outer_prod)

base_grid <- rbind(outer_prod, row_sums)
base_grid <- cbind(base_grid, c(col_sums, sum(col_sums)))


# Create Pretty Columns & Row Names
colnames(base_grid) <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Total")
rownames(base_grid) <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Total")

#Print the raw data
as.data.frame(base_grid)
# Create a scalar
prodSum <- sum(outer_prod)


# Scale
grid <- outer_prod / prodSum


row_sums <- rowSums(grid)
col_sums <- colSums(grid)

grid_with_sums <- rbind(grid, row_sums)
grid_with_sums <- cbind(grid_with_sums, c(col_sums, sum(col_sums)))


# Create Pretty Columns & Row Names
colnames(grid_with_sums) <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Total")
rownames(grid_with_sums) <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Total")


# Print the updated grid
as.data.frame(grid_with_sums)

Essentially, the joint probabilities are the first four rows and columns. The Marginal Probabilities are the fifth row and column respectively.

Looking at the data, these appear to converge only as they approach higher quartiles At lower quartiles, there is limited intersection, whereas at higher values, the rate suddenly spikes and the rate of intersection increases. Other than that, I think other methods are in order to effectively evaluate this.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test

Fisher Test

If we define the null hypothesis as the events are independent If we define the alternative hypothesis as the events are dependent

fisher.test(base_grid)
## Warning in fisher.test(base_grid): 'x' has been rounded to integer: Mean
## relative difference: 0.09959902
## 
##  Fisher's Exact Test for Count Data
## 
## data:  base_grid
## p-value = 0.8625
## alternative hypothesis: two.sided

The P value of ~0.86 does not yield enough weight to reject the null hypothesis.

Chi Square

chisq.test(base_grid)
## Warning in chisq.test(base_grid): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  base_grid
## X-squared = 5.2746, df = 16, p-value = 0.9942

Practically, the X-squared value of 5.27 coupled with the p-value of 0.9942 indicates that there is a high degree of independence between the two variables.

In terms of analysis:

The Fisher Exact Test is targeted towards small sample sizes, and smaller contingency tables.

The Chi Square is targeted toward larger data sets.

Given the size of this data set is large, the appropriate metric to implement is the Chi Square.

Problem 2

Library & Data Read in

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
library(vcd)
## Warning: package 'vcd' was built under R version 4.2.3
## Loading required package: grid
library(Matrix)
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
home_data <- read.csv("kaggle_data/train.csv", header = TRUE)
summary(home_data)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 
# Identify columns with non-numeric data types
non_numeric_cols <- sapply(home_data, function(x) !is.numeric(x))

# Filter the table to exclude rows with non-numeric values
filtered_home_data <-home_data[,!non_numeric_cols]
correlations <- cor(filtered_home_data[, -ncol(filtered_home_data)], filtered_home_data$SalePrice)
correlations
##                      [,1]
## Id            -0.02191672
## MSSubClass    -0.08428414
## LotFrontage            NA
## LotArea        0.26384335
## OverallQual    0.79098160
## OverallCond   -0.07785589
## YearBuilt      0.52289733
## YearRemodAdd   0.50710097
## MasVnrArea             NA
## BsmtFinSF1     0.38641981
## BsmtFinSF2    -0.01137812
## BsmtUnfSF      0.21447911
## TotalBsmtSF    0.61358055
## X1stFlrSF      0.60585218
## X2ndFlrSF      0.31933380
## LowQualFinSF  -0.02560613
## GrLivArea      0.70862448
## BsmtFullBath   0.22712223
## BsmtHalfBath  -0.01684415
## FullBath       0.56066376
## HalfBath       0.28410768
## BedroomAbvGr   0.16821315
## KitchenAbvGr  -0.13590737
## TotRmsAbvGrd   0.53372316
## Fireplaces     0.46692884
## GarageYrBlt            NA
## GarageCars     0.64040920
## GarageArea     0.62343144
## WoodDeckSF     0.32441344
## OpenPorchSF    0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch     0.04458367
## ScreenPorch    0.11144657
## PoolArea       0.09240355
## MiscVal       -0.02118958
## MoSold         0.04643225
## YrSold        -0.02892259
sorted_correlations <- as.data.frame(correlations) %>%
  arrange(desc(abs(correlations)))

# Print the sorted correlations
print(sorted_correlations)
##                        V1
## OverallQual    0.79098160
## GrLivArea      0.70862448
## GarageCars     0.64040920
## GarageArea     0.62343144
## TotalBsmtSF    0.61358055
## X1stFlrSF      0.60585218
## FullBath       0.56066376
## TotRmsAbvGrd   0.53372316
## YearBuilt      0.52289733
## YearRemodAdd   0.50710097
## Fireplaces     0.46692884
## BsmtFinSF1     0.38641981
## WoodDeckSF     0.32441344
## X2ndFlrSF      0.31933380
## OpenPorchSF    0.31585623
## HalfBath       0.28410768
## LotArea        0.26384335
## BsmtFullBath   0.22712223
## BsmtUnfSF      0.21447911
## BedroomAbvGr   0.16821315
## KitchenAbvGr  -0.13590737
## EnclosedPorch -0.12857796
## ScreenPorch    0.11144657
## PoolArea       0.09240355
## MSSubClass    -0.08428414
## OverallCond   -0.07785589
## MoSold         0.04643225
## X3SsnPorch     0.04458367
## YrSold        -0.02892259
## LowQualFinSF  -0.02560613
## Id            -0.02191672
## MiscVal       -0.02118958
## BsmtHalfBath  -0.01684415
## BsmtFinSF2    -0.01137812
## LotFrontage            NA
## MasVnrArea             NA
## GarageYrBlt            NA

Okay so I quickly took the numeric values and shoved them into corr, which is a great way to show what has the greatest impact on the SalePrice.

Looking at it,

OverallQual GrLivArea GarageCars

appear to have significant impact. In addition, I want to add in CentralAir, SaleType, and SaleCondition as I believe these categorical instances can be very valuable.

With that in mind, lets make a scatterplots for the quantitative variables: For something like this, I’d love to see a box & Whisker plot

# Set the minimum and maximum range for the y-axis, lets make them pretty & clean
y_min <- 35000
y_max <- 800000
lm_SalePrice <- lm(SalePrice ~ OverallQual, data = home_data)
plot(SalePrice ~ OverallQual, data = home_data, ylim = c(y_min, y_max), xlim = c(0,10),
     main = "Overall Quality vs. SalePrice", xlab = "Rating", ylab = "Sale Price" )
abline(lm_SalePrice, col = "Blue")

lm_GrLivArea <- lm(SalePrice ~ GrLivArea, data = home_data)
plot(SalePrice ~ GrLivArea, data = home_data, ylim = c(y_min, y_max),
     main = "Living Area Square Footage vs. SalePrice", xlab = "Rating", ylab = "Sale Price")
abline(lm_GrLivArea, col = "blue")

lm_Garage <- lm(SalePrice ~ GarageCars, data = home_data)
plot(SalePrice ~ GarageCars, data = home_data, ylim = c(y_min, y_max),
     main = "Garage Capacity vs. SalePrice", xlab = "Rating", ylab = "Sale Price")
abline(lm_Garage, col = "blue")

Looking at this, they all appear to be nicely linear. This is perfect!

Lets Make a corr Matrix

selected_vars <- home_data[, c("OverallQual", "GrLivArea", "GarageCars", "SalePrice")]

# Create the correlation matrix
corr_matrix <- cor(selected_vars)

# Print the correlation matrix
print(corr_matrix)
##             OverallQual GrLivArea GarageCars SalePrice
## OverallQual   1.0000000 0.5930074  0.6006707 0.7909816
## GrLivArea     0.5930074 1.0000000  0.4672474 0.7086245
## GarageCars    0.6006707 0.4672474  1.0000000 0.6404092
## SalePrice     0.7909816 0.7086245  0.6404092 1.0000000
corrplot(corr_matrix, method = 'number')

I really like Corr plot because it makes super pretty charts!

Descriptive and Inferential Statistics

cor.test(home_data$GrLivArea, home_data$SalePrice, 
        conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  home_data$GrLivArea and home_data$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245
cor.test(home_data$OverallQual, home_data$SalePrice, 
        conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  home_data$OverallQual and home_data$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
cor.test(home_data$GarageCars, home_data$SalePrice, 
        conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  home_data$GarageCars and home_data$SalePrice
## t = 31.839, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6201771 0.6597899
## sample estimates:
##       cor 
## 0.6404092

Seeing as we are doing 3 comparisons with an 80% ci:

fwe <- 1-(1-0.2)^3
fwe
## [1] 0.488

We have a fairly high chance of a familywise error.In terms of this data, conducting multiple hypothesis tests without adjustment increases the likelyhood of a type 1 error, which is incorrectly rejecting the null hypothesis. Given that we are rejecting 3 null hypothesis, this is a real concern. To Address the concern of a familywise error, we could add in comparison correction methods in order to help control the overall familywise error rate.

In terms of result, we can definitely reject the null hypothesis for all of these, as the p-values for each of these is close to zero.

Linear Algebra and Correlation

matrix <- cor(selected_vars) #Start Matrix
precision_matrix <- solve(matrix) #Invert
int <- matrix %*% precision_matrix #Multiple corr by precision
int_2 <- int %*% matrix  #Multiple Precision matrix by corr matrix
expand(lu(int_2))
## $L
## 4 x 4 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]      [,4]     
## [1,] 1.0000000         .         .         .
## [2,] 0.5930074 1.0000000         .         .
## [3,] 0.6006707 0.1712756 1.0000000         .
## [4,] 0.7909816 0.3695063 0.2003591 1.0000000
## 
## $U
## 4 x 4 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]      [,4]     
## [1,] 1.0000000 0.5930074 0.6006707 0.7909816
## [2,]         . 0.6483422 0.1110452 0.2395665
## [3,]         .         . 0.6201753 0.1242578
## [4,]         .         .         . 0.2609306
## 
## $P
## 4 x 4 sparse Matrix of class "pMatrix"
##             
## [1,] | . . .
## [2,] . | . .
## [3,] . . | .
## [4,] . . . |

Data Cleaning

ggplot(data = home_data, aes(x = LotFrontage)) +  geom_histogram(aes(y=..density..), color = 'black', fill = 'white') + geom_density(alpha = .5, fill = 'blue') + labs(title = 'Distribution of Frontage')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 259 rows containing non-finite values (stat_bin).
## Warning: Removed 259 rows containing non-finite values (stat_density).

cleaned_data <- na.omit(home_data$LotFrontage)
fit <- fitdistr(cleaned_data, "exponential")
fit
##        rate    
##   0.0142755260 
##  (0.0004119273)

Let’s grab the optimal lambda:

lambda <- fit$estimate
lambda
##       rate 
## 0.01427553

Lets make samples:

# 1000 samples using optimal lambda
new_data <- data.frame(samples = rexp(1000, lambda))

# Plot Histogram
ggplot(data = new_data, aes(x = samples)) + 
    geom_histogram(aes(y=..density..), color = 'black', fill = 'white') + 
    geom_density(alpha = .2, fill = 'green') + 
  labs(title = 'Generated Frontage')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lets Calculate the 5th and 95th percentiles

qexp(p = 0.05, rate = lambda)
## [1] 3.593093
qexp(p = 0.95, rate = lambda)
## [1] 209.8509

Lets Get the emperical 95th and 5th percentile:

quantile(new_data$samples, probs = 0.05)
##       5% 
## 3.703297
quantile(new_data$samples, probs = 0.95)
##      95% 
## 213.1795

Pretty much, the chart is shifted over, and is significantly cleaner.

Modeling

Okay so I quickly took the numeric values and shoved them into corr, which is a great way to show what has the greatest impact on the SalePrice.

Looking at it,

OverallQual GrLivArea GarageCars

appear to have significant impact. In addition, I want to add in CentralAir, SaleType, and SaleCondition as I believe these categorical instances can be very valuable.

With that in mind, lets make a scatterplots for the quantitative variables: For something like this, I’d love to see a box & Whisker plot

# Set the minimum and maximum range for the y-axis, lets make them pretty & clean
y_min <- 35000
y_max <- 800000
lm_SalePrice <- lm(SalePrice ~ OverallQual, data = home_data)
plot(SalePrice ~ OverallQual, data = home_data, ylim = c(y_min, y_max), xlim = c(0,10),
     main = "Overall Quality vs. SalePrice", xlab = "Rating", ylab = "Sale Price" )
abline(lm_SalePrice, col = "Blue")

lm_GrLivArea <- lm(SalePrice ~ GrLivArea, data = home_data)
plot(SalePrice ~ GrLivArea, data = home_data, ylim = c(y_min, y_max),
     main = "Living Area Square Footage vs. SalePrice", xlab = "Square Footage", ylab = "Sale Price")
abline(lm_GrLivArea, col = "blue")

lm_Garage <- lm(SalePrice ~ GarageCars, data = home_data)
plot(SalePrice ~ GarageCars, data = home_data, ylim = c(y_min, y_max),
     main = "Garage Capacity vs. SalePrice", xlab = "Num Car Garage", ylab = "Sale Price")
abline(lm_Garage, col = "blue")

Let’s Make a model

# Build a linear model for stopping distance as a function of speed
final_model <- lm(SalePrice ~ OverallQual + GrLivArea + GarageCars + CentralAir + SaleType + SaleCondition, data = home_data)

Model Evaluation

summary(final_model)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars + 
##     CentralAir + SaleType + SaleCondition, data = home_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -367247  -21542   -1963   18924  307495 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.197e+05  8.294e+03 -14.428  < 2e-16 ***
## OverallQual           2.418e+04  1.102e+03  21.937  < 2e-16 ***
## GrLivArea             5.306e+01  2.514e+00  21.103  < 2e-16 ***
## GarageCars            1.936e+04  1.811e+03  10.687  < 2e-16 ***
## CentralAirY           1.674e+04  4.463e+03   3.750 0.000184 ***
## SaleTypeCon           5.806e+04  2.880e+04   2.016 0.043983 *  
## SaleTypeConLD         4.582e+03  1.488e+04   0.308 0.758186    
## SaleTypeConLI         2.373e+04  1.879e+04   1.263 0.206733    
## SaleTypeConLw         1.716e+04  1.897e+04   0.905 0.365878    
## SaleTypeCWD           2.683e+04  2.099e+04   1.278 0.201344    
## SaleTypeNew           3.897e+04  2.449e+04   1.592 0.111695    
## SaleTypeOth           5.355e+04  2.381e+04   2.249 0.024681 *  
## SaleTypeWD            1.378e+04  6.548e+03   2.105 0.035482 *  
## SaleConditionAdjLand  1.685e+04  2.041e+04   0.825 0.409253    
## SaleConditionAlloca  -2.094e+03  1.229e+04  -0.170 0.864703    
## SaleConditionFamily  -1.199e+04  9.893e+03  -1.212 0.225693    
## SaleConditionNormal   7.718e+03  4.456e+03   1.732 0.083447 .  
## SaleConditionPartial  1.127e+04  2.367e+04   0.476 0.634097    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39610 on 1442 degrees of freedom
## Multiple R-squared:  0.7543, Adjusted R-squared:  0.7515 
## F-statistic: 260.5 on 17 and 1442 DF,  p-value: < 2.2e-16

Residual Analysis

par(mfrow = c(2, 2))
plot(final_model, which = 1:4)

Looking at the Q-Q plot, both ends diverge. I honestly feel that this is reasonably valid, as cheap and overly expensive houses would quickly break this basic linear model.

Overall Analysis

Practically, we look for 4 key parts to analyze if a regression is reasonable.

Independence-This is a given spec of the data set There is a linear relationship, which is demonstrated loosely in the scatter plots above. There is constant variance, which is shown by the uniform spread of residuals. There is a normality of Residuals across a reasonable range, which is demonstrated by the chart above.

Cook’s disance indicates the potential of a few outliers, that merit additional attention, but overall, there is a linear relationship between speed and stopping distance.

Lets generate the data:

test_data <- read.csv("kaggle_data/test.csv", header = TRUE)
test_data$SalePrice <- predict(object = final_model,     # The regression model
        newdata = test_data)   # dataframe of new data
test_data
output_file <- "kaggle_data/submit.csv"
subset_data <- test_data[, c("Id", "SalePrice")]
#write.csv(subset_data, file = output_file, row.names = FALSE)

Posting: alt text

And the video can be found here: https://youtu.be/8uSFfm_ckI4