Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =(N+1)/2\).

set.seed(123)

N <- 100
X <- runif(10000, min = 1, max = N)
Y <- rnorm(10000, mean = (N+1)/2, sd = (N+1)/2)

hist(X)

hist(Y)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

small_x <- median(X)
small_y <- quantile(Y, 0.25, names = FALSE)

The median of \(X\) is 49.9621956. The 1st quartile of \(Y\) is 16.899406.

a. \(P(X>x|X>y)\)

This is a conditional probability. Here we are saying "given that \(X\) is greater than \(y\), what is the probability that \(X\) is greater than \(x\). The general formula for solving this problem is called Bayes’ Theorem and is given as:

\(P(A|B)=\frac { P(B|A)\cdot P(A) }{ P(B) }\)

a <- X[X>small_x]

prob_b_given_a <- length(a[a>small_y])/length(a)

prob_a <- sum(X>small_x)/length(X)
prob_b <- sum(X>small_y)/length(X)

(prob_b_given_a * prob_a) / prob_b

## [1] 0.595309

b. \(P(X>x,Y>y)\)

Here we are looking at the probability that \(X\) is greater than \(x\) AND that \(Y\) is greater than \(y\). To solve this problem, we’ll use the multiplication rule for independent events (since we randomly generated these distributions):

\(P(A and B) =P(A)\cdot P(B)\)

Because we built these distributions, we already know that the probability that \(X>x\) is 0.5 since we are using the median value. Similarly, we know that the probability that \(Y>y\) is equal to 0.75 since we used the first percentile. We can multiply both of these values together to get our answer.

X_greater_than_x <- 0.5
Y_greater_than_y <- 0.75

probability <- X_greater_than_x * Y_greater_than_y
probability

## [1] 0.375

c. \(P(X<x|X>y)\)

This is very similar to (a). Here we are looking at the conditional probability that \(X\) is less than \(x\), given that \(X\) is greater than \(y\). We’ll use Bayes’ theorem again to answer this question.

a <- X[X<small_x]

prob_b_given_a <- length(a[a>small_y])/length(a)

prob_a <- sum(X<small_x)/length(X)
prob_b <- sum(X>small_y)/length(X)

(prob_b_given_a * prob_a) / prob_b

## [1] 0.404691

Investigate whether \(P(X>x\cup Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

Marginal probability is the probability of an event occuring - it is an unconditional probability. Joint probability is the probability of event A and event B occuring.

df <- data.frame(X = X, Y = Y)

df1 <- df %>%
  filter(X > small_x & Y > small_y) %>%
  nrow() / dim(df)[1]


df2 <- df %>%
  filter(X < small_x & Y > small_y) %>%
  nrow() / dim(df)[1]

df3 <- rbind(df2, df1) 

df4 <- rbind(df3,sum(df3))

df5 <- df %>%
  filter(X < small_x & Y < small_y) %>%
  nrow() / dim(df)[1]

df6 <- df %>%
  filter(X > small_x & Y < small_y) %>%
  nrow() / dim(df)[1]

df7 <- rbind(df5,df6)
df8 <- sum(df7)

df9 <- rbind(df7, df8)


df10 <- cbind(df9, df4)


df11 <- df10[,1] + df10[,2]

df12 <- cbind(df10, df11)

rownames(df12) <- c('X < x', 'X > x', 'Total')
colnames(df12) <- c('Y < y', 'Y > y', 'Total')

df12 %>% kable() %>%
  kable_styling()

	Y < y	Y > y	Total
X < x	0.1256	0.3744	0.5
X > x	0.1244	0.3756	0.5
Total	0.2500	0.7500	1.0

In looking at the marginal probabilities in the totals, we can tell that this table is accurate since we used the median value of X, which would be 0.5 and we used the first quartile of Y to get y, which would give us 0.25 and 0.75 as we see in the table. The joint probabilities also sum to the marginal probabilities, which is what we would expect.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s Exact Test of Independence is used to examine the association (contingency - which is why we use contingency tables) betwen two classifications. In essence, we want to know whether the two classifications are associated (dependent) or if they are independent. We use Fisher’s exact test when doing experiments with small numbers of observations. One of the key criteria is:

more than 20% of cells have expected cell counts less than 5, and no expected cell count is are less than 1.

Looking at our contingency table above, it looks like we don’t HAVE to use the Fisher’s Exact Test of Independence, but we can none-the-less, because the test still works for tests with higher observation sizes. The real problem is that chi-square test can not be used for experiments that have a small sample size because the distribution’s approximations are inadequate when sample sizes are small.

Now let’s turn to our experiment to see if there is a relationship between our two variables. Here our hypothesis will be the same for both the Fisher’s Exact test and the Chi Square Test:

\({ H }_{ 0 }:\) The variables are independent and there is no relationship between the two variables
\({ H }_{ 1 }:\) The variables are dependent and there is a relationship between the two variables

#------------adjusting dataframe to work with Fisher's test --------
df <- data.frame(X = X, Y = Y)

df1 <- df %>%
  filter(X > small_x & Y > small_y) %>%
  nrow() 


df2 <- df %>%
  filter(X < small_x & Y > small_y) %>%
  nrow()

df3 <- rbind(df2, df1) 


df5 <- df %>%
  filter(X < small_x & Y < small_y) %>%
  nrow() 

df6 <- df %>%
  filter(X > small_x & Y < small_y) %>%
  nrow() 

df7 <- rbind(df5,df6)


df10 <- cbind(df7, df3)



rownames(df10) <- c('X < x', 'X > x')
colnames(df10) <- c('Y < y', 'Y > y')

## ----------------------------------------------#
#---------------  Fisher's Test  ----------------#

fisher.test(df10)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  df10
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9242273 1.1100187
## sample estimates:
## odds ratio 
##   1.012883

In looking at the test results, we can see that our p-value is well above the generally accepted 0.05. In this case, we can fail to reject the null hypothesis and say with comfort that these two variables are independent.

Now, we’ll run the same test, but using the chi-square test.

chisq.test(df10)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df10
## X-squared = 0.064533, df = 1, p-value = 0.7995

In looking at these results, you can see that we come back with the exact same p-value as we did with the Fisher’s test. Here again we will fail to reject the null hypothesis and conclude that the variables are independent.

So why are the p-values the same? Well as was mentioned above, our observation size is 10,000, which is quite large, and we definitely have enough observations in each category in our contingency table for us to run a valid Chi-square test - so in this case, we really could run either test and we should get nearly identical results. We should use the Fisher’s test if we have a small sample size, which isn’t the case here, but it should be noted that it doesn’t hurt to run the Fisher’s test on experiments that don’t have a small sample size. As we’ve proved, it provides just as accurate a result.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following:

Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

train <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Final/train.csv')
head(train)

## # A tibble: 6 x 81
##      Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
##   <dbl>      <dbl> <chr>          <dbl>   <dbl> <chr>  <chr> <chr>   
## 1     1         60 RL                65    8450 Pave   <NA>  Reg     
## 2     2         20 RL                80    9600 Pave   <NA>  Reg     
## 3     3         60 RL                68   11250 Pave   <NA>  IR1     
## 4     4         70 RL                60    9550 Pave   <NA>  IR1     
## 5     5         60 RL                84   14260 Pave   <NA>  IR1     
## 6     6         50 RL                85   14115 Pave   <NA>  IR1     
## # ... with 73 more variables: LandContour <chr>, Utilities <chr>,
## #   LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## #   Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## #   OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## #   RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## #   MasVnrArea <dbl>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## #   BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>,
## #   BsmtFinSF1 <dbl>, BsmtFinType2 <chr>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## #   TotalBsmtSF <dbl>, Heating <chr>, HeatingQC <chr>, CentralAir <chr>,
## #   Electrical <chr>, `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>,
## #   GrLivArea <dbl>, BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>,
## #   HalfBath <dbl>, BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## #   TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <chr>,
## #   GarageType <chr>, GarageYrBlt <dbl>, GarageFinish <chr>, GarageCars <dbl>,
## #   GarageArea <dbl>, GarageQual <chr>, GarageCond <chr>, PavedDrive <chr>,
## #   WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
## #   `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <chr>,
## #   Fence <chr>, MiscFeature <chr>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## #   SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>

First, let’s get a look at the dimensions of our dataset:

dim(train)

## [1] 1460   81

Our data has 1460 rows and 81 columns.

Next, we’ll generate some summary statistics of each feature by using the summary function from base R.

summary(train)

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical           1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##     2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      3SsnPorch       ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

We can see that there is a tremendous wealth of information in this summary. Some interesting pieces in about the dataset in general are that we are looking at houses built between 1872 and 2010 and our dependent variable, Sales Price, goes from 34,900 to 775,000.

Let’s also take a look at the distribution of our dependent variable, sale price:

ggplot(train) + 
  aes(x = SalePrice) + 
  geom_histogram(bins =45 ) +
 scale_x_continuous(labels=comma) +
  labs(title="Home Sale Price Distribution",
       x="Price") +
  theme(plot.title = element_text(hjust = 0.425))

Now, we’ll take a look at the relationship between several of our indpendent variables and our dependent variable, SalePrice. We’ll select these variables using the hypothesis that above ground living area square feet (GrLivArea), total basement square feet (TotalBsmtSF), and garage area (GarageArea) will be highly predictive factors for the model.

ggplot(train) + 
  aes(x = GrLivArea, y = SalePrice) + 
  geom_point(alpha = 0.25) + scale_y_continuous(labels=comma) +
  geom_smooth(method = lm ,se = FALSE) +
  scale_x_continuous(labels=comma) +
  labs(title="Home Sale Price vs. Above Ground Square Footage",
       y="Home Price $",
       x="Sq Ft") + 
  theme_classic() +
  theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
        plot.title = element_text(hjust = 0.30))

## `geom_smooth()` using formula 'y ~ x'

We can see by looking at the plot above that the above ground square footage has a fairly strong linear relationship with the home sale price.

Now, let’s look at the total square footage of the basement:

ggplot(train) + 
  aes(x = TotalBsmtSF, y = SalePrice) + 
  geom_point(alpha = 0.25) + scale_y_continuous(labels=comma) +
  geom_smooth(method = lm ,se = FALSE) +
  scale_x_continuous(labels=comma) +
  labs(title="Home Sale Price vs. Total Basement Square Footage",
       y="Home Price $",
       x="Sq Ft") + 
  theme_classic() +
  theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
        plot.title = element_text(hjust = 0.30))

The total basement square footage also has a linear relationship with the some sale price, although the relationship is not as strong as what we saw with the previous feature.

Lastly, we’ll look at garage area:

ggplot(train) + 
  aes(x = GarageArea, y = SalePrice) + 
  geom_point(alpha = 0.25) + scale_y_continuous(labels=comma) +
  geom_smooth(method = lm ,se = FALSE) +
  scale_x_continuous(labels=comma) +
  labs(title="Home Sale Price vs. Garage Square Footage",
       y="Home Price $",
       x="Sq Ft") + 
  theme_classic() +
  theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
        plot.title = element_text(hjust = 0.30))

Again, here we can see that Garage Square footage has a linear relationship with the home price. This relationship looks to be moderately strong.

How strong are these relationships? Let’s create a correlation matrix to determine the strength of each of these relationships:

df_cor <- train %>% dplyr::select(SalePrice, GarageArea, TotalBsmtSF, GrLivArea)
cor(df_cor, use = "complete.obs")

##             SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice   1.0000000  0.6234314   0.6135806 0.7086245
## GarageArea  0.6234314  1.0000000   0.4866655 0.4689975
## TotalBsmtSF 0.6135806  0.4866655   1.0000000 0.4548682
## GrLivArea   0.7086245  0.4689975   0.4548682 1.0000000

While the correlation matrix is fairly small, it still often helps to represent this matrix visually:

df_cor <- train %>% dplyr::select(SalePrice, GarageArea, TotalBsmtSF, GrLivArea)

corrplot(cor(df_cor, use = "complete.obs"), type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

We can see that all three of the variables we have chosen are strongly correlated with sales price as we saw in our scatterplots above. We can also see that there may be some slight collinearity between these features as well.

Before moving on, let’s test the following hypotheses for each of our features as it relates to sales price:

\({ H }_{0 }:\) Correlation between the pairwise set of variables is 0 and is not significant
\({ H }_{1 }:\) Correlation between the pairwise set of variables in not 0 and is significant

We’ll use the cor.test method to compute this. In addition, this function has the ability to generate confidence intervals based on a specified input. We’ll generate 80% confidence intervals that will display as part of the output.

Let’s first look at the pairwise correlation between sales price and garage area:

cor.test(df_cor$SalePrice, df_cor$GarageArea, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  df_cor$SalePrice and df_cor$GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314

We can see here that the p-value is extremely low, which means we will reject the null hypothesis and conclude that the correlation is not 0 and the relationship is significant. Additionally, you can see that 0 is not included in the 80% confidence interval.

Now, we’ll turn our attention to sales price and above ground square footage (GrLivArea):

cor.test(df_cor$SalePrice, df_cor$GrLivArea, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  df_cor$SalePrice and df_cor$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

Here again, the p-value is extremely small which leads us to reject the null hypothesis and conclude that the correlation between these two variables is not 0 and the relationship is significant. Additionally, as we saw before, 0 is not part of the 80% confidence interval.

Lastly, we’ll look at sales price and total basement square footage (TotalBsmtSF)

cor.test(df_cor$SalePrice, df_cor$TotalBsmtSF, conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  df_cor$SalePrice and df_cor$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5922142 0.6340846
## sample estimates:
##       cor 
## 0.6135806

Following suit from our previous tests, the p-value is extremely small. We will reject the null hypothesis and conclude that the true correlation is not equal to 0. You can see in the 80% confidence interval that, once again, 0 is not included in the interval.

The following tests give us reasonable assurance that there is, in fact, a significant relationship between these variables and they will be significant factors in building our multiple regression model.

Now, because we’ve performed multiple statistical tests on the same dataset, it behooves us to check the family-wise error rate. We can do this using the following formula: \(1−(1−α)m\)

In the above forumla, alpha is our level of significance (threshold for rejecting the null hypothesis) and m is the number of tests we performed. We set alpha equal to 0.01 since that was the threshold we were using to determine significance in the above tests and set m = 3 as we ran 3 tests.

alpha <- 0.01
m<-3

1-(1-alpha)^m

## [1] 0.029701

We can see from the output above that our risk of running a family-wise error is very low (~3%). What does this mean? It means that our chance of committing a type I error, or rejecting the null hypothesis when it is acutally true, is very low.

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

We’ll take our correlation matrix from above:

cor_matrix <- cor(df_cor)
cor_matrix

##             SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice   1.0000000  0.6234314   0.6135806 0.7086245
## GarageArea  0.6234314  1.0000000   0.4866655 0.4689975
## TotalBsmtSF 0.6135806  0.4866655   1.0000000 0.4548682
## GrLivArea   0.7086245  0.4689975   0.4548682 1.0000000

and use the solve method from R to get the inverse (which is also called the precision matrix):

inverse <- solve(cor_matrix)
inverse

##              SalePrice  GarageArea TotalBsmtSF   GrLivArea
## SalePrice    2.9597719 -0.82302478 -0.80327436 -1.34598629
## GarageArea  -0.8230248  1.68692624 -0.27914378 -0.08097502
## TotalBsmtSF -0.8032744 -0.27914378  1.65207569 -0.05133909
## GrLivArea   -1.3459863 -0.08097502 -0.05133909  2.01512843

We can do a quick check to make sure we calculated the inverse correctly by multiplying the original matrix and the inverse together. If we did it correctly, the calculation should return the identity matrix (within rounding error):

round(cor_matrix %*% inverse,0)

##             SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice           1          0           0         0
## GarageArea          0          1           0         0
## TotalBsmtSF         0          0           1         0
## GrLivArea           0          0           0         1

Additionally multiplying the inverse by the correlation matrix results in the identity matrix as well (within rounding error):

round(inverse %*% cor_matrix,0)

##             SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice           1          0           0         0
## GarageArea          0          1           0         0
## TotalBsmtSF         0          0           1         0
## GrLivArea           0          0           0         1

Now, let’s perform lower-upper matrix decomposition on the correlation matrix. Decomposing a matrix into its parts can often make final calculations more simple. We’ll use the lu.decomposition method from the matrixcalc library to decompose this matrix into lower and upper triangle matrices. :

decomp <- lu.decomposition(cor_matrix)
decomp

## $L
##           [,1]       [,2]       [,3] [,4]
## [1,] 1.0000000 0.00000000 0.00000000    0
## [2,] 0.6234314 1.00000000 0.00000000    0
## [3,] 0.6135806 0.17034908 1.00000000    0
## [4,] 0.7086245 0.04452351 0.02547683    1
## 
## $U
##      [,1]          [,2]      [,3]       [,4]
## [1,]    1  6.234314e-01 0.6135806 0.70862448
## [2,]    0  6.113332e-01 0.1041401 0.02721870
## [3,]    0 -1.387779e-17 0.6057787 0.01543332
## [4,]    0  3.535621e-19 0.0000000 0.49624629

Multiplying the lower triangle by the upper triangle should result in our original correlation matrix:

decomp$L %*% decomp$U

##           [,1]      [,2]      [,3]      [,4]
## [1,] 1.0000000 0.6234314 0.6135806 0.7086245
## [2,] 0.6234314 1.0000000 0.4866655 0.4689975
## [3,] 0.6135806 0.4866655 1.0000000 0.4548682
## [4,] 0.7086245 0.4689975 0.4548682 1.0000000

Perfect!

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

First, we’ll choose a feature from our data that is right skewed. The unfinished square feet of basement area (BsmtUnfSF) is right skewed as we can see in the histogram below:

right <- ggplot(train) + 
  aes(x = BsmtUnfSF) + 
  geom_histogram(aes(y=..density..),bins = 45)  +
  geom_density(alpha=.1, color = "blue")+ 
  labs(title="Unfinished Square Feet of Basement",
       y="Density",
       x="Sq Ft") + 
  theme_classic() +
  theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
        plot.title = element_text(hjust = 0.30))

right

Let’s check the minimum value of our feature to make sure that it is zero:

min(train$BsmtUnfSF)

## [1] 0

Now, let’s use fitdistr from the MASS package to fit our feature to an exponential distribution.

exp_fit <- MASS::fitdistr(train$BsmtUnfSF, "exponential")
exp_fit

##        rate    
##   1.762921e-03 
##  (4.613775e-05)

Running the above function has given us a \(\lambda\) value of 0.0017629. We’ll now use this \(\lambda\) or rate to generate 1,000 samples from this exponential distribution using the rexp function from the stats library.

samples <- rexp(1000, rate = exp_fit$estimate )

Armed with samples from our exponential distribution, lets build another histogram and compare it to the one we made previously:

expo <- ggplot(tibble(samples)) + 
  aes(x = samples) + 
  geom_histogram(aes(y=..density..),bins = 45)  +
  geom_density(alpha=.1, color = "blue") + 
  labs(title="Fit Exponential Distribution",
       y="Density",
       x="Sq Ft") + 
  theme_classic() +
  theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
        plot.title = element_text(hjust = 0.30))

ggpubr::ggarrange(expo, right, ncol = 1, nrow = 2)

By looking at the above two distributions, especially looking at the line from the density plot, we can tell our exponential transformation made a significant change in the shape of the distribution. In the top plot, which shows the transformation we made, we can clearly see an exponential curve, whereas in the bottom plot, it looks much more like a gentle decline.

Now, let’s use the exponential probability density (PDF) to find the 5th and 95th percentiles using the cumulative distribution function (CDF). To do this we’ll use the qexp method from stats library.

qexp(.05, rate = exp_fit$estimate)

## [1] 29.09563

qexp(.95, rate = exp_fit$estimate)

## [1] 1699.3

Next, we’ll generate a 95% confidence interval from the original data, assuming normality. First let’s find the length of the interval:

center <- mean(train$BsmtUnfSF)
stddev <- sd(train$BsmtUnfSF)
n <- length(train$BsmtUnfSF)
error <- qnorm(0.95)*stddev/sqrt(n)
error

## [1] 19.02139

Now utilizing our center point (mean), we can build out our confidence interval:

lower_bound <- center - error

upper_bound <- center + error

lower_bound

## [1] 548.219

upper_bound

## [1] 586.2618

For our mean of 567.240411, the lower bound of our confidence interval is 548.2190164 and the upper bound is 586.2618055.

Similar to what we did above, we’ll look at the 5th and 95th percentiles of the data:

quantile(train$BsmtUnfSF, 0.05)

## 5% 
##  0

quantile(train$BsmtUnfSF, 0.95)

##  95% 
## 1468

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

My model and analysis can be found here. My Kaggle username is christianthieme and my score from the competition is:

Problem 1

Problem 2

Sources