Problem 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu =\sigma =(N+1)/2\).
set.seed(123)
N <- 100
X <- runif(10000, min = 1, max = N)
Y <- rnorm(10000, mean = (N+1)/2, sd = (N+1)/2)Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
The median of \(X\) is 49.9621956. The 1st quartile of \(Y\) is 16.899406.
a. \(P(X>x|X>y)\)
This is a conditional probability. Here we are saying "given that \(X\) is greater than \(y\), what is the probability that \(X\) is greater than \(x\). The general formula for solving this problem is called Bayes’ Theorem and is given as:
\(P(A|B)=\frac { P(B|A)\cdot P(A) }{ P(B) }\)
a <- X[X>small_x]
prob_b_given_a <- length(a[a>small_y])/length(a)
prob_a <- sum(X>small_x)/length(X)
prob_b <- sum(X>small_y)/length(X)
(prob_b_given_a * prob_a) / prob_b## [1] 0.595309
b. \(P(X>x,Y>y)\)
Here we are looking at the probability that \(X\) is greater than \(x\) AND that \(Y\) is greater than \(y\). To solve this problem, we’ll use the multiplication rule for independent events (since we randomly generated these distributions):
\(P(A and B) =P(A)\cdot P(B)\)
Because we built these distributions, we already know that the probability that \(X>x\) is 0.5 since we are using the median value. Similarly, we know that the probability that \(Y>y\) is equal to 0.75 since we used the first percentile. We can multiply both of these values together to get our answer.
X_greater_than_x <- 0.5
Y_greater_than_y <- 0.75
probability <- X_greater_than_x * Y_greater_than_y
probability## [1] 0.375
c. \(P(X<x|X>y)\)
This is very similar to (a). Here we are looking at the conditional probability that \(X\) is less than \(x\), given that \(X\) is greater than \(y\). We’ll use Bayes’ theorem again to answer this question.
a <- X[X<small_x]
prob_b_given_a <- length(a[a>small_y])/length(a)
prob_a <- sum(X<small_x)/length(X)
prob_b <- sum(X>small_y)/length(X)
(prob_b_given_a * prob_a) / prob_b## [1] 0.404691
Investigate whether \(P(X>x\cup Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
Marginal probability is the probability of an event occuring - it is an unconditional probability. Joint probability is the probability of event A and event B occuring.
df <- data.frame(X = X, Y = Y)
df1 <- df %>%
filter(X > small_x & Y > small_y) %>%
nrow() / dim(df)[1]
df2 <- df %>%
filter(X < small_x & Y > small_y) %>%
nrow() / dim(df)[1]
df3 <- rbind(df2, df1)
df4 <- rbind(df3,sum(df3))
df5 <- df %>%
filter(X < small_x & Y < small_y) %>%
nrow() / dim(df)[1]
df6 <- df %>%
filter(X > small_x & Y < small_y) %>%
nrow() / dim(df)[1]
df7 <- rbind(df5,df6)
df8 <- sum(df7)
df9 <- rbind(df7, df8)
df10 <- cbind(df9, df4)
df11 <- df10[,1] + df10[,2]
df12 <- cbind(df10, df11)
rownames(df12) <- c('X < x', 'X > x', 'Total')
colnames(df12) <- c('Y < y', 'Y > y', 'Total')
df12 %>% kable() %>%
kable_styling()| Y < y | Y > y | Total | |
|---|---|---|---|
| X < x | 0.1256 | 0.3744 | 0.5 |
| X > x | 0.1244 | 0.3756 | 0.5 |
| Total | 0.2500 | 0.7500 | 1.0 |
In looking at the marginal probabilities in the totals, we can tell that this table is accurate since we used the median value of X, which would be 0.5 and we used the first quartile of Y to get y, which would give us 0.25 and 0.75 as we see in the table. The joint probabilities also sum to the marginal probabilities, which is what we would expect.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s Exact Test of Independence is used to examine the association (contingency - which is why we use contingency tables) betwen two classifications. In essence, we want to know whether the two classifications are associated (dependent) or if they are independent. We use Fisher’s exact test when doing experiments with small numbers of observations. One of the key criteria is:
- more than 20% of cells have expected cell counts less than 5, and no expected cell count is are less than 1.
Looking at our contingency table above, it looks like we don’t HAVE to use the Fisher’s Exact Test of Independence, but we can none-the-less, because the test still works for tests with higher observation sizes. The real problem is that chi-square test can not be used for experiments that have a small sample size because the distribution’s approximations are inadequate when sample sizes are small.
Now let’s turn to our experiment to see if there is a relationship between our two variables. Here our hypothesis will be the same for both the Fisher’s Exact test and the Chi Square Test:
- \({ H }_{ 0 }:\) The variables are independent and there is no relationship between the two variables
- \({ H }_{ 1 }:\) The variables are dependent and there is a relationship between the two variables
#------------adjusting dataframe to work with Fisher's test --------
df <- data.frame(X = X, Y = Y)
df1 <- df %>%
filter(X > small_x & Y > small_y) %>%
nrow()
df2 <- df %>%
filter(X < small_x & Y > small_y) %>%
nrow()
df3 <- rbind(df2, df1)
df5 <- df %>%
filter(X < small_x & Y < small_y) %>%
nrow()
df6 <- df %>%
filter(X > small_x & Y < small_y) %>%
nrow()
df7 <- rbind(df5,df6)
df10 <- cbind(df7, df3)
rownames(df10) <- c('X < x', 'X > x')
colnames(df10) <- c('Y < y', 'Y > y')
## ----------------------------------------------#
#--------------- Fisher's Test ----------------#
fisher.test(df10)##
## Fisher's Exact Test for Count Data
##
## data: df10
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9242273 1.1100187
## sample estimates:
## odds ratio
## 1.012883
In looking at the test results, we can see that our p-value is well above the generally accepted 0.05. In this case, we can fail to reject the null hypothesis and say with comfort that these two variables are independent.
Now, we’ll run the same test, but using the chi-square test.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df10
## X-squared = 0.064533, df = 1, p-value = 0.7995
In looking at these results, you can see that we come back with the exact same p-value as we did with the Fisher’s test. Here again we will fail to reject the null hypothesis and conclude that the variables are independent.
So why are the p-values the same? Well as was mentioned above, our observation size is 10,000, which is quite large, and we definitely have enough observations in each category in our contingency table for us to run a valid Chi-square test - so in this case, we really could run either test and we should get nearly identical results. We should use the Fisher’s test if we have a small sample size, which isn’t the case here, but it should be noted that it doesn’t hurt to run the Fisher’s test on experiments that don’t have a small sample size. As we’ve proved, it provides just as accurate a result.
Problem 2
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following:
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train <- readr::read_csv('C:/Users/chris/OneDrive/Master Of Data Science - CUNY/Fall 2020/DATA605-Computational Mathematics/Final/train.csv')
head(train) ## # A tibble: 6 x 81
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## # ... with 73 more variables: LandContour <chr>, Utilities <chr>,
## # LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## # Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## # OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## # RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## # MasVnrArea <dbl>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## # BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>,
## # BsmtFinSF1 <dbl>, BsmtFinType2 <chr>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## # TotalBsmtSF <dbl>, Heating <chr>, HeatingQC <chr>, CentralAir <chr>,
## # Electrical <chr>, `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>,
## # GrLivArea <dbl>, BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>,
## # HalfBath <dbl>, BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <chr>,
## # GarageType <chr>, GarageYrBlt <dbl>, GarageFinish <chr>, GarageCars <dbl>,
## # GarageArea <dbl>, GarageQual <chr>, GarageCond <chr>, PavedDrive <chr>,
## # WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
## # `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <chr>,
## # Fence <chr>, MiscFeature <chr>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>
First, let’s get a look at the dimensions of our dataset:
## [1] 1460 81
Our data has 1460 rows and 81 columns.
Next, we’ll generate some summary statistics of each feature by using the summary function from base R.
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical 1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
We can see that there is a tremendous wealth of information in this summary. Some interesting pieces in about the dataset in general are that we are looking at houses built between 1872 and 2010 and our dependent variable, Sales Price, goes from 34,900 to 775,000.
Let’s also take a look at the distribution of our dependent variable, sale price:
ggplot(train) +
aes(x = SalePrice) +
geom_histogram(bins =45 ) +
scale_x_continuous(labels=comma) +
labs(title="Home Sale Price Distribution",
x="Price") +
theme(plot.title = element_text(hjust = 0.425))Now, we’ll take a look at the relationship between several of our indpendent variables and our dependent variable, SalePrice. We’ll select these variables using the hypothesis that above ground living area square feet (GrLivArea), total basement square feet (TotalBsmtSF), and garage area (GarageArea) will be highly predictive factors for the model.
ggplot(train) +
aes(x = GrLivArea, y = SalePrice) +
geom_point(alpha = 0.25) + scale_y_continuous(labels=comma) +
geom_smooth(method = lm ,se = FALSE) +
scale_x_continuous(labels=comma) +
labs(title="Home Sale Price vs. Above Ground Square Footage",
y="Home Price $",
x="Sq Ft") +
theme_classic() +
theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
plot.title = element_text(hjust = 0.30))## `geom_smooth()` using formula 'y ~ x'
We can see by looking at the plot above that the above ground square footage has a fairly strong linear relationship with the home sale price.
Now, let’s look at the total square footage of the basement:
ggplot(train) +
aes(x = TotalBsmtSF, y = SalePrice) +
geom_point(alpha = 0.25) + scale_y_continuous(labels=comma) +
geom_smooth(method = lm ,se = FALSE) +
scale_x_continuous(labels=comma) +
labs(title="Home Sale Price vs. Total Basement Square Footage",
y="Home Price $",
x="Sq Ft") +
theme_classic() +
theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
plot.title = element_text(hjust = 0.30))The total basement square footage also has a linear relationship with the some sale price, although the relationship is not as strong as what we saw with the previous feature.
Lastly, we’ll look at garage area:
ggplot(train) +
aes(x = GarageArea, y = SalePrice) +
geom_point(alpha = 0.25) + scale_y_continuous(labels=comma) +
geom_smooth(method = lm ,se = FALSE) +
scale_x_continuous(labels=comma) +
labs(title="Home Sale Price vs. Garage Square Footage",
y="Home Price $",
x="Sq Ft") +
theme_classic() +
theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
plot.title = element_text(hjust = 0.30))Again, here we can see that Garage Square footage has a linear relationship with the home price. This relationship looks to be moderately strong.
How strong are these relationships? Let’s create a correlation matrix to determine the strength of each of these relationships:
df_cor <- train %>% dplyr::select(SalePrice, GarageArea, TotalBsmtSF, GrLivArea)
cor(df_cor, use = "complete.obs")## SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice 1.0000000 0.6234314 0.6135806 0.7086245
## GarageArea 0.6234314 1.0000000 0.4866655 0.4689975
## TotalBsmtSF 0.6135806 0.4866655 1.0000000 0.4548682
## GrLivArea 0.7086245 0.4689975 0.4548682 1.0000000
While the correlation matrix is fairly small, it still often helps to represent this matrix visually:
df_cor <- train %>% dplyr::select(SalePrice, GarageArea, TotalBsmtSF, GrLivArea)
corrplot(cor(df_cor, use = "complete.obs"), type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)We can see that all three of the variables we have chosen are strongly correlated with sales price as we saw in our scatterplots above. We can also see that there may be some slight collinearity between these features as well.
Before moving on, let’s test the following hypotheses for each of our features as it relates to sales price:
- \({ H }_{0 }:\) Correlation between the pairwise set of variables is 0 and is not significant
- \({ H }_{1 }:\) Correlation between the pairwise set of variables in not 0 and is significant
We’ll use the cor.test method to compute this. In addition, this function has the ability to generate confidence intervals based on a specified input. We’ll generate 80% confidence intervals that will display as part of the output.
Let’s first look at the pairwise correlation between sales price and garage area:
##
## Pearson's product-moment correlation
##
## data: df_cor$SalePrice and df_cor$GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
We can see here that the p-value is extremely low, which means we will reject the null hypothesis and conclude that the correlation is not 0 and the relationship is significant. Additionally, you can see that 0 is not included in the 80% confidence interval.
Now, we’ll turn our attention to sales price and above ground square footage (GrLivArea):
##
## Pearson's product-moment correlation
##
## data: df_cor$SalePrice and df_cor$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
Here again, the p-value is extremely small which leads us to reject the null hypothesis and conclude that the correlation between these two variables is not 0 and the relationship is significant. Additionally, as we saw before, 0 is not part of the 80% confidence interval.
Lastly, we’ll look at sales price and total basement square footage (TotalBsmtSF)
##
## Pearson's product-moment correlation
##
## data: df_cor$SalePrice and df_cor$TotalBsmtSF
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
Following suit from our previous tests, the p-value is extremely small. We will reject the null hypothesis and conclude that the true correlation is not equal to 0. You can see in the 80% confidence interval that, once again, 0 is not included in the interval.
The following tests give us reasonable assurance that there is, in fact, a significant relationship between these variables and they will be significant factors in building our multiple regression model.
Now, because we’ve performed multiple statistical tests on the same dataset, it behooves us to check the family-wise error rate. We can do this using the following formula: \(1−(1−α)m\)
In the above forumla, alpha is our level of significance (threshold for rejecting the null hypothesis) and m is the number of tests we performed. We set alpha equal to 0.01 since that was the threshold we were using to determine significance in the above tests and set m = 3 as we ran 3 tests.
## [1] 0.029701
We can see from the output above that our risk of running a family-wise error is very low (~3%). What does this mean? It means that our chance of committing a type I error, or rejecting the null hypothesis when it is acutally true, is very low.
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
We’ll take our correlation matrix from above:
## SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice 1.0000000 0.6234314 0.6135806 0.7086245
## GarageArea 0.6234314 1.0000000 0.4866655 0.4689975
## TotalBsmtSF 0.6135806 0.4866655 1.0000000 0.4548682
## GrLivArea 0.7086245 0.4689975 0.4548682 1.0000000
and use the solve method from R to get the inverse (which is also called the precision matrix):
## SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice 2.9597719 -0.82302478 -0.80327436 -1.34598629
## GarageArea -0.8230248 1.68692624 -0.27914378 -0.08097502
## TotalBsmtSF -0.8032744 -0.27914378 1.65207569 -0.05133909
## GrLivArea -1.3459863 -0.08097502 -0.05133909 2.01512843
We can do a quick check to make sure we calculated the inverse correctly by multiplying the original matrix and the inverse together. If we did it correctly, the calculation should return the identity matrix (within rounding error):
## SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice 1 0 0 0
## GarageArea 0 1 0 0
## TotalBsmtSF 0 0 1 0
## GrLivArea 0 0 0 1
Additionally multiplying the inverse by the correlation matrix results in the identity matrix as well (within rounding error):
## SalePrice GarageArea TotalBsmtSF GrLivArea
## SalePrice 1 0 0 0
## GarageArea 0 1 0 0
## TotalBsmtSF 0 0 1 0
## GrLivArea 0 0 0 1
Now, let’s perform lower-upper matrix decomposition on the correlation matrix. Decomposing a matrix into its parts can often make final calculations more simple. We’ll use the lu.decomposition method from the matrixcalc library to decompose this matrix into lower and upper triangle matrices. :
## $L
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.00000000 0.00000000 0
## [2,] 0.6234314 1.00000000 0.00000000 0
## [3,] 0.6135806 0.17034908 1.00000000 0
## [4,] 0.7086245 0.04452351 0.02547683 1
##
## $U
## [,1] [,2] [,3] [,4]
## [1,] 1 6.234314e-01 0.6135806 0.70862448
## [2,] 0 6.113332e-01 0.1041401 0.02721870
## [3,] 0 -1.387779e-17 0.6057787 0.01543332
## [4,] 0 3.535621e-19 0.0000000 0.49624629
Multiplying the lower triangle by the upper triangle should result in our original correlation matrix:
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 0.6234314 0.6135806 0.7086245
## [2,] 0.6234314 1.0000000 0.4866655 0.4689975
## [3,] 0.6135806 0.4866655 1.0000000 0.4548682
## [4,] 0.7086245 0.4689975 0.4548682 1.0000000
Perfect!
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
First, we’ll choose a feature from our data that is right skewed. The unfinished square feet of basement area (BsmtUnfSF) is right skewed as we can see in the histogram below:
right <- ggplot(train) +
aes(x = BsmtUnfSF) +
geom_histogram(aes(y=..density..),bins = 45) +
geom_density(alpha=.1, color = "blue")+
labs(title="Unfinished Square Feet of Basement",
y="Density",
x="Sq Ft") +
theme_classic() +
theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
plot.title = element_text(hjust = 0.30))
rightLet’s check the minimum value of our feature to make sure that it is zero:
## [1] 0
Now, let’s use fitdistr from the MASS package to fit our feature to an exponential distribution.
## rate
## 1.762921e-03
## (4.613775e-05)
Running the above function has given us a \(\lambda\) value of 0.0017629. We’ll now use this \(\lambda\) or rate to generate 1,000 samples from this exponential distribution using the rexp function from the stats library.
Armed with samples from our exponential distribution, lets build another histogram and compare it to the one we made previously:
expo <- ggplot(tibble(samples)) +
aes(x = samples) +
geom_histogram(aes(y=..density..),bins = 45) +
geom_density(alpha=.1, color = "blue") +
labs(title="Fit Exponential Distribution",
y="Density",
x="Sq Ft") +
theme_classic() +
theme(panel.grid.major.y = element_line(colour = "grey90",linetype='dashed'),
plot.title = element_text(hjust = 0.30))
ggpubr::ggarrange(expo, right, ncol = 1, nrow = 2)By looking at the above two distributions, especially looking at the line from the density plot, we can tell our exponential transformation made a significant change in the shape of the distribution. In the top plot, which shows the transformation we made, we can clearly see an exponential curve, whereas in the bottom plot, it looks much more like a gentle decline.
Now, let’s use the exponential probability density (PDF) to find the 5th and 95th percentiles using the cumulative distribution function (CDF). To do this we’ll use the qexp method from stats library.
## [1] 29.09563
## [1] 1699.3
Next, we’ll generate a 95% confidence interval from the original data, assuming normality. First let’s find the length of the interval:
center <- mean(train$BsmtUnfSF)
stddev <- sd(train$BsmtUnfSF)
n <- length(train$BsmtUnfSF)
error <- qnorm(0.95)*stddev/sqrt(n)
error## [1] 19.02139
Now utilizing our center point (mean), we can build out our confidence interval:
## [1] 548.219
## [1] 586.2618
For our mean of 567.240411, the lower bound of our confidence interval is 548.2190164 and the upper bound is 586.2618055.
Similar to what we did above, we’ll look at the 5th and 95th percentiles of the data:
## 5%
## 0
## 95%
## 1468
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
My model and analysis can be found here. My Kaggle username is christianthieme and my score from the competition is: