Pick one of the quantitative independent variables \(Xi\) from the data set below, and define that variable as \(X\). Also, pick one of the dependent variables \(Yi\) below, and define that as \(Y\).
I selected \(X1\) and \(Y1\).
data <- data_frame(x=c(9.3, 4.1, 22.4, 9.1, 15.8, 7.1, 15.9, 6.9, 16.0, 6.7, 8.2, 16.0, 6.4, 11.8, 3.5, 21.7, 12.2, 9.3, 8.0, 6.2),
y=c(20.3, 19.1, 19.3, 20.9, 22.0, 23.5, 13.8, 18.8, 20.9, 18.6, 22.3, 17.6, 20.8, 28.7, 15.2, 20.9, 18.4, 10.3, 26.3, 28.1))
Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the 3d quartile of the \(X\) variable, and the small letter \(y\) is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.
summary(data)
## x y
## Min. : 3.50 Min. :10.30
## 1st Qu.: 6.85 1st Qu.:18.55
## Median : 9.20 Median :20.55
## Mean :10.83 Mean :20.29
## 3rd Qu.:15.82 3rd Qu.:22.07
## Max. :22.40 Max. :28.70
\(P(X > 15.82 | Y > 18.55) = \frac{17}{20}\) - The probability that a randomly row has X greater than 15.82 OR Y greater than 18.55.
data %>%
filter(x > 15.82 | y > 18.55) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 17
\(P(X > 15.82, Y>18.55)=\frac{3}{20}\) - The probability that a randomly selected row has X greater than 15.82 AND Y greater than 18.55.
data %>%
filter(x > 15.82, y > 18.55) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 3
\(P(x < 15.82 | y > 18.55) = \frac{18}{20}=\frac{9}{10}\) - The probability that a randomly selected row has X less than 15.82 OR Y greater than 18.55
data %>%
filter(x < 15.82 | y > 18.55) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 18
In addition, make a table of counts as shown below.
data %>%
mutate(x = x <= 15.82,
y = y <= 18.55) %>%
table()
## y
## x FALSE TRUE
## FALSE 3 2
## TRUE 12 3
| . | x <= 3rd | x > 3rd | Total |
|---|---|---|---|
| y <= 1st | 3 | 2 | 5 |
| y > 1st | 12 | 3 | 15 |
| Total | 15 | 5 | 20 |
Does splitting the training data in this fashion make them independent? Let \(A\) be the new variable counting those observations above the 1st quartile for \(X\), and let \(B\) be the new variable counting those observations above the 1st quartile for \(Y\). Does \(P(AB)=P(A)P(B)\)? Check mathematically, and then evaluate by running a Chi Square test for independence.
\(P(A)=\frac{15}{20}\)
\(P(B)=\frac{15}{20}\)
\(P(AB)=\frac{15}{20}(\frac{15}{20})=\frac{9}{16}\)
\(P(A)\times P(B|A)=\frac{15}{20}\frac{12}{15}=\frac{3}{5}\)
\(P(B)\times P(A|B)=\frac{15}{20}\frac{11}{15}=\frac{11}{20}\)
As per the above probabilities, the events are not independent. The probability of B occurring depends on whether A has occurred and vice versa. The chi-sq test is inconclusive (possiblity due to the small number of observations.)
data %>%
mutate(x = x <= 15.82,
y = y <= 18.55) %>%
table() %>%
chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 0.088889, df = 1, p-value = 0.7656
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
raw.data <- read_csv('C:\\Users\\Brian\\Desktop\\GradClasses\\Fall18\\605\\final\\all\\train.csv') %>%
select(-c(Id))
Below are the univariate descriptive statistics for all of the numeric variables.
raw.data %>%
select_if(is.numeric) %>%
summary()
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 50.0 Median : 69.00 Median : 9478 Median : 6.000
## Mean : 56.9 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :190.0 Max. :313.00 Max. :215245 Max. :10.000
## NA's :259
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1994 Median : 0.0
## Mean :5.575 Mean :1971 Mean :1985 Mean : 103.7
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## NA's :8
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## Median : 383.5 Median : 0.00 Median : 477.5 Median : 991.5
## Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000
## Median :3.000 Median :1.000 Median : 6.000 Median :1.000
## Mean :2.866 Mean :1.047 Mean : 6.518 Mean :0.613
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :8.000 Max. :3.000 Max. :14.000 Max. :3.000
##
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00
## Median :1980 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.767 Mean : 473.0 Mean : 94.24
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## NA's :81
## OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
## 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
##
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 2.759 Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :738.000 Max. :15500.00 Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
I plotted my three selected variables from the training set. The variables SalePrice and LotArea were heavily skewed so I log transformed them to make the data easier to visualize.
mod.data <- raw.data %>%
dplyr::select(SalePrice, TotRmsAbvGrd, LotArea) %>%
mutate(LotArea = log(LotArea),
SalePrice = log(SalePrice))
mod.data %>%
gather("key", "value") %>%
ggplot() +
geom_histogram(aes(value), bins=30) +
facet_wrap(~key, scales='free_x')
Below is the correlation matrix for my three selected variables. The SalePrice and Lotarea are still log transformed.
mod.data %>%
select(SalePrice, LotArea, TotRmsAbvGrd) %>%
pairs()
The 80% confidence interval for the correlation between SalePrice and LotArea is \((0.37, 0.42)\)
cor.test(mod.data$SalePrice, mod.data$LotArea, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: mod.data$SalePrice and mod.data$LotArea
## t = 16.661, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3713402 0.4277383
## sample estimates:
## cor
## 0.3999177
The 80% confidence interval for the correlation between SalePrice and TotRmsAbvGrd is \((0.51, 0.55)\)
cor.test(mod.data$SalePrice, mod.data$TotRmsAbvGrd, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: mod.data$SalePrice and mod.data$TotRmsAbvGrd
## t = 24.143, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5100081 0.5579759
## sample estimates:
## cor
## 0.5344222
The 80% confidence interval for the correlation between TotRmsAbvGrd and LotArea is \((0.33, 0.38)\)
cor.test(mod.data$TotRmsAbvGrd, mod.data$LotArea, conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: mod.data$TotRmsAbvGrd and mod.data$LotArea
## t = 14.74, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3305629 0.3889893
## sample estimates:
## cor
## 0.3601291
The correlations between all three variables are appears to be strongly positive and significant at the 80% confidence interval. There appears to be correlation between these variables. Familywise error can occurr when performing multiple hypothesis tests. In short, it says that the likelyhood of making a Type 1 error increases as the number of comparisons made increases. Looking casually at the data, the small number of tests combined with the tiny p-values makes me confident that no such error has occurred. For completeness however, I used the bonferroni correct to adjust the p-values to account for the multiple tests. The p-values are still statistically significant providing further evidence of the strong positive relationship between all the selected variables.
p.adjust(c(2.2e-16, 2.2e-16, 2.2e-16), method="bonferroni")
## [1] 6.6e-16 6.6e-16 6.6e-16
Invert your \(3\times3\) correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct \(LU\) decomposition on the matrix.
cor.matrix <- raw.data %>%
select(SalePrice, TotRmsAbvGrd, LotArea) %>%
cor()
cor.matrix
## SalePrice TotRmsAbvGrd LotArea
## SalePrice 1.0000000 0.5337232 0.2638434
## TotRmsAbvGrd 0.5337232 1.0000000 0.1900148
## LotArea 0.2638434 0.1900148 1.0000000
prec.matrix <- solve(cor.matrix)
prec.matrix
## SalePrice TotRmsAbvGrd LotArea
## SalePrice 1.4539777 -0.72946544 -0.24501314
## TotRmsAbvGrd -0.7294654 1.40343330 -0.07420846
## LotArea -0.2450131 -0.07420846 1.07874579
cor.matrix %*% prec.matrix %>%
round(2)
## SalePrice TotRmsAbvGrd LotArea
## SalePrice 1 0 0
## TotRmsAbvGrd 0 1 0
## LotArea 0 0 1
prec.matrix %*% cor.matrix %>%
round(2)
## SalePrice TotRmsAbvGrd LotArea
## SalePrice 1 0 0
## TotRmsAbvGrd 0 1 0
## LotArea 0 0 1
Conduct \(LU\) decomposition on the matrix.
library(matrixcalc)
lu.decomposition(cor.matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.00000000 0
## [2,] 0.5337232 1.00000000 0
## [3,] 0.2638434 0.06879142 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.5337232 0.26384335
## [2,] 0 0.7151396 0.04919547
## [3,] 0 0.0000000 0.92700246
Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
I selected LotArea. It is skewed as seen in the boxplot below.
raw.data %>%
select(LotArea) %>%
gather("key", "value") %>%
ggplot() +
geom_boxplot(aes(key, value))
Then load the MASS package and run fitdistr to fit an exponential probability density function. Find the optimal value of \(\lambda\)??? for this distribution, and then take 1000 samples from this exponential distribution using this value.
library(MASS)
fitdistr(raw.data$LotArea, densfun='exponential')
## rate
## 9.508570e-05
## (2.488507e-06)
Plot a histogram and compare it with a histogram of your original variable.
The theoretical exponential distribution has more diversity in values centered around the mean while the actual data has an incredibly large number of values in a very small range before a more steep drop off than seen in the theoretical data. This indicates that the expontential distribution may not be able to accurately provide insight into the LotArea variable.
compare.data <- data_frame(exp=rexp(1460, 9.508570e-5)) %>%
mutate(lotarea = raw.data$LotArea)
compare.data %>%
gather("key", "value") %>%
mutate(value = as.numeric(value)) %>%
ggplot() +
geom_histogram(aes(value), bins=70) +
facet_wrap(~key)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.
quantile(raw.data$LotArea, c(0.05, 0.95))
## 5% 95%
## 3311.70 17401.15
library(gmodels)
ci(compare.data$exp, confidence=0.95)
## Estimate CI lower CI upper Std. Error
## 10748.754 10184.438 11313.071 287.683
quantile(rexp(1000, 9.508570e-5), c(0.05, 0.95))
## 5% 95%
## 475.0215 30308.2985
As mentioned above the actual data has a much smaller grouping in the quantile range (~14000) compared to the theoretical data (~33000). The 95% confidence interval of (10520, 11647) just barely misses the true mean of 10517. Taking these findings as a whole, I would not model the LotaArea with an expontential distribution.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Please see other RMD file.
https://www.youtube.com/watch?v=njCM-Iaeq4A&feature=youtu.be