Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the \(X\) variable, and the small letter “y” is estimated as the 1st quartile of the \(Y\) variable.
set.seed(87)
X <- runif(10000, min=1, max=6)
Y <- rnorm(10000, mean=7/2, sd=7/2)
par(mfrow = c(1, 2))
hist(X)
hist(Y)
x <- median(X)
y <-quantile(Y,0.25)
x_probability_greater <- sum(X > x) / length(X)
x_probability_leq <- 1 - x_probability_greater
y_probability_greater <- sum(Y > y) / length(Y)
y_probability_leq <- 1 - y_probability_greater
x_count_greater <- sum(X > x)
x_count_leq <- length(X) - x_count_greater
y_count_greater <- sum(Y > y)
y_count_leq <- length(Y) - y_count_greater
\[P(A \space and \space B) = P(A) * P(B|A)\]
Interpret the meaning of all probabilities
\(a. P(X>x | Y>y)\)
a_probability <-
(x_probability_greater * y_probability_greater) / y_probability_greater
a_probability
## [1] 0.5
The probability of > x given > y 0.5
\(b. P(X>x, Y>y)\)
b_probability <-
(x_probability_greater) *
((x_probability_greater * y_probability_greater) / x_probability_greater)
b_probability
## [1] 0.375
The probability of > x and > y 0.375
\(c. P(X<x | Y>y)\)
c_probability <-
(x_probability_leq * y_probability_greater) / y_probability_greater
c_probability
## [1] 0.5
The probability of < x given > y 0.5
Investigate whether \(P(X>x \space and \space Y>y)=P(X>x) * P(Y>y)\) by building a table and evaluating the marginal and joint probabilities
#P(X>x and Y>y)
b_probability
## [1] 0.375
#P(X>x)
x_probability_greater
## [1] 0.5
#PY>y)
y_probability_greater
## [1] 0.75
Are_Probabilities_Equal <- (b_probability == (x_probability_greater * y_probability_greater))
Are_Probabilities_Equal
## [1] TRUE
Since the probabilities are equal, X and Y are independent events
Check to see if independence holds by using Fishers Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
#table of count
t <- cbind(X,Y)
table_of_count <- matrix(c(
nrow(subset(t, X > x & Y > y)),
nrow(subset(t, X <= x & Y > y)),
nrow(subset(t, X > x & Y < y)),
nrow(subset(t, X <= x & Y <= y))),
nrow=2)
colnames(table_of_count) <- c(">1st quantile","<=1st quantile")
rownames(table_of_count) <- c(">mean","<=mean")
table_of_count
## >1st quantile <=1st quantile
## >mean 3714 1286
## <=mean 3786 1214
#Fishers Exact Test
fisher.test(table_of_count)
##
## Fisher's Exact Test for Count Data
##
## data: table_of_count
## p-value = 0.1011
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8449983 1.0149039
## sample estimates:
## odds ratio
## 0.9260742
#Chi Square Test
chisq.test(table_of_count)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_of_count
## X-squared = 2.6885, df = 1, p-value = 0.1011
Since the p-values are greater than 0.05, it suggests no association between X and Y.
library(tidyr)
library(dplyr)
library(corrplot)
train <- read.csv('https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/Final/train.csv', header=TRUE, sep=",", stringsAsFactors = FALSE)
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
dim(train)
## [1] 1460 81
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## LandSlope Neighborhood Condition1
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Condition2 BldgType HouseStyle OverallQual
## Length:1460 Length:1460 Length:1460 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 6.099
## 3rd Qu.: 7.000
## Max. :10.000
##
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Length:1460
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Class :character
## Median :5.000 Median :1973 Median :1994 Mode :character
## Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :9.000 Max. :2010 Max. :2010
##
## RoofMatl Exterior1st Exterior2nd
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MasVnrType MasVnrArea ExterQual ExterCond
## Length:1460 Min. : 0.0 Length:1460 Length:1460
## Class :character 1st Qu.: 0.0 Class :character Class :character
## Mode :character Median : 0.0 Mode :character Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature
## Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 Length:1460
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 Class :character
## Median : 0.00 Median : 6.000 Median :2008 Mode :character
## Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :15500.00 Max. :12.000 Max. :2010
##
## SaleCondition SalePrice
## Length:1460 Min. : 34900
## Class :character 1st Qu.:129975
## Mode :character Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
Independent Variables:
Dependent Variable:
plot(train$LotArea, train$SalePrice,
main = "LotArea vs SalePrice",
xlab = "LotArea",
ylab = "SalePrice")
plot(train$GrLivArea, train$SalePrice,
main = "GrLivArea vs SalePrice",
xlab = "OverallQuad",
ylab = "SalePrice")
Quantitative variables
cor_matrix <- cor(train[c("OverallQual","GarageArea","SalePrice")])
cor_matrix
## OverallQual GarageArea SalePrice
## OverallQual 1.0000000 0.5620218 0.7909816
## GarageArea 0.5620218 1.0000000 0.6234314
## SalePrice 0.7909816 0.6234314 1.0000000
cor.test(train$OverallQual, train$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(train$GarageArea, train$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$GarageArea and train$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
cor.test(train$OverallQual, train$GarageArea, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$GarageArea
## t = 25.946, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5386198 0.5845573
## sample estimates:
## cor
## 0.5620218
Since the p-values are less than 0.05 for each pairwise variables, we can reject that null hypohethis that there is no correlation between them. Looking at the correlation matrix, there is relatively strong correlation between them.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
cor_matrix_inv <- solve(cor_matrix);cor_matrix_inv
## OverallQual GarageArea SalePrice
## OverallQual 2.7278952 -0.3074414 -1.966046
## GarageArea -0.3074414 1.6704186 -0.798211
## SalePrice -1.9660463 -0.7982110 3.052736
cor_matrix %*% cor_matrix_inv
## OverallQual GarageArea SalePrice
## OverallQual 1.000000e+00 0 0
## GarageArea -2.220446e-16 1 0
## SalePrice -4.440892e-16 0 1
cor_matrix_inv %*% cor_matrix
## OverallQual GarageArea SalePrice
## OverallQual 1 0 -2.220446e-16
## GarageArea 0 1 0.000000e+00
## SalePrice 0 0 1.000000e+00
library(Matrix)
lu_matrix <- expand(lu(cor_matrix_inv))
lu_matrix$L;lu_matrix$U
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] -0.1127028 1.0000000 .
## [3,] -0.7207191 -0.6234314 1.0000000
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 2.7278952 -0.3074414 -1.9660463
## [2,] . 1.6357691 -1.0197899
## [3,] . . 1.0000000
lu_matrix$L %*% lu_matrix$U
## 3 x 3 Matrix of class "dgeMatrix"
## [,1] [,2] [,3]
## [1,] 2.7278952 -0.3074414 -1.966046
## [2,] -0.3074414 1.6704186 -0.798211
## [3,] -1.9660463 -0.7982110 3.052736
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000,\(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
summary(train$TotalBsmtSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
hist(train$TotalBsmtSF)
s <- train$TotalBsmtSF + 1
summary(s)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 796.8 992.5 1058.4 1299.2 6111.0
hist(s)
library(MASS)
fit <- fitdistr(s, "exponential")
lambda <- fit$estimate
lambda
## rate
## 0.0009447961
s_samples <- rexp(1000, lambda)
par(mfrow = c(1, 2))
hist(s_samples, main="Exponential")
hist(train$TotalBsmtSF, main="Empirical")
library(Rmisc)
#exponential
pct <- c(0.05, 0.95)
qexp(pct, rate = lambda)
## [1] 54.29033 3170.77127
#empirical confidence interval
CI(train$TotalBsmtSF, 0.95) #t.test(train$TotalBsmtSF)$conf.int[1:2]
## upper mean lower
## 1079.951 1057.429 1034.908
#empirical
quantile(train$TotalBsmtSF, pct)
## 5% 95%
## 519.3 1753.0
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
train_numeric <- select_if(train, is.numeric)
train_numeric_complete <- train_numeric[complete.cases(train_numeric),]
dim(train_numeric)
## [1] 1460 38
dim(train_numeric_complete)
## [1] 1121 38
id <- c(1:37)
for (val in id) {
qnt <- quantile(train_numeric_complete[,val], probs=c(.25, .75), na.rm = T)
caps <- quantile(train_numeric_complete[,val], probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(train_numeric_complete[,val], na.rm = T)
train_numeric_complete[,val][train_numeric_complete[,val] < (qnt[1] - H)] <- caps[1]
train_numeric_complete[,val][train_numeric_complete[,val] > (qnt[2] + H)] <- caps[2]
}
# Fit the full model
full.model <- lm(log(SalePrice) ~., data = train_numeric_complete)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtUnfSF +
## TotalBsmtSF + X2ndFlrSF + GrLivArea + FullBath + HalfBath +
## Fireplaces + GarageYrBlt + GarageCars + OpenPorchSF + EnclosedPorch +
## ScreenPorch + YrSold, data = train_numeric_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09569 -0.06036 0.00431 0.07190 0.43614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.146e+01 6.049e+00 1.895 0.05840 .
## MSSubClass -3.634e-04 1.279e-04 -2.841 0.00458 **
## LotArea 1.024e-05 1.635e-06 6.263 5.42e-10 ***
## OverallQual 8.139e-02 5.118e-03 15.904 < 2e-16 ***
## OverallCond 5.967e-02 5.319e-03 11.218 < 2e-16 ***
## YearBuilt 2.375e-03 3.037e-04 7.818 1.25e-14 ***
## YearRemodAdd 1.283e-03 3.032e-04 4.232 2.50e-05 ***
## BsmtFinSF1 6.268e-05 2.336e-05 2.683 0.00741 **
## BsmtUnfSF -5.016e-05 2.297e-05 -2.184 0.02920 *
## TotalBsmtSF 2.332e-04 3.301e-05 7.065 2.84e-12 ***
## X2ndFlrSF 5.032e-05 2.360e-05 2.132 0.03319 *
## GrLivArea 1.971e-04 2.378e-05 8.289 3.29e-16 ***
## FullBath 2.450e-02 1.223e-02 2.004 0.04532 *
## HalfBath 2.735e-02 1.199e-02 2.280 0.02277 *
## Fireplaces 4.049e-02 7.929e-03 5.107 3.86e-07 ***
## GarageYrBlt 7.980e-04 3.104e-04 2.571 0.01028 *
## GarageCars 4.491e-02 9.378e-03 4.789 1.90e-06 ***
## OpenPorchSF 1.653e-04 8.805e-05 1.878 0.06067 .
## EnclosedPorch 1.868e-04 7.238e-05 2.580 0.01000 *
## ScreenPorch 2.590e-04 9.148e-05 2.831 0.00472 **
## YrSold -4.919e-03 3.014e-03 -1.632 0.10293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1336 on 1100 degrees of freedom
## Multiple R-squared: 0.8886, Adjusted R-squared: 0.8866
## F-statistic: 438.7 on 20 and 1100 DF, p-value: < 2.2e-16
It appears that we have a decent model passing linearity assumptions
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(step.model)
test <- read.csv('https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/Final/test.csv', header=TRUE, sep=",", stringsAsFactors = FALSE)
dim(test)
## [1] 1459 80
testSalePrice <- predict(step.model, test, type = "response")
test$SalePrice <- exp(testSalePrice)
test[is.na(test)] <- mean(test$SalePrice, na.rm = TRUE)
submitTest <- subset(test, select = c("Id", "SalePrice"))
# Export
# write.csv(submitTest,file="LV_submit.csv")
Kaggle