Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\)
set.seed(123)
N <- 10
sigma <- (N + 1)/2
mu <- sigma
X <- runif(n = 10000, min = 1, max = N)
Y <- rnorm(n = 10000, mean = mu, sd = sigma)
df <- data.frame(cbind(X, Y))
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.001 3.276 5.451 5.478 7.691 9.999
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15.649 1.841 5.475 5.511 9.330 26.663
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x <- median(X)
y <- quantile(Y, 0.25)
\(P(X>x|X>y)\). The likelihood of event A, X > x, given that B, X > y), is true. $ = $
P_AB <- df %>%
filter(X > x & X > y) %>%
nrow()
P_B <- df %>%
filter(X > y) %>%
nrow()
P1 <- P_AB/P_B
round(P1, 2)
## [1] 0.55
\(P(X>x, Y>y)\). This is the probability of the intersection of P(A) and P(B)
P2 <- df %>%
filter (X > x & Y>y) %>%
nrow()/10000
round(P2, 2)
## [1] 0.38
\(P(X<x|X>y)\) The likelihood of event A, X < x, given that B, X > y, is true. $ = $
P_AB <- df %>%
filter(X < x & X > y) %>%
nrow()
P_B <- df %>%
filter(X > y) %>%
nrow()
P3 <- P_AB/P_B
round(P3, 2)
## [1] 0.45
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
p11 <- df %>%
filter (X > x & Y>y) %>%
nrow()/10000
p21 <- df %>%
filter (X > x & Y<y) %>%
nrow()/10000
p31 <- df %>%
filter (X > x) %>%
nrow()/10000
p12 <- df %>%
filter (X < x & Y>y) %>%
nrow()/10000
p22 <- df %>%
filter (X < x & Y<y) %>%
nrow()/10000
p23 <- df %>%
filter (X < x) %>%
nrow()/10000
m1 <- df %>%
filter (Y > y) %>%
nrow()/10000
m2 <- df %>%
filter (Y < y) %>%
nrow()/10000
prop_matrix <- matrix(c(p11, p21,p31,p12,p22,p23,m1,m2,1.00), nrow = 3, byrow=FALSE,
dimnames = list(c('Y>y','Y<y', 'Total'),
c('X>x','X<x', 'Total')))
prop_matrix
## X>x X<x Total
## Y>y 0.3756 0.3744 0.75
## Y<y 0.1244 0.1256 0.25
## Total 0.5000 0.5000 1.00
#P(A) * P(B)
m1 * p31
## [1] 0.375
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Both the Fisher’s Exact test and the Chi Square test are both used to determine if there is a significant relationship between variables. Chi-square test is used when the sample is large and the Fisher’s exact test is used when the sample is small. Here, the Chi-Square test is appropriate.
p11 <- p11 * 10000
p21 <- p21 * 10000
p12 <- p12 * 10000
p22 <- p22 * 10000
prop_matrix2 <- matrix(c(p11, p21,p12,p22), nrow = 2, byrow=FALSE,
dimnames = list(c('Y>y','Y<y'),
c('X>x','X<x')))
prop_matrix2
## X>x X<x
## Y>y 3756 3744
## Y<y 1244 1256
chisq.test(prop_matrix2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: prop_matrix2
## X-squared = 0.064533, df = 1, p-value = 0.7995
fisher.test(prop_matrix2)
##
## Fisher's Exact Test for Count Data
##
## data: prop_matrix2
## p-value = 0.7995
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9242273 1.1100187
## sample estimates:
## odds ratio
## 1.012883
We cannot reject the null hypothesis. The independence remains true.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
path <- 'house prices - train.csv'
house_prices_train <- read.csv(path)
# Remove outliers
out <- boxplot.stats(house_prices_train$SalePrice)$out
out_rows <- which(house_prices_train$SalePrice %in% c(out))
house_prices_train <- house_prices_train[-out_rows,]
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
# Descriptive statistics
summary(house_prices_train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.00 Length:1399 Min. : 21.00
## 1st Qu.: 367.5 1st Qu.: 20.00 Class :character 1st Qu.: 59.00
## Median : 739.0 Median : 50.00 Mode :character Median : 68.00
## Mean : 733.5 Mean : 57.49 Mean : 69.06
## 3rd Qu.:1098.5 3rd Qu.: 70.00 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.00 Max. :313.00
## NA's :256
## LotArea Street Alley LotShape
## Min. : 1300 Length:1399 Length:1399 Length:1399
## 1st Qu.: 7442 Class :character Class :character Class :character
## Median : 9317 Mode :character Mode :character Mode :character
## Mean : 10155
## 3rd Qu.: 11316
## Max. :164660
##
## LandContour Utilities LotConfig LandSlope
## Length:1399 Length:1399 Length:1399 Length:1399
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1399 Length:1399 Length:1399 Length:1399
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1399 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1953
## Mode :character Median : 6.000 Median :5.000 Median :1971
## Mean : 5.984 Mean :5.591 Mean :1970
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:1999
## Max. :10.000 Max. :9.000 Max. :2009
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1399 Length:1399 Length:1399
## 1st Qu.:1966 Class :character Class :character Class :character
## Median :1992 Mode :character Mode :character Mode :character
## Mean :1984
## 3rd Qu.:2003
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1399 Length:1399 Min. : 0.00 Length:1399
## Class :character Class :character 1st Qu.: 0.00 Class :character
## Mode :character Mode :character Median : 0.00 Mode :character
## Mean : 90.18
## 3rd Qu.: 144.00
## Max. :1600.00
## NA's :7
## ExterCond Foundation BsmtQual BsmtCond
## Length:1399 Length:1399 Length:1399 Length:1399
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1399 Length:1399 Min. : 0.0 Length:1399
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 370.0 Mode :character
## Mean : 417.3
## 3rd Qu.: 686.0
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0 Length:1399
## 1st Qu.: 0.00 1st Qu.: 218.0 1st Qu.: 788 Class :character
## Median : 0.00 Median : 476.0 Median : 973 Mode :character
## Mean : 47.48 Mean : 559.5 Mean :1024
## 3rd Qu.: 0.00 3rd Qu.: 807.0 3rd Qu.:1252
## Max. :1474.00 Max. :2042.0 Max. :6110
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1399 Length:1399 Length:1399 Min. : 334.0
## Class :character Class :character Class :character 1st Qu.: 870.5
## Mode :character Mode :character Mode :character Median :1069.0
## Mean :1132.2
## 3rd Qu.:1346.5
## Max. :4692.0
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0.0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1119 1st Qu.:0.0000
## Median : 0.0 Median : 0.000 Median :1437 Median :0.0000
## Mean : 336.5 Mean : 5.691 Mean :1474 Mean :0.4103
## 3rd Qu.: 720.0 3rd Qu.: 0.000 3rd Qu.:1728 3rd Qu.:1.0000
## Max. :1818.0 Max. :528.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.00 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.00 Median :0.0000 Median :3.000
## Mean :0.05861 Mean :1.54 Mean :0.3703 Mean :2.862
## 3rd Qu.:0.00000 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.00 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1399 Min. : 2.000 Length:1399
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.049 Mean : 6.417
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.0000 Length:1399 Length:1399 Min. :1900
## 1st Qu.:0.0000 Class :character Class :character 1st Qu.:1960
## Median :1.0000 Mode :character Mode :character Median :1978
## Mean :0.5833 Mean :1977
## 3rd Qu.:1.0000 3rd Qu.:2001
## Max. :3.0000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1399 Min. :0.000 Min. : 0.0 Length:1399
## Class :character 1st Qu.:1.000 1st Qu.: 312.0 Class :character
## Mode :character Median :2.000 Median : 471.0 Mode :character
## Mean :1.718 Mean : 458.8
## 3rd Qu.:2.000 3rd Qu.: 573.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1399 Length:1399 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 22.00
## Mean : 89.62 Mean : 44.89
## 3rd Qu.:165.00 3rd Qu.: 64.00
## Max. :736.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 22.38 Mean : 3.232 Mean : 14.63 Mean : 2.482
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.000 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1399 Length:1399 Length:1399 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 45.38
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1399 Length:1399
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.312 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129000
## Median :159500
## Mean :170237
## 3rd Qu.:203500
## Max. :340000
##
# Distribution of the Dependent Variable, Sale Price
ggplot(data = house_prices_train, aes(x = SalePrice)) +
geom_histogram(aes(y=..density..), color = 'black', fill = 'white') +
geom_density(alpha = .2, fill = '#FF6666') +
labs(title = 'Distribution of Sale Price')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Scatterplots
ggplot(data = house_prices_train, aes(x = OverallQual, y = SalePrice
, color = OverallQual)) +
geom_point() +
labs(title = 'Sale Price by Overall Quality')
ggplot(data = house_prices_train, aes(x = GrLivArea, y = SalePrice)) +
geom_point() +
labs(title = 'Sale Price by Above Ground Living Area')
# Correlation
(corr_matrix <- house_prices_train %>%
dplyr::select(SalePrice, OverallQual, GrLivArea) %>%
cor() %>%
as.matrix())
## SalePrice OverallQual GrLivArea
## SalePrice 1.0000000 0.7842943 0.6613246
## OverallQual 0.7842943 1.0000000 0.5379843
## GrLivArea 0.6613246 0.5379843 1.0000000
# Testing the Hypothesis
cor.test(house_prices_train$OverallQual, house_prices_train$SalePrice,
conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: house_prices_train$OverallQual and house_prices_train$SalePrice
## t = 47.251, df = 1397, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7707334 0.7971450
## sample estimates:
## cor
## 0.7842943
cor.test(house_prices_train$GrLivArea, house_prices_train$SalePrice,
conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: house_prices_train$GrLivArea and house_prices_train$SalePrice
## t = 32.953, df = 1397, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6415857 0.6801881
## sample estimates:
## cor
## 0.6613246
#Famimlywise Error Rate
(fwe <- 1-(1-.05)^2)
## [1] 0.0975
The independent variables have strong correlations with out dependent variable. Here we are not worried about a familywise error as the number of tests are low and the p-value from our tests are very low. Thus, we can reject the null hypothesis.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# Precision matrix
(prec_matrix <- solve(corr_matrix))
## SalePrice OverallQual GrLivArea
## SalePrice 3.286933 -1.98219009 -1.10734249
## OverallQual -1.982190 2.60267798 -0.08932896
## GrLivArea -1.107342 -0.08932896 1.78037038
# Correlation matrix * precision matrix
round(corr_matrix %*% prec_matrix,2)
## SalePrice OverallQual GrLivArea
## SalePrice 1 0 0
## OverallQual 0 1 0
## GrLivArea 0 0 1
# Precision matrix * correlation matrix
round(prec_matrix %*% corr_matrix,2)
## SalePrice OverallQual GrLivArea
## SalePrice 1 0 0
## OverallQual 0 1 0
## GrLivArea 0 0 1
#LU Decomposition
LU(corr_matrix)
## $P
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.00000000 0
## [2,] 0.7842943 1.00000000 0
## [3,] 0.6613246 0.05017437 1
##
## $U
## SalePrice OverallQual GrLivArea
## [1,] 1 0.7842943 0.66132457
## [2,] 0 0.3848824 0.01931123
## [3,] 0 0.0000000 0.56168088
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss. 10 points.
# Variable skewed to the right
ggplot(data = house_prices_train, aes(x = LotArea)) +
geom_histogram(aes(y=..density..), color = 'black', fill = 'white') +
geom_density(alpha = .2, fill = '#FF6666') +
labs(title = 'Distribution of Lot Area')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# fitting Lot Area to an exponential probability density function
(epd <- fitdistr(house_prices_train$LotArea, 'exponential'))
## rate
## 9.847093e-05
## (2.632687e-06)
# optimal lambda value
(lambda <- epd$estimate)
## rate
## 9.847093e-05
# 1000 samples using optimal lambda
opt_lam <- data.frame(samples = rexp(1000, lambda))
# Plot Histogram
ggplot(data = opt_lam, aes(x = samples)) +
geom_histogram(aes(y=..density..), color = 'black', fill = 'white') +
geom_density(alpha = .2, fill = '#FF6666') +
labs(title = 'Exponential Probability Distribution of Lot Area')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 5th and 95th percentiles using CDF
qexp(p = 0.05, rate = lambda)
## [1] 520.8978
qexp(p = 0.95, rate = lambda)
## [1] 30422.5
# 95% confidence interval
sigma <- sd(house_prices_train$LotArea)
n <- length(house_prices_train$LotArea)
ci <- qnorm(0.95)
me <- ci * (sigma/sqrt(n))
# Upper bound
(mean(house_prices_train$LotArea) + me)
## [1] 10521.38
# Lower bound
(mean(house_prices_train$LotArea) - me)
## [1] 9789.186
# Percentiles
quantile(house_prices_train$LotArea, 0.05)
## 5%
## 3196
quantile(house_prices_train$LotArea, 0.95)
## 95%
## 16783.5
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
lr <- lm(SalePrice ~ GarageCars + OverallQual + GrLivArea + BldgType + CentralAir +
Utilities + MSZoning + OverallCond, data = house_prices_train)
# Linear regression
summary(lr)
##
## Call:
## lm(formula = SalePrice ~ GarageCars + OverallQual + GrLivArea +
## BldgType + CentralAir + Utilities + MSZoning + OverallCond,
## data = house_prices_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -260412 -16754 -1085 13861 109731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -90450.203 9982.587 -9.061 < 2e-16 ***
## GarageCars 17507.748 1319.348 13.270 < 2e-16 ***
## OverallQual 20510.961 852.005 24.074 < 2e-16 ***
## GrLivArea 38.017 2.003 18.978 < 2e-16 ***
## BldgType2fmCon -4649.219 5398.021 -0.861 0.389232
## BldgTypeDuplex -16526.140 4190.804 -3.943 8.43e-05 ***
## BldgTypeTwnhs -16086.251 4649.587 -3.460 0.000557 ***
## BldgTypeTwnhsE -140.767 3071.066 -0.046 0.963447
## CentralAirY 10631.039 3427.583 3.102 0.001964 **
## UtilitiesNoSeWa -40329.815 28421.620 -1.419 0.156129
## MSZoningFV 35529.362 10028.651 3.543 0.000409 ***
## MSZoningRH 18377.653 11593.665 1.585 0.113162
## MSZoningRL 31327.878 9283.545 3.375 0.000760 ***
## MSZoningRM 5755.675 9391.415 0.613 0.540066
## OverallCond 2857.578 729.230 3.919 9.34e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28400 on 1384 degrees of freedom
## Multiple R-squared: 0.7724, Adjusted R-squared: 0.7701
## F-statistic: 335.4 on 14 and 1384 DF, p-value: < 2.2e-16
# Residuals
ggplot(data = lr, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = lr, aes(x = .resid)) +
geom_histogram() +
xlab("Residuals")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = lr, aes(sample = .resid)) +
stat_qq()
Using the above linear regression, my kaggle score is a 0.75094, username dcorrea614.