Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
X <- AmesHomes$GrLivArea
Y <- AmesHomes$SalePrice
par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,1))
hist(AmesHomes$GrLivArea,breaks=25, main='X')
hist(AmesHomes$SalePrice,breaks=50, main='Y')
#"x" is 3d quartile of X variable
#"y" is 2d quartile of X variable
quantile(X)
## 0% 25% 50% 75% 100%
## 334.00 1129.50 1464.00 1776.75 5642.00
quantile(Y)
## 0% 25% 50% 75% 100%
## 34900 129975 163000 214000 755000
Calculate as a minimum the below probabilities a through c.
Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.
Gross Living Area is not independent of Sales Price. What is the conditional probability of GLA being greater than 1776.75 sq ft given the sales price is greater than $163,000.
Gross Living Area is not independent of Sales Price. What is the joint probability of GLA being greater than 1776.75 sq ft and the sales price is greater than $163,000.
Gross Living Area is not independent of Sales Price. What is the conditional probability of GLA being greater than 1776.75 sq ft given the sales price is greater than $163,000.
count(subset(AmesHomes, ( X <= 1776.75 & Y <= 163000)))
## # A tibble: 1 × 1
## n
## <int>
## 1 682
count(subset(AmesHomes, ( X <= 1776.75 & Y > 163000)))
## # A tibble: 1 × 1
## n
## <int>
## 1 413
count(subset(AmesHomes, ( X > 1776.75 & Y <= 163000)))
## # A tibble: 1 × 1
## n
## <int>
## 1 50
count(subset(AmesHomes, ( X > 1776.75 & Y > 163000)))
## # A tibble: 1 × 1
## n
## <int>
## 1 315
<=2nd quartile | > 2nd quartile | Total | |
---|---|---|---|
<= 3rd quartile | 682 | 413 | 1095 |
> 3rd quartile | 50 | 315 | 365 |
Total | 732 | 728 | 1460 |
A + B represents 682 observations.
P(A|B) = P(B and A) / P(B)
= (P(B) * P(A))/P(B)
P(A) = 365/1460 = .25
P(B) = 728/1460 = .50
= (.50 * .25)/.50
= .25
P(A) * P(B) = .25 * .5 = .125
#Negligible p-value
#Provides strong evidence to suggest that X and Y have a dependent or have some association.
chisq.test(matrix(c(682,413,50,315), ncol=2))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix(c(682, 413, 50, 315), ncol = 2)
## X-squared = 256.53, df = 1, p-value < 2.2e-16
vars | id | n | mean | sd | median | min | max | range | skew | kurtosis | se |
---|---|---|---|---|---|---|---|---|---|---|---|
MSSubClass | 2 | 1460 | 57 | 42 | 50 | 20 | 190 | 170 | 1 | 2 | 1 |
LotFrontage | 4 | 1201 | 70 | 24 | 69 | 21 | 313 | 292 | 2 | 17 | 1 |
LotArea | 5 | 1460 | 10517 | 9981 | 9478 | 1300 | 215245 | 213945 | 12 | 202 | 261 |
OverallQual | 18 | 1460 | 6 | 1 | 6 | 1 | 10 | 9 | 0 | 0 | 0 |
OverallCond | 19 | 1460 | 6 | 1 | 5 | 1 | 9 | 8 | 1 | 1 | 0 |
YearBuilt | 20 | 1460 | 1971 | 30 | 1973 | 1872 | 2010 | 138 | -1 | 0 | 1 |
YearRemodAdd | 21 | 1460 | 1985 | 21 | 1994 | 1950 | 2010 | 60 | -1 | -1 | 1 |
MasVnrArea | 27 | 1452 | 104 | 181 | 0 | 0 | 1600 | 1600 | 3 | 10 | 5 |
BsmtFinSF1 | 35 | 1460 | 444 | 456 | 384 | 0 | 5644 | 5644 | 2 | 11 | 12 |
BsmtFinSF2 | 37 | 1460 | 47 | 161 | 0 | 0 | 1474 | 1474 | 4 | 20 | 4 |
BsmtUnfSF | 38 | 1460 | 567 | 442 | 478 | 0 | 2336 | 2336 | 1 | 0 | 12 |
TotalBsmtSF | 39 | 1460 | 1057 | 439 | 992 | 0 | 6110 | 6110 | 2 | 13 | 11 |
X1stFlrSF | 44 | 1460 | 1163 | 387 | 1087 | 334 | 4692 | 4358 | 1 | 6 | 10 |
X2ndFlrSF | 45 | 1460 | 347 | 437 | 0 | 0 | 2065 | 2065 | 1 | -1 | 11 |
-LowQualFinSF | 46 | 1460 | 6 | 49 | 0 | 0 | 572 | 572 | 9 | 83 | 1 |
GrLivArea | 47 | 1460 | 1515 | 525 | 1464 | 334 | 5642 | 5308 | 1 | 5 | 14 |
BsmtFullBath | 48 | 1460 | 0 | 1 | 0 | 0 | 3 | 3 | 1 | -1 | 0 |
BsmtHalfBath | 49 | 1460 | 0 | 0 | 0 | 0 | 2 | 2 | 4 | 16 | 0 |
FullBath | 50 | 1460 | 2 | 1 | 2 | 0 | 3 | 3 | 0 | -1 | 0 |
HalfBath | 51 | 1460 | 0 | 1 | 0 | 0 | 2 | 2 | 1 | -1 | 0 |
BedroomAbvGr | 52 | 1460 | 3 | 1 | 3 | 0 | 8 | 8 | 0 | 2 | 0 |
KitchenAbvGr | 53 | 1460 | 1 | 0 | 1 | 0 | 3 | 3 | 4 | 21 | 0 |
TotRmsAbvGrd | 55 | 1460 | 7 | 2 | 6 | 2 | 14 | 12 | 1 | 1 | 0 |
Fireplaces | 57 | 1460 | 1 | 1 | 1 | 0 | 3 | 3 | 1 | 0 | 0 |
GarageYrBlt | 60 | 1379 | 1979 | 25 | 1980 | 1900 | 2010 | 110 | -1 | 0 | 1 |
-GarageCars | 62 | 1460 | 2 | 1 | 2 | 0 | 4 | 4 | 0 | 0 | 0 |
GarageArea | 63 | 1460 | 473 | 214 | 480 | 0 | 1418 | 1418 | 0 | 1 | 6 |
WoodDeckSF | 67 | 1460 | 94 | 125 | 0 | 0 | 857 | 857 | 2 | 3 | 3 |
OpenPorchSF | 68 | 1460 | 47 | 66 | 25 | 0 | 547 | 547 | 2 | 8 | 2 |
EnclosedPorch | 69 | 1460 | 22 | 61 | 0 | 0 | 552 | 552 | 3 | 10 | 2 |
X3SsnPorch | 70 | 1460 | 3 | 29 | 0 | 0 | 508 | 508 | 10 | 123 | 1 |
ScreenPorch | 71 | 1460 | 15 | 56 | 0 | 0 | 480 | 480 | 4 | 18 | 1 |
PoolArea | 72 | 1460 | 3 | 40 | 0 | 0 | 738 | 738 | 15 | 222 | 1 |
MiscVal | 76 | 1460 | 43 | 496 | 0 | 0 | 15500 | 15500 | 24 | 698 | 13 |
MoSold | 77 | 1460 | 6 | 3 | 6 | 1 | 12 | 11 | 0 | 0 | 0 |
YrSold | 78 | 1460 | 2008 | 1 | 2008 | 2006 | 2010 | 4 | 0 | -1 | 0 |
SalePrice | 81 | 1460 | 180921 | 79443 | 163000 | 4900 | 755000 | 720100 | 2 | 6 | 2079 |
Provide a 95% CI for the difference in the mean of the variables.
Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
t1 <- t.test( X, Y, paired = TRUE, conf.level = 0.95)
t1
##
## Paired t-test
##
## data: X and Y
## t = -86.695, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183465.0 -175346.4
## sample estimates:
## mean of the differences
## -179405.7
#shows strong positive correlation
c <- cor(AmesHomes[,c("GrLivArea","SalePrice")])
c
## GrLivArea SalePrice
## GrLivArea 1.0000000 0.7086245
## SalePrice 0.7086245 1.0000000
cor.test(AmesHomes$GrLivArea, AmesHomes$SalePrice,conf.level=.99)
##
## Pearson's product-moment correlation
##
## data: AmesHomes$GrLivArea and AmesHomes$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.6733974 0.7406408
## sample estimates:
## cor
## 0.7086245
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
#invert the correlation for precision
p <- solve(c)
p
## GrLivArea SalePrice
## GrLivArea 2.008632 -1.423366
## SalePrice -1.423366 2.008632
#precision * correlation
p %*% c
## GrLivArea SalePrice
## GrLivArea 1 0
## SalePrice 0 1
#correlation * precision
c %*% p
## GrLivArea SalePrice
## GrLivArea 1 0
## SalePrice 0 1
par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,2))
# Pricipal Components Analysis
# entering raw data and extracting PCs
# from the correlation matrix
mydata <- data.frame(X,Y)
fit <- princomp(mydata, cor=TRUE)
summary(fit) # print variance accounted for
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.3071436 0.5397921
## Proportion of Variance 0.8543122 0.1456878
## Cumulative Proportion 0.8543122 1.0000000
loadings(fit) # pc loadings
##
## Loadings:
## Comp.1 Comp.2
## X 0.707 -0.707
## Y 0.707 0.707
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
plot(fit,type="lines") # scree plot
#fit$scores # the principal components
biplot(fit)
Many times, it makes sense to fit a closed form distribution to data.
For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load
the MASS package and run fitdistr to fit an exponential probability density function.
(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential
distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable.
# fit exponential function to get optimal value of lambda for X
fit <- fitdistr(X,"exponential")
lambda <- fit$estimate
# get 1000 element sampling distribution
sample <- rexp(1000,lambda)
# plot historgram of exponential function
par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,1))
hist(sample,prob=TRUE,breaks=25)
curve(dexp(x,lambda),add=T)
# plot histogram of original x
hist(AmesHomes$GrLivArea, breaks=25)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The exponential distribution is not a good fit for the right-skewed curve as evidenced by the big difference in upperbound–95 percentile–data points comparing actual and fit data. It appears that despite the skew, a normal distribution would still be a better model to fit the data which could then be adjusted by a transformation function, i.e. Box-Cox . . .
# lambda is an attribute of the exponential fit pdf
# use qexp function for cdf
# get the 5th & 95th pctile using cdf
qexp(.05, rate = lambda)
## [1] 77.73313
qexp(.95, rate = lambda)
## [1] 4539.924
# generate a 95% confidence interval from emperical data
# X is AmesHomes$GrLivArea and fit contains its descriptive statistics
# use this to calculate error and then Confidence Interval CI
err <- qnorm(.95) * (fit$sd / sqrt(fit$n))
CI <- data.frame(1 - err, 1 + err)
colnames(CI) <- c("se-","se+")
CI
## se- se+
## rate 0.9999993 1.000001
# Quantile gets emperical data from distribution using percentiles
quantile(X, c(.05,.95))
## 5% 95%
## 848.0 2466.1