Setup

Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

Probability.

X <- AmesHomes$GrLivArea
Y <- AmesHomes$SalePrice

par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,1))
hist(AmesHomes$GrLivArea,breaks=25, main='X')
hist(AmesHomes$SalePrice,breaks=50, main='Y')

#"x" is 3d quartile of X variable
#"y" is 2d quartile of X variable
quantile(X)
##      0%     25%     50%     75%    100% 
##  334.00 1129.50 1464.00 1776.75 5642.00
quantile(Y)
##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000
  1. Calculate as a minimum the below probabilities a through c.

    Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

  1. Does splitting the training data in this fashion make them independent?
count(subset(AmesHomes, ( X <= 1776.75 & Y <= 163000)))
## # A tibble: 1 × 1
##       n
##   <int>
## 1   682
count(subset(AmesHomes, ( X <= 1776.75 & Y > 163000)))
## # A tibble: 1 × 1
##       n
##   <int>
## 1   413
count(subset(AmesHomes, ( X > 1776.75 & Y <= 163000)))
## # A tibble: 1 × 1
##       n
##   <int>
## 1    50
count(subset(AmesHomes, ( X > 1776.75 & Y > 163000)))
## # A tibble: 1 × 1
##       n
##   <int>
## 1   315
  1. In addition, make a table of counts as shown below.
    Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.

TABLE OF OBSERVATION COUNTS:

<=2nd quartile > 2nd quartile Total
<= 3rd quartile 682 413 1095
> 3rd quartile 50 315 365
Total 732 728 1460

A + B represents 682 observations.

  1. Does P(A|B)=P(A)P(B)? No Check mathematically, and then
  • P(A|B) = P(B and A) / P(B)
    = (P(B) * P(A))/P(B)
    P(A) = 365/1460 = .25
    P(B) = 728/1460 = .50
    = (.50 * .25)/.50
    = .25

    P(A) * P(B) = .25 * .5 = .125

  1. Evaluate by running a Chi Square test for association.
#Negligible p-value
#Provides strong evidence to suggest that X and Y have a dependent or have some association.
chisq.test(matrix(c(682,413,50,315), ncol=2))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix(c(682, 413, 50, 315), ncol = 2)
## X-squared = 256.53, df = 1, p-value < 2.2e-16

Descriptive and Inferential Statistics

Descriptive Statistics

vars id n mean sd median min max range skew kurtosis se
MSSubClass 2 1460 57 42 50 20 190 170 1 2 1
LotFrontage 4 1201 70 24 69 21 313 292 2 17 1
LotArea 5 1460 10517 9981 9478 1300 215245 213945 12 202 261
OverallQual 18 1460 6 1 6 1 10 9 0 0 0
OverallCond 19 1460 6 1 5 1 9 8 1 1 0
YearBuilt 20 1460 1971 30 1973 1872 2010 138 -1 0 1
YearRemodAdd 21 1460 1985 21 1994 1950 2010 60 -1 -1 1
MasVnrArea 27 1452 104 181 0 0 1600 1600 3 10 5
BsmtFinSF1 35 1460 444 456 384 0 5644 5644 2 11 12
BsmtFinSF2 37 1460 47 161 0 0 1474 1474 4 20 4
BsmtUnfSF 38 1460 567 442 478 0 2336 2336 1 0 12
TotalBsmtSF 39 1460 1057 439 992 0 6110 6110 2 13 11
X1stFlrSF 44 1460 1163 387 1087 334 4692 4358 1 6 10
X2ndFlrSF 45 1460 347 437 0 0 2065 2065 1 -1 11
-LowQualFinSF 46 1460 6 49 0 0 572 572 9 83 1
GrLivArea 47 1460 1515 525 1464 334 5642 5308 1 5 14
BsmtFullBath 48 1460 0 1 0 0 3 3 1 -1 0
BsmtHalfBath 49 1460 0 0 0 0 2 2 4 16 0
FullBath 50 1460 2 1 2 0 3 3 0 -1 0
HalfBath 51 1460 0 1 0 0 2 2 1 -1 0
BedroomAbvGr 52 1460 3 1 3 0 8 8 0 2 0
KitchenAbvGr 53 1460 1 0 1 0 3 3 4 21 0
TotRmsAbvGrd 55 1460 7 2 6 2 14 12 1 1 0
Fireplaces 57 1460 1 1 1 0 3 3 1 0 0
GarageYrBlt 60 1379 1979 25 1980 1900 2010 110 -1 0 1
-GarageCars 62 1460 2 1 2 0 4 4 0 0 0
GarageArea 63 1460 473 214 480 0 1418 1418 0 1 6
WoodDeckSF 67 1460 94 125 0 0 857 857 2 3 3
OpenPorchSF 68 1460 47 66 25 0 547 547 2 8 2
EnclosedPorch 69 1460 22 61 0 0 552 552 3 10 2
X3SsnPorch 70 1460 3 29 0 0 508 508 10 123 1
ScreenPorch 71 1460 15 56 0 0 480 480 4 18 1
PoolArea 72 1460 3 40 0 0 738 738 15 222 1
MiscVal 76 1460 43 496 0 0 15500 15500 24 698 13
MoSold 77 1460 6 3 6 1 12 11 0 0 0
YrSold 78 1460 2008 1 2008 2006 2010 4 0 -1 0
SalePrice 81 1460 180921 79443 163000 4900 755000 720100 2 6 2079

Barplots & Catagorical Data

Histograms for Ordinal Data

Scatterplot

Provide a 95% CI for the difference in the mean of the variables.

Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

  1. Provide a 95% CI for the difference in the mean of the variables.
  • If 100 sample distibutions are taken from the population, 95 of them are likely to have prices within the confidence interval -183465.0 -175346.4
t1 <- t.test( X, Y, paired = TRUE, conf.level = 0.95)  
t1  
## 
##  Paired t-test
## 
## data:  X and Y
## t = -86.695, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183465.0 -175346.4
## sample estimates:
## mean of the differences 
##               -179405.7
  1. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
  • If 100 sample distibutions are taken from the population, 99 of them are likely to have correlations within the confidence interval 0.6733974 0.7406408. The p-value is negligible, so reject the null hypothesis that there is no correlation
#shows strong positive correlation
c <- cor(AmesHomes[,c("GrLivArea","SalePrice")])
c
##           GrLivArea SalePrice
## GrLivArea 1.0000000 0.7086245
## SalePrice 0.7086245 1.0000000
cor.test(AmesHomes$GrLivArea, AmesHomes$SalePrice,conf.level=.99)
## 
##  Pearson's product-moment correlation
## 
## data:  AmesHomes$GrLivArea and AmesHomes$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.6733974 0.7406408
## sample estimates:
##       cor 
## 0.7086245

Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

  1. Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
#invert the correlation for precision
p <- solve(c) 
p
##           GrLivArea SalePrice
## GrLivArea  2.008632 -1.423366
## SalePrice -1.423366  2.008632
#precision * correlation
p %*% c
##           GrLivArea SalePrice
## GrLivArea         1         0
## SalePrice         0         1
#correlation * precision
c %*% p
##           GrLivArea SalePrice
## GrLivArea         1         0
## SalePrice         0         1
  1. Conduct principle components analysis (research this!) and interpret. Discuss.

Emperical vs. Theoretical Distributions

par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,2))
# Pricipal Components Analysis
# entering raw data and extracting PCs 
# from the correlation matrix 
mydata <- data.frame(X,Y)
fit <- princomp(mydata, cor=TRUE)
summary(fit) # print variance accounted for 
## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.3071436 0.5397921
## Proportion of Variance 0.8543122 0.1456878
## Cumulative Proportion  0.8543122 1.0000000
loadings(fit) # pc loadings 
## 
## Loadings:
##   Comp.1 Comp.2
## X  0.707 -0.707
## Y  0.707  0.707
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0
plot(fit,type="lines") # scree plot 
#fit$scores # the principal components
biplot(fit)

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data.

  1. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load
    the MASS package and run fitdistr to fit an exponential probability density function.
    (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

    Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential
    distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable.

# fit exponential function to get optimal value of lambda for X
fit <- fitdistr(X,"exponential")
lambda <- fit$estimate

# get 1000 element sampling distribution
sample <- rexp(1000,lambda)

# plot historgram of exponential function
par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,1))
hist(sample,prob=TRUE,breaks=25)
curve(dexp(x,lambda),add=T)

# plot histogram of original x
hist(AmesHomes$GrLivArea, breaks=25)

  1. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

    The exponential distribution is not a good fit for the right-skewed curve as evidenced by the big difference in upperbound–95 percentile–data points comparing actual and fit data. It appears that despite the skew, a normal distribution would still be a better model to fit the data which could then be adjusted by a transformation function, i.e. Box-Cox . . .

# lambda is an attribute of the exponential fit pdf
# use qexp function for cdf
# get the 5th & 95th pctile using cdf
qexp(.05, rate = lambda)
## [1] 77.73313
qexp(.95, rate = lambda)
## [1] 4539.924
# generate a 95% confidence interval from emperical data 
# X is AmesHomes$GrLivArea and fit contains its descriptive statistics
# use this to calculate error and then Confidence Interval CI
err <- qnorm(.95) * (fit$sd / sqrt(fit$n))

CI <- data.frame(1 - err, 1 + err)
colnames(CI) <- c("se-","se+")
CI
##            se-      se+
## rate 0.9999993 1.000001
# Quantile gets emperical data from distribution using percentiles
quantile(X, c(.05,.95))
##     5%    95% 
##  848.0 2466.1

Model for AmesHousing

http://rpubs.com/smkarr/236666