Setup
Probability.
- TABLE OF OBSERVATION COUNTS:
Descriptive and Inferential Statistics
Linear Algebra and Correlation
- Emperical vs. Theoretical Distributions
Calculus-Based Probability & Statistics.
Model for AmesHousing

Setup

Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

Probability.

X <- AmesHomes$GrLivArea
Y <- AmesHomes$SalePrice

par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,1))
hist(AmesHomes$GrLivArea,breaks=25, main='X')
hist(AmesHomes$SalePrice,breaks=50, main='Y')

#"x" is 3d quartile of X variable
#"y" is 2d quartile of X variable
quantile(X)

##      0%     25%     50%     75%    100% 
##  334.00 1129.50 1464.00 1776.75 5642.00

quantile(Y)

##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000

Calculate as a minimum the below probabilities a through c.

Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

Gross Living Area is not independent of Sales Price. What is the conditional probability of GLA being greater than 1776.75 sq ft given the sales price is greater than $163,000.
1. P(X > .75 | Y > .50)
  P(A) = P(X > .75) = .25
  P(B) = P(Y > .50) = .50
  P(A|B) = P(B and A) / P(B)
  = (.50 * .25 / .50)
  = .50
Gross Living Area is not independent of Sales Price. What is the joint probability of GLA being greater than 1776.75 sq ft and the sales price is greater than $163,000.
1. P(X > .75, Y > .50)
  P(A) = P(X > .75) = .25
  P(B) = P(Y > .50) = .50
  P(A,B) = P(A) + P(B)
  = (.25 + .50)
  = .75
Gross Living Area is not independent of Sales Price. What is the conditional probability of GLA being greater than 1776.75 sq ft given the sales price is greater than $163,000.
1. P(X < .75 | Y > .50)
  P(A) = P(X < .75) = .75
  P(B) = P(Y > .50) = .50
  P(A|B) = P(B and A) / P(B)
  = (.50 * .75 / .50)
  = .75

Does splitting the training data in this fashion make them independent?

No, independence describes whether there is a relation between X & Y. Splitting the data doesn’t change the relationship, it just changes the extent of problem domain.

count(subset(AmesHomes, ( X <= 1776.75 & Y <= 163000)))

## # A tibble: 1 × 1
##       n
##   <int>
## 1   682

count(subset(AmesHomes, ( X <= 1776.75 & Y > 163000)))

## # A tibble: 1 × 1
##       n
##   <int>
## 1   413

count(subset(AmesHomes, ( X > 1776.75 & Y <= 163000)))

## # A tibble: 1 × 1
##       n
##   <int>
## 1    50

count(subset(AmesHomes, ( X > 1776.75 & Y > 163000)))

## # A tibble: 1 × 1
##       n
##   <int>
## 1   315

In addition, make a table of counts as shown below.
Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.

TABLE OF OBSERVATION COUNTS:

	<=2nd quartile	> 2nd quartile	Total
<= 3rd quartile	682	413	1095
> 3rd quartile	50	315	365
Total	732	728	1460

A + B represents 682 observations.

Does P(A|B)=P(A)P(B)? No Check mathematically, and then

P(A|B) = P(B and A) / P(B)
= (P(B) * P(A))/P(B)
P(A) = 365/1460 = .25
P(B) = 728/1460 = .50
= (.50 * .25)/.50
= .25

P(A) * P(B) = .25 * .5 = .125

Evaluate by running a Chi Square test for association.

#Negligible p-value
#Provides strong evidence to suggest that X and Y have a dependent or have some association.
chisq.test(matrix(c(682,413,50,315), ncol=2))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  matrix(c(682, 413, 50, 315), ncol = 2)
## X-squared = 256.53, df = 1, p-value < 2.2e-16

Descriptive and Inferential Statistics

Descriptive Statistics

vars	id	n	mean	sd	median	min	max	range	skew	kurtosis	se
MSSubClass	2	1460	57	42	50	20	190	170	1	2	1
LotFrontage	4	1201	70	24	69	21	313	292	2	17	1
LotArea	5	1460	10517	9981	9478	1300	215245	213945	12	202	261
OverallQual	18	1460	6	1	6	1	10	9	0	0	0
OverallCond	19	1460	6	1	5	1	9	8	1	1	0
YearBuilt	20	1460	1971	30	1973	1872	2010	138	-1	0	1
YearRemodAdd	21	1460	1985	21	1994	1950	2010	60	-1	-1	1
MasVnrArea	27	1452	104	181	0	0	1600	1600	3	10	5
BsmtFinSF1	35	1460	444	456	384	0	5644	5644	2	11	12
BsmtFinSF2	37	1460	47	161	0	0	1474	1474	4	20	4
BsmtUnfSF	38	1460	567	442	478	0	2336	2336	1	0	12
TotalBsmtSF	39	1460	1057	439	992	0	6110	6110	2	13	11
X1stFlrSF	44	1460	1163	387	1087	334	4692	4358	1	6	10
X2ndFlrSF	45	1460	347	437	0	0	2065	2065	1	-1	11
-LowQualFinSF	46	1460	6	49	0	0	572	572	9	83	1
GrLivArea	47	1460	1515	525	1464	334	5642	5308	1	5	14
BsmtFullBath	48	1460	0	1	0	0	3	3	1	-1	0
BsmtHalfBath	49	1460	0	0	0	0	2	2	4	16	0
FullBath	50	1460	2	1	2	0	3	3	0	-1	0
HalfBath	51	1460	0	1	0	0	2	2	1	-1	0
BedroomAbvGr	52	1460	3	1	3	0	8	8	0	2	0
KitchenAbvGr	53	1460	1	0	1	0	3	3	4	21	0
TotRmsAbvGrd	55	1460	7	2	6	2	14	12	1	1	0
Fireplaces	57	1460	1	1	1	0	3	3	1	0	0
GarageYrBlt	60	1379	1979	25	1980	1900	2010	110	-1	0	1
-GarageCars	62	1460	2	1	2	0	4	4	0	0	0
GarageArea	63	1460	473	214	480	0	1418	1418	0	1	6
WoodDeckSF	67	1460	94	125	0	0	857	857	2	3	3
OpenPorchSF	68	1460	47	66	25	0	547	547	2	8	2
EnclosedPorch	69	1460	22	61	0	0	552	552	3	10	2
X3SsnPorch	70	1460	3	29	0	0	508	508	10	123	1
ScreenPorch	71	1460	15	56	0	0	480	480	4	18	1
PoolArea	72	1460	3	40	0	0	738	738	15	222	1
MiscVal	76	1460	43	496	0	0	15500	15500	24	698	13
MoSold	77	1460	6	3	6	1	12	11	0	0	0
YrSold	78	1460	2008	1	2008	2006	2010	4	0	-1	0
SalePrice	81	1460	180921	79443	163000	4900	755000	720100	2	6	2079

Barplots & Catagorical Data

Histograms for Ordinal Data

Scatterplot

Provide a 95% CI for the difference in the mean of the variables.

Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Provide a 95% CI for the difference in the mean of the variables.

If 100 sample distibutions are taken from the population, 95 of them are likely to have prices within the confidence interval -183465.0 -175346.4

t1 <- t.test( X, Y, paired = TRUE, conf.level = 0.95)  
t1

## 
##  Paired t-test
## 
## data:  X and Y
## t = -86.695, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183465.0 -175346.4
## sample estimates:
## mean of the differences 
##               -179405.7

Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

If 100 sample distibutions are taken from the population, 99 of them are likely to have correlations within the confidence interval 0.6733974 0.7406408. The p-value is negligible, so reject the null hypothesis that there is no correlation

#shows strong positive correlation
c <- cor(AmesHomes[,c("GrLivArea","SalePrice")])
c

##           GrLivArea SalePrice
## GrLivArea 1.0000000 0.7086245
## SalePrice 0.7086245 1.0000000

cor.test(AmesHomes$GrLivArea, AmesHomes$SalePrice,conf.level=.99)

## 
##  Pearson's product-moment correlation
## 
## data:  AmesHomes$GrLivArea and AmesHomes$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.6733974 0.7406408
## sample estimates:
##       cor 
## 0.7086245

Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

#invert the correlation for precision
p <- solve(c) 
p

##           GrLivArea SalePrice
## GrLivArea  2.008632 -1.423366
## SalePrice -1.423366  2.008632

#precision * correlation
p %*% c

##           GrLivArea SalePrice
## GrLivArea         1         0
## SalePrice         0         1

#correlation * precision
c %*% p

##           GrLivArea SalePrice
## GrLivArea         1         0
## SalePrice         0         1

Conduct principle components analysis (research this!) and interpret. Discuss.

Comp 1 explains most of the variability in the data by a factor of 0.8543122 : 0.1456878

Emperical vs. Theoretical Distributions

par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,2))
# Pricipal Components Analysis
# entering raw data and extracting PCs 
# from the correlation matrix 
mydata <- data.frame(X,Y)
fit <- princomp(mydata, cor=TRUE)
summary(fit) # print variance accounted for

## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.3071436 0.5397921
## Proportion of Variance 0.8543122 0.1456878
## Cumulative Proportion  0.8543122 1.0000000

loadings(fit) # pc loadings

## 
## Loadings:
##   Comp.1 Comp.2
## X  0.707 -0.707
## Y  0.707  0.707
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0

plot(fit,type="lines") # scree plot 
#fit$scores # the principal components
biplot(fit)

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data.

For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load
the MASS package and run fitdistr to fit an exponential probability density function.
(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential
distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable.

# fit exponential function to get optimal value of lambda for X
fit <- fitdistr(X,"exponential")
lambda <- fit$estimate

# get 1000 element sampling distribution
sample <- rexp(1000,lambda)

# plot historgram of exponential function
par(oma=c(3,3,0,0),mar=c(3,3,2,2),mfrow=c(2,1))
hist(sample,prob=TRUE,breaks=25)
curve(dexp(x,lambda),add=T)

# plot histogram of original x
hist(AmesHomes$GrLivArea, breaks=25)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

The exponential distribution is not a good fit for the right-skewed curve as evidenced by the big difference in upperbound–95 percentile–data points comparing actual and fit data. It appears that despite the skew, a normal distribution would still be a better model to fit the data which could then be adjusted by a transformation function, i.e. Box-Cox . . .

# lambda is an attribute of the exponential fit pdf
# use qexp function for cdf
# get the 5th & 95th pctile using cdf
qexp(.05, rate = lambda)

## [1] 77.73313

qexp(.95, rate = lambda)

## [1] 4539.924

# generate a 95% confidence interval from emperical data 
# X is AmesHomes$GrLivArea and fit contains its descriptive statistics
# use this to calculate error and then Confidence Interval CI
err <- qnorm(.95) * (fit$sd / sqrt(fit$n))

CI <- data.frame(1 - err, 1 + err)
colnames(CI) <- c("se-","se+")
CI

##            se-      se+
## rate 0.9999993 1.000001

# Quantile gets emperical data from distribution using percentiles
quantile(X, c(.05,.95))

##     5%    95% 
##  848.0 2466.1

Model for AmesHousing

http://rpubs.com/smkarr/236666

DATA 605 Final Project

Scott Karr

Dec 23, 2016