Problem 1. Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.
#create variables
X <- runif(10000, min = 1, max = 9)
Y <- rnorm(10000, 5, 5)
x <- median(X)
y <- summary(Y)[2]
#calculate probabilities
Xgtx <- sum(X > x)/10000
Xgty <- sum(X > y)/10000
Ygtx <- sum(Y > x)/10000
Ygty <- sum(Y > y)/10000
Xltx <- sum(X < x)/10000
Xlty <- sum(X < y)/10000
Yltx <- sum(Y < x)/10000
Ylty <- sum(Y < y)/10000
Probability. Calculate as a minimum the below probabilities a through c.
Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points
a. P(X>x | X>y)
#first, find the P(X>x and X>y), which would be the probability that X is greater that the maximum value of x and y
(sum(X > max(x, y))/10000)/Xgty
## [1] 0.5457919
Xgtx*Ygty
## [1] 0.375
#find P(X<x and X>y)
Xltx_gty <- Xgty-Xgtx
(Xltx_gty)/Xgty
## [1] 0.4542081
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
a <- c(.38, .38, .75)
b <- c(.12, .12, .25)
c <-c(.5, .5, 1)
tbl <- data.frame(rbind(a, b, c))
names(tbl) <- c("P(X>x)", "P(X<x)", "Marginal Probabilities")
row.names(tbl) <- c("P(Y>y)", "P(Y<y)", "Marginal Probabilities")
tbl
## P(X>x) P(X<x) Marginal Probabilities
## P(Y>y) 0.38 0.38 0.75
## P(Y<y) 0.12 0.12 0.25
## Marginal Probabilities 0.50 0.50 1.00
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
m <- rbind(a, b)
fisher.test(m)
## Warning in fisher.test(m): 'x' has been rounded to integer: Mean relative
## difference: 0.75
##
## Fisher's Exact Test for Count Data
##
## data: m
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(m)
## Warning in chisq.test(m): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: m
## X-squared = 0.00027031, df = 2, p-value = 0.9999
Given the p-values of both the Chi Square test and the Fisher’s test, it is safe to assume the variables are independent. Fisher’s test is more appropriate, since the contingency table is set.
Problem 2 You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.
train <- read.csv("train.csv")
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
hist(train$MoSold)
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
plot(train$SalePrice~train$LotArea)
plot(train$SalePrice~train$OverallQual)
Derive a correlation matrix for any three quantitative variables in the dataset.
m <- cbind(train$SalePrice, train$LotArea, train$YearBuilt)
m <- cor(m)
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = .8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
cor.test(train$SalePrice, train$YearBuilt, method = "pearson", conf.level = .8)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
cor.test(train$YearBuilt, train$LotArea, method = "pearson", conf.level = .8)
##
## Pearson's product-moment correlation
##
## data: train$YearBuilt and train$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.01934322 0.04776648
## sample estimates:
## cor
## 0.01422765
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not? Since the P-values for the tests between sale price and lot area and sale price and year built are well below 0, it is safe to say there is correlation between sale prices and lot sizes and sale prices and year built. The p-value for the correlation test between lot area and year built is .6, so we fail to reject the null hypothesis, which was that there is no correlation. I would be worried about a familywise error, since I rejected the null hypothesis for the first two tests.
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
I <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), 3, 3, byrow=TRUE)
m1 <- cbind(m, I)
m1[2,] <- (-(m1[2,1]))*m1[1,]+m1[2,]
m1[3,] <- (-m[3,1])*m1[1,]+m1[3,]
m1[2,] <- (1/m1[2,2])*m1[2,]
m1[1,] <- (-m1[1,2])*m1[2,] + m1[1,]
m1[3,] <- (-m1[3,2])*m1[2,] + m1[3,]
m1[3,] <- (1/m1[3,3])*m1[3,]
m1[2,] <- (-m1[2,3])*m1[3,] + m1[2,]
m1[1,] <- (-m1[1,3])*m1[3,] + m1[1,]
inv <- m1[,4:6]
m*inv
## [,1] [,2] [,3]
## [1,] 1.5132664 -0.102393843 -0.410872545
## [2,] -0.1023938 1.099729254 0.002664589
## [3,] -0.4108725 0.002664589 1.408207956
inv*m
## [,1] [,2] [,3]
## [1,] 1.5132664 -0.102393843 -0.410872545
## [2,] -0.1023938 1.099729254 0.002664589
## [3,] -0.4108725 0.002664589 1.408207956
L <- matrix(c(1, 0, 0, 0, 1, 0, 0, 0, 1), 3, 3, byrow=TRUE)
U <- m
L[2,1] <- U[2,1]
U[2,] <- (-(U[2,1]))*U[1,]+U[2,]
L[3,1] <- U[3,1]
U[3,] <- (-(U[3,1]))*U[1,]+U[3,]
L[3,2] <- (U[3,2]/U[2,2])
U[3,] <- (-(U[3,2]/U[2,2]))*U[2,]+U[3,]
L*U
## [,1] [,2] [,3]
## [1,] 1 0.0000000 0.0000000
## [2,] 0 0.9303867 0.0000000
## [3,] 0 0.0000000 0.7101224
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
library(MASS)
fit <- fitdistr(train$SalePrice, densfun = "exponential")
lambda <- fit$estimate
redist <- rexp(1000, lambda)
summary(redist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 86.2 52696.5 124848.7 183935.2 261845.0 1378081.3
hist(train$SalePrice)
hist(redist)
p05 <- round(log(1-0.05)/-lambda,2)
p95 <- round(log(1-0.95)/-lambda,2)
5th percentile: 9280 95th percentile: 541991
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
CI(train$SalePrice, ci=0.95)
## upper mean lower
## 184999.6 180921.2 176842.8
While the data is heavily skewed right, I believe an exponential model is not the correct model. The 5th percentile for the exponential model is 9280, while the minimum value of the data is 34900. The exponential model estimates too far to the right, and while the CI does contain 95% of the data, it doesn’t create a symmetrical interval. The best model would be a linear model, and disregard the most extreme outliers to balance out the skew. It might also be beneficial to analyze data of sales price by neighborhood or by home type.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
n <- lm(train$SalePrice ~ train$YearBuilt + train$LotArea)
summary(n)
##
## Call:
## lm(formula = train$SalePrice ~ train$YearBuilt + train$LotArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258044 -37455 -14472 21135 520542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.533e+06 1.104e+05 -22.93 <2e-16 ***
## train$YearBuilt 1.366e+03 5.602e+01 24.38 <2e-16 ***
## train$LotArea 2.041e+00 1.695e-01 12.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64620 on 1457 degrees of freedom
## Multiple R-squared: 0.3392, Adjusted R-squared: 0.3383
## F-statistic: 373.9 on 2 and 1457 DF, p-value: < 2.2e-16
Since all of the p-values are well below .05, we can say the coefficients of the model are statistically significant. However, the \(R^2\) value is low, .339. This tells us that only 33.9% of the variation in Sales price is predicted by this model.
To make this model better, I will change the variables.
n <- lm(train$SalePrice ~ train$BedroomAbvGr + train$LotArea +train$WoodDeckSF + train$OpenPorchSF + train$TotRmsAbvGrd)
summary(n)
##
## Call:
## lm(formula = train$SalePrice ~ train$BedroomAbvGr + train$LotArea +
## train$WoodDeckSF + train$OpenPorchSF + train$TotRmsAbvGrd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -310247 -33687 -3075 26523 467012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938e+04 6.657e+03 4.414 1.09e-05 ***
## train$BedroomAbvGr -3.011e+04 2.604e+03 -11.562 < 2e-16 ***
## train$LotArea 1.031e+00 1.602e-01 6.435 1.67e-10 ***
## train$WoodDeckSF 1.265e+02 1.274e+01 9.928 < 2e-16 ***
## train$OpenPorchSF 2.051e+02 2.421e+01 8.473 < 2e-16 ***
## train$TotRmsAbvGrd 3.153e+04 1.362e+03 23.144 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59270 on 1454 degrees of freedom
## Multiple R-squared: 0.4453, Adjusted R-squared: 0.4433
## F-statistic: 233.4 on 5 and 1454 DF, p-value: < 2.2e-16
These variables all have low p-values, meaning they are statistically signifcant, but the model is still weak (\(R^2=.445\)).
plot(fitted(n), resid(n))
The data looks as though the model fits earlier on, and the residuals become less normal as the model increases.
qqnorm(resid(n))
qqline(resid(n))
As I suspected, the model fits well for the middle values, but there are a few extreme residuals. To make this data set cleaner, it would be better to run analysis on the data set after removing the outliers.
train <- train[-1460,]
test <- read.csv("test.csv")
test[,c('BedroomAbvGr', 'LotArea','WoodDeckSF','OpenPorchSF', 'TotRmsAbvGrd')]<-lapply(test[,c('BedroomAbvGr', 'LotArea','WoodDeckSF','OpenPorchSF', 'TotRmsAbvGrd')], as.numeric)
t <- lm(train$SalePrice ~ test$BedroomAbvGr + test$LotArea +test$WoodDeckSF + test$OpenPorchSF + test$TotRmsAbvGrd)
preds_2<- predict(t, test)
kaggle<- data.frame(test$Id, preds_2)
colnames(kaggle)<- c('Id', 'SalePrice')
write.csv(kaggle, 'submission.csv', row.names = FALSE)
Kaggle name: msiracusa Kaggle score: 0.42322 I’m not sure how to interpret the score, but that is what I got!