Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of ????????????(N+1)/2.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
# creating variables X and Y with random values, let's choose base value 1 to 9 only otherwise it will be a big messy data
X <- runif(10000, min = 1, max = 9)
Y <- rnorm(10000, 5, 5)
x <- median(X)
y <- summary(Y)[2]
# now let's calculate the most probabilities you can get from these X, x
Xgtx <- sum(X > x)/10000
Xgty <- sum(X > y)/10000
Xltx <- sum(X < x)/10000
Xlty <- sum(X < y)/10000
# for Y, y
Ygtx <- sum(Y > x)/10000
Ygty <- sum(Y > y)/10000
Yltx <- sum(Y < x)/10000
Ylty <- sum(Y < y)/10000
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points
# Here, the question is asking about the Probabilty of X that is greater than x and y
prob_a <- (sum(X > max(x, y))/10000) / Xgty
prob_a
## [1] 0.5414771
# Here, the question is asking about the Probabilty of X greater than x multiplied Y is greater than y
prob_b <- Xgtx * Ygty
prob_b
## [1] 0.375
# Here, the question is asking about the Probabilty of X less than x and X is greater than y
prob_c <- (Xgty - Xgtx)/Xgty
prob_c
## [1] 0.4585229
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
m <- c(0.38, 0.38, 0.75)
n <- c(0.12, 0.12, 0.25)
o <-c(0.5, 0.5, 1)
table <- (rbind(m, n, o))
names(table) <- c("P(X>x)", "P(X<x)", "Marginal Probabilities")
row.names(table) <- c("P(Y>y)", "P(Y<y)", "Marginal Probabilities")
table
## [,1] [,2] [,3]
## P(Y>y) 0.38 0.38 0.75
## P(Y<y) 0.12 0.12 0.25
## Marginal Probabilities 0.50 0.50 1.00
## attr(,"names")
## [1] "P(X>x)" "P(X<x)"
## [3] "Marginal Probabilities" NA
## [5] NA NA
## [7] NA NA
## [9] NA
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Answer:
p <- rbind(m, n)
fisher.test(p)
## Warning in fisher.test(p): 'x' has been rounded to integer: Mean relative
## difference: 0.75
##
## Fisher's Exact Test for Count Data
##
## data: p
## p-value = 1
## alternative hypothesis: two.sided
chisq.test(p)
## Warning in chisq.test(p): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: p
## X-squared = 0.00027031, df = 2, p-value = 0.9999
In both test p-value is approximately 1. We can say that the variables are independent.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
# import data from github which downloaded from kaggle
train <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA605/master/train.csv", sep =",")
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
# for any housing pricing the location matters a lot so lets see if we can find somthing here.
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
#let's see if there is the relationship between the quality of the house and the price of the house
plot(train$SalePrice~train$OverallQual)
#let's see if there is the relationship between the LotFrontage and the price of the house
plot(train$SalePrice~train$LotFrontage)
# to see if there is the correlation between the locaiton and yearbuilt with 80% confidence interval
corr <- cor.test(train$YearBuilt, train$LotArea, method = "pearson", conf.level = 0.8)
corr
##
## Pearson's product-moment correlation
##
## data: train$YearBuilt and train$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.01934322 0.04776648
## sample estimates:
## cor
## 0.01422765
# to see if there is the correlation between the saleprice and yearbuilt with 80% confidence interval
corr <- cor.test(train$SalePrice, train$YearBuilt, method = "pearson", conf.level = 0.8)
corr
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
# to see if there is the correlation between the saleprice and lotarea with 80% confidence interval
corr <- cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.8)
corr
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
#to see the saleprice histogram
hist(train$SalePrice)
library(MASS)
fit <- fitdistr(train$SalePrice, densfun = "exponential")
lambda <- fit$estimate
redist <- rexp(500, lambda)
summary(redist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 166.6 52233.5 123229.4 180946.0 256535.1 1245643.3
#to see the redist histogram
hist(redist)
# to see the 5th precentile value
p05 <- round(log(1-0.05)/-lambda,2)
p05
## rate
## 9280.04
# to see the 95th percentile value
p95 <- round(log(1-0.95)/-lambda,2)
p95
## rate
## 541991.5
# to calculate the multiple regression, and create a model lets choose and compare correct variables
x <- lm(train$SalePrice ~ train$LotArea + train$YearBuilt)
summary(x)
##
## Call:
## lm(formula = train$SalePrice ~ train$LotArea + train$YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -258044 -37455 -14472 21135 520542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.533e+06 1.104e+05 -22.93 <2e-16 ***
## train$LotArea 2.041e+00 1.695e-01 12.04 <2e-16 ***
## train$YearBuilt 1.366e+03 5.602e+01 24.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64620 on 1457 degrees of freedom
## Multiple R-squared: 0.3392, Adjusted R-squared: 0.3383
## F-statistic: 373.9 on 2 and 1457 DF, p-value: < 2.2e-16
It seems like 33.9% of the variation in the SalePrice can be predicted in the model.
# to see if location, bedroom, features and qualities of the house makes any difference on the price of the house we will create another model
y <- lm(train$SalePrice ~ train$TotRmsAbvGrd + train$BedroomAbvGr + train$FullBath + train$HalfBath + train$GarageCars + train$LotArea + train$OpenPorchSF + train$WoodDeckSF)
summary(y)
##
## Call:
## lm(formula = train$SalePrice ~ train$TotRmsAbvGrd + train$BedroomAbvGr +
## train$FullBath + train$HalfBath + train$GarageCars + train$LotArea +
## train$OpenPorchSF + train$WoodDeckSF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -242103 -28149 -2007 23342 418013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.066e+03 5.892e+03 -1.539 0.124
## train$TotRmsAbvGrd 1.703e+04 1.304e+03 13.063 < 2e-16 ***
## train$BedroomAbvGr -2.131e+04 2.235e+03 -9.531 < 2e-16 ***
## train$FullBath 3.015e+04 3.083e+03 9.778 < 2e-16 ***
## train$HalfBath 1.176e+04 2.820e+03 4.171 3.21e-05 ***
## train$GarageCars 3.751e+04 2.093e+03 17.924 < 2e-16 ***
## train$LotArea 8.947e-01 1.346e-01 6.645 4.29e-11 ***
## train$OpenPorchSF 1.119e+02 2.075e+01 5.392 8.11e-08 ***
## train$WoodDeckSF 7.916e+01 1.082e+01 7.317 4.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49550 on 1451 degrees of freedom
## Multiple R-squared: 0.6132, Adjusted R-squared: 0.611
## F-statistic: 287.5 on 8 and 1451 DF, p-value: < 2.2e-16
It seems 61.32% of the variation in the SalePrice can be predicted in the model. This model is more stronger than the previous one. Because of these variables the price of the house is predicted.
qqnorm(resid(y))
qqline(resid(y))
Even though there are outliers on this Normal Q-Q plot, this is a fit model.
My Kaggle account is : maharjansudhan