df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA605/W15/FinalExam/train.csv")
Below is the dataset of house prices available from Kaggle.com. The dataset has 1459 observations of houses in Ames, Iowa, and 79 variables potentially contributing to the house sale price.
The full dataset and dictionary are available at: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
#kable(head(df))
datatable(df, options = list( pageLength = 5, lengthMenu = c(5, 10, 40), initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#68080e', 'color': '#fff'});",
"}")), rownames=TRUE)
Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
#test variable
X1<-df$YearBuilt
Y1<-df$SalePrice
#
# plot(X1,Y1)
# hist(X1, col="pink", main="Histogram of YearBuilt")
# hist(Y1, col="blue", main="Histogram of SalePrice")
#chosen variable
X<-df$YearBuilt
Y<-df$SalePrice
plot(X,Y, col="#1109f9", main="Scatterplot of Year Built and Sale Price", xlab = "Year Built", ylab="Sale Price")
abline(lm(Y~X), col="yellow", lwd=3) # regression line (y~x)
hist(X, col="#9df909", main="Histogram of Year Built", xlab = "Year Built")
hist(Y, col="#80cbc4", main="Histogram of Sale Price", xlab = "Sale Price")
print("Summary of X variable: Year Built")
## [1] "Summary of X variable: Year Built"
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
print("Summary of Y variable: Sale Price")
## [1] "Summary of Y variable: Sale Price"
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
Given an above median sale price, the probability that a house has a year built greater than the third quartile.
XQ1<-quantile(X, probs=0.25) #1st quartile of X variable
YQ1<-quantile (Y, probs=0.25) #1st quartile of Y variable
n<-(nrow(df))
yearbuilt<-as.numeric(df$YearBuilt)
saleprice<-as.numeric(df$SalePrice)
nYQ1<-nrow(subset(df,saleprice>YQ1))
p1<-nrow(subset(df, yearbuilt > XQ1 & saleprice>YQ1))/nYQ1
p1
## [1] 0.8465753
Given the complete data set, the probability that a house has a year built greater than the third quartile and a sale price above median value.
p2<-nrow(subset(df, yearbuilt > XQ1 & saleprice>YQ1))/n
p2
## [1] 0.6349315
Given an above median selling price, the probability that a house has a year built less than [less than or equal to] the third quartile.
p3<-nrow(subset(df, yearbuilt <=XQ1 & saleprice>YQ1))/nYQ1
p3
## [1] 0.1534247
c1<-nrow(subset(df, yearbuilt <=XQ1 & saleprice<=YQ1))/n
c2<-nrow(subset(df, yearbuilt <=XQ1 & saleprice>YQ1))/n
c3<-c1+c2
c4<-nrow(subset(df, yearbuilt >XQ1 & saleprice<=YQ1))/n
c5<-nrow(subset(df, yearbuilt >XQ1 & saleprice>YQ1))/n
c6<-c4+c5
c7<-c1+c4
c8<-c2+c5
c9<-c3+c6
dfcounts<-matrix(round(c(c1,c2,c3,c4,c5,c6,c7,c8,c9),3), ncol=3, nrow=3, byrow=TRUE)
colnames(dfcounts)<-c(
"<=2d quartile",
">2d quartile",
"Total")
rownames(dfcounts)<-c("<=3rd quartile",">3rd quartile","Total")
print("Quartile Matrix by Percentage")
## [1] "Quartile Matrix by Percentage"
dfcounts<-as.table(dfcounts)
dfcounts
## <=2d quartile >2d quartile Total
## <=3rd quartile 0.149 0.115 0.264
## >3rd quartile 0.101 0.635 0.736
## Total 0.250 0.750 1.000
print("Quartile Matrix by Count")
## [1] "Quartile Matrix by Count"
dfvals<-round(dfcounts*1460,0)
dfvals
## <=2d quartile >2d quartile Total
## <=3rd quartile 218 168 385
## >3rd quartile 147 927 1075
## Total 365 1095 1460
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations for the 1st quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
papb<-c4*c5
print (paste0("p(A)*p(B)=", round(papb,5)))
## [1] "p(A)*p(B)=0.06436"
\[p(A|B)=p(X>x|Y>y)=0.846\]
\[p(A)*p(B)=0.00643\]
\[p(A|B) != p(A)*p(B)\]
mat<-data.frame(df$YearBuilt,Y<-df$SalePrice)
#chi_table<- table(mat2$X1, mat2$Y1)
chisq.test(mat, correct=TRUE)
##
## Pearson's Chi-squared test
##
## data: mat
## X-squared = 471530, df = 1459, p-value < 2.2e-16
We recieve a p value of < 2.2e-16. Therefore, we reject the null hypothesis. The values are significant and therefore our data is independent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
isnum <- sapply(df, is.numeric)
dfnum<-df[ , isnum]
summary(dfnum)
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
Provide a scatterplot of X and Y. Create density plot for SalePrice variable
#Collect summary statistics
SalePrice.mean <-mean(df$SalePrice)
SalePrice.median <-median(df$SalePrice)
SalePrice.mode <- as.numeric(names(sort(-table(df$SalePrice))))[1]
SalePrice.sd <- sd(df$SalePrice)
d_SalePrice <- density(df$SalePrice)
plot(d_SalePrice, main="SalePrice Probabilities", ylab="Probability", xlab="SalePrice")
polygon(d_SalePrice, col="red")
abline(v = SalePrice.median, col = "green", lwd = 2)
abline(v = SalePrice.mean, col = "blue", lwd = 2)
abline(v = SalePrice.mode, col = "purple", lwd = 2)
legend("topright", legend=c("median", "mean","mode"),col=c("green", "blue", "purple"), lty=1, cex=0.8)
#Collect summary statistics
YearBuilt.mean <-mean(df$YearBuilt)
YearBuilt.median <-median(df$YearBuilt)
YearBuilt.mode <- as.numeric(names(sort(-table(df$YearBuilt))))[1]
YearBuilt.sd <- sd(df$YearBuilt)
d_YearBuilt <- density(df$YearBuilt)
plot(d_YearBuilt, main="YearBuilt Probabilities", ylab="Probability", xlab="YearBuilt")
polygon(d_YearBuilt, col="#68080e")
abline(v = YearBuilt.median, col = "green", lwd = 2)
abline(v = YearBuilt.mean, col = "blue", lwd = 2)
abline(v = YearBuilt.mode, col = "purple", lwd = 2)
legend("topright", legend=c("median", "mean","mode"),col=c("green", "blue", "purple"), lty=1, cex=0.8)
X = YearBuilt
Y = SalePrice
## Warning: package 'bindrcpp' was built under R version 3.4.4
** Would you be worried about familywise error? Why or why not?**
The above analysis done 3 categorise using the data provided. 1.Bedrooms Above Ground vs SalePrice 2.House Quality and Condition Compared to Sale Price 3.Ground Living Area and Sale Price
Regardless of the manufactured year I was triying to find some corallation between above factors . Some factos are moderately corralized and some are not.
All above factors and independent and because of that I don’t consern much familywise error rate.
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
#my correlation matrix
mymx
## df.GrLivArea df.SalePrice
## df.GrLivArea 1.0000000 0.7086245
## df.SalePrice 0.7086245 1.0000000
#inverse of my correlation matrix, precision matrix
ginvmymx<-ginv(mymx)
ginvmymx
## [,1] [,2]
## [1,] 2.008632 -1.423366
## [2,] -1.423366 2.008632
#corr mat x precision mat
mymxginv<-mymx%*%ginvmymx
round(mymxginv,2)
## [,1] [,2]
## df.GrLivArea 1 0
## df.SalePrice 0 1
#precision mat x corr mat
ginvmymx<-ginvmymx%*%mymx
round(ginvmymx,2)
## df.GrLivArea df.SalePrice
## [1,] 1 0
## [2,] 0 1
#my correlation matrix
myvars<-data.frame(df$YearBuilt,df$OverallQual,df$SalePrice)
mymx<-as.matrix(cor(myvars))
mymx
## df.YearBuilt df.OverallQual df.SalePrice
## df.YearBuilt 1.0000000 0.5723228 0.5228973
## df.OverallQual 0.5723228 1.0000000 0.7909816
## df.SalePrice 0.5228973 0.7909816 1.0000000
#inverse of my correlation matrix, precision matrix
ginvmymx<-ginv(mymx)
ginvmymx
## [,1] [,2] [,3]
## [1,] 1.5168013 -0.6431116 -0.2844419
## [2,] -0.6431116 2.9439846 -1.9923563
## [3,] -0.2844419 -1.9923563 2.7246511
#Multiply corr mat by precision mat
mymxginv<-mymx%*%ginvmymx
round(mymxginv,2)
## [,1] [,2] [,3]
## df.YearBuilt 1 0 0
## df.OverallQual 0 1 0
## df.SalePrice 0 0 1
library("Matrix")
## Warning: package 'Matrix' was built under R version 3.4.4
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:reshape':
##
## expand
lu(mymx)
## 'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"] with 4 slots
## ..@ x : num [1:9] 1 0.572 0.523 0.572 0.672 ...
## ..@ perm : int [1:3] 1 2 3
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:3] "df.YearBuilt" "df.OverallQual" "df.SalePrice"
## .. ..$ : chr [1:3] "df.YearBuilt" "df.OverallQual" "df.SalePrice"
## ..@ Dim : int [1:2] 3 3
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data.
For your first variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Minimum value is above zero
#check that min val is not 0
min(df$SalePrice)
## [1] 34900
Run fitdistr to fit an exponential probability density function.
fit <- fitdistr(df$SalePrice, "exponential")
fit
## rate
## 5.527268e-06
## (1.446552e-07)
Find the optimal value of ?? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ??)).
#optimal value of ?? for this distribution
lambda <- fit$estimate
sampledf <- rexp(1000, lambda)
lambda
## rate
## 5.527268e-06
Plot a histogram and compare it with a histogram of your original variable.
#Plot a histogram and compare it with a histogram of your original variable.
sampledf<-data.frame(as.numeric(sampledf))
colnames(sampledf)[1] <- "sample"
str(sampledf)
## 'data.frame': 1000 obs. of 1 variable:
## $ sample: num 17151 8009 26217 294766 233449 ...
head(sampledf)
## sample
## 1 17151.105
## 2 8008.771
## 3 26216.798
## 4 294765.508
## 5 233448.812
## 6 292053.120
hist(sampledf$sample, col="green", main="Histogram of Exponential Distribution", xlab = "SalePRice", breaks=30)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
#find the 5th and 95th percentiles
print("5th percentile")
## [1] "5th percentile"
qexp(.05,rate = lambda)
## [1] 9280.044
print("95th percentile")
## [1] "95th percentile"
qexp(.95, rate = lambda)
## [1] 541991.5
Also generate a 95% confidence interval from the empirical data, assuming normality.
#95% confidence interval from the empirical data
CI(df$SalePrice, 0.95)
## upper mean lower
## 184999.6 180921.2 176842.8
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(df$SalePrice, .05)
## 5%
## 88000
quantile(df$SalePrice, .95)
## 95%
## 326100
With 95% confidence, the mean of SalePrice is between 184999.6 and 176842.8. The exponential distribution would not be a good fit in this case. We see that the center of the exponential distribution is shifted left as compared the empirical data. Additionally we see more spread in the exponential distribution.
Modeling. Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
#test of alternate model
modvars <- df[, which(sapply(df, function(x) sum(is.na(x))) == 0)]
model1 <- step(lm(df$SalePrice ~ ., modvars), direction = 'backward', trace = FALSE)
model1
#dfglm <- glm(df$SalePrice ~ ., family=binomial, data = df)
#dfstep <- stepAIC(dfglm, trace = FALSE)
#dfstep$anova
fit <- lm(df$SalePrice ~ df$OverallQual + df$GrLivArea + df$GarageCars + df$GarageArea, data=df)
summary(fit) # show results
##
## Call:
## lm(formula = df$SalePrice ~ df$OverallQual + df$GrLivArea + df$GarageCars +
## df$GarageArea, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -372594 -21236 -1594 18625 301129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -98436.050 4820.467 -20.420 < 2e-16 ***
## df$OverallQual 26988.854 1067.393 25.285 < 2e-16 ***
## df$GrLivArea 49.573 2.555 19.402 < 2e-16 ***
## df$GarageCars 11317.522 3126.297 3.620 0.000305 ***
## df$GarageArea 41.478 10.627 3.903 9.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40420 on 1455 degrees of freedom
## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7411
## F-statistic: 1045 on 4 and 1455 DF, p-value: < 2.2e-16
Using intercepts from regression summary, create multiple linear regression model.
\[ SalePrice = 26988.854*OverallQual + 49.573*GrLivArea + 11317.522*GarageCars + 41.478*GarageArea -98436.050 \]
par(mfrow=c(2,2))
X1<-df$OverallQual
X2<-df$GrLivArea
X3<-df$GarageCars
X4<-df$GarageArea
Y<-df$SalePrice
plot(X1,Y, col="#f06292", main="OverallQual", ylab="Sale Price")
abline(lm(Y~X1), col="yellow", lwd=3) # regression line (y~x)
plot(X2,Y, col="#9c27b0", main="GrLivArea", ylab="Sale Price")
abline(lm(Y~X2), col="yellow", lwd=3) # regression line (y~x)
plot(X3,Y, col="#ce93d8", main="GarageCars", ylab="Sale Price")
abline(lm(Y~X3), col="yellow", lwd=3) # regression line (y~x)
plot(X4,Y, col="#c2185b", main="GarageArea", ylab="Sale Price")
abline(lm(Y~X4), col="yellow", lwd=3) # regression line (y~x)
Load test data set and create calculated column using equation for multiple linear regression. Select required columns and export to csv for contest entry.
dftest <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA605/W15/FinalExam/test.csv")
#str(dftest)
#nrow(dftest)
SalePrice<-((26988.854*df$OverallQual) + (49.573*df$GrLivArea) + (11317.522*df$GarageCars) + (41.478*df$GarageArea) -98436.050)
#head(SalePrice)
dftest<-dftest[,c("Id","OverallQual","GrLivArea","GarageCars","GarageArea")]
kable(head(dftest))
Id | OverallQual | GrLivArea | GarageCars | GarageArea |
---|---|---|---|---|
1461 | 5 | 896 | 1 | 730 |
1462 | 6 | 1329 | 1 | 312 |
1463 | 5 | 1629 | 2 | 482 |
1464 | 6 | 1604 | 2 | 470 |
1465 | 8 | 1280 | 2 | 506 |
1466 | 6 | 1655 | 2 | 440 |
#tail(dftest)
submission <- cbind(dftest$Id,SalePrice)
## Warning in cbind(dftest$Id, SalePrice): number of rows of result is not a
## multiple of vector length (arg 1)
colnames(submission)[1] <- "Id"
submission[submission<0] <- median(SalePrice) #clear negatives due to missing values
submission<-as.data.frame(submission[1:1459,])
kable(head(submission))
Id | SalePrice |
---|---|
1461 | 220620.7 |
1462 | 167773.1 |
1463 | 226877.0 |
1464 | 236184.2 |
1465 | 295064.4 |
1466 | 146571.1 |
#str(submission)
#dim(submission)
Eval set to FALSE for reader convenience.
write.csv(submission, file = "matstest.csv", quote=FALSE, row.names=FALSE)
Kaggle UserName: mathsanu71 Kaggle score: 0.60113