MThambeliyagodage_Data605_W15

df <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA605/W15/FinalExam/train.csv")

Introduction

Below is the dataset of house prices available from Kaggle.com. The dataset has 1459 observations of houses in Ames, Iowa, and 79 variables potentially contributing to the house sale price.

The full dataset and dictionary are available at: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

#kable(head(df))
datatable(df, options = list( pageLength = 5, lengthMenu = c(5, 10, 40),   initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#68080e', 'color': '#fff'});",
    "}")), rownames=TRUE)

Part 1: Variables

Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

#test variable
X1<-df$YearBuilt
Y1<-df$SalePrice
# 
# plot(X1,Y1)
# hist(X1, col="pink", main="Histogram of YearBuilt")
# hist(Y1, col="blue", main="Histogram of SalePrice")

#chosen variable
X<-df$YearBuilt
Y<-df$SalePrice

plot(X,Y, col="#1109f9", main="Scatterplot of Year Built and Sale Price", xlab = "Year Built", ylab="Sale Price")
abline(lm(Y~X), col="yellow", lwd=3) # regression line (y~x)

hist(X, col="#9df909", main="Histogram of Year Built", xlab = "Year Built")

hist(Y, col="#80cbc4", main="Histogram of Sale Price", xlab = "Sale Price")

print("Summary of X variable: Year Built")

## [1] "Summary of X variable: Year Built"

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1973    1971    2000    2010

print("Summary of Y variable: Sale Price")

## [1] "Summary of Y variable: Sale Price"

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Part 2: Probability

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

\[p_1 =p(X>x | Y>y) \]

Given an above median sale price, the probability that a house has a year built greater than the third quartile.

XQ1<-quantile(X, probs=0.25)  #1st quartile of X variable
YQ1<-quantile (Y, probs=0.25) #1st quartile of Y variable

n<-(nrow(df))
yearbuilt<-as.numeric(df$YearBuilt)
saleprice<-as.numeric(df$SalePrice)

nYQ1<-nrow(subset(df,saleprice>YQ1))


p1<-nrow(subset(df, yearbuilt > XQ1 & saleprice>YQ1))/nYQ1
p1

## [1] 0.8465753

\[p_2 = p(X>x , Y>y) \]

Given the complete data set, the probability that a house has a year built greater than the third quartile and a sale price above median value.

p2<-nrow(subset(df, yearbuilt > XQ1 & saleprice>YQ1))/n
p2

## [1] 0.6349315

\[p_3 =p(X<x | Y>y) \]

Given an above median selling price, the probability that a house has a year built less than [less than or equal to] the third quartile.

p3<-nrow(subset(df, yearbuilt <=XQ1 & saleprice>YQ1))/nYQ1
p3

## [1] 0.1534247

c1<-nrow(subset(df, yearbuilt <=XQ1 & saleprice<=YQ1))/n
c2<-nrow(subset(df, yearbuilt <=XQ1 & saleprice>YQ1))/n
c3<-c1+c2
c4<-nrow(subset(df, yearbuilt >XQ1 & saleprice<=YQ1))/n
c5<-nrow(subset(df, yearbuilt >XQ1 & saleprice>YQ1))/n
c6<-c4+c5
c7<-c1+c4
c8<-c2+c5
c9<-c3+c6

dfcounts<-matrix(round(c(c1,c2,c3,c4,c5,c6,c7,c8,c9),3), ncol=3, nrow=3, byrow=TRUE)
colnames(dfcounts)<-c(
"<=2d quartile",
">2d quartile",
"Total")
rownames(dfcounts)<-c("<=3rd quartile",">3rd quartile","Total")

print("Quartile Matrix by Percentage")

## [1] "Quartile Matrix by Percentage"

dfcounts<-as.table(dfcounts)
dfcounts

##                <=2d quartile >2d quartile Total
## <=3rd quartile         0.149        0.115 0.264
## >3rd quartile          0.101        0.635 0.736
## Total                  0.250        0.750 1.000

print("Quartile Matrix by Count")

## [1] "Quartile Matrix by Count"

dfvals<-round(dfcounts*1460,0)
dfvals

##                <=2d quartile >2d quartile Total
## <=3rd quartile           218          168   385
## >3rd quartile            147          927  1075
## Total                    365         1095  1460

Part 3: Independence

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations for the 1st quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

papb<-c4*c5
print (paste0("p(A)*p(B)=", round(papb,5)))

## [1] "p(A)*p(B)=0.06436"

\[p(A|B)=p(X>x|Y>y)=0.846\]

\[p(A)*p(B)=0.00643\]

\[p(A|B) != p(A)*p(B)\]

mat<-data.frame(df$YearBuilt,Y<-df$SalePrice)

#chi_table<- table(mat2$X1, mat2$Y1)
chisq.test(mat, correct=TRUE)

## 
##  Pearson's Chi-squared test
## 
## data:  mat
## X-squared = 471530, df = 1459, p-value < 2.2e-16

We recieve a p value of < 2.2e-16. Therefore, we reject the null hypothesis. The values are significant and therefore our data is independent.

Part 4: Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Generate descriptive statistics on the numerical variables in the training dataset.

isnum <- sapply(df, is.numeric)
dfnum<-df[ , isnum]
summary(dfnum)

##        Id           MSSubClass     LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.0   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 730.5   Median : 50.0   Median : 69.00   Median :  9478  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :1460.0   Max.   :190.0   Max.   :313.00   Max.   :215245  
##                                   NA's   :259                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0  
##  Median :   0.0   Median : 383.5   Median :   0.00   Median : 477.5  
##  Mean   : 103.7   Mean   : 443.6   Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1474.00   Max.   :2336.0  
##  NA's   :8                                                           
##   TotalBsmtSF       X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  Min.   :   0.0   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  Median : 991.5   Median :1087   Median :   0   Median :  0.000  
##  Mean   :1057.4   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  Max.   :6110.0   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000   Max.   :14.000  
##                                                                   
##    Fireplaces     GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.000   Min.   :1900   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.000   1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5  
##  Median :1.000   Median :1980   Median :2.000   Median : 480.0  
##  Mean   :0.613   Mean   :1979   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.:1.000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :3.000   Max.   :2010   Max.   :4.000   Max.   :1418.0  
##                  NA's   :81                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice     
##  Min.   :2006   Min.   : 34900  
##  1st Qu.:2007   1st Qu.:129975  
##  Median :2008   Median :163000  
##  Mean   :2008   Mean   :180921  
##  3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :2010   Max.   :755000  
##

scatterplot of X and Y

Provide a scatterplot of X and Y. Create density plot for SalePrice variable

#Collect summary statistics
SalePrice.mean <-mean(df$SalePrice)
SalePrice.median <-median(df$SalePrice)
SalePrice.mode <- as.numeric(names(sort(-table(df$SalePrice))))[1]
SalePrice.sd <- sd(df$SalePrice)


d_SalePrice <- density(df$SalePrice)
plot(d_SalePrice, main="SalePrice Probabilities", ylab="Probability", xlab="SalePrice")
polygon(d_SalePrice, col="red")
abline(v = SalePrice.median, col = "green", lwd = 2)
abline(v = SalePrice.mean, col = "blue", lwd = 2)
abline(v = SalePrice.mode, col = "purple", lwd = 2)
legend("topright", legend=c("median", "mean","mode"),col=c("green", "blue", "purple"), lty=1, cex=0.8)

#Collect summary statistics
YearBuilt.mean <-mean(df$YearBuilt)
YearBuilt.median <-median(df$YearBuilt)
YearBuilt.mode <- as.numeric(names(sort(-table(df$YearBuilt))))[1]
YearBuilt.sd <- sd(df$YearBuilt)


d_YearBuilt <- density(df$YearBuilt)
plot(d_YearBuilt, main="YearBuilt Probabilities", ylab="Probability", xlab="YearBuilt")
polygon(d_YearBuilt, col="#68080e")
abline(v = YearBuilt.median, col = "green", lwd = 2)
abline(v = YearBuilt.mean, col = "blue", lwd = 2)
abline(v = YearBuilt.mode, col = "purple", lwd = 2)
legend("topright", legend=c("median", "mean","mode"),col=c("green", "blue", "purple"), lty=1, cex=0.8)

X = YearBuilt
Y = SalePrice

Derive a correlation matrix for any THREE quantitative variables in the dataset.

## Warning: package 'bindrcpp' was built under R version 3.4.4

Overall House Quality and Condition Compared to Sale Price

Above Ground Living Area and Sale Price

House Price Compared to Year Built

Barplots of Sale-related Variables:

Confidence interval

Provide a 95% CI for the difference in the mean of the variables.

#t.test(x,y)
t.test(df$YearBuilt, df$SalePrice)

## 
##  Welch Two Sample t-test
## 
## data:  df$YearBuilt and df$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183028.3 -174871.6
## sample estimates:
##  mean of x  mean of y 
##   1971.268 180921.196

Selective correlation matrix for chosen variables

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval.Discuss the meaning of your analysis.

YearBuilt vs SalePrice

myvars<-data.frame(df$BedroomAbvGr, df$SalePrice)
#head(myvars) #view header
cor(myvars)

##                 df.BedroomAbvGr df.SalePrice
## df.BedroomAbvGr       1.0000000    0.1682132
## df.SalePrice          0.1682132    1.0000000

cor.test(df$BedroomAbvGr, df$SalePrice, conf.level = 0.92)

## 
##  Pearson's product-moment correlation
## 
## data:  df$BedroomAbvGr and df$SalePrice
## t = 6.5159, df = 1458, p-value = 9.927e-11
## alternative hypothesis: true correlation is not equal to 0
## 92 percent confidence interval:
##  0.1233314 0.2124081
## sample estimates:
##       cor 
## 0.1682132

t.test(df$BedroomAbvGr, df$SalePrice, conf.level = 0.92)

## 
##  Welch Two Sample t-test
## 
## data:  df$BedroomAbvGr and df$SalePrice
## t = -87.017, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -184560.7 -177275.9
## sample estimates:
##    mean of x    mean of y 
## 2.866438e+00 1.809212e+05

mymx<-as.matrix(cor(myvars))

With a 92 percent confidence level, the correlation between Year Built and Sale Price is estimated to be between 0.48 and 0.55.

House Quality and Condition Compared to Sale Price

myvars<-data.frame(df$OverallQual, df$SalePrice)
#head(myvars) #view header
cor(myvars)

##                df.OverallQual df.SalePrice
## df.OverallQual      1.0000000    0.7909816
## df.SalePrice        0.7909816    1.0000000

cor.test(df$OverallQual, df$SalePrice, conf.level = 0.92)

## 
##  Pearson's product-moment correlation
## 
## data:  df$OverallQual and df$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 92 percent confidence interval:
##  0.7731789 0.8075387
## sample estimates:
##       cor 
## 0.7909816

t.test(df$OverallQual, df$SalePrice, conf.level = 0.92)

## 
##  Welch Two Sample t-test
## 
## data:  df$OverallQual and df$SalePrice
## t = -87.016, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -184557.5 -177272.7
## sample estimates:
##    mean of x    mean of y 
## 6.099315e+00 1.809212e+05

mymx<-as.matrix(cor(myvars))

With a 92 percent confidence level, the correlation between Year Built and Sale Price is estimated to be between 0.77 and 0.80.

Ground Living Area and Sale Price

myvars<-data.frame(df$GrLivArea, df$SalePrice)
#head(myvars) #view header
cor(myvars)

##              df.GrLivArea df.SalePrice
## df.GrLivArea    1.0000000    0.7086245
## df.SalePrice    0.7086245    1.0000000

cor.test(df$GrLivArea, df$SalePrice, conf.level = 0.92)

## 
##  Pearson's product-moment correlation
## 
## data:  df$GrLivArea and df$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 92 percent confidence interval:
##  0.6850407 0.7307245
## sample estimates:
##       cor 
## 0.7086245

t.test(df$GrLivArea, df$SalePrice, conf.level = 0.92)

## 
##  Welch Two Sample t-test
## 
## data:  df$GrLivArea and df$SalePrice
## t = -86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -183048.2 -175763.3
## sample estimates:
##  mean of x  mean of y 
##   1515.464 180921.196

mymx<-as.matrix(cor(myvars))

With a 92 percent confidence level, the correlation between Year Built and Sale Price is estimated to be between 0.68 and 0.73.

Discuss the meaning of your analysis.

** Would you be worried about familywise error? Why or why not?**

The above analysis done 3 categorise using the data provided. 1.Bedrooms Above Ground vs SalePrice 2.House Quality and Condition Compared to Sale Price 3.Ground Living Area and Sale Price

Regardless of the manufactured year I was triying to find some corallation between above factors . Some factos are moderately corralized and some are not.

All above factors and independent and because of that I don’t consern much familywise error rate.

Part 5: Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Correlation Matrix, Precision Matrix, Identity Matrix

#my correlation matrix
mymx

##              df.GrLivArea df.SalePrice
## df.GrLivArea    1.0000000    0.7086245
## df.SalePrice    0.7086245    1.0000000

#inverse of my correlation matrix, precision matrix
ginvmymx<-ginv(mymx)
ginvmymx

##           [,1]      [,2]
## [1,]  2.008632 -1.423366
## [2,] -1.423366  2.008632

#corr mat x precision mat
mymxginv<-mymx%*%ginvmymx
round(mymxginv,2)

##              [,1] [,2]
## df.GrLivArea    1    0
## df.SalePrice    0    1

#precision mat x corr mat
ginvmymx<-ginvmymx%*%mymx
round(ginvmymx,2)

##      df.GrLivArea df.SalePrice
## [1,]            1            0
## [2,]            0            1

#my correlation matrix

myvars<-data.frame(df$YearBuilt,df$OverallQual,df$SalePrice)
mymx<-as.matrix(cor(myvars))
mymx

##                df.YearBuilt df.OverallQual df.SalePrice
## df.YearBuilt      1.0000000      0.5723228    0.5228973
## df.OverallQual    0.5723228      1.0000000    0.7909816
## df.SalePrice      0.5228973      0.7909816    1.0000000

#inverse of my correlation matrix, precision matrix
ginvmymx<-ginv(mymx)
ginvmymx

##            [,1]       [,2]       [,3]
## [1,]  1.5168013 -0.6431116 -0.2844419
## [2,] -0.6431116  2.9439846 -1.9923563
## [3,] -0.2844419 -1.9923563  2.7246511

#Multiply corr mat by precision mat
mymxginv<-mymx%*%ginvmymx
round(mymxginv,2)

##                [,1] [,2] [,3]
## df.YearBuilt      1    0    0
## df.OverallQual    0    1    0
## df.SalePrice      0    0    1

Conduct LU decomposition on the matrix.

library("Matrix")

## Warning: package 'Matrix' was built under R version 3.4.4

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:reshape':
## 
##     expand

lu(mymx)

## 'MatrixFactorization' of Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ x       : num [1:9] 1 0.572 0.523 0.572 0.672 ...
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "df.YearBuilt" "df.OverallQual" "df.SalePrice"
##   .. ..$ : chr [1:3] "df.YearBuilt" "df.OverallQual" "df.SalePrice"
##   ..@ Dim     : int [1:2] 3 3

Part 6: Calculus-Based Probability & Statistics

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data.

For your first variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

Minimum value is above zero

#check that min val is not 0
min(df$SalePrice)

## [1] 34900

Run fitdistr to fit an exponential probability density function.

fit <- fitdistr(df$SalePrice, "exponential")
fit

##        rate    
##   5.527268e-06 
##  (1.446552e-07)

Find the optimal value of ?? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ??)).

#optimal value of ?? for this distribution

lambda <- fit$estimate
sampledf <- rexp(1000, lambda)
lambda

##         rate 
## 5.527268e-06

Plot a histogram and compare it with a histogram of your original variable.

#Plot a histogram and compare it with a histogram of your original variable.

sampledf<-data.frame(as.numeric(sampledf))
colnames(sampledf)[1] <- "sample"
str(sampledf)

## 'data.frame':    1000 obs. of  1 variable:
##  $ sample: num  17151 8009 26217 294766 233449 ...

head(sampledf)

##       sample
## 1  17151.105
## 2   8008.771
## 3  26216.798
## 4 294765.508
## 5 233448.812
## 6 292053.120

hist(sampledf$sample, col="green", main="Histogram of Exponential Distribution", xlab = "SalePRice", breaks=30)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

#find the 5th and 95th percentiles
print("5th percentile")

## [1] "5th percentile"

qexp(.05,rate = lambda)

## [1] 9280.044

print("95th percentile")

## [1] "95th percentile"

qexp(.95, rate = lambda)

## [1] 541991.5

Also generate a 95% confidence interval from the empirical data, assuming normality.

#95% confidence interval from the empirical data
CI(df$SalePrice, 0.95)

##    upper     mean    lower 
## 184999.6 180921.2 176842.8

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(df$SalePrice, .05)

##    5% 
## 88000

quantile(df$SalePrice, .95)

##    95% 
## 326100

With 95% confidence, the mean of SalePrice is between 184999.6 and 176842.8. The exponential distribution would not be a good fit in this case. We see that the center of the exponential distribution is shifted left as compared the empirical data. Additionally we see more spread in the exponential distribution.

Part 7: Modeling

Modeling. Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Test Model 1: AIC in a Stepwise Algorithm

#test of alternate model
modvars <- df[, which(sapply(df, function(x) sum(is.na(x))) == 0)]
model1 <- step(lm(df$SalePrice ~ ., modvars), direction = 'backward', trace = FALSE)
model1

#dfglm <- glm(df$SalePrice ~ ., family=binomial, data = df)
#dfstep <- stepAIC(dfglm, trace = FALSE)
#dfstep$anova

Test Model 2: Multiple Linear Regression

fit <- lm(df$SalePrice ~ df$OverallQual + df$GrLivArea + df$GarageCars + df$GarageArea, data=df)
summary(fit) # show results

## 
## Call:
## lm(formula = df$SalePrice ~ df$OverallQual + df$GrLivArea + df$GarageCars + 
##     df$GarageArea, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -372594  -21236   -1594   18625  301129 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -98436.050   4820.467 -20.420  < 2e-16 ***
## df$OverallQual  26988.854   1067.393  25.285  < 2e-16 ***
## df$GrLivArea       49.573      2.555  19.402  < 2e-16 ***
## df$GarageCars   11317.522   3126.297   3.620 0.000305 ***
## df$GarageArea      41.478     10.627   3.903 9.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40420 on 1455 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7411 
## F-statistic:  1045 on 4 and 1455 DF,  p-value: < 2.2e-16

Using intercepts from regression summary, create multiple linear regression model.

\[ SalePrice = 26988.854*OverallQual + 49.573*GrLivArea + 11317.522*GarageCars + 41.478*GarageArea -98436.050 \]

par(mfrow=c(2,2))
X1<-df$OverallQual
X2<-df$GrLivArea
X3<-df$GarageCars
X4<-df$GarageArea
Y<-df$SalePrice

plot(X1,Y, col="#f06292", main="OverallQual", ylab="Sale Price")
abline(lm(Y~X1), col="yellow", lwd=3) # regression line (y~x)

plot(X2,Y, col="#9c27b0", main="GrLivArea", ylab="Sale Price")
abline(lm(Y~X2), col="yellow", lwd=3) # regression line (y~x)

plot(X3,Y, col="#ce93d8", main="GarageCars", ylab="Sale Price")
abline(lm(Y~X3), col="yellow", lwd=3) # regression line (y~x)

plot(X4,Y, col="#c2185b", main="GarageArea", ylab="Sale Price")
abline(lm(Y~X4), col="yellow", lwd=3) # regression line (y~x)

Load test data set and create calculated column using equation for multiple linear regression. Select required columns and export to csv for contest entry.

dftest <- read.csv("https://raw.githubusercontent.com/mathsanu/CUNY_MSDA/master/DATA605/W15/FinalExam/test.csv")
#str(dftest)
#nrow(dftest)

SalePrice<-((26988.854*df$OverallQual) + (49.573*df$GrLivArea) +  (11317.522*df$GarageCars) + (41.478*df$GarageArea) -98436.050)
#head(SalePrice)

dftest<-dftest[,c("Id","OverallQual","GrLivArea","GarageCars","GarageArea")]

kable(head(dftest))

Id	OverallQual	GrLivArea	GarageCars	GarageArea
1461	5	896	1	730
1462	6	1329	1	312
1463	5	1629	2	482
1464	6	1604	2	470
1465	8	1280	2	506
1466	6	1655	2	440

#tail(dftest)

submission <- cbind(dftest$Id,SalePrice)

## Warning in cbind(dftest$Id, SalePrice): number of rows of result is not a
## multiple of vector length (arg 1)

colnames(submission)[1] <- "Id"
submission[submission<0] <- median(SalePrice) #clear negatives due to missing values
submission<-as.data.frame(submission[1:1459,])
kable(head(submission))

Id	SalePrice
1461	220620.7
1462	167773.1
1463	226877.0
1464	236184.2
1465	295064.4
1466	146571.1

#str(submission)
#dim(submission)

Export CSV and submit to Kaggle.

Eval set to FALSE for reader convenience.

write.csv(submission, file = "matstest.csv", quote=FALSE, row.names=FALSE)

Kaggle UserName: mathsanu71 Kaggle score: 0.60113

MThambeliyagodage_Data605_W15_FinalExam

Matheesha Thambeliyagodage

May 12, 2018