DATA 605 Final Exam

Dataset

Website: https://www.kaggle.com/c/house-prices-advanced-regression-techniques provides a dataset about 1460 objects of 80 variables that With 79 explanatory variables describing every aspect of residential homes in Ames, Iowa, to predict the final saleprice of each home.

# read in training data, use ID column as row names
train <- read.csv('train.csv', row.names = 1)
head(train)

##   MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1         60       RL          65    8450   Pave  <NA>      Reg
## 2         20       RL          80    9600   Pave  <NA>      Reg
## 3         60       RL          68   11250   Pave  <NA>      IR1
## 4         70       RL          60    9550   Pave  <NA>      IR1
## 5         60       RL          84   14260   Pave  <NA>      IR1
## 6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Final sale price (in dollars) is defined as the dependent variable \(Y\) and the above grade living area (in square feet) is defined as the independent variable \(X\).

# Choose my quantitative independent variable to be Overall Quality
X <- train$GrLivArea

# Choose Sale Price as my dependent variable as per the Kaggle Competition
Y <- train$SalePrice

# Let's get the median and mean to show it is skewed to the right
summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

# Plot Overall Quality to visually show it is skewed to the right
ggplot(data = train) + geom_density(aes(x=OverallQual), fill="blue") + labs(x="Overall Quality") + ggtitle("Overall Quality Density") + theme_light()

# We have the quartiles for X above, here we get them for Y
summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  130000  163000  180900  214000  755000

“x” is estimated as the 1d quartile of the X variable: x = 1130

“y” is estimated as the 2d quartile of the Y variable: y = 163,000

# we count the number of rows with SalesPrice > 163,000
sum(Y > 163000)

## [1] 728

# Total number of rows
length(Y)

## [1] 1460

length(X)

## [1] 1460

# Rows where both Y > y and X > x
sum(Y > 163000 & X > 1130)

## [1] 720

Q1x <- quantile(X,0.25)
Q1x

##    25% 
## 1129.5

Q2y <- quantile(Y,0.5)
Q2y

##    50% 
## 163000

a) P(X > x | Y > y)

# The probability P(X>x | Y>y)
(sum(Y > Q2y & X > Q1x) / length(X)) / (sum(Y > Q2y) / length(Y))

## [1] 0.989011

b) P(X > x, Y > y)

# The probability P(X>x and Y>y)
sum(X > Q1x & Y > Q2y) / length(X)

## [1] 0.4931507

c) P(X < x | Y > y)

# The probability P(X<x | Y>y)
(sum(Y > Q2y & X < Q1x) / length(X)) / (sum(Y > Q2y) / length(Y))

## [1] 0.01098901

P(X|Y)=P(X)P(Y)

Does splitting the training data in this fashion make them independent? In other words, does P(X|Y)=P(X)P(Y))? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this.

mathematical method

# P(x) = P(above the 1d quartile for X)
px <- sum(X > Q1x) / length(X)
px

## [1] 0.75

# P(Y) = P(above the 2d quartile for Y)
py <- sum(Y > Q2y) / length(Y)
py

## [1] 0.4986301

# P(x)*P(Y)
pxy <- px*py
pxy

## [1] 0.3739726

# P(X|Y) = P(Y and X) / P(Y)
pxxyy <- (sum(X > Q1x & Y > Q2y) / length(X)) / py
pxxyy

## [1] 0.989011

\(P(A | B) \ne P(A) P(B)\), which implies that my variables are not independent, so splitting the data in this fashion does not make them independent.

Chi Square test

The Chi Square Test, tests the hypothesis of whether the grade living area (in square feet) is independent of the Sale Price. We could debate a 5% to 1% significance level, but in this case that’s not necessary. With a p-value of well under 1% we reject the null hypothesis that the grade living area (in square feet) is independent of the Sale Price of a house.

if (!require(MASS)) install.packages("MASS")

## Loading required package: MASS

library(MASS)

tbl <- table(train$GrLivArea > Q1x, train$SalePrice > Q2y)
chisq.test(tbl)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 439.85, df = 1, p-value < 2.2e-16

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this.

Dataset

Train data set we can see from running dim() and str() below that we have 1,460 observations and 80 variables

Summary statistics for \(X\) and \(Y\) are provided in the table below:

library(pander)

## Warning: package 'pander' was built under R version 3.3.3

# prepare summary table supplemented with standard deviation
sum_tbl <- rbind(c(summary(X), 'Std. Dev.'=round(sd(X), 0)),
                 c(summary(Y), 'Std. Dev.'=round(sd(Y), 0)))
row.names(sum_tbl) <- c('X', 'Y')
pander(sum_tbl)

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.	Std. Dev.
X	334	1130	1464	1515	1777	5642	525
Y	34900	130000	163000	180900	214000	755000	79443

The plots below show the joint distributions of the two variables

library(ggplot2)
library(scales)
# scatter of X & Y
pscat <- ggplot(data.frame(X, Y), aes(X, Y / 1000)) + 
  geom_point(alpha = 0.25) +
  scale_x_continuous('Above Grade Living Area [square feet]') + 
  scale_y_continuous('Sale Price [thousands]', labels = dollar)
# histogram of X
px <- ggplot(data.frame(X), aes(X, ..count.. / sum(..count..))) + 
  geom_histogram(alpha = 0.5, col = 'black', binwidth = 200) +
  scale_x_continuous(name = NULL, labels = NULL) +
  scale_y_continuous(name = '', labels = percent)
# histogram of Y
py <- ggplot(data.frame(Y), aes(Y, ..count.. / sum(..count..))) + 
  geom_histogram(alpha = 0.5, col = 'black', binwidth = 25000) + 
  scale_x_continuous(name = NULL, labels = NULL) + 
  scale_y_continuous(name = '', labels = percent) +
  coord_flip()

# combined plots
library(grid)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.3.3

grid.arrange(px, rectGrob(NA), pscat, py,
             nrow=2, widths = c(0.8, 0.2), heights = c(0.2, 0.8),
             top = 'Histograms and Scatterplot of X & Y Variables')

Scatterplot X and Y

Below is a scatterplot of grade living area (in square feet) versus Sale Price. I overlayed the regression line and the lighter band around the line is the 95% confidence interval.

# Plot Overall Quality versus Sales Price and overlay a regression line
ggplot(train, aes(x = GrLivArea, y = SalePrice)) + geom_point(color="red") + labs(x="grade living area", y = "Sale Price") + ggtitle("grade living area vs. Sales Price") + geom_smooth(method = "lm", se = TRUE) + theme_light()

95% Confidence Interval

Provide a 95% CI for the difference in the mean of the variables.

We show below a 95% confidence that the difference in the means of x and y will be between 175.3K and 183.5K

# Run the T test
t.test(Y, X)

## 
##  Welch Two Sample t-test
## 
## data:  Y and X
## t = 86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  175327.3 183484.2
## sample estimates:
##  mean of x  mean of y 
## 180921.196   1515.464

# The assumption for the test is that both groups are sampled from normal distributions with equal variances
diffmean <- t.test(X, Y, paired = TRUE)$conf.int
diffmean

## [1] -183465.0 -175346.4
## attr(,"conf.level")
## [1] 0.95

Correlation Matrix

Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval.

# Setup and run correlation test
data <- data.frame(X, Y)
cm <- cor(data)
cm

##           X         Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000

cor.test(Y, X, conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  Y and X
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.6733974 0.7406408
## sample estimates:
##       cor 
## 0.7086245

Analysis

The Pearson correlation is 0.7086. The Pearson correlation coefficient can take a range of values from +1 to -1. A value of 0 indicates that there is no association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one variable increases, so does the value of the other variable. This test result is a strong indication of a correlation between grade living area and Sale Price.

In a similar way the T-test or the difference in the means also indicates an association between grade living area and Sale Price, because the true difference in means is not equal to 0

Linear Algebra and Correlation

Using at least three untransformed variables, build a correlation matrix. Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

Assuming our square correlation matrix is invertible, the process outlined above should create the identity matrix due to \(AA^{-1} = A^{-1}A = I\)

# Correlaiton Matrix
cm

##           X         Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000

# #Invert Correlaiton Matrix
pm <- solve(cm)
pm

##           X         Y
## X  2.008632 -1.423366
## Y -1.423366  2.008632

# Multiply the correlation matrix by the precision matrix# Multiply the correlation matrix by the precision matrix
#A*A^-1
AAn1 <- cm%*%pm
AAn1

##   X Y
## X 1 0
## Y 0 1

# Multiply the precision matrix by the correlation matrix
#A^-1*A
An1A <- pm %*% cm
An1A

##   X Y
## X 1 0
## Y 0 1

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice.

Probability

Find the optimal value of λλ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λλ)). Plot a histogram and compare it with a histogram of your original variable.

# Run fitdistr to fit an exponential probability density function
ep <- fitdistr(train$GrLivArea, "exponential")
ep

##        rate    
##   6.598640e-04 
##  (1.726943e-05)

# Take 1000 samples from the exponential distribution, set.seed is so percentiles below stay the same
set.seed(1)
esamp <- rexp(1000, 6.598640e-04)

# Plot a histogram and compare it with a histogram of your original variable
hist(esamp)

hist(train$GrLivArea)

Exponential PDF Percentiles

using the cumulative distribution function (CDF) to find the 5th and 95th percentiles:

quantile(esamp, probs=c(.05,.95))

##         5%        95% 
##   78.87384 4481.40505

To generate a 95% confidence interval from the empirical data: Confidence Interval as: CI = Sample Mean ±± Standard Error

lower or left side of the interval:

# The lower or left side of the interval for GrLivArea = mean - error
mean(train$GrLivArea) - qnorm(0.95) * sd(train$GrLivArea) / sqrt(length(train$GrLivArea))

## [1] 1492.843

# The upper or right side of the interval for GrLivArea = mean + error
mean(train$GrLivArea) + qnorm(0.95) * sd(train$GrLivArea) / sqrt(length(train$GrLivArea))

## [1] 1538.084

# Find the empirical 5th percentile and 95th percentile of the data
quantile(train$GrLivArea, probs=c(.05,.95))

##     5%    95% 
##  848.0 2466.1

For grade living area, 95% CI is 1492 < X < 1538.

5th percentile is 848 and 95th percentile is 2466.

Modeling

In order to predict the sale price of houses using information about the characteristics of a house, regression models are constructed and their results entered into the Kaggle competition.

Bayesian Information Criterion

The model considers only the numeric variables examined in previous section. The leaps package is utilized to determine the minimum Bayesian Information Criterion (BIC) to select the best model as shown below.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# first variable is categorical values stored by numeric key; convert to factor
train$MSSubClass <- factor(train$MSSubClass)
# impute LotFrontage and MasVnrArea with zero
train$LotFrontage[is.na(train$LotFrontage)] <- 0
train$MasVnrArea[is.na(train$MasVnrArea)] <- 0
# impute GarageYrBlt with delta between YearBuilt
train$GarageYrBlt[is.na(train$GarageYrBlt)] <- 
  train$YearBuilt[is.na(train$GarageYrBlt)] +
  median(train$GarageYrBlt - train$YearBuilt, na.rm = TRUE)
# gather all explanatory variables and exclude non-numeric variables

pca_data <- train[, sapply(train, is.numeric)] %>% select(-SalePrice)

```

#library(dplyr)
# gather all explanatory variables and exclude non-numeric variables
#pca_data <- train[, sapply(train, is.numeric)] %>% select(-SalePrice)

# use numeric variables from PCA; include target variable
model_data <- cbind(pca_data, SalePrice = train$SalePrice)

# use regsubsets to get best subsets per number of variables
library(leaps)

## Warning: package 'leaps' was built under R version 3.3.3

regfit <- regsubsets(SalePrice~., data = model_data, nvmax = 20)

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 2 linear dependencies found

## Reordering variables and trying again:

reg.summary <- summary(regfit)
# plot variables vs BIC & BIC vs # variables
par(mfrow = c(1, 2), mar=c(1,1,1,1))
plot(regfit, scale = "bic", main = "Predictor Variables vs. BIC")
plot(reg.summary$bic, xlab = "Number of Predictors", ylab = "BIC", 
     type = "l", main = "Best Subset Selection Using BIC")
# annotate lowest BIC
minbic <- which.min(reg.summary$bic)
points(minbic, reg.summary$bic[minbic], col = "brown", cex = 2, pch = 20)

The parameters of this model are presented below:

# get formula using 15 criteria
usedvars = names(coef(regfit, minbic))[-1]
BIC_model = paste0("glm(SalePrice ~ ", paste(usedvars, collapse = " + "), 
                   ", data = model_data)")
# run model and print summary
bestsubset = eval(parse(text = BIC_model))
# print model summary
pander(summary(bestsubset))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-1066852	113116	-9.431	1.553e-20
LotArea	0.6162	0.1077	5.721	1.286e-08
OverallQual	24431	1123	21.75	5.986e-91
OverallCond	5557	1007	5.517	4.085e-08
YearBuilt	273.4	71.76	3.81	0.0001446
MasVnrArea	39.46	6.257	6.307	3.779e-10
BsmtFinSF1	19.94	2.592	7.692	2.657e-14
X1stFlrSF	26.85	3.475	7.726	2.06e-14
BsmtHalfBath	-4250	4211	-1.009	0.313
FullBath	6379	2583	2.47	0.01363
TotRmsAbvGrd	8901	828.8	10.74	6.092e-26
Fireplaces	8832	1820	4.853	1.346e-06
GarageYrBlt	203.1	72.73	2.792	0.005306
WoodDeckSF	32.95	8.486	3.883	0.000108
EnclosedPorch	25.93	17.95	1.445	0.1488
MiscVal	-1.491	2.008	-0.7423	0.458

(Dispersion parameter for gaussian family taken to be 1432794491 )

Null deviance:	9.208e+12 on 1459 degrees of freedom
Residual deviance:	2.069e+12 on 1444 degrees of freedom

# read in test data, using ID column as row names
test <- read.csv('test.csv', row.names = 1)

# apply same imputations to test data as train data
test$MSSubClass <- factor(test$MSSubClass)
test$LotFrontage[is.na(test$LotFrontage)] <- 0
test$MasVnrArea[is.na(test$MasVnrArea)] <- 0
test$GarageYrBlt[is.na(test$GarageYrBlt)] <- 
  test$YearBuilt[is.na(test$GarageYrBlt)] +
  median(test$GarageYrBlt - test$YearBuilt, na.rm = TRUE)
head(test)

##      MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1461         20       RH          80   11622   Pave  <NA>      Reg
## 1462         20       RL          81   14267   Pave  <NA>      IR1
## 1463         60       RL          74   13830   Pave  <NA>      IR1
## 1464         60       RL          78    9978   Pave  <NA>      IR1
## 1465        120       RL          43    5005   Pave  <NA>      IR1
## 1466         60       RL          75   10000   Pave  <NA>      IR1
##      LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1461         Lvl    AllPub    Inside       Gtl        NAmes      Feedr
## 1462         Lvl    AllPub    Corner       Gtl        NAmes       Norm
## 1463         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 1464         Lvl    AllPub    Inside       Gtl      Gilbert       Norm
## 1465         HLS    AllPub    Inside       Gtl      StoneBr       Norm
## 1466         Lvl    AllPub    Corner       Gtl      Gilbert       Norm
##      Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1461       Norm     1Fam     1Story           5           6      1961
## 1462       Norm     1Fam     1Story           6           6      1958
## 1463       Norm     1Fam     2Story           5           5      1997
## 1464       Norm     1Fam     2Story           6           6      1998
## 1465       Norm   TwnhsE     1Story           8           5      1992
## 1466       Norm     1Fam     2Story           6           5      1993
##      YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1461         1961     Gable  CompShg     VinylSd     VinylSd       None
## 1462         1958       Hip  CompShg     Wd Sdng     Wd Sdng    BrkFace
## 1463         1998     Gable  CompShg     VinylSd     VinylSd       None
## 1464         1998     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 1465         1992     Gable  CompShg     HdBoard     HdBoard       None
## 1466         1994     Gable  CompShg     HdBoard     HdBoard       None
##      MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond
## 1461          0        TA        TA     CBlock       TA       TA
## 1462        108        TA        TA     CBlock       TA       TA
## 1463          0        TA        TA      PConc       Gd       TA
## 1464         20        TA        TA      PConc       TA       TA
## 1465          0        Gd        TA      PConc       Gd       TA
## 1466          0        TA        TA      PConc       Gd       TA
##      BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## 1461           No          Rec        468          LwQ        144
## 1462           No          ALQ        923          Unf          0
## 1463           No          GLQ        791          Unf          0
## 1464           No          GLQ        602          Unf          0
## 1465           No          ALQ        263          Unf          0
## 1466           No          Unf          0          Unf          0
##      BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1461       270         882    GasA        TA          Y      SBrkr
## 1462       406        1329    GasA        TA          Y      SBrkr
## 1463       137         928    GasA        Gd          Y      SBrkr
## 1464       324         926    GasA        Ex          Y      SBrkr
## 1465      1017        1280    GasA        Ex          Y      SBrkr
## 1466       763         763    GasA        Gd          Y      SBrkr
##      X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1461       896         0            0       896            0            0
## 1462      1329         0            0      1329            0            0
## 1463       928       701            0      1629            0            0
## 1464       926       678            0      1604            0            0
## 1465      1280         0            0      1280            0            0
## 1466       763       892            0      1655            0            0
##      FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 1461        1        0            2            1          TA            5
## 1462        1        1            3            1          Gd            6
## 1463        2        1            3            1          TA            6
## 1464        2        1            3            1          Gd            7
## 1465        2        0            2            1          Gd            5
## 1466        2        1            3            1          TA            7
##      Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## 1461        Typ          0        <NA>     Attchd        1961          Unf
## 1462        Typ          0        <NA>     Attchd        1958          Unf
## 1463        Typ          1          TA     Attchd        1997          Fin
## 1464        Typ          1          Gd     Attchd        1998          Fin
## 1465        Typ          0        <NA>     Attchd        1992          RFn
## 1466        Typ          1          TA     Attchd        1993          Fin
##      GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 1461          1        730         TA         TA          Y        140
## 1462          1        312         TA         TA          Y        393
## 1463          2        482         TA         TA          Y        212
## 1464          2        470         TA         TA          Y        360
## 1465          2        506         TA         TA          Y          0
## 1466          2        440         TA         TA          Y        157
##      OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1461           0             0          0         120        0   <NA>
## 1462          36             0          0           0        0   <NA>
## 1463          34             0          0           0        0   <NA>
## 1464          36             0          0           0        0   <NA>
## 1465          82             0          0         144        0   <NA>
## 1466          84             0          0           0        0   <NA>
##      Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1461 MnPrv        <NA>       0      6   2010       WD        Normal
## 1462  <NA>        Gar2   12500      6   2010       WD        Normal
## 1463 MnPrv        <NA>       0      3   2010       WD        Normal
## 1464  <NA>        <NA>       0      6   2010       WD        Normal
## 1465  <NA>        <NA>       0      1   2010       WD        Normal
## 1466  <NA>        <NA>       0      4   2010       WD        Normal

predication

# apply models to test dataset
pred_bic = predict(bestsubset, test)

# impute median for NAs
pred_bic[is.na(pred_bic)] <- median(train$SalePrice)

# shift values so all are positive
pred_bic <- pred_bic - min(pred_bic)

# store as data frames
pred_bic <- data.frame(Id = names(pred_bic), SalePrice = pred_bic)

Results

The results of the two models’ predictions are output to csv files and submitted to the Kaggle competition under the username [jim lung], the score is 0.46252. The scores of these results are posted below:

# store results
write.csv(pred_bic, 'jimlungBIC.csv', row.names = FALSE)

TzeFungLung_FinalExam

jim lung

12-10-2017

DATA 605 Final Exam

Dataset

a) P(X > x | Y > y)

b) P(X > x, Y > y)

c) P(X < x | Y > y)

P(X|Y)=P(X)P(Y)

mathematical method

Chi Square test

Descriptive and Inferential Statistics.

Dataset

Scatterplot X and Y

95% Confidence Interval

Correlation Matrix

Analysis

Linear Algebra and Correlation

Calculus-Based Probability & Statistics

Probability

Exponential PDF Percentiles

Modeling

Bayesian Information Criterion

predication

Results