All of the files and associated data for this final can be found on my github here. The finished product can be found on here on rupbs. The required youtube video can be found here and my kaggle account can be found here.
Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.
Here I load the data, choose my X & Y and compute x & y.
#saved the data from the exam doc to my github as .csv and load it form there
p1.data = read.table("https://raw.githubusercontent.com/plb2018/DATA605/master/problem1Data.csv"
,sep=",",
header=T)
#display a few rows for inspection
kable(head(p1.data,5))| Y1 | Y2 | Y3 | Y4 | X1 | X2 | X3 | X4 |
|---|---|---|---|---|---|---|---|
| 20.3 | 20.8 | 28.4 | 20.2 | 9.3 | 7.4 | 9.5 | 9.3 |
| 19.1 | 14.6 | 21.5 | 18.6 | 4.1 | 6.4 | 3.7 | 12.4 |
| 19.3 | 18.0 | 20.8 | 22.6 | 22.4 | 8.5 | 11.7 | 19.9 |
| 20.9 | 7.3 | 22.2 | 11.4 | 9.1 | 9.5 | 7.4 | 6.9 |
| 22.0 | 19.4 | 21.6 | 23.6 | 15.8 | 11.8 | 5.3 | -1.0 |
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
A few general/theoretical things that will be helpful for a-c:
\[ P(X>x) = 0.25\\ P(X<x) = 0.75\\ P(Y>y) = 0.75 \\ P(Y<y) = 0.25 \\ P(X>x \cap Y>y) = P(X>x) P(Y>y) =0.1875\\ P(X<x \cap Y>y) = P(X<x) P(Y>y) =0.5625\\ P(X>x \cup Y>y) = P(X>x) + P(Y>y) - P(X>x \cap Y>y) = 0.8125 \]
Stated in words, we’re looking for the probability that X>x given that Y>y. Theoretically, we expect 25% of X values to be > the 3rd quartile (X>x) and 75% of Y values to be above the 1st quartile based on our knowledge of what quartiles represent, so the theoretical answer is:
\[ P(X>x|Y>y) = \frac{P(X>x \cap Y>y)}{P(Y>y)} = \frac{0.1875}{0.75}\\ P(X>x|Y>y) = 0.25 \]
Next we’ll check computationally. I realize that in this case, the set is so small enough that this knid of bootstrapping isn’t nessecary, but I figured that I would attempt this one a few different ways anyway. If we run the test using the actual data we get the following:
itr <- 50000
a.obs <- 0
b.obs <- 0
for (i in 1:itr){
a <- sample(X,1)
b<- sample(Y,1)
if (b >y){
b.obs <- b.obs+1
if (a > x) {
a.obs <- a.obs+1
}
}
}
a.obs /b.obs## [1] 0.2516456
And if we run using a random probabilities we get:
samples <- 50000
trials <- matrix(runif(samples*2),nrow=samples,ncol=2)
t2 <- trials[trials[,1] > 0.25,,drop=F] #subset cases that match the first condition
t3 <- trials[trials[,2] > 0.75,,drop=F] #cases that match both conditions
nrow(t3)/samples## [1] 0.25066
We can see that the all three methods yield approximately the same result of 25%.
Stated in words, we’re looking for the probability that both X>x and Y>y co-occur, or the intersection. Once again, the theoretical answer, based on quartiles is:
\[ P(X>x,Y>y) = P(X>x) P(Y>y) = 0.75*0.25 = 0.1875\\ \] We can check this another way by checking if the following relationship holds also:
\[ P(X>x,Y>y) = P(X>x|Y>y)P(X>x) = 0.25*0.75 = 0.1875\\ \]
In this case, we’ll skip the bootstrapping as it’s excessive here and just to straight to the emperical answer:
## [1] 0.2
We see that the answer is 20%, Which is approximately equal to the emperical solution of 18.75%
And if we run using a random probabilities we get:
samples <- 50000
trials <- matrix(runif(samples*2),nrow=samples,ncol=2)
t2 <- trials[trials[,1] > 0.25 & trials[,2] > 0.75 ,,drop=F] #cases that match both conditions
dim(t2)[1]/samples## [1] 0.18838
And based on this, we can say that 0.1875 appears to be correct
Similar to part a, but now we’re looking for the probability that \(X<x\) given that \(Y>y\)
\[ P(X<x|Y>y) = \frac{P(X<x \cap Y>y)}{P(Y>y)} = \frac{0.5625}{0.75}\\ P(X<x|Y>y) = 0.75\\ \]
And then we check computationally. If we run the test using the actual data we get the following:
## [1] 0.7333333
And if we run using a random probabilities we get:
samples <- 50000
trials <- matrix(runif(samples*2),nrow=samples,ncol=2)
t2 <- trials[trials[,1] > 0.25,,drop=F] #subset cases that match the first condition Y>y (>1st quartile)
t3 <- t2[t2[,2] < 0.75,,drop=F] #cases that match both conditions
nrow(t3)/nrow(t2)## [1] 0.7499402
Once again, we see that we get 75%.
#get the data into a dataframe so we can subselect
d <- data.frame(X,Y)
#set up al the cases (done manually so as not to confuse myself)
c1 <- dim(d[d$X <= quantile(X, 0.25) & d$Y <= quantile(Y, 0.75),])[1]
c2 <- dim(d[d$X > quantile(X, 0.25) & d$Y <= quantile(Y, 0.75),])[1]
c3 <- c1+c2
c4 <- dim(d[d$X <= quantile(X, 0.25) & d$Y > quantile(Y, 0.75),])[1]
c5 <- dim(d[d$X > quantile(X, 0.25) & d$Y > quantile(Y, 0.75),])[1]
c6 <- c4+c5
c7 <- c1+c4
c8 <- c2+c5
c9 <- c3+c6
#reshape and put it into a dataframe
c <- matrix(c(c1,c2,c3,c4,c5,c6,c7,c8,c9),nrow=3)
prob.table <- data.frame(c)
#rename cols and rows
colnames <- c('<= 3rd quartile','> 3rd quartile','Total')
rownames <- c('<= 1st quartile','>1st quartile','Total')
rownames(prob.table) <- rownames
colnames(prob.table) <- colnames
#plot
kable(prob.table)| <= 3rd quartile | > 3rd quartile | Total | |
|---|---|---|---|
| <= 1st quartile | 5 | 0 | 5 |
| >1st quartile | 11 | 4 | 15 |
| Total | 16 | 4 | 20 |
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
If we run the actual numbers, we get the following:
A <- length(X[X>quantile(X, 0.25)]) /length(X)
B <- length(Y[Y>quantile(Y, 0.25)]) / length(Y)
AB <- nrow(XY[XY$X > quantile(X, 0.25) & XY$Y >quantile(Y, 0.25) ,])/nrow(XY)
cbind(A,B,AB)## A B AB
## [1,] 0.75 0.75 0.6
## [1] FALSE
We see that the probability of \(P(A)P(B)\) does not equal \(P(AB)\). This suggests that the data are NOT independent, however, in practice, I think it would be hard to tell with a sample size of 20 observations per variable.
In the interest of thoroughness I’ll run the test for \((X,Y)\) and \((A,B)\)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: X and Y
## X-squared = 275, df = NA, p-value = 0.2449
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: X[X > quantile(X, 0.25)] and Y[Y > quantile(Y, 0.25)]
## X-squared = 157.5, df = NA, p-value = 0.2829
Given the large number of degrees of freedom and correspondingly small p-values, we can reject the null hypothesis and conclude that that the variables appear to be dependent.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition and do the following:
#Load kaggle data from github
p2.train = read.table("https://raw.githubusercontent.com/plb2018/DATA605/master/kaggle_train.csv"
,sep=",",
header=T,
stringsAsFactors = T)
p2.test = read.table("https://raw.githubusercontent.com/plb2018/DATA605/master/kaggle_test.csv"
,sep=",",
header=T)
#take a quick peek @ the data
kable(head(p2.train[,1:5],5))| Id | MSSubClass | MSZoning | LotFrontage | LotArea |
|---|---|---|---|---|
| 1 | 60 | RL | 65 | 8450 |
| 2 | 20 | RL | 80 | 9600 |
| 3 | 60 | RL | 68 | 11250 |
| 4 | 70 | RL | 60 | 9550 |
| 5 | 60 | RL | 84 | 14260 |
| LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 |
|---|---|---|---|---|
| Inside | Gtl | NAmes | Feedr | Norm |
| Corner | Gtl | NAmes | Norm | Norm |
| Inside | Gtl | Gilbert | Norm | Norm |
| Inside | Gtl | Gilbert | Norm | Norm |
| Inside | Gtl | StoneBr | Norm | Norm |
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Univariate and descriptive stats:
p2.workingData <- p2.train[c('LotArea','GrLivArea','SalePrice')]
kable(round(describe(p2.train)[c(2,3,4,7,8,9,10,11,12,13)],2)) | n | mean | sd | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|
| Id | 1460 | 730.50 | 421.61 | 541.15 | 1 | 1460 | 1459 | 0.00 | -1.20 | 11.03 |
| MSSubClass | 1460 | 56.90 | 42.30 | 44.48 | 20 | 190 | 170 | 1.40 | 1.56 | 1.11 |
| MSZoning* | 1460 | 4.03 | 0.63 | 0.00 | 1 | 5 | 4 | -1.73 | 6.25 | 0.02 |
| LotFrontage | 1201 | 70.05 | 24.28 | 16.31 | 21 | 313 | 292 | 2.16 | 17.34 | 0.70 |
| LotArea | 1460 | 10516.83 | 9981.26 | 2962.23 | 1300 | 215245 | 213945 | 12.18 | 202.26 | 261.22 |
| Street* | 1460 | 2.00 | 0.06 | 0.00 | 1 | 2 | 1 | -15.49 | 238.01 | 0.00 |
| Alley* | 91 | 1.45 | 0.50 | 0.00 | 1 | 2 | 1 | 0.20 | -1.98 | 0.05 |
| LotShape* | 1460 | 2.94 | 1.41 | 0.00 | 1 | 4 | 3 | -0.61 | -1.60 | 0.04 |
| LandContour* | 1460 | 3.78 | 0.71 | 0.00 | 1 | 4 | 3 | -3.16 | 8.65 | 0.02 |
| Utilities* | 1460 | 1.00 | 0.03 | 0.00 | 1 | 2 | 1 | 38.13 | 1453.00 | 0.00 |
| LotConfig* | 1460 | 4.02 | 1.62 | 0.00 | 1 | 5 | 4 | -1.13 | -0.59 | 0.04 |
| LandSlope* | 1460 | 1.06 | 0.28 | 0.00 | 1 | 3 | 2 | 4.80 | 24.47 | 0.01 |
| Neighborhood* | 1460 | 13.15 | 5.89 | 7.41 | 1 | 25 | 24 | 0.02 | -1.06 | 0.15 |
| Condition1* | 1460 | 3.03 | 0.87 | 0.00 | 1 | 9 | 8 | 3.01 | 16.34 | 0.02 |
| Condition2* | 1460 | 3.01 | 0.26 | 0.00 | 1 | 8 | 7 | 13.14 | 247.54 | 0.01 |
| BldgType* | 1460 | 1.49 | 1.20 | 0.00 | 1 | 5 | 4 | 2.24 | 3.41 | 0.03 |
| HouseStyle* | 1460 | 4.04 | 1.91 | 1.48 | 1 | 8 | 7 | 0.31 | -0.96 | 0.05 |
| OverallQual | 1460 | 6.10 | 1.38 | 1.48 | 1 | 10 | 9 | 0.22 | 0.09 | 0.04 |
| OverallCond | 1460 | 5.58 | 1.11 | 0.00 | 1 | 9 | 8 | 0.69 | 1.09 | 0.03 |
| YearBuilt | 1460 | 1971.27 | 30.20 | 37.06 | 1872 | 2010 | 138 | -0.61 | -0.45 | 0.79 |
| YearRemodAdd | 1460 | 1984.87 | 20.65 | 19.27 | 1950 | 2010 | 60 | -0.50 | -1.27 | 0.54 |
| RoofStyle* | 1460 | 2.41 | 0.83 | 0.00 | 1 | 6 | 5 | 1.47 | 0.61 | 0.02 |
| RoofMatl* | 1460 | 2.08 | 0.60 | 0.00 | 1 | 8 | 7 | 8.09 | 66.28 | 0.02 |
| Exterior1st* | 1460 | 10.62 | 3.20 | 1.48 | 1 | 15 | 14 | -0.72 | -0.37 | 0.08 |
| Exterior2nd* | 1460 | 11.34 | 3.54 | 2.97 | 1 | 16 | 15 | -0.69 | -0.52 | 0.09 |
| MasVnrType* | 1452 | 2.76 | 0.62 | 0.00 | 1 | 4 | 3 | -0.07 | -0.13 | 0.02 |
| MasVnrArea | 1452 | 103.69 | 181.07 | 0.00 | 0 | 1600 | 1600 | 2.66 | 10.03 | 4.75 |
| ExterQual* | 1460 | 3.54 | 0.69 | 0.00 | 1 | 4 | 3 | -1.83 | 3.86 | 0.02 |
| ExterCond* | 1460 | 4.73 | 0.73 | 0.00 | 1 | 5 | 4 | -2.56 | 5.29 | 0.02 |
| Foundation* | 1460 | 2.40 | 0.72 | 1.48 | 1 | 6 | 5 | 0.09 | 1.02 | 0.02 |
| BsmtQual* | 1423 | 3.26 | 0.87 | 1.48 | 1 | 4 | 3 | -1.31 | 1.27 | 0.02 |
| BsmtCond* | 1423 | 3.81 | 0.66 | 0.00 | 1 | 4 | 3 | -3.39 | 10.14 | 0.02 |
| BsmtExposure* | 1422 | 3.27 | 1.15 | 0.00 | 1 | 4 | 3 | -1.15 | -0.39 | 0.03 |
| BsmtFinType1* | 1423 | 3.73 | 1.83 | 2.97 | 1 | 6 | 5 | -0.02 | -1.39 | 0.05 |
| BsmtFinSF1 | 1460 | 443.64 | 456.10 | 568.58 | 0 | 5644 | 5644 | 1.68 | 11.06 | 11.94 |
| BsmtFinType2* | 1422 | 5.71 | 0.94 | 0.00 | 1 | 6 | 5 | -3.56 | 12.32 | 0.02 |
| BsmtFinSF2 | 1460 | 46.55 | 161.32 | 0.00 | 0 | 1474 | 1474 | 4.25 | 20.01 | 4.22 |
| BsmtUnfSF | 1460 | 567.24 | 441.87 | 426.99 | 0 | 2336 | 2336 | 0.92 | 0.46 | 11.56 |
| TotalBsmtSF | 1460 | 1057.43 | 438.71 | 347.67 | 0 | 6110 | 6110 | 1.52 | 13.18 | 11.48 |
| Heating* | 1460 | 2.04 | 0.30 | 0.00 | 1 | 6 | 5 | 9.83 | 110.98 | 0.01 |
| HeatingQC* | 1460 | 2.54 | 1.74 | 0.00 | 1 | 5 | 4 | 0.48 | -1.51 | 0.05 |
| CentralAir* | 1460 | 1.93 | 0.25 | 0.00 | 1 | 2 | 1 | -3.52 | 10.42 | 0.01 |
| Electrical* | 1459 | 4.68 | 1.05 | 0.00 | 1 | 5 | 4 | -3.06 | 7.49 | 0.03 |
| X1stFlrSF | 1460 | 1162.63 | 386.59 | 347.67 | 334 | 4692 | 4358 | 1.37 | 5.71 | 10.12 |
| X2ndFlrSF | 1460 | 346.99 | 436.53 | 0.00 | 0 | 2065 | 2065 | 0.81 | -0.56 | 11.42 |
| LowQualFinSF | 1460 | 5.84 | 48.62 | 0.00 | 0 | 572 | 572 | 8.99 | 82.83 | 1.27 |
| GrLivArea | 1460 | 1515.46 | 525.48 | 483.33 | 334 | 5642 | 5308 | 1.36 | 4.86 | 13.75 |
| BsmtFullBath | 1460 | 0.43 | 0.52 | 0.00 | 0 | 3 | 3 | 0.59 | -0.84 | 0.01 |
| BsmtHalfBath | 1460 | 0.06 | 0.24 | 0.00 | 0 | 2 | 2 | 4.09 | 16.31 | 0.01 |
| FullBath | 1460 | 1.57 | 0.55 | 0.00 | 0 | 3 | 3 | 0.04 | -0.86 | 0.01 |
| HalfBath | 1460 | 0.38 | 0.50 | 0.00 | 0 | 2 | 2 | 0.67 | -1.08 | 0.01 |
| BedroomAbvGr | 1460 | 2.87 | 0.82 | 0.00 | 0 | 8 | 8 | 0.21 | 2.21 | 0.02 |
| KitchenAbvGr | 1460 | 1.05 | 0.22 | 0.00 | 0 | 3 | 3 | 4.48 | 21.42 | 0.01 |
| KitchenQual* | 1460 | 3.34 | 0.83 | 0.00 | 1 | 4 | 3 | -1.42 | 1.72 | 0.02 |
| TotRmsAbvGrd | 1460 | 6.52 | 1.63 | 1.48 | 2 | 14 | 12 | 0.67 | 0.87 | 0.04 |
| Functional* | 1460 | 6.75 | 0.98 | 0.00 | 1 | 7 | 6 | -4.08 | 16.37 | 0.03 |
| Fireplaces | 1460 | 0.61 | 0.64 | 1.48 | 0 | 3 | 3 | 0.65 | -0.22 | 0.02 |
| FireplaceQu* | 770 | 3.73 | 1.13 | 1.48 | 1 | 5 | 4 | -0.16 | -0.98 | 0.04 |
| GarageType* | 1379 | 3.28 | 1.79 | 0.00 | 1 | 6 | 5 | 0.76 | -1.30 | 0.05 |
| GarageYrBlt | 1379 | 1978.51 | 24.69 | 31.13 | 1900 | 2010 | 110 | -0.65 | -0.42 | 0.66 |
| GarageFinish* | 1379 | 2.18 | 0.81 | 1.48 | 1 | 3 | 2 | -0.35 | -1.41 | 0.02 |
| GarageCars | 1460 | 1.77 | 0.75 | 0.00 | 0 | 4 | 4 | -0.34 | 0.21 | 0.02 |
| GarageArea | 1460 | 472.98 | 213.80 | 177.91 | 0 | 1418 | 1418 | 0.18 | 0.90 | 5.60 |
| GarageQual* | 1379 | 4.86 | 0.61 | 0.00 | 1 | 5 | 4 | -4.43 | 18.25 | 0.02 |
| GarageCond* | 1379 | 4.90 | 0.52 | 0.00 | 1 | 5 | 4 | -5.28 | 26.77 | 0.01 |
| PavedDrive* | 1460 | 2.86 | 0.50 | 0.00 | 1 | 3 | 2 | -3.30 | 9.22 | 0.01 |
| WoodDeckSF | 1460 | 94.24 | 125.34 | 0.00 | 0 | 857 | 857 | 1.54 | 2.97 | 3.28 |
| OpenPorchSF | 1460 | 46.66 | 66.26 | 37.06 | 0 | 547 | 547 | 2.36 | 8.44 | 1.73 |
| EnclosedPorch | 1460 | 21.95 | 61.12 | 0.00 | 0 | 552 | 552 | 3.08 | 10.37 | 1.60 |
| X3SsnPorch | 1460 | 3.41 | 29.32 | 0.00 | 0 | 508 | 508 | 10.28 | 123.06 | 0.77 |
| ScreenPorch | 1460 | 15.06 | 55.76 | 0.00 | 0 | 480 | 480 | 4.11 | 18.34 | 1.46 |
| PoolArea | 1460 | 2.76 | 40.18 | 0.00 | 0 | 738 | 738 | 14.80 | 222.19 | 1.05 |
| PoolQC* | 7 | 2.14 | 0.90 | 1.48 | 1 | 3 | 2 | -0.22 | -1.90 | 0.34 |
| Fence* | 281 | 2.43 | 0.86 | 0.00 | 1 | 4 | 3 | -0.57 | -0.88 | 0.05 |
| MiscFeature* | 54 | 2.91 | 0.45 | 0.00 | 1 | 4 | 3 | -2.93 | 10.71 | 0.06 |
| MiscVal | 1460 | 43.49 | 496.12 | 0.00 | 0 | 15500 | 15500 | 24.43 | 697.64 | 12.98 |
| MoSold | 1460 | 6.32 | 2.70 | 2.97 | 1 | 12 | 11 | 0.21 | -0.41 | 0.07 |
| YrSold | 1460 | 2007.82 | 1.33 | 1.48 | 2006 | 2010 | 4 | 0.10 | -1.19 | 0.03 |
| SaleType* | 1460 | 8.51 | 1.56 | 0.00 | 1 | 9 | 8 | -3.83 | 14.57 | 0.04 |
| SaleCondition* | 1460 | 4.77 | 1.10 | 0.00 | 1 | 6 | 5 | -2.74 | 6.82 | 0.03 |
| SalePrice | 1460 | 180921.20 | 79442.50 | 56338.80 | 34900 | 755000 | 720100 | 1.88 | 6.50 | 2079.11 |
And we’ll also look at a few plots.
We see a lot of skew in these few quantitative variables (including the target variable) and we can mitigate it a bit by taking the log of both of variables. It’s not perfect, but my hunch is that it may be a better way to look at data to be used in a linear regression model.
Here I look at both linear and log scatter-plot matricies:
The scatter matrix shows that there is a strong positive linear relationship between Above Grade Living Aread (“GrLivArea”) and Sale price. The relationship between LotArea and the other variables appears to be slightly positive, but not strong. In addition, as we see above, if we look at the Log of the data, we can see a bit more.
Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
p2.corrData <- p2.train[c('LotArea','GrLivArea','TotalBsmtSF')]
cor.mat <- cor(p2.corrData)
corrplot(cor.mat)##
## Pearson's product-moment correlation
##
## data: p2.corrData$LotArea and p2.corrData$GrLivArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
##
## Pearson's product-moment correlation
##
## data: p2.corrData$LotArea and p2.corrData$TotalBsmtSF
## t = 10.317, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2292786 0.2918400
## sample estimates:
## cor
## 0.2608331
##
## Pearson's product-moment correlation
##
## data: p2.corrData$TotalBsmtSF and p2.corrData$GrLivArea
## t = 19.503, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4278380 0.4810855
## sample estimates:
## cor
## 0.4548682
Above we test each pariwise correlation to see if it is significantly different from zero with each case having the formulation: \[ H_0 : r=0 \\ H_a : r \ne 0 \]
And we see that in all 3 cases, the correlation is significantly different from zero at 80% confidence. Practically, this means that there is some relationship (though not nessecarily causal) between these variables on a pairwise basis. This makes intuitive sense given that they are all house-area-related variables and that bigger houses likely have bigger living-space, bigger basements etc. I believe that based on the family-wise error rate formula (FWER) shown below, we should be concerned here as the probabilty is nearly 50% (0.488) that we have at least 1 false conclusion:
\(FWER \leq 1 - (1-\alpha)^c\) where \(\alpha\) is the significance of the tests and \(c\) is the number of comparisons performed. In this case, we get a value of:
## [1] 0.488
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Here we invert the correlation matrix to create a precision matrix and multiply them to see whether we get the same answer independent of order (we don’t - as expected):
| LotArea | GrLivArea | TotalBsmtSF | |
|---|---|---|---|
| LotArea | 1.1041806 | -0.2011394 | -0.1965150 |
| GrLivArea | -0.2011394 | 1.2975230 | -0.5377381 |
| TotalBsmtSF | -0.1965150 | -0.5377381 | 1.2958576 |
## LotArea GrLivArea TotalBsmtSF
## LotArea 1.000000e+00 0.000000e+00 0
## GrLivArea -1.387779e-17 1.000000e+00 0
## TotalBsmtSF 2.775558e-17 1.110223e-16 1
## LotArea GrLivArea TotalBsmtSF
## LotArea 1.000000e+00 -4.163336e-17 -2.775558e-17
## GrLivArea 0.000000e+00 1.000000e+00 1.110223e-16
## TotalBsmtSF 5.551115e-17 0.000000e+00 1.000000e+00
## [1] FALSE
It’s not clear to me which of the above matricies to run the LU Decomposition on, so I’ve written general code and will run it on all three!
LU <- function(U){
colnames(U) <- NULL
rownames(U) <- NULL
L = diag(x = 1, ncol = ncol(U), nrow = nrow(U))
for (row in 1:dim(U)[1]){
col = 1
while (col< row) {
L[row,col] <- U[row,col] / U[col,col]
U[row,] <- -1 * U[row,col]/U[col,col] * U[col,] + U[row,]
col = col+1
}
}
return(list('L' = L, 'U' = U))
}
corMat.LU <- LU(data.matrix(cor.mat))
cXp.LU <- LU(data.matrix(cXp))
pXc.LU<- LU(data.matrix(pXc))And we display the 3 outputs:
| V1 | V2 | V3 |
|---|---|---|
| 1 | 0.2631162 | 0.2608331 |
| 0 | 0.9307699 | 0.3862388 |
| 0 | 0.0000000 | 0.7716897 |
| V1 | V2 | V3 |
|---|---|---|
| 1.0000000 | 0.000000 | 0 |
| 0.2631162 | 1.000000 | 0 |
| 0.2608331 | 0.414967 | 1 |
| V1 | V2 | V3 |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| V1 | V2 | V3 |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| V1 | V2 | V3 |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
| V1 | V2 | V3 |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
I chose to look at the sale price, which has a reasonable skew
library(MASS)
#select a var with right skew
r.skew <- (p2.train$SalePrice)
#plot and make sure
hist(r.skew,main="Hist of SalePrice Showing Skew")##Fit an Exponential Distribution & Create Histogram
#fit and generate 100 samples
dist.fit <- fitdistr(r.skew,densfun = "exponential")
samples <- rexp(1000,dist.fit$estimate)
#Compare histograms
hist(r.skew, col=rgb(1,0,0,0.5), breaks = 15, main="Emperical vs Simulated")
hist(samples, col=rgb(0,0,1,0.5),breaks = 15, add=T)
box()
legend("topright", legend=c("Emperical", "Simulated"),
col=c("red", "blue"), lty=1:2, cex=0.8)In addition to the required metrics, I have also shows summary stats as I think they are helpful here:
# simulated
sim.5th <- log(1 - .05)/-dist.fit$estimate
sim.95th <- log(1 - .95)/-dist.fit$estimate
#emperical
d.5th <- quantile(r.skew, 0.05)
d.95th <- quantile(r.skew, 0.95)
#reshape and put it into a dataframe
c <- matrix(c(sim.5th,sim.95th,d.5th,d.95th),nrow=2)
prob.table <- data.frame(c)
#rename cols and rows
colnames <- c('5th','95th')
rownames <- c('Sim','Emperical')
rownames(prob.table) <- rownames
colnames(prob.table) <- colnames
kable(prob.table)| 5th | 95th | |
|---|---|---|
| Sim | 9280.044 | 88000 |
| Emperical | 541991.465 | 326100 |
## Estimate CI lower CI upper Std. Error
## 180921.196 176842.841 184999.551 2079.105
## vars n mean sd median trimmed mad min max range
## X1 1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
## skew kurtosis se
## X1 1.88 6.5 2079.11
## vars n mean sd median trimmed mad min max
## X1 1 1000 174523 170357.4 121842.6 147104.8 121643.1 125.26 1291094
## range skew kurtosis se
## X1 1290968 2 5.84 5387.17
If we compare the data for the simulation agains the emperical data, we see that we do not get an ideal fit. The emperical data has a similar mean to the simulated data, but a much higher standard deviation and median. Looking at the histograms, we can see a distinct difference in the shapes of the distributions (simulated dist has much more mass towards zero).
Above we see a the 95% confidence interval which indicated the range in which we can be 95% confident that the mean of the population will be found. Given the nature of the distribution (clearly not normal) I would suggest staying away from the assumption of normality in this case and using something like bootstrapping instead.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
First we’ll run a regression on all the data and take a look at the result. We want to eliminate variables to the extent that we can.
#drop anything that contains an NA
df <- p2.train[ , apply(p2.train, 2, function(x) !any(is.na(x)))]
#train a model and look @ the resuls
m1 <- lm(SalePrice ~ ., data = df)##
## Call:
## lm(formula = SalePrice ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174399 -10591 13 9618 174399
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.181e+06 1.062e+06 -1.112 0.266296
## Id 6.688e-01 1.595e+00 0.419 0.675090
## MSSubClass -8.564e+00 8.528e+01 -0.100 0.920028
## MSZoningFV 3.099e+04 1.219e+04 2.543 0.011112 *
## MSZoningRH 2.391e+04 1.226e+04 1.950 0.051452 .
## MSZoningRL 2.586e+04 1.044e+04 2.477 0.013396 *
## MSZoningRM 2.495e+04 9.771e+03 2.554 0.010770 *
## LotArea 7.040e-01 1.080e-01 6.521 1.00e-10 ***
## StreetPave 3.885e+04 1.226e+04 3.170 0.001562 **
## LotShapeIR2 4.532e+03 4.313e+03 1.051 0.293556
## LotShapeIR3 3.422e+03 8.796e+03 0.389 0.697299
## LotShapeReg 5.598e+02 1.661e+03 0.337 0.736120
## LandContourHLS 1.374e+04 5.276e+03 2.604 0.009323 **
## LandContourLow -4.127e+03 6.507e+03 -0.634 0.526100
## LandContourLvl 7.273e+03 3.801e+03 1.914 0.055880 .
## UtilitiesNoSeWa -2.945e+04 2.638e+04 -1.116 0.264602
## LotConfigCulDSac 7.631e+03 3.321e+03 2.298 0.021735 *
## LotConfigFR2 -5.860e+03 4.152e+03 -1.412 0.158314
## LotConfigFR3 -1.357e+04 1.306e+04 -1.039 0.298978
## LotConfigInside -1.262e+03 1.803e+03 -0.700 0.483916
## LandSlopeMod 1.047e+04 4.030e+03 2.597 0.009506 **
## LandSlopeSev -2.561e+04 1.107e+04 -2.315 0.020792 *
## NeighborhoodBlueste -2.677e+03 1.932e+04 -0.139 0.889813
## NeighborhoodBrDale 8.298e+03 1.111e+04 0.747 0.455201
## NeighborhoodBrkSide -1.896e+03 9.478e+03 -0.200 0.841466
## NeighborhoodClearCr -1.285e+04 9.405e+03 -1.366 0.172043
## NeighborhoodCollgCr -9.730e+03 7.324e+03 -1.329 0.184239
## NeighborhoodCrawfor 9.549e+03 8.656e+03 1.103 0.270190
## NeighborhoodEdwards -1.678e+04 8.065e+03 -2.080 0.037700 *
## NeighborhoodGilbert -1.384e+04 7.839e+03 -1.766 0.077617 .
## NeighborhoodIDOTRR -7.324e+03 1.082e+04 -0.677 0.498597
## NeighborhoodMeadowV -1.542e+03 1.138e+04 -0.135 0.892294
## NeighborhoodMitchel -2.041e+04 8.266e+03 -2.470 0.013651 *
## NeighborhoodNAmes -1.452e+04 7.890e+03 -1.840 0.066022 .
## NeighborhoodNoRidge 2.876e+04 8.389e+03 3.428 0.000628 ***
## NeighborhoodNPkVill 8.088e+03 1.431e+04 0.565 0.572103
## NeighborhoodNridgHt 2.463e+04 7.367e+03 3.343 0.000853 ***
## NeighborhoodNWAmes -2.063e+04 8.138e+03 -2.535 0.011360 *
## NeighborhoodOldTown -1.298e+04 9.658e+03 -1.344 0.179227
## NeighborhoodSawyer -1.021e+04 8.217e+03 -1.243 0.214194
## NeighborhoodSawyerW -6.179e+03 7.842e+03 -0.788 0.430921
## NeighborhoodSomerst 1.716e+01 8.966e+03 0.002 0.998473
## NeighborhoodStoneBr 3.896e+04 8.372e+03 4.653 3.61e-06 ***
## NeighborhoodSWISU -9.568e+03 9.813e+03 -0.975 0.329727
## NeighborhoodTimber -6.004e+03 8.359e+03 -0.718 0.472682
## NeighborhoodVeenker 3.093e+03 1.072e+04 0.289 0.772968
## Condition1Feedr 2.991e+03 5.091e+03 0.587 0.556973
## Condition1Norm 1.220e+04 4.196e+03 2.907 0.003716 **
## Condition1PosA 7.495e+03 1.028e+04 0.729 0.466139
## Condition1PosN 8.064e+03 7.601e+03 1.061 0.288931
## Condition1RRAe -1.693e+04 9.358e+03 -1.809 0.070679 .
## Condition1RRAn 6.448e+03 7.001e+03 0.921 0.357203
## Condition1RRNe -7.142e+03 1.835e+04 -0.389 0.697190
## Condition1RRNn 3.887e+03 1.308e+04 0.297 0.766416
## Condition2Feedr -9.306e+03 2.302e+04 -0.404 0.686081
## Condition2Norm -7.438e+03 1.959e+04 -0.380 0.704237
## Condition2PosA 2.017e+04 3.796e+04 0.532 0.595162
## Condition2PosN -2.303e+05 2.757e+04 -8.352 < 2e-16 ***
## Condition2RRAe -1.289e+05 4.673e+04 -2.757 0.005908 **
## Condition2RRAn -1.206e+04 3.189e+04 -0.378 0.705223
## Condition2RRNn -8.647e+03 2.708e+04 -0.319 0.749501
## BldgType2fmCon -6.377e+03 1.285e+04 -0.496 0.619902
## BldgTypeDuplex -1.146e+03 7.443e+03 -0.154 0.877696
## BldgTypeTwnhs -2.552e+04 1.014e+04 -2.517 0.011966 *
## BldgTypeTwnhsE -2.334e+04 9.188e+03 -2.540 0.011201 *
## HouseStyle1.5Unf 1.116e+04 7.907e+03 1.412 0.158309
## HouseStyle1Story 8.920e+03 4.326e+03 2.062 0.039437 *
## HouseStyle2.5Fin -1.695e+04 1.228e+04 -1.380 0.167683
## HouseStyle2.5Unf -1.158e+04 9.366e+03 -1.236 0.216527
## HouseStyle2Story -6.378e+03 3.543e+03 -1.800 0.072066 .
## HouseStyleSFoyer 7.694e+03 6.183e+03 1.244 0.213590
## HouseStyleSLvl 7.340e+03 5.450e+03 1.347 0.178279
## OverallQual 8.014e+03 1.018e+03 7.874 7.29e-15 ***
## OverallCond 5.378e+03 8.692e+02 6.187 8.25e-10 ***
## YearBuilt 3.275e+02 7.360e+01 4.450 9.34e-06 ***
## YearRemodAdd 1.054e+02 5.523e+01 1.908 0.056590 .
## RoofStyleGable 1.177e+03 1.871e+04 0.063 0.949840
## RoofStyleGambrel 4.091e+03 2.047e+04 0.200 0.841666
## RoofStyleHip 2.769e+03 1.876e+04 0.148 0.882711
## RoofStyleMansard 1.712e+04 2.181e+04 0.785 0.432753
## RoofStyleShed 8.750e+04 3.547e+04 2.467 0.013772 *
## RoofMatlCompShg 6.493e+05 3.292e+04 19.724 < 2e-16 ***
## RoofMatlMembran 7.363e+05 4.765e+04 15.453 < 2e-16 ***
## RoofMatlMetal 6.963e+05 4.707e+04 14.793 < 2e-16 ***
## RoofMatlRoll 6.501e+05 4.150e+04 15.666 < 2e-16 ***
## RoofMatlTar&Grv 6.545e+05 3.784e+04 17.297 < 2e-16 ***
## RoofMatlWdShake 6.296e+05 3.666e+04 17.177 < 2e-16 ***
## RoofMatlWdShngl 7.269e+05 3.416e+04 21.275 < 2e-16 ***
## Exterior1stAsphShn -1.097e+04 3.386e+04 -0.324 0.745942
## Exterior1stBrkComm -1.132e+04 2.832e+04 -0.400 0.689358
## Exterior1stBrkFace 6.822e+03 1.251e+04 0.545 0.585724
## Exterior1stCBlock -2.809e+04 2.754e+04 -1.020 0.307801
## Exterior1stCemntBd -1.361e+04 1.917e+04 -0.710 0.477808
## Exterior1stHdBoard -1.243e+04 1.258e+04 -0.988 0.323327
## Exterior1stImStucc -6.780e+04 2.843e+04 -2.385 0.017224 *
## Exterior1stMetalSd -1.990e+03 1.448e+04 -0.137 0.890733
## Exterior1stPlywood -1.650e+04 1.245e+04 -1.325 0.185244
## Exterior1stStone -1.381e+04 2.414e+04 -0.572 0.567551
## Exterior1stStucco -3.653e+03 1.378e+04 -0.265 0.790949
## Exterior1stVinylSd -1.652e+04 1.318e+04 -1.253 0.210394
## Exterior1stWd Sdng -1.243e+04 1.204e+04 -1.032 0.302146
## Exterior1stWdShing -4.975e+03 1.304e+04 -0.381 0.702980
## Exterior2ndAsphShn 7.085e+03 2.264e+04 0.313 0.754349
## Exterior2ndBrk Cmn 1.407e+04 2.060e+04 0.683 0.494652
## Exterior2ndBrkFace -1.492e+03 1.319e+04 -0.113 0.909921
## Exterior2ndCBlock NA NA NA NA
## Exterior2ndCmentBd 1.237e+04 1.909e+04 0.648 0.517359
## Exterior2ndHdBoard 7.251e+03 1.234e+04 0.588 0.556776
## Exterior2ndImStucc 3.289e+04 1.431e+04 2.298 0.021710 *
## Exterior2ndMetalSd 2.294e+03 1.431e+04 0.160 0.872703
## Exterior2ndOther -6.655e+03 2.813e+04 -0.237 0.813060
## Exterior2ndPlywood 8.272e+03 1.198e+04 0.691 0.489868
## Exterior2ndStone -1.104e+04 1.720e+04 -0.642 0.520910
## Exterior2ndStucco 1.867e+03 1.356e+04 0.138 0.890471
## Exterior2ndVinylSd 1.596e+04 1.291e+04 1.236 0.216661
## Exterior2ndWd Sdng 9.997e+03 1.184e+04 0.845 0.398542
## Exterior2ndWd Shng 2.671e+03 1.237e+04 0.216 0.829035
## ExterQualFa -8.475e+03 1.085e+04 -0.781 0.434775
## ExterQualGd -3.089e+04 4.783e+03 -6.459 1.50e-10 ***
## ExterQualTA -3.078e+04 5.353e+03 -5.750 1.12e-08 ***
## ExterCondFa -2.694e+03 1.882e+04 -0.143 0.886199
## ExterCondGd -7.991e+03 1.799e+04 -0.444 0.656932
## ExterCondPo 1.198e+04 3.276e+04 0.366 0.714683
## ExterCondTA -5.285e+03 1.795e+04 -0.294 0.768536
## FoundationCBlock 1.834e+03 3.182e+03 0.576 0.564477
## FoundationPConc 4.853e+03 3.498e+03 1.387 0.165582
## FoundationSlab 8.600e+03 7.834e+03 1.098 0.272465
## FoundationStone 9.706e+02 1.087e+04 0.089 0.928841
## FoundationWood -3.335e+04 1.510e+04 -2.209 0.027325 *
## BsmtFinSF1 3.711e+01 4.409e+00 8.416 < 2e-16 ***
## BsmtFinSF2 2.455e+01 5.787e+00 4.242 2.38e-05 ***
## BsmtUnfSF 1.498e+01 4.060e+00 3.689 0.000234 ***
## TotalBsmtSF NA NA NA NA
## HeatingGasA -6.580e+03 2.537e+04 -0.259 0.795359
## HeatingGasW -1.513e+04 2.618e+04 -0.578 0.563344
## HeatingGrav -1.433e+04 2.747e+04 -0.522 0.601921
## HeatingOthW -4.540e+04 3.166e+04 -1.434 0.151839
## HeatingWall 9.296e+03 2.922e+04 0.318 0.750423
## HeatingQCFa -1.496e+03 4.798e+03 -0.312 0.755158
## HeatingQCGd -3.653e+03 2.139e+03 -1.708 0.087903 .
## HeatingQCPo 7.920e+03 2.743e+04 0.289 0.772847
## HeatingQCTA -4.355e+03 2.116e+03 -2.058 0.039822 *
## CentralAirY -3.452e+03 3.867e+03 -0.893 0.372264
## X1stFlrSF 5.513e+01 5.318e+00 10.367 < 2e-16 ***
## X2ndFlrSF 7.013e+01 5.243e+00 13.375 < 2e-16 ***
## LowQualFinSF 2.487e+01 1.859e+01 1.338 0.181237
## GrLivArea NA NA NA NA
## BsmtFullBath 1.541e+03 1.964e+03 0.785 0.432883
## BsmtHalfBath 4.014e+02 3.110e+03 0.129 0.897325
## FullBath 2.522e+03 2.237e+03 1.127 0.259762
## HalfBath -1.701e+02 2.133e+03 -0.080 0.936444
## BedroomAbvGr -5.506e+03 1.378e+03 -3.994 6.86e-05 ***
## KitchenAbvGr -1.580e+04 5.723e+03 -2.760 0.005862 **
## KitchenQualFa -2.103e+04 6.345e+03 -3.315 0.000943 ***
## KitchenQualGd -2.776e+04 3.482e+03 -7.972 3.45e-15 ***
## KitchenQualTA -2.528e+04 3.990e+03 -6.337 3.24e-10 ***
## TotRmsAbvGrd 1.348e+03 9.738e+02 1.384 0.166551
## FunctionalMaj2 2.507e+02 1.359e+04 0.018 0.985287
## FunctionalMin1 4.244e+03 8.617e+03 0.493 0.622430
## FunctionalMin2 8.527e+03 8.558e+03 0.996 0.319263
## FunctionalMod -7.214e+03 1.055e+04 -0.684 0.494057
## FunctionalSev -6.024e+04 2.753e+04 -2.188 0.028829 *
## FunctionalTyp 1.963e+04 7.389e+03 2.656 0.007999 **
## Fireplaces 2.774e+03 1.370e+03 2.025 0.043076 *
## GarageCars 4.294e+03 2.216e+03 1.937 0.052937 .
## GarageArea 1.304e+01 7.617e+00 1.712 0.087196 .
## PavedDriveP -3.287e+03 5.560e+03 -0.591 0.554567
## PavedDriveY -2.047e+03 3.436e+03 -0.596 0.551468
## WoodDeckSF 1.367e+01 5.947e+00 2.298 0.021728 *
## OpenPorchSF 1.215e+01 1.180e+01 1.029 0.303446
## EnclosedPorch 5.447e+00 1.277e+01 0.426 0.669822
## X3SsnPorch 2.446e+01 2.311e+01 1.058 0.290148
## ScreenPorch 3.706e+01 1.257e+01 2.948 0.003259 **
## PoolArea 7.097e+01 1.831e+01 3.876 0.000112 ***
## MiscVal -3.138e-01 1.466e+00 -0.214 0.830587
## MoSold -6.327e+02 2.531e+02 -2.499 0.012566 *
## YrSold -1.807e+02 5.236e+02 -0.345 0.730049
## SaleTypeCon 3.536e+04 1.835e+04 1.927 0.054204 .
## SaleTypeConLD 1.673e+04 9.998e+03 1.673 0.094580 .
## SaleTypeConLI 9.914e+03 1.190e+04 0.833 0.405035
## SaleTypeConLw -2.240e+03 1.239e+04 -0.181 0.856546
## SaleTypeCWD 2.307e+04 1.334e+04 1.730 0.083921 .
## SaleTypeNew 3.453e+04 1.602e+04 2.155 0.031316 *
## SaleTypeOth 1.831e+04 1.501e+04 1.220 0.222807
## SaleTypeWD 5.104e+02 4.330e+03 0.118 0.906193
## SaleConditionAdjLand 8.466e+03 1.448e+04 0.585 0.558816
## SaleConditionAlloca 5.075e+03 8.757e+03 0.580 0.562322
## SaleConditionFamily -1.410e+03 6.308e+03 -0.224 0.823156
## SaleConditionNormal 6.517e+03 2.971e+03 2.194 0.028439 *
## SaleConditionPartial -9.350e+03 1.544e+04 -0.606 0.544843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23970 on 1273 degrees of freedom
## Multiple R-squared: 0.9206, Adjusted R-squared: 0.909
## F-statistic: 79.31 on 186 and 1273 DF, p-value: < 2.2e-16
From this we can see that it appears as though there are a few variables that are more important than others. We’ll collect those and work with them directly. It’s also worth noting that this model appears to predict a high amount of variability in the target variable (\(r^2\) > 0.9) but given the number of variables, we can be reasonably confident that there’s some overfitting going on here. It appears as though we should be able to cut about 75% of the original 80 variables
Importantly, I also transformed a few of teh quantitative variables, including the target variable, using the log() function as a linear model should perfom better post-transformation.
df.reduced <-df[c("LotArea","Street","LandContour","LotConfig","LandSlope",
"Neighborhood","Condition2","OverallQual","OverallCond",
"YearBuilt","RoofMatl","ExterQual","BsmtFinSF1","BsmtFinSF2",
"BsmtUnfSF","X1stFlrSF","X2ndFlrSF","KitchenAbvGr","KitchenQual",
"ScreenPorch","PoolArea","SalePrice")]
#some of these vars are pretty non-linear, so i'm going modify them:
df.reduced$LotArea <- log(df.reduced$LotArea)
df.reduced$SalePrice <- log(df.reduced$SalePrice)
m2 <- lm(SalePrice ~ ., data = df.reduced)##
## Call:
## lm(formula = SalePrice ~ ., data = df.reduced)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82329 -0.05441 0.00415 0.06361 0.75764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.514e-02 5.917e-01 0.026 0.979592
## LotArea 1.170e-01 1.041e-02 11.232 < 2e-16 ***
## StreetPave 1.161e-01 5.605e-02 2.072 0.038459 *
## LandContourHLS 5.497e-02 2.531e-02 2.171 0.030076 *
## LandContourLow 2.781e-03 3.070e-02 0.091 0.927827
## LandContourLvl 2.467e-02 1.800e-02 1.370 0.170875
## LotConfigCulDSac 6.854e-03 1.565e-02 0.438 0.661432
## LotConfigFR2 -3.483e-02 1.995e-02 -1.745 0.081136 .
## LotConfigFR3 -8.824e-02 6.394e-02 -1.380 0.167807
## LotConfigInside -8.415e-03 8.728e-03 -0.964 0.335163
## LandSlopeMod 1.944e-02 1.939e-02 1.003 0.316153
## LandSlopeSev -6.073e-02 4.425e-02 -1.372 0.170149
## NeighborhoodBlueste -3.823e-02 9.359e-02 -0.409 0.682963
## NeighborhoodBrDale -1.708e-01 4.558e-02 -3.748 0.000185 ***
## NeighborhoodBrkSide -7.933e-02 3.935e-02 -2.016 0.043958 *
## NeighborhoodClearCr -7.083e-02 4.371e-02 -1.620 0.105379
## NeighborhoodCollgCr -9.477e-02 3.366e-02 -2.816 0.004933 **
## NeighborhoodCrawfor 1.095e-02 3.963e-02 0.276 0.782329
## NeighborhoodEdwards -1.769e-01 3.644e-02 -4.854 1.35e-06 ***
## NeighborhoodGilbert -9.292e-02 3.625e-02 -2.563 0.010478 *
## NeighborhoodIDOTRR -2.169e-01 4.193e-02 -5.172 2.66e-07 ***
## NeighborhoodMeadowV -2.073e-01 4.504e-02 -4.603 4.55e-06 ***
## NeighborhoodMitchel -1.555e-01 3.822e-02 -4.068 5.00e-05 ***
## NeighborhoodNAmes -1.326e-01 3.526e-02 -3.761 0.000176 ***
## NeighborhoodNoRidge -6.513e-02 3.873e-02 -1.681 0.092898 .
## NeighborhoodNPkVill -5.624e-02 5.237e-02 -1.074 0.282969
## NeighborhoodNridgHt 2.035e-02 3.552e-02 0.573 0.566748
## NeighborhoodNWAmes -1.577e-01 3.680e-02 -4.284 1.96e-05 ***
## NeighborhoodOldTown -1.441e-01 3.840e-02 -3.752 0.000183 ***
## NeighborhoodSawyer -1.506e-01 3.742e-02 -4.025 6.01e-05 ***
## NeighborhoodSawyerW -1.343e-01 3.632e-02 -3.699 0.000225 ***
## NeighborhoodSomerst -5.725e-03 3.392e-02 -0.169 0.865978
## NeighborhoodStoneBr 2.249e-02 4.051e-02 0.555 0.578884
## NeighborhoodSWISU -5.824e-02 4.443e-02 -1.311 0.190146
## NeighborhoodTimber -8.420e-02 3.937e-02 -2.139 0.032623 *
## NeighborhoodVeenker -5.845e-02 5.083e-02 -1.150 0.250334
## Condition2Feedr 3.827e-02 1.038e-01 0.369 0.712360
## Condition2Norm 2.286e-02 9.007e-02 0.254 0.799717
## Condition2PosA 1.769e-01 1.563e-01 1.132 0.257893
## Condition2PosN -8.904e-01 1.278e-01 -6.968 4.94e-12 ***
## Condition2RRAe -7.202e-02 1.529e-01 -0.471 0.637699
## Condition2RRAn -1.033e-01 1.524e-01 -0.678 0.498001
## Condition2RRNn -5.669e-02 1.264e-01 -0.449 0.653781
## OverallQual 6.284e-02 4.673e-03 13.447 < 2e-16 ***
## OverallCond 5.243e-02 3.492e-03 15.015 < 2e-16 ***
## YearBuilt 3.542e-03 2.638e-04 13.425 < 2e-16 ***
## RoofMatlCompShg 2.819e+00 1.455e-01 19.381 < 2e-16 ***
## RoofMatlMembran 2.983e+00 1.984e-01 15.033 < 2e-16 ***
## RoofMatlMetal 3.000e+00 1.970e-01 15.227 < 2e-16 ***
## RoofMatlRoll 2.809e+00 1.910e-01 14.703 < 2e-16 ***
## RoofMatlTar&Grv 2.754e+00 1.495e-01 18.421 < 2e-16 ***
## RoofMatlWdShake 2.761e+00 1.566e-01 17.632 < 2e-16 ***
## RoofMatlWdShngl 2.894e+00 1.527e-01 18.953 < 2e-16 ***
## ExterQualFa -6.737e-02 4.715e-02 -1.429 0.153250
## ExterQualGd -1.984e-02 2.305e-02 -0.861 0.389401
## ExterQualTA -2.466e-02 2.564e-02 -0.962 0.336431
## BsmtFinSF1 1.897e-04 1.590e-05 11.933 < 2e-16 ***
## BsmtFinSF2 1.361e-04 2.490e-05 5.466 5.46e-08 ***
## BsmtUnfSF 8.806e-05 1.544e-05 5.702 1.44e-08 ***
## X1stFlrSF 2.887e-04 1.787e-05 16.163 < 2e-16 ***
## X2ndFlrSF 2.765e-04 1.048e-05 26.384 < 2e-16 ***
## KitchenAbvGr -6.725e-02 1.707e-02 -3.940 8.57e-05 ***
## KitchenQualFa -1.220e-01 2.933e-02 -4.160 3.38e-05 ***
## KitchenQualGd -5.369e-02 1.696e-02 -3.165 0.001585 **
## KitchenQualTA -8.287e-02 1.900e-02 -4.361 1.39e-05 ***
## ScreenPorch 2.469e-04 5.993e-05 4.119 4.03e-05 ***
## PoolArea 9.656e-05 8.689e-05 1.111 0.266630
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1228 on 1393 degrees of freedom
## Multiple R-squared: 0.9098, Adjusted R-squared: 0.9055
## F-statistic: 212.9 on 66 and 1393 DF, p-value: < 2.2e-16
We see a slight reduction in model performance, but it is likely worth it given the reduction in parameters. Next we’ll use look at a few visualizations of the residuals.
in.sample <- predict(m2,data=p2.train)
plot(cbind(exp(in.sample),p2.train$SalePrice), main = "In Sample Model Result")We can see that the a first check, the in-sample result appears to show a good fit (as indicated in the summary report) however, the distribution of residuals is a bitleptokurtic and the QQ-norm plot indicates that the residuals are not normally distributed (tails appear to be long here!). This indicates that the model likely doesn’t meet the assumptions for linear regression and as such, we don’t expect it to perform exceptionally well on kaggle. Based on my Kaggle performance, the model is much improved, however, using log-transformed data vs. using the raw data as is.
Next we run the model on the test dataset and create an output file which can be loaded to kaggle. We need to be careful to remember that our predictions will be in log-space and will need to be converted back in order to be coherent.
df.test <-p2.test[c("LotArea","Street","LandContour","LotConfig","LandSlope",
"Neighborhood","Condition2","OverallQual","OverallCond",
"YearBuilt","RoofMatl","ExterQual","BsmtFinSF1","BsmtFinSF2",
"BsmtUnfSF","X1stFlrSF","X2ndFlrSF","KitchenAbvGr","KitchenQual",
"ScreenPorch","PoolArea")]
#need to take the log of LotArea... and remember to exp() the prediction output!
df.test$LotArea <- log(df.test$LotArea)
prediction <- exp(predict(m2, newdata = df.test) )
prediction[is.na(prediction)] <- mean(prediction, na.rm = TRUE)
prediction.df <- as.data.frame(cbind(p2.test$Id,prediction))
colnames(prediction.df) <- c("Id","SalePrice")
write.csv(prediction.df, "data_605_kaggle.csv",row.names=F)The score for the model is ~0.134 which is not great but makes sense given the plots above and the high liklihood that the model doesn’t perfectly meet the assumptions of linear regression. This score puts me at about the 40% percentile, which is leaps and bounds better than my first attempt (97th percentile!). My results are available under the username plbcunyacct. Note also that I may try to improve upon this after the submission of this assignment, so my score may not be the same as reported here. See my progression below:
Kaggle History