#install.packages("GGally")
#install.packages('MASS')
suppressMessages(library(kableExtra))
suppressMessages(library(GGally))
suppressMessages(library(ggplot2))
suppressMessages(library(pracma))
## Warning: package 'pracma' was built under R version 3.5.2
suppressMessages(library(MASS))
## Warning: package 'MASS' was built under R version 3.5.2
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.
set.seed(101)
N <- 6
X <- runif(10000, 1, N)
Let’s take a look at the distribution of X
hist(X)
mu <- (N+1)/2
Y <- rnorm(10000,mean=mu)
Let’s take a look at the distribution of X
hist(Y)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x <- median(X)
x
## [1] 3.487115
y <- quantile(Y, 0.25)
y
## 25%
## 2.829464
We need to calculate the P(X>x and X>y) and divide that by P(X>y)
#P(X>x and X>y)
P1<-sum(X>x & X>y)/10000
#P(X>y)
P2<-sum(X>y)/10000
round(P1/P2,3)
## [1] 0.785
#P(X>x and Y>y)
P3<-sum(X>x & Y>y)/10000
round(P3,3)
## [1] 0.378
#P(X<x and X>y)
P4<-sum((X<x) & (X>y))/10000
#P(X>y)
P2<-sum(X>y)/10000
round(P4/P2,3)
## [1] 0.215
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
df<-data.frame("Xgx" =c(sum(X>x & Y<y), sum(X>x & Y>y), sum(X>x & Y<y)+sum(X>x & Y>y)),
"Xlx" = c(sum(X<x & Y<y), sum(X<x & Y>y), sum(X<x & Y<y)+sum(X<x & Y>y)),
"Total" = c(sum(X>x & Y<y)+sum(X<x & Y<y), sum(X>x & Y>y)+sum(X<x & Y>y), sum(X>x & Y<y)+sum(X>x & Y>y)+ sum(X<x & Y<y)+sum(X<x & Y>y)))
names(df) <- c("X>x","X<x","Total")
row.names(df) <- c("Y<y","Y>y", "Total")
df %>% kable(caption = "Table of Probabilities") %>% kable_styling("striped", full_width = TRUE)
X>x | X<x | Total | |
---|---|---|---|
Y<y | 1222 | 1278 | 2500 |
Y>y | 3778 | 3722 | 7500 |
Total | 5000 | 5000 | 10000 |
Let’s use the table to locate P(X>x and Y>y) = 3756/10000.
df[2,1]/df[3,3]
## [1] 0.3778
Now, let’s find P(X>x)P(Y>y), P(X>x) = 5000/10000, P(Y>y) = 7500/10000.
#P(X>x)
df[3,1]/df[3,3]
## [1] 0.5
#P(Y>y)
df[2,3]/df[3,3]
## [1] 0.75
#P(X>x)P(Y>y)
(df[3,1]/df[3,3]) * (df[2,3]/df[3,3])
## [1] 0.375
We conclude that the probabilities are independant since P(X>x and Y>y)=P(X>x)P(Y>y).
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s Exact Test.
fisher.test(df, simulate.p.value=TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: df
## p-value = 0.7966
## alternative hypothesis: two.sided
p-value is close to 1, we don’t reject the null hypothesis and conclude that these variables are independent.
Chi Square Test.
chisq.test(df)
##
## Pearson's Chi-squared test
##
## data: df
## X-squared = 1.6725, df = 4, p-value = 0.7957
A test statistic close to 0 and a p-value close to 1, we don’t reject the null hypothesis and conclude that these variables are independent.
Difference between Fisher’s Exact Test and the Chi Square Test:
Fisher’s exact test, always gives an exact P value and works fine with small sample sizes. Most statistical books advise using it instead of chi-square test. Chi Square Rest is very accurate with large values. Fisher’s Exact test is more appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train <- read.csv('train.csv', sep = ',', header = T, stringsAsFactors = F)
test <- read.csv("test.csv", sep = ',', header = T, stringsAsFactors = F)
Let’s view the head of Train data.
head(train,20) %>% kable(caption = "Train") %>% kable_styling("striped", full_width = TRUE) %>% scroll_box("width:400px")
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
7 | 20 | RL | 75 | 10084 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Somerst | Norm | Norm | 1Fam | 1Story | 8 | 5 | 2004 | 2005 | Gable | CompShg | VinylSd | VinylSd | Stone | 186 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1369 | Unf | 0 | 317 | 1686 | GasA | Ex | Y | SBrkr | 1694 | 0 | 0 | 1694 | 1 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2004 | RFn | 2 | 636 | TA | TA | Y | 255 | 57 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | WD | Normal | 307000 |
8 | 60 | RL | NA | 10382 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NWAmes | PosN | Norm | 1Fam | 2Story | 7 | 6 | 1973 | 1973 | Gable | CompShg | HdBoard | HdBoard | Stone | 240 | TA | TA | CBlock | Gd | TA | Mn | ALQ | 859 | BLQ | 32 | 216 | 1107 | GasA | Ex | Y | SBrkr | 1107 | 983 | 0 | 2090 | 1 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 2 | TA | Attchd | 1973 | RFn | 2 | 484 | TA | TA | Y | 235 | 204 | 228 | 0 | 0 | 0 | NA | NA | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
9 | 50 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | OldTown | Artery | Norm | 1Fam | 1.5Fin | 7 | 5 | 1931 | 1950 | Gable | CompShg | BrkFace | Wd Shng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 952 | 952 | GasA | Gd | Y | FuseF | 1022 | 752 | 0 | 1774 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 8 | Min1 | 2 | TA | Detchd | 1931 | Unf | 2 | 468 | Fa | TA | Y | 90 | 0 | 205 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2008 | WD | Abnorml | 129900 |
10 | 190 | RL | 50 | 7420 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Artery | Artery | 2fmCon | 1.5Unf | 5 | 6 | 1939 | 1950 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | BrkTil | TA | TA | No | GLQ | 851 | Unf | 0 | 140 | 991 | GasA | Ex | Y | SBrkr | 1077 | 0 | 0 | 1077 | 1 | 0 | 1 | 0 | 2 | 2 | TA | 5 | Typ | 2 | TA | Attchd | 1939 | RFn | 1 | 205 | Gd | TA | Y | 0 | 4 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 1 | 2008 | WD | Normal | 118000 |
11 | 20 | RL | 70 | 11200 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Sawyer | Norm | Norm | 1Fam | 1Story | 5 | 5 | 1965 | 1965 | Hip | CompShg | HdBoard | HdBoard | None | 0 | TA | TA | CBlock | TA | TA | No | Rec | 906 | Unf | 0 | 134 | 1040 | GasA | Ex | Y | SBrkr | 1040 | 0 | 0 | 1040 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 5 | Typ | 0 | NA | Detchd | 1965 | Unf | 1 | 384 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 129500 |
12 | 60 | RL | 85 | 11924 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | NridgHt | Norm | Norm | 1Fam | 2Story | 9 | 5 | 2005 | 2006 | Hip | CompShg | WdShing | Wd Shng | Stone | 286 | Ex | TA | PConc | Ex | TA | No | GLQ | 998 | Unf | 0 | 177 | 1175 | GasA | Ex | Y | SBrkr | 1182 | 1142 | 0 | 2324 | 1 | 0 | 3 | 0 | 4 | 1 | Ex | 11 | Typ | 2 | Gd | BuiltIn | 2005 | Fin | 3 | 736 | TA | TA | Y | 147 | 21 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 7 | 2006 | New | Partial | 345000 |
13 | 20 | RL | NA | 12968 | Pave | NA | IR2 | Lvl | AllPub | Inside | Gtl | Sawyer | Norm | Norm | 1Fam | 1Story | 5 | 6 | 1962 | 1962 | Hip | CompShg | HdBoard | Plywood | None | 0 | TA | TA | CBlock | TA | TA | No | ALQ | 737 | Unf | 0 | 175 | 912 | GasA | TA | Y | SBrkr | 912 | 0 | 0 | 912 | 1 | 0 | 1 | 0 | 2 | 1 | TA | 4 | Typ | 0 | NA | Detchd | 1962 | Unf | 1 | 352 | TA | TA | Y | 140 | 0 | 0 | 0 | 176 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 144000 |
14 | 20 | RL | 91 | 10652 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 1Story | 7 | 5 | 2006 | 2007 | Gable | CompShg | VinylSd | VinylSd | Stone | 306 | Gd | TA | PConc | Gd | TA | Av | Unf | 0 | Unf | 0 | 1494 | 1494 | GasA | Ex | Y | SBrkr | 1494 | 0 | 0 | 1494 | 0 | 0 | 2 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 2006 | RFn | 3 | 840 | TA | TA | Y | 160 | 33 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 8 | 2007 | New | Partial | 279500 |
15 | 20 | RL | NA | 10920 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 5 | 1960 | 1960 | Hip | CompShg | MetalSd | MetalSd | BrkFace | 212 | TA | TA | CBlock | TA | TA | No | BLQ | 733 | Unf | 0 | 520 | 1253 | GasA | TA | Y | SBrkr | 1253 | 0 | 0 | 1253 | 1 | 0 | 1 | 1 | 2 | 1 | TA | 5 | Typ | 1 | Fa | Attchd | 1960 | RFn | 1 | 352 | TA | TA | Y | 0 | 213 | 176 | 0 | 0 | 0 | NA | GdWo | NA | 0 | 5 | 2008 | WD | Normal | 157000 |
16 | 45 | RM | 51 | 6120 | Pave | NA | Reg | Lvl | AllPub | Corner | Gtl | BrkSide | Norm | Norm | 1Fam | 1.5Unf | 7 | 8 | 1929 | 2001 | Gable | CompShg | Wd Sdng | Wd Sdng | None | 0 | TA | TA | BrkTil | TA | TA | No | Unf | 0 | Unf | 0 | 832 | 832 | GasA | Ex | Y | FuseA | 854 | 0 | 0 | 854 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | NA | Detchd | 1991 | Unf | 2 | 576 | TA | TA | Y | 48 | 112 | 0 | 0 | 0 | 0 | NA | GdPrv | NA | 0 | 7 | 2007 | WD | Normal | 132000 |
17 | 20 | RL | NA | 11241 | Pave | NA | IR1 | Lvl | AllPub | CulDSac | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 7 | 1970 | 1970 | Gable | CompShg | Wd Sdng | Wd Sdng | BrkFace | 180 | TA | TA | CBlock | TA | TA | No | ALQ | 578 | Unf | 0 | 426 | 1004 | GasA | Ex | Y | SBrkr | 1004 | 0 | 0 | 1004 | 1 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 1 | TA | Attchd | 1970 | Fin | 2 | 480 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | Shed | 700 | 3 | 2010 | WD | Normal | 149000 |
18 | 90 | RL | 72 | 10791 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | Sawyer | Norm | Norm | Duplex | 1Story | 4 | 5 | 1967 | 1967 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | Slab | NA | NA | NA | NA | 0 | NA | 0 | 0 | 0 | GasA | TA | Y | SBrkr | 1296 | 0 | 0 | 1296 | 0 | 0 | 2 | 0 | 2 | 2 | TA | 6 | Typ | 0 | NA | CarPort | 1967 | Unf | 2 | 516 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | Shed | 500 | 10 | 2006 | WD | Normal | 90000 |
19 | 20 | RL | 66 | 13695 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | SawyerW | RRAe | Norm | 1Fam | 1Story | 5 | 5 | 2004 | 2004 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | TA | TA | No | GLQ | 646 | Unf | 0 | 468 | 1114 | GasA | Ex | Y | SBrkr | 1114 | 0 | 0 | 1114 | 1 | 0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NA | Detchd | 2004 | Unf | 2 | 576 | TA | TA | Y | 0 | 102 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2008 | WD | Normal | 159000 |
20 | 20 | RL | 70 | 7560 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 5 | 6 | 1958 | 1965 | Hip | CompShg | BrkFace | Plywood | None | 0 | TA | TA | CBlock | TA | TA | No | LwQ | 504 | Unf | 0 | 525 | 1029 | GasA | TA | Y | SBrkr | 1339 | 0 | 0 | 1339 | 0 | 0 | 1 | 0 | 3 | 1 | TA | 6 | Min1 | 0 | NA | Attchd | 1958 | Unf | 1 | 294 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | NA | MnPrv | NA | 0 | 5 | 2009 | COD | Abnorml | 139000 |
Next step is to review a summary of our Train dataset to get a better idea of the type of variables we have available to us for analysis.
summary(train) %>% kable(caption = "Train Summary All Columns") %>% kable_styling("striped", full_width = TRUE) %>% scroll_box("width:400px")
| MSSubClass | MSZoning | LotFrontage |
|
|
| LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath |
|
| BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch |
|
|
| MiscFeature |
|
|
| SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.0 | Min. : 20.0 | Length:1460 | Min. : 21.00 | Min. : 1300 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 1.000 | Min. :1.000 | Min. :1872 | Min. :1950 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.0 | Length:1460 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 334 | Min. : 0 | Min. : 0.000 | Min. : 334 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.000 | Length:1460 | Min. : 2.000 | Length:1460 | Min. :0.000 | Length:1460 | Length:1460 | Min. :1900 | Length:1460 | Min. :0.000 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.00 | Min. : 1.000 | Min. :2006 | Length:1460 | Length:1460 | Min. : 34900 | |
1st Qu.: 365.8 | 1st Qu.: 20.0 | Class :character | 1st Qu.: 59.00 | 1st Qu.: 7554 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 5.000 | 1st Qu.:5.000 | 1st Qu.:1954 | 1st Qu.:1967 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | 1st Qu.: 0.00 | 1st Qu.: 223.0 | 1st Qu.: 795.8 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 882 | 1st Qu.: 0 | 1st Qu.: 0.000 | 1st Qu.:1130 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | 1st Qu.:2.000 | 1st Qu.:1.000 | Class :character | 1st Qu.: 5.000 | Class :character | 1st Qu.:0.000 | Class :character | Class :character | 1st Qu.:1961 | Class :character | 1st Qu.:1.000 | 1st Qu.: 334.5 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 5.000 | 1st Qu.:2007 | Class :character | Class :character | 1st Qu.:129975 | |
Median : 730.5 | Median : 50.0 | Mode :character | Median : 69.00 | Median : 9478 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 6.000 | Median :5.000 | Median :1973 | Median :1994 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 0.0 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 383.5 | Mode :character | Median : 0.00 | Median : 477.5 | Median : 991.5 | Mode :character | Mode :character | Mode :character | Mode :character | Median :1087 | Median : 0 | Median : 0.000 | Median :1464 | Median :0.0000 | Median :0.00000 | Median :2.000 | Median :0.0000 | Median :3.000 | Median :1.000 | Mode :character | Median : 6.000 | Mode :character | Median :1.000 | Mode :character | Mode :character | Median :1980 | Mode :character | Median :2.000 | Median : 480.0 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 25.00 | Median : 0.00 | Median : 0.00 | Median : 0.00 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 6.000 | Median :2008 | Mode :character | Mode :character | Median :163000 | |
Mean : 730.5 | Mean : 56.9 | NA | Mean : 70.05 | Mean : 10517 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 6.099 | Mean :5.575 | Mean :1971 | Mean :1985 | NA | NA | NA | NA | NA | Mean : 103.7 | NA | NA | NA | NA | NA | NA | NA | Mean : 443.6 | NA | Mean : 46.55 | Mean : 567.2 | Mean :1057.4 | NA | NA | NA | NA | Mean :1163 | Mean : 347 | Mean : 5.845 | Mean :1515 | Mean :0.4253 | Mean :0.05753 | Mean :1.565 | Mean :0.3829 | Mean :2.866 | Mean :1.047 | NA | Mean : 6.518 | NA | Mean :0.613 | NA | NA | Mean :1979 | NA | Mean :1.767 | Mean : 473.0 | NA | NA | NA | Mean : 94.24 | Mean : 46.66 | Mean : 21.95 | Mean : 3.41 | Mean : 15.06 | Mean : 2.759 | NA | NA | NA | Mean : 43.49 | Mean : 6.322 | Mean :2008 | NA | NA | Mean :180921 | |
3rd Qu.:1095.2 | 3rd Qu.: 70.0 | NA | 3rd Qu.: 80.00 | 3rd Qu.: 11602 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 7.000 | 3rd Qu.:6.000 | 3rd Qu.:2000 | 3rd Qu.:2004 | NA | NA | NA | NA | NA | 3rd Qu.: 166.0 | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 712.2 | NA | 3rd Qu.: 0.00 | 3rd Qu.: 808.0 | 3rd Qu.:1298.2 | NA | NA | NA | NA | 3rd Qu.:1391 | 3rd Qu.: 728 | 3rd Qu.: 0.000 | 3rd Qu.:1777 | 3rd Qu.:1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:1.000 | NA | 3rd Qu.: 7.000 | NA | 3rd Qu.:1.000 | NA | NA | 3rd Qu.:2002 | NA | 3rd Qu.:2.000 | 3rd Qu.: 576.0 | NA | NA | NA | 3rd Qu.:168.00 | 3rd Qu.: 68.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.000 | NA | NA | NA | 3rd Qu.: 0.00 | 3rd Qu.: 8.000 | 3rd Qu.:2009 | NA | NA | 3rd Qu.:214000 | |
Max. :1460.0 | Max. :190.0 | NA | Max. :313.00 | Max. :215245 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :10.000 | Max. :9.000 | Max. :2010 | Max. :2010 | NA | NA | NA | NA | NA | Max. :1600.0 | NA | NA | NA | NA | NA | NA | NA | Max. :5644.0 | NA | Max. :1474.00 | Max. :2336.0 | Max. :6110.0 | NA | NA | NA | NA | Max. :4692 | Max. :2065 | Max. :572.000 | Max. :5642 | Max. :3.0000 | Max. :2.00000 | Max. :3.000 | Max. :2.0000 | Max. :8.000 | Max. :3.000 | NA | Max. :14.000 | NA | Max. :3.000 | NA | NA | Max. :2010 | NA | Max. :4.000 | Max. :1418.0 | NA | NA | NA | Max. :857.00 | Max. :547.00 | Max. :552.00 | Max. :508.00 | Max. :480.00 | Max. :738.000 | NA | NA | NA | Max. :15500.00 | Max. :12.000 | Max. :2010 | NA | NA | Max. :755000 | |
NA | NA | NA | NA’s :259 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :81 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
hist(train$SalePrice, main="Histogram of Sale Price", xlab="Sale Price")
Overall Quality as suspected has a strong correlation with price, based on the plot below.
boxplot(train$SalePrice~train$OverallQual)
Let’s see if having a fireplace has a connection with increase in price.
boxplot(train$SalePrice~train$Fireplaces)
Let’s explore the correlation between some of the variable I suspect will be a strong indicator of the price by creating a Scatterplot Matrix.
pairs(~SalePrice+LotArea+YearBuilt+GrLivArea+GarageArea+TotRmsAbvGrd,data=train,
main="Scatterplot Matrix")
All of these seem to display a strong correlation. I am going to do a correlation matrix to investigate further.
newtrain <- subset(train,select=c(LotArea, YearBuilt, GrLivArea, GarageArea, TotRmsAbvGrd, SalePrice))
ggcorr(newtrain, geom = "circle", nbreaks = 5)
My assumption was confirmed - all 5 of these variables display a very strong correlaton.
Null Hypothesis: The correlation between TotRmsAbvGrd and SalePrice is 0
Alternative Hypothesis: The correlation between TotRmsAbvGrd and SalePrice is other than 0
cor.test(newtrain$SalePrice, newtrain$TotRmsAbvGrd, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: newtrain$SalePrice and newtrain$TotRmsAbvGrd
## t = 24.099, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5092841 0.5573021
## sample estimates:
## cor
## 0.5337232
Since the the p value is 0 we reject our null hypothesis and conclude that the correlation between TotRmsAbvGrd and SalePrice is not zero. 80 percent confidence interval of the test is 0.5092841 0.5573021.
Null Hypothesis: The correlation between GarageArea and SalePrice is 0
Alternative Hypothesis: The correlation between GarageArea and SalePrice is other than 0
cor.test(newtrain$SalePrice, newtrain$GarageArea, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: newtrain$SalePrice and newtrain$GarageArea
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
Since the the p value is 0 we reject our null hypothesis and conclude that the correlation between GarageArea and SalePrice is not zero. 80 percent confidence interval of the test is 0.6024756 0.6435283.
Null Hypothesis: The correlation between GrLivArea and SalePrice is 0
Alternative Hypothesis: The correlation between GrLivArea and SalePrice is other than 0
cor.test(newtrain$SalePrice, newtrain$GrLivArea, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: newtrain$SalePrice and newtrain$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
Since the the p value is 0 we reject our null hypothesis and conclude that the correlation between GrLivArea and SalePrice is not zero. 80 percent confidence interval of the test is 0.6915087 0.7249450.
Null Hypothesis: The correlation between YearBuilt and SalePrice is 0
Alternative Hypothesis: The correlation between YearBuilt and SalePrice is other than 0
cor.test(newtrain$SalePrice, newtrain$YearBuilt, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: newtrain$SalePrice and newtrain$YearBuilt
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4980766 0.5468619
## sample estimates:
## cor
## 0.5228973
Since the the p value is 0 we reject our null hypothesis and conclude that the correlation between YearBuilt and SalePrice is not zero. 80 percent confidence interval of the test is 0.4980766 0.5468619.
Null Hypothesis: The correlation between LotArea and SalePrice is 0
Alternative Hypothesis: The correlation between LotArea and SalePrice is other than 0
cor.test(newtrain$SalePrice, newtrain$LotArea, method = "pearson", conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: newtrain$SalePrice and newtrain$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
Since the the p value is 0 we reject our null hypothesis and conclude that the correlation between LotArea and SalePrice is not zero. 80 percent confidence interval of the test is 0.2323391 0.2947946.
At a 95% Cofidence interval, let’s calculate Familywise Error.
1 - (1 - 0.05)^5
## [1] 0.2262191
I will not worry about this error since this is the overall Type I error rate of these comparisons is 23% which is not too high.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Correlation Matrix:
#Saving Correlation Matrix
corM<-cor(newtrain)
round(corM, 2)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1.00 0.01 0.26 0.18 0.19 0.26
## YearBuilt 0.01 1.00 0.20 0.48 0.10 0.52
## GrLivArea 0.26 0.20 1.00 0.47 0.83 0.71
## GarageArea 0.18 0.48 0.47 1.00 0.34 0.62
## TotRmsAbvGrd 0.19 0.10 0.83 0.34 1.00 0.53
## SalePrice 0.26 0.52 0.71 0.62 0.53 1.00
#Interting Correlation Matrix
precM<-solve(corM)
round(precM, 2)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1.11 0.18 -0.18 -0.07 0.08 -0.26
## YearBuilt 0.18 1.64 0.44 -0.43 0.14 -1.02
## GrLivArea -0.18 0.44 4.77 -0.29 -2.86 -1.85
## GarageArea -0.07 -0.43 -0.29 1.76 0.06 -0.68
## TotRmsAbvGrd 0.08 0.14 -2.86 0.06 3.21 0.18
## SalePrice -0.26 -1.02 -1.85 -0.68 0.18 3.24
#Multiply the correlation matrix by the precision matrix
round(corM %*% precM, 1)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1 0 0 0 0 0
## YearBuilt 0 1 0 0 0 0
## GrLivArea 0 0 1 0 0 0
## GarageArea 0 0 0 1 0 0
## TotRmsAbvGrd 0 0 0 0 1 0
## SalePrice 0 0 0 0 0 1
#Multiply the precision matrix by the correlation matrix
round(precM %*% corM, 1)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1 0 0 0 0 0
## YearBuilt 0 1 0 0 0 0
## GrLivArea 0 0 1 0 0 0
## GarageArea 0 0 0 1 0 0
## TotRmsAbvGrd 0 0 0 0 1 0
## SalePrice 0 0 0 0 0 1
LU Decomposition
z<-lu(corM)
round(z$L, 2)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1.00 0.00 0.00 0.00 0.00 0
## YearBuilt 0.01 1.00 0.00 0.00 0.00 0
## GrLivArea 0.26 0.20 1.00 0.00 0.00 0
## GarageArea 0.18 0.48 0.37 1.00 0.00 0
## TotRmsAbvGrd 0.19 0.09 0.85 -0.03 1.00 0
## SalePrice 0.26 0.52 0.60 0.21 -0.05 1
round(z$U, 2)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1 0.01 0.26 0.18 0.19 0.26
## YearBuilt 0 1.00 0.20 0.48 0.09 0.52
## GrLivArea 0 0.00 0.89 0.33 0.76 0.54
## GarageArea 0 0.00 0.00 0.62 -0.02 0.13
## TotRmsAbvGrd 0 0.00 0.00 0.00 0.31 -0.02
## SalePrice 0 0.00 0.00 0.00 0.00 0.31
Let’s compare our L*U Matrix to the original Correlation Matrix
round(z$L %*% z$U,2)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1.00 0.01 0.26 0.18 0.19 0.26
## YearBuilt 0.01 1.00 0.20 0.48 0.10 0.52
## GrLivArea 0.26 0.20 1.00 0.47 0.83 0.71
## GarageArea 0.18 0.48 0.47 1.00 0.34 0.62
## TotRmsAbvGrd 0.19 0.10 0.83 0.34 1.00 0.53
## SalePrice 0.26 0.52 0.71 0.62 0.53 1.00
round(corM,2)
## LotArea YearBuilt GrLivArea GarageArea TotRmsAbvGrd SalePrice
## LotArea 1.00 0.01 0.26 0.18 0.19 0.26
## YearBuilt 0.01 1.00 0.20 0.48 0.10 0.52
## GrLivArea 0.26 0.20 1.00 0.47 0.83 0.71
## GarageArea 0.18 0.48 0.47 1.00 0.34 0.62
## TotRmsAbvGrd 0.19 0.10 0.83 0.34 1.00 0.53
## SalePrice 0.26 0.52 0.71 0.62 0.53 1.00
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
I am going to use GrLivArea variable for this step.
hist(train$GrLivArea, breaks = 40)
LRA<-as.numeric(train$GrLivArea)
summary(LRA) #see that it is not necessary to shift it so that the minimum value is absolutely above zero - Min is above 0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
LRAfit <- fitdistr(LRA, densfun = "exponential")
LRAfit
## rate
## 6.598640e-04
## (1.726943e-05)
lambda <- LRAfit$estimate[1]
#take 1000 samples from this exponential distribution
LRAfitSample <- rexp(1000, lambda)
#Sample Histogram
hist(LRAfitSample, breaks = 40)
#Original Histogram
hist(train$GrLivArea, breaks = 40)
The exponential shift make the data more consitent and might fit the data, let’s investigate further.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(0.05, rate=lambda) # 5th percentile
## [1] 77.73313
qexp(0.95, rate=lambda) # 95th percentile
## [1] 4539.924
Generating a 95% confidence interval from the empirical data, assuming normality.
a <- mean(train$GrLivArea)
s <- sd(train$GrLivArea)
n <- length(train$GrLivArea)
error <- qnorm(0.975)*s/sqrt(n)
left <- a-error
right <- a+error
round(left,1)
## [1] 1488.5
round(right,1)
## [1] 1542.4
Let’s look at the empirical 5th percentile and 95th percentile of the data.
quantile(train$GrLivArea, 0.05)
## 5%
## 848
quantile(train$GrLivArea, 0.95)
## 95%
## 2466.1
Based on the above information it looks like exponential pdf seems to be a better estimate of the emperical data than I anticipated but normal model fits better so we shouldn’t use exponentially fitted model in this case.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I will create a regression model that seem to give me high significance code by trying out various variables.
m1 <- lm(SalePrice ~ LotArea + YearBuilt + GrLivArea + LotFrontage + OverallQual + OverallCond + factor(Neighborhood) + factor (Condition2) + factor(BldgType) + factor(HouseStyle) + factor(SaleCondition) + factor(MoSold)+BedroomAbvGr + FullBath * HalfBath + TotalBsmtSF + factor(RoofMatl), data=train)
summary(m1)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + GrLivArea + LotFrontage +
## OverallQual + OverallCond + factor(Neighborhood) + factor(Condition2) +
## factor(BldgType) + factor(HouseStyle) + factor(SaleCondition) +
## factor(MoSold) + BedroomAbvGr + FullBath * HalfBath + TotalBsmtSF +
## factor(RoofMatl), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -191823 -14738 -609 13217 204383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.657e+06 1.461e+05 -11.344 < 2e-16 ***
## LotArea 7.245e-01 1.304e-01 5.557 3.43e-08 ***
## YearBuilt 4.939e+02 6.955e+01 7.103 2.17e-12 ***
## GrLivArea 7.649e+01 4.258e+00 17.963 < 2e-16 ***
## LotFrontage 5.249e+01 5.201e+01 1.009 0.313051
## OverallQual 1.276e+04 1.197e+03 10.656 < 2e-16 ***
## OverallCond 7.501e+03 9.214e+02 8.140 1.03e-15 ***
## factor(Neighborhood)Blueste 1.388e+04 2.253e+04 0.616 0.537848
## factor(Neighborhood)BrDale 2.579e+04 1.219e+04 2.115 0.034681 *
## factor(Neighborhood)BrkSide 9.668e+03 1.071e+04 0.903 0.366973
## factor(Neighborhood)ClearCr -5.787e+03 1.259e+04 -0.460 0.645959
## factor(Neighborhood)CollgCr -1.632e+03 9.192e+03 -0.178 0.859145
## factor(Neighborhood)Crawfor 2.612e+04 1.046e+04 2.496 0.012690 *
## factor(Neighborhood)Edwards -1.263e+03 9.717e+03 -0.130 0.896591
## factor(Neighborhood)Gilbert -7.080e+03 9.951e+03 -0.712 0.476917
## factor(Neighborhood)IDOTRR 2.805e+03 1.117e+04 0.251 0.801838
## factor(Neighborhood)MeadowV 1.913e+04 1.209e+04 1.582 0.113906
## factor(Neighborhood)Mitchel -5.637e+03 1.022e+04 -0.552 0.581366
## factor(Neighborhood)NAmes -2.266e+03 9.511e+03 -0.238 0.811760
## factor(Neighborhood)NoRidge 4.053e+04 1.052e+04 3.855 0.000122 ***
## factor(Neighborhood)NPkVill 2.104e+04 1.424e+04 1.478 0.139633
## factor(Neighborhood)NridgHt 5.293e+04 9.148e+03 5.786 9.33e-09 ***
## factor(Neighborhood)NWAmes -1.619e+04 1.003e+04 -1.614 0.106850
## factor(Neighborhood)OldTown -3.557e+03 1.034e+04 -0.344 0.730931
## factor(Neighborhood)Sawyer -3.822e+02 1.020e+04 -0.037 0.970102
## factor(Neighborhood)SawyerW -1.312e+03 9.648e+03 -0.136 0.891866
## factor(Neighborhood)Somerst 1.357e+04 9.108e+03 1.490 0.136416
## factor(Neighborhood)StoneBr 6.781e+04 1.060e+04 6.395 2.34e-10 ***
## factor(Neighborhood)SWISU -4.961e+02 1.179e+04 -0.042 0.966431
## factor(Neighborhood)Timber 1.080e+04 1.038e+04 1.041 0.298300
## factor(Neighborhood)Veenker 2.871e+04 1.408e+04 2.039 0.041709 *
## factor(Condition2)Feedr -5.157e+03 2.532e+04 -0.204 0.838637
## factor(Condition2)Norm 5.428e+03 2.222e+04 0.244 0.807025
## factor(Condition2)PosA 2.564e+04 3.912e+04 0.655 0.512341
## factor(Condition2)PosN -2.000e+05 3.099e+04 -6.455 1.60e-10 ***
## factor(Condition2)RRNn 6.537e+03 3.050e+04 0.214 0.830300
## factor(BldgType)2fmCon -2.167e+03 6.162e+03 -0.352 0.725214
## factor(BldgType)Duplex -1.784e+04 5.290e+03 -3.372 0.000772 ***
## factor(BldgType)Twnhs -4.139e+04 6.915e+03 -5.986 2.88e-09 ***
## factor(BldgType)TwnhsE -2.445e+04 4.621e+03 -5.292 1.45e-07 ***
## factor(HouseStyle)1.5Unf 5.001e+03 9.121e+03 0.548 0.583592
## factor(HouseStyle)1Story 1.388e+04 3.849e+03 3.606 0.000325 ***
## factor(HouseStyle)2.5Fin -2.485e+04 1.201e+04 -2.070 0.038718 *
## factor(HouseStyle)2.5Unf -2.070e+04 1.045e+04 -1.980 0.047918 *
## factor(HouseStyle)2Story -3.547e+02 3.785e+03 -0.094 0.925366
## factor(HouseStyle)SFoyer 2.265e+04 7.156e+03 3.165 0.001592 **
## factor(HouseStyle)SLvl 1.441e+04 5.697e+03 2.530 0.011541 *
## factor(SaleCondition)AdjLand 3.338e+04 1.592e+04 2.097 0.036208 *
## factor(SaleCondition)Alloca 1.417e+04 1.090e+04 1.300 0.194005
## factor(SaleCondition)Family -9.539e+02 7.796e+03 -0.122 0.902635
## factor(SaleCondition)Normal 6.349e+03 3.464e+03 1.833 0.067067 .
## factor(SaleCondition)Partial 2.433e+04 4.741e+03 5.131 3.39e-07 ***
## factor(MoSold)2 -1.633e+04 6.224e+03 -2.624 0.008811 **
## factor(MoSold)3 -1.561e+04 5.450e+03 -2.864 0.004254 **
## factor(MoSold)4 -1.590e+04 5.153e+03 -3.085 0.002087 **
## factor(MoSold)5 -9.486e+03 4.957e+03 -1.914 0.055905 .
## factor(MoSold)6 -1.413e+04 4.849e+03 -2.913 0.003649 **
## factor(MoSold)7 -1.128e+04 4.855e+03 -2.323 0.020333 *
## factor(MoSold)8 -1.608e+04 5.224e+03 -3.078 0.002138 **
## factor(MoSold)9 -1.757e+04 5.871e+03 -2.993 0.002826 **
## factor(MoSold)10 -1.922e+04 5.578e+03 -3.447 0.000589 ***
## factor(MoSold)11 -1.416e+04 5.713e+03 -2.478 0.013358 *
## factor(MoSold)12 -1.862e+04 6.163e+03 -3.021 0.002575 **
## BedroomAbvGr -1.015e+04 1.543e+03 -6.576 7.36e-11 ***
## FullBath -3.007e+03 3.036e+03 -0.991 0.322093
## HalfBath -1.379e+04 5.420e+03 -2.544 0.011101 *
## TotalBsmtSF 2.667e+01 3.673e+00 7.260 7.21e-13 ***
## factor(RoofMatl)CompShg 6.120e+05 3.552e+04 17.228 < 2e-16 ***
## factor(RoofMatl)Membran 6.380e+05 4.679e+04 13.634 < 2e-16 ***
## factor(RoofMatl)Roll 6.127e+05 4.596e+04 13.332 < 2e-16 ***
## factor(RoofMatl)Tar&Grv 6.007e+05 3.671e+04 16.364 < 2e-16 ***
## factor(RoofMatl)WdShake 5.781e+05 4.139e+04 13.967 < 2e-16 ***
## factor(RoofMatl)WdShngl 6.764e+05 3.700e+04 18.282 < 2e-16 ***
## FullBath:HalfBath 1.272e+04 3.333e+03 3.817 0.000142 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29210 on 1127 degrees of freedom
## (259 observations deleted due to missingness)
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8773
## F-statistic: 118.5 on 73 and 1127 DF, p-value: < 2.2e-16
R squared values 0.885 means that a model we created is a good model. This multiple regression model is 88.5% accurate in predicting Sales Price based on the selected variables. Our p value is less than 0.05 at 5% level of significance, we can confirm that our model works.
Let’s take a look at residuals:
hist(m1$residuals, breaks=40)
qqnorm(m1$residuals)
qqline(m1$residuals)
Residuals seems to be normally distributed in my model even though there is a little bit of variance present.
pred <- predict(m1, test, type="response", se.fit=FALSE)
final <- data.frame(Id = test$Id, SalePrice=pred)
#replace missing values with mean
Mean = mean(final[, 2], na.rm = TRUE)
final[,2][is.na(final[,2])] <- Mean
#Writing file to submit for competition
write.csv(final,"EAsubmission.csv", row.names = FALSE)
Overall the data that showed significant correlation as part of my initial analysis was a significant predictor for my final regression model.
My Kaggle username is EAzrilyan and my score is 0.22252. I will continue experimenting with the model to see if I can achieve a higher score.
knitr::include_graphics("kagglesubmit.png")