library(knitr) # for table of counts
library(kableExtra)
library(tidyverse) # for %>% operator
library(moments) # skewness and kurtosis
library(hrbrthemes) # themes
library(DescTools) # mode
library(scales) # remove sci notation
library(corrplot)
library(factoextra) # pca scree plot
library(MASS) # fitdistr
train.df <- read.csv("train.csv")
# head(train.df)
hist(train.df$WoodDeckSF)
X <- train.df$WoodDeckSF
Y <- train.df$SalePrice
The independent variable is WoodDeckSF, which is wood
deck area in square feet.
X_Q3 <- quantile(X, 0.75) # 3rd quartile of the variable X
Y_Q2 <- quantile(Y, 0.50) # 2nd quartile of the variable Y
a <- length(which(X <= X_Q3 & Y <= Y_Q2))
b <- length(which(X <= X_Q3 & Y > Y_Q2))
c <- a + b
d <- length(which(X > X_Q3 & Y <= Y_Q2))
e <- length(which(X > X_Q3 & Y > Y_Q2))
f <- d + e
g <- a + d
h <- b + e
i <- g + h # also c + f
data <- data.frame(
"X/Y" = c("X <= Q3", "X > Q3", "Total"),
"Y <= Q2" = c(a, d, g),
"Y > Q2" = c(b, e, h),
"Total" = c(c, f, i),
check.names=F
)
# Create a table using kable
kable(data, caption = "Table of Counts", format = "html") %>%
kable_styling()
| X/Y | Y <= Q2 | Y > Q2 | Total |
|---|---|---|---|
| X <= Q3 | 624 | 480 | 1104 |
| X > Q3 | 108 | 248 | 356 |
| Total | 732 | 728 | 1460 |
This represents: Given the wood deck area is above the 3rd quartile, what is the probability that the sales price is above the 2nd quartile?
This represents the probability that the events occur together. The wood deck area is above the 3rd quartile and the sales price is above the 2nd quartile.
The number of records where x equals the 3rd quartile is 28, so we can subtract 28 from the number of records where x is less than or equal to the 3rd quartile and y is greater than the 2nd quartile. This probability represents when the wood deck area is strictly less than Q3 given that the sales price is strictly greater than the 2nd quartile.
Let A be the event \(X > Q3\). Let B be the event \(Y >Q2\).
\[\begin{align*} P(A|B) &= P(X > Q3 | Y > Q2) \\ &= 248/728 \\ &= 0.3407 \end{align*}\]
\[\begin{align*} P(A) \cdot P(B) &= P(X>Q3) \cdot P(Y>Q2) \\ &= \frac{356}{1460} \cdot \frac{728}{1460} \\ &= 0.1216 \end{align*}\]
probA <- length(which(X > X_Q3))/nrow(train.df)
probB <- length(which(Y > Y_Q2))/nrow(train.df)
prob_condAB <- length(which(X > X_Q3 & Y > Y_Q2))/length(which(Y > Y_Q2))
Since \(P(A|B) \neq P(A) \cdot P(B)\), the events are not independent. Verify with the chi-square test.
The condition for applying the chi-square test for independence is that all of the expected counts for the data are greater than five.
Check the expected counts:
data2 <- data.frame(
"Combinations" = c("X<=Q3, Y<=Q2", "X>Q3, Y>Q2", "X>Q3, Y<=Q2", "X<=Q3, Y>Q2"),
"Expected Counts" = c("1104*732/1460=553", "356*728/1460=178", "356*732/1460=179", "1104*728/1460=551"),
">5?" = c("Yes", "Yes", "Yes", "Yes"),
check.names=F
)
kable(data2, caption = "Checking Expected Counts", format = "html") %>%
kable_styling()
| Combinations | Expected Counts | >5? |
|---|---|---|
| X<=Q3, Y<=Q2 | 1104*732/1460=553 | Yes |
| X>Q3, Y>Q2 | 356*728/1460=178 | Yes |
| X>Q3, Y<=Q2 | 356*732/1460=179 | Yes |
| X<=Q3, Y>Q2 | 1104*728/1460=551 | Yes |
\[\begin{align*} \sum \frac{(O-E)^2}{E} &= \frac{(624-553)^2}{553} + \frac{(108-179)^2}{179} + \frac{(480-551)^2}{551} + \frac{(248-178)^2}{178} \\ &= 9.116 + 28.162 + 8.149 + 27.528 \\ &= 72.955 \end{align*}\]
The degrees of freedom is given by \(df = (r-1)(c-1)\).
\(df = (2-1)(2-1) = 1\)
Choose to be 0.05. \(X^2_{0.05, 1} = 3.841\)
Since the test statistic is larger, reject the null hypothesis of independence. Cannot assume that there is no relationship between the wood deck areas above the 3rd quartile and sales price above the 2nd quartile.
For the all the numeric variables in the training set:
exclude_cols <- c("YrSold", "GarageYrBlt", "YearRemodAdd", "YearBuilt", "Id", "SalePrice")
train_numeric <- dplyr::select_if(train.df, is.numeric) %>% dplyr::select(-all_of(exclude_cols))
summary(train_numeric)
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 50.0 Median : 69.00 Median : 9478 Median : 6.000
## Mean : 56.9 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :190.0 Max. :313.00 Max. :215245 Max. :10.000
## NA's :259
## OverallCond MasVnrArea BsmtFinSF1 BsmtFinSF2
## Min. :1.000 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:5.000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00
## Median :5.000 Median : 0.0 Median : 383.5 Median : 0.00
## Mean :5.575 Mean : 103.7 Mean : 443.6 Mean : 46.55
## 3rd Qu.:6.000 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00
## Max. :9.000 Max. :1600.0 Max. :5644.0 Max. :1474.00
## NA's :8
## BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF
## Min. : 0.0 Min. : 0.0 Min. : 334 Min. : 0
## 1st Qu.: 223.0 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0
## Median : 477.5 Median : 991.5 Median :1087 Median : 0
## Mean : 567.2 Mean :1057.4 Mean :1163 Mean : 347
## 3rd Qu.: 808.0 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728
## Max. :2336.0 Max. :6110.0 Max. :4692 Max. :2065
##
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.000 Median :1464 Median :0.0000 Median :0.00000
## Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753
## 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000
##
## FullBath HalfBath BedroomAbvGr KitchenAbvGr
## Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :0.0000 Median :3.000 Median :1.000
## Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Fireplaces GarageCars GarageArea
## Min. : 2.000 Min. :0.000 Min. :0.000 Min. : 0.0
## 1st Qu.: 5.000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.: 334.5
## Median : 6.000 Median :1.000 Median :2.000 Median : 480.0
## Mean : 6.518 Mean :0.613 Mean :1.767 Mean : 473.0
## 3rd Qu.: 7.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :14.000 Max. :3.000 Max. :4.000 Max. :1418.0
##
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
For the chosen quantitative variables, Wood Deck Area and Sale Price,
ggplot() +
geom_boxplot(aes(y = X), fill="#69b3a2", color="black", alpha=0.8) +
scale_x_discrete( ) +
labs(title = "Box Plot of Wood Deck Area",
y = "Square Feet") +
theme_ipsum() +
coord_flip()
ggplot(data=data.frame(X)) +
geom_density(aes(x=X), fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Density Plot of Wood Deck Area") +
theme_ipsum()
ggplot(data=data.frame(X), aes(x=X)) +
geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Histogram of Wood Deck Area") +
theme_ipsum()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot() +
geom_boxplot(aes(y = Y), fill="#ffd500", color="black", alpha=0.8) +
scale_x_discrete( ) +
labs(title = "Box Plot of Sale Price",
y = "Dollars") +
theme_ipsum() +
coord_flip()
ggplot(data=data.frame(X)) +
geom_density(aes(x=X), fill="#ffd500", color="#e9ecef", alpha=0.8) +
ggtitle("Density Plot of Sale Price") +
theme_ipsum()
ggplot(data=data.frame(X), aes(x=X)) +
geom_histogram(fill="#ffd500", color="#e9ecef", alpha=0.8) +
ggtitle("Histogram of Sale Price") +
theme_ipsum()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Wood Deck Area
A skewness value of 1.5398 suggests moderate skewness to the right. There is a tendency for the data to have higher values (outliers) on the right side of the distribution. This is visible in the box plot above.
A kurtosis value of 5.9786 suggests that the distribution has a heavy tail and a sharper peak than a normal distribution.
Sale Price
A skewness value of 1.8809 suggests moderate skewness to the right.
A kurtosis value of 9.5098 suggests that the distribution has a heavy tail and a sharper peak than a normal distribution.
ggplot(data=train.df, aes(x=WoodDeckSF, y=SalePrice)) + geom_point() +
ggtitle("Sale Price vs Wood Deck Area") +
scale_y_continuous(labels = scales::comma) + theme_ipsum()
t.test(X, Y, paired = TRUE)
##
## Paired t-test
##
## data: X and Y
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -184903.2 -176750.7
## sample estimates:
## mean difference
## -180827
M = cor(data.frame(X,Y))
corrplot(M, type = "upper", tl.col = "black", tl.srt = 45)
### Hypothesis Test
\(H_0:\) The correlation between Wood Deck Area and Sale Price is 0.
\(H_A:\) The correlation between Wood Deck Area and Sale Price is not 0.
cor.test(X,Y, conf.level=0.99)
##
## Pearson's product-moment correlation
##
## data: X and Y
## t = 13.096, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.2627778 0.3834121
## sample estimates:
## cor
## 0.3244134
The confidence interval is \((0.263, 0.383)\) , which means that we can say with 99% confidence that the true correlation between Wood Deck Area and Sale Price falls within this interval. The small p-value indicates strong evidence against the null hypothesis, suggesting that the observed correlation of \(0.324\) is unlikely to occur by chance alone.
data <- data.frame(WoodDeckSF=train_numeric$WoodDeckSF, SalePrice=Y, GarageArea=train_numeric$GarageArea)
corr_matrix <- cor(data)
corrplot(corr_matrix, type = "upper", tl.col = "black", tl.srt = 45)
precision_mat <- solve(corr_matrix)
precision_mat
## WoodDeckSF SalePrice GarageArea
## WoodDeckSF 1.11865091 -0.3373326 -0.04101942
## SalePrice -0.33733260 1.7374927 -1.00742033
## GarageArea -0.04101942 -1.0074203 1.63727319
corr_matrix %*% precision_mat %*% corr_matrix
## WoodDeckSF SalePrice GarageArea
## WoodDeckSF 1.0000000 0.3244134 0.2246663
## SalePrice 0.3244134 1.0000000 0.6234314
## GarageArea 0.2246663 0.6234314 1.0000000
house.pca <- prcomp(~. ,data=train_numeric, scale.=T)
fviz_eig(house.pca)
* A common rule of thumb is to retain the principal components before
the elbow bend, which appears to be at 3. However, less than 50% of the
variance is captured in the data according to the scree plot. We will
choose 6 principal components instead.
summary(house.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5689 1.74664 1.44279 1.34361 1.2107 1.10469 1.07217
## Proportion of Variance 0.2062 0.09534 0.06505 0.05642 0.0458 0.03814 0.03592
## Cumulative Proportion 0.2062 0.30156 0.36661 0.42303 0.4688 0.50697 0.54289
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.04171 1.03809 1.00844 0.99509 0.9814 0.96411 0.9466
## Proportion of Variance 0.03391 0.03368 0.03178 0.03094 0.0301 0.02905 0.0280
## Cumulative Proportion 0.57680 0.61048 0.64226 0.67320 0.7033 0.73234 0.7603
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.92006 0.91079 0.87151 0.8579 0.81152 0.79750 0.77821
## Proportion of Variance 0.02645 0.02592 0.02374 0.0230 0.02058 0.01988 0.01893
## Cumulative Proportion 0.78680 0.81272 0.83646 0.8595 0.88003 0.89991 0.91884
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.73423 0.65270 0.61820 0.56307 0.5122 0.50166 0.43485
## Proportion of Variance 0.01685 0.01331 0.01194 0.00991 0.0082 0.00786 0.00591
## Cumulative Proportion 0.93568 0.94900 0.96094 0.97085 0.9790 0.98691 0.99282
## PC29 PC30 PC31 PC32
## Standard deviation 0.36738 0.30810 1.191e-15 9.369e-16
## Proportion of Variance 0.00422 0.00297 0.000e+00 0.000e+00
## Cumulative Proportion 0.99703 1.00000 1.000e+00 1.000e+00
Cumulative Proportion
section of the output above at PC6. You can also sum the
numbers in the Proportion of Variance section until you
reach over 50%.sum(0.2062, 0.09534, 0.06505, 0.05642, 0.0458, 0.03814)
## [1] 0.50695
The Standard deviation section indicates the amount
of variance explained by each principal component. Larger standard
deviations indicate principal components that capture more variability
in the data.
If you wanted to, you could grab these six principal components and use any or all of them in a model.
pca_predictors <- house.pca$x[, 1:6]
What is x?
x matrix corresponds to an observation
(sample) from the original dataset.x matrix corresponds to a principal
component.x to help prevent problems that may arise as
a result of the curse of dimensionality (ex: overfitting).WoodDeckSF such that all observations are greater
than 0.X1 <- X+1
estimate attribute to see the optimal \(\lambda\) estimate.exp_dist <- MASS::fitdistr(x=X1, densfun="exponential")
lambda <- exp_dist$estimate
lambda
## rate
## 0.01049929
\(\lambda = 0.01049929\)
exp_sample = rexp(1000, rate=lambda)
ggplot() +
geom_histogram(data=data.frame(exp_sample), aes(x=exp_sample, fill="y"), color="#e9ecef", alpha=0.5) +
geom_histogram(data=data.frame(X), aes(x=X, fill="b"), color="#e9ecef", alpha=0.5) +
scale_colour_manual(name ="exp_sample", values = c("y" = "yellow", "b" = "blue"), labels=c("b" = "exp_sample", "y" = "WoodDeckSq")) +
scale_fill_manual(name ="Data", values = c("y" = "yellow", "b" = "blue"), labels=c("b" = "exp_sample", "y" = "WoodDeckSq")) +
ggtitle("Histogram: WoodDeckSq vs Exp Dist") +
xlab("")+
theme_ipsum()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
* The exponential distribution has a larger spread. There are also more
occurrences of zero. * Both are skewed to the right. * The exponential
distribution is centered farther to the left.
qexp(0.05, rate = lambda)
## [1] 4.885405
qexp(0.95, rate = lambda)
## [1] 285.3271
res <- t.test(data$WoodDeckSF, conf.level=0.95)
res$conf.int
## [1] 87.80998 100.67906
## attr(,"conf.level")
## [1] 0.95
quantile(x=data$WoodDeckSF, probs=c(0.05, 0.95))
## 5% 95%
## 0 335
WoodDeckSF variable. This suggests
that an exponential distribution is not a good fit for this
variable.m <- lm(Y ~ ., data=train_numeric)
summary(m)
##
## Call:
## lm(formula = Y ~ ., data = train_numeric)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442775 -17713 -2277 15084 321710
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.038e+04 1.160e+04 -6.066 1.77e-09 ***
## MSSubClass -1.862e+02 3.273e+01 -5.689 1.62e-08 ***
## LotFrontage -7.440e+01 5.906e+01 -1.260 0.208025
## LotArea 4.824e-01 1.572e-01 3.069 0.002201 **
## OverallQual 2.060e+04 1.320e+03 15.604 < 2e-16 ***
## OverallCond 3.015e+03 1.061e+03 2.840 0.004584 **
## MasVnrArea 3.726e+01 6.934e+00 5.373 9.33e-08 ***
## BsmtFinSF1 1.950e+01 5.571e+00 3.500 0.000483 ***
## BsmtFinSF2 7.756e+00 8.550e+00 0.907 0.364525
## BsmtUnfSF 8.865e+00 4.989e+00 1.777 0.075838 .
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.077e+01 6.936e+00 5.878 5.42e-09 ***
## X2ndFlrSF 3.566e+01 5.577e+00 6.394 2.33e-10 ***
## LowQualFinSF 1.900e+01 2.188e+01 0.868 0.385378
## GrLivArea NA NA NA NA
## BsmtFullBath 1.237e+04 3.034e+03 4.078 4.85e-05 ***
## BsmtHalfBath 4.025e+03 4.898e+03 0.822 0.411393
## FullBath 1.634e+04 2.989e+03 5.467 5.59e-08 ***
## HalfBath 6.197e+03 2.999e+03 2.066 0.039050 *
## BedroomAbvGr -1.080e+04 1.957e+03 -5.521 4.16e-08 ***
## KitchenAbvGr -1.782e+04 5.828e+03 -3.057 0.002285 **
## TotRmsAbvGrd 4.883e+03 1.444e+03 3.382 0.000744 ***
## Fireplaces 3.043e+03 2.132e+03 1.428 0.153637
## GarageCars 1.355e+04 3.302e+03 4.103 4.36e-05 ***
## GarageArea -4.063e-01 1.148e+01 -0.035 0.971776
## WoodDeckSF 2.877e+01 9.827e+00 2.927 0.003486 **
## OpenPorchSF 4.016e+00 1.822e+01 0.220 0.825548
## EnclosedPorch -2.421e+01 1.868e+01 -1.296 0.195335
## X3SsnPorch 2.616e+01 3.775e+01 0.693 0.488494
## ScreenPorch 4.599e+01 2.035e+01 2.261 0.023965 *
## PoolArea -5.827e+01 2.971e+01 -1.962 0.050057 .
## MiscVal -4.045e-01 5.845e+00 -0.069 0.944843
## MoSold -2.892e+02 4.057e+02 -0.713 0.476169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37130 on 1164 degrees of freedom
## (265 observations deleted due to missingness)
## Multiple R-squared: 0.8057, Adjusted R-squared: 0.8007
## F-statistic: 160.9 on 30 and 1164 DF, p-value: < 2.2e-16
We will remove the TotalBsmtSF and
GrLivArea variables because coefficients could not be
calculated for them due to singularities.
m1 <- lm(Y ~ ., data=train_numeric %>% dplyr::select(-c(TotalBsmtSF, GrLivArea)))
summary(m1)
##
## Call:
## lm(formula = Y ~ ., data = train_numeric %>% dplyr::select(-c(TotalBsmtSF,
## GrLivArea)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -442775 -17713 -2277 15084 321710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.038e+04 1.160e+04 -6.066 1.77e-09 ***
## MSSubClass -1.862e+02 3.273e+01 -5.689 1.62e-08 ***
## LotFrontage -7.440e+01 5.906e+01 -1.260 0.208025
## LotArea 4.824e-01 1.572e-01 3.069 0.002201 **
## OverallQual 2.060e+04 1.320e+03 15.604 < 2e-16 ***
## OverallCond 3.015e+03 1.061e+03 2.840 0.004584 **
## MasVnrArea 3.726e+01 6.934e+00 5.373 9.33e-08 ***
## BsmtFinSF1 1.950e+01 5.571e+00 3.500 0.000483 ***
## BsmtFinSF2 7.756e+00 8.550e+00 0.907 0.364525
## BsmtUnfSF 8.865e+00 4.989e+00 1.777 0.075838 .
## X1stFlrSF 4.077e+01 6.936e+00 5.878 5.42e-09 ***
## X2ndFlrSF 3.566e+01 5.577e+00 6.394 2.33e-10 ***
## LowQualFinSF 1.900e+01 2.188e+01 0.868 0.385378
## BsmtFullBath 1.237e+04 3.034e+03 4.078 4.85e-05 ***
## BsmtHalfBath 4.025e+03 4.898e+03 0.822 0.411393
## FullBath 1.634e+04 2.989e+03 5.467 5.59e-08 ***
## HalfBath 6.197e+03 2.999e+03 2.066 0.039050 *
## BedroomAbvGr -1.080e+04 1.957e+03 -5.521 4.16e-08 ***
## KitchenAbvGr -1.782e+04 5.828e+03 -3.057 0.002285 **
## TotRmsAbvGrd 4.883e+03 1.444e+03 3.382 0.000744 ***
## Fireplaces 3.043e+03 2.132e+03 1.428 0.153637
## GarageCars 1.355e+04 3.302e+03 4.103 4.36e-05 ***
## GarageArea -4.063e-01 1.148e+01 -0.035 0.971776
## WoodDeckSF 2.877e+01 9.827e+00 2.927 0.003486 **
## OpenPorchSF 4.016e+00 1.822e+01 0.220 0.825548
## EnclosedPorch -2.421e+01 1.868e+01 -1.296 0.195335
## X3SsnPorch 2.616e+01 3.775e+01 0.693 0.488494
## ScreenPorch 4.599e+01 2.035e+01 2.261 0.023965 *
## PoolArea -5.827e+01 2.971e+01 -1.962 0.050057 .
## MiscVal -4.045e-01 5.845e+00 -0.069 0.944843
## MoSold -2.892e+02 4.057e+02 -0.713 0.476169
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37130 on 1164 degrees of freedom
## (265 observations deleted due to missingness)
## Multiple R-squared: 0.8057, Adjusted R-squared: 0.8007
## F-statistic: 160.9 on 30 and 1164 DF, p-value: < 2.2e-16
SalePrice.SalePrice.Here are the most significant variables in this model:
names(coef(m1)[summary(m1)$coef[, "Pr(>|t|)"] < 0.05])[-1]
## [1] "MSSubClass" "LotArea" "OverallQual" "OverallCond" "MasVnrArea"
## [6] "BsmtFinSF1" "X1stFlrSF" "X2ndFlrSF" "BsmtFullBath" "FullBath"
## [11] "HalfBath" "BedroomAbvGr" "KitchenAbvGr" "TotRmsAbvGrd" "GarageCars"
## [16] "WoodDeckSF" "ScreenPorch"
Simplifying the model to include fewer variables:
m2 <- lm(Y ~ ., data=train_numeric %>% dplyr::select(names(coef(m1)[summary(m1)$coef[, "Pr(>|t|)"] < 0.05])[-1]))
summary(m2)
##
## Call:
## lm(formula = Y ~ ., data = train_numeric %>% dplyr::select(names(coef(m1)[summary(m1)$coef[,
## "Pr(>|t|)"] < 0.05])[-1]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -462356 -16562 -2390 14903 289054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.354e+04 9.526e+03 -7.720 2.17e-14 ***
## MSSubClass -1.655e+02 2.633e+01 -6.285 4.35e-10 ***
## LotArea 3.911e-01 1.007e-01 3.883 0.000108 ***
## OverallQual 2.122e+04 1.078e+03 19.688 < 2e-16 ***
## OverallCond 2.897e+03 8.817e+02 3.286 0.001042 **
## MasVnrArea 3.462e+01 5.954e+00 5.814 7.50e-09 ***
## BsmtFinSF1 1.209e+01 3.003e+00 4.025 6.00e-05 ***
## X1stFlrSF 4.875e+01 4.636e+00 10.516 < 2e-16 ***
## X2ndFlrSF 3.598e+01 4.546e+00 7.914 4.95e-15 ***
## BsmtFullBath 1.110e+04 2.423e+03 4.580 5.06e-06 ***
## FullBath 1.384e+04 2.531e+03 5.468 5.36e-08 ***
## HalfBath 4.894e+03 2.505e+03 1.954 0.050880 .
## BedroomAbvGr -1.072e+04 1.677e+03 -6.395 2.17e-10 ***
## KitchenAbvGr -2.005e+04 5.144e+03 -3.898 0.000101 ***
## TotRmsAbvGrd 4.996e+03 1.228e+03 4.068 5.00e-05 ***
## GarageCars 1.326e+04 1.671e+03 7.932 4.31e-15 ***
## WoodDeckSF 3.017e+01 7.989e+00 3.777 0.000165 ***
## ScreenPorch 4.653e+01 1.699e+01 2.739 0.006230 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35370 on 1434 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8033, Adjusted R-squared: 0.801
## F-statistic: 344.5 on 17 and 1434 DF, p-value: < 2.2e-16
test.df <- read.csv("test.csv")
# get the numeric columns of the test set.
exclude_cols <- c("YrSold", "GarageYrBlt", "YearRemodAdd", "YearBuilt", "Id")
test_numeric <- dplyr::select_if(test.df, is.numeric) %>% dplyr::select(-all_of(exclude_cols))
# Predict house sale price
X_pred <- test_numeric
y_pred <- predict(m2, X_pred)
med <- median(y_pred, na.rm = TRUE)
y_pred[is.na(y_pred)] <- med
res <- data.frame(Id=test.df$Id, SalePrice=y_pred)
#write.csv(res, "my_predictions.csv", row.names=FALSE)