Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of: \[ \mu = \sigma = \frac{(N+1)}{2} \]
set.seed(642)
N <- 10
N1 <- ((N+1)/2)
X <- runif(10000,1,N)
Y <- rnorm(10000,N1,N1)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x <- median(X)
y <- quantile(Y, .25, names = FALSE)
P(A|B) translates to “the probability of A given B”. With that in mind, we can first filter our X to only include values greater than y, then take the percentage of remaining observations greater than x.
We see below that this probability is 54.82%.
a.df.0 <- data.frame(X)
a.df.1 <- filter(a.df.0,X>y)
a.df.2 <- filter(a.df.1, X>x)
a.prob <- nrow(a.df.2)/nrow(a.df.1)
a.prob
## [1] 0.5482456
P(A,B) translates to “the probability of A and B”. To accomplish this we can join our lists X and Y into one DF and count the observations for which the X value is greater than x and the Y value is greater than y.
Additionally, in this case we can calculate the theoretical value of P(X>x, Y>y). As y is the 1st quartitle of Y, we know that 75% of the observations should be greater than y. Similarly, as x is the median of X, 50% of the observations should be greater than x. Therefore, P(X>x, Y>y) should be approximately .75 * .5.
We see below that this holds true. Our theoretical probability, .375, is very close to our observed probability .3784.
b.theoretical <- .5*.75
b.theoretical
## [1] 0.375
b.df.0 <- data.frame(X, Y)
b.df.1 <- filter(b.df.0,X>x & Y>y)
b.prob <- nrow(b.df.1)/nrow(b.df.0)
b.prob
## [1] 0.3784
We can approach this problem in the same way as part (a). We find below that the P(X<x | X>y) is 45.18%
c.df.0 <- data.frame(X)
c.df.1 <- filter(c.df.0,X>y)
c.df.2 <- filter(c.df.1, X<x)
c.prob <- nrow(c.df.2)/nrow(c.df.1)
c.prob
## [1] 0.4517544
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
Theory tells us that if A and B are independent, that P(A, B) = P(A)*P(B). Lets assess if this is the case:
d.df.0 <- data.frame(X,Y)
d.df.1 <- filter(d.df.0,X>x & Y>y)
d.prob.1 <- nrow(d.df.1)/nrow(d.df.0)
d.df.2 <- filter(d.df.0, X>x)
d.prob.2a <- nrow(d.df.2)/nrow(d.df.0)
d.df.3 <- filter(d.df.0, Y>y)
d.prob.2b <- nrow(d.df.3)/nrow(d.df.0)
d.prob.1
## [1] 0.3784
d.prob.2a * d.prob.2b
## [1] 0.375
We can also do this using a contingency table:
d.table <- table(d.df.0$X > x, d.df.0$Y > y)
d.table
##
## FALSE TRUE
## FALSE 1284 3716
## TRUE 1216 3784
The P(X>x and Y>y) is equal to the True/True value above divided by our total observation count.
d.table[2,2] / 10000
## [1] 0.3784
Similarly, P(X>x) and P(Y>y) can both be found by taking the total of the corresponding true row/column by the total. We need to multiply these probabilities together to find the marginal probability.
sum(d.table[,2])/10000 * sum(d.table[2,])/10000
## [1] 0.375
We see above that the probability P(X>x and Y>y) = .3784
and that P(X>x)*P(Y>y) = .375.
With this we can say that P(X>x) and P(Y>y) are likely independent. That being said, we have tests to measure this:
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
The chi square test can take a contingency table as an input. The p-value is the output for which we are most concerned.
For the chi square test, our Null Hypothesis is that these variables are independent. We will use a significance level of 5%, meaning, if our P-Value is less than .05, we must reject our Null Hypothesis.
chisq.test(d.table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: d.table
## X-squared = 2.3941, df = 1, p-value = 0.1218
As we see above, our p-value is greater than .05, meaning we cannot reject our Null Hypothesis, and have more evidence in favor of these variables being independent.
When using R, there are not a lot of practical differences in how we use Fishers Exact Test and the Chi Squared Test. Our Null Hypothesis, Significance Level and input our all the same:
fisher.test(d.table)
##
## Fisher's Exact Test for Count Data
##
## data: d.table
## p-value = 0.1218
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9811225 1.1784053
## sample estimates:
## odds ratio
## 1.075221
We see above that our P-value is identical, and our findings are the same as our Chi Square test: we cannot reject our Null Hypothesis that the variables X and Y are independent.
What is the difference between the two? Which is most appropriate?
The math behind the Chi Square Test is significantly more intuitive : one squares the difference between the observed count in a given category and the expected count of said category, then divides this by the expected count. You then sum across all categories in the contingency table and consult a statistical table to determine your P-Value. This test is intended for use with larger samples.
The Fishers Exact test was designed specifically for smaller samples, and uses binomial distributions to accomplish a similar measure.
As we are dealing with a sample of 10,000 observations, the Chi Square Test is more appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .
#After downloading the data from Kaggle and uploading to my personal Github:
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/Data605House.csv")
train_raw <- read.csv(text = x, stringsAsFactors = FALSE)
dim(train_raw)
## [1] 1460 81
So the first thing we see is that this is a large dataset. We have 1460 observations of 81 variables. After removing the ID and our response variable, we see there are 79 explanatory variables.
After going through the Kaggle descriptions, I selected a small number of variables to explore in this dataset. Most of these are self-explanatory, however, I included notes when necessary.
Categorical Variables:
Continous/Numerical Variables:
train_set <- data.frame(train_raw$MSSubClass, train_raw$MSZoning, train_raw$Utilities, train_raw$Neighborhood, train_raw$Condition1, train_raw$BldgType, train_raw$HouseStyle, train_raw$OverallQual, train_raw$OverallCond, train_raw$Functional, train_raw$PoolQC, train_raw$SaleCondition, train_raw$SaleType, train_raw$LotArea, train_raw$YearRemodAdd, train_raw$BedroomAbvGr, train_raw$FullBath, train_raw$YrSold, train_raw$SalePrice)
names(train_set) <- c("Class", "Zoning", "Utilities", "Neighborhood", "Conditional", "BldgType", "HouseStyle", "OverallQual", "OverallCond", "Functional", "PoolQC", "SaleCondition", "SaleType", "LotArea", "YearRemod", "Bedroom", "Bathroom", "YearSold", "Price")
#Convert to numerical variable to categorical
train_set$Class <- as.factor(train_set$Class)
As we see above, I elected to change Class to a categorical variable, as it maps to assigned categories per the Kaggle documentation. While Year is often categorical, due to the range of years and the directional of their use case (inflation has been reasonably linear, a newly remodeled home is arguably more valuable than one which was remodeled less recently), I elected to keep them as numerical.
We can start with some general statistics for each of our selected variables:
summary(train_set)
## Class Zoning Utilities Neighborhood Conditional
## 20 :536 C (all): 10 AllPub:1459 NAmes :225 Norm :1260
## 60 :299 FV : 65 NoSeWa: 1 CollgCr:150 Feedr : 81
## 50 :144 RH : 16 OldTown:113 Artery : 48
## 120 : 87 RL :1151 Edwards:100 RRAn : 26
## 30 : 69 RM : 218 Somerst: 86 PosN : 19
## 160 : 63 Gilbert: 79 RRAe : 11
## (Other):262 (Other):707 (Other): 15
## BldgType HouseStyle OverallQual OverallCond Functional
## 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000 Maj1: 14
## 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000 Maj2: 5
## Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000 Min1: 31
## Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575 Min2: 34
## TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000 Mod : 15
## 1.5Unf : 14 Max. :10.000 Max. :9.000 Sev : 1
## (Other): 19 Typ :1360
## PoolQC SaleCondition SaleType LotArea YearRemod
## Ex : 2 Abnorml: 101 WD :1267 Min. : 1300 Min. :1950
## Fa : 2 AdjLand: 4 New : 122 1st Qu.: 7554 1st Qu.:1967
## Gd : 3 Alloca : 12 COD : 43 Median : 9478 Median :1994
## NA's:1453 Family : 20 ConLD : 9 Mean : 10517 Mean :1985
## Normal :1198 ConLI : 5 3rd Qu.: 11602 3rd Qu.:2004
## Partial: 125 ConLw : 5 Max. :215245 Max. :2010
## (Other): 9
## Bedroom Bathroom YearSold Price
## Min. :0.000 Min. :0.000 Min. :2006 Min. : 34900
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2007 1st Qu.:129975
## Median :3.000 Median :2.000 Median :2008 Median :163000
## Mean :2.866 Mean :1.565 Mean :2008 Mean :180921
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2009 3rd Qu.:214000
## Max. :8.000 Max. :3.000 Max. :2010 Max. :755000
##
Next, we can generate some plots to see what is happening with our data. The summaries above give us a decent idea of the range of our data and their distributions, however, visualizations are more impactful.
I selected a few I found interesting:
ggplot(train_set, aes(LotArea)) + geom_histogram(bins = 50) + scale_x_log10() + ggtitle("Lot Area Hist. (Log Trans. base 10)")+ theme_minimal()
ggplot(train_set, aes(Bathroom)) + geom_bar() + labs(title = "Count of Houses with Given Number of Bathrooms", x = "Number of Bathrooms", y = "Number of Houses")+ theme_minimal()
ggplot(train_set, aes(Bedroom)) + geom_bar() + labs(title = "Count of Houses with Given Number of Bedrooms", x = "Number of Bedrooms", y = "Number of Houses")+ theme_minimal()
ggplot(train_set, aes(YearRemod)) + geom_histogram(bins = 50) + ggtitle("Year Remodeled (Or Built, if never remod.)")+ theme_minimal()
ggplot(train_set, aes(YearSold)) + geom_bar() + ggtitle("Count of Houses Sold per Year")+ theme_minimal()
ggplot(train_set, aes(Price)) + geom_histogram(bins = 50) + ggtitle("Sale Price")+ theme_minimal()
A Few Observations:
ggplot(train_set, aes(LotArea, Price)) + geom_jitter() + scale_x_log10()
ggplot(train_set, aes(YearRemod, Price)) +geom_jitter() + stat_smooth(method='lm', formula = y~poly(x,2))
I expected both of these pairing to be fairly linear, and it seems that this holds true. Thinking through the plot of remodel year vs sale price, one would expect that the year which a home was remodeled is increasingly relevant as the remodel year approaches the year the home was sold. As a result, I added a polynomial regression line to illustrate this relationship.
While not technically part of the ask in this portion of the assignment, I wanted to explore the relationship between a few additional variables:
ggplot(train_set, aes(Bedroom, Bathroom)) + geom_jitter()
ggplot(train_set, aes(OverallQual, OverallCond)) + geom_jitter()
Again, these relationships are very correlated. Of note, it seems that there are more homes in a better condition, than their quality, which implies they have been prepared for sale. Additionally, nearly all the homes of a “10” condition are of 5 quality, which speaks to the use of this variable: it may be the case that 5 was used as a “default”.
cor_df <- data.frame(train_set$OverallQual, train_set$LotArea, train_set$Price)
names(cor_df) <- c("Quality", "Lot Area", "Price")
C_Matrix <- as.matrix(cor(cor_df))
C_Matrix
## Quality Lot Area Price
## Quality 1.0000000 0.1058057 0.7909816
## Lot Area 0.1058057 1.0000000 0.2638434
## Price 0.7909816 0.2638434 1.0000000
As expected, Overall Quality and Price are well correlated. I honestly expected Lot Area and Price to better correlated.
cor.test(cor_df$Price, cor_df$Quality, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: cor_df$Price and cor_df$Quality
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(cor_df$Price, cor_df$`Lot Area`, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: cor_df$Price and cor_df$`Lot Area`
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
As we see above, our P Value is very very low for each of these relationships. In both cases we must reject the Null Hypothesis that the correlation is truly zero.
We have demonstrated that both Lot Size and Home Quality can be used to predict the price for which a home will sell. Because the relationship is so strong, even in the weaker case of the two, I am not worried about a familywise error.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# Invert your correlation matrix from above
P_Matrix <- solve(C_Matrix)
P_Matrix
## Quality Lot Area Price
## Quality 2.7550503 0.3046752 -2.2595806
## Lot Area 0.3046752 1.1085153 -0.5334669
## Price -2.2595806 -0.5334669 2.9280384
# Multiply the correlation matrix by the precision matrix ...
round(P_Matrix %*% C_Matrix)
## Quality Lot Area Price
## Quality 1 0 0
## Lot Area 0 1 0
## Price 0 0 1
# ... and then multiply the precision matrix by the correlation matrix.
round(C_Matrix %*% P_Matrix)
## Quality Lot Area Price
## Quality 1 0 0
## Lot Area 0 1 0
## Price 0 0 1
# Conduct LU decomposition on the matrix.*
lu.decomposition(C_Matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.1058057 1.0000000 0
## [3,] 0.7909816 0.1821926 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.1058057 0.7909816
## [2,] 0 0.9888051 0.1801530
## [3,] 0 0.0000000 0.3415256
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
#Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
minLA <- min(train_set$LotArea)
minLA
## [1] 1300
LotArea2 <- train_set$LotArea - minLA + 1
min(LotArea2)
## [1] 1
LotAreadb <- data.frame(LotArea2)
names(LotAreadb) <- "LotArea"
#Then load the MASS package and run fitdistr to fit an exponential probability density function [Loaded in first chunck]
DisFit <- fitdistr(LotAreadb$LotArea, densfun = 'exponential')
Lambda <- DisFit$estimate
DisSamples <- rexp(1000, Lambda)
DsDF <- data.frame(DisSamples)
names(DsDF) <- "LotAreaExpSamples"
ggplot(LotAreadb, aes(LotArea)) +geom_histogram(bins = 50)
# seeing how there were a small number of outliers, I elected to set limits on our x axis:
ggplot(LotAreadb, aes(LotArea)) +geom_histogram(bins = 50) +xlim(0,100000)
## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(DsDF, aes(LotAreaExpSamples)) +geom_histogram(bins = 50) +xlim(0,100000)
## Warning: Removed 2 rows containing missing values (geom_bar).
As we see above, the generated samples do not seem near enough to our observations to be useful. Lets confirm:
# Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data.
round(quantile(DisSamples, c(.05, .95)), 3)
## 5% 95%
## 366.40 28425.21
t.test(LotArea2)
##
## One Sample t-test
##
## data: LotArea2
## t = 35.287, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8705.418 9730.238
## sample estimates:
## mean of x
## 9217.828
round(quantile(LotArea2, c(.05, .95)), 2)
## 5% 95%
## 2012.70 16102.15
As expected, we see significant variation between the 5% and the 95% levels between the generated sample and the observations from the data. i would not use this as a proxy for our data.
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I will be doing a Linear Regression with our data. As I have already isolated what I felt to be quality predictor variables, it makes sense to run a model with nearly all of these variables to get a sense of their predictive power, before then removing and adjusting variables.
Model1 <- lm(data = train_set, formula = Price ~ Class + Zoning + Neighborhood + Conditional + BldgType + HouseStyle + OverallQual + OverallCond + SaleCondition + SaleType + LotArea + YearRemod + Bedroom + Bathroom + YearSold)
summary(Model1)
##
## Call:
## lm(formula = Price ~ Class + Zoning + Neighborhood + Conditional +
## BldgType + HouseStyle + OverallQual + OverallCond + SaleCondition +
## SaleType + LotArea + YearRemod + Bedroom + Bathroom + YearSold,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -202452 -19849 -1968 16133 338716
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.501e+04 1.567e+06 0.022 0.98218
## Class30 -1.877e+04 6.205e+03 -3.024 0.00254 **
## Class40 -6.153e+03 1.947e+04 -0.316 0.75201
## Class45 -2.634e+04 3.019e+04 -0.873 0.38307
## Class50 2.913e+02 1.273e+04 0.023 0.98175
## Class60 6.497e+03 1.106e+04 0.587 0.55711
## Class70 -1.628e+04 1.188e+04 -1.370 0.17094
## Class75 1.209e+04 2.179e+04 0.555 0.57917
## Class80 -1.041e+04 1.792e+04 -0.581 0.56153
## Class85 -4.303e+03 1.599e+04 -0.269 0.78794
## Class90 -1.238e+04 7.363e+03 -1.681 0.09298 .
## Class120 -2.067e+04 2.296e+04 -0.901 0.36797
## Class160 -2.859e+04 2.679e+04 -1.067 0.28608
## Class180 -1.574e+04 3.021e+04 -0.521 0.60236
## Class190 -1.897e+04 3.957e+04 -0.479 0.63172
## ZoningFV -8.210e+03 1.833e+04 -0.448 0.65423
## ZoningRH -3.471e+03 1.844e+04 -0.188 0.85073
## ZoningRL -3.546e+02 1.542e+04 -0.023 0.98166
## ZoningRM -9.310e+01 1.447e+04 -0.006 0.99487
## NeighborhoodBlueste 1.108e+04 3.040e+04 0.364 0.71559
## NeighborhoodBrDale -6.807e+02 1.717e+04 -0.040 0.96839
## NeighborhoodBrkSide -8.874e+03 1.367e+04 -0.649 0.51648
## NeighborhoodClearCr 1.002e+04 1.350e+04 0.742 0.45811
## NeighborhoodCollgCr -1.736e+04 1.101e+04 -1.577 0.11511
## NeighborhoodCrawfor 2.439e+04 1.251e+04 1.949 0.05146 .
## NeighborhoodEdwards -2.071e+04 1.184e+04 -1.749 0.08043 .
## NeighborhoodGilbert -3.087e+04 1.163e+04 -2.654 0.00804 **
## NeighborhoodIDOTRR -2.313e+04 1.559e+04 -1.484 0.13816
## NeighborhoodMeadowV 1.337e+04 1.696e+04 0.789 0.43051
## NeighborhoodMitchel -1.481e+04 1.216e+04 -1.218 0.22327
## NeighborhoodNAmes -1.403e+04 1.145e+04 -1.225 0.22083
## NeighborhoodNoRidge 7.254e+04 1.236e+04 5.868 5.53e-09 ***
## NeighborhoodNPkVill 1.077e+03 1.652e+04 0.065 0.94804
## NeighborhoodNridgHt 5.305e+04 1.095e+04 4.844 1.41e-06 ***
## NeighborhoodNWAmes -1.883e+04 1.189e+04 -1.584 0.11345
## NeighborhoodOldTown -2.358e+04 1.382e+04 -1.707 0.08804 .
## NeighborhoodSawyer -1.411e+04 1.207e+04 -1.169 0.24257
## NeighborhoodSawyerW -9.550e+03 1.172e+04 -0.815 0.41539
## NeighborhoodSomerst 1.157e+03 1.370e+04 0.084 0.93269
## NeighborhoodStoneBr 6.529e+04 1.238e+04 5.276 1.53e-07 ***
## NeighborhoodSWISU -2.144e+04 1.429e+04 -1.500 0.13382
## NeighborhoodTimber -3.530e+03 1.243e+04 -0.284 0.77638
## NeighborhoodVeenker 3.442e+04 1.570e+04 2.193 0.02851 *
## ConditionalFeedr -1.319e+03 7.508e+03 -0.176 0.86059
## ConditionalNorm 4.912e+03 6.171e+03 0.796 0.42616
## ConditionalPosA 1.527e+04 1.509e+04 1.012 0.31169
## ConditionalPosN 1.138e+04 1.088e+04 1.046 0.29565
## ConditionalRRAe -1.371e+04 1.369e+04 -1.001 0.31702
## ConditionalRRAn -3.051e+02 1.007e+04 -0.030 0.97583
## ConditionalRRNe -1.611e+04 2.812e+04 -0.573 0.56687
## ConditionalRRNn 6.771e+03 1.863e+04 0.363 0.71636
## BldgType2fmCon 6.482e+03 3.844e+04 0.169 0.86614
## BldgTypeDuplex NA NA NA NA
## BldgTypeTwnhs -3.040e+04 2.459e+04 -1.236 0.21663
## BldgTypeTwnhsE -1.785e+04 2.328e+04 -0.767 0.44330
## HouseStyle1.5Unf 8.597e+02 2.956e+04 0.029 0.97681
## HouseStyle1Story 9.526e+03 1.217e+04 0.783 0.43405
## HouseStyle2.5Fin 3.247e+04 2.292e+04 1.417 0.15684
## HouseStyle2.5Unf -2.640e+04 2.247e+04 -1.175 0.24021
## HouseStyle2Story 9.127e+03 1.162e+04 0.785 0.43231
## HouseStyleSFoyer 9.939e+03 1.679e+04 0.592 0.55398
## HouseStyleSLvl 1.562e+04 1.933e+04 0.808 0.41921
## OverallQual 2.674e+04 1.210e+03 22.106 < 2e-16 ***
## OverallCond 1.856e+03 1.112e+03 1.669 0.09537 .
## SaleConditionAdjLand -1.228e+04 2.028e+04 -0.605 0.54504
## SaleConditionAlloca 1.824e+04 1.267e+04 1.440 0.15012
## SaleConditionFamily -3.642e+03 9.712e+03 -0.375 0.70769
## SaleConditionNormal 6.114e+03 4.400e+03 1.390 0.16484
## SaleConditionPartial -1.565e+04 2.323e+04 -0.674 0.50056
## SaleTypeCon 4.270e+04 2.846e+04 1.500 0.13380
## SaleTypeConLD 7.808e+03 1.518e+04 0.514 0.60706
## SaleTypeConLI 1.547e+04 1.844e+04 0.839 0.40149
## SaleTypeConLw -2.210e+03 1.862e+04 -0.119 0.90555
## SaleTypeCWD 1.872e+04 2.031e+04 0.922 0.35693
## SaleTypeNew 4.442e+04 2.410e+04 1.843 0.06554 .
## SaleTypeOth 1.057e+04 2.307e+04 0.458 0.64708
## SaleTypeWD 1.180e+03 6.515e+03 0.181 0.85625
## LotArea 9.574e-01 1.143e-01 8.379 < 2e-16 ***
## YearRemod 2.408e+02 7.342e+01 3.280 0.00106 **
## Bedroom 7.417e+02 1.758e+03 0.422 0.67316
## Bathroom 2.001e+04 2.782e+03 7.195 1.02e-12 ***
## YearSold -2.780e+02 7.829e+02 -0.355 0.72260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37920 on 1379 degrees of freedom
## Multiple R-squared: 0.7846, Adjusted R-squared: 0.7721
## F-statistic: 62.8 on 80 and 1379 DF, p-value: < 2.2e-16
With an R-Squared of .78, this is a great start! That being said, we are looking a large number of variables. With so many variables, we run the risk of over fitting our model.
Lets break this into test and train sets. While it is true that we are already working with a training set, we want to further divide this so that we can more objectively view how predictive are variables are on un-seen samples.
set.seed(642)
train_set$TestTrain <- sample(c("test","train"),replace = TRUE, size = nrow(train_set),prob = c(.2,.8))
Model_Test <- filter(train_set, train_set$TestTrain == "test")
nrow(Model_Test)
## [1] 309
Model_Train <- filter(train_set, train_set$TestTrain == "train")
nrow(Model_Train)
## [1] 1151
Now that we have our mini test and train set, lets run a regression on the new testing set, use this to predict the price, and calculate the Root Mean Standard Error (the SD of the residuals of our predicted price vs the actual price).
Model1.b <- lm(data = Model_Train, formula = Price ~ Class + Zoning + Neighborhood + Conditional + BldgType + HouseStyle + OverallQual + OverallCond + SaleCondition + SaleType + LotArea + YearRemod + Bedroom + Bathroom + YearSold)
summary(Model1.b)$r.squared
## [1] 0.7866565
Model1.b.pred <- predict(Model1.b, newdata = Model_Test)
rmse(Model_Test$Price, Model1.b.pred)
## [1] 34770.97
summary(Model_Test$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 132250 167500 181391 212000 501837
35k is a very solid RMSE when the Mean of our data is 181k.
Now that we know that this model is off to a good start, lets investigate our variables. We will be referencing the model created with all of our data as we sort out the predicted power of these variables.
Model2 <- lm(data = train_set, formula = Price ~ Class + Neighborhood + OverallQual + OverallCond + SaleType + LotArea + YearRemod + Bathroom)
Model2.b <- lm(data = Model_Train, formula = Price ~ Class + Neighborhood + OverallQual + OverallCond + SaleType + LotArea + YearRemod + Bathroom)
summary(Model2)
##
## Call:
## lm(formula = Price ~ Class + Neighborhood + OverallQual + OverallCond +
## SaleType + LotArea + YearRemod + Bathroom, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -209968 -19693 -1926 15999 335548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.144e+05 1.402e+05 -3.669 0.000253 ***
## Class30 -1.925e+04 6.065e+03 -3.174 0.001535 **
## Class40 -6.745e+03 1.930e+04 -0.349 0.726770
## Class45 -3.639e+04 1.161e+04 -3.135 0.001752 **
## Class50 -1.009e+04 4.590e+03 -2.198 0.028111 *
## Class60 6.370e+03 3.235e+03 1.969 0.049119 *
## Class70 -1.675e+04 6.431e+03 -2.605 0.009278 **
## Class75 7.979e+02 1.066e+04 0.075 0.940330
## Class80 -4.070e+03 5.363e+03 -0.759 0.448030
## Class85 -4.778e+03 8.767e+03 -0.545 0.585845
## Class90 -1.338e+04 5.923e+03 -2.258 0.024078 *
## Class120 -3.916e+04 5.370e+03 -7.294 5.02e-13 ***
## Class160 -5.307e+04 7.222e+03 -7.349 3.38e-13 ***
## Class180 -3.900e+04 1.484e+04 -2.628 0.008688 **
## Class190 -1.701e+04 8.000e+03 -2.126 0.033670 *
## NeighborhoodBlueste 1.309e+04 2.968e+04 0.441 0.659309
## NeighborhoodBrDale -4.279e+03 1.566e+04 -0.273 0.784673
## NeighborhoodBrkSide -9.584e+03 1.250e+04 -0.766 0.443561
## NeighborhoodClearCr 9.861e+03 1.302e+04 0.757 0.449058
## NeighborhoodCollgCr -1.717e+04 1.079e+04 -1.591 0.111843
## NeighborhoodCrawfor 2.529e+04 1.222e+04 2.070 0.038668 *
## NeighborhoodEdwards -2.093e+04 1.152e+04 -1.818 0.069294 .
## NeighborhoodGilbert -3.158e+04 1.141e+04 -2.768 0.005721 **
## NeighborhoodIDOTRR -2.405e+04 1.310e+04 -1.836 0.066641 .
## NeighborhoodMeadowV 1.442e+04 1.575e+04 0.916 0.359965
## NeighborhoodMitchel -1.494e+04 1.187e+04 -1.258 0.208485
## NeighborhoodNAmes -1.396e+04 1.119e+04 -1.248 0.212297
## NeighborhoodNoRidge 7.221e+04 1.220e+04 5.920 4.03e-09 ***
## NeighborhoodNPkVill 1.716e+02 1.641e+04 0.010 0.991655
## NeighborhoodNridgHt 5.204e+04 1.083e+04 4.806 1.70e-06 ***
## NeighborhoodNWAmes -1.877e+04 1.158e+04 -1.622 0.105096
## NeighborhoodOldTown -2.535e+04 1.189e+04 -2.131 0.033227 *
## NeighborhoodSawyer -1.629e+04 1.177e+04 -1.384 0.166707
## NeighborhoodSawyerW -1.144e+04 1.146e+04 -0.998 0.318229
## NeighborhoodSomerst -5.664e+03 1.110e+04 -0.510 0.609892
## NeighborhoodStoneBr 6.610e+04 1.226e+04 5.390 8.26e-08 ***
## NeighborhoodSWISU -1.697e+04 1.374e+04 -1.235 0.216972
## NeighborhoodTimber -3.873e+03 1.223e+04 -0.317 0.751482
## NeighborhoodVeenker 3.297e+04 1.556e+04 2.119 0.034280 *
## OverallQual 2.680e+04 1.173e+03 22.854 < 2e-16 ***
## OverallCond 1.974e+03 1.096e+03 1.801 0.071879 .
## SaleTypeCon 4.590e+04 2.831e+04 1.622 0.105109
## SaleTypeConLD 8.927e+03 1.435e+04 0.622 0.534055
## SaleTypeConLI 1.589e+04 1.812e+04 0.877 0.380904
## SaleTypeConLw -4.568e+03 1.821e+04 -0.251 0.801919
## SaleTypeCWD 1.789e+04 2.007e+04 0.891 0.372972
## SaleTypeNew 2.666e+04 7.356e+03 3.624 0.000300 ***
## SaleTypeOth 9.792e+03 2.286e+04 0.428 0.668452
## SaleTypeWD 4.518e+03 6.079e+03 0.743 0.457479
## LotArea 9.675e-01 1.130e-01 8.561 < 2e-16 ***
## YearRemod 2.447e+02 7.227e+01 3.385 0.000731 ***
## Bathroom 2.066e+04 2.562e+03 8.064 1.56e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37910 on 1408 degrees of freedom
## Multiple R-squared: 0.7803, Adjusted R-squared: 0.7723
## F-statistic: 98.03 on 51 and 1408 DF, p-value: < 2.2e-16
Model2.b.pred <- predict(Model2.b, newdata = Model_Test)
rmse(Model_Test$Price, Model2.b.pred)
## [1] 34440.2
Looking at our RMSE, our model has very slightly improved. While I did not share it here, I quickly ran a few additional models to see if our R-squared and/or RMSE could be significantly improved, and I feel that this is where I would like to leave it. With something like housing prices, there so much is market dependent - an interesting follow up might be to compare interest loan rates during the period in question or perhaps a different proxy for housing market health like the number of homes sold in a year vs quantity “on the market”. That being said, I believe this is fairly strong given the data we have.
Reviewing the Kaggle Submission guidelines, we see we need to download the test file, and supply a predicted sale price for each value.
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/test.csv")
test_raw <- read.csv(text = x, stringsAsFactors = FALSE)
test_set <- data.frame(test_raw$MSSubClass, test_raw$Neighborhood, test_raw$OverallQual, test_raw$OverallCond, test_raw$SaleType, test_raw$LotArea, test_raw$YearRemodAdd, test_raw$FullBath, test_raw$Id)
names(test_set) <- c("Class", "Neighborhood", "OverallQual", "OverallCond", "SaleType", "LotArea", "YearRemod", "Bathroom","Id")
#test_pred <- predict(Model2, newdata = test_set)
At this stage, I ran into an error on the now commented-out line: “factor Class has new levels 150”. What we see here is that we have a Class in our testing set that did not exist in our training set.
There are a couple ways around this. As Kaggle is only looking for each entries ID and Sale Price, I am choosing what I believe to be the most graceful workaround. The Kaggle documentation tells us that Class 150 corresponds to “1-1/2 STORY PUD - ALL AGES”. Looking through our other entries, I believe this to be most similar to Class 120 “1-STORY PUD (Planned Unit Development) - 1946 & NEWER.”, specifically when you consider that the one home this applies to was built or remodeled in 1981 as seen below (I would wager this was the year it was built.)
filter(test_set, Class=="150")$YearRemod
## [1] 1981
Row150 <- which(test_set$Class == "150")
test_set$Class[Row150] <- "120"
nrow(filter(test_set, Class == "150"))
## [1] 0
# I also found that one vlaue was missing a sale type, and will replace with the most common and least impactful value:
test_set$SaleType[1030] <- "WD"
With that sorted, we can apply our model, generate our submission file and submit our results!
test_pred <- predict(Model2, newdata = test_set)
Sub_file <- data.frame(test_set$Id, test_pred)
names(Sub_file) <- c("Id", "SalePrice")
#Commented out as not to repeadely save file.
#write.csv(Sub_file,"C:\\Users\\bloom\\desktop\\subfile.csv", row.names = FALSE)
After submitting I found that I scored a .206, placing 3725 at the time of my submission. My Kaggle username is “Chris Bloome”