Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of: \[ \mu = \sigma = \frac{(N+1)}{2} \]

set.seed(642)

N <- 10
N1 <- ((N+1)/2)

X <- runif(10000,1,N)
Y <- rnorm(10000,N1,N1)

Part A

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

x <- median(X)
y <- quantile(Y, .25, names = FALSE)

(a) P(X>x | X>y)

P(A|B) translates to “the probability of A given B”. With that in mind, we can first filter our X to only include values greater than y, then take the percentage of remaining observations greater than x.

We see below that this probability is 54.82%.

a.df.0 <- data.frame(X)

a.df.1 <- filter(a.df.0,X>y)

a.df.2 <- filter(a.df.1, X>x)

a.prob <- nrow(a.df.2)/nrow(a.df.1)

a.prob
## [1] 0.5482456

(b) P(X>x, Y>y)

P(A,B) translates to “the probability of A and B”. To accomplish this we can join our lists X and Y into one DF and count the observations for which the X value is greater than x and the Y value is greater than y.

Additionally, in this case we can calculate the theoretical value of P(X>x, Y>y). As y is the 1st quartitle of Y, we know that 75% of the observations should be greater than y. Similarly, as x is the median of X, 50% of the observations should be greater than x. Therefore, P(X>x, Y>y) should be approximately .75 * .5.

We see below that this holds true. Our theoretical probability, .375, is very close to our observed probability .3784.

b.theoretical <- .5*.75
b.theoretical
## [1] 0.375
b.df.0 <- data.frame(X, Y)

b.df.1 <- filter(b.df.0,X>x & Y>y)

b.prob <- nrow(b.df.1)/nrow(b.df.0)

b.prob
## [1] 0.3784

(c) P(X<x | X>y)

We can approach this problem in the same way as part (a). We find below that the P(X<x | X>y) is 45.18%

c.df.0 <- data.frame(X)

c.df.1 <- filter(c.df.0,X>y)

c.df.2 <- filter(c.df.1, X<x)

c.prob <- nrow(c.df.2)/nrow(c.df.1)

c.prob
## [1] 0.4517544

Part B

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Theory tells us that if A and B are independent, that P(A, B) = P(A)*P(B). Lets assess if this is the case:

d.df.0 <- data.frame(X,Y)

d.df.1 <- filter(d.df.0,X>x & Y>y)

d.prob.1 <- nrow(d.df.1)/nrow(d.df.0)

d.df.2 <- filter(d.df.0, X>x)

d.prob.2a <- nrow(d.df.2)/nrow(d.df.0)

d.df.3 <- filter(d.df.0, Y>y)

d.prob.2b <- nrow(d.df.3)/nrow(d.df.0)

d.prob.1
## [1] 0.3784
d.prob.2a * d.prob.2b
## [1] 0.375

We can also do this using a contingency table:

d.table <- table(d.df.0$X > x, d.df.0$Y > y)
d.table
##        
##         FALSE TRUE
##   FALSE  1284 3716
##   TRUE   1216 3784

The P(X>x and Y>y) is equal to the True/True value above divided by our total observation count.

d.table[2,2] / 10000
## [1] 0.3784

Similarly, P(X>x) and P(Y>y) can both be found by taking the total of the corresponding true row/column by the total. We need to multiply these probabilities together to find the marginal probability.

sum(d.table[,2])/10000 * sum(d.table[2,])/10000
## [1] 0.375

We see above that the probability P(X>x and Y>y) = .3784

and that P(X>x)*P(Y>y) = .375.

With this we can say that P(X>x) and P(Y>y) are likely independent. That being said, we have tests to measure this:

Part C

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Chi Square Test

The chi square test can take a contingency table as an input. The p-value is the output for which we are most concerned.

For the chi square test, our Null Hypothesis is that these variables are independent. We will use a significance level of 5%, meaning, if our P-Value is less than .05, we must reject our Null Hypothesis.

chisq.test(d.table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  d.table
## X-squared = 2.3941, df = 1, p-value = 0.1218

As we see above, our p-value is greater than .05, meaning we cannot reject our Null Hypothesis, and have more evidence in favor of these variables being independent.

Fishers Exact Test

When using R, there are not a lot of practical differences in how we use Fishers Exact Test and the Chi Squared Test. Our Null Hypothesis, Significance Level and input our all the same:

fisher.test(d.table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  d.table
## p-value = 0.1218
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9811225 1.1784053
## sample estimates:
## odds ratio 
##   1.075221

We see above that our P-value is identical, and our findings are the same as our Chi Square test: we cannot reject our Null Hypothesis that the variables X and Y are independent.

Analysis

What is the difference between the two? Which is most appropriate?

The math behind the Chi Square Test is significantly more intuitive : one squares the difference between the observed count in a given category and the expected count of said category, then divides this by the expected count. You then sum across all categories in the contingency table and consult a statistical table to determine your P-Value. This test is intended for use with larger samples.

The Fishers Exact test was designed specifically for smaller samples, and uses binomial distributions to accomplish a similar measure.

As we are dealing with a sample of 10,000 observations, the Chi Square Test is more appropriate.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .

#After downloading the data from Kaggle and uploading to my personal Github:
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/Data605House.csv")

train_raw <- read.csv(text = x, stringsAsFactors = FALSE)

dim(train_raw) 
## [1] 1460   81

So the first thing we see is that this is a large dataset. We have 1460 observations of 81 variables. After removing the ID and our response variable, we see there are 79 explanatory variables.

After going through the Kaggle descriptions, I selected a small number of variables to explore in this dataset. Most of these are self-explanatory, however, I included notes when necessary.

Categorical Variables:

  • MSubClass: this filed uses numbers between 10 and 200 to indicate what “type” of house this is. Categories include things like “1 story 1945 and older,” “duplex” and “1 Story Planned Urban Development.”
  • Mzoning: this variables uses letters to indicate the neighborhood zoning.
  • Utilities
  • Neighborhood: Letter variables indicate which specific neighborhood the house is located.
  • Conditional1: This is an interesting variable, it indicates if the house is located near a railroad, major road or a “positive feature” like a bike path or park. There are 2 of these, Conditional1 and Conditional2. I will only be using Conditional1.
  • BldgType
  • House Style
  • OverallQual
  • Overall Condition
  • Functional
  • PoolQC
  • SaleCondition
  • SaleType

Continous/Numerical Variables:

  • LotArea
  • YearRemodAdd: This variable indicates the year the house was last remodeled, or in the case where the house has never been remodeled, the year the house was built.
  • Bedroom
  • FullBath
  • YrSold
train_set <- data.frame(train_raw$MSSubClass, train_raw$MSZoning, train_raw$Utilities, train_raw$Neighborhood, train_raw$Condition1, train_raw$BldgType, train_raw$HouseStyle, train_raw$OverallQual, train_raw$OverallCond, train_raw$Functional, train_raw$PoolQC, train_raw$SaleCondition, train_raw$SaleType, train_raw$LotArea, train_raw$YearRemodAdd, train_raw$BedroomAbvGr, train_raw$FullBath, train_raw$YrSold, train_raw$SalePrice)

names(train_set) <- c("Class", "Zoning", "Utilities", "Neighborhood", "Conditional", "BldgType", "HouseStyle", "OverallQual", "OverallCond", "Functional", "PoolQC", "SaleCondition", "SaleType", "LotArea", "YearRemod", "Bedroom", "Bathroom", "YearSold", "Price")

#Convert to numerical variable to categorical 
train_set$Class <- as.factor(train_set$Class)

As we see above, I elected to change Class to a categorical variable, as it maps to assigned categories per the Kaggle documentation. While Year is often categorical, due to the range of years and the directional of their use case (inflation has been reasonably linear, a newly remodeled home is arguably more valuable than one which was remodeled less recently), I elected to keep them as numerical.

Question 1

Provide univariate descriptive statistics and appropriate plots for the training data set.

We can start with some general statistics for each of our selected variables:

summary(train_set)
##      Class         Zoning      Utilities     Neighborhood  Conditional  
##  20     :536   C (all):  10   AllPub:1459   NAmes  :225   Norm   :1260  
##  60     :299   FV     :  65   NoSeWa:   1   CollgCr:150   Feedr  :  81  
##  50     :144   RH     :  16                 OldTown:113   Artery :  48  
##  120    : 87   RL     :1151                 Edwards:100   RRAn   :  26  
##  30     : 69   RM     : 218                 Somerst: 86   PosN   :  19  
##  160    : 63                                Gilbert: 79   RRAe   :  11  
##  (Other):262                                (Other):707   (Other):  15  
##    BldgType      HouseStyle   OverallQual      OverallCond    Functional 
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000   Maj1:  14  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000   Maj2:   5  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000   Min1:  31  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575   Min2:  34  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000   Mod :  15  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000   Sev :   1  
##                (Other): 19                                    Typ :1360  
##   PoolQC     SaleCondition     SaleType       LotArea         YearRemod   
##  Ex  :   2   Abnorml: 101   WD     :1267   Min.   :  1300   Min.   :1950  
##  Fa  :   2   AdjLand:   4   New    : 122   1st Qu.:  7554   1st Qu.:1967  
##  Gd  :   3   Alloca :  12   COD    :  43   Median :  9478   Median :1994  
##  NA's:1453   Family :  20   ConLD  :   9   Mean   : 10517   Mean   :1985  
##              Normal :1198   ConLI  :   5   3rd Qu.: 11602   3rd Qu.:2004  
##              Partial: 125   ConLw  :   5   Max.   :215245   Max.   :2010  
##                             (Other):   9                                  
##     Bedroom         Bathroom        YearSold        Price       
##  Min.   :0.000   Min.   :0.000   Min.   :2006   Min.   : 34900  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2007   1st Qu.:129975  
##  Median :3.000   Median :2.000   Median :2008   Median :163000  
##  Mean   :2.866   Mean   :1.565   Mean   :2008   Mean   :180921  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :8.000   Max.   :3.000   Max.   :2010   Max.   :755000  
## 

Next, we can generate some plots to see what is happening with our data. The summaries above give us a decent idea of the range of our data and their distributions, however, visualizations are more impactful.

I selected a few I found interesting:

ggplot(train_set, aes(LotArea)) + geom_histogram(bins = 50) + scale_x_log10() + ggtitle("Lot Area Hist. (Log Trans. base 10)")+ theme_minimal()

ggplot(train_set, aes(Bathroom)) + geom_bar() + labs(title = "Count of Houses with Given Number of Bathrooms", x = "Number of Bathrooms", y = "Number of Houses")+ theme_minimal()

ggplot(train_set, aes(Bedroom)) + geom_bar() + labs(title = "Count of Houses with Given Number of Bedrooms", x = "Number of Bedrooms", y = "Number of Houses")+ theme_minimal()

ggplot(train_set, aes(YearRemod)) + geom_histogram(bins = 50) + ggtitle("Year Remodeled (Or Built, if never remod.)")+ theme_minimal()

ggplot(train_set, aes(YearSold)) + geom_bar() + ggtitle("Count of Houses Sold per Year")+ theme_minimal()

ggplot(train_set, aes(Price)) + geom_histogram(bins = 50) + ggtitle("Sale Price")+ theme_minimal()

A Few Observations:

  • After transforming the x axis, the Lot Area is relatively normal. Thinking about the nature of this variable, this is to be expected. We see a bit of a floor effect: there is a theoretical minimum lot size, but no maximum. As a result, our mean lot size is equidistant from the median and the third quartile.
  • Despite considerable variation in the number of bedrooms and other factors, nearly all of our houses had 2 or 3 bathrooms.
  • The distribution of years in which the house was remodeled is interesting. There does not seem to be much logic, to it at first glance, however, I would be interested to see how the peak of remodels around 2004-2005 lines up with the house market crash.

Provide a scatter-plot matrix for at least two of the independent variables and the dependent variable.

ggplot(train_set, aes(LotArea, Price)) + geom_jitter() + scale_x_log10()

ggplot(train_set, aes(YearRemod, Price)) +geom_jitter() + stat_smooth(method='lm', formula = y~poly(x,2))

I expected both of these pairing to be fairly linear, and it seems that this holds true. Thinking through the plot of remodel year vs sale price, one would expect that the year which a home was remodeled is increasingly relevant as the remodel year approaches the year the home was sold. As a result, I added a polynomial regression line to illustrate this relationship.

While not technically part of the ask in this portion of the assignment, I wanted to explore the relationship between a few additional variables:

ggplot(train_set, aes(Bedroom, Bathroom)) + geom_jitter()

ggplot(train_set, aes(OverallQual, OverallCond)) + geom_jitter()

Again, these relationships are very correlated. Of note, it seems that there are more homes in a better condition, than their quality, which implies they have been prepared for sale. Additionally, nearly all the homes of a “10” condition are of 5 quality, which speaks to the use of this variable: it may be the case that 5 was used as a “default”.

Derive a correlation matrix for any three quantitative variables in the dataset.

cor_df <- data.frame(train_set$OverallQual, train_set$LotArea, train_set$Price)

names(cor_df) <- c("Quality", "Lot Area", "Price")

C_Matrix <- as.matrix(cor(cor_df))

C_Matrix
##            Quality  Lot Area     Price
## Quality  1.0000000 0.1058057 0.7909816
## Lot Area 0.1058057 1.0000000 0.2638434
## Price    0.7909816 0.2638434 1.0000000

As expected, Overall Quality and Price are well correlated. I honestly expected Lot Area and Price to better correlated.

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

cor.test(cor_df$Price, cor_df$Quality, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_df$Price and cor_df$Quality
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
cor.test(cor_df$Price, cor_df$`Lot Area`, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  cor_df$Price and cor_df$`Lot Area`
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

As we see above, our P Value is very very low for each of these relationships. In both cases we must reject the Null Hypothesis that the correlation is truly zero.

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

We have demonstrated that both Lot Size and Home Quality can be used to predict the price for which a home will sell. Because the relationship is so strong, even in the weaker case of the two, I am not worried about a familywise error.

Question 2

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

# Invert your correlation matrix from above

P_Matrix <-  solve(C_Matrix) 

P_Matrix
##             Quality   Lot Area      Price
## Quality   2.7550503  0.3046752 -2.2595806
## Lot Area  0.3046752  1.1085153 -0.5334669
## Price    -2.2595806 -0.5334669  2.9280384
# Multiply the correlation matrix by the precision matrix ...

round(P_Matrix %*% C_Matrix)
##          Quality Lot Area Price
## Quality        1        0     0
## Lot Area       0        1     0
## Price          0        0     1
# ... and then multiply the precision matrix by the correlation matrix.

round(C_Matrix %*% P_Matrix)
##          Quality Lot Area Price
## Quality        1        0     0
## Lot Area       0        1     0
## Price          0        0     1
# Conduct LU decomposition on the matrix.*

lu.decomposition(C_Matrix)
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.1058057 1.0000000    0
## [3,] 0.7909816 0.1821926    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.1058057 0.7909816
## [2,]    0 0.9888051 0.1801530
## [3,]    0 0.0000000 0.3415256

Question 3

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

#Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

minLA <- min(train_set$LotArea)
minLA
## [1] 1300
LotArea2 <- train_set$LotArea - minLA + 1

min(LotArea2)
## [1] 1
LotAreadb <- data.frame(LotArea2)

names(LotAreadb) <- "LotArea"

#Then load the MASS package and run fitdistr to fit an exponential probability density function [Loaded in first chunck]

DisFit <- fitdistr(LotAreadb$LotArea, densfun = 'exponential')
Lambda <- DisFit$estimate
DisSamples <- rexp(1000, Lambda)

DsDF <- data.frame(DisSamples)
names(DsDF) <- "LotAreaExpSamples"

ggplot(LotAreadb, aes(LotArea)) +geom_histogram(bins = 50)

# seeing how there were a small number of outliers, I elected to set limits on our x axis:
ggplot(LotAreadb, aes(LotArea)) +geom_histogram(bins = 50) +xlim(0,100000)
## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

ggplot(DsDF, aes(LotAreaExpSamples)) +geom_histogram(bins = 50) +xlim(0,100000)
## Warning: Removed 2 rows containing missing values (geom_bar).

As we see above, the generated samples do not seem near enough to our observations to be useful. Lets confirm:

#  Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).  Also generate a 95% confidence interval from the empirical data, assuming normality.  Finally, provide the empirical 5th percentile and 95th percentile of the data.

round(quantile(DisSamples, c(.05, .95)), 3)
##       5%      95% 
##   366.40 28425.21
t.test(LotArea2)
## 
##  One Sample t-test
## 
## data:  LotArea2
## t = 35.287, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  8705.418 9730.238
## sample estimates:
## mean of x 
##  9217.828
round(quantile(LotArea2, c(.05, .95)), 2)
##       5%      95% 
##  2012.70 16102.15

As expected, we see significant variation between the 5% and the 95% levels between the generated sample and the observations from the data. i would not use this as a proxy for our data.

Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Question 4 (Regression)

I will be doing a Linear Regression with our data. As I have already isolated what I felt to be quality predictor variables, it makes sense to run a model with nearly all of these variables to get a sense of their predictive power, before then removing and adjusting variables.

Model 1

Model1 <- lm(data = train_set, formula = Price ~ Class + Zoning + Neighborhood + Conditional + BldgType + HouseStyle + OverallQual + OverallCond + SaleCondition + SaleType + LotArea + YearRemod + Bedroom + Bathroom + YearSold)

summary(Model1)
## 
## Call:
## lm(formula = Price ~ Class + Zoning + Neighborhood + Conditional + 
##     BldgType + HouseStyle + OverallQual + OverallCond + SaleCondition + 
##     SaleType + LotArea + YearRemod + Bedroom + Bathroom + YearSold, 
##     data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202452  -19849   -1968   16133  338716 
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.501e+04  1.567e+06   0.022  0.98218    
## Class30              -1.877e+04  6.205e+03  -3.024  0.00254 ** 
## Class40              -6.153e+03  1.947e+04  -0.316  0.75201    
## Class45              -2.634e+04  3.019e+04  -0.873  0.38307    
## Class50               2.913e+02  1.273e+04   0.023  0.98175    
## Class60               6.497e+03  1.106e+04   0.587  0.55711    
## Class70              -1.628e+04  1.188e+04  -1.370  0.17094    
## Class75               1.209e+04  2.179e+04   0.555  0.57917    
## Class80              -1.041e+04  1.792e+04  -0.581  0.56153    
## Class85              -4.303e+03  1.599e+04  -0.269  0.78794    
## Class90              -1.238e+04  7.363e+03  -1.681  0.09298 .  
## Class120             -2.067e+04  2.296e+04  -0.901  0.36797    
## Class160             -2.859e+04  2.679e+04  -1.067  0.28608    
## Class180             -1.574e+04  3.021e+04  -0.521  0.60236    
## Class190             -1.897e+04  3.957e+04  -0.479  0.63172    
## ZoningFV             -8.210e+03  1.833e+04  -0.448  0.65423    
## ZoningRH             -3.471e+03  1.844e+04  -0.188  0.85073    
## ZoningRL             -3.546e+02  1.542e+04  -0.023  0.98166    
## ZoningRM             -9.310e+01  1.447e+04  -0.006  0.99487    
## NeighborhoodBlueste   1.108e+04  3.040e+04   0.364  0.71559    
## NeighborhoodBrDale   -6.807e+02  1.717e+04  -0.040  0.96839    
## NeighborhoodBrkSide  -8.874e+03  1.367e+04  -0.649  0.51648    
## NeighborhoodClearCr   1.002e+04  1.350e+04   0.742  0.45811    
## NeighborhoodCollgCr  -1.736e+04  1.101e+04  -1.577  0.11511    
## NeighborhoodCrawfor   2.439e+04  1.251e+04   1.949  0.05146 .  
## NeighborhoodEdwards  -2.071e+04  1.184e+04  -1.749  0.08043 .  
## NeighborhoodGilbert  -3.087e+04  1.163e+04  -2.654  0.00804 ** 
## NeighborhoodIDOTRR   -2.313e+04  1.559e+04  -1.484  0.13816    
## NeighborhoodMeadowV   1.337e+04  1.696e+04   0.789  0.43051    
## NeighborhoodMitchel  -1.481e+04  1.216e+04  -1.218  0.22327    
## NeighborhoodNAmes    -1.403e+04  1.145e+04  -1.225  0.22083    
## NeighborhoodNoRidge   7.254e+04  1.236e+04   5.868 5.53e-09 ***
## NeighborhoodNPkVill   1.077e+03  1.652e+04   0.065  0.94804    
## NeighborhoodNridgHt   5.305e+04  1.095e+04   4.844 1.41e-06 ***
## NeighborhoodNWAmes   -1.883e+04  1.189e+04  -1.584  0.11345    
## NeighborhoodOldTown  -2.358e+04  1.382e+04  -1.707  0.08804 .  
## NeighborhoodSawyer   -1.411e+04  1.207e+04  -1.169  0.24257    
## NeighborhoodSawyerW  -9.550e+03  1.172e+04  -0.815  0.41539    
## NeighborhoodSomerst   1.157e+03  1.370e+04   0.084  0.93269    
## NeighborhoodStoneBr   6.529e+04  1.238e+04   5.276 1.53e-07 ***
## NeighborhoodSWISU    -2.144e+04  1.429e+04  -1.500  0.13382    
## NeighborhoodTimber   -3.530e+03  1.243e+04  -0.284  0.77638    
## NeighborhoodVeenker   3.442e+04  1.570e+04   2.193  0.02851 *  
## ConditionalFeedr     -1.319e+03  7.508e+03  -0.176  0.86059    
## ConditionalNorm       4.912e+03  6.171e+03   0.796  0.42616    
## ConditionalPosA       1.527e+04  1.509e+04   1.012  0.31169    
## ConditionalPosN       1.138e+04  1.088e+04   1.046  0.29565    
## ConditionalRRAe      -1.371e+04  1.369e+04  -1.001  0.31702    
## ConditionalRRAn      -3.051e+02  1.007e+04  -0.030  0.97583    
## ConditionalRRNe      -1.611e+04  2.812e+04  -0.573  0.56687    
## ConditionalRRNn       6.771e+03  1.863e+04   0.363  0.71636    
## BldgType2fmCon        6.482e+03  3.844e+04   0.169  0.86614    
## BldgTypeDuplex               NA         NA      NA       NA    
## BldgTypeTwnhs        -3.040e+04  2.459e+04  -1.236  0.21663    
## BldgTypeTwnhsE       -1.785e+04  2.328e+04  -0.767  0.44330    
## HouseStyle1.5Unf      8.597e+02  2.956e+04   0.029  0.97681    
## HouseStyle1Story      9.526e+03  1.217e+04   0.783  0.43405    
## HouseStyle2.5Fin      3.247e+04  2.292e+04   1.417  0.15684    
## HouseStyle2.5Unf     -2.640e+04  2.247e+04  -1.175  0.24021    
## HouseStyle2Story      9.127e+03  1.162e+04   0.785  0.43231    
## HouseStyleSFoyer      9.939e+03  1.679e+04   0.592  0.55398    
## HouseStyleSLvl        1.562e+04  1.933e+04   0.808  0.41921    
## OverallQual           2.674e+04  1.210e+03  22.106  < 2e-16 ***
## OverallCond           1.856e+03  1.112e+03   1.669  0.09537 .  
## SaleConditionAdjLand -1.228e+04  2.028e+04  -0.605  0.54504    
## SaleConditionAlloca   1.824e+04  1.267e+04   1.440  0.15012    
## SaleConditionFamily  -3.642e+03  9.712e+03  -0.375  0.70769    
## SaleConditionNormal   6.114e+03  4.400e+03   1.390  0.16484    
## SaleConditionPartial -1.565e+04  2.323e+04  -0.674  0.50056    
## SaleTypeCon           4.270e+04  2.846e+04   1.500  0.13380    
## SaleTypeConLD         7.808e+03  1.518e+04   0.514  0.60706    
## SaleTypeConLI         1.547e+04  1.844e+04   0.839  0.40149    
## SaleTypeConLw        -2.210e+03  1.862e+04  -0.119  0.90555    
## SaleTypeCWD           1.872e+04  2.031e+04   0.922  0.35693    
## SaleTypeNew           4.442e+04  2.410e+04   1.843  0.06554 .  
## SaleTypeOth           1.057e+04  2.307e+04   0.458  0.64708    
## SaleTypeWD            1.180e+03  6.515e+03   0.181  0.85625    
## LotArea               9.574e-01  1.143e-01   8.379  < 2e-16 ***
## YearRemod             2.408e+02  7.342e+01   3.280  0.00106 ** 
## Bedroom               7.417e+02  1.758e+03   0.422  0.67316    
## Bathroom              2.001e+04  2.782e+03   7.195 1.02e-12 ***
## YearSold             -2.780e+02  7.829e+02  -0.355  0.72260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37920 on 1379 degrees of freedom
## Multiple R-squared:  0.7846, Adjusted R-squared:  0.7721 
## F-statistic:  62.8 on 80 and 1379 DF,  p-value: < 2.2e-16

With an R-Squared of .78, this is a great start! That being said, we are looking a large number of variables. With so many variables, we run the risk of over fitting our model.

Test and Train sets

Lets break this into test and train sets. While it is true that we are already working with a training set, we want to further divide this so that we can more objectively view how predictive are variables are on un-seen samples.

set.seed(642)

train_set$TestTrain <- sample(c("test","train"),replace = TRUE, size = nrow(train_set),prob = c(.2,.8))

Model_Test <- filter(train_set, train_set$TestTrain == "test")
nrow(Model_Test)
## [1] 309
Model_Train <- filter(train_set, train_set$TestTrain == "train")
nrow(Model_Train)
## [1] 1151

Now that we have our mini test and train set, lets run a regression on the new testing set, use this to predict the price, and calculate the Root Mean Standard Error (the SD of the residuals of our predicted price vs the actual price).

Model1.b <- lm(data = Model_Train, formula = Price ~ Class + Zoning + Neighborhood + Conditional + BldgType + HouseStyle + OverallQual + OverallCond + SaleCondition + SaleType + LotArea + YearRemod + Bedroom + Bathroom + YearSold)

summary(Model1.b)$r.squared
## [1] 0.7866565
Model1.b.pred <-  predict(Model1.b, newdata = Model_Test)

rmse(Model_Test$Price, Model1.b.pred)
## [1] 34770.97
summary(Model_Test$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  132250  167500  181391  212000  501837

35k is a very solid RMSE when the Mean of our data is 181k.

Model Adjustments

Now that we know that this model is off to a good start, lets investigate our variables. We will be referencing the model created with all of our data as we sort out the predicted power of these variables.

  • Lot Area/Year Remodeled/Bathroom/Overall Quality: These are clearly predictive, we will leave in.
  • Zoning/conditional/Building Type/HousingStyle/Sales condition / Year Sold : These are not predictive, I will remove in our next model.
  • Class/Neighborhoods: These two are very similar: while not all of these classes are truly predictive as per their P-Value, several are in fact predictive. I am hesitant to combine any of these classes as their estimated impact on the price varies wildly.
  • Overall Condition/Sales Type: These are currently not below our .05 significance value, however, I will leave this in for our next model. This is likely not predictive due to its correlation with Overall Quality. That being said, this might be more predictive with fewer other variables.
  • Bedrooms: It is interesting that this is not predictive - likely due to the fact that it is well correlated with other variables.

Model 2

Model2 <- lm(data = train_set, formula = Price ~ Class + Neighborhood + OverallQual + OverallCond  + SaleType + LotArea + YearRemod + Bathroom)

Model2.b <- lm(data = Model_Train, formula = Price ~ Class + Neighborhood + OverallQual + OverallCond + SaleType + LotArea + YearRemod + Bathroom)

summary(Model2)
## 
## Call:
## lm(formula = Price ~ Class + Neighborhood + OverallQual + OverallCond + 
##     SaleType + LotArea + YearRemod + Bathroom, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -209968  -19693   -1926   15999  335548 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.144e+05  1.402e+05  -3.669 0.000253 ***
## Class30             -1.925e+04  6.065e+03  -3.174 0.001535 ** 
## Class40             -6.745e+03  1.930e+04  -0.349 0.726770    
## Class45             -3.639e+04  1.161e+04  -3.135 0.001752 ** 
## Class50             -1.009e+04  4.590e+03  -2.198 0.028111 *  
## Class60              6.370e+03  3.235e+03   1.969 0.049119 *  
## Class70             -1.675e+04  6.431e+03  -2.605 0.009278 ** 
## Class75              7.979e+02  1.066e+04   0.075 0.940330    
## Class80             -4.070e+03  5.363e+03  -0.759 0.448030    
## Class85             -4.778e+03  8.767e+03  -0.545 0.585845    
## Class90             -1.338e+04  5.923e+03  -2.258 0.024078 *  
## Class120            -3.916e+04  5.370e+03  -7.294 5.02e-13 ***
## Class160            -5.307e+04  7.222e+03  -7.349 3.38e-13 ***
## Class180            -3.900e+04  1.484e+04  -2.628 0.008688 ** 
## Class190            -1.701e+04  8.000e+03  -2.126 0.033670 *  
## NeighborhoodBlueste  1.309e+04  2.968e+04   0.441 0.659309    
## NeighborhoodBrDale  -4.279e+03  1.566e+04  -0.273 0.784673    
## NeighborhoodBrkSide -9.584e+03  1.250e+04  -0.766 0.443561    
## NeighborhoodClearCr  9.861e+03  1.302e+04   0.757 0.449058    
## NeighborhoodCollgCr -1.717e+04  1.079e+04  -1.591 0.111843    
## NeighborhoodCrawfor  2.529e+04  1.222e+04   2.070 0.038668 *  
## NeighborhoodEdwards -2.093e+04  1.152e+04  -1.818 0.069294 .  
## NeighborhoodGilbert -3.158e+04  1.141e+04  -2.768 0.005721 ** 
## NeighborhoodIDOTRR  -2.405e+04  1.310e+04  -1.836 0.066641 .  
## NeighborhoodMeadowV  1.442e+04  1.575e+04   0.916 0.359965    
## NeighborhoodMitchel -1.494e+04  1.187e+04  -1.258 0.208485    
## NeighborhoodNAmes   -1.396e+04  1.119e+04  -1.248 0.212297    
## NeighborhoodNoRidge  7.221e+04  1.220e+04   5.920 4.03e-09 ***
## NeighborhoodNPkVill  1.716e+02  1.641e+04   0.010 0.991655    
## NeighborhoodNridgHt  5.204e+04  1.083e+04   4.806 1.70e-06 ***
## NeighborhoodNWAmes  -1.877e+04  1.158e+04  -1.622 0.105096    
## NeighborhoodOldTown -2.535e+04  1.189e+04  -2.131 0.033227 *  
## NeighborhoodSawyer  -1.629e+04  1.177e+04  -1.384 0.166707    
## NeighborhoodSawyerW -1.144e+04  1.146e+04  -0.998 0.318229    
## NeighborhoodSomerst -5.664e+03  1.110e+04  -0.510 0.609892    
## NeighborhoodStoneBr  6.610e+04  1.226e+04   5.390 8.26e-08 ***
## NeighborhoodSWISU   -1.697e+04  1.374e+04  -1.235 0.216972    
## NeighborhoodTimber  -3.873e+03  1.223e+04  -0.317 0.751482    
## NeighborhoodVeenker  3.297e+04  1.556e+04   2.119 0.034280 *  
## OverallQual          2.680e+04  1.173e+03  22.854  < 2e-16 ***
## OverallCond          1.974e+03  1.096e+03   1.801 0.071879 .  
## SaleTypeCon          4.590e+04  2.831e+04   1.622 0.105109    
## SaleTypeConLD        8.927e+03  1.435e+04   0.622 0.534055    
## SaleTypeConLI        1.589e+04  1.812e+04   0.877 0.380904    
## SaleTypeConLw       -4.568e+03  1.821e+04  -0.251 0.801919    
## SaleTypeCWD          1.789e+04  2.007e+04   0.891 0.372972    
## SaleTypeNew          2.666e+04  7.356e+03   3.624 0.000300 ***
## SaleTypeOth          9.792e+03  2.286e+04   0.428 0.668452    
## SaleTypeWD           4.518e+03  6.079e+03   0.743 0.457479    
## LotArea              9.675e-01  1.130e-01   8.561  < 2e-16 ***
## YearRemod            2.447e+02  7.227e+01   3.385 0.000731 ***
## Bathroom             2.066e+04  2.562e+03   8.064 1.56e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37910 on 1408 degrees of freedom
## Multiple R-squared:  0.7803, Adjusted R-squared:  0.7723 
## F-statistic: 98.03 on 51 and 1408 DF,  p-value: < 2.2e-16
Model2.b.pred <-  predict(Model2.b, newdata = Model_Test)

rmse(Model_Test$Price, Model2.b.pred)
## [1] 34440.2

Results

Looking at our RMSE, our model has very slightly improved. While I did not share it here, I quickly ran a few additional models to see if our R-squared and/or RMSE could be significantly improved, and I feel that this is where I would like to leave it. With something like housing prices, there so much is market dependent - an interesting follow up might be to compare interest loan rates during the period in question or perhaps a different proxy for housing market health like the number of homes sold in a year vs quantity “on the market”. That being said, I believe this is fairly strong given the data we have.

Reviewing the Kaggle Submission guidelines, we see we need to download the test file, and supply a predicted sale price for each value.

x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/Misc/main/test.csv")

test_raw <- read.csv(text = x, stringsAsFactors = FALSE)

test_set <- data.frame(test_raw$MSSubClass, test_raw$Neighborhood, test_raw$OverallQual, test_raw$OverallCond, test_raw$SaleType, test_raw$LotArea, test_raw$YearRemodAdd, test_raw$FullBath, test_raw$Id)

names(test_set) <- c("Class", "Neighborhood", "OverallQual", "OverallCond",  "SaleType", "LotArea", "YearRemod", "Bathroom","Id")


#test_pred <- predict(Model2, newdata = test_set)

At this stage, I ran into an error on the now commented-out line: “factor Class has new levels 150”. What we see here is that we have a Class in our testing set that did not exist in our training set.

There are a couple ways around this. As Kaggle is only looking for each entries ID and Sale Price, I am choosing what I believe to be the most graceful workaround. The Kaggle documentation tells us that Class 150 corresponds to “1-1/2 STORY PUD - ALL AGES”. Looking through our other entries, I believe this to be most similar to Class 120 “1-STORY PUD (Planned Unit Development) - 1946 & NEWER.”, specifically when you consider that the one home this applies to was built or remodeled in 1981 as seen below (I would wager this was the year it was built.)

filter(test_set, Class=="150")$YearRemod
## [1] 1981
Row150 <- which(test_set$Class == "150")

test_set$Class[Row150] <- "120"

nrow(filter(test_set, Class == "150"))
## [1] 0
# I also found that one vlaue was missing a sale type, and will replace with the most common and least impactful value: 

test_set$SaleType[1030] <- "WD"

With that sorted, we can apply our model, generate our submission file and submit our results!

test_pred <- predict(Model2, newdata = test_set)

Sub_file <- data.frame(test_set$Id, test_pred)

names(Sub_file) <- c("Id", "SalePrice")

#Commented out as not to repeadely save file. 
#write.csv(Sub_file,"C:\\Users\\bloom\\desktop\\subfile.csv", row.names = FALSE)

Score

After submitting I found that I scored a .206, placing 3725 at the time of my submission. My Kaggle username is “Chris Bloome”