DATA 605 - Final Project
Background
The purpose of our Computational Mathematics Final Project is to explore / showcase (some of) what we’ve learned over the course of the semester.
CLICK HERE for video presentation.
…………………………………………………………………….
Problem 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\).
I ensured reproducibility, selected an N value and then made us of the runif() function to generate the specified number of random uniform numbers between 1 and N:
#Ensure reproducibility:
set.seed(90210)
N <- 9 #chosen N value
#Generate X variable of 10000 random uniform numbers between 1 and N:
X <- runif(10000,1,N)
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.992 5.003 5.002 6.997 8.999
#generate Y variable of 10000 random normal numbers with specified mu=sigma=(N+1)/2
ms <- (N+1)/2 #mu=sigma=(N+1)/2
Y <- rnorm(10000,ms, ms)
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -12.768 1.590 5.014 4.995 8.397 24.104
Probability
Calculate as a minimum the below probabilities. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
- P(X>x | X>y)
- P(X>x, Y>y)
- P(X<x | X>y)
First we define our x and y variables:
Then we explore P(X>x | X>y):
#P(X>x | X>y): conditional probability
#Calculate the probability P(X>x):
pa1 <- sum(X > x) / 10000
pa1
## [1] 0.5001
## [1] 0.9247
## [1] 0.541
Being that x > y. If P(X>x) holds then P(X>y) holds, so our conditional probability expression simplifies to \(pa1 / pa2\) and we arrive at P(X>x | X>y) = 0.541.
What we’re doing is exploring the probability that \(X > x\) given that \(X > y\) with the expectation of an output greater than 0.5 because of our “pre-filtering” for a higher lower bound (y) on our X set.
As a more in depth explanation: rather than our X ranging from 1 through N=9, to meet the condition of \(X > y\) we’re instead considering only X values that are greater than 1.590. This smaller set, starts with a higher lower bound and thus, we’d expect the proportion of values that meet the condition of \(X > x\), where x = 5.003 to be higher (which is true).
Next we explore P(X>x, Y>y):
## [1] 0.5001
## [1] 0.75
## [1] 0.375
Being that we’re looking for the joint probability distribution, we’re merely looking for the product of the two individual probabilities. We find P(X>x) and P(Y>y), compute their product P(X>x) * P(Y>y) and voila! we arrive at a joint probability distribution of P(X>x, Y>y) = 0.375.
We’re looking for X > x (the average value) which gives us ~0.5 and Y > y (the 1st quartile value) which gives us ~0.75. Being that both values hold true (they make sense) … taking 1/2 of 3/4 or 3/4 of 1/2 leads to 3/8 (their product) which makes perfect sense as a joint probability distribution.
Next we explore P(X<x | X>y):
#P(X<x | X>y): conditional probability
#Calculate the probability P(X<x):
pa4 <- sum(X < x) / 10000
pa4
## [1] 0.4999
## [1] 0.9247
## [1] 0.459
What we’re doing is exploring the probability that \(X < x\) given that \(X > y\) with the expectation of an output less than 0.5 because of our “pre-filtering” for a higher lower bound (y) on our X set.
Rather than our X ranging from 1 through N=9, to meet the condition of \(X > y\) we’re instead considering only X values that are greater than 1.590. This smaller set, starts with a higher lower bound and thus, we’d expect the proportion of values that meet the condition of \(X < x\), where x = 5.003 to be lower (which is true). This is because the range of values for which our expression holds is narrowed when we shift the lower bound up while holding the upper bound (ie. y < X < x).
Investigation
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
For building out our probability distribution table, we initialize the (marginal) probabilities of interest , calculate joint probabilities, populate the table with corresponding values, and then convert this table into a nice and simple output table by using the kable() package. The implementation in R is shown below:
#Initialize probabilities of interest
X_gt <- round(sum(X > x) / 10000,3) #P(X>x)
X_lt <- round(sum(X < x) / 10000,3) #P(X<x)
Y_gt <- round(sum(Y > y) / 10000,3) #P(Y>y)
Y_lt <- round(sum(Y < y) / 10000,3) #P(Y<y)
total <- X_gt + X_lt #or Y_gt + Y_lt ... either way it's 1.0
#Calculate joint probabilities
p11 <- X_gt * Y_gt #P(X>x,Y>y)
p21 <- X_gt * Y_lt #P(X>x,Y<y)
p12 <- X_lt * Y_gt #P(X<x,Y>y)
p22 <- X_lt * Y_lt #P(X<x,Y<y)
#Populate table with marginal and joint probabities
##The table is populated column by column, top to bottom
p <- matrix(c("", "", "Y", "", "", "", "", ">y", "<y", "marginal (X)", "", ">x",p11, p21, X_gt, "X", "<x", p12, p22, X_lt, "", "marginal (Y)",Y_gt, Y_lt, total), ncol=5)
#Convert into nicer form and display probability table
p %>%
kbl() %>%
kable_minimal()
X | ||||
>x | <x | marginal (Y) | ||
Y | >y | 0.375 | 0.375 | 0.75 |
<y | 0.125 | 0.125 | 0.25 | |
marginal (X) | 0.5 | 0.5 | 1 |
From the output table (above), and as we noted before, we observe that the joint probability is indeed just the product of corresponding marginal probabilities. As such, P(X>x and Y>y) = P(X>x)P(Y>y).
Testing
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Fisher’s Exact Test is a statistical test used to determine the level of association, or whether there is a non-random association, between two categorical variables.
The Chi Square Test, by contrast, is used to quantify the magnitude of discrepancy between expected and actual results and is often used in hypothesis testing.
To compare the output of each test … we initialize our contingency table, which is just our p-table joint probability values multiplied by n (10,000), and use this table as input to the built in fisher.test() and chisq.test() functions. Each function outputs a p-value as well as a number of other variables for consideration:
## [,1] [,2]
## [1,] 3750 3750
## [2,] 1250 1250
##
## Fisher's Exact Test for Count Data
##
## data: c_table
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9124854 1.0959080
## sample estimates:
## odds ratio
## 1
##
## Pearson's Chi-squared test
##
## data: c_table
## X-squared = 0, df = 1, p-value = 1
When we observe the outputs above, we see that each test produces a p-value of 1. From this, we can extend that independence holds.
One of the major differences between Fisher’s Exact Test and the Chi Square Test is that Fisher’s is meant to be employed for relatively small sample sizes while the Chi Square Test is better for relatively large sample sizes. As such, and given the fact that we’re dealing with samples sizes of 10,000 (which is relatively large), the Chi Square Test is the better test for our purposes.
…………………………………………………………………….
Problem 2
Register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
First, we read in our training and test data. The .csv files were downloaded from Kaggle (linked to above), uploaded to Github, read in raw, converted to tibble (a new S3 generic with more efficient methods for matrices and dataframes), and then verified via printing of the 1st 6 rows of each table:
#Read in training and test data .csv files:
hp_train <- read.csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-605/Final-Project/hp_train.csv")
train_table <- as_tibble(hp_train)
head(train_table)
hp_test <- read.csv("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-605/Final-Project/hp_test.csv")
test_table <- as_tibble(hp_test)
head(test_table)
One of the early contrasts we’ll observe between training and test data is that the training data has 81 columns (variables) whereas the test data has 80 columns (variables). The missing column is \(SalePrice\) which makes sense :)
Further EDA (exploratory data analysis) of the training dataset is shown below:
## [1] 1460
## [1] 81
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
We explore the number of rows (1460), the number of columns (81), the names of the columns (Id, MSSubClass, MSZoning, etc.), and the corresponding summary statistics. The summary statistics provide early insight into what sort of data each variable holds and (if numeric) what the min vs. max, 1st quartile vs. 3rd quartile, median and mean values are for the given variable but the problem is that the output is HUGE and it’s a lot to peruse.
Instead of dissecting each variable, I marked column names of interest (those I thought could have an effect on the final SalePrice), explored these variables and then elected independent variables based on this more tailored approach:
#Summary statistics for specific variables of interest
##Dependent variable
summary(train_table$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
## Length Class Mode
## 1460 character character
## Length Class Mode
## 1460 character character
## Length Class Mode
## 1460 character character
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.518 7.000 14.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 0.613 1.000 3.000
From the output (above), we see that our first (3) independent variables of interest are categorical / character-based variables. For sake of simplicity, I elected to instead peruse the numerical variables: \(GrLivArea\), \(TotRmsAbvGrd\), and \(Fireplaces\). Reason being, I was looking for non-generic variables that may be predictive of a home’s sale price and I thought the size of the grand living area, the total rooms above grade (taller house, higher price), and number of fireplaces might be a good place to start.
Thus, our variables of interest are:
Dependent: \(SalePrice\)
Independent: \(GrLivArea, TotRmsAbvGrd, Fireplaces\)
With these variables elected, we then move on to visualizing our dependent and independent variables
#Visualization of select variables of interest
##Histogram of dependent variable: SalePrice
hist(train_table$SalePrice, xlab = "SalePrice", main = "Sale Price Histogram", col = "blue")
##Histogram of independent variables: GrLivArea, TotRmsAbvGrd, Fireplaces
hist(train_table$GrLivArea, xlab = "Grand Living Rm Area", main = "Grand Living Rm Area Histogram", col = "blue")
hist(train_table$TotRmsAbvGrd, xlab = "Total Rooms Above Grade", main = "Total Rooms Above Grade Histogram", col = "blue")
Our \(SalePrice\), \(GrLivArea\), and \(TotRmsAbvGrd\) plots are all unimodal, nearly normal, and left skewed. The \(Fireplaces\) data does not fit the same criterion but it may still be an indicator for very high value homes and so we proceed to the scatter matrix and correlation matrix with all variables and a sense of curiosity:
#Scatter plot matrix for 2+ independent variables and the dependent variable
select_few <- subset(train_table, select=c(SalePrice, GrLivArea, TotRmsAbvGrd, Fireplaces))
pairs(select_few, pch=19, lower.panel=NULL)
From the above scatter plots, we can see that as the corresponding \(GrLivArea\) and \(TotRmsAbvGrd\) values increase, so does the \(SalePrice\). The same can not be said for \(Fireplaces\) and thus that variable may not be as useful as we’d hoped. For this reason, we proceed to the correlation matrix with just 3/4 variables: \(SalePrice\), \(GrLivArea\), and \(TotRmsAbvGrd\):
#Correlation matrix for (3) quantitative variables
select_fewer <- subset(train_table, select=c(SalePrice, GrLivArea, TotRmsAbvGrd))
select_fewer.cor = cor(select_fewer)
select_fewer.cor
## SalePrice GrLivArea TotRmsAbvGrd
## SalePrice 1.0000000 0.7086245 0.5337232
## GrLivArea 0.7086245 1.0000000 0.8254894
## TotRmsAbvGrd 0.5337232 0.8254894 1.0000000
From the correlation matrix we can proceed to use the cor.test function to test whether the hypotheses that the correlation between each pairwise set of variables is 0 (provided an 80% confidence interval).
We start with \(SalePrice\) vs. \(GrLivArea\):
##
## Pearson's product-moment correlation
##
## data: select_fewer$GrLivArea and select_fewer$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
With p-value < 2.2e-16, which is far less than the typical 0.05 threshold, we reject the null hypothesis (of 0 correlation). We can then pull a correlation estimate of 0.7086 on the 80 percent confidence interval: 0.6915087 0.7249450, thus finding a relatively high correlation between this pairwise set of variables. This tells us that \(GrLivArea\) may indeed be a good predictor of \(SalePrice\).
We then move on \(SalePrice\) vs. \(TotRmsAbvGrd\):
##
## Pearson's product-moment correlation
##
## data: select_fewer$TotRmsAbvGrd and select_fewer$SalePrice
## t = 24.099, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5092841 0.5573021
## sample estimates:
## cor
## 0.5337232
With p-value < 2.2e-16, which is far less than the typical 0.05 threshold, we reject the null hypothesis (of 0 correlation). We can then pull a correlation estimate of 0.5337 on the 80 percent confidence interval: 0.5092841 0.5573021, thus finding a decent correlation - neither very strong nor weak - between this pairwise set of variables. This tells us that \(TotRmsAbvGrd\) is an OK predictor of \(SalePrice\) and we’d likely be better off using a variable with a stronger correlation (if possible).
Regarding family-wise error: for 2 tests, I would not be worried about family-wise error but as the number of tests and variables under consideration increases, the likelihood of family-wise error would as well and thus if we were to ramp our testing up to say 5, 10 or more (ie. considering all 80 variables in our set) then I would be worried about familywise error. As an aside, another way of dealing with this would be to reduce our alpha value …
Linear Algebra and Correlation
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
We start by initializing our correlation matrix by retaining the values of the matrix specified above (\(select_fewer\)) and initializing our precision matrix by taking the inverse of our correlation matrix via the inv() function.
We then multiply our correlation and precision matrices, using \(corr * prec\) and \(prec * corr\) to confirm that yes indeed \(AA^{-1} = A^{-1}A\):
#initialize correlation and precision matrices
corr <- matrix(select_fewer.cor[,1:3],nrow=3)
prec <- inv(corr)
#matrix multiplication: corr * prec
corr_prec <- corr * prec
prec_corr <- prec * corr
corr_prec == prec_corr #confirming A*inv(A) = inv(A)*A
## [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
From above we see that all entries are identical, thus confirming that \(corr * prec = prec * corr\). We then conduct LU decomposition using the \(lu.decomposition\) function:
## $L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.000000 0
## [2,] -0.5962348 1.000000 0
## [3,] 0.0858448 -0.586352 1
##
## $U
## [,1] [,2] [,3]
## [1,] 2.042442 -1.217775 0.175333
## [2,] 0.000000 3.858920 -2.262685
## [3,] 0.000000 0.000000 1.850111
Recalling \(A = LU\) for LU decomposition, we confirm above that (1) our lower triangular matrix has 1s on the diagonal, values beneath and 0s above the diagonal and (2) our upper triangular matrix has values on and above the diagonal with 0s beneath.
Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The \(SalePrice\) variable’s histogram had a pronounced right skew and so we’ll reconsider this variable.
We revisit the summary statistics, apply the fitdistr() function to the \(SalePrice\) data, use the estimate value as our \(optimal\) lambda, take 1000 samples of a random exponential with our \(optimal\) lambda value, and then plot the resulting (exponential) histogram vs. the original for \(SalePrice\):
#select right skewed variable:
#summary(train_table$SalePrice)
#shift min val from 34900 to 0: not necessary
#load MASS pckg and fit exp distribution
fitted <- fitdistr(train_table$SalePrice, "exponential")
fitted
## rate
## 5.527268e-06
## (1.446552e-07)
#find optimal lambda and take 1000 samples
optimal <- fitted$estimate
samples <- rexp(1000, optimal)
head(samples)
## [1] 33466.214 69905.657 20180.087 1908.491 44843.872 16241.041
#Plot the result vs. original
par(mfrow = c(1, 2))
hist(train_table$SalePrice, xlab = "SalePrice", main = "Original Sale Price Histogram", col = "blue")
hist(samples, xlab = "SalePrice", main = "Exponential Sale Price Histogram", col = "gray")
The histograms highlight the difference in axes and data. When we consider the exponential histogram vs. that of the original, we see that the peak has been shifted to ~0 the axis and that the x and y axes have been slightly scaled.
Next, we use the exponential pdf to obtain our 5th and 95th percentile values. To do so we use the qexp() function with our desired percentile (as a decimal) and optimal lambda value as the provided \(rate\) variable:
## [1] 9280.044
## [1] 541991.5
We then proceed to generating a 95% confidence interval …
The confidence interval follows as \(CI = \mu \frac{+}{-} z \frac{s}{\sqrt{n}}\) and thus we need to find the components of the equation to calculate our interval:
#95% confidence interval (assuming normality):
z <- 1.96 #for 95% confidence
mu <- mean(train_table$SalePrice)
s <- sd(train_table$SalePrice)
n <- nrow(as.matrix(train_table$SalePrice))
#Calculate the differential
diff <- (z * s) / sqrt(n)
#Calculate lower bound of confidence interval
CI_lower <- mu - diff
CI_lower
## [1] 176846.1
## [1] 184996.2
## [1] 180921.2
## Warning in ci.numeric(train_table$SalePrice, confidence = 0.95): No class or
## unkown class. Using default calcuation.
## Estimate CI lower CI upper Std. Error
## 180921.196 176842.841 184999.551 2079.105
Our 95% confidence interval ranges from 176846.1 to 184996.2 with the mean at 180921. Being that it’s meant to be 2 standard deviations from our mean, it’s supposed to provide the range over which 95% of our values should lie.
Moving on from the confidence interval, we seek our empirical 5th and 95th percentiles with regard to the data. To do so, we use the quantile() function:
## 5%
## 88000
## 95%
## 326100
Comparing our 5th and 95th percentiles for the original data vs that of the exponential highlights the contrasts between. We saw it with the scaling of axes and spread of values on the histograms and we see it again when we look at the percentiles. Having fit our data to an exponential function resulted in a lower 5th percentile and higher 95th percentile (as compared to the original data).
It’s also worth questioning the calculated confidence interval. The CI is supposed to provide a range that represents 95% of the data values … but when we revisit our summary statistics for the original data, it doesn’t even appear to capture 50% of the data. Maybe it’s a misinterpretation on my end or maybe there was some data cleaning that should have been performed to account for NA’s, outliers, etc. I just thought it worth noting.
Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
After re-visiting, the earlier variable summary statistics, I elected 12 variables (LotArea, HouseStyle, OverallQual, OverallCond, TotalBsmtSF, X1stFlrSF X2ndFlrSF, GrLivArea, BedroomAbvGr, TotRmsAbvGrd, GarageCars, GarageArea) based on (1) whether they were numeric and (2) whether I perceived there to be a correlation between the variable and the SalePrice. I cast a wide net to start, believing that the (p-values of the) summary statistics could be used to weed out any non-pertinent variables.
For each variable whose p-value was above the threshold (0.05), I compared the R-squared value for including vs. dropping that variable. If there was no net negative effect, I dropped the variable. This resulted in the drop of X1stFlrSF and GarageArea, and a 10 variable multivariate linear regression model as shown below:
#Determine variables and run model
attach(train_table)
housing.lm <- lm(SalePrice ~ LotArea + HouseStyle + OverallQual + OverallCond + TotalBsmtSF + X2ndFlrSF + GrLivArea + BedroomAbvGr + TotRmsAbvGrd + GarageCars)
#Display summary statistics
summary(housing.lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + HouseStyle + OverallQual +
## OverallCond + TotalBsmtSF + X2ndFlrSF + GrLivArea + BedroomAbvGr +
## TotRmsAbvGrd + GarageCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -473707 -17648 -1121 16349 260604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.345e+05 8.852e+03 -15.194 < 2e-16 ***
## LotArea 5.940e-01 1.038e-01 5.722 1.28e-08 ***
## HouseStyle1.5Unf 1.553e+04 1.079e+04 1.438 0.15051
## HouseStyle1Story 3.525e+04 4.756e+03 7.412 2.12e-13 ***
## HouseStyle2.5Fin -3.184e+04 1.407e+04 -2.262 0.02384 *
## HouseStyle2.5Unf -4.711e+04 1.181e+04 -3.990 6.93e-05 ***
## HouseStyle2Story -2.680e+03 4.184e+03 -0.640 0.52198
## HouseStyleSFoyer 3.254e+04 7.408e+03 4.393 1.20e-05 ***
## HouseStyleSLvl 2.553e+04 5.971e+03 4.276 2.03e-05 ***
## OverallQual 2.344e+04 1.083e+03 21.650 < 2e-16 ***
## OverallCond 3.698e+03 9.039e+02 4.092 4.52e-05 ***
## TotalBsmtSF 2.575e+01 4.189e+00 6.147 1.02e-09 ***
## X2ndFlrSF 3.932e+01 8.015e+00 4.906 1.04e-06 ***
## GrLivArea 4.223e+01 5.684e+00 7.430 1.86e-13 ***
## BedroomAbvGr -1.119e+04 1.729e+03 -6.473 1.31e-10 ***
## TotRmsAbvGrd 4.110e+03 1.250e+03 3.287 0.00104 **
## GarageCars 1.570e+04 1.729e+03 9.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37020 on 1443 degrees of freedom
## Multiple R-squared: 0.7852, Adjusted R-squared: 0.7829
## F-statistic: 329.8 on 16 and 1443 DF, p-value: < 2.2e-16
With a high R-squared value (0.7852) and a low p-value (2.2e-16), it appears, we have a promising model on our hands. One that may represent upto 78.5% of our data.
Exploring further via plotting leads to:
#Subset graphs so we can display all (3) on one output
par(mfrow=c(2,2))
#Histogram of residuals
hist(resid(housing.lm), breaks = 50, main = "Histogram of Residuals", xlab= "")
#Residuals plot
plot(resid(housing.lm), fitted(housing.lm), main = "Residuals Plot")
#Q-Q plot
qqnorm(resid(housing.lm))
qqline(resid(housing.lm))
A nearly normal histogram, a residuals plot that appears (although difficult to interpret) to be randomly spread within a certain range, and a Q-Q plot that follows the line until the lower and upper quantiles (-3, 3) show that we’re dealing with a strong model but we have to account for non-pertinent observations / values and outliers. For a stronger model, we’d like to account for NA values and outliers.
At this point, we re-introduce our test data, attempt to account for NA values and outliers, test our model with the test data, interpret the result, and submit our findings to Kaggle:
#Identify outlier value ranges
#hist(test_table$LotArea, breaks=50) #30000 and up
#hist(test_table$GrLivArea, breaks=50) #3200 and up
#Remove Outliers from test data
RO_test <- test_table %>%
filter(LotArea < 30000) %>%
filter(GrLivArea < 3200)
#Subset data based on columns of interest
SRO_test <- subset(test_table, select=c(LotArea, HouseStyle, OverallQual, OverallCond, TotalBsmtSF, X2ndFlrSF, GrLivArea, BedroomAbvGr, TotRmsAbvGrd, GarageCars))
#Account for NA values
NSRO_test <- na.omit(SRO_test)
After removing outlier and NA values from the test dataset, we’re ready to feed our model and observe the results:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -13608 128986 167928 177916 217400 609665
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Looking at our summary statistics: aside from Min and Max values, the 1st Qu, Median, Mean, and 3rd Qu comparison between predictive and actual SalePrice statistics are incredibly in-line.
Next, we compare output histograms:
#Compare actual vs. predictive plots
par(mfrow=c(1,2))
hist(SP_pred, breaks=25, xlab = "Housing Sales Price", main = 'Predictive Histogram [test]')
hist(train_table$SalePrice, breaks=30, xlim = c(0, 600000), xlab = "Housing Sales Price", main = 'Actual Histogram [train]')
Again, although it may be tough to read because the x axis is zoomed out to account for the great range of sales prices, our histograms are also (aside from outlier values) incredibly in-line.
At this point, I felt quite confident in the model and just had to re-package the SalePrice predictions with corresponding IDs prior to submitting back to Kaggle:
## Warning in cbind(test_table$Id, SP_pred): number of rows of result is not a
## multiple of vector length (arg 2)
colnames(rpckgd) = c("Id", "SalePrice")
rpckgd[rpckgd<0] <- 0 #account for negative values
final = as.data.frame(rpckgd)
head(final) #verify output
##Does the number of decimal places matter for SalePrice (ie. round(,2))?
#Write to csv - COMMENTED OUT to knit .rmd file
#write.csv(final, file="kaggle_hp_sub.csv", row.names = FALSE)
#Submit and see my score
……………………………………………………………………
Closing Remarks
Based on the submission, I see that while I may have been confident in the model based on the R-squared and p-values from the summary statistics and residual plots, and while I did adapt per what I thought the feedback was (accounting for NA values and outliers), my Kaggle score confirms that I still have a long ways to go …
Kaggle score
To improve this model, I could have looked into / applied some more advanced methods (ie. feature engineering or random forests). I’m not currently familiar with these methods but now at least I can see where they may be applicable.
All-in-all this Project (and the Kaggle submission in particular) were a great way to review a lot of the concepts covered this past semester, while highlighting how much more there is to learn and how much further there is to go.