DATA 605 - Final Project

Instructions
Problem 1
Instructions
Problem 2

Instructions

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{(N+1)}{2}\)

Following the instructions from above, here is my R syntax to generate a random variable X:

set.seed(123)

N <- 10
X <- runif(10000, 1, N)
Y <- rnorm(10000, (N+1)/2, (N+1)/2)

From the above syntax, you can see that I’ve created a variable \(X\) with 10,000 random uniform numbers from 1 to \(N\) (\(N=10\)). Additionally, I’ve created a random variable \(Y\) that has 10,000 random normal numbers with a \(\mu\) and \(\sigma\) of \(\frac{N+1}{2}\).

Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the median of the \(X\) variable, and the small letter \(y\) is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.

To work through this, I’ll first have to calculate \(x\) (the median) and \(y\) (the first quartile). Additionally, I’ve saved \(X\) and \(Y\) in a dataframe and stored the total number of rows (10,000) in a variable:

x <- quantile(X, 0.50)
y <- quantile(Y, 0.25)
df <- data.frame(X = X, Y = Y)
total_rows <- nrow(df)

a) \(P(X>x \ | \ X>y)\)

We can use the following equation to find the probability:

\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X>x) \ and \ B = P(X>y)\]

With this in mind, we can solve:

P_AandB <- nrow(df %>% filter(X > x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows

P_AgivenB <- P_AandB / P_B
P_AgivenB

## [1] 0.5509642

Solution: The probability \(P(X>x \ | \ X>y) = 0.5510\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 given this number is greater than 1.84 is 0.5510.

b) \(P(X>x, \ Y>y)\)

We can use the following equation to solve this portion:

\[P(A,B)=P(A) \times P(B) \\ where \ A=N(X>x) \ and \ B=N(Y>y)\]

With this in mind, we can solve:

P_AandB_2 <- nrow(df %>% filter(X > x & Y > y)) / total_rows
P_AandB_2

## [1] 0.3756

Solution: The probability \(P(X>x, \ Y>y) = 0.3756\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 and this number is greater than 1.84 is 0.3756.

c) \(P(X<x \ | \ X>y)\)

Similar to part a, we can use the following equation to solve:

\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X<x) \ and \ B = P(X>y)\]

With this in mind, we can solve:

P_AandB <- nrow(df %>% filter(X < x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows

P_AgivenB_2 <- P_AandB / P_B
P_AgivenB_2

## [1] 0.4490358

Solution: The probability \(P(X<x \ | \ X>y) = 0.4490\), which means that the probability that a uniform number from 1 to 10 is less than the median of 5.45 given this number is greater than 1.84 is 0.4490.

Investigate whether \(P(X>x \ and \ Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.

To do this investigation, we’ll first have to calculate the joint probabilities:

We’ll calculate joint probabilities by using the equation:

\[P(A \cap B) = P(A \times B)\] Therefore, we first will create a few columns to distinguish whether our values meet the following conditions:

Is \(X > x\) or \(X < x\)?
Is \(Y > y\) or \(Y < y\)?

Then, we can take the count of each condition, and multiply the four conditions to get their respective joint probabilities:

\(P((X > x) \times (Y < y))\)
\(P((X < x) \times (Y < y))\)
\(P((X < x) \times (Y > y))\)
\(P((X > x) \times (Y > y))\)

Below is the R syntax I used to do this, by using group_by() and summarise() functions to do the calculations with the tidyverse package:

# Joint probabilities
prob_df <- df %>% 
  mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>% 
  group_by(A, B) %>% 
  summarise(count = n()) %>% 
  mutate(prob = count / total_rows)

Next, we need to back up a bit and examine the marginal probabilities of \(P(A)\) and \(P(B)\) independently:

Intuitively we know that \(x = median(X)\), therefore, half of the values of \(X\) > \(x\) and the other half of \(X\) < \(x\). Similarly, since we know that \(y = first \ quartile \ of \ Y\), a quarter of \(Y\) < \(y\) and the remaining three fourths of the values of \(Y\) > \(y\). Therefore, the marginal probabilities are:

\(P(X > x) = 0.50\)
\(P(X < x) = 0.50\)
\(P(Y > y) = 0.75\)
\(P(Y < y) = 0.25\)

We can check these assumptions by reworking our table:

prob_df <- prob_df %>% 
  ungroup() %>% 
  group_by(A) %>% 
  summarise(count = sum(count), prob = sum(prob)) %>% 
  mutate(B = "Total Prob") %>% 
  bind_rows(prob_df)

prob_df <- prob_df %>% 
  ungroup() %>% 
  group_by(B) %>% 
  summarise(count = sum(count), prob = sum(prob)) %>% 
  mutate(A = "Total Prob") %>% 
  bind_rows(prob_df)

And by printing our table, we can see that our marginal probabilities that we assumed are confirmed:

prob_df %>%
  dplyr::select(-count) %>% 
  spread(A, prob) %>%
  rename(" " = B) %>%
  kable() %>%
  kable_styling(bootstrap_options = c('striped'), full_width = FALSE)

	X < x	X > x	Total Prob
Y < y	0.1256	0.1244	0.25
Y > y	0.3744	0.3756	0.75
Total Prob	0.5000	0.5000	1.00

Solution: After generating the table above, it appears that \(P(X>x \ and \ Y>y)\) does equal \(P(X>x)P(Y>y)\), since they both have a value of 0.3756. By doing the calculations to check, \(P(X>x)P(Y>y) = (0.50)(0.75) = 0.375\) and \(P(X>x \ and \ Y>y) = 0.375\) as well.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

In order to run our Fisher’s Exact Test and Chi Square Test, we’ll first need to find the counts of each our conditions, similar to the previous question and create a two-by-two table:

X_plus <- nrow(df %>% 
  filter(X > x))

X_lesseq <- nrow(df %>% 
  filter(X <= x))

Y_lesseq <- nrow(df %>% 
  filter(Y <= y))

Y_plus <- nrow(df %>% 
  filter(Y > y))

freq_matrix <- matrix(c(X_plus, X_lesseq, Y_plus, Y_lesseq), 
                      nrow = 2, ncol = 2, byrow = TRUE,
                      dimnames = list(c("x", "y"),
                                c("X > x; Y > y", "X <= x; Y <= y")))

With our frequency table ready to go, we can see our counts below:

freq_matrix %>% 
  kable() %>%
  kable_styling(bootstrap_options = c('striped'), full_width = FALSE)

	X > x; Y > y	X <= x; Y <= y
x	5000	5000
y	7500	2500

Now, we are ready to run our Fisher’s Exact Test:

fisher.test(freq_matrix)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  freq_matrix
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.3137986 0.3540659
## sample estimates:
## odds ratio 
##  0.3333333

And here is the Chi-Square Test:

chisq.test(freq_matrix)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  freq_matrix
## X-squared = 1332.3, df = 1, p-value < 2.2e-16

When looking at both of the outputs, we can see that we reject the null hypothesise for both, given that the p-value is less than 0.05 for the 95 percent confidence interval. Since the Fisher’s Exact Test is typically used for smaller sample sizes, and our sample is of 10,000 observations, it would be more appropriate to use the Chi-Square Test.

Solution:

Our \(H_0\) = There is no relationship between X and Y.
Our \(H_a\) = There is a significant relationship between X and Y.

Given our outputs, independence does not hold when we run the Fisher’s Exact Test and Chi-Square test. We reject the null hypothesis that there is no relationship between X and Y. Since the Fisher’s Exact Test is typically used on smaller sample sizes, and we have a large sample size of 10,000 observations, it would be more appropriate to rely on the output from our Chi-Square Test.

Instructions

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.

Problem 2

Descriptive and Inferential Statistics: Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

kaggle_df <- read_csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Final%20Exam/train.csv')

# summary(kaggle_df)

On a quick look at the summary, I decided to pull out the numeric values to explore a bit more:

numerics <- data.frame(summary(kaggle_df))

numerics <- numerics %>% 
  filter(!is.na(Freq))

numerics <- data.frame(numerics %>% 
  filter(!grepl('Class', Freq), !grepl('Length', Freq), !grepl('Mode', Freq)) %>% 
  separate(Freq, c('Value', 'Count'), sep = ":", remove = TRUE))

numerics$Var2 <- trimws(numerics$Var2)
numerics$Value <- trimws(numerics$Value)
numerics$Count <- round(as.numeric(trimws(numerics$Count)), 1)

numerics <- numerics %>%
  dplyr::select(Var2, Value, Count) %>%
  spread(Value, Count) %>% 
  rename("Attribute" = "Var2", "Q1" = `1st Qu.`, "Q3" = `3rd Qu.`, "Max" = "Max.", "Min" = "Min.", "NA" = "NA's") %>% 
  dplyr::select(Attribute, Min, Q1, Median, Mean, Q3, Max, `NA`)

kable(numerics) %>% 
  kable_styling(bootstrap_options = 'striped')

Attribute	Min	Q1	Median	Mean	Q3	Max	NA
1stFlrSF	334	882.0	1087.0	1163.0	1391.0	4692	NA
2ndFlrSF	0	0.0	0.0	347.0	728.0	2065	NA
3SsnPorch	0	0.0	0.0	3.4	0.0	508	NA
BedroomAbvGr	0	2.0	3.0	2.9	3.0	8	NA
BsmtFinSF1	0	0.0	383.5	443.6	712.2	5644	NA
BsmtFinSF2	0	0.0	0.0	46.5	0.0	1474	NA
BsmtFullBath	0	0.0	0.0	0.4	1.0	3	NA
BsmtHalfBath	0	0.0	0.0	0.1	0.0	2	NA
BsmtUnfSF	0	223.0	477.5	567.2	808.0	2336	NA
EnclosedPorch	0	0.0	0.0	21.9	0.0	552	NA
Fireplaces	0	0.0	1.0	0.6	1.0	3	NA
FullBath	0	1.0	2.0	1.6	2.0	3	NA
GarageArea	0	334.5	480.0	473.0	576.0	1418	NA
GarageCars	0	1.0	2.0	1.8	2.0	4	NA
GarageYrBlt	1900	1961.0	1980.0	1979.0	2002.0	2010	81
GrLivArea	334	1130.0	1464.0	1515.0	1777.0	5642	NA
HalfBath	0	0.0	0.0	0.4	1.0	2	NA
Id	1	365.8	730.5	730.5	1095.2	1460	NA
KitchenAbvGr	0	1.0	1.0	1.0	1.0	3	NA
LotArea	1300	7554.0	9478.0	10517.0	11602.0	215245	NA
LotFrontage	21	59.0	69.0	70.0	80.0	313	259
LowQualFinSF	0	0.0	0.0	5.8	0.0	572	NA
MasVnrArea	0	0.0	0.0	103.7	166.0	1600	8
MiscVal	0	0.0	0.0	43.5	0.0	15500	NA
MoSold	1	5.0	6.0	6.3	8.0	12	NA
MSSubClass	20	20.0	50.0	56.9	70.0	190	NA
OpenPorchSF	0	0.0	25.0	46.7	68.0	547	NA
OverallCond	1	5.0	5.0	5.6	6.0	9	NA
OverallQual	1	5.0	6.0	6.1	7.0	10	NA
PoolArea	0	0.0	0.0	2.8	0.0	738	NA
SalePrice	34900	129975.0	163000.0	180921.0	214000.0	755000	NA
ScreenPorch	0	0.0	0.0	15.1	0.0	480	NA
TotalBsmtSF	0	795.8	991.5	1057.4	1298.2	6110	NA
TotRmsAbvGrd	2	5.0	6.0	6.5	7.0	14	NA
WoodDeckSF	0	0.0	0.0	94.2	168.0	857	NA
YearBuilt	1872	1954.0	1973.0	1971.0	2000.0	2010	NA
YearRemodAdd	1950	1967.0	1994.0	1985.0	2004.0	2010	NA
YrSold	2006	2007.0	2008.0	2008.0	2009.0	2010	NA

With my numerics dataframe ready, I can now set up some visuals by splitting the dataframe into two, one with all quantative variables, and the other with categorical data:

kaggle_numerics <- kaggle_df[, (colnames(kaggle_df)) %in% numerics$Attribute]
kaggle_categorical <- kaggle_df[, !(colnames(kaggle_df) %in% numerics$Attribute)]

Exploring the quantitative variables

kaggle_numerics %>% 
  dplyr::select(2:11) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

kaggle_numerics %>% 
  dplyr::select(12:21) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

kaggle_numerics %>% 
  dplyr::select(22:31) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

kaggle_numerics %>% 
  dplyr::select(32:38) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

After creating a few plot matrices with our quantitative data, I thought there were a few initial takeaways:

Many of the living spaces, and the square footage values seem to show stronger correlations with one another. This makes sense given that many times, the square footage of one floor will likely be similar square footage to another floor in the house. For instance, there’s a strong correlation between the first floor square footage and the basement square footage. When thinking about our regression later, it’s nice to know that there are multiple attributes here that could serve as proxies (if needed).
Additionally, I’m seeing a gradual, positive trend towards larger areas of garages the later they are built in the 20th century – for instance, garages built in the early part of the 20th century seem to be smaller than those that are built in the later 1900s and early 2000s. This may be something to look into further, and it’s relationship with the Sale Price later on.
It’s interesting to see too that the year the house is built has a relationship with the rating of overall material and finish quality of the house. Again, this make sense, but it would be valueable to see if this has a relationship with Sale Price.

Now, let’s explore the categorical variables a bit.

Exploring a few of the categorical variables

table(kaggle_categorical$LandSlope)

## 
##  Gtl  Mod  Sev 
## 1382   65   13

table(kaggle_categorical$HouseStyle)

## 
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    154     14    726      8     11    445     37     65

table(kaggle_categorical$RoofMatl)

## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    1434       1       1       1      11       5       6

table(kaggle_categorical$CentralAir)

## 
##    N    Y 
##   95 1365

table(kaggle_categorical$HeatingQC)

## 
##  Ex  Fa  Gd  Po  TA 
## 741  49 241   1 428

table(kaggle_categorical$SaleType)

## 
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    43     2     9     5     5     4   122     3  1267

table(kaggle_categorical$SaleCondition)

## 
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     101       4      12      20    1198     125

table(kaggle_categorical$Foundation)

## 
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    146    634    647     24      6      3

After looking through a large number of categorical values, and checking frequencies, a few things stood out to me:

It looks like there’s a majority of houses that are built on gentle slopes, however, there’s a proportion that are built on moderate or severe slopes – this could be something to look into further when building out the regression later on. It may be interesting to see if there is any correlation here with the Sale Price.
Similarly, the foundation material seems to be split heavily between cinder blocks and poured concrete. This is another feature that may be useful to explore later to see if this has an affect on housing prices, or if there is a connection between the year a house was built and which type of foundation was used.
Finally, Sale Type and Sale Condition were interesting values and their frequencies suggest that there isn’t a “one-size-fits-all” approach to certain transactions, making this an enticing set of features to explore further.

Developing a scatterplot matrix

After plotting all of my quantitative variables earlier, I thought I’d hone in on a few that may be useful to explore their relationship with the Sale Price variable. You can find my plot below:

pairs(kaggle_df[, c('1stFlrSF', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', 'YearBuilt', 'SalePrice' )], pch = 19)

Developing a correlation matrix

For our correlation matrix, I will use GrLivArea, 1stFlrSF, and SalePrice. And taking a few of these same attributes, I thought it would be interesting to plot these in a correlation matrix:

kaggle_numerics %>% 
  dplyr::select(`1stFlrSF`, `GrLivArea`, SalePrice) %>%
  pairs.panels(method = "pearson", hist.col = "#8bb397")

corr_matrix <- kaggle_numerics %>% 
  dplyr::select(`1stFlrSF`, GrLivArea, SalePrice) %>% 
  cor() %>% 
  as.matrix()

As we can see from above, all three attributes, the square footage of the first floor, the above grade (ground) living area and the Sale Price all seem to have a positive relationship with one another. All three distributions also appear to be unimodal and right-skewed. Below is the official correlation matrix:

kable(corr_matrix) %>% 
  kable_styling(bootstrap_options = 'striped', full_width = FALSE)

	1stFlrSF	GrLivArea	SalePrice
1stFlrSF	1.0000000	0.5660240	0.6058522
GrLivArea	0.5660240	1.0000000	0.7086245
SalePrice	0.6058522	0.7086245	1.0000000

Hypothesis testing on correlation matrix

Now, although these correlations seem to be quite strong, we can do some hypothesis testing on their relationships at an 80-percent confidence interval:

corr1 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$`1stFlrSF`, method = 'pearson', conf.level = 0.80)

corr2 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)

corr3 <- cor.test(kaggle_numerics$`1stFlrSF`, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)

corr1

## 
##  Pearson's product-moment correlation
## 
## data:  kaggle_numerics$SalePrice and kaggle_numerics$`1stFlrSF`
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522

corr2

## 
##  Pearson's product-moment correlation
## 
## data:  kaggle_numerics$SalePrice and kaggle_numerics$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

corr3

## 
##  Pearson's product-moment correlation
## 
## data:  kaggle_numerics$`1stFlrSF` and kaggle_numerics$GrLivArea
## t = 26.217, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5427732 0.5884078
## sample estimates:
##      cor 
## 0.566024

Solution: After running the three pairwise correlations between my variables of SalePrice, 1stFlrSF, and GrLivArea, I can see from the outputs that all three accept the alternative hypothesis that the true correlation is not equal to zero and that there is a positive relationship between these attributes. Therefore, this rejects the null hypothesis that states that the correlations between each pairwise set of variables is zero. I know this to be true since the p-values for all three of my Person’s correlation tests are less than 0.05.

Solution: Additionally, since we are dealing with a large sample size, and our correlation tests are yielding very small p-values, all less than 0.001, I’m not worried about family-wise error when measuring relationships across these three attributes. With p-values this low, it creates a solid argument that we aren’t running into a Type 1 error here.

Linear Algebra and Correlation: Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Remembering that the inverse of a matrix multiplied by a matrix will become the identity matrix:

\[A^{−1}A=AA^{−1}=I\]

We can find our precision matrix by using the solve() function in baseR to invert our correlation matrix:

prec_matrix <- solve(corr_matrix)
prec_matrix

##             1stFlrSF  GrLivArea SalePrice
## 1stFlrSF   1.6795240 -0.4611713 -0.690746
## GrLivArea -0.4611713  2.1352622 -1.233697
## SalePrice -0.6907460 -1.2336974  2.292718

Then, we can multiple the precision matrix by the correlation matrix to confirm that we get the identity matrix:

\[ correlation \ matrix \times \ precision \ matrix = \]

\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix}\]

I_1 <- round(prec_matrix %*% corr_matrix, 2)
I_2 <- round(corr_matrix %*% prec_matrix, 2)

I_1

##           1stFlrSF GrLivArea SalePrice
## 1stFlrSF         1         0         0
## GrLivArea        0         1         0
## SalePrice        0         0         1

I_2

##           1stFlrSF GrLivArea SalePrice
## 1stFlrSF         1         0         0
## GrLivArea        0         1         0
## SalePrice        0         0         1

As we can see above, we do indeed get the Identity Matrix when we multiply the precision matrix with the correlation matrix and vice-versa.

Finally, we will perform LU decomposition on the matrix. Fortunately, R has a package called lu.decomposition() that will find the upper and lower triangle matrices of our precision matrix, \(U\) and \(L\) respectively. The relationship between these two matrices is based on the formula:

\[A = LU\] Where \(A\) = our precision matrix, and \(L\) and \(U\) are the upper and lower triangular matrices of our precision matrix. We can see, when finding \(L\) and \(U\) below, we get these respective matrices:

A <- prec_matrix
L <- lu.decomposition(A)$L
U <- lu.decomposition(A)$U

round(L, 2)

##       [,1]  [,2] [,3]
## [1,]  1.00  0.00    0
## [2,] -0.27  1.00    0
## [3,] -0.41 -0.71    1

round(U, 2)

##      [,1]  [,2]  [,3]
## [1,] 1.68 -0.46 -0.69
## [2,] 0.00  2.01 -1.42
## [3,] 0.00  0.00  1.00

And when we multiply these together, we can obtain our original matrix (\(A\)):

round(A, 2)

##           1stFlrSF GrLivArea SalePrice
## 1stFlrSF      1.68     -0.46     -0.69
## GrLivArea    -0.46      2.14     -1.23
## SalePrice    -0.69     -1.23      2.29

round(L %*% U, 2)

##       [,1]  [,2]  [,3]
## [1,]  1.68 -0.46 -0.69
## [2,] -0.46  2.14 -1.23
## [3,] -0.69 -1.23  2.29

\[\begin{equation} LU = A, where \space A = \begin{bmatrix} 1.68 & -0.46 & -0.69\\ -0.46 & 2.14 & -1.23\\ -0.69 & -1.23 & 2.29\\ \end{bmatrix} \space = \space \begin{bmatrix} 1.00 & 0.00 & 0.00\\ -0.27 & 1.00 & 0.00\\ -0.41 & -0.71 & 1.00\\ \end{bmatrix} \space \begin{bmatrix} 1.68 & -0.46 & -0.69\\ 0.00 & 2.01 & -1.42\\ 0.00 & 0.00 & 1.00\\ \end{bmatrix} \space = LU \end{equation}\]

Calculus-Based Probability & Statistics: Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

For this exercise, we’ll use the GrLivArea variable from the training dataset. When we plot it below on a histogram we can quickly see that it is skewed to the right:

ggplot(kaggle_df, aes(GrLivArea)) +
  geom_histogram(aes(y = ..density..), colour="#555555", fill="#dddddd") +
  geom_density(alpha=.6, fill="#333333") +
  geom_vline(xintercept = min(kaggle_df$GrLivArea)) +
  labs(title = "Above Ground Living Area (square feet)", align='middle') +
  theme_minimal()

We can also see that the minimum value is absolutely above zero, so we do not need to perform any shifts. We can double check the data by using the summary() function, and we see that the minimum value is 334 square feet:

summary(kaggle_df$GrLivArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

Loading the MASS package and running fitdistr()

After loading the MASS package and running fitdistr() to fit an exponential probability density function to our variable, I’ve found that that the optimal value of \(\lambda\) for this distribution is 0.000659864.

require(MASS)

exp_fit <- fitdistr(kaggle_df$GrLivArea, "exponential")
est_rate <- exp_fit$estimate
est_rate

##        rate 
## 0.000659864

Plot a histogram comparing the exponential distribution to the original

Using my newly found optimal value for lambda, I then was able to generate a 1000-sample variable from this exponential distribution:

set.seed(123)
pdf <- rexp(1000, est_rate)
pdf_df <- as.data.frame(pdf)

I then plotted this distribution below, along with my original distribution for GrLivArea:

A few takeaways when comparing the distributions:

The original data for GrLivArea appears to be shaped as a more normal distribution, albeit right skewed, the most frequent square footage tends to be around 1500 and 1650 square feet.
Alternatively, the exponential distribution of GrLiveArea appears to be even more right skewed, and takes on more of an exponential shape with a longer tail.

Find the 5th and 95th percentiles using CDF

Next, I’ll find the 5th and 95th percentiles using the exponential cumulative distribution function (CDF). Thinking back to the formula from our reading, this function is:

\[F(x) = 1 - e^{-\lambda x}\] Since we already have found \(\lambda = 0.000659864\), we can solve the equation accordingly for the 5th percentile:

\[0.05 = 1 - e^{-0.000659864x}\] \[-0.95 = e^{-0.000659864x}\] \[-ln(0.95) = 0.000659864x\] \[\frac{-ln(0.95)}{0.000659864} = x\] \[77.73313 = x\] Similarly, we can find the 95th percentile by:

\[0.95 = 1 - e^{-0.000659864x}\] \[-0.05 = e^{-0.000659864x}\] \[-ln(0.05) = 0.000659864x\] \[\frac{-ln(0.05)}{0.000659864} = x\] \[4539.92 = x\] We can check these calculations using the qexp() function in R:

edf_5 <- qexp(0.05, rate=est_rate)
edf_95 <- qexp(0.95, rate=est_rate)

edf_5

## [1] 77.73313

edf_95

## [1] 4539.924

It looks like these match the calculations above!

Generate 95% confidence interval from empirical data

Assuming normality, we can generate a 95% confidence interval from the empirical data by utilizing the following function:

\[\overline{X} \pm Z * \frac{\sigma}{\sqrt{n}}\]

Since we have all of this information available to us, we can solve by utilizing the following:

\(Z\) = 1.96 (standard for a 95% confidence interval)
\(\overline{X}\) = mean value of GrLivArea
\(\sigma\) = standard deviation of GrLivArea mean
\(n\) = sample size of GrLivArea

X_cap <- mean(kaggle_df$GrLivArea)
Z <- 1.96
sigma <- std(kaggle_df$GrLivArea)
n <- length(kaggle_df$GrLivArea)

upper_bound <- X_cap + Z * (sigma / sqrt(n))
lower_bound <- X_cap - Z * (sigma / sqrt(n))

upper_bound

## [1] 1542.419

lower_bound

## [1] 1488.509

mean(kaggle_df$GrLivArea)

## [1] 1515.464

median(kaggle_df$GrLivArea)

## [1] 1464

And below is the empirical 5th percentile and 95th percentile of the data:

emp_5 <- quantile(kaggle_df$GrLivArea, c(0.05, 0.95))[1]
emp_95 <- quantile(kaggle_df$GrLivArea, c(0.05, 0.95))[2]

calc_df <- data.frame(method= c('Exponential pdf percentiles', 'Empirical percentiles'), "percentile_5th" = c(edf_5, emp_5), "percentile_95th" = c(edf_95, emp_95))

calc_df %>% 
  kable() %>% 
  kable_styling(bootstrap_options = 'striped')

	method	percentile_5th	percentile_95th
	Exponential pdf percentiles	77.73313	4539.924
5%	Empirical percentiles	848.00000	2466.100

Solution: My calculations for my exponential CDF percentiles and empirical percentiles can be summarized in the table above. Additionally, I found that when calculating a 95% confidence interval from my empirical data, I found this interval to be \([1488.51, 1542.42]\). I also calculated the mean and median from my empirical data, which equals \(1515.46\) and \(1464\) respectively. Given that our GrLivArea variable is right skewed, the 95% confidence interval that we produced is not very effective and doesn’t map to the data very well. Our confidence interval should indicate that if we were to continuously sample houses from this area, with the same sample size of 1460, then we should expect the mean GrLivArea value to fall within our bounds of \([1488.51, 1542.42]\) about 95% of the time – which isn’t reflected well in the sense that the empirical median does not fall within this range, characteristic of a right skewed distribution.

Additionally, we found that when using the exponential pdf in an attempt to fit our empirical distribution, we see a large disparity between the 5th percentile of our exponential function (77.73) and our empirical data (848.00). This is also true when looking at the 95th percentiles that were calculated, the exponential pdf yielding 4539 square feet, and the empirical data yielding 2466.1 square feet. These percentiles mean that 95% of the houses in our training dataset have above ground living areas that fall below these markers. Therefore, the exponential percentile suggests that 95% of the houses have above ground living areas that are less than or equal to 4539 square feet, while the empirical data tells us that 95% of the houses have above ground living areas of up to 2466.1 square feet. This is a large difference between the two, and shows that the exponential distribution we fit would not be a good estimator of our empirical data.

Modeling: Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

Data processing before running the model

Before we get started on building our multiple regression model, I thought it would be helpful to take a closer look at the training dataset, and to hone in on a few variables that seem like good attributes to include in our analysis.

Earlier, I developed a large number of correlation matrices and scatter plots to visualize interactions between some of the quantitative variables in the dataset. When looking back at some of this data, I thought the following attributes stood out as possible additions to our regression model:

1stFlrSF (first floor square feet)
GrLivArea (above ground living area square feet)
GarageArea (size of garage in square feet)
TotalBsmtSF (total square feet of basement area)
YearBuilt (year house was built)
TotRmsAbvGrd (total rooms above ground)
BedroomAbvGr (bedrooms above ground)
BsmtUnfSF (unfinished square feet of basement area)
YrSold (year sold)

Additionally, I thought a few of the categorical variables stood out as possible effective attributes for predicting the Sale Price:

LandSlope (slope of property)
HouseStyle (style of dwelling)
CentralAir (does the house have central air conditioning)
SaleType (type of sale)
SaleCondition (condition of sale)
Foundation (type of foundation)

With these variables of interest selected, I’ll need to recode the categorical variables into quantitative values:

kaggle_df$LandSlope <- recode(kaggle_df$LandSlope, Gtl=1, Mod=2, Sev=3)
kaggle_df$HouseStyle <- recode(kaggle_df$HouseStyle, "1.5Fin"=1, "1.5Unf"=2, "1Story"=3, "2.5Fin"=4, "2.5Unf"=5, "2Story"=6, "SFoyer"=7, "SLvl"=8)
kaggle_df$CentralAir <- recode(kaggle_df$CentralAir, N=1, Y=2)
kaggle_df$SaleType <- recode(kaggle_df$SaleType, "COD"=1, "Con"=2, "ConLD"=3, "ConLI"=4, "ConLw"=5, "CWD"=6, "New"=7, "Oth"=8, "WD"=9)
kaggle_df$SaleCondition <- recode(kaggle_df$SaleCondition, Abnorml=1, AdjLand=2, Alloca=3, Family=4, Normal=5, Partial=6)
kaggle_df$Foundation <- recode(kaggle_df$Foundation, BrkTil=1, CBlock=2, PConc=3, Slab=4, Stone=5, Wood=6)

Now, with the categorical values recoded to numeric values, I decided to take the training dataset and only select the columns needed for the analysis:

train_df <- kaggle_df %>% 
  dplyr::select(`1stFlrSF`, GrLivArea, GarageArea, TotalBsmtSF, YearBuilt, TotRmsAbvGrd, BedroomAbvGr, BsmtUnfSF, YrSold, LandSlope, HouseStyle, CentralAir, SaleType, SaleCondition, Foundation, SalePrice)

Finally, we can check for any null values. If there are null values, we’ll have to impute them:

anyNA(train_df)

## [1] FALSE

It looks like there aren’t any null values in our training dataset, so we are now ready to start building our multiple regression model.

Building the multiple regression model

For a first step, I’ll add in all of the variables that we have in our subsetted dataframe. I’ll use backward elimination to remove out any variables with high p-values:

train_lm <- lm(data = train_df, SalePrice ~ `1stFlrSF` +`GrLivArea` + `GarageArea` + `TotalBsmtSF` + `YearBuilt` + `TotRmsAbvGrd` + `BedroomAbvGr` + `BsmtUnfSF` + `YrSold` + `LandSlope` + `HouseStyle` + `CentralAir` + `SaleType` + `SaleCondition` + `Foundation`)

summary(train_lm)

## 
## Call:
## lm(formula = SalePrice ~ `1stFlrSF` + GrLivArea + GarageArea + 
##     TotalBsmtSF + YearBuilt + TotRmsAbvGrd + BedroomAbvGr + BsmtUnfSF + 
##     YrSold + LandSlope + HouseStyle + CentralAir + SaleType + 
##     SaleCondition + Foundation, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -630162  -18443   -2095   15229  280608 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.802e+05  1.631e+06  -0.356   0.7221    
## `1stFlrSF`    -4.516e+00  5.592e+00  -0.808   0.4194    
## GrLivArea      7.533e+01  4.338e+00  17.365  < 2e-16 ***
## GarageArea     5.149e+01  6.682e+00   7.706 2.39e-14 ***
## TotalBsmtSF    4.113e+01  4.863e+00   8.458  < 2e-16 ***
## YearBuilt      6.031e+02  5.843e+01  10.323  < 2e-16 ***
## TotRmsAbvGrd   5.593e+03  1.390e+03   4.024 6.02e-05 ***
## BedroomAbvGr  -1.494e+04  1.884e+03  -7.928 4.41e-15 ***
## BsmtUnfSF     -1.159e+01  2.810e+00  -4.123 3.96e-05 ***
## YrSold        -3.234e+02  8.096e+02  -0.399   0.6896    
## LandSlope      9.137e+03  3.987e+03   2.292   0.0221 *  
## HouseStyle    -8.950e+02  6.966e+02  -1.285   0.1990    
## CentralAir     1.052e+04  4.771e+03   2.206   0.0275 *  
## SaleType       1.278e+02  7.160e+02   0.179   0.8583    
## SaleCondition  5.014e+03  1.029e+03   4.872 1.22e-06 ***
## Foundation     2.588e+03  2.013e+03   1.286   0.1988    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40970 on 1444 degrees of freedom
## Multiple R-squared:  0.7367, Adjusted R-squared:  0.734 
## F-statistic: 269.4 on 15 and 1444 DF,  p-value: < 2.2e-16

Then, using backward elimination, I decided to remove out variables with high p-values:

train_lm_2 <- step(train_lm, direction = 'backward', trace = FALSE)
summary(train_lm_2)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + TotalBsmtSF + 
##     YearBuilt + TotRmsAbvGrd + BedroomAbvGr + BsmtUnfSF + LandSlope + 
##     CentralAir + SaleCondition, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -630868  -18242   -2292   15344  280778 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.275e+06  8.418e+04 -15.146  < 2e-16 ***
## GrLivArea      7.425e+01  4.103e+00  18.098  < 2e-16 ***
## GarageArea     5.085e+01  6.616e+00   7.686 2.78e-14 ***
## TotalBsmtSF    3.898e+01  3.298e+00  11.821  < 2e-16 ***
## YearBuilt      6.277e+02  4.429e+01  14.173  < 2e-16 ***
## TotRmsAbvGrd   5.468e+03  1.381e+03   3.958 7.92e-05 ***
## BedroomAbvGr  -1.499e+04  1.866e+03  -8.031 1.98e-15 ***
## BsmtUnfSF     -1.075e+01  2.757e+00  -3.901   0.0001 ***
## LandSlope      9.299e+03  3.973e+03   2.341   0.0194 *  
## CentralAir     1.020e+04  4.744e+03   2.151   0.0317 *  
## SaleCondition  5.138e+03  1.003e+03   5.123 3.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40960 on 1449 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7342 
## F-statistic:   404 on 10 and 1449 DF,  p-value: < 2.2e-16

This process didn’t seem to affect the R-squared value much. Therefore, to try and boost the R-squared value a bit, I’ll look back and see if I can add a few more variables and do more testing:

For this run, I decided to add in Neighborhood, OpenPorchSF, KitchenAvbGr, ExterQual, RoofMatl, YearRemodAdd, and OverallQual, while removing a few variables that seem to be showing colinearity – CentralAir, LandSlope, YearBuilt, GarageArea, and TotRmsAbvGr.

train_df <- kaggle_df %>% 
  dplyr::select(`1stFlrSF`, GrLivArea, GarageArea, TotalBsmtSF, YearBuilt, TotRmsAbvGrd, BedroomAbvGr, BsmtUnfSF, YrSold, LandSlope, HouseStyle, CentralAir, SaleType, SaleCondition, Foundation, SalePrice, OverallQual, Fireplaces, YearRemodAdd, LotArea, RoofMatl, ExterQual, Utilities, KitchenAbvGr, Neighborhood, OpenPorchSF)


train_lm <- lm(data = train_df, SalePrice ~ `GrLivArea` + `TotalBsmtSF` + `BsmtUnfSF` + `SaleCondition` + OverallQual +  YearRemodAdd + LotArea + RoofMatl + ExterQual + KitchenAbvGr + Neighborhood + OpenPorchSF)

summary(train_lm)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + TotalBsmtSF + BsmtUnfSF + 
##     SaleCondition + OverallQual + YearRemodAdd + LotArea + RoofMatl + 
##     ExterQual + KitchenAbvGr + Neighborhood + OpenPorchSF, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -367612  -12310    -288   12213  245180 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.156e+06  1.096e+05 -10.547  < 2e-16 ***
## GrLivArea            5.576e+01  2.254e+00  24.741  < 2e-16 ***
## TotalBsmtSF          4.237e+01  2.679e+00  15.816  < 2e-16 ***
## BsmtUnfSF           -2.263e+01  2.110e+00 -10.723  < 2e-16 ***
## SaleCondition        3.428e+03  7.376e+02   4.648 3.67e-06 ***
## OverallQual          1.195e+04  1.075e+03  11.117  < 2e-16 ***
## YearRemodAdd         3.071e+02  5.204e+01   5.901 4.52e-09 ***
## LotArea              5.168e-01  9.043e-02   5.715 1.34e-08 ***
## RoofMatlCompShg      5.932e+05  3.300e+04  17.974  < 2e-16 ***
## RoofMatlMembran      6.116e+05  4.471e+04  13.679  < 2e-16 ***
## RoofMatlMetal        6.192e+05  4.506e+04  13.744  < 2e-16 ***
## RoofMatlRoll         5.900e+05  4.455e+04  13.244  < 2e-16 ***
## RoofMatlTar&Grv      5.948e+05  3.403e+04  17.481  < 2e-16 ***
## RoofMatlWdShake      5.930e+05  3.565e+04  16.634  < 2e-16 ***
## RoofMatlWdShngl      6.694e+05  3.464e+04  19.323  < 2e-16 ***
## ExterQualFa         -5.477e+04  1.030e+04  -5.319 1.21e-07 ***
## ExterQualGd         -5.289e+04  4.968e+03 -10.646  < 2e-16 ***
## ExterQualTA         -5.688e+04  5.649e+03 -10.070  < 2e-16 ***
## KitchenAbvGr        -2.086e+04  3.874e+03  -5.385 8.49e-08 ***
## NeighborhoodBlueste -2.051e+04  2.234e+04  -0.918  0.35870    
## NeighborhoodBrDale  -2.479e+04  1.076e+04  -2.304  0.02136 *  
## NeighborhoodBrkSide -6.248e+03  8.690e+03  -0.719  0.47225    
## NeighborhoodClearCr -5.138e+03  9.940e+03  -0.517  0.60532    
## NeighborhoodCollgCr  3.946e+03  7.690e+03   0.513  0.60794    
## NeighborhoodCrawfor  1.275e+04  8.687e+03   1.468  0.14235    
## NeighborhoodEdwards -1.688e+04  8.320e+03  -2.029  0.04260 *  
## NeighborhoodGilbert  1.169e+03  8.143e+03   0.144  0.88589    
## NeighborhoodIDOTRR  -1.971e+04  9.305e+03  -2.118  0.03434 *  
## NeighborhoodMeadowV -2.153e+04  1.070e+04  -2.012  0.04442 *  
## NeighborhoodMitchel -7.903e+03  8.769e+03  -0.901  0.36761    
## NeighborhoodNAmes   -9.891e+03  8.004e+03  -1.236  0.21674    
## NeighborhoodNoRidge  4.817e+04  8.922e+03   5.399 7.87e-08 ***
## NeighborhoodNPkVill -1.204e+04  1.253e+04  -0.960  0.33705    
## NeighborhoodNridgHt  4.021e+04  8.204e+03   4.901 1.06e-06 ***
## NeighborhoodNWAmes  -1.064e+04  8.435e+03  -1.261  0.20737    
## NeighborhoodOldTown -2.234e+04  8.247e+03  -2.709  0.00684 ** 
## NeighborhoodSawyer  -1.039e+04  8.508e+03  -1.222  0.22197    
## NeighborhoodSawyerW -3.845e+03  8.407e+03  -0.457  0.64749    
## NeighborhoodSomerst  1.497e+04  7.983e+03   1.875  0.06099 .  
## NeighborhoodStoneBr  4.941e+04  9.480e+03   5.213 2.14e-07 ***
## NeighborhoodSWISU   -2.698e+04  9.860e+03  -2.737  0.00628 ** 
## NeighborhoodTimber   9.319e+03  8.980e+03   1.038  0.29954    
## NeighborhoodVeenker  1.746e+04  1.171e+04   1.491  0.13629    
## OpenPorchSF          5.239e+00  1.310e+01   0.400  0.68924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29580 on 1416 degrees of freedom
## Multiple R-squared:  0.8654, Adjusted R-squared:  0.8613 
## F-statistic: 211.7 on 43 and 1416 DF,  p-value: < 2.2e-16

This seemed to boost the R-squared value quite a lot. However, now there are a lot of variables in the output, and I’m worried that it may eventually overfit the testing data. To help cut down on this, I noticed that a few neighborhoods yielded very low p-values, and others not so much. Therefore, I recoded the Neighborhood variable accordingly:

kaggle_df$Neighborhood <- recode(kaggle_df$Neighborhood,
                                 Blmngtn = 1,
                                 Blueste = 1,
                                 BrDale = 1,
                                 BrkSide = 1,
                                 ClearCr = 1,
                                 CollgCr = 1,
                                 Crawfor = 1,
                                 Edwards = 1,
                                 Gilbert = 1,
                                 IDOTRR = 1,
                                 MeadowV = 1,
                                 Mitchel = 1,
                                 NAmes = 1,
                                 NoRidge = 2,
                                 NPkVill = 1,
                                 NridgHt = 3,
                                 NWAmes = 1,
                                 OldTown = 1,
                                 Sawyer = 1,
                                 SawyerW = 1,
                                 Somerst = 1,
                                 StoneBr = 4,
                                 SWISU = 1,
                                 Timber = 1,
                                 Veenker = 1)

table(kaggle_df$Neighborhood)

## 
##    1    2    3    4 
## 1317   41   77   25

train_df <- kaggle_df %>% 
  dplyr::select(`1stFlrSF`, GrLivArea, GarageArea, TotalBsmtSF, YearBuilt, TotRmsAbvGrd, BedroomAbvGr, BsmtUnfSF, YrSold, LandSlope, HouseStyle, CentralAir, SaleType, SaleCondition, Foundation, SalePrice, OverallQual, Fireplaces, YearRemodAdd, LotArea, RoofMatl, ExterQual, Utilities, KitchenAbvGr, Neighborhood, OpenPorchSF)

train_lm <- lm(data = train_df, SalePrice ~ `GrLivArea` + `TotalBsmtSF` + `BsmtUnfSF` + `SaleCondition` + OverallQual +  YearRemodAdd + LotArea + RoofMatl + ExterQual + KitchenAbvGr + Neighborhood)

summary(train_lm)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + TotalBsmtSF + BsmtUnfSF + 
##     SaleCondition + OverallQual + YearRemodAdd + LotArea + RoofMatl + 
##     ExterQual + KitchenAbvGr + Neighborhood, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398985  -13176    -516   13580  259408 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.363e+06  1.066e+05 -12.792  < 2e-16 ***
## GrLivArea        5.648e+01  2.111e+00  26.757  < 2e-16 ***
## TotalBsmtSF      4.513e+01  2.589e+00  17.432  < 2e-16 ***
## BsmtUnfSF       -2.435e+01  2.084e+00 -11.682  < 2e-16 ***
## SaleCondition    3.730e+03  7.613e+02   4.900 1.07e-06 ***
## OverallQual      1.417e+04  1.050e+03  13.495  < 2e-16 ***
## YearRemodAdd     3.696e+02  5.111e+01   7.233 7.67e-13 ***
## LotArea          6.179e-01  8.794e-02   7.027 3.25e-12 ***
## RoofMatlCompShg  6.372e+05  3.366e+04  18.930  < 2e-16 ***
## RoofMatlMembran  6.500e+05  4.551e+04  14.280  < 2e-16 ***
## RoofMatlMetal    6.659e+05  4.570e+04  14.571  < 2e-16 ***
## RoofMatlRoll     6.381e+05  4.578e+04  13.940  < 2e-16 ***
## RoofMatlTar&Grv  6.342e+05  3.472e+04  18.264  < 2e-16 ***
## RoofMatlWdShake  6.316e+05  3.622e+04  17.438  < 2e-16 ***
## RoofMatlWdShngl  7.068e+05  3.538e+04  19.976  < 2e-16 ***
## ExterQualFa     -5.806e+04  1.049e+04  -5.533 3.74e-08 ***
## ExterQualGd     -4.648e+04  4.956e+03  -9.380  < 2e-16 ***
## ExterQualTA     -5.988e+04  5.669e+03 -10.563  < 2e-16 ***
## KitchenAbvGr    -2.279e+04  3.930e+03  -5.800 8.16e-09 ***
## Neighborhood     1.726e+04  1.610e+03  10.720  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30830 on 1440 degrees of freedom
## Multiple R-squared:  0.8513, Adjusted R-squared:  0.8493 
## F-statistic: 433.9 on 19 and 1440 DF,  p-value: < 2.2e-16

With this model above, we are able to capture about 85% of the variability in SalePrice from these attributes that make up the multiple regression. To be safe, we’ll do a bit of testing on the residuals to ensure that this is a valid regression model:

Evaluating the Model and Residual Analysis

After running the summary above, we can see a few things:

The median residual value is -554, which given the spread of the dataset, it is roughly around zero, which is a good sign.
Additionally, the minimum and maximum values of the residuals are roughly around the same, albeit leaning a bit more on the minimum side at -398985 and 259408 respectively.
Our Multiple R-squared value is 0.8493, which indicates that these terms account for about 85% of the variability in the Sale Price of the homes in this dataset. This seems to be a pretty good result for our regression model.

Although this seems to be a good sign, we need to check our residuals to see if this qualifies as a suitable regression model.

When plotting residuals (below), we can see that the residuals are pretty uniformly scattered around zero:

plot(fitted(train_lm),resid(train_lm))

And finally, the residuals when plotted on a q-q plot appear to be pretty normally distributed, although a bit skewed at the poles. We can see from the histogram that the distribution is unimodal and is slightly left skewed, however, it is approaching normal so I feel that this qualifies as a valid multiple regression model.

qqnorm(resid(train_lm))
qqline(resid(train_lm))

hist(resid(train_lm))

Importing the test data and making predictions with multiple regression model

test_df <- read_csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Final%20Exam/test.csv')

sale_predictions <- predict(train_lm, test_df) 
par(mfrow=c(1,2))
hist(sale_predictions, breaks=40, xlim=c(0, 600000), main = 'predicted sale price - test data')
hist(train_df$SalePrice, breaks=50, xlim=c(0, 600000), main = 'sales price - training data')

As we can see when comparing the two distributions, they both seem to be roughly the same shape – which is a good indication that our predicted values are similar to our training values. Now, we can get this model ready to submit to Kaggle.

Report your Kaggle.com user name and score

kaggle_data <- data.frame(id = test_df[,"Id"],  SalePrice = sale_predictions)
kaggle_data[kaggle_data < 0] <- 0
kaggle_data <- replace(kaggle_data, is.na(kaggle_data), 0)
write.csv(kaggle_data, file="kaggle.csv", row.names = FALSE)

Overall, I fell towards the middle of the pack on the leaderboard, but hope to work on this more in the future. It would be great to set this up to run SVM and/or Random Forest to see if I can improve my score.

My Kaggle Score:

kaggle_info <- c("zdalexander", 0.42696)
names(kaggle_info) <- c("Username", "My Score")
kable(kaggle_info, col.names = "Kaggle") %>% 
  kable_styling(full_width = F)

	Kaggle
Username	zdalexander
My Score	0.42696

My YouTube Link Explaining My Final:

You can find my YouTube video here: https://www.youtube.com/watch?v=pvlqQwg9cag

Thanks!

DATA 605 - Final Project

Zach Alexander

5/17/2020

Instructions

Problem 1

Instructions

Problem 2

Exploring the quantitative variables

Exploring a few of the categorical variables

Developing a scatterplot matrix

Developing a correlation matrix

Hypothesis testing on correlation matrix

Loading the MASS package and running fitdistr()

Plot a histogram comparing the exponential distribution to the original

Find the 5th and 95th percentiles using CDF

Generate 95% confidence interval from empirical data

Data processing before running the model

Building the multiple regression model

Evaluating the Model and Residual Analysis

Importing the test data and making predictions with multiple regression model

Report your Kaggle.com user name and score

My Kaggle Score:

My YouTube Link Explaining My Final: