Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of N+1/2. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
library(dplyr)
library(ggplot2)
set.seed(321)
# choose a value for N
N <- 13
# Generate uniform random numbers
X <- runif(10000, 1, N)
# generate normal random numbers
Y <- rnorm(10000, (N+1)/2, (N+1)/2)
# store X and Y in a dataframe
df <- data.frame(cbind(X,Y)) %>% as_tibble()
head(df)
## # A tibble: 6 x 2
## X Y
## <dbl> <dbl>
## 1 12.5 6.34
## 2 12.2 6.28
## 3 3.86 9.26
## 4 4.06 12.0
## 5 5.69 19.2
## 6 5.09 6.21
# find x - median of X
x <- median(X)
paste("x = ", round(x,3))
## [1] "x = 7.022"
# find y - the 1st quartile of Y
y <- quantile(Y, 0.25 )
paste("y = ", round(y,3))
## [1] "y = 2.349"
# visual of X and Y
df %>% ggplot(aes(x=df$Y)) + geom_histogram(binwidth=1, color="black", fill="blue", alpha=.9) +
geom_histogram(aes(x=df$X), color="black", fill="grey", alpha=.5) +
geom_vline(aes(xintercept=x), color="blue", linetype="dashed", size=1) +
geom_vline(aes(xintercept=y), color= "black", linetype="dashed", size=1) +
geom_text(aes(x=x), y=1000, label="x", size=9)+
geom_text(aes(x=y), y=1000, label="y", size=9)+
theme_minimal()
Conditional probability computed by finding P(X > x) after first changing the sample space, given that X > y
# get new sample space
x_gt_y <- df %>% filter(X > y)
# probability is events where X > x divided by the # of events in the new sample space
p_a <- sum(x_gt_y$X > x)/nrow(x_gt_y)
paste("The probability that P(X>x | X>y) is approximately ", round(p_a,3 ))
## [1] "The probability that P(X>x | X>y) is approximately 0.565"
p_b <- sum(X > x & Y > y)/10000
paste("The probability that P(X>x, Y>y) is approximately", round(p_b,3 ))
## [1] "The probability that P(X>x, Y>y) is approximately 0.376"
Conditional probability computed by finding P(X < x) after first changing the sample space, given that X > y
x_gt_y <- df %>% filter(X>y)
p_c <- sum(x_gt_y$X<x)/nrow(x_gt_y)
paste("The probability that P(X<x | X>y) is approximately", round(p_c,3 ))
## [1] "The probability that P(X<x | X>y) is approximately 0.435"
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y)
This is done by first building a table and evaluating the marginal and joint probabilities.
# find joint probabilities of each of the events
Xgtx_Ygty <- sum(X > x & Y >y)/10000
Xltx_Ygty <- sum(X < x & Y >y)/10000
Xgtx_Ylty <- sum(X > x & Y< y)/10000
Xltx_Ylty <- sum(X < x & Y <y)/10000
# create a dataframe to store the probability distribution - creating row and column names
df2 <- data.frame("Y_greater_than_y"=c(Xgtx_Ygty,Xltx_Ygty), "Y_less_than_y"=c(Xgtx_Ylty, Xltx_Ylty))
row.names(df2) <- c("X_greater_than_x","X_less_than_x")
# joint probability distribution
df2
## Y_greater_than_y Y_less_than_y
## X_greater_than_x 0.3765 0.1235
## X_less_than_x 0.3735 0.1265
# calculate row and column sums - the marginals
df2 <- cbind(df2, total=rowSums(df2))
df2 <- rbind(df2, total=colSums(df2))
df2
## Y_greater_than_y Y_less_than_y total
## X_greater_than_x 0.3765 0.1235 0.5
## X_less_than_x 0.3735 0.1265 0.5
## total 0.7500 0.2500 1.0
# Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y)
paste("P(X>x and Y>y)=P(X>x)P(Y>y): ", round(sum(X > x)/10000 * sum(Y>y)/10000,2) == round(df2[1,1],2))
## [1] "P(X>x and Y>y)=P(X>x)P(Y>y): TRUE"
The third column and row are the marignal probabiity distribtuions -> they sum to 1
As far as the marginals go, we see total Y greater thatn y is 0.75 - the upper 3 quantiles. The rest make sense give our distribution.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Many applications of probaiblity and statistics are predicated on independence of variables. The Fisher’s Exact Test and Chi Square Test allow for a way of evaluating the concept of independence.
Like all other statistical test, we start with null and alternative hypothesis.
H0 : the variables are independent - there no relationship between the variables.
H1 : the variables are dependent - knowing the value of one variable helps predict the other.
We will either reject or not reject the null hypthesis based on the p-value computed from the tests: I’ll choose an value of alpha of 0.05.
The functions for the tests require counts, not probabilities, and data should be in a table format.
# multiply by the number of events to convert the probability distribution table into table of counts
df2 <- df2*10000
df2 <-df2[1:2,1:2]
# convert to table
table <- as.table(as.matrix(df2))
table
## Y_greater_than_y Y_less_than_y
## X_greater_than_x 3765 1235
## X_less_than_x 3735 1265
# Fishers Exact test
fisher.test(table, conf.int=T)
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 0.503
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9421511 1.1315373
## sample estimates:
## odds ratio
## 1.032527
# Chi Square Test
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 0.44853, df = 1, p-value = 0.503
For both the fisher’s and chi-square test, the p-value is much higher than the significance level chosen, so the null hypothesis, that there is no relationship between the variables and that they are independent, cannot be rejected. This makes sense intuitively when looking at the joint values from the table above. For example, knowing the value of Y_greater_than_y will not help us predict the if X is greater than or less than x.
Generally the chi-square test is used with large sample sizes, and provides an approximate p-value whereas fisher’s exact test provides an exact p-value on small sample sizes. In this case there was no real benefit in choosing one over the other since we selected the large sample size. In datasets with limited events, Fisher’s Test would be more appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques .
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset.
First Look at the dataset
require(dplyr)
# require(PerformanceAnalytics)
# source("http://www.sthda.com/upload/rquery_cormat.r")
# load the training data
train <- read.csv('C:/NYBackup/DATA605/regression/train.csv') %>% as_tibble()
test <- read.csv('C:/NYBackup/DATA605/regression/test.csv') %>% as_tibble()
train
## # A tibble: 1,460 x 81
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <int> <int> <fct> <int> <int> <fct> <fct> <fct>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## 7 7 20 RL 75 10084 Pave <NA> Reg
## 8 8 60 RL NA 10382 Pave <NA> IR1
## 9 9 50 RM 51 6120 Pave <NA> Reg
## 10 10 190 RL 50 7420 Pave <NA> Reg
## # ... with 1,450 more rows, and 73 more variables: LandContour <fct>,
## # Utilities <fct>, LotConfig <fct>, LandSlope <fct>, Neighborhood <fct>,
## # Condition1 <fct>, Condition2 <fct>, BldgType <fct>, HouseStyle <fct>,
## # OverallQual <int>, OverallCond <int>, YearBuilt <int>,
## # YearRemodAdd <int>, RoofStyle <fct>, RoofMatl <fct>,
## # Exterior1st <fct>, Exterior2nd <fct>, MasVnrType <fct>,
## # MasVnrArea <int>, ExterQual <fct>, ExterCond <fct>, Foundation <fct>,
## # BsmtQual <fct>, BsmtCond <fct>, BsmtExposure <fct>,
## # BsmtFinType1 <fct>, BsmtFinSF1 <int>, BsmtFinType2 <fct>,
## # BsmtFinSF2 <int>, BsmtUnfSF <int>, TotalBsmtSF <int>, Heating <fct>,
## # HeatingQC <fct>, CentralAir <fct>, Electrical <fct>, X1stFlrSF <int>,
## # X2ndFlrSF <int>, LowQualFinSF <int>, GrLivArea <int>,
## # BsmtFullBath <int>, BsmtHalfBath <int>, FullBath <int>,
## # HalfBath <int>, BedroomAbvGr <int>, KitchenAbvGr <int>,
## # KitchenQual <fct>, TotRmsAbvGrd <int>, Functional <fct>,
## # Fireplaces <int>, FireplaceQu <fct>, GarageType <fct>,
## # GarageYrBlt <int>, GarageFinish <fct>, GarageCars <int>,
## # GarageArea <int>, GarageQual <fct>, GarageCond <fct>,
## # PavedDrive <fct>, WoodDeckSF <int>, OpenPorchSF <int>,
## # EnclosedPorch <int>, X3SsnPorch <int>, ScreenPorch <int>,
## # PoolArea <int>, PoolQC <fct>, Fence <fct>, MiscFeature <fct>,
## # MiscVal <int>, MoSold <int>, YrSold <int>, SaleType <fct>,
## # SaleCondition <fct>, SalePrice <int>
# Iterative approach since there are 81 variables
# scatterplot matrix on the whole dataset
# output supressed since there are so many
# pairs(train[,c(81,1:8)])
# pairs(train[,c(81,9:15)])
# pairs(train[,c(81,16:22)])
# pairs(train[,c(81,23:29)])
# pairs(train[,c(81,30:36)])
# pairs(train[,c(81,37:43)])
# pairs(train[,c(81,49:55)])
# pairs(train[,c(81,61:67)])
# pairs(train[,c(81,73:80)])
Select variables that are most correlated with Sale Price based on the scatterplot matrices above. 12 variables were selected. Summary statistics,new scatterplot matrices, and a correlation matrix are given.
# initial selection of the training set - 12 variables selected including the target
train_selection1 <- train %>% dplyr::select(SalePrice, GrLivArea, OverallQual, YearBuilt, X1stFlrSF, FullBath, TotRmsAbvGrd,LotArea, OverallCond, BedroomAbvGr, YrSold, LotFrontage, OverallCond )
#summary statistics
summary(train_selection1)
## SalePrice GrLivArea OverallQual YearBuilt
## Min. : 34900 Min. : 334 Min. : 1.000 Min. :1872
## 1st Qu.:129975 1st Qu.:1130 1st Qu.: 5.000 1st Qu.:1954
## Median :163000 Median :1464 Median : 6.000 Median :1973
## Mean :180921 Mean :1515 Mean : 6.099 Mean :1971
## 3rd Qu.:214000 3rd Qu.:1777 3rd Qu.: 7.000 3rd Qu.:2000
## Max. :755000 Max. :5642 Max. :10.000 Max. :2010
##
## X1stFlrSF FullBath TotRmsAbvGrd LotArea
## Min. : 334 Min. :0.000 Min. : 2.000 Min. : 1300
## 1st Qu.: 882 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.: 7554
## Median :1087 Median :2.000 Median : 6.000 Median : 9478
## Mean :1163 Mean :1.565 Mean : 6.518 Mean : 10517
## 3rd Qu.:1391 3rd Qu.:2.000 3rd Qu.: 7.000 3rd Qu.: 11602
## Max. :4692 Max. :3.000 Max. :14.000 Max. :215245
##
## OverallCond BedroomAbvGr YrSold LotFrontage
## Min. :1.000 Min. :0.000 Min. :2006 Min. : 21.00
## 1st Qu.:5.000 1st Qu.:2.000 1st Qu.:2007 1st Qu.: 59.00
## Median :5.000 Median :3.000 Median :2008 Median : 69.00
## Mean :5.575 Mean :2.866 Mean :2008 Mean : 70.05
## 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:2009 3rd Qu.: 80.00
## Max. :9.000 Max. :8.000 Max. :2010 Max. :313.00
## NA's :259
# Scatterplot matrix sepatate into two for clarity
pairs(train_selection1[,1:8])
pairs(train_selection1[,c(1,7:12)])
# correlation matrix
cor_matrix1 <- cor(train_selection1)
cor_matrix1
## SalePrice GrLivArea OverallQual YearBuilt X1stFlrSF
## SalePrice 1.00000000 0.70862448 0.79098160 0.52289733 0.60585218
## GrLivArea 0.70862448 1.00000000 0.59300743 0.19900971 0.56602397
## OverallQual 0.79098160 0.59300743 1.00000000 0.57232277 0.47622383
## YearBuilt 0.52289733 0.19900971 0.57232277 1.00000000 0.28198586
## X1stFlrSF 0.60585218 0.56602397 0.47622383 0.28198586 1.00000000
## FullBath 0.56066376 0.63001165 0.55059971 0.46827079 0.38063749
## TotRmsAbvGrd 0.53372316 0.82548937 0.42745234 0.09558913 0.40951598
## LotArea 0.26384335 0.26311617 0.10580574 0.01422765 0.29947458
## OverallCond -0.07785589 -0.07968587 -0.09193234 -0.37598320 -0.14420278
## BedroomAbvGr 0.16821315 0.52126951 0.10167636 -0.07065122 0.12740075
## YrSold -0.02892259 -0.03652582 -0.02734671 -0.01361768 -0.01360377
## LotFrontage NA NA NA NA NA
## FullBath TotRmsAbvGrd LotArea OverallCond BedroomAbvGr
## SalePrice 0.56066376 0.53372316 0.26384335 -0.07785589 0.16821315
## GrLivArea 0.63001165 0.82548937 0.26311617 -0.07968587 0.52126951
## OverallQual 0.55059971 0.42745234 0.10580574 -0.09193234 0.10167636
## YearBuilt 0.46827079 0.09558913 0.01422765 -0.37598320 -0.07065122
## X1stFlrSF 0.38063749 0.40951598 0.29947458 -0.14420278 0.12740075
## FullBath 1.00000000 0.55478425 0.12603063 -0.19414949 0.36325198
## TotRmsAbvGrd 0.55478425 1.00000000 0.19001478 -0.05758317 0.67661994
## LotArea 0.12603063 0.19001478 1.00000000 -0.00563627 0.11968991
## OverallCond -0.19414949 -0.05758317 -0.00563627 1.00000000 0.01298006
## BedroomAbvGr 0.36325198 0.67661994 0.11968991 0.01298006 1.00000000
## YrSold -0.01966884 -0.03451635 -0.01426141 0.04394975 -0.03601389
## LotFrontage NA NA NA NA NA
## YrSold LotFrontage
## SalePrice -0.02892259 NA
## GrLivArea -0.03652582 NA
## OverallQual -0.02734671 NA
## YearBuilt -0.01361768 NA
## X1stFlrSF -0.01360377 NA
## FullBath -0.01966884 NA
## TotRmsAbvGrd -0.03451635 NA
## LotArea -0.01426141 NA
## OverallCond 0.04394975 NA
## BedroomAbvGr -0.03601389 NA
## YrSold 1.00000000 NA
## LotFrontage NA 1
The selection is refined to 3 variables, which show strong correlation amongst each other.
# make new selection baed on scatterplot results
train_selection2 <- train_selection1 %>% dplyr::select(SalePrice, GrLivArea, OverallQual)
pairs(train_selection2)
# create correlation matrix with 3 variables - using pearson
cor_matrix2 <- cor(train_selection2)
cor_matrix2
## SalePrice GrLivArea OverallQual
## SalePrice 1.0000000 0.7086245 0.7909816
## GrLivArea 0.7086245 1.0000000 0.5930074
## OverallQual 0.7909816 0.5930074 1.0000000
#
# 3 pairwise events to test
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
We test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. There are 3 pairs to test.
Null hypothesis is that the correlation between a pair of variables is 0. We look to reject or not reject the null.
cor.test(train_selection2$SalePrice, train_selection2$GrLivArea, conf.level = .80)
##
## Pearson's product-moment correlation
##
## data: train_selection2$SalePrice and train_selection2$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(train_selection2$SalePrice, train_selection2$OverallQual, conf.level = .80)
##
## Pearson's product-moment correlation
##
## data: train_selection2$SalePrice and train_selection2$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(train_selection2$GrLivArea, train_selection2$OverallQual, conf.level = .80)
##
## Pearson's product-moment correlation
##
## data: train_selection2$GrLivArea and train_selection2$OverallQual
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5708061 0.6143422
## sample estimates:
## cor
## 0.5930074
From the results of the test, the variables seem to be correlated. The outputs of the correlation test (using the Pearson method) show variables that have a reasonably high level of correlation. In each case we can reject the null hypothesis that the correlation between variables is 0. 80% confidence intervals are provided in the output of each test. There is not much cause to have concern about familywise error, that is, having at least one type I error. In this case there are only 3 hypothesis tests being performed. The chance of incorrectly rejecting the null in any of these tests is very small when considering the small p-values. Type I errors should always be a concern, but here we can be confident in the results of the correlation test.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
# invert the correlation matrix to get precision matrix
precision_matrix <- solve(cor_matrix2)
precision_matrix
## SalePrice GrLivArea OverallQual
## SalePrice 3.498623 -1.2927630 -2.0007280
## GrLivArea -1.292763 2.0200794 -0.1753704
## OverallQual -2.000728 -0.1753704 2.6865350
#multiply correlation matrix by precision matrix
mat <- cor_matrix2 %*% precision_matrix
mat
## SalePrice GrLivArea OverallQual
## SalePrice 1.000000e+00 0.000000e+00 0
## GrLivArea -4.440892e-16 1.000000e+00 0
## OverallQual -4.440892e-16 -8.326673e-17 1
# LU decomposition
LU.factorize <- function(A){
# create a diagonal matrix to store L
L <- diag(nrow(A))
# U is set to A
U <- A
# loop over i rows and j columns
for(i in 2:ncol(A)){
for( j in 1:2){
# This if statement ensures we work on the lower triangle - that is, indices where i is greater than j
if(j<i){
# This if/else control sturcture is needed to avoid dividing by 0
if(i-j == 1){
multiplier <- -1*(U[i,j]/U[i-1,j])
U[i,] <- multiplier*U[i-1,] + U[i,]
L[i,j] <- multiplier*-1}
else{
multiplier <- -1*(U[i,j]/U[1,j])
U[i,] <- multiplier*U[1,] + U[i,]
L[i,j] <- multiplier*-1
}
}
}
}
print("U")
print(U)
print("L")
print(L)
print("L times U equals A ?")
L%*%U == A
}
LU.factorize(mat)
## [1] "U"
## SalePrice GrLivArea OverallQual
## SalePrice 1 0 0
## GrLivArea 0 1 0
## OverallQual 0 0 1
## [1] "L"
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] -4.440892e-16 1.000000e+00 0
## [3,] -4.440892e-16 -8.326673e-17 1
## [1] "L times U equals A ?"
## SalePrice GrLivArea OverallQual
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable.
# find a variable that is skewed to the right
hist(train$X1stFlrSF, breaks=30)
The histogram is skewed to the right. Finding probabilities using normality will lead to inaccuracies
The parameter of interest the rate, most often referred to as lambda.
require(MASS)
x <- fitdistr(train$X1stFlrSF, densfun = 'exponential')
# find the rate, lambda
lambda <- x$estimate
#generate exponential random numbers
random_exp <- rexp(1000,lambda)
# plot the result
hist(random_exp, breaks=100)
This now resembles a classic exponential distribution.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). These are calculated from the the cumulative distribution function of the exponential distribution
cdf <- ecdf(random_exp)
plot(cdf)
exp_quantiles <- quantile(cdf,c(.05,.95))
exp_quantiles
## 5% 95%
## 49.05657 3698.80675
Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
# 95% confidence interval assuming normality
mean <- mean(train$X1stFlrSF)
sd <- sd(train$X1stFlrSF)
err <- qnorm(.975)*sd/sqrt(nrow(train))
low_bound <- mean -err
up_bound <- mean + err
paste("95% of the data is is between ", round(low_bound,3), " and ", round(up_bound,3))
## [1] "95% of the data is is between 1142.797 and 1182.457"
# empirical 5th and 95 percentile
quantile(train$X1stFlrSF, c(.05, 0.95))
## 5% 95%
## 672.95 1831.25
The results of the methods show completely different results. Results for the exponential do not seem reasonable. There may be an issue with the optimization of parameter lambda that is skewing the results.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Using the variables selected from a previous section, build a model and then remove variables with a high p-value using backward elimination. The goal is avoid overfitting by reducing the number of explanatory variables.
lm1 <- lm(SalePrice~ GrLivArea + OverallQual + YearBuilt + X1stFlrSF + FullBath + TotRmsAbvGrd + LotArea + OverallCond + BedroomAbvGr + YrSold + LotFrontage + OverallCond, data=train)
summary(lm1)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt +
## X1stFlrSF + FullBath + TotRmsAbvGrd + LotArea + OverallCond +
## BedroomAbvGr + YrSold + LotFrontage + OverallCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513412 -18598 -1289 15964 279374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.017e+05 1.739e+06 -0.403 0.68674
## GrLivArea 5.239e+01 4.914e+00 10.661 < 2e-16 ***
## OverallQual 2.208e+04 1.348e+03 16.384 < 2e-16 ***
## YearBuilt 5.926e+02 5.459e+01 10.855 < 2e-16 ***
## X1stFlrSF 2.687e+01 4.110e+00 6.537 9.31e-11 ***
## FullBath -1.289e+03 3.070e+03 -0.420 0.67471
## TotRmsAbvGrd 4.330e+03 1.448e+03 2.991 0.00283 **
## LotArea 8.430e-01 1.641e-01 5.137 3.26e-07 ***
## OverallCond 6.957e+03 1.176e+03 5.915 4.34e-09 ***
## BedroomAbvGr -1.214e+04 2.031e+03 -5.977 3.00e-09 ***
## YrSold -2.851e+02 8.645e+02 -0.330 0.74161
## LotFrontage 4.343e+01 5.780e+01 0.751 0.45256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39740 on 1189 degrees of freedom
## (259 observations deleted due to missingness)
## Multiple R-squared: 0.7749, Adjusted R-squared: 0.7728
## F-statistic: 372.2 on 11 and 1189 DF, p-value: < 2.2e-16
lm2 <- lm(SalePrice~ GrLivArea + OverallQual + YearBuilt + X1stFlrSF + TotRmsAbvGrd + LotArea + OverallCond + BedroomAbvGr + OverallCond, data=train)
summary(lm2)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt +
## X1stFlrSF + TotRmsAbvGrd + LotArea + OverallCond + BedroomAbvGr +
## OverallCond, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -496265 -17810 -1641 15359 283630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.243e+06 8.796e+04 -14.133 < 2e-16 ***
## GrLivArea 5.205e+01 4.072e+00 12.783 < 2e-16 ***
## OverallQual 2.124e+04 1.145e+03 18.559 < 2e-16 ***
## YearBuilt 5.805e+02 4.486e+01 12.940 < 2e-16 ***
## X1stFlrSF 2.915e+01 3.316e+00 8.790 < 2e-16 ***
## TotRmsAbvGrd 3.932e+03 1.253e+03 3.138 0.00174 **
## LotArea 6.989e-01 1.050e-01 6.656 3.97e-11 ***
## OverallCond 6.687e+03 9.803e+02 6.821 1.32e-11 ***
## BedroomAbvGr -1.144e+04 1.735e+03 -6.594 5.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37630 on 1451 degrees of freedom
## Multiple R-squared: 0.7769, Adjusted R-squared: 0.7757
## F-statistic: 631.6 on 8 and 1451 DF, p-value: < 2.2e-16
lm3 <- lm(SalePrice~ GrLivArea + OverallQual + YearBuilt , data=train)
summary(lm3)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + OverallQual + YearBuilt,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -393773 -22639 -2424 18437 290554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.053e+06 8.376e+04 -12.57 <2e-16 ***
## GrLivArea 6.209e+01 2.581e+00 24.06 <2e-16 ***
## OverallQual 2.520e+04 1.172e+03 21.50 <2e-16 ***
## YearBuilt 5.001e+02 4.409e+01 11.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40750 on 1456 degrees of freedom
## Multiple R-squared: 0.7374, Adjusted R-squared: 0.7368
## F-statistic: 1363 on 3 and 1456 DF, p-value: < 2.2e-16
An analysis is perfromed to provide diagnostics on the residuals
# visual residual analysis
train %>%
ggplot(aes(x=fitted(lm3),y=resid(lm3))) +
geom_point()
qqnorm(resid(lm3))
qqline(resid(lm3))
The residual plots give a less favorable view of the model. Residual variance is not particularly constant, especially on the high end. The Q-Q plot also shows residuals deviating form a normal distribution on the high and low end.
username: RobetWelk Score: 0.61514
test <- read.csv(file="C:/NYBackup/DATA605/regression/test.csv") %>%
dplyr::select(Id, GrLivArea, OverallQual, YearBuilt)
predictions <- predict(lm3,test)
prediction_df <- data.frame(cbind(Id=test$Id, SalePrice=predictions))
prediction_df %>%
write.csv("C:/NYBackup/DATA605/regression/predictions_rjw.csv")