title: “Data 607 Final Project” |
author: “Nnaemezue Obi-Eyisi” |
date: “May 18, 2017” |
header-includes: |
- |
output: |
pdf_document |
Load libraries
knitr::opts_chunk$set(echo = TRUE)
if("Hmisc" %in% rownames(installed.packages()) == FALSE) {install.packages("Hmisc")}
library(Hmisc)
if("pastecs" %in% rownames(installed.packages()) == FALSE) {install.packages("pastecs")}
library(pastecs)
if("MASS" %in% rownames(installed.packages()) == FALSE) {install.packages("MASS")}
library(MASS)
if("psych" %in% rownames(installed.packages()) == FALSE) {install.packages("psych")}
library(psych)
if("fitdistrplus" %in% rownames(installed.packages()) == FALSE) {install.packages("fitdistrplus")}
library(fitdistrplus)
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.
Train_data <-read.csv(file = "https://raw.githubusercontent.com/nobieyi00/CUNY_MSDA_R/master/train.csv",
header = TRUE, sep = ",")
We will pick GrLivArea as our X variable and SalePrice is our Y variable
plot(Train_data$GrLivArea,Train_data$SalePrice)
#Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile (this is correct) of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.
a.) P(X>x | Y>y)
We will use small letter “x” to get the 4th quartile of variable X. Lets get all quartiles of variable X (Train_data$GrLivArea)
quantile(Train_data$GrLivArea)
## 0% 25% 50% 75% 100%
## 334.00 1129.50 1464.00 1776.75 5642.00
We can see that the 4th quartile is 5642.00 so small x = 5642.00
Let’s find 2nd quartile of Y. where Y is Train_data$SalePrice
quantile(Train_data$SalePrice)
## 0% 25% 50% 75% 100%
## 34900 129975 163000 214000 755000
Second quartile is at 50%, which give small y=163000
P(X>x n Y>y)/P(X>x)
P(X>x n Y>y) = Number of values where X >5642.00 and Y>163000
Num <-Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice > 163000), ]
Denom <- Train_data[ which( Train_data$GrLivArea>5642.00 ), ]
#P(X>x n Y>y)/P(X>x)
nrow(Num)/nrow(Denom)
## [1] NaN
This gives us undefined which is not acceptable. Let’s approach this a different way
Recall that P(X>x|Y>y) can be written as
\[\frac{P(Y>y|X>x) * P(X>x)}{P(Y>y|X>x) * P(X>x) + P(Y>y |X<=x) * P(X<=x)}\]
Simplifying and cancelling out terms will give us
\[\frac{P(Y>y and X>x)}{P(Y>y and X>x)+ P(Y>y and X<=x)}\]
let’s now find P(Y>163000 and X<=5642)
num <-nrow(Train_data[ which( Train_data$SalePrice > 163000 & Train_data$GrLivArea <= 5642), ])
#P(Y>163000 and X<=5642) =
P1<-num/nrow(Train_data)
P1
## [1] 0.4986301
Plugging it into the equation we get \[\frac{P(Y>y and X>x)}{P(Y>y and X>x)+ P(Y>y and X<=x)}\] = \[\frac{0}{0+0.4986301}\] This will give 0 Therefore P(X>x | Y>y) = 0
P(X>5642.00, Y>163000)
Num <-nrow(Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data)
#P(X>x, Y>y) =P(X>5642.00, Y>163000)
Num/Denom
## [1] 0
P(X>x, Y>y) =0
We can see that P(X>x, Y>y) =P(X>5642.00, Y>163000) =0
P(X<5642.00| Y>163000) = P(X<5642 and Y>163000)/P(Y>163000)
Num <-nrow(Train_data[ which( Train_data$GrLivArea<5642.00 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data[ which( Train_data$SalePrice > 163000), ])
#P(X>x, Y>y) =P(X<5642.00, Y>163000)
Num/Denom
## [1] 1
P(X<5642.00, Y>163000) = 1
Does splitting the training data in this fashion make them independent? In other words, does P(XY)=P(X)P(Y) or does P(X|Y) = P(X)? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this. A Chi Square test for independence (association) will require you to bin the data into logical groups. Build a table
Lets check if P(X and Y) =P(X)P(Y)
P(Y>163000 and X>2500)
Num <-nrow(Train_data[ which( Train_data$GrLivArea>2500 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data)
#P(X>x, Y>y) =P(X>2500, Y>163000)
Num/Denom
## [1] 0.04520548
Probability of P(Y>163000 and X>2500)
Now lets check P(X>2500)* P(Y>163000)
#P(X>2500)
Num <-nrow(Train_data[ which( Train_data$GrLivArea>2500 ), ])
Denom <- nrow(Train_data)
P1 <-Num/Denom
#P(Y>163000)
Num <-nrow(Train_data[ which( Train_data$SalePrice > 163000 ), ])
Denom <- nrow(Train_data)
P2 <-Num/Denom
P1*P2
## [1] 0.02390692
P(Y>163000 and X>2500) =0.02390692
therefore the P(X and Y) is not equal P(X)*P(Y) This shows that the Probability of P(x) and P(Y) are NOT independent
CHI SQUARE TEST The null hypothesis of the independence assumption is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level ??.
Load library MASS
tbl = table(Train_data$GrLivArea, Train_data$SalePrice)
chisq.test(tbl)
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 589730, df = 569320, p-value < 2.2e-16
Test the hypothesis whether the GrLivArea is independent of their SalePrice at .05 significance level.
As the p-value < 2.2e-16 is less than the .05 significance level, we do reject the null hypothesis that the GrLivArea variable is independent of the House SALESPRICE. therefore their is a dependency
P(X>x) P(X<=x)
P(Y>y) 0 0.4986301 P(Y<=y) 0 0.5013699
Num <-nrow(Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data)
Num/Denom
## [1] 0
Num <-nrow(Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice <= 163000), ])
Denom <- nrow(Train_data)
Num/Denom
## [1] 0
Num <-nrow(Train_data[ which( Train_data$GrLivArea<=5642.00 & Train_data$SalePrice> 163000), ])
Denom <- nrow(Train_data)
Num/Denom
## [1] 0.4986301
Num <-nrow(Train_data[ which( Train_data$GrLivArea<=5642.00 & Train_data$SalePrice<= 163000), ])
Denom <- nrow(Train_data)
Num/Denom
## [1] 0.5013699
Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this. Using the transformed variables, run a correlation analysis and interpret. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
Let’s get the descriptive statistics for GrLivArea
describe(Train_data$GrLivArea)
## vars n mean sd median trimmed mad min max range skew
## X1 1 1460 1515.46 525.48 1464 1467.67 483.33 334 5642 5308 1.36
## kurtosis se
## X1 4.86 13.75
Let’s get the descriptive statistics for SalePrice
describe(Train_data$SalePrice)
## vars n mean sd median trimmed mad min max range
## X1 1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
## skew kurtosis se
## X1 1.88 6.5 2079.11
Let’s now dive into relevant plots for the variables
Create Density plots for GrLivArea
plot(density(Train_data$GrLivArea), main="GrLivArea Probabilities", ylab="Probability", xlab="GrLivArea")
polygon(density(Train_data$GrLivArea), col="red")
Create a histogram for the variable
hist(Train_data$GrLivArea)
Create qq plot To see whether data can be assumed normally distributed, it is often useful to create a qq-plot. In a qq-plot, we plot the kth smallest observation against the expected value of the kth smallest observation out of n in a standard normal distribution.
qqnorm(Train_data$GrLivArea)
Box plot
boxplot(Train_data$GrLivArea)
Plotting Saleprice variable
Create Density plots for SalePrice
plot(density(Train_data$SalePrice), main="SalePrice Probabilities", ylab="Probability", xlab="SalePrice")
polygon(density(Train_data$SalePrice), col="red")
Create a histogram for the variable
hist(Train_data$SalePrice)
Create qq plot To see whether data can be assumed normally distributed, it is often useful to create a qq-plot. In a qq-plot, we plot the kth smallest observation against the expected value of the kth smallest observation out of n in a standard normal distribution.
qqnorm(Train_data$SalePrice)
Box plot
boxplot(Train_data$SalePrice)
Scatter plot of X and Y –(GrLivArea,SalePrice)
plot(Train_data$GrLivArea, Train_data$SalePrice, col="red", pch =19)
Transform both variables simultaneously using Box-Cox transformations
# run a linear model
GrLivArea<-Train_data$GrLivArea
SalePrice<-Train_data$SalePrice
Area_Price_lreg <- lm(SalePrice ~ GrLivArea, data = Train_data )
summary(Area_Price_lreg)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = Train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462999 -29800 -1124 21957 339832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18569.026 4480.755 4.144 3.61e-05 ***
## GrLivArea 107.130 2.794 38.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018
## F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16
cor(GrLivArea,SalePrice)
## [1] 0.7086245
We can see there is some significant relationship between the two variables, p value is 2e-16 which is <0.05
bc<-boxcox(SalePrice ~ GrLivArea)
bc
## $x
## [1] -2.00000000 -1.95959596 -1.91919192 -1.87878788 -1.83838384
## [6] -1.79797980 -1.75757576 -1.71717172 -1.67676768 -1.63636364
## [11] -1.59595960 -1.55555556 -1.51515152 -1.47474747 -1.43434343
## [16] -1.39393939 -1.35353535 -1.31313131 -1.27272727 -1.23232323
## [21] -1.19191919 -1.15151515 -1.11111111 -1.07070707 -1.03030303
## [26] -0.98989899 -0.94949495 -0.90909091 -0.86868687 -0.82828283
## [31] -0.78787879 -0.74747475 -0.70707071 -0.66666667 -0.62626263
## [36] -0.58585859 -0.54545455 -0.50505051 -0.46464646 -0.42424242
## [41] -0.38383838 -0.34343434 -0.30303030 -0.26262626 -0.22222222
## [46] -0.18181818 -0.14141414 -0.10101010 -0.06060606 -0.02020202
## [51] 0.02020202 0.06060606 0.10101010 0.14141414 0.18181818
## [56] 0.22222222 0.26262626 0.30303030 0.34343434 0.38383838
## [61] 0.42424242 0.46464646 0.50505051 0.54545455 0.58585859
## [66] 0.62626263 0.66666667 0.70707071 0.74747475 0.78787879
## [71] 0.82828283 0.86868687 0.90909091 0.94949495 0.98989899
## [76] 1.03030303 1.07070707 1.11111111 1.15151515 1.19191919
## [81] 1.23232323 1.27272727 1.31313131 1.35353535 1.39393939
## [86] 1.43434343 1.47474747 1.51515152 1.55555556 1.59595960
## [91] 1.63636364 1.67676768 1.71717172 1.75757576 1.79797980
## [96] 1.83838384 1.87878788 1.91919192 1.95959596 2.00000000
##
## $y
## [1] -4816.056 -4766.444 -4717.639 -4669.655 -4622.505 -4576.201 -4530.758
## [8] -4486.186 -4442.497 -4399.704 -4357.815 -4316.843 -4276.796 -4237.684
## [15] -4199.515 -4162.296 -4126.037 -4090.742 -4056.418 -4023.071 -3990.707
## [22] -3959.328 -3928.940 -3899.545 -3871.147 -3843.747 -3817.349 -3791.952
## [29] -3767.559 -3744.170 -3721.786 -3700.407 -3680.032 -3660.661 -3642.294
## [36] -3624.930 -3608.568 -3593.208 -3578.847 -3565.485 -3553.121 -3541.754
## [43] -3531.383 -3522.006 -3513.622 -3506.231 -3499.832 -3494.423 -3490.004
## [50] -3486.575 -3484.135 -3482.683 -3482.219 -3482.742 -3484.253 -3486.752
## [57] -3490.238 -3494.711 -3500.171 -3506.618 -3514.053 -3522.474 -3531.883
## [64] -3542.278 -3553.660 -3566.028 -3579.382 -3593.721 -3609.044 -3625.351
## [71] -3642.640 -3660.909 -3680.158 -3700.383 -3721.584 -3743.756 -3766.898
## [78] -3791.006 -3816.077 -3842.105 -3869.088 -3897.021 -3925.897 -3955.712
## [85] -3986.460 -4018.134 -4050.728 -4084.234 -4118.645 -4153.953 -4190.149
## [92] -4227.225 -4265.172 -4303.980 -4343.639 -4384.140 -4425.471 -4467.623
## [99] -4510.583 -4554.342
Box cox transformations are designed to increase normality of the errors in a linear model. This often increases linearity of the function as well.
It is clear that a sensible estimate for lambda should be 0.15 from the graph and at a 95% confidence interval for lambda could be from approximately -2 to 2
Residuals are the difference between the observed value of the dependent variable (y) and the predicted value (y)
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis
Let’s plot a residual graph to determine if these variables should be analyzed using linear model Plot the residual of the simple linear regression model of the data set Traing_Data against the independent variable GrLivArea.
Area_Price_lreg.res = resid(Area_Price_lreg)
plot(GrLivArea, Area_Price_lreg.res,
ylab="Residuals", xlab="GrLivArea",
main="Residual plot")
abline(0, 0)
The residual plot shows that it is fairly random so linear model is good fit for analyzing the relationship between GrLivArea and Salesprice
LEt’s now Transform both variables simultaneously using Box-Cox transformations
#Get max data point of bc for maximum y value
trans <- bc$x[which.max(bc$y)]
#output lambda
trans
## [1] 0.1010101
#check out transformed scatter plot
plot(GrLivArea,SalePrice^trans)
# re-run with transformation
Area_Price_lreg_tran <- lm(SalePrice^trans ~ GrLivArea)
summary(Area_Price_lreg_tran)
##
## Call:
## lm(formula = SalePrice^trans ~ GrLivArea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77261 -0.04995 0.00852 0.05367 0.31878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.094e+00 7.734e-03 400.06 <2e-16 ***
## GrLivArea 1.832e-04 4.822e-06 37.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09678 on 1458 degrees of freedom
## Multiple R-squared: 0.4975, Adjusted R-squared: 0.4971
## F-statistic: 1443 on 1 and 1458 DF, p-value: < 2.2e-16
trans =lambda is 0.10101 We can notice clearly that the Residuals are greatly reduced and the std. Error is reduced to 4.822e-06
Now lets compare the side by side the Q-Q plot before and after the Box transformation. We see that the Q-Q Plot on the right(after boxcox transformation) looks better fitted on the line
# QQ-plot
op <- par(pty = "s", mfrow = c(1, 2))
qqnorm(Area_Price_lreg$residuals); qqline(Area_Price_lreg$residuals)
qqnorm(Area_Price_lreg_tran$residuals); qqline(Area_Price_lreg_tran$residuals)
par(op)
#cbind(Train_data$GrLivArea,Train_data$SalePrice)
#cor(cbind(Train_data$GrLivArea,Train_data$SalePrice), use="complete.obs", method="kendall")
Using the transformed variables, run a correlation analysis and interpret
Let’s look at the correlation coefficient of the transformed variables
#find correlation coefficient
cor(GrLivArea, SalePrice^trans)
## [1] 0.7053098
#find Pearson correlation matrix
rcorr(GrLivArea,SalePrice^trans, type="pearson")
## x y
## x 1.00 0.71
## y 0.71 1.00
##
## n= 1460
##
##
## P
## x y
## x 0
## y 0
THis correlation of 0.7053098 shows A strong uphill (positive) linear relationship
Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis
If we assume that the null hypothesis is that the correlation between these variables is 0.
Using the cor.test function we can calculate the p value and calculate the 99 percent confidence interval
#correlation analysis of model after boxcox transformation
cor.test(GrLivArea, SalePrice^trans, method = c("pearson"), conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: GrLivArea and SalePrice^trans
## t = 37.99, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.6697594 0.7376344
## sample estimates:
## cor
## 0.7053098
#correlation analysis of model before boxcox transformation
cor.test(GrLivArea, SalePrice, method = c("pearson"), conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: GrLivArea and SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.6733974 0.7406408
## sample estimates:
## cor
## 0.7086245
#confidence of regression model before boxcox
confint(Area_Price_lreg, level=0.99)
## 0.5 % 99.5 %
## (Intercept) 7012.23872 30125.8130
## GrLivArea 99.92504 114.3357
#confidence of regression model after boxcox
confint(Area_Price_lreg_tran, level=0.99)
## 0.5 % 99.5 %
## (Intercept) 3.073977776 3.1138712394
## GrLivArea 0.000170743 0.0001956154
We get a p-value of 2.2e-16 which is less than 0.05 (significance level) therefore we can reject the null hypothesis that the correlation between these variable is 0. Therefore there iscorrelation between variable is not zero
The 99 percent confident interval is of model after transformation 0.6697594 0.7376344; with an error of 0.0339375 this is less than the error before the boxcox transformation
In conclusion I think that the boxcox transformation helped the model to be a bit more linear and reduce variance however, i don’t think it is necessary to prove significance of the relationship or dependency between the two variables
Invert your correlation matrix from the previous section. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
Derive correlation matrix
#find correlation matrix from previous secton
cormatrix <- cor(cbind(GrLivArea,SalePrice^trans))
cormatrix
## GrLivArea
## GrLivArea 1.0000000 0.7053098
## 0.7053098 1.0000000
#invert correlation matrix to get a precision matrix
prec_matrix <- solve(cormatrix)
prec_matrix
## GrLivArea
## GrLivArea 1.989899 -1.403495
## -1.403495 1.989899
The variance inflation factors on the diagonal are 1.989899
Multiply the correlation matrix by the precision matrix
cormatrix %*% prec_matrix
## GrLivArea
## GrLivArea 1 0
## 0 1
This gives us an identity matrix
multiply the precision matrix by the correlation matrix
prec_matrix %*% cormatrix
## GrLivArea
## GrLivArea 1 0
## 0 1
This gives us identity matrix as well. This shows that the correlation matrix is orthogonal
Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable ( X ), location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, ???) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.
For your non-transformed independent variable ( X ), location shift it so that the minimum value is above zero
Let’s check the min of our X variable
min(GrLivArea)
## [1] 334
We can see it is greater than 0 so no need for a location shift
Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
Let’s determine which distribution makes sense for our variable with a histogram plot
hist(GrLivArea)
We notice Skew distributions to the right. The outliers are mostly large positive numbers We will use log-normal distribution
Let’s check the density plot and emperical cumulative distribution
plot(ecdf(GrLivArea), main="Emperical Cumulative distribution")
z.norm <- (GrLivArea-mean(GrLivArea))/sd(GrLivArea) #standardized data
qqnorm(z.norm)
abline(0,1)
The qqplot shows the data should be a different pdf (i.e data belonging from a lognormal pdf)
#Plot with the Log-normal distribution to ensure we made right choice
x.lnorm <- rlnorm(1000,mean(log(GrLivArea)),sd(log(GrLivArea)))
hist(x.lnorm)
summary(x.lnorm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 475.5 1153.0 1423.0 1495.0 1759.0 4701.0
After choosing log normal distribution model that can mathematically represent our X variable data we have to estimate parameters of the model.
The method of maximum likelihood is used in statistical inference to estimate parameters. the likelihood of a set of data is the probability of obtaining that particular set of data given the chosen probability model
This fitdistr function will return optimal values for model the GrLIvarea variable in a log-normal distribution
fitdistr(GrLivArea, densfun="log-normal")
## meanlog sdlog
## 7.267774383 0.333436175
## (0.008726424) (0.006170513)
We can now take 1000 samples from this distribution and plot the histogram using the optimal values fromt he fitdistr function
GrLivArea_fit <- rlnorm(1000, meanlog=7.267774383,sdlog=0.333436175)
hist(GrLivArea_fit)
Let’s compare with histogram of original variable
hist(GrLivArea)
We see that it is very similar
Build some type of regression model and submit your model to the competition board. You can use as many variables as you like. Provide your complete model summary and results with analysis.
We will model Salesprice against this 4 variables GarageArea,LotFrontage, LotArea,GrLivArea
Train_fit <- lm(SalePrice ~ GarageArea + LotFrontage + LotArea + GrLivArea, Train_data)
Train_fit_lm <-summary(Train_fit)
Train_fit_lm
##
## Call:
## lm(formula = SalePrice ~ GarageArea + LotFrontage + LotArea +
## GrLivArea, data = Train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515157 -23329 -695 19986 312334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.151e+04 5.421e+03 -2.124 0.033877 *
## GarageArea 1.436e+02 7.833e+00 18.334 < 2e-16 ***
## LotFrontage -5.265e+01 7.286e+01 -0.723 0.470054
## LotArea 7.909e-01 2.120e-01 3.730 0.000201 ***
## GrLivArea 7.958e+01 3.390e+00 23.473 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51870 on 1196 degrees of freedom
## (259 observations deleted due to missingness)
## Multiple R-squared: 0.6144, Adjusted R-squared: 0.6132
## F-statistic: 476.5 on 4 and 1196 DF, p-value: < 2.2e-16
Let’s write the equation SalePrice =-1.151e+04 + 1.436e+02×GarageArea+-5.265e+01×LotFrontage+7.909e-01×LotArea+7.958e+01×GrLivArea
Let’s find which of the 4 independent variables have a significant impact on SalePrice? Assuming significance level of 0.05
p_values <-Train_fit_lm$coefficients[2:5,"Pr(>|t|)"]
p_values[which(p_values < .05)]
## GarageArea LotArea GrLivArea
## 2.331531e-66 2.007370e-04 1.620604e-100
We can conclude that GarageArea , LotArea and GrLivArea has the most significant impact on SalePrice
What are the standard errors on each of the coefficients?
Train_fit_lm$coefficients[2:5,"Std. Error"]
## GarageArea LotFrontage LotArea GrLivArea
## 7.8331639 72.8564129 0.2120493 3.3902597
measure the 99% confidence intervals.
confint(Train_fit, level=0.99)
## 0.5 % 99.5 %
## (Intercept) -2.550083e+04 2471.936467
## GarageArea 1.234028e+02 163.821087
## LotFrontage -2.406132e+02 135.318028
## LotArea 2.437918e-01 1.337943
## GrLivArea 7.083186e+01 88.325238
Load test data
Test_data <-read.csv(file = "https://raw.githubusercontent.com/nobieyi00/CUNY_MSDA_R/master/test.csv",
header = TRUE, sep = ",")
#Predict the salePrice with the variables used to build the regression model
SalePrice_predicted =- 1.151e+04 + 1.436e+02*Test_data$GarageArea+-5.265e+01*Test_data$LotFrontage+7.909e-01*Test_data$LotArea+7.958e+01*Test_data$GrLivArea
#Replace NA files
SalePrice_predicted[is.na(SalePrice_predicted)] <- 0
#write.csv(cbind(Test_data$Id,SalePrice_predicted), file = "C:/Users/Mezue/Documents/data605/test_pred.csv")
#fitted(Train_fit) # predicted values
#residuals(Train_fit) # residuals
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(Train_fit)
We can now reject the null hypothesis that there is NO significant relationship between GarageArea LotArea GrLivArea variables.
We conclude that their is a significant relationship for them
Kaggle user name is Nnaemezue Obieyisi Score is 4.78145