title: “Data 607 Final Project”

author: “Nnaemezue Obi-Eyisi”

date: “May 18, 2017”

header-includes:

output:

pdf_document

Load libraries

knitr::opts_chunk$set(echo = TRUE)

if("Hmisc" %in% rownames(installed.packages()) == FALSE) {install.packages("Hmisc")}
library(Hmisc)
if("pastecs" %in% rownames(installed.packages()) == FALSE) {install.packages("pastecs")}
library(pastecs)
if("MASS" %in% rownames(installed.packages()) == FALSE) {install.packages("MASS")}
library(MASS)
if("psych" %in% rownames(installed.packages()) == FALSE) {install.packages("psych")}
library(psych)
if("fitdistrplus" %in% rownames(installed.packages()) == FALSE) {install.packages("fitdistrplus")}
library(fitdistrplus)

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.

Train_data <-read.csv(file = "https://raw.githubusercontent.com/nobieyi00/CUNY_MSDA_R/master/train.csv", 
                      header = TRUE, sep = ",")

We will pick GrLivArea as our X variable and SalePrice is our Y variable

plot(Train_data$GrLivArea,Train_data$SalePrice)

#Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile (this is correct) of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

a.) P(X>x | Y>y)

We will use small letter “x” to get the 4th quartile of variable X. Lets get all quartiles of variable X (Train_data$GrLivArea)

quantile(Train_data$GrLivArea)

##      0%     25%     50%     75%    100% 
##  334.00 1129.50 1464.00 1776.75 5642.00

We can see that the 4th quartile is 5642.00 so small x = 5642.00

Let’s find 2nd quartile of Y. where Y is Train_data$SalePrice

quantile(Train_data$SalePrice)

##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000

Second quartile is at 50%, which give small y=163000

Now let’s answer P(X>x | Y>y)= P(X>5642.00 | Y>163000)

P(X>x n Y>y)/P(X>x)

P(X>x n Y>y) = Number of values where X >5642.00 and Y>163000

Num <-Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice > 163000), ]


Denom <- Train_data[ which( Train_data$GrLivArea>5642.00 ), ]


 #P(X>x n Y>y)/P(X>x) 
nrow(Num)/nrow(Denom)

## [1] NaN

This gives us undefined which is not acceptable. Let’s approach this a different way

Recall that P(X>x|Y>y) can be written as

\[\frac{P(Y>y|X>x) * P(X>x)}{P(Y>y|X>x) * P(X>x) + P(Y>y |X<=x) * P(X<=x)}\]

Simplifying and cancelling out terms will give us

\[\frac{P(Y>y and X>x)}{P(Y>y and X>x)+ P(Y>y and X<=x)}\]

let’s now find P(Y>163000 and X<=5642)

num <-nrow(Train_data[ which( Train_data$SalePrice > 163000 & Train_data$GrLivArea <= 5642), ])

#P(Y>163000 and X<=5642) =
P1<-num/nrow(Train_data)


P1

## [1] 0.4986301

Plugging it into the equation we get \[\frac{P(Y>y and X>x)}{P(Y>y and X>x)+ P(Y>y and X<=x)}\] = \[\frac{0}{0+0.4986301}\] This will give 0 Therefore P(X>x | Y>y) = 0

P(X>x, Y>y)

P(X>5642.00, Y>163000)

Num <-nrow(Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice > 163000), ])
Denom  <- nrow(Train_data)
#P(X>x, Y>y) =P(X>5642.00, Y>163000)
Num/Denom

## [1] 0

P(X>x, Y>y) =0

We can see that P(X>x, Y>y) =P(X>5642.00, Y>163000) =0

P(Xy)

P(X<5642.00| Y>163000) = P(X<5642 and Y>163000)/P(Y>163000)

Num <-nrow(Train_data[ which( Train_data$GrLivArea<5642.00 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data[ which( Train_data$SalePrice > 163000), ])

#P(X>x, Y>y) =P(X<5642.00, Y>163000)
Num/Denom

## [1] 1

P(X<5642.00, Y>163000) = 1

Does splitting the training data in this fashion make them independent? In other words, does P(XY)=P(X)P(Y) or does P(X|Y) = P(X)? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this. A Chi Square test for independence (association) will require you to bin the data into logical groups. Build a table

Lets check if P(X and Y) =P(X)P(Y)

P(Y>163000 and X>2500)

Num <-nrow(Train_data[ which( Train_data$GrLivArea>2500 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data)

#P(X>x, Y>y) =P(X>2500, Y>163000)
Num/Denom

## [1] 0.04520548

Probability of P(Y>163000 and X>2500)

Now lets check P(X>2500)* P(Y>163000)

#P(X>2500)
Num <-nrow(Train_data[ which( Train_data$GrLivArea>2500 ), ])
Denom <- nrow(Train_data)
P1 <-Num/Denom

#P(Y>163000)
Num <-nrow(Train_data[ which( Train_data$SalePrice > 163000 ), ])
Denom <- nrow(Train_data)
P2 <-Num/Denom

P1*P2

## [1] 0.02390692

P(Y>163000 and X>2500) =0.02390692

therefore the P(X and Y) is not equal P(X)*P(Y) This shows that the Probability of P(x) and P(Y) are NOT independent

CHI SQUARE TEST The null hypothesis of the independence assumption is to be rejected if the p-value of the following Chi-squared test statistics is less than a given significance level ??.

Load library MASS

tbl = table(Train_data$GrLivArea, Train_data$SalePrice) 

chisq.test(tbl)

## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 589730, df = 569320, p-value < 2.2e-16

Test the hypothesis whether the GrLivArea is independent of their SalePrice at .05 significance level.

As the p-value < 2.2e-16 is less than the .05 significance level, we do reject the null hypothesis that the GrLivArea variable is independent of the House SALESPRICE. therefore their is a dependency

          P(X>x)          P(X<=x)

P(Y>y) 0 0.4986301 P(Y<=y) 0 0.5013699

Num <-nrow(Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice > 163000), ])
Denom <- nrow(Train_data)
Num/Denom

## [1] 0

Num <-nrow(Train_data[ which( Train_data$GrLivArea>5642.00 & Train_data$SalePrice <= 163000), ])
Denom <- nrow(Train_data)
Num/Denom

## [1] 0

Num <-nrow(Train_data[ which( Train_data$GrLivArea<=5642.00 & Train_data$SalePrice> 163000), ])
Denom <- nrow(Train_data)
Num/Denom

## [1] 0.4986301

Num <-nrow(Train_data[ which( Train_data$GrLivArea<=5642.00 & Train_data$SalePrice<= 163000), ])
Denom <- nrow(Train_data)
Num/Denom

## [1] 0.5013699

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations. You might have to research this. Using the transformed variables, run a correlation analysis and interpret. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Let’s get the descriptive statistics for GrLivArea

describe(Train_data$GrLivArea)

##    vars    n    mean     sd median trimmed    mad min  max range skew
## X1    1 1460 1515.46 525.48   1464 1467.67 483.33 334 5642  5308 1.36
##    kurtosis    se
## X1     4.86 13.75

Let’s get the descriptive statistics for SalePrice

describe(Train_data$SalePrice)

##    vars    n     mean      sd median  trimmed     mad   min    max  range
## X1    1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
##    skew kurtosis      se
## X1 1.88      6.5 2079.11

Let’s now dive into relevant plots for the variables

Create Density plots for GrLivArea

plot(density(Train_data$GrLivArea), main="GrLivArea Probabilities", ylab="Probability", xlab="GrLivArea")
polygon(density(Train_data$GrLivArea), col="red")

Create a histogram for the variable

hist(Train_data$GrLivArea)

Create qq plot To see whether data can be assumed normally distributed, it is often useful to create a qq-plot. In a qq-plot, we plot the kth smallest observation against the expected value of the kth smallest observation out of n in a standard normal distribution.

qqnorm(Train_data$GrLivArea)

Box plot

boxplot(Train_data$GrLivArea)

Plotting Saleprice variable

Create Density plots for SalePrice

plot(density(Train_data$SalePrice), main="SalePrice Probabilities", ylab="Probability", xlab="SalePrice")
polygon(density(Train_data$SalePrice), col="red")

Create a histogram for the variable

hist(Train_data$SalePrice)

qqnorm(Train_data$SalePrice)

Box plot

boxplot(Train_data$SalePrice)

Scatter plot of X and Y –(GrLivArea,SalePrice)

plot(Train_data$GrLivArea, Train_data$SalePrice, col="red", pch =19)

Transform both variables simultaneously using Box-Cox transformations

# run a linear model
GrLivArea<-Train_data$GrLivArea
SalePrice<-Train_data$SalePrice
Area_Price_lreg <- lm(SalePrice ~ GrLivArea, data = Train_data )
summary(Area_Price_lreg)

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = Train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18569.026   4480.755   4.144 3.61e-05 ***
## GrLivArea     107.130      2.794  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

cor(GrLivArea,SalePrice)

## [1] 0.7086245

We can see there is some significant relationship between the two variables, p value is 2e-16 which is <0.05

bc<-boxcox(SalePrice ~ GrLivArea)

bc

## $x
##   [1] -2.00000000 -1.95959596 -1.91919192 -1.87878788 -1.83838384
##   [6] -1.79797980 -1.75757576 -1.71717172 -1.67676768 -1.63636364
##  [11] -1.59595960 -1.55555556 -1.51515152 -1.47474747 -1.43434343
##  [16] -1.39393939 -1.35353535 -1.31313131 -1.27272727 -1.23232323
##  [21] -1.19191919 -1.15151515 -1.11111111 -1.07070707 -1.03030303
##  [26] -0.98989899 -0.94949495 -0.90909091 -0.86868687 -0.82828283
##  [31] -0.78787879 -0.74747475 -0.70707071 -0.66666667 -0.62626263
##  [36] -0.58585859 -0.54545455 -0.50505051 -0.46464646 -0.42424242
##  [41] -0.38383838 -0.34343434 -0.30303030 -0.26262626 -0.22222222
##  [46] -0.18181818 -0.14141414 -0.10101010 -0.06060606 -0.02020202
##  [51]  0.02020202  0.06060606  0.10101010  0.14141414  0.18181818
##  [56]  0.22222222  0.26262626  0.30303030  0.34343434  0.38383838
##  [61]  0.42424242  0.46464646  0.50505051  0.54545455  0.58585859
##  [66]  0.62626263  0.66666667  0.70707071  0.74747475  0.78787879
##  [71]  0.82828283  0.86868687  0.90909091  0.94949495  0.98989899
##  [76]  1.03030303  1.07070707  1.11111111  1.15151515  1.19191919
##  [81]  1.23232323  1.27272727  1.31313131  1.35353535  1.39393939
##  [86]  1.43434343  1.47474747  1.51515152  1.55555556  1.59595960
##  [91]  1.63636364  1.67676768  1.71717172  1.75757576  1.79797980
##  [96]  1.83838384  1.87878788  1.91919192  1.95959596  2.00000000
## 
## $y
##   [1] -4816.056 -4766.444 -4717.639 -4669.655 -4622.505 -4576.201 -4530.758
##   [8] -4486.186 -4442.497 -4399.704 -4357.815 -4316.843 -4276.796 -4237.684
##  [15] -4199.515 -4162.296 -4126.037 -4090.742 -4056.418 -4023.071 -3990.707
##  [22] -3959.328 -3928.940 -3899.545 -3871.147 -3843.747 -3817.349 -3791.952
##  [29] -3767.559 -3744.170 -3721.786 -3700.407 -3680.032 -3660.661 -3642.294
##  [36] -3624.930 -3608.568 -3593.208 -3578.847 -3565.485 -3553.121 -3541.754
##  [43] -3531.383 -3522.006 -3513.622 -3506.231 -3499.832 -3494.423 -3490.004
##  [50] -3486.575 -3484.135 -3482.683 -3482.219 -3482.742 -3484.253 -3486.752
##  [57] -3490.238 -3494.711 -3500.171 -3506.618 -3514.053 -3522.474 -3531.883
##  [64] -3542.278 -3553.660 -3566.028 -3579.382 -3593.721 -3609.044 -3625.351
##  [71] -3642.640 -3660.909 -3680.158 -3700.383 -3721.584 -3743.756 -3766.898
##  [78] -3791.006 -3816.077 -3842.105 -3869.088 -3897.021 -3925.897 -3955.712
##  [85] -3986.460 -4018.134 -4050.728 -4084.234 -4118.645 -4153.953 -4190.149
##  [92] -4227.225 -4265.172 -4303.980 -4343.639 -4384.140 -4425.471 -4467.623
##  [99] -4510.583 -4554.342

Box cox transformations are designed to increase normality of the errors in a linear model. This often increases linearity of the function as well.

It is clear that a sensible estimate for lambda should be 0.15 from the graph and at a 95% confidence interval for lambda could be from approximately -2 to 2

Residuals are the difference between the observed value of the dependent variable (y) and the predicted value (y)

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis

Let’s plot a residual graph to determine if these variables should be analyzed using linear model Plot the residual of the simple linear regression model of the data set Traing_Data against the independent variable GrLivArea.

Area_Price_lreg.res = resid(Area_Price_lreg)

 plot(GrLivArea, Area_Price_lreg.res, 
    ylab="Residuals", xlab="GrLivArea", 
    main="Residual plot") 
 abline(0, 0)

The residual plot shows that it is fairly random so linear model is good fit for analyzing the relationship between GrLivArea and Salesprice

LEt’s now Transform both variables simultaneously using Box-Cox transformations

#Get max data point of bc for maximum y value
trans <- bc$x[which.max(bc$y)]

#output lambda
trans

## [1] 0.1010101

#check out transformed scatter plot
plot(GrLivArea,SalePrice^trans)

# re-run with transformation
Area_Price_lreg_tran <- lm(SalePrice^trans ~ GrLivArea)
summary(Area_Price_lreg_tran)

## 
## Call:
## lm(formula = SalePrice^trans ~ GrLivArea)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77261 -0.04995  0.00852  0.05367  0.31878 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.094e+00  7.734e-03  400.06   <2e-16 ***
## GrLivArea   1.832e-04  4.822e-06   37.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09678 on 1458 degrees of freedom
## Multiple R-squared:  0.4975, Adjusted R-squared:  0.4971 
## F-statistic:  1443 on 1 and 1458 DF,  p-value: < 2.2e-16

trans =lambda is 0.10101 We can notice clearly that the Residuals are greatly reduced and the std. Error is reduced to 4.822e-06

Now lets compare the side by side the Q-Q plot before and after the Box transformation. We see that the Q-Q Plot on the right(after boxcox transformation) looks better fitted on the line

# QQ-plot
op <- par(pty = "s", mfrow = c(1, 2))
qqnorm(Area_Price_lreg$residuals); qqline(Area_Price_lreg$residuals)
qqnorm(Area_Price_lreg_tran$residuals); qqline(Area_Price_lreg_tran$residuals)

par(op)

#cbind(Train_data$GrLivArea,Train_data$SalePrice)

#cor(cbind(Train_data$GrLivArea,Train_data$SalePrice), use="complete.obs", method="kendall")

Using the transformed variables, run a correlation analysis and interpret

Let’s look at the correlation coefficient of the transformed variables

#find correlation coefficient
cor(GrLivArea, SalePrice^trans)

## [1] 0.7053098

#find Pearson correlation matrix
rcorr(GrLivArea,SalePrice^trans, type="pearson")

##      x    y
## x 1.00 0.71
## y 0.71 1.00
## 
## n= 1460 
## 
## 
## P
##   x  y 
## x     0
## y  0

THis correlation of 0.7053098 shows A strong uphill (positive) linear relationship

Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis

If we assume that the null hypothesis is that the correlation between these variables is 0.

Using the cor.test function we can calculate the p value and calculate the 99 percent confidence interval

#correlation analysis of model after boxcox transformation
cor.test(GrLivArea, SalePrice^trans, method = c("pearson"), conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  GrLivArea and SalePrice^trans
## t = 37.99, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.6697594 0.7376344
## sample estimates:
##       cor 
## 0.7053098

#correlation analysis of model before boxcox transformation
cor.test(GrLivArea, SalePrice, method = c("pearson"), conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  GrLivArea and SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.6733974 0.7406408
## sample estimates:
##       cor 
## 0.7086245

#confidence of regression model before boxcox
confint(Area_Price_lreg, level=0.99)

##                  0.5 %     99.5 %
## (Intercept) 7012.23872 30125.8130
## GrLivArea     99.92504   114.3357

#confidence of regression model after boxcox
confint(Area_Price_lreg_tran, level=0.99)

##                   0.5 %       99.5 %
## (Intercept) 3.073977776 3.1138712394
## GrLivArea   0.000170743 0.0001956154

We get a p-value of 2.2e-16 which is less than 0.05 (significance level) therefore we can reject the null hypothesis that the correlation between these variable is 0. Therefore there iscorrelation between variable is not zero

The 99 percent confident interval is of model after transformation 0.6697594 0.7376344; with an error of 0.0339375 this is less than the error before the boxcox transformation

In conclusion I think that the boxcox transformation helped the model to be a bit more linear and reduce variance however, i don’t think it is necessary to prove significance of the relationship or dependency between the two variables

Linear Algebra and Correlation.

Invert your correlation matrix from the previous section. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

Derive correlation matrix

#find correlation matrix from previous secton
cormatrix <- cor(cbind(GrLivArea,SalePrice^trans))
cormatrix

##           GrLivArea          
## GrLivArea 1.0000000 0.7053098
##           0.7053098 1.0000000

#invert correlation matrix to get a precision matrix
prec_matrix <- solve(cormatrix)
prec_matrix

##           GrLivArea          
## GrLivArea  1.989899 -1.403495
##           -1.403495  1.989899

The variance inflation factors on the diagonal are 1.989899

Multiply the correlation matrix by the precision matrix

cormatrix %*% prec_matrix

##           GrLivArea  
## GrLivArea         1 0
##                   0 1

This gives us an identity matrix

multiply the precision matrix by the correlation matrix

prec_matrix %*% cormatrix

##           GrLivArea  
## GrLivArea         1 0
##                   0 1

This gives us identity matrix as well. This shows that the correlation matrix is orthogonal

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable ( X ), location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, ???) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

For your non-transformed independent variable ( X ), location shift it so that the minimum value is above zero

Let’s check the min of our X variable

min(GrLivArea)

## [1] 334

We can see it is greater than 0 so no need for a location shift

Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html

Let’s determine which distribution makes sense for our variable with a histogram plot

hist(GrLivArea)

We notice Skew distributions to the right. The outliers are mostly large positive numbers We will use log-normal distribution

Let’s check the density plot and emperical cumulative distribution

plot(ecdf(GrLivArea), main="Emperical Cumulative distribution")

z.norm <- (GrLivArea-mean(GrLivArea))/sd(GrLivArea) #standardized data
qqnorm(z.norm)
abline(0,1)

The qqplot shows the data should be a different pdf (i.e data belonging from a lognormal pdf)

#Plot with the Log-normal distribution to ensure we made right choice
x.lnorm <- rlnorm(1000,mean(log(GrLivArea)),sd(log(GrLivArea)))
hist(x.lnorm)

summary(x.lnorm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   475.5  1153.0  1423.0  1495.0  1759.0  4701.0

After choosing log normal distribution model that can mathematically represent our X variable data we have to estimate parameters of the model.

The method of maximum likelihood is used in statistical inference to estimate parameters. the likelihood of a set of data is the probability of obtaining that particular set of data given the chosen probability model

This fitdistr function will return optimal values for model the GrLIvarea variable in a log-normal distribution

fitdistr(GrLivArea, densfun="log-normal")

##      meanlog        sdlog   
##   7.267774383   0.333436175 
##  (0.008726424) (0.006170513)

We can now take 1000 samples from this distribution and plot the histogram using the optimal values fromt he fitdistr function

GrLivArea_fit <- rlnorm(1000, meanlog=7.267774383,sdlog=0.333436175)

hist(GrLivArea_fit)

Let’s compare with histogram of original variable

hist(GrLivArea)

We see that it is very similar

Modeling.

Build some type of regression model and submit your model to the competition board. You can use as many variables as you like. Provide your complete model summary and results with analysis.

We will model Salesprice against this 4 variables GarageArea,LotFrontage, LotArea,GrLivArea

Train_fit <- lm(SalePrice ~ GarageArea + LotFrontage + LotArea + GrLivArea, Train_data)
 Train_fit_lm <-summary(Train_fit)
 Train_fit_lm

## 
## Call:
## lm(formula = SalePrice ~ GarageArea + LotFrontage + LotArea + 
##     GrLivArea, data = Train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -515157  -23329    -695   19986  312334 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.151e+04  5.421e+03  -2.124 0.033877 *  
## GarageArea   1.436e+02  7.833e+00  18.334  < 2e-16 ***
## LotFrontage -5.265e+01  7.286e+01  -0.723 0.470054    
## LotArea      7.909e-01  2.120e-01   3.730 0.000201 ***
## GrLivArea    7.958e+01  3.390e+00  23.473  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51870 on 1196 degrees of freedom
##   (259 observations deleted due to missingness)
## Multiple R-squared:  0.6144, Adjusted R-squared:  0.6132 
## F-statistic: 476.5 on 4 and 1196 DF,  p-value: < 2.2e-16

Let’s write the equation SalePrice =-1.151e+04 + 1.436e+02×GarageArea+-5.265e+01×LotFrontage+7.909e-01×LotArea+7.958e+01×GrLivArea

Let’s find which of the 4 independent variables have a significant impact on SalePrice? Assuming significance level of 0.05

p_values <-Train_fit_lm$coefficients[2:5,"Pr(>|t|)"]
p_values[which(p_values < .05)]

##    GarageArea       LotArea     GrLivArea 
##  2.331531e-66  2.007370e-04 1.620604e-100

We can conclude that GarageArea , LotArea and GrLivArea has the most significant impact on SalePrice

What are the standard errors on each of the coefficients?

Train_fit_lm$coefficients[2:5,"Std. Error"]

##  GarageArea LotFrontage     LotArea   GrLivArea 
##   7.8331639  72.8564129   0.2120493   3.3902597

measure the 99% confidence intervals.

confint(Train_fit, level=0.99)

##                     0.5 %      99.5 %
## (Intercept) -2.550083e+04 2471.936467
## GarageArea   1.234028e+02  163.821087
## LotFrontage -2.406132e+02  135.318028
## LotArea      2.437918e-01    1.337943
## GrLivArea    7.083186e+01   88.325238

Load test data

Test_data <-read.csv(file = "https://raw.githubusercontent.com/nobieyi00/CUNY_MSDA_R/master/test.csv", 
                      header = TRUE, sep = ",")

#Predict the salePrice with the variables used to build the regression model

SalePrice_predicted =- 1.151e+04 + 1.436e+02*Test_data$GarageArea+-5.265e+01*Test_data$LotFrontage+7.909e-01*Test_data$LotArea+7.958e+01*Test_data$GrLivArea

#Replace NA files

SalePrice_predicted[is.na(SalePrice_predicted)] <- 0


#write.csv(cbind(Test_data$Id,SalePrice_predicted), file = "C:/Users/Mezue/Documents/data605/test_pred.csv")

#fitted(Train_fit) # predicted values
#residuals(Train_fit) # residuals

# diagnostic plots 
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(Train_fit)

We can now reject the null hypothesis that there is NO significant relationship between GarageArea LotArea GrLivArea variables.

We conclude that their is a significant relationship for them

Kaggle user name is Nnaemezue Obieyisi Score is 4.78145