Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.
X <- c(9.3,7.4, 9.5, 9.3, 4.1, 6.4 ,3.7, 12.4,22.4, 8.5, 11.7, 19.9,9.1, 9.5, 7.4, 6.9,15.8, 11.8, 5.3 ,-1,7.1, 8.8, 7.4 ,10.6,15.9, 8.4, 7.4, 6.4,6.9, 5.1, 8.6, 10.6,16, 11.4, 9.1, 1.2,6.7, 15.1, 11.4, 7.7,8.2, 12.6, 8.4, 15.5,16, 8, 7.3, 6.9,6.4 ,10.3 ,11.3 ,13.7, 11.8 ,10.4 ,4.4 ,3.7,3.5, 9.5, 9.3, 4.4,21.7, 9.5 ,10.9 ,11.5,12.2 ,15.1, 10.9, 4.2,9.3, 6.6, 7.7, 13.9,8 ,15.4, 7.7, 12.9,6.2, 8.2 ,11.5, 1.2)
Y <- c(20.3, 20.8, 28.4, 20.2, 19.1, 14.6, 21.5, 18.6, 15.2, 16.2, 21.3, 26.9, 13.8, 14.7, 25.2, 26,
22.3, 13.1, 21.4, 16.8, 19.3, 18, 20.8, 22.6, 20.9, 15.7, 15.1, 16.3, 18.8, 15.3, 22.5, 26.8,
17.6, 10.3, 20.8, 20.2, 20.9, 7.3, 22.2, 11.4, 18.4, 16.3, 17.8, 19.9, 20.9, 12.6, 21.1, 19.7,
20.8, 14.9, 23, 21.7, 22, 19.4, 21.6, 23.6, 10.3, 11.5, 26.4, 15.5, 18.6, 13, 21.7, 22.7,
28.7, 14.8, 17.4, 20.9, 23.5, 13.5, 21.8, 24, 26.3, 12.2, 21.6, 26.5, 28.1, 11.8, 22.5, 21.7)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
Q3_X <- quantile(X, 0.75)
Q1_Y <- quantile(Y, 0.25)
Q3_X
## 75%
## 11.55
Q1_Y
## 25%
## 15.65
- P(X>x | Y>y)
The conditional probability of X>x given Y>y is
\(P(X>x | Y>y) = \frac{P(X>x,Y>y) }{P(Y>y)} = \frac{P(X>x) P(Y>y) }{P(Y>y)}\)
(length(X[X > Q3_X & Y > Q1_Y])/length(X)) / (length(Y[Y > Q1_Y])/ length(Y))
## [1] 0.25
- P(X>x, Y>y)
\(P(X>x , Y>y) = \frac{P(X>x|Y>y) }{P(Y>y)}\)
(length(X[X > Q3_X & Y > Q1_Y])/length(X))
## [1] 0.1875
- P(X
y)
\(P(X>x | Y>y) = \frac{P(X<x,Y>y) }{P(Y>y)} = \frac{P(X<x) P(Y>y) }{P(Y>y)}\)
(length(X[X < Q3_X & Y > Q1_Y])/length(X)) / (length(Y[Y > Q1_Y])/ length(Y))
## [1] 0.75
In addition, make a table of counts as shown below.
# X<=x, Y<=y
row1_col1 <- length(X[X <= Q3_X & Y <= Q1_Y])
# X>x, Y<=y
row1_col2 <- length(X[X > Q3_X & Y <= Q1_Y])
# X<=x, Y>y
row2_col1 <- length(X[X <= Q3_X & Y > Q1_Y])
# X>x, Y>y
row2_col2 <- length(X[X > Q3_X & Y > Q1_Y])
count <- data.frame(c(row1_col1, row2_col1), c(row1_col2, row2_col2))
count[3,] = count[1,] + count[2,]
count[,3] = count[,1] + count[,2]
names(count) <- c('X<=x', 'X>x', 'Total')
rownames(count) <- c('Y<=y', 'Y>y', 'Total')
count
## X<=x X>x Total
## Y<=y 15 5 20
## Y>y 45 15 60
## Total 60 20 80
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
A <- 20
B <- 60
#from 1.a P(X>x | Y>y) = 0.25
P_AB <- 0.25 * 80
PA_PB <- 20/80 * 60/80 * 80
P_AB
## [1] 20
PA_PB
## [1] 15
\(p(A|B)!=p(A)∗p(B)\)
A and B are not independent.
H_0: \(X>x\) and \(Y>y\) are independent
H_a: \(X>x\) and \(Y>y\) are not independent
\(\chi^2 =\sum { \frac { { (observed-expected) }^2 }{ expected } }\)
row1_col1 <- count[1,3] * count[3,1] / count[3,3]
row1_col2 <- count[1,3] * count[3,2] / count[3,3]
row2_col1 <- count[2,3] * count[3,1] / count[3,3]
row2_col2 <- count[2,3] * count[3,2] / count[3,3]
expected_count <- data.frame(c(row1_col1, row2_col1), c(row2_col1, row2_col2))
expected_count
## c.row1_col1..row2_col1. c.row2_col1..row2_col2.
## 1 15 45
## 2 45 15
diff_sq <- (count[1:2, 1:2] - expected_count) ^2 / expected_count
diff_sq
## c.row1_col1..row2_col1. c.row2_col1..row2_col2.
## 1 0 35.55556
## 2 0 0.00000
chi_square <- sum(diff_sq)
chi_square
## [1] 35.55556
p_value <- pchisq(chi_square, 1, lower.tail=F)
p_value
## [1] 2.478791e-09
This value is less than 0.05. \(H_0\) can be rejected: X and Y are not independent.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Here’s a brief version of what you’ll find in the data description file.
train.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/Data605_Final_Problem2/master/train.csv?token=Ac3_PlkWy_zjurpLjUxOWZj4UvMrBCdiks5cHachwA%3D%3D', header=TRUE)
head(train.dt)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
summary(train.dt$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
hist(train.dt$SalePrice, main="Sale Price")
qqnorm(train.dt$SalePrice)
qqline(train.dt$SalePrice)
pairs(~SalePrice+TotalBsmtSF +YearBuilt+LotFrontage+GrLivArea++GarageArea,data=train.dt,
main="Scatterplot Matrix")
Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
selected <- data.frame(train.dt$SalePrice,train.dt$TotalBsmtSF,train.dt$GrLivArea)
matrix <- cor(selected)
matrix
## train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice 1.0000000 0.6135806
## train.dt.TotalBsmtSF 0.6135806 1.0000000
## train.dt.GrLivArea 0.7086245 0.4548682
## train.dt.GrLivArea
## train.dt.SalePrice 0.7086245
## train.dt.TotalBsmtSF 0.4548682
## train.dt.GrLivArea 1.0000000
cor.test(train.dt$GrLivArea, train.dt$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train.dt$GrLivArea and train.dt$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
In the result above:
t is the t-test statistic value (t = 38.348), df is the degrees of freedom (df= 1458), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the correlation coefficient at 80% (0.69 and 0.72); sample estimates is the correlation coefficient (Cor.coeff = 0.7).
cor.test(train.dt$TotalBsmtSF, train.dt$SalePrice, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train.dt$TotalBsmtSF and train.dt$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5922142 0.6340846
## sample estimates:
## cor
## 0.6135806
In the result above:
t is the t-test statistic value (t = 29.671), df is the degrees of freedom (df= 1458), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the correlation coefficient at 80% ( 0.59 and 0.63); sample estimates is the correlation coefficient (Cor.coeff = 0.61).
The p-value of both tests is less than the significance level 0.05. We can conclude that variables are significantly correlated.
Linear Algebra and Correlation. Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
cor_matrix <- matrix
precision_matrix <- solve(cor_matrix)
precision_matrix
## train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice 2.5582310 -0.93946422
## train.dt.TotalBsmtSF -0.9394642 1.60588442
## train.dt.GrLivArea -1.3854927 -0.06473842
## train.dt.GrLivArea
## train.dt.SalePrice -1.38549273
## train.dt.TotalBsmtSF -0.06473842
## train.dt.GrLivArea 2.01124151
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
matrix_2 <- cor_matrix%*% precision_matrix
matrix_2
## train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice 1 1.387779e-17
## train.dt.TotalBsmtSF 0 1.000000e+00
## train.dt.GrLivArea 0 5.551115e-17
## train.dt.GrLivArea
## train.dt.SalePrice 0.000000e+00
## train.dt.TotalBsmtSF 1.110223e-16
## train.dt.GrLivArea 1.000000e+00
and then multiply the precision matrix by the correlation matrix.
matrix_3 <-precision_matrix%*%cor_matrix
matrix_3
## train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice 1.000000e+00 0
## train.dt.TotalBsmtSF 2.775558e-17 1
## train.dt.GrLivArea 0.000000e+00 0
## train.dt.GrLivArea
## train.dt.SalePrice 0.000000e+00
## train.dt.TotalBsmtSF 6.938894e-17
## train.dt.GrLivArea 1.000000e+00
Conduct LU decomposition on the matrix.
solver_func = function(A){
rows = columns = dim(A)[1]
U = A
L = D = diag(rows)
for (j in 1:(columns-1)){
for (i in (j+1):rows){
L[i,j] = (U[i,j]/U[j,j])
U[i,] = U[i,]-(U[j,]*L[i,j])
}
}
diag(D) = diag(U)
for (l in 1:rows){
U[l,] = U[l,]/U[l,l]
}
LDU = list("Lower matrix"=L,"Diagonal matrix"=D,"Upper matrix"=U)
return(LDU)
}
solver_func(matrix_3)
## $`Lower matrix`
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0 0
## [2,] 2.775558e-17 1 0
## [3,] 0.000000e+00 0 1
##
## $`Diagonal matrix`
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## $`Upper matrix`
## train.dt.SalePrice train.dt.TotalBsmtSF
## train.dt.SalePrice 1 0
## train.dt.TotalBsmtSF 0 1
## train.dt.GrLivArea 0 0
## train.dt.GrLivArea
## train.dt.SalePrice 0.000000e+00
## train.dt.TotalBsmtSF 6.938894e-17
## train.dt.GrLivArea 1.000000e+00
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
min(train.dt$GrLivArea)
## [1] 334
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
library(MASS)
exp_prob <- fitdistr(train.dt$GrLivArea, "exponential")
exp_prob
## rate
## 6.598640e-04
## (1.726943e-05)
Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).
lambda <- exp_prob$estimate
lambda
## rate
## 0.000659864
samples <- rexp(1000,lambda)
Plot a histogram and compare it with a histogram of your original variable.
par(mfrow=c(1,2))
hist(samples, breaks=50, xlim=c(0, 6000) , main = "Exponential GrLivArea")
hist (train.dt$GrLivArea, breaks=50, main = "Origianl GrLivArea")
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
qexp(.05,rate = lambda)
## [1] 77.73313
qexp(.95,rate = lambda)
## [1] 4539.924
Also generate a 95% confidence interval from the empirical data, assuming normality.
The Confidence Interval is based on Mean and Standard Deviation. Its formula is:
\(\bar{x}\pm Z*\frac{\sigma }{\sqrt{n}}\)
Z <- 1.96
mean <- mean(train.dt$GrLivArea)
std <- sd(train.dt$GrLivArea)
n <- length(train.dt$GrLivArea)
upper <- mean + Z * std / sqrt(n)
lower <- mean - Z * std / sqrt(n)
lower
## [1] 1488.509
upper
## [1] 1542.419
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
quantile(train.dt$GrLivArea, .05)
## 5%
## 848
quantile(train.dt$GrLivArea, .95)
## 95%
## 2466.1
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
only_numeric <- sapply(train.dt, is.numeric)
train_data <- train.dt[ , only_numeric]
head(train_data)
## Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## 1 1 60 65 8450 7 5 2003
## 2 2 20 80 9600 6 8 1976
## 3 3 60 68 11250 7 5 2001
## 4 4 70 60 9550 7 5 1915
## 5 5 60 84 14260 8 5 2000
## 6 6 50 85 14115 5 5 1993
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 2003 196 706 0 150 856
## 2 1976 0 978 0 284 1262
## 3 2002 162 486 0 434 920
## 4 1970 0 216 0 540 756
## 5 2000 350 655 0 490 1145
## 6 1995 0 732 0 64 796
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1 856 854 0 1710 1 0
## 2 1262 0 0 1262 0 1
## 3 920 866 0 1786 1 0
## 4 961 756 0 1717 1 0
## 5 1145 1053 0 2198 1 0
## 6 796 566 0 1362 1 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 1 2 1 3 1 8 0
## 2 2 0 3 1 6 1
## 3 2 1 3 1 6 1
## 4 1 0 3 1 7 1
## 5 2 1 4 1 9 1
## 6 1 1 1 1 5 0
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch
## 1 2003 2 548 0 61 0
## 2 1976 2 460 298 0 0
## 3 2001 2 608 0 42 0
## 4 1998 3 642 0 35 272
## 5 2000 3 836 192 84 0
## 6 1993 2 480 40 30 0
## X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
## 1 0 0 0 0 2 2008 208500
## 2 0 0 0 0 5 2007 181500
## 3 0 0 0 0 9 2008 223500
## 4 0 0 0 0 2 2006 140000
## 5 0 0 0 0 12 2008 250000
## 6 320 0 0 700 10 2009 143000
cor <- as.data.frame(cor(train_data))
cor <- cor[order(cor$SalePrice, decreasing=T),]
cor_saleprice <- cor$SalePrice
names(cor_saleprice) <- rownames(cor)
cor_saleprice
## SalePrice OverallQual GrLivArea GarageCars GarageArea
## 1.00000000 0.79098160 0.70862448 0.64040920 0.62343144
## TotalBsmtSF X1stFlrSF FullBath TotRmsAbvGrd YearBuilt
## 0.61358055 0.60585218 0.56066376 0.53372316 0.52289733
## YearRemodAdd Fireplaces BsmtFinSF1 WoodDeckSF X2ndFlrSF
## 0.50710097 0.46692884 0.38641981 0.32441344 0.31933380
## OpenPorchSF HalfBath LotArea BsmtFullBath BsmtUnfSF
## 0.31585623 0.28410768 0.26384335 0.22712223 0.21447911
## BedroomAbvGr ScreenPorch PoolArea MoSold X3SsnPorch
## 0.16821315 0.11144657 0.09240355 0.04643225 0.04458367
## BsmtFinSF2 BsmtHalfBath MiscVal Id LowQualFinSF
## -0.01137812 -0.01684415 -0.02118958 -0.02191672 -0.02560613
## YrSold OverallCond MSSubClass EnclosedPorch KitchenAbvGr
## -0.02892259 -0.07785589 -0.08428414 -0.12857796 -0.13590737
## LotFrontage MasVnrArea GarageYrBlt
## NA NA NA
model <- lm(SalePrice ~ OverallQual + GrLivArea+ GarageCars + GarageArea + TotalBsmtSF + X1stFlrSF + TotRmsAbvGrd
+ YearBuilt + YearRemodAdd + Fireplaces + BsmtFinSF1 + WoodDeckSF + X2ndFlrSF + OpenPorchSF + HalfBath + LotArea+ BsmtFullBath + BsmtUnfSF + BedroomAbvGr + ScreenPorch , data = train.dt)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + GarageCars +
## GarageArea + TotalBsmtSF + X1stFlrSF + TotRmsAbvGrd + YearBuilt +
## YearRemodAdd + Fireplaces + BsmtFinSF1 + WoodDeckSF + X2ndFlrSF +
## OpenPorchSF + HalfBath + LotArea + BsmtFullBath + BsmtUnfSF +
## BedroomAbvGr + ScreenPorch, data = train.dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513055 -16892 -1319 14475 297422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.057e+06 1.215e+05 -8.703 < 2e-16 ***
## OverallQual 1.919e+04 1.177e+03 16.298 < 2e-16 ***
## GrLivArea 1.462e+01 2.017e+01 0.725 0.468876
## GarageCars 9.433e+03 2.929e+03 3.220 0.001308 **
## GarageArea 1.036e+01 9.935e+00 1.043 0.297272
## TotalBsmtSF 1.245e+01 7.221e+00 1.724 0.084876 .
## X1stFlrSF 3.192e+01 2.061e+01 1.549 0.121638
## TotRmsAbvGrd 4.258e+03 1.223e+03 3.481 0.000515 ***
## YearBuilt 2.171e+02 4.821e+01 4.503 7.24e-06 ***
## YearRemodAdd 2.842e+02 6.172e+01 4.605 4.49e-06 ***
## Fireplaces 5.142e+03 1.809e+03 2.843 0.004536 **
## BsmtFinSF1 1.304e+01 6.239e+00 2.090 0.036820 *
## WoodDeckSF 2.987e+01 8.163e+00 3.660 0.000261 ***
## X2ndFlrSF 2.798e+01 2.039e+01 1.372 0.170294
## OpenPorchSF 4.942e-01 1.556e+01 0.032 0.974666
## HalfBath -9.331e+02 2.547e+03 -0.366 0.714130
## LotArea 4.799e-01 1.037e-01 4.628 4.03e-06 ***
## BsmtFullBath 4.932e+03 2.530e+03 1.949 0.051436 .
## BsmtUnfSF 3.463e-02 6.326e+00 0.005 0.995633
## BedroomAbvGr -7.586e+03 1.707e+03 -4.443 9.54e-06 ***
## ScreenPorch 5.942e+01 1.761e+01 3.375 0.000757 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36090 on 1439 degrees of freedom
## Multiple R-squared: 0.7964, Adjusted R-squared: 0.7936
## F-statistic: 281.5 on 20 and 1439 DF, p-value: < 2.2e-16
par(mfrow=c(2,1))
hist(model$residuals, breaks=60, main = "Histogram of Residuals", xlab= "")
qqnorm(model$residuals)
qqline(model$residuals)
The R-squared value is 0.7964 which means that the model explains 79.64 percent of the data’s variation. Residuals are normally distributed. Q-Q plot confirms that we can use speed as a predictor.
test.dt<-read.csv('https://raw.githubusercontent.com/Lidiia25/Data605_Final_Problem2/master/test.csv?token=Ac3_PjYkV8Ye1x60BeCC-8mqFmnuEmUiks5cHyU_wA%3D%3D', header=TRUE)
SalesPred <- predict(model, test.dt)
par(mfrow=c(1,2))
hist(SalesPred, breaks=40, main = 'Predicted Sales Prices from test data')
hist(train.dt$SalePrice, breaks=50, xlim=c(0, 600000), main = 'Sales Prices from train data')
summary(SalesPred)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1139 127385 167504 177967 221957 632904 3
summary(train.dt$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Report your Kaggle.com user name and score
kaggle <- data.frame( Id = test.dt[,"Id"], SalePrice =SalesPred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
write.csv(kaggle, file="kaggle.csv", row.names = FALSE)
Username: Lidiia T
Score : 0.64354
Since I’ve excluded the categorical variables, and haven’t checked all of the numerical values for zero and NA values, I din’t get a high score on http://kaggle.com. There are some outliers, which could also affect the overfitting of the model. If I review and update my model according to the above suggestions, there is a possibility that I may get a higher score.