Problem 1.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = \frac{N+1}{2}\)
set.seed(100)
N <- 1000 # Max
X <- runif(10000, min=1, max=N) # number between 1 and 10000
Y <- rnorm(10000, mean=(N+1)/2, sd=(N+1)/2) # mean and standard deviation as (N+1)/2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.199 253.881 497.497 500.795 749.039 999.854
## [1] 10000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1545.5 161.0 495.8 495.6 832.5 2492.2
## [1] 10000
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
“x” is estimated as the median of the X variable
## [1] 497.4966
“y” is estimated as the 1st quartile of the Y variable
## 25%
## 161.0403
5 points
a. P(X>x | X>y)
Is computed by calculating \(\frac{P(X > x\enspace and \enspace X > y)}{P(X > y)}\)
prob <- (length(which(X > x & X > y) == TRUE) / length(X) ) / (length(which(X > y) == TRUE) / length(Y))
prob
## [1] 0.5929088
c. P(X<x | X>y)
Is computed by calculating \(\frac{P(X < x\enspace and \enspace X > y)}{P(X > y)}\)
prob <- (length(which(X < x & X > y) == TRUE) / length(X) ) / (length(which(X > y) == TRUE) / length(Y))
prob
## [1] 0.4070912
5 points
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
prob_tab <- c(length(which(X < x & Y < y) == TRUE),length(which(X < x & Y== y) == TRUE),length(which(X < x & Y > y) == TRUE))
prob_tab <-rbind(prob_tab,c(length(which(X == x & Y < y) == TRUE),length(which(X == x & Y == y) == TRUE),length(which(X == x & Y > y) == TRUE)))
prob_tab <- rbind(prob_tab,c(length(which(X > x & Y < y) == TRUE), length(which(X > x & Y == y) == TRUE), length(which(X > x & Y > y) == TRUE)))
prob_tab <- cbind(prob_tab,rowSums(prob_tab))
prob_tab <- rbind(prob_tab,colSums(prob_tab))
colnames(prob_tab) <- c("Y<y","Y=y","Y>y","Total")
rownames(prob_tab) <- c("X,x","X=x","X>x","Total")
knitr::kable(prob_tab)
Y<y | Y=y | Y>y | Total | |
---|---|---|---|---|
X,x | 1255 | 0 | 3745 | 5000 |
X=x | 0 | 0 | 0 | 0 |
X>x | 1245 | 0 | 3755 | 5000 |
Total | 2500 | 0 | 7500 | 10000 |
For P(x>x)P(Y>y)
## [1] 0.375
As both the probilities are approximately same so this proves P(X>x and Y>y) = P(X>x)P(Y>y)
5 points
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
##
## Fisher's Exact Test for Count Data
##
## data: fisher_dat
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9222661 1.1076494
## sample estimates:
## odds ratio
## 1.010724
As p value is greater than 0.05. so we cannot reject the null hypothesis, so we can conclude that both events are independent
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: chi_dat
## X-squared = 0.0432, df = 1, p-value = 0.8353
As p value is greater than 0.05, so we cannot reject the null hypothesis, so we can conclude that both events are independent.
Fisher’s exact test is practically applied only in analysis of small samples but actually it is valid for all sample sizes.
Chi-square test is used when the cell sizes are expected to be large.
The chi-squared test applies an approximation assuming the sample is large, while the Fisher’s exact test runs an exact procedure especially for small-sized samples. With large cell sizes, their answer should be very similar.
Problem 2.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Downloading the competition Files
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'pracma'
## The following objects are masked from 'package:psych':
##
## logit, polar
I ran a Shell block here and used CLI to download Kaggle files.
I had too many issues installing the Kaggle API utility for R and devtools and had compatibility issues with the version of Rstudio i have.
kaggle competitions list -s house-prices-advanced-regression-techniques
kaggle competitions download -c house-prices-advanced-regression-techniques --force
unzip -o house-prices-advanced-regression-techniques.zip
## ref deadline category reward teamCount userHasEntered
## ------------------------------------------- ------------------- --------------- --------- --------- --------------
## house-prices-advanced-regression-techniques 2030-01-01 00:00:00 Getting Started Knowledge 5179 True
##
0%| | 0.00/199k [00:00<?, ?B/s]
100%|██████████| 199k/199k [00:00<00:00, 3.06MB/s]
## Downloading house-prices-advanced-regression-techniques.zip to /localdisk/Data-605/Final-Project
##
## Archive: house-prices-advanced-regression-techniques.zip
## inflating: data_description.txt
## inflating: sample_submission.csv
## inflating: test.csv
## inflating: train.csv
## [1] 1460 81
Number of columns 81
Number of Rows including Headers 1460
Id | MSSubClass | MSZoning | LotFrontage | LotArea |
---|---|---|---|---|
1 | 60 | RL | 65 | 8450 |
2 | 20 | RL | 80 | 9600 |
3 | 60 | RL | 68 | 11250 |
4 | 70 | RL | 60 | 9550 |
5 | 60 | RL | 84 | 14260 |
6 | 50 | RL | 85 | 14115 |
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
Plotting the scatter plots between a few independent variables and the response variable
#par(mfrow=c(2,3))
ggplot(housing, aes(x=OverallQual, y=SalePrice)) + geom_jitter(color='skyblue') + theme_classic() + labs(title ='Scatter Plot of OverallQual vs Sale price')
Looking at the above scatterplots OverallQual,TheSalePrice,TotalBsmtSF,GrLivArea has a linear relationship with the sale price of the apartment.
Selected variables are: SalePrice,TotalBsmtSF,GrLivArea
Using the R function for T-testing to get 80% confidence level:
Working on Living area (Y) and sale price (Z) :
##
## Welch Two Sample t-test
##
## data: Y and Z
## t = -86.288, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 80 percent confidence interval:
## -182071.5 -176740.0
## sample estimates:
## mean of x mean of y
## 1515.464 180921.196
From this result, we see that there is a 80% confidence level that the difference in the means of the 2 variables is between -182071.5 and -176740.0. And the p-value is less than 2.2e-16 which is way less than the significance value of 0.05. Hence we can reject the null hypothesis and say that the correlation between Living area and sale price is not 0, and these are related to each other.
Similarly, for the other pair: TotalBsmtSF (X) and sale price (Z)
##
## Welch Two Sample t-test
##
## data: X and Z
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 80 percent confidence interval:
## -182529.5 -177198.0
## sample estimates:
## mean of x mean of y
## 1057.429 180921.196
From this result, we see that there is a 80% confidence level that the difference in the means of the 2 variables is between -182529.5 and -177198.0. And the p-value is 2.2e-16 for this one too, which is way less than the significance value of 0.05. Hence we can reject the null hypothesis and say that the correlation between TotalBsmtSF and sale price is not 0, and these are related to each other.
There are variables in this dataset that might have impact on the corelation of the the pairs of selected variables that are being considered here. There could be familywise error which might cause rejecting of true Null hypothesis.
5 points
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Correlation Matrix:
housing_cor <- housing[c("SalePrice", "TotalBsmtSF", "GrLivArea")]
housing_cor_matrix <- cor(housing_cor, use = "complete.obs")
housing_cor_matrix
## SalePrice TotalBsmtSF GrLivArea
## SalePrice 1.0000000 0.6135806 0.7086245
## TotalBsmtSF 0.6135806 1.0000000 0.4548682
## GrLivArea 0.7086245 0.4548682 1.0000000
Precision matrix:
## SalePrice TotalBsmtSF GrLivArea
## SalePrice 2.5582310 -0.93946422 -1.38549273
## TotalBsmtSF -0.9394642 1.60588442 -0.06473842
## GrLivArea -1.3854927 -0.06473842 2.01124151
Multiplying the correlation and precision matrices in both directions
## SalePrice TotalBsmtSF GrLivArea
## SalePrice 1.000000e+00 2.015979e-17 -3.026621e-19
## TotalBsmtSF -1.511271e-16 1.000000e+00 7.067800e-17
## GrLivArea -2.220446e-16 5.551115e-17 1.000000e+00
## SalePrice TotalBsmtSF GrLivArea
## SalePrice 1.000000e+00 -2.621494e-16 -2.220446e-16
## TotalBsmtSF 1.311821e-16 1.000000e+00 5.551115e-17
## GrLivArea -3.026621e-19 7.067800e-17 1.000000e+00
Both the multiplications gives identity matrix as a result.
LU decomposition
housing_cor_matrix_L <- lu(housing_cor_matrix)$L #The numeric lower triangular matrix
housing_cor_matrix_U <- lu(housing_cor_matrix)$U #The number upper triangular matrix
Since, LU = A Where A is the correlation matrix we created above. So, if we multiply L and U above, it should give correlation matrix
## [1] TRUE
5 points
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/Rdevel/library/MASS/html/fitdistr.html ). Find the optimal value of l for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Plotting each of the variables for their ranges:
par(mfrow=c(2,3))
boxplot(housing$MSSubClass, main='MSSubClass')
boxplot(housing$LotFrontage, main='LotFrontage')
boxplot(housing$LotArea, main='LotArea')
boxplot(housing$OverallQual, main='OverallQual')
boxplot(housing$OverallCond, main='OverallCond')
boxplot(housing$YearBuilt, main='YearBuilt')
par(mfrow=c(2,3))
boxplot(housing$YearRemodAdd, main='YearRemodAdd')
boxplot(housing$MasVnrArea, main='MasVnrArea')
boxplot(housing$BsmtFinSF1, main='BsmtFinSF1')
boxplot(housing$BsmtFinSF2, main='BsmtFinSF2')
boxplot(housing$BsmtUnfSF, main='BsmtUnfSF')
boxplot(housing$TotalBsmtSF, main='TotalBsmtSF')
par(mfrow=c(2,3))
boxplot(housing$X1stFlrSF, main='X1stFlrSF')
boxplot(housing$X2ndFlrSF, main='X2ndFlrSF')
boxplot(housing$LowQualFinSF, main='LowQualFinSF')
boxplot(housing$GrLivArea, main='GrLivArea')
boxplot(housing$BsmtFullBath, main='BsmtFullBath')
boxplot(housing$BsmtHalfBath, main='BsmtHalfBath')
par(mfrow=c(2,3))
boxplot(housing$FullBath, main='FullBath')
boxplot(housing$HalfBath, main='HalfBath')
boxplot(housing$BedroomAbvGr, main='BedroomAbvGr')
boxplot(housing$KitchenAbvGr, main='KitchenAbvGr')
boxplot(housing$TotRmsAbvGrd, main='TotRmsAbvGrd')
boxplot(housing$Fireplaces, main='Fireplaces')
par(mfrow=c(2,3))
boxplot(housing$GarageYrBlt, main='GarageYrBlt')
boxplot(housing$GarageCars, main='GarageCars')
boxplot(housing$GarageArea, main='GarageArea')
boxplot(housing$WoodDeckSF, main='WoodDeckSF')
boxplot(housing$PoolArea, main='PoolArea')
boxplot(housing$MiscVal, main='MiscVal')
par(mfrow=c(1,3))
boxplot(housing$MoSold, main='MoSold')
boxplot(housing$YrSold, main='YrSold')
boxplot(housing$SalePrice, main='SalePrice')
Looking at the box plots. I will take the variable: LotArea as it is highly skewed to the right. I chose it because all the rows have a valid numeric value (No Missing Values) and overall provides a better quality of data. And ofcourse it is highly right skewed.
Build a histogram.
As we see above in the histogram, this variable is highly skewed to the right.
Now we are going to fit this variable to an exponential distribution.
Getting the value of lambda for this exponential distribution
## rate
## 9.50857e-05
Take 1000 samples from this exponential distribution using this value
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.4 3313.3 7722.9 10664.9 14313.7 78675.6
As we see from this new histogram of the new variable which was generated by fitting the LotArea variable to the exponential distribution, the plot looks exponential.
Generating the 5th and 95th percentiles
## [1] 539.4428 31505.6013
Generating 95% confidence level from the data, assuming that the distribution is normal. If 95% of the area lies between −z and z, then 5% of the area must lie outside of this range. since normal curves are symmetric, half of this amount–2.5% must lie before −z.
## [1] "Average Lower levels"
## [1] -9046.092
## [1] "Average Upper levels"
## [1] 30079.75
Using the actual data to get 5th and 95th percentile
## 5% 95%
## 3311.70 17401.15
This indicates that the lowest 5% of the observations are below 3312 sq. ft. of Lot Area, and the upper 5% values are above 17401 sq. ft.
10 points
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Creating a dataframe for with all the numeric columns.
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
## [1] 38
## [1] "list"
## [1] "Dropping ID column"
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 50.0 Median : 69.00 Median : 9478 Median : 6.000
## Mean : 56.9 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :190.0 Max. :313.00 Max. :215245 Max. :10.000
## NA's :259
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1994 Median : 0.0
## Mean :5.575 Mean :1971 Mean :1985 Mean : 103.7
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## NA's :8
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## Median : 383.5 Median : 0.00 Median : 477.5 Median : 991.5
## Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000
## Median :3.000 Median :1.000 Median : 6.000 Median :1.000
## Mean :2.866 Mean :1.047 Mean : 6.518 Mean :0.613
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :8.000 Max. :3.000 Max. :14.000 Max. :3.000
##
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00
## Median :1980 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.767 Mean : 473.0 Mean : 94.24
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## NA's :81
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
## 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
##
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 2.759 Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :738.000 Max. :15500.00 Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
## [1] 37
MODEL 1.
##
## Call:
## lm(formula = SalePrice ~ ., data = housing_test_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442865 -16873 -2581 14998 318042
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.232e+05 1.701e+06 -0.190 0.849317
## MSSubClass -2.005e+02 3.449e+01 -5.814 8.03e-09 ***
## LotFrontage -1.161e+02 6.124e+01 -1.896 0.058203 .
## LotArea 5.454e-01 1.573e-01 3.466 0.000548 ***
## OverallQual 1.870e+04 1.478e+03 12.646 < 2e-16 ***
## OverallCond 5.227e+03 1.367e+03 3.824 0.000139 ***
## YearBuilt 3.170e+02 8.762e+01 3.617 0.000311 ***
## YearRemodAdd 1.206e+02 8.661e+01 1.392 0.164174
## MasVnrArea 3.160e+01 7.006e+00 4.511 7.15e-06 ***
## BsmtFinSF1 1.739e+01 5.835e+00 2.980 0.002947 **
## BsmtFinSF2 8.362e+00 8.763e+00 0.954 0.340205
## BsmtUnfSF 5.006e+00 5.275e+00 0.949 0.342890
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.591e+01 7.356e+00 6.241 6.21e-10 ***
## X2ndFlrSF 4.668e+01 6.099e+00 7.654 4.28e-14 ***
## LowQualFinSF 3.415e+01 2.788e+01 1.225 0.220788
## GrLivArea NA NA NA NA
## BsmtFullBath 8.980e+03 3.194e+03 2.812 0.005018 **
## BsmtHalfBath 2.490e+03 5.071e+03 0.491 0.623487
## FullBath 5.390e+03 3.529e+03 1.527 0.126941
## HalfBath -1.119e+03 3.320e+03 -0.337 0.736244
## BedroomAbvGr -1.023e+04 2.154e+03 -4.750 2.30e-06 ***
## KitchenAbvGr -2.193e+04 6.704e+03 -3.271 0.001105 **
## TotRmsAbvGrd 5.440e+03 1.486e+03 3.661 0.000263 ***
## Fireplaces 4.375e+03 2.188e+03 2.000 0.045793 *
## GarageYrBlt -4.914e+01 9.093e+01 -0.540 0.589011
## GarageCars 1.679e+04 3.487e+03 4.815 1.68e-06 ***
## GarageArea 6.488e+00 1.211e+01 0.536 0.592338
## WoodDeckSF 2.155e+01 1.002e+01 2.151 0.031713 *
## OpenPorchSF -2.315e+00 1.948e+01 -0.119 0.905404
## EnclosedPorch 7.233e+00 2.061e+01 0.351 0.725733
## X3SsnPorch 3.458e+01 3.749e+01 0.922 0.356593
## ScreenPorch 5.797e+01 2.040e+01 2.842 0.004572 **
## PoolArea -6.126e+01 2.984e+01 -2.053 0.040326 *
## MiscVal -3.850e+00 6.955e+00 -0.554 0.579980
## MoSold -2.240e+02 4.227e+02 -0.530 0.596213
## YrSold -2.536e+02 8.454e+02 -0.300 0.764216
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36790 on 1086 degrees of freedom
## (339 observations deleted due to missingness)
## Multiple R-squared: 0.8095, Adjusted R-squared: 0.8036
## F-statistic: 135.7 on 34 and 1086 DF, p-value: < 2.2e-16
From the above summary information, we will remove the independent variables which gave NA in the results, Hence we will do multiple iterations of recreating this model.
Value of R-Squared = 0.80 indicates that the model explains 80% the variability of the response data around its mean.
MODEL 2.
model <- lm(SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF +
BsmtFullBath + FullBath + BedroomAbvGr + KitchenAbvGr +
TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF +
PoolArea, data = housing_test_df)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
## OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea +
## BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF +
## LowQualFinSF + BsmtFullBath + FullBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + PoolArea,
## data = housing_test_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -458575 -17710 -2478 14079 319298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.599e+05 1.417e+05 -6.772 2.01e-11 ***
## MSSubClass -1.894e+02 3.208e+01 -5.903 4.66e-09 ***
## LotFrontage -9.584e+01 5.771e+01 -1.661 0.097040 .
## LotArea 5.622e-01 1.550e-01 3.627 0.000299 ***
## OverallQual 1.754e+04 1.377e+03 12.737 < 2e-16 ***
## OverallCond 4.452e+03 1.215e+03 3.665 0.000259 ***
## YearBuilt 2.842e+02 6.182e+01 4.597 4.75e-06 ***
## YearRemodAdd 1.728e+02 7.565e+01 2.285 0.022511 *
## MasVnrArea 3.593e+01 6.845e+00 5.249 1.82e-07 ***
## BsmtFinSF1 1.909e+01 5.439e+00 3.509 0.000467 ***
## BsmtFinSF2 9.341e+00 8.347e+00 1.119 0.263309
## BsmtUnfSF 7.781e+00 4.876e+00 1.596 0.110786
## X1stFlrSF 4.706e+01 6.920e+00 6.801 1.66e-11 ***
## X2ndFlrSF 4.666e+01 5.051e+00 9.236 < 2e-16 ***
## LowQualFinSF 3.711e+01 2.180e+01 1.702 0.088936 .
## BsmtFullBath 9.736e+03 2.887e+03 3.372 0.000770 ***
## FullBath 7.073e+03 3.000e+03 2.358 0.018543 *
## BedroomAbvGr -9.862e+03 1.943e+03 -5.077 4.45e-07 ***
## KitchenAbvGr -1.338e+04 5.778e+03 -2.315 0.020776 *
## TotRmsAbvGrd 4.923e+03 1.417e+03 3.473 0.000533 ***
## Fireplaces 5.458e+03 2.070e+03 2.637 0.008477 **
## GarageCars 1.129e+04 1.914e+03 5.901 4.73e-09 ***
## WoodDeckSF 1.995e+01 9.538e+00 2.092 0.036649 *
## PoolArea -6.132e+01 2.889e+01 -2.123 0.033965 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36570 on 1171 degrees of freedom
## (265 observations deleted due to missingness)
## Multiple R-squared: 0.8104, Adjusted R-squared: 0.8067
## F-statistic: 217.7 on 23 and 1171 DF, p-value: < 2.2e-16
Some columns have P-value significantly greater than 0.05 and is not statistically significant and indicates strong evidence for the null hypothesis.
Removing the independent variables which have P-value greater than 0.05. like for LowQualFinSF and BsmtFinSF2.
MODEL 3.
model <- lm(SalePrice ~ MSSubClass + LotArea + OverallQual +
OverallCond + YearBuilt + YearRemodAdd + MasVnrArea +
BsmtFinSF1 + X1stFlrSF + X2ndFlrSF + BsmtFullBath +
BedroomAbvGr + KitchenAbvGr +
TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF,
data = housing_test_df)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF, data = housing_test_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -487250 -16180 -2016 13330 285692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.043e+06 1.157e+05 -9.016 < 2e-16 ***
## MSSubClass -1.651e+02 2.583e+01 -6.393 2.20e-10 ***
## LotArea 4.223e-01 1.002e-01 4.215 2.66e-05 ***
## OverallQual 1.821e+04 1.139e+03 15.977 < 2e-16 ***
## OverallCond 4.017e+03 1.001e+03 4.012 6.32e-05 ***
## YearBuilt 3.062e+02 5.184e+01 5.907 4.34e-09 ***
## YearRemodAdd 1.929e+02 6.482e+01 2.976 0.002973 **
## MasVnrArea 3.232e+01 5.892e+00 5.486 4.87e-08 ***
## BsmtFinSF1 1.080e+01 2.983e+00 3.620 0.000304 ***
## X1stFlrSF 5.593e+01 4.632e+00 12.075 < 2e-16 ***
## X2ndFlrSF 4.715e+01 4.100e+00 11.501 < 2e-16 ***
## BsmtFullBath 8.545e+03 2.379e+03 3.591 0.000340 ***
## BedroomAbvGr -9.399e+03 1.665e+03 -5.645 1.99e-08 ***
## KitchenAbvGr -1.339e+04 5.097e+03 -2.627 0.008694 **
## TotRmsAbvGrd 5.166e+03 1.214e+03 4.255 2.22e-05 ***
## Fireplaces 3.938e+03 1.732e+03 2.273 0.023147 *
## GarageCars 1.058e+04 1.693e+03 6.253 5.31e-10 ***
## WoodDeckSF 2.099e+01 7.857e+00 2.672 0.007630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34830 on 1434 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8093, Adjusted R-squared: 0.807
## F-statistic: 358 on 17 and 1434 DF, p-value: < 2.2e-16
Now as we see above, in the above model, there are no variables with P-value > 0.05. Hence we will use this as our final model here. Value of R-Squared = 0.80 indicates that the model explains 80% the variability of the response data around its mean.
Load the test data into a dataframe and then predict the results using this model.
test_data <- read.csv("test.csv", header = TRUE, sep = ",",stringsAsFactors = FALSE)
head(test_data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1461 20 RH 80 11622 Pave <NA> Reg
## 2 1462 20 RL 81 14267 Pave <NA> IR1
## 3 1463 60 RL 74 13830 Pave <NA> IR1
## 4 1464 60 RL 78 9978 Pave <NA> IR1
## 5 1465 120 RL 43 5005 Pave <NA> IR1
## 6 1466 60 RL 75 10000 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2
## 1 Lvl AllPub Inside Gtl NAmes Feedr Norm
## 2 Lvl AllPub Corner Gtl NAmes Norm Norm
## 3 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 4 Lvl AllPub Inside Gtl Gilbert Norm Norm
## 5 HLS AllPub Inside Gtl StoneBr Norm Norm
## 6 Lvl AllPub Corner Gtl Gilbert Norm Norm
## BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 1Fam 1Story 5 6 1961 1961 Gable
## 2 1Fam 1Story 6 6 1958 1958 Hip
## 3 1Fam 2Story 5 5 1997 1998 Gable
## 4 1Fam 2Story 6 6 1998 1998 Gable
## 5 TwnhsE 1Story 8 5 1992 1992 Gable
## 6 1Fam 2Story 6 5 1993 1994 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## 1 CompShg VinylSd VinylSd None 0 TA TA
## 2 CompShg Wd Sdng Wd Sdng BrkFace 108 TA TA
## 3 CompShg VinylSd VinylSd None 0 TA TA
## 4 CompShg VinylSd VinylSd BrkFace 20 TA TA
## 5 CompShg HdBoard HdBoard None 0 Gd TA
## 6 CompShg HdBoard HdBoard None 0 TA TA
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 1 CBlock TA TA No Rec 468
## 2 CBlock TA TA No ALQ 923
## 3 PConc Gd TA No GLQ 791
## 4 PConc TA TA No GLQ 602
## 5 PConc Gd TA No ALQ 263
## 6 PConc Gd TA No Unf 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## 1 LwQ 144 270 882 GasA TA Y
## 2 Unf 0 406 1329 GasA TA Y
## 3 Unf 0 137 928 GasA Gd Y
## 4 Unf 0 324 926 GasA Ex Y
## 5 Unf 0 1017 1280 GasA Ex Y
## 6 Unf 0 763 763 GasA Gd Y
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## 1 SBrkr 896 0 0 896 0
## 2 SBrkr 1329 0 0 1329 0
## 3 SBrkr 928 701 0 1629 0
## 4 SBrkr 926 678 0 1604 0
## 5 SBrkr 1280 0 0 1280 0
## 6 SBrkr 763 892 0 1655 0
## BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## 1 0 1 0 2 1 TA
## 2 0 1 1 3 1 Gd
## 3 0 2 1 3 1 TA
## 4 0 2 1 3 1 Gd
## 5 0 2 0 2 1 Gd
## 6 0 2 1 3 1 TA
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 1 5 Typ 0 <NA> Attchd 1961
## 2 6 Typ 0 <NA> Attchd 1958
## 3 6 Typ 1 TA Attchd 1997
## 4 7 Typ 1 Gd Attchd 1998
## 5 5 Typ 0 <NA> Attchd 1992
## 6 7 Typ 1 TA Attchd 1993
## GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive
## 1 Unf 1 730 TA TA Y
## 2 Unf 1 312 TA TA Y
## 3 Fin 2 482 TA TA Y
## 4 Fin 2 470 TA TA Y
## 5 RFn 2 506 TA TA Y
## 6 Fin 2 440 TA TA Y
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC
## 1 140 0 0 0 120 0 <NA>
## 2 393 36 0 0 0 0 <NA>
## 3 212 34 0 0 0 0 <NA>
## 4 360 36 0 0 0 0 <NA>
## 5 0 82 0 0 144 0 <NA>
## 6 157 84 0 0 0 0 <NA>
## Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
## 1 MnPrv <NA> 0 6 2010 WD Normal
## 2 <NA> Gar2 12500 6 2010 WD Normal
## 3 MnPrv <NA> 0 3 2010 WD Normal
## 4 <NA> <NA> 0 6 2010 WD Normal
## 5 <NA> <NA> 0 1 2010 WD Normal
## 6 <NA> <NA> 0 4 2010 WD Normal
We see that there are some variables which have value NA due to which the predicted values are coming as NA and there were errors uploading the result file into kaggle.
To avoid this so that the predicted values can be submitted on Kaggle, we will replace NA with 0.
test_data[is.na(test_data)] = 0
predic_result <- predict(model, test_data)
predict_df <- data.frame(cbind(test_data$Id, predic_result))
colnames(predict_df) = c('Id', 'SalePrice')
head(predict_df, 10)
## Id SalePrice
## 1 1461 114786.8
## 2 1462 166311.5
## 3 1463 173391.9
## 4 1464 199975.7
## 5 1465 188458.3
## 6 1466 183230.9
## 7 1467 197252.5
## 8 1468 172831.8
## 9 1469 211915.6
## 10 1470 114604.4
Submitting My file to competition:-
##
0%| | 0.00/31.2k [00:00<?, ?B/s]
100%|██████████| 31.2k/31.2k [00:01<00:00, 31.7kB/s]
## 403 - Your team has used its submission allowance (10 of 10). This resets at midnight UTC (55 minutes from now).
kaggle competitions submissions -c house-prices-advanced-regression-techniques > score_array
#cat score_array
score=$(head -3 score_array)
printf "My Submission details:- \n\n $score"
## My Submission details:-
##
## fileName date description status publicScore privateScore
## ---------------------------- ------------------- ---------------------------------------------- -------- ----------- ------------
## submission-ashishsm1986.csv 2020-05-22 04:54:57 DATA605 Submittion Ashish Kumar (ashishsm1986) complete 0.26387 None