3. Final Problem 3. 30 points

3.1 Descriptive and Inferential Statistics

Load the data from the Github. The training dataset contains 1460 observations of 81 variables, test data contains 1459 observations of 80 variables.

train_house <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/final_exam/train.csv', header=TRUE, sep=",", check.names=FALSE)
test_house <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/final_exam/test.csv', header=TRUE, sep=",", check.names=FALSE)

Univariate descriptive statistics and appropriate plots for the training data set.
The descriptive statistics is provided by the package psych. The info available: count of not NA values, standard deviation, interquartile range, and standard error. Categorical variables are converted to numeric (marked with *).

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00 Min. : 1300 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 0.0 Length:1460 Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460 Length:1460 Length:1460 Length:1460 Min. : 334 Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000 Min. :0.000 Length:1460 Min. : 2.000 Length:1460 Min. :0.000 Length:1460 Length:1460 Min. :1900 Length:1460 Min. :0.000 Min. : 0.0 Length:1460 Length:1460 Length:1460 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000 Length:1460 Length:1460 Length:1460 Min. : 0.00 Min. : 1.000 Min. :2006 Length:1460 Length:1460 Min. : 34900
1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00 1st Qu.: 7554 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character Class :character Class :character Class :character 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961 Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Class :character Class :character Class :character 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character 1st Qu.:129975
Median : 730.5 Median : 50.0 Mode :character Median : 69.00 Median : 9478 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 6.000 Median :5.000 Median :1973 Median :1994 Mode :character Mode :character Mode :character Mode :character Mode :character Median : 0.0 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 383.5 Mode :character Median : 0.00 Median : 477.5 Median : 991.5 Mode :character Mode :character Mode :character Mode :character Median :1087 Median : 0 Median : 0.000 Median :1464 Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000 Median :3.000 Median :1.000 Mode :character Median : 6.000 Mode :character Median :1.000 Mode :character Mode :character Median :1980 Mode :character Median :2.000 Median : 480.0 Mode :character Mode :character Mode :character Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000 Mode :character Mode :character Mode :character Median : 0.00 Median : 6.000 Median :2008 Mode :character Mode :character Median :163000
Mean : 730.5 Mean : 56.9 NA Mean : 70.05 Mean : 10517 NA NA NA NA NA NA NA NA NA NA NA NA Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985 NA NA NA NA NA Mean : 103.7 NA NA NA NA NA NA NA Mean : 443.6 NA Mean : 46.55 Mean : 567.2 Mean :1057.4 NA NA NA NA Mean :1163 Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866 Mean :1.047 NA Mean : 6.518 NA Mean :0.613 NA NA Mean :1979 NA Mean :1.767 Mean : 473.0 NA NA NA Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759 NA NA NA Mean : 43.49 Mean : 6.322 Mean :2008 NA NA Mean :180921
3rd Qu.:1095.2 3rd Qu.: 70.0 NA 3rd Qu.: 80.00 3rd Qu.: 11602 NA NA NA NA NA NA NA NA NA NA NA NA 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 NA NA NA NA NA 3rd Qu.: 166.0 NA NA NA NA NA NA NA 3rd Qu.: 712.2 NA 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 NA NA NA NA 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 NA 3rd Qu.: 7.000 NA 3rd Qu.:1.000 NA NA 3rd Qu.:2002 NA 3rd Qu.:2.000 3rd Qu.: 576.0 NA NA NA 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000 NA NA NA 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 NA NA 3rd Qu.:214000
Max. :1460.0 Max. :190.0 NA Max. :313.00 Max. :215245 NA NA NA NA NA NA NA NA NA NA NA NA Max. :10.000 Max. :9.000 Max. :2010 Max. :2010 NA NA NA NA NA Max. :1600.0 NA NA NA NA NA NA NA Max. :5644.0 NA Max. :1474.00 Max. :2336.0 Max. :6110.0 NA NA NA NA Max. :4692 Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000 Max. :3.000 NA Max. :14.000 NA Max. :3.000 NA NA Max. :2010 NA Max. :4.000 Max. :1418.0 NA NA NA Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000 NA NA NA Max. :15500.00 Max. :12.000 Max. :2010 NA NA Max. :755000
NA NA NA NA’s :259 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :81 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Let’s check the distribution of the sale price for the training data frame. The distribution is right skewed. Min price is 34900, max price is 755000.

train_house %>%
ggplot(aes(x=SalePrice)) + 
    geom_histogram(aes(y = ..density..)) +
  geom_density(col='red', size = 2) +
    theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices", x= "Price", y = "Count") 

We can also built some graphs to see if there is any connection between price and year built

train_house %>%
ggplot(aes(x=YearBuilt, y=SalePrice)) + 
    geom_point() +
    theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices per year", x= "Year", y = "Sale Price") 

Usually, people have more than 5 rooms above grade (does not include bathrooms)

columns <- c('FullBath', 'TotRmsAbvGrd', 'GarageCars')
data <- melt(train_house[columns])


ggplot(data, aes(x = variable, y = value)) +   geom_boxplot()

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
The dependent variable is Sale Price, the independent variables are Above grade (ground) living area square feet, Garage area as it seems to be correlation between these variables.

train_house %>%
ggplot(aes(x = SalePrice, y = GarageArea)) + 
    geom_point(size = 3) +
    geom_smooth(method = "lm", color = 'red')  +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices per living area", x= "Sale Price", y = "Above ground living area, sq.ft") 

train_house %>%
ggplot(aes(x = SalePrice, y = GrLivArea)) + 
    geom_point(size = 3) +
    geom_smooth(method = "lm", color = 'red')  +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices per Garage area", x= "Sale Price", y = "Garage area") 

Derive a correlation matrix for any three quantitative variables in the dataset.
The variable are ‘LotArea’, ‘GarageArea’, ‘GrLivArea’. There is definitely the correlation between garage area/living area above ground and sale price.

salePrice <- train_house$SalePrice
areaLot <- train_house$LotArea
garageArea <- train_house$GarageArea
livAreaF <- train_house$GrLivArea
df <- data.frame(salePrice,areaLot,garageArea,livAreaF)
colnames(df) <- c("SalePrice", "LotArea",  "GarageArea", "GrLivArea")

corr_results <- cor(df)
round(corr_results, 3)
##            SalePrice LotArea GarageArea GrLivArea
## SalePrice      1.000   0.264      0.623     0.709
## LotArea        0.264   1.000      0.180     0.263
## GarageArea     0.623   0.180      1.000     0.469
## GrLivArea      0.709   0.263      0.469     1.000
corrplot(corr_results, type = "lower", order = "original", tl.srt = 0)

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
H0: the correlations is 0
H1: the correlations is not 0
We will use function cor.test of the ggpubr package to test the hypotheses. The data contains missing data and outliers. There are many variables that can affect the sale price. For now, we found some of them: the year built, living area above ground, garage area.
Once we run the test, we see that we should reject the null hypothesis about the correlation being equal to 0 and accept alternative hypothesis (the correlations is not 0). We should do it because the correlation is 0.18, 0.47, 0.26 which is significant as p-value < 2.2e-16 is much less than the significance level of 0.05.
And yes, we should expect the familywise (type I) error with the 80% confidence interval, the significance level of 0.05. The chance to get this error is 14.26% for the multiple comparisons we made.

corr_GarageArea_LotArea <- cor.test(df$GarageArea, df$LotArea, method = "pearson", conf.level=0.8)
corr_GarageArea_LotArea
## 
##  Pearson's product-moment correlation
## 
## data:  df$GarageArea and df$LotArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028
corr_Garagearea_GrLivArea <- cor.test(df$GarageArea, df$GrLivArea , method = "pearson", conf.level=0.8)
corr_Garagearea_GrLivArea
## 
##  Pearson's product-moment correlation
## 
## data:  df$GarageArea and df$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975
corr_GrLivArea_LotArea <- cor.test(df$GrLivArea, df$LotArea , method = "pearson", conf.level=0.8)
corr_GrLivArea_LotArea
## 
##  Pearson's product-moment correlation
## 
## data:  df$GrLivArea and df$LotArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162
n <- 3
a <- .05
1 - (1-a)^n
## [1] 0.142625

3.2 Linear Algebra and Correlation

Invert your correlation matrix from above.   The original correlation matrix:

cor_data <- data.frame(areaLot,garageArea,livAreaF)
colnames(df) <- c("LotArea",  "GarageArea", "GrLivArea")
cor_data <- cor(cor_data, method = "pearson", use = "complete.obs") 

cor_matrix <- data.matrix(cor_data )
rownames(cor_matrix) <- c("LotArea",  "GarageArea", "GrLivArea")
colnames(cor_matrix) <- c("LotArea",  "GarageArea", "GrLivArea")
cor_matrix
##              LotArea GarageArea GrLivArea
## LotArea    1.0000000  0.1804028 0.2631162
## GarageArea 0.1804028  1.0000000 0.4689975
## GrLivArea  0.2631162  0.4689975 1.0000000

The inverted:

precision_matrix <- solve(cor_matrix)
precision_matrix
##                LotArea  GarageArea  GrLivArea
## LotArea     1.07920917 -0.07886378 -0.2469705
## GarageArea -0.07886378  1.28774631 -0.5831994
## GrLivArea  -0.24697046 -0.58319943  1.3385010

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
We get the same results: identity matrices.

matrix1 <- round(cor_matrix %*% precision_matrix, 2)
matrix1
##            LotArea GarageArea GrLivArea
## LotArea          1          0         0
## GarageArea       0          1         0
## GrLivArea        0          0         1
matrix2 <- round(precision_matrix %*% cor_matrix, 2)
matrix2
##            LotArea GarageArea GrLivArea
## LotArea          1          0         0
## GarageArea       0          1         0
## GrLivArea        0          0         1

Conduct LU decomposition on the matrix.

decomp <- lu.decomposition(cor_matrix)
#The lower triangular matrix L 
decomp$L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.1804028 1.0000000    0
## [3,] 0.2631162 0.4357109    1
#The upper triangular matrix U
decomp$U
##      [,1]      [,2]      [,3]
## [1,]    1 0.1804028 0.2631162
## [2,]    0 0.9674548 0.4215306
## [3,]    0 0.0000000 0.7471044

To check the work, we can multiply L by U and get the correlation matrix again:

decomp$L %*% decomp$U
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.1804028 0.2631162
## [2,] 0.1804028 1.0000000 0.4689975
## [3,] 0.2631162 0.4689975 1.0000000

3.3 Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
To see what variables are skewed to the right, we will build the graph below:

train_house %>%
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins = 100)

I decided to choose the living are above ground ‘GrLivArea’:

train_house %>%
ggplot(aes(x=GrLivArea)) + 
    geom_histogram() +
    theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of the living area", x= "Are, sq.ft", y = "Count") 

data <- train_house$GrLivArea
describe(data)
##    vars    n    mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 1460 1515.46 525.48   1464 1467.67 483.33 334 5642  5308 1.36     4.86
##       se
## X1 13.75
summary(data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

Load the MASS package and run fitdistr to fit an exponential probability density function:

library(MASS)
exp_prob <- fitdistr(data, densfun="exponential")

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).
The lambda is 0.000659864.

#lambda is 0.000659864 
lambda <- exp_prob$estimate

samples <- rexp(1000,lambda)

Plot a histogram and compare it with a histogram of your original variable.
The histogram is still right-skewed, though it looks more uniformly distributed.

par(mfrow=c(1,2))
summary(samples)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.659   431.393  1014.606  1489.212  2080.674 11304.289
hist(samples)
hist(data)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
The 5th percentile is 77.73and 95th percentile is 4539.92.

qexp(.05, rate=lambda)
## [1] 77.73313
qexp(.95, rate=lambda)
## [1] 4539.924

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The 95% confidence interval from the t-test: from 1488.487 to 1542.440. The empirical 5th percentile is 848, the 95th percentile 2466.1.
We used a fitted distribution instead of empirical data. The histograms for fitted distribution and empirical data are both right skewed, though empirical data slightly skewed to the left. The percentiles are reasonably close (77/848 versus 4539/2466). The population means are also close (1515/1530).

t <- t.test(data,conf.level=0.95)
t
## 
##  One Sample t-test
## 
## data:  data
## t = 110.2, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1488.487 1542.440
## sample estimates:
## mean of x 
##  1515.464
quantile(data, c(0.05, 0.95))
##     5%    95% 
##  848.0 2466.1

3.4 Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
At the first attempt, we will include numeric columns that have no NA (LotFrontage, GarageYrBlt) to see how they are related to Sale Price. We will stay with the variables that have small p-value: MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1, 1stFlrSF, 2ndFlrSF, BedroomAbvGr, GarageCars.

df <- train_house %>% 
  select_if(is.numeric) 
colSums(is.na(df))
##            Id    MSSubClass   LotFrontage       LotArea   OverallQual 
##             0             0           259             0             0 
##   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1 
##             0             0             0             8             0 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF      1stFlrSF      2ndFlrSF 
##             0             0             0             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces 
##             0             0             0             0             0 
##   GarageYrBlt    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF 
##            81             0             0             0             0 
## EnclosedPorch     3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##             0             0             0             0             0 
##        MoSold        YrSold     SalePrice 
##             0             0             0
df <- df[,-c(3,26)]


first <- lm(SalePrice~.,data=df)
summary(first)
## 
## Call:
## lm(formula = SalePrice ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475057  -16584   -2090   13651  303497 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.050e+05  1.414e+06   0.286 0.774604    
## Id            -1.129e+00  2.187e+00  -0.516 0.605961    
## MSSubClass    -1.703e+02  2.629e+01  -6.479 1.27e-10 ***
## LotArea        4.083e-01  1.007e-01   4.056 5.26e-05 ***
## OverallQual    1.711e+04  1.191e+03  14.366  < 2e-16 ***
## OverallCond    4.397e+03  1.023e+03   4.299 1.83e-05 ***
## YearBuilt      3.250e+02  6.081e+01   5.345 1.05e-07 ***
## YearRemodAdd   1.737e+02  6.593e+01   2.634 0.008529 ** 
## MasVnrArea     3.173e+01  5.957e+00   5.326 1.16e-07 ***
## BsmtFinSF1     1.869e+01  4.660e+00   4.010 6.38e-05 ***
## BsmtFinSF2     7.697e+00  7.049e+00   1.092 0.275032    
## BsmtUnfSF      9.430e+00  4.192e+00   2.250 0.024622 *  
## TotalBsmtSF           NA         NA      NA       NA    
## `1stFlrSF`     4.779e+01  5.773e+00   8.277 2.88e-16 ***
## `2ndFlrSF`     4.862e+01  4.958e+00   9.805  < 2e-16 ***
## LowQualFinSF   2.933e+01  1.971e+01   1.488 0.137033    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.578e+03  2.611e+03   3.669 0.000253 ***
## BsmtHalfBath   1.922e+03  4.083e+03   0.471 0.637893    
## FullBath       4.365e+03  2.820e+03   1.548 0.121919    
## HalfBath      -2.176e+03  2.668e+03  -0.816 0.414913    
## BedroomAbvGr  -1.025e+04  1.694e+03  -6.053 1.81e-09 ***
## KitchenAbvGr  -1.224e+04  5.224e+03  -2.343 0.019244 *  
## TotRmsAbvGrd   5.072e+03  1.237e+03   4.099 4.38e-05 ***
## Fireplaces     3.782e+03  1.772e+03   2.134 0.033017 *  
## GarageCars     1.069e+04  2.851e+03   3.749 0.000185 ***
## GarageArea    -1.806e+00  9.699e+00  -0.186 0.852267    
## WoodDeckSF     2.475e+01  7.977e+00   3.103 0.001955 ** 
## OpenPorchSF   -4.028e+00  1.519e+01  -0.265 0.790973    
## EnclosedPorch  1.271e+01  1.688e+01   0.753 0.451683    
## `3SsnPorch`    1.813e+01  3.136e+01   0.578 0.563251    
## ScreenPorch    5.642e+01  1.716e+01   3.287 0.001038 ** 
## PoolArea      -3.303e+01  2.361e+01  -1.399 0.162098    
## MiscVal       -7.375e-01  1.853e+00  -0.398 0.690624    
## MoSold        -8.383e+01  3.461e+02  -0.242 0.808630    
## YrSold        -7.203e+02  7.029e+02  -1.025 0.305669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34710 on 1418 degrees of freedom
##   (8 пропущенных наблюдений удалены)
## Multiple R-squared:  0.8127, Adjusted R-squared:  0.8084 
## F-statistic: 186.5 on 33 and 1418 DF,  p-value: < 2.2e-16

We will also add some factor variables that look related to the Sale Price. The p-value is sagnificantly small, R-value is 0.81, the mdoel explains 81% of the data which is good:

df <-  train_house %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea, `BsmtFinSF1`, `1stFlrSF`, `2ndFlrSF`, BedroomAbvGr, GarageCars, SalePrice, ExterQual) 
second <- lm(SalePrice~.,data=df)
summary(second)
## 
## Call:
## lm(formula = SalePrice ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -535152  -15659   -1198   12698  276986 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.681e+05  9.255e+04  -7.219 8.48e-13 ***
## MSSubClass   -1.785e+02  2.347e+01  -7.605 5.13e-14 ***
## LotArea       5.343e-01  9.705e-02   5.505 4.36e-08 ***
## OverallQual   1.607e+04  1.164e+03  13.802  < 2e-16 ***
## OverallCond   5.471e+03  9.118e+02   6.000 2.49e-09 ***
## YearBuilt     3.419e+02  4.648e+01   7.357 3.16e-13 ***
## MasVnrArea    2.016e+01  5.877e+00   3.430 0.000620 ***
## BsmtFinSF1    1.744e+01  2.329e+00   7.490 1.20e-13 ***
## `1stFlrSF`    6.179e+01  3.755e+00  16.456  < 2e-16 ***
## `2ndFlrSF`    5.666e+01  3.243e+00  17.469  < 2e-16 ***
## BedroomAbvGr -5.241e+03  1.445e+03  -3.626 0.000298 ***
## GarageCars    1.135e+04  1.660e+03   6.839 1.18e-11 ***
## ExterQualFa  -4.774e+04  1.169e+04  -4.083 4.69e-05 ***
## ExterQualGd  -4.766e+04  5.535e+03  -8.611  < 2e-16 ***
## ExterQualTA  -6.266e+04  6.194e+03 -10.116  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34280 on 1437 degrees of freedom
##   (8 пропущенных наблюдений удалены)
## Multiple R-squared:  0.8149, Adjusted R-squared:  0.8131 
## F-statistic: 451.8 on 14 and 1437 DF,  p-value: < 2.2e-16

We should check the model using the residuals. We will use the results as residuals are somewhere around 0 and along the normal distribution.

par(mfrow=c(2,2))
plot(second)

Next, we will predict the data using test data frame. Select the necessary columns and substitute :

test_df <- test_house %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea, `BsmtFinSF1`, `1stFlrSF`, `2ndFlrSF`, BedroomAbvGr, GarageCars, ExterQual)  



test_df <- test_df %>% 
          mutate(BsmtFinSF1 = coalesce(BsmtFinSF1, 0)) %>% 
          mutate(GarageCars = coalesce(GarageCars, 0)) %>%
          mutate(MasVnrArea = coalesce(MasVnrArea, 0))  

colSums(is.na(test_df))
##   MSSubClass      LotArea  OverallQual  OverallCond    YearBuilt   MasVnrArea 
##            0            0            0            0            0            0 
##   BsmtFinSF1     1stFlrSF     2ndFlrSF BedroomAbvGr   GarageCars    ExterQual 
##            0            0            0            0            0            0

Using function predict() and the model we built, we can predict the sale price. We will do some adjustments to fit the kaggle submission criteria: add id column, remove rows’ index and write to the file. File is also available at the Github repo: https://github.com/ex-pr/DATA607/blob/final_exam/prediction.csv

prediction <- predict(second, test_df)
prediction <- as.data.frame(prediction)
names(prediction )[1] <- "SalePrice"

prediction$Id <- test_house$Id
prediction <- prediction[,c(2,1)]
head(prediction)
##     Id SalePrice
## 1 1461  119945.7
## 2 1462  168030.5
## 3 1463  174259.6
## 4 1464  189765.9
## 5 1465  198404.9
## 6 1466  173746.7
write.csv(prediction,"C:/Users/daria/Documents/prediction.csv", row.names = FALSE)
Kaggle results: 0.17298, User name: dariadubovskaia