Problem 3 Final

3. Final Problem 3. 30 points

3.1 Descriptive and Inferential Statistics

Load the data from the Github. The training dataset contains 1460 observations of 81 variables, test data contains 1459 observations of 80 variables.

train_house <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/final_exam/train.csv', header=TRUE, sep=",", check.names=FALSE)
test_house <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/final_exam/test.csv', header=TRUE, sep=",", check.names=FALSE)

Univariate descriptive statistics and appropriate plots for the training data set.
The descriptive statistics is provided by the package psych. The info available: count of not NA values, standard deviation, interquartile range, and standard error. Categorical variables are converted to numeric (marked with *).

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
Min. : 1.0	Min. : 20.0	Length:1460	Min. : 21.00	Min. : 1300	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Min. : 1.000	Min. :1.000	Min. :1872	Min. :1950	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Min. : 0.0	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Length:1460	Min. : 0.0	Length:1460	Min. : 0.00	Min. : 0.0	Min. : 0.0	Length:1460	Length:1460	Length:1460	Length:1460	Min. : 334	Min. : 0	Min. : 0.000	Min. : 334	Min. :0.0000	Min. :0.00000	Min. :0.000	Min. :0.0000	Min. :0.000	Min. :0.000	Length:1460	Min. : 2.000	Length:1460	Min. :0.000	Length:1460	Length:1460	Min. :1900	Length:1460	Min. :0.000	Min. : 0.0	Length:1460	Length:1460	Length:1460	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.000	Length:1460	Length:1460	Length:1460	Min. : 0.00	Min. : 1.000	Min. :2006	Length:1460	Length:1460	Min. : 34900
1st Qu.: 365.8	1st Qu.: 20.0	Class :character	1st Qu.: 59.00	1st Qu.: 7554	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 5.000	1st Qu.:5.000	1st Qu.:1954	1st Qu.:1967	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 0.0	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.: 0.0	Class :character	1st Qu.: 0.00	1st Qu.: 223.0	1st Qu.: 795.8	Class :character	Class :character	Class :character	Class :character	1st Qu.: 882	1st Qu.: 0	1st Qu.: 0.000	1st Qu.:1130	1st Qu.:0.0000	1st Qu.:0.00000	1st Qu.:1.000	1st Qu.:0.0000	1st Qu.:2.000	1st Qu.:1.000	Class :character	1st Qu.: 5.000	Class :character	1st Qu.:0.000	Class :character	Class :character	1st Qu.:1961	Class :character	1st Qu.:1.000	1st Qu.: 334.5	Class :character	Class :character	Class :character	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.00	1st Qu.: 0.000	Class :character	Class :character	Class :character	1st Qu.: 0.00	1st Qu.: 5.000	1st Qu.:2007	Class :character	Class :character	1st Qu.:129975
Median : 730.5	Median : 50.0	Mode :character	Median : 69.00	Median : 9478	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 6.000	Median :5.000	Median :1973	Median :1994	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 0.0	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median : 383.5	Mode :character	Median : 0.00	Median : 477.5	Median : 991.5	Mode :character	Mode :character	Mode :character	Mode :character	Median :1087	Median : 0	Median : 0.000	Median :1464	Median :0.0000	Median :0.00000	Median :2.000	Median :0.0000	Median :3.000	Median :1.000	Mode :character	Median : 6.000	Mode :character	Median :1.000	Mode :character	Mode :character	Median :1980	Mode :character	Median :2.000	Median : 480.0	Mode :character	Mode :character	Mode :character	Median : 0.00	Median : 25.00	Median : 0.00	Median : 0.00	Median : 0.00	Median : 0.000	Mode :character	Mode :character	Mode :character	Median : 0.00	Median : 6.000	Median :2008	Mode :character	Mode :character	Median :163000
Mean : 730.5	Mean : 56.9	NA	Mean : 70.05	Mean : 10517	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Mean : 6.099	Mean :5.575	Mean :1971	Mean :1985	NA	NA	NA	NA	NA	Mean : 103.7	NA	NA	NA	NA	NA	NA	NA	Mean : 443.6	NA	Mean : 46.55	Mean : 567.2	Mean :1057.4	NA	NA	NA	NA	Mean :1163	Mean : 347	Mean : 5.845	Mean :1515	Mean :0.4253	Mean :0.05753	Mean :1.565	Mean :0.3829	Mean :2.866	Mean :1.047	NA	Mean : 6.518	NA	Mean :0.613	NA	NA	Mean :1979	NA	Mean :1.767	Mean : 473.0	NA	NA	NA	Mean : 94.24	Mean : 46.66	Mean : 21.95	Mean : 3.41	Mean : 15.06	Mean : 2.759	NA	NA	NA	Mean : 43.49	Mean : 6.322	Mean :2008	NA	NA	Mean :180921
3rd Qu.:1095.2	3rd Qu.: 70.0	NA	3rd Qu.: 80.00	3rd Qu.: 11602	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.: 7.000	3rd Qu.:6.000	3rd Qu.:2000	3rd Qu.:2004	NA	NA	NA	NA	NA	3rd Qu.: 166.0	NA	NA	NA	NA	NA	NA	NA	3rd Qu.: 712.2	NA	3rd Qu.: 0.00	3rd Qu.: 808.0	3rd Qu.:1298.2	NA	NA	NA	NA	3rd Qu.:1391	3rd Qu.: 728	3rd Qu.: 0.000	3rd Qu.:1777	3rd Qu.:1.0000	3rd Qu.:0.00000	3rd Qu.:2.000	3rd Qu.:1.0000	3rd Qu.:3.000	3rd Qu.:1.000	NA	3rd Qu.: 7.000	NA	3rd Qu.:1.000	NA	NA	3rd Qu.:2002	NA	3rd Qu.:2.000	3rd Qu.: 576.0	NA	NA	NA	3rd Qu.:168.00	3rd Qu.: 68.00	3rd Qu.: 0.00	3rd Qu.: 0.00	3rd Qu.: 0.00	3rd Qu.: 0.000	NA	NA	NA	3rd Qu.: 0.00	3rd Qu.: 8.000	3rd Qu.:2009	NA	NA	3rd Qu.:214000
Max. :1460.0	Max. :190.0	NA	Max. :313.00	Max. :215245	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Max. :10.000	Max. :9.000	Max. :2010	Max. :2010	NA	NA	NA	NA	NA	Max. :1600.0	NA	NA	NA	NA	NA	NA	NA	Max. :5644.0	NA	Max. :1474.00	Max. :2336.0	Max. :6110.0	NA	NA	NA	NA	Max. :4692	Max. :2065	Max. :572.000	Max. :5642	Max. :3.0000	Max. :2.00000	Max. :3.000	Max. :2.0000	Max. :8.000	Max. :3.000	NA	Max. :14.000	NA	Max. :3.000	NA	NA	Max. :2010	NA	Max. :4.000	Max. :1418.0	NA	NA	NA	Max. :857.00	Max. :547.00	Max. :552.00	Max. :508.00	Max. :480.00	Max. :738.000	NA	NA	NA	Max. :15500.00	Max. :12.000	Max. :2010	NA	NA	Max. :755000
NA	NA	NA	NA’s :259	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :81	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Let’s check the distribution of the sale price for the training data frame. The distribution is right skewed. Min price is 34900, max price is 755000.

train_house %>%
ggplot(aes(x=SalePrice)) + 
    geom_histogram(aes(y = ..density..)) +
  geom_density(col='red', size = 2) +
    theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices", x= "Price", y = "Count")

We can also built some graphs to see if there is any connection between price and year built

train_house %>%
ggplot(aes(x=YearBuilt, y=SalePrice)) + 
    geom_point() +
    theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices per year", x= "Year", y = "Sale Price")

Usually, people have more than 5 rooms above grade (does not include bathrooms)

columns <- c('FullBath', 'TotRmsAbvGrd', 'GarageCars')
data <- melt(train_house[columns])


ggplot(data, aes(x = variable, y = value)) +   geom_boxplot()

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
The dependent variable is Sale Price, the independent variables are Above grade (ground) living area square feet, Garage area as it seems to be correlation between these variables.

train_house %>%
ggplot(aes(x = SalePrice, y = GarageArea)) + 
    geom_point(size = 3) +
    geom_smooth(method = "lm", color = 'red')  +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices per living area", x= "Sale Price", y = "Above ground living area, sq.ft")

train_house %>%
ggplot(aes(x = SalePrice, y = GrLivArea)) + 
    geom_point(size = 3) +
    geom_smooth(method = "lm", color = 'red')  +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of house prices per Garage area", x= "Sale Price", y = "Garage area")

Derive a correlation matrix for any three quantitative variables in the dataset.
The variable are ‘LotArea’, ‘GarageArea’, ‘GrLivArea’. There is definitely the correlation between garage area/living area above ground and sale price.

salePrice <- train_house$SalePrice
areaLot <- train_house$LotArea
garageArea <- train_house$GarageArea
livAreaF <- train_house$GrLivArea
df <- data.frame(salePrice,areaLot,garageArea,livAreaF)
colnames(df) <- c("SalePrice", "LotArea",  "GarageArea", "GrLivArea")

corr_results <- cor(df)
round(corr_results, 3)

##            SalePrice LotArea GarageArea GrLivArea
## SalePrice      1.000   0.264      0.623     0.709
## LotArea        0.264   1.000      0.180     0.263
## GarageArea     0.623   0.180      1.000     0.469
## GrLivArea      0.709   0.263      0.469     1.000

corrplot(corr_results, type = "lower", order = "original", tl.srt = 0)

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
H0: the correlations is 0
H1: the correlations is not 0
We will use function cor.test of the ggpubr package to test the hypotheses. The data contains missing data and outliers. There are many variables that can affect the sale price. For now, we found some of them: the year built, living area above ground, garage area.
Once we run the test, we see that we should reject the null hypothesis about the correlation being equal to 0 and accept alternative hypothesis (the correlations is not 0). We should do it because the correlation is 0.18, 0.47, 0.26 which is significant as p-value < 2.2e-16 is much less than the significance level of 0.05.
And yes, we should expect the familywise (type I) error with the 80% confidence interval, the significance level of 0.05. The chance to get this error is 14.26% for the multiple comparisons we made.

corr_GarageArea_LotArea <- cor.test(df$GarageArea, df$LotArea, method = "pearson", conf.level=0.8)
corr_GarageArea_LotArea

## 
##  Pearson's product-moment correlation
## 
## data:  df$GarageArea and df$LotArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1477356 0.2126767
## sample estimates:
##       cor 
## 0.1804028

corr_Garagearea_GrLivArea <- cor.test(df$GarageArea, df$GrLivArea , method = "pearson", conf.level=0.8)
corr_Garagearea_GrLivArea

## 
##  Pearson's product-moment correlation
## 
## data:  df$GarageArea and df$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4423993 0.4947713
## sample estimates:
##       cor 
## 0.4689975

corr_GrLivArea_LotArea <- cor.test(df$GrLivArea, df$LotArea , method = "pearson", conf.level=0.8)
corr_GrLivArea_LotArea

## 
##  Pearson's product-moment correlation
## 
## data:  df$GrLivArea and df$LotArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162

n <- 3
a <- .05
1 - (1-a)^n

## [1] 0.142625

3.2 Linear Algebra and Correlation

Invert your correlation matrix from above. The original correlation matrix:

cor_data <- data.frame(areaLot,garageArea,livAreaF)
colnames(df) <- c("LotArea",  "GarageArea", "GrLivArea")
cor_data <- cor(cor_data, method = "pearson", use = "complete.obs") 

cor_matrix <- data.matrix(cor_data )
rownames(cor_matrix) <- c("LotArea",  "GarageArea", "GrLivArea")
colnames(cor_matrix) <- c("LotArea",  "GarageArea", "GrLivArea")
cor_matrix

##              LotArea GarageArea GrLivArea
## LotArea    1.0000000  0.1804028 0.2631162
## GarageArea 0.1804028  1.0000000 0.4689975
## GrLivArea  0.2631162  0.4689975 1.0000000

The inverted:

precision_matrix <- solve(cor_matrix)
precision_matrix

##                LotArea  GarageArea  GrLivArea
## LotArea     1.07920917 -0.07886378 -0.2469705
## GarageArea -0.07886378  1.28774631 -0.5831994
## GrLivArea  -0.24697046 -0.58319943  1.3385010

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
We get the same results: identity matrices.

matrix1 <- round(cor_matrix %*% precision_matrix, 2)
matrix1

##            LotArea GarageArea GrLivArea
## LotArea          1          0         0
## GarageArea       0          1         0
## GrLivArea        0          0         1

matrix2 <- round(precision_matrix %*% cor_matrix, 2)
matrix2

##            LotArea GarageArea GrLivArea
## LotArea          1          0         0
## GarageArea       0          1         0
## GrLivArea        0          0         1

Conduct LU decomposition on the matrix.

decomp <- lu.decomposition(cor_matrix)
#The lower triangular matrix L 
decomp$L

##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.1804028 1.0000000    0
## [3,] 0.2631162 0.4357109    1

#The upper triangular matrix U
decomp$U

##      [,1]      [,2]      [,3]
## [1,]    1 0.1804028 0.2631162
## [2,]    0 0.9674548 0.4215306
## [3,]    0 0.0000000 0.7471044

To check the work, we can multiply L by U and get the correlation matrix again:

decomp$L %*% decomp$U

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.1804028 0.2631162
## [2,] 0.1804028 1.0000000 0.4689975
## [3,] 0.2631162 0.4689975 1.0000000

3.3 Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
To see what variables are skewed to the right, we will build the graph below:

train_house %>%
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins = 100)

I decided to choose the living are above ground ‘GrLivArea’:

train_house %>%
ggplot(aes(x=GrLivArea)) + 
    geom_histogram() +
    theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey",     size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
    labs(title="Distribution of the living area", x= "Are, sq.ft", y = "Count")

data <- train_house$GrLivArea
describe(data)

##    vars    n    mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 1460 1515.46 525.48   1464 1467.67 483.33 334 5642  5308 1.36     4.86
##       se
## X1 13.75

summary(data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

Load the MASS package and run fitdistr to fit an exponential probability density function:

library(MASS)
exp_prob <- fitdistr(data, densfun="exponential")

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).
The lambda is 0.000659864.

#lambda is 0.000659864 
lambda <- exp_prob$estimate

samples <- rexp(1000,lambda)

Plot a histogram and compare it with a histogram of your original variable.
The histogram is still right-skewed, though it looks more uniformly distributed.

par(mfrow=c(1,2))
summary(samples)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.659   431.393  1014.606  1489.212  2080.674 11304.289

hist(samples)
hist(data)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
The 5th percentile is 77.73and 95th percentile is 4539.92.

qexp(.05, rate=lambda)

## [1] 77.73313

qexp(.95, rate=lambda)

## [1] 4539.924

Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The 95% confidence interval from the t-test: from 1488.487 to 1542.440. The empirical 5th percentile is 848, the 95th percentile 2466.1.
We used a fitted distribution instead of empirical data. The histograms for fitted distribution and empirical data are both right skewed, though empirical data slightly skewed to the left. The percentiles are reasonably close (77/848 versus 4539/2466). The population means are also close (1515/1530).

t <- t.test(data,conf.level=0.95)
t

## 
##  One Sample t-test
## 
## data:  data
## t = 110.2, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1488.487 1542.440
## sample estimates:
## mean of x 
##  1515.464

quantile(data, c(0.05, 0.95))

##     5%    95% 
##  848.0 2466.1

3.4 Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
At the first attempt, we will include numeric columns that have no NA (LotFrontage, GarageYrBlt) to see how they are related to Sale Price. We will stay with the variables that have small p-value: MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1, 1stFlrSF, 2ndFlrSF, BedroomAbvGr, GarageCars.

df <- train_house %>% 
  select_if(is.numeric) 
colSums(is.na(df))

##            Id    MSSubClass   LotFrontage       LotArea   OverallQual 
##             0             0           259             0             0 
##   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1 
##             0             0             0             8             0 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF      1stFlrSF      2ndFlrSF 
##             0             0             0             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces 
##             0             0             0             0             0 
##   GarageYrBlt    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF 
##            81             0             0             0             0 
## EnclosedPorch     3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##             0             0             0             0             0 
##        MoSold        YrSold     SalePrice 
##             0             0             0

df <- df[,-c(3,26)]


first <- lm(SalePrice~.,data=df)
summary(first)

## 
## Call:
## lm(formula = SalePrice ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475057  -16584   -2090   13651  303497 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.050e+05  1.414e+06   0.286 0.774604    
## Id            -1.129e+00  2.187e+00  -0.516 0.605961    
## MSSubClass    -1.703e+02  2.629e+01  -6.479 1.27e-10 ***
## LotArea        4.083e-01  1.007e-01   4.056 5.26e-05 ***
## OverallQual    1.711e+04  1.191e+03  14.366  < 2e-16 ***
## OverallCond    4.397e+03  1.023e+03   4.299 1.83e-05 ***
## YearBuilt      3.250e+02  6.081e+01   5.345 1.05e-07 ***
## YearRemodAdd   1.737e+02  6.593e+01   2.634 0.008529 ** 
## MasVnrArea     3.173e+01  5.957e+00   5.326 1.16e-07 ***
## BsmtFinSF1     1.869e+01  4.660e+00   4.010 6.38e-05 ***
## BsmtFinSF2     7.697e+00  7.049e+00   1.092 0.275032    
## BsmtUnfSF      9.430e+00  4.192e+00   2.250 0.024622 *  
## TotalBsmtSF           NA         NA      NA       NA    
## `1stFlrSF`     4.779e+01  5.773e+00   8.277 2.88e-16 ***
## `2ndFlrSF`     4.862e+01  4.958e+00   9.805  < 2e-16 ***
## LowQualFinSF   2.933e+01  1.971e+01   1.488 0.137033    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   9.578e+03  2.611e+03   3.669 0.000253 ***
## BsmtHalfBath   1.922e+03  4.083e+03   0.471 0.637893    
## FullBath       4.365e+03  2.820e+03   1.548 0.121919    
## HalfBath      -2.176e+03  2.668e+03  -0.816 0.414913    
## BedroomAbvGr  -1.025e+04  1.694e+03  -6.053 1.81e-09 ***
## KitchenAbvGr  -1.224e+04  5.224e+03  -2.343 0.019244 *  
## TotRmsAbvGrd   5.072e+03  1.237e+03   4.099 4.38e-05 ***
## Fireplaces     3.782e+03  1.772e+03   2.134 0.033017 *  
## GarageCars     1.069e+04  2.851e+03   3.749 0.000185 ***
## GarageArea    -1.806e+00  9.699e+00  -0.186 0.852267    
## WoodDeckSF     2.475e+01  7.977e+00   3.103 0.001955 ** 
## OpenPorchSF   -4.028e+00  1.519e+01  -0.265 0.790973    
## EnclosedPorch  1.271e+01  1.688e+01   0.753 0.451683    
## `3SsnPorch`    1.813e+01  3.136e+01   0.578 0.563251    
## ScreenPorch    5.642e+01  1.716e+01   3.287 0.001038 ** 
## PoolArea      -3.303e+01  2.361e+01  -1.399 0.162098    
## MiscVal       -7.375e-01  1.853e+00  -0.398 0.690624    
## MoSold        -8.383e+01  3.461e+02  -0.242 0.808630    
## YrSold        -7.203e+02  7.029e+02  -1.025 0.305669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34710 on 1418 degrees of freedom
##   (8 пропущенных наблюдений удалены)
## Multiple R-squared:  0.8127, Adjusted R-squared:  0.8084 
## F-statistic: 186.5 on 33 and 1418 DF,  p-value: < 2.2e-16

We will also add some factor variables that look related to the Sale Price. The p-value is sagnificantly small, R-value is 0.81, the mdoel explains 81% of the data which is good:

df <-  train_house %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea, `BsmtFinSF1`, `1stFlrSF`, `2ndFlrSF`, BedroomAbvGr, GarageCars, SalePrice, ExterQual) 
second <- lm(SalePrice~.,data=df)
summary(second)

## 
## Call:
## lm(formula = SalePrice ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -535152  -15659   -1198   12698  276986 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.681e+05  9.255e+04  -7.219 8.48e-13 ***
## MSSubClass   -1.785e+02  2.347e+01  -7.605 5.13e-14 ***
## LotArea       5.343e-01  9.705e-02   5.505 4.36e-08 ***
## OverallQual   1.607e+04  1.164e+03  13.802  < 2e-16 ***
## OverallCond   5.471e+03  9.118e+02   6.000 2.49e-09 ***
## YearBuilt     3.419e+02  4.648e+01   7.357 3.16e-13 ***
## MasVnrArea    2.016e+01  5.877e+00   3.430 0.000620 ***
## BsmtFinSF1    1.744e+01  2.329e+00   7.490 1.20e-13 ***
## `1stFlrSF`    6.179e+01  3.755e+00  16.456  < 2e-16 ***
## `2ndFlrSF`    5.666e+01  3.243e+00  17.469  < 2e-16 ***
## BedroomAbvGr -5.241e+03  1.445e+03  -3.626 0.000298 ***
## GarageCars    1.135e+04  1.660e+03   6.839 1.18e-11 ***
## ExterQualFa  -4.774e+04  1.169e+04  -4.083 4.69e-05 ***
## ExterQualGd  -4.766e+04  5.535e+03  -8.611  < 2e-16 ***
## ExterQualTA  -6.266e+04  6.194e+03 -10.116  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34280 on 1437 degrees of freedom
##   (8 пропущенных наблюдений удалены)
## Multiple R-squared:  0.8149, Adjusted R-squared:  0.8131 
## F-statistic: 451.8 on 14 and 1437 DF,  p-value: < 2.2e-16

We should check the model using the residuals. We will use the results as residuals are somewhere around 0 and along the normal distribution.

par(mfrow=c(2,2))
plot(second)

Next, we will predict the data using test data frame. Select the necessary columns and substitute :

test_df <- test_house %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea, `BsmtFinSF1`, `1stFlrSF`, `2ndFlrSF`, BedroomAbvGr, GarageCars, ExterQual)  



test_df <- test_df %>% 
          mutate(BsmtFinSF1 = coalesce(BsmtFinSF1, 0)) %>% 
          mutate(GarageCars = coalesce(GarageCars, 0)) %>%
          mutate(MasVnrArea = coalesce(MasVnrArea, 0))  

colSums(is.na(test_df))

##   MSSubClass      LotArea  OverallQual  OverallCond    YearBuilt   MasVnrArea 
##            0            0            0            0            0            0 
##   BsmtFinSF1     1stFlrSF     2ndFlrSF BedroomAbvGr   GarageCars    ExterQual 
##            0            0            0            0            0            0

Using function predict() and the model we built, we can predict the sale price. We will do some adjustments to fit the kaggle submission criteria: add id column, remove rows’ index and write to the file. File is also available at the Github repo: https://github.com/ex-pr/DATA607/blob/final_exam/prediction.csv

prediction <- predict(second, test_df)
prediction <- as.data.frame(prediction)
names(prediction )[1] <- "SalePrice"

prediction$Id <- test_house$Id
prediction <- prediction[,c(2,1)]
head(prediction)

##     Id SalePrice
## 1 1461  119945.7
## 2 1462  168030.5
## 3 1463  174259.6
## 4 1464  189765.9
## 5 1465  198404.9
## 6 1466  173746.7

write.csv(prediction,"C:/Users/daria/Documents/prediction.csv", row.names = FALSE)

Kaggle results: 0.17298, User name: dariadubovskaia