Load the data from the Github. The training dataset contains 1460 observations of 81 variables, test data contains 1459 observations of 80 variables.
train_house <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/final_exam/train.csv', header=TRUE, sep=",", check.names=FALSE)
test_house <- read.csv('https://raw.githubusercontent.com/ex-pr/DATA607/final_exam/test.csv', header=TRUE, sep=",", check.names=FALSE)
Univariate descriptive statistics and appropriate plots for the training data set.
The descriptive statistics is provided by the package psych. The info available: count of not NA values, standard deviation, interquartile range, and standard error. Categorical variables are converted to numeric (marked with *).
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.0 | Min. : 20.0 | Length:1460 | Min. : 21.00 | Min. : 1300 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 1.000 | Min. :1.000 | Min. :1872 | Min. :1950 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.0 | Length:1460 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Length:1460 | Min. : 334 | Min. : 0 | Min. : 0.000 | Min. : 334 | Min. :0.0000 | Min. :0.00000 | Min. :0.000 | Min. :0.0000 | Min. :0.000 | Min. :0.000 | Length:1460 | Min. : 2.000 | Length:1460 | Min. :0.000 | Length:1460 | Length:1460 | Min. :1900 | Length:1460 | Min. :0.000 | Min. : 0.0 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.000 | Length:1460 | Length:1460 | Length:1460 | Min. : 0.00 | Min. : 1.000 | Min. :2006 | Length:1460 | Length:1460 | Min. : 34900 | |
1st Qu.: 365.8 | 1st Qu.: 20.0 | Class :character | 1st Qu.: 59.00 | 1st Qu.: 7554 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 5.000 | 1st Qu.:5.000 | 1st Qu.:1954 | 1st Qu.:1967 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0 | Class :character | 1st Qu.: 0.00 | 1st Qu.: 223.0 | 1st Qu.: 795.8 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 882 | 1st Qu.: 0 | 1st Qu.: 0.000 | 1st Qu.:1130 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | 1st Qu.:2.000 | 1st Qu.:1.000 | Class :character | 1st Qu.: 5.000 | Class :character | 1st Qu.:0.000 | Class :character | Class :character | 1st Qu.:1961 | Class :character | 1st Qu.:1.000 | 1st Qu.: 334.5 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | 1st Qu.: 0.00 | 1st Qu.: 5.000 | 1st Qu.:2007 | Class :character | Class :character | 1st Qu.:129975 | |
Median : 730.5 | Median : 50.0 | Mode :character | Median : 69.00 | Median : 9478 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 6.000 | Median :5.000 | Median :1973 | Median :1994 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 0.0 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 383.5 | Mode :character | Median : 0.00 | Median : 477.5 | Median : 991.5 | Mode :character | Mode :character | Mode :character | Mode :character | Median :1087 | Median : 0 | Median : 0.000 | Median :1464 | Median :0.0000 | Median :0.00000 | Median :2.000 | Median :0.0000 | Median :3.000 | Median :1.000 | Mode :character | Median : 6.000 | Mode :character | Median :1.000 | Mode :character | Mode :character | Median :1980 | Mode :character | Median :2.000 | Median : 480.0 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 25.00 | Median : 0.00 | Median : 0.00 | Median : 0.00 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Median : 0.00 | Median : 6.000 | Median :2008 | Mode :character | Mode :character | Median :163000 | |
Mean : 730.5 | Mean : 56.9 | NA | Mean : 70.05 | Mean : 10517 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 6.099 | Mean :5.575 | Mean :1971 | Mean :1985 | NA | NA | NA | NA | NA | Mean : 103.7 | NA | NA | NA | NA | NA | NA | NA | Mean : 443.6 | NA | Mean : 46.55 | Mean : 567.2 | Mean :1057.4 | NA | NA | NA | NA | Mean :1163 | Mean : 347 | Mean : 5.845 | Mean :1515 | Mean :0.4253 | Mean :0.05753 | Mean :1.565 | Mean :0.3829 | Mean :2.866 | Mean :1.047 | NA | Mean : 6.518 | NA | Mean :0.613 | NA | NA | Mean :1979 | NA | Mean :1.767 | Mean : 473.0 | NA | NA | NA | Mean : 94.24 | Mean : 46.66 | Mean : 21.95 | Mean : 3.41 | Mean : 15.06 | Mean : 2.759 | NA | NA | NA | Mean : 43.49 | Mean : 6.322 | Mean :2008 | NA | NA | Mean :180921 | |
3rd Qu.:1095.2 | 3rd Qu.: 70.0 | NA | 3rd Qu.: 80.00 | 3rd Qu.: 11602 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 7.000 | 3rd Qu.:6.000 | 3rd Qu.:2000 | 3rd Qu.:2004 | NA | NA | NA | NA | NA | 3rd Qu.: 166.0 | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 712.2 | NA | 3rd Qu.: 0.00 | 3rd Qu.: 808.0 | 3rd Qu.:1298.2 | NA | NA | NA | NA | 3rd Qu.:1391 | 3rd Qu.: 728 | 3rd Qu.: 0.000 | 3rd Qu.:1777 | 3rd Qu.:1.0000 | 3rd Qu.:0.00000 | 3rd Qu.:2.000 | 3rd Qu.:1.0000 | 3rd Qu.:3.000 | 3rd Qu.:1.000 | NA | 3rd Qu.: 7.000 | NA | 3rd Qu.:1.000 | NA | NA | 3rd Qu.:2002 | NA | 3rd Qu.:2.000 | 3rd Qu.: 576.0 | NA | NA | NA | 3rd Qu.:168.00 | 3rd Qu.: 68.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.00 | 3rd Qu.: 0.000 | NA | NA | NA | 3rd Qu.: 0.00 | 3rd Qu.: 8.000 | 3rd Qu.:2009 | NA | NA | 3rd Qu.:214000 | |
Max. :1460.0 | Max. :190.0 | NA | Max. :313.00 | Max. :215245 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :10.000 | Max. :9.000 | Max. :2010 | Max. :2010 | NA | NA | NA | NA | NA | Max. :1600.0 | NA | NA | NA | NA | NA | NA | NA | Max. :5644.0 | NA | Max. :1474.00 | Max. :2336.0 | Max. :6110.0 | NA | NA | NA | NA | Max. :4692 | Max. :2065 | Max. :572.000 | Max. :5642 | Max. :3.0000 | Max. :2.00000 | Max. :3.000 | Max. :2.0000 | Max. :8.000 | Max. :3.000 | NA | Max. :14.000 | NA | Max. :3.000 | NA | NA | Max. :2010 | NA | Max. :4.000 | Max. :1418.0 | NA | NA | NA | Max. :857.00 | Max. :547.00 | Max. :552.00 | Max. :508.00 | Max. :480.00 | Max. :738.000 | NA | NA | NA | Max. :15500.00 | Max. :12.000 | Max. :2010 | NA | NA | Max. :755000 | |
NA | NA | NA | NA’s :259 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :81 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Let’s check the distribution of the sale price for the training data frame. The distribution is right skewed. Min price is 34900, max price is 755000.
train_house %>%
ggplot(aes(x=SalePrice)) +
geom_histogram(aes(y = ..density..)) +
geom_density(col='red', size = 2) +
theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey", size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
labs(title="Distribution of house prices", x= "Price", y = "Count")
We can also built some graphs to see if there is any connection between price and year built
train_house %>%
ggplot(aes(x=YearBuilt, y=SalePrice)) +
geom_point() +
theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey", size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
labs(title="Distribution of house prices per year", x= "Year", y = "Sale Price")
Usually, people have more than 5 rooms above grade (does not include bathrooms)
columns <- c('FullBath', 'TotRmsAbvGrd', 'GarageCars')
data <- melt(train_house[columns])
ggplot(data, aes(x = variable, y = value)) + geom_boxplot()
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
The dependent variable is Sale Price, the independent variables are Above grade (ground) living area square feet, Garage area as it seems to be correlation between these variables.
train_house %>%
ggplot(aes(x = SalePrice, y = GarageArea)) +
geom_point(size = 3) +
geom_smooth(method = "lm", color = 'red') +
theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey", size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
labs(title="Distribution of house prices per living area", x= "Sale Price", y = "Above ground living area, sq.ft")
train_house %>%
ggplot(aes(x = SalePrice, y = GrLivArea)) +
geom_point(size = 3) +
geom_smooth(method = "lm", color = 'red') +
theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey", size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
labs(title="Distribution of house prices per Garage area", x= "Sale Price", y = "Garage area")
Derive a correlation matrix for any three quantitative variables in the dataset.
The variable are ‘LotArea’, ‘GarageArea’, ‘GrLivArea’. There is definitely the correlation between garage area/living area above ground and sale price.
salePrice <- train_house$SalePrice
areaLot <- train_house$LotArea
garageArea <- train_house$GarageArea
livAreaF <- train_house$GrLivArea
df <- data.frame(salePrice,areaLot,garageArea,livAreaF)
colnames(df) <- c("SalePrice", "LotArea", "GarageArea", "GrLivArea")
corr_results <- cor(df)
round(corr_results, 3)
## SalePrice LotArea GarageArea GrLivArea
## SalePrice 1.000 0.264 0.623 0.709
## LotArea 0.264 1.000 0.180 0.263
## GarageArea 0.623 0.180 1.000 0.469
## GrLivArea 0.709 0.263 0.469 1.000
corrplot(corr_results, type = "lower", order = "original", tl.srt = 0)
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
H0: the correlations is 0
H1: the correlations is not 0
We will use function cor.test of the ggpubr package to test the hypotheses. The data contains missing data and outliers. There are many variables that can affect the sale price. For now, we found some of them: the year built, living area above ground, garage area.
Once we run the test, we see that we should reject the null hypothesis about the correlation being equal to 0 and accept alternative hypothesis (the correlations is not 0). We should do it because the correlation is 0.18, 0.47, 0.26 which is significant as p-value < 2.2e-16 is much less than the significance level of 0.05.
And yes, we should expect the familywise (type I) error with the 80% confidence interval, the significance level of 0.05. The chance to get this error is 14.26% for the multiple comparisons we made.
corr_GarageArea_LotArea <- cor.test(df$GarageArea, df$LotArea, method = "pearson", conf.level=0.8)
corr_GarageArea_LotArea
##
## Pearson's product-moment correlation
##
## data: df$GarageArea and df$LotArea
## t = 7.0034, df = 1458, p-value = 3.803e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1477356 0.2126767
## sample estimates:
## cor
## 0.1804028
corr_Garagearea_GrLivArea <- cor.test(df$GarageArea, df$GrLivArea , method = "pearson", conf.level=0.8)
corr_Garagearea_GrLivArea
##
## Pearson's product-moment correlation
##
## data: df$GarageArea and df$GrLivArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
corr_GrLivArea_LotArea <- cor.test(df$GrLivArea, df$LotArea , method = "pearson", conf.level=0.8)
corr_GrLivArea_LotArea
##
## Pearson's product-moment correlation
##
## data: df$GrLivArea and df$LotArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
n <- 3
a <- .05
1 - (1-a)^n
## [1] 0.142625
Invert your correlation matrix from above. The original correlation matrix:
cor_data <- data.frame(areaLot,garageArea,livAreaF)
colnames(df) <- c("LotArea", "GarageArea", "GrLivArea")
cor_data <- cor(cor_data, method = "pearson", use = "complete.obs")
cor_matrix <- data.matrix(cor_data )
rownames(cor_matrix) <- c("LotArea", "GarageArea", "GrLivArea")
colnames(cor_matrix) <- c("LotArea", "GarageArea", "GrLivArea")
cor_matrix
## LotArea GarageArea GrLivArea
## LotArea 1.0000000 0.1804028 0.2631162
## GarageArea 0.1804028 1.0000000 0.4689975
## GrLivArea 0.2631162 0.4689975 1.0000000
The inverted:
precision_matrix <- solve(cor_matrix)
precision_matrix
## LotArea GarageArea GrLivArea
## LotArea 1.07920917 -0.07886378 -0.2469705
## GarageArea -0.07886378 1.28774631 -0.5831994
## GrLivArea -0.24697046 -0.58319943 1.3385010
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
We get the same results: identity matrices.
matrix1 <- round(cor_matrix %*% precision_matrix, 2)
matrix1
## LotArea GarageArea GrLivArea
## LotArea 1 0 0
## GarageArea 0 1 0
## GrLivArea 0 0 1
matrix2 <- round(precision_matrix %*% cor_matrix, 2)
matrix2
## LotArea GarageArea GrLivArea
## LotArea 1 0 0
## GarageArea 0 1 0
## GrLivArea 0 0 1
Conduct LU decomposition on the matrix.
decomp <- lu.decomposition(cor_matrix)
#The lower triangular matrix L
decomp$L
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] 0.1804028 1.0000000 0
## [3,] 0.2631162 0.4357109 1
#The upper triangular matrix U
decomp$U
## [,1] [,2] [,3]
## [1,] 1 0.1804028 0.2631162
## [2,] 0 0.9674548 0.4215306
## [3,] 0 0.0000000 0.7471044
To check the work, we can multiply L by U and get the correlation matrix again:
decomp$L %*% decomp$U
## [,1] [,2] [,3]
## [1,] 1.0000000 0.1804028 0.2631162
## [2,] 0.1804028 1.0000000 0.4689975
## [3,] 0.2631162 0.4689975 1.0000000
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
To see what variables are skewed to the right, we will build the graph below:
train_house %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 100)
I decided to choose the living are above ground ‘GrLivArea’:
train_house %>%
ggplot(aes(x=GrLivArea)) +
geom_histogram() +
theme(axis.text.x = element_text(vjust = 0.5, hjust=1), plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "white", colour = "grey", size = 0.5, linetype = "solid"), panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "grey"),) +
labs(title="Distribution of the living area", x= "Are, sq.ft", y = "Count")
data <- train_house$GrLivArea
describe(data)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1460 1515.46 525.48 1464 1467.67 483.33 334 5642 5308 1.36 4.86
## se
## X1 13.75
summary(data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
Load the MASS package and run fitdistr to fit an exponential probability density function:
library(MASS)
exp_prob <- fitdistr(data, densfun="exponential")
Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).
The lambda is 0.000659864.
#lambda is 0.000659864
lambda <- exp_prob$estimate
samples <- rexp(1000,lambda)
Plot a histogram and compare it with a histogram of your original variable.
The histogram is still right-skewed, though it looks more uniformly distributed.
par(mfrow=c(1,2))
summary(samples)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.659 431.393 1014.606 1489.212 2080.674 11304.289
hist(samples)
hist(data)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
The 5th percentile is 77.73and 95th percentile is 4539.92.
qexp(.05, rate=lambda)
## [1] 77.73313
qexp(.95, rate=lambda)
## [1] 4539.924
Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The 95% confidence interval from the t-test: from 1488.487 to 1542.440. The empirical 5th percentile is 848, the 95th percentile 2466.1.
We used a fitted distribution instead of empirical data. The histograms for fitted distribution and empirical data are both right skewed, though empirical data slightly skewed to the left. The percentiles are reasonably close (77/848 versus 4539/2466). The population means are also close (1515/1530).
t <- t.test(data,conf.level=0.95)
t
##
## One Sample t-test
##
## data: data
## t = 110.2, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1488.487 1542.440
## sample estimates:
## mean of x
## 1515.464
quantile(data, c(0.05, 0.95))
## 5% 95%
## 848.0 2466.1
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
At the first attempt, we will include numeric columns that have no NA (LotFrontage, GarageYrBlt) to see how they are related to Sale Price. We will stay with the variables that have small p-value: MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea,BsmtFinSF1, 1stFlrSF, 2ndFlrSF, BedroomAbvGr, GarageCars.
df <- train_house %>%
select_if(is.numeric)
colSums(is.na(df))
## Id MSSubClass LotFrontage LotArea OverallQual
## 0 0 259 0 0
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## 0 0 0 8 0
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF
## 0 0 0 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 0 0 0 0 0
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## 81 0 0 0 0
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal
## 0 0 0 0 0
## MoSold YrSold SalePrice
## 0 0 0
df <- df[,-c(3,26)]
first <- lm(SalePrice~.,data=df)
summary(first)
##
## Call:
## lm(formula = SalePrice ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -475057 -16584 -2090 13651 303497
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.050e+05 1.414e+06 0.286 0.774604
## Id -1.129e+00 2.187e+00 -0.516 0.605961
## MSSubClass -1.703e+02 2.629e+01 -6.479 1.27e-10 ***
## LotArea 4.083e-01 1.007e-01 4.056 5.26e-05 ***
## OverallQual 1.711e+04 1.191e+03 14.366 < 2e-16 ***
## OverallCond 4.397e+03 1.023e+03 4.299 1.83e-05 ***
## YearBuilt 3.250e+02 6.081e+01 5.345 1.05e-07 ***
## YearRemodAdd 1.737e+02 6.593e+01 2.634 0.008529 **
## MasVnrArea 3.173e+01 5.957e+00 5.326 1.16e-07 ***
## BsmtFinSF1 1.869e+01 4.660e+00 4.010 6.38e-05 ***
## BsmtFinSF2 7.697e+00 7.049e+00 1.092 0.275032
## BsmtUnfSF 9.430e+00 4.192e+00 2.250 0.024622 *
## TotalBsmtSF NA NA NA NA
## `1stFlrSF` 4.779e+01 5.773e+00 8.277 2.88e-16 ***
## `2ndFlrSF` 4.862e+01 4.958e+00 9.805 < 2e-16 ***
## LowQualFinSF 2.933e+01 1.971e+01 1.488 0.137033
## GrLivArea NA NA NA NA
## BsmtFullBath 9.578e+03 2.611e+03 3.669 0.000253 ***
## BsmtHalfBath 1.922e+03 4.083e+03 0.471 0.637893
## FullBath 4.365e+03 2.820e+03 1.548 0.121919
## HalfBath -2.176e+03 2.668e+03 -0.816 0.414913
## BedroomAbvGr -1.025e+04 1.694e+03 -6.053 1.81e-09 ***
## KitchenAbvGr -1.224e+04 5.224e+03 -2.343 0.019244 *
## TotRmsAbvGrd 5.072e+03 1.237e+03 4.099 4.38e-05 ***
## Fireplaces 3.782e+03 1.772e+03 2.134 0.033017 *
## GarageCars 1.069e+04 2.851e+03 3.749 0.000185 ***
## GarageArea -1.806e+00 9.699e+00 -0.186 0.852267
## WoodDeckSF 2.475e+01 7.977e+00 3.103 0.001955 **
## OpenPorchSF -4.028e+00 1.519e+01 -0.265 0.790973
## EnclosedPorch 1.271e+01 1.688e+01 0.753 0.451683
## `3SsnPorch` 1.813e+01 3.136e+01 0.578 0.563251
## ScreenPorch 5.642e+01 1.716e+01 3.287 0.001038 **
## PoolArea -3.303e+01 2.361e+01 -1.399 0.162098
## MiscVal -7.375e-01 1.853e+00 -0.398 0.690624
## MoSold -8.383e+01 3.461e+02 -0.242 0.808630
## YrSold -7.203e+02 7.029e+02 -1.025 0.305669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34710 on 1418 degrees of freedom
## (8 пропущенных наблюдений удалены)
## Multiple R-squared: 0.8127, Adjusted R-squared: 0.8084
## F-statistic: 186.5 on 33 and 1418 DF, p-value: < 2.2e-16
We will also add some factor variables that look related to the Sale Price. The p-value is sagnificantly small, R-value is 0.81, the mdoel explains 81% of the data which is good:
df <- train_house %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea, `BsmtFinSF1`, `1stFlrSF`, `2ndFlrSF`, BedroomAbvGr, GarageCars, SalePrice, ExterQual)
second <- lm(SalePrice~.,data=df)
summary(second)
##
## Call:
## lm(formula = SalePrice ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -535152 -15659 -1198 12698 276986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.681e+05 9.255e+04 -7.219 8.48e-13 ***
## MSSubClass -1.785e+02 2.347e+01 -7.605 5.13e-14 ***
## LotArea 5.343e-01 9.705e-02 5.505 4.36e-08 ***
## OverallQual 1.607e+04 1.164e+03 13.802 < 2e-16 ***
## OverallCond 5.471e+03 9.118e+02 6.000 2.49e-09 ***
## YearBuilt 3.419e+02 4.648e+01 7.357 3.16e-13 ***
## MasVnrArea 2.016e+01 5.877e+00 3.430 0.000620 ***
## BsmtFinSF1 1.744e+01 2.329e+00 7.490 1.20e-13 ***
## `1stFlrSF` 6.179e+01 3.755e+00 16.456 < 2e-16 ***
## `2ndFlrSF` 5.666e+01 3.243e+00 17.469 < 2e-16 ***
## BedroomAbvGr -5.241e+03 1.445e+03 -3.626 0.000298 ***
## GarageCars 1.135e+04 1.660e+03 6.839 1.18e-11 ***
## ExterQualFa -4.774e+04 1.169e+04 -4.083 4.69e-05 ***
## ExterQualGd -4.766e+04 5.535e+03 -8.611 < 2e-16 ***
## ExterQualTA -6.266e+04 6.194e+03 -10.116 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34280 on 1437 degrees of freedom
## (8 пропущенных наблюдений удалены)
## Multiple R-squared: 0.8149, Adjusted R-squared: 0.8131
## F-statistic: 451.8 on 14 and 1437 DF, p-value: < 2.2e-16
We should check the model using the residuals. We will use the results as residuals are somewhere around 0 and along the normal distribution.
par(mfrow=c(2,2))
plot(second)
Next, we will predict the data using test data frame. Select the necessary columns and substitute :
test_df <- test_house %>% dplyr::select(MSSubClass, LotArea, OverallQual, OverallCond, YearBuilt, MasVnrArea, `BsmtFinSF1`, `1stFlrSF`, `2ndFlrSF`, BedroomAbvGr, GarageCars, ExterQual)
test_df <- test_df %>%
mutate(BsmtFinSF1 = coalesce(BsmtFinSF1, 0)) %>%
mutate(GarageCars = coalesce(GarageCars, 0)) %>%
mutate(MasVnrArea = coalesce(MasVnrArea, 0))
colSums(is.na(test_df))
## MSSubClass LotArea OverallQual OverallCond YearBuilt MasVnrArea
## 0 0 0 0 0 0
## BsmtFinSF1 1stFlrSF 2ndFlrSF BedroomAbvGr GarageCars ExterQual
## 0 0 0 0 0 0
Using function predict() and the model we built, we can predict the sale price. We will do some adjustments to fit the kaggle submission criteria: add id column, remove rows’ index and write to the file. File is also available at the Github repo: https://github.com/ex-pr/DATA607/blob/final_exam/prediction.csv
prediction <- predict(second, test_df)
prediction <- as.data.frame(prediction)
names(prediction )[1] <- "SalePrice"
prediction$Id <- test_house$Id
prediction <- prediction[,c(2,1)]
head(prediction)
## Id SalePrice
## 1 1461 119945.7
## 2 1462 168030.5
## 3 1463 174259.6
## 4 1464 189765.9
## 5 1465 198404.9
## 6 1466 173746.7
write.csv(prediction,"C:/Users/daria/Documents/prediction.csv", row.names = FALSE)
Kaggle results: 0.17298, User name: dariadubovskaia