Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\)
set.seed(10000)
N <- 43
X <- round(runif(10000, 1, N))
Y <- round(rnorm(10000, mean = (N+1)/2, sd = (N+1)/2))
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. (5 points)
x <- median(X)
y <- quantile(Y,0.25,names=FALSE)
(a<-min(pnorm(X>x | X>y)))
## [1] 0.5
(b<-min(pnorm(X>x ,Y>y)))
## [1] 0.1586553
(c<-min(pnorm(X<x, X>y)))
## [1] 0.1586553
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.(5 points)
a<-pnorm(X>x)*pnorm(Y>y)
b<-pnorm((X>x)*(Y>y))
r<-rbind(table(a),table(b))
## Warning in rbind(table(a), table(b)): number of columns of result is not a
## multiple of vector length (arg 2)
row.names(r)<-c('P(X>x and Y>y)','P(X>x)P(Y>y)')
colnames(r)<-names(table(round(a,2)))
rp<-round(addmargins(prop.table(r)),2)
ftable(round(a,2))
## 0.25 0.42 0.71
##
## 1302 5017 3681
ftable(round(b,2))
## 0.5 0.84
##
## 6319 3681
rp
## 0.25 0.42 0.71 Sum
## P(X>x and Y>y) 0.05 0.19 0.14 0.38
## P(X>x)P(Y>y) 0.24 0.14 0.24 0.62
## Sum 0.29 0.33 0.38 1.00
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? (5 points)
fst<-fisher.test(rp[1,],rp[2,])
cst<-chisq.test(rp[1,],rp[2,])
## Warning in chisq.test(rp[1, ], rp[2, ]): Chi-squared approximation may be
## incorrect
print(fst$p.value)
## [1] 1
print(cst$p.value)
## [1] 0.2381033
We got Fisher test value as 1 which fits the data better when compared to Chisquare having P value 0.24
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following. 5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential ptrain_data, find the 5th and 95th percentiles using the cumulative distribution function (Ctrain_data). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
library(MASS)
library(knitr)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(DT)
library(reshape)
library(corrplot)
library(Rmisc)
train_data <- read_csv(file="https://raw.githubusercontent.com/nathtrish334/Data-605/main/final/train.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## Id = col_double(),
## MSSubClass = col_double(),
## LotFrontage = col_double(),
## LotArea = col_double(),
## OverallQual = col_double(),
## OverallCond = col_double(),
## YearBuilt = col_double(),
## YearRemodAdd = col_double(),
## MasVnrArea = col_double(),
## BsmtFinSF1 = col_double(),
## BsmtFinSF2 = col_double(),
## BsmtUnfSF = col_double(),
## TotalBsmtSF = col_double(),
## `1stFlrSF` = col_double(),
## `2ndFlrSF` = col_double(),
## LowQualFinSF = col_double(),
## GrLivArea = col_double(),
## BsmtFullBath = col_double(),
## BsmtHalfBath = col_double(),
## FullBath = col_double()
## # ... with 18 more columns
## )
## i Use `spec()` for the full column specifications.
datatable(train_data, options = list( pageLength = 5, lengthMenu = c(5, 10, 40), initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#E9967A', 'color': '#fff'});",
"}")), rownames=TRUE)
#test variable
X1<-train_data$OverallQual
Y1<-train_data$SalePrice
plot(X1,Y1,main="OverallQual vs SalePrice", xlab="OverallQual", ylab="SalePrice",col=2)
hist(Y1, col="violet", main="Histogram of Overall Quality")
#selected variable
X<-train_data$YearBuilt
Y<-train_data$SalePrice
plot(X,Y, col="#a82f89", main="Scatterplot of Year Built and Sale Price", xlab = "Year Built", ylab="Sale Price")
# Include regression line (y~x)
abline(lm(Y~X), col="blue", lwd=3)
print("Summary of X variable (Year Built)")
## [1] "Summary of X variable (Year Built)"
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
print("Summary of Y variable (Sale Price)")
## [1] "Summary of Y variable (Sale Price)"
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Probability that a house has a year built greater than the third quartile
XQ3<-quantile(X, probs=0.75) #2000 #3rd quartile of X variable
YQ2<-quantile (Y, probs=0.50) #163000 #2nd quartile, or median, of Y variable
n<-(nrow(train_data))
yearbuilt<-as.numeric(train_data$YearBuilt)
saleprice<-as.numeric(train_data$SalePrice)
nYQ2<-nrow(subset(train_data,saleprice>YQ2))
p1<-nrow(subset(train_data, yearbuilt > XQ3 & saleprice>YQ2))/nYQ2
p1
## [1] 0.4436813
Probability that a house has a year built greater than the third quartile and a sale price above median value
p2<-nrow(subset(train_data, yearbuilt > XQ3 & saleprice>YQ2))/n
p2
## [1] 0.2212329
Probability that a house has year built less than or equal to the third quartile
p3<-nrow(subset(train_data, yearbuilt <=XQ3 & saleprice>YQ2))/nYQ2
p3
## [1] 0.5563187
c1<-nrow(subset(train_data, yearbuilt <=XQ3 & saleprice<=YQ2))/n
c2<-nrow(subset(train_data, yearbuilt <=XQ3 & saleprice>YQ2))/n
c3<-c1+c2
c4<-nrow(subset(train_data, yearbuilt >XQ3 & saleprice<=YQ2))/n
c5<-nrow(subset(train_data, yearbuilt >XQ3 & saleprice>YQ2))/n
c6<-c4+c5
c7<-c1+c4
c8<-c2+c5
c9<-c3+c6
train_counts<-matrix(round(c(c1,c2,c3,c4,c5,c6,c7,c8,c9),3), ncol=3, nrow=3, byrow=TRUE)
colnames(train_counts)<-c(
"<=2d quartile",
">2d quartile",
"Total")
rownames(train_counts)<-c("<=3rd quartile",">3rd quartile","Total")
train_counts<-as.table(train_counts)
train_vals<-round(train_counts*1460,0)
print("Quartile Matrix by Percentage")
## [1] "Quartile Matrix by Percentage"
train_counts
## <=2d quartile >2d quartile Total
## <=3rd quartile 0.473 0.277 0.751
## >3rd quartile 0.028 0.221 0.249
## Total 0.501 0.499 1.000
print("Quartile Matrix by Count")
## [1] "Quartile Matrix by Count"
train_vals
## <=2d quartile >2d quartile Total
## <=3rd quartile 691 404 1096
## >3rd quartile 41 323 364
## Total 731 729 1460
A Chi Square test for association
#A = new variable for count of observations above the 3rd quartile for X,
#B = new variable for count of observations for the 2nd quartile for Y.
# Check if P(A|B)=P(A)P(B)
papb<-c4*c5
print (paste0("p(A)*p(B)=", round(papb,5)))
## [1] "p(A)*p(B)=0.00621"
mat <- matrix(c(691, 404, 41, 323), 2, 2, byrow=T)
chisq.test(mat, correct=TRUE)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mat
## X-squared = 291.61, df = 1, p-value < 2.2e-16
isnum <- sapply(train_data, is.numeric)
train_num<-train_data[ , isnum]
summary(train_num)
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
# 95% CI for the difference in the mean of the variables
t.test(train_data$YearBuilt, train_data$SalePrice)
##
## Welch Two Sample t-test
##
## data: train_data$YearBuilt and train_data$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -183028.3 -174871.6
## sample estimates:
## mean of x mean of y
## 1971.268 180921.196
# Derive a correlation matrix for two of the quantitative variables
myvars<-data.frame(train_data$YearBuilt, train_data$SalePrice)
cor(myvars)
## train_data.YearBuilt train_data.SalePrice
## train_data.YearBuilt 1.0000000 0.5228973
## train_data.SalePrice 0.5228973 1.0000000
cor.test(train_data$YearBuilt, train_data$SalePrice, conf.level = 0.99)
##
## Pearson's product-moment correlation
##
## data: train_data$YearBuilt and train_data$SalePrice
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.4721529 0.5701878
## sample estimates:
## cor
## 0.5228973
t.test(train_data$YearBuilt, train_data$SalePrice, conf.level = 0.99)
##
## Welch Two Sample t-test
##
## data: train_data$YearBuilt and train_data$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## -184312.4 -173587.5
## sample estimates:
## mean of x mean of y
## 1971.268 180921.196
#Correlation matrix
mymx<-as.matrix(cor(myvars))
mymx
## train_data.YearBuilt train_data.SalePrice
## train_data.YearBuilt 1.0000000 0.5228973
## train_data.SalePrice 0.5228973 1.0000000
# precision matrix (inverse of correlation matrix)
ginvmymx<-ginv(mymx)
ginvmymx
## [,1] [,2]
## [1,] 1.3763140 -0.7196709
## [2,] -0.7196709 1.3763140
# Principal Components Analysis
# Correlation matrix for all qualitative variables
cormatrix<-cor(train_num)
cor_train<-as.data.frame(cormatrix)
kable(head(cor_train))
Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | 1.0000000 | 0.0111565 | NA | -0.0332255 | -0.0283648 | 0.0126089 | -0.0127127 | -0.0219976 | NA | -0.0050240 | -0.0059677 | -0.0079397 | -0.0154146 | 0.0104960 | 0.0055898 | -0.0442300 | 0.0082728 | 0.0022886 | -0.0201547 | 0.0055875 | 0.0067838 | 0.0377186 | 0.0029512 | 0.0272387 | -0.0197716 | NA | 0.0165697 | 0.0176338 | -0.0296432 | -0.0004769 | 0.0028892 | -0.0466348 | 0.0013302 | 0.0570439 | -0.0062424 | 0.0211722 | 0.0007118 | -0.0219167 |
MSSubClass | 0.0111565 | 1.0000000 | NA | -0.1397811 | 0.0326277 | -0.0593158 | 0.0278501 | 0.0405810 | NA | -0.0698357 | -0.0656486 | -0.1407595 | -0.2385184 | -0.2517584 | 0.3078857 | 0.0464738 | 0.0748532 | 0.0034910 | -0.0023325 | 0.1316082 | 0.1773544 | -0.0234380 | 0.2817210 | 0.0403801 | -0.0455693 | NA | -0.0401098 | -0.0986715 | -0.0125794 | -0.0061001 | -0.0120366 | -0.0438245 | -0.0260302 | 0.0082827 | -0.0076833 | -0.0135846 | -0.0214070 | -0.0842841 |
LotFrontage | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
LotArea | -0.0332255 | -0.1397811 | NA | 1.0000000 | 0.1058057 | -0.0056363 | 0.0142277 | 0.0137884 | NA | 0.2141031 | 0.1111697 | -0.0026184 | 0.2608331 | 0.2994746 | 0.0509859 | 0.0047790 | 0.2631162 | 0.1581545 | 0.0480456 | 0.1260306 | 0.0142595 | 0.1196899 | -0.0177839 | 0.1900148 | 0.2713640 | NA | 0.1548707 | 0.1804028 | 0.1716977 | 0.0847738 | -0.0183397 | 0.0204228 | 0.0431604 | 0.0776724 | 0.0380677 | 0.0012050 | -0.0142614 | 0.2638434 |
OverallQual | -0.0283648 | 0.0326277 | NA | 0.1058057 | 1.0000000 | -0.0919323 | 0.5723228 | 0.5506839 | NA | 0.2396660 | -0.0591187 | 0.3081589 | 0.5378085 | 0.4762238 | 0.2954929 | -0.0304293 | 0.5930074 | 0.1110978 | -0.0401502 | 0.5505997 | 0.2734581 | 0.1016764 | -0.1838822 | 0.4274523 | 0.3967650 | NA | 0.6006707 | 0.5620218 | 0.2389234 | 0.3088188 | -0.1139369 | 0.0303706 | 0.0648864 | 0.0651658 | -0.0314062 | 0.0708152 | -0.0273467 | 0.7909816 |
OverallCond | 0.0126089 | -0.0593158 | NA | -0.0056363 | -0.0919323 | 1.0000000 | -0.3759832 | 0.0737415 | NA | -0.0462309 | 0.0402292 | -0.1368406 | -0.1710975 | -0.1442028 | 0.0289421 | 0.0254943 | -0.0796859 | -0.0549415 | 0.1178209 | -0.1941495 | -0.0607693 | 0.0129801 | -0.0870009 | -0.0575832 | -0.0238200 | NA | -0.1857575 | -0.1515214 | -0.0033337 | -0.0325888 | 0.0703562 | 0.0255037 | 0.0548105 | -0.0019849 | 0.0687768 | -0.0035108 | 0.0439497 | -0.0778559 |
# fit an exponential probability density function.
fit <- fitdistr(train_data$YearBuilt, "exponential")
#optimal lambda value for the distribution
lambda <- fit$estimate
train_sample <- rexp(1000, lambda)
lambda
## rate
## 0.0005072877
# Histogram from the distribution
train_sample<-data.frame(as.numeric(train_sample))
colnames(train_sample)[1] <- "sample"
hist(train_sample$sample, col="green", main="Histogram for Exponential Distribution", xlab = "Year Built", breaks=30)
#95% confidence interval from the empirical data
CI(train_data$YearBuilt, 0.95)
## upper mean lower
## 1972.818 1971.268 1969.717
#test of alternate model
modvars <- train_data[, which(sapply(train_data, function(x) sum(is.na(x))) == 0)]
model1 <- step(lm(train_data$SalePrice ~ ., modvars), direction = 'backward', trace = FALSE)
model1
##
## Call:
## lm(formula = train_data$SalePrice ~ MSZoning + LotArea + Street +
## LandContour + LotConfig + LandSlope + Neighborhood + Condition1 +
## Condition2 + BldgType + HouseStyle + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + RoofStyle + RoofMatl + ExterQual +
## Foundation + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + `1stFlrSF` +
## `2ndFlrSF` + FullBath + BedroomAbvGr + KitchenAbvGr + KitchenQual +
## TotRmsAbvGrd + Functional + Fireplaces + GarageCars + GarageArea +
## WoodDeckSF + ScreenPorch + PoolArea + MoSold + SaleType,
## data = modvars)
##
## Coefficients:
## (Intercept) MSZoningFV MSZoningRH
## -1.590e+06 3.081e+04 2.235e+04
## MSZoningRL MSZoningRM LotArea
## 2.500e+04 2.399e+04 7.290e-01
## StreetPave LandContourHLS LandContourLow
## 3.855e+04 1.455e+04 -4.130e+03
## LandContourLvl LotConfigCulDSac LotConfigFR2
## 7.223e+03 7.028e+03 -5.531e+03
## LotConfigFR3 LotConfigInside LandSlopeMod
## -1.330e+04 -1.009e+03 1.085e+04
## LandSlopeSev NeighborhoodBlueste NeighborhoodBrDale
## -3.134e+04 1.053e+03 7.068e+03
## NeighborhoodBrkSide NeighborhoodClearCr NeighborhoodCollgCr
## -4.220e+02 -1.098e+04 -7.826e+03
## NeighborhoodCrawfor NeighborhoodEdwards NeighborhoodGilbert
## 8.985e+03 -1.417e+04 -1.344e+04
## NeighborhoodIDOTRR NeighborhoodMeadowV NeighborhoodMitchel
## -4.029e+03 3.845e+02 -1.931e+04
## NeighborhoodNAmes NeighborhoodNoRidge NeighborhoodNPkVill
## -1.268e+04 2.977e+04 4.911e+03
## NeighborhoodNridgHt NeighborhoodNWAmes NeighborhoodOldTown
## 2.459e+04 -2.299e+04 -1.200e+04
## NeighborhoodSawyer NeighborhoodSawyerW NeighborhoodSomerst
## -1.195e+04 -8.005e+03 1.200e+03
## NeighborhoodStoneBr NeighborhoodSWISU NeighborhoodTimber
## 3.612e+04 -5.338e+03 -6.451e+03
## NeighborhoodVeenker Condition1Feedr Condition1Norm
## 2.206e+03 3.626e+03 1.171e+04
## Condition1PosA Condition1PosN Condition1RRAe
## 1.132e+04 1.143e+04 -1.462e+04
## Condition1RRAn Condition1RRNe Condition1RRNn
## 8.111e+03 -5.697e+03 8.997e+03
## Condition2Feedr Condition2Norm Condition2PosA
## -5.155e+03 -2.705e+03 3.240e+04
## Condition2PosN Condition2RRAe Condition2RRAn
## -2.319e+05 -1.272e+05 -1.544e+03
## Condition2RRNn BldgType2fmCon BldgTypeDuplex
## -3.178e+03 -4.461e+03 -6.219e+01
## BldgTypeTwnhs BldgTypeTwnhsE HouseStyle1.5Unf
## -2.744e+04 -2.371e+04 1.274e+04
## HouseStyle1Story HouseStyle2.5Fin HouseStyle2.5Unf
## 7.971e+03 -5.760e+03 -1.377e+04
## HouseStyle2Story HouseStyleSFoyer HouseStyleSLvl
## -6.455e+03 5.604e+03 4.988e+03
## OverallQual OverallCond YearBuilt
## 7.793e+03 5.362e+03 3.420e+02
## YearRemodAdd RoofStyleGable RoofStyleGambrel
## 1.097e+02 -8.072e+03 -8.076e+02
## RoofStyleHip RoofStyleMansard RoofStyleShed
## -5.084e+03 1.088e+04 7.967e+04
## RoofMatlCompShg RoofMatlMembran RoofMatlMetal
## 6.545e+05 7.292e+05 6.996e+05
## RoofMatlRoll RoofMatlTar&Grv RoofMatlWdShake
## 6.566e+05 6.470e+05 6.286e+05
## RoofMatlWdShngl ExterQualFa ExterQualGd
## 7.311e+05 -1.446e+04 -3.204e+04
## ExterQualTA FoundationCBlock FoundationPConc
## -3.386e+04 5.553e+02 5.388e+03
## FoundationSlab FoundationStone FoundationWood
## 1.256e+04 -2.508e+02 -2.822e+04
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 3.795e+01 2.609e+01 1.573e+01
## `1stFlrSF` `2ndFlrSF` FullBath
## 5.531e+01 6.939e+01 2.933e+03
## BedroomAbvGr KitchenAbvGr KitchenQualFa
## -6.336e+03 -1.850e+04 -2.137e+04
## KitchenQualGd KitchenQualTA TotRmsAbvGrd
## -2.836e+04 -2.611e+04 1.555e+03
## FunctionalMaj2 FunctionalMin1 FunctionalMin2
## -9.426e+02 1.685e+03 9.192e+03
## FunctionalMod FunctionalSev FunctionalTyp
## -4.378e+03 -7.531e+04 1.799e+04
## Fireplaces GarageCars GarageArea
## 2.944e+03 3.978e+03 1.328e+01
## WoodDeckSF ScreenPorch PoolArea
## 9.906e+00 3.173e+01 7.234e+01
## MoSold SaleTypeCon SaleTypeConLD
## -5.716e+02 4.523e+04 1.747e+04
## SaleTypeConLI SaleTypeConLw SaleTypeCWD
## 1.142e+04 6.004e+03 2.210e+04
## SaleTypeNew SaleTypeOth SaleTypeWD
## 2.361e+04 1.601e+04 5.139e+03
# create multiple linear regression model using intercepts from regression summary.
fit <- lm(train_data$SalePrice ~ train_data$OverallQual + train_data$GrLivArea + train_data$GarageCars + train_data$GarageArea, data=train_data)
summary(fit)
##
## Call:
## lm(formula = train_data$SalePrice ~ train_data$OverallQual +
## train_data$GrLivArea + train_data$GarageCars + train_data$GarageArea,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -372594 -21236 -1594 18625 301129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -98436.050 4820.467 -20.420 < 2e-16 ***
## train_data$OverallQual 26988.854 1067.393 25.285 < 2e-16 ***
## train_data$GrLivArea 49.573 2.555 19.402 < 2e-16 ***
## train_data$GarageCars 11317.522 3126.297 3.620 0.000305 ***
## train_data$GarageArea 41.478 10.627 3.903 9.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40420 on 1455 degrees of freedom
## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7411
## F-statistic: 1045 on 4 and 1455 DF, p-value: < 2.2e-16
# Plots
par(mfrow=c(2,2))
X1<-train_data$GarageCars
X2<-train_data$OverallQual
X3<-train_data$GarageArea
X4<-train_data$GrLivArea
Y<-train_data$SalePrice
plot(X1,Y, col="#b40da5", main="GarageCars", ylab="Sale Price")
abline(lm(Y~X1), col="Green", lwd=3) # regression line
plot(X2,Y, col="#f7f8b8", main="OverallQual", ylab="Sale Price")
abline(lm(Y~X2), col="Blue", lwd=3) # regression line
plot(X3,Y, col="#3cf8ed", main="GarageArea", ylab="Sale Price")
abline(lm(Y~X3), col="Brown", lwd=3) # regression line
plot(X4,Y, col="#abf395", main="GrLivArea", ylab="Sale Price")
abline(lm(Y~X4), col="Purple", lwd=3) # regression line
#Load test data set and create calculated column using equation for multiple linear regression.
test_data <- read_csv(file="https://raw.githubusercontent.com/nathtrish334/Data-605/main/final/test.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## Id = col_double(),
## MSSubClass = col_double(),
## LotFrontage = col_double(),
## LotArea = col_double(),
## OverallQual = col_double(),
## OverallCond = col_double(),
## YearBuilt = col_double(),
## YearRemodAdd = col_double(),
## MasVnrArea = col_double(),
## BsmtFinSF1 = col_double(),
## BsmtFinSF2 = col_double(),
## BsmtUnfSF = col_double(),
## TotalBsmtSF = col_double(),
## `1stFlrSF` = col_double(),
## `2ndFlrSF` = col_double(),
## LowQualFinSF = col_double(),
## GrLivArea = col_double(),
## BsmtFullBath = col_double(),
## BsmtHalfBath = col_double(),
## FullBath = col_double()
## # ... with 17 more columns
## )
## i Use `spec()` for the full column specifications.
SalePrice<-((26988.854*train_data$OverallQual) + (49.573*train_data$GrLivArea) + (11317.522*train_data$GarageCars) + (41.478*train_data$GarageArea) -98436.050)
test_data<-test_data[,c("Id","OverallQual","GrLivArea","GarageCars","GarageArea")]
kable(head(test_data))
Id | OverallQual | GrLivArea | GarageCars | GarageArea |
---|---|---|---|---|
1461 | 5 | 896 | 1 | 730 |
1462 | 6 | 1329 | 1 | 312 |
1463 | 5 | 1629 | 2 | 482 |
1464 | 6 | 1604 | 2 | 470 |
1465 | 8 | 1280 | 2 | 506 |
1466 | 6 | 1655 | 2 | 440 |
submission <- cbind(test_data$Id,SalePrice)
## Warning in cbind(test_data$Id, SalePrice): number of rows of result is not a
## multiple of vector length (arg 1)
colnames(submission)[1] <- "Id"
submission[submission<0] <- median(SalePrice) # remove negatives
submission<-as.data.frame(submission[1:1459,])
kable(head(submission))
Id | SalePrice |
---|---|
1461 | 220620.7 |
1462 | 167773.1 |
1463 | 226877.0 |
1464 | 236184.2 |
1465 | 295064.4 |
1466 | 146571.1 |
# Export to csv
write.csv(submission, file = "C:/FinalExam/kaggle_submission.csv", quote=FALSE, row.names=FALSE)
Kaggle Score