Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\)

set.seed(10000)
N <- 43
X <- round(runif(10000, 1, N))
Y <- round(rnorm(10000, mean = (N+1)/2, sd = (N+1)/2))

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. (5 points)

x <- median(X)
y <- quantile(Y,0.25,names=FALSE)

P(X>x | X>y)

(a<-min(pnorm(X>x | X>y)))

## [1] 0.5

P(X>x, Y>y)

(b<-min(pnorm(X>x ,Y>y)))

## [1] 0.1586553

P(X<x | X>y)

(c<-min(pnorm(X<x, X>y)))

## [1] 0.1586553

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.(5 points)

a<-pnorm(X>x)*pnorm(Y>y)
b<-pnorm((X>x)*(Y>y))
r<-rbind(table(a),table(b))

## Warning in rbind(table(a), table(b)): number of columns of result is not a
## multiple of vector length (arg 2)

row.names(r)<-c('P(X>x and Y>y)','P(X>x)P(Y>y)')
colnames(r)<-names(table(round(a,2)))
rp<-round(addmargins(prop.table(r)),2)
ftable(round(a,2))

##  0.25 0.42 0.71
##                
##  1302 5017 3681

ftable(round(b,2))

##   0.5 0.84
##           
##  6319 3681

rp

##                0.25 0.42 0.71  Sum
## P(X>x and Y>y) 0.05 0.19 0.14 0.38
## P(X>x)P(Y>y)   0.24 0.14 0.24 0.62
## Sum            0.29 0.33 0.38 1.00

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? (5 points)

fst<-fisher.test(rp[1,],rp[2,])
cst<-chisq.test(rp[1,],rp[2,])

## Warning in chisq.test(rp[1, ], rp[2, ]): Chi-squared approximation may be
## incorrect

print(fst$p.value)

## [1] 1

print(cst$p.value)

## [1] 0.2381033

We got Fisher test value as 1 which fits the data better when compared to Chisquare having P value 0.24

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following. 5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of  for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, )). Plot a histogram and compare it with a histogram of your original variable. Using the exponential ptrain_data, find the 5th and 95th percentiles using the cumulative distribution function (Ctrain_data). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

library(MASS)
library(knitr)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(DT)
library(reshape)
library(corrplot)
library(Rmisc)

Dataset

train_data <- read_csv(file="https://raw.githubusercontent.com/nathtrish334/Data-605/main/final/train.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   Id = col_double(),
##   MSSubClass = col_double(),
##   LotFrontage = col_double(),
##   LotArea = col_double(),
##   OverallQual = col_double(),
##   OverallCond = col_double(),
##   YearBuilt = col_double(),
##   YearRemodAdd = col_double(),
##   MasVnrArea = col_double(),
##   BsmtFinSF1 = col_double(),
##   BsmtFinSF2 = col_double(),
##   BsmtUnfSF = col_double(),
##   TotalBsmtSF = col_double(),
##   `1stFlrSF` = col_double(),
##   `2ndFlrSF` = col_double(),
##   LowQualFinSF = col_double(),
##   GrLivArea = col_double(),
##   BsmtFullBath = col_double(),
##   BsmtHalfBath = col_double(),
##   FullBath = col_double()
##   # ... with 18 more columns
## )
## i Use `spec()` for the full column specifications.

datatable(train_data, options = list( pageLength = 5, lengthMenu = c(5, 10, 40),   initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#E9967A', 'color': '#fff'});",
    "}")), rownames=TRUE)

1. Variables

#test variable
X1<-train_data$OverallQual
Y1<-train_data$SalePrice

plot(X1,Y1,main="OverallQual vs SalePrice", xlab="OverallQual", ylab="SalePrice",col=2)

hist(Y1, col="violet", main="Histogram of Overall Quality")

#selected variable
X<-train_data$YearBuilt
Y<-train_data$SalePrice

plot(X,Y, col="#a82f89", main="Scatterplot of Year Built and Sale Price", xlab = "Year Built", ylab="Sale Price")
# Include regression line (y~x)
abline(lm(Y~X), col="blue", lwd=3)

print("Summary of X variable (Year Built)")

## [1] "Summary of X variable (Year Built)"

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1973    1971    2000    2010

print("Summary of Y variable (Sale Price)")

## [1] "Summary of Y variable (Sale Price)"

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

2. Probability

Probability that a house has a year built greater than the third quartile

XQ3<-quantile(X, probs=0.75) #2000 #3rd quartile of X variable
YQ2<-quantile (Y, probs=0.50) #163000 #2nd quartile, or median, of Y variable

n<-(nrow(train_data))
yearbuilt<-as.numeric(train_data$YearBuilt)
saleprice<-as.numeric(train_data$SalePrice)

nYQ2<-nrow(subset(train_data,saleprice>YQ2))

p1<-nrow(subset(train_data, yearbuilt > XQ3 & saleprice>YQ2))/nYQ2
p1

## [1] 0.4436813

Probability that a house has a year built greater than the third quartile and a sale price above median value

p2<-nrow(subset(train_data, yearbuilt > XQ3 & saleprice>YQ2))/n
p2

## [1] 0.2212329

Probability that a house has year built less than or equal to the third quartile

p3<-nrow(subset(train_data, yearbuilt <=XQ3 & saleprice>YQ2))/nYQ2
p3

## [1] 0.5563187

c1<-nrow(subset(train_data, yearbuilt <=XQ3 & saleprice<=YQ2))/n
c2<-nrow(subset(train_data, yearbuilt <=XQ3 & saleprice>YQ2))/n
c3<-c1+c2
c4<-nrow(subset(train_data, yearbuilt >XQ3 & saleprice<=YQ2))/n
c5<-nrow(subset(train_data, yearbuilt >XQ3 & saleprice>YQ2))/n
c6<-c4+c5
c7<-c1+c4
c8<-c2+c5
c9<-c3+c6

train_counts<-matrix(round(c(c1,c2,c3,c4,c5,c6,c7,c8,c9),3), ncol=3, nrow=3, byrow=TRUE)
colnames(train_counts)<-c(
"<=2d quartile",
">2d quartile",
"Total")
rownames(train_counts)<-c("<=3rd quartile",">3rd quartile","Total")

train_counts<-as.table(train_counts)
train_vals<-round(train_counts*1460,0)

print("Quartile Matrix by Percentage")

## [1] "Quartile Matrix by Percentage"

train_counts

##                <=2d quartile >2d quartile Total
## <=3rd quartile         0.473        0.277 0.751
## >3rd quartile          0.028        0.221 0.249
## Total                  0.501        0.499 1.000

print("Quartile Matrix by Count")

## [1] "Quartile Matrix by Count"

train_vals

##                <=2d quartile >2d quartile Total
## <=3rd quartile           691          404  1096
## >3rd quartile             41          323   364
## Total                    731          729  1460

3. Independence

A Chi Square test for association

#A = new variable for count of observations above the 3rd quartile for X, 
#B = new variable for count of observations for the 2nd quartile for Y.
# Check if P(A|B)=P(A)P(B)
papb<-c4*c5
print (paste0("p(A)*p(B)=", round(papb,5)))

## [1] "p(A)*p(B)=0.00621"

mat <- matrix(c(691, 404, 41, 323), 2, 2, byrow=T) 
chisq.test(mat, correct=TRUE)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mat
## X-squared = 291.61, df = 1, p-value < 2.2e-16

4. Statistics

isnum <- sapply(train_data, is.numeric)
train_num<-train_data[ , isnum]
summary(train_num)

##        Id           MSSubClass     LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.0   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 730.5   Median : 50.0   Median : 69.00   Median :  9478  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :1460.0   Max.   :190.0   Max.   :313.00   Max.   :215245  
##                                   NA's   :259                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0  
##  Median :   0.0   Median : 383.5   Median :   0.00   Median : 477.5  
##  Mean   : 103.7   Mean   : 443.6   Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1474.00   Max.   :2336.0  
##  NA's   :8                                                           
##   TotalBsmtSF        1stFlrSF       2ndFlrSF     LowQualFinSF    
##  Min.   :   0.0   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  Median : 991.5   Median :1087   Median :   0   Median :  0.000  
##  Mean   :1057.4   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  Max.   :6110.0   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000   Max.   :14.000  
##                                                                   
##    Fireplaces     GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.000   Min.   :1900   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.000   1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5  
##  Median :1.000   Median :1980   Median :2.000   Median : 480.0  
##  Mean   :0.613   Mean   :1979   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.:1.000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :3.000   Max.   :2010   Max.   :4.000   Max.   :1418.0  
##                  NA's   :81                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      3SsnPorch     
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice     
##  Min.   :2006   Min.   : 34900  
##  1st Qu.:2007   1st Qu.:129975  
##  Median :2008   Median :163000  
##  Mean   :2008   Mean   :180921  
##  3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :2010   Max.   :755000  
##

# 95% CI for the difference in the mean of the variables
t.test(train_data$YearBuilt, train_data$SalePrice)

## 
##  Welch Two Sample t-test
## 
## data:  train_data$YearBuilt and train_data$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183028.3 -174871.6
## sample estimates:
##  mean of x  mean of y 
##   1971.268 180921.196

# Derive a correlation matrix for two of the quantitative variables
myvars<-data.frame(train_data$YearBuilt, train_data$SalePrice)
cor(myvars)

##                      train_data.YearBuilt train_data.SalePrice
## train_data.YearBuilt            1.0000000            0.5228973
## train_data.SalePrice            0.5228973            1.0000000

cor.test(train_data$YearBuilt, train_data$SalePrice, conf.level = 0.99)

## 
##  Pearson's product-moment correlation
## 
## data:  train_data$YearBuilt and train_data$SalePrice
## t = 23.424, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.4721529 0.5701878
## sample estimates:
##       cor 
## 0.5228973

t.test(train_data$YearBuilt, train_data$SalePrice, conf.level = 0.99)

## 
##  Welch Two Sample t-test
## 
## data:  train_data$YearBuilt and train_data$SalePrice
## t = -86.071, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  -184312.4 -173587.5
## sample estimates:
##  mean of x  mean of y 
##   1971.268 180921.196

5. Correlation

#Correlation matrix
mymx<-as.matrix(cor(myvars))
mymx

##                      train_data.YearBuilt train_data.SalePrice
## train_data.YearBuilt            1.0000000            0.5228973
## train_data.SalePrice            0.5228973            1.0000000

# precision matrix (inverse of correlation matrix)
ginvmymx<-ginv(mymx)
ginvmymx

##            [,1]       [,2]
## [1,]  1.3763140 -0.7196709
## [2,] -0.7196709  1.3763140

# Principal Components Analysis

# Correlation matrix for all qualitative variables
cormatrix<-cor(train_num)
cor_train<-as.data.frame(cormatrix)
kable(head(cor_train))

	Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
Id	1.0000000	0.0111565	NA	-0.0332255	-0.0283648	0.0126089	-0.0127127	-0.0219976	NA	-0.0050240	-0.0059677	-0.0079397	-0.0154146	0.0104960	0.0055898	-0.0442300	0.0082728	0.0022886	-0.0201547	0.0055875	0.0067838	0.0377186	0.0029512	0.0272387	-0.0197716	NA	0.0165697	0.0176338	-0.0296432	-0.0004769	0.0028892	-0.0466348	0.0013302	0.0570439	-0.0062424	0.0211722	0.0007118	-0.0219167
MSSubClass	0.0111565	1.0000000	NA	-0.1397811	0.0326277	-0.0593158	0.0278501	0.0405810	NA	-0.0698357	-0.0656486	-0.1407595	-0.2385184	-0.2517584	0.3078857	0.0464738	0.0748532	0.0034910	-0.0023325	0.1316082	0.1773544	-0.0234380	0.2817210	0.0403801	-0.0455693	NA	-0.0401098	-0.0986715	-0.0125794	-0.0061001	-0.0120366	-0.0438245	-0.0260302	0.0082827	-0.0076833	-0.0135846	-0.0214070	-0.0842841
LotFrontage	NA	NA	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
LotArea	-0.0332255	-0.1397811	NA	1.0000000	0.1058057	-0.0056363	0.0142277	0.0137884	NA	0.2141031	0.1111697	-0.0026184	0.2608331	0.2994746	0.0509859	0.0047790	0.2631162	0.1581545	0.0480456	0.1260306	0.0142595	0.1196899	-0.0177839	0.1900148	0.2713640	NA	0.1548707	0.1804028	0.1716977	0.0847738	-0.0183397	0.0204228	0.0431604	0.0776724	0.0380677	0.0012050	-0.0142614	0.2638434
OverallQual	-0.0283648	0.0326277	NA	0.1058057	1.0000000	-0.0919323	0.5723228	0.5506839	NA	0.2396660	-0.0591187	0.3081589	0.5378085	0.4762238	0.2954929	-0.0304293	0.5930074	0.1110978	-0.0401502	0.5505997	0.2734581	0.1016764	-0.1838822	0.4274523	0.3967650	NA	0.6006707	0.5620218	0.2389234	0.3088188	-0.1139369	0.0303706	0.0648864	0.0651658	-0.0314062	0.0708152	-0.0273467	0.7909816
OverallCond	0.0126089	-0.0593158	NA	-0.0056363	-0.0919323	1.0000000	-0.3759832	0.0737415	NA	-0.0462309	0.0402292	-0.1368406	-0.1710975	-0.1442028	0.0289421	0.0254943	-0.0796859	-0.0549415	0.1178209	-0.1941495	-0.0607693	0.0129801	-0.0870009	-0.0575832	-0.0238200	NA	-0.1857575	-0.1515214	-0.0033337	-0.0325888	0.0703562	0.0255037	0.0548105	-0.0019849	0.0687768	-0.0035108	0.0439497	-0.0778559

6. Sampling

# fit an exponential probability density function.
fit <- fitdistr(train_data$YearBuilt, "exponential")

#optimal lambda value for the distribution
lambda <- fit$estimate
train_sample <- rexp(1000, lambda)
lambda

##         rate 
## 0.0005072877

# Histogram from the distribution
train_sample<-data.frame(as.numeric(train_sample))
colnames(train_sample)[1] <- "sample"

hist(train_sample$sample, col="green", main="Histogram for Exponential Distribution", xlab = "Year Built", breaks=30)

#95% confidence interval from the empirical data
CI(train_data$YearBuilt, 0.95)

##    upper     mean    lower 
## 1972.818 1971.268 1969.717

7. Modelling

#test of alternate model
modvars <- train_data[, which(sapply(train_data, function(x) sum(is.na(x))) == 0)]
model1 <- step(lm(train_data$SalePrice ~ ., modvars), direction = 'backward', trace = FALSE)
model1

## 
## Call:
## lm(formula = train_data$SalePrice ~ MSZoning + LotArea + Street + 
##     LandContour + LotConfig + LandSlope + Neighborhood + Condition1 + 
##     Condition2 + BldgType + HouseStyle + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + RoofStyle + RoofMatl + ExterQual + 
##     Foundation + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + `1stFlrSF` + 
##     `2ndFlrSF` + FullBath + BedroomAbvGr + KitchenAbvGr + KitchenQual + 
##     TotRmsAbvGrd + Functional + Fireplaces + GarageCars + GarageArea + 
##     WoodDeckSF + ScreenPorch + PoolArea + MoSold + SaleType, 
##     data = modvars)
## 
## Coefficients:
##         (Intercept)           MSZoningFV           MSZoningRH  
##          -1.590e+06            3.081e+04            2.235e+04  
##          MSZoningRL           MSZoningRM              LotArea  
##           2.500e+04            2.399e+04            7.290e-01  
##          StreetPave       LandContourHLS       LandContourLow  
##           3.855e+04            1.455e+04           -4.130e+03  
##      LandContourLvl     LotConfigCulDSac         LotConfigFR2  
##           7.223e+03            7.028e+03           -5.531e+03  
##        LotConfigFR3      LotConfigInside         LandSlopeMod  
##          -1.330e+04           -1.009e+03            1.085e+04  
##        LandSlopeSev  NeighborhoodBlueste   NeighborhoodBrDale  
##          -3.134e+04            1.053e+03            7.068e+03  
## NeighborhoodBrkSide  NeighborhoodClearCr  NeighborhoodCollgCr  
##          -4.220e+02           -1.098e+04           -7.826e+03  
## NeighborhoodCrawfor  NeighborhoodEdwards  NeighborhoodGilbert  
##           8.985e+03           -1.417e+04           -1.344e+04  
##  NeighborhoodIDOTRR  NeighborhoodMeadowV  NeighborhoodMitchel  
##          -4.029e+03            3.845e+02           -1.931e+04  
##   NeighborhoodNAmes  NeighborhoodNoRidge  NeighborhoodNPkVill  
##          -1.268e+04            2.977e+04            4.911e+03  
## NeighborhoodNridgHt   NeighborhoodNWAmes  NeighborhoodOldTown  
##           2.459e+04           -2.299e+04           -1.200e+04  
##  NeighborhoodSawyer  NeighborhoodSawyerW  NeighborhoodSomerst  
##          -1.195e+04           -8.005e+03            1.200e+03  
## NeighborhoodStoneBr    NeighborhoodSWISU   NeighborhoodTimber  
##           3.612e+04           -5.338e+03           -6.451e+03  
## NeighborhoodVeenker      Condition1Feedr       Condition1Norm  
##           2.206e+03            3.626e+03            1.171e+04  
##      Condition1PosA       Condition1PosN       Condition1RRAe  
##           1.132e+04            1.143e+04           -1.462e+04  
##      Condition1RRAn       Condition1RRNe       Condition1RRNn  
##           8.111e+03           -5.697e+03            8.997e+03  
##     Condition2Feedr       Condition2Norm       Condition2PosA  
##          -5.155e+03           -2.705e+03            3.240e+04  
##      Condition2PosN       Condition2RRAe       Condition2RRAn  
##          -2.319e+05           -1.272e+05           -1.544e+03  
##      Condition2RRNn       BldgType2fmCon       BldgTypeDuplex  
##          -3.178e+03           -4.461e+03           -6.219e+01  
##       BldgTypeTwnhs       BldgTypeTwnhsE     HouseStyle1.5Unf  
##          -2.744e+04           -2.371e+04            1.274e+04  
##    HouseStyle1Story     HouseStyle2.5Fin     HouseStyle2.5Unf  
##           7.971e+03           -5.760e+03           -1.377e+04  
##    HouseStyle2Story     HouseStyleSFoyer       HouseStyleSLvl  
##          -6.455e+03            5.604e+03            4.988e+03  
##         OverallQual          OverallCond            YearBuilt  
##           7.793e+03            5.362e+03            3.420e+02  
##        YearRemodAdd       RoofStyleGable     RoofStyleGambrel  
##           1.097e+02           -8.072e+03           -8.076e+02  
##        RoofStyleHip     RoofStyleMansard        RoofStyleShed  
##          -5.084e+03            1.088e+04            7.967e+04  
##     RoofMatlCompShg      RoofMatlMembran        RoofMatlMetal  
##           6.545e+05            7.292e+05            6.996e+05  
##        RoofMatlRoll      RoofMatlTar&Grv      RoofMatlWdShake  
##           6.566e+05            6.470e+05            6.286e+05  
##     RoofMatlWdShngl          ExterQualFa          ExterQualGd  
##           7.311e+05           -1.446e+04           -3.204e+04  
##         ExterQualTA     FoundationCBlock      FoundationPConc  
##          -3.386e+04            5.553e+02            5.388e+03  
##      FoundationSlab      FoundationStone       FoundationWood  
##           1.256e+04           -2.508e+02           -2.822e+04  
##          BsmtFinSF1           BsmtFinSF2            BsmtUnfSF  
##           3.795e+01            2.609e+01            1.573e+01  
##          `1stFlrSF`           `2ndFlrSF`             FullBath  
##           5.531e+01            6.939e+01            2.933e+03  
##        BedroomAbvGr         KitchenAbvGr        KitchenQualFa  
##          -6.336e+03           -1.850e+04           -2.137e+04  
##       KitchenQualGd        KitchenQualTA         TotRmsAbvGrd  
##          -2.836e+04           -2.611e+04            1.555e+03  
##      FunctionalMaj2       FunctionalMin1       FunctionalMin2  
##          -9.426e+02            1.685e+03            9.192e+03  
##       FunctionalMod        FunctionalSev        FunctionalTyp  
##          -4.378e+03           -7.531e+04            1.799e+04  
##          Fireplaces           GarageCars           GarageArea  
##           2.944e+03            3.978e+03            1.328e+01  
##          WoodDeckSF          ScreenPorch             PoolArea  
##           9.906e+00            3.173e+01            7.234e+01  
##              MoSold          SaleTypeCon        SaleTypeConLD  
##          -5.716e+02            4.523e+04            1.747e+04  
##       SaleTypeConLI        SaleTypeConLw          SaleTypeCWD  
##           1.142e+04            6.004e+03            2.210e+04  
##         SaleTypeNew          SaleTypeOth           SaleTypeWD  
##           2.361e+04            1.601e+04            5.139e+03

# create multiple linear regression model using intercepts from regression summary.
fit <- lm(train_data$SalePrice ~ train_data$OverallQual + train_data$GrLivArea + train_data$GarageCars + train_data$GarageArea, data=train_data)
summary(fit)

## 
## Call:
## lm(formula = train_data$SalePrice ~ train_data$OverallQual + 
##     train_data$GrLivArea + train_data$GarageCars + train_data$GarageArea, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -372594  -21236   -1594   18625  301129 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -98436.050   4820.467 -20.420  < 2e-16 ***
## train_data$OverallQual  26988.854   1067.393  25.285  < 2e-16 ***
## train_data$GrLivArea       49.573      2.555  19.402  < 2e-16 ***
## train_data$GarageCars   11317.522   3126.297   3.620 0.000305 ***
## train_data$GarageArea      41.478     10.627   3.903 9.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40420 on 1455 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7411 
## F-statistic:  1045 on 4 and 1455 DF,  p-value: < 2.2e-16

# Plots
par(mfrow=c(2,2))
X1<-train_data$GarageCars
X2<-train_data$OverallQual
X3<-train_data$GarageArea
X4<-train_data$GrLivArea
Y<-train_data$SalePrice

plot(X1,Y, col="#b40da5", main="GarageCars", ylab="Sale Price")
abline(lm(Y~X1), col="Green", lwd=3) # regression line

plot(X2,Y, col="#f7f8b8", main="OverallQual", ylab="Sale Price")
abline(lm(Y~X2), col="Blue", lwd=3) # regression line

plot(X3,Y, col="#3cf8ed", main="GarageArea", ylab="Sale Price")
abline(lm(Y~X3), col="Brown", lwd=3) # regression line

plot(X4,Y, col="#abf395", main="GrLivArea", ylab="Sale Price")
abline(lm(Y~X4), col="Purple", lwd=3) # regression line

#Load test data set and create calculated column using equation for multiple linear regression.

test_data <- read_csv(file="https://raw.githubusercontent.com/nathtrish334/Data-605/main/final/test.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   Id = col_double(),
##   MSSubClass = col_double(),
##   LotFrontage = col_double(),
##   LotArea = col_double(),
##   OverallQual = col_double(),
##   OverallCond = col_double(),
##   YearBuilt = col_double(),
##   YearRemodAdd = col_double(),
##   MasVnrArea = col_double(),
##   BsmtFinSF1 = col_double(),
##   BsmtFinSF2 = col_double(),
##   BsmtUnfSF = col_double(),
##   TotalBsmtSF = col_double(),
##   `1stFlrSF` = col_double(),
##   `2ndFlrSF` = col_double(),
##   LowQualFinSF = col_double(),
##   GrLivArea = col_double(),
##   BsmtFullBath = col_double(),
##   BsmtHalfBath = col_double(),
##   FullBath = col_double()
##   # ... with 17 more columns
## )
## i Use `spec()` for the full column specifications.

SalePrice<-((26988.854*train_data$OverallQual) + (49.573*train_data$GrLivArea) +  (11317.522*train_data$GarageCars) + (41.478*train_data$GarageArea) -98436.050)
test_data<-test_data[,c("Id","OverallQual","GrLivArea","GarageCars","GarageArea")]

kable(head(test_data))

Id	OverallQual	GrLivArea	GarageCars	GarageArea
1461	5	896	1	730
1462	6	1329	1	312
1463	5	1629	2	482
1464	6	1604	2	470
1465	8	1280	2	506
1466	6	1655	2	440

submission <- cbind(test_data$Id,SalePrice)

## Warning in cbind(test_data$Id, SalePrice): number of rows of result is not a
## multiple of vector length (arg 1)

colnames(submission)[1] <- "Id"
submission[submission<0] <- median(SalePrice) # remove negatives
submission<-as.data.frame(submission[1:1459,])
kable(head(submission))

Id	SalePrice
1461	220620.7
1462	167773.1
1463	226877.0
1464	236184.2
1465	295064.4
1466	146571.1

# Export to csv
write.csv(submission, file = "C:/FinalExam/kaggle_submission.csv", quote=FALSE, row.names=FALSE)

Kaggle Score

Data 605 Final Exam

Trishita Nath

5/25/2021