Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \({\mu}\)=\({\sigma}\)=(N+1)/2
The below histogram uses Runif to generate random uniform numbers where frequencies are close to even. The histogram shows a uniform distribution while Rnorm whows the random distritubion
N <- 6
mean = (N+1)/2
sd = (N+1)/2
X <- runif(10000, 1, N)
hist(X)Y <- rnorm(X, mean, sd)
hist(Y)Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.
x<-median(X)
y<-quantile(Y)[2]Interpret the meaning of all probabilities. 5 points
a<-min(pnorm(X>x | X>y))The minimum probabilty of random uniform number X being greater than median number x given X is greater than the 1st quartile value in y is 0.5
b<-min(pnorm(X>x ,Y>y))The minimum probabilty of random uniform number X being greater than median number x and random normal number Y is greater than the 1st quartile value in y is 0.16
c<-min(pnorm(X<x, X>y))The minimum probabilty of random uniform number X being less than median number x and X is greater than the 1st quartile value in y is 0.16
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities. 5 points.
The probability table for X>x * Y>y and X>x + Y>y show that joint probability differ.
a<-pnorm(X>x)*pnorm(Y>y)
#a<-rbinom(n=6, size = 10000, prob =dnorm((X>x)*(Y>y)))/10000
b<-pnorm((X>x)*(Y>y))
#b<-rbinom(n=6, size = 10000, prob =dnorm(X>x)*dnorm(Y>y))/10000
r<-rbind(table(a),table(b))
#r<-rbind(a[1:6],b[1:6])
row.names(r)<-c('P(X>x and Y>y)','P(X>x)P(Y>y)')
colnames(r)<-names(table(round(a,2)))
#colnames(r)<-c(1,2,3,4,5,6)
rp<-round(addmargins(prop.table(r)),2)
ftable(round(a,2))## 0.25 0.42 0.71
##
## 1225 5050 3725
ftable(round(b,2))## 0.5 0.84
##
## 6275 3725
rp## 0.25 0.42 0.71 Sum
## P(X>x and Y>y) 0.05 0.19 0.14 0.38
## P(X>x)P(Y>y) 0.24 0.14 0.24 0.62
## Sum 0.29 0.33 0.38 1.00
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? 5 points.
Chisq.test and fisher.test checks for independence when comparing categorical data. Fisher.test can be used for smaller datasets < 10. In our test the Chisq.test has a pvalue of ~.24 and Fisther.test has a pvalue of 1. The fisher.test fits the data better as both values are equal and fisher.test value is equal to 1
ft<-fisher.test(rp[1,],rp[2,])
ct<-chisq.test(rp[1,],rp[2,])
print(ft$p.value)## [1] 1
print(ct$p.value)## [1] 0.2381033
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
#Qulatative: Neighborhood, YearBuilt, KitchenQual
#Quantitative: GrLivArea,FullBath, BedroomAbvGr
library(dplyr)
library(tidyr)
library(readxl)
library(ggplot2)
library(plotly)
library(corrplot)
tr<-read.csv('C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/train.csv')
#head(tr)
#glimpse(tr))As a start we remove columns and rows with NA values
tr.1<-tr%>%
select(Id,SalePrice,YearBuilt, KitchenQual,GrLivArea,FullBath, BedroomAbvGr, Neighborhood) %>%
na.omit()
#select( -Alley,-PoolQC,-Fence,-MiscFeature)%>%The summaries show the basic statistical data points for all variables
summary(tr.1$SalePrice)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
summary(tr.1$GrLivArea)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
summary(tr.1$YearBuilt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
summary(tr.1$KitchenQual)## Ex Fa Gd TA
## 100 39 586 735
summary(tr.1$Neighborhood)## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert
## 17 2 16 58 28 150 51 100 79
## IDOTRR MeadowV Mitchel NAmes NoRidge NPkVill NridgHt NWAmes OldTown
## 37 17 49 225 41 9 77 73 113
## Sawyer SawyerW Somerst StoneBr SWISU Timber Veenker
## 74 59 86 25 25 38 11
The GGPlots below will show that all independent variables have a positive correlation excpet for Kitchen Quality. The Kitchen Quality is bimodal
**The GrLiveArea vs Saleprice show some positive linearity in the qqplot when grlivearea is less than 3k.
qqplot(tr.1$GrLivArea,tr.1$SalePrice)lm1<-lm(tr.1$SalePrice~tr.1$GrLivArea+tr.1$Neighborhood)
summary(lm1)##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$GrLivArea + tr.1$Neighborhood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303847 -20271 -465 16433 278716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83466.633 10511.722 7.940 4.04e-15 ***
## tr.1$GrLivArea 78.017 2.398 32.528 < 2e-16 ***
## tr.1$NeighborhoodBlueste -54605.854 30631.543 -1.783 0.074852 .
## tr.1$NeighborhoodBrDale -68161.393 14288.908 -4.770 2.03e-06 ***
## tr.1$NeighborhoodBrkSide -52492.887 11313.997 -4.640 3.81e-06 ***
## tr.1$NeighborhoodClearCr -10404.664 12628.480 -0.824 0.410131
## tr.1$NeighborhoodCollgCr -1005.611 10486.955 -0.096 0.923620
## tr.1$NeighborhoodCrawfor -12618.482 11508.684 -1.096 0.273074
## tr.1$NeighborhoodEdwards -59793.362 10751.816 -5.561 3.19e-08 ***
## tr.1$NeighborhoodGilbert -18663.359 10967.328 -1.702 0.089024 .
## tr.1$NeighborhoodIDOTRR -72461.908 12025.616 -6.026 2.14e-09 ***
## tr.1$NeighborhoodMeadowV -67505.993 14082.499 -4.794 1.81e-06 ***
## tr.1$NeighborhoodMitchel -28166.940 11538.443 -2.441 0.014761 *
## tr.1$NeighborhoodNAmes -39846.611 10310.613 -3.865 0.000116 ***
## tr.1$NeighborhoodNoRidge 56094.460 12101.237 4.635 3.89e-06 ***
## tr.1$NeighborhoodNPkVill -38527.983 16896.812 -2.280 0.022743 *
## tr.1$NeighborhoodNridgHt 83326.716 11042.752 7.546 7.95e-14 ***
## tr.1$NeighborhoodNWAmes -29213.522 11058.228 -2.642 0.008337 **
## tr.1$NeighborhoodOldTown -70685.672 10660.249 -6.631 4.72e-11 ***
## tr.1$NeighborhoodSawyer -41475.175 11032.540 -3.759 0.000177 ***
## tr.1$NeighborhoodSawyerW -21349.902 11286.521 -1.892 0.058742 .
## tr.1$NeighborhoodSomerst 17346.641 10883.665 1.594 0.111196
## tr.1$NeighborhoodStoneBr 80431.442 12926.676 6.222 6.42e-10 ***
## tr.1$NeighborhoodSWISU -81403.324 12912.389 -6.304 3.85e-10 ***
## tr.1$NeighborhoodTimber 22299.649 11981.077 1.861 0.062915 .
## tr.1$NeighborhoodVeenker 35187.678 15858.060 2.219 0.026649 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40980 on 1434 degrees of freedom
## Multiple R-squared: 0.7385, Adjusted R-squared: 0.734
## F-statistic: 162 on 25 and 1434 DF, p-value: < 2.2e-16
The GGplots show a positive correlation for each separate plot that break down by category variables Neighborhood , Year Built and Ktichent Quality
ggplot(tr.1, aes(GrLivArea, SalePrice,
width = 800, height = 300)) +
geom_point(aes(group=Neighborhood,size = SalePrice, color = Neighborhood), alpha = 0.2)+
stat_smooth(method = "lm", col = "red")lm2<-lm(tr.1$SalePrice~tr.1$GrLivArea+tr.1$YearBuilt)
summary(lm2)##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$GrLivArea + tr.1$YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -452049 -25741 -2331 17873 310520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.025e+06 8.090e+04 -25.03 <2e-16 ***
## tr.1$GrLivArea 9.517e+01 2.377e+00 40.03 <2e-16 ***
## tr.1$YearBuilt 1.046e+03 4.136e+01 25.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 46760 on 1457 degrees of freedom
## Multiple R-squared: 0.654, Adjusted R-squared: 0.6535
## F-statistic: 1377 on 2 and 1457 DF, p-value: < 2.2e-16
ggplot(tr.1, aes(GrLivArea, SalePrice,
width = 800, height = 300)) +
geom_point(aes(group=YearBuilt,size = SalePrice, color = YearBuilt), alpha = 0.2)+
stat_smooth(method = "lm", col = "red")lm3<-lm(tr.1$SalePrice~tr.1$GrLivArea+tr.1$KitchenQual)
summary(lm3)##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$GrLivArea + tr.1$KitchenQual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -442830 -23711 -38 20935 263336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.648e+05 7.000e+03 23.54 <2e-16 ***
## tr.1$GrLivArea 7.764e+01 2.518e+00 30.84 <2e-16 ***
## tr.1$KitchenQualFa -1.566e+05 8.874e+03 -17.65 <2e-16 ***
## tr.1$KitchenQualGd -8.159e+04 5.062e+03 -16.12 <2e-16 ***
## tr.1$KitchenQualTA -1.283e+05 5.239e+03 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45600 on 1455 degrees of freedom
## Multiple R-squared: 0.6714, Adjusted R-squared: 0.6705
## F-statistic: 743.1 on 4 and 1455 DF, p-value: < 2.2e-16
ggplot(tr.1, aes(GrLivArea, SalePrice,
width = 800, height = 300)) +
geom_point(aes(group=KitchenQual,size = SalePrice, color = KitchenQual), alpha = 0.2)+
stat_smooth(method = "lm", col = "red")The Full Baths vs Sale Price also show a positive linearity. When Neighborhood and Kitchen quality are added to the linear model it has a positive correlation. However, when Year built is added there is a negative y intercept estimate.
qqplot(tr.1$FullBath,tr.1$SalePrice)lm1<-lm(tr.1$SalePrice~tr.1$FullBath+tr.1$Neighborhood)
summary(lm1)##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$FullBath + tr.1$Neighborhood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160552 -26404 -3605 20432 379902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111848 13443 8.320 < 2e-16 ***
## tr.1$FullBath 44106 2990 14.751 < 2e-16 ***
## tr.1$NeighborhoodBlueste -40507 37644 -1.076 0.282089
## tr.1$NeighborhoodBrDale -59730 17655 -3.383 0.000736 ***
## tr.1$NeighborhoodBrkSide -35683 14076 -2.535 0.011351 *
## tr.1$NeighborhoodClearCr 29833 15498 1.925 0.054434 .
## tr.1$NeighborhoodCollgCr 8785 12887 0.682 0.495515
## tr.1$NeighborhoodCrawfor 32185 14140 2.276 0.022984 *
## tr.1$NeighborhoodEdwards -43612 13297 -3.280 0.001063 **
## tr.1$NeighborhoodGilbert -6089 13460 -0.452 0.651086
## tr.1$NeighborhoodIDOTRR -58214 14954 -3.893 0.000104 ***
## tr.1$NeighborhoodMeadowV -62566 17415 -3.593 0.000338 ***
## tr.1$NeighborhoodMitchel -19487 14227 -1.370 0.171007
## tr.1$NeighborhoodNAmes -19516 12818 -1.523 0.128079
## tr.1$NeighborhoodNoRidge 130933 14534 9.009 < 2e-16 ***
## tr.1$NeighborhoodNPkVill -57365 20752 -2.764 0.005778 **
## tr.1$NeighborhoodNridgHt 114492 13496 8.483 < 2e-16 ***
## tr.1$NeighborhoodNWAmes -5572 13555 -0.411 0.681082
## tr.1$NeighborhoodOldTown -42561 13195 -3.225 0.001286 **
## tr.1$NeighborhoodSawyer -28101 13689 -2.053 0.040274 *
## tr.1$NeighborhoodSawyerW -2291 13861 -0.165 0.868767
## tr.1$NeighborhoodSomerst 25833 13364 1.933 0.053426 .
## tr.1$NeighborhoodStoneBr 113968 15824 7.202 9.52e-13 ***
## tr.1$NeighborhoodSWISU -41590 15840 -2.626 0.008739 **
## tr.1$NeighborhoodTimber 46830 14687 3.189 0.001461 **
## tr.1$NeighborhoodVeenker 66780 19539 3.418 0.000649 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50330 on 1434 degrees of freedom
## Multiple R-squared: 0.6054, Adjusted R-squared: 0.5986
## F-statistic: 88.02 on 25 and 1434 DF, p-value: < 2.2e-16
ggplot(data= tr.1) +
geom_bar(mapping = aes(x = FullBath, y = SalePrice, fill= Neighborhood), stat = "identity", position = "identity")+
theme(axis.text.x=element_text(size=9))+
geom_abline(intercept = lm1$coefficients[1], slope = lm1$coefficients[2])lm2<-lm(tr.1$SalePrice~tr.1$FullBath+tr.1$YearBuilt)
summary(lm2)##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$FullBath + tr.1$YearBuilt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128819 -38034 -9222 21116 470440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.639e+06 1.166e+05 -14.05 <2e-16 ***
## tr.1$FullBath 5.833e+04 3.309e+03 17.63 <2e-16 ***
## tr.1$YearBuilt 8.771e+02 6.035e+01 14.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 61520 on 1457 degrees of freedom
## Multiple R-squared: 0.4012, Adjusted R-squared: 0.4003
## F-statistic: 488 on 2 and 1457 DF, p-value: < 2.2e-16
ggplot(data= tr.1) +
geom_bar(mapping = aes(x = FullBath, y = SalePrice, fill= YearBuilt), stat = "identity", position = "identity")+
theme(axis.text.x=element_text(size=9))+
geom_abline(intercept = lm2$coefficients[1], slope = lm2$coefficients[2])lm3<-lm(tr.1$SalePrice~tr.1$FullBath+tr.1$KitchenQual)
summary(lm3)##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$FullBath + tr.1$KitchenQual)
##
## Residuals:
## Min 1Q Median 3Q Max
## -199462 -31878 -3311 22356 370788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 229608 7584 30.28 <2e-16 ***
## tr.1$FullBath 51535 2829 18.21 <2e-16 ***
## tr.1$KitchenQualFa -186149 10193 -18.26 <2e-16 ***
## tr.1$KitchenQualGd -111064 5733 -19.37 <2e-16 ***
## tr.1$KitchenQualTA -158499 5877 -26.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52920 on 1455 degrees of freedom
## Multiple R-squared: 0.5575, Adjusted R-squared: 0.5563
## F-statistic: 458.3 on 4 and 1455 DF, p-value: < 2.2e-16
ggplot(data= tr.1) +
geom_bar(mapping = aes(x = FullBath, y = SalePrice, fill= KitchenQual), stat = "identity", position = "identity")+
theme(axis.text.x=element_text(size=9))+
geom_abline(intercept = lm3$coefficients[1], slope = lm3$coefficients[2])The Bedroom Above Ground Vs Salesprice is bimodal. There is no correlation. Here also, when year built is added to the linear model it results in a negative y intercept estimate.
qqplot(tr.1$BedroomAbvGr,tr.1$SalePrice)lm1<-lm(tr.1$SalePrice~tr.1$BedroomAbvGr+tr.1$Neighborhood)
lm1##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$BedroomAbvGr + tr.1$Neighborhood)
##
## Coefficients:
## (Intercept) tr.1$BedroomAbvGr
## 164971 16397
## tr.1$NeighborhoodBlueste tr.1$NeighborhoodBrDale
## -68463 -101469
## tr.1$NeighborhoodBrkSide tr.1$NeighborhoodClearCr
## -82825 161
## tr.1$NeighborhoodCollgCr tr.1$NeighborhoodCrawfor
## -13244 -3215
## tr.1$NeighborhoodEdwards tr.1$NeighborhoodGilbert
## -83974 -22967
## tr.1$NeighborhoodIDOTRR tr.1$NeighborhoodMeadowV
## -106061 -105940
## tr.1$NeighborhoodMitchel tr.1$NeighborhoodNAmes
## -53876 -67221
## tr.1$NeighborhoodNoRidge tr.1$NeighborhoodNPkVill
## 112736 -64179
## tr.1$NeighborhoodNridgHt tr.1$NeighborhoodNWAmes
## 107007 -29828
## tr.1$NeighborhoodOldTown tr.1$NeighborhoodSawyer
## -82889 -76260
## tr.1$NeighborhoodSawyerW tr.1$NeighborhoodSomerst
## -26494 16557
## tr.1$NeighborhoodStoneBr tr.1$NeighborhoodSWISU
## 107488 -84031
## tr.1$NeighborhoodTimber tr.1$NeighborhoodVeenker
## 29381 38027
ggplot(data= tr.1) +
geom_bar(mapping = aes(x = BedroomAbvGr, y = SalePrice, fill= Neighborhood), stat = "identity", position = "identity")+
theme(axis.text.x=element_text(size=9))+
geom_abline(intercept = lm1$coefficients[1], slope = lm1$coefficients[2])lm2<-lm(tr.1$SalePrice~tr.1$BedroomAbvGr+tr.1$YearBuilt)
lm2##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$BedroomAbvGr + tr.1$YearBuilt)
##
## Coefficients:
## (Intercept) tr.1$BedroomAbvGr tr.1$YearBuilt
## -2663395 20079 1414
ggplot(data= tr.1) +
geom_bar(mapping = aes(x = BedroomAbvGr, y = SalePrice, fill= YearBuilt), stat = "identity", position = "identity")+
theme(axis.text.x=element_text(size=9))+
geom_abline(intercept = lm2$coefficients[1], slope = lm2$coefficients[2])lm3<-lm(tr.1$SalePrice~tr.1$BedroomAbvGr+tr.1$KitchenQual)
lm3##
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$BedroomAbvGr + tr.1$KitchenQual)
##
## Coefficients:
## (Intercept) tr.1$BedroomAbvGr tr.1$KitchenQualFa
## 278635 18153 -223339
## tr.1$KitchenQualGd tr.1$KitchenQualTA
## -118715 -190957
ggplot(data= tr.1) +
geom_bar(mapping = aes(x = BedroomAbvGr, y = SalePrice, fill= KitchenQual), stat = "identity", position = "identity")+
theme(axis.text.x=element_text(size=9))+
geom_abline(intercept = lm3$coefficients[1], slope = lm3$coefficients[2])The correlation confirms that there is an above average correlation between Ground Living Area and .71and Full Bath at .56 vs Sale price and the Bedroom Above ground has a lower correlation at .17
tr.2<-tr.1%>%
select(SalePrice,GrLivArea,FullBath, BedroomAbvGr)
tr.c<-round(cor(tr.2),2)
tr.c## SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice 1.00 0.71 0.56 0.17
## GrLivArea 0.71 1.00 0.63 0.52
## FullBath 0.56 0.63 1.00 0.36
## BedroomAbvGr 0.17 0.52 0.36 1.00
tr.g<-cor.test(tr.2$SalePrice,tr.2$GrLivArea,conf.level = .8)
tr.f<-cor.test(tr.2$SalePrice,tr.2$FullBath,conf.level = .8)
tr.b<-cor.test(tr.2$SalePrice,tr.2$BedroomAbvGr,conf.level = .8)
tr.g##
## Pearson's product-moment correlation
##
## data: tr.2$SalePrice and tr.2$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
tr.f##
## Pearson's product-moment correlation
##
## data: tr.2$SalePrice and tr.2$FullBath
## t = 25.854, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5372107 0.5832505
## sample estimates:
## cor
## 0.5606638
tr.b##
## Pearson's product-moment correlation
##
## data: tr.2$SalePrice and tr.2$BedroomAbvGr
## t = 6.5159, df = 1458, p-value = 9.927e-11
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1354160 0.2006421
## sample estimates:
## cor
## 0.1682132
With a 80% probability GrLiveArea is between 0.69 and 0.72, FullBath are betwen 0.54 and 0.58 and BedroomAbvGr is 0.14 and 0.2
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
In the LU decomposition we are using matlib to multiply the inverse of the cor.test times the cor test to get the precision matrix. We use the Upper.tri and lower.tri functions to get the upper and lower triangles of the matrix.
library(matlib)
tr.p<-round(inv(tr.c),2)
tr.p##
## [1,] 2.40 -1.75 -0.48 0.68
## [2,] -1.75 3.26 -0.66 -1.16
## [3,] -0.48 -0.66 1.76 -0.21
## [4,] 0.68 -1.16 -0.21 1.56
tr.m<-round(tr.c%*%tr.p,2)
tr.m##
## SalePrice 1.00 0 0 0
## GrLivArea 0.01 1 0 0
## FullBath 0.01 0 1 0
## BedroomAbvGr 0.01 0 0 1
tr.m2<-round(tr.p%*%tr.c,2)
tr.m2## SalePrice GrLivArea FullBath BedroomAbvGr
## [1,] 1 0.01 0.01 0.01
## [2,] 0 1.00 0.00 0.00
## [3,] 0 0.00 1.00 0.00
## [4,] 0 0.00 0.00 1.00
tr.m%*%tr.m2## SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice 1.00 0.0100 0.0100 0.0100
## GrLivArea 0.01 1.0001 0.0001 0.0001
## FullBath 0.01 0.0001 1.0001 0.0001
## BedroomAbvGr 0.01 0.0001 0.0001 1.0001
Multiplying correlation by percision matrix then multplying percision by correlation matrix give inverse of each other
tr.c## SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice 1.00 0.71 0.56 0.17
## GrLivArea 0.71 1.00 0.63 0.52
## FullBath 0.56 0.63 1.00 0.36
## BedroomAbvGr 0.17 0.52 0.36 1.00
lower<-tr.c
lower[lower.tri(tr.c, diag=TRUE)]<-0
#lower<-as.data.frame(lower)
lower## SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice 0 0.71 0.56 0.17
## GrLivArea 0 0.00 0.63 0.52
## FullBath 0 0.00 0.00 0.36
## BedroomAbvGr 0 0.00 0.00 0.00
upper<-tr.c
upper[upper.tri(tr.c, diag=TRUE)]<-0
#upper<-as.data.frame(upper)
upper## SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice 0.00 0.00 0.00 0
## GrLivArea 0.71 0.00 0.00 0
## FullBath 0.56 0.63 0.00 0
## BedroomAbvGr 0.17 0.52 0.36 0
The Lu decomposition of the matrix
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \({\lambda}\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \({\lambda}\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
library(MASS)
hist(tr.1$GrLivArea)Salesprice shows it is shifted to the right in this histogram
tr.log10<-log10(tr.1$GrLivArea)
hist(tr.log10)Using log 10 the histogram shifts to the center
fs<-fitdistr(tr.log10,"Poisson")
tr.rlog10<-rexp(1000,fs$estimate)
hist(tr.rlog10)Applying fitdistr estimates gives similar right skew. For the Mass functions we used the Ground Living Area values. . The histogram shows the data is right skewed. Using log10 the new histogram is shifted to the right and is closer to a more normal distribution. Applying the fitdstr estimate the histogram shifts to the right again. Applying the exponential PDF to the original data vs the log10 data at .05 and .95 percentail there is not much of a difference. The original Log10 values were .149 , .941 respectively while the new values were .162 and .939.
tr.e<-ecdf(tr.rlog10)
tr.e(.95)## [1] 0.953
tr.e(.05)## [1] 0.165
tr.1e<-rexp(1000,tr.log10)
tr.1ec<-ecdf(tr.1e)
tr.1ec(.95)## [1] 0.946
tr.1ec(.05)## [1] 0.138
Comparision does not show much of a difference
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
We begin with ggpairss data and compare the Saleprice to year built, Kitchen Quality, Ground Living Area, Full bath and Bedroom Abover ground. The ggpairs plots hint that Ground living area and bath have the highest correlation to sale price.
library(GGally)
ggpairs(data=tr.1, columns = 2:7)For all of the linear models we are extracting the coeficients, r squared values, adjusted r squared, sigma and f statistics. Coeficients provide y intercept, slope , p value which probability the null hypothesis is true and the t value which gives the standard deviations the estimated coefficients are from zero. The multiple r-squred and adjusted r squared lets us know how close our data are close to the linear regression model. The F-statistic gives us the relationship between dependent and independent variables. A large F-statistics means a strong relationship.
We begin with linear model summary stats for each variable vs sale price. Summary of Linear model show low Pr values for each independent variable. All variable except Kitchen Quality have a p value below .05.
s<-summary(lm(SalePrice~Neighborhood, tr.1, na.action = na.fail))
c(s$coefficients[1,1:4],s$r.squared,s$adj.r.squared,s$sigma,s$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 1.948709e+05 1.309668e+04 1.487941e+01 1.120183e-46 5.455750e-01
## value numdf dendf
## 5.379749e-01 5.399900e+04 7.178487e+01 2.400000e+01 1.435000e+03
y<-summary(lm(SalePrice~YearBuilt, tr.1, na.action = na.fail))
c(y$coefficients[1,1:4],y$r.squared,y$adj.r.squared,y$sigma,y$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## -2.530308e+06 1.157613e+05 -2.185799e+01 7.682137e-92 2.734216e-01
## value numdf dendf
## 2.729233e-01 6.773966e+04 5.486658e+02 1.000000e+00 1.458000e+03
k<-summary(lm(SalePrice~KitchenQual, tr.1, na.action = na.fail))
c(k$coefficients[1,1:4],k$r.squared,k$adj.r.squared,k$sigma,k$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 3.285547e+05 5.862195e+03 5.604636e+01 0.000000e+00 4.565986e-01
## value numdf dendf
## 4.554790e-01 5.862195e+04 4.078064e+02 3.000000e+00 1.456000e+03
g<-summary(lm(SalePrice~GrLivArea, tr.1, na.action = na.fail))
c(g$coefficients[1,1:4],g$r.squared,g$adj.r.squared,g$sigma,g$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 1.856903e+04 4.480755e+03 4.144174e+00 3.606554e-05 5.021487e-01
## value numdf dendf
## 5.018072e-01 5.607272e+04 1.470585e+03 1.000000e+00 1.458000e+03
f<-summary(lm(SalePrice~FullBath, tr.1, na.action = na.fail))
c(f$coefficients[1,1:4],f$r.squared,f$adj.r.squared,f$sigma,f$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 5.438828e+04 5.188295e+03 1.048288e+01 7.714661e-25 3.143439e-01
## value numdf dendf
## 3.138736e-01 6.580441e+04 6.684303e+02 1.000000e+00 1.458000e+03
b<-summary(lm(SalePrice~BedroomAbvGr, tr.1, na.action = na.fail))
c(b$coefficients[1,1:4],b$r.squared,b$adj.r.squared,b$sigma,b$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 1.339660e+05 7.492255e+03 1.788060e+01 8.332345e-65 2.829567e-02
## value numdf dendf
## 2.762920e-02 7.833735e+04 4.245641e+01 1.000000e+00 1.458000e+03
Our first multiple linear model stat includes all variables vs sale prices. So we are including Neighborhood, YearBuilt, Kitchen Quality, Ground Living Area and bedroom abover ground. It shows a pvalue below .05 collectively and a negative y intercept. However, individually FullBath p-value is .88.
nykgfb<-summary(lm(SalePrice~Neighborhood+YearBuilt+KitchenQual+GrLivArea+FullBath+BedroomAbvGr, tr.1, na.action = na.fail))
nykgfb##
## Call:
## lm(formula = SalePrice ~ Neighborhood + YearBuilt + KitchenQual +
## GrLivArea + FullBath + BedroomAbvGr, data = tr.1, na.action = na.fail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -380892 -17342 185 15082 236136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.543e+05 1.436e+05 -5.948 3.40e-09 ***
## NeighborhoodBlueste -2.007e+04 2.749e+04 -0.730 0.465481
## NeighborhoodBrDale -3.093e+04 1.316e+04 -2.350 0.018916 *
## NeighborhoodBrkSide 6.115e+03 1.140e+04 0.536 0.591726
## NeighborhoodClearCr 2.411e+04 1.166e+04 2.068 0.038864 *
## NeighborhoodCollgCr 1.324e+04 9.493e+03 1.395 0.163231
## NeighborhoodCrawfor 3.794e+04 1.121e+04 3.385 0.000731 ***
## NeighborhoodEdwards -1.528e+04 1.035e+04 -1.477 0.139872
## NeighborhoodGilbert 3.168e+03 9.971e+03 0.318 0.750736
## NeighborhoodIDOTRR -1.204e+04 1.205e+04 -0.999 0.317970
## NeighborhoodMeadowV -3.591e+04 1.297e+04 -2.769 0.005699 **
## NeighborhoodMitchel 4.188e+03 1.065e+04 0.393 0.694299
## NeighborhoodNAmes 4.079e+03 9.964e+03 0.409 0.682353
## NeighborhoodNoRidge 7.519e+04 1.092e+04 6.883 8.73e-12 ***
## NeighborhoodNPkVill -4.005e+03 1.536e+04 -0.261 0.794323
## NeighborhoodNridgHt 6.512e+04 1.005e+04 6.479 1.27e-10 ***
## NeighborhoodNWAmes 7.727e+03 1.026e+04 0.753 0.451522
## NeighborhoodOldTown -1.026e+04 1.109e+04 -0.924 0.355388
## NeighborhoodSawyer 1.986e+03 1.053e+04 0.189 0.850345
## NeighborhoodSawyerW 3.111e+03 1.025e+04 0.304 0.761507
## NeighborhoodSomerst 2.138e+04 9.774e+03 2.187 0.028893 *
## NeighborhoodStoneBr 7.153e+04 1.161e+04 6.162 9.34e-10 ***
## NeighborhoodSWISU -1.012e+04 1.290e+04 -0.785 0.432809
## NeighborhoodTimber 3.684e+04 1.081e+04 3.408 0.000673 ***
## NeighborhoodVeenker 4.768e+04 1.432e+04 3.329 0.000893 ***
## YearBuilt 5.034e+02 7.145e+01 7.046 2.86e-12 ***
## KitchenQualFa -8.467e+04 7.904e+03 -10.712 < 2e-16 ***
## KitchenQualGd -5.782e+04 4.514e+03 -12.808 < 2e-16 ***
## KitchenQualTA -7.143e+04 5.026e+03 -14.212 < 2e-16 ***
## GrLivArea 7.446e+01 3.027e+00 24.599 < 2e-16 ***
## FullBath 3.804e+02 2.744e+03 0.139 0.889780
## BedroomAbvGr -7.095e+03 1.568e+03 -4.525 6.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36560 on 1428 degrees of freedom
## Multiple R-squared: 0.7927, Adjusted R-squared: 0.7882
## F-statistic: 176.1 on 31 and 1428 DF, p-value: < 2.2e-16
#c(nykgfb$coefficients[1,1:4],nykgfb$r.squared,nykgfb$adj.r.squared,nykgfb$sigma,nykgfb$fstatistic)The next multiple linear model removes the FullBath. All variables p-values are below .05 except FullBath at .68.
nkgb<-summary(lm(SalePrice~Neighborhood+KitchenQual+GrLivArea+BedroomAbvGr, tr.1, na.action = na.fail))
nkgb##
## Call:
## lm(formula = SalePrice ~ Neighborhood + KitchenQual + GrLivArea +
## BedroomAbvGr, data = tr.1, na.action = na.fail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -360606 -18893 369 15255 234429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 158120.63 10777.69 14.671 < 2e-16 ***
## NeighborhoodBlueste -27886.24 27932.08 -0.998 0.318274
## NeighborhoodBrDale -43341.63 13225.57 -3.277 0.001074 **
## NeighborhoodBrkSide -26308.86 10604.43 -2.481 0.013218 *
## NeighborhoodClearCr 7618.53 11594.42 0.657 0.511232
## NeighborhoodCollgCr 11315.54 9634.25 1.175 0.240385
## NeighborhoodCrawfor 9116.82 10608.66 0.859 0.390278
## NeighborhoodEdwards -35566.42 10091.90 -3.524 0.000438 ***
## NeighborhoodGilbert 2805.38 10139.21 0.277 0.782061
## NeighborhoodIDOTRR -46296.96 11214.61 -4.128 3.87e-05 ***
## NeighborhoodMeadowV -48112.53 13022.81 -3.694 0.000229 ***
## NeighborhoodMitchel -3327.84 10752.17 -0.310 0.756984
## NeighborhoodNAmes -14270.91 9714.34 -1.469 0.142037
## NeighborhoodNoRidge 71819.51 11044.10 6.503 1.09e-10 ***
## NeighborhoodNPkVill -13780.10 15536.69 -0.887 0.375261
## NeighborhoodNridgHt 65678.81 10220.38 6.426 1.78e-10 ***
## NeighborhoodNWAmes -3009.12 10321.63 -0.292 0.770684
## NeighborhoodOldTown -47612.49 9950.99 -4.785 1.89e-06 ***
## NeighborhoodSawyer -14240.02 10384.78 -1.371 0.170516
## NeighborhoodSawyerW -2896.54 10368.79 -0.279 0.780015
## NeighborhoodSomerst 22250.99 9940.53 2.238 0.025348 *
## NeighborhoodStoneBr 68179.17 11792.12 5.782 9.06e-09 ***
## NeighborhoodSWISU -45026.51 12146.76 -3.707 0.000218 ***
## NeighborhoodTimber 32661.70 10970.68 2.977 0.002958 **
## NeighborhoodVeenker 37719.57 14444.91 2.611 0.009115 **
## KitchenQualFa -92303.29 7960.39 -11.595 < 2e-16 ***
## KitchenQualGd -58794.69 4568.61 -12.869 < 2e-16 ***
## KitchenQualTA -76664.45 5054.32 -15.168 < 2e-16 ***
## GrLivArea 74.88 2.81 26.645 < 2e-16 ***
## BedroomAbvGr -8134.35 1556.70 -5.225 2.00e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37190 on 1430 degrees of freedom
## Multiple R-squared: 0.7852, Adjusted R-squared: 0.7808
## F-statistic: 180.2 on 29 and 1430 DF, p-value: < 2.2e-16
#c(nkg$coefficients[1,1:4],nkg$r.squared,nkg$adj.r.squared,nkg$sigma,nkg$fstatistic)When comparing SalePrice to Ground Living Area and Bedroom or Full bath the p values are under .05 for Bedrooms but over .05 with Full bath. However, if you compare both full bath and Bedrooms the combined P value is below .05.
gb<-summary(lm(SalePrice~GrLivArea+BedroomAbvGr, tr.1, na.action = na.fail))
gb##
## Call:
## lm(formula = SalePrice ~ GrLivArea + BedroomAbvGr, data = tr.1,
## na.action = na.fail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -549234 -27252 -349 23139 298053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62686.434 5336.730 11.75 <2e-16 ***
## GrLivArea 128.899 3.087 41.76 <2e-16 ***
## BedroomAbvGr -26899.823 1988.164 -13.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52870 on 1457 degrees of freedom
## Multiple R-squared: 0.5577, Adjusted R-squared: 0.5571
## F-statistic: 918.6 on 2 and 1457 DF, p-value: < 2.2e-16
#c(gb$coefficients[1,1:4],gb$r.squared,gb$adj.r.squared,gb$sigma,gb$fstatistic)
gf<-summary(lm(SalePrice~GrLivArea+FullBath, tr.1, na.action = na.fail))
gf##
## Call:
## lm(formula = SalePrice ~ GrLivArea + FullBath, data = tr.1, na.action = na.fail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -400438 -26191 -2027 21488 343260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3162.993 4775.342 0.662 0.508
## GrLivArea 89.091 3.519 25.314 < 2e-16 ***
## FullBath 27311.090 3357.001 8.136 8.7e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54860 on 1457 degrees of freedom
## Multiple R-squared: 0.5238, Adjusted R-squared: 0.5231
## F-statistic: 801.3 on 2 and 1457 DF, p-value: < 2.2e-16
#c(gf$coefficients[1,1:4],gf$r.squared,gf$adj.r.squared,gf$sigma,gf$fstatistic)
gfb<-summary(lm(SalePrice~GrLivArea+FullBath+BedroomAbvGr, tr.1, na.action = na.fail))
gfb##
## Call:
## lm(formula = SalePrice ~ GrLivArea + FullBath + BedroomAbvGr,
## data = tr.1, na.action = na.fail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -484289 -26709 -61 23596 300291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47509.482 5426.062 8.756 <2e-16 ***
## GrLivArea 110.062 3.601 30.566 <2e-16 ***
## FullBath 29694.688 3145.948 9.439 <2e-16 ***
## BedroomAbvGr -27859.332 1933.328 -14.410 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51340 on 1456 degrees of freedom
## Multiple R-squared: 0.5832, Adjusted R-squared: 0.5824
## F-statistic: 679.2 on 3 and 1456 DF, p-value: < 2.2e-16
#c(gfb$coefficients[1,1:4],gfb$r.squared,gfb$adj.r.squared,gfb$sigma,gfb$fstatistic)My fist submission to Kaggle was a comparison of Sale Price to GrLivArea+FullBath+BedroomAbvGr the p-value was the lowest of the 4 at 5.512411e-18 while the F stat vaule was high at 679. Howerver, the Kaggle score was the highest at .28.
tst<-read.csv('C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/test.csv')
sm<-lm(SalePrice~GrLivArea+FullBath+BedroomAbvGr, tr.1, na.action = na.fail)
sms<-summary(sm)
c(sms$coefficients[1,1:4],sms$r.squared,sms$adj.r.squared,sms$sigma,sms$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 4.750948e+04 5.426062e+03 8.755794e+00 5.512411e-18 5.832213e-01
## value numdf dendf
## 5.823626e-01 5.133962e+04 6.791536e+02 3.000000e+00 1.456000e+03
p<-data.frame(tst$Id,predict(sm, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)## Id SalePrice
## 1 1461 120100.8
## 2 1462 139898.2
## 3 1463 202611.4
## 4 1464 199859.9
## 5 1465 192059.2
## 6 1466 205473.0
write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict.csv", x=p, row.names = FALSE)
#.28The submission for Sale Price to GrLivArea+FullBath as expected had a high Pvalue of 5.078456e-01 and the highst f-stat of 801, the Kaggle was also ~.28. The addition of Fullbath and Neighborhood reduced the pvalue significantly to 3.142946e-14. The F-stat value was 155 showing a weaker relationship between dependent an independent variables. The Kaggle score improved to a lower value of .215.
sm2<-lm(SalePrice~GrLivArea+FullBath, tr.1, na.action = na.fail)
sms2<-summary(sm2)
c(sms2$coefficients[1,1:4],sms2$r.squared,sms2$adj.r.squared,sms2$sigma,sms2$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 3.162993e+03 4.775342e+03 6.623594e-01 5.078456e-01 5.237819e-01
## value numdf dendf
## 5.231282e-01 5.485974e+04 8.012612e+02 2.000000e+00 1.457000e+03
p<-data.frame(tst$Id,predict(sm2, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)## Id SalePrice
## 1 1461 110299.8
## 2 1462 148876.3
## 3 1463 202914.7
## 4 1464 200687.5
## 5 1465 171821.9
## 6 1466 205231.1
write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict2.csv", x=p, row.names = FALSE)
#.28
sm3<-lm(SalePrice~GrLivArea+FullBath+Neighborhood, tr.1, na.action = na.fail)
sms3<-summary(sm3)
c(sms3$coefficients[1,1:4],sms3$r.squared,sms3$adj.r.squared,sms3$sigma,sms3$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 8.433413e+04 1.099440e+04 7.670647e+00 3.142946e-14 7.385277e-01
## value numdf dendf
## 7.337836e-01 4.098928e+04 1.556733e+02 2.600000e+01 1.433000e+03
p<-data.frame(tst$Id,predict(sm3, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)## Id SalePrice
## 1 1461 113510.2
## 2 1462 147483.4
## 3 1463 191868.0
## 4 1464 189906.5
## 5 1465 263431.4
## 6 1466 193907.9
write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict3.csv", x=p, row.names = FALSE)
#.215Finally, i swapped the FullBath and BedroomAbvGr reducing the p-value to 1.318023e-17 with a f-stat of 164 and a Kaggle score of .210
sm4<-lm(SalePrice~Neighborhood+GrLivArea+BedroomAbvGr, tr.1, na.action = na.fail)
sms4<-summary(sm4)
c(sms4$coefficients[1,1:4],sms4$r.squared,sms4$adj.r.squared,sms4$sigma,sms4$fstatistic)## Estimate Std. Error t value Pr(>|t|)
## 8.936337e+04 1.032679e+04 8.653546e+00 1.318023e-17 7.491618e-01
## value numdf dendf
## 7.446107e-01 4.014711e+04 1.646095e+02 2.600000e+01 1.433000e+03
p<-data.frame(tst$Id,predict(sm4, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)## Id SalePrice
## 1 1461 120424.8
## 2 1462 146678.4
## 3 1463 193042.7
## 4 1464 190785.6
## 5 1465 260520.1
## 6 1466 195390.1
write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict4.csv", x=p, row.names = FALSE)
#.210