Final Exam Project

Problem 1 (Probability)

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \({\mu}\)=\({\sigma}\)=(N+1)/2

The below histogram uses Runif to generate random uniform numbers where frequencies are close to even. The histogram shows a uniform distribution while Rnorm whows the random distritubion

N <- 6
mean = (N+1)/2
sd = (N+1)/2
X <- runif(10000, 1, N)
hist(X)

Y <- rnorm(X, mean, sd)
hist(Y)

Minimums

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable.

x<-median(X)
y<-quantile(Y)[2]

Interpret the meaning of all probabilities. 5 points

\({P(X>x | X>y)}\)

a<-min(pnorm(X>x | X>y))

The minimum probabilty of random uniform number X being greater than median number x given X is greater than the 1st quartile value in y is 0.5

\({P(X>x, Y>y)}\)

b<-min(pnorm(X>x ,Y>y))

The minimum probabilty of random uniform number X being greater than median number x and random normal number Y is greater than the 1st quartile value in y is 0.16

\({P(X<x, X>y)}\)

c<-min(pnorm(X<x, X>y))

The minimum probabilty of random uniform number X being less than median number x and X is greater than the 1st quartile value in y is 0.16

Marginal/Joint

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities. 5 points.

The probability table for X>x * Y>y and X>x + Y>y show that joint probability differ.

a<-pnorm(X>x)*pnorm(Y>y)
#a<-rbinom(n=6, size = 10000, prob =dnorm((X>x)*(Y>y)))/10000
b<-pnorm((X>x)*(Y>y))
#b<-rbinom(n=6, size = 10000, prob =dnorm(X>x)*dnorm(Y>y))/10000
r<-rbind(table(a),table(b))
#r<-rbind(a[1:6],b[1:6])
row.names(r)<-c('P(X>x and Y>y)','P(X>x)P(Y>y)')
colnames(r)<-names(table(round(a,2)))
#colnames(r)<-c(1,2,3,4,5,6)
rp<-round(addmargins(prop.table(r)),2)
ftable(round(a,2))

##  0.25 0.42 0.71
##                
##  1225 5050 3725

ftable(round(b,2))

##   0.5 0.84
##           
##  6275 3725

rp

##                0.25 0.42 0.71  Sum
## P(X>x and Y>y) 0.05 0.19 0.14 0.38
## P(X>x)P(Y>y)   0.24 0.14 0.24 0.62
## Sum            0.29 0.33 0.38 1.00

Independence

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate? 5 points.

Chisq.test and fisher.test checks for independence when comparing categorical data. Fisher.test can be used for smaller datasets < 10. In our test the Chisq.test has a pvalue of ~.24 and Fisther.test has a pvalue of 1. The fisher.test fits the data better as both values are equal and fisher.test value is equal to 1

ft<-fisher.test(rp[1,],rp[2,])
ct<-chisq.test(rp[1,],rp[2,])
print(ft$p.value)

## [1] 1

print(ct$p.value)

## [1] 0.2381033

Problem 2 (Kaggle)

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

#Qulatative: Neighborhood, YearBuilt, KitchenQual
#Quantitative: GrLivArea,FullBath, BedroomAbvGr
library(dplyr) 
library(tidyr) 
library(readxl)
library(ggplot2)
library(plotly)
library(corrplot)

tr<-read.csv('C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/train.csv')
#head(tr)
#glimpse(tr))

As a start we remove columns and rows with NA values

tr.1<-tr%>%
    select(Id,SalePrice,YearBuilt, KitchenQual,GrLivArea,FullBath, BedroomAbvGr, Neighborhood) %>%
    na.omit()

#select( -Alley,-PoolQC,-Fence,-MiscFeature)%>%

Summary of data points

The summaries show the basic statistical data points for all variables

summary(tr.1$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

summary(tr.1$GrLivArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

summary(tr.1$YearBuilt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1973    1971    2000    2010

summary(tr.1$KitchenQual)

##  Ex  Fa  Gd  TA 
## 100  39 586 735

summary(tr.1$Neighborhood)

## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      17       2      16      58      28     150      51     100      79 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      37      17      49     225      41       9      77      73     113 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##      74      59      86      25      25      38      11

The GGPlots below will show that all independent variables have a positive correlation excpet for Kitchen Quality. The Kitchen Quality is bimodal

GrLiveArea vs Saleprice

**The GrLiveArea vs Saleprice show some positive linearity in the qqplot when grlivearea is less than 3k.

qqplot(tr.1$GrLivArea,tr.1$SalePrice)

lm1<-lm(tr.1$SalePrice~tr.1$GrLivArea+tr.1$Neighborhood)
summary(lm1)

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$GrLivArea + tr.1$Neighborhood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303847  -20271    -465   16433  278716 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               83466.633  10511.722   7.940 4.04e-15 ***
## tr.1$GrLivArea               78.017      2.398  32.528  < 2e-16 ***
## tr.1$NeighborhoodBlueste -54605.854  30631.543  -1.783 0.074852 .  
## tr.1$NeighborhoodBrDale  -68161.393  14288.908  -4.770 2.03e-06 ***
## tr.1$NeighborhoodBrkSide -52492.887  11313.997  -4.640 3.81e-06 ***
## tr.1$NeighborhoodClearCr -10404.664  12628.480  -0.824 0.410131    
## tr.1$NeighborhoodCollgCr  -1005.611  10486.955  -0.096 0.923620    
## tr.1$NeighborhoodCrawfor -12618.482  11508.684  -1.096 0.273074    
## tr.1$NeighborhoodEdwards -59793.362  10751.816  -5.561 3.19e-08 ***
## tr.1$NeighborhoodGilbert -18663.359  10967.328  -1.702 0.089024 .  
## tr.1$NeighborhoodIDOTRR  -72461.908  12025.616  -6.026 2.14e-09 ***
## tr.1$NeighborhoodMeadowV -67505.993  14082.499  -4.794 1.81e-06 ***
## tr.1$NeighborhoodMitchel -28166.940  11538.443  -2.441 0.014761 *  
## tr.1$NeighborhoodNAmes   -39846.611  10310.613  -3.865 0.000116 ***
## tr.1$NeighborhoodNoRidge  56094.460  12101.237   4.635 3.89e-06 ***
## tr.1$NeighborhoodNPkVill -38527.983  16896.812  -2.280 0.022743 *  
## tr.1$NeighborhoodNridgHt  83326.716  11042.752   7.546 7.95e-14 ***
## tr.1$NeighborhoodNWAmes  -29213.522  11058.228  -2.642 0.008337 ** 
## tr.1$NeighborhoodOldTown -70685.672  10660.249  -6.631 4.72e-11 ***
## tr.1$NeighborhoodSawyer  -41475.175  11032.540  -3.759 0.000177 ***
## tr.1$NeighborhoodSawyerW -21349.902  11286.521  -1.892 0.058742 .  
## tr.1$NeighborhoodSomerst  17346.641  10883.665   1.594 0.111196    
## tr.1$NeighborhoodStoneBr  80431.442  12926.676   6.222 6.42e-10 ***
## tr.1$NeighborhoodSWISU   -81403.324  12912.389  -6.304 3.85e-10 ***
## tr.1$NeighborhoodTimber   22299.649  11981.077   1.861 0.062915 .  
## tr.1$NeighborhoodVeenker  35187.678  15858.060   2.219 0.026649 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40980 on 1434 degrees of freedom
## Multiple R-squared:  0.7385, Adjusted R-squared:  0.734 
## F-statistic:   162 on 25 and 1434 DF,  p-value: < 2.2e-16

The GGplots show a positive correlation for each separate plot that break down by category variables Neighborhood , Year Built and Ktichent Quality

ggplot(tr.1, aes(GrLivArea, SalePrice,
    width = 800, height = 300)) +
    geom_point(aes(group=Neighborhood,size = SalePrice, color = Neighborhood), alpha = 0.2)+
    stat_smooth(method = "lm", col = "red")

lm2<-lm(tr.1$SalePrice~tr.1$GrLivArea+tr.1$YearBuilt)
summary(lm2)

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$GrLivArea + tr.1$YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -452049  -25741   -2331   17873  310520 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.025e+06  8.090e+04  -25.03   <2e-16 ***
## tr.1$GrLivArea  9.517e+01  2.377e+00   40.03   <2e-16 ***
## tr.1$YearBuilt  1.046e+03  4.136e+01   25.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46760 on 1457 degrees of freedom
## Multiple R-squared:  0.654,  Adjusted R-squared:  0.6535 
## F-statistic:  1377 on 2 and 1457 DF,  p-value: < 2.2e-16

ggplot(tr.1, aes(GrLivArea, SalePrice,
    width = 800, height = 300)) +
    geom_point(aes(group=YearBuilt,size = SalePrice, color = YearBuilt), alpha = 0.2)+
    stat_smooth(method = "lm", col = "red")

lm3<-lm(tr.1$SalePrice~tr.1$GrLivArea+tr.1$KitchenQual)
summary(lm3)

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$GrLivArea + tr.1$KitchenQual)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442830  -23711     -38   20935  263336 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.648e+05  7.000e+03   23.54   <2e-16 ***
## tr.1$GrLivArea      7.764e+01  2.518e+00   30.84   <2e-16 ***
## tr.1$KitchenQualFa -1.566e+05  8.874e+03  -17.65   <2e-16 ***
## tr.1$KitchenQualGd -8.159e+04  5.062e+03  -16.12   <2e-16 ***
## tr.1$KitchenQualTA -1.283e+05  5.239e+03  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45600 on 1455 degrees of freedom
## Multiple R-squared:  0.6714, Adjusted R-squared:  0.6705 
## F-statistic: 743.1 on 4 and 1455 DF,  p-value: < 2.2e-16

ggplot(tr.1, aes(GrLivArea, SalePrice,
    width = 800, height = 300)) +
    geom_point(aes(group=KitchenQual,size = SalePrice, color = KitchenQual), alpha = 0.2)+
    stat_smooth(method = "lm", col = "red")

FullBath vs Saleprice

The Full Baths vs Sale Price also show a positive linearity. When Neighborhood and Kitchen quality are added to the linear model it has a positive correlation. However, when Year built is added there is a negative y intercept estimate.

qqplot(tr.1$FullBath,tr.1$SalePrice)

lm1<-lm(tr.1$SalePrice~tr.1$FullBath+tr.1$Neighborhood)
summary(lm1)

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$FullBath + tr.1$Neighborhood)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -160552  -26404   -3605   20432  379902 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                111848      13443   8.320  < 2e-16 ***
## tr.1$FullBath               44106       2990  14.751  < 2e-16 ***
## tr.1$NeighborhoodBlueste   -40507      37644  -1.076 0.282089    
## tr.1$NeighborhoodBrDale    -59730      17655  -3.383 0.000736 ***
## tr.1$NeighborhoodBrkSide   -35683      14076  -2.535 0.011351 *  
## tr.1$NeighborhoodClearCr    29833      15498   1.925 0.054434 .  
## tr.1$NeighborhoodCollgCr     8785      12887   0.682 0.495515    
## tr.1$NeighborhoodCrawfor    32185      14140   2.276 0.022984 *  
## tr.1$NeighborhoodEdwards   -43612      13297  -3.280 0.001063 ** 
## tr.1$NeighborhoodGilbert    -6089      13460  -0.452 0.651086    
## tr.1$NeighborhoodIDOTRR    -58214      14954  -3.893 0.000104 ***
## tr.1$NeighborhoodMeadowV   -62566      17415  -3.593 0.000338 ***
## tr.1$NeighborhoodMitchel   -19487      14227  -1.370 0.171007    
## tr.1$NeighborhoodNAmes     -19516      12818  -1.523 0.128079    
## tr.1$NeighborhoodNoRidge   130933      14534   9.009  < 2e-16 ***
## tr.1$NeighborhoodNPkVill   -57365      20752  -2.764 0.005778 ** 
## tr.1$NeighborhoodNridgHt   114492      13496   8.483  < 2e-16 ***
## tr.1$NeighborhoodNWAmes     -5572      13555  -0.411 0.681082    
## tr.1$NeighborhoodOldTown   -42561      13195  -3.225 0.001286 ** 
## tr.1$NeighborhoodSawyer    -28101      13689  -2.053 0.040274 *  
## tr.1$NeighborhoodSawyerW    -2291      13861  -0.165 0.868767    
## tr.1$NeighborhoodSomerst    25833      13364   1.933 0.053426 .  
## tr.1$NeighborhoodStoneBr   113968      15824   7.202 9.52e-13 ***
## tr.1$NeighborhoodSWISU     -41590      15840  -2.626 0.008739 ** 
## tr.1$NeighborhoodTimber     46830      14687   3.189 0.001461 ** 
## tr.1$NeighborhoodVeenker    66780      19539   3.418 0.000649 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50330 on 1434 degrees of freedom
## Multiple R-squared:  0.6054, Adjusted R-squared:  0.5986 
## F-statistic: 88.02 on 25 and 1434 DF,  p-value: < 2.2e-16

ggplot(data= tr.1) +
    geom_bar(mapping = aes(x = FullBath, y = SalePrice, fill= Neighborhood), stat = "identity", position = "identity")+
    theme(axis.text.x=element_text(size=9))+
    geom_abline(intercept = lm1$coefficients[1], slope = lm1$coefficients[2])

lm2<-lm(tr.1$SalePrice~tr.1$FullBath+tr.1$YearBuilt)
summary(lm2)

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$FullBath + tr.1$YearBuilt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128819  -38034   -9222   21116  470440 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.639e+06  1.166e+05  -14.05   <2e-16 ***
## tr.1$FullBath   5.833e+04  3.309e+03   17.63   <2e-16 ***
## tr.1$YearBuilt  8.771e+02  6.035e+01   14.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61520 on 1457 degrees of freedom
## Multiple R-squared:  0.4012, Adjusted R-squared:  0.4003 
## F-statistic:   488 on 2 and 1457 DF,  p-value: < 2.2e-16

ggplot(data= tr.1) +
    geom_bar(mapping = aes(x = FullBath, y = SalePrice, fill= YearBuilt), stat = "identity", position = "identity")+
    theme(axis.text.x=element_text(size=9))+
    geom_abline(intercept = lm2$coefficients[1], slope = lm2$coefficients[2])

lm3<-lm(tr.1$SalePrice~tr.1$FullBath+tr.1$KitchenQual)
summary(lm3)

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$FullBath + tr.1$KitchenQual)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -199462  -31878   -3311   22356  370788 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          229608       7584   30.28   <2e-16 ***
## tr.1$FullBath         51535       2829   18.21   <2e-16 ***
## tr.1$KitchenQualFa  -186149      10193  -18.26   <2e-16 ***
## tr.1$KitchenQualGd  -111064       5733  -19.37   <2e-16 ***
## tr.1$KitchenQualTA  -158499       5877  -26.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52920 on 1455 degrees of freedom
## Multiple R-squared:  0.5575, Adjusted R-squared:  0.5563 
## F-statistic: 458.3 on 4 and 1455 DF,  p-value: < 2.2e-16

ggplot(data= tr.1) +
    geom_bar(mapping = aes(x = FullBath, y = SalePrice, fill= KitchenQual), stat = "identity", position = "identity")+
    theme(axis.text.x=element_text(size=9))+
    geom_abline(intercept = lm3$coefficients[1], slope = lm3$coefficients[2])

BedroomAbvGr vs Saleprice

The Bedroom Above Ground Vs Salesprice is bimodal. There is no correlation. Here also, when year built is added to the linear model it results in a negative y intercept estimate.

qqplot(tr.1$BedroomAbvGr,tr.1$SalePrice)

lm1<-lm(tr.1$SalePrice~tr.1$BedroomAbvGr+tr.1$Neighborhood)
lm1

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$BedroomAbvGr + tr.1$Neighborhood)
## 
## Coefficients:
##              (Intercept)         tr.1$BedroomAbvGr  
##                   164971                     16397  
## tr.1$NeighborhoodBlueste   tr.1$NeighborhoodBrDale  
##                   -68463                   -101469  
## tr.1$NeighborhoodBrkSide  tr.1$NeighborhoodClearCr  
##                   -82825                       161  
## tr.1$NeighborhoodCollgCr  tr.1$NeighborhoodCrawfor  
##                   -13244                     -3215  
## tr.1$NeighborhoodEdwards  tr.1$NeighborhoodGilbert  
##                   -83974                    -22967  
##  tr.1$NeighborhoodIDOTRR  tr.1$NeighborhoodMeadowV  
##                  -106061                   -105940  
## tr.1$NeighborhoodMitchel    tr.1$NeighborhoodNAmes  
##                   -53876                    -67221  
## tr.1$NeighborhoodNoRidge  tr.1$NeighborhoodNPkVill  
##                   112736                    -64179  
## tr.1$NeighborhoodNridgHt   tr.1$NeighborhoodNWAmes  
##                   107007                    -29828  
## tr.1$NeighborhoodOldTown   tr.1$NeighborhoodSawyer  
##                   -82889                    -76260  
## tr.1$NeighborhoodSawyerW  tr.1$NeighborhoodSomerst  
##                   -26494                     16557  
## tr.1$NeighborhoodStoneBr    tr.1$NeighborhoodSWISU  
##                   107488                    -84031  
##  tr.1$NeighborhoodTimber  tr.1$NeighborhoodVeenker  
##                    29381                     38027

ggplot(data= tr.1) +
    geom_bar(mapping = aes(x = BedroomAbvGr, y = SalePrice, fill= Neighborhood), stat = "identity", position = "identity")+
    theme(axis.text.x=element_text(size=9))+
    geom_abline(intercept = lm1$coefficients[1], slope = lm1$coefficients[2])

lm2<-lm(tr.1$SalePrice~tr.1$BedroomAbvGr+tr.1$YearBuilt)
lm2

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$BedroomAbvGr + tr.1$YearBuilt)
## 
## Coefficients:
##       (Intercept)  tr.1$BedroomAbvGr     tr.1$YearBuilt  
##          -2663395              20079               1414

ggplot(data= tr.1) +
    geom_bar(mapping = aes(x = BedroomAbvGr, y = SalePrice, fill= YearBuilt), stat = "identity", position = "identity")+
    theme(axis.text.x=element_text(size=9))+
    geom_abline(intercept = lm2$coefficients[1], slope = lm2$coefficients[2])

lm3<-lm(tr.1$SalePrice~tr.1$BedroomAbvGr+tr.1$KitchenQual)
lm3

## 
## Call:
## lm(formula = tr.1$SalePrice ~ tr.1$BedroomAbvGr + tr.1$KitchenQual)
## 
## Coefficients:
##        (Intercept)   tr.1$BedroomAbvGr  tr.1$KitchenQualFa  
##             278635               18153             -223339  
## tr.1$KitchenQualGd  tr.1$KitchenQualTA  
##            -118715             -190957

ggplot(data= tr.1) +
    geom_bar(mapping = aes(x = BedroomAbvGr, y = SalePrice, fill= KitchenQual), stat = "identity", position = "identity")+
    theme(axis.text.x=element_text(size=9))+
    geom_abline(intercept = lm3$coefficients[1], slope = lm3$coefficients[2])

Correlation Test

The correlation confirms that there is an above average correlation between Ground Living Area and .71and Full Bath at .56 vs Sale price and the Bedroom Above ground has a lower correlation at .17

tr.2<-tr.1%>%
    select(SalePrice,GrLivArea,FullBath, BedroomAbvGr)
tr.c<-round(cor(tr.2),2)
tr.c

##              SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice         1.00      0.71     0.56         0.17
## GrLivArea         0.71      1.00     0.63         0.52
## FullBath          0.56      0.63     1.00         0.36
## BedroomAbvGr      0.17      0.52     0.36         1.00

tr.g<-cor.test(tr.2$SalePrice,tr.2$GrLivArea,conf.level = .8)
tr.f<-cor.test(tr.2$SalePrice,tr.2$FullBath,conf.level = .8)
tr.b<-cor.test(tr.2$SalePrice,tr.2$BedroomAbvGr,conf.level = .8)

tr.g

## 
##  Pearson's product-moment correlation
## 
## data:  tr.2$SalePrice and tr.2$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

tr.f

## 
##  Pearson's product-moment correlation
## 
## data:  tr.2$SalePrice and tr.2$FullBath
## t = 25.854, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5372107 0.5832505
## sample estimates:
##       cor 
## 0.5606638

tr.b

## 
##  Pearson's product-moment correlation
## 
## data:  tr.2$SalePrice and tr.2$BedroomAbvGr
## t = 6.5159, df = 1458, p-value = 9.927e-11
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1354160 0.2006421
## sample estimates:
##       cor 
## 0.1682132

With a 80% probability GrLiveArea is between 0.69 and 0.72, FullBath are betwen 0.54 and 0.58 and BedroomAbvGr is 0.14 and 0.2

Invert Corr/LU

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

In the LU decomposition we are using matlib to multiply the inverse of the cor.test times the cor test to get the precision matrix. We use the Upper.tri and lower.tri functions to get the upper and lower triangles of the matrix.

library(matlib)
tr.p<-round(inv(tr.c),2)
tr.p

##                             
## [1,]  2.40 -1.75 -0.48  0.68
## [2,] -1.75  3.26 -0.66 -1.16
## [3,] -0.48 -0.66  1.76 -0.21
## [4,]  0.68 -1.16 -0.21  1.56

tr.m<-round(tr.c%*%tr.p,2)
tr.m

##                        
## SalePrice    1.00 0 0 0
## GrLivArea    0.01 1 0 0
## FullBath     0.01 0 1 0
## BedroomAbvGr 0.01 0 0 1

tr.m2<-round(tr.p%*%tr.c,2)
tr.m2

##      SalePrice GrLivArea FullBath BedroomAbvGr
## [1,]         1      0.01     0.01         0.01
## [2,]         0      1.00     0.00         0.00
## [3,]         0      0.00     1.00         0.00
## [4,]         0      0.00     0.00         1.00

tr.m%*%tr.m2

##              SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice         1.00    0.0100   0.0100       0.0100
## GrLivArea         0.01    1.0001   0.0001       0.0001
## FullBath          0.01    0.0001   1.0001       0.0001
## BedroomAbvGr      0.01    0.0001   0.0001       1.0001

Multiplying correlation by percision matrix then multplying percision by correlation matrix give inverse of each other

tr.c

##              SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice         1.00      0.71     0.56         0.17
## GrLivArea         0.71      1.00     0.63         0.52
## FullBath          0.56      0.63     1.00         0.36
## BedroomAbvGr      0.17      0.52     0.36         1.00

lower<-tr.c
lower[lower.tri(tr.c, diag=TRUE)]<-0
#lower<-as.data.frame(lower)
lower

##              SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice            0      0.71     0.56         0.17
## GrLivArea            0      0.00     0.63         0.52
## FullBath             0      0.00     0.00         0.36
## BedroomAbvGr         0      0.00     0.00         0.00

upper<-tr.c
upper[upper.tri(tr.c, diag=TRUE)]<-0
#upper<-as.data.frame(upper)
upper

##              SalePrice GrLivArea FullBath BedroomAbvGr
## SalePrice         0.00      0.00     0.00            0
## GrLivArea         0.71      0.00     0.00            0
## FullBath          0.56      0.63     0.00            0
## BedroomAbvGr      0.17      0.52     0.36            0

The Lu decomposition of the matrix

Calc Prob/Stats

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \({\lambda}\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \({\lambda}\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

library(MASS)
hist(tr.1$GrLivArea)

Salesprice shows it is shifted to the right in this histogram

tr.log10<-log10(tr.1$GrLivArea)
hist(tr.log10)

Using log 10 the histogram shifts to the center

fs<-fitdistr(tr.log10,"Poisson")
tr.rlog10<-rexp(1000,fs$estimate)
hist(tr.rlog10)

Applying fitdistr estimates gives similar right skew. For the Mass functions we used the Ground Living Area values. . The histogram shows the data is right skewed. Using log10 the new histogram is shifted to the right and is closer to a more normal distribution. Applying the fitdstr estimate the histogram shifts to the right again. Applying the exponential PDF to the original data vs the log10 data at .05 and .95 percentail there is not much of a difference. The original Log10 values were .149 , .941 respectively while the new values were .162 and .939.

tr.e<-ecdf(tr.rlog10)
tr.e(.95)

## [1] 0.953

tr.e(.05)

## [1] 0.165

tr.1e<-rexp(1000,tr.log10)
tr.1ec<-ecdf(tr.1e)
tr.1ec(.95)

## [1] 0.946

tr.1ec(.05)

## [1] 0.138

Comparision does not show much of a difference

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Modeling

We begin with ggpairss data and compare the Saleprice to year built, Kitchen Quality, Ground Living Area, Full bath and Bedroom Abover ground. The ggpairs plots hint that Ground living area and bath have the highest correlation to sale price.

library(GGally)
ggpairs(data=tr.1, columns = 2:7)

For all of the linear models we are extracting the coeficients, r squared values, adjusted r squared, sigma and f statistics. Coeficients provide y intercept, slope , p value which probability the null hypothesis is true and the t value which gives the standard deviations the estimated coefficients are from zero. The multiple r-squred and adjusted r squared lets us know how close our data are close to the linear regression model. The F-statistic gives us the relationship between dependent and independent variables. A large F-statistics means a strong relationship.

We begin with linear model summary stats for each variable vs sale price. Summary of Linear model show low Pr values for each independent variable. All variable except Kitchen Quality have a p value below .05.

s<-summary(lm(SalePrice~Neighborhood, tr.1, na.action = na.fail))
c(s$coefficients[1,1:4],s$r.squared,s$adj.r.squared,s$sigma,s$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 1.948709e+05 1.309668e+04 1.487941e+01 1.120183e-46 5.455750e-01 
##                                  value        numdf        dendf 
## 5.379749e-01 5.399900e+04 7.178487e+01 2.400000e+01 1.435000e+03

y<-summary(lm(SalePrice~YearBuilt, tr.1, na.action = na.fail))
c(y$coefficients[1,1:4],y$r.squared,y$adj.r.squared,y$sigma,y$fstatistic)

##      Estimate    Std. Error       t value      Pr(>|t|)               
## -2.530308e+06  1.157613e+05 -2.185799e+01  7.682137e-92  2.734216e-01 
##                                     value         numdf         dendf 
##  2.729233e-01  6.773966e+04  5.486658e+02  1.000000e+00  1.458000e+03

k<-summary(lm(SalePrice~KitchenQual, tr.1, na.action = na.fail))
c(k$coefficients[1,1:4],k$r.squared,k$adj.r.squared,k$sigma,k$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 3.285547e+05 5.862195e+03 5.604636e+01 0.000000e+00 4.565986e-01 
##                                  value        numdf        dendf 
## 4.554790e-01 5.862195e+04 4.078064e+02 3.000000e+00 1.456000e+03

g<-summary(lm(SalePrice~GrLivArea, tr.1, na.action = na.fail))
c(g$coefficients[1,1:4],g$r.squared,g$adj.r.squared,g$sigma,g$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 1.856903e+04 4.480755e+03 4.144174e+00 3.606554e-05 5.021487e-01 
##                                  value        numdf        dendf 
## 5.018072e-01 5.607272e+04 1.470585e+03 1.000000e+00 1.458000e+03

f<-summary(lm(SalePrice~FullBath, tr.1, na.action = na.fail))
c(f$coefficients[1,1:4],f$r.squared,f$adj.r.squared,f$sigma,f$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 5.438828e+04 5.188295e+03 1.048288e+01 7.714661e-25 3.143439e-01 
##                                  value        numdf        dendf 
## 3.138736e-01 6.580441e+04 6.684303e+02 1.000000e+00 1.458000e+03

b<-summary(lm(SalePrice~BedroomAbvGr, tr.1, na.action = na.fail))
c(b$coefficients[1,1:4],b$r.squared,b$adj.r.squared,b$sigma,b$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 1.339660e+05 7.492255e+03 1.788060e+01 8.332345e-65 2.829567e-02 
##                                  value        numdf        dendf 
## 2.762920e-02 7.833735e+04 4.245641e+01 1.000000e+00 1.458000e+03

Our first multiple linear model stat includes all variables vs sale prices. So we are including Neighborhood, YearBuilt, Kitchen Quality, Ground Living Area and bedroom abover ground. It shows a pvalue below .05 collectively and a negative y intercept. However, individually FullBath p-value is .88.

nykgfb<-summary(lm(SalePrice~Neighborhood+YearBuilt+KitchenQual+GrLivArea+FullBath+BedroomAbvGr, tr.1, na.action = na.fail))
nykgfb

## 
## Call:
## lm(formula = SalePrice ~ Neighborhood + YearBuilt + KitchenQual + 
##     GrLivArea + FullBath + BedroomAbvGr, data = tr.1, na.action = na.fail)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -380892  -17342     185   15082  236136 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -8.543e+05  1.436e+05  -5.948 3.40e-09 ***
## NeighborhoodBlueste -2.007e+04  2.749e+04  -0.730 0.465481    
## NeighborhoodBrDale  -3.093e+04  1.316e+04  -2.350 0.018916 *  
## NeighborhoodBrkSide  6.115e+03  1.140e+04   0.536 0.591726    
## NeighborhoodClearCr  2.411e+04  1.166e+04   2.068 0.038864 *  
## NeighborhoodCollgCr  1.324e+04  9.493e+03   1.395 0.163231    
## NeighborhoodCrawfor  3.794e+04  1.121e+04   3.385 0.000731 ***
## NeighborhoodEdwards -1.528e+04  1.035e+04  -1.477 0.139872    
## NeighborhoodGilbert  3.168e+03  9.971e+03   0.318 0.750736    
## NeighborhoodIDOTRR  -1.204e+04  1.205e+04  -0.999 0.317970    
## NeighborhoodMeadowV -3.591e+04  1.297e+04  -2.769 0.005699 ** 
## NeighborhoodMitchel  4.188e+03  1.065e+04   0.393 0.694299    
## NeighborhoodNAmes    4.079e+03  9.964e+03   0.409 0.682353    
## NeighborhoodNoRidge  7.519e+04  1.092e+04   6.883 8.73e-12 ***
## NeighborhoodNPkVill -4.005e+03  1.536e+04  -0.261 0.794323    
## NeighborhoodNridgHt  6.512e+04  1.005e+04   6.479 1.27e-10 ***
## NeighborhoodNWAmes   7.727e+03  1.026e+04   0.753 0.451522    
## NeighborhoodOldTown -1.026e+04  1.109e+04  -0.924 0.355388    
## NeighborhoodSawyer   1.986e+03  1.053e+04   0.189 0.850345    
## NeighborhoodSawyerW  3.111e+03  1.025e+04   0.304 0.761507    
## NeighborhoodSomerst  2.138e+04  9.774e+03   2.187 0.028893 *  
## NeighborhoodStoneBr  7.153e+04  1.161e+04   6.162 9.34e-10 ***
## NeighborhoodSWISU   -1.012e+04  1.290e+04  -0.785 0.432809    
## NeighborhoodTimber   3.684e+04  1.081e+04   3.408 0.000673 ***
## NeighborhoodVeenker  4.768e+04  1.432e+04   3.329 0.000893 ***
## YearBuilt            5.034e+02  7.145e+01   7.046 2.86e-12 ***
## KitchenQualFa       -8.467e+04  7.904e+03 -10.712  < 2e-16 ***
## KitchenQualGd       -5.782e+04  4.514e+03 -12.808  < 2e-16 ***
## KitchenQualTA       -7.143e+04  5.026e+03 -14.212  < 2e-16 ***
## GrLivArea            7.446e+01  3.027e+00  24.599  < 2e-16 ***
## FullBath             3.804e+02  2.744e+03   0.139 0.889780    
## BedroomAbvGr        -7.095e+03  1.568e+03  -4.525 6.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36560 on 1428 degrees of freedom
## Multiple R-squared:  0.7927, Adjusted R-squared:  0.7882 
## F-statistic: 176.1 on 31 and 1428 DF,  p-value: < 2.2e-16

#c(nykgfb$coefficients[1,1:4],nykgfb$r.squared,nykgfb$adj.r.squared,nykgfb$sigma,nykgfb$fstatistic)

The next multiple linear model removes the FullBath. All variables p-values are below .05 except FullBath at .68.

nkgb<-summary(lm(SalePrice~Neighborhood+KitchenQual+GrLivArea+BedroomAbvGr, tr.1, na.action = na.fail))
nkgb

## 
## Call:
## lm(formula = SalePrice ~ Neighborhood + KitchenQual + GrLivArea + 
##     BedroomAbvGr, data = tr.1, na.action = na.fail)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -360606  -18893     369   15255  234429 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         158120.63   10777.69  14.671  < 2e-16 ***
## NeighborhoodBlueste -27886.24   27932.08  -0.998 0.318274    
## NeighborhoodBrDale  -43341.63   13225.57  -3.277 0.001074 ** 
## NeighborhoodBrkSide -26308.86   10604.43  -2.481 0.013218 *  
## NeighborhoodClearCr   7618.53   11594.42   0.657 0.511232    
## NeighborhoodCollgCr  11315.54    9634.25   1.175 0.240385    
## NeighborhoodCrawfor   9116.82   10608.66   0.859 0.390278    
## NeighborhoodEdwards -35566.42   10091.90  -3.524 0.000438 ***
## NeighborhoodGilbert   2805.38   10139.21   0.277 0.782061    
## NeighborhoodIDOTRR  -46296.96   11214.61  -4.128 3.87e-05 ***
## NeighborhoodMeadowV -48112.53   13022.81  -3.694 0.000229 ***
## NeighborhoodMitchel  -3327.84   10752.17  -0.310 0.756984    
## NeighborhoodNAmes   -14270.91    9714.34  -1.469 0.142037    
## NeighborhoodNoRidge  71819.51   11044.10   6.503 1.09e-10 ***
## NeighborhoodNPkVill -13780.10   15536.69  -0.887 0.375261    
## NeighborhoodNridgHt  65678.81   10220.38   6.426 1.78e-10 ***
## NeighborhoodNWAmes   -3009.12   10321.63  -0.292 0.770684    
## NeighborhoodOldTown -47612.49    9950.99  -4.785 1.89e-06 ***
## NeighborhoodSawyer  -14240.02   10384.78  -1.371 0.170516    
## NeighborhoodSawyerW  -2896.54   10368.79  -0.279 0.780015    
## NeighborhoodSomerst  22250.99    9940.53   2.238 0.025348 *  
## NeighborhoodStoneBr  68179.17   11792.12   5.782 9.06e-09 ***
## NeighborhoodSWISU   -45026.51   12146.76  -3.707 0.000218 ***
## NeighborhoodTimber   32661.70   10970.68   2.977 0.002958 ** 
## NeighborhoodVeenker  37719.57   14444.91   2.611 0.009115 ** 
## KitchenQualFa       -92303.29    7960.39 -11.595  < 2e-16 ***
## KitchenQualGd       -58794.69    4568.61 -12.869  < 2e-16 ***
## KitchenQualTA       -76664.45    5054.32 -15.168  < 2e-16 ***
## GrLivArea               74.88       2.81  26.645  < 2e-16 ***
## BedroomAbvGr         -8134.35    1556.70  -5.225 2.00e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37190 on 1430 degrees of freedom
## Multiple R-squared:  0.7852, Adjusted R-squared:  0.7808 
## F-statistic: 180.2 on 29 and 1430 DF,  p-value: < 2.2e-16

#c(nkg$coefficients[1,1:4],nkg$r.squared,nkg$adj.r.squared,nkg$sigma,nkg$fstatistic)

When comparing SalePrice to Ground Living Area and Bedroom or Full bath the p values are under .05 for Bedrooms but over .05 with Full bath. However, if you compare both full bath and Bedrooms the combined P value is below .05.

gb<-summary(lm(SalePrice~GrLivArea+BedroomAbvGr, tr.1, na.action = na.fail))
gb

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + BedroomAbvGr, data = tr.1, 
##     na.action = na.fail)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -549234  -27252    -349   23139  298053 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   62686.434   5336.730   11.75   <2e-16 ***
## GrLivArea       128.899      3.087   41.76   <2e-16 ***
## BedroomAbvGr -26899.823   1988.164  -13.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52870 on 1457 degrees of freedom
## Multiple R-squared:  0.5577, Adjusted R-squared:  0.5571 
## F-statistic: 918.6 on 2 and 1457 DF,  p-value: < 2.2e-16

#c(gb$coefficients[1,1:4],gb$r.squared,gb$adj.r.squared,gb$sigma,gb$fstatistic)

gf<-summary(lm(SalePrice~GrLivArea+FullBath, tr.1, na.action = na.fail))
gf

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + FullBath, data = tr.1, na.action = na.fail)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -400438  -26191   -2027   21488  343260 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3162.993   4775.342   0.662    0.508    
## GrLivArea      89.091      3.519  25.314  < 2e-16 ***
## FullBath    27311.090   3357.001   8.136  8.7e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54860 on 1457 degrees of freedom
## Multiple R-squared:  0.5238, Adjusted R-squared:  0.5231 
## F-statistic: 801.3 on 2 and 1457 DF,  p-value: < 2.2e-16

#c(gf$coefficients[1,1:4],gf$r.squared,gf$adj.r.squared,gf$sigma,gf$fstatistic)

gfb<-summary(lm(SalePrice~GrLivArea+FullBath+BedroomAbvGr, tr.1, na.action = na.fail))
gfb

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + FullBath + BedroomAbvGr, 
##     data = tr.1, na.action = na.fail)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -484289  -26709     -61   23596  300291 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   47509.482   5426.062   8.756   <2e-16 ***
## GrLivArea       110.062      3.601  30.566   <2e-16 ***
## FullBath      29694.688   3145.948   9.439   <2e-16 ***
## BedroomAbvGr -27859.332   1933.328 -14.410   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51340 on 1456 degrees of freedom
## Multiple R-squared:  0.5832, Adjusted R-squared:  0.5824 
## F-statistic: 679.2 on 3 and 1456 DF,  p-value: < 2.2e-16

#c(gfb$coefficients[1,1:4],gfb$r.squared,gfb$adj.r.squared,gfb$sigma,gfb$fstatistic)

My fist submission to Kaggle was a comparison of Sale Price to GrLivArea+FullBath+BedroomAbvGr the p-value was the lowest of the 4 at 5.512411e-18 while the F stat vaule was high at 679. Howerver, the Kaggle score was the highest at .28.

tst<-read.csv('C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/test.csv')

sm<-lm(SalePrice~GrLivArea+FullBath+BedroomAbvGr, tr.1, na.action = na.fail)
sms<-summary(sm)
c(sms$coefficients[1,1:4],sms$r.squared,sms$adj.r.squared,sms$sigma,sms$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 4.750948e+04 5.426062e+03 8.755794e+00 5.512411e-18 5.832213e-01 
##                                  value        numdf        dendf 
## 5.823626e-01 5.133962e+04 6.791536e+02 3.000000e+00 1.456000e+03

p<-data.frame(tst$Id,predict(sm, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)

##     Id SalePrice
## 1 1461  120100.8
## 2 1462  139898.2
## 3 1463  202611.4
## 4 1464  199859.9
## 5 1465  192059.2
## 6 1466  205473.0

write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict.csv", x=p, row.names = FALSE)
#.28

The submission for Sale Price to GrLivArea+FullBath as expected had a high Pvalue of 5.078456e-01 and the highst f-stat of 801, the Kaggle was also ~.28. The addition of Fullbath and Neighborhood reduced the pvalue significantly to 3.142946e-14. The F-stat value was 155 showing a weaker relationship between dependent an independent variables. The Kaggle score improved to a lower value of .215.

sm2<-lm(SalePrice~GrLivArea+FullBath, tr.1, na.action = na.fail)
sms2<-summary(sm2)
c(sms2$coefficients[1,1:4],sms2$r.squared,sms2$adj.r.squared,sms2$sigma,sms2$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 3.162993e+03 4.775342e+03 6.623594e-01 5.078456e-01 5.237819e-01 
##                                  value        numdf        dendf 
## 5.231282e-01 5.485974e+04 8.012612e+02 2.000000e+00 1.457000e+03

p<-data.frame(tst$Id,predict(sm2, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)

##     Id SalePrice
## 1 1461  110299.8
## 2 1462  148876.3
## 3 1463  202914.7
## 4 1464  200687.5
## 5 1465  171821.9
## 6 1466  205231.1

write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict2.csv", x=p, row.names = FALSE)
#.28

sm3<-lm(SalePrice~GrLivArea+FullBath+Neighborhood, tr.1, na.action = na.fail)
sms3<-summary(sm3)
c(sms3$coefficients[1,1:4],sms3$r.squared,sms3$adj.r.squared,sms3$sigma,sms3$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 8.433413e+04 1.099440e+04 7.670647e+00 3.142946e-14 7.385277e-01 
##                                  value        numdf        dendf 
## 7.337836e-01 4.098928e+04 1.556733e+02 2.600000e+01 1.433000e+03

p<-data.frame(tst$Id,predict(sm3, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)

##     Id SalePrice
## 1 1461  113510.2
## 2 1462  147483.4
## 3 1463  191868.0
## 4 1464  189906.5
## 5 1465  263431.4
## 6 1466  193907.9

write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict3.csv", x=p, row.names = FALSE)
#.215

Finally, i swapped the FullBath and BedroomAbvGr reducing the p-value to 1.318023e-17 with a f-stat of 164 and a Kaggle score of .210

sm4<-lm(SalePrice~Neighborhood+GrLivArea+BedroomAbvGr, tr.1, na.action = na.fail)
sms4<-summary(sm4)
c(sms4$coefficients[1,1:4],sms4$r.squared,sms4$adj.r.squared,sms4$sigma,sms4$fstatistic)

##     Estimate   Std. Error      t value     Pr(>|t|)              
## 8.936337e+04 1.032679e+04 8.653546e+00 1.318023e-17 7.491618e-01 
##                                  value        numdf        dendf 
## 7.446107e-01 4.014711e+04 1.646095e+02 2.600000e+01 1.433000e+03

p<-data.frame(tst$Id,predict(sm4, new=tst))
colnames(p)<-c('Id','SalePrice')
head(p)

##     Id SalePrice
## 1 1461  120424.8
## 2 1462  146678.4
## 3 1463  193042.7
## 4 1464  190785.6
## 5 1465  260520.1
## 6 1466  195390.1

write.csv(file = "C:/apag101/OneDrive/Desktop/GitHub/CUNYSPS/Data605/finalsData/predict4.csv", x=p, row.names = FALSE)
#.210

Data 605 Final Project Assignment 13

Anthony Pagan

May 26, 2019