DATA 605 Final Exam

Problem 1. *Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of

\[\mu=\sigma=(N+1)/2\]

N<-15
X<-runif(10000,1,N)

print("Length, Mean, Min, and Max of X: ")

## [1] "Length, Mean, Min, and Max of X: "

print(length(X))

## [1] 10000

print(mean(X))

## [1] 8.010021

print(min(X))

## [1] 1.001735

print(max(X ))

## [1] 14.9988

m<-(N+1)/2

Y<-rnorm(10000,m,m)

print("Length, Mean, Min, and Max of Y: ")

## [1] "Length, Mean, Min, and Max of Y: "

print(length(Y))

## [1] 10000

print(mean(Y))

## [1] 8.084818

print(min(Y))

## [1] -30.24956

print(max(Y))

## [1] 46.7123

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

Conditional Probability:

\[P(A|B) = \frac{P(BandA)}{P(B)}\]

\[P(X>x|X>y)\]

x<-median(X)
y<-quantile(X,0.25)[[1]]

#Calculate the probability of A and B and the probability of B

PAaB<-sum(X>x & X>y)/10000

PB<-sum(X>y)/10000


PAgB<-(PAaB)/PB

print(PAgB)

## [1] 0.6666667

There is a 66.67% chance that X is greater than the median of x given that X is greater than the 1st quartile of Y.

\[P(X>x|Y>y)\]

#Calculate the probability of A and B and the probability of B

PAaB<-sum(X>x & Y>y)/10000

PB<-sum(X>y)/10000


PAgB<-(PAaB)/PB

print(PAgB)

## [1] 0.4402667

There is a 45.5% chance that X is greater than the median of X given that Y is greater than the 1st quartile of Y.

\[P(X<x|X>y)\]

#Calculate the probability of A and B and the probability of B

PAaB<-sum(X<x & X>y)/10000

PB<-sum(X>y)/10000


PAgB<-(PAaB)/PB

print(PAgB)

## [1] 0.3333333

There is a 33.3% chance that X is less than the median of X given that X is greater than the 1st quartile of Y.

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

table<-c(sum(X<x&Y<y),sum(X>x&Y<y),sum(X<x&Y>y),sum(X>x&Y>y))
df<-data.frame(matrix(table, nrow = 2))
rownames(df) <- c("X<x", "X>x")
colnames(df) <- c("Y<y", "Y>y")
print(df)

##      Y<y  Y>y
## X<x 1665 3335
## X>x 1698 3302

tableratio<-c(round(sum(X<x&Y<y)/10000,2),round(sum(X>x&Y<y)/10000,2),round(sum(X<x&Y>y)/10000,2),round(sum(X>x&Y>y)/10000,2))

dfratio<-data.frame(matrix(tableratio, nrow = 2))
rownames(dfratio) <- c("X<x", "X>x")
colnames(dfratio) <- c("Y<y", "Y>y")
print(dfratio)

##      Y<y  Y>y
## X<x 0.17 0.33
## X>x 0.17 0.33

#P(X>x and Y>y)

PA<-0.33

#P(X>x)P(Y>y)

PB<-round((0.16+0.33)*(0.34+0.33),2)

print(PA)

## [1] 0.33

print(PB)

## [1] 0.33

We can see that P(X>x and Y>y) and P(X>x)P(Y>y) are both approximately 0.33 thus confirming that they are equal.

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

library(exact2x2)

## Loading required package: exactci

## Loading required package: ssanv

fisher.test(df)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  df
## p-value = 0.4982
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8927801 1.0557771
## sample estimates:
## odds ratio 
##  0.9708552

chisq.test(df)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  df
## X-squared = 0.45878, df = 1, p-value = 0.4982

With a p-value of 0.0301 we reject the null hypothesis that the variables are unrelated and we accept that independence does not hold.

Since the sample size is 10000 so we can use the chi-squared test which is more suited for larger sample sizes. However, the fisher - exact test can also be used and may be preferable because it is an exact test vs an approximation.Additionally, the fisher-exact test can be used because the dimensionality is 2x2. If it was larger chi-squared would need to be used.

Considering they returned the same result it doesn’t matter which test was chosen, but fisher-exact testd can be preferrable and it is more exact and the dataset is 2x2.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

testfile<-'https://raw.githubusercontent.com/agersowitz/ADG/master/test.csv'
dftest<-as.data.frame(read.csv(testfile))

trainfile<-'https://raw.githubusercontent.com/agersowitz/ADG/master/train.csv'
dftrain<-as.data.frame(read.csv(trainfile))

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.

#f<-rbind(dftest,dftrain)

#Univariate and Descriptive Stats

summary(dftrain)

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

summary(dftrain$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

hist(dftrain$SalePrice)

qqnorm(dftrain$SalePrice)
qqline(dftrain$SalePrice)

We can see that Sale Price is skewed right and is not normally distributed.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

I view the sublcass, overall quality and condition and the square footage to be extremely relevant and important variables so I will use them in the scatterplot matrix.

pairs(~MSSubClass+OverallQual+OverallCond+X1stFlrSF+X2ndFlrSF, data = dftrain, pch = 19,  cex = 0.5)

Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

data <- dftrain[, c(18,19,44)]
head(data, 6)

##   OverallQual OverallCond X1stFlrSF
## 1           7           5       856
## 2           6           8      1262
## 3           7           5       920
## 4           7           5       961
## 5           8           5      1145
## 6           5           5       796

m <- as.matrix(cor(data))
round(m, 2)

##             OverallQual OverallCond X1stFlrSF
## OverallQual        1.00       -0.09      0.48
## OverallCond       -0.09        1.00     -0.14
## X1stFlrSF          0.48       -0.14      1.00

test <- cor.test(data$OverallQual, data$OverallCond, 
                    method = "pearson",conf.level=0.8)
test

## 
##  Pearson's product-moment correlation
## 
## data:  data$OverallQual and data$OverallCond
## t = -3.5253, df = 1458, p-value = 0.0004362
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.12510797 -0.05855136
## sample estimates:
##         cor 
## -0.09193234

test <- cor.test(data$OverallQual, data$X1stFlrSF, 
                    method = "pearson",conf.level=0.8)
test

## 
##  Pearson's product-moment correlation
## 
## data:  data$OverallQual and data$X1stFlrSF
## t = 20.68, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4498521 0.5017658
## sample estimates:
##       cor 
## 0.4762238

test <- cor.test(data$OverallCond, data$X1stFlrSF, 
                    method = "pearson",conf.level=0.8)
test

## 
##  Pearson's product-moment correlation
## 
## data:  data$OverallCond and data$X1stFlrSF
## t = -5.5644, df = 1458, p-value = 3.126e-08
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.1769082 -0.1111792
## sample estimates:
##        cor 
## -0.1442028

The output above contains both the correlation and confidence interval of our variables. The following correlations fall below the significance level of 0.2 associated with an 80% confidence interval: Overall quality and Overall Condition, Overall Quality and 1st floor square footage, Overall Condition and 1st floor square footage.

While these correlations range form weak to moderate they are significant.

As we can plainly see the variables that were chosen are correlated. This makes sense when considering that Overall quality likely takes all other variables into account.

We should be very concerned with having a family-wise error with our calculated family considering we are doing multiple comparisons on the same dataset and the comparative variables are frequently correlated. We also have a high significance level that is associated with a weak confidence interval of 80%. To improve the frequency of type 1 errors we could lower the significance level (perhaps by using Bonferroni’s correction which divides alpha in half).

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

library(matrixcalc)
precision<-solve(m)
print("Precision Matrix:")

## [1] "Precision Matrix:"

print(precision)

##             OverallQual OverallCond  X1stFlrSF
## OverallQual  1.29423305  0.03074254 -0.6119115
## OverallCond  0.03074254  1.02196628  0.1327301
## X1stFlrSF   -0.61191146  0.13273005  1.3105469

print("Correlation*Precision:")

## [1] "Correlation*Precision:"

m%*%precision

##              OverallQual   OverallCond     X1stFlrSF
## OverallQual 1.000000e+00  0.000000e+00  0.000000e+00
## OverallCond 1.387779e-17  1.000000e+00 -2.775558e-17
## X1stFlrSF   0.000000e+00 -2.775558e-17  1.000000e+00

print("Precision*Correlation:")

## [1] "Precision*Correlation:"

precision%*%m

##               OverallQual OverallCond     X1stFlrSF
## OverallQual  1.000000e+00           0  1.110223e-16
## OverallCond  0.000000e+00           1 -2.775558e-17
## X1stFlrSF   -1.110223e-16           0  1.000000e+00

l <- lu.decomposition(m)
L <- l$L
U <- l$U
print("Lower Triangular Matrix of correlation matrix:")

## [1] "Lower Triangular Matrix of correlation matrix:"

print( L )

##             [,1]       [,2] [,3]
## [1,]  1.00000000  0.0000000    0
## [2,] -0.09193234  1.0000000    0
## [3,]  0.47622383 -0.1012784    1

print("Upper Triangular Matrix of correlation matrix:")

## [1] "Upper Triangular Matrix of correlation matrix:"

print( U )

##      [,1]        [,2]       [,3]
## [1,]    1 -0.09193234  0.4762238
## [2,]    0  0.99154844 -0.1004224
## [3,]    0  0.00000000  0.7630402

print( L %*% U )

##             [,1]        [,2]       [,3]
## [1,]  1.00000000 -0.09193234  0.4762238
## [2,] -0.09193234  1.00000000 -0.1442028
## [3,]  0.47622383 -0.14420278  1.0000000

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lfor this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

library(MASS)

hist(dftrain$X1stFlrSF)

exp<-fitdistr(dftrain$X1stFlrSF, "exponential")

l<-exp$estimate

opt<-1/l

print ("Optimal Value: ")

## [1] "Optimal Value: "

print(opt)

##     rate 
## 1162.627

exp_samples<-rexp(1000,l)

hist(dftrain$X1stFlrSF)

hist(exp_samples)

CI<-c(0.05,.95)
print("5th and 95th percentiles of Exponential Sample :")

## [1] "5th and 95th percentiles of Exponential Sample :"

qexp(CI,rate=l)

## [1]   59.63495 3482.91836

CI95=c(0.025,.975)

me<-mean(dftrain$X1stFlrSF)
sdv<-sd(dftrain$X1stFlrSF)

print("95% Confidence Interval from Original Data :")

## [1] "95% Confidence Interval from Original Data :"

qnorm(CI95,mean=me,sd=sdv)

## [1]  404.9287 1920.3248

print("Empirical 5th and 95th Percentile :")

## [1] "Empirical 5th and 95th Percentile :"

quantile(dftrain$X1stFlrSF,CI)

##      5%     95% 
##  672.95 1831.25

We can see the exponential distribution is distributed in a more uniform way as intended. The 5th and 95th percentile for the original data is obviously correlated with the actual dataset while the exponential function transforms these so they can be more easily understood when compared to each other than the other data. We can see that 672.95 feet is the original 5th percentile while 59.63 is the 5th for the exponential sample.

While the empirical number by itself is more a useful because it gives the value in reality the exponential value offers more context when compared to the 95th percentile. We can see a better picture on the scale of the difference between 59.63 and 3482.92 then we can when comparing the original raw values. The exponential scale lets us see the stark difference in value based on a knowable scale.

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.0-2

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

dftrain[is.na(dftrain)] <- 0
dftest[is.na(dftest)] <- 0

dft<-model.matrix(SalePrice ~.,dftrain)[,-1]
dfte<-model.matrix(~.,dftest)


dft<-model.matrix(SalePrice ~.,dftrain)[,-1]
dfte<-model.matrix(~.,dftest)[,-1]

drops<-c()
keeps<-c()

train_names<-colnames(dft)
test_names<-colnames(dfte)

dft<-as.data.frame(dft)
dfte<-as.data.frame(dfte)

for (x in train_names){
  if (x %in% test_names){
    keeps[[length(keeps) + 1]] <- x 
  }
  else{
      drops[[length(drops) + 1]] <- x  
  }
}

for (x in test_names){
  if (x %in% train_names){
    keeps[[length(keeps) + 1]] <- x 
  }
  else{
      drops[[length(drops) + 1]] <- x  
  }
}
dft<-dft[ , !(names(dft) %in% drops)]
dfte<-dfte[ , !(names(dfte) %in% drops)]

dft<-as.matrix(dft)
dfte<-as.matrix(dfte)


y <- (dftrain$SalePrice)

set.seed(33)

lasso <- cv.glmnet(
  x = dft,
  y = y,
  alpha = 1
)

plot(lasso, xvar = "lambda")

## Warning in plot.window(...): "xvar" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "xvar" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter

## Warning in box(...): "xvar" is not a graphical parameter

## Warning in title(...): "xvar" is not a graphical parameter

m<-min(lasso$cvm)
m

## [1] 1193789100

ml<-lasso$lambda.min
ml

## [1] 1668.449

summary(lasso)

##            Length Class  Mode     
## lambda     100    -none- numeric  
## cvm        100    -none- numeric  
## cvsd       100    -none- numeric  
## cvup       100    -none- numeric  
## cvlo       100    -none- numeric  
## nzero      100    -none- numeric  
## call         4    -none- call     
## name         1    -none- character
## glmnet.fit  12    elnet  list     
## lambda.min   1    -none- numeric  
## lambda.1se   1    -none- numeric

p<-predict(lasso,s=ml,dfte)

p<-cbind(dftest$Id,p)


write.csv(p,"AGersowitz_Housing_sub.csv", row.names = FALSE)

I used a Lasso regression for this analysis. This will take features and shrink them towards a central point which encourages simple and sparse models. It does this by selecting relevant covariates for use in the final model.

Kaggle User Name: Adam Gersowitz Score:0.14322

DATA 605 Final Exam

Adam Gersowitz

12/12/2020