Problem 1. *Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of
\[\mu=\sigma=(N+1)/2\]
N<-15
X<-runif(10000,1,N)
print("Length, Mean, Min, and Max of X: ")
## [1] "Length, Mean, Min, and Max of X: "
print(length(X))
## [1] 10000
print(mean(X))
## [1] 8.010021
print(min(X))
## [1] 1.001735
print(max(X ))
## [1] 14.9988
m<-(N+1)/2
Y<-rnorm(10000,m,m)
print("Length, Mean, Min, and Max of Y: ")
## [1] "Length, Mean, Min, and Max of Y: "
print(length(Y))
## [1] 10000
print(mean(Y))
## [1] 8.084818
print(min(Y))
## [1] -30.24956
print(max(Y))
## [1] 46.7123
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
Conditional Probability:
\[P(A|B) = \frac{P(BandA)}{P(B)}\]
\[P(X>x|X>y)\]
x<-median(X)
y<-quantile(X,0.25)[[1]]
#Calculate the probability of A and B and the probability of B
PAaB<-sum(X>x & X>y)/10000
PB<-sum(X>y)/10000
PAgB<-(PAaB)/PB
print(PAgB)
## [1] 0.6666667
There is a 66.67% chance that X is greater than the median of x given that X is greater than the 1st quartile of Y.
\[P(X>x|Y>y)\]
#Calculate the probability of A and B and the probability of B
PAaB<-sum(X>x & Y>y)/10000
PB<-sum(X>y)/10000
PAgB<-(PAaB)/PB
print(PAgB)
## [1] 0.4402667
There is a 45.5% chance that X is greater than the median of X given that Y is greater than the 1st quartile of Y.
\[P(X<x|X>y)\]
#Calculate the probability of A and B and the probability of B
PAaB<-sum(X<x & X>y)/10000
PB<-sum(X>y)/10000
PAgB<-(PAaB)/PB
print(PAgB)
## [1] 0.3333333
There is a 33.3% chance that X is less than the median of X given that X is greater than the 1st quartile of Y.
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
table<-c(sum(X<x&Y<y),sum(X>x&Y<y),sum(X<x&Y>y),sum(X>x&Y>y))
df<-data.frame(matrix(table, nrow = 2))
rownames(df) <- c("X<x", "X>x")
colnames(df) <- c("Y<y", "Y>y")
print(df)
## Y<y Y>y
## X<x 1665 3335
## X>x 1698 3302
tableratio<-c(round(sum(X<x&Y<y)/10000,2),round(sum(X>x&Y<y)/10000,2),round(sum(X<x&Y>y)/10000,2),round(sum(X>x&Y>y)/10000,2))
dfratio<-data.frame(matrix(tableratio, nrow = 2))
rownames(dfratio) <- c("X<x", "X>x")
colnames(dfratio) <- c("Y<y", "Y>y")
print(dfratio)
## Y<y Y>y
## X<x 0.17 0.33
## X>x 0.17 0.33
#P(X>x and Y>y)
PA<-0.33
#P(X>x)P(Y>y)
PB<-round((0.16+0.33)*(0.34+0.33),2)
print(PA)
## [1] 0.33
print(PB)
## [1] 0.33
We can see that P(X>x and Y>y) and P(X>x)P(Y>y) are both approximately 0.33 thus confirming that they are equal.
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
library(exact2x2)
## Loading required package: exactci
## Loading required package: ssanv
fisher.test(df)
##
## Fisher's Exact Test for Count Data
##
## data: df
## p-value = 0.4982
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8927801 1.0557771
## sample estimates:
## odds ratio
## 0.9708552
chisq.test(df)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df
## X-squared = 0.45878, df = 1, p-value = 0.4982
With a p-value of 0.0301 we reject the null hypothesis that the variables are unrelated and we accept that independence does not hold.
Since the sample size is 10000 so we can use the chi-squared test which is more suited for larger sample sizes. However, the fisher - exact test can also be used and may be preferable because it is an exact test vs an approximation.Additionally, the fisher-exact test can be used because the dimensionality is 2x2. If it was larger chi-squared would need to be used.
Considering they returned the same result it doesn’t matter which test was chosen, but fisher-exact testd can be preferrable and it is more exact and the dataset is 2x2.
Problem 2
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
testfile<-'https://raw.githubusercontent.com/agersowitz/ADG/master/test.csv'
dftest<-as.data.frame(read.csv(testfile))
trainfile<-'https://raw.githubusercontent.com/agersowitz/ADG/master/train.csv'
dftrain<-as.data.frame(read.csv(trainfile))
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set.
#f<-rbind(dftest,dftrain)
#Univariate and Descriptive Stats
summary(dftrain)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical X1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
summary(dftrain$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
hist(dftrain$SalePrice)
qqnorm(dftrain$SalePrice)
qqline(dftrain$SalePrice)
We can see that Sale Price is skewed right and is not normally distributed.
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
I view the sublcass, overall quality and condition and the square footage to be extremely relevant and important variables so I will use them in the scatterplot matrix.
pairs(~MSSubClass+OverallQual+OverallCond+X1stFlrSF+X2ndFlrSF, data = dftrain, pch = 19, cex = 0.5)
Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
data <- dftrain[, c(18,19,44)]
head(data, 6)
## OverallQual OverallCond X1stFlrSF
## 1 7 5 856
## 2 6 8 1262
## 3 7 5 920
## 4 7 5 961
## 5 8 5 1145
## 6 5 5 796
m <- as.matrix(cor(data))
round(m, 2)
## OverallQual OverallCond X1stFlrSF
## OverallQual 1.00 -0.09 0.48
## OverallCond -0.09 1.00 -0.14
## X1stFlrSF 0.48 -0.14 1.00
test <- cor.test(data$OverallQual, data$OverallCond,
method = "pearson",conf.level=0.8)
test
##
## Pearson's product-moment correlation
##
## data: data$OverallQual and data$OverallCond
## t = -3.5253, df = 1458, p-value = 0.0004362
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.12510797 -0.05855136
## sample estimates:
## cor
## -0.09193234
test <- cor.test(data$OverallQual, data$X1stFlrSF,
method = "pearson",conf.level=0.8)
test
##
## Pearson's product-moment correlation
##
## data: data$OverallQual and data$X1stFlrSF
## t = 20.68, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4498521 0.5017658
## sample estimates:
## cor
## 0.4762238
test <- cor.test(data$OverallCond, data$X1stFlrSF,
method = "pearson",conf.level=0.8)
test
##
## Pearson's product-moment correlation
##
## data: data$OverallCond and data$X1stFlrSF
## t = -5.5644, df = 1458, p-value = 3.126e-08
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.1769082 -0.1111792
## sample estimates:
## cor
## -0.1442028
The output above contains both the correlation and confidence interval of our variables. The following correlations fall below the significance level of 0.2 associated with an 80% confidence interval: Overall quality and Overall Condition, Overall Quality and 1st floor square footage, Overall Condition and 1st floor square footage.
While these correlations range form weak to moderate they are significant.
As we can plainly see the variables that were chosen are correlated. This makes sense when considering that Overall quality likely takes all other variables into account.
We should be very concerned with having a family-wise error with our calculated family considering we are doing multiple comparisons on the same dataset and the comparative variables are frequently correlated. We also have a high significance level that is associated with a weak confidence interval of 80%. To improve the frequency of type 1 errors we could lower the significance level (perhaps by using Bonferroni’s correction which divides alpha in half).
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
library(matrixcalc)
precision<-solve(m)
print("Precision Matrix:")
## [1] "Precision Matrix:"
print(precision)
## OverallQual OverallCond X1stFlrSF
## OverallQual 1.29423305 0.03074254 -0.6119115
## OverallCond 0.03074254 1.02196628 0.1327301
## X1stFlrSF -0.61191146 0.13273005 1.3105469
print("Correlation*Precision:")
## [1] "Correlation*Precision:"
m%*%precision
## OverallQual OverallCond X1stFlrSF
## OverallQual 1.000000e+00 0.000000e+00 0.000000e+00
## OverallCond 1.387779e-17 1.000000e+00 -2.775558e-17
## X1stFlrSF 0.000000e+00 -2.775558e-17 1.000000e+00
print("Precision*Correlation:")
## [1] "Precision*Correlation:"
precision%*%m
## OverallQual OverallCond X1stFlrSF
## OverallQual 1.000000e+00 0 1.110223e-16
## OverallCond 0.000000e+00 1 -2.775558e-17
## X1stFlrSF -1.110223e-16 0 1.000000e+00
l <- lu.decomposition(m)
L <- l$L
U <- l$U
print("Lower Triangular Matrix of correlation matrix:")
## [1] "Lower Triangular Matrix of correlation matrix:"
print( L )
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] -0.09193234 1.0000000 0
## [3,] 0.47622383 -0.1012784 1
print("Upper Triangular Matrix of correlation matrix:")
## [1] "Upper Triangular Matrix of correlation matrix:"
print( U )
## [,1] [,2] [,3]
## [1,] 1 -0.09193234 0.4762238
## [2,] 0 0.99154844 -0.1004224
## [3,] 0 0.00000000 0.7630402
print( L %*% U )
## [,1] [,2] [,3]
## [1,] 1.00000000 -0.09193234 0.4762238
## [2,] -0.09193234 1.00000000 -0.1442028
## [3,] 0.47622383 -0.14420278 1.0000000
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lfor this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, l)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
library(MASS)
hist(dftrain$X1stFlrSF)
exp<-fitdistr(dftrain$X1stFlrSF, "exponential")
l<-exp$estimate
opt<-1/l
print ("Optimal Value: ")
## [1] "Optimal Value: "
print(opt)
## rate
## 1162.627
exp_samples<-rexp(1000,l)
hist(dftrain$X1stFlrSF)
hist(exp_samples)
CI<-c(0.05,.95)
print("5th and 95th percentiles of Exponential Sample :")
## [1] "5th and 95th percentiles of Exponential Sample :"
qexp(CI,rate=l)
## [1] 59.63495 3482.91836
CI95=c(0.025,.975)
me<-mean(dftrain$X1stFlrSF)
sdv<-sd(dftrain$X1stFlrSF)
print("95% Confidence Interval from Original Data :")
## [1] "95% Confidence Interval from Original Data :"
qnorm(CI95,mean=me,sd=sdv)
## [1] 404.9287 1920.3248
print("Empirical 5th and 95th Percentile :")
## [1] "Empirical 5th and 95th Percentile :"
quantile(dftrain$X1stFlrSF,CI)
## 5% 95%
## 672.95 1831.25
We can see the exponential distribution is distributed in a more uniform way as intended. The 5th and 95th percentile for the original data is obviously correlated with the actual dataset while the exponential function transforms these so they can be more easily understood when compared to each other than the other data. We can see that 672.95 feet is the original 5th percentile while 59.63 is the 5th for the exponential sample.
While the empirical number by itself is more a useful because it gives the value in reality the exponential value offers more context when compared to the 95th percentile. We can see a better picture on the scale of the difference between 59.63 and 3482.92 then we can when comparing the original raw values. The exponential scale lets us see the stark difference in value based on a knowable scale.
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
dftrain[is.na(dftrain)] <- 0
dftest[is.na(dftest)] <- 0
dft<-model.matrix(SalePrice ~.,dftrain)[,-1]
dfte<-model.matrix(~.,dftest)
dft<-model.matrix(SalePrice ~.,dftrain)[,-1]
dfte<-model.matrix(~.,dftest)[,-1]
drops<-c()
keeps<-c()
train_names<-colnames(dft)
test_names<-colnames(dfte)
dft<-as.data.frame(dft)
dfte<-as.data.frame(dfte)
for (x in train_names){
if (x %in% test_names){
keeps[[length(keeps) + 1]] <- x
}
else{
drops[[length(drops) + 1]] <- x
}
}
for (x in test_names){
if (x %in% train_names){
keeps[[length(keeps) + 1]] <- x
}
else{
drops[[length(drops) + 1]] <- x
}
}
dft<-dft[ , !(names(dft) %in% drops)]
dfte<-dfte[ , !(names(dfte) %in% drops)]
dft<-as.matrix(dft)
dfte<-as.matrix(dfte)
y <- (dftrain$SalePrice)
set.seed(33)
lasso <- cv.glmnet(
x = dft,
y = y,
alpha = 1
)
plot(lasso, xvar = "lambda")
## Warning in plot.window(...): "xvar" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "xvar" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "xvar" is not a
## graphical parameter
## Warning in box(...): "xvar" is not a graphical parameter
## Warning in title(...): "xvar" is not a graphical parameter
m<-min(lasso$cvm)
m
## [1] 1193789100
ml<-lasso$lambda.min
ml
## [1] 1668.449
summary(lasso)
## Length Class Mode
## lambda 100 -none- numeric
## cvm 100 -none- numeric
## cvsd 100 -none- numeric
## cvup 100 -none- numeric
## cvlo 100 -none- numeric
## nzero 100 -none- numeric
## call 4 -none- call
## name 1 -none- character
## glmnet.fit 12 elnet list
## lambda.min 1 -none- numeric
## lambda.1se 1 -none- numeric
p<-predict(lasso,s=ml,dfte)
p<-cbind(dftest$Id,p)
write.csv(p,"AGersowitz_Housing_sub.csv", row.names = FALSE)
I used a Lasso regression for this analysis. This will take features and shrink them towards a central point which encourages simple and sparse models. It does this by selecting relevant covariates for use in the final model.
Kaggle User Name: Adam Gersowitz Score:0.14322