## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'dply'
## Loaded glmnet 2.0-2
##
## This is mgcv 1.8-6. For overview type 'help("mgcv-package")'.
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:Matrix':
##
## expand
##
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:car':
##
## logit
##
## The following object is masked from 'package:lattice':
##
## melanoma
##
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
https://sites.google.com/site/cme250winter2015/mini-project
The goal of the competition, from the Kaggle website, is to “predict your classmates’ age”. The data set profiled here is the training data set.
ID - an anonymous random id unique to each sample, you should not use this to fit your model
Age - student’s age
Height - student’s height in inches
Sibling - number of siblings
FBfriends - number of facebook friends
Credits - number of enrolled credits this quarter
Contacts - number of contacts on phone
Exercise - number of times exercising per week
News - number of news articles read per week
ParentContact - number of times in a month student speaks with their parents
Texts - number of texts/mobile messages sent yesterday
Emails - number of emails received yesterday
Pets - pets student has or has had
Twitter - does student have a twitter account?
Field - area of study/research
Origin - birthplace
Comments
There are five missing value in the data set. I am not trying to delete them directly because I have a small number of observations to fit models.As a result, every piece of information is precise for me. I will use linear regression to predict the missing values. There were no inherent character column and the data was already converted to factor-like integer values. But I found that there are 6 levels for the Field variable. In fact, there are only 5 possible values for this variable. It shows 6 levels because of the missing value. Given the small number of data, I will use leave-one-out method to calculate all the MSE.
Estimate Missing Value
First, I need to plot the covariance graph to evaluate the possible relationship between different covariates and then build linear regression model to predict the missing value. It turned out this method didn’t suitable for calculating all the missing value. Alternatively, I may assign the mean of that variable to the null value. If the missing value is categorical value, I may estimate it based on the cluster result.
According to the above plot, I guess that height relates to news, text, emails field and origin. I tried different model to see which one is meaningful in order to estimate height. It turns out the regression function with height, Field and Origin as well as all the coefficients are significant. I get the estimation value of height based on this linear model. I also estimate another two null values in the same way. While I try to estimate Fbfriends, I found that none of the model is significant. So, I just simple assign the missing value with the mean of data in this column. I also guess observation 62th’s major is social science and humanities based on the hierarchical clustering result.
### Summary Statistics ###
Graphical Model
The above picture shows that when the sparsity is about 1%, the regularization parameter is around 0.399. In another word, the overfitting problem of the data is smallest when lambda equals to 0.3999 among three possible choice of lambda.
Outliers Analysis
It is important to remove outliers to make further analysis.Firstly, I fit a full model, the p-values shows that only Fbfriends, Credits, Contacts and News are significant. Then I try to build a new model by removing the covariates that don’t explain the response variable very well.It turns out the new model and all the coefficient are significant. Then I check for the cook distance to identify outliers.
According the above plot, I will remove observation 13, 48 and 63. I also try other method to detect the outliers. It is outlier detection by clustering. The idea is to detect outlier with the k-means algorithm. With k-means, the data are partitioned into k groups by assigning them to the closest cluster centers. After that, we can calculate the distance between each object and its cluster center, and pick those with the largest distances as outliers.
For this data set, I prefer cook-distance to detect outliers since the following predictive model seem like perform better on it.
PCA
The first thirteen principle components explain around 80% of the variance in the data. This is a big amount of the variance. so I will choose the first thirteen principal components as the the significant prinncipal components. From the scree plot, we see that while each of the first four principal components explain a substantial amount of variance, there is a marked decrease in the variance explained by furter principal components, then the decrease becomes faster from the first six principal component to the thirteen one. Finally, the increase become much slower after the thirteen principal component. This suggests that there may be little benefit to examining more than thirteen principal components.
Hierarchical Clustering
The above result shows that the choice of linkage certainly does affect the results obtained. Typically, single linkage will tend to yield trailing clusters–very large clusters onto which individual observations attach one-by-one. On the other hand, complete linkage tends to yield more balanced, attractive clusters. I prefer complete linkage to single linkage. Clearly, people with similar age tend to be clustered together but they are varied in properties within different group.
K-means Clustering
Since I have a small number of observations, I divide them into two groups. The above plot shows the observations scatter into two groups based on the first two principal component scores.
Model Selection
For variable selection, I use subset selection, forward stepwise selection, backward stepwise selection and lasso regression. When I use subset selection, it return different result with different standards such as CP, adjr2 and AIC. In this case, I wrote a loop to calculate the MSE for each subset to see which one reach the smallest MSE, it turns out the model with 5 variable is the best subset since it have the smallest MSE.
24.99610 19.61245 19.76116 13.56199 10.34055 13.16404 13.26169 11.91733
Both forward and backward stepwise selection choose 5 variables, which are Contacts, News, Credits, Field and Fbfriends into the model. This is consensus with the best subset. Finally, I use lasso regression, the MSE for lasso regression is 13.4838, which is higher than the best model which I get from other variable selection method.Lasoo chooses 6 variables, which are Fbfriends, Credits, Contacts, ParentCOntact, Emails and Field.
Check for Non-linear Effect
Splines
First, I build a additive model based on all the variables except for ID and I also add splines to Fbfriends, News and Credits, the models shows that only Fbfriends, News, Contacts, Field and Credits are significant. I decided to remove other unimportant variables. After all, they don’t contribute much on explaining the response variable. It turns out this small model perform much better. The leave one out MSE also decreases from 23.71507 to 8.482493. In order to improve the small model, I also try to add the interactive splines instead of two separated splines. As a result, the MSE seem change a little bit. And then I add Emails to the model, it also fails to increase the MSE. Another finding is when I add the poisson family to the additive model, they perform much better than the ones without poisson family.
Polynomial Regression
In order to choose the degree of polynomial terms, I fit respective polynomial regression with Age against every independent variable and choose the degree of freedom using cross-validation. Then, I try to build a full model including all the variables except for ID as well as the polynomial terms. The summary shows that only several terms are significant. Then I remove all the unimportant terms just like what I did before. Finally, I found that the model without polynomial terms reach the smallest MSE. After that, I try to add an interactive term of Fbfriends and Credits, it does improve my model. The smallest MSE I get from polynomial regression is 10.06359.
Linear Model
First, I build a full model with all the variable, a interactive term and a polynomial term. Then, I chose the variables that are significant to make further analysis. Finally, I use a relatively small model. After that,I diagnose this model with both plot and three test to check if satisfy the assumptions. The plot shows that there is still non-linearity terms that don’t include into my model since the red line in picture 1 is not flat. The tests indicate that the variance is non-constant, normality and independent. Hence, I consider making transformations to fix the problem of non-linearity of this model.
Transformation for Response
I use the boxcox function in MASS library to choose the possible transformation for the response variable. The above picture shows that when lambda closes to -1, it may reach the biggest log-likelihood. So I use 1/Y as the new response. It turns out the model with the transformed response variable performs even bad since it resists the assumption that the variance should be non-constant. In one word, the transformation for the response term fails to improve the model.
Weighted Linear Regression
WLS usually used to fix the problem of non-constant residual for MLS model, I just want to see whether it can improve my model or not.
The right-opening megaphone seems in the graph, suggesting non-constant variance, then I build a model of the absolute value of reside that I get from the MLR model on x to get an estimate of sigma, then I define weight equals to 1 over the estimate of sigma. If I get my weight, I can easily fit the WLR model. In order to check for the performance of this model, I calculate the leave-one-out MSE of this model. The MSE is 25.2038, which may be not good. Therefore, I try another WLR based on the forward or backward stepwise selection. As a result,the MSE of it is 23.6517. it does decrease the MSE, but still not good. Perhaps, WLR is not suitable for my data set.
Ridge Regression
I also try ridge regression and its MSE is 15.19955.And the lasso regression is better than it. it seems like smaller model is suitable for my dataset.
Support Vector Regression
SVM is known as a classification method, but it can used to build regression model. I try two different kernel functions to fit the decision boundary. It is also important to choose the cost,which determines the number of support vector. The result indicates that the one using the linear kernel function reach the smallest leave one out cross validation, which is 13.26392.
KNN Regression
Ihis is another common classification method, but it can be used to fit regression model. The most important thing when fit kNN regression is to decide the number of neighbor to predict new data. The leave one out cross-validation shows that when the number of neighbors equals to 7. it get the smallest MSE, which is 18.15405.
Random Forest
Finally, I try to use random forest regression to fit my model. I choose the number of ntree=100. First, I use the full model to fit the random forest regression, I still use leave one out cross validation to calculate MSE, which is 17.25867, then I try to use a smaller model including Fbfriends, Credits, Contacts, News, Field, Origin, Pets and Twitter.It turns out the MSE decline from 17.25867 to 14.20598.
Bootstrap to Estimate Confidence Interval
The Confidence Interval for glm.fit3
My best model is the additive model, I can’t get the confidence interval from this model. So I choose my second best model which is the glm.fit3. It turns out the confidence that I get from different model vary greatly. Perhaps, I need more data to fit my model to reach a more believable conclusion.
Conclusion
Currently, my best model may be the general additive model, which is gam2. Its formula is Age ~ s(Fbfriends) + s(Credits) + s(Contacts) + News + Field. The family of this model is Poisson and the MSE is 8.482493. I put all the R-Code in the Appendix part.
Appendix