Homework 3 - STAT 601

#Kevin Kuipers (Completed byself)

#September 11, 2018

##Problem 1. Use the data from the library to answer the following questions

##Problem 1 Part A)

a) Construct graphical and numerical summaries that will show the relationship between 
tumor size and the number of recurrent tumors. Discuss your discovery. 
(Hint: mosaic plot may be a great way to assess this)

##My Assumption:

When starting out with any data set I generally look at the most simplest model first to see how it performs on the data. I believe Number of Tumors will be best predicted by one variable.

I will create a data frame from data set and load the needed libraries. After that a mosaic plot will be created. In addition, the data will be split into two data sets to look at histograms for each category (>3cm vs <=3cm). These plots will be produced in both base R-plots and ggplots. The ggplots can be alitle more sophisticated, so while the base R has two histograms one for each category, ggplot can overlay the histograms on 1 plot by producing different colors. Even though both styles are alittle different they display the same results. In addition, the mosaic plots between the two functions display the same results just flipped between >3cm vs <=3cm.

## 
##  Descriptive statistics by group 
## group: <=3cm
##    vars  n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 22 1.45 0.8      1    1.28   0   1   4     3 1.74      2.4 0.17
## -------------------------------------------------------- 
## group: >3cm
##    vars n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 9 1.78 1.09      1    1.78   0   1   4     3 0.89    -0.78 0.36

##My Response:

Based on the mosaic plot and histograms it shows that the frequncy of the data is centered around number 1 which is the number of tumors. The graphs also display that the frequency decreases as the number of tumors increases. This is the case for both sizes of tumors. However, the majority of the data consists centers around 1 tumor and the bulk of which belongs to <= 3cm in size. There is a postivie skew in the overall data set and each group (<=3cm vs >3cm). The descriptive statistics also provide similar insight.

Due to this skewness I wonder if a e value for number of tumors would be good to use in fitting a model? I will create five models: 1) NumberOfTumors ~ TumorSize 2) NumberOfTumors ~ TumorSize + Time 3) NumberOfTumors ~ TumorSize + Time + TumorSizeTime 4) NumberOfTumors ~ TumorSizeTime 5) exp(NumberOfTumors) ~ TumorSize

##Problem 1 Part B)

b) Build a Poisson regression that estimates the effect of size of 
tumor on the number of recurrent tumors.  Discuss your results.

Summary of Model 1 ( NumberOfTumors ~ TumorSize )

## 
## Call:
## glm(formula = NumberOfTumors ~ TumorSize, family = poisson(), 
##     data = bladdercancer_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6363  -0.3996  -0.3996   0.4277   1.7326  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     0.3747     0.1768   2.120    0.034 *
## TumorSize>3cm   0.2007     0.3062   0.655    0.512  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.80  on 30  degrees of freedom
## Residual deviance: 12.38  on 29  degrees of freedom
## AIC: 87.191
## 
## Number of Fisher Scoring iterations: 4

Summary of Model 2 ( NumberOfTumors ~ TumorSize + Time )

## 
## Call:
## glm(formula = NumberOfTumors ~ TumorSize + Time, family = poisson(), 
##     data = bladdercancer_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8183  -0.4753  -0.2923   0.3319   1.5446  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)    0.14568    0.34766   0.419    0.675
## TumorSize>3cm  0.20511    0.30620   0.670    0.503
## Time           0.01478    0.01883   0.785    0.433
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.800  on 30  degrees of freedom
## Residual deviance: 11.757  on 28  degrees of freedom
## AIC: 88.568
## 
## Number of Fisher Scoring iterations: 4

Summary of Model 3 ( NumberOfTumors ~ TumorSize + Time + TumorSize*Time)

## 
## Call:
## glm(formula = NumberOfTumors ~ TumorSize + Time + TumorSize * 
##     Time, family = poisson(), data = bladdercancer_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6943  -0.5581  -0.2413   0.2932   1.4644  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)         0.03957    0.43088   0.092    0.927
## TumorSize>3cm       0.46717    0.66713   0.700    0.484
## Time                0.02138    0.02418   0.884    0.377
## TumorSize>3cm:Time -0.01676    0.03821  -0.439    0.661
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.800  on 30  degrees of freedom
## Residual deviance: 11.566  on 27  degrees of freedom
## AIC: 90.377
## 
## Number of Fisher Scoring iterations: 4

Summary of Model 4 ( NumberOfTumors ~ TumorSize * Time )

## 
## Call:
## glm(formula = NumberOfTumors ~ TumorSize * Time, family = poisson(), 
##     data = bladdercancer_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6943  -0.5581  -0.2413   0.2932   1.4644  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)         0.03957    0.43088   0.092    0.927
## TumorSize>3cm       0.46717    0.66713   0.700    0.484
## Time                0.02138    0.02418   0.884    0.377
## TumorSize>3cm:Time -0.01676    0.03821  -0.439    0.661
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 12.800  on 30  degrees of freedom
## Residual deviance: 11.566  on 27  degrees of freedom
## AIC: 90.377
## 
## Number of Fisher Scoring iterations: 4

Summary Of Model 5 (eNumberOfTumors ~ TumorSize) where number of tumors underwent the exp function

## 
## Call:
## glm(formula = eNumberOfTumors ~ TumorSize, family = poisson(), 
##     data = bladdercancer_dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6574  -0.3121   0.1491   0.1491   0.2374  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.2564     0.3996  -3.144  0.00167 **
## TumorSize>3cm  -0.1624     0.7866  -0.206  0.83648   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2.5666  on 30  degrees of freedom
## Residual deviance: 2.5229  on 29  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4

Comparing the models using anova function with Chisq test

##My Response:

From the results above I believe Model 2 or Model 3 would be the best one for explaining the data set. Therefore, my original assumption was in correct. It appears that the more variables and there interactions tend to fit the data better. The Analysis of Deviance tables shows that factoring in TumorSize, time and even Tumorsize*Time yield the best results since the residual deviance tends be lower and are found to be more significant at the chisq test. Therefore I would reject my originally hypothesis and accept two variables for explaing NumberOfTumors is better model. However, there is a greater significance of the inercept one the first model (NumberOfTumors ~ TumorSize).

The e value model is interesting… The AIC score is infinity probably due to dealing with negative values. But the residual & null deviance is very low. The pvalue is highly significant at the incercept.

##Problem 2. The following data is the number of new AIDS cases in Belgium between the years 1981-1993. Let \(t\) denote time

Do the following

##My Assumption:

It would be my speculation that AIDs cases will increase up as time goes on.

##Problem 2 Part A

a) Plot the relationship between AIDS cases against time. Comment on the plot

I will create a data frame from the the two variables and plot scatter plots in both Base-R and ggplot to see if there is a relationship

##My Repsonse:

It appears my original assumption looks correct thus far. As time goes on the number of AIDs cases increases. The scatterplot seems to sugguest a postivie relationship between the two variables.

##Problem 2 Part B

b) Fit a Poisson regression model $log(\mu_i)=\beta_0+\beta_1t_i$. 
Comment on the model parameters and residuals (deviance) vs Fitted plot.

I will fit a regression model of AIDS cases vs Time and display the residual vs fitted plot and perform a summary of the model

## 
## Call:
## glm(formula = cases ~ time, family = poisson(), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6784  -1.5013  -0.2636   2.1760   2.7306  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.140590   0.078247   40.14   <2e-16 ***
## time        0.202121   0.007771   26.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 872.206  on 12  degrees of freedom
## Residual deviance:  80.686  on 11  degrees of freedom
## AIC: 166.37
## 
## Number of Fisher Scoring iterations: 4

##My Response: It Appears that Residuals vs Fitted plot reveal two things. First, due to parabola shape of the plot it might be better using a quadratic expression in the model. Secondly, it appears there are some outliers in the data at points, 1, 2, and 13.

The summary of the model shows there is a high statistical significance between the two variables. The standard error values are low. However, the NULL deivance and residual deviance appear to be rather high compared to the degress of freedom.

##Problem 2 Part C

 c) Now add a quadratic term  in time 
(\textit{ i.e., $log(\mu_i)=\beta_0+\beta_1t_i +\beta_2t_i^2$} ) 
and fit the model. Comment on the model parameters and assess the residual plots.

##My Assumption:

I believe fitting the model with an additional variable of time being squared will fit the model better than the first one.

I will create a similar model but with an additional expression known as time squared. Therefore AIDS cases ~ time + time^2. Again I will look at the residual vs fitted plot

##Problem 3 Part D

d) Compare the two models using AIC. Which model is better?

summary of the overall regression model pertaining to Model1 number of AIDS cases ~ time.

## 
## Call:
## glm(formula = cases ~ time, family = poisson(), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6784  -1.5013  -0.2636   2.1760   2.7306  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 3.140590   0.078247   40.14   <2e-16 ***
## time        0.202121   0.007771   26.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 872.206  on 12  degrees of freedom
## Residual deviance:  80.686  on 11  degrees of freedom
## AIC: 166.37
## 
## Number of Fisher Scoring iterations: 4

summary of the overall regression model pertaining to Model2 number of AIDS cases ~ time + time^2

## 
## Call:
## glm(formula = cases ~ time + timesq, family = poisson(), data = dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.45903  -0.64491   0.08927   0.67117   1.54596  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.901459   0.186877  10.175  < 2e-16 ***
## time         0.556003   0.045780  12.145  < 2e-16 ***
## timesq      -0.021346   0.002659  -8.029 9.82e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 872.2058  on 12  degrees of freedom
## Residual deviance:   9.2402  on 10  degrees of freedom
## AIC: 96.924
## 
## Number of Fisher Scoring iterations: 4

Comparing the two models AIC score

## Model 1 of AIDS Case ~ time AIC is: 166.3698

## Model 2 of AIDS Case ~ time + time^ AIC is: 96.92358

##My Repsonse:

According to my assumption, the second model containing time squared fits the data better. The AIC is much lower and the residual deviance is much lower. The residual vs fitted plot shows more of a cigar shape which is a good thing.

e) Use \textit{ anova()}-function to perform $\chi^2$ test for model selection. 
Did adding the quadratic term improve model?

Running anova function on Mode1 and Model2 using Chisq test to see statistical significance of the models and the overall residual deviance.

##My Repsonse:

The anova test confirms my previous arguments. The anova test shows that there is higher significance in the quadratic model and lower residual deviance. In addition, my original hypothesis is correct: As time goes on the number of AIDs cases have been increasing. And this is best fitted the Model 2: AIDS Cases ~ time + time^2

##Problem 3. Load the dataset from library. The dataset contains information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt. It is a 4 dimensional dataset with 10000 observations. You had developed a logistic regression model on HW #2. Now consider the following two models

For the two competing models do the following

##Problem 3 Part A

a) With the whole data compare the two models (Use AIC and/or error rate)

##My Assumption:

If I was going into this data set for the first time, I would assume that Model1 would produce better results. But in Homework_2 I developed and saw scatter plots, boxplots, and descriptive stats that seemed to indicate that student was not always a good indicator for defaulting. It seemed like balance was the statistically significant variable in defaulting. Therefore, I believe Model2 will perform better with a lower AIC and lower error rate than Model1.

## Model1's AIC: 1577.682

## . Model2's AIC: 1600.452

## [1] "Model1 Confusion Matrix:"

##                True
## Model1_fit_pred   No  Yes
##             No  9628  228
##             Yes   39  105

## Model1 Error Rate 0.0267

##  Therefore the accuracy of the Model1 is: 97.33

## [1] "Model2 Confusion Matrix:"

##                True
## Model2_fit_pred   No  Yes
##             No  9625  233
##             Yes   42  100

## Model2 Error Rate 0.0275

##  Therefore the accuracy of the Model2 is: 97.25

## . The Model1 MSE is:  0.02130176

## . The Model2 MSE is:  0.02170579

##My Response:

It appears my hypothesis was incorrect. It seems Model1 has a lower AIC score than Model2 AND Model1 has a very slighly lower error rate than Model2. However, the differences between the models are very slight.

##Probken 3 Part B

b) Use validation set approach and choose the best model. 
Be aware  that we have few people who defaulted in the data.

Splitting the data set into 70% for training the data and 30% for testing. The models will be created using the 70% data set and will be tested by looking at the AIC and against 30% testing data set to compare MSEs.

Model 1 is default ~ stundet + balance Model 2 is deafult ~ student

## [1] "After spliting the data set 70% for training and 30% for testing and developing models on the training data testing them against the mean standard errors we see that:"

## Training-test split at 70% training vs 30%$ test MSE for Model1:  0.02249964

##  Training-test split at 70% training vs 30%$ test MSE for Model2:  0.0230682

## Model1 AIC:  1085.103

## Model2 AIC:  1096.991

##My Response: It Appears Model 1 has a lower AIC and lower MSE therefore, Model1 fits the dataset the best

##Problem 3 Part C

c) Use LOOCV approach and choose the best model

##My Repsonse:

Originally, I tried mimicing the code found in the lecture for resampling using Leave One Out Cross Validation. However, after multiple tries rstudio kept getting hung up. In order to get around this I built an empty matrix with 500 rows and 1 column in order to loop through. In the loop I created a training set for i in 1:500. Then fittedd the models for each ith iteration. I also incoporated the MSE for each ith iteration. Then set the ith row to the previously empty matrix. From there I can calculate the mean error rate of the LOOCV method.

## The mean error rate of Model1 using Leave one out cross validation is: 0.02135832

## The mean error rate of Model2 using Leave one out cross validation is: 0.02229847

##Problem 3 Part D

d) Use 10-fold cross-validation approach and choose the best model

Creating Cros validation for both models with K-10

## The mean error rate for 10-fold cross-validation for Model1 is: 0.02088694

##  The mean error rate for 10-fold cross-validation for Model2 is: 0.0211975

##Problem 3 Part E

Report validation misclassification (error) rate for both models 
in each of the three assessment methods. Discuss your results.

## [1] "Training-test split at 70% training vs 30%$ test MSE"

## Model1 MSE: 0.02249964    Model2 MSE: 0.0230682

## [1] "The mean error rate of Model1 using Leave one out cross validation"

## Model1: 0.02135832    Model2: 0.02229847

## [1] "The mean error rate for 10-fold cross validation"

## Model1: 0.02088694     Modele2: 0.0211975

##My Repsonse: It appears that after going through all the testing and training data methods Model1 yields the lowest error rate. Therefore, having student and balance as predictors for defaulting yields a lower error rate in the model.

##Problem 4. In the library load the dataset. This contains Daily percentage returns for the S&P 500 stock index between 2001 and 2005. There are 1250 observations and 9 variables. The variable of interest is Direction which is a factor with levels Down and Up indicating whether the market had a positive or negative return on a given day. Since the goal is to predict the direction of the stock market in the future, here it would make sense to use the data from years 2001 - 2004 as training and 2005 as validation. According to this, create a training set and testing set. Perform logistic regression and assess the error rate.

##My Response: Creating many different models for testing by using many different variables and interactions between them. Also running a quick ggpairs plot of all the variables in order to gain some insight into the data.

The data will also be subsetted into the two data sets. One is for years 2001 - 2004 for developing a model and then testing in on the 2005 data set. Many different models will be created to see which one is best for fitting to 2005 data.

##My Response: Running a anova function with Chisq test to see which model has the most significance while maintaining a lower deviance.

##My Response: Based on anova chisq test I believe mod23 may obtain the best results without overfitting the data too much. The residual deviance is lower than just a simple model and it is also statisically significant with a Chi score at 0.05798. The deviance is also not too larged compared to the more complex models.

Running a summary of the mod23: direction ~ lag1 + lag2 + lag3 + lag4 + lag5 + lag1lag2 + lag2lag3 + lag3lag4 + lag1lag3 + lag1lag4 + lag2lag4 + lag1*lag5

## 
## Call:
## glm(formula = direction ~ lag1 + lag2 + lag3 + lag4 + lag5 + 
##     lag1 * lag2 + lag2 * lag3 + lag3 * lag4 + lag1 * lag3 + lag1 * 
##     lag4 + lag2 * lag4 + lag1 * lag5, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6181  -1.2021   0.9948   1.1459   1.5470  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.073213   0.056880   1.287   0.1980  
## lag1        -0.069594   0.052213  -1.333   0.1826  
## lag2        -0.049179   0.051867  -0.948   0.3430  
## lag3        -0.004783   0.051054  -0.094   0.9254  
## lag4         0.004348   0.050946   0.085   0.9320  
## lag5         0.009765   0.050373   0.194   0.8463  
## lag1:lag2   -0.016777   0.035746  -0.469   0.6388  
## lag2:lag3    0.015592   0.034913   0.447   0.6552  
## lag3:lag4   -0.043079   0.035381  -1.218   0.2234  
## lag1:lag3    0.042934   0.032360   1.327   0.1846  
## lag1:lag4   -0.017387   0.028682  -0.606   0.5444  
## lag2:lag4   -0.016467   0.032113  -0.513   0.6081  
## lag1:lag5   -0.064251   0.034209  -1.878   0.0604 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1721.1  on 1237  degrees of freedom
## AIC: 1747.1
## 
## Number of Fisher Scoring iterations: 4

##My Response:

Now it is time to test the model. I will use the same methods as found in problem 3. LOOCV, 10-fold cross validation.

First I will do the LOOCV one out method. In the loop I created a training set for i in 1:500. Then fitted the models for each ith iteration. I also in coporated the MSE for each ith iteration. Then set the ith row to the previously empty matrix. From there I can calculate the mean error rate of the LOOCV method.

Next I will perform the 10-fold corss validation method.

calculating the MSE for all methods

MSE for 70% vs 30%

## The MSE for mod23 is:  0.1700634

MSE for LOOCV method

## The MSE from the LOOCV method for mod23 is:  0.2591566

MSE for 10-fold Cross validation method

## The MSE from the 10-fold cross validation method for mod23 is:  0.2596989

##My Final Repsonse:

It appears that using the final model there is an error rate of less than 26% for predicting wether the market will move up or down on given day using the 5 lag variables and some interactions between them. It is interesting that the lag variables seem to have some interaction between tham that model how the market will act (up or down). So based on 1 day lag and 5 days lag they seem to help provide some indication of how the stock market will move weather it be up or down.