I. Content

This Markdown includes the following for each statistical approach covered:

  1. Major theoretical concepts in Statistical Learning.
  2. The Assumptions behind each statistical approach.
  3. Resampling and Feature Selection application in the context of each approach.

The Models included:

  1. Simple & Multiple Linear Regression
  2. Logistic Regression
  3. Linear & Quadratic Discriminant Analysis

II. Introduction to Statistical Learning

A. Workflow of a DataScience Project:

  1. Identification of the Problem: Theories and Hypotheses
  2. Data Collection: Acquisition and Screening
  3. Data Investigation: Visualization and Observing Possible Patterns
  4. Identification of Analysis Type: Regression, Classification …
  5. Identification of potential Models
  6. Training the Models
  7. Testing the Models and summarizing the results
  8. Choosing the Best Fit Model according to a set of Measurements.

B. Types of Variables

Quantitative Variables Qualitative Variables
Interval Scaled (No 0 Point) Nominal or Binary (Categorical, 2 or More)
Ratio Scaled (Inherent 0 Point) Ordinal (No, Slight, Significant)
  1. It is wise to use Mean, Median and Mode on Interval Scaled Data.
  2. It is better to use Central Tendency Measures (ex SD).
  3. For Qualitative Variables, it is important to check Class Distribution.

Representation of Variables:

  1. Predictors as X0, X1, X2 … Xn-1
  2. Quantitative Output as Y
  3. Qualitative Output as G for Group

C. Goals

  1. Predict the Response given some Features
  2. Identify significant Features
  3. Understand the extent of the effect of Features on the Response

D. The Model Representation: Y = F(x) + Irreducible_Error

  1. The irreducible error is the error that we can not remove with our model, or with any model. The error is caused by elements outside our control, such as statistical noise in the observations.

  2. F(x) is the predicting function. To construct the function we train the model using the training data.

E. Model Assessment

  1. The model must be Accurate in predicting the correct response. THen, for an accurate model, we seek the lowest mean squared error between the predicted response and the true response (in case of regression), or the highest ratio of correct predictions (in case of classification).

MSE: mean((predicted - Actual Response)^2)

As we train the Model, we care about reducing the MSE of the train data. Ultimately, after the model is trained, the MSE of the unseen data is what we really care about.

Therefore, as we train, we assess the model by the train MSE and we hope that the latter is reflected in the test MSE.

We see that as the model function f(x) fits more the training data, the training MSE surely decreases but the test MSE increases. Therefore, the we adjust the model flexibility to reach the lowest test mse.

  1. Generalizability is crucial for any model, in order to accurately predict on new observations.

F. Trades-off between different Models

Flexibility vs Interpretability:

An important property of any model is Flexibility. The latter determines the restriction on the model parameters.

  1. If the Flexibility Increases: The Model is Less Interpretable but might result in Higher Accuracy.
  2. Black Box Models such as Deep Learning Models have high flexibility and are hard to interpret compared to lower flexibility models such as simple Linear Regression with polynomial degree of one. However, these models usually have high accuracy.

How to choose:

Lower Flexibility, Higher Interpretability Higher Flexibility, Lower Interpretability
If we are looking to Infer about the response, for example, to understand how the features impact the response If we are looking to predict, usually the most accurate method is the one with the highest flexibility.

Underfitting & Overfitting:

As mentioned before, generalization is a vital property of good models. Therefore, underfitting or overfitting the model would result in bad models with low generalizability.

  1. Underfitting: Poor performance on the training data & poor generalization. Underfitting occurs when the model cannot adequately capture the underlying structure of the data. Example: If the relation between the features and the response is quadratic, and we use a linear model, it will underfit the data. In this case the test and the train MSE are high.

  2. Overfitting: Good performance on the training data & poor generalization. Overfitting occurs when the model is too flexible and fits exactly the training data. The latter model would fail to predict on future samples. In this case, the training MSE is low, and the test MSE is high.

In the above figure: The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data, compared to the black line. Further, the Fushia Line represents an underfitted model where the line doesnt adquately capture the structure of the data.

Bias-Variance Trade-off:

The mean of the test mse, or the expected test mse can be decomposed into:

E(test mse) = Variance(predicted response) + Bias(predicted response)^2 + Var(irreducible error)

  1. Then, as flexibility increases, the Variance of the response will increase: changing one training sample would change f(x) drastically.

  2. On the other hand, as flexibility increases, the error of approximating a very complex problem by a much simpler one decreases. >> Generally: as Flexibility Increases, Variance Increases and Bias Decreases.

  3. The lowest expected test mse possible is equal to the Var(irreducible error). The latter is due to external reasons (data collection, sample size, sample similarity, etc…)

=============================================

III. Linear Regression

A. Defining Linear Regression

Linear Regression is a linear approach for modeling the relation between the Predictors and the Response.

In the Representation Y = f(x) + irreducible_error, f(x) = B0 + B1X1 + B2X2 + … + BpXp

Then, we say that the null Hypothesis assumes that the coefficients are zero, while the alternative hypothesis assumes that the coefficients are non-zero.

Our Goal therefore is to find the constants B0 to Bp in order to obtain the least test MSE when predicting on unseen data.

B. The assumptions of Linear Regression

  1. Linearity: The mean of the response variable (i.e. the expected value of the response) is a linear combination of the coefficients and the variables.
  2. Homoscedasticity: The variance of residual is the same for any value of X.
  3. Independence: Observations are independent of each other.
  4. Normality: For any fixed value of X, Y is normally distributed.
  5. For Multiple Linear Regression: The Independent Variables must not be highly correlated with each other.

C. Fitting the Model: Finding the Constants

Steps of Fitting the Model:

  1. Apply a resampling technique to split into Train and Test: Validation Set Approach, LOOCV, and CV.
  2. Training the Model using an Estimation Technique: Least Squares, or other forms of penalized Least Squares (Ridge & Lasso).
  3. Testing the model on the testing data and assessing the model accuracy.

D. Applying Least Squares

  1. In the Least Squares approach we aim to minimize the Residual Sum of Squares (RSS).

Therefore, for one predictor, we derive the values of the coefficients that minimize the value of RSS as:

  1. Assessing the accuracy of the coefficient estimates:

Since the mean of the estimates can reflect the real population mean, then the standard error could be of use.

Remember SE tells us the average difference between an estimate and the actual mean.

Then, we calculate the SE of the coefficients, and compare them to their own value in order to understand whether we can reject the null hypothesis that assumes that the coefficients are zero.

  1. If the SE(coefficient) is low (and less than the coefficient), then the coefficient could be a small value and the null Hypothesis could be rejected.
  2. If the SE(coefficient) is high, then the coefficient needs to be a large value in absolute in order for the null hypothesis to be rejected.

Further: The SE could be used to compute the Confidence Intervals. 95% of the time, the range of the 95% confidence interval will include the true unknown value of the parameter.

Confidence Interval 95%: [Bi_hat - 2.SE(Bi_hat), Bi_hat + 2.SE(Bi_hat)]

It is worth noting that if the training data doesn’t span (spread) on the X axis, then the range of lines that could fit the data is larger than that in case of training data spreading over the X axis, which means that the SE(expected values) will increase if the data isn’t spread over the x-axis.

More Spread Data = Lower SE = Generally More accurate coefficient estimates

  1. Assessing the accuracy of the Model:

RSE: is the average amount the predicted response deviates from the actual response. (Measures the lack of fit)

For instance, if the RSE of the model = Z, then the predicted response would still be off by Z on average.

R2: measures the proportion of variability in the response Y that can be explained by X.

R2 = 1 - RSS/TSS where RSS is the AVE(y-y_hat)^2 and TSS is the AVE(y-mean(y))^2

F-Statistic: used for Hypothesis Testing and must be >1 with if n is large and significantly > 1 if n is small in order to reject the null hypothesis.

Note that the latter RSE, RSS, R2 all decrease as the number of predictors increase, thus we check the following in order not to underestimate the error: Unbiased Estimates of MSE: when increasing the predictors, we use:

Mellow’s Cp: is an unbiased estimate of the MSE, lower values mean lower test error, and thus a better model.

\[ Cp = \frac{RSS + 2d\sigma^2}{n} \]

AIC: AIC stands for (Akaike’s Information Criteria), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.

\[ AIC = \frac{Cp}{\sigma^2} \]

BIC: BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional variables to the model. BIC puts more penalty on models with more variables for n>7 since log(n) > 2 (check formula).

\[ BIC = \frac{RSS + \log(n).d.\sigma^2}{n\sigma^2} \]

\[ Adjusted R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)} \]

  1. Things to account for in Multiple Linear Regression:

Interaction Effect: An interaction effect is the simultaneous effect of two or more independent variables on at least one dependent variable in which their joint effect is significantly greater (or significantly less) than the sum of the parts

Practical

Loading the Libraries

## Visualization:
library(ggplot2)
library(GGally)

## Multi-Purpose
library(tidyverse)
library(dplyr)

## Correlation
#== Melt a Matrix
library(reshape2)


##
library(glmnet)
library(boot)
library(infer)

## Manual Cross Validation
library(caret)

## Subset Selection
library(leaps)

## Obtain Model Metrics
library(broom)

Importing the DataSet

data <- read.csv("Twelve.csv")
head(data)
##    V1   V2   V3  V4    V5 V6  V7      V8   V9  V10  V11 V12
## 1 6.7 0.28 0.28 2.4 0.012 36 100 0.99064 3.26 0.39 11.7   7
## 2 6.7 0.28 0.28 2.4 0.012 36 100 0.99064 3.26 0.39 11.7   7
## 3 5.1 0.47 0.02 1.3 0.034 18  44 0.99210 3.90 0.62 12.8   6
## 4 9.3 0.37 0.44 1.6 0.038 21  42 0.99526 3.24 0.81 10.8   7
## 5 6.4 0.38 0.14 2.2 0.038 15  25 0.99514 3.44 0.65 11.1   6
## 6 9.7 0.53 0.60 2.0 0.039  5  19 0.99585 3.30 0.86 12.4   6

Data Cleaning – More About this Later

## Check for Missing Values
length(which(is.na(data)))
## [1] 19
## Replace NA with Mean
data <- apply(data, 2, function(x){
  x[is.na(x)] = mean(x, na.rm = T)
  return (x)
})
data <- as.data.frame(data)
## Check for missing Values Again
length(which(is.na(data)))
## [1] 0

Training the Model - Validation Set Approach

split_sample <- function(data, split_prop, seed = NULL){
  training <- sample(nrow(data), size=nrow(data)*split_prop, replace = FALSE)
  return(list(train = data[training,], test = data[-training,]))
}

data_split <- split_sample(data, 0.7, 123)

Simple Regression V11~V6

### The correlation between V6 and V11 is extremely low
cor(data$V11, data$V6)
## [1] -0.06967631
val_simple <- lm(V11~V6, data = data_split$train)
summary(val_simple)
## 
## Call:
## lm(formula = V11 ~ V6, data = data_split$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1012 -0.8936 -0.2783  0.6754  4.5294 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.539633   0.058798 179.252   <2e-16 ***
## V6          -0.007684   0.003147  -2.442   0.0148 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.071 on 1117 degrees of freedom
## Multiple R-squared:  0.00531,    Adjusted R-squared:  0.00442 
## F-statistic: 5.963 on 1 and 1117 DF,  p-value: 0.01476
### Testing
pred <- predict(val_simple, newdata = data_split$test)
mse <- mean((pred-data_split$test$V11)^2)
mse
## [1] 1.093982

We analyze the latter:

  1. P-value of V6 is slightly significant.
  2. The coefficient of V6 is extremely low compared to the average of the response.
  3. R2 is extremely low, which shows that V6 cannot explain the variation in the response.
  4. The SE(coefficient V6) is very low, but considering that the Coefficient is also extremely low, then the null Hypothesis (assuming coefficient is zero) cannot be clearly rejected.
  5. MSE is usually used in comparative analysis between two models… To be see later.

Multiple Linear Regression V11~highest correlated variables

corr <- cor(data)
corr <- reshape2::melt(corr, na.rm = TRUE, value.name = "corr")
corr <- corr[corr$Var1=='V11',]
corr[order(abs(corr$corr), decreasing = TRUE),]
##     Var1 Var2        corr
## 131  V11  V11  1.00000000
## 95   V11   V8 -0.49332813
## 143  V11  V12  0.47616632
## 59   V11   V5 -0.22007094
## 107  V11   V9  0.20437087
## 83   V11   V7 -0.20290459
## 23   V11   V2 -0.20098751
## 35   V11   V3  0.11016353
## 119  V11  V10  0.09174972
## 71   V11   V6 -0.06967631
## 11   V11   V1 -0.06172464
## 47   V11   V4  0.04210144
### Training
multi_lm <- lm(V11~V12+V8, data = data_split$train)
summary(multi_lm)
## 
## Call:
## lm(formula = V11 ~ V12 + V8, data = data_split$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9337 -0.5985 -0.1154  0.5196  5.0293 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  244.13834   13.60357   17.95   <2e-16 ***
## V12            0.54924    0.03108   17.67   <2e-16 ***
## V8          -237.58408   13.61627  -17.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8324 on 1116 degrees of freedom
## Multiple R-squared:  0.3996, Adjusted R-squared:  0.3985 
## F-statistic: 371.4 on 2 and 1116 DF,  p-value: < 2.2e-16
### Testing
pred <- predict(multi_lm, newdata = data_split$test)
mse <- mean((pred - data_split$test$V11)^2)
mse
## [1] 0.6514234

We analyze the latter:

  1. The P-values are extremely significant.
  2. The Coefficient of V12 is low, but the SE(CoeffV12) is at least 10 times smaller than the coefficient, therefore rejecting the null hypothesis for V11~V12.
  3. The Coefficient of V8 is large in absolute, and the SE(CoeffV8) is small enough to reject the null Hypothesis. For sure the coefficient of V8 is not zero.
  4. The RSE is 0.826 which is less than the previous model, and is an acceptable error compared to the mean of the response.
  5. The R2 is slightly low, but indicates that 38% of the variation of the response can be explained by the predictors.
  6. F-Statistic is high enough to show a significant relation between V11~V12,V8.
  7. MSE decreased from previous model to the current model, showing an improvement in the model.

Multiple Linear Regression with Interaction

multi_inter_lm <- lm(V11~V12*V8, data = data_split$train)
summary(multi_inter_lm)
## 
## Call:
## lm(formula = V11 ~ V12 * V8, data = data_split$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9541 -0.5907 -0.1402  0.5106  4.9944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -106.86      95.23  -1.122 0.262058    
## V12            60.44      16.08   3.758 0.000180 ***
## V8            114.60      95.55   1.199 0.230622    
## V12:V8        -60.10      16.14  -3.724 0.000206 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8276 on 1115 degrees of freedom
## Multiple R-squared:  0.407,  Adjusted R-squared:  0.4054 
## F-statistic: 255.1 on 3 and 1115 DF,  p-value: < 2.2e-16
pred <- predict(multi_inter_lm, newdata = data_split$test)
mse <- mean((pred-data_split$test$V11)^2)
mse
## [1] 0.6470563
### Extract the Metrics
results_multi_inter <- glance(multi_inter_lm) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value )
results_multi_inter
## # A tibble: 1 x 5
##   adj.r.squared sigma   AIC   BIC   p.value
##           <dbl> <dbl> <dbl> <dbl>     <dbl>
## 1         0.405 0.828 2758. 2783. 5.24e-126
### Average Prediction Error rate:
results_multi_inter$sigma/mean(data$V11)*100
## [1] 7.940152
### Average Error Rate = 7.87%

We can see that V12 and V8 are correlated and have a significant interaction effect together on the response. The MSE very slightly improved.

Linear Regression - Polynomial Regression

degree <- seq(1,3)
mse_s <- seq(1,3)

for( i in degree){
  poly_lm <- lm(V11~poly(V1, degree = i, raw = TRUE), data = data_split$train)
  pred <- predict(poly_lm, newdata = data_split$test)
  mse <- mean((pred-data_split$test$V11)^2)
  mse_s[i] <- mse
}
mse_s
## [1] 1.091766 1.036936 1.003532
plot(degree,mse_s, type = "b", col="purple",
      pch = 16, main = "Validation Set",
      ylab = "MSE", xlab = "Degree of Model")

Simple Linear Regression with LOOCV

Leave One Out Cross Validation works by training the model on N-1 Samples and then testing the model on 1 sample, for each sample. Therefore, the coefficients of the model are the same as the simple linear regression without resampling.

lm_loocv <- glm(V11~V8, data = data)
mse <- cv.glm(data, lm_loocv)$delta[1]
mse
## [1] 0.861597

Simple Linear Regression with CV

CV works by splitting the data into k parts, training the model on k-1 and testing on the last one part. In comparison with LOOCV, if the data size is small, CV will result in high variance since the training data size will be smaller than LOOCV.

lm_cv <- glm(V11~V8, data = data)
#A vector of length two. The first component is the raw cross-validation estimate
#of prediction error. The second component is the adjusted cross-validation estimate.
#The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.
mse <- cv.glm(data, lm_cv, K=5)$delta[1]
mse
## [1] 0.8610209
### Manually: using Library Caret
shuffled_data <- data[sample(nrow(data)),]
folds <- createFolds(shuffled_data$V11, k=10, returnTrain = TRUE)
cv_error <- rep(1:10)
coeffi <- data.frame()
for(i in rep(1:10)){
  fold <- folds[i]
  training_indices = unlist(fold)
  cv_fit <- glm(V11~V8, data = data, subset = training_indices)
  coeffi <- rbind(coeffi,cv_fit$coefficients)
  pred <- predict(cv_fit, newdata = data[-training_indices,])
  mse <- mean((pred-data$V11[-training_indices])^2)
  cv_error[i] <- mse
}
mse <- mean(cv_error)
mse
## [1] 0.8627044
c(mean(unlist(coeffi[1])),mean(unlist(coeffi[2])))
## [1]  288.5858 -279.0698

blueprint of CV:
library(caret)
shuffled_data <- data[sample(nrow(data)),]
folds <- createFolds(shuffled_data, K=5,10,.., returnTrain = TRUE)
cv_error <- rep(1:K)
for(i in rep(1:K)){
train <- unlist(folds[i])
fit <- model_function(formula, data, subset = train)
predictions <- predict(fit, data[-train,])
cv_error[i] = mean((predictions-data$response)^2)
}
mse <- mean(cv_error)
mse

Polynomial Regression with CV and Validation Set

degree <- rep(1:3)
mse_s <- rep(1:3)
cv.error <- rep(1:3)
for(i in degree){
  poly_cv <- glm(V11~poly(V8, i, raw=TRUE), data = data_split$train)
  cv.error[i] <- cv.glm(data_split$train, poly_cv, K=10)$delta[2]
}
cv.error
## [1] 0.8914465 0.7611520 0.7635648
plot(x=degree, y=cv.error, xlab ="Degree of Polynomial", ylab="CV Error as a Function of the Degree of the Model", type = "l", col= "purple")

Linear Regression with Best Subset Selection

Best Subset selection is a feature selection approach which tries different combinations of predictors for model sizes ranging from 1 to P predictors, and outputs the model with the minimum Least Squares.

fits <- regsubsets(V11~., data = data_split$train, nvmax = length(data), method="exhaustive")
sum_fits <- summary(fits)
sum_fits$outmat
##           V1  V2  V3  V4  V5  V6  V7  V8  V9  V10 V12
## 1  ( 1 )  " " " " " " " " " " " " " " " " " " " " "*"
## 2  ( 1 )  " " " " " " " " " " " " " " "*" " " " " "*"
## 3  ( 1 )  "*" " " " " " " " " " " " " "*" "*" " " " "
## 4  ( 1 )  "*" " " " " "*" " " " " " " "*" "*" " " " "
## 5  ( 1 )  "*" " " " " "*" " " " " " " "*" "*" " " "*"
## 6  ( 1 )  "*" " " " " "*" " " " " " " "*" "*" "*" "*"
## 7  ( 1 )  "*" " " "*" "*" " " " " " " "*" "*" "*" "*"
## 8  ( 1 )  "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
## 9  ( 1 )  "*" "*" "*" "*" " " "*" " " "*" "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 11  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
models_metrics <- rbind(c(sum_fits$adjr2, which.max(sum_fits$adjr2)),c(sum_fits$bic, which.min(sum_fits$bic)), c(sum_fits$cp, which.min(sum_fits$cp)))
row.names(models_metrics) <- c("Adjusted R^2", "BIC", "Cp")
colnames(models_metrics) <- c(rep(1:11),"BestModel")
models_metrics[,12]
## Adjusted R^2          BIC           Cp 
##           10            9           10
sum_fits$outmat
##           V1  V2  V3  V4  V5  V6  V7  V8  V9  V10 V12
## 1  ( 1 )  " " " " " " " " " " " " " " " " " " " " "*"
## 2  ( 1 )  " " " " " " " " " " " " " " "*" " " " " "*"
## 3  ( 1 )  "*" " " " " " " " " " " " " "*" "*" " " " "
## 4  ( 1 )  "*" " " " " "*" " " " " " " "*" "*" " " " "
## 5  ( 1 )  "*" " " " " "*" " " " " " " "*" "*" " " "*"
## 6  ( 1 )  "*" " " " " "*" " " " " " " "*" "*" "*" "*"
## 7  ( 1 )  "*" " " "*" "*" " " " " " " "*" "*" "*" "*"
## 8  ( 1 )  "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
## 9  ( 1 )  "*" "*" "*" "*" " " "*" " " "*" "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 11  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

In this case the 10th Model is the best model, which takes all features except V6

coef(fits, id=10)
##   (Intercept)            V1            V2            V3            V4 
##  562.35514513    0.47737991    0.69120031    0.97411387    0.26099143 
##            V5            V6            V8            V9           V10 
##   -0.96880272   -0.00589145 -573.17404145    3.69319631    0.94673535 
##           V12 
##    0.25691417

Subset Selection with Cross Validation K = 5

shuffled_data <- data[sample(nrow(data)),]
folds <- createFolds(shuffled_data$V11, k=5, returnTrain = TRUE)
fit_mses <- data.frame()
for(i in rep(1:5)){
   train <- unlist(folds[i])
   fits <- regsubsets(V11~., data = data[train, ], nvmax = length(data), method = "exhaustive")
   ### For each of the models, calculate the test MSE
   mse_vector <- rep(1:11)
   for(j in rep(1:11)){
     coefis <- coef(fits, id=j)
     test_mat <- model.matrix(V11~., data = data[-train, ])
     pred <- test_mat[,names(coefis)]%*%coefis
     mse <- mean((pred - data[-train, ]$V11)^2)
     mse_vector[j] = mse
   }
   
   fit_mses <- rbind(fit_mses, mse_vector)
}
colnames(fit_mses) <- paste0("Model#",rep(1:11))
rownames(fit_mses) <- paste0("Fold#",rep(1:5))
fit_mses
##          Model#1   Model#2   Model#3   Model#4   Model#5   Model#6   Model#7
## Fold#1 0.8140748 0.6172570 0.5704373 0.4624975 0.4081447 0.3825859 0.3740590
## Fold#2 0.8996169 0.7328626 0.5556584 0.4019149 0.3683903 0.3607508 0.3587359
## Fold#3 0.9843834 0.7818061 0.5972123 0.4516946 0.4050209 0.3793136 0.3847262
## Fold#4 0.8591110 0.6106012 0.5869654 0.4440157 0.3755767 0.3476176 0.3548733
## Fold#5 0.8902292 0.6715887 0.5512878 0.4562417 0.4211041 0.4075759 0.4002592
##          Model#8   Model#9  Model#10  Model#11
## Fold#1 0.3652193 0.3576331 0.3550977 0.3534441
## Fold#2 0.3532529 0.3519224 0.3541863 0.3525643
## Fold#3 0.3793184 0.3829449 0.3864124 0.3843401
## Fold#4 0.3449430 0.3510512 0.3494137 0.3477075
## Fold#5 0.3971714 0.3976137 0.3938571 0.3919891
mse_model <- apply(fit_mses, 2, mean)
names(mse_model) <- paste0("Model#",rep(1:11))
mse_model
##   Model#1   Model#2   Model#3   Model#4   Model#5   Model#6   Model#7   Model#8 
## 0.8894831 0.6828231 0.5723122 0.4432729 0.3956473 0.3755687 0.3745307 0.3679810 
##   Model#9  Model#10  Model#11 
## 0.3682331 0.3677935 0.3660090

E. Penalized Least Sqaures: Ridge & Lasso

Ridge

\[ Minimize (RSS + \lambda\sum_{i=0}^{p}\beta_j^2) \] >> The latter will shrink the coefficients towards zero but will not reach zero (no feature selection).

Lasso

\[ Minimize (RSS + \lambda\sum_{i=0}^{p}|\beta_j|) \]

Lasso will shrink the coeffients to zero and perform feature selection.

Comparison: Both models have lower variance compared to the Least Squares at the cost of a slight increse in the Bias.

Lasso vs Ridge: Lasso generally has lower variance, at the cost of a slight increase in bias. Also, Lasso is more interpretable.

Lasso can generally lead to better predictions if only a subset of the predictors highly effect the response, while Ridge can generally lead to better predictions if many or all the predictors contribute to the variation in the response.

Ridge & Lasso in Practice

### 1. Create the Training Matrix
train_x <- model.matrix(V11~., data = data_split$train)[,-1]
train_y <- data_split$train$V11


### 2. Train the Models
ridge.fit <- cv.glmnet(x=train_x, y=train_y, type.measure = "mse", alpha = 0, family="gaussian", nfolds = 5, standardize = TRUE)
lasso.fit <- cv.glmnet(x=train_x, y=train_y, type.measure = "mse", alpha = 1, family="gaussian", nfolds = 5, standardize = TRUE)

### 3. Test the Models on the best Lambda
test_x <- model.matrix(V11~., data = data_split$test)[,-1]
ridge.pred <- predict(ridge.fit, newx = test_x,s=ridge.fit$lambda.min, type="response")
lasso.pred <- predict(lasso.fit, newx = test_x,s=lasso.fit$lambda.min, type="response")

### 4. Calculate the Test MSE
test_y <- data_split$test$V11
ridge.mse <- mean((ridge.pred - test_y)^2)
lasso.mse <- mean((lasso.pred - test_y)^2)

ridge.mse
## [1] 0.3528857
lasso.mse
## [1] 0.3339964
coef(ridge.fit)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  4.160454e+02
## V1           2.650636e-01
## V2           4.755286e-01
## V3           1.163782e+00
## V4           1.973570e-01
## V5          -1.937148e+00
## V6          -2.463649e-03
## V7          -2.213416e-03
## V8          -4.203386e+02
## V9           2.467182e+00
## V10          7.612801e-01
## V12          3.052834e-01
coef(lasso.fit)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  4.942606e+02
## V1           3.906121e-01
## V2           2.684068e-01
## V3           7.574988e-01
## V4           2.221385e-01
## V5          -6.131896e-01
## V6          -3.026886e-03
## V7          -5.332194e-04
## V8          -5.016082e+02
## V9           3.058848e+00
## V10          6.684729e-01
## V12          2.735268e-01
plot(ridge.fit)

plot(lasso.fit)
title("Variation of  MSE as a function of Lambda and the # of Predictors" , line = 2.6)

F. Least Angle Regression

===============================================

IV. Classification

Logistic Regression

Calculates:

\[ P(G=g|X=x) = \frac{P(X=x|G=g) \times P(G=g)}{P(X=x)} \]

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data <- read.csv("Diabetes.csv")
head(data)
##   id relwt glufast glutest steady insulin group
## 1  1  0.81      80     356    124      55     3
## 2  3  0.94     105     319    143     105     3
## 3  5  1.00      90     323    240     143     3
## 4  7  0.91     100     350    221     119     3
## 5  9  0.99      97     379    142      98     3
## 6 11  0.90      91     353    221      53     3
corr <- cor(data)
corr <- subset(corr, select =- diag(corr))

form_corr <- reshape2::melt(corr, na.rm = TRUE, value.name = "corr") 
form_corr <- subset(form_corr, form_corr$Var1 == "group")
form_corr[order(abs(form_corr$corr), decreasing = TRUE),]
##     Var1    Var2        corr
## 42 group   group  1.00000000
## 21 group glutest -0.83925229
## 35 group insulin -0.78435862
## 14 group glufast -0.72989486
## 7  group   relwt -0.23985592
## 28 group  steady  0.09966494
### Observe the Distribution over the Groups
data$group <- as.factor(data$group)
id_group <- reshape2::melt(data = data, id.vars = c("id", "group"))
id_group
##      id group variable   value
## 1     1     3    relwt    0.81
## 2     3     3    relwt    0.94
## 3     5     3    relwt    1.00
## 4     7     3    relwt    0.91
## 5     9     3    relwt    0.99
## 6    11     3    relwt    0.90
## 7    13     3    relwt    0.96
## 8    15     3    relwt    0.74
## 9    17     3    relwt    1.10
## 10   19     3    relwt    0.83
## 11   21     3    relwt    0.95
## 12   23     3    relwt    0.95
## 13   25     3    relwt    0.72
## 14   27     3    relwt    1.20
## 15   29     3    relwt    1.00
## 16   31     3    relwt    1.00
## 17   33     3    relwt    0.71
## 18   35     3    relwt    0.89
## 19   37     3    relwt    1.17
## 20   39     3    relwt    0.97
## 21   41     3    relwt    1.00
## 22   43     3    relwt    0.98
## 23   45     3    relwt    0.74
## 24   47     3    relwt    0.95
## 25   49     3    relwt    1.03
## 26   51     3    relwt    0.87
## 27   53     3    relwt    0.83
## 28   55     3    relwt    0.86
## 29   57     3    relwt    0.88
## 30   59     2    relwt    0.99
## 31   61     3    relwt    1.09
## 32   63     2    relwt    1.19
## 33   65     2    relwt    1.20
## 34   67     3    relwt    1.18
## 35   69     3    relwt    0.91
## 36   71     2    relwt    1.10
## 37   73     3    relwt    0.97
## 38   75     3    relwt    1.10
## 39   77     2    relwt    1.08
## 40   79     3    relwt    0.74
## 41   81     3    relwt    0.89
## 42   83     2    relwt    1.19
## 43   85     2    relwt    1.06
## 44   87     2    relwt    1.06
## 45   89     2    relwt    1.16
## 46   91     2    relwt    1.20
## 47   93     2    relwt    0.91
## 48   95     2    relwt    1.09
## 49   97     2    relwt    1.20
## 50   99     2    relwt    1.10
## 51  101     2    relwt    0.96
## 52  103     2    relwt    1.07
## 53  105     2    relwt    0.94
## 54  107     2    relwt    0.88
## 55  109     2    relwt    1.16
## 56  111     2    relwt    0.91
## 57  113     1    relwt    0.92
## 58  115     1    relwt    0.85
## 59  117     1    relwt    0.85
## 60  119     1    relwt    1.06
## 61  121     1    relwt    1.20
## 62  123     1    relwt    1.16
## 63  125     1    relwt    0.95
## 64  127     1    relwt    0.90
## 65  129     1    relwt    1.16
## 66  131     1    relwt    1.07
## 67  133     1    relwt    0.85
## 68  135     1    relwt    0.98
## 69  137     1    relwt    1.19
## 70  139     1    relwt    1.06
## 71  141     1    relwt    1.05
## 72  143     1    relwt    0.90
## 73    2     3    relwt    0.95
## 74    4     3    relwt    1.04
## 75    6     3    relwt    0.76
## 76    8     3    relwt    1.10
## 77   10     3    relwt    0.78
## 78   12     3    relwt    0.73
## 79   14     3    relwt    0.84
## 80   16     3    relwt    0.98
## 81   18     3    relwt    0.85
## 82   20     3    relwt    0.93
## 83   22     3    relwt    0.74
## 84   24     3    relwt    0.97
## 85   26     3    relwt    1.11
## 86   28     3    relwt    1.13
## 87   30     3    relwt    0.78
## 88   32     3    relwt    1.00
## 89   34     3    relwt    0.76
## 90   36     3    relwt    0.88
## 91   38     3    relwt    0.85
## 92   40     3    relwt    1.00
## 93   42     3    relwt    0.89
## 94   44     3    relwt    0.78
## 95   46     3    relwt    0.91
## 96   48     3    relwt    0.95
## 97   50     3    relwt    0.87
## 98   52     3    relwt    1.17
## 99   54     3    relwt    0.82
## 100  56     3    relwt    1.01
## 101  58     3    relwt    0.75
## 102  60     3    relwt    1.12
## 103  62     2    relwt    1.02
## 104  64     3    relwt    1.06
## 105  66     2    relwt    1.05
## 106  68     3    relwt    1.01
## 107  70     3    relwt    0.81
## 108  72     3    relwt    1.03
## 109  74     3    relwt    0.96
## 110  76     3    relwt    1.07
## 111  78     3    relwt    0.95
## 112  80     3    relwt    0.84
## 113  82     3    relwt    1.11
## 114  84     3    relwt    1.18
## 115  86     2    relwt    0.95
## 116  88     2    relwt    0.98
## 117  90     2    relwt    1.18
## 118  92     2    relwt    1.08
## 119  94     2    relwt    1.03
## 120  96     2    relwt    1.05
## 121  98     2    relwt    1.05
## 122 100     2    relwt    1.12
## 123 102     2    relwt    1.13
## 124 104     2    relwt    1.10
## 125 106     2    relwt    1.12
## 126 108     2    relwt    0.93
## 127 110     2    relwt    0.94
## 128 112     2    relwt    0.83
## 129 114     1    relwt    0.86
## 130 116     1    relwt    0.83
## 131 118     1    relwt    1.06
## 132 120     1    relwt    0.92
## 133 122     1    relwt    1.04
## 134 124     1    relwt    1.08
## 135 126     1    relwt    0.86
## 136 128     1    relwt    0.97
## 137 130     1    relwt    1.12
## 138 132     1    relwt    0.93
## 139 134     1    relwt    0.81
## 140 136     1    relwt    1.01
## 141 138     1    relwt    1.04
## 142 140     1    relwt    1.03
## 143 142     1    relwt    0.91
## 144 144     1    relwt    1.11
## 145   1     3  glufast   80.00
## 146   3     3  glufast  105.00
## 147   5     3  glufast   90.00
## 148   7     3  glufast  100.00
## 149   9     3  glufast   97.00
## 150  11     3  glufast   91.00
## 151  13     3  glufast   78.00
## 152  15     3  glufast   86.00
## 153  17     3  glufast   90.00
## 154  19     3  glufast   85.00
## 155  21     3  glufast   90.00
## 156  23     3  glufast   95.00
## 157  25     3  glufast   92.00
## 158  27     3  glufast   98.00
## 159  29     3  glufast   86.00
## 160  31     3  glufast   70.00
## 161  33     3  glufast   75.00
## 162  35     3  glufast   85.00
## 163  37     3  glufast  100.00
## 164  39     3  glufast  106.00
## 165  41     3  glufast  102.00
## 166  43     3  glufast   94.00
## 167  45     3  glufast   93.00
## 168  47     3  glufast   85.00
## 169  49     3  glufast   88.00
## 170  51     3  glufast   94.00
## 171  53     3  glufast   86.00
## 172  55     3  glufast   96.00
## 173  57     3  glufast   89.00
## 174  59     2  glufast   98.00
## 175  61     3  glufast  110.00
## 176  63     2  glufast  100.00
## 177  65     2  glufast   89.00
## 178  67     3  glufast   96.00
## 179  69     3  glufast   82.00
## 180  71     2  glufast   90.00
## 181  73     3  glufast   86.00
## 182  75     3  glufast  107.00
## 183  77     2  glufast   94.00
## 184  79     3  glufast   93.00
## 185  81     3  glufast   99.00
## 186  83     2  glufast   85.00
## 187  85     2  glufast   96.00
## 188  87     2  glufast  107.00
## 189  89     2  glufast  101.00
## 190  91     2  glufast  112.00
## 191  93     2  glufast  103.00
## 192  95     2  glufast  102.00
## 193  97     2  glufast  102.00
## 194  99     2  glufast   95.00
## 195 101     2  glufast  110.00
## 196 103     2  glufast  104.00
## 197 105     2  glufast   92.00
## 198 107     2  glufast   92.00
## 199 109     2  glufast  112.00
## 200 111     2  glufast  114.00
## 201 113     1  glufast  300.00
## 202 115     1  glufast  125.00
## 203 117     1  glufast  216.00
## 204 119     1  glufast  151.00
## 205 121     1  glufast  173.00
## 206 123     1  glufast  195.00
## 207 125     1  glufast  151.00
## 208 127     1  glufast  260.00
## 209 129     1  glufast  233.00
## 210 131     1  glufast  124.00
## 211 133     1  glufast  330.00
## 212 135     1  glufast  130.00
## 213 137     1  glufast  138.00
## 214 139     1  glufast  339.00
## 215 141     1  glufast  353.00
## 216 143     1  glufast  213.00
## 217   2     3  glufast   97.00
## 218   4     3  glufast   90.00
## 219   6     3  glufast   86.00
## 220   8     3  glufast   85.00
## 221  10     3  glufast   97.00
## 222  12     3  glufast   87.00
## 223  14     3  glufast   90.00
## 224  16     3  glufast   80.00
## 225  18     3  glufast   99.00
## 226  20     3  glufast   90.00
## 227  22     3  glufast   88.00
## 228  24     3  glufast   90.00
## 229  26     3  glufast   74.00
## 230  28     3  glufast  100.00
## 231  30     3  glufast   98.00
## 232  32     3  glufast   99.00
## 233  34     3  glufast   90.00
## 234  36     3  glufast   99.00
## 235  38     3  glufast   78.00
## 236  40     3  glufast   98.00
## 237  42     3  glufast   90.00
## 238  44     3  glufast   80.00
## 239  46     3  glufast   86.00
## 240  48     3  glufast   96.00
## 241  50     3  glufast   87.00
## 242  52     3  glufast   93.00
## 243  54     3  glufast   86.00
## 244  56     3  glufast   86.00
## 245  58     3  glufast   83.00
## 246  60     3  glufast  100.00
## 247  62     2  glufast   88.00
## 248  64     3  glufast   80.00
## 249  66     2  glufast   91.00
## 250  68     3  glufast   95.00
## 251  70     3  glufast   84.00
## 252  72     3  glufast  100.00
## 253  74     3  glufast   93.00
## 254  76     3  glufast  112.00
## 255  78     3  glufast   93.00
## 256  80     3  glufast   90.00
## 257  82     3  glufast   93.00
## 258  84     3  glufast   89.00
## 259  86     2  glufast  111.00
## 260  88     2  glufast  114.00
## 261  90     2  glufast  108.00
## 262  92     2  glufast  105.00
## 263  94     2  glufast   99.00
## 264  96     2  glufast  110.00
## 265  98     2  glufast   96.00
## 266 100     2  glufast  112.00
## 267 102     2  glufast   92.00
## 268 104     2  glufast   75.00
## 269 106     2  glufast   92.00
## 270 108     2  glufast   93.00
## 271 110     2  glufast   88.00
## 272 112     2  glufast  103.00
## 273 114     1  glufast  303.00
## 274 116     1  glufast  280.00
## 275 118     1  glufast  190.00
## 276 120     1  glufast  303.00
## 277 122     1  glufast  203.00
## 278 124     1  glufast  140.00
## 279 126     1  glufast  275.00
## 280 128     1  glufast  149.00
## 281 130     1  glufast  146.00
## 282 132     1  glufast  213.00
## 283 134     1  glufast  123.00
## 284 136     1  glufast  120.00
## 285 138     1  glufast  188.00
## 286 140     1  glufast  265.00
## 287 142     1  glufast  180.00
## 288 144     1  glufast  328.00
## 289   1     3  glutest  356.00
## 290   3     3  glutest  319.00
## 291   5     3  glutest  323.00
## 292   7     3  glutest  350.00
## 293   9     3  glutest  379.00
## 294  11     3  glutest  353.00
## 295  13     3  glutest  290.00
## 296  15     3  glutest  312.00
## 297  17     3  glutest  364.00
## 298  19     3  glutest  296.00
## 299  21     3  glutest  378.00
## 300  23     3  glutest  347.00
## 301  25     3  glutest  386.00
## 302  27     3  glutest  365.00
## 303  29     3  glutest  325.00
## 304  31     3  glutest  360.00
## 305  33     3  glutest  352.00
## 306  35     3  glutest  373.00
## 307  37     3  glutest  367.00
## 308  39     3  glutest  396.00
## 309  41     3  glutest  378.00
## 310  43     3  glutest  291.00
## 311  45     3  glutest  318.00
## 312  47     3  glutest  334.00
## 313  49     3  glutest  291.00
## 314  51     3  glutest  313.00
## 315  53     3  glutest  319.00
## 316  55     3  glutest  332.00
## 317  57     3  glutest  323.00
## 318  59     2  glutest  478.00
## 319  61     3  glutest  426.00
## 320  63     2  glutest  429.00
## 321  65     2  glutest  472.00
## 322  67     3  glutest  418.00
## 323  69     3  glutest  390.00
## 324  71     2  glutest  413.00
## 325  73     3  glutest  393.00
## 326  75     3  glutest  403.00
## 327  77     2  glutest  426.00
## 328  79     3  glutest  391.00
## 329  81     3  glutest  398.00
## 330  83     2  glutest  425.00
## 331  85     2  glutest  465.00
## 332  87     2  glutest  503.00
## 333  89     2  glutest  469.00
## 334  91     2  glutest  568.00
## 335  93     2  glutest  537.00
## 336  95     2  glutest  599.00
## 337  97     2  glutest  472.00
## 338  99     2  glutest  517.00
## 339 101     2  glutest  522.00
## 340 103     2  glutest  472.00
## 341 105     2  glutest  442.00
## 342 107     2  glutest  580.00
## 343 109     2  glutest  562.00
## 344 111     2  glutest  643.00
## 345 113     1  glutest 1468.00
## 346 115     1  glutest  714.00
## 347 117     1  glutest 1113.00
## 348 119     1  glutest  854.00
## 349 121     1  glutest  832.00
## 350 123     1  glutest  920.00
## 351 125     1  glutest  857.00
## 352 127     1  glutest 1133.00
## 353 129     1  glutest 1183.00
## 354 131     1  glutest  538.00
## 355 133     1  glutest 1520.00
## 356 135     1  glutest  670.00
## 357 137     1  glutest  741.00
## 358 139     1  glutest 1354.00
## 359 141     1  glutest 1428.00
## 360 143     1  glutest 1025.00
## 361   2     3  glutest  289.00
## 362   4     3  glutest  356.00
## 363   6     3  glutest  381.00
## 364   8     3  glutest  301.00
## 365  10     3  glutest  296.00
## 366  12     3  glutest  306.00
## 367  14     3  glutest  371.00
## 368  16     3  glutest  393.00
## 369  18     3  glutest  359.00
## 370  20     3  glutest  345.00
## 371  22     3  glutest  304.00
## 372  24     3  glutest  327.00
## 373  26     3  glutest  365.00
## 374  28     3  glutest  352.00
## 375  30     3  glutest  321.00
## 376  32     3  glutest  336.00
## 377  34     3  glutest  353.00
## 378  36     3  glutest  376.00
## 379  38     3  glutest  335.00
## 380  40     3  glutest  277.00
## 381  42     3  glutest  360.00
## 382  44     3  glutest  269.00
## 383  46     3  glutest  328.00
## 384  48     3  glutest  356.00
## 385  50     3  glutest  360.00
## 386  52     3  glutest  306.00
## 387  54     3  glutest  349.00
## 388  56     3  glutest  323.00
## 389  58     3  glutest  351.00
## 390  60     3  glutest  398.00
## 391  62     2  glutest  439.00
## 392  64     3  glutest  333.00
## 393  66     2  glutest  436.00
## 394  68     3  glutest  391.00
## 395  70     3  glutest  416.00
## 396  72     3  glutest  385.00
## 397  74     3  glutest  376.00
## 398  76     3  glutest  414.00
## 399  78     3  glutest  364.00
## 400  80     3  glutest  356.00
## 401  82     3  glutest  393.00
## 402  84     3  glutest  318.00
## 403  86     2  glutest  558.00
## 404  88     2  glutest  540.00
## 405  90     2  glutest  486.00
## 406  92     2  glutest  527.00
## 407  94     2  glutest  466.00
## 408  96     2  glutest  477.00
## 409  98     2  glutest  456.00
## 410 100     2  glutest  503.00
## 411 102     2  glutest  476.00
## 412 104     2  glutest  455.00
## 413 106     2  glutest  541.00
## 414 108     2  glutest  472.00
## 415 110     2  glutest  423.00
## 416 112     2  glutest  533.00
## 417 114     1  glutest 1487.00
## 418 116     1  glutest 1470.00
## 419 118     1  glutest  972.00
## 420 120     1  glutest 1364.00
## 421 122     1  glutest  967.00
## 422 124     1  glutest  613.00
## 423 126     1  glutest 1373.00
## 424 128     1  glutest  849.00
## 425 130     1  glutest  847.00
## 426 132     1  glutest 1001.00
## 427 134     1  glutest  557.00
## 428 136     1  glutest  636.00
## 429 138     1  glutest  958.00
## 430 140     1  glutest 1263.00
## 431 142     1  glutest  923.00
## 432 144     1  glutest 1246.00
## 433   1     3   steady  124.00
## 434   3     3   steady  143.00
## 435   5     3   steady  240.00
## 436   7     3   steady  221.00
## 437   9     3   steady  142.00
## 438  11     3   steady  221.00
## 439  13     3   steady  136.00
## 440  15     3   steady  208.00
## 441  17     3   steady  152.00
## 442  19     3   steady  116.00
## 443  21     3   steady  136.00
## 444  23     3   steady  184.00
## 445  25     3   steady  279.00
## 446  27     3   steady  145.00
## 447  29     3   steady  179.00
## 448  31     3   steady  134.00
## 449  33     3   steady  169.00
## 450  35     3   steady  174.00
## 451  37     3   steady  182.00
## 452  39     3   steady  128.00
## 453  41     3   steady  165.00
## 454  43     3   steady   94.00
## 455  45     3   steady   73.00
## 456  47     3   steady  118.00
## 457  49     3   steady  157.00
## 458  51     3   steady  200.00
## 459  53     3   steady  144.00
## 460  55     3   steady  151.00
## 461  57     3   steady   73.00
## 462  59     2   steady  151.00
## 463  61     3   steady  117.00
## 464  63     2   steady  201.00
## 465  65     2   steady  162.00
## 466  67     3   steady  130.00
## 467  69     3   steady  375.00
## 468  71     2   steady  344.00
## 469  73     3   steady  115.00
## 470  75     3   steady  267.00
## 471  77     2   steady  213.00
## 472  79     3   steady  221.00
## 473  81     3   steady   76.00
## 474  83     2   steady  143.00
## 475  85     2   steady  237.00
## 476  87     2   steady  320.00
## 477  89     2   steady  607.00
## 478  91     2   steady  232.00
## 479  93     2   steady  622.00
## 480  95     2   steady  266.00
## 481  97     2   steady  297.00
## 482  99     2   steady  564.00
## 483 101     2   steady  325.00
## 484 103     2   steady  180.00
## 485 105     2   steady  109.00
## 486 107     2   steady  132.00
## 487 109     2   steady  139.00
## 488 111     2   steady  155.00
## 489 113     1   steady   28.00
## 490 115     1   steady  232.00
## 491 117     1   steady   81.00
## 492 119     1   steady   76.00
## 493 121     1   steady  102.00
## 494 123     1   steady  160.00
## 495 125     1   steady  145.00
## 496 127     1   steady  118.00
## 497 129     1   steady   73.00
## 498 131     1   steady  460.00
## 499 133     1   steady   13.00
## 500 135     1   steady   44.00
## 501 137     1   steady  219.00
## 502 139     1   steady   10.00
## 503 141     1   steady   41.00
## 504 143     1   steady   29.00
## 505   2     3   steady  117.00
## 506   4     3   steady  199.00
## 507   6     3   steady  157.00
## 508   8     3   steady  186.00
## 509  10     3   steady  131.00
## 510  12     3   steady  178.00
## 511  14     3   steady  200.00
## 512  16     3   steady  202.00
## 513  18     3   steady  185.00
## 514  20     3   steady  123.00
## 515  22     3   steady  134.00
## 516  24     3   steady  192.00
## 517  26     3   steady  228.00
## 518  28     3   steady  172.00
## 519  30     3   steady  222.00
## 520  32     3   steady  143.00
## 521  34     3   steady  263.00
## 522  36     3   steady  134.00
## 523  38     3   steady  241.00
## 524  40     3   steady  222.00
## 525  42     3   steady  282.00
## 526  44     3   steady  121.00
## 527  46     3   steady  106.00
## 528  48     3   steady  112.00
## 529  50     3   steady  292.00
## 530  52     3   steady  220.00
## 531  54     3   steady  109.00
## 532  56     3   steady  158.00
## 533  58     3   steady   81.00
## 534  60     3   steady  122.00
## 535  62     2   steady  208.00
## 536  64     3   steady  131.00
## 537  66     2   steady  148.00
## 538  68     3   steady  137.00
## 539  70     3   steady  146.00
## 540  72     3   steady  192.00
## 541  74     3   steady  195.00
## 542  76     3   steady  281.00
## 543  78     3   steady  156.00
## 544  80     3   steady  199.00
## 545  82     3   steady  490.00
## 546  84     3   steady   73.00
## 547  86     2   steady  748.00
## 548  88     2   steady  188.00
## 549  90     2   steady  297.00
## 550  92     2   steady  480.00
## 551  94     2   steady  287.00
## 552  96     2   steady  124.00
## 553  98     2   steady  326.00
## 554 100     2   steady  408.00
## 555 102     2   steady  433.00
## 556 104     2   steady  392.00
## 557 106     2   steady  313.00
## 558 108     2   steady  285.00
## 559 110     2   steady  212.00
## 560 112     2   steady  120.00
## 561 114     1   steady   23.00
## 562 116     1   steady   54.00
## 563 118     1   steady   87.00
## 564 120     1   steady   42.00
## 565 122     1   steady  138.00
## 566 124     1   steady  131.00
## 567 126     1   steady   45.00
## 568 128     1   steady  159.00
## 569 130     1   steady  103.00
## 570 132     1   steady   42.00
## 571 134     1   steady  130.00
## 572 136     1   steady  314.00
## 573 138     1   steady  100.00
## 574 140     1   steady   83.00
## 575 142     1   steady   77.00
## 576 144     1   steady  124.00
## 577   1     3  insulin   55.00
## 578   3     3  insulin  105.00
## 579   5     3  insulin  143.00
## 580   7     3  insulin  119.00
## 581   9     3  insulin   98.00
## 582  11     3  insulin   53.00
## 583  13     3  insulin  142.00
## 584  15     3  insulin   68.00
## 585  17     3  insulin   76.00
## 586  19     3  insulin   60.00
## 587  21     3  insulin   47.00
## 588  23     3  insulin   91.00
## 589  25     3  insulin   74.00
## 590  27     3  insulin  158.00
## 591  29     3  insulin  145.00
## 592  31     3  insulin   90.00
## 593  33     3  insulin   32.00
## 594  35     3  insulin   78.00
## 595  37     3  insulin   54.00
## 596  39     3  insulin   80.00
## 597  41     3  insulin  117.00
## 598  43     3  insulin   71.00
## 599  45     3  insulin   42.00
## 600  47     3  insulin  122.00
## 601  49     3  insulin  122.00
## 602  51     3  insulin  233.00
## 603  53     3  insulin  138.00
## 604  55     3  insulin  109.00
## 605  57     3  insulin   52.00
## 606  59     2  insulin  122.00
## 607  61     3  insulin  118.00
## 608  63     2  insulin  194.00
## 609  65     2  insulin  257.00
## 610  67     3  insulin  153.00
## 611  69     3  insulin  273.00
## 612  71     2  insulin  270.00
## 613  73     3  insulin   85.00
## 614  75     3  insulin  254.00
## 615  77     2  insulin  177.00
## 616  79     3  insulin  103.00
## 617  81     3  insulin  108.00
## 618  83     2  insulin  204.00
## 619  85     2  insulin  111.00
## 620  87     2  insulin  253.00
## 621  89     2  insulin  271.00
## 622  91     2  insulin  276.00
## 623  93     2  insulin  264.00
## 624  95     2  insulin  268.00
## 625  97     2  insulin  272.00
## 626  99     2  insulin  206.00
## 627 101     2  insulin  286.00
## 628 103     2  insulin  239.00
## 629 105     2  insulin  157.00
## 630 107     2  insulin  155.00
## 631 109     2  insulin  198.00
## 632 111     2  insulin  100.00
## 633 113     1  insulin  455.00
## 634 115     1  insulin  279.00
## 635 117     1  insulin  378.00
## 636 119     1  insulin  260.00
## 637 121     1  insulin  319.00
## 638 123     1  insulin  357.00
## 639 125     1  insulin  324.00
## 640 127     1  insulin  300.00
## 641 129     1  insulin  458.00
## 642 131     1  insulin  320.00
## 643 133     1  insulin  303.00
## 644 135     1  insulin  167.00
## 645 137     1  insulin  209.00
## 646 139     1  insulin  450.00
## 647 141     1  insulin  480.00
## 648 143     1  insulin  209.00
## 649   2     3  insulin   76.00
## 650   4     3  insulin  108.00
## 651   6     3  insulin  165.00
## 652   8     3  insulin  105.00
## 653  10     3  insulin   94.00
## 654  12     3  insulin   66.00
## 655  14     3  insulin   93.00
## 656  16     3  insulin  102.00
## 657  18     3  insulin   37.00
## 658  20     3  insulin   50.00
## 659  22     3  insulin   50.00
## 660  24     3  insulin  124.00
## 661  26     3  insulin  235.00
## 662  28     3  insulin  140.00
## 663  30     3  insulin   99.00
## 664  32     3  insulin  105.00
## 665  34     3  insulin  165.00
## 666  36     3  insulin   80.00
## 667  38     3  insulin  175.00
## 668  40     3  insulin  186.00
## 669  42     3  insulin  160.00
## 670  44     3  insulin   29.00
## 671  46     3  insulin   56.00
## 672  48     3  insulin   73.00
## 673  50     3  insulin  128.00
## 674  52     3  insulin  132.00
## 675  54     3  insulin   83.00
## 676  56     3  insulin   96.00
## 677  58     3  insulin   42.00
## 678  60     3  insulin  176.00
## 679  62     2  insulin  244.00
## 680  64     3  insulin  136.00
## 681  66     2  insulin  167.00
## 682  68     3  insulin  248.00
## 683  70     3  insulin   80.00
## 684  72     3  insulin  180.00
## 685  74     3  insulin  106.00
## 686  76     3  insulin  119.00
## 687  78     3  insulin  159.00
## 688  80     3  insulin   59.00
## 689  82     3  insulin  259.00
## 690  84     3  insulin  220.00
## 691  86     2  insulin  122.00
## 692  88     2  insulin  211.00
## 693  90     2  insulin  220.00
## 694  92     2  insulin  233.00
## 695  94     2  insulin  231.00
## 696  96     2  insulin   60.00
## 697  98     2  insulin  235.00
## 698 100     2  insulin  300.00
## 699 102     2  insulin  226.00
## 700 104     2  insulin  242.00
## 701 106     2  insulin  267.00
## 702 108     2  insulin  194.00
## 703 110     2  insulin  156.00
## 704 112     2  insulin  135.00
## 705 114     1  insulin  327.00
## 706 116     1  insulin  382.00
## 707 118     1  insulin  374.00
## 708 120     1  insulin  346.00
## 709 122     1  insulin  351.00
## 710 124     1  insulin  248.00
## 711 126     1  insulin  300.00
## 712 128     1  insulin  310.00
## 713 130     1  insulin  339.00
## 714 132     1  insulin  297.00
## 715 134     1  insulin  152.00
## 716 136     1  insulin  220.00
## 717 138     1  insulin  351.00
## 718 140     1  insulin  413.00
## 719 142     1  insulin  150.00
## 720 144     1  insulin  442.00
ggplot(data = melt(data=data, id.vars = c("id", "group")),
       mapping = aes(x=value, col=group))+
       geom_density() + facet_wrap(~variable, ncol=2, scale ="free")

ggpairs(data=data, aes(col = group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.