Simple Linear Regression: a case study in R

Introduction

Linear Regression is a supervised modeling technique for continuous data that generates a response based on the set of input features. It is used for explaining the linear relationship between a single variable Y, called the response (output or dependent variable), and one or more predictor (input, independent or explanatory variables).

It’s a simple regression problem if only a single variable X is considered, otherwise it takes the form of a multiple regression problem, that is if more than one predictor is used in the model.

In the current document, attention will be focused only on simple linear regression, that is the case of one response variable and a single predictor variable. The basic equation is the following:

\[ \begin{equation} Y_i = \beta_0 + \beta_1 X + \epsilon \end{equation} \] In the equation, they represent the intercept, the slope, the error term, respectively: \[ \begin{equation} \beta_0 \end{equation} \] \[ \begin{equation} \beta_1 \end{equation} \] \[ \begin{equation} \epsilon \end{equation} \]

The assumptions for linear regression are:

Linearity: the relationship between X and Y is linear;
Homoscedasticity: the variance of residuals is the same for any given value of X;
Independence: all observations are independent of each other;
Normality: Y is normally distributed for any given value of X.

Note: an error is introduced if the basic assumption about the linearity of the model does not correspond to reality (bias towards linearity). Ideally, error term should be random and not be dependent on any input. It can ben considered the part of Y that the regression model is unable to explain.

Dataset details

For the current analysis, AIS dataset coming from the CRAN package DAAG ha been used: it’s a data frame with 202 observations and 13 variables. It represents a study on a group of australian athletes of how some characteristics of the blood varied with sport body size and sex type.

Here is an overview of the dataset (i.e. taking a look to the first ten 10 rows):

library(DAAG) # it cointains AIS dataset and CVlm (Cross-Validation for Linear Regression)

head(ais, n=10)

##     rcc wcc   hc   hg ferr   bmi   ssf pcBfat   lbm    ht   wt sex  sport
## 1  3.96 7.5 37.5 12.3   60 20.56 109.1  19.75 63.32 195.9 78.9   f B_Ball
## 2  4.41 8.3 38.2 12.7   68 20.67 102.8  21.30 58.55 189.7 74.4   f B_Ball
## 3  4.14 5.0 36.4 11.6   21 21.86 104.6  19.88 55.36 177.8 69.1   f B_Ball
## 4  4.11 5.3 37.3 12.6   69 21.88 126.4  23.66 57.18 185.0 74.9   f B_Ball
## 5  4.45 6.8 41.5 14.0   29 18.96  80.3  17.64 53.20 184.6 64.6   f B_Ball
## 6  4.10 4.4 37.4 12.5   42 21.04  75.2  15.58 53.77 174.0 63.7   f B_Ball
## 7  4.31 5.3 39.6 12.8   73 21.69  87.2  19.99 60.17 186.2 75.2   f B_Ball
## 8  4.42 5.7 39.9 13.2   44 20.62  97.9  22.43 48.33 173.8 62.3   f B_Ball
## 9  4.30 8.9 41.1 13.5   41 22.64  75.1  17.95 54.57 171.4 66.5   f B_Ball
## 10 4.51 4.4 41.6 12.7   44 19.44  65.1  15.07 53.42 179.9 62.9   f B_Ball

Then, some packages that will be needed during the analysis are loaded.

library(e1071) # it includes function to compute skewness
library(plyr) # it allows to wrangle data
library(ggplot2) # it allows to create a number of different types of plots

It is now possibile to subset the original dataset and select variables and observations for the simple regression analysis:

Only one sex type is considered (i.e, male, in order to have a more homogeneous group);
No distinctions between practised sports are done (assuming that there are no substantial differences of blood characteristics and body size between each sport);
Focusing only on blood characteristics (leaving out the body size), the goal is building a simple regression model that can be use to predict hematocrit (hc) by establishing a statistically significant linear relationship with hemaglobin (hg).

ais2 <- subset(ais, sex=="m") # only male athletes
ais3 = ais2[,c(3,4)] # subset column number that correspond to "hg" and "hc"
newdata <- rename(ais3, c("hg"="HEMAGLOBIN", "hc"="HEMATOCRIT")) # rename variables
str(newdata) # new dataset now includes 102 observations

## 'data.frame':    102 obs. of  2 variables:
##  $ HEMATOCRIT: num  46.8 45.2 46.6 44.9 46.1 45.1 47.5 45.5 48.6 44.9 ...
##  $ HEMAGLOBIN: num  15.9 15.2 15.9 15 15.6 15.2 16.3 15.2 16.5 15.4 ...

Before proceeding, it is important to check if there are any missing values in the dataset:

colSums(is.na(newdata)) # report how many missing values are per column

## HEMATOCRIT HEMAGLOBIN 
##          0          0

Both columns do not show any NA value.

Note: Generally, if many values are missing from the dataset, the effective sample may turn out to be too small to yield significant findings.

Here the univariate summary information of the new dataset:

summary(newdata) # overview of the two selected variables

##    HEMATOCRIT      HEMAGLOBIN   
##  Min.   :40.30   Min.   :13.50  
##  1st Qu.:44.23   1st Qu.:14.93  
##  Median :45.50   Median :15.50  
##  Mean   :45.65   Mean   :15.55  
##  3rd Qu.:46.80   3rd Qu.:15.90  
##  Max.   :59.70   Max.   :19.20

Graphical Analysis

Box plot

It displays data into quartiles and allows to detect any outliers:

par(mfrow=c(1, 2))  # it divides graph area in two parts

boxplot(newdata$HEMAGLOBIN, col = "yellow", border="blue",
        main = "HEMAGLOBIN boxplot",
        ylab = "g per decaliter")

boxplot(newdata$HEMATOCRIT, col = "orange", border="blue",
        main = "HEMATROCRIT boxplot",
        ylab = "percent values")

Here the values of the outliers detected:

boxplot.stats(newdata$HEMAGLOBIN)$out # HEMAGLOBIN outliers

## [1] 18.0 19.2 18.5 17.7

boxplot.stats(newdata$HEMATOCRIT)$out #HEMATOCRIT outliers

## [1] 40.3 59.7

Note: Having outliers can affect the direction/slope of the line of best fit.

Histogram

It displays the probability distribution of the numerical data:

# Histogram of HEMAGLOBIN
qplot(HEMAGLOBIN, data = newdata, geom="histogram", binwidth=0.5, 
      fill=I("azure4"), col=I("azure3")) +
  labs(title = "HEMAGLOBIN") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(x ="Concentration (in g per decaliter)") +
  labs(y = "Frequency") +
  scale_y_continuous(breaks = c(0,5,10,15,20,25,30,35,40,45,50), minor_breaks = NULL) +
  scale_x_continuous(breaks = c(10:25), minor_breaks = NULL) +
  geom_vline(xintercept = mean(newdata$HEMAGLOBIN), show_guide=TRUE, color
             ="red", labels="Average") +
  geom_vline(xintercept = median(newdata$HEMAGLOBIN), show_guide=TRUE, color
             ="blue", labels="Median")

# Histogram of HEMATOCRIT
qplot(HEMATOCRIT, data = newdata, geom="histogram", binwidth=1, 
      fill=I("azure4"), col=I("azure3")) +
  labs(title = "HEMATOCRIT") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(x ="percent values") +
  labs(y = "Frequency") +
  scale_y_continuous(breaks = c(0,5,10,15,20,25), minor_breaks = NULL) +
  scale_x_continuous(breaks = c(30:65), minor_breaks = NULL) +
  geom_vline(xintercept = mean(newdata$HEMATOCRIT), show_guide=TRUE, color
             ="red", labels="Average") +
  geom_vline(xintercept = median(newdata$HEMATOCRIT), show_guide=TRUE, color
             ="blue", labels="Median")

Note: The blue vertical line shows the median value and the red line the mean.

Density plot

It checks the shape of the distribution:

par(mfrow=c(1, 2))  # it divides graph area in two parts

plot(density(newdata$HEMAGLOBIN), main="Density: HEMAGLOBIN", ylab="Frequency", 
     sub=paste("Skewness:", round(e1071::skewness(newdata$HEMAGLOBIN), 2)))
     polygon(density(newdata$HEMAGLOBIN), col="yellow")

plot(density(newdata$HEMATOCRIT), main="Density: HEMATOCRIT", ylab="Frequency", 
     sub=paste("Skewness:", round(e1071::skewness(newdata$HEMATOCRIT), 2)))
     polygon(density(newdata$HEMATOCRIT), col="orange")

Note: Ideally, the best would be having a normal distribution, and not skewed (to the left or right).

Scatter plot

This visualization suggests if there is any linear relationship between the dependent (response) variable and independent (predictor) variable:

qplot(HEMAGLOBIN, HEMATOCRIT, data = newdata,
      main = "HEMAGLOBIN and HEMATOCRIT relationship") +
      theme(plot.title = element_text(hjust = 0.5)) +
      geom_point(colour = "blue", size = 1.5) +
      scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
      scale_x_continuous(breaks = c(10:25), minor_breaks = NULL)

Plot shows there is a somewhat strong relationship between hemaglobin’s level and hematocrit percentage (i.e. to higher hemaglobin values correspond higher hematocrit values).

Building linear model

Objective now is building a linear model and see how well this model fits the observed data. In a simplistic form, equation to solve is the following: \[ \begin{equation} Hematocrit = \beta_0 + \beta_1 Hemaglobin \end{equation} \]

So, the intercept is the expected hematocrit value for the hemaglobin level and the slope is the increase of hematocrit associated with a one-unit increase in hemaglobin level.

# Show the relationship creating a regression line
qplot(HEMAGLOBIN, HEMATOCRIT, data = newdata,
      main = "HEMAGLOBIN and HEMATOCRIT relationship") +
      theme(plot.title = element_text(hjust = 0.5)) +
      stat_smooth(method="lm", col="red", size=1) +
      geom_point(colour = "blue", size = 1.5) +
      scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
      scale_x_continuous(breaks = c(10:25), minor_breaks = NULL)

Note: Ideally, the regression line should be as close as possible to all data points observed. Smoothing is set to a confidence level of 0.95 (by default).

An additional and interesting possibility is to create a new variable named “HEMAGLOBIN_CENT”, that centers the value of the variable HEMAGLOBIN on its mean: this is useful to give a meaningful interpretation of its intercept estimate (the average HEMAGLOBIN level is centered on value 0.0 on X-axis).

set.seed(123) # setting seed to reproduce results of random sampling
HEMAGLOBIN_CENT = scale(newdata$HEMAGLOBIN, center=TRUE, scale=FALSE) # center the variable
# Show the relationship with new variable centered, creating a regression line
qplot(HEMAGLOBIN_CENT, HEMATOCRIT, data = newdata,
      main = "HEMAGLOBIN_CENT and HEMATOCRIT relationship") +
      theme(plot.title = element_text(hjust = 0.5)) +
      stat_smooth(method="lm", col="red", size=1) +
      geom_point(colour = "blue", size = 1.5) +
      scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
      scale_x_continuous(breaks = c(-2,-1.5,-1,-0.5,0,0.5,1,1.5,2,2.5,3,3.5,4), minor_breaks = NULL)

Linear model analysis

Summary statistics are very useful to interpret the key components of the linear model output:

mod1 = lm(HEMATOCRIT ~ HEMAGLOBIN_CENT, data = newdata)
summary(mod1)

## 
## Call:
## lm(formula = HEMATOCRIT ~ HEMAGLOBIN_CENT, data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4183 -0.7043 -0.0072  0.6049  5.0765 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      45.6500     0.1140  400.35   <2e-16 ***
## HEMAGLOBIN_CENT   2.4605     0.1227   20.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.152 on 100 degrees of freedom
## Multiple R-squared:  0.801,  Adjusted R-squared:  0.799 
## F-statistic: 402.4 on 1 and 100 DF,  p-value: < 2.2e-16

Residuals

Residuals show if the predicted response values are close or not to the response values that the model predicts. In the current case, distribution of residuals is quite simmetrical (further analysis of residuals will be discussed later in the document)

Estimate coefficient

The first row is the intercept and represents in the current case the expected HEMATOCRIT value of 45.65 percent when the average value of HEMAGLOBIN is considered across the dataset.

The second row is the slope term, and it shows in the current case that every HEMAGLOBIN increase of 1 g per decaliter produces a consequent increase of HEMATOCRIT of 2.46 points in percentage.

Standard error

Standard error measures how the coefficient estimates can vary from the actual average value of the response variable (i.e. if the model is run more times). In the current case, it’s like to say that estimated HEMATOCRIT value can vary by 0.12 points in percentage.

Note: ideally, having a low number is the best situation in regression analysis. Standard error is then also used to set confidence intervals.

Significance

Test of significance of the model shows that there is strong evidence of a linear relationship between the variables. This is visually interpreted by the significance stars *** at the end of the row. This level is a threshold that is used to indicate real findings, and not the ones by chance alone.

For each estimated regression coefficient, the variable’s p-Value Pr(>|t|) provides an estimate of the probability that the true coefficient is zero given the value of the estimate. More the number of stars near the p-Value are, more significant is the variable.

With the presence of the p-value, there is a test of hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficient associated with the variables is equal to zero. Instead, the alternative hypothesis is that the coefficient is not equal to zero and then exists a relationship between the independent variable and the dependent variable.

So, if p-values are less than significance level (typically, a p-value < 0.05 is a good cut-off point), null hypothesis can be safely rejected. In the current case, p-values are well below the 0.05 threshold, so the model is indeed statistically significant.

t-statistic

t-statistic is used in a t-test in order to decide if support or reject the null hypothesis.

Note: t-statistic is the estimated value of the parameter (coefficient/slope) divided by its standard error.

Then, this statistic is a measure of the likelihood that the actual value of the parameter is not zero. A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance.

modSummary <- summary(mod1)  # capture model summary as an object
modCoeff <- modSummary$coefficients  # model coefficients
beta.estimate <- modCoeff["HEMAGLOBIN_CENT", "Estimate"]  # get beta coefficient estimate
std.error <- modCoeff["HEMAGLOBIN_CENT", "Std. Error"]  # get standard error
t_value <- beta.estimate/std.error  # calculate t statistic
print(t_value) # print t-value

## [1] 20.0601

Note: if p-values are lower than significance level (< 0.05), t-statistic value should be greater than 1.96.

Residual standard error

Residual standard error is the average amount that the response deviates from the regression line (due to the presence of the error term), so it can be intended ad a measure of goodness of fit. In the current case, given that the expected HEMATOCRIT has a value of 45.65 and that the residual standard error is 1.15, it results that the percentage error of the prediction is only about 2.5%.

Note: degrees of freedom correspond to the number of observations (number of independent pieces of information that went into calculating the estimate) minus the number of parameters estimated (intercept and slope).

R-squared

For the simple linear regression, R-squared is the square of the correlation between two variables. Its value can vary between 0 and 1: a value close to 0 means that the regression model does not explain the variance in the response variable, while a number close to 1 that the observed variance in the response variable is well explained. In the current case, R-squared suggests the linear model fit explains a 80.1% of the variance observed in the data.

High value of R-squared does not necessarily indicate if a regression model provides an adequate fit to data. A good model could show a low R-squared value, while, on the other hand, a biased model could have a high R-squared value.

Note: R-squared value tends to increase as more variables are included in the model. So, adjusted R-squared is the preferred measure as it adjusts for the number of variables considered (it’sthe case of the multiple regression, but it’s out of scope of the current document).

F statistic

Basically, F-test compares the model with zero predictor variables (the intercept only mod1el), and suggests whether the added coefficients improve the model. If a significant result is obtained, then the coefficients (in the current case, being a simple regression, only one predictor is entered) included in the model improve the model’s fit.

So, F statistic defines the collective effect of all predictor variables on the response variable. In this model, F=102.3 is far greater than 1:

f_statistic <- mod1$fstatistic[1]  # calculate F statistic
f <- summary(mod1)$fstatistic  # parameters for model p-value calculation
print(f) # print F value

##    value    numdf    dendf 
## 402.4075   1.0000 100.0000

Note: ideally, greater F statistic (higher than 1) is get, better the fit is (so, this statistic can be intended as a measure of goodness of fit).

Regression diagnostic plots

Diagnostics plots are used to evaluate the model assumptions and understand whether or not there are observations that can strongly have influence on the analysis. As consequence, the goal is to take the proper actions to improve the model fit.

Residuals vs Fitted values

plot(mod1, pch=16, col="blue", lty=1, lwd=2, which=1)

Residual data of the simple linear regression model is the difference between the observed data of the dependent variable and the fitted values.

The plot is useful for checking the assumption of linearity and homoscedasticity. To assess the assumption of linearity, residuals should be not too far from 0 (ideally, standardized values should be in the range of -2 and +2). To assess he assumption of homoscedasticity, residuals should be randomly and equally distributed around the horizontal red line (which is just a scatterplot smoother, showing the average value of the residuals at each value of fitted value) representing a residual error of zero.

In the current case, the red trend line is almost at zero except towards the right side, due to outliers presence. Some values, in particular observations “159”, “166”, “169”, are also outside the range between -2 and +2.

Normal Q-Q

plot(mod1, pch=16, col="blue", lty=1, lwd=2, which=2)

The normal Q-Q (quantile-quantile) plot is a scatterplot that allows to see if a set of data plausibly come from a normal distribution.

It is created by plotting two sets of quantiles vs one another: graphically, the points will fall along a a straight line if both sets of quantiles come from the same distribution.

In the current case, points form a line in the middle of the graph, but tend to deviate from the diagonal line in both the upper and lower extremities. Plot behaviour like this, means that the tails are lighter (have smaller values) than what would be expected under the standard modeling assumptions (of the Normal distribution). Again the observations that can be noticed in the tails are “159”, “166”, “169”.

Scale-location

plot(mod1, pch=16, col="blue", lty=1, lwd=2, which=3)

The scale-location plot shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. It is useful to see how the residuals are spread and check the assumption of homoscedasticity (that it if the residuals have an equal variance or not).

In the current case, the red trend line is almost horizontal except towards the right side. Again, observations “159”, “166”, “169” are outside the level of +1.5

Residuals vs Leverage

plot(mod1, pch=16, col="blue", lty=1, lwd=2, which=5)

The plot shows each points leverage, which represents a measure of its importance in determining the regression result. Each point far from the dashed line can be intended as an influential point.

Cook’s distance is the measure of the infuence of each observation on the regression coefficients.

plot(mod1, pch=16, col="blue", lty=1, lwd=2, which=4)

Distances larger than 1 suggest the presence of a possible outlier. In the current case, large Cook’s distance (higher than 2) of observation “166” is evident.

Model improvement

Model summary as well as diagnostic plots have given important information that allow to improve the model fit. Together with mod1, it is possible to explore the mod2 that omits the noticed outliers.

Note: of course, different models could be considered, i.e. including a quadratic term or adding one or more variables (or considering a new transformation of variables), but it’s out of the scope of the current document (it becomes a multiple regression problem).

As noticed in all the main diagnostic plots, the observations “159”, “166”, “169” can be considered as outliers. They can also be shown graphically:

newdata1 <- setNames(cbind(rownames(newdata), newdata, row.names = NULL), 
                     c("OBS", "HEMAGLOBIN", "HEMATOCRIT"))
newdata1$OUTLIER = ifelse(newdata1$OBS %in% c(159,166,169),"Y","N") # create condition Yes/No if outlier

qplot(HEMATOCRIT, HEMAGLOBIN, data = newdata1, colour = OUTLIER,
main = "HEMAGLOBIN and HEMATOCRIT relationship") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(breaks = c(30:65), minor_breaks = NULL) +
  scale_x_continuous(breaks = c(10:25), minor_breaks = NULL)

Blue points represent the three outliers indentified.

So a new dataset can be created excluding them:

newdata2 <- subset(newdata1, OBS != 159 & OBS != 166 & OBS != 169,
                  select=c(HEMAGLOBIN, HEMATOCRIT))
HEMAGLOBIN_CENT = scale(newdata2$HEMAGLOBIN, center=TRUE, scale=FALSE) # center the variable

A new model is so given, and shows the following results:

mod2 = lm(HEMATOCRIT ~ HEMAGLOBIN_CENT, data = newdata2)
summary(mod2)

## 
## Call:
## lm(formula = HEMATOCRIT ~ HEMAGLOBIN_CENT, data = newdata2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83224 -0.20845 -0.00573  0.21535  1.19873 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     15.51212    0.03656  424.26   <2e-16 ***
## HEMAGLOBIN_CENT  0.35783    0.01687   21.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3638 on 97 degrees of freedom
## Multiple R-squared:  0.8227, Adjusted R-squared:  0.8209 
## F-statistic: 450.1 on 1 and 97 DF,  p-value: < 2.2e-16

Removing outliers, model fits better than the previous: F statistic and R-squared values of mod2 are higher than mod1 ones. F is now 450.1 (vs 402.4), while R-squared is 82.3 (vs 80.1).

Diagnostic plots are summarized in the graph below:

par(mfrow = c(2,2)) # display a unique layout for all graphs
plot(mod2)

Now, graphs of “residual vs fitted” and “scale-location” show a red trend line closer to the horizontal line. Q-Q plot still shows some tails, but they seem to be more acceptable.

Another way that gives support to model selection is calculating AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) metrics. They represent measures of the goodness of fit of an estimated statistical model (both criteria depend on the maximized value of the likelihood function for the estimated model).

AIC(mod1)

## [1] 322.2403

AIC(mod2)

## [1] 84.71671

BIC(mod1)

## [1] 330.1153

BIC(mod2)

## [1] 92.50206

AIC and BIC values of mod2 are lower than mod1 ones. Generally, small values correspond to models with a low test error, so it’ is’s the confirmation that mod2 fits better than mod1.

Predicting Linear Models

Once the model has been improved, it is possibile to run predictive analytics, the real goal of the regression analysis. Dataset can be split into training (development) and testing (validation) dataset, and the test one will be used to evaluate the model comparing the predicted response with the actual response value.

set.seed(123)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(newdata2), 0.7*nrow(newdata2))  #  training and testing: 70/30 split
trainingData <- newdata2[trainingRowIndex, ]  # training data
testData  <- newdata2[-trainingRowIndex, ]   # test data

Now it’s possible to develop the model on the training data and use it to predict HEMATOCRIT on test data.

modTrain <- lm(HEMATOCRIT ~ HEMAGLOBIN, data=trainingData)  # build the model
predict <- predict(modTrain, testData)  # predicted values
summary(modTrain)

## 
## Call:
## lm(formula = HEMATOCRIT ~ HEMAGLOBIN, data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75636 -0.22814  0.00008  0.22879  1.11346 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.86335    0.94719  -1.967   0.0533 .  
## HEMAGLOBIN   0.38119    0.02067  18.439   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.381 on 67 degrees of freedom
## Multiple R-squared:  0.8354, Adjusted R-squared:  0.8329 
## F-statistic:   340 on 1 and 67 DF,  p-value: < 2.2e-16

The statistics in the summary show that the model is significant. R-Squared value calculated on training data is comparative to the original model built on full data.

Now, taking into consideration the test data, the correlation between actuals values and predicted values can be used as a form of accuracy measure.

act_pred <- data.frame(cbind(actuals=testData$HEMATOCRIT, predicteds=predict)) # actuals_predicteds 
cor(act_pred) # correlation_accuracy

##              actuals predicteds
## actuals    1.0000000  0.8963783
## predicteds 0.8963783  1.0000000

Correlation shows a high value of 89.6 %, so it means that actuals values and predicted values have similar directional movement.

Here an overview of the first 10 rows of the new dataframe composed by actual values and predicted values:

head(act_pred, n=10)

##    actuals predicteds
## 1     15.9   15.97616
## 7     16.3   16.24299
## 12    15.3   15.36626
## 17    15.5   15.44250
## 18    14.7   14.52765
## 19    15.4   15.09943
## 23    15.8   15.74744
## 28    15.6   15.48061
## 30    15.6   15.06131
## 33    16.4   16.39546

Actual values and predicted ones seem very close to each other. A good metric to see how much they are close is the min-max accuracy, that considers the average between the minimum and the maximum prediction.

min_max <- mean(apply(act_pred, 1, min) / apply(act_pred, 1, max))  
print(min_max) # show the result

## [1] 0.9829277

The result of 0.98 is a high value, and it means that the accuracy is very good.

Note: ideally, result of 1 is for a nearly perfect prediction.

Mean absolute percentage deviation is so defined:

mape <- mean(abs((act_pred$predicteds - act_pred$actuals))/act_pred$actuals)
print(mape) # show the result

## [1] 0.01720556

The result of about 0.02 is very low, and means a very good prediction accuracy.

Note: MAPE can only be computed with respect to data that are guaranteed to be strictly positive.

Resampling method

Resampling is based on drawing different samples of observations from a training set and repeatedly refitting a model on each sample. The goal is to estimate the variability and the robustness of a linear regression fit. Any difference allows to obtain additional information that would be not be available using once only the original training sample of observations.

In the current case, it has been seen that the model predicts well on the 30% split (test data) of the dataset, and now it has to be ensured that it fits well also all the time on different subsets. There are many resampling methods; in the current document k-fold cross validation will be used.

K-fold method randomly divides a set of observations into k random sample portions of approximately equal size. Keeping each portion as test data, the model is re-fit on the remaining (k-1) portions that are used to predict the deleted observations.

Then, the test error is estimated by computing the average of the mean squared errors (for ‘k’ portions). In the current case, it has been set K=5(usually 3, 5, 7, 10 folds are used):

kfold <- CVlm(data = newdata2, form.lm = formula(HEMATOCRIT ~ HEMAGLOBIN), m=5, 
                   dots = FALSE, seed=123, legend.pos="topleft",
                   main="Cross Validation; k=5",
                   plotit=TRUE, printit=FALSE)

The graph shows that the lines don’t vary too much with respect to the slope and level and the points are not over dispersed for one particular color, that is for any one particular sample.

The mean squared error measures how a regression line is close to a set of points:

attr(kfold, 'ms')

## [1] 0.1445367

The value of 0.14 is low, and it represents a good accuracy result.

Ideally, smaller the means squared error is, closer is the line of best fit.

References

An Introduction to Statistical Learning with Applications in R, James G., Witten D., Hastie T., Tibshirani R., (Springer, 2013)
Package 'DAAG', Maindonald J.H., John Braun W.J., (cran.r-project.org, September 2015)