Simple linear regression

TP2

Example 1

We are going to use a dataset called Boston which is part of the MASS package. First install the MASS package and import it.

# load MASS package
library(MASS)

# Check the dimensions of the Boston dataset
dim(Boston)

## [1] 506  14

Split the dataset into train and test subets using only two variables : lstat and medv.

# Split the data by using the first 400 observations as the training
# data and the remaining as the testing data
train = 1:400
test = -train

# Speficy that we are going to use only two variables (lstat and medv)
variables = c("lstat", "medv")
training_data = Boston[train, variables]
testing_data = Boston[test, variables]

# Check the dimensions of the new dataset
dim(training_data)

## [1] 400   2

Check for linearity between lstat and medv features.

plot(training_data$lstat, training_data$medv)

According to the plot, we see that the relationship is not linear. Try a transformation of our explanatory variable lstat using the log function.

lstatlog=log(training_data$lstat)
plot(lstatlog, training_data$medv)

Run the linear regression model using the log transformation.

model=lm(medv~log(lstat), data=training_data)
model

## 
## Call:
## lm(formula = medv ~ log(lstat), data = training_data)
## 
## Coefficients:
## (Intercept)   log(lstat)  
##       51.78       -12.20

Plot the obtained regression model.

plot(log(training_data$lstat), training_data$medv,
xlab = "Log Transform of % of Houshold with Low Socioeconomic Income",
ylab = "Median House Value",
col = "red",
pch = 20)


abline(model, col = "blue", lwd =3)

Predict what is the median value of the house with lstat = 5%

predict(model, data.frame(lstat = c(5)))

##        1 
## 32.14277

Predict what is the median values of houses with lstat= 5%, 10% , and 15%.

predict(model, data.frame(lstat = c(5,10,15)))

##        1        2        3 
## 32.14277 23.68432 18.73644

Compute the mean squared error (MSE) using the test data

prediction=predict(model, newdata=testing_data)
mse=mean((testing_data$medv - prediction)^2)
mse

## [1] 17.69552

Example 2

Load required packages (install them otherwise):
- datarium for data manipulation and visualization.
- ggpubr: creates easily a publication ready-plot.

library(datarium)
library(ggpubr)

## Le chargement a nécessité le package : ggplot2

Load and inspect the marketing data.

data("marketing", package = "datarium")
head(marketing, 4)

##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20

str(marketing)

## 'data.frame':    200 obs. of  4 variables:
##  $ youtube  : num  276.1 53.4 20.6 181.8 217 ...
##  $ facebook : num  45.4 47.2 55.1 49.6 13 ...
##  $ newspaper: num  83 54.1 83.2 70.2 70.1 ...
##  $ sales    : num  26.5 12.5 11.2 22.2 15.5 ...

summary(marketing)

##     youtube          facebook       newspaper          sales      
##  Min.   :  0.84   Min.   : 0.00   Min.   :  0.36   Min.   : 1.92  
##  1st Qu.: 89.25   1st Qu.:11.97   1st Qu.: 15.30   1st Qu.:12.45  
##  Median :179.70   Median :27.48   Median : 30.90   Median :15.48  
##  Mean   :176.45   Mean   :27.92   Mean   : 36.66   Mean   :16.83  
##  3rd Qu.:262.59   3rd Qu.:43.83   3rd Qu.: 54.12   3rd Qu.:20.88  
##  Max.   :355.68   Max.   :59.52   Max.   :136.80   Max.   :32.40

data("marketing")
head(marketing)

##   youtube facebook newspaper sales
## 1  276.12    45.36     83.04 26.52
## 2   53.40    47.16     54.12 12.48
## 3   20.64    55.08     83.16 11.16
## 4  181.80    49.56     70.20 22.20
## 5  216.96    12.96     70.08 15.48
## 6   10.44    58.68     90.00  8.64

We want to predict future sales on the basis of advertising budget spent on youtube. Create a scatter plot displaying the sales units versus youtube advertising budget and add a smoothed line.

scatter.smooth(marketing$sales,marketing$youtube)

plot(marketing$youtube,marketing$sales,smooth = TRUE)

## Warning in plot.window(...): "smooth" n'est pas un paramètre graphique

## Warning in plot.xy(xy, type, ...): "smooth" n'est pas un paramètre graphique

## Warning in axis(side = side, at = at, labels = labels, ...): "smooth" n'est pas
## un paramètre graphique

## Warning in axis(side = side, at = at, labels = labels, ...): "smooth" n'est pas
## un paramètre graphique

## Warning in box(...): "smooth" n'est pas un paramètre graphique

## Warning in title(...): "smooth" n'est pas un paramètre graphique

The graph above suggests a linearly increasing relationship between the sales and the youtube variables. This is a good thing, because, one important assumption of the linear regression is that the relationship between the outcome and predictor variables is linear.
It’s also possible to compute the correlation coefficient between the two variables using the R function cor()

Compute the correlation coefficient between sales and youtube features.

cor(marketing$youtube,marketing$sales)

## [1] 0.7822244

A correlation value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome variable (y) is not explained by the predictor (x). In such case, we should probably look for better predictor variables.
In our example, the correlation coefficient is large enough, so we can continue by building a linear model of y as a function of x.

The simple linear regression tries to find the best line to predict sales on the basis of youtube advertising budget. The linear model equation can be written as follow: sales = b0 + b1 * youtube. Use the R function lm() to determine the beta coefficients of the linear model

model = lm(marketing$sales~marketing$youtube)

Plat the summary table and give an interpretation of the obtained results.

summary(model)

## 
## Call:
## lm(formula = marketing$sales ~ marketing$youtube)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0632  -2.3454  -0.2295   2.4805   8.6548 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.439112   0.549412   15.36   <2e-16 ***
## marketing$youtube 0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

#On voit ainsi que la régression linéaire est un bon model tel que : sales = 8.44 + 0.048*youtube

Add the regression line into the scatter plot, you can use the function stat_smooth() [ggplot2]. By default, the fitted line is presented with confidence interval around it. The confidence bands reflect the uncertainty about the line. If you don’t want to display it, specify the option se = FALSE in the function stat_smooth().

library(ggplot2)
ggplot(marketing,aes(x = youtube,y = sales)) + geom_point() + stat_smooth(se=FALSE,method = "lm") + stat_smooth(se=FALSE,col="red")

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The significance of the model

In the previous section, we built a linear model of sales as a function of youtube advertising budget: sales = 8.44 + 0.048*youtube.
Before using this formula to predict future sales, you should make sure that this model is statistically significant, that is:

There is a statistically significant relationship between the predictor and the outcome variables

The model that we built fits very well the data in our hand. Now we’ll try to check the quality of a linear regression model

Use the help function to describe the summary outputs.

help(summary)

## démarrage du serveur d'aide httpd ... fini

Use the p-values for the intercept and the predictor variable to check the if they are significant which means that there is a significant association between the predictor and the outcome variables.

summary(model)$coefficient[,4]

##       (Intercept) marketing$youtube 
##       1.40630e-35       1.46739e-42

Compute the 95% confidence interval for the coefficient b1 using the confint function.

confint(model)

##                        2.5 %     97.5 %
## (Intercept)       7.35566312 9.52256140
## marketing$youtube 0.04223072 0.05284256

Model accuracy:

Once you identified that, at least, one predictor variable is significantly associated the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit
The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:

The Residual Standard Error (RSE)

The R-squared (R2)

Fstatistic

Find those three mesures for our regression model.

model_summary = summary(model)
model_summary

## 
## Call:
## lm(formula = marketing$sales ~ marketing$youtube)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0632  -2.3454  -0.2295   2.4805   8.6548 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.439112   0.549412   15.36   <2e-16 ***
## marketing$youtube 0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

model_summary$sigma;

## [1] 3.910388

model_summary$r.squared;

## [1] 0.6118751

model_summary$fstatistic[1]

##   value 
## 312.145

The RSE (also known as the model sigma) is the residual variation, representing the average variation of the observations points around the fitted regression line. This is the standard deviation of residual errors.

RSE provides an absolute measure of patterns in the data that can’t be explained by the model. When comparing two models, the model with the small RSE is a good indication that this model fits the best the data.

Dividing the RSE by the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.

In our example, RSE = 3.91, meaning that the observed sales values deviate from the true regression line by approximately 3.9 units in average.
Whether or not an RSE of 3.9 units is an acceptable prediction error is subjective and depends on the problem context. This is why we usually normalise it with respect to the predicted variable.

Calculate the percentage error: RSE/mean(y)

RSE = model_summary$sigma
y= marketing$sales
error =  RSE/mean(y)
error

## [1] 0.2323877

The R-squared (R2) ranges from 0 to 1 and represents the proportion of information (i.e. variation) in the data that can be explained by the model. The adjusted R-squared adjusts for the degrees of freedom.

The R2 measures, how well the model fits the data. For a simple linear regression, R2 is the square of the Pearson correlation coefficient.

A high value of R2 is a good indication. However, as the value of R2 tends to increase when more predictors are added in the model, such as in multiple linear regression model, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors.

An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model. A number near 0 indicates that the regression model did not explain much of the variability in the outcome. F-Statistic: The F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.

In a simple linear regression, this test is not really interesting since it just duplicates the information in given by the t-test, available in the coefficient table. In fact, the F test is identical to the square of the t test: 312.1 = (17.67)^2. This is true in any model with 1 degree of freedom.

The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression.
A large F-statistic will corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 312.14 producing a p-value of 1.46e-42, which is highly significant.

Is the F test significant in our case?

#Le F nous donne un p <0.05 donc c'est significatif

Make sure your data meet the assumptions: We can use R to check that our data meet the four main assumptions for linear regression. This can be done only after the model conructiion as we will see.

Check main assumptions for linear regression using plots seen in the course. Recall that they may summurised by : residuals need to be normal, independent and have same variance and (X,y) need to have good linear corrlation.

#There is 4 main assumptions:
#Linearity, Homoscedasticity, Independenc, Normality
res = resid(model)
plot(fitted(model),res)
abline(0,0)

qqnorm(res)
qqline(res)

plot(density(res))

Give an interpretation of the Residuals vs. Leverage Plot.

plot(model)

#A residuals vs. leverage plot is a type of diagnostic plot that allows us to identify influential observations in a regression model.Each observation from the dataset is shown as a single point within the plot. The x-axis shows the leverage of each point and the y-axis shows the standardized residual of each point.