Simple linear regression
TP2
Example 1
- We are going to use a dataset called Boston which is part of the MASS package. First install the MASS package and import it.
-
Split the dataset into train and test subets using only two variables :
lstatandmedv. -
Check for linearity between
lstatandmedvfeatures. -
According to the plot, we see that the relationship is not linear. Try a
transformation of our explanatory variable lstat using the
logfunction. - Run the linear regression model using the log transformation.
- Plot the obtained regression model.
-
Predict what is the median value of the house with
lstat = 5% -
Predict what is the median values of houses with
lstat= 5%, 10% , and 15%. - Compute the mean squared error (MSE) using the test data
# load MASS package
library(MASS)
# Check the dimensions of the Boston dataset
dim(Boston)## [1] 506 14
# Split the data by using the first 400 observations as the training
# data and the remaining as the testing data
train = 1:400
test = -train
# Speficy that we are going to use only two variables (lstat and medv)
variables = c("lstat", "medv")
training_data = Boston[train, variables]
testing_data = Boston[test, variables]
# Check the dimensions of the new dataset
dim(training_data)## [1] 400 2
plot(training_data$lstat, training_data$medv)lstatlog=log(training_data$lstat)
plot(lstatlog, training_data$medv)model=lm(medv~log(lstat), data=training_data)
model##
## Call:
## lm(formula = medv ~ log(lstat), data = training_data)
##
## Coefficients:
## (Intercept) log(lstat)
## 51.78 -12.20
plot(log(training_data$lstat), training_data$medv,
xlab = "Log Transform of % of Houshold with Low Socioeconomic Income",
ylab = "Median House Value",
col = "red",
pch = 20)
abline(model, col = "blue", lwd =3)predict(model, data.frame(lstat = c(5)))## 1
## 32.14277
predict(model, data.frame(lstat = c(5,10,15)))## 1 2 3
## 32.14277 23.68432 18.73644
prediction=predict(model, newdata=testing_data)
mse=mean((testing_data$medv - prediction)^2)
mse## [1] 17.69552
Example 2
-
Load required packages (install them otherwise):
- datarium for data manipulation and visualization.
- ggpubr: creates easily a publication ready-plot.
- Load and inspect the marketing data.
- We want to predict future sales on the basis of advertising budget spent on youtube. Create a scatter plot displaying the sales units versus youtube advertising budget and add a smoothed line.
-
Compute the correlation coefficient between
salesandyoutubefeatures. -
The simple linear regression tries to find the best line to predict
saleson the basis ofyoutubeadvertising budget. The linear model equation can be written as follow:sales = b0 + b1 * youtube. Use theRfunctionlm()to determine the beta coefficients of the linear model -
Plat the
summarytable and give an interpretation of the obtained results. -
Add the regression line into the scatter plot, you can use the function
stat_smooth() [ggplot2]. By default, the fitted line is presented with confidence interval around it. The confidence bands reflect the uncertainty about the line. If you don’t want to display it, specify the optionse = FALSEin the functionstat_smooth(). - There is a statistically significant relationship between the predictor and the outcome variables
- The model that we built fits very well the data in our hand. Now we’ll try to check the quality of a linear regression model
-
Use the
helpfunction to describe thesummaryoutputs. -
Use the
p-valuesfor the intercept and the predictor variable to check the if they are significant which means that there is a significant association between the predictor and the outcome variables. -
Compute the
95%confidence interval for the coefficientb1using theconfintfunction. -
The
Residual Standard Error (RSE) -
The
R-squared (R2) -
Fstatistic - Find those three mesures for our regression model.
-
Calculate the percentage error:
RSE/mean(y) -
Is the
F testsignificant in our case? -
Check main assumptions for linear regression using plots seen in the
course. Recall that they may summurised by : residuals need to be
normal, independent and have same variance and
(X,y)need to have good linear corrlation. -
Give an interpretation of the
Residuals vs. LeveragePlot.
library(datarium)
library(ggpubr)## Le chargement a nécessité le package : ggplot2
data("marketing", package = "datarium")
head(marketing, 4)## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
str(marketing)## 'data.frame': 200 obs. of 4 variables:
## $ youtube : num 276.1 53.4 20.6 181.8 217 ...
## $ facebook : num 45.4 47.2 55.1 49.6 13 ...
## $ newspaper: num 83 54.1 83.2 70.2 70.1 ...
## $ sales : num 26.5 12.5 11.2 22.2 15.5 ...
summary(marketing)## youtube facebook newspaper sales
## Min. : 0.84 Min. : 0.00 Min. : 0.36 Min. : 1.92
## 1st Qu.: 89.25 1st Qu.:11.97 1st Qu.: 15.30 1st Qu.:12.45
## Median :179.70 Median :27.48 Median : 30.90 Median :15.48
## Mean :176.45 Mean :27.92 Mean : 36.66 Mean :16.83
## 3rd Qu.:262.59 3rd Qu.:43.83 3rd Qu.: 54.12 3rd Qu.:20.88
## Max. :355.68 Max. :59.52 Max. :136.80 Max. :32.40
data("marketing")
head(marketing)## youtube facebook newspaper sales
## 1 276.12 45.36 83.04 26.52
## 2 53.40 47.16 54.12 12.48
## 3 20.64 55.08 83.16 11.16
## 4 181.80 49.56 70.20 22.20
## 5 216.96 12.96 70.08 15.48
## 6 10.44 58.68 90.00 8.64
scatter.smooth(marketing$sales,marketing$youtube)plot(marketing$youtube,marketing$sales,smooth = TRUE)## Warning in plot.window(...): "smooth" n'est pas un paramètre graphique
## Warning in plot.xy(xy, type, ...): "smooth" n'est pas un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smooth" n'est pas
## un paramètre graphique
## Warning in axis(side = side, at = at, labels = labels, ...): "smooth" n'est pas
## un paramètre graphique
## Warning in box(...): "smooth" n'est pas un paramètre graphique
## Warning in title(...): "smooth" n'est pas un paramètre graphique
The graph above suggests a linearly increasing relationship between the sales and the youtube variables. This is a good thing, because, one important assumption of the linear regression is that the relationship between the outcome and predictor variables is linear.
It’s also possible to compute the correlation coefficient between the two variables using theRfunctioncor()
cor(marketing$youtube,marketing$sales)## [1] 0.7822244
A correlation value closer to 0 suggests a weak relationship between the variables. A low correlation
In our example, the correlation coefficient is large enough, so we can continue by building a linear model of(-0.2 < x < 0.2)probably suggests that much of variation of the outcome variable(y)is not explained by the predictor(x). In such case, we should probably look for better predictor variables.yas a function ofx.
model = lm(marketing$sales~marketing$youtube)summary(model)##
## Call:
## lm(formula = marketing$sales ~ marketing$youtube)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0632 -2.3454 -0.2295 2.4805 8.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.439112 0.549412 15.36 <2e-16 ***
## marketing$youtube 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
#On voit ainsi que la régression linéaire est un bon model tel que : sales = 8.44 + 0.048*youtubelibrary(ggplot2)
ggplot(marketing,aes(x = youtube,y = sales)) + geom_point() + stat_smooth(se=FALSE,method = "lm") + stat_smooth(se=FALSE,col="red")## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The significance of the model
In the previous section, we built a linear model of sales as a function of youtube advertising budget:
Before using this formula to predict futuresales = 8.44 + 0.048*youtube.sales, you should make sure that this model is statistically significant, that is:
help(summary)## démarrage du serveur d'aide httpd ... fini
summary(model)$coefficient[,4]## (Intercept) marketing$youtube
## 1.40630e-35 1.46739e-42
confint(model)## 2.5 % 97.5 %
## (Intercept) 7.35566312 9.52256140
## marketing$youtube 0.04223072 0.05284256
Model accuracy:
Once you identified that, at least, one predictor variable is significantly associated the outcome, you should continue the diagnostic by checking how well the model fits the data. This process is also referred to as the goodness-of-fit
The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary:
model_summary = summary(model)
model_summary##
## Call:
## lm(formula = marketing$sales ~ marketing$youtube)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0632 -2.3454 -0.2295 2.4805 8.6548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.439112 0.549412 15.36 <2e-16 ***
## marketing$youtube 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.91 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
model_summary$sigma;## [1] 3.910388
model_summary$r.squared;## [1] 0.6118751
model_summary$fstatistic[1]## value
## 312.145
The
RSE(also known as the model sigma) is the residual variation, representing the average variation of the observations points around the fitted regression line. This is the standard deviation of residual errors.
RSEprovides an absolute measure of patterns in the data that can’t be explained by the model. When comparing two models, the model with the smallRSEis a good indication that this model fits the best the data.Dividing the
RSEby the average value of the outcome variable will give you the prediction error rate, which should be as small as possible.In our example,
Whether or not anRSE = 3.91, meaning that the observed sales values deviate from the true regression line by approximately3.9units in average.RSEof3.9units is an acceptable prediction error is subjective and depends on the problem context. This is why we usually normalise it with respect to the predicted variable.
RSE = model_summary$sigma
y= marketing$sales
error = RSE/mean(y)
error## [1] 0.2323877
The
R-squared (R2)ranges from 0 to 1 and represents the proportion of information (i.e. variation) in the data that can be explained by the model. The adjustedR-squaredadjusts for the degrees of freedom.The
R2measures, how well the model fits the data. For a simple linear regression,R2is the square of the Pearson correlation coefficient.A high value of
R2is a good indication. However, as the value ofR2tends to increase when more predictors are added in the model, such as in multiple linear regression model, you should mainly consider the adjustedR-squared, which is a penalizedR2for a higher number of predictors.An (adjusted)
R2that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model. A number near 0 indicates that the regression model did not explain much of the variability in the outcome.F-Statistic: TheF-statisticgives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient.In a simple linear regression, this test is not really interesting since it just duplicates the information in given by the
t-test, available in the coefficient table. In fact, the F test is identical to the square of the ttest: 312.1 = (17.67)^2. This is true in any model with 1 degree of freedom.The
A largeF-statisticbecomes more important once we start using multiple predictors as in multiple linear regression.F-statisticwill corresponds to a statistically significantp-value (p < 0.05). In our example, theF-statistic equal 312.14producing ap-value of 1.46e-42, which is highly significant.
#Le F nous donne un p <0.05 donc c'est significatifMake sure your data meet the assumptions: We can use R to check that our data meet the four main assumptions for linear regression. This can be done only after the model conructiion as we will see.
#There is 4 main assumptions:
#Linearity, Homoscedasticity, Independenc, Normality
res = resid(model)
plot(fitted(model),res)
abline(0,0)qqnorm(res)
qqline(res)plot(density(res))plot(model)#A residuals vs. leverage plot is a type of diagnostic plot that allows us to identify influential observations in a regression model.Each observation from the dataset is shown as a single point within the plot. The x-axis shows the leverage of each point and the y-axis shows the standardized residual of each point.