library("ggplot2")

Example 16.1

The annual bonuses ($1,000s) of six employees with different years of experience were recorded as follows. We wish to determine the straight-line relationship between annual bonus and years of experience.

Récupérer les données

years=1:6
bonus=c(6,1,9,5,17,12)
bonusdata=data.frame(years,bonus)

Représenter les données par un nuage de points

Représenter la droite des moindres carrés avec ggplot

Calcul manuel des coefficients de la régression

round(b0,3)
[1] 0.933

Les résidus

sum(e)
[1] -3.108624e-15

Représentation graphique des résidus (ou erreurs)

Illustration des moindres carrés

To be done

Faire une régression avec la fonction lm()

lm signifie linear models ## Effectuer la régression

LM$coefficients[2]
   years 
2.114286 

Les sorties de la régression

LM$fitted.values ## Les valeurs ajustés (ychap_i=b0+b1x_i) i.e. les ordonnées sur la droite 
        1         2         3         4         5         6 
 3.047619  5.161905  7.276190  9.390476 11.504762 13.619048 

Un bilan de la régression avec summary()

summary(LM)

Call:
lm(formula = bonus ~ years, data = bonusdata)

Residuals:
     1      2      3      4      5      6 
 2.952 -4.162  1.724 -4.390  5.495 -1.619 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.9333     4.1920   0.223    0.835
years         2.1143     1.0764   1.964    0.121

Residual standard error: 4.503 on 4 degrees of freedom
Multiple R-squared:  0.491, Adjusted R-squared:  0.3637 
F-statistic: 3.858 on 1 and 4 DF,  p-value: 0.121

Une présentation des résultats plus professionnelle avec Stargazer()

##stargazer(LM,type = "latex")
summary(LM)

Call:
lm(formula = bonus ~ years, data = bonusdata)

Residuals:
     1      2      3      4      5      6 
 2.952 -4.162  1.724 -4.390  5.495 -1.619 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.9333     4.1920   0.223    0.835
years         2.1143     1.0764   1.964    0.121

Residual standard error: 4.503 on 4 degrees of freedom
Multiple R-squared:  0.491, Adjusted R-squared:  0.3637 
F-statistic: 3.858 on 1 and 4 DF,  p-value: 0.121
5*(var(bonusdata$bonus)-
    (cov(bonusdata$years,bonusdata$bonus))^2/var(bonusdata$years))
[1] 81.10476
summary(LM)

Call:
lm(formula = bonus ~ years, data = bonusdata)

Residuals:
     1      2      3      4      5      6 
 2.952 -4.162  1.724 -4.390  5.495 -1.619 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.9333     4.1920   0.223    0.835
years         2.1143     1.0764   1.964    0.121

Residual standard error: 4.503 on 4 degrees of freedom
Multiple R-squared:  0.491, Adjusted R-squared:  0.3637 
F-statistic: 3.858 on 1 and 4 DF,  p-value: 0.121
sqrt(sum((LM$residuals)^2)/98) ## Calculer l'estimation de l'écart-type des erreurs formule p. 650 SSE/(n-2)
[1] 0.3264886

Odometer Reading and Prices of Used Toyota Camrys—Part 2

Find the standard error of estimate for Example 16.2 and describe what it tells you about the model’s fit.

SLM=summary(LM)
SLM$sigma

SSE=sum((LM$residuals)^2)
sqrt(SSE/98)

## Estimation de l’écart de b1

SLM$sigma/sqrt(99*var(data$Odometer))
[1] 0.004974639
SLM

Call:
lm(formula = Price ~ Odometer, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.68679 -0.27263  0.00521  0.23210  0.70071 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17.248727   0.182093   94.72   <2e-16 ***
Odometer    -0.066861   0.004975  -13.44   <2e-16 ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3265 on 98 degrees of freedom
Multiple R-squared:  0.6483,    Adjusted R-squared:  0.6447 
F-statistic: 180.6 on 1 and 98 DF,  p-value: < 2.2e-16

Calcul de p-value du test de nullité de la pente

qt(0.025,98,lower.tail=F)*
SLM$sigma/(sqrt(99*var(data$Odometer))
)
[1] 0.009872009

EXAMPLE 16. 5 Measuring the Strength of the Linear Relationship between Odometer Reading and Price of Used Toyota Camrys

Find the coefficient of determination for Example 16.2 and describe what this statistic tells you about the regression model.

SLM$r.squared
(cor(data$Price,data$Odometer))^2

This statistic tells us that 64.83% of the variation in the auction selling prices is explained by the variation in the odometer readings.

ANOVA=anova(LM)
sum(ANOVA[,2])
99*var(data$Price)
ANOVA[1,2]/(99*var(data$Price))

EXAMPLE 16.7 Predicting the Price and Estimating the Mean Price of Used Toyota Camrys

  1. A used-car dealer is about to bid on a 3-year-old Toyota Camry equipped with all the standard features and with 40,000 (\(x_g\) ⫽ 40) miles on the odometer. To help him decide how much to bid, he needs to predict the selling price.
xg=data.frame(Odometer = c(40))

PREDICT=predict.lm(LM,xg,interval = "prediction")
SLM$coefficients[,1]%*%c(1,40)
?predict.lm
PREDICT[1]-PREDICT[2]
  1. The used-car dealer mentioned in part (a) has an opportunity to bid on a lot of cars offered by a rental company. The rental company has 250 Toyota Camrys all equipped with standard features. All the cars in this lot have about 40,000 (xg ⫽ 40) miles on their odometers. The dealer would like an estimate of the selling price of all the cars in the lot.
PREDICTEXP=predict.lm(LM,xg,interval = "confidence")
PREDICTEXP[1]-PREDICTEXP[2]
PREDICTEXP[3]-PREDICTEXP[1]

Avec les matrices

Si on note \[ Y=\begin{bmatrix} y_1\\ \vdots\\ y_n \\ \end{bmatrix} \text{ et } X=\begin{bmatrix} 1 & x_1\\ \vdots & \vdots \\ 1 & x_n \\ \end{bmatrix}, \] on obtient les estimateurs grâce à un clcul matriciel \[ \begin{bmatrix} b_0 \\ b_1 \end{bmatrix}= (X^tX)^{-1}X^tY, \]\[ X^t=\begin{bmatrix} 1 &\cdots& 1 \\ 1 & \cdots &x_n \\ \end{bmatrix}, \] est la matrice transposée de \(X\).

X=matrix(c(rep(1,100),data$Odometer),nrow = 100,ncol=2)
dim(X)
Y=matrix(data$Price,nrow = 100,ncol=1)
solve(t(X)%*%X)%*%t(X)%*%Y

SLM$coefficients

La matice de variance-covariance de \((b_0,b_1)\) est donnée par

\[ \sigma^2(X^tX)^{-1}. \] Comme \(\sigma^2\) est incommu, on l’estime ce qui donne



SLM$sigma

sqrt(sum((SLM$residuals)^2)/98)
(SLM$sigma)^2*solve(t(X)%*%X)

(SLM$coefficients[2,2])^2

vcov(LM)

Un peu d’algèbre

On rappelle que la méthode de moindres carrés va résoudre le programme suivant

\[ \min_{b_0,b_1} \sum (y_i-\hat{y_i})^2,\text{ avec } \hat{y_i}=b_0+b_1 x_i. \]

Par dérivation par rapport à \(b_0\) et \(b_1\) on obtient les deux condition du premier ordre suivantes (on notera que \(\partial \hat{y}_i/\partial b_1= x_i\))

\[ \begin{equation} \begin{cases} \sum x_i(y_i-\hat{y_i})=0;\\ \sum (y_i-\hat{y_i})=0. \end{cases} \end{equation} \]

La première condition vous dit que les erreurs sont de moyenne zero et la seconde que la variable des erreurs et la variables explicative ne sont pas corrélés.

On peut réécrire ces deux conditions ainsi \[ \begin{equation} \begin{cases} \sum (x_i-\bar{x})(y_i-\hat{y_i})=0;\\ \sum (y_i-\hat{y_i})=0. \end{cases} \end{equation} \]

On remarquera que \(\bar{y}=\bar{\hat{y}}\).

Pour fini, en replaçant \(b_0\) et \(b_1\) par leurs valeurs optimales, on trouve

\[ SSE=\sum (y_i-\hat{y_i})^2=\sum (y_i-\bar{y}+ \bar{y}-\hat{y_i})^2, \] ce qui donne

\[ SSE=\sum [y_i-\bar{y}-b_1(x_i-\bar{x})] ^2. \] En développant, on trouve

\[ SSE=\sum (y_i-\bar{y})^2-2b_1\sum(x_i-\bar{x})(y_i-\bar{y})+b^2_1\sum (x_i-\bar{x})^2. \]

En remarquant que

\[ b^2_1\sum (x_i-\bar{x})^2=b_1\sum(x_i-\bar{x})(y_i-\bar{y}), \] On obtient

\[ SSE=\sum (y_i-\bar{y})^2-b_1\sum(x_i-\bar{x})(y_i-\bar{y}). \]

En utilisant la valeur de \(b_1\), on trouve

\[ SSE=\sum (y_i-\bar{y})^2-\frac{[ \sum(x_i-\bar{x})(y_i-\bar{y}) ]^2}{\sum (x_i-\bar{x})^2} \]

On a aussi

\[ SSE=(1-R^2)\sum (y_i-\bar{y})^2. \]\[ R^2=\frac{[ \sum(x_i-\bar{x})(y_i-\bar{y}) ]^2}{\sum (x_i-\bar{x})^2\sum (y_i-\bar{y})^2}=\frac{s^2_{xy}}{s_{x}s_{y}}. \] On a aussi

\[ R^2=1-\frac{SSE }{\sum (y_i-\bar{y})^2}, \]

En remarquant qye \[ \sum(y_i-\bar{y })^2=\sum(y_i-\hat{y }_i)^2+\sum(\hat{y }_i-\bar{y })^2=SSE+\sum(\hat{y }_i-\bar{y })^2, \] On a aussi \[ R^2=\frac{\sum (y_i-\bar{y})^2-SSE }{\sum (y_i-\bar{y})^2}=\frac{\sum(\hat{y }_i-\bar{y })^2 }{\sum (y_i-\bar{y})^2}. \]

\(R^2\) devientb la part de variance expliquée par le modèle.

