QUESTION 1 (15 points) Reprenons le contexte du devoir 3. Le jeu de données Metro Montreal.csv contient le kilométrage mensuel parcouru par l’ensemble des wagons de métro de la Ville de Montréal, de janvier 2019 à septembre 2023, ainsi que le kilométrage calculé pour la préparation du budget d’exploitation du métro de 2019 à 2022.
On aimerait utiliser les régressions locales afin de mettre à jour les prévisions budgétaires au début de chaque mois. La variable dépendante correspond ici au kilométrage réel parcouru par les wagons de métro. La spécification du modèle contiendra les variables explicatives suivantes : t, mois, et la variable binaire créée à la question 3(c) du devoir 3.
Le mois de mars servira de mois de référence.
Pour déterminer le meilleur modèle, on testera différentes fonctions de pondération (uniforme, bisquare et gaussienne) ainsi que des rayons de sélection s’étalant de 24 à 36 mois.
Remarque : Lorsque nécessaire, justifiez vos réponses à l’aide de tableaux et/ou graphiques appropriés.
Justifiez votre réponse. Remarque : Pour les sous-questions suivantes, utilisez le modèle choisi en (a).
rm(list=ls())
setwd("~/USHERBROOKE/ESCUELA/ÉTÉ/MQG813 - Statistiques décisionnelles avancées/D4")
source("MQG813_librairie_temporel.R")
## Loading required package: DescTools
## Loading required package: vcd
## Loading required package: grid
## Loading required package: gmodels
## Registered S3 method overwritten by 'gdata':
## method from
## reorder.factor DescTools
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: fastDummies
##
## Loading required package: rio
##
## Loading required package: car
##
## Loading required package: carData
##
##
## Attaching package: 'car'
##
##
## The following object is masked from 'package:dplyr':
##
## recode
##
##
## The following object is masked from 'package:purrr':
##
## some
##
##
## The following object is masked from 'package:DescTools':
##
## Recode
##
##
## Loading required package: Hmisc
##
##
## Attaching package: 'Hmisc'
##
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
##
## The following objects are masked from 'package:DescTools':
##
## %nin%, Label, Mean, Quantile
##
##
## The following objects are masked from 'package:base':
##
## format.pval, units
##
##
## Loading required package: modelr
##
## Loading required package: pander
##
## Loading required package: nlme
##
##
## Attaching package: 'nlme'
##
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
##
## Loading required package: knitr
##
## Loading required package: ggpubr
##
## Loading required package: forecast
##
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
##
## Attaching package: 'forecast'
##
##
## The following object is masked from 'package:ggpubr':
##
## gghistogram
##
##
## The following object is masked from 'package:nlme':
##
## getResponse
##
##
## The following object is masked from 'package:DescTools':
##
## BoxCox
##
##
## Loading required package: vars
##
## Loading required package: MASS
##
##
## Attaching package: 'MASS'
##
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Loading required package: strucchange
##
## Loading required package: zoo
##
##
## Attaching package: 'zoo'
##
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## Loading required package: sandwich
##
##
## Attaching package: 'strucchange'
##
##
## The following object is masked from 'package:stringr':
##
## boundary
##
##
## Loading required package: urca
##
## Loading required package: lmtest
##
##
## Attaching package: 'vars'
##
##
## The following object is masked from 'package:DescTools':
##
## Phi
#Verifie si une librairie est installee, l'installe au besoin et la charge
list.of.packages <- c("modelr", "rio", "tidyverse", "DescTools", "pastecs",
"psych", "pander", "gmodels", "vcd", "fastDummies",
"questionr", "PerformanceAnalytics", "Hmisc", "FSA",
"car", "robustHD", "hexView", "ggplot2", "dplyr",
"forecast", "latticeExtra", "numbers", "zoo", "lubridate",
"stringr", "fastDummies", "descr","tinytex","readr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) {install.packages(new.packages)}
lapply(list.of.packages, require, character.only = TRUE)
## Loading required package: pastecs
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## Loading required package: psych
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:Hmisc':
##
## describe
##
## The following object is masked from 'package:car':
##
## logit
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
##
## The following objects are masked from 'package:DescTools':
##
## AUC, ICC, SD
##
## Loading required package: questionr
##
## Attaching package: 'questionr'
##
## The following object is masked from 'package:psych':
##
## describe
##
## The following objects are masked from 'package:Hmisc':
##
## describe, wtd.mean, wtd.table, wtd.var
##
## Loading required package: PerformanceAnalytics
## Loading required package: xts
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:pastecs':
##
## first, last
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
##
## Attaching package: 'PerformanceAnalytics'
##
## The following object is masked from 'package:vcd':
##
## Kappa
##
## The following object is masked from 'package:graphics':
##
## legend
##
## Loading required package: FSA
## Registered S3 methods overwritten by 'FSA':
## method from
## confint.boot car
## hist.boot car
## ## FSA v0.9.6. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
##
## Attaching package: 'FSA'
##
## The following object is masked from 'package:psych':
##
## headtail
##
## The following object is masked from 'package:car':
##
## bootCase
##
## Loading required package: robustHD
## Loading required package: perry
## Loading required package: parallel
##
## Attaching package: 'perry'
##
## The following object is masked from 'package:modelr':
##
## mape
##
## Loading required package: robustbase
## Loading required package: hexView
## Loading required package: latticeExtra
## Loading required package: lattice
##
## Attaching package: 'latticeExtra'
##
## The following object is masked from 'package:ggplot2':
##
## layer
##
## The following object is masked from 'package:vcd':
##
## rootogram
##
## Loading required package: numbers
##
## Attaching package: 'numbers'
##
## The following object is masked from 'package:PerformanceAnalytics':
##
## Omega
##
## The following object is masked from 'package:psych':
##
## omega
##
## The following objects are masked from 'package:DescTools':
##
## GCD, LCM, Primes
##
## Loading required package: descr
##
## Attaching package: 'descr'
##
## The following object is masked from 'package:questionr':
##
## freq
##
## The following object is masked from 'package:gmodels':
##
## CrossTable
##
## Loading required package: tinytex
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
##
## [[12]]
## [1] TRUE
##
## [[13]]
## [1] TRUE
##
## [[14]]
## [1] TRUE
##
## [[15]]
## [1] TRUE
##
## [[16]]
## [1] TRUE
##
## [[17]]
## [1] TRUE
##
## [[18]]
## [1] TRUE
##
## [[19]]
## [1] TRUE
##
## [[20]]
## [1] TRUE
##
## [[21]]
## [1] TRUE
##
## [[22]]
## [1] TRUE
##
## [[23]]
## [1] TRUE
##
## [[24]]
## [1] TRUE
##
## [[25]]
## [1] TRUE
##
## [[26]]
## [1] TRUE
##
## [[27]]
## [1] TRUE
##
## [[28]]
## [1] TRUE
##
## [[29]]
## [1] TRUE
install_formats()
metro <- import("Metro_Montreal.csv")
#Creation des variables t, annee et trimestre
metro <- metro %>% mutate(Date=as.Date(Date),
t = row_number(),
trimestre=factor(paste0("Q", quarter(Date)),
levels = c("Q1", "Q2", "Q3", "Q4")),
mois=factor(month(Date,label = TRUE, abbr = FALSE,)),
annee=year(Date))
metro$BINAIRE <- 0
metro[16, "BINAIRE"] <- 1
#chisisr 1 jour de la semaine comme reference
metro$mois <- factor(metro$mois,ordered=F)
metro$mois <- relevel(metro$mois, ref = "March") #
comparaison_rayons_uniforme <- tests_rayons(metro, Km_reel~ t + mois+ BINAIRE,
ponderation="uniforme", rayons=c(24:36))
## | | | 0% | |======== | 15% | |============ | 23% | |=============== | 31% | |=================== | 38% | |======================= | 46% | |=========================== | 54% | |=============================== | 62% | |=================================== | 69% | |====================================== | 77% | |========================================== | 85% | |============================================== | 92% | |==================================================| 100%
comparaison_rayons_gaussienne <- tests_rayons(metro, Km_reel~ t + mois+ BINAIRE,
ponderation="gaussienne", rayons=c(24:36))
## | | | 0% | |======== | 15% | |============ | 23% | |=============== | 31% | |=================== | 38% | |======================= | 46% | |=========================== | 54% | |=============================== | 62% | |=================================== | 69% | |====================================== | 77% | |========================================== | 85% | |============================================== | 92% | |==================================================| 100%
comparaison_rayons_bisquare <- tests_rayons(metro, Km_reel~ t + mois+ BINAIRE,
ponderation="bisquare", rayons=c(24:36))
## | | | 0%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |======== | 15%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |============ | 23%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |=============== | 31%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |=================== | 38%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |======================= | 46%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |=========================== | 54%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |=============================== | 62%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |=================================== | 69%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |====================================== | 77%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |========================================== | 85%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |============================================== | 92%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
## | |==================================================| 100%
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
commentaire: Selon le graphique, les résidus semblent suivre un comportement globalement aléatoirerish, on observe une légère tendance à revenir vers zéro. Les valeurs restent dans l’intervalle [–2, +2], ce qui indique une certaine stabilité, mais cela ne suffit pas pour conclure qu’ils sont stationnaires. Toutefois, le test de séquences donne une p-valeur de 0,08247, qui est supérieure au seuil de 0,05. On ne rejette donc pas l’hypothèse nulle d’indépendance, ce qui soutient l’idée que les résidus sont compatibles avec un comportement aléatoire (stationnaire).
# une fois on a identifier le meilleur rayon..
metro <- reg_locales(metro, Km_reel ~ t + mois + BINAIRE,
ponderation="bisquare", rayon=29)
## Warning in predict.lm(reg.temp, data[(i - rayon):i, ]): prediction from
## rank-deficient fit; attr(*, "non-estim") has doubtful cases
#############################
### Evolution des residus ###
#############################
metro$Zres <- scale(metro$resid_29)[,1]
#Grape sequentiel des residus standardises
ggplot(data=metro[25:57,], aes(x=Date, y = Zres)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0)
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
# cest aleatoire, on voit quils sont autour de 0 .. il revien a 0, on depasse pas +2 -2
# Test de séquences
RunsTest(metro[["Zres"]] <= mean(metro[["Zres"]], na.rm=T), na.rm = T)
##
## Runs Test for Randomness
##
## data: metro[["Zres"]] <= mean(metro[["Zres"]], na.rm = T)
## runs = 10, m = 14, n = 14, p-value = 0.08247
## alternative hypothesis: true number of runs is not equal the expected number
# comme pvalue 0.08247 >0.05 on rejet pas Ho.. cad... cest aleatoire
On observe que ce coefficient est négatif pendant presque toute l’année 2021, ce qui indique que les kilométrages de décembre étaient inférieurs à ceux de mars dans ces périodes.
Toutefois, à partir de la fin de 2021, on voit une hausse progressive du coefficient, qui dépasse la ligne de zéro à plusieurs reprises en 2022 et surtout en 2023.
➤ Cela montre que les kilomètres parcourus en décembre ne sont pas toujours inférieurs à ceux de mars, mais que cela varie selon les années et les données locales. La relation n’est donc pas constante, ce qui est typique dans une régression locale.
# On va regarder les coefficients des mois decembre
ggplot(data=metro[25:57,], aes(x=Date, y = coeff_December_29)) +
geom_point() +
geom_line() +
geom_hline(aes(yintercept = mean(metro$coeff_December_29, na.rm=TRUE)), color = "red")+
labs(title = "Evolution de Coefficients de décembre",
x = "Date",
y = "Coefficient estime")
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
(d) À partir de quel moment le kilométrage mensuel est-il en diminution
si l’on se fie au graphe séquentiel montrant l’évolution du coefficient
associé à la variable t ? commentaire:a partir du t=30, car on commence
en bas de 0, donc ici on parle dune diminution , meme si la diminution
“diminue” en montant, cest encore une diminution jusqua t=45 (apres la
prochaine diminution est au t=51).
#on va regarder les coefficient t
#Choix du coefficient (coeff_constante_9, coeff_t_29***,
nom_coeff <- "coeff_t_29"
#Graphe sequentiel
ggplot(data=metro[25:57,], aes(x=t, y = coeff_t_29)) +
geom_point() +
geom_line() +
geom_hline(aes(yintercept = 0), color = "red")
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
# quand les vente coeff diminue.. apres il est negati
Considérons la colonne contenant les coefficients associés à la variable binaire. Selon vous, pourquoi observe-t-on des valeurs manquantes à partir d’un certain moment, alors que les autres coefficients existent ? COmmentaire: Cetait un variable qu’ etait creer pour pour le mois davril 2020. Lorsque le rayon utilisé ne couvre pas cette observation, la régression locale n’a pas de contraste possible pour estimer son effet, ce qui entraîne une valeur manquante dans les coefficients associés.
Quel trimestre présente l’erreur mensuelle absolue moyenne la plus faible ?
trimestre 2 avec un erreur de 110545 km
# Calcul de l'erreur mensuelle absolue moyenne
metro %>%
group_by(trimestre) %>%
summarise(MAD = mean(abs(resid_29), na.rm = TRUE)) %>%
arrange(MAD)
## # A tibble: 4 × 2
## trimestre MAD
## <fct> <dbl>
## 1 Q2 110545.
## 2 Q1 194336.
## 3 Q4 201896.
## 4 Q3 208150.