library(tidyverse)
library(mlogit)
library(stargazer)
library(Ecdat)
Projet données Yogurt
Introduction
Chargement des packages
Chargement des data et statistiques descriptives
data("Yogurt")
On peut calculer
%>%select(choice)%>%table()%>%
Yogurtprop.table()%>%round(5)
choice
yoplait dannon hiland weight
0.33914 0.40216 0.02944 0.22927
# Moyenne et écart-type pour les variables numériques
%>%
Yogurt summarise(across(where(is.numeric) & !all_of("id"), list(mean = mean, sd = sd), na.rm = TRUE))
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(...)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
feat.yoplait_mean feat.yoplait_sd feat.dannon_mean feat.dannon_sd
1 0.05597015 0.2299117 0.03772803 0.1905772
feat.hiland_mean feat.hiland_sd feat.weight_mean feat.weight_sd
1 0.03689884 0.1885525 0.03772803 0.1905772
price.yoplait_mean price.yoplait_sd price.dannon_mean price.dannon_sd
1 10.68213 1.906265 8.163474 1.062886
price.hiland_mean price.hiland_sd price.weight_mean price.weight_sd
1 5.362935 0.805391 7.949088 0.7735004
# Calcul de proportions pour la variable qualititative
%>%
Yogurt group_by(choice) %>%
summarise(proportion = n())
# A tibble: 4 × 2
choice proportion
<fct> <int>
1 yoplait 818
2 dannon 970
3 hiland 71
4 weight 553
set.seed(123)
<- tibble(
data groupe = sample(c("A", "B", "C"), 100, replace = TRUE),
x = rnorm(100, mean = 50, sd = 10), # Variable quantitative
y = rnorm(100, mean = 30, sd = 5) # Variable quantitative
)
%>%
data summarise(
mean_x = mean(x, na.rm = TRUE),
sd_x = sd(x, na.rm = TRUE),
min_x = min(x, na.rm = TRUE),
max_x = max(x, na.rm = TRUE),
mean_y = mean(y, na.rm = TRUE),
sd_y = sd(y, na.rm = TRUE)
)
# A tibble: 1 × 6
mean_x sd_x min_x max_x mean_y sd_y
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 49.4 9.57 26.9 71.9 30.0 4.71
#
Reproduire Table 1
Un exemple pedagogique pour comprendre pivot longer
Partonds d’un exemple simple
<- tibble(
df id = c(1, 2, 3),
math = c(80, 90, 85),
science = c(75, 95, 88),
english = c(78, 85, 82)
)
En réalité il n’y a qu’une seule variable, c’esl la note mais decilinée par élève et par matière. On pourrait créer 3 colonne id une colonne note et une colonne matière
<- df %>%
df_long pivot_longer(cols = c(math, science, english),
names_to = "matiere",
values_to = "note")
df_long
# A tibble: 9 × 3
id matiere note
<dbl> <chr> <dbl>
1 1 math 80
2 1 science 75
3 1 english 78
4 2 math 90
5 2 science 95
6 2 english 85
7 3 math 85
8 3 science 88
9 3 english 82
Pivoter le dataframe
Le data frame est dans un format peu commode qui s’appelle un format wide. Il y a beaucoup de colonne (1+4+4+1=10) colonnes. Il est plus commode et plus parlant statistiquement de créer un format long avec des colonne . id, feat, price, choice mais on sera obligé d’jouter une autre colonne qui identifie la marque en question
|>
Yogurt mutate(purshase = as.integer(row_number()),id=as.integer(id)) |>
pivot_longer(cols = starts_with("price") | starts_with("feat"),
names_to = c(".value", "brand"),
names_sep = "\\.") |>
relocate( choice, .after = last_col())->Yogurt_long
Yogurt_long
# A tibble: 9,648 × 6
id purshase brand price feat choice
<int> <int> <chr> <dbl> <dbl> <fct>
1 1 1 yoplait 10.8 0 weight
2 1 1 dannon 8.1 0 weight
3 1 1 hiland 6.10 0 weight
4 1 1 weight 7.90 0 weight
5 1 2 yoplait 10.8 0 dannon
6 1 2 dannon 9.80 0 dannon
7 1 2 hiland 6.40 0 dannon
8 1 2 weight 7.50 0 dannon
9 1 3 yoplait 10.8 0 dannon
10 1 3 dannon 9.80 0 dannon
# ℹ 9,638 more rows
Utiliser gtsummary
library(glue)
packageVersion("glue")
[1] '1.8.0'
library("gtsummary")
%>%
Yogurt_long mutate(price = price / 100, ms = as.numeric(brand == choice)) %>%
tbl_summary(
by = brand,
statistic = list(
~ "{mean}({sd})",
ms ~ "{mean}({sd})",
feat ~ "{mean}({sd})"
price
),type = list(feat ~ "continuous", ms ~ "continuous"),
digits = list(ms ~ 3, feat ~ 3, price ~ 3),
include = c(ms, feat, price),
label = list(ms = "Market Share", feat = "Feature", price = "Price"),
missing = "no"
%>%
) modify_header(
all_stat_cols() ~ "**{level}**" # Modifie les en-têtes pour supprimer "N = ..."
%>%
) modify_header(
label = "**Variable**"
%>%
) bold_labels() %>%modify_caption("Summary statistics for Yogurt data")
Variable | dannon1 | hiland1 | weight1 | yoplait1 |
---|---|---|---|---|
Market Share | 0.402(0.490) | 0.029(0.169) | 0.229(0.420) | 0.339(0.474) |
Feature | 0.038(0.191) | 0.037(0.189) | 0.038(0.191) | 0.056(0.230) |
Price | 0.082(0.011) | 0.054(0.008) | 0.079(0.008) | 0.107(0.019) |
1 Mean(SD) |
log(00.9)
[1] -0.1053605
Estimer un modèle logit avec le package mlogit
library(mlogit)
data(ModeCanada)
Apprentissage avec des données simples
library(stargazer)
<- ModeCanada |>
MC dfidx(subset = noalt == 4, alt.levels = levels(factor(unique(ModeCanada$alt)))
)
<-mlogit(choice ~ cost+freq+ovt|income ,reflevel = "bus" ,MC)
MNL%>%summary() MNL
Call:
mlogit(formula = choice ~ cost + freq + ovt | income, data = MC,
reflevel = "bus", method = "nr")
Frequencies of alternatives:choice
bus train air car
0.0035984 0.1666067 0.3738755 0.4559194
nr method
9 iterations, 0h:0m:1s
g'(-H)^-1g = 9.24E-05
successive function values within tolerance limits
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):train 4.8074276 0.6747002 7.1253 1.039e-12 ***
(Intercept):air 7.7020867 0.7861126 9.7977 < 2.2e-16 ***
(Intercept):car 3.1611780 0.6990468 4.5221 6.122e-06 ***
cost -0.0634143 0.0033165 -19.1208 < 2.2e-16 ***
freq 0.1163068 0.0043995 26.4362 < 2.2e-16 ***
ovt -0.0369923 0.0026639 -13.8867 < 2.2e-16 ***
income:train 0.0519669 0.0179975 2.8875 0.0038837 **
income:air 0.0889521 0.0181069 4.9126 8.988e-07 ***
income:car 0.0636173 0.0179400 3.5461 0.0003909 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -2031.4
McFadden R^2: 0.30034
Likelihood ratio test : chisq = 1744 (p.value = < 2.22e-16)
%>%stargazer(type = "text", title = "Résultats de la régression pour le MNL") MNL
Résultats de la régression pour le MNL
=============================================
Dependent variable:
---------------------------
choice
---------------------------------------------
(Intercept):train 4.807***
(0.675)
(Intercept):air 7.702***
(0.786)
(Intercept):car 3.161***
(0.699)
cost -0.063***
(0.003)
freq 0.116***
(0.004)
ovt -0.037***
(0.003)
income:train 0.052***
(0.018)
income:air 0.089***
(0.018)
income:car 0.064***
(0.018)
---------------------------------------------
Observations 2,779
R2 0.300
Log Likelihood -2,031.390
LR Test 1,743.974*** (df = 9)
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
?stargazer