Introdocution to R Markdown

Remember the markdown documention is available in Help > Markdown quick reference.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Introdution à R

Les variables sont persistantes dans la session

x = 10
x
## [1] 10

Si on exécute ce bloc plusieurs fois de suite on peut se retrouver avec un document dont l’état semble incohérent. Dans ce cas, il suffit de réexécuter toutes les cellules situées au dessus.

x = x+10
x
## [1] 20

Lorsque l’on knit par contre tout le code R est bien réexécuté deuis le début (mais dans processus séparé!).

Les tableaux

Quelques exembles de tableau et de manipulation.

tab = c(2,5, 65,33)
tab
## [1]  2  5 65 33
tab[2]
## [1] 5
length(tab)
## [1] 4

En R, on préfèrera toujours l’écriture vectorielle et on évitera les boucles for.

# for(i in 1:length(tab)) {
#   tab[i] = 2*tab[i]
# }
tab = tab*tab
tab
## [1]    4   25 4225 1089

Les tableaux sont automatiquement agrandis à la même taille si cea a du sens.

log(tab + c(1,4))
## [1] 1.609438 3.367296 8.349011 6.996681

Quelques exemples d’intialisation.

c(10, 20, 30, 50, 100, 200, 300)
## [1]  10  20  30  50 100 200 300
seq.int(from = 10, to = 200, by = 50)
## [1]  10  60 110 160
1:12
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Les data frames

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
mtcars[1:5,3:5]
##                   disp  hp drat
## Mazda RX4          160 110 3.90
## Mazda RX4 Wag      160 110 3.90
## Datsun 710         108  93 3.85
## Hornet 4 Drive     258 110 3.08
## Hornet Sportabout  360 175 3.15
cars$newcol = cos(cars$speed) + sin(cars$dist)
cars
##    speed dist       newcol
## 1      4    2  0.255653806
## 2      4   10 -1.197664732
## 3      7    4 -0.002900241
## 4      7   22  0.745050945
## 5      8   16 -0.433403350
## 6      9   10 -1.455151373
## 7     10   18 -1.590058776
## 8     10   26 -0.076513079
## 9     10   34 -0.309988843
## 10    11   17 -0.956971794
## 11    11   28  0.275331486
## 12    12   14  1.834461314
## 13    12   20  1.756799209
## 14    12   24 -0.061724403
## 15    12   28  1.114759747
## 16    13   26  1.670005232
## 17    13   34  1.436529468
## 18    13   34  1.436529468
## 19    13   46  1.809235129
## 20    14   26  0.899295669
## 21    14   36 -0.855041635
## 22    14   60 -0.168073403
## 23    14   80 -0.857151436
## 24    15   20  0.153257338
## 25    15   26  0.002870538
## 26    15   54 -1.318476962
## 27    16   32 -0.406232799
## 28    16   40 -0.212546320
## 29    17   32  0.276263343
## 30    17   40  0.469949822
## 31    17   50 -0.537538192
## 32    18   42 -0.256204840
## 33    18   56  0.138765706
## 34    18   76  1.226424345
## 35    18   84  1.393507028
## 36    19   36 -0.003074235
## 37    19   46  1.890492966
## 38    19   68  0.090776937
## 39    20   32  0.959508743
## 40    20   48 -0.360172600
## 41    20   52  1.394709654
## 42    20   56 -0.113468940
## 43    20   64  1.328108100
## 44    22   66 -1.026511980
## 45    23   54 -1.091622069
## 46    24   70  1.198069689
## 47    24   92 -0.355287062
## 48    24   93 -0.524103134
## 49    24  120  1.004790192
## 50    25   85  0.815127192
mtcars[mtcars$mpg>30 | mtcars$mpg<15,]
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Créons une nouvelle data frame.

df = data.frame(col1 = 1:10, col2 = runif(10))
df
##    col1      col2
## 1     1 0.3796499
## 2     2 0.3528289
## 3     3 0.8419026
## 4     4 0.6249661
## 5     5 0.3731026
## 6     6 0.3816953
## 7     7 0.5830887
## 8     8 0.1275146
## 9     9 0.3156694
## 10   10 0.7647084
df = rbind(df,df)
dim(df)
## [1] 20  2

A/B test

On échantillonne N=100 personnes. On propose à chacun un design A ou B et on sait à la fin s’ils sont satisfaits ou pas. Une stratégie possible consiste à présenter chaque design à N/2 personnes et à choisir le design qui a été le plus apprécié. C’est raisonnable et souvent on tombe juste, i.e., on trouve bien le design le plus populaire. Mais pas toujours… Quelle est la fiabilité réelle de cette approche ?

N = 200
pA = 0.85
pB = 0.9

NRepeat = 100000


# repA = rep.int(0, times=N/2)  # Initialisation
# for (i in 1:length(repA)) {   # Yuck!!!
#    if(runif(1)<pA)  {repA[i] = 1} else {repA[i] = 0}
# }

# repA = runif(N/2)<pA          # Better
success = 0
for(i in 1:NRepeat) {
  repA = sample(size=N/2, x = c(0,1), prob = c(1-pA,pA), replace=T)  # Much better!
  repB = sample(size=N/2, x = c(0,1), prob = c(1-pB,pB), replace=T)  # Much better!
  if(sum(repA)<sum(repB)) { 
    success = success + 1
  }
} 
success/NRepeat
## [1] 0.83309

Et comment cette fiabilité évolue-t-elle en fonction de N ?

reliability = function(N = 200, pA = 0.5, pB = 0.6, NRepeat = 100000) {
  success = 0
  for(i in 1:NRepeat) {
    repA = sample(size=N/2, x = c(0,1), prob = c(1-pA,pA), replace=T)  # Much better!
    repB = sample(size=N/2, x = c(0,1), prob = c(1-pB,pB), replace=T)  # Much better!
    if(sum(repA)<sum(repB)) { 
      success = success + 1
    }
  } 
 return(success/NRepeat)
}

Nsamples = c(10,50,100,200,300)
Rely = c()
for(N in Nsamples) {
  Rely = c(Rely, reliability(N))
}
plot(Nsamples,Rely, ylim=c(0,1))