Analizzare i dati con R

Alessandra Santi
28 Ottobre 2017

Linux Day 2017 a Pisa

?R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

https://www.r-project.org/about.html

Download R: https://cran.r-project.org/

For Linux, Mac and Windows

Linux Repository

HowTo Installation

Help Manuals in HTML, PDF, EPUB

Ubuntu –> sources.list

Ubuntu –> sources.list –> mirror

Ubuntu –> sudo apt-get update

Ubuntu –> sudo apt-get install r-base

Terminal –> R

R environment –> install package (es. dplyr)

?RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).

https://www.rstudio.com/products/rstudio/

https://www.rstudio.com/

Open Source Edition and Commercial License

Libraries

Cheatsheets

Focus and details

Packages and Help

Packages

install.packages(‘dplyr’) Download and install
library(dplyr) Load
data(iris) Load a built-in dataset

Help

?mean Help function mean
help(package = ‘dplyr’) Help package.

Working Directory

Where am I?

getwd() Find the current working directory (where inputs are found and outputs are sent).
setwd(‘C://file/path’) Change the current working directory.

Vectors
Data Frames
Libraries
- magrittr
- dplyr
- ggplot2
- rmarkdown

Vectors and Data Frame - Data Containers

1 2 3 …

a <- 5

y <- f(x)

Vectors

(v <- c(1, 2, 3, 4, 5))

[1] 1 2 3 4 5

(v1 <- 1:5)

[1] 1 2 3 4 5

(v2 <- seq(from=1, to=5, by=1))

[1] 1 2 3 4 5

Vectors

(v3 <- rep(1:2, times=3))

[1] 1 2 1 2 1 2

(v4 <- rep(1:2, each=3))

[1] 1 1 1 2 2 2

Selecting Vector Elements

v5 <- rnorm(3)
v5

[1]  1.019617 -1.726204 -1.382875

v5[2]

[1] -1.726204

v5[-2]

[1]  1.019617 -1.382875

Selecting Vector Elements

v5

[1]  1.019617 -1.726204 -1.382875

v5[2:3]

[1] -1.726204 -1.382875

v5[-(2:3)]

[1] 1.019617

Selecting Vector Elements

v5

[1]  1.019617 -1.726204 -1.382875

v5[c(1, 2)]

[1]  1.019617 -1.726204

v5[-c(1,2)]

[1] -1.382875

Selecting Vector Elements

v5

[1]  1.019617 -1.726204 -1.382875

v5[v5 < 0]

[1] -1.726204 -1.382875

fruits <- c("apple", "pear", "apple")
fruits[fruits == "apple"]

[1] "apple" "apple"

seq with Date Series

t <- seq(from = as.Date("2017-01-01"), 
         to = as.Date("2017-12-31"), 
         by = "days")
t[1:3]

[1] "2017-01-01" "2017-01-02" "2017-01-03"

t2 <- seq(from = as.Date("2017-01-01"), 
          to = as.Date("2017-12-31"), 
          by = "months")
t2[1:3]

[1] "2017-01-01" "2017-02-01" "2017-03-01"

Data Frame

df <- data.frame(
        Fruit=c("apple","pear","melon"),
        kg=c(100, 250, 560),
        PrUn=c(2.0, 4.5, 1.2)
      )
df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

Read and write external data

Read

df <- read.table(“file.txt”)
df <- read.csv(“file.csv”)
df <- read.csv(file.choose())
df <- read.excel(file.choose(), sheet = “Sheet1”)

Write

write.table(df, “file.txt”)
write.csv(df, “file.csv”)

Read and write external data

Selecting Data Frame Elements

df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

df[1:2, ]

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5

df[1, 1]

[1] apple
Levels: apple melon pear

Selecting Data Frame Elements

df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

df[ , 'Fruit']

[1] apple pear  melon
Levels: apple melon pear

Add column in Data Frame

df['tot'] <- df['kg'] * df['PrUn']
df

  Fruit  kg PrUn  tot
1 apple 100  2.0  200
2  pear 250  4.5 1125
3 melon 560  1.2  672

df$tot <- df$kg * df$PrUn
df

  Fruit  kg PrUn  tot
1 apple 100  2.0  200
2  pear 250  4.5 1125
3 melon 560  1.2  672

Delete column in Data Frame

df['tot'] <- NULL
df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

df$tot <- NULL
df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

Structure of Data Frame

df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

dim(df)

[1] 3 3

names(df)

[1] "Fruit" "kg"    "PrUn"

Structure of Data Frame

df

  Fruit  kg PrUn
1 apple 100  2.0
2  pear 250  4.5
3 melon 560  1.2

str(df)

'data.frame':   3 obs. of  3 variables:
 $ Fruit: Factor w/ 3 levels "apple","melon",..: 1 3 2
 $ kg   : num  100 250 560
 $ PrUn : num  2 4.5 1.2

head() and tail()

df1 <- data.frame(a=sample(x=1:100), b=rnorm(100))

head(df1, 2)

   a          b
1 40 -0.9167949
2 70  0.7531830

tail(df1, 2)

     a          b
99  73  0.5667689
100 42 -0.1077406

plot()

head(df1)

   a          b
1 40 -0.9167949
2 70  0.7531830
3 87  0.3538634
4 76  0.9907758
5 50  1.4268626
6 68  0.1808850

plot(df1)

plot of chunk unnamed-chunk-20

plot()

(df1.10 <- head(df1, 10))

    a          b
1  40 -0.9167949
2  70  0.7531830
3  87  0.3538634
4  76  0.9907758
5  50  1.4268626
6  68  0.1808850
7  11  0.2417842
8   4  1.2177763
9  34 -0.7826337
10 80 -0.3418487

plot(df1.10$a,type="l", col="red")

plot of chunk unnamed-chunk-22

summary()

summary(df1)

       a                b           
 Min.   :  1.00   Min.   :-2.11038  
 1st Qu.: 25.75   1st Qu.:-0.70474  
 Median : 50.50   Median :-0.10633  
 Mean   : 50.50   Mean   :-0.07583  
 3rd Qu.: 75.25   3rd Qu.: 0.53946  
 Max.   :100.00   Max.   : 2.97220

Linear Regression()

lr <- lm(formula = a ~ b, data = df1)
lr


Call:
lm(formula = a ~ b, data = df1)

Coefficients:
(Intercept)            b  
     50.750        3.297

par(mfrow=c(2,2))
plot(lr)

plot of chunk unnamed-chunk-25

tidyverse - collection of packages

magrittr -code more readable

pipe %>%

Mean Square with %>%

v <- c(22, 3, 2, 5, 6, 77, 8, 11, 9, 7, 57)

sqrt(mean(v^2))

[1] 30.22792

library(magrittr)

v^2 %>% mean() %>% sqrt()

[1] 30.22792

dplyr -grammar of data manipulation

dplyr

select() select columns
filter() filter rows
summarise() summarise values
arrange() re-order or arrange rows
mutate() create new columns
group_by() group data

dplyr in RStudio

dplyr

library(dplyr)
df_a <- data.frame(year = rep(2013:2017,each=12),
          month = rep(1:12, times=5),
          num = sample(1:20),
          weight = runif(n = 60, min = 1.0,
                    max = 10.0))
head(df_a, 4)

  year month num   weight
1 2013     1  20 6.249453
2 2013     2  18 4.340047
3 2013     3  12 4.000076
4 2013     4  10 7.860788

dplyr - select()

df_a_select <- select(df_a,
                      year, num, weight)
head(df_a_select)

  year num   weight
1 2013  20 6.249453
2 2013  18 4.340047
3 2013  12 4.000076
4 2013  10 7.860788
5 2013  14 1.957468
6 2013   6 9.132897

dplyr - filter()

df_a_filter <- filter(df_a, 
                      month == 2, num > 3)
head(df_a_filter)

  year month num   weight
1 2013     2  18 4.340047
2 2014     2  15 8.910923
3 2015     2   6 3.393272
4 2016     2  13 8.434680
5 2017     2   9 1.946008

dplyr - summarise()

df_a_summarise <- summarise(df_a, 
                        avg_n=mean(num),
                        avg_w=mean(weight))
head(df_a_summarise)

  avg_n    avg_w
1  10.5 5.630199

dplyr - arrange()

df_a_arrange <- arrange(df_a, desc(num))
head(df_a_arrange)

  year month num   weight
1 2013     1  20 6.249453
2 2014     9  20 6.409159
3 2016     5  20 9.276693
4 2014     1  19 1.851002
5 2015     9  19 1.687055
6 2017     5  19 9.145018

dplyr - mutate()

df_a_mutate <- mutate(df_a, tot=num*weight)
head(df_a_mutate)

  year month num   weight       tot
1 2013     1  20 6.249453 124.98906
2 2013     2  18 4.340047  78.12085
3 2013     3  12 4.000076  48.00091
4 2013     4  10 7.860788  78.60788
5 2013     5  14 1.957468  27.40455
6 2013     6   6 9.132897  54.79738

dplyr %>%

dplyr - group_by() - summarise() with %>%

df_a_gr <- df_a %>% 
           group_by(year) %>%
           summarise(avg_n=mean(num), 
                     avg_w=mean(weight))
head(df_a_gr)

# A tibble: 5 x 3
   year     avg_n    avg_w
  <int>     <dbl>    <dbl>
1  2013 10.083333 4.946174
2  2014 12.416667 5.868181
3  2015  9.916667 4.284383
4  2016 10.333333 6.556034
5  2017  9.750000 6.496225

dplyr - group_by() - summarise_all() with %>%

df_a_gr <- df_a %>% 
           group_by(year) %>%
           summarise_all(funs(mean))
head(df_a_gr)

# A tibble: 5 x 4
   year month       num   weight
  <int> <dbl>     <dbl>    <dbl>
1  2013   6.5 10.083333 4.946174
2  2014   6.5 12.416667 5.868181
3  2015   6.5  9.916667 4.284383
4  2016   6.5 10.333333 6.556034
5  2017   6.5  9.750000 6.496225

ggplot2

library(ggplot2)

t <- seq(from=as.Date("2017-01-01"), 
         to=as.Date("2017-12-31"), 
         by="days")

val <- rnorm(365)

df <- data.frame(tempo=t, valori=val)

f <- ggplot(df,aes(x=tempo,y=valori))+
    geom_line() +
    theme_bw()

plot of chunk unnamed-chunk-38

ggplot2

ggplot2 - examples

start=as.Date("2017-01-01")
df9 <- data.frame(date=seq(from=start,to=start+99,by="days"),
                  descr=sample(x=c("cat","dog","mouse"),100, replace=T),
                  val=sample(x=1:30,size=100,replace=T))
head(df9, 4)

        date descr val
1 2017-01-01 mouse  20
2 2017-01-02 mouse  20
3 2017-01-03 mouse   2
4 2017-01-04 mouse  22

ggplot2 - examples

g   <- ggplot(df9, aes(descr, val))

gg  <- ggplot(df9, aes(x=val))

gg2 <- ggplot(df9, aes(x=val, fill=descr))

gg3 <- ggplot(df9, aes(x=descr, y=val, fill=descr))

g + geom_violin()

plot of chunk unnamed-chunk-41

g + geom_violin(aes(fill=descr))

plot of chunk unnamed-chunk-42

g + geom_violin() + geom_jitter()

plot of chunk unnamed-chunk-43

g + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))

plot of chunk unnamed-chunk-44

gg + geom_histogram()

plot of chunk unnamed-chunk-45

gg2 + geom_histogram()

plot of chunk unnamed-chunk-46

gg2 + geom_density(alpha=.3)

plot of chunk unnamed-chunk-47

gg2 + geom_density(alpha=.3) + facet_grid(descr ~.)

plot of chunk unnamed-chunk-48

gg3 + geom_boxplot(alpha=.3)

plot of chunk unnamed-chunk-49

gg3 + geom_boxplot() + stat_summary()

plot of chunk unnamed-chunk-50

rmarkdown

R Notebook

R Presentation (used for this slides :)

Slices (R presentation)

“Analizzare i dati con R”

Author: Alessandra Santi

E-mail: santi.info@gmail.com

License: CC-BY-SA

Analizzare i dati con R

?R

?RStudio

Libraries

Focus and details

Packages and Help

Working Directory

Table of contents

Vectors and Data Frame - Data Containers

1 2 3 …

a <- 5

y <- f(x)

Vectors

Vectors

Vectors

Selecting Vector Elements

Selecting Vector Elements

Selecting Vector Elements

Selecting Vector Elements

seq with Date Series

Data Frame

Data Frame

Read and write external data

Read and write external data

Selecting Data Frame Elements

Selecting Data Frame Elements

Add column in Data Frame

Delete column in Data Frame

Structure of Data Frame

Structure of Data Frame

head() and tail()

plot()

plot()

summary()

Linear Regression()

tidyverse - collection of packages

magrittr -code more readable

pipe %>%

Mean Square with %>%

dplyr -grammar of data manipulation

dplyr

dplyr in RStudio

dplyr in RStudio

dplyr

dplyr - select()

dplyr - filter()

dplyr - summarise()

dplyr - arrange()

dplyr - mutate()

dplyr %>%

dplyr - group_by() - summarise() with %>%

dplyr - group_by() - summarise_all() with %>%

ggplot2

ggplot2

ggplot2

ggplot2

ggplot2 - examples

ggplot2 - examples

g + geom_violin()

g + geom_violin(aes(fill=descr))

g + geom_violin() + geom_jitter()

g + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))

gg + geom_histogram()

gg2 + geom_histogram()

gg2 + geom_density(alpha=.3)

gg2 + geom_density(alpha=.3) + facet_grid(descr ~.)

gg3 + geom_boxplot(alpha=.3)

gg3 + geom_boxplot() + stat_summary()

rmarkdown

rmarkdown

rmarkdown

rmarkdown

rmarkdown

rmarkdown

rmarkdown

rmarkdown

rmarkdown

R Notebook

R Presentation (used for this slides :)

Slices (R presentation)