Data Analysis

Author

Nagender Aneja

Book

Modern Statistics with R

From wrangling and exploring data to inference and predictive modelling, https://www.modernstatisticswithr.com/

https://www.amazon.com/Modern-Statistics-wrangling-exploring-predictive/dp/9152701514

Måns Thulin

1 Basics

1.1 Installing R and RStudio

R
R Studio

1.2 A first look at RStudio

Three or four panels:
- Environment panel
  - where a list of the data - imported and created
- Files, Plots and Help panel
  - list of available files
  - view graphs
  - help documents for different parts of R
- Console panel
  - used for running code
- Script panel
  - used for writing code

1.3 Running R code

1+1

[1] 2

2*2

[1] 4

1.3.1 R Scripts

Ctrl + Shift + N
File > New File > R Script

1+1

[1] 2

2*2

[1] 4

1+2*3-5

[1] 2

(1+2)*3-5

[1] 4

Run the entire script
- Press the Source button
- Press Ctrl+Shift+Enter
- Press Ctrl+Alt+Enter (without print code)
Run part of script
- Press the Run button
- Press Ctrl+Enter
Save the script
- File -> Save
- Ctrl + S

1.4 Variables and functions

Case Sensitive
- snake_case
- camelCase or CamelCase
- period.case (avoid)
- Chars not allowed
  - -, +, *, :, =, ! and $
Comments
- # —
- Ctrl + Shift + C
- Select lines and press Ctrl + Shift + C

1.4.1 Storing Data

x <- 4
x

[1] 4

x + 1

[1] 5

x + x

[1] 8

2 + 2 -> y
y

[1] 4

income <- 200; taxes <- 30
income; taxes

[1] 200

[1] 30

income2 <- taxes2 <- 100
income2; taxes2

[1] 100

[1] 100

1.4.2 Vectors and data frames

age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
age

[1] 28 48 47 71 22 80 48 30 31

purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
purchase

[1]  20  59   2  12  22 160  34  34  29

age_months <- age * 12
age_months

[1] 336 576 564 852 264 960 576 360 372

Data Frames

bookstore <- data.frame(age, purchase)
bookstore

  age purchase
1  28       20
2  48       59
3  47        2
4  71       12
5  22       22
6  80      160
7  48       34
8  30       34
9  31       29

# line breaks between the commas
distances <- c(687, 5076, 7270, 
               967, 6364, 1683, 
               9394, 5712, 5206,
               4317, 9411, 5625, 
               9725, 4977, 2730, 
               5648, 3818, 8241, 
               5547, 1637, 4428, 
               8584, 2962, 5729, 
               5325, 4370, 5989,
               9030, 5532, 9623)
distances

 [1]  687 5076 7270  967 6364 1683 9394 5712 5206 4317 9411 5625 9725 4977 2730
[16] 5648 3818 8241 5547 1637 4428 8584 2962 5729 5325 4370 5989 9030 5532 9623

height = c(155, 158, 160, 162, 166)
weight = c(50, 55, 60, 65, 70)
data = data.frame(height, weight)
data

  height weight
1    155     50
2    158     55
3    160     60
4    162     65
5    166     70

x <- 1:5
x

[1] 1 2 3 4 5

y <- 4:1
y

[1] 4 3 2 1

c(x, y)

[1] 1 2 3 4 5 4 3 2 1

1.4.3 Functions

# Compute the mean age of bookstore customers
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
mean(age)

[1] 45

# Compute the correlation between the variables age and purchase
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
purchase <- c(20, 59, 2, 12, 22, 160, 34, 34, 29)
cor(age, purchase)

[1] 0.589402

cor(age, purchase, method = "spearman")

[1] 0.3487395

Sys.time()

[1] "2022-12-25 16:27:09 EST"

height = c(155, 158, 160, 162, 166)
weight = c(50, 55, 60, 65, 70)
data = data.frame(height, weight)

mean(data$height)

[1] 160.2

cor(data$height, data$weight)

[1] 0.9912407

1.4.4 Mathematical operations

11 %% 2

[1] 1

abs(-4)

[1] 4

sqrt(2)

[1] 1.414214

log(100)

[1] 4.60517

log(8, base = 2)

[1] 3

2^3

[1] 8

exp(2)

[1] 7.389056

sum(c(1,2,3,4,5))

[1] 15

prod(c(1,2,3,4,5))

[1] 120

factorial(5)

[1] 120

choose(5, 3) # 5c3

[1] 10

1.5 Packages

#install.packages("ggplot2")
library(ggplot2)

1.6 Descriptive statistics

Two datasets from ggplot2 package used in this chapter
- diamonds: prices of more than 50,000 cut diamonds
- msleep: sleep times of 83 mammals
Explore dataset msleep
- provides info that tibble of 83 rows and 11 columns
- shows 10 rows and some columns

library(ggplot2)
msleep

# A tibble: 83 × 11
   name         genus vore  order conse…¹ sleep…² sleep…³ sleep…⁴ awake  brainwt
   <chr>        <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl> <dbl>    <dbl>
 1 Cheetah      Acin… carni Carn… lc         12.1    NA    NA      11.9 NA      
 2 Owl monkey   Aotus omni  Prim… <NA>       17       1.8  NA       7    0.0155 
 3 Mountain be… Aplo… herbi Rode… nt         14.4     2.4  NA       9.6 NA      
 4 Greater sho… Blar… omni  Sori… lc         14.9     2.3   0.133   9.1  0.00029
 5 Cow          Bos   herbi Arti… domest…     4       0.7   0.667  20    0.423  
 6 Three-toed … Brad… herbi Pilo… <NA>       14.4     2.2   0.767   9.6 NA      
 7 Northern fu… Call… carni Carn… vu          8.7     1.4   0.383  15.3 NA      
 8 Vesper mouse Calo… <NA>  Rode… <NA>        7      NA    NA      17   NA      
 9 Dog          Canis carni Carn… domest…    10.1     2.9   0.333  13.9  0.07   
10 Roe deer     Capr… herbi Arti… lc          3      NA    NA      21    0.0982 
# … with 73 more rows, 1 more variable: bodywt <dbl>, and abbreviated variable
#   names ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle

View All
- View(msleep)
- Some cells have NA, placeholder for missing data

#View(msleep)

Useful functions to find information about dataframe

head(msleep)

# A tibble: 6 × 11
  name  genus vore  order conse…¹ sleep…² sleep…³ sleep…⁴ awake  brainwt  bodywt
  <chr> <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>
1 Chee… Acin… carni Carn… lc         12.1    NA    NA      11.9 NA        50    
2 Owl … Aotus omni  Prim… <NA>       17       1.8  NA       7    0.0155    0.48 
3 Moun… Aplo… herbi Rode… nt         14.4     2.4  NA       9.6 NA         1.35 
4 Grea… Blar… omni  Sori… lc         14.9     2.3   0.133   9.1  0.00029   0.019
5 Cow   Bos   herbi Arti… domest…     4       0.7   0.667  20    0.423   600    
6 Thre… Brad… herbi Pilo… <NA>       14.4     2.2   0.767   9.6 NA         3.85 
# … with abbreviated variable names ¹conservation, ²sleep_total, ³sleep_rem,
#   ⁴sleep_cycle

tail(msleep)

# A tibble: 6 × 11
  name   genus vore  order conse…¹ sleep…² sleep…³ sleep…⁴ awake brainwt  bodywt
  <chr>  <chr> <chr> <chr> <chr>     <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>
1 Tenrec Tenr… omni  Afro… <NA>       15.6     2.3  NA       8.4  0.0026   0.9  
2 Tree … Tupa… omni  Scan… <NA>        8.9     2.6   0.233  15.1  0.0025   0.104
3 Bottl… Turs… carni Ceta… <NA>        5.2    NA    NA      18.8 NA      173.   
4 Genet  Gene… carni Carn… <NA>        6.3     1.3  NA      17.7  0.0175   2    
5 Arcti… Vulp… carni Carn… <NA>       12.5    NA    NA      11.5  0.0445   3.38 
6 Red f… Vulp… carni Carn… <NA>        9.8     2.4   0.35   14.2  0.0504   4.23 
# … with abbreviated variable names ¹conservation, ²sleep_total, ³sleep_rem,
#   ⁴sleep_cycle

dim(msleep)

[1] 83 11

names(msleep)

 [1] "name"         "genus"        "vore"         "order"        "conservation"
 [6] "sleep_total"  "sleep_rem"    "sleep_cycle"  "awake"        "brainwt"     
[11] "bodywt"

str
- returns information about 11 variables
- in particular data types of variables
  - char or num
- Tells us whether it is numerical or categorical

str(msleep)

tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
 $ name        : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
 $ genus       : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
 $ vore        : chr [1:83] "carni" "omni" "herbi" "omni" ...
 $ order       : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
 $ conservation: chr [1:83] "lc" NA "nt" "lc" ...
 $ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
 $ sleep_rem   : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
 $ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
 $ awake       : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
 $ brainwt     : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
 $ bodywt      : num [1:83] 50 0.48 1.35 0.019 600 ...

Documentation about msleep

?msleep

Include in the Environment panel of RStudio

data("msleep")

1.6.1 Numerical data

1.6.1.1 Summary of Each variable

for numeric variables, provides smallest value, largest value, the first quartile, median, 3rd quartile, mean, and number of values with NAs
The first quartile is a value such that 25 % of the observations are smaller than it
the 3rd quartile is a value such that 25 % of the observations are larger than it.

summary(msleep)

     name              genus               vore              order          
 Length:83          Length:83          Length:83          Length:83         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 conservation        sleep_total      sleep_rem      sleep_cycle    
 Length:83          Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
 Class :character   1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
 Mode  :character   Median :10.10   Median :1.500   Median :0.3333  
                    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
                    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
                    Max.   :19.90   Max.   :6.600   Max.   :1.5000  
                                    NA's   :22      NA's   :51      
     awake          brainwt            bodywt        
 Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
 1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
 Median :13.90   Median :0.01240   Median :   1.670  
 Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
 3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
 Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
                 NA's   :27

1.6.1.2 Descriptive statistics of one numeric variable

mean(msleep$sleep_total) # Mean

[1] 10.43373

median(msleep$sleep_total) # Median

[1] 10.1

max(msleep$sleep_total) # Max

[1] 19.9

min(msleep$sleep_total) # Min

[1] 1.9

sd(msleep$sleep_total) # Standard deviation

[1] 4.450357

var(msleep$sleep_total) # Variance

[1] 19.80568

quantile(msleep$sleep_total) # Various quantiles

   0%   25%   50%   75%  100% 
 1.90  7.85 10.10 13.75 19.90

1.6.1.3 How many animals sleep for more than 8 hours a day

n = msleep$sleep_total > 8
n

 [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
[49] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[61]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[73]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE

sum(msleep$sleep_total > 8) # count (Frequency)

[1] 61

mean(msleep$sleep_total > 8) # Proportion (relative frequency)

[1] 0.7349398

1.6.1.4 Compute Mean REM Sleep (rapid eye movement sleep)

mean(msleep$sleep_rem)

[1] NA

Ignore NA values to compute

mean(msleep$sleep_rem, na.rm = TRUE)

[1] 1.87541

1.6.1.5 Correlation between sleep_total and sleep_rem

cor(msleep$sleep_total, msleep$sleep_rem)

[1] NA

Ignore NA for cor with use

cor(msleep$sleep_total, msleep$sleep_rem, use = "complete.obs")

[1] 0.751755

1.6.2 Categorical data

also called factors
examples in msleep dataset are
- vore (feeding behavior) and
- conservation (conservation status)

1.6.2.1 Table showing the frequencies of different categories

table(msleep$vore)


  carni   herbi insecti    omni 
     19      32       5      20

1.6.2.2 Proportion of different categories

Apply proportions to the table

proportions(table(msleep$vore))


     carni      herbi    insecti       omni 
0.25000000 0.42105263 0.06578947 0.26315789

1.6.2.3 cross table

counts for different combinations of two categorical variables

table(msleep$vore, msleep$conservation)

         
          cd domesticated en lc nt vu
  carni    1            2  1  5  1  4
  herbi    1            7  2 10  3  3
  insecti  0            0  1  2  0  0
  omni     0            1  0  8  0  0

1.6.2.4 Crosstable Proportion

margin is a vector giving the margins to split by.
for a matrix, 1 indicates rows, 2 indicates columns
c(1, 2) indicates rows and columns
default is NULL

proportions(table(msleep$vore, msleep$conservation))

         
                  cd domesticated         en         lc         nt         vu
  carni   0.01923077   0.03846154 0.01923077 0.09615385 0.01923077 0.07692308
  herbi   0.01923077   0.13461538 0.03846154 0.19230769 0.05769231 0.05769231
  insecti 0.00000000   0.00000000 0.01923077 0.03846154 0.00000000 0.00000000
  omni    0.00000000   0.01923077 0.00000000 0.15384615 0.00000000 0.00000000

proportions(table(msleep$vore, msleep$conservation), margin = 1)

         
                  cd domesticated         en         lc         nt         vu
  carni   0.07142857   0.14285714 0.07142857 0.35714286 0.07142857 0.28571429
  herbi   0.03846154   0.26923077 0.07692308 0.38461538 0.11538462 0.11538462
  insecti 0.00000000   0.00000000 0.33333333 0.66666667 0.00000000 0.00000000
  omni    0.00000000   0.11111111 0.00000000 0.88888889 0.00000000 0.00000000

proportions(table(msleep$vore, msleep$conservation), margin = 2)

         
                 cd domesticated        en        lc        nt        vu
  carni   0.5000000    0.2000000 0.2500000 0.2000000 0.2500000 0.5714286
  herbi   0.5000000    0.7000000 0.5000000 0.4000000 0.7500000 0.4285714
  insecti 0.0000000    0.0000000 0.2500000 0.0800000 0.0000000 0.0000000
  omni    0.0000000    0.1000000 0.0000000 0.3200000 0.0000000 0.0000000

1.7 Plotting numerical data

R has several plotting options and one option is using the ggplot2 software, which uses the “grammar of graphics”
The grammar of graphics is a collection of structural guidelines for creating a graphics language.
All plots are constructed using functions that follow the same logic, or grammar.
Compare this to when we intended to disregard NA values while generating descriptive statistics: mean needed the parameter na.rm, whereas cor required usage.
By utilizing a common plot grammar, we learn fewer arguments.
Three key components to grammar of graphics plots are:
- Data: the observations in your dataset
- Aesthetics: mappings from the data to visual properties (like axes and sizes of geometric objects), and
- Geoms: geometric objects, e.g. lines, representing what you see in the plot.

library(ggplot2)

1.7.1 First plot

plot(msleep$sleep_total, msleep$sleep_rem)

# Plot character or pch
plot(msleep$sleep_total, msleep$sleep_rem, pch = 16)
grid()

?ggplot

ggplot(msleep, 
       aes(x = sleep_total, y = sleep_rem)) + 
    geom_point(na.rm = TRUE) + 
    ggtitle("A scatterplot of mammal sleeping times")

library(ggplot2)
# without x and y
ggplot(msleep, aes(sleep_total, sleep_rem)) + 
    geom_point(na.rm = TRUE) + 
    ggtitle("A scatterplot of mammal sleeping times")

1.7.2 Colours, shapes and axis labels

ggplot(msleep, aes(sleep_total, sleep_rem)) +
  geom_point(na.rm = TRUE) +
  xlab("Total sleep time (h)")

1.7.2.1 Color

ggplot(msleep, aes(sleep_total, sleep_rem)) +
  geom_point(color = "red", na.rm = TRUE) +
  xlab("Total sleep time (h)")

1.7.2.2 colors

colors()[1:5]

[1] "white"         "aliceblue"     "antiquewhite"  "antiquewhite1"
[5] "antiquewhite2"

1.7.2.3 Color with Categories

ggplot(msleep, 
       aes(sleep_total, 
           sleep_rem, 
           colour = vore)) +
  geom_point(na.rm = TRUE) +
  xlab("Total sleep time (h)")

1.7.2.4 Numeric variable as Color

ggplot(msleep, 
       aes(sleep_total, sleep_rem, 
           colour = sleep_cycle)) +
  geom_point(na.rm = TRUE) +
  xlab("Total sleep time (h)")

1.7.3 Axis limits and scales

1.7.3.1 Scatter plot to show relationship between animals’ brain sizes (brainwt) and their total sleep time (sleep_total)

two animals with brains that are much heavier than the rest (African elephant and Asian elephant)

ggplot(msleep, 
       aes(brainwt, sleep_total, 
           colour = vore)) +
  geom_point(na.rm = TRUE) +
  xlab("Brain weight") +
  ylab("Total sleep time")

1.7.3.2 changing the x-axis to only go from 0 to 1.5 by adding xlim to the plot to remove outliers

ggplot(msleep, 
       aes(brainwt, sleep_total, 
           colour = vore)) +
  geom_point(na.rm = TRUE) +
  xlab("Brain weight") +
  ylab("Total sleep time") +
  xlim(0, 1.5)

1.7.3.3 Changing y-axis limit

ggplot(msleep, 
       aes(brainwt, sleep_total, 
           colour = vore)) +
  geom_point(na.rm = TRUE) +
  xlab("Brain weight") +
  ylab("Total sleep time") +
  xlim(0, 1.5) +
  ylim(0, 10)

1.7.3.4 Appling log transform to brain weights to rescale x-axis

looks better without removing outliers
weak declining trend
however, difficult to interpret now

ggplot(msleep, 
       aes(log(brainwt), sleep_total, 
           colour = vore)) +
  geom_point(na.rm = TRUE) + 
  xlab("log(Brain weight)") + 
  ylab("Total sleep time")

1.7.3.5 Change the scale to log10 instead of data

increases interpretability because the values shown at the ticks still are on the original x-scale

ggplot(msleep, 
       aes(brainwt, sleep_total, 
           colour = vore)) +
  geom_point(na.rm = TRUE) +
  xlab("Brain weight (logarithmic scale)") +
  ylab("Total sleep time") +
  scale_x_log10()

1.7.4 Comparing groups - facetting

grid of plots corresponding to the different groups
plot of animal brain weight versus total sleep time we may wish to separate the different feeding behaviours (omnivores, carnivores, etc.) in the msleep data using facetting instead of different coloured points.
In ggplot2 we do this by adding a call to facet_wrap to the plot
Note that the x-axes and y-axes of the different plots in the grid all have the same scale and limits.

ggplot(msleep, 
       aes(brainwt, sleep_total)) +
  geom_point(na.rm = TRUE) +
  xlab("Brain weight (logarithmic scale)") +
  ylab("Total sleep time") +
  scale_x_log10() +
  facet_wrap(~ vore)

1.7.5 Boxplots

Another option for comparing groups is boxplots
also called box-and-whiskers plots

Boxes visualise important descriptive statistics for the different groups, similar to what we got using summary:

Median: the thick black line inside the box.
First quartile: the bottom of the box.
Third quartile: the top of the box.
Minimum: the end of the line (“whisker”) that extends from the bottom of the box.
Maximum: the end of the line that extends from the top of the box.
Outliers: observations that deviate too much (more than 1.5 times the height of the box) from the rest are shown as separate points.

1.7.5.1 Animal sleep times, grouped by feeding behaviour - using base R

boxplot(sleep_total ~ vore, data = msleep)

1.7.5.2 Animal sleep times, grouped by feeding behaviour - using ggplot2

code is similar to scatter plot except using geom_boxplot

ggplot(msleep, 
       aes(vore, sleep_total)) +
  geom_boxplot() +
  ggtitle("Boxplots showing mammal sleeping times")

1.7.6 Histograms

To show the distribution of a continuous variable
data is split into a number of bins and the
number of observations in each bin is shown by a bar

1.7.6.1 Using base R

hist(msleep$sleep_total)

1.7.6.2 Using ggplot2

Same code but used geom_histogram()

ggplot(msleep, 
       aes(sleep_total)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.7.6.3 Number of Bins

ggplot(msleep, 
       aes(sleep_total)) +
  geom_histogram(bins = 15)

1.7.6.4 Binwidth

ggplot(msleep, 
       aes(sleep_total)) +
  geom_histogram(binwidth = 5)

1.7.6.5 Binwidth

ggplot(msleep, 
       aes(sleep_total)) +
  geom_histogram(binwidth = 2.5)

1.8 Plotting categorical data

When visualizing categorical data, we display the counts for each category..
Bar charts are common for this type of data.

1.8.1 Bar charts

Bar charts are discrete histograms where categories are represented by bars.

library(ggplot2)

1.8.1.1 Base R

barplot(table(msleep$vore))

1.8.1.2 Using ggplot2

using geom_bar()

ggplot(msleep, 
       aes(vore)) +
  geom_bar()

1.8.1.3 stacked bar chart

ggplot(msleep, 
       aes(factor(1), fill = vore)) +
  geom_bar()

1.9 Saving your plot

myPlot <- ggplot(msleep,
                 aes(sleep_total,
                     sleep_rem)) +
  geom_point(na.rm = TRUE)

myPlot

myPlot + xlab("Total Sleep Time")

1.9.1 Save using ggsave()

ggsave("filename.pdf", 
       myPlot, 
       width = 5, height = 5)
myPlot

dev.off()

null device 
          1

1.9.2 save using base R

pdf("filename.pdf", 
    width = 5, height = 5)
myPlot
dev.off()

quartz_off_screen 
                2

png("filename.png", 
    width = 500, height = 500)
plot(msleep$sleep_total,
     msleep$sleep_rem)
dev.off()

quartz_off_screen 
                2