[1] 2
[1] 4
Book
Modern Statistics with R
From wrangling and exploring data to inference and predictive modelling, https://www.modernstatisticswithr.com/
https://www.amazon.com/Modern-Statistics-wrangling-exploring-predictive/dp/9152701514
Måns Thulin
\
Three or four panels:
Ctrl + Shift + N
File > New File > R Script
Run the entire script
Press the Source button
Press Ctrl+Shift+Enter
Press Ctrl+Alt+Enter (without print code)
Run part of script
Press the Run button
Press Ctrl+Enter
Save the script
File -> Save
Ctrl + S
Case Sensitive
snake_case
camelCase or CamelCase
period.case (avoid)
Chars not allowed
Comments
# —
Ctrl + Shift + C
Select lines and press Ctrl + Shift + C
[1] 28 48 47 71 22 80 48 30 31
[1] 20 59 2 12 22 160 34 34 29
Data Frames
# line breaks between the commas
distances <- c(687, 5076, 7270,
967, 6364, 1683,
9394, 5712, 5206,
4317, 9411, 5625,
9725, 4977, 2730,
5648, 3818, 8241,
5547, 1637, 4428,
8584, 2962, 5729,
5325, 4370, 5989,
9030, 5532, 9623)
distances [1] 687 5076 7270 967 6364 1683 9394 5712 5206 4317 9411 5625 9725 4977 2730
[16] 5648 3818 8241 5547 1637 4428 8584 2962 5729 5325 4370 5989 9030 5532 9623
# Compute the mean age of bookstore customers
age <- c(28, 48, 47, 71, 22, 80, 48, 30, 31)
mean(age)[1] 45
diamonds: prices of more than 50,000 cut diamonds
msleep: sleep times of 83 mammals
provides info that tibble of 83 rows and 11 columns
shows 10 rows and some columns
# A tibble: 83 × 11
name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴ awake brainwt
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9 NA
2 Owl monkey Aotus omni Prim… <NA> 17 1.8 NA 7 0.0155
3 Mountain be… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6 NA
4 Greater sho… Blar… omni Sori… lc 14.9 2.3 0.133 9.1 0.00029
5 Cow Bos herbi Arti… domest… 4 0.7 0.667 20 0.423
6 Three-toed … Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6 NA
7 Northern fu… Call… carni Carn… vu 8.7 1.4 0.383 15.3 NA
8 Vesper mouse Calo… <NA> Rode… <NA> 7 NA NA 17 NA
9 Dog Canis carni Carn… domest… 10.1 2.9 0.333 13.9 0.07
10 Roe deer Capr… herbi Arti… lc 3 NA NA 21 0.0982
# … with 73 more rows, 1 more variable: bodywt <dbl>, and abbreviated variable
# names ¹conservation, ²sleep_total, ³sleep_rem, ⁴sleep_cycle
View(msleep)
Some cells have NA, placeholder for missing data
# A tibble: 6 × 11
name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴ awake brainwt bodywt
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Chee… Acin… carni Carn… lc 12.1 NA NA 11.9 NA 50
2 Owl … Aotus omni Prim… <NA> 17 1.8 NA 7 0.0155 0.48
3 Moun… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6 NA 1.35
4 Grea… Blar… omni Sori… lc 14.9 2.3 0.133 9.1 0.00029 0.019
5 Cow Bos herbi Arti… domest… 4 0.7 0.667 20 0.423 600
6 Thre… Brad… herbi Pilo… <NA> 14.4 2.2 0.767 9.6 NA 3.85
# … with abbreviated variable names ¹conservation, ²sleep_total, ³sleep_rem,
# ⁴sleep_cycle
# A tibble: 6 × 11
name genus vore order conse…¹ sleep…² sleep…³ sleep…⁴ awake brainwt bodywt
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tenrec Tenr… omni Afro… <NA> 15.6 2.3 NA 8.4 0.0026 0.9
2 Tree … Tupa… omni Scan… <NA> 8.9 2.6 0.233 15.1 0.0025 0.104
3 Bottl… Turs… carni Ceta… <NA> 5.2 NA NA 18.8 NA 173.
4 Genet Gene… carni Carn… <NA> 6.3 1.3 NA 17.7 0.0175 2
5 Arcti… Vulp… carni Carn… <NA> 12.5 NA NA 11.5 0.0445 3.38
6 Red f… Vulp… carni Carn… <NA> 9.8 2.4 0.35 14.2 0.0504 4.23
# … with abbreviated variable names ¹conservation, ²sleep_total, ³sleep_rem,
# ⁴sleep_cycle
[1] "name" "genus" "vore" "order" "conservation"
[6] "sleep_total" "sleep_rem" "sleep_cycle" "awake" "brainwt"
[11] "bodywt"
returns information about 11 variables
in particular data types of variables
Tells us whether it is numerical or categorical
tibble [83 × 11] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:83] "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
$ genus : chr [1:83] "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
$ vore : chr [1:83] "carni" "omni" "herbi" "omni" ...
$ order : chr [1:83] "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
$ conservation: chr [1:83] "lc" NA "nt" "lc" ...
$ sleep_total : num [1:83] 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
$ sleep_rem : num [1:83] NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
$ sleep_cycle : num [1:83] NA NA NA 0.133 0.667 ...
$ awake : num [1:83] 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
$ brainwt : num [1:83] NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
$ bodywt : num [1:83] 50 0.48 1.35 0.019 600 ...
for numeric variables, provides smallest value, largest value, the first quartile, median, 3rd quartile, mean, and number of values with NAs
The first quartile is a value such that 25 % of the observations are smaller than it
the 3rd quartile is a value such that 25 % of the observations are larger than it.
name genus vore order
Length:83 Length:83 Length:83 Length:83
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
conservation sleep_total sleep_rem sleep_cycle
Length:83 Min. : 1.90 Min. :0.100 Min. :0.1167
Class :character 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
Mode :character Median :10.10 Median :1.500 Median :0.3333
Mean :10.43 Mean :1.875 Mean :0.4396
3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
Max. :19.90 Max. :6.600 Max. :1.5000
NA's :22 NA's :51
awake brainwt bodywt
Min. : 4.10 Min. :0.00014 Min. : 0.005
1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
Median :13.90 Median :0.01240 Median : 1.670
Mean :13.57 Mean :0.28158 Mean : 166.136
3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
Max. :22.10 Max. :5.71200 Max. :6654.000
NA's :27
[1] 10.43373
[1] 10.1
[1] 19.9
[1] 1.9
[1] 4.450357
[1] 19.80568
0% 25% 50% 75% 100%
1.90 7.85 10.10 13.75 19.90
[1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
[25] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[49] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[61] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
also called factors
examples in msleep dataset are
cd domesticated en lc nt vu
carni 0.01923077 0.03846154 0.01923077 0.09615385 0.01923077 0.07692308
herbi 0.01923077 0.13461538 0.03846154 0.19230769 0.05769231 0.05769231
insecti 0.00000000 0.00000000 0.01923077 0.03846154 0.00000000 0.00000000
omni 0.00000000 0.01923077 0.00000000 0.15384615 0.00000000 0.00000000
cd domesticated en lc nt vu
carni 0.07142857 0.14285714 0.07142857 0.35714286 0.07142857 0.28571429
herbi 0.03846154 0.26923077 0.07692308 0.38461538 0.11538462 0.11538462
insecti 0.00000000 0.00000000 0.33333333 0.66666667 0.00000000 0.00000000
omni 0.00000000 0.11111111 0.00000000 0.88888889 0.00000000 0.00000000
cd domesticated en lc nt vu
carni 0.5000000 0.2000000 0.2500000 0.2000000 0.2500000 0.5714286
herbi 0.5000000 0.7000000 0.5000000 0.4000000 0.7500000 0.4285714
insecti 0.0000000 0.0000000 0.2500000 0.0800000 0.0000000 0.0000000
omni 0.0000000 0.1000000 0.0000000 0.3200000 0.0000000 0.0000000
R has several plotting options and one option is using the ggplot2 software, which uses the “grammar of graphics”
The grammar of graphics is a collection of structural guidelines for creating a graphics language.
All plots are constructed using functions that follow the same logic, or grammar.
Compare this to when we intended to disregard NA values while generating descriptive statistics: mean needed the parameter na.rm, whereas cor required usage.
By utilizing a common plot grammar, we learn fewer arguments.
Three key components to grammar of graphics plots are:
Data: the observations in your dataset
Aesthetics: mappings from the data to visual properties (like axes and sizes of geometric objects), and
Geoms: geometric objects, e.g. lines, representing what you see in the plot.
?ggplot
ggplot(msleep, aes(sleep_total, sleep_rem)) +
geom_point(na.rm = TRUE) +
xlab("Total sleep time (h)")looks better without removing outliers
weak declining trend
however, difficult to interpret now
grid of plots corresponding to the different groups
plot of animal brain weight versus total sleep time we may wish to separate the different feeding behaviours (omnivores, carnivores, etc.) in the msleep data using facetting instead of different coloured points.
In ggplot2 we do this by adding a call to facet_wrap to the plot
Note that the x-axes and y-axes of the different plots in the grid all have the same scale and limits.
Another option for comparing groups is boxplots
also called box-and-whiskers plots
Boxes visualise important descriptive statistics for the different groups, similar to what we got using summary:
Median: the thick black line inside the box.
First quartile: the bottom of the box.
Third quartile: the top of the box.
Minimum: the end of the line (“whisker”) that extends from the bottom of the box.
Maximum: the end of the line that extends from the top of the box.
Outliers: observations that deviate too much (more than 1.5 times the height of the box) from the rest are shown as separate points.
To show the distribution of a continuous variable
data is split into a number of bins and the
number of observations in each bin is shown by a bar
When visualizing categorical data, we display the counts for each category..
Bar charts are common for this type of data.