The purpose of this lab is to show you how to compute basic descriptive statistics, including measures of central tendency (mean, mode, median) and variation (range, variance, standard deviation).
In this lab, you will learn the following:
We will use the gapminder project data for this lab.
To do descriptive statistics, we put some numbers in a variable. You
can also do this using the c() command, which stands for
“combine.”
my_numbers <- c(1,2,3,4)
my_numbers
## [1] 1 2 3 4
my_names <- c("Anna", "Bren", "Carlos", "Danny")
my_names
## [1] "Anna" "Bren" "Carlos" "Danny"
There are a few other handy ways to make numbers. We can use
seq() to make a sequence.
one_to_forty <- seq(1,40,1)
one_to_forty
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
even_numbers <- seq(2,39,2)
even_numbers
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
We can repeat things using rep(). Here’s making five
times the number 10 and eleven times the number 1:
rep(10,5)
## [1] 10 10 10 10 10
rep(1,11)
## [1] 1 1 1 1 1 1 1 1 1 1 1
all_together_now <- c(rep(10,5),rep(1,11))
all_together_now
## [1] 10 10 10 10 10 1 1 1 1 1 1 1 1 1 1 1
Note: the syntax is
rep(object, number of repetitions)
Store the letter grades below in a variable called
grades. A, B, B, B, B, A, B, C, B, C, C, C, A, C, B, B
Enter your code below:
A common task when analyzing data is the construction of frequency
tables. You can use several functions in R for that purpose, one of them
being table().
table(all_together_now)
## all_together_now
## 1 10
## 11 5
This shows the frequency of each value in the data.
Use R to tabulate the letter grades from the previous example.
Enter your code below:
Let’s play with the numbers from 1 to 100. To create this sequence of
numbers, we use the function seq() and then use the
sum() function to add them up.
one_to_one_hundred <- seq(1,100,1)
sum(one_to_one_hundred)
## [1] 5050
Use R to create the list of the numbers between 143 and 246 (integers), and calculate their sum. Use a variable called “my_numbers2”.
Enter your code below:
We put 100 numbers into the variable one_to_one_hundred.
We know how many numbers there are in there. How can we get R to tell
us? We use length() for that.
length(one_to_one_hundred)
## [1] 100
Find the length of the variable my_numbers2 from the
previous example.
Enter your code below:
Remember, the mean of some numbers is their sum divided by the number of numbers. We can compute the mean like this:
sum(one_to_one_hundred)/length(one_to_one_hundred)
## [1] 50.5
Or, we could use the mean() function like this:
mean(one_to_one_hundred)
## [1] 50.5
The median is the number in the exact middle of the numbers ordered
from smallest to largest. If there are an even number of values (no
number in the middle), then we take the number between the two (decimal
.5). Use the median() function.
median(c(1,2,3))
## [1] 2
median(c(1,4,3,2))
## [1] 2.5
Use R to calculate the mean and the median of a list of numbers that contains twenty times the number 15, six times the number 8, and thirteen times the number 9.
Enter your code below:
R does not have a base function for the Mode. You would have to write one for yourself. Here is an example of writing your mode function and then using it. I searched for how to do this on Google.
Remember, the mode is the most frequently occurring number in the set. Below 1 occurs the most, so the mode will be one.
Mode <- function(x) {
a <- table(x)
as.numeric(names(a)[a == max(a)])
}
x <- c(1,2,2,3,3,4)
Mode(x)
## [1] 2 3
Use R to calculate the mode of 2,1,2,2,4,4,5,2
Enter your code below:
We often want to know how variable the numbers are. We will look at descriptive statistics to describe this, such as the range, variance, the standard deviation, and a few others.
First, let’s remind ourselves what variation looks like (it’s when the numbers are different). We will sample 100 numbers from a normal distribution (if we haven’t covered this in the lecture yet, don’t worry about it) with a mean of 10 and a standard deviation of 5 and then make a histogram to see the variation around 10.
sample_numbers <- rnorm(100,10,5)
hist(sample_numbers)
The range in R is the minimum and maximum values in
the set; we use the range() function [this is not the same
as we computed in class].
range(sample_numbers)
## [1] -0.8023148 22.3451158
We can find the sample variance using var(). Note:
divides by (n-1).
var(sample_numbers)
## [1] 24.77019
We find the sample standard deviation using sd(). Note:
divides by (n-1)
sd(sample_numbers)
## [1] 4.976966
Remember that the standard deviation is just the square root of the variance, see:
sqrt(var(sample_numbers))
## [1] 4.976966
Let’s put all of the descriptives and other functions so far in one place:
sample_numbers <- rnorm(100,10,5)
sum(sample_numbers)
## [1] 972.1034
length(sample_numbers)
## [1] 100
mean(sample_numbers)
## [1] 9.721034
median(sample_numbers)
## [1] 9.840207
Mode(sample_numbers)
## [1] -5.876128 -3.557092 -1.610557 -1.364455 2.038547 2.446879 2.716807
## [8] 2.857915 3.203215 3.414219 3.560047 3.680509 3.915945 4.010134
## [15] 4.207583 4.259008 4.592616 4.810556 6.299655 6.313818 6.436958
## [22] 6.528781 6.534385 6.581815 6.710463 6.835617 6.860096 6.934805
## [29] 7.114593 7.181564 7.417618 7.533927 7.540664 7.698494 7.851187
## [36] 7.862933 7.949608 7.951708 8.385656 8.619645 8.752677 8.864796
## [43] 8.905462 9.065786 9.101403 9.106720 9.370581 9.667762 9.826138
## [50] 9.832184 9.848229 10.044995 10.106572 10.128780 10.151572 10.317754
## [57] 10.372619 11.028662 11.069949 11.288431 11.382409 11.435255 11.509361
## [64] 11.655566 11.760190 11.786068 11.814417 11.943194 12.022185 12.125303
## [71] 12.251735 12.377169 12.418439 12.542764 12.576148 12.938953 13.373730
## [78] 13.393517 13.415657 13.745406 13.855387 14.017188 14.409490 14.516839
## [85] 14.815887 14.957229 15.551579 15.642595 15.715567 15.945781 16.179648
## [92] 16.570004 16.695359 16.908128 17.567889 18.250611 18.344952 18.956197
## [99] 20.319324 23.115511
range(sample_numbers)
## [1] -5.876128 23.115511
var(sample_numbers)
## [1] 25.93281
sd(sample_numbers)
## [1] 5.092427
Use R to find all descriptives of the variable
my_numbers2.
Enter your code below:
Sometimes you will have a single variable with some numbers, and you can use the above functions to find the descriptives for that variable. Other times (most often in this course), you will have a big data frame of numbers with different numbers in different conditions (like categories). You will want to find descriptive statistics for each set of numbers inside each condition. Fortunately, this is where R shines; it does it all for you in one go.
Let’s illustrate the problem. Here I make a date frame with 10 numbers in each condition. There are 10 conditions; each labeled A, B, C, D, E, F, G, H, I, J. We will generate normal distributions for the numbers.
scores <- rnorm(100,10,5)
conditions <- rep(c("A","B","C","D","E","F","G","H","I","J"), each =10)
my_df <- data.frame(conditions,scores)
my_df
## conditions scores
## 1 A 1.9488356
## 2 A 7.6981907
## 3 A 15.1967531
## 4 A 5.2810534
## 5 A 10.2873946
## 6 A 12.9512402
## 7 A 15.3255847
## 8 A 9.7095527
## 9 A 5.8881785
## 10 A 4.6142500
## 11 B 0.6735651
## 12 B 15.3922816
## 13 B 15.0215193
## 14 B 15.8173150
## 15 B 6.0673248
## 16 B 6.0563548
## 17 B 3.4776414
## 18 B 5.2022458
## 19 B 10.8758230
## 20 B 3.5605096
## 21 C 9.0694576
## 22 C 13.8911559
## 23 C 12.0003001
## 24 C 7.3177749
## 25 C 13.6167187
## 26 C 8.5854365
## 27 C 10.0177840
## 28 C 0.2906766
## 29 C 1.7578970
## 30 C 9.9594927
## 31 D 14.7351400
## 32 D 6.7614749
## 33 D 8.4439133
## 34 D 11.7209461
## 35 D 18.8355190
## 36 D 14.3774039
## 37 D 9.3426156
## 38 D 13.1055677
## 39 D 15.4358847
## 40 D 11.5136795
## 41 E 7.5431744
## 42 E 7.6459553
## 43 E 16.7578872
## 44 E 11.0546467
## 45 E 5.3828260
## 46 E 9.7528000
## 47 E 7.7449872
## 48 E 2.5349572
## 49 E 15.4350087
## 50 E 2.5950722
## 51 F 8.8688935
## 52 F 8.4733282
## 53 F 4.9192875
## 54 F 10.0617912
## 55 F 6.6109064
## 56 F 2.9170453
## 57 F 13.6016567
## 58 F 3.1694675
## 59 F 7.5155358
## 60 F 14.9516434
## 61 G 11.9831329
## 62 G 8.3750147
## 63 G 14.2390604
## 64 G 21.8280457
## 65 G 10.1204978
## 66 G 13.7089846
## 67 G 15.9991271
## 68 G 17.0954494
## 69 G 6.7744162
## 70 G 7.5092844
## 71 H 11.7537669
## 72 H 13.1108306
## 73 H 8.6129383
## 74 H 5.7875353
## 75 H 6.4207179
## 76 H 10.5233025
## 77 H 7.9591960
## 78 H 16.4457328
## 79 H 11.3249000
## 80 H 15.4430849
## 81 I 20.8165536
## 82 I 18.9330352
## 83 I 16.5555362
## 84 I 20.5136299
## 85 I 6.2706811
## 86 I 7.3330056
## 87 I 3.4291746
## 88 I 13.1598561
## 89 I 11.2340328
## 90 I 6.7070449
## 91 J 6.3224070
## 92 J 1.4869977
## 93 J 11.1531062
## 94 J 13.9827597
## 95 J 12.9382041
## 96 J 12.9607289
## 97 J -2.8484868
## 98 J 1.0457738
## 99 J 8.5069350
## 100 J 17.4951273
If you look at the my_df data frame, you will see it has
100 rows; there are 10 rows for each condition with a label in the
conditions column and 10 scores for each condition in the
scores column.
What if you wanted to know the mean scores in each condition?
You would want to find 10 means. The slow way to do it would be like this:
mean(my_df[my_df$conditions=="A",]$scores)
## [1] 8.890103
mean(my_df[my_df$conditions=="B",]$scores)
## [1] 8.214458
mean(my_df[my_df$conditions=="C",]$scores)
## [1] 8.650669
# and then keep going
Nobody wants to do that!
We can do everything all at once using the group_by and
summarize functions from the dplyr
package.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
my_df %>%
group_by(conditions) %>%
summarize(means = mean(scores))
## # A tibble: 10 × 2
## conditions means
## <chr> <dbl>
## 1 A 8.89
## 2 B 8.21
## 3 C 8.65
## 4 D 12.4
## 5 E 8.64
## 6 F 8.11
## 7 G 12.8
## 8 H 10.7
## 9 I 12.5
## 10 J 8.30
A couple of things now. First, the printout of this was ugly. Let’s
fix that. we put the results of our code into a new variable, then use
knitr::kable to print it out nicely when this document is
built.
summary_df <- my_df %>%
group_by(conditions) %>%
summarize(means = mean(scores))
knitr::kable(summary_df)
| conditions | means |
|---|---|
| A | 8.890103 |
| B | 8.214458 |
| C | 8.650669 |
| D | 12.427214 |
| E | 8.644732 |
| F | 8.108956 |
| G | 12.763301 |
| H | 10.738200 |
| I | 12.495255 |
| J | 8.304355 |
The best thing about the dplyr method is that we can add
more than one function, and we’ll get more than one summary returned,
all in a nice format, let’s add the standard deviation:
summary_df <- my_df %>%
group_by(conditions) %>%
summarise(means = mean(scores),
sds = sd(scores))
knitr::kable(summary_df)
| conditions | means | sds |
|---|---|---|
| A | 8.890103 | 4.597434 |
| B | 8.214458 | 5.598078 |
| C | 8.650669 | 4.549511 |
| D | 12.427214 | 3.628954 |
| E | 8.644732 | 4.790673 |
| F | 8.108956 | 4.022723 |
| G | 12.763301 | 4.770295 |
| H | 10.738200 | 3.607898 |
| I | 12.495255 | 6.451468 |
| J | 8.304355 | 6.629084 |
We’ll add the min and the max too:
summary_df <- my_df %>%
group_by(conditions) %>%
summarize(means = mean(scores),
sds = sd(scores),
min = min(scores),
max = max(scores))
knitr::kable(summary_df)
| conditions | means | sds | min | max |
|---|---|---|---|---|
| A | 8.890103 | 4.597434 | 1.9488356 | 15.32558 |
| B | 8.214458 | 5.598078 | 0.6735651 | 15.81732 |
| C | 8.650669 | 4.549511 | 0.2906766 | 13.89116 |
| D | 12.427214 | 3.628954 | 6.7614749 | 18.83552 |
| E | 8.644732 | 4.790673 | 2.5349572 | 16.75789 |
| F | 8.108956 | 4.022723 | 2.9170453 | 14.95164 |
| G | 12.763301 | 4.770295 | 6.7744162 | 21.82805 |
| H | 10.738200 | 3.607898 | 5.7875353 | 16.44573 |
| I | 12.495255 | 6.451468 | 3.4291746 | 20.81655 |
| J | 8.304355 | 6.629084 | -2.8484868 | 17.49513 |
Create a data frame with 5 conditions for every 20 numbers with normal distribution with a mean of 12 and a standard deviation of 3.
Calculate the mean and standard deviation for each condition.
Enter your code below:
Now that we know how to get descriptive statistics from R, we can do this will some actual data. Let’s quickly ask a few questions about the gapminder data.
library(gapminder)
gapminder_df <-gapminder
gapminder_df
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
Copy the code from the last part of descriptives using
dplyr, then change the names like this:
summary_df <- gapminder_df %>%
group_by(continent) %>%
summarize(means = mean(lifeExp),
sds = sd(lifeExp),
min = min(lifeExp),
max = max(lifeExp))
knitr::kable(summary_df)
| continent | means | sds | min | max |
|---|---|---|---|---|
| Africa | 48.86533 | 9.150210 | 23.599 | 76.442 |
| Americas | 64.65874 | 9.345088 | 37.579 | 80.653 |
| Asia | 60.06490 | 11.864532 | 28.801 | 82.603 |
| Europe | 71.90369 | 5.433178 | 43.585 | 81.757 |
| Oceania | 74.32621 | 3.795611 | 69.120 | 81.235 |
Calculate the mean, standard deviation, minimum and maximum for each continent’s GDP per Capita.
Enter your code below:
The material used in this document contains excerpts and modifications from: