This week, we are focusing on descriptive statistics. Descriptive statistics is explained in the Handbook of Biological Statistics (http://www.biostathandbook.com/central.html) and in the R companion (https://rcompanion.org/rcompanion/c_01.html).
First, we need to load new packages that allows use to run some function on descriptive statistics. Previously, we have only been looking at base packages. The base package has common tools that developers thought users would need, such as the functions plot(), str(), head(), etc. There are a lot actually.
However, some functions are not part of the base package, but are available because someone else wrote the code for the function and put it in a package. We can install those packages. One way to think of packages is like a library of book. Each book is a function. If you want a book, you need to go to the library (ie install and load the package), and then you can read the book (use the function). We will be working with two new packages for this assignment, called “psych” and “DescTools”. To load packages, we have to use the install.packages() function.
# install two packages called psych and DescTools
install.packages("psych")
install.packages("DescTools")
Then, we have to call up the package using the library() function. Note, once you install a package once, you do not need to re-install each time (although you can). You do, however, need to use the library() function to call it.
library(psych)
library(DescTools)
##
## Attaching package: 'DescTools'
## The following objects are masked from 'package:psych':
##
## AUC, ICC, SD
Great. Now you have installed a package and are ready to do some statistics. Let’s go!
Note 1, in future assignments i will pre-install packages, and you just need to load them with library(). Another advantage of the cloud.
Note 2, it might be wise to start making a list of packages, and the functions you use in that package, and why. This will help you access the information quickly, for assignments and let’s say, quizzes.
Ok, let’s talk descriptive statistics.
Mean
First, we have to reload the dataset. I called my object finches2, but you can rename yours. Note that if you do, you will have to use your new object name throughout, but it is good practice.
finches2 <- read.csv("Finches_Dataset_BIO205Class.csv")
str(finches2)
## 'data.frame': 100 obs. of 12 variables:
## $ Band : int 9 12 276 278 283 288 293 294 298 307 ...
## $ Species : chr "Geospiza fortis" "Geospiza fortis" "Geospiza fortis" "Geospiza fortis" ...
## $ Sex : chr "unknown" "female" "unknown" "unknown" ...
## $ First.adult.year: int 1975 1975 1976 1976 1976 1976 1976 1976 1976 1975 ...
## $ Last.Year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
## $ Survivor : chr "No" "No" "No" "No" ...
## $ Weight..g. : num 14.5 13.5 16.4 18.5 17.4 ...
## $ Wing..mm. : num 67 66 64.2 67.2 70.2 ...
## $ Tarsus..mm. : num 18 18.3 18.5 19.3 19.3 ...
## $ Beak.Length..mm.: num 9.2 9.5 9.93 11.13 12.13 ...
## $ Beak.Depth..mm. : num 8.3 7.5 8 10.6 11.2 9.1 9.5 10.5 8.4 8.6 ...
## $ Beak.Width..mm. : num 8.1 7.5 7.6 9.4 9.5 8.8 8.9 9.1 8.2 8.4 ...
let’s say we want to know what the average (arithmetic mean) of the beak depth is. Mean, or average, is one way to get central values of a dataset. To get the mean, use the mean() function.
mean(finches2$Beak.Depth..mm.)
## [1] 9.3924
Simple huh!
How about the means of males and females? For this, we can use the brackets ([]) that we used previously to specific specific groups within one variable. In this case, the variable is “Sex” and the groups are male or female.
# find mean beak depths in males and females
malemean <- mean(finches2$Beak.Depth..mm.[finches2$Sex == "male"])
femalemean <- mean(finches2$Beak.Depth..mm.[finches2$Sex == "female"])
male mean is 9.6393617. female mean is 9.2394737.
Good. You try.
1. What is the mean of male and female weight?.
2. Is there a difference between the weight of the females that survived, and those that didn’t?
Provide the numbers in your comments.
Median
To determine whether to use the mean or median, it depends how your data distributes. We will discuss this in class, but briefly, outliers can severely skew the data. For example, if the class mean for an exam was a 75, but two people in the class got perfect 100s, than really, most of the class did much worse than 75. There, median may make more sense. For this, we use the median() function. Let’s find the median values for female and male beak depths.
malemed <- median(finches2$Beak.Depth..mm.[finches2$Sex == "male"])
femalemed <- median(finches2$Beak.Depth..mm.[finches2$Sex == "female"])
male median is 9.6. female median is 9.48.
Notice that in this case, the median and the mean are very close. Recall from lecture what factors may affect means and medians.
Mode
We can also get the mode, which is the most common number. This is sometimes useful for non-parametric data. For example, if you asked 100 freshman how many math classes they took in high school, the most popular answer might be informative.
Note, you need the DescTools package for this. We loaded it earlier, but you might need to run the library again if you closed your session.
library(DescTools)
Mode(finches2$Beak.Depth..mm.)
## [1] 8.8
## attr(,"freq")
## [1] 6
Here, you can see that there are two answers. Both 10.2mm and 10.5mm comes up six times in the dataset.
Oh wait. I forgot to get mode by sex. As Elliot would say, that’s so silly :). Give it a try and get the mode of beak depths, by female and male finches.
For some dispersion statistics, you need another package called Rmisc. This is one we use often.
install.packages(“Rmisc”)
library(Rmisc)
Range
range(finches2$Beak.Depth..mm.)
## [1] 7.50 11.21
Variance
var(finches2$Beak.Depth..mm.)
## [1] 0.8144851
Standard deviation
sd(finches2$Beak.Depth..mm.)
## [1] 0.9024883
The great thing with programming, is that you can make functions that get you many statistics. For example, yes, we can easily calculate mean, standard deviation, etc with any program like excel. But, after some familiarity with R, you can also use one function. In this case, the describe() function to get some summary values.
Again, we need a new package called Hmisc.
install.packages(“Hmisc”)
library(Hmisc)
describe(finches2$Beak.Depth..mm.)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 9.39 0.9 9.3 9.39 1.05 7.5 11.21 3.71 0.02 -0.83 0.09
As you can see, there is quite a bit of imformation in both describe and summary. These might come in handy in the future.
Your turn. Let’s do some work comparing the survivors and non-survivors in this dataset. 1. What are the most likely dependent variables in this dataset?.
2. Examine all the dependent variables. Which of these dependent values has the greatest Range of values?
3. Examine all the dependent variables. Identify the mean of all survivors and non-survivors for all dependent variables.
4. Examine all the dependent variables. Identify the standard deviation of all survivors and non-survivors for all dependent variables.
Challenge 1
5. Make a table that has the following columns: variable, means of survivors, means of non-survivors, sd of survivors, sd of non-survivors. Each row will be the different dependent variable. Examine your table. Of all the dependent values, which has the greatest difference in mean values between survivors and non-survivors?
Provide the numbers and explanation in your comments.
Sometimes, it is helpful to see how the data points distribute, and see how they distribute around a central point, such as the mean. For this, we can plot the data in a histogram. An easy way to do this in R is with a histogram and hist() function.
hist(finches2$Beak.Depth..mm.)
We can also add a line to indicate where a value, such as the mean, is located. For this, we use the abline() function
hist(finches2$Beak.Depth..mm.)
abline(v=mean(finches2$Beak.Depth..mm.), col="red", lwd = 3)
Challenge 2
1. Make histograms of the your answer in 2 above (biggest range), for the surviving and non-surviving finches (2 graphs). Add the line for mean in RED and add the lines for standard deviation in blue.
2. Try changing the color of the bars to random colors.For more information on colors, see this page (https://www.r-bloggers.com/color-palettes-in-r/).
3. Make observations about the differences you observe.
**Note, i didn’t not include the results of depth_z, but you should get a 100 point vector of z scores. Now, we want to create a table that has our z scores with the bird identities, including the Band and the Survivor status. So, we have to make a table using the data.frame() function we learned last week.
depth_tab <- data.frame(finches2$Band, finches2$Survivor, depth_z)
Use the View() function to look at the table you created. Using this table, and the sorting functionality, we can sort through which have the highest and lowest z scores. Notice that many of the non-surviving finches have low z scores (less than the mean) whereas many of the surviving birds have higher z scores.
Challenge 3
1. Make a table of z scores, similar to the one above, but include all the independent variable Band and Survival, and all the dependent variables. You can include only z scores and put them all in one table (8 total columns).
2. Make observations about the differences you observe from the data, such as which variables have more birds that are far from the mean, are those + far from the mean more likely to be survivors or non survivors??
3. What kind of information do you think you learn when reading across a row, regarding an individual bird?
Challenge 4
Let’s take a more practical use (for you) - test scores. Let’s say Sam took the MCAT (freaked out emoji) and got a 505. Is that good, bad? Use the new dataset provided, called MCAT.csv. It should be in your working directory already. Using that, answer the following:
1. What is the mean and median GPA of applicants to the schools provided. Does this data look normal?
2. What is the mean and median MCAT score of applicants to the schools provided. Does this data look normal?
3. What is Sam’s z score for the MCAT compared to all schools (schools’ averages)? If you use a z-table (http://www.z-table.com/), you would be able to see what percentage of test-takers Sam scored above or below.
4. Given what you know about standard deviation, discuss the implications of the score and where Sam score likely falls within the population of test takers.