Here is a brief exploration of the numbers of gold medals won by all of the different countries in the 2012 Olympics.
I begin by reading in the dataset that I previously created and published using Google spreadsheet. Here I have downloaded the file in csv format and saved the file in a folder “markdown”. I create a new Project in RStudio which will read files from the “markdown” folder.
olympics = read.csv("Olympics Medals 2012.csv")
To make sure I've read in the dataset correctly, I'll display the first few lines of the data frame.
head(olympics)
## Country Gold Silver Bronze Total
## 1 United States (USA) 46 29 29 104
## 2 China (CHN) 38 27 23 88
## 3 Great Britain (GBR)* 29 17 19 65
## 4 Russia (RUS) 24 26 32 82
## 5 South Korea (KOR) 13 8 7 28
## 6 Germany (GER) 11 19 14 44
I am going to focus on the variable Gold that contains the number of gold medals for all countries.
I load in the LearnEDA page.
library(LearnEDA)
## Loading required package: aplpack
## Loading required package: tcltk
## Loading Tcl/Tk interface ...
## done
## Loading required package: vcd
## Loading required package: MASS
## Loading required package: grid
## Loading required package: colorspace
## Attaching package: 'LearnEDA'
## The following object(s) are masked from 'package:MASS':
##
## farms
I'll first construst a stemplot of the gold medal counts.
stem.leaf(olympics$Gold)
## 1 | 2: represents 1.2
## leaf unit: 0.1
## n: 85
## 31 0* | 0000000000000000000000000000000
## 0. |
## (19) 1* | 0000000000000000000
## 1. |
## 35 2* | 0000000000
## 2. |
## 25 3* | 00000
## 3. |
## 20 4* | 0000
## 4. |
## 16 5* | 0
## 5. |
## 15 6* | 000
## 6. |
## 12 7* | 000
## HI: 8 8 11 11 13 24 29 38 46
It seems that most countries have small numbers of gold medals and there are few countries (like China and the U.S.) that have high numbers. This is not a particularly effective graph and we'll learn about ways of transforming the counts to make the display easier to read.
One could construct a histogram, but it gives the same general impression of the distribution of the gold medal counts.
hist(olympics$Gold)
We'll talk about finding a five number summary of these counts. The individual summaries are called letter values.
lval(olympics$Gold)
## depth lo hi mids spreads
## M 43.0 1 1.0 1.00 0.0
## H 22.0 0 3.0 1.50 3.0
## E 11.5 0 7.0 3.50 7.0
## D 6.0 0 11.0 5.50 11.0
## C 3.5 0 26.5 13.25 26.5
## B 2.0 0 38.0 19.00 38.0
## A 1.0 0 46.0 23.00 46.0
This tells us that the median number of gold medals won was 1, half of the gold medal counts are between 0 and 3, and so on.