** Please click all the tabs (in sequence) to get the entire set of information in these pages. **
** To download code, see the instructions in Session 2: https://rpubs.com/hkb/DAX-Session2 **
Analytics at Dropbox: Jane You
Acquiring and Loading a Local Data Set
Basic Operations on Data Sets
Project Updates
In Session 6 we looked at how to create a toy data set “by hand” - that is, by writing code to specify the rows and columns and the values that go into each.
More generally, though, we bring in a data set by pointing to a data set (e.g., a csv file) on the web (like we did for the movies/boxoffice project, or that’s how you would find data related to covid for instance), or by reading a local file. Let’s see how to implement the local file approach in RStudio.
Let’s first identify the data source. In this case we’ll get data from a collaboration between Kaggle and The Guardian. The data set on Olympics medals history is located at https://www.kaggle.com/the-guardian/olympic-games.
We’ll download two data sets: (1) summer.csv, which lists all the medals won by each country (more on that below) and (2) dictionary.csv which does two things, i) it matches 3-digit country abbreviations to country full names, and ii) provides population and GDP data for each country.
Q: Why does someone use 3-digit abbrevations?
Suppose you’ve downloaded the two data sets to your local computer. The next thing you’d do is upload them into your RStudio.Cloud workspace - I’m adding them to a folder called “Nikhil” (why?).
Now you’re ready to bring it into your active workspace.
summer = read.csv('Nikhil/Olympics-full-summer.csv', header = TRUE)
Here’s a sample of the data
print(summer[sample(1000,2),])
NA
Notice that country names are abbreviated here. That might be good for some purposes. For others we might want full names.
Fortunately we have the dictionary data set that maps country abbrevations to full names, and also provides population data.
countries = read.csv('Nikhil/dictionary.csv', header = TRUE)
head(countries)
So, what we want to do is “merge” the two data sets. What does that mean? and What information do we need to do the merge?
summer <- summer %>% rename(Code = Country)
print(t(summer[sample(1000,2),]))
227
Year "1900"
City "Paris"
Sport "Archery"
Discipline "Archery"
Athlete "SERRURIER, Auguste"
Code "FRA"
Gender "Men"
Event "Sur La Perche À La Herse"
Medal "Silver"
710
Year "1904"
City "St Louis"
Sport "Archery"
Discipline "Archery"
Athlete "POLLOCK, Jessie"
Code "USA"
Gender "Women"
Event "Double Columbia Round (50Y - 40Y - 30Y)"
Medal "Bronze"
full.summer <- inner_join(summer, countries, by="Code")
print(t(full.summer[sample(1000,2),]))
980 282
Year "1904" "1900"
City "St Louis" "Paris"
Sport "Lacrosse" "Athletics"
Discipline "Lacrosse" "Athletics"
Athlete "WOODS" "CHAMPION, Emile"
Code "USA" "FRA"
Gender "Men" "Men"
Event "Lacrosse" "Marathon"
Medal "Silver" "Silver"
Country "United States" "France"
Population "321418820" " 66808385"
GDP.per.Capita "56116" "36206"
### Counting and Summing
The summer.csv data file lists the medals won at each event. Specifically, each row lists the type of event, year, the sportsperson and what kind of medal was won. This is a large data set, with 31165 rows. So, this is a lot of individual-level detail regarding who won what when.
This type of individual level detail has limited value for data analysis and insights. What kinds of interesting questions would you answer with it?
To go further, we can summarize the individual level information in various ways.
Suppose we care about the number of medals won by each country each year, across all the categories, and across all medal types.
To break this up, first we’ll count for each medal type.
medal.count.each <- full.summer %>% group_by(Code,Year,Medal) %>% dplyr::mutate(n.medals=n()) %>% select(Country,Code,Year,Medal,n.medals) %>% distinct()
head(medal.count.each)
Now, suppose we just care about total medals won each year by each country, across gold/silver/bronze.
medal.count <- medal.count.each %>% group_by(Code,Year) %>% dplyr::mutate(n=sum(n.medals)) %>% select(Country,Code,Year,n) %>% distinct()
head(medal.count)
Next let us rank the countries by medals won, for each year of the games.
country.ranks <- medal.count %>% group_by(Year) %>% arrange(desc(n)) %>% dplyr::mutate(rank=row_number()) %>% arrange(Year) %>% ungroup()
head(country.ranks)
Now suppose we only want to keep the top 5 ranks each year.
country.ranks <- country.ranks %>% filter(rank < 6)
head(country.ranks)