Source file ⇒ Sam_Popular_Names_Project.Rmd
Examine the data you have at hand. For this project (the data are th one table BabyNames) to find out what variables are available and what is the meaning of a case.
BabyNames, consists of four variables:
Imagine what your end report will look like and sketch out your idea. Here, Figure A.1 will serve as the sketch of the goal.
year on the x-axis, Popularity, a new and transformed variable, on the y-axis, and name mapped to color on the legend, distinguishing each line on the graph by color, and each line designates one of the three names generated from the BabyNames data set.Analyze the graphic to figure out what a glyph-ready data table should look like. Mostly, this involves figuring out what variables are represented in teh graph. Write down a small example of a glyph-ready data frame that you think could be used to make something in the form of the graphic
name as an aesthetic for the legend, and year and the new variable, Popularity to be the aesthetics mapped to the x and y axes respectively. Popularity is a new variable transformed from division of the total number of births for a given name for a given year in the numerator and the total births for that year in the denominator.What variable(s) from the raw data table do not appear at all in the graph?
sex does not appear at all in the graph. count does not appear as well, but it is used to generate the Popularity aesthetic.What variable(s) in the graph are similar to corresponding variables in the raw data table, but might have been transformed in some way?
Popularity looks similar to count, but is transformed as a result of the division of the total count of any given name per year and the total count of all names per year.Consider how the cases differ between the raw input and the glyph-ready table.
Have cases been filtered out?
James, John, and Robert have been filtered out.Have cases been grouped and summarized within groups in any way?
BabyNames data frame, we generated a new data frame by grouping name, count, and year to figure out the count of each name every year, more specifically, the yearly counts of people named James, John, and Robert. Those names are also summarized to show a glyph-ready table based on names and their collective counts per year.Have any new variables been introduced? If so, what’s the relationship between the new variables and the existing variables?
Popularity is the quotient of two counts: the count of one of the three names aforementioned per year (numerator) and the count of all names per year (denominator) to give us the frequency each name has every year.Using English, write down a sequence of steps that will accomplish the wrangling from the raw data table to your hypothesized glyph-ready data table.
First, We want to know the most popular names in the BabyNames data set. In other words, which three names (under the names variable) has the greatest count in the data set? To do this, select the name and count variables, grouping by name and summarising by count, then rearranging in descending order to find out the most popular names of all time: James, John, and Robert.
Now generate another data table, named TotalNames by extracting the year and count variables from the raw data set to generate a new data table, grouped by year, allowing us to inspect the amount of births of all names every year.
Create another data table, named YearNames, in which we extract the name, year, and count variables, grouped by name, to inspect the number of births for James per year, the number of John births per year, and the number of Robert births per year.
Next, since Popularity is the variable assigned in the y-axis, we cannot just simply use count and year as our location-generating aesthetics for the y and x axes respectively. We have to create the Popularity variable by dividing the count in YearNames by the count (labeled total) in TotalNames.
Since each count is in different data sets, we use a left-join to merge the two data sets with their common variable year, generating a new data table, DivideNames. Then, with the count and total variable side by side, we use a transform function to divide count by total for every year, and the ratio is stored as the Popularity variable with the year and name variables in a new data set called SumNames.
Now, we are almost done. We can use ggplot to map year as the x-axis and Popularity as the y-axis. We use a legend and colors and map name to the colors aesthetic to generate three different lines. The lines are generated by the geom_line layer and distinguished by attributes such as alpha and linetype.
Using paper and pen, translate your design, step by step, into R.
We can now list our code step-by-step below:
We want to know the most popular names in the BabyNames data set. In other words, which three names (under the names variable) has the greatest count in the data set?
James, John, and Robert.Below, we select the most popular names by grouping by name and sorting in descending order by count:
## Source: local data frame [92,600 x 2]
##
## name count
## (chr) (int)
## 1 James 5114325
## 2 John 5095590
## 3 Robert 4809858
## 4 Michael 4315029
## 5 Mary 4127615
## 6 William 4054318
## 7 David 3578068
## 8 Joseph 2568379
## 9 Richard 2561839
## 10 Charles 2369238
## .. ... ...
Here we extract the year and count variables from the raw data set to generate the total number of births per year:
TotalNames Data Frame:
## Source: local data frame [134 x 2]
##
## year total
## (int) (int)
## 1 1880 201484
## 2 1881 192700
## 3 1882 221537
## 4 1883 216952
## 5 1884 243468
## 6 1885 240856
## 7 1886 255320
## 8 1887 247396
## 9 1888 299481
## 10 1889 288952
## .. ... ...
Extracting the name, year, and count variables from BabyNames total number of births per name per year:
YearNames Data Frame:
## Source: local data frame [402 x 3]
## Groups: name [?]
##
## name year count
## (chr) (int) (int)
## 1 James 1880 5949
## 2 James 1881 5466
## 3 James 1882 5910
## 4 James 1883 5249
## 5 James 1884 5726
## 6 James 1885 5201
## 7 James 1886 5384
## 8 James 1887 4787
## 9 James 1888 5607
## 10 James 1889 5046
## .. ... ... ...
Per case(year), we want to divide:
(#names of X of year i)/(#names of all names for year i), where i = [1880,…., 2013] and X = either James, John, or Robert:
DivideNames Data Frame:
## Source: local data frame [402 x 4]
## Groups: name [?]
##
## name year count total
## (chr) (int) (int) (int)
## 1 James 1880 5949 201484
## 2 James 1881 5466 192700
## 3 James 1882 5910 221537
## 4 James 1883 5249 216952
## 5 James 1884 5726 243468
## 6 James 1885 5201 240856
## 7 James 1886 5384 255320
## 8 James 1887 4787 247396
## 9 James 1888 5607 299481
## 10 James 1889 5046 288952
## .. ... ... ... ...
The left-join above allows us to combine variables from different data frames via the merging via their common variable, year.
Now, we would like to tack on another variable/column to the DivideNames data frame, which would be:
proportion =(#names of X of year i)/(#names of all names for year i), where i = [1880,…, 2013] and X = either James, John, or Robert.
This new proportion will define the ratio that each given name exists per year:
SumNames Data Frame:
## name year count total Popularity
## 1 James 1880 5949 201484 2.952592
## 2 James 1881 5466 192700 2.836533
## 3 James 1882 5910 221537 2.667726
## 4 James 1883 5249 216952 2.419429
## 5 James 1884 5726 243468 2.351849
## 6 James 1885 5201 240856 2.159382
## 7 James 1886 5384 255320 2.108726
## 8 James 1887 4787 247396 1.934954
## 9 James 1888 5607 299481 1.872239
## 10 James 1889 5046 288952 1.746311
Now, we would like to plot our data. From the SumNames data frame, we have a summary of variables name, year, count, total, and popularity. We are asked to great a two-dimensional graph displaying how the popularity of names change over time. As such, we need to extract variables year on the x-axis and Popularity on the y-axis. We would also need the name variable to create three different lines, one for each name.
As such, we use ggplot and its data visualization data verbs to represent the graph:
GraphNames, which has the visible aesthetics necessary for the plot.GraphNames Data Frame:
## name year Popularity
## 1 James 1880 2.952592
## 2 James 1881 2.836533
## 3 James 1882 2.667726
## 4 James 1883 2.419429
## 5 James 1884 2.351849
## 6 James 1885 2.159382
Note: The above data set is truncated to 6 cases for visual simplicity, although there exists more than 6 cases for the plot.
Implement, test, and revise. This was done above.