Stat 133 Lab 4

Step 3:

Analyze the graphic to figure out what a glyph-ready data table should look like. Mostly, this involves figuring out what variables are represented in teh graph. Write down a small example of a glyph-ready data frame that you think could be used to make something in the form of the graphic

A glyph-ready data table would consist of all the variables mapped to aesthetics or attributes that will be visible on the graph. We take that the variable name as an aesthetic for the legend, and year and the new variable, Popularity to be the aesthetics mapped to the x and y axes respectively. Popularity is a new variable transformed from division of the total number of births for a given name for a given year in the numerator and the total births for that year in the denominator.

What variable(s) from the raw data table do not appear at all in the graph?

sex does not appear at all in the graph. count does not appear as well, but it is used to generate the Popularity aesthetic.

What variable(s) in the graph are similar to corresponding variables in the raw data table, but might have been transformed in some way?

Popularity looks similar to count, but is transformed as a result of the division of the total count of any given name per year and the total count of all names per year.

Step 4:

Consider how the cases differ between the raw input and the glyph-ready table.

Have cases been filtered out?

All other names besides James, John, and Robert have been filtered out.

Have cases been grouped and summarized within groups in any way?

Yes. From the various names in the BabyNames data frame, we generated a new data frame by grouping name, count, and year to figure out the count of each name every year, more specifically, the yearly counts of people named James, John, and Robert. Those names are also summarized to show a glyph-ready table based on names and their collective counts per year.

Have any new variables been introduced? If so, what’s the relationship between the new variables and the existing variables?

If you count variables transformed from existing variables, then yes. The new variable Popularity is the quotient of two counts: the count of one of the three names aforementioned per year (numerator) and the count of all names per year (denominator) to give us the frequency each name has every year.

Step 5:

Using English, write down a sequence of steps that will accomplish the wrangling from the raw data table to your hypothesized glyph-ready data table.

First, We want to know the most popular names in the BabyNames data set. In other words, which three names (under the names variable) has the greatest count in the data set? To do this, select the name and count variables, grouping by name and summarising by count, then rearranging in descending order to find out the most popular names of all time: James, John, and Robert.
Now generate another data table, named TotalNames by extracting the year and count variables from the raw data set to generate a new data table, grouped by year, allowing us to inspect the amount of births of all names every year.
Create another data table, named YearNames, in which we extract the name, year, and count variables, grouped by name, to inspect the number of births for James per year, the number of John births per year, and the number of Robert births per year.
Next, since Popularity is the variable assigned in the y-axis, we cannot just simply use count and year as our location-generating aesthetics for the y and x axes respectively. We have to create the Popularity variable by dividing the count in YearNames by the count (labeled total) in TotalNames.
Since each count is in different data sets, we use a left-join to merge the two data sets with their common variable year, generating a new data table, DivideNames. Then, with the count and total variable side by side, we use a transform function to divide count by total for every year, and the ratio is stored as the Popularity variable with the year and name variables in a new data set called SumNames.
Now, we are almost done. We can use ggplot to map year as the x-axis and Popularity as the y-axis. We use a legend and colors and map name to the colors aesthetic to generate three different lines. The lines are generated by the geom_line layer and distinguished by attributes such as alpha and linetype.

Step 6:

Using paper and pen, translate your design, step by step, into R.

We can now list our code step-by-step below:

Data Wrangling Procedure:

We want to know the most popular names in the BabyNames data set. In other words, which three names (under the names variable) has the greatest count in the data set?

The three names that have the greatest count in the data set are the names James, John, and Robert.

Below, we select the most popular names by grouping by name and sorting in descending order by count:

## Source: local data frame [92,600 x 2]
## 
##       name   count
##      (chr)   (int)
## 1    James 5114325
## 2     John 5095590
## 3   Robert 4809858
## 4  Michael 4315029
## 5     Mary 4127615
## 6  William 4054318
## 7    David 3578068
## 8   Joseph 2568379
## 9  Richard 2561839
## 10 Charles 2369238
## ..     ...     ...

Here we extract the year and count variables from the raw data set to generate the total number of births per year:

TotalNames Data Frame:

## Source: local data frame [134 x 2]
## 
##     year  total
##    (int)  (int)
## 1   1880 201484
## 2   1881 192700
## 3   1882 221537
## 4   1883 216952
## 5   1884 243468
## 6   1885 240856
## 7   1886 255320
## 8   1887 247396
## 9   1888 299481
## 10  1889 288952
## ..   ...    ...

Extracting the name, year, and count variables from BabyNames total number of births per name per year:

YearNames Data Frame:

## Source: local data frame [402 x 3]
## Groups: name [?]
## 
##     name  year count
##    (chr) (int) (int)
## 1  James  1880  5949
## 2  James  1881  5466
## 3  James  1882  5910
## 4  James  1883  5249
## 5  James  1884  5726
## 6  James  1885  5201
## 7  James  1886  5384
## 8  James  1887  4787
## 9  James  1888  5607
## 10 James  1889  5046
## ..   ...   ...   ...

Per case(year), we want to divide:

(#names of X of year i)/(#names of all names for year i), where i = [1880,…., 2013] and X = either James, John, or Robert:

DivideNames Data Frame:

## Source: local data frame [402 x 4]
## Groups: name [?]
## 
##     name  year count  total
##    (chr) (int) (int)  (int)
## 1  James  1880  5949 201484
## 2  James  1881  5466 192700
## 3  James  1882  5910 221537
## 4  James  1883  5249 216952
## 5  James  1884  5726 243468
## 6  James  1885  5201 240856
## 7  James  1886  5384 255320
## 8  James  1887  4787 247396
## 9  James  1888  5607 299481
## 10 James  1889  5046 288952
## ..   ...   ...   ...    ...

The left-join above allows us to combine variables from different data frames via the merging via their common variable, year.

Now, we would like to tack on another variable/column to the DivideNames data frame, which would be:

proportion =(#names of X of year i)/(#names of all names for year i), where i = [1880,…, 2013] and X = either James, John, or Robert.

This new proportion will define the ratio that each given name exists per year:

SumNames Data Frame:

##     name year count  total Popularity
## 1  James 1880  5949 201484   2.952592
## 2  James 1881  5466 192700   2.836533
## 3  James 1882  5910 221537   2.667726
## 4  James 1883  5249 216952   2.419429
## 5  James 1884  5726 243468   2.351849
## 6  James 1885  5201 240856   2.159382
## 7  James 1886  5384 255320   2.108726
## 8  James 1887  4787 247396   1.934954
## 9  James 1888  5607 299481   1.872239
## 10 James 1889  5046 288952   1.746311

Above, the popularity column represents the percentage of name X amongst the population at that year, along with the previous variables from the BabyNames data set

Now, we would like to plot our data. From the SumNames data frame, we have a summary of variables name, year, count, total, and popularity. We are asked to great a two-dimensional graph displaying how the popularity of names change over time. As such, we need to extract variables year on the x-axis and Popularity on the y-axis. We would also need the name variable to create three different lines, one for each name.

As such, we use ggplot and its data visualization data verbs to represent the graph:

The graph is now established via generating a glyph-ready data frame, GraphNames, which has the visible aesthetics necessary for the plot.

GraphNames Data Frame:

##    name year Popularity
## 1 James 1880   2.952592
## 2 James 1881   2.836533
## 3 James 1882   2.667726
## 4 James 1883   2.419429
## 5 James 1884   2.351849
## 6 James 1885   2.159382

Note: The above data set is truncated to 6 cases for visual simplicity, although there exists more than 6 cases for the plot.

Step 7:

Implement, test, and revise. This was done above.

Stat 133 Lab 4

Samba Njie Jr.

February 12, 2016

Popular Names Project

Step 1:

Step 2: