Project 2 - Dataset 2 (Income by Gender), DATA607 Fall 2016

Income by Age and Sex Dataset from Census.gov - Posted by Ahsanul Choudhury

For this project, the objective is to select three untidy data sets posted by students on the class discussion board, and - for each of the three datasets - read the data into R, tidy the data, and complete the analysis suggested by the classmate in the discussion board post.

This R markdown file explains the steps used to complete the analysis for the second dataset I selected, which was posted by Ahsanul Choudhury. It is a dataset from the census.gov website showing income by age range by sex (http://www.census.gov/population/age/data/files/2012/2012gender_table17.csv).

The dataset in the .csv was loaded to a MySQL database table (see “Create Database Containing Census Data” section). To execute that load, the .csv was saved as an Excel file, the desired data was organized in the Excel file, and then INSERT SQL statements were generated in the Excel file.

You can see these steps by opening the .xlsx file, which is available at this GitHub location: https://github.com/LelandoSupreme/DATA607/blob/master/gender_income.xlsx

Columns Y thru AJ, rows 21-45 contain the formulas used to organize the data. Column AK, rows 22 thru 45 contain the formulas used to create the INSERT strings which were used in the SQL code.

Create Database Containing Census Data

The Crosstab data was loaded to a MySQL database schema called ‘project_two’ into a table called ‘gender_income’. The SQL scripts to build the database and load the data are available on GitHub at the link below. Please note that the GitHub file includes the code used for all three datasets selected for the project, not just this dataset. In the code, you can see the INSERT statements which were cut and pasted from the Excel file:

https://github.com/LelandoSupreme/DATA607/blob/master/Randles_Project2_DATA607.sql

Read gender_income MySQL Table into R

To facilitate the loading of data from MySQL to an R data frame, I loaded the “RMySQL” package.

install.packages("RMySQL",repos='http://cran.wustl.edu/')
library(RMySQL)

Once the package was installed, I connected to the database and created a data frame from the ‘gender_income’ table.

# Get the MySQL
drv = dbDriver("MySQL")
# Create a connection to the MySQL database
con <- dbConnect(drv, user = 'root', password = 'temp1002!', dbname = 'project_two')
# Create the gender_income data frame
gender_income <- dbReadTable(con, "gender_income")
# View the gender_income data frame
head(gender_income)

##   sex      age_range under_5000 X5000_to_9999 X10000_to_14999
## 1   M 15 to 17 years         10             3              20
## 2   M 18 to 24 years         66           122             394
## 3   M 25 to 29 years         23            77             193
## 4   M 30 to 34 years         55            55             162
## 5   M 35 to 39 years         34            30             155
## 6   M 40 to 44 years         42            28             126
##   X15000_to_19999 X20000_to_24999 X25000_to_34999 X35000_to_49999
## 1               4               7               9               8
## 2             617             707             892             572
## 3             471             671            1350            1526
## 4             332             509            1080            1605
## 5             263             467             873            1453
## 6             244             388             998            1506
##   X50000_to_74999 X75000_to_99999 X100000_and_over
## 1               3               0                0
## 2             180              48               52
## 3            1222             407              283
## 4            1655             688              664
## 5            1638             810             1080
## 6            1743             879             1340

As you can see, the ‘gender_income’ table has 12 columns: sex, age_range, under_5000, x5000_to_9999, x10000_to_14999, x15000_to_19999, x20000_to_24999, x25000_to_34999, x35000_to_49999, x50000_to_74999, x75000_to_99999, and x_100000_and_over. R has added x in front of the columns which started with a number character in RMySQL.

Tidy and Tranform Data

Hadley Wickham defines “tidy data” (http://vita.had.co.nz/papers/tidy-data.pdf.) as data which is structured such that:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

In the case of the gender_income data, there are four variables: sex, age range, income range, and number of persons. To make each row an observation, we need a row for every sex/age range/income range combination. In this case, there is only one type of observational unit, so one table will be appropriate.

To tidy and transform the data, I loaded the tidyr package (https://cran.r-project.org/web/packages/tidyr/tidyr.pdf) and the dplyr package (https://cran.r-project.org/web/packages/dplyr/dplyr.pdf).

install.packages("tidyr",repos='http://mirrors.nics.utk.edu/cran/')
library(tidyr)
library(dplyr)

Once the packages were loaded, I pipelined three functions to create the tidy data frame ‘gender_income’. The steps performed:
1. Turn the income range columns into values in an ‘income_range’ column using the gather function.
2. Rename the ‘n’ column created in step 1 to ‘num_persons’. 3. Replace the x’s in the values in the income_range column using the transmute function.

# Pipelined commands executing steps shown above
gender_income <- gender_income %>% gather("income_range", "n", 3:12) %>% rename(num_persons = n) %>% transmute(sex = sex, age_range = age_range, income_range = gsub("X","",income_range), num_persons = num_persons)
# View results
head(gender_income)

##   sex      age_range income_range num_persons
## 1   M 15 to 17 years   under_5000          10
## 2   M 18 to 24 years   under_5000          66
## 3   M 25 to 29 years   under_5000          23
## 4   M 30 to 34 years   under_5000          55
## 5   M 35 to 39 years   under_5000          34
## 6   M 40 to 44 years   under_5000          42

Income by Sex by Age Ranges

The suggested analysis is to “compare incomes between male and female for the same age group”. Because the total number of persons within each age_range varies by sex, I created a data frame containing the totals by sex by age_range (sex_age_ttls). I then joined the sex_age_ttls data frame to the gender_income data frame to compute percentages of persons within each income_range by sex/age_range. This will allow me to compare the percentages of persons in each income_range by sex for each age_range.

# Create data frame of total persons by sex and age_range
sex_age_ttls <- gender_income %>% group_by(sex, age_range) %>% summarize(sum(num_persons)) %>% rename(num_persons_by_age = `sum(num_persons)`) 
# Finalized gender_income dataset
gender_income <- gender_income %>% left_join(sex_age_ttls, by = c("sex", "age_range")) %>% mutate(income_range_perc = num_persons / num_persons_by_age)
# View results
head(gender_income)

##   sex      age_range income_range num_persons num_persons_by_age
## 1   M 15 to 17 years   under_5000          10                 64
## 2   M 18 to 24 years   under_5000          66               3650
## 3   M 25 to 29 years   under_5000          23               6223
## 4   M 30 to 34 years   under_5000          55               6805
## 5   M 35 to 39 years   under_5000          34               6803
## 6   M 40 to 44 years   under_5000          42               7294
##   income_range_perc
## 1       0.156250000
## 2       0.018082192
## 3       0.003695967
## 4       0.008082292
## 5       0.004997795
## 6       0.005758157

Next, I created line graphs for each age range plotting percentages within each income range by sex:

# Load ggplot2 to create line chart
library("ggplot2")
# Create line graph for age 15-17
ggplot(data=filter(gender_income, age_range == "15 to 17 years"), aes(x=factor(filter(gender_income, age_range == "15 to 17 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 15 to 17 years old")

# Create line graph for age 18-24
ggplot(data=filter(gender_income, age_range == "18 to 24 years"), aes(x=factor(filter(gender_income, age_range == "18 to 24 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 18 to 24 years old")

# Create line graph for age 25-29
ggplot(data=filter(gender_income, age_range == "25 to 29 years"), aes(x=factor(filter(gender_income, age_range == "25 to 29 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 25 to 29 years old")

# Create line graph for age 30-34
ggplot(data=filter(gender_income, age_range == "30 to 34 years"), aes(x=factor(filter(gender_income, age_range == "30 to 34 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 30 to 34 years old")

# Create line graph for age 35-39
ggplot(data=filter(gender_income, age_range == "35 to 39 years"), aes(x=factor(filter(gender_income, age_range == "35 to 39 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 35 to 39 years old")

# Create line graph for age 40-44
ggplot(data=filter(gender_income, age_range == "40 to 44 years"), aes(x=factor(filter(gender_income, age_range == "40 to 44 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 40 to 44 years old")

# Create line graph for age 45-49
ggplot(data=filter(gender_income, age_range == "45 to 49 years"), aes(x=factor(filter(gender_income, age_range == "45 to 49 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 45 to 49 years old")

# Create line graph for age 50-54
ggplot(data=filter(gender_income, age_range == "50 to 54 years"), aes(x=factor(filter(gender_income, age_range == "50 to 54 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 50 to 54 years old")

# Create line graph for age 55-59
ggplot(data=filter(gender_income, age_range == "55 to 59 years"), aes(x=factor(filter(gender_income, age_range == "55 to 59 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 55 to 59 years old")

# Create line graph for age 60-64
ggplot(data=filter(gender_income, age_range == "60 to 64 years"), aes(x=factor(filter(gender_income, age_range == "60 to 64 years")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 60 to 64 years old")

# Create line graph for age 65 years and over
ggplot(data=filter(gender_income, age_range == "65 years and over"), aes(x=factor(filter(gender_income, age_range == "65 years and over")$income_range, levels=c("under_5000", "5000_to_9999", "10000_to_14999", "15000_to_19999", "20000_to_24999", "25000_to_34999", "35000_to_49999", "50000_to_74999", "75000_to_99999", "100000_and_over")), y=income_range_perc, group=sex, colour=sex)) + geom_line(size = 1.5) + geom_point(size=3) + xlab("Income Range") + ylab("Percentage") + theme(axis.text.x=element_text(angle = -90, hjust = 0)) + ggtitle("Age Range 65 and Over")

The charts show a consistent pattern across all age ranges except the two lowest age ranges. Also, the pattern becomes more pronounced as the age ranges get older: a larger percentage of females in the lower income ranges and a higher percentage of males in the higher income ranges. The income range where the ratios cross is usually $50,000-75,000.