Data 607 - Project 2 - Enter the Tidyverse

Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Project_2
Github link: https://github.com/littlejohnjeff/DATA607_Fall2018/blob/master/Data%20607%20-%20Week%205%20Assignment%20-%20Tidying%20and%20Transforming%20Data%20-%20Jeff%20Littlejohn.Rmd

Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items and then:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.

library(tidyr)
library(dplyr)
library(ggplot2)
library(stringr)
library(gdata)
library(RCurl)
library(magrittr)
library(scales)

Dataset 1: FBI Uniform Crime Reporting (UCR) Program - Law Enforcement Officers Killed - Feloniously and Accidentally

The FBI UCR program tracks police officers killed in two separate tracks - one Law Enforcement Officers Killed “Feloniously,” meaning they were killed as the result of intentional actions by an assailant in the line of duty. Another data set conveys Law Enforcement Officers Killed “Accidentally,” or in an unintentional incident while on shift.

Wrangling: We wish to combine these data sets, remove subtotals, and convert the horizontal counts by year to a single year column to fit tidy standards. Consider adding a region column to enable analysis by party of country.

Analysis: Do more officers die in accidents than are killed feloniously? Has that trend been steady over the years?

On with the wrangling, starting with the LEOK Feloniously data. Link: https://ucr.fbi.gov/leoka/2017/topic-pages/tables/table-1.xls

#Initially was retrieving the feloniously data, which lives in xls format on an FBI web page, using read_xls, but availability was inconsistent
#Reading csv file into dataframe
df_felon <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/table-1%20-%20LEOFK.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))

head(df_felon)

## # A tibble: 6 x 12
##   V1     V2       V3    V4    V5    V6    V7    V8    V9   V10   V11   V12
##   <chr>  <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Table~ ""       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2 Law E~ ""       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3 Regio~ ""       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 4 Area   Total  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017
## 5 Numbe~ 496      41    48    55    72    49    27    51    41    66    46
## 6 NORTH~ 50        3     7     3    10     6     2     8     4     4     3

Issues with read dataframe:

Our titles were read to rows 1 and 2.
The rows include subtotals by region and an overall total.
The data shows deaths by year by column. Tidy principles dictate there should be a single year column, and then the deaths should be expressed in a single count column.

Let’s tidy the data.

#Credit for help creating a subset based on SQL NOT IN logic: #https://stackoverflow.com/questions/7494848/standard-way-to-remove-multiple-elements-from-a-dataframe
df_felon <- df_felon[!df_felon$V1 %in% c("Law Enforcement Officers Feloniously Killed", "Region, Geographic Division, and State, 2008â???2017", "Number of victim officers","NORTHEAST","New England","Middle Atlantic","MIDWEST","East North Central","West North Central","SOUTH","South Atlantic", "East South Central", "West South Central", "WEST", "Mountain", "Pacific", "PUERTO RICO AND OTHER OUTLYING AREAS"),]

#Remove pesky first row that has character encoding issue preventing it from being picked up in above. And the area column with the year numbers.
df_felon <- df_felon[3:59,]

#add column names
df_felon_names <- c("State", "Total", df_felon[1,3],df_felon[1,4],df_felon[1,5],df_felon[1,6],df_felon[1,7],df_felon[1,8],df_felon[1,9],df_felon[1,10],df_felon[1,11],df_felon[1,12])
names(df_felon) <- df_felon_names

#Remove pesky first row that has character encoding issue preventing it from being picked up in above. And the area column with the year numbers.
df_felon <- df_felon[2:57,]
#Remove second column showing total 2008-2017 values by state
df_felon <- df_felon[-c(2)]

head(df_felon)

## # A tibble: 6 x 11
##   State `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
##   <chr>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
## 1 Conn~      0      0      0      0      0      0      0      0      0
## 2 Maine      0      0      0      0      0      0      0      0      0
## 3 Mass~      0      0      1      0      1      1      0      0      1
## 4 New ~      0      0      0      0      1      0      1      0      0
## 5 Rhod~      0      0      0      0      0      0      0      0      0
## 6 Verm~      0      0      0      0      0      0      0      0      0
## # ... with 1 more variable: `2017` <int>

Turn the columns tidy.

#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave Airline and Arrival_Status as is
df_felon <- df_felon %>%
    gather(Year,Victims,-State)

#Add label Feloniously or Accidentally
df_felon$F_or_K <- "Felony"

head(df_felon)

## # A tibble: 6 x 4
##   State         Year  Victims F_or_K
##   <chr>         <chr>   <int> <chr> 
## 1 Connecticut   2008        0 Felony
## 2 Maine         2008        0 Felony
## 3 Massachusetts 2008        0 Felony
## 4 New Hampshire 2008        0 Felony
## 5 Rhode Island  2008        0 Felony
## 6 Vermont       2008        0 Felony

Now we’re going to repeat the same process with the LEOK Accidentally data, only showing less of our work.

#Initially was retrieving the acccidentally data, which lives in xls format on an FBI web page, using read_xls, but availability was inconsistent
#Reading csv file into dataframe
df_accidental <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/table-48%20-%20LEOAK.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))

#Remove subtotal rows
df_accidental <- df_accidental[!df_accidental$V1 %in% c("Law Enforcement Officers Feloniously Killed", "Region, Geographic Division, and State, 2008â???2017", "Number of victim officers","NORTHEAST","New England","Middle Atlantic","MIDWEST","East North Central","West North Central","SOUTH","South Atlantic", "East South Central", "West South Central", "WEST", "Mountain", "Pacific", "PUERTO RICO AND OTHER OUTLYING AREAS"),]

#Remove first three rows which are metadata
df_accidental <- df_accidental[4:61,]
#Remove second column showing total 2008-2017 values by state
df_accidental <- df_accidental[-c(2)]

#add column names
df_accidental_names <- c("State",df_accidental[1,2],df_accidental[1,3],df_accidental[1,4],df_accidental[1,5],df_accidental[1,6],df_accidental[1,7],df_accidental[1,8],df_accidental[1,9],df_accidental[1,10],df_accidental[1,11])
names(df_accidental) <- df_accidental_names

df_accidental <- df_accidental[2:58,]

#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave State
df_accidental <- df_accidental %>%
    gather(Year,Victims,-State)

#Add label Feloniously or Accidentally
df_accidental$F_or_K <- "Accident"

head(df_accidental)

## # A tibble: 6 x 4
##   State         Year  Victims F_or_K  
##   <chr>         <chr>   <int> <chr>   
## 1 Connecticut   2008        1 Accident
## 2 Maine         2008        0 Accident
## 3 Massachusetts 2008        0 Accident
## 4 New Hampshire 2008        0 Accident
## 5 Rhode Island  2008        0 Accident
## 6 Vermont       2008        0 Accident

Now let’s combine the two data sets vertically.

df_leok <- rbind(df_felon, df_accidental)
df_leok

## # A tibble: 1,130 x 4
##    State         Year  Victims F_or_K
##  * <chr>         <chr>   <int> <chr> 
##  1 Connecticut   2008        0 Felony
##  2 Maine         2008        0 Felony
##  3 Massachusetts 2008        0 Felony
##  4 New Hampshire 2008        0 Felony
##  5 Rhode Island  2008        0 Felony
##  6 Vermont       2008        0 Felony
##  7 New Jersey    2008        0 Felony
##  8 New York      2008        0 Felony
##  9 Pennsylvania  2008        3 Felony
## 10 Illinois      2008        3 Felony
## # ... with 1,120 more rows

Let’s do some quick analysis.

How do overall deaths change year-by-year?

#Aggregate to get summary stats
agg_leok_year <-aggregate(df_leok$Victims, by=list(df_leok$Year), 
  FUN=sum, na.rm=TRUE)

summary(agg_leok_year)

##    Group.1                x         
##  Length:10          Min.   : 76.00  
##  Class :character   1st Qu.: 93.75  
##  Mode  :character   Median : 96.50  
##                     Mean   :102.30  
##                     3rd Qu.:115.75  
##                     Max.   :127.00

hist((agg_leok_year$x))

#Hiding warning message
ggplot(df_leok,aes(x=Year,y=Victims,fill=F_or_K),na.rm = TRUE) + geom_bar(stat = "identity")

Overall officer deaths have been relatively steady since 2007. Most recently, 2016 had a singular event in Dallas that unfortunately killed numerous officers. The years 2009 and 2010 may have had similar outlier events. Over time, deaths appear to be almost evenly split between accidents and felony events.

In a world with infinite time, it would be very interesting to pull the dataset showing number of employed law enforcement officers by state to get state death rates. Additionally, it would be interesting to compare on-the-job police deaths with other dangerous occupations.

Dataset 2: Student Loan Defaults

Wrangling: This data set has merged row and column labels that need to be handled. It shows loan status horiztonally by column - we’ll want to tidy that up by creating a singular loan status column.

Analysis: Have student loan delinquencies increased as a percentage of outstanding loans during the time period conveyed by the dataset?

Let’s bring in the data.

#Reading csv file into dataframe
df_loan_del <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/DLPortfoliobyDelinquencyStatus.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))

head(df_loan_del)

## # A tibble: 6 x 14
##   V1     V2    V3    V4    V5    V6    V7    V8    V9    V10   V11   V12  
##   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Direc~ ""    ""    ""    ""    ""    ""    ""    ""    ""    ""    ""   
## 2 Inclu~ ""    ""    ""    ""    ""    ""    ""    ""    ""    ""    ""   
## 3 Data ~ ""    ""    ""    ""    ""    ""    ""    ""    ""    ""    ""   
## 4 ""     ""    ""    ""    ""    ""    ""    ""    ""    ""    ""    ""   
## 5 ""     ""    Curr~ ""    31-9~ ""    91-1~ ""    181-~ ""    271-~ ""   
## 6 Feder~ ""    Doll~ Reci~ Doll~ Reci~ Doll~ Reci~ Doll~ Reci~ Doll~ Reci~
## # ... with 2 more variables: V13 <chr>, V14 <chr>

Not at all tidy! Begin the cleanup.

#Removing first few rows showing titles and last few rows conveying notes
df_loan_del <- df_loan_del[5:28,]

#Duplicate year values to subsequent rows from same period
#2013 Q1-Q2
df_loan_del[4,1] <- df_loan_del[3,1]
#2014
df_loan_del[6:8,1] <- df_loan_del[5,1]
#2015
df_loan_del[10:12,1] <- df_loan_del[9,1]
#2016
df_loan_del[14:16,1] <- df_loan_del[13,1]
#2017
df_loan_del[18:20,1] <- df_loan_del[17,1]
#2018 Q1-Q3
df_loan_del[22:23,1] <- df_loan_del[21,1]

#Duplicate loan status to subsequent column with same status
#Current
df_loan_del[1,4] <- df_loan_del[1,3]
#31-90
df_loan_del[1,6] <- df_loan_del[1,5]
#91-180
df_loan_del[1,8] <- df_loan_del[1,7]
#181-270
df_loan_del[1,10] <- df_loan_del[1,9]
#271-360
df_loan_del[1,12] <- df_loan_del[1,11]
#Collections
df_loan_del[1,14] <- df_loan_del[1,13]

#There is undoubtedly a better way to do this, but I don't know how to use gather when there are multiple rows containing headers
#We're gonna concatenate the Loan Status and Dollars vs. No. of Recepients columns before using gather and then separate them 
df_loan_del[2,3] <- paste(c(df_loan_del[1,3], df_loan_del[2,3]), collapse = "--")
df_loan_del[2,4] <- paste(c(df_loan_del[1,4], df_loan_del[2,4]), collapse = "--")
df_loan_del[2,5] <- paste(c(df_loan_del[1,5], df_loan_del[2,5]), collapse = "--")
df_loan_del[2,6] <- paste(c(df_loan_del[1,6], df_loan_del[2,6]), collapse = "--")
df_loan_del[2,7] <- paste(c(df_loan_del[1,7], df_loan_del[2,7]), collapse = "--")
df_loan_del[2,8] <- paste(c(df_loan_del[1,8], df_loan_del[2,8]), collapse = "--")
df_loan_del[2,9] <- paste(c(df_loan_del[1,9], df_loan_del[2,9]), collapse = "--")
df_loan_del[2,10] <- paste(c(df_loan_del[1,10], df_loan_del[2,10]), collapse = "--")
df_loan_del[2,11] <- paste(c(df_loan_del[1,11], df_loan_del[2,11]), collapse = "--")
df_loan_del[2,12] <- paste(c(df_loan_del[1,12], df_loan_del[2,12]), collapse = "--")
df_loan_del[2,13] <- paste(c(df_loan_del[1,13], df_loan_del[2,13]), collapse = "--")
df_loan_del[2,14] <- paste(c(df_loan_del[1,14], df_loan_del[2,14]), collapse = "--")

# #Remove the top row now that concatenation has occurred
df_loan_del <- df_loan_del[2:24,]

#Create "Quarter" value in row 1 before using it to create colnames for df
#Use row 1 as column names before deleting it
df_loan_del[1,2] <- "Quarter"
df_loan_names <- df_loan_del[1,]
names(df_loan_del) <- df_loan_names

#Remove row 1 now that names are there
df_loan_del <- df_loan_del[2:22,]

head(df_loan_del)

## # A tibble: 6 x 14
##   `Federal Fiscal~ Quarter `Current Repaym~ `Current Repaym~
##   <chr>            <chr>   <chr>            <chr>           
## 1 2013             Q3      "$193.5 "        8.35            
## 2 2013             Q4      "$188.5 "        7.91            
## 3 2014             Q1      "$224.8 "        9.08            
## 4 2014             Q2      "$230.3 "        9.06            
## 5 2014             Q3      "$248.6 "        9.64            
## 6 2014             Q4      "$247.2 "        9.34            
## # ... with 10 more variables: `31-90 Days Delinquent--Dollars Outstanding
## #   (in billions)` <chr>, `31-90 Days Delinquent--Recipients (in
## #   millions)` <chr>, `91-180 Days Delinquent--Dollars Outstanding (in
## #   billions)` <chr>, `91-180 Days Delinquent--Recipients (in
## #   millions)` <chr>, `181-270 Days Delinquent--Dollars Outstanding (in
## #   billions)` <chr>, `181-270 Days Delinquent--Recipients (in
## #   millions)` <chr>, `271-360 Days Delinquent--Dollars Outstanding (in
## #   billions)` <chr>, `271-360 Days Delinquent--Recipients (in
## #   millions)` <chr>, `Loans Transferring to the Debt Management and
## #   Collections System (DMCS)--Dollars Outstanding (in billions)` <chr>,
## #   `Loans Transferring to the Debt Management and Collections System
## #   (DMCS)--Recipients (in millions)` <chr>

#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave State
df_loan <- df_loan_del %>%
    gather(Loan_Status_and_Units,Counts,-`Federal Fiscal Year`,-Quarter)

This is going to get kludgy.

#Split concatenated loan status column
df_loan_add <- data.frame(do.call('rbind',strsplit(df_loan$Loan_Status_and_Units,"--")))
df_loan <- cbind(df_loan,df_loan_add)

#rename new columns
colnames(df_loan)[5] <- "Loan_Status"
colnames(df_loan)[6] <- "Units"

#get rid of concatenated column
df_loan <- df_loan[-c(3)]

#subset out the dollars
df_loan_dollars <- subset(df_loan, str_detect(df_loan$Units,"Dollars"))
#rename count column
colnames(df_loan_dollars)[3] <- "Dollars Outstanding (billions)"


#subset out the number of loans
df_loan_numb <- subset(df_loan, Units=='Recipients     (in millions)')
colnames(df_loan_numb)[3] <- "Recipients (millions)"

The weird workarounds continue.

#df_loan_dollars
#df_loan_numb
#merge the number of loans in as a new column
df_loan_final <- merge(df_loan_dollars, df_loan_numb, by.x = c("Federal Fiscal Year", "Quarter", "Loan_Status"), by.y = c("Federal Fiscal Year", "Quarter", "Loan_Status"))

#Get rid of Units columns 
df_loan_final <- df_loan_final[-c(5)]
df_loan_final <- df_loan_final[-c(6)]

#Create numeric values
df_loan_final$`Dollars Outstanding (billions)` <- as.numeric(substr(df_loan_final$`Dollars Outstanding (billions)`,2,length(df_loan_final$`Dollars Outstanding (billions)`)))
df_loan_final$`Recipients (millions)` <- as.numeric(df_loan_final$`Recipients (millions)`)

#This should be our tidy data set
df_loan_final

##     Federal Fiscal Year Quarter
## 1                  2013      Q3
## 2                  2013      Q3
## 3                  2013      Q3
## 4                  2013      Q3
## 5                  2013      Q3
## 6                  2013      Q3
## 7                  2013      Q4
## 8                  2013      Q4
## 9                  2013      Q4
## 10                 2013      Q4
## 11                 2013      Q4
## 12                 2013      Q4
## 13                 2014      Q1
## 14                 2014      Q1
## 15                 2014      Q1
## 16                 2014      Q1
## 17                 2014      Q1
## 18                 2014      Q1
## 19                 2014      Q2
## 20                 2014      Q2
## 21                 2014      Q2
## 22                 2014      Q2
## 23                 2014      Q2
## 24                 2014      Q2
## 25                 2014      Q3
## 26                 2014      Q3
## 27                 2014      Q3
## 28                 2014      Q3
## 29                 2014      Q3
## 30                 2014      Q3
## 31                 2014      Q4
## 32                 2014      Q4
## 33                 2014      Q4
## 34                 2014      Q4
## 35                 2014      Q4
## 36                 2014      Q4
## 37                 2015      Q1
## 38                 2015      Q1
## 39                 2015      Q1
## 40                 2015      Q1
## 41                 2015      Q1
## 42                 2015      Q1
## 43                 2015      Q2
## 44                 2015      Q2
## 45                 2015      Q2
## 46                 2015      Q2
## 47                 2015      Q2
## 48                 2015      Q2
## 49                 2015      Q3
## 50                 2015      Q3
## 51                 2015      Q3
## 52                 2015      Q3
## 53                 2015      Q3
## 54                 2015      Q3
## 55                 2015      Q4
## 56                 2015      Q4
## 57                 2015      Q4
## 58                 2015      Q4
## 59                 2015      Q4
## 60                 2015      Q4
## 61                 2016      Q1
## 62                 2016      Q1
## 63                 2016      Q1
## 64                 2016      Q1
## 65                 2016      Q1
## 66                 2016      Q1
## 67                 2016      Q2
## 68                 2016      Q2
## 69                 2016      Q2
## 70                 2016      Q2
## 71                 2016      Q2
## 72                 2016      Q2
## 73                 2016      Q3
## 74                 2016      Q3
## 75                 2016      Q3
## 76                 2016      Q3
## 77                 2016      Q3
## 78                 2016      Q3
## 79                 2016      Q4
## 80                 2016      Q4
## 81                 2016      Q4
## 82                 2016      Q4
## 83                 2016      Q4
## 84                 2016      Q4
## 85                 2017      Q1
## 86                 2017      Q1
## 87                 2017      Q1
## 88                 2017      Q1
## 89                 2017      Q1
## 90                 2017      Q1
## 91                 2017      Q2
## 92                 2017      Q2
## 93                 2017      Q2
## 94                 2017      Q2
## 95                 2017      Q2
## 96                 2017      Q2
## 97                 2017      Q3
## 98                 2017      Q3
## 99                 2017      Q3
## 100                2017      Q3
## 101                2017      Q3
## 102                2017      Q3
## 103                2017      Q4
## 104                2017      Q4
## 105                2017      Q4
## 106                2017      Q4
## 107                2017      Q4
## 108                2017      Q4
## 109                2018      Q1
## 110                2018      Q1
## 111                2018      Q1
## 112                2018      Q1
## 113                2018      Q1
## 114                2018      Q1
## 115                2018      Q2
## 116                2018      Q2
## 117                2018      Q2
## 118                2018      Q2
## 119                2018      Q2
## 120                2018      Q2
## 121                2018      Q3
## 122                2018      Q3
## 123                2018      Q3
## 124                2018      Q3
## 125                2018      Q3
## 126                2018      Q3
##                                                                 Loan_Status
## 1                                                   181-270 Days Delinquent
## 2                                                   271-360 Days Delinquent
## 3                                                     31-90 Days Delinquent
## 4                                                    91-180 Days Delinquent
## 5                                                         Current Repayment
## 6   Loans Transferring to the Debt Management and Collections System (DMCS)
## 7                                                   181-270 Days Delinquent
## 8                                                   271-360 Days Delinquent
## 9                                                     31-90 Days Delinquent
## 10                                                   91-180 Days Delinquent
## 11                                                        Current Repayment
## 12  Loans Transferring to the Debt Management and Collections System (DMCS)
## 13                                                  181-270 Days Delinquent
## 14                                                  271-360 Days Delinquent
## 15                                                    31-90 Days Delinquent
## 16                                                   91-180 Days Delinquent
## 17                                                        Current Repayment
## 18  Loans Transferring to the Debt Management and Collections System (DMCS)
## 19                                                  181-270 Days Delinquent
## 20                                                  271-360 Days Delinquent
## 21                                                    31-90 Days Delinquent
## 22                                                   91-180 Days Delinquent
## 23                                                        Current Repayment
## 24  Loans Transferring to the Debt Management and Collections System (DMCS)
## 25                                                  181-270 Days Delinquent
## 26                                                  271-360 Days Delinquent
## 27                                                    31-90 Days Delinquent
## 28                                                   91-180 Days Delinquent
## 29                                                        Current Repayment
## 30  Loans Transferring to the Debt Management and Collections System (DMCS)
## 31                                                  181-270 Days Delinquent
## 32                                                  271-360 Days Delinquent
## 33                                                    31-90 Days Delinquent
## 34                                                   91-180 Days Delinquent
## 35                                                        Current Repayment
## 36  Loans Transferring to the Debt Management and Collections System (DMCS)
## 37                                                  181-270 Days Delinquent
## 38                                                  271-360 Days Delinquent
## 39                                                    31-90 Days Delinquent
## 40                                                   91-180 Days Delinquent
## 41                                                        Current Repayment
## 42  Loans Transferring to the Debt Management and Collections System (DMCS)
## 43                                                  181-270 Days Delinquent
## 44                                                  271-360 Days Delinquent
## 45                                                    31-90 Days Delinquent
## 46                                                   91-180 Days Delinquent
## 47                                                        Current Repayment
## 48  Loans Transferring to the Debt Management and Collections System (DMCS)
## 49                                                  181-270 Days Delinquent
## 50                                                  271-360 Days Delinquent
## 51                                                    31-90 Days Delinquent
## 52                                                   91-180 Days Delinquent
## 53                                                        Current Repayment
## 54  Loans Transferring to the Debt Management and Collections System (DMCS)
## 55                                                  181-270 Days Delinquent
## 56                                                  271-360 Days Delinquent
## 57                                                    31-90 Days Delinquent
## 58                                                   91-180 Days Delinquent
## 59                                                        Current Repayment
## 60  Loans Transferring to the Debt Management and Collections System (DMCS)
## 61                                                  181-270 Days Delinquent
## 62                                                  271-360 Days Delinquent
## 63                                                    31-90 Days Delinquent
## 64                                                   91-180 Days Delinquent
## 65                                                        Current Repayment
## 66  Loans Transferring to the Debt Management and Collections System (DMCS)
## 67                                                  181-270 Days Delinquent
## 68                                                  271-360 Days Delinquent
## 69                                                    31-90 Days Delinquent
## 70                                                   91-180 Days Delinquent
## 71                                                        Current Repayment
## 72  Loans Transferring to the Debt Management and Collections System (DMCS)
## 73                                                  181-270 Days Delinquent
## 74                                                  271-360 Days Delinquent
## 75                                                    31-90 Days Delinquent
## 76                                                   91-180 Days Delinquent
## 77                                                        Current Repayment
## 78  Loans Transferring to the Debt Management and Collections System (DMCS)
## 79                                                  181-270 Days Delinquent
## 80                                                  271-360 Days Delinquent
## 81                                                    31-90 Days Delinquent
## 82                                                   91-180 Days Delinquent
## 83                                                        Current Repayment
## 84  Loans Transferring to the Debt Management and Collections System (DMCS)
## 85                                                  181-270 Days Delinquent
## 86                                                  271-360 Days Delinquent
## 87                                                    31-90 Days Delinquent
## 88                                                   91-180 Days Delinquent
## 89                                                        Current Repayment
## 90  Loans Transferring to the Debt Management and Collections System (DMCS)
## 91                                                  181-270 Days Delinquent
## 92                                                  271-360 Days Delinquent
## 93                                                    31-90 Days Delinquent
## 94                                                   91-180 Days Delinquent
## 95                                                        Current Repayment
## 96  Loans Transferring to the Debt Management and Collections System (DMCS)
## 97                                                  181-270 Days Delinquent
## 98                                                  271-360 Days Delinquent
## 99                                                    31-90 Days Delinquent
## 100                                                  91-180 Days Delinquent
## 101                                                       Current Repayment
## 102 Loans Transferring to the Debt Management and Collections System (DMCS)
## 103                                                 181-270 Days Delinquent
## 104                                                 271-360 Days Delinquent
## 105                                                   31-90 Days Delinquent
## 106                                                  91-180 Days Delinquent
## 107                                                       Current Repayment
## 108 Loans Transferring to the Debt Management and Collections System (DMCS)
## 109                                                 181-270 Days Delinquent
## 110                                                 271-360 Days Delinquent
## 111                                                   31-90 Days Delinquent
## 112                                                  91-180 Days Delinquent
## 113                                                       Current Repayment
## 114 Loans Transferring to the Debt Management and Collections System (DMCS)
## 115                                                 181-270 Days Delinquent
## 116                                                 271-360 Days Delinquent
## 117                                                   31-90 Days Delinquent
## 118                                                  91-180 Days Delinquent
## 119                                                       Current Repayment
## 120 Loans Transferring to the Debt Management and Collections System (DMCS)
## 121                                                 181-270 Days Delinquent
## 122                                                 271-360 Days Delinquent
## 123                                                   31-90 Days Delinquent
## 124                                                  91-180 Days Delinquent
## 125                                                       Current Repayment
## 126 Loans Transferring to the Debt Management and Collections System (DMCS)
##     Dollars Outstanding (billions) Recipients (millions)
## 1                              6.8                  0.46
## 2                              3.8                  0.27
## 3                             18.7                  1.05
## 4                             10.5                  0.69
## 5                            193.5                  8.35
## 6                              4.1                  0.22
## 7                              6.2                  0.43
## 8                              4.2                  0.31
## 9                             23.2                  1.29
## 10                            12.2                  0.77
## 11                           188.5                  7.91
## 12                             2.0                  0.14
## 13                             7.6                  0.48
## 14                             3.6                  0.27
## 15                            20.6                  1.10
## 16                            13.6                  0.82
## 17                           224.8                  9.08
## 18                             2.2                  0.15
## 19                             7.6                  0.47
## 20                             4.2                  0.28
## 21                            17.7                  0.94
## 22                            14.9                  0.85
## 23                           230.3                  9.06
## 24                             1.4                  0.09
## 25                             8.6                  0.52
## 26                             4.5                  0.30
## 27                            23.4                  1.19
## 28                            13.4                  0.76
## 29                           248.6                  9.64
## 30                             1.6                  0.10
## 31                             7.6                  0.46
## 32                             5.4                  0.34
## 33                            25.4                  1.25
## 34                            15.2                  0.84
## 35                           247.2                  9.34
## 36                             1.0                  0.06
## 37                             9.2                  0.53
## 38                             5.1                  0.32
## 39                            26.3                  1.25
## 40                            15.7                  0.82
## 41                           291.5                 10.59
## 42                             1.9                  0.11
## 43                             8.8                  0.48
## 44                             5.6                  0.33
## 45                            23.1                  1.07
## 46                            18.0                  0.90
## 47                           310.8                 11.07
## 48                             1.9                  0.11
## 49                            10.8                  0.57
## 50                             5.6                  0.32
## 51                            26.5                  1.20
## 52                            17.6                  0.87
## 53                           331.3                 11.58
## 54                             2.2                  0.12
## 55                            10.5                  0.54
## 56                             6.7                  0.37
## 57                            28.0                  1.27
## 58                            17.4                  0.83
## 59                           332.0                 11.37
## 60                             2.5                  0.14
## 61                            10.7                  0.53
## 62                             6.8                  0.36
## 63                            27.9                  1.20
## 64                            18.1                  0.85
## 65                           374.4                 12.45
## 66                             2.1                  0.11
## 67                             9.9                  0.48
## 68                             6.2                  0.31
## 69                            27.3                  1.14
## 70                            17.9                  0.80
## 71                           377.8                 12.45
## 72                             1.7                  0.09
## 73                            10.4                  0.49
## 74                             6.5                  0.33
## 75                            30.6                  1.24
## 76                            18.4                  0.83
## 77                           400.8                 12.98
## 78                             1.4                  0.07
## 79                            10.5                  0.49
## 80                             7.3                  0.35
## 81                            32.2                  1.30
## 82                            20.3                  0.88
## 83                           406.7                 12.84
## 84                             1.5                  0.07
## 85                            11.1                  0.50
## 86                             6.6                  0.32
## 87                            34.1                  1.32
## 88                            20.5                  0.87
## 89                           448.3                 13.74
## 90                             1.9                  0.09
## 91                            11.5                  0.50
## 92                             6.8                  0.31
## 93                            30.9                  1.18
## 94                            23.2                  0.94
## 95                           461.4                 13.90
## 96                             1.7                  0.08
## 97                            13.8                  0.59
## 98                             7.6                  0.35
## 99                            35.7                  1.33
## 100                           22.8                  0.93
## 101                          472.2                 14.20
## 102                            1.6                  0.08
## 103                           12.5                  0.54
## 104                            8.3                  0.37
## 105                           35.0                  1.27
## 106                           22.2                  0.89
## 107                          467.8                 13.77
## 108                            1.5                  0.07
## 109                           12.3                  0.51
## 110                            8.0                  0.35
## 111                           37.3                  1.31
## 112                           21.3                  0.82
## 113                          518.9                 14.99
## 114                            2.1                  0.10
## 115                           11.0                  0.44
## 116                            7.0                  0.30
## 117                           28.4                  0.98
## 118                           22.7                  0.85
## 119                          533.7                 15.25
## 120                            1.2                  0.06
## 121                           12.5                  0.50
## 122                            6.8                  0.29
## 123                           43.3                  1.53
## 124                           26.0                  0.98
## 125                          540.5                 15.24
## 126                            1.0                  0.05

Let’s look at outstanding dollars by status year-over-year to see if student loan debt has been growing.

#Create subset of data only showing 3rd quarter of year to not show misleading 2013 or 2018 data
df_loan_final_3 <- subset(df_loan_final, Quarter=='Q3')
#Hiding warning message
ggplot(df_loan_final_3,aes(x= `Federal Fiscal Year`,y= `Dollars Outstanding (billions)`,fill = Loan_Status)) + geom_bar(stat = "Identity") + theme(legend.position = "bottom")

Wow, that is a lot of student debt that is growing almost exponentially. Let’s take a closer look at the percentages of debt by status to see if there are anay changes in the amount of deliquent loans.

ggplot(df_loan_final_3,aes(x= `Federal Fiscal Year`,y= `Dollars Outstanding (billions)`,fill = Loan_Status)) + 
    geom_bar(position = "fill",stat = "identity")  + theme(legend.position = "bottom") +
    scale_y_continuous(labels = percent_format())

We see the amount of loans being sent to collections going down as a percentage of overall debt. More debt - at least in terms of dollars - is current in 2018 that previous years. This could be because of an improving economy or perhaps much needed reform to the student loan industry.

Dataset 3: Messi Goals

Wikipedia has good stats from various sports stars, but they’re in non-tidy tables. We’re going to tidy them and then do some analysis. Based on the original posting of this untidy data, we’re going to process some summary statistics regarding number of goals per season. We’re going to aggregate goals by season, combining seasons split in multiple teams/leagues and also amalgamating the league and non-leage tournament play into one total.

Note that the initial step was cutting and pasting the table from Wikipedia into an Excel file, which was saved as a csv and posted on Github. Wikipedia link: https://en.wikipedia.org/wiki/Lionel_Messi

#Reading csv file into dataframe
df_messi <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/Messi%20Career%20Stats.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))

df_messi

## # A tibble: 24 x 13
##    V1    V2    V3    V4    V5    V6    V7    V8    V9    V10   V11   V12  
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Club  Seas~ Leag~ ""    ""    Copa~ ""    Cham~ ""    Other ""    Total
##  2 ""    ""    ""    ""    ""    ""    ""    Leag~ ""    ""    ""    ""   
##  3 ""    ""    Divi~ Apps  Goals Apps  Goals Apps  Goals Apps  Goals Apps 
##  4 Barc~ 2003~ Terc~ 10    5          ""         ""         ""    10   
##  5 Barc~ 2003~ Segu~ 5     0          ""         ""         ""    5    
##  6 ""    2004~ Segu~ 17    6          ""         ""         ""    17   
##  7 ""    Total ""    22    6          ""         ""         ""    22   
##  8 Barc~ 2004~ La L~ 7     1     1     0     1     0          ""    9    
##  9 ""    2005~ La L~ 17    6     2     1     6     1     0     0     25   
## 10 ""    2006~ La L~ 26    14    2     2     5     1     3[a]  0     36   
## # ... with 14 more rows, and 1 more variable: V13 <chr>

#subset the goal columns
df_messi_s <- select(df_messi, V2, V5, V7, V9, V11)

#cleanup rows and apply names to columns
df_messi_s[3,1] <- df_messi_s[1,1]
df_messi_s <- df_messi_s[3:22,]
colnames(df_messi_s) <- df_messi_s[1,]

#remove total rows
df_messi_s <- df_messi_s[-c(1,5),]

#Split concatenated loan status column
df_messi_split <- data.frame(do.call('rbind',strsplit(df_messi_s$Season,"[\\[]")))
df_messi_f <- cbind(df_messi_s,df_messi_split)

df_messi_f <- df_messi_f[-c(1,7)]

colnames(df_messi_f)[5] <- "Season"

#convert goals to numeric
df_messi_f$Goals <- as.numeric(df_messi_f$Goals)
df_messi_f$Goals.1 <- as.numeric(df_messi_f$Goals.1)
df_messi_f$Goals.2 <- as.numeric(df_messi_f$Goals.2)
df_messi_f$Goals.3 <- as.numeric(df_messi_f$Goals.3)

#create total goals column for all leagues for a season
#df_messi_f$TotalGoals <- df_messi_f$Goals + df_messi_f$Goals.1 + df_messi_f$Goals.2 + df_messi_f$Goals.3
df_messi_f$TotalGoals <- rowSums(df_messi_f[,c("Goals", "Goals.1", "Goals.2", "Goals.3")], na.rm=TRUE)

#drop individual goal columns
df_messi_g <- df_messi_f[,5:6]

#aggregate goals by season
df_messi_agg <- aggregate(df_messi_g$TotalGoals, by=list(df_messi_g$Season), 
  FUN=sum, na.rm=TRUE)

colnames(df_messi_agg) <- c("Season","Goals")
df_messi_agg

##     Season Goals
## 1  200304     5
## 2  200405     7
## 3  200506     8
## 4  200607    17
## 5  200708    16
## 6  200809    38
## 7  200910    47
## 8  201011    53
## 9  201112    73
## 10 201213    60
## 11 201314    41
## 12 201415    58
## 13 201516    41
## 14 201617    54
## 15 201718    45
## 16 201819    11

To the summarizing.

summary(df_messi_agg)

##      Season       Goals      
##  200304: 1   Min.   : 5.00  
##  200405: 1   1st Qu.:14.75  
##  200506: 1   Median :41.00  
##  200607: 1   Mean   :35.88  
##  200708: 1   3rd Qu.:53.25  
##  200809: 1   Max.   :73.00  
##  (Other):10

Boxplot.

boxplot(df_messi_agg$Goals)

Histogram.

histogram(df_messi_agg$Goals)

A great career.

Data 607 - Project 2 - Enter the Tidyverse

Jeff Littlejohn

October 7, 2018