Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Project_2
Github link: https://github.com/littlejohnjeff/DATA607_Fall2018/blob/master/Data%20607%20-%20Week%205%20Assignment%20-%20Tidying%20and%20Transforming%20Data%20-%20Jeff%20Littlejohn.Rmd
Our task is to choose any three of the “wide” datasets identified in the Week 6 Discussion items and then:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
library(tidyr)
library(dplyr)
library(ggplot2)
library(stringr)
library(gdata)
library(RCurl)
library(magrittr)
library(scales)
Dataset 1: FBI Uniform Crime Reporting (UCR) Program - Law Enforcement Officers Killed - Feloniously and Accidentally
The FBI UCR program tracks police officers killed in two separate tracks - one Law Enforcement Officers Killed “Feloniously,” meaning they were killed as the result of intentional actions by an assailant in the line of duty. Another data set conveys Law Enforcement Officers Killed “Accidentally,” or in an unintentional incident while on shift.
Wrangling: We wish to combine these data sets, remove subtotals, and convert the horizontal counts by year to a single year column to fit tidy standards. Consider adding a region column to enable analysis by party of country.
Analysis: Do more officers die in accidents than are killed feloniously? Has that trend been steady over the years?
On with the wrangling, starting with the LEOK Feloniously data. Link: https://ucr.fbi.gov/leoka/2017/topic-pages/tables/table-1.xls
#Initially was retrieving the feloniously data, which lives in xls format on an FBI web page, using read_xls, but availability was inconsistent
#Reading csv file into dataframe
df_felon <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/table-1%20-%20LEOFK.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))
head(df_felon)
## # A tibble: 6 x 12
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Table~ "" NA NA NA NA NA NA NA NA NA NA
## 2 Law E~ "" NA NA NA NA NA NA NA NA NA NA
## 3 Regio~ "" NA NA NA NA NA NA NA NA NA NA
## 4 Area Total 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 5 Numbe~ 496 41 48 55 72 49 27 51 41 66 46
## 6 NORTH~ 50 3 7 3 10 6 2 8 4 4 3
Issues with read dataframe:
Let’s tidy the data.
#Credit for help creating a subset based on SQL NOT IN logic: #https://stackoverflow.com/questions/7494848/standard-way-to-remove-multiple-elements-from-a-dataframe
df_felon <- df_felon[!df_felon$V1 %in% c("Law Enforcement Officers Feloniously Killed", "Region, Geographic Division, and State, 2008â???2017", "Number of victim officers","NORTHEAST","New England","Middle Atlantic","MIDWEST","East North Central","West North Central","SOUTH","South Atlantic", "East South Central", "West South Central", "WEST", "Mountain", "Pacific", "PUERTO RICO AND OTHER OUTLYING AREAS"),]
#Remove pesky first row that has character encoding issue preventing it from being picked up in above. And the area column with the year numbers.
df_felon <- df_felon[3:59,]
#add column names
df_felon_names <- c("State", "Total", df_felon[1,3],df_felon[1,4],df_felon[1,5],df_felon[1,6],df_felon[1,7],df_felon[1,8],df_felon[1,9],df_felon[1,10],df_felon[1,11],df_felon[1,12])
names(df_felon) <- df_felon_names
#Remove pesky first row that has character encoding issue preventing it from being picked up in above. And the area column with the year numbers.
df_felon <- df_felon[2:57,]
#Remove second column showing total 2008-2017 values by state
df_felon <- df_felon[-c(2)]
head(df_felon)
## # A tibble: 6 x 11
## State `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Conn~ 0 0 0 0 0 0 0 0 0
## 2 Maine 0 0 0 0 0 0 0 0 0
## 3 Mass~ 0 0 1 0 1 1 0 0 1
## 4 New ~ 0 0 0 0 1 0 1 0 0
## 5 Rhod~ 0 0 0 0 0 0 0 0 0
## 6 Verm~ 0 0 0 0 0 0 0 0 0
## # ... with 1 more variable: `2017` <int>
Turn the columns tidy.
#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave Airline and Arrival_Status as is
df_felon <- df_felon %>%
gather(Year,Victims,-State)
#Add label Feloniously or Accidentally
df_felon$F_or_K <- "Felony"
head(df_felon)
## # A tibble: 6 x 4
## State Year Victims F_or_K
## <chr> <chr> <int> <chr>
## 1 Connecticut 2008 0 Felony
## 2 Maine 2008 0 Felony
## 3 Massachusetts 2008 0 Felony
## 4 New Hampshire 2008 0 Felony
## 5 Rhode Island 2008 0 Felony
## 6 Vermont 2008 0 Felony
Now we’re going to repeat the same process with the LEOK Accidentally data, only showing less of our work.
#Initially was retrieving the acccidentally data, which lives in xls format on an FBI web page, using read_xls, but availability was inconsistent
#Reading csv file into dataframe
df_accidental <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/table-48%20-%20LEOAK.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))
#Remove subtotal rows
df_accidental <- df_accidental[!df_accidental$V1 %in% c("Law Enforcement Officers Feloniously Killed", "Region, Geographic Division, and State, 2008â???2017", "Number of victim officers","NORTHEAST","New England","Middle Atlantic","MIDWEST","East North Central","West North Central","SOUTH","South Atlantic", "East South Central", "West South Central", "WEST", "Mountain", "Pacific", "PUERTO RICO AND OTHER OUTLYING AREAS"),]
#Remove first three rows which are metadata
df_accidental <- df_accidental[4:61,]
#Remove second column showing total 2008-2017 values by state
df_accidental <- df_accidental[-c(2)]
#add column names
df_accidental_names <- c("State",df_accidental[1,2],df_accidental[1,3],df_accidental[1,4],df_accidental[1,5],df_accidental[1,6],df_accidental[1,7],df_accidental[1,8],df_accidental[1,9],df_accidental[1,10],df_accidental[1,11])
names(df_accidental) <- df_accidental_names
df_accidental <- df_accidental[2:58,]
#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave State
df_accidental <- df_accidental %>%
gather(Year,Victims,-State)
#Add label Feloniously or Accidentally
df_accidental$F_or_K <- "Accident"
head(df_accidental)
## # A tibble: 6 x 4
## State Year Victims F_or_K
## <chr> <chr> <int> <chr>
## 1 Connecticut 2008 1 Accident
## 2 Maine 2008 0 Accident
## 3 Massachusetts 2008 0 Accident
## 4 New Hampshire 2008 0 Accident
## 5 Rhode Island 2008 0 Accident
## 6 Vermont 2008 0 Accident
Now let’s combine the two data sets vertically.
df_leok <- rbind(df_felon, df_accidental)
df_leok
## # A tibble: 1,130 x 4
## State Year Victims F_or_K
## * <chr> <chr> <int> <chr>
## 1 Connecticut 2008 0 Felony
## 2 Maine 2008 0 Felony
## 3 Massachusetts 2008 0 Felony
## 4 New Hampshire 2008 0 Felony
## 5 Rhode Island 2008 0 Felony
## 6 Vermont 2008 0 Felony
## 7 New Jersey 2008 0 Felony
## 8 New York 2008 0 Felony
## 9 Pennsylvania 2008 3 Felony
## 10 Illinois 2008 3 Felony
## # ... with 1,120 more rows
Let’s do some quick analysis.
#Aggregate to get summary stats
agg_leok_year <-aggregate(df_leok$Victims, by=list(df_leok$Year),
FUN=sum, na.rm=TRUE)
summary(agg_leok_year)
## Group.1 x
## Length:10 Min. : 76.00
## Class :character 1st Qu.: 93.75
## Mode :character Median : 96.50
## Mean :102.30
## 3rd Qu.:115.75
## Max. :127.00
hist((agg_leok_year$x))
#Hiding warning message
ggplot(df_leok,aes(x=Year,y=Victims,fill=F_or_K),na.rm = TRUE) + geom_bar(stat = "identity")
Overall officer deaths have been relatively steady since 2007. Most recently, 2016 had a singular event in Dallas that unfortunately killed numerous officers. The years 2009 and 2010 may have had similar outlier events. Over time, deaths appear to be almost evenly split between accidents and felony events.
In a world with infinite time, it would be very interesting to pull the dataset showing number of employed law enforcement officers by state to get state death rates. Additionally, it would be interesting to compare on-the-job police deaths with other dangerous occupations.
Dataset 2: Student Loan Defaults
Wrangling: This data set has merged row and column labels that need to be handled. It shows loan status horiztonally by column - we’ll want to tidy that up by creating a singular loan status column.
Analysis: Have student loan delinquencies increased as a percentage of outstanding loans during the time period conveyed by the dataset?
Let’s bring in the data.
#Reading csv file into dataframe
df_loan_del <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/DLPortfoliobyDelinquencyStatus.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))
head(df_loan_del)
## # A tibble: 6 x 14
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Direc~ "" "" "" "" "" "" "" "" "" "" ""
## 2 Inclu~ "" "" "" "" "" "" "" "" "" "" ""
## 3 Data ~ "" "" "" "" "" "" "" "" "" "" ""
## 4 "" "" "" "" "" "" "" "" "" "" "" ""
## 5 "" "" Curr~ "" 31-9~ "" 91-1~ "" 181-~ "" 271-~ ""
## 6 Feder~ "" Doll~ Reci~ Doll~ Reci~ Doll~ Reci~ Doll~ Reci~ Doll~ Reci~
## # ... with 2 more variables: V13 <chr>, V14 <chr>
Not at all tidy! Begin the cleanup.
#Removing first few rows showing titles and last few rows conveying notes
df_loan_del <- df_loan_del[5:28,]
#Duplicate year values to subsequent rows from same period
#2013 Q1-Q2
df_loan_del[4,1] <- df_loan_del[3,1]
#2014
df_loan_del[6:8,1] <- df_loan_del[5,1]
#2015
df_loan_del[10:12,1] <- df_loan_del[9,1]
#2016
df_loan_del[14:16,1] <- df_loan_del[13,1]
#2017
df_loan_del[18:20,1] <- df_loan_del[17,1]
#2018 Q1-Q3
df_loan_del[22:23,1] <- df_loan_del[21,1]
#Duplicate loan status to subsequent column with same status
#Current
df_loan_del[1,4] <- df_loan_del[1,3]
#31-90
df_loan_del[1,6] <- df_loan_del[1,5]
#91-180
df_loan_del[1,8] <- df_loan_del[1,7]
#181-270
df_loan_del[1,10] <- df_loan_del[1,9]
#271-360
df_loan_del[1,12] <- df_loan_del[1,11]
#Collections
df_loan_del[1,14] <- df_loan_del[1,13]
#There is undoubtedly a better way to do this, but I don't know how to use gather when there are multiple rows containing headers
#We're gonna concatenate the Loan Status and Dollars vs. No. of Recepients columns before using gather and then separate them
df_loan_del[2,3] <- paste(c(df_loan_del[1,3], df_loan_del[2,3]), collapse = "--")
df_loan_del[2,4] <- paste(c(df_loan_del[1,4], df_loan_del[2,4]), collapse = "--")
df_loan_del[2,5] <- paste(c(df_loan_del[1,5], df_loan_del[2,5]), collapse = "--")
df_loan_del[2,6] <- paste(c(df_loan_del[1,6], df_loan_del[2,6]), collapse = "--")
df_loan_del[2,7] <- paste(c(df_loan_del[1,7], df_loan_del[2,7]), collapse = "--")
df_loan_del[2,8] <- paste(c(df_loan_del[1,8], df_loan_del[2,8]), collapse = "--")
df_loan_del[2,9] <- paste(c(df_loan_del[1,9], df_loan_del[2,9]), collapse = "--")
df_loan_del[2,10] <- paste(c(df_loan_del[1,10], df_loan_del[2,10]), collapse = "--")
df_loan_del[2,11] <- paste(c(df_loan_del[1,11], df_loan_del[2,11]), collapse = "--")
df_loan_del[2,12] <- paste(c(df_loan_del[1,12], df_loan_del[2,12]), collapse = "--")
df_loan_del[2,13] <- paste(c(df_loan_del[1,13], df_loan_del[2,13]), collapse = "--")
df_loan_del[2,14] <- paste(c(df_loan_del[1,14], df_loan_del[2,14]), collapse = "--")
# #Remove the top row now that concatenation has occurred
df_loan_del <- df_loan_del[2:24,]
#Create "Quarter" value in row 1 before using it to create colnames for df
#Use row 1 as column names before deleting it
df_loan_del[1,2] <- "Quarter"
df_loan_names <- df_loan_del[1,]
names(df_loan_del) <- df_loan_names
#Remove row 1 now that names are there
df_loan_del <- df_loan_del[2:22,]
head(df_loan_del)
## # A tibble: 6 x 14
## `Federal Fiscal~ Quarter `Current Repaym~ `Current Repaym~
## <chr> <chr> <chr> <chr>
## 1 2013 Q3 "$193.5 " 8.35
## 2 2013 Q4 "$188.5 " 7.91
## 3 2014 Q1 "$224.8 " 9.08
## 4 2014 Q2 "$230.3 " 9.06
## 5 2014 Q3 "$248.6 " 9.64
## 6 2014 Q4 "$247.2 " 9.34
## # ... with 10 more variables: `31-90 Days Delinquent--Dollars Outstanding
## # (in billions)` <chr>, `31-90 Days Delinquent--Recipients (in
## # millions)` <chr>, `91-180 Days Delinquent--Dollars Outstanding (in
## # billions)` <chr>, `91-180 Days Delinquent--Recipients (in
## # millions)` <chr>, `181-270 Days Delinquent--Dollars Outstanding (in
## # billions)` <chr>, `181-270 Days Delinquent--Recipients (in
## # millions)` <chr>, `271-360 Days Delinquent--Dollars Outstanding (in
## # billions)` <chr>, `271-360 Days Delinquent--Recipients (in
## # millions)` <chr>, `Loans Transferring to the Debt Management and
## # Collections System (DMCS)--Dollars Outstanding (in billions)` <chr>,
## # `Loans Transferring to the Debt Management and Collections System
## # (DMCS)--Recipients (in millions)` <chr>
#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave State
df_loan <- df_loan_del %>%
gather(Loan_Status_and_Units,Counts,-`Federal Fiscal Year`,-Quarter)
This is going to get kludgy.
#Split concatenated loan status column
df_loan_add <- data.frame(do.call('rbind',strsplit(df_loan$Loan_Status_and_Units,"--")))
df_loan <- cbind(df_loan,df_loan_add)
#rename new columns
colnames(df_loan)[5] <- "Loan_Status"
colnames(df_loan)[6] <- "Units"
#get rid of concatenated column
df_loan <- df_loan[-c(3)]
#subset out the dollars
df_loan_dollars <- subset(df_loan, str_detect(df_loan$Units,"Dollars"))
#rename count column
colnames(df_loan_dollars)[3] <- "Dollars Outstanding (billions)"
#subset out the number of loans
df_loan_numb <- subset(df_loan, Units=='Recipients (in millions)')
colnames(df_loan_numb)[3] <- "Recipients (millions)"
The weird workarounds continue.
#df_loan_dollars
#df_loan_numb
#merge the number of loans in as a new column
df_loan_final <- merge(df_loan_dollars, df_loan_numb, by.x = c("Federal Fiscal Year", "Quarter", "Loan_Status"), by.y = c("Federal Fiscal Year", "Quarter", "Loan_Status"))
#Get rid of Units columns
df_loan_final <- df_loan_final[-c(5)]
df_loan_final <- df_loan_final[-c(6)]
#Create numeric values
df_loan_final$`Dollars Outstanding (billions)` <- as.numeric(substr(df_loan_final$`Dollars Outstanding (billions)`,2,length(df_loan_final$`Dollars Outstanding (billions)`)))
df_loan_final$`Recipients (millions)` <- as.numeric(df_loan_final$`Recipients (millions)`)
#This should be our tidy data set
df_loan_final
## Federal Fiscal Year Quarter
## 1 2013 Q3
## 2 2013 Q3
## 3 2013 Q3
## 4 2013 Q3
## 5 2013 Q3
## 6 2013 Q3
## 7 2013 Q4
## 8 2013 Q4
## 9 2013 Q4
## 10 2013 Q4
## 11 2013 Q4
## 12 2013 Q4
## 13 2014 Q1
## 14 2014 Q1
## 15 2014 Q1
## 16 2014 Q1
## 17 2014 Q1
## 18 2014 Q1
## 19 2014 Q2
## 20 2014 Q2
## 21 2014 Q2
## 22 2014 Q2
## 23 2014 Q2
## 24 2014 Q2
## 25 2014 Q3
## 26 2014 Q3
## 27 2014 Q3
## 28 2014 Q3
## 29 2014 Q3
## 30 2014 Q3
## 31 2014 Q4
## 32 2014 Q4
## 33 2014 Q4
## 34 2014 Q4
## 35 2014 Q4
## 36 2014 Q4
## 37 2015 Q1
## 38 2015 Q1
## 39 2015 Q1
## 40 2015 Q1
## 41 2015 Q1
## 42 2015 Q1
## 43 2015 Q2
## 44 2015 Q2
## 45 2015 Q2
## 46 2015 Q2
## 47 2015 Q2
## 48 2015 Q2
## 49 2015 Q3
## 50 2015 Q3
## 51 2015 Q3
## 52 2015 Q3
## 53 2015 Q3
## 54 2015 Q3
## 55 2015 Q4
## 56 2015 Q4
## 57 2015 Q4
## 58 2015 Q4
## 59 2015 Q4
## 60 2015 Q4
## 61 2016 Q1
## 62 2016 Q1
## 63 2016 Q1
## 64 2016 Q1
## 65 2016 Q1
## 66 2016 Q1
## 67 2016 Q2
## 68 2016 Q2
## 69 2016 Q2
## 70 2016 Q2
## 71 2016 Q2
## 72 2016 Q2
## 73 2016 Q3
## 74 2016 Q3
## 75 2016 Q3
## 76 2016 Q3
## 77 2016 Q3
## 78 2016 Q3
## 79 2016 Q4
## 80 2016 Q4
## 81 2016 Q4
## 82 2016 Q4
## 83 2016 Q4
## 84 2016 Q4
## 85 2017 Q1
## 86 2017 Q1
## 87 2017 Q1
## 88 2017 Q1
## 89 2017 Q1
## 90 2017 Q1
## 91 2017 Q2
## 92 2017 Q2
## 93 2017 Q2
## 94 2017 Q2
## 95 2017 Q2
## 96 2017 Q2
## 97 2017 Q3
## 98 2017 Q3
## 99 2017 Q3
## 100 2017 Q3
## 101 2017 Q3
## 102 2017 Q3
## 103 2017 Q4
## 104 2017 Q4
## 105 2017 Q4
## 106 2017 Q4
## 107 2017 Q4
## 108 2017 Q4
## 109 2018 Q1
## 110 2018 Q1
## 111 2018 Q1
## 112 2018 Q1
## 113 2018 Q1
## 114 2018 Q1
## 115 2018 Q2
## 116 2018 Q2
## 117 2018 Q2
## 118 2018 Q2
## 119 2018 Q2
## 120 2018 Q2
## 121 2018 Q3
## 122 2018 Q3
## 123 2018 Q3
## 124 2018 Q3
## 125 2018 Q3
## 126 2018 Q3
## Loan_Status
## 1 181-270 Days Delinquent
## 2 271-360 Days Delinquent
## 3 31-90 Days Delinquent
## 4 91-180 Days Delinquent
## 5 Current Repayment
## 6 Loans Transferring to the Debt Management and Collections System (DMCS)
## 7 181-270 Days Delinquent
## 8 271-360 Days Delinquent
## 9 31-90 Days Delinquent
## 10 91-180 Days Delinquent
## 11 Current Repayment
## 12 Loans Transferring to the Debt Management and Collections System (DMCS)
## 13 181-270 Days Delinquent
## 14 271-360 Days Delinquent
## 15 31-90 Days Delinquent
## 16 91-180 Days Delinquent
## 17 Current Repayment
## 18 Loans Transferring to the Debt Management and Collections System (DMCS)
## 19 181-270 Days Delinquent
## 20 271-360 Days Delinquent
## 21 31-90 Days Delinquent
## 22 91-180 Days Delinquent
## 23 Current Repayment
## 24 Loans Transferring to the Debt Management and Collections System (DMCS)
## 25 181-270 Days Delinquent
## 26 271-360 Days Delinquent
## 27 31-90 Days Delinquent
## 28 91-180 Days Delinquent
## 29 Current Repayment
## 30 Loans Transferring to the Debt Management and Collections System (DMCS)
## 31 181-270 Days Delinquent
## 32 271-360 Days Delinquent
## 33 31-90 Days Delinquent
## 34 91-180 Days Delinquent
## 35 Current Repayment
## 36 Loans Transferring to the Debt Management and Collections System (DMCS)
## 37 181-270 Days Delinquent
## 38 271-360 Days Delinquent
## 39 31-90 Days Delinquent
## 40 91-180 Days Delinquent
## 41 Current Repayment
## 42 Loans Transferring to the Debt Management and Collections System (DMCS)
## 43 181-270 Days Delinquent
## 44 271-360 Days Delinquent
## 45 31-90 Days Delinquent
## 46 91-180 Days Delinquent
## 47 Current Repayment
## 48 Loans Transferring to the Debt Management and Collections System (DMCS)
## 49 181-270 Days Delinquent
## 50 271-360 Days Delinquent
## 51 31-90 Days Delinquent
## 52 91-180 Days Delinquent
## 53 Current Repayment
## 54 Loans Transferring to the Debt Management and Collections System (DMCS)
## 55 181-270 Days Delinquent
## 56 271-360 Days Delinquent
## 57 31-90 Days Delinquent
## 58 91-180 Days Delinquent
## 59 Current Repayment
## 60 Loans Transferring to the Debt Management and Collections System (DMCS)
## 61 181-270 Days Delinquent
## 62 271-360 Days Delinquent
## 63 31-90 Days Delinquent
## 64 91-180 Days Delinquent
## 65 Current Repayment
## 66 Loans Transferring to the Debt Management and Collections System (DMCS)
## 67 181-270 Days Delinquent
## 68 271-360 Days Delinquent
## 69 31-90 Days Delinquent
## 70 91-180 Days Delinquent
## 71 Current Repayment
## 72 Loans Transferring to the Debt Management and Collections System (DMCS)
## 73 181-270 Days Delinquent
## 74 271-360 Days Delinquent
## 75 31-90 Days Delinquent
## 76 91-180 Days Delinquent
## 77 Current Repayment
## 78 Loans Transferring to the Debt Management and Collections System (DMCS)
## 79 181-270 Days Delinquent
## 80 271-360 Days Delinquent
## 81 31-90 Days Delinquent
## 82 91-180 Days Delinquent
## 83 Current Repayment
## 84 Loans Transferring to the Debt Management and Collections System (DMCS)
## 85 181-270 Days Delinquent
## 86 271-360 Days Delinquent
## 87 31-90 Days Delinquent
## 88 91-180 Days Delinquent
## 89 Current Repayment
## 90 Loans Transferring to the Debt Management and Collections System (DMCS)
## 91 181-270 Days Delinquent
## 92 271-360 Days Delinquent
## 93 31-90 Days Delinquent
## 94 91-180 Days Delinquent
## 95 Current Repayment
## 96 Loans Transferring to the Debt Management and Collections System (DMCS)
## 97 181-270 Days Delinquent
## 98 271-360 Days Delinquent
## 99 31-90 Days Delinquent
## 100 91-180 Days Delinquent
## 101 Current Repayment
## 102 Loans Transferring to the Debt Management and Collections System (DMCS)
## 103 181-270 Days Delinquent
## 104 271-360 Days Delinquent
## 105 31-90 Days Delinquent
## 106 91-180 Days Delinquent
## 107 Current Repayment
## 108 Loans Transferring to the Debt Management and Collections System (DMCS)
## 109 181-270 Days Delinquent
## 110 271-360 Days Delinquent
## 111 31-90 Days Delinquent
## 112 91-180 Days Delinquent
## 113 Current Repayment
## 114 Loans Transferring to the Debt Management and Collections System (DMCS)
## 115 181-270 Days Delinquent
## 116 271-360 Days Delinquent
## 117 31-90 Days Delinquent
## 118 91-180 Days Delinquent
## 119 Current Repayment
## 120 Loans Transferring to the Debt Management and Collections System (DMCS)
## 121 181-270 Days Delinquent
## 122 271-360 Days Delinquent
## 123 31-90 Days Delinquent
## 124 91-180 Days Delinquent
## 125 Current Repayment
## 126 Loans Transferring to the Debt Management and Collections System (DMCS)
## Dollars Outstanding (billions) Recipients (millions)
## 1 6.8 0.46
## 2 3.8 0.27
## 3 18.7 1.05
## 4 10.5 0.69
## 5 193.5 8.35
## 6 4.1 0.22
## 7 6.2 0.43
## 8 4.2 0.31
## 9 23.2 1.29
## 10 12.2 0.77
## 11 188.5 7.91
## 12 2.0 0.14
## 13 7.6 0.48
## 14 3.6 0.27
## 15 20.6 1.10
## 16 13.6 0.82
## 17 224.8 9.08
## 18 2.2 0.15
## 19 7.6 0.47
## 20 4.2 0.28
## 21 17.7 0.94
## 22 14.9 0.85
## 23 230.3 9.06
## 24 1.4 0.09
## 25 8.6 0.52
## 26 4.5 0.30
## 27 23.4 1.19
## 28 13.4 0.76
## 29 248.6 9.64
## 30 1.6 0.10
## 31 7.6 0.46
## 32 5.4 0.34
## 33 25.4 1.25
## 34 15.2 0.84
## 35 247.2 9.34
## 36 1.0 0.06
## 37 9.2 0.53
## 38 5.1 0.32
## 39 26.3 1.25
## 40 15.7 0.82
## 41 291.5 10.59
## 42 1.9 0.11
## 43 8.8 0.48
## 44 5.6 0.33
## 45 23.1 1.07
## 46 18.0 0.90
## 47 310.8 11.07
## 48 1.9 0.11
## 49 10.8 0.57
## 50 5.6 0.32
## 51 26.5 1.20
## 52 17.6 0.87
## 53 331.3 11.58
## 54 2.2 0.12
## 55 10.5 0.54
## 56 6.7 0.37
## 57 28.0 1.27
## 58 17.4 0.83
## 59 332.0 11.37
## 60 2.5 0.14
## 61 10.7 0.53
## 62 6.8 0.36
## 63 27.9 1.20
## 64 18.1 0.85
## 65 374.4 12.45
## 66 2.1 0.11
## 67 9.9 0.48
## 68 6.2 0.31
## 69 27.3 1.14
## 70 17.9 0.80
## 71 377.8 12.45
## 72 1.7 0.09
## 73 10.4 0.49
## 74 6.5 0.33
## 75 30.6 1.24
## 76 18.4 0.83
## 77 400.8 12.98
## 78 1.4 0.07
## 79 10.5 0.49
## 80 7.3 0.35
## 81 32.2 1.30
## 82 20.3 0.88
## 83 406.7 12.84
## 84 1.5 0.07
## 85 11.1 0.50
## 86 6.6 0.32
## 87 34.1 1.32
## 88 20.5 0.87
## 89 448.3 13.74
## 90 1.9 0.09
## 91 11.5 0.50
## 92 6.8 0.31
## 93 30.9 1.18
## 94 23.2 0.94
## 95 461.4 13.90
## 96 1.7 0.08
## 97 13.8 0.59
## 98 7.6 0.35
## 99 35.7 1.33
## 100 22.8 0.93
## 101 472.2 14.20
## 102 1.6 0.08
## 103 12.5 0.54
## 104 8.3 0.37
## 105 35.0 1.27
## 106 22.2 0.89
## 107 467.8 13.77
## 108 1.5 0.07
## 109 12.3 0.51
## 110 8.0 0.35
## 111 37.3 1.31
## 112 21.3 0.82
## 113 518.9 14.99
## 114 2.1 0.10
## 115 11.0 0.44
## 116 7.0 0.30
## 117 28.4 0.98
## 118 22.7 0.85
## 119 533.7 15.25
## 120 1.2 0.06
## 121 12.5 0.50
## 122 6.8 0.29
## 123 43.3 1.53
## 124 26.0 0.98
## 125 540.5 15.24
## 126 1.0 0.05
Let’s look at outstanding dollars by status year-over-year to see if student loan debt has been growing.
#Create subset of data only showing 3rd quarter of year to not show misleading 2013 or 2018 data
df_loan_final_3 <- subset(df_loan_final, Quarter=='Q3')
#Hiding warning message
ggplot(df_loan_final_3,aes(x= `Federal Fiscal Year`,y= `Dollars Outstanding (billions)`,fill = Loan_Status)) + geom_bar(stat = "Identity") + theme(legend.position = "bottom")
Wow, that is a lot of student debt that is growing almost exponentially. Let’s take a closer look at the percentages of debt by status to see if there are anay changes in the amount of deliquent loans.
ggplot(df_loan_final_3,aes(x= `Federal Fiscal Year`,y= `Dollars Outstanding (billions)`,fill = Loan_Status)) +
geom_bar(position = "fill",stat = "identity") + theme(legend.position = "bottom") +
scale_y_continuous(labels = percent_format())
We see the amount of loans being sent to collections going down as a percentage of overall debt. More debt - at least in terms of dollars - is current in 2018 that previous years. This could be because of an improving economy or perhaps much needed reform to the student loan industry.
Dataset 3: Messi Goals
Wikipedia has good stats from various sports stars, but they’re in non-tidy tables. We’re going to tidy them and then do some analysis. Based on the original posting of this untidy data, we’re going to process some summary statistics regarding number of goals per season. We’re going to aggregate goals by season, combining seasons split in multiple teams/leagues and also amalgamating the league and non-leage tournament play into one total.
Note that the initial step was cutting and pasting the table from Wikipedia into an Excel file, which was saved as a csv and posted on Github. Wikipedia link: https://en.wikipedia.org/wiki/Lionel_Messi
#Reading csv file into dataframe
df_messi <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/Messi%20Career%20Stats.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))
df_messi
## # A tibble: 24 x 13
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Club Seas~ Leag~ "" "" Copa~ "" Cham~ "" Other "" Total
## 2 "" "" "" "" "" "" "" Leag~ "" "" "" ""
## 3 "" "" Divi~ Apps Goals Apps Goals Apps Goals Apps Goals Apps
## 4 Barc~ 2003~ Terc~ 10 5 "" "" "" 10
## 5 Barc~ 2003~ Segu~ 5 0 "" "" "" 5
## 6 "" 2004~ Segu~ 17 6 "" "" "" 17
## 7 "" Total "" 22 6 "" "" "" 22
## 8 Barc~ 2004~ La L~ 7 1 1 0 1 0 "" 9
## 9 "" 2005~ La L~ 17 6 2 1 6 1 0 0 25
## 10 "" 2006~ La L~ 26 14 2 2 5 1 3[a] 0 36
## # ... with 14 more rows, and 1 more variable: V13 <chr>
#subset the goal columns
df_messi_s <- select(df_messi, V2, V5, V7, V9, V11)
#cleanup rows and apply names to columns
df_messi_s[3,1] <- df_messi_s[1,1]
df_messi_s <- df_messi_s[3:22,]
colnames(df_messi_s) <- df_messi_s[1,]
#remove total rows
df_messi_s <- df_messi_s[-c(1,5),]
#Split concatenated loan status column
df_messi_split <- data.frame(do.call('rbind',strsplit(df_messi_s$Season,"[\\[]")))
df_messi_f <- cbind(df_messi_s,df_messi_split)
df_messi_f <- df_messi_f[-c(1,7)]
colnames(df_messi_f)[5] <- "Season"
#convert goals to numeric
df_messi_f$Goals <- as.numeric(df_messi_f$Goals)
df_messi_f$Goals.1 <- as.numeric(df_messi_f$Goals.1)
df_messi_f$Goals.2 <- as.numeric(df_messi_f$Goals.2)
df_messi_f$Goals.3 <- as.numeric(df_messi_f$Goals.3)
#create total goals column for all leagues for a season
#df_messi_f$TotalGoals <- df_messi_f$Goals + df_messi_f$Goals.1 + df_messi_f$Goals.2 + df_messi_f$Goals.3
df_messi_f$TotalGoals <- rowSums(df_messi_f[,c("Goals", "Goals.1", "Goals.2", "Goals.3")], na.rm=TRUE)
#drop individual goal columns
df_messi_g <- df_messi_f[,5:6]
#aggregate goals by season
df_messi_agg <- aggregate(df_messi_g$TotalGoals, by=list(df_messi_g$Season),
FUN=sum, na.rm=TRUE)
colnames(df_messi_agg) <- c("Season","Goals")
df_messi_agg
## Season Goals
## 1 200304 5
## 2 200405 7
## 3 200506 8
## 4 200607 17
## 5 200708 16
## 6 200809 38
## 7 200910 47
## 8 201011 53
## 9 201112 73
## 10 201213 60
## 11 201314 41
## 12 201415 58
## 13 201516 41
## 14 201617 54
## 15 201718 45
## 16 201819 11
To the summarizing.
summary(df_messi_agg)
## Season Goals
## 200304: 1 Min. : 5.00
## 200405: 1 1st Qu.:14.75
## 200506: 1 Median :41.00
## 200607: 1 Mean :35.88
## 200708: 1 3rd Qu.:53.25
## 200809: 1 Max. :73.00
## (Other):10
Boxplot.
boxplot(df_messi_agg$Goals)
Histogram.
histogram(df_messi_agg$Goals)
A great career.