Tidy Data Project

1) Financial Complaints to the Consumer Financial Protection Bureau : 2011-2018

2) US Employment by NAICS Sector

3) Pokemon

Our original dataset was extremely large. We first pared down unneeded columns to make subsequent runtime quicker. The dataset is available from catalog.data.gov/dataset/consumer-complaint-database#topic=consumer_navigation

complaint.data<-read.csv(‘/Users/dawig/Documents/Consumer_Complaintsb.csv’,header=TRUE)

complaint.data %<>%
select(Product,Company,State,Submitted.via,Company.response.to.consumer,Timely.response.,Consumer.disputed. ) write.csv(complaint.data, file = “Consumer_Complaints_Selected.csv”)

complaint.data<-read.csv('/Users/dawig/Documents/Consumer_Complaints_Selected.csv',header=TRUE)

complaint.data<-select(complaint.data,Product,Company,State,Submitted.via,Company.response.to.consumer,Timely.response.,Consumer.disputed. )


complaint.data %>% arrange(Company,State) -> complaint.data.2
head(complaint.data.2)
##                                                                        Product                                  Company State Submitted.via Company.response.to.consumer Timely.response. Consumer.disputed.
## 1 Credit reporting, credit repair services, or other personal consumer reports (Former)Shapiro, Swertfeger & Hasty, LLP    AL           Web            Untimely response               No                N/A
## 2 Credit reporting, credit repair services, or other personal consumer reports (Former)Shapiro, Swertfeger & Hasty, LLP    CA           Web            Untimely response               No                N/A
## 3                                                                     Mortgage (Former)Shapiro, Swertfeger & Hasty, LLP    DE      Referral            Untimely response               No                 No
## 4                                                              Debt collection (Former)Shapiro, Swertfeger & Hasty, LLP    GA      Referral            Untimely response               No                 No
## 5                                                                     Mortgage (Former)Shapiro, Swertfeger & Hasty, LLP    GA      Referral            Untimely response               No                 No
## 6                                                              Debt collection (Former)Shapiro, Swertfeger & Hasty, LLP    GA      Referral            Untimely response               No                 No

Apart from tidyness, our data contains no numerical values. We have to sum them up and recombine them into a new dataframe that contains the numerical data that we are after.

While preparing our data frame, we found one company, PORTFOLIO RECOVERY ASSOCIATES INC, that never registered an untimely response, despite having 7144 complaints. Mobiloans, LLC, however, didn’t answer any of its 308 complaints on time. Southwest Credit Systems, L.P., took second to Mobiloans with 304 slow answers.

complaint.data.2 %>%
  group_by(Company,Timely.response.) %>% 
       summarise(number_complaints.y = n()) %>%
          filter(number_complaints.y>5000,Timely.response.=="Yes") %>% 
            arrange(desc(number_complaints.y),Timely.response.)->complaint.data.yes
complaint.data.2 %>%
  group_by(Company,Timely.response.) %>% 
       summarise(number_complaints.n = n()) %>%
          filter(Timely.response.=="No") %>% 
            arrange(desc(number_complaints.n),Timely.response.)->complaint.data.no
complaint.data.2 %>%
  group_by(Company) %>% 
       summarise(number_complaints = n()) %>%
          filter(number_complaints>5100) %>% 
            arrange(desc(number_complaints))->complaint.data.all  

We create numeric data from individual complaints by summing them up and filtering for our top 25 complained about companies. We put data into 3 data frames so we can compare companies based on their rate of answering complaints on time.

complaint.data.2<-left_join(complaint.data.all,complaint.data.yes[,-2])
## Joining, by = "Company"
complaint.data.2<-left_join(complaint.data.2[],complaint.data.no[,-2])
## Joining, by = "Company"
complaint.data.2[is.na(complaint.data.2)] <- 0
complaint.data.2 %<>%
  mutate(percent_slow=number_complaints.n/number_complaints) %>% 
     mutate(slow_rank=(desc(rank(percent_slow))+26))

We take a look at our data. Our top 25 companies are placed in order. We rank each company among other top 25 by the proportion of complaints not answered on time.

The credit reporting agencies, Bank of America and Wells Fargo stand out as the greatest complaint attractors. It makes sense that the credit ratings agencies accumulated a lot of complaints. When Americans need credit, the banks they use turn to these three agencies. Wells Fargo, among the large banks, attracts attention for its number one rank for failing to respond to complaints quickly.

To prepare to look at data for Wells Fargo only, we created a separate file containing only complaints about Wells Fargo.

complaint.data<-read.csv('/Users/dawig/Documents/Consumer_Complaintsb.csv',header=TRUE)
pattern <- paste('WELLS', collapse = "|")
complaint.data.wells<-  subset(complaint.data,grepl(pattern,Company))  
head(complaint.data.wells)
write.csv(complaint.data.wells, file = "Consumer_Complaints_Wells.csv")

We want to take a look at Wells Fargo’s complaints over time. To create an order to be able to make a time series, we have to format our dates into separate numbers that conform to a length that will be accepted as a date format.

complaint.data<-read.csv('/Users/dawig/Documents/Consumer_Complaints_Wells.csv',header=TRUE)
complaint.data %>% 
  separate(Date.received,c("Month","Day", "Year"),sep="/") %>% 
     group_by(Year,Month,Day) %>% 
       summarise(number_complaints.day = n()) ->Yearly.chart
Yearly.chart<-transform(Yearly.chart, Day = as.numeric(Day))
Yearly.chart[,3]<-sprintf("%02.0f", Yearly.chart[,3]) 
Yearly.chart<-transform(Yearly.chart, Month = as.numeric(Month))
Yearly.chart[,2]<-sprintf("%02.0f", Yearly.chart[,2]) 
Yearly.chart<- unite(Yearly.chart,Date,Year,Month,Day, sep = "-",remove=FALSE)
Yearly.chart[,1]<-as.Date(Yearly.chart[,1])
Yearly.chart<- unite(Yearly.chart,Year_Month,Year,Month, sep = ".",remove=FALSE)

We want to find out any reasons for the changes of complaints over time. Google Trends will also be graphed as a proxy for public interest in Wells Fargo to see if interest is tied to complaints. As it will turn out below, Google Trends is not a perfect measure of public interest and didn’t respond directly to increased news stories when their account fraud scandal began.

Two spikes can be found in Wells Fargo complaints. In the first spike, Google Trends shows a spike in Dec., 2012, the month before a complaint spike.

In September 2016, Wells Fargo’s account manipulation scandal broke. This led to a spike in complaints to the CFPB. It did not cause an increased interest in Google searches. The public did not appear to turn to Google searches for information about the scandal.

(http://www.slate.com/blogs/moneybox/2016/09/08/wells_fargo_to_pay_185_million_for_account_opening_scandal_that_s_not_enough.html)

A longer time series of Google Trends shows that our first spike was the result of fast growth and Wells Fargo’s long position as the market leader at a time when refinancing and bank foreclosures were a part of the national conversation. Wells Fargo accounted for 1 in 4 mortgate originations at the time. An article in Forbes tells part of this story:

“For two days, financial services giant Wells Fargo, America’s largest residential-mortgage originator, took over 104,000 square feet for one of the 51”Home Preservation Workshops" it has held over the past three years.“

https://www.forbes.com/sites/halahtouryalai/2012/01/25/wells-fargo-the-bank-that-works/#6316f00f718e

CFPB only started compiling complaint data at the time of the first spike, so it would not be possible to find out how long this trend would have gone back to. Finally, a look at Wells Fargo’s market capitalization brings our question into clearer focus. Wells Fargo continued to grow quickly after 2012, but complaints subsided until 2016. Mortgage difficulties and refinancing anxiety seem to have been the largest contributor to our first spike.

=======================================================================


Our second data set comes from the US Census Bureau. It includes information, by Congressional District, of all workers in the US.


We include our graphs of the top and bottom states in three industries, measured by proportion of employment in the named sector. We follow this by descriptions of our steps to tidy the data and produce the graphs.


From our graphs, we can see that Washington,D.C., Massachusetts and Rhode Island have relatively large education sectors. Alaska, Wyoming and Nevada have relatively small education sectors. The former places are large, early established population centers with a lot of colleges. The latter group is more rural and was not settled in large numbers early. Wyoming, North Dakota and Alaska have many mining and fossil fuel jobs. Washington, D.C. has almost none. (Were they lobbyists?) California, Washington,D.C. and Washington have large information sectors. Nevada, Mississippi and Louisiana have relatively small information sectors.

We load our csv file in and then clean up the header, which contains information about our data, but did not naturally begin with column names and that the data.

Business.size<-read.csv('/Users/dawig/Documents/cd_naicssector_2015.csv', header=FALSE)
Business.size<-as.matrix(Business.size)
Business.size<-Business.size[-c(1,2,3,4,6,7),]
Business.size<-Business.size[,-c(14,15,16)]
colnames(Business.size)<-Business.size[1,]
Business.size<-tbl_df(Business.size[-1,])
Business.size<-Business.size[,c(2,5,6,9)]
colnames(Business.size)<-c('State','Description','Business_size_category','Employment')
Business.size<-transform(Business.size, Employment = as.numeric(Employment))

We take data from separate congressional districts within each state and combine it, removing the “total” grouping to not interfere with our percentages. We spread business size categories so as not to have a variable within the columns.

Business.size %<>%
  filter(Description!='Total') %>% 
   arrange(State,Description,Business_size_category) %>% 
     group_by(State,Description,Business_size_category) %>% 
        summarise(Number_of_Employees=sum(Employment))
Business.size<-spread(Business.size,Business_size_category,Number_of_Employees)
Business.size[is.na(Business.size)] <- 0    

To create a column to tell us the proportion of employment in an industry within a state, we create a total state employment table. Then, we left join it to our main table and create a new column for proportion.

Business.size %>%
  group_by(State) %>% 
     summarise(Number_of_Employees_State=sum(`1:  Total`))-> State.total.df
Business.size<-left_join(Business.size,State.total.df)

Business.size %<>% 
  mutate(Sector_share=`1:  Total`/Number_of_Employees_State)

Finally, we create our plots by filtering for the industry of interest and the levels for the top and bottom states.

Business.size %>%
  filter(Description=='Educational Services',Sector_share>.0374) %>% 
  ggplot(aes(x=State,y=Sector_share))+geom_point(aes(x=State,y=Sector_share,size=2))+theme(axis.text.x = element_text(angle = 45, hjust = 1))+theme(panel.background = element_rect(fill = '#52b781'))  + theme(legend.position="none")+labs(title='Education Sector as a Proportion of Employment',subtitle='Top 10 and Bottom 10')+ theme(plot.title = element_text(colour='#af571c', size=20))+xlab("")+ylim(.02,.12)->plot.a

Business.size %>%
  filter(Description=='Educational Services',Sector_share<.0177) %>% 
  ggplot(aes(x=State,y=Sector_share))+geom_point(aes(x=State,y=Sector_share,size=2))+theme(axis.text.x = element_text(angle = 45, hjust = 1))+theme(panel.background = element_rect(fill = '#52b781'))  + theme(legend.position="none")+xlab("")+ylim(.009,.018)->plot.b

grid.arrange(plot.a, plot.b, nrow = 2)

Business.size %>%
  filter(Description=='Mining, Quarrying, and Oil and Gas Extraction',Sector_share>.011) %>% 
  ggplot(aes(x=State,y=Sector_share))+geom_point(aes(x=State,y=Sector_share,size=2))+theme(axis.text.x = element_text(angle = 45, hjust = 1))+theme(panel.background = element_rect(fill = '#7fd893'))  + theme(legend.position="none")+labs(title='Mining, Quarrying, and Oil and Gas Extraction Sector as a Proportion of Employment',subtitle='Top 10 and Bottom 10')+ theme(plot.title = element_text(colour='#af571c', size=20))+xlab("")+ylim(.008,.12)->plot.a

Business.size %>%
  filter(Description=='Mining, Quarrying, and Oil and Gas Extraction',Sector_share<.0005) %>% 
  ggplot(aes(x=State,y=Sector_share))+geom_point(aes(x=State,y=Sector_share,size=2))+theme(axis.text.x = element_text(angle = 45, hjust = 1))+theme(panel.background = element_rect(fill = '#7fd893'))  + theme(legend.position="none")+xlab("")+ylim(.000005,.0005)->plot.b

grid.arrange(plot.a, plot.b, nrow = 2)

Business.size %>%
  filter(Description=='Information',Sector_share>.028) %>% 
  ggplot(aes(x=State,y=Sector_share))+geom_point(aes(x=State,y=Sector_share,size=2))+theme(axis.text.x = element_text(angle = 45, hjust = 1))+theme(panel.background = element_rect(fill = '#52b781'))  + theme(legend.position="none")+labs(title='Information Sector as a Proportion of Employment',subtitle='Top 10 and Bottom 10')+ theme(plot.title = element_text(colour='#af571c', size=20))+xlab("")+ylim(.025,.055)->plot.a

Business.size %>%
  filter(Description=='Information',Sector_share<.0187) %>% 
  ggplot(aes(x=State,y=Sector_share))+geom_point(aes(x=State,y=Sector_share,size=2))+theme(axis.text.x = element_text(angle = 45, hjust = 1))+theme(panel.background = element_rect(fill = '#52b781'))  + theme(legend.position="none")+xlab("")+ylim(.0142,.0188)->plot.b

grid.arrange(plot.a, plot.b, nrow = 2)

write.csv(Business.size, file = "/Users/dawig/Documents/cd_Naic_Test.csv")

=======================================================================

Final Data Set: Pokemon

After downloading our Pokemon data, we transpose our data to put our observations into rows and variables into columns. It was originally resistant, but was made possible by using header=FALSE. This left us with a task of moving the appropriate row to variable names and starting our data in the first row.

Poke.matrix<-matrix(ncol=41,nrow=801)
Poke.data <-read.csv(text=getURL("https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/pokemon_proj02.csv"), header=FALSE,stringsAsFactors = FALSE)

Poke.data<- t(Poke.data)
colnames(Poke.data)<- Poke.data[1,]
Poke.data<- Poke.data[-1,]
Poke.matrix[,39]<- as.numeric(as.character(Poke.data[,39]))
for(i in c(2:26)){
Poke.matrix[,i]<- as.numeric(as.character(Poke.data[,i]))
}

Our data was of mixed type. Numerical data was particularly stubborn and resisted the usual efforts to cast it as numeric. We created a new matrix with the numeric data we wanted to use so that it would have a clean place to reside. That allowed us to recast it as numeric. To analyze our data, we look at the unadjusted R2 of appropriate variables to find promising correlations.

lin.model<-matrix(nrow=41)
for (i in c(2:24,26))    {
lin.model.temp<-lm(Poke.matrix[,39]~Poke.matrix[,i])

lin.model[i]<-summary(lm(Poke.matrix[,39]~Poke.matrix[,i]))$r.squared
}
lin.model[1:26]
##  [1]           NA 0.0010181653 0.0015109262 0.0158768474 0.0105937777 0.0097068355 0.0261029767 0.0019469150 0.0160640089 0.0022914830 0.0025128752 0.0119267546 0.0003005317 0.0184437381 0.0415693352 0.0077716885 0.0127109526 0.0002607293 0.0146553462 0.1460637040 0.1965023900 0.1647058290 0.2098654149 0.1071815038           NA 0.1798325444
lin.model.20<-lm(Poke.matrix[,39]~Poke.matrix[,20])
summary(lin.model.temp)
## 
## Call:
## lm(formula = Poke.matrix[, 39] ~ Poke.matrix[, i])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.21  -37.77  -14.51    8.30  893.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.5160     9.1257  -5.316 1.38e-07 ***
## Poke.matrix[, i]   1.5053     0.1152  13.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.1 on 779 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1788 
## F-statistic: 170.8 on 1 and 779 DF,  p-value: < 2.2e-16
lin.model.21<-lm(Poke.matrix[,39]~Poke.matrix[,21])
summary(lin.model.temp)
## 
## Call:
## lm(formula = Poke.matrix[, 39] ~ Poke.matrix[, i])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.21  -37.77  -14.51    8.30  893.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.5160     9.1257  -5.316 1.38e-07 ***
## Poke.matrix[, i]   1.5053     0.1152  13.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.1 on 779 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1788 
## F-statistic: 170.8 on 1 and 779 DF,  p-value: < 2.2e-16
lin.model.22<-lm(Poke.matrix[,39]~Poke.matrix[,22])
summary(lin.model.temp)
## 
## Call:
## lm(formula = Poke.matrix[, 39] ~ Poke.matrix[, i])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.21  -37.77  -14.51    8.30  893.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.5160     9.1257  -5.316 1.38e-07 ***
## Poke.matrix[, i]   1.5053     0.1152  13.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.1 on 779 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1788 
## F-statistic: 170.8 on 1 and 779 DF,  p-value: < 2.2e-16
lin.model.23<-lm(Poke.matrix[,39]~Poke.matrix[,23])
summary(lin.model.temp)
## 
## Call:
## lm(formula = Poke.matrix[, 39] ~ Poke.matrix[, i])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.21  -37.77  -14.51    8.30  893.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.5160     9.1257  -5.316 1.38e-07 ***
## Poke.matrix[, i]   1.5053     0.1152  13.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.1 on 779 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1788 
## F-statistic: 170.8 on 1 and 779 DF,  p-value: < 2.2e-16
lin.model.24<-lm(Poke.matrix[,39]~Poke.matrix[,24])
summary(lin.model.temp)
## 
## Call:
## lm(formula = Poke.matrix[, 39] ~ Poke.matrix[, i])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.21  -37.77  -14.51    8.30  893.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.5160     9.1257  -5.316 1.38e-07 ***
## Poke.matrix[, i]   1.5053     0.1152  13.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.1 on 779 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1788 
## F-statistic: 170.8 on 1 and 779 DF,  p-value: < 2.2e-16
lin.model.26<-lm(Poke.matrix[,39]~Poke.matrix[,26])
summary(lin.model.temp)
## 
## Call:
## lm(formula = Poke.matrix[, 39] ~ Poke.matrix[, i])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277.21  -37.77  -14.51    8.30  893.37 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.5160     9.1257  -5.316 1.38e-07 ***
## Poke.matrix[, i]   1.5053     0.1152  13.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.1 on 779 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.1798, Adjusted R-squared:  0.1788 
## F-statistic: 170.8 on 1 and 779 DF,  p-value: < 2.2e-16
Poke.matrix<-tbl_df(Poke.matrix)

We ran 6 linear models after checking unadjusted R2 for 24 variables. This type of exploratory method is potentially dangerous when making inferences. We can justify it for 3 reasons: We have 802 observations, allowing sufficient degrees of freedom. We’re not making important decisions based on our results. This researcher was already in college when Pokemon was released and doesn’t really understand Pokemon.

Base egg steps and base happiness both have discrete values and cause the data to line up in bands. This makes sense, given that these variables were created by a programmer or author at some point who had to set levels. Base egg steps has a mixture of continuous and discrete data.