library(tidyverse)
library(ggplot2)
library(sqldf)

First we will look at the Ohio fraud data. Here is a briefing on the set:

Metropolitan Areas are defined by the Office of Management and Budget, and population estimates are based on 2018 U.S. Census figures. Metropolitan Areas are ranked based on the number of reports per 100,000 population. Reports exclude state-specific data contributor reports.

I did most of my editting in excel to the database.

Ohiofraud <- read.csv("Desktop:/Ohio_normalized.csv")
knitr::kable(Ohiofraud, caption = "Ohio Fraud")
Ohio Fraud
metro_area_f state reports
Cleveland-Elyria OH Metropolitan Statistical Area 714
Columbus OH Metropolitan Statistical Area 705
Dayton OH Metropolitan Statistical Area 621
Weirton-Steubenville WV-OH Metropolitan Statistical Area 614
Akron OH Metropolitan Statistical Area 596
Canton-Massillon OH Metropolitan Statistical Area 588
Youngstown-Warren-Boardman OH-PA Metropolitan Statistical Area 587
Cincinnati OH-KY-IN Metropolitan Statistical Area 586
Springfield OH Metropolitan Statistical Area 561
Toledo OH Metropolitan Statistical Area 545
Huntington-Ashland WV-KY-OH Metropolitan Statistical Area 489
Mansfield OH Metropolitan Statistical Area 489
Salem OH Micropolitan Statistical Area 484
Lima OH Metropolitan Statistical Area 467
Wheeling WV-OH Metropolitan Statistical Area 459
Wooster OH Micropolitan Statistical Area 454

Which cities have the highest number of incidents?

knitr::kable((Ohiofraud[1:3,] %>%
  arrange(desc(reports))), caption ="Cities with the Highest Number of Incidents")
Cities with the Highest Number of Incidents
metro_area_f state reports
Cleveland-Elyria OH Metropolitan Statistical Area 714
Columbus OH Metropolitan Statistical Area 705
Dayton OH Metropolitan Statistical Area 621

Cleveland with 714, Columbus with 705, and Dayton with 621

Where’s the nasty Nati?

Ohiofraud %>%filter(str_detect(metro_area,"Cincinn")

mean(Ohiofraud$reports)# 559.9375

The nasty Nati is just above the average at 586 reports. Let see if Cincinnati falls within the upper interquartile range.

summary(Ohiofraud$reports)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   454.0   487.8   573.5   559.9   600.5   714.0

Cincinnati is just under the 3rd Quartile which is 600. Was the average the best summary statistic to use? Let’s see a visual of the data.

The scatterplot shows several points that are scattered about in what appear to be clusters.

d <- Ohiofraud$reports

View(d)

qplot(Ohiofraud$reports,Ohiofraud$metro_area)

Now let’s also look at add a second column which will contain the data on id theft. I will need to revert to SQL for this query.

Ohioidtheft <- read.csv("Desktop:/Ohio_idtheft_BCNF.csv")
View(Ohioidtheft)

Ohioidtheft <- Ohioidtheft %>%
  rename(metro_area_t=Metropolitan.Area)%>%
  rename(reportsoftheft= X) %>%
  rename(state=Reports.per.100K.Population)

sqldf("SELECT *
      FROM Ohioidtheft")
##                  metro_area_t                                   state
## 1            Cleveland-Elyria        OH Metropolitan Statistical Area
## 2                    Columbus        OH Metropolitan Statistical Area
## 3                      Dayton        OH Metropolitan Statistical Area
## 4                       Akron        OH Metropolitan Statistical Area
## 5                      Toledo        OH Metropolitan Statistical Area
## 6  Youngstown-Warren-Boardman     OH-PA Metropolitan Statistical Area
## 7                  Cincinnati  OH-KY-IN Metropolitan Statistical Area
## 8                     Wooster        OH Micropolitan Statistical Area
## 9            Canton-Massillon        OH Metropolitan Statistical Area
## 10                  Mansfield        OH Metropolitan Statistical Area
## 11                       Lima        OH Metropolitan Statistical Area
## 12                      Salem        OH Micropolitan Statistical Area
## 13                Springfield        OH Metropolitan Statistical Area
## 14       Weirton-Steubenville     WV-OH Metropolitan Statistical Area
## 15         Huntington-Ashland  WV-KY-OH Metropolitan Statistical Area
## 16                   Wheeling     WV-OH Metropolitan Statistical Area
##    reportsoftheft
## 1             226
## 2             121
## 3             116
## 4             115
## 5             104
## 6              96
## 7              95
## 8              88
## 9              77
## 10             73
## 11             72
## 12             68
## 13             64
## 14             59
## 15             51
## 16             43

Do we have a primary key? Is it unique, irreducible, and not null… There should be 16 rows in this next query.

sqldf("SELECT DISTINCT metro_area_f
      FROM Ohiofraud")

So the easiest way I know of as of now, is to sort based on my primary key and add the data that way. If I had more data, I would need to first create a database, but for our purposes, this is fine.

sqldf("SELECT *
      FROM Ohiofraud f
      LEFT JOIN Ohioidtheft t
      ON metro_area.f = metro_area.t")#Error: No such column: metro_area.f?

Since the SQL join statment isn’t working. I will have to resort to the join function within…tidyverse. I am not going to show all of the queries I ran, but I used the sample queiries for the tidyverse package to figure out how to apply the same method to my dataset.

Using left join?

View(Ohiofraud)
View(Ohioidtheft)

Ohio_reports <- Ohiofraud %>%
left_join( Ohioidtheft, by=c("metro_area_f"="metro_area_t"))

View(Ohio_reports)

I need to remove a column and rename a column. For Simplicities sake we will call our collection of data ‘r’

r <- data.frame(Ohio_reports$metro_area_f, Ohio_reports$state.x, Ohio_reports$reports, Ohio_reports$reportsoftheft)
View(r)

View(
r %>%
  rename(metro_area= Ohio_reports.metro_area_f) %>%
  rename(state=Ohio_reports.state.x) %>%
  rename(fraud_reports=Ohio_reports.reports) %>%
  rename(theft_reports=Ohio_reports.reportsoftheft)
)

r <- r %>%
  rename(metro_area= Ohio_reports.metro_area_f) %>%
  rename(state=Ohio_reports.state.x) %>%
  rename(fraud_reports=Ohio_reports.reports) %>%
  rename(theft_reports=Ohio_reports.reportsoftheft)

View(r)

Now we finally have our dataset to work with. Truly 80% of the work.

rp <- ggplot(r, aes(r$metro_area, r$fraud_reports))

rp+ geom_point(color = "blue")+coord_flip()

rp2 <- ggplot(r, aes(r$metro_area, r$theft_reports))
 
rp2 + geom_point(color ="red")+coord_flip()

What is the correlation between the two variables?

cor(r$theft_reports, r$fraud_reports)
## [1] 0.7238463
# 0.7238 
rp <- ggplot(r, aes(r$theft_reports, r$fraud_reports, colour = r$metro_area))

rp+geom_point(mapping = aes(color=r$metro_area))+
  labs(title = "The Imperfect High Fraud High Theft Relationship",
         subtitle = "The relationship between fraud reports and theft reports.")

As you can see there is somewhat of an upward trend. If the fraud reports increase, so do the theft reports. Although a definite outlier does seem to be Cleveland-Elyria.

theft<- ggplot(r, aes(r$metro_area,r$theft_reports, colour = r$metro_area))

theft+coord_flip()+geom_col()

Clevland-Elyria sticks out like a sore thumb with 226 theft reports.

fraud<- ggplot(r, aes(r$metro_area,r$fraud_reports, colour = r$metro_area))

fraud+coord_flip()+geom_col()

Clevland-ELyria also has the highest number of reports here too with 714 fraud repots.

Take a look at more data at FTC.gov/data. Check out their incredible Tableau vizzes here: https://www.ftc.gov/enforcement/data-visualizations/explore-data.

https://rpubs.com/natester