Got Scammed?

library(tidyverse)
library(ggplot2)
library(sqldf)

First we will look at the Ohio fraud data. Here is a briefing on the set:

Metropolitan Areas are defined by the Office of Management and Budget, and population estimates are based on 2018 U.S. Census figures. Metropolitan Areas are ranked based on the number of reports per 100,000 population. Reports exclude state-specific data contributor reports.

I did most of my editting in excel to the database.

Ohiofraud <- read.csv("Desktop:/Ohio_normalized.csv")

knitr::kable(Ohiofraud, caption = "Ohio Fraud")

Ohio Fraud
metro_area_f	state	reports
Cleveland-Elyria	OH Metropolitan Statistical Area	714
Columbus	OH Metropolitan Statistical Area	705
Dayton	OH Metropolitan Statistical Area	621
Weirton-Steubenville	WV-OH Metropolitan Statistical Area	614
Akron	OH Metropolitan Statistical Area	596
Canton-Massillon	OH Metropolitan Statistical Area	588
Youngstown-Warren-Boardman	OH-PA Metropolitan Statistical Area	587
Cincinnati	OH-KY-IN Metropolitan Statistical Area	586
Springfield	OH Metropolitan Statistical Area	561
Toledo	OH Metropolitan Statistical Area	545
Huntington-Ashland	WV-KY-OH Metropolitan Statistical Area	489
Mansfield	OH Metropolitan Statistical Area	489
Salem	OH Micropolitan Statistical Area	484
Lima	OH Metropolitan Statistical Area	467
Wheeling	WV-OH Metropolitan Statistical Area	459
Wooster	OH Micropolitan Statistical Area	454

Which cities have the highest number of incidents?

knitr::kable((Ohiofraud[1:3,] %>%
  arrange(desc(reports))), caption ="Cities with the Highest Number of Incidents")

Cities with the Highest Number of Incidents
metro_area_f	state	reports
Cleveland-Elyria	OH Metropolitan Statistical Area	714
Columbus	OH Metropolitan Statistical Area	705
Dayton	OH Metropolitan Statistical Area	621

Cleveland with 714, Columbus with 705, and Dayton with 621

Where’s the nasty Nati?

Ohiofraud %>%filter(str_detect(metro_area,"Cincinn")

mean(Ohiofraud$reports)# 559.9375

The nasty Nati is just above the average at 586 reports. Let see if Cincinnati falls within the upper interquartile range.

summary(Ohiofraud$reports)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   454.0   487.8   573.5   559.9   600.5   714.0

Cincinnati is just under the 3rd Quartile which is 600. Was the average the best summary statistic to use? Let’s see a visual of the data.

The scatterplot shows several points that are scattered about in what appear to be clusters.

d <- Ohiofraud$reports

View(d)

qplot(Ohiofraud$reports,Ohiofraud$metro_area)

Now let’s also look at add a second column which will contain the data on id theft. I will need to revert to SQL for this query.

Ohioidtheft <- read.csv("Desktop:/Ohio_idtheft_BCNF.csv")

View(Ohioidtheft)

Ohioidtheft <- Ohioidtheft %>%
  rename(metro_area_t=Metropolitan.Area)%>%
  rename(reportsoftheft= X) %>%
  rename(state=Reports.per.100K.Population)

sqldf("SELECT *
      FROM Ohioidtheft")

##                  metro_area_t                                   state
## 1            Cleveland-Elyria        OH Metropolitan Statistical Area
## 2                    Columbus        OH Metropolitan Statistical Area
## 3                      Dayton        OH Metropolitan Statistical Area
## 4                       Akron        OH Metropolitan Statistical Area
## 5                      Toledo        OH Metropolitan Statistical Area
## 6  Youngstown-Warren-Boardman     OH-PA Metropolitan Statistical Area
## 7                  Cincinnati  OH-KY-IN Metropolitan Statistical Area
## 8                     Wooster        OH Micropolitan Statistical Area
## 9            Canton-Massillon        OH Metropolitan Statistical Area
## 10                  Mansfield        OH Metropolitan Statistical Area
## 11                       Lima        OH Metropolitan Statistical Area
## 12                      Salem        OH Micropolitan Statistical Area
## 13                Springfield        OH Metropolitan Statistical Area
## 14       Weirton-Steubenville     WV-OH Metropolitan Statistical Area
## 15         Huntington-Ashland  WV-KY-OH Metropolitan Statistical Area
## 16                   Wheeling     WV-OH Metropolitan Statistical Area
##    reportsoftheft
## 1             226
## 2             121
## 3             116
## 4             115
## 5             104
## 6              96
## 7              95
## 8              88
## 9              77
## 10             73
## 11             72
## 12             68
## 13             64
## 14             59
## 15             51
## 16             43

Do we have a primary key? Is it unique, irreducible, and not null… There should be 16 rows in this next query.

sqldf("SELECT DISTINCT metro_area_f
      FROM Ohiofraud")

So the easiest way I know of as of now, is to sort based on my primary key and add the data that way. If I had more data, I would need to first create a database, but for our purposes, this is fine.

sqldf("SELECT *
      FROM Ohiofraud f
      LEFT JOIN Ohioidtheft t
      ON metro_area.f = metro_area.t")#Error: No such column: metro_area.f?

Since the SQL join statment isn’t working. I will have to resort to the join function within…tidyverse. I am not going to show all of the queries I ran, but I used the sample queiries for the tidyverse package to figure out how to apply the same method to my dataset.

Using left join?

View(Ohiofraud)
View(Ohioidtheft)

Ohio_reports <- Ohiofraud %>%
left_join( Ohioidtheft, by=c("metro_area_f"="metro_area_t"))

View(Ohio_reports)

I need to remove a column and rename a column. For Simplicities sake we will call our collection of data ‘r’

r <- data.frame(Ohio_reports$metro_area_f, Ohio_reports$state.x, Ohio_reports$reports, Ohio_reports$reportsoftheft)
View(r)

View(
r %>%
  rename(metro_area= Ohio_reports.metro_area_f) %>%
  rename(state=Ohio_reports.state.x) %>%
  rename(fraud_reports=Ohio_reports.reports) %>%
  rename(theft_reports=Ohio_reports.reportsoftheft)
)

r <- r %>%
  rename(metro_area= Ohio_reports.metro_area_f) %>%
  rename(state=Ohio_reports.state.x) %>%
  rename(fraud_reports=Ohio_reports.reports) %>%
  rename(theft_reports=Ohio_reports.reportsoftheft)

View(r)

Now we finally have our dataset to work with. Truly 80% of the work.

rp <- ggplot(r, aes(r$metro_area, r$fraud_reports))

rp+ geom_point(color = "blue")+coord_flip()

rp2 <- ggplot(r, aes(r$metro_area, r$theft_reports))
 
rp2 + geom_point(color ="red")+coord_flip()

What is the correlation between the two variables?

cor(r$theft_reports, r$fraud_reports)

## [1] 0.7238463

# 0.7238

rp <- ggplot(r, aes(r$theft_reports, r$fraud_reports, colour = r$metro_area))

rp+geom_point(mapping = aes(color=r$metro_area))+
  labs(title = "The Imperfect High Fraud High Theft Relationship",
         subtitle = "The relationship between fraud reports and theft reports.")

As you can see there is somewhat of an upward trend. If the fraud reports increase, so do the theft reports. Although a definite outlier does seem to be Cleveland-Elyria.

theft<- ggplot(r, aes(r$metro_area,r$theft_reports, colour = r$metro_area))

theft+coord_flip()+geom_col()

Clevland-Elyria sticks out like a sore thumb with 226 theft reports.

fraud<- ggplot(r, aes(r$metro_area,r$fraud_reports, colour = r$metro_area))

fraud+coord_flip()+geom_col()

Clevland-ELyria also has the highest number of reports here too with 714 fraud repots.

Take a look at more data at FTC.gov/data. Check out their incredible Tableau vizzes here: https://www.ftc.gov/enforcement/data-visualizations/explore-data.

https://rpubs.com/natester