library(tidyverse)
library(ggplot2)
library(sqldf)
First we will look at the Ohio fraud data. Here is a briefing on the set:
Metropolitan Areas are defined by the Office of Management and Budget, and population estimates are based on 2018 U.S. Census figures. Metropolitan Areas are ranked based on the number of reports per 100,000 population. Reports exclude state-specific data contributor reports.
I did most of my editting in excel to the database.
Ohiofraud <- read.csv("Desktop:/Ohio_normalized.csv")
knitr::kable(Ohiofraud, caption = "Ohio Fraud")
| metro_area_f | state | reports |
|---|---|---|
| Cleveland-Elyria | OH Metropolitan Statistical Area | 714 |
| Columbus | OH Metropolitan Statistical Area | 705 |
| Dayton | OH Metropolitan Statistical Area | 621 |
| Weirton-Steubenville | WV-OH Metropolitan Statistical Area | 614 |
| Akron | OH Metropolitan Statistical Area | 596 |
| Canton-Massillon | OH Metropolitan Statistical Area | 588 |
| Youngstown-Warren-Boardman | OH-PA Metropolitan Statistical Area | 587 |
| Cincinnati | OH-KY-IN Metropolitan Statistical Area | 586 |
| Springfield | OH Metropolitan Statistical Area | 561 |
| Toledo | OH Metropolitan Statistical Area | 545 |
| Huntington-Ashland | WV-KY-OH Metropolitan Statistical Area | 489 |
| Mansfield | OH Metropolitan Statistical Area | 489 |
| Salem | OH Micropolitan Statistical Area | 484 |
| Lima | OH Metropolitan Statistical Area | 467 |
| Wheeling | WV-OH Metropolitan Statistical Area | 459 |
| Wooster | OH Micropolitan Statistical Area | 454 |
Which cities have the highest number of incidents?
knitr::kable((Ohiofraud[1:3,] %>%
arrange(desc(reports))), caption ="Cities with the Highest Number of Incidents")
| metro_area_f | state | reports |
|---|---|---|
| Cleveland-Elyria | OH Metropolitan Statistical Area | 714 |
| Columbus | OH Metropolitan Statistical Area | 705 |
| Dayton | OH Metropolitan Statistical Area | 621 |
Cleveland with 714, Columbus with 705, and Dayton with 621
Where’s the nasty Nati?
Ohiofraud %>%filter(str_detect(metro_area,"Cincinn")
mean(Ohiofraud$reports)# 559.9375
The nasty Nati is just above the average at 586 reports. Let see if Cincinnati falls within the upper interquartile range.
summary(Ohiofraud$reports)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 454.0 487.8 573.5 559.9 600.5 714.0
Cincinnati is just under the 3rd Quartile which is 600. Was the average the best summary statistic to use? Let’s see a visual of the data.
The scatterplot shows several points that are scattered about in what appear to be clusters.
d <- Ohiofraud$reports
View(d)
qplot(Ohiofraud$reports,Ohiofraud$metro_area)
Now let’s also look at add a second column which will contain the data on id theft. I will need to revert to SQL for this query.
Ohioidtheft <- read.csv("Desktop:/Ohio_idtheft_BCNF.csv")
View(Ohioidtheft)
Ohioidtheft <- Ohioidtheft %>%
rename(metro_area_t=Metropolitan.Area)%>%
rename(reportsoftheft= X) %>%
rename(state=Reports.per.100K.Population)
sqldf("SELECT *
FROM Ohioidtheft")
## metro_area_t state
## 1 Cleveland-Elyria OH Metropolitan Statistical Area
## 2 Columbus OH Metropolitan Statistical Area
## 3 Dayton OH Metropolitan Statistical Area
## 4 Akron OH Metropolitan Statistical Area
## 5 Toledo OH Metropolitan Statistical Area
## 6 Youngstown-Warren-Boardman OH-PA Metropolitan Statistical Area
## 7 Cincinnati OH-KY-IN Metropolitan Statistical Area
## 8 Wooster OH Micropolitan Statistical Area
## 9 Canton-Massillon OH Metropolitan Statistical Area
## 10 Mansfield OH Metropolitan Statistical Area
## 11 Lima OH Metropolitan Statistical Area
## 12 Salem OH Micropolitan Statistical Area
## 13 Springfield OH Metropolitan Statistical Area
## 14 Weirton-Steubenville WV-OH Metropolitan Statistical Area
## 15 Huntington-Ashland WV-KY-OH Metropolitan Statistical Area
## 16 Wheeling WV-OH Metropolitan Statistical Area
## reportsoftheft
## 1 226
## 2 121
## 3 116
## 4 115
## 5 104
## 6 96
## 7 95
## 8 88
## 9 77
## 10 73
## 11 72
## 12 68
## 13 64
## 14 59
## 15 51
## 16 43
Do we have a primary key? Is it unique, irreducible, and not null… There should be 16 rows in this next query.
sqldf("SELECT DISTINCT metro_area_f
FROM Ohiofraud")
So the easiest way I know of as of now, is to sort based on my primary key and add the data that way. If I had more data, I would need to first create a database, but for our purposes, this is fine.
sqldf("SELECT *
FROM Ohiofraud f
LEFT JOIN Ohioidtheft t
ON metro_area.f = metro_area.t")#Error: No such column: metro_area.f?
Since the SQL join statment isn’t working. I will have to resort to the join function within…tidyverse. I am not going to show all of the queries I ran, but I used the sample queiries for the tidyverse package to figure out how to apply the same method to my dataset.
Using left join?
View(Ohiofraud)
View(Ohioidtheft)
Ohio_reports <- Ohiofraud %>%
left_join( Ohioidtheft, by=c("metro_area_f"="metro_area_t"))
View(Ohio_reports)
I need to remove a column and rename a column. For Simplicities sake we will call our collection of data ‘r’
r <- data.frame(Ohio_reports$metro_area_f, Ohio_reports$state.x, Ohio_reports$reports, Ohio_reports$reportsoftheft)
View(r)
View(
r %>%
rename(metro_area= Ohio_reports.metro_area_f) %>%
rename(state=Ohio_reports.state.x) %>%
rename(fraud_reports=Ohio_reports.reports) %>%
rename(theft_reports=Ohio_reports.reportsoftheft)
)
r <- r %>%
rename(metro_area= Ohio_reports.metro_area_f) %>%
rename(state=Ohio_reports.state.x) %>%
rename(fraud_reports=Ohio_reports.reports) %>%
rename(theft_reports=Ohio_reports.reportsoftheft)
View(r)
Now we finally have our dataset to work with. Truly 80% of the work.
rp <- ggplot(r, aes(r$metro_area, r$fraud_reports))
rp+ geom_point(color = "blue")+coord_flip()
rp2 <- ggplot(r, aes(r$metro_area, r$theft_reports))
rp2 + geom_point(color ="red")+coord_flip()
What is the correlation between the two variables?
cor(r$theft_reports, r$fraud_reports)
## [1] 0.7238463
# 0.7238
rp <- ggplot(r, aes(r$theft_reports, r$fraud_reports, colour = r$metro_area))
rp+geom_point(mapping = aes(color=r$metro_area))+
labs(title = "The Imperfect High Fraud High Theft Relationship",
subtitle = "The relationship between fraud reports and theft reports.")
As you can see there is somewhat of an upward trend. If the fraud reports increase, so do the theft reports. Although a definite outlier does seem to be Cleveland-Elyria.
theft<- ggplot(r, aes(r$metro_area,r$theft_reports, colour = r$metro_area))
theft+coord_flip()+geom_col()
Clevland-Elyria sticks out like a sore thumb with 226 theft reports.
fraud<- ggplot(r, aes(r$metro_area,r$fraud_reports, colour = r$metro_area))
fraud+coord_flip()+geom_col()
Clevland-ELyria also has the highest number of reports here too with 714 fraud repots.
Take a look at more data at FTC.gov/data. Check out their incredible Tableau vizzes here: https://www.ftc.gov/enforcement/data-visualizations/explore-data.