Abstract

The project utilizes information acquired from the Financial Crimes Enforcement Network on a number of suspected subjects of mortgage fraud between fiscal year 2006 and 2012. County population size was shown to have a strong correlation with the number of subjects reported. The majority of the suspected subjects were in counties above 122,112 residents and states above 28,251,324 residents. A majority of cases were identified in the southern region of the United States with the southern counties of Florida having the highest number of subjects per capita of 1,000,000 residents.

Introduction

The Financial Crimes Enforcement Network (FinCEN) is a bureau of the U.S. Department of the Treasury. It is the intermediary between the United States financial system and law enforcement. As a regulator, FinCEN gathers reports from financial institutions regarding suspicious activity relating to illicit use of funds. FinCEN published a report on the number of subjects listed in suspicious activity reports involving mortgage fraud from fiscal year 2006 to fiscal year 2012. Mortgage fraud is crime characterized by some type of material misstatement, misrepresentation, or omission in relation to a mortgage loan which is then relied upon by a lender. The aim of the project is to observe if subjects of mortgage fraud are more prevalent in higher populated areas.

Packages

#load required packages
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'tibble' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'stringr' was built under R version 4.0.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(stringr)
library(zoo)

## Warning: package 'zoo' was built under R version 4.0.5

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(highcharter)

## Warning: package 'highcharter' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

library(RColorBrewer)
library(tmap)

## Warning: package 'tmap' was built under R version 4.0.5

library(tmaptools)

## Warning: package 'tmaptools' was built under R version 4.0.5

library(leaflet)

## Warning: package 'leaflet' was built under R version 4.0.5

library(sf)

## Warning: package 'sf' was built under R version 4.0.5

## Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1

library(leaflet.extras)

## Warning: package 'leaflet.extras' was built under R version 4.0.5

library(dplyr)
library(rio)

## Warning: package 'rio' was built under R version 4.0.5

library(sp)

## Warning: package 'sp' was built under R version 4.0.5

library(raster)

## Warning: package 'raster' was built under R version 4.0.5

## 
## Attaching package: 'raster'

## The following object is masked from 'package:dplyr':
## 
##     select

## The following object is masked from 'package:tidyr':
## 
##     extract

library(rgdal)

## Warning: package 'rgdal' was built under R version 4.0.5

## rgdal: version: 1.5-23, (SVN revision 1121)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.2.1, released 2020/12/29
## Path to GDAL shared files: C:/Users/mivul/OneDrive/Documents/R/win-library/4.0/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Users/mivul/OneDrive/Documents/R/win-library/4.0/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.4-5
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading rgdal.
## Overwritten PROJ_LIB was C:/Users/mivul/OneDrive/Documents/R/win-library/4.0/rgdal/proj

Setup Workstation

setwd("C:/Users/mivul/OneDrive/Desktop/Data 110/Datasets")
county<-read.csv("FinCEN county total subjects.csv")
state<-read.csv("FinCEN by State.csv")
census_state<-read.csv("nst-est2019-alldata.csv")
state_FIP<-read.csv("state-geocodes-v2017copy.csv")
census_county<-read.csv("county pop.csv")
census_county2<-read.csv("county pop2.csv")

Dataset

The mortgage fraud datasets were aquired from https://www.fincen.gov/fincen-mortgage-fraud-sar-datasets. The datasets offered by FinCEN showed mortgage fraud activity at the state, county. and metropolitan level. The state and county datasets were used in this project. The number of reported subjects were defined by the number of unique names and addresses in the suspicious activity reports. Name variations in reports could have caused an inflation in the reported numbers. Due to the methodology the data that was collected by FinCEN, the state and county values of reported subjects are not equal even though information come from the same source. The two datasets will be used conduct separate analysis.

Variable Description

County Dataset Description

County: Name of the county

State: Name of the State

County.FIPS: A code used by the census that uniquely identifies each county in the United States

XYear.Qnumber: Number of reported subjects for the fiscal year quarter

State Dataset Description

State.Name: Name of the State

State.Abbre.Viation: Two letter abbreviation of the state

XYear.Qnumber: Number of reported subjects for the fiscal year quarter

Data Exploration

Clean State Data

#Convert variables from numeric to characters
state$X2011.Q4<-as.character(state$X2011.Q4)
state$X2012.Q1<-as.character(state$X2012.Q1)
state$X2012.Q2<-as.character(state$X2012.Q2)

#Convert to long format
state_long<-state %>%
  pivot_longer(3:28, names_to = "Y.Q", values_to = "num_reported",  values_ptypes = list(test2=character()))

#Remove first character in Y.Q variables and set to clean_date
clean_date<-str_remove(state_long$Y.Q, "^(.)")

#Set Y.Q values to clean_date values
state_long$Y.Q<-clean_date

#Remove comma from num_reported values
state_long$num_reported <- gsub(",","",state_long$num_reported) 

#Convert num_reported values to numeric
state_long$num_reported <- as.numeric(state_long$num_reported)

#Convert Y.Q format to years and quarter
state_long$Y.Q<-as.yearqtr(state_long$Y.Q, format = "%Y.Q%q")

top_10_states<-state_long %>%
  group_by(State.Name) %>%
  summarise(total_reported = sum(num_reported)) %>%
  arrange(desc(total_reported))%>%
  head(10)

Clean County Data

#Convert variables from numeric to characters
county$X2006.Q1<-as.character(county$X2006.Q1)
county$X2006.Q2<-as.character(county$X2006.Q2)
county$X2006.Q3<-as.character(county$X2006.Q3)
county$X2006.Q4<-as.character(county$X2006.Q4)
county$X2011.Q1<-as.character(county$X2011.Q1)
county$X2011.Q2<-as.character(county$X2011.Q2)
county$X2011.Q3<-as.character(county$X2011.Q3)
county$X2011.Q4<-as.character(county$X2011.Q4)
county$X2012.Q1<-as.character(county$X2012.Q1)
county$X2012.Q2<-as.character(county$X2012.Q2)

#Convert format to long
county_long<-county %>%
  pivot_longer(4:29, names_to = "Y.Q", values_to = "num_reported")

#Remove first character in Y.Q variables and set to clean_date2
clean_date2<-str_remove(county_long$Y.Q, "^(.)")

#Set Y.Q values to clean_date2 values
county_long$Y.Q<-clean_date2

#Remove comma from num_reported values
county_long$num_reported <- gsub(",","",county_long$num_reported) 

#Convert num_reported values to numeric
county_long$num_reported <- as.numeric(county_long$num_reported)

#Convert Y.Q format to years and quarter
county_long$Y.Q<-as.yearqtr(county_long$Y.Q, format = "%Y.Q%q")

Add Census Data to County Data

#Convert 2005 to 2009 census table from wide to long format
census_county_long<-census_county %>%
  pivot_longer(3:12, names_to = "year", values_to = "population")


#Remove the period at the start of all the county names
census_county_long$county<-str_remove(census_county_long$county, "^[.]")

#Substitute comma in the population values an empty string to remove the comma
census_county_long$population <- gsub(",","",census_county_long$population)

#Remove the first character in the year variable to only keep the numbers
census_county_long$year<-str_remove(census_county_long$year, "^(.)")

#Convert population values from character to numeric
census_county_long$population <- as.numeric(census_county_long$population)

#Remove white space at the end of county names
census_county_long$county<-str_remove(census_county_long$county, "\\s$")

#Convert 2010 to 2012 census table from wide to long format
census_county_long2<-census_county2 %>%
  pivot_longer(3:5, names_to = "year", values_to = "population")

#Remove the first character in the year variable to only keep the numbers
census_county_long2$year<-str_remove(census_county_long2$year, "^(.)")

#Convert population variable from character to numeric
census_county_long2$population <- as.numeric(census_county_long2$population)

#Extract the four digit year value from the Y.Q variable
state_year2<-str_extract(county_long$Y.Q, "\\d\\d\\d\\d")

#Join county and 2005 to 2009 census table
county_census_long1<- county_long %>%
  #Create a year variable
  mutate(year = state_year2)

#Change year for Q1 to the correct calender year
county_census_long1_1<-county_census_long1%>%
  #filter observations other than first quarter
  filter(grepl("Q1", Y.Q))%>%
  #convert year to numeric subtract 1 to match calender year
  transmute(County=County, State = State,County.FIPS = County.FIPS, Y.Q = Y.Q, num_reported = num_reported, year = as.numeric(year) - 1)
 
#filter out first quarter observations 
county_census_long1_2<-county_census_long1%>%
  filter(!grepl("Q1", Y.Q))

#Convert year from character to numeric
county_census_long1_2$year<-as.numeric(county_census_long1_2$year)

#Reunite tables into one
county_census_long1<-county_census_long1_1 %>%
  #join rows
  rbind(county_census_long1_2)
  
#Convert year from character to numeric to conduct join
census_county_long$year<-as.numeric(census_county_long$year)

county_census_long3<-county_census_long1%>%
  #join 2005 to 2009 census information the county table
  left_join(census_county_long, by = c("State"="state","County"="county","year"="year"))%>%
  #Filter out observations from 2010 to 2012 that have no population information
  filter(!is.na(population))%>%
  filter(!grepl("2010 Q1",Y.Q))

#Join county and 2010 to 2012 census table
county_census_long2<- county_long %>%
  #Create a year variable
  mutate(year = state_year2)%>%
   #Join 2010 to 2012 census information the county table
  left_join(census_county_long2, by = c("State"="state","County"="county","year"="year"))%>%
  #Filter out observations from 2005 to 2009 that have no population information
  filter(!is.na(population))

#Convert year from character to numeric 
county_census_long2$year<-as.numeric(county_census_long2$year)

#create table by combining the two previous census tables
county_census_full<-county_census_long3 %>%
  #join the tables by rows
  rbind(county_census_long2)%>%
  #create variable of subjects per_capita per 1,000,000 residents
  mutate(per_capita= (num_reported/population)*1000000)%>%
  arrange(State,County, Y.Q)%>%
  group_by(County, State)%>%
  #create variable of the population growth rate of each county
  mutate(pop_growth_rate = ((population - lag(population))/lag(population))*100)

Add Census Data to State Data

# Reload .Renviron
readRenviron("~/.Renviron")


#load census API
library(censusapi)

## Warning: package 'censusapi' was built under R version 4.0.5

## 
## Attaching package: 'censusapi'

## The following object is masked from 'package:methods':
## 
##     getFunction

#Get population information from 2000/pep/int_population
census_years <- getCensus(
    name = "2000/pep/int_population",
    vars = c("POP", 
"DATE_DESC"),
    region = "state:*")


#Extract the four digit year
year<-str_extract(census_years$DATE_DESC, "\\d\\d\\d\\d")

#create a table with census population date to join
census<-census_years%>% 
  mutate(years = year)%>%
  filter(!grepl("base",DATE_DESC))%>%
  dplyr::select(!DATE_DESC)

#Convert state data from character to numeric
census$state<-as.numeric(census$state)


state_long_join<-state_long %>%
  left_join(state_FIP, by = c("State.Name" = "Name"))

#Extract four digit year
state_long_join$years<-str_extract(state_long_join$Y.Q, "\\d\\d\\d\\d")
  
#join census data with state data
state_census<-state_long_join%>%
  left_join(census, by = c("StateFIPS" = "state", "years"))

#subset census table
census_11_12<-census_state[2:10]

#Prepare table to join 2011 and 2012 census data
census_11_12<-census_11_12%>%
  dplyr::select(NAME,POPESTIMATE2011, POPESTIMATE2012) %>%
  #Rename variables
  rename("2011" = "POPESTIMATE2011", "2012" = "POPESTIMATE2012", "State.Name" = "NAME") %>%
  #convert table to long format
  pivot_longer(2:3, names_to = "years", values_to = "POP")

#Join census data with state data
state_census_2<-state_census%>%
  dplyr::select(!POP) %>%
  left_join(census_11_12, by = c("State.Name", "years"))%>%
  #filter for data from year 2011 and 2012
  filter(years %in% c("2011", "2012"))

#Combine tables
state_census<-state_census%>%
  filter(!years %in% c("2011", "2012"))%>%
  rbind(state_census_2)

#Convert region from numeric to factor
state_census$Region<-as.factor(state_census$Region)

Region Distribution

Utilizing the state data, a regional distribution of reported subjects portrays that a higher portion of subjects are located in the Southern region of the United States. Subjects identified in the Southern region make up a third of the total reported subjects. The median level of the of the Southern boxplot is lower than the median of the Midwest and West regions. This means that the fifth percentile of the number of values in the Southern distribution are lower than those regions. The third level of the Southern region is higher than other regions which shows that states in Southern counties have higher number of reported subjects.

Ten states were found to have values which were outliers in their respective regions. In the Northeast region, New York and New Jersey had 20 and 2 outlier values, respectively, in their region. In the Midwest region, Illinois, Michigan, and Minnesota had 30, 9, and 3 outlier values, respectively, in their region. In the Southern region, Florida, Georgia, and Texas had 25, 2, and 2 outlier values respectively in their region. Finally, in the West region, Arizona and California had 9 and 26 outliers values, respectively, in their region.

#create boxplot to show distribution of of subjects by region
ggplot(state_census, aes(y= num_reported, x=Region, group = Region, fill = Region))+
  #create boxplot
  geom_boxplot()+
  #Label x axis
  xlab("Region")+
  #Label title
ggtitle("Subjects Reported Distribution by Region")+
  #Label y axis
  ylab("Number of Subjects Reported")+
  #Set theme
  theme_classic()+
  #Change name of legend values
  scale_fill_discrete(name = "Regions", labels = c("North East", "Midwest", "South", "West"))

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

#Region 1 outlier (429-43)*1.5+429
state_census_region_1<-state_census %>%
  filter(Region == "1")%>%
  filter(num_reported>1137)%>%
  count(State.Name)

#Region 2 outlier (516.50-42.75)*1.5+516.50
state_census_region_2<-state_census %>%
  filter(Region == "2")%>%
  filter(num_reported>1008)%>%
  count(State.Name)

#Region 3 outlier (578-69)*1.5+578-69
state_census_region_3<-state_census %>%
  filter(Region == "3")%>%
  filter(num_reported>1272.5)%>%
  count(State.Name)

#Region 4 outlier (494.75-40.50)*1.5+494.75
state_census_region_4<-state_census %>%
  filter(Region == "4")%>%
  filter(num_reported>1176.125)%>%
  count(State.Name)

#Total subjects reported by region
state_census %>%
  count(Region)

## # A tibble: 4 x 2
##   Region     n
##   <fct>  <int>
## 1 1        234
## 2 2        312
## 3 3        442
## 4 4        338

Map

Utilizing the county data, the map provides a portrayal of the counties which have had a possible case of mortgage fraud from fiscal year 2006 to 2012. The map uses a gradient of red to display the average number of reported mortgage fraud suspects per capita (per 1,000,000 residents). The majority of counties with suspected subjects are located in the Northest region of the United States. The states identified with outlier values in the bloxplot graph all have hot spots signified by a darker shade of red. The two major hot spots for suspects are Clark county in Nevada and the southern counties in Florida.

#set new work environment
setwd("C:/Users/mivul/OneDrive/Desktop/Data 110/Datasets/GIS")

#Set shapefile to usgeo variable
usgeo <- shapefile("cb_2014_us_county_5m/cb_2014_us_county_5m.shp")

## Warning in rgdal::readOGR(dirname(x), fn, stringsAsFactors = stringsAsFactors, :
## Z-dimension discarded

#Create table with average values for population, per capita, and growth rate for county table
whole<-county_census_full%>%
  group_by(County, State, County.FIPS )%>%
  #Calculate averages
  summarise(Total_subjects_reported = sum(num_reported), Avg_pop = mean(population),Avg_per_capita = mean(per_capita), Avg_growth_rate = mean(pop_growth_rate), na.rm = TRUE, .groups = "keep")
 
#Order usgeo by GEOID
usgeo <- usgeo[order(usgeo$GEOID),]
#Order usgeo by County FIPS code
whole <- whole[order(whole$County.FIPS),]

# Merge geo and whole datasets
if (identical(usgeo$GEOID,whole$County.FIPS)) {
usmap <- merge(usgeo, whole, by.x = "GEOID", by.y = "County.FIPS")  
} else {stop}

## function (..., call. = TRUE, domain = NULL) 
## {
##     args <- list(...)
##     if (length(args) == 1L && inherits(args[[1L]], "condition")) {
##         cond <- args[[1L]]
##         if (nargs() > 1L) 
##             warning("additional arguments ignored in stop()")
##         message <- conditionMessage(cond)
##         call <- conditionCall(cond)
##         .Internal(.signalCondition(cond, message, call))
##         .Internal(.dfltStop(message, call))
##     }
##     else .Internal(stop(call., .makeMessage(..., domain = domain)))
## }
## <bytecode: 0x00000000122daf98>
## <environment: namespace:base>

usmap <- merge(usgeo, whole, by.x = "GEOID", by.y = "County.FIPS")


library(scales)

## Warning: package 'scales' was built under R version 4.0.5

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

#Create color pallete
USPalette <- colorNumeric(palette = "Reds", domain=usmap$Avg_per_capita) 

#Create popup information
uspopup <- paste0("State: ", usmap$State,", County: ", usmap$County,
", Number of Subjects Reported ", usmap$Total_subjects_reported, ",Per Capita Per 1 Million: ", usmap$Avg_per_capita)  

usmap_projected <- sp::spTransform(usmap, "+proj=longlat +datum=WGS84")  

#Create Map
leaflet(usmap_projected) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(stroke=FALSE, 
              smoothFactor = 0.2, 
              fillOpacity = .8, 
              popup=uspopup,
              color= ~USPalette(usmap$Avg_per_capita))

Time Series

Utilizing the state data, the national number of reported subjects increased from the first quarter of 2006 through the first quarter of 2008, along with the rise of the housing bubble. The reported numbers had minor fluctuations around the 29,5000 reported subject level until the increase in first quarter of 2011 to 40,734 reported subjects. The peak of the trend was hit on the second quarter of 2011, at 44,604 reported subjects, and subsequently dropped. The original dataset had numbers on new and old subjects reported from fiscal years 2010 through 2012. Financial institutions have windows when they are supposed to report illicit activity, when detected. For example, a New subject committed a case in the first quarter of 2010, and a report was filed in the first quarter of 2010. Old subjects are subjects who possibly committed a case first quarter of 2010 but the report was filed after that quarter. The increase in reported subjects is a result of financial institutions retroactively reporting subjects. This can be seen by the old reported subjects being higher then the new reported subjects. The top ten states with mortgage fraud suspects are Arizona, Georgia, Minnesota, California, Illinois, New Jersey, Florida, Michigan, and New York. California had the highest number of reported subjects followed by Florida. The peak in the second quarter of 2011 was primarily caused by the discovery of suspect in these two states.

#Create Time series graph to display the trend of total subjects reported
state_long %>%
  group_by(Y.Q)%>%
  summarise(Total_num_reported = sum(num_reported, na.rm = TRUE), .groups = "keep")%>%
  #plot line chart
hchart(
    "line", 
    hcaes(x = Y.Q, y = Total_num_reported))%>%
  #Label x axis
  hc_xAxis(title = list(text = "Fiscal Year")) %>%
  #Label y axis
  hc_yAxis(title = list(text = "Number of Subjects Reported"))%>%
  #Label Title
  hc_title(text = "National Reported Numbers")

#Create a bar chart to display the top 10 states
top_10_states %>%
  #create plot
  ggplot(aes(x=State.Name, y = total_reported, fill = State.Name))+
  #Add bar plot layer
  geom_bar(stat = "identity")+
  #Flip x and y axis
  coord_flip()+
   #Label y axis
  ylab("Number of Subjects Reported")+
  #Label x axis
  xlab("State")+
  #Label Title
  ggtitle("Top 10 States with Mortgage Fraud")+
  #set Theme
  theme_classic()+
  #Remove legend
  theme(legend.position = "none")+
  #Set color palette
   scale_fill_brewer(palette = "Spectral")

#Create Time series graph to display the trend of  subjects reported in top 10 states
state_long %>%
  semi_join(top_10_states, by ="State.Name")%>%
  #plot line chart
  hchart(
    "line", 
    hcaes(x = Y.Q, y = num_reported, group = State.Name))%>%
  #Label x axis
  hc_xAxis(title = list(text = "Fiscal Year")) %>%
   #Label y axis
  hc_yAxis(title = list(text = "Number of Subjects Reported"))%>%
   #Label Title
  hc_title(text = "Top 10 States Reported Numbers")

Data Analysis

Utilizing the county data, the scatter plot displays that the number of reported subjects and the size of the county population have a positive correlation. The correlation coefficient between the number subjects reported and the population is 0.837754. The correlation coefficient between the number subjects reported and calender year is 0.05250068. This means that the number of subjects reported and the calender year have a weak correlation. The correlation coefficient between the calender year and the population is 0.009998235. The weak correlation between the two variable means they can be used as independent variables to predict the number of reported subjects.

The multilinear regression model utilizes the population size and calender year as explanatory variables to predict the number of subjects. Approximately 70% of the variation in the number of subjects reported can explained by the variation in the population size and calender year. Approximately 30% variation in the subjects reported cannot be explained by the model.

#remove turn off scientific numbers
options(scipen = 999) 

#Scatter plot of affects of population on number subjects reported 
county_census_full%>%
  #Filter out first quarter variables 
  filter(!grepl( "Q1", Y.Q))%>%
  #create graph
ggplot(aes(x=population, y = num_reported))+
  #Add scatter plot layer
  geom_point()+
  #Add linear regression line layer
  geom_smooth(method = "lm")+
  #Label Y axis
  ylab("Number of Subjects Reported")+
  #Label title
  ggtitle("Number of Subjects v Population")+
  #set theme
  theme_classic()

## `geom_smooth()` using formula 'y ~ x'

#correlation between number of subject reported and calendar year of event
cor(county_census_full$num_reported, county_census_full$year)

## [1] 0.0535074

#correlation between number of subject reported and county population
cor(county_census_full$num_reported, county_census_full$population)

## [1] 0.8371687

#correlation between variables in linear model
cor(county_census_full$year, county_census_full$population)

## [1] 0.01017942

#linear regression model  
model<-lm(num_reported ~ population + year, data = county_census_full)

#view model stats
summary(model)

## 
## Call:
## lm(formula = num_reported ~ population + year, data = county_census_full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1200.55    -7.53     5.43    13.14  1787.90 
## 
## Coefficients:
##                     Estimate       Std. Error t value            Pr(>|t|)    
## (Intercept) -5172.7147369830   391.6996980041  -13.21 <0.0000000000000002 ***
## population      0.0001866894     0.0000007636  244.50 <0.0000000000000002 ***
## year            2.5637178547     0.1950107347   13.15 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.77 on 25373 degrees of freedom
## Multiple R-squared:  0.7029, Adjusted R-squared:  0.7029 
## F-statistic: 3.001e+04 on 2 and 25373 DF,  p-value: < 0.00000000000000022

#plot model
plot(model)

Utilizing the county data, the histogram of the population distribution for county and state displays a right skew. The median population for county and state was used as the measure of center due to outliers in the histogram. 0.002653192% of the reported subjects were located in states under the median threshold of 24,663,015 estimated residents. 4.111944% of the reported subjects were located in counties under the median threshold of 122,213 estimated residents. The proportion of reported subjects from states and counties below the median thresholds is 0.0001090978%. This signifies a higher proportion of reported subjects are being identified in counties and states above the median population threshold.

#Sum of county population and subjects by state
state_county_full<-county_census_full %>%
    group_by(County,State, year, population)%>%
  summarise(tot = sum(num_reported, na.rm = TRUE))%>%
  ungroup()%>%
  group_by(State)%>%
  summarise(total_population = sum(population),total_subjects = sum(tot, na.rm = TRUE))

## `summarise()` has grouped output by 'County', 'State', 'year'. You can override using the `.groups` argument.

#Histogram of distribution state populations
ggplot(county_census_full, aes(x=population, fill = "blue", color="black"))+
  geom_histogram(bins = 30)+
  ggtitle("County Population Distribution")+
  #Set theme
  theme_classic()+
  #Remove legend
  theme(legend.position = "none")

#Histogram of distribution of state populations
ggplot(state_county_full, aes(x=total_population, fill = "blue",color="black"))+
  geom_histogram(bins = 30)+
  ggtitle("State Population Distribution")+
  #Set theme
  theme_classic()+
  #Remove legend
  theme(legend.position = "none")

#Median populations for State and County 
median(county_census_full$population)

## [1] 122112

median(state_county_full$total_population)

## [1] 28251324

#Table displaying counties below median threshold 
below_county_median<-county_census_full%>%
 filter(population <=122112)

#Table displaying counties above median threshold 
above_county_median<-county_census_full%>%
 filter(population >122112)

#Table displaying states below median threshold 
below_state_median<-state_county_full%>%
 filter(total_population <=28251324)

#Table displaying states above median threshold 
above_state_median<-state_county_full%>%
 filter(total_population >28251324)


#Percent of subjects in counties below median threshold
sum(below_county_median$num_reported)/sum(county_census_full$num_reported)*100

## [1] 4.111944

#Percent of subjects in states below median threshold
sum(below_state_median$total_subjects)/sum(state_county_full$total_population)*100

## [1] 0.002653192

#Percent of subjects in counties and states below  median threshold
(sum(below_county_median$num_reported)/sum(county_census_full$num_reported))*(sum(below_state_median$total_subjects)/sum(state_county_full$total_population))*100

## [1] 0.0001090978

Conculsion

I can conclude by my analysis of the possible mortgage fraud data that states and counties with higher than the median threshold population had a majority of the reported subjects. The multilinear regression model showed population size and calendar year are good predictive variables of for the number of reported subjects. The top states identified for possible mortgage fraud were Arizona, Georgia, Minnesota, California, Illinois, New Jersey, Florida, Michigan, and New York. California and Florida had the highest number of reported subjects by state. According to the state data, a third of the reported subjects were located in the Southern region of the country. County data revealed activity in the south appeared to be concentrated in the southern counties of Florida. Miami-Dade county had the highest number of subjects per capita of one million residents. Another hot spot identified in the county data was in the Western Region is Clark county in Nevada. Suspicious activity reporting by financial institutions is an important tool in order to identify illicit financial activity.

References

FinCEN Mortgage Fraud SAR Datasets. FinCEN Mortgage Fraud SAR Datasets | FinCEN.gov. (n.d.). https://www.fincen.gov/fincen-mortgage-fraud-sar-datasets.

FBI. (2016, May 3). Financial Institution/Mortgage Fraud. FBI. https://www.fbi.gov/investigate/white-collar-crime/mortgage-fraud.

Index of /programs-surveys/popest/tables/2000-2010/intercensal/county. (n.d.). https://www2.census.gov/programs-surveys/popest/tables/2000-2010/intercensal/county/.

Data 101 Final

Michael Vasquez

5/11/2021