title: “Data 607 Fall 2019 - Final Project: Medical care provider fraud detection tool”
author: “Sufian”
date: “11/14/2019”
output: html_document
The purpose of this project is two-fold:
have occured and a flag is thrown up for further investigation.
As an example: If a provider claim to be general practice but the model predicts it was, a cardiologist
provider, then possible medical fraud could’ve occured because procedures performed were not
consistent with actual provider type. The reason this could result in a fraud is because providers
might then up-charged medicare for these procedures or worse, made multiple lower-charges
fraudelently. Either of which costs tax payers money and a financial drain to the entire Medicare
system as a whole over the long run.
The prediction hypothesis is as follows (2-tail test):
charged amounts along with the number of times claims that were submitted and compared it to the
Standardized Medicare Average for that provider type. This extra 2nd step is for us to gauge
numerically the impact of fraudelent charges (if any) relative to average normal standardized charges.
In addition, this check allows us to see if the Office of Inspector General’s (OIG) exclusion list
were effective in catching thses “bad doctors” that have commited frauds in the past. This extra
after-the-fact exercise is simply to check the predictive and accuracy of power of this tool.
https://oig.hhs.gov/exclusions/authorities.asp
library(dplyr)
library("tidyverse")
require(data.table)
#Reading in the Medicare database (Currently, this huge database resides in my local computer but for #the final project, it will be read from a public URL)
# This is solely for the initial proposal only
x <- fread("C:/Desk top Stuff/CUNY Fall 2019 Folder/Data Acquisition/Project Final/Final Project Files Folder/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv") %>% as_tibble()
#Peeking at the Medicare data
head(x,n=3)
## # A tibble: 3 x 26
## `National Provi~ `Last Name/Orga~ `First Name of ~ `Middle Initial~
## <int> <chr> <chr> <chr>
## 1 1003000126 ENKESHAFI ARDALAN ""
## 2 1003000126 ENKESHAFI ARDALAN ""
## 3 1003000126 ENKESHAFI ARDALAN ""
## # ... with 22 more variables: `Credentials of the Provider` <chr>, `Gender
## # of the Provider` <chr>, `Entity Type of the Provider` <chr>, `Street
## # Address 1 of the Provider` <chr>, `Street Address 2 of the
## # Provider` <chr>, `City of the Provider` <chr>, `Zip Code of the
## # Provider` <chr>, `State Code of the Provider` <chr>, `Country Code of
## # the Provider` <chr>, `Provider Type` <chr>, `Medicare Participation
## # Indicator` <chr>, `Place of Service` <chr>, `HCPCS Code` <chr>, `HCPCS
## # Description` <chr>, `HCPCS Drug Indicator` <chr>, `Number of
## # Services` <dbl>, `Number of Medicare Beneficiaries` <int>, `Number of
## # Distinct Medicare Beneficiary/Per Day Services` <int>, `Average
## # Medicare Allowed Amount` <dbl>, `Average Submitted Charge
## # Amount` <dbl>, `Average Medicare Payment Amount` <dbl>, `Average
## # Medicare Standardized Amount` <dbl>
# Reading & Loading data base where physicians have a fraudelent past
url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/ELIE.csv'
df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)
head(df)
## LASTNAME FIRSTNAME MIDNAME BUSNAME
## 1 <NA> <NA> <NA> 14 LAWRENCE AVE PHARMACY
## 2 <NA> <NA> <NA> 143 MEDICAL EQUIPMENT CO
## 3 <NA> <NA> <NA> 184TH STREET PHARMACY CORP
## 4 <NA> <NA> <NA> 1951 FLATBUSH AVENUE PHARMACY
## 5 <NA> <NA> <NA> 1ST COMMUNITY HEALTH CTR, LTD
## 6 <NA> <NA> <NA> 1ST REHABILITATION OF PORT ST
## GENERAL SPECIALTY UPIN NPI DOB
## 1 PHARMACY <NA> <NA> 0 NA
## 2 DME COMPANY DME - OXYGEN <NA> 0 NA
## 3 OTHER BUSINESS PHARMACY <NA> 1922348218 NA
## 4 PHARMACY <NA> <NA> 0 NA
## 5 CLINIC <NA> <NA> 0 NA
## 6 MANAGEMENT SVCS CO <NA> <NA> 0 NA
## ADDRESS CITY STATE ZIP EXCLTYPE EXCLDATE REINDATE
## 1 14 LAWRENCE AVENUE SMITHTOWN NY 11787 1128a1 19880830 0
## 2 701 NW 36 AVENUE MIAMI FL 33125 1128b7 19970620 0
## 3 69 E 184TH ST BRONX NY 10468 1128a1 20180419 0
## 4 1951 FLATBUSH AVE BROOKLYN NY 11234 1128b5 20090319 0
## 5 3138 W CERMAK ROAD CHICAGO IL 60623 1128a1 19940524 0
## 6 C/O 3659 MAGUIRE BLVD ORLANDO FL 32803 1128b8 20020919 0
## WAIVERDATE WVRSTATE
## 1 0 <NA>
## 2 0 <NA>
## 3 0 <NA>
## 4 0 <NA>
## 5 0 <NA>
## 6 0 <NA>
It is a 9847443 X 26 dataframe. Breifly, the columns are information regarding the medical providers,
locations of practices, National provider ID, Procedure codes, Text descriptions of procedures (HCPCS),
standard, actual procedural codes or HCPCS codes, average medicare charges and its related average
charges for the procedure types just to name a few.
procedure types for the same physician.
of Inspector General (OIG). This data set is also called the exclusion list. Its where the “bad doctor”
with have fraudlent past is listed along with the types of fraud they have commited.
column. This is the column that describe the medical procedure that actually took place.
there were any extreme outliers, skewness and other typical statistical checks to understand the data
Geographic areas
medical practice types: groups = provider types
Physicians’ gender
ANOVA on scores
Correlation Matrix (This might be challenging to visualize given the high no. of variables)
improve accuracy level if needed.
our academic purpose)
smaller branches)
Fig 1: Data Work Flow