title: “Data 607 Fall 2019 - Final Project: Medical care provider fraud detection tool”

author: “Sufian”

date: “11/14/2019”

output: html_document

Project Statement


The purpose of this project is two-fold:

  1. To predict the type of medical providers based on the type and number of procedures performed:

have occured and a flag is thrown up for further investigation.

Why is this Important?

As an example: If a provider claim to be general practice but the model predicts it was, a cardiologist

provider, then possible medical fraud could’ve occured because procedures performed were not

consistent with actual provider type. The reason this could result in a fraud is because providers

might then up-charged medicare for these procedures or worse, made multiple lower-charges

fraudelently. Either of which costs tax payers money and a financial drain to the entire Medicare

system as a whole over the long run.

The prediction hypothesis is as follows (2-tail test):

Ho: there is no difference between actual and predicted provider types

Ha: there is a difference (not equal to) between actual and predicted provider types


  1. Of the ones that were predicted to be different from actual provider types, I will check the average

charged amounts along with the number of times claims that were submitted and compared it to the

Standardized Medicare Average for that provider type. This extra 2nd step is for us to gauge

numerically the impact of fraudelent charges (if any) relative to average normal standardized charges.

In addition, this check allows us to see if the Office of Inspector General’s (OIG) exclusion list

were effective in catching thses “bad doctors” that have commited frauds in the past. This extra

after-the-fact exercise is simply to check the predictive and accuracy of power of this tool.


Data Sources:

  1. Data.CMS.gov: Medicare Provider Utilization and Payment Data:

https://data.cms.gov/Medicare-Physician-Supplier/Medicare-Provider-Utilization-and-Payment-Data-Phy/fs4p-t5eq/data

  1. US department of Health and Human services: Office of Inspector General (OIG), Exclusion Program

https://oig.hhs.gov/exclusions/authorities.asp


library(dplyr)
library("tidyverse")
require(data.table)
#Reading in the Medicare database (Currently, this huge database resides in my local computer but for #the final project, it will be read from a public URL)

# This is solely for the initial proposal only
x <- fread("C:/Desk top Stuff/CUNY Fall 2019 Folder/Data Acquisition/Project Final/Final Project Files Folder/Medicare_Provider_Utilization_and_Payment_Data__Physician_and_Other_Supplier_PUF_CY2017.csv") %>% as_tibble()

#Peeking at the Medicare data
head(x,n=3)
## # A tibble: 3 x 26
##   `National Provi~ `Last Name/Orga~ `First Name of ~ `Middle Initial~
##              <int> <chr>            <chr>            <chr>           
## 1       1003000126 ENKESHAFI        ARDALAN          ""              
## 2       1003000126 ENKESHAFI        ARDALAN          ""              
## 3       1003000126 ENKESHAFI        ARDALAN          ""              
## # ... with 22 more variables: `Credentials of the Provider` <chr>, `Gender
## #   of the Provider` <chr>, `Entity Type of the Provider` <chr>, `Street
## #   Address 1 of the Provider` <chr>, `Street Address 2 of the
## #   Provider` <chr>, `City of the Provider` <chr>, `Zip Code of the
## #   Provider` <chr>, `State Code of the Provider` <chr>, `Country Code of
## #   the Provider` <chr>, `Provider Type` <chr>, `Medicare Participation
## #   Indicator` <chr>, `Place of Service` <chr>, `HCPCS Code` <chr>, `HCPCS
## #   Description` <chr>, `HCPCS Drug Indicator` <chr>, `Number of
## #   Services` <dbl>, `Number of Medicare Beneficiaries` <int>, `Number of
## #   Distinct Medicare Beneficiary/Per Day Services` <int>, `Average
## #   Medicare Allowed Amount` <dbl>, `Average Submitted Charge
## #   Amount` <dbl>, `Average Medicare Payment Amount` <dbl>, `Average
## #   Medicare Standardized Amount` <dbl>
# Reading & Loading data base where physicians have a fraudelent past 

url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/ELIE.csv'

df <- read.csv(file = url ,sep = ",", na.strings = c("NA", " ", ""), strip.white = TRUE, stringsAsFactors = F,header=T)
head(df)
##   LASTNAME FIRSTNAME MIDNAME                       BUSNAME
## 1     <NA>      <NA>    <NA>      14 LAWRENCE AVE PHARMACY
## 2     <NA>      <NA>    <NA>      143 MEDICAL EQUIPMENT CO
## 3     <NA>      <NA>    <NA>    184TH STREET PHARMACY CORP
## 4     <NA>      <NA>    <NA> 1951 FLATBUSH AVENUE PHARMACY
## 5     <NA>      <NA>    <NA> 1ST COMMUNITY HEALTH CTR, LTD
## 6     <NA>      <NA>    <NA> 1ST REHABILITATION OF PORT ST
##              GENERAL    SPECIALTY UPIN        NPI DOB
## 1           PHARMACY         <NA> <NA>          0  NA
## 2        DME COMPANY DME - OXYGEN <NA>          0  NA
## 3     OTHER BUSINESS     PHARMACY <NA> 1922348218  NA
## 4           PHARMACY         <NA> <NA>          0  NA
## 5             CLINIC         <NA> <NA>          0  NA
## 6 MANAGEMENT SVCS CO         <NA> <NA>          0  NA
##                 ADDRESS      CITY STATE   ZIP EXCLTYPE EXCLDATE REINDATE
## 1    14 LAWRENCE AVENUE SMITHTOWN    NY 11787   1128a1 19880830        0
## 2      701 NW 36 AVENUE     MIAMI    FL 33125   1128b7 19970620        0
## 3         69 E 184TH ST     BRONX    NY 10468   1128a1 20180419        0
## 4     1951 FLATBUSH AVE  BROOKLYN    NY 11234   1128b5 20090319        0
## 5    3138 W CERMAK ROAD   CHICAGO    IL 60623   1128a1 19940524        0
## 6 C/O 3659 MAGUIRE BLVD   ORLANDO    FL 32803   1128b8 20020919        0
##   WAIVERDATE WVRSTATE
## 1          0     <NA>
## 2          0     <NA>
## 3          0     <NA>
## 4          0     <NA>
## 5          0     <NA>
## 6          0     <NA>

The Data Sets

Dataset No. 1

  1. The Medicare Provide Utilization & Payment Data Provides the majority of the data for this project.

It is a 9847443 X 26 dataframe. Breifly, the columns are information regarding the medical providers,

locations of practices, National provider ID, Procedure codes, Text descriptions of procedures (HCPCS),

standard, actual procedural codes or HCPCS codes, average medicare charges and its related average

charges for the procedure types just to name a few.

  1. Each row, contains medical procedures provided by the provider. So it can contain multiple

procedure types for the same physician.

Dataset No. 2

  1. US department of Health and Human services’ exclusion list. This is database is run by the Office

of Inspector General (OIG). This data set is also called the exclusion list. Its where the “bad doctor”

with have fraudlent past is listed along with the types of fraud they have commited.


Exploratory Data Exercise

column. This is the column that describe the medical procedure that actually took place.

Statistical Analysis

there were any extreme outliers, skewness and other typical statistical checks to understand the data

  1. Geographic areas

  2. medical practice types: groups = provider types

  3. Physicians’ gender

Statistical Results

  1. ANOVA on scores

  2. Correlation Matrix (This might be challenging to visualize given the high no. of variables)

Predictive Analystics

improve accuracy level if needed.

our academic purpose)

smaller branches)


Process Work Flow

Fig 1: Data Work Flow

Fig 1: Data Work Flow