This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.


Setup

# import packages
setwd("~/R/TUM Business Analytics/09 Analytics Cup")
library(tidyverse)
library(readr)
library(data.table)
library(tibble)
options(dplyr.width = Inf)
theme_set(theme_minimal())

Data import

# import raw data
physicians_raw = as_tibble(read.csv('data/physicians.csv'))  # 6000 x 17
payments_raw = as_tibble(read.csv('data/payments.csv'))      # 1402250 x 29
companies_raw = as_tibble(read.csv('data/companies.csv'))    # 2431 x 4

Data preprocessing

From payments_raw to payments: Change attributes to correct variable type and drop unnecessary columns.

# prepare payments data: change attribute type and drop unnecessary columns
payments = payments_raw %>% mutate(Date=as.Date(Date, format='%d/%m/%Y'))
payments = payments %>% mutate_at(vars(Form_of_Payment_or_Transfer_of_Value,Nature_of_Payment_or_Transfer_of_Value,
                                       City_of_Travel,State_of_Travel,Country_of_Travel,
                                       Ownership_Indicator,Third_Party_Recipient,Charity,Third_Party_Covered,
                                       Related_Product_Indicator,Product_Type_1,Product_Type_2,Product_Type_3,
                                       Product_Category_1,Product_Category_2,Product_Category_3),as.factor)
payments = payments %>% select(-City_of_Travel, -State_of_Travel, -Country_of_Travel,
                               -Charity, -Third_Party_Covered, -Contextual_Information,
                               -Product_Code_1, -Product_Code_2, -Product_Code_3,
                               -Product_Name_1, -Product_Name_2, -Product_Name_3)

From payments to payments_train: Filter out all transactions related to physicians from training set.

# prepare payments train: payments associated with training set physicians
payments_train = payments %>% merge(physicians_raw %>% filter(set == 'train') %>% select(id), by.x='Physician_ID', by.y='id')

Refine payments_train: Column Ownership_Indicator no longer refers to the payment but the physician.It tells whether the physician has an ownership interest, i.e. if she has at least one payment labeled as ownership interest. Whether the actual payment was labeled as ownership interest can be seen in the (new) column Ownership_Indicator_Payment.

physicians_id_fraud = payments_train %>% filter(Ownership_Indicator == "Yes") %>% select(Physician_ID) %>% distinct()
payments_train = payments_train %>% mutate(Physician_Ownership_Interest = case_when(Physician_ID %in% physicians_id_fraud$Physician_ID ~ "Yes",TRUE ~ "No"))
payments_train = payments_train %>% rename(Ownership_Indicator_Payment = Ownership_Indicator)
payments_train = payments_train %>% rename(Ownership_Indicator = Physician_Ownership_Interest)
payments_train = payments_train %>% relocate(Ownership_Indicator, .after = Physician_ID)
payments_train = payments_train %>% relocate(Ownership_Indicator_Payment, .after = Record_ID)
payments_train = payments_train %>% filter(Ownership_Indicator_Payment == "No")


Presenting payments_train:

 [1] "Physician_ID"                           "Ownership_Indicator"                   
 [3] "Record_ID"                              "Ownership_Indicator_Payment"           
 [5] "Company_ID"                             "Total_Amount_of_Payment_USDollars"     
 [7] "Date"                                   "Number_of_Payments"                    
 [9] "Form_of_Payment_or_Transfer_of_Value"   "Nature_of_Payment_or_Transfer_of_Value"
[11] "Third_Party_Recipient"                  "Related_Product_Indicator"             
[13] "Product_Type_1"                         "Product_Type_2"                        
[15] "Product_Type_3"                         "Product_Category_1"                    
[17] "Product_Category_2"                     "Product_Category_3"                    

Exploratory analysis

Procedure:
1. Group payments_train by Ownership_Indicator and compare different attributes.
2. Create additional feature from payments, if the attribute helps to distinguish between ownership interest.

Amount of Transactions per Ownership_Indicator

As we can see, 116,983 payments belong to physicians, which have an ownership interest (through other payments not in the data set).


Total amounts of payments

Total amount of payments is way higher for ownership interest. Create additional feature avg_total_amount_of_payments with high values signaling ownership interest.


Number of payment rates per payment

Average number of payments is slightly higher for ownership interest. Not sure if the difference is enough for an additional feature.


Form of Payment

Except for Cash and in-kind items and services, all other forms are 100% associated with ownership interest.
Create binary variable that signals, if Form_of_Payment_or_Transfer_of_Value does not contain ‘cash’ or ‘items’.


Nature of Payment

Not sure if binary features are justified here. We need to exclude attributes which do not help us to distinguish.


We need to exclude further attributes, which do not help to distinguish.


‘Education’ is associated with no ownership interest. ‘Current or prospective…’ is (obviously) associated with ownership interest. Not sure about the rest.


Third Party Recipient

Third party recipients - entities or individuals - are associated with ownership interest. Create binary feature third_party which signals ownership interest.


Product Type (1)

‘Biological’ and ‘Device or Medical Supply’ are associated with ownership interest. Not sure if enough evidence to construct a binary feature.



Physician Specialties

Pull out Primary_Specialty, factor in three different columns and join with physicians:

primary_specialties = physicians %>% select(id, Primary_Specialty)
primary_specialties = primary_specialties %>% separate(Primary_Specialty, into = c("First_Specialty","Second_Specialty","Third_Specialty"), sep = "\\|")

physicians = physicians %>% merge(primary_specialties)
physicians = physicians %>% relocate(First_Specialty,Second_Specialty,Third_Specialty,.before=Primary_Specialty)

We go from this:

To this:


Examine specialties


Examine absolute amount of first specialties


Lets examine share of second specialties based on first specialty


Lets look at first specialites apart from Allopathic & Osteopathic.


Merge specialties per physician with payments

payments_train_specialty = payments_train %>% inner_join(physicians %>% select(id,First_Specialty,Second_Specialty,Third_Specialty),by=c("Physician_ID" = "id"))

payments_train_specialty = payments_train_specialty %>% relocate(First_Specialty,Second_Specialty,Third_Specialty,.after=Physician_ID)


Examine relationship of ownership interest and first specialty.


Remove Allopathic & Osteopathic Physicians to get a closer view


Lets look at this numerically:

                                                
                                                     No    Yes
  Allopathic & Osteopathic Physicians            95.484 97.295
  Chiropractic Providers                          0.002  0.187
  Dental Providers                                1.246  1.307
  Eye and Vision Services Providers               2.160  0.652
  Podiatric Medicine & Surgery Service Providers  1.108  0.558


Lets have a final look into second specialties - nothing interesting here.


The recipe

??recipes
library(recipes)

We need to specify one outcome and n predictors. That means we need to have one data table in tidy format, i.e. the physicians. Each row represents one physician (ID) and each column represents one feature (which we engineer from payments and companies).

Lets make a mock data frame i_physicians:

Fraud = c(0,0,1,0,0)
Physician_ID = c(1,2,3,4,5)
License_State = c("NY","CA","TX","NY","CA")
Primary_Specialty = c("Allopathic","Radiology","Osteopathic","Physician","Cardiology")
Total_amount_payments = c(9945,2210,34250,12050,1256)
Total_number_of_payments = c(5,12,45,22,8)
i_physicians = data.frame(Fraud,Physician_ID,License_State,Primary_Specialty,Total_amount_payments,Total_number_of_payments)
i_physicians

Lets specify the outcome and the predictors:

rec_obj = recipe(Fraud ~ ., data = i_physicians) %>% update_role(Physician_ID, new_role = "ID")
rec_obj
Data Recipe

Inputs:

Convert categorical predictors into numeric dummy variables:

ind_obj = rec_obj %>% step_dummy(all_predictors(), -all_numeric())
ind_obj
Data Recipe

Inputs:

Operations:

Dummy variables from all_predictors(), -all_numeric()

As all variables are now numeric, we can center and scale them:

stand_obj = ind_obj %>% step_center(all_predictors()) %>% step_scale(all_predictors())
stand_obj
Data Recipe

Inputs:

Operations:

Dummy variables from all_predictors(), -all_numeric()
Centering for all_predictors()
Scaling for all_predictors()

Estimate means and sd from training set:

trained_rec = prep(stand_obj, training = i_physicians)
trained_rec
Data Recipe

Inputs:

Training data contained 5 data points and no missing data.

Operations:

Dummy variables from License_State, Primary_Specialty [trained]
Centering for Total_amount_payments, ... [trained]
Scaling for Total_amount_payments, ... [trained]

Now, the preprocessing can be applied to the actual training (and test) set:

i_train = bake(trained_rec, new_data = i_physicians)
#j_test = bake(trained_rec, new_data = j_physicians)
i_train
---
title: "Analytics Cup"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

***

### Setup
```{r, message=F}
# import packages
setwd("~/R/TUM Business Analytics/09 Analytics Cup")
library(tidyverse)
library(readr)
library(data.table)
library(tibble)
options(dplyr.width = Inf)
theme_set(theme_minimal())
```

### Data import
```{r}
# import raw data
physicians_raw = as_tibble(read.csv('data/physicians.csv'))  # 6000 x 17
payments_raw = as_tibble(read.csv('data/payments.csv'))      # 1402250 x 29
companies_raw = as_tibble(read.csv('data/companies.csv'))    # 2431 x 4
```

### Data preprocessing
From `payments_raw` to `payments`: Change attributes to correct variable type and drop unnecessary columns.
```{r}
# prepare payments data: change attribute type and drop unnecessary columns
payments = payments_raw %>% mutate(Date=as.Date(Date, format='%d/%m/%Y'))
payments = payments %>% mutate_at(vars(Form_of_Payment_or_Transfer_of_Value,Nature_of_Payment_or_Transfer_of_Value,
                                       City_of_Travel,State_of_Travel,Country_of_Travel,
                                       Ownership_Indicator,Third_Party_Recipient,Charity,Third_Party_Covered,
                                       Related_Product_Indicator,Product_Type_1,Product_Type_2,Product_Type_3,
                                       Product_Category_1,Product_Category_2,Product_Category_3),as.factor)
payments = payments %>% select(-City_of_Travel, -State_of_Travel, -Country_of_Travel,
                               -Charity, -Third_Party_Covered, -Contextual_Information,
                               -Product_Code_1, -Product_Code_2, -Product_Code_3,
                               -Product_Name_1, -Product_Name_2, -Product_Name_3)
```

From `payments` to `payments_train`: Filter out all transactions related to physicians from training set.
```{r}
# prepare payments train: payments associated with training set physicians
payments_train = payments %>% merge(physicians_raw %>% filter(set == 'train') %>% select(id), by.x='Physician_ID', by.y='id')
```

Refine `payments_train`: Column `Ownership_Indicator` no longer refers to the payment but the physician.It tells whether the physician has an ownership interest, i.e. if she has at least one payment labeled as ownership interest. Whether the actual payment was labeled as ownership interest can be seen in the (new) column `Ownership_Indicator_Payment`.
```{r, warning=F}
physicians_id_fraud = payments_train %>% filter(Ownership_Indicator == "Yes") %>% select(Physician_ID) %>% distinct()
payments_train = payments_train %>% mutate(Physician_Ownership_Interest = case_when(Physician_ID %in% physicians_id_fraud$Physician_ID ~ "Yes",TRUE ~ "No"))
payments_train = payments_train %>% rename(Ownership_Indicator_Payment = Ownership_Indicator)
payments_train = payments_train %>% rename(Ownership_Indicator = Physician_Ownership_Interest)
payments_train = payments_train %>% relocate(Ownership_Indicator, .after = Physician_ID)
payments_train = payments_train %>% relocate(Ownership_Indicator_Payment, .after = Record_ID)
payments_train = payments_train %>% filter(Ownership_Indicator_Payment == "No")
```
\
Presenting `payments_train`:
```{r, echo=F}
payments_train %>% names()
```
***

### Exploratory analysis
Procedure:\
1. Group `payments_train` by `Ownership_Indicator` and compare different attributes.\
2. Create additional feature from `payments`, if the attribute helps to distinguish between ownership interest.\
\

#### Amount of Transactions per Ownership_Indicator
As we can see, 116,983 payments belong to physicians, which have an ownership interest (through other payments not in the data set).
```{r, echo=F, message=F}
payments_train %>% count(Ownership_Indicator) %>% ggplot(aes(x=Ownership_Indicator, y=n)) +
  geom_bar(stat="identity", fill='turquoise') +
  geom_text(aes(label=n), vjust=-0.5) +
  labs(title = "Number of payments per physician ownership interest", y="Number of payments")
```
\

#### Total amounts of payments
Total amount of payments is way higher for ownership interest. Create additional feature `avg_total_amount_of_payments` with high values signaling ownership interest.
```{r, echo=F, message=F}
payments_train %>% group_by(Ownership_Indicator) %>% summarise(avg_Total_Amount_of_Payment_USDollars = mean(Total_Amount_of_Payment_USDollars)) %>% ggplot(aes(x=Ownership_Indicator, y=avg_Total_Amount_of_Payment_USDollars)) +
  geom_bar(stat='identity', fill='turquoise') +
  geom_text(aes(label=round(avg_Total_Amount_of_Payment_USDollars, digits=3)), vjust=-0.5) +
  labs(title = 'Average total amount of payments per ownership interest', y = 'Average Total amount of Payment')
```
\

#### Number of payment rates per payment
Average number of payments is slightly higher for ownership interest. Not sure if the difference is enough for an additional feature.
```{r, echo=F, message=F}
payments_train %>% group_by(Ownership_Indicator) %>% summarise(avg_Number_of_Payments = mean(Number_of_Payments)) %>% ggplot(aes(y=avg_Number_of_Payments, x=Ownership_Indicator)) + 
  geom_bar(stat='identity', fill='turquoise') +
  geom_text(aes(label=round(avg_Number_of_Payments, digits = 3)), vjust=-0.5) +
  labs(title = 'Average number of payments (per payment) per ownership interest', y = 'Average number of payments')
```
\

#### Form of Payment
Except for Cash and in-kind items and services, all other forms are 100% associated with ownership interest.\
Create binary variable that signals, if `Form_of_Payment_or_Transfer_of_Value` does not contain 'cash' or 'items'.
```{r, echo=F, message=F}
payments_train %>% count(Ownership_Indicator, Form_of_Payment_or_Transfer_of_Value) %>%  ggplot(aes(x=Ownership_Indicator,y=n,fill=Form_of_Payment_or_Transfer_of_Value)) + 
  geom_col(position = 'fill') + 
  labs(y='Relative frequency', title = 'Frequency of Form of payment per ownership interest', fill = 'Forms of payment')

# payments_train %>% count(Ownership_Indicator, Form_of_Payment_or_Transfer_of_Value) %>% arrange(desc(n))
```
\

#### Nature of Payment
Not sure if binary features are justified here. We need to exclude attributes which do not help us to distinguish.
```{r, echo=F, message=F}
payments_train %>% count(Ownership_Indicator,Nature_of_Payment_or_Transfer_of_Value) %>% ggplot(aes(x=Ownership_Indicator,y=n, fill=str_wrap(Nature_of_Payment_or_Transfer_of_Value,50))) + 
  geom_col(position='fill') +
  theme(legend.position="right", legend.text = element_text(size = 7)) +
  labs(fill='', title='Frequency of Nature of payment per ownership interest #1',y='Relative frequency')
```
\
We need to exclude further attributes, which do not help to distinguish.
```{r, echo=F, message=F}
payments_train %>% filter(!grepl("Food|Lodging|venue", Nature_of_Payment_or_Transfer_of_Value)) %>% count(Ownership_Indicator,Nature_of_Payment_or_Transfer_of_Value) %>% ggplot(aes(x=Ownership_Indicator,y=n,fill=str_wrap(Nature_of_Payment_or_Transfer_of_Value,50))) + geom_col(position="fill") +
theme(legend.position="right", legend.text = element_text(size = 8)) +
labs(fill='', title='Frequency of Nature of payment per ownership interest #2',y='Relative frequency')
```
\
'Education' is associated with no ownership interest. 'Current or prospective...' is (obviously) associated with ownership interest. Not sure about the rest.
```{r, echo=F, message=F}
payments_train %>% filter(!grepl("Food|Lodging|venue|Consulting|Gift|non-accredited", Nature_of_Payment_or_Transfer_of_Value)) %>% count(Ownership_Indicator,Nature_of_Payment_or_Transfer_of_Value) %>% ggplot(aes(x=Ownership_Indicator,y=n,fill=str_wrap(Nature_of_Payment_or_Transfer_of_Value,50))) + geom_col(position="fill") +
theme(legend.position="right", legend.text = element_text(size = 8)) +
labs(fill='', title='Frequency of Nature of payment per ownership interest #3',y='Relative frequency')
```
\

#### Third Party Recipient
Third party recipients - entities or individuals - are associated with ownership interest. Create binary feature `third_party` which signals ownership interest.
```{r, echo=F, message=F}
tab = prop.table(table(payments_train$Ownership_Indicator, payments_train$Third_Party_Recipient),1) %>% as.data.frame()
tab %>% ggplot(aes(x=Var2,y=Freq,fill=Var1)) +
  geom_bar(position = 'dodge', stat = 'identity') +
  geom_text(position = position_dodge(width = 0.9), aes(label=round(Freq, digits = 3), vjust=-0.5)) +
  labs(title = 'Relative frequency of third party recipient per ownership interest', x='Third party recipient', y='Relative freqency', fill='Ownership interest')
```
\

#### Related product indicator (1)
'No' or 'None' product indicator is associated with ownership interest. Not sure if enough evidence to construct a binary feature.
```{r, echo=F, message=F}
positions = c('None','No','Non-Covered',"Combination","Covered","Yes")
tab = prop.table(table(payments_train$Ownership_Indicator, payments_train$Related_Product_Indicator),1) %>% as.data.frame()
tab %>% ggplot(aes(x=Var2,y=Freq,fill=Var1)) +
  geom_bar(position = 'dodge', stat = 'identity') +
  scale_x_discrete(limits = positions) +
  labs(title = "Frequency of Related product indicator per ownership interest", x="Product indicator", y="Relative frequency", fill="Ownership interest") +
  geom_text(position = position_dodge(width = 0.9), aes(label=round(Freq, digits = 3), vjust=-0.5))
  
```
\

#### Product Type (1)
'Biological' and 'Device or Medical Supply' are associated with ownership interest. Not sure if enough evidence to construct a binary feature.
```{r, echo=F, message=F}
positions = c("Drug","Drug or Biological","Biological","Device","Device or Medical Supply","Medical Supply")
tab = prop.table(table(payments_train$Ownership_Indicator,payments_train$Product_Type_1),1) %>% as.data.frame()
tab %>% ggplot(aes(x=str_wrap(Var2,15),y=Freq,fill=Var1)) +
  geom_bar(position='dodge', stat = "identity") +
  geom_text(position = position_dodge(width = 0.9), aes(label=round(Freq, digits = 3), vjust=-0.5)) +
  labs(title = "Product type frequency per ownership interest", x="Product type",y="Relative frequency",fill="Ownership interest")
```
***
\

#### Physician Specialties
```{r,echo=F}
physicans_own_list <- payments_raw %>% filter(Ownership_Indicator == "Yes") %>% pull(Physician_ID) %>% unique()
physicians <- physicians %>% mutate(Fraud_label = case_when((id %in% physicans_own_list) ~ 1, TRUE ~ 0))
```

Pull out Primary_Specialty, factor in three different columns and join with `physicians`:
```{r, warning=F}
primary_specialties = physicians %>% select(id, Primary_Specialty)
primary_specialties = primary_specialties %>% separate(Primary_Specialty, into = c("First_Specialty","Second_Specialty","Third_Specialty"), sep = "\\|")

physicians = physicians %>% merge(primary_specialties)
physicians = physicians %>% relocate(First_Specialty,Second_Specialty,Third_Specialty,.before=Primary_Specialty)
```

We go from this:
```{r,echo=F}
physicians %>% select(Primary_Specialty) %>% head(3)
```

To this:
```{r, echo=F,warning=F}
physicians %>% separate(Primary_Specialty, into = c("First_Specialty","Second_Specialty","Third_Specialty"), sep = "\\|") %>% select(First_Specialty,Second_Specialty,Third_Specialty) %>% head(3)
```
\

Examine specialties
```{r, echo=F}
specialties = c("First specialty","Second specialty", "Third specialty")
n = c(table(physicians$First_Specialty,useNA = "always") %>% length(),table(physicians$Second_Specialty, useNA = "always") %>% length(),table(physicians$Third_Specialty,useNA = "always") %>% length())
nr_specialties = data.frame(specialties,n)
nr_specialties %>% ggplot(aes(x=specialties,y=n)) +
  geom_bar(stat = "identity", fill="turquoise") +
  geom_text(aes(label=n),vjust=-0.5) +
  labs(title = "Number of different specialties per cateogory")
```
\

Examine absolute amount of first specialties
```{r,echo=F}
table(physicians$First_Specialty,useNA = "always") %>% as.data.frame() %>% ggplot(aes(x=str_wrap(Var1,15),y=Freq)) +
  geom_bar(stat = "identity", fill="turquoise") +
  geom_text(aes(label=Freq),vjust=-0.3) +
  labs(title = "Number of first specialties",x="First specialties",y="n")
```
\

Lets examine share of second specialties based on first specialty
```{r,echo=F}
physicians %>% count(First_Specialty,Second_Specialty) %>% arrange(desc(n)) %>% as.data.frame() %>% ggplot(aes(x=str_wrap(First_Specialty,8),y=n,fill=str_wrap(Second_Specialty,15))) +
  geom_bar(stat = "identity") +
  theme(legend.position="right", legend.text = element_text(size = 6)) +
  labs(title = "Second specialities based on first specialities #1",x="First specialties",fill="") +
  guides(fill=guide_legend(ncol=3))
```
\

Lets look at first specialites apart from Allopathic & Osteopathic.
```{r,echo=F}
physicians %>% filter(!First_Specialty == "Allopathic & Osteopathic Physicians") %>% count(First_Specialty,Second_Specialty) %>% as.data.frame() %>% ggplot(aes(x=str_wrap(First_Specialty,20),y=n,fill=Second_Specialty)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label=n),vjust=-0.3) +
  labs(title = "Second specialties based on first specialties #2",x="First specialties other than Allopathic & Osteopathic",fill="Second specialty")
```
\

Merge specialties per physician with payments
```{r}
payments_train_specialty = payments_train %>% inner_join(physicians %>% select(id,First_Specialty,Second_Specialty,Third_Specialty),by=c("Physician_ID" = "id"))

payments_train_specialty = payments_train_specialty %>% relocate(First_Specialty,Second_Specialty,Third_Specialty,.after=Physician_ID)
```
\

Examine relationship of ownership interest and first specialty.
```{r,echo=F}
payments_train_specialty %>% count(Ownership_Indicator,First_Specialty) %>% as.data.frame() %>% ggplot(aes(x=Ownership_Indicator,y=n,fill=First_Specialty)) +
  geom_bar(stat = "identity",position = "fill") +
  labs(title = "First specialty ratio relative to ownership interest #1",y="relative frequency")
```
\

Remove Allopathic & Osteopathic Physicians to get a closer view
```{r,echo=F}
payments_train_specialty %>% filter(!grepl("Allopathic",First_Specialty)) %>% count(Ownership_Indicator,First_Specialty) %>% as.data.frame() %>% ggplot(aes(x=Ownership_Indicator,y=n,fill=First_Specialty)) +
  geom_bar(stat = "identity",position = "fill") +
    labs(title = "First specialty ratio relative to ownership interest #2",y="relative frequency")
```
\

Lets look at this numerically:
```{r,echo=F}
prop.table(table(payments_train_specialty$First_Specialty,payments_train_specialty$Ownership_Indicator),2) %>% `*`(100) %>% round(3)
```
\

Lets have a final look into second specialties - nothing interesting here.
```{r,echo=F,warning=F,message=F}
physicians %>% count(Second_Specialty) %>% arrange(desc(n)) %>% top_n(10) %>% ggplot(aes(x=str_wrap(Second_Specialty,8),y=n)) +
  geom_bar(stat = "identity", fill="turquoise") +
  geom_text(aes(label=n),vjust=-0.3) +
  labs(title = "Top 10 second specialties",x="Second specialties")
```

***
#### The recipe
```{r, message=F, warning=F}
??recipes
library(recipes)
```

We need to specify one outcome and n predictors. That means we need to have one data table in tidy format, i.e. the `physicians`. Each row represents one physician (ID) and each column represents one feature (which we engineer from `payments` and `companies`).\

Lets make a mock data frame `i_physicians`:
```{r}
Fraud = c(0,0,1,0,0)
Physician_ID = c(1,2,3,4,5)
License_State = c("NY","CA","TX","NY","CA")
Primary_Specialty = c("Allopathic","Radiology","Osteopathic","Physician","Cardiology")
Total_amount_payments = c(9945,2210,34250,12050,1256)
Total_number_of_payments = c(5,12,45,22,8)
i_physicians = data.frame(Fraud,Physician_ID,License_State,Primary_Specialty,Total_amount_payments,Total_number_of_payments)
i_physicians
```

Lets specify the outcome and the predictors:
```{r}
rec_obj = recipe(Fraud ~ ., data = i_physicians) %>% update_role(Physician_ID, new_role = "ID")
rec_obj
```

Convert categorical predictors into numeric dummy variables:
```{r}
ind_obj = rec_obj %>% step_dummy(all_predictors(), -all_numeric())
ind_obj
```

As all variables are now numeric, we can center and scale them:
```{r}
stand_obj = ind_obj %>% step_center(all_predictors()) %>% step_scale(all_predictors())
stand_obj
```

Estimate means and sd from training set:
```{r}
trained_rec = prep(stand_obj, training = i_physicians)
trained_rec
```

Now, the preprocessing can be applied to the actual training (and test) set:
```{r}
i_train = bake(trained_rec, new_data = i_physicians)
#j_test = bake(trained_rec, new_data = j_physicians)
i_train
```







