The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 5 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 5 assignment!)

For each of the three chosen datasets:

  • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

  • Perform the analysis requested in the discussion item.

  • Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

  1. Please include in your homework submission, for each of the three chosen datasets:

The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.

PROCEDURE

Library definitions

library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
library(RMySQL)
library(zipcode)
library(ggplot2)

Centers for Medicare & Medicaid Services Open Payment Data.

Dataset url location: https://openpaymentsdata.cms.gov/

I will be exploring the Physician Supplement File for all Program Years.

This is a supplementary file that displays a list of physicians indicated as recipients of payments reported in Open Payments. Each record includes the physicians demographic information, specialties, and license information, as well as a unique identification number (Physician Profile ID) that can be used to search for a specific physician in the general, research, and physician ownership files.

This is a complete Open Payments Dataset from the program from The Centers for Medicare & Medicaid Services Open Payment Data Report site that includes data about a federal government website managed by the Centers for Medicare & Medicaid Services.

Last Updated:

This is a complete set of all data from the Program.

Version: 1.0

Date: January 2017.

Data Provided by:

Centers for Medicare & Medicaid Services (CMS).

Dataset Owner:

Centers for Medicare & Medicaid Services (CMS) and the Government of the United States of America.

Dictionary

This dataset does not seem to have a dictionary. The download link is for a .zip file containing the desired data. However in the zip file there’s a .txt file describing the following:

Filename: OP_PH_PRFL_SPLMTL_README_P01172017.txt

1. Physician Profile Supplement File

The Physician Profile Supplement file contains information about physicians who were indicated as recipients of payments, other transfers of value, or ownership and investment interest in payment records as well as physicians identified as principal investigators associated with research payment records published by Open Payments.

However, this file contains only physicians who are included in at least one published payment record. The criteria used by the Centers for Medicare and Medicaid Services (CMS) to determine which payment records are eligible for publication is available in the Open Payments Data Dictionary and Methodology document. This document can be found on the Resources page of the Open Payments website (https://www.cms.gov/OpenPayments/About/Resources.html). The Data Dictionary and Methodology document also includes information on the data collection and reporting methodology, data fields included in the files, and any notes or special considerations that users should be aware of.

2. Considerations for using the CSV File

Microsoft Excel removes leading zeroes from data fields in comma-separated values (CSV) files. Certain fields in this data set may have leading zeroes. These zeroes will be missing when viewing the information within Microsoft Excel.

Additionally, the latest versions of Microsoft Excel cannot display data sets with more than 1,048,576 rows. This CSV file may exceed that limit. Displaying the data in its entirety may require the use of a spreadsheet program capable of handling very large numbers of records.

3. Details about the OP_PH_PRFL_SPLMTL_P01172017.zip File

This compressed (.zip) file contains one (1) comma-separated values (.csv) format file that uses commas as delimiters and one (1) README.txt file. A description of the CSV file is provided below.

  • OP_PH_PRFL_SPLMTL_P01172017.csv:

This supplementary file displays information on all of the physicians indicated as covered recipients of payments and/or physician principal investigators associated with payments in records published by Open Payments. Each record includes the physician’s demographic information, specialties, and license state, as well as the unique identification number (Physician Profile ID) assigned by Open Payments for each physician. The Physician Profile ID can be used to search Open Payments data to find payments made to or associated with that specific physician.

The physician profile information included in the data sets is submitted by the reporting entity. In contrast, the physician information included in the supplementary file can be derived from different sources including the National Plan and Provider Enumeration System (NPPES) and the Provider Enrollment, Chain and Ownership System (PECOS). As a result, the data in these sources may differ slightly. When searching for physicians using the Open Payments search tool on https://openpaymentsdata.cms.gov, use the physician profile information as listed in the supplementary file.

For simplicity reasons, I will read the raw data directly from the source.

URL and Raw data name and location definitions:

url <- "http://download.cms.gov/openpayments/"
zipfile <- "PHPRFL_P011717.ZIP"
csvfile <- "OP_PH_PRFL_SPLMTL_P01172017.csv"

Local MySQL definitions:

# Need to change to correct local root password for the local database
myLocalPassword <- 'pswrd'
myLocalUser <- 'root'
myLocalHost <- 'localhost'
myLocalMySQLSchema <- 'cms_OpenPaymentData'
myLocalTableName <- 'tbl_OpenPaymentData'

(1) Read information from .CSV file into R.

From the above .zip file, I will choose OP_PH_PRFL_SPLMTL_P01172017.csv which includes the latest information recorded, I am just keeping in mind that this file contains all the records for all the physicians that received payments from CMS.

Read .csv from url by employing read.csv()

For this project I will experiment a few things as follows:

I will do a combined data management procedure by employing MySQL; that is:

  1. I will create a procedure that will create a connection to MySQL.

  2. Once the connection is “ON”, I will check to see if a database named cms_OpenPaymentData exist.

  • If the database exist, do nothing.

  • If the database does NOT exist, create one.

  1. Once the previous step is performed, I will check to see if a table named tbl_OpenPaymentData exist.
  • If the table exist, read the information from it.

  • If the table does NOT exist, then do as follows:

    • Download the .zip file from url in a temp file.

    • Extract the OP_PH_PRFL_SPLMTL_P01172017.csv file containing the required information.

    • Write all the information contained in the .csv file into the tbl_OpenPaymentData table from the cms_OpenPaymentData MySQL scheme.

    • Delete the temp .zip file.

After that I will import all the data into a data frame.

This might take some time since the file contains a large number of records, but in a second or multiple iterations will save a ton of time. Also another advantage is that the information will stay local and it will be straight forward to update if needed.

Function to download .zip file, unzip and extract information from .csv

downloadZip <- function(myurl, myzipfile, mycsvfile){
  temp <- tempfile()
  url <- paste(myurl, myzipfile, sep="")
  download.file(url, temp)
  my.file <- unzip(myzipfile, files = mycsvfile)
  my.data <- read.csv(my.file, header=TRUE, sep=",", stringsAsFactors=FALSE)
# Deleting downloaded file
unlink(temp)
# Returning data
  return(my.data)
}

MySQL Procedure to Read or Write tables

# Establish MySQLconnection
mydbconnection <- dbConnect(MySQL(), 
                  user = myLocalUser,
                  password = myLocalPassword,
                  host = myLocalHost)
# creating a database if it doesn't exist by employing RMySQL() in R
MySQLcode <- paste0("CREATE SCHEMA IF NOT EXISTS ",myLocalMySQLSchema,";", sep="")
dbSendQuery(mydbconnection, MySQLcode)

# Table exists?
mydbconnection <- dbConnect(MySQL(), 
                  user = myLocalUser,
                  password = myLocalPassword,
                  host = myLocalHost,
                  dbname = myLocalMySQLSchema)
# Check to see if table data exist.
myLocalTableName <- tolower(myLocalTableName)
if (dbExistsTable(mydbconnection, name = myLocalTableName)  == FALSE){
# If the table does not exist, download .zip and write .csv file into MySQL
my.data <- downloadZip(myurl= url, myzipfile= zipfile,  mycsvfile= csvfile)
# Then Write the table in MySQL
dbWriteTable(mydbconnection, name= myLocalTableName , value= my.data)
} else {
# Read the data from the local table
 my.data <- dbReadTable(mydbconnection, name = myLocalTableName)
}

# Closing connection with local Schema
dbDisconnect(mydbconnection)

Imported file structure display

## 'data.frame':    814447 obs. of  27 variables:
##  $ Physician_Profile_ID                   : num  29708 29709 29719 29720 29721 ...
##  $ Physician_Profile_First_Name           : chr  "GREGORY" "JOHN" "GEORGE" "MARC" ...
##  $ Physician_Profile_Middle_Name          : chr  "" "D" "A" "S" ...
##  $ Physician_Profile_Last_Name            : chr  "SENSENICH" "BALUCH" "BLESSIOS" "FISK" ...
##  $ Physician_Profile_Suffix               : chr  "" "" "" "" ...
##  $ Physician_Profile_Alternate_First_Name : chr  "GREGORY" "" "" "MARC" ...
##  $ Physician_Profile_Alternate_Middle_Name: chr  "W" "" "" "SASLOW" ...
##  $ Physician_Profile_Alternate_Last_Name  : chr  "SENSENICH" "" "" "FISK" ...
##  $ Physician_Profile_Alternate_Suffix     : chr  "" "" "" "" ...
##  $ Physician_Profile_Address_Line_1       : chr  "861 FAIRWAY DR" "500 E MAIN ST" "550 ORCHARD PARK RD" "375 MOUNT PLEASANT AVE" ...
##  $ Physician_Profile_Address_Line_2       : chr  "" "STE 220" "A103" "" ...
##  $ Physician_Profile_City                 : chr  "CHILLICOTHE" "COLUMBUS" "WEST SENECA" "WEST ORANGE" ...
##  $ Physician_Profile_State                : chr  "MO" "OH" "NY" "NJ" ...
##  $ Physician_Profile_Zipcode              : chr  "64601-3673" "43215" "14224-2646" "07052-2724" ...
##  $ Physician_Profile_Country_Name         : chr  "UNITED STATES" "UNITED STATES" "UNITED STATES" "UNITED STATES" ...
##  $ Physician_Profile_Province_Name        : chr  "" "" "" "" ...
##  $ Physician_Profile_Primary_Specialty    : chr  "Allopathic & Osteopathic Physicians|Family Medicine" "Allopathic & Osteopathic Physicians|Urology" "Allopathic & Osteopathic Physicians|Surgery" "Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease" ...
##  $ Physician_Profile_OPS_Taxonomy_1       : chr  "207Q00000X" "208800000X" "208600000X" "207RC0000X" ...
##  $ Physician_Profile_OPS_Taxonomy_2       : chr  "" "" "204F00000X" "" ...
##  $ Physician_Profile_OPS_Taxonomy_3       : chr  "" "" "" "" ...
##  $ Physician_Profile_OPS_Taxonomy_4       : chr  "" "" "" "" ...
##  $ Physician_Profile_OPS_Taxonomy_5       : chr  "" "" "" "" ...
##  $ Physician_Profile_License_State_Code_1 : chr  "MO" "OH" "NY" "NJ" ...
##  $ Physician_Profile_License_State_Code_2 : chr  "" "" "PA" "" ...
##  $ Physician_Profile_License_State_Code_3 : chr  "" "" "" "" ...
##  $ Physician_Profile_License_State_Code_4 : chr  "" "" "" "" ...
##  $ Physician_Profile_License_State_Code_5 : chr  "" "" "" "" ...

In summary, this data frame contains 814447 independent observations with 27 recognizable variables.

Data transformation

Now that I have the data frame I will transform it in order to create some possible outcomes from the given information; for this, I will subset() by excluding small portion of it.

Excluding Information:

Excluding Alternate Name information:

This procedure will exclude the repeating of the alternate information columns for the physician’s name by performing a subset with the required exclusion.

my.new.data <-  my.data %>% subset(select=-(Physician_Profile_Alternate_First_Name:Physician_Profile_Alternate_Suffix ))
Physician_Profile_ID Physician_Profile_First_Name Physician_Profile_Middle_Name Physician_Profile_Last_Name Physician_Profile_Suffix Physician_Profile_Address_Line_1 Physician_Profile_Address_Line_2 Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
29708 GREGORY SENSENICH 861 FAIRWAY DR CHILLICOTHE MO 64601-3673 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH 500 E MAIN ST STE 220 COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS 550 ORCHARD PARK RD A103 WEST SENECA NY 14224-2646 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK 375 MOUNT PLEASANT AVE WEST ORANGE NJ 07052-2724 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON 1 HAMPTON RD SUITE 200 EXETER NH 03833-4848 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY 2238 GEARY BLVD SAN FRANCISCO CA 94115-3416 UNITED STATES 207LP2900X 207L00000X CA

Excluding Street Address information:

This procedure will exclude the street address information columns Physician_Profile_Address_Line_1 and Physician_Profile_Address_Line_2 for the physician’s name by performing a subset with the required exclusion.

my.new.data <-  my.new.data %>% subset(select=-(Physician_Profile_Address_Line_1:Physician_Profile_Address_Line_2))
Physician_Profile_ID Physician_Profile_First_Name Physician_Profile_Middle_Name Physician_Profile_Last_Name Physician_Profile_Suffix Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
29708 GREGORY SENSENICH CHILLICOTHE MO 64601-3673 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS WEST SENECA NY 14224-2646 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK WEST ORANGE NJ 07052-2724 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON EXETER NH 03833-4848 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY SAN FRANCISCO CA 94115-3416 UNITED STATES 207LP2900X 207L00000X CA

Working physician’s name:

For this, I will be combining the columns related to the physician’s name information into one single column by employing the function unite() from the tidyr library.

This procedure will exclude the repeating of the alternate information columns for the physician’s name.

my.new.data <-  my.new.data %>%
                unite(Physician, c(Physician_Profile_First_Name, Physician_Profile_Middle_Name, Physician_Profile_Last_Name, Physician_Profile_Suffix), remove=TRUE, sep=" ")
Physician_Profile_ID Physician Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
29708 GREGORY SENSENICH CHILLICOTHE MO 64601-3673 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS WEST SENECA NY 14224-2646 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK WEST ORANGE NJ 07052-2724 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON EXETER NH 03833-4848 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY SAN FRANCISCO CA 94115-3416 UNITED STATES 207LP2900X 207L00000X CA

Zip Codes:

For this, I will be transforming the diverse representations for the zip codes into a 5 digit length zip code by employing clean.zipcodes() from the zipcodes library. That is, I will attempts to detect and clean up suspected ZIP codes. Will strip “ZIP+4” suffixes to match format of zipcode data.frame. Restores leading zeros, converts invalid entries to NAs, and returns character vector. Note that this function does not attempt to find a matching ZIP code in the database, but rather examines formatting alone.

my.new.data$Physician_Profile_Zipcode <-clean.zipcodes(my.new.data$Physician_Profile_Zipcode)
Physician_Profile_ID Physician Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
29708 GREGORY SENSENICH CHILLICOTHE MO 64601 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS WEST SENECA NY 14224 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK WEST ORANGE NJ 07052 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON EXETER NH 03833 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY SAN FRANCISCO CA 94115 UNITED STATES 207LP2900X 207L00000X CA

Specialties:

For this I tried splitting the rows y employing separate_rows() from the tidyr library but the limited resources on my computer returned the following Error: “Error: cannot allocate vector of size 332.9 Mb”; hence I have to skip it and go look for more power.

# my.new.data$Physician_Profile_Primary_Specialty <- my.new.data %>%
#                                                    separate_rows(Physician_Profile_Primary_Specialty,sep="|", convert = FALSE)
#View(my.new.data)

Taxonomies:

For this I will group all the Taxonomies under one Taxonomy Variable by employing the unite() function from the tidyr library.

my.new.data <-  my.new.data %>%
                unite(Taxonomy, c(Physician_Profile_OPS_Taxonomy_1, Physician_Profile_OPS_Taxonomy_2, Physician_Profile_OPS_Taxonomy_3, Physician_Profile_OPS_Taxonomy_4, Physician_Profile_OPS_Taxonomy_5), remove=TRUE, sep=" ")
Physician_Profile_ID Physician Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Taxonomy Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
29708 GREGORY SENSENICH CHILLICOTHE MO 64601 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS WEST SENECA NY 14224 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK WEST ORANGE NJ 07052 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON EXETER NH 03833 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY SAN FRANCISCO CA 94115 UNITED STATES 207LP2900X 207L00000X CA

License: For this I will group all the Licenses under one License Variable by employing the unite() function from the tidyr library.

my.new.data <-  my.new.data %>%
                unite(License, c(Physician_Profile_License_State_Code_1, Physician_Profile_License_State_Code_2, Physician_Profile_License_State_Code_3, Physician_Profile_License_State_Code_4, Physician_Profile_License_State_Code_5), remove=TRUE, sep=" ")
Physician_Profile_ID Physician Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Taxonomy License
29708 GREGORY SENSENICH CHILLICOTHE MO 64601 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS WEST SENECA NY 14224 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK WEST ORANGE NJ 07052 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON EXETER NH 03833 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY SAN FRANCISCO CA 94115 UNITED STATES 207LP2900X 207L00000X CA

Renaming Columns: I will provide friendly names for all remaining columns.

Original Structure:

## 'data.frame':    814447 obs. of  10 variables:
##  $ Physician_Profile_ID               : num  29708 29709 29719 29720 29721 ...
##  $ Physician                          : chr  "GREGORY  SENSENICH " "JOHN D BALUCH " "GEORGE A BLESSIOS " "MARC S FISK " ...
##  $ Physician_Profile_City             : chr  "CHILLICOTHE" "COLUMBUS" "WEST SENECA" "WEST ORANGE" ...
##  $ Physician_Profile_State            : chr  "MO" "OH" "NY" "NJ" ...
##  $ Physician_Profile_Zipcode          : chr  "64601" "43215" "14224" "07052" ...
##  $ Physician_Profile_Country_Name     : chr  "UNITED STATES" "UNITED STATES" "UNITED STATES" "UNITED STATES" ...
##  $ Physician_Profile_Province_Name    : chr  "" "" "" "" ...
##  $ Physician_Profile_Primary_Specialty: chr  "Allopathic & Osteopathic Physicians|Family Medicine" "Allopathic & Osteopathic Physicians|Urology" "Allopathic & Osteopathic Physicians|Surgery" "Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease" ...
##  $ Taxonomy                           : chr  "207Q00000X    " "208800000X    " "208600000X 204F00000X   " "207RC0000X    " ...
##  $ License                            : chr  "MO    " "OH    " "NY PA   " "NJ    " ...
names(my.new.data) <- c("ID","Physician", "City", "State", "Zipcode", "Country", "Province", "Specialty", "Taxonomy", "License")

Structure after renaming columns:

## 'data.frame':    814447 obs. of  10 variables:
##  $ ID       : num  29708 29709 29719 29720 29721 ...
##  $ Physician: chr  "GREGORY  SENSENICH " "JOHN D BALUCH " "GEORGE A BLESSIOS " "MARC S FISK " ...
##  $ City     : chr  "CHILLICOTHE" "COLUMBUS" "WEST SENECA" "WEST ORANGE" ...
##  $ State    : chr  "MO" "OH" "NY" "NJ" ...
##  $ Zipcode  : chr  "64601" "43215" "14224" "07052" ...
##  $ Country  : chr  "UNITED STATES" "UNITED STATES" "UNITED STATES" "UNITED STATES" ...
##  $ Province : chr  "" "" "" "" ...
##  $ Specialty: chr  "Allopathic & Osteopathic Physicians|Family Medicine" "Allopathic & Osteopathic Physicians|Urology" "Allopathic & Osteopathic Physicians|Surgery" "Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease" ...
##  $ Taxonomy : chr  "207Q00000X    " "208800000X    " "208600000X 204F00000X   " "207RC0000X    " ...
##  $ License  : chr  "MO    " "OH    " "NY PA   " "NJ    " ...

Final Tidy Table

ID Physician City State Zipcode Country Province Specialty Taxonomy License
29708 GREGORY SENSENICH CHILLICOTHE MO 64601 UNITED STATES Allopathic & Osteopathic Physicians|Family Medicine 207Q00000X MO
29709 JOHN D BALUCH COLUMBUS OH 43215 UNITED STATES Allopathic & Osteopathic Physicians|Urology 208800000X OH
29719 GEORGE A BLESSIOS WEST SENECA NY 14224 UNITED STATES Allopathic & Osteopathic Physicians|Surgery 208600000X 204F00000X NY PA
29720 MARC S FISK WEST ORANGE NJ 07052 UNITED STATES Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease 207RC0000X NJ
29721 BENJAMIN MICHAEL JULIAN THOMPSON EXETER NH 03833 UNITED STATES Allopathic & Osteopathic Physicians|Orthopaedic Surgery|Sports Medicine 207XX0005X 207X00000X MA NH
29722 TAISSA N CHERRY SAN FRANCISCO CA 94115 UNITED STATES 207LP2900X 207L00000X CA

Data Exploration

From the above table we can explore a few things as follows:

Total number of physicians:

The grand total of physicians listed in this database is 814447 physicians.

Total number of physicians by Country:

Country list with respective number of physicians and matching percentages.

Country n Physicians Percentage
UNITED STATES 814153 99.96 %
UNITED STATES MINOR OUTLYING ISLANDS 65 0.01 %
GERMANY 42 0.01 %
CANADA 38 0 %
JAPAN 26 0 %
GREAT BRITAIN (UK) 16 0 %
KOREA (REPUBLIC OF) 16 0 %
ITALY 12 0 %
ISRAEL 10 0 %
TURKEY 6 0 %
SAUDI ARABIA 5 0 %
UNITED ARAB EMIRATES 5 0 %
INDIA 4 0 %
MEXICO 4 0 %
PAKISTAN 4 0 %
AUSTRALIA 3 0 %
THAILAND 3 0 %
BAHRAIN 2 0 %
BRAZIL 2 0 %
CHINA 2 0 %
EGYPT 2 0 %
FRANCE 2 0 %
LEBANON 2 0 %
SPAIN 2 0 %
SWITZERLAND 2 0 %
ANTIGUA AND BARBUDA 1 0 %
BERMUDA 1 0 %
CAMEROON 1 0 %
GABON 1 0 %
GREECE 1 0 %
GUATEMALA 1 0 %
ICELAND 1 0 %
IRELAND 1 0 %
KOREA (DEMOCRATIC PEOPLE’S REPUBLIC OF) 1 0 %
MADAGASCAR 1 0 %
NETHERLANDS ANTILLES 1 0 %
NEW ZEALAND 1 0 %
NULL 1 0 %
PAPUA NEW GUINEA 1 0 %
PHILIPPINES 1 0 %
SOUTH AFRICA 1 0 %
TRINIDAD AND TOBAGO 1 0 %
UGANDA 1 0 %
VENEZUELA 1 0 %

Interesting is to observe how CMS provided payment to physicians around the world, I was not expecting to see those indicators.

Total number of physicians by state:

Top 5 states with highest number of physicians and respective percentages.

State n Physicians Percentage
CA 90242 11.08 %
NY 62140 7.63 %
TX 59782 7.34 %
FL 53902 6.62 %
PA 40456 4.97 %

Bottom 10 states with lowest number of physicians and respective percentages.

State n Physicians Percentage
AP 81 0.01 %
VI 65 0.01 %
GU 54 0.01 %
AA 29 0 %
MP 5 0 %
FM 4 0 %
PW 4 0 %
AS 3 0 %
KO 1 0 %
MH 1 0 %

Bar plot: Total of physicians by state sorted by the number of physicians.

Something very interesting is that from the above table and the bar-plot we can spot a record that shows MH as a state when in reality this entry should be updated since the correct entry is MD. See the below table for the filtered entry from the original table. Once, an address verification is performed with the corresponding zip code, I found out that this entry should be MD and not MH.

Physician_Profile_ID Physician_Profile_First_Name Physician_Profile_Middle_Name Physician_Profile_Last_Name Physician_Profile_Suffix Physician_Profile_Alternate_First_Name Physician_Profile_Alternate_Middle_Name Physician_Profile_Alternate_Last_Name Physician_Profile_Alternate_Suffix Physician_Profile_Address_Line_1 Physician_Profile_Address_Line_2 Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
711782 SUSAN G SMIGOCKI SUSAN GAIL SMIGOCKI 26005 RIDGE ROAD SUITE 200 DAMASCUS MH 20872-1899 UNITED STATES Allopathic & Osteopathic Physicians|Obstetrics & Gynecology|Gynecology 207VG0400X MD

Similar analysis can be performed for KO where it stands for an APO.

Physician_Profile_ID Physician_Profile_First_Name Physician_Profile_Middle_Name Physician_Profile_Last_Name Physician_Profile_Suffix Physician_Profile_Alternate_First_Name Physician_Profile_Alternate_Middle_Name Physician_Profile_Alternate_Last_Name Physician_Profile_Alternate_Suffix Physician_Profile_Address_Line_1 Physician_Profile_Address_Line_2 Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
835220 BLAKE CHARLES STUART 618TH DENTAL COMPANY UNIT#15653 APO KO UNITED STATES Dental Providers|Dentist 122300000X UT

Similar analysis can be performed for PW where it stands for different typos like PA or PR for example.

Physician_Profile_ID Physician_Profile_First_Name Physician_Profile_Middle_Name Physician_Profile_Last_Name Physician_Profile_Suffix Physician_Profile_Alternate_First_Name Physician_Profile_Alternate_Middle_Name Physician_Profile_Alternate_Last_Name Physician_Profile_Alternate_Suffix Physician_Profile_Address_Line_1 Physician_Profile_Address_Line_2 Physician_Profile_City Physician_Profile_State Physician_Profile_Zipcode Physician_Profile_Country_Name Physician_Profile_Province_Name Physician_Profile_Primary_Specialty Physician_Profile_OPS_Taxonomy_1 Physician_Profile_OPS_Taxonomy_2 Physician_Profile_OPS_Taxonomy_3 Physician_Profile_OPS_Taxonomy_4 Physician_Profile_OPS_Taxonomy_5 Physician_Profile_License_State_Code_1 Physician_Profile_License_State_Code_2 Physician_Profile_License_State_Code_3 Physician_Profile_License_State_Code_4 Physician_Profile_License_State_Code_5
512947 JEFFREY ROBERT WERT ROUTE 611 BOX 485 TANNERSVILLE PW 18372 UNITED STATES Dental Providers|Dentist|General Practice 1223G0001X PA
725235 MIRIAM KATZ 225 COBBS CREEK PARKWAY PHILADELPHIA PW 19139-3723 UNITED STATES Allopathic & Osteopathic Physicians|Pediatrics 208000000X PA
927562 ROSS M WEZMAR 900 SCHECHTER DRIVE WILKES BARRE PW 18702 UNITED STATES Dental Providers|Dentist|Pediatric Dentistry 1223P0221X PA
1047744 FLOREN E PEREZ SUITE 210 PROFESSIONAL BUILDING MENONITE HOSPITAL BO MONTE LLANO CAYEY PW 00736 UNITED STATES Allopathic & Osteopathic Physicians|Obstetrics & Gynecology 207V00000X PR

The above typos can be analyzed by comparing them to our zipcodes library, and further analysis could be performed in order to find out how reliable is the raw data before some cleanup is performed for the physician’s location.

Continuing with the data exploration I fond out that some data sanitation should be performed before any further state exploration and I should not take this data as 100% accurate as it is. If it is true the numbers so far are small these are not significant enough to create dramatic changes in our reports.

Total number of physicians by Zipcode:

Top 5 Zipcodes with highest number of physicians and respective percentages.

City State Zipcode n Physicians Percentage
Houston TX 77030 4642 0.57 %
Chicago IL 60612 2165 0.27 %
Philadelphia PA 19104 2088 0.26 %
Saint Louis MO 63110 1859 0.23 %
New York NY 10021 1799 0.22 %

Bottom 10 Zipcodes with lowest number of physicians and respective percentages.

City State Zipcode n Physicians Percentage
17955 Talkeetna AK 99676 1 0 %
17956 Willow AK 99688 1 0 %
17957 Dutch Harbor AK 99692 1 0 %
17958 Funny River AK 99699 1 0 %
17959 Fairbanks AK 99707 1 0 %
17960 Central AK 99730 1 0 %
17961 Fairbanks AK 99775 1 0 %
17962 Tok AK 99780 1 0 %
17963 Petersburg AK 99833 1 0 %
17964 Klawock AK 99925 1 0 %

Distribution by Region:

Distribution of physicians by the first digit of the zipcode (Region).

# First I will subset the data for all the Zipcodes within "United States"
USzipCodes <- my.new.data %>% subset(Country == "UNITED STATES", select=c(Country, State, Zipcode))
# First I will Create a Region with the first value from the ZipCode.
USzipCodes$region = substr(USzipCodes$Zipcode, 1, 1)
# Merge Zipcodes with the zipcode library
data(zipcode)
USzipCodes <- merge(USzipCodes, zipcode, by.x='Zipcode', by.y='zip')

# Creating ggplot of matches ZipCodes
g <- ggplot(data=USzipCodes) + geom_point(aes(x=longitude, y=latitude, colour=region))

# simplify display and limit to the "lower 48"
g <- g + theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NULL)
g <- g + scale_y_continuous(limits = c(25,50), breaks = NULL)
g

Conclusions

Country:

Based on simple observation, is easy to spot how CMS has made payments to physicians from out of this country, this is something I was not expecting to find and it definitely caught my attention since I was under the impression that those kind of payments for services were performed in the US territories.

State:

If we look at the chart and results, we noticed that over all California and New York, presented the highest physician populations. This, I believe is correlated due to major concentration of humans in those states, once again in this comparison comes to show how less populated areas tend to have less physician populations registered with the open payment data from CMS.

Zipcode:

This is an interesting analysis and I believe it can play a great role in local discoveries related to quality of health care and competition, since it covers immediate surrounding areas.

For example, from the Region distribution we can visualize how the distribution is over the country having a lot of empty spaces on the mid west. Also, something interesting is to observe that the top 3 Zip codes present on the Zip code table, belong to Houston, Chicago and Philadelphia; this is remarkable since these states did not figured as the top states, yet these zip codes have a concentrated number of physicians.

Final conclusion:

Since there’s more data available, it will be interesting to perform more comparisons in regards of the specialties, number of licenses or number of taxonomies that physicians have; this, to find out if some physicians are “better” prepared to deal with proper treatment for all the patients in the surrounding areas, the above conclusion is by assuming that patients tend to visit local health care practitioners in any given emergency situation.