Methods

Using the dplyr package we will subset, mutate, reorganize, and remove NULL values from our data.
## dplyr We first want to isnstall the dplyr package and add them to our library. We can do this using install.packages

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reading the data

Our dataset “describes average spending levels during hospitals’ Medicare Spending per Beneficiary (MSPB) episodes by Medicare claim type.” More information can be viewed about the dataset here: https://catalog.data.gov/dataset/medicare-hospital-spending-by-claim You can download the csv file here: Medicare Spending Dataset

We will read our data in using the read.csv() function.

Medicare_Hospital <- read.csv("Medicare_Hospital_Spending_by_Claim.csv")

Understanding the data

Now that we’ve read in our data, we can use str(), dim(), head(), and tail() to get a better overall better view of what data we are dealing with.

# head() gives us the first 6 observations (rows) and all of the variables (columns)
head(Medicare_Hospital)

# tail() gives us the last 6 observations, and all of the variables
tail(Medicare_Hospital)

#str() gives us the overall structure of the datset
str(Medicare_Hospital)

## 'data.frame':    65499 obs. of  13 variables:
##  $ Hospital_Name                    : Factor w/ 2890 levels "ABBEVILLE GENERAL HOSPITAL",..: 2248 2248 2248 2248 2248 2248 2248 2248 2248 2248 ...
##  $ Provider_ID                      : int  10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
##  $ State                            : Factor w/ 47 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Period                           : Factor w/ 4 levels "1 through 30 days After Discharge from Index Hospital Admission",..: 2 2 2 2 2 2 2 4 4 4 ...
##  $ Claim_Type                       : Factor w/ 8 levels "Carrier","Durable Medical Equipment",..: 3 4 5 6 7 2 1 3 4 5 ...
##  $ Avg_Spending_Per_Episode_Hospital: int  21 1 9 215 1 6 569 0 0 10046 ...
##  $ Avg_Spending_Per_Episode_State   : int  15 1 7 100 2 8 527 0 0 9451 ...
##  $ Avg_Spending_Per_Episode_Nation  : int  13 0 6 138 2 7 561 0 0 9865 ...
##  $ Percent_of_Spending_Hospital     : num  0.1 0.01 0.04 1 0 ...
##  $ Percent_of_Spending_State        : num  0.08 0 0.03 0.5 0.01 0.04 2.61 0 0 46.8 ...
##  $ Percent_of_Spending_Nation       : num  0.06 0 0.03 0.65 0.01 ...
##  $ Start_Date                       : int  1012017 1012017 1012017 1012017 1012017 1012017 1012017 1012017 1012017 1012017 ...
##  $ End_Date                         : int  12312017 12312017 12312017 12312017 12312017 12312017 12312017 12312017 12312017 12312017 ...

# dim() gives us the number of observations and variables
dim(Medicare_Hospital)

## [1] 65499    13

We see that we have 65499 observations (rows) and 13 variables (columns) in this dataset. However, we are only interested in certain variables, and certain states within the “State” variable. We’ll talk about how we can extract just the information we need in the Subsetting Data section.

Removing Missing Values

Next we are going to remove any observations (rows) that have any missing values in them, using complete.cases(). We will name the new file “Medicare_Hospital_Complete”

Medicare_Hospital_Complete <- Medicare_Hospital[complete.cases(Medicare_Hospital), ]
Medicare_Hospital_Complete <- na.omit(Medicare_Hospital)

We had 65499 obersations and 13 variables in our original dataset. Lets use dim() to quickly find out the new number of obersations and variables after deleted any null values.

dim(Medicare_Hospital_Complete)

## [1] 65499    13

We have the exact same number of observations and variables so we didn’t have any missing data to begin with. Since we don’t have any missing values we can continue to use Medicare_Hospital instead of Medicare_Hospital_Complete

If you want to quickly check how many NA values are a dataset, you can do so by using sum(is.na()).

sum(is.na(Medicare_Hospital))

## [1] 0

We have 0 null values in our dataset. You can do this before removing any null values. It could poassibly save you time if you already have a complete data set.

Subsetting

Variables of interest

Now that we’re a little more familiar with our data set, we want to subset the data based on what we’ll need.
There’s 13 variables in our dataset, however we don’t need all of them.
Since the provider_ID, start and end date column provided very little useful information, so we can remove those.
We also only want to look at Average spending per Hospital and by State. We don’t need to look at Average spending per claim per Nation, and also do not need percentages. so we can remove the percentage variables as well. We can remove all of these variables at once by subsetting our data.

Medicare_Hospital_Subset <- Medicare_Hospital[c(1:65499), c(1, 3:7)]
Medicare_Hospital_Subset

# check our new data set
dim(Medicare_Hospital_Subset)

## [1] 65499     6

Using dim we see we now only have our 6 variables of interest.

States of interest

This dataset contains information on almost all 50 states. We are only intersted in data from the Northeast part of the United States, so we will subset the data again to include just the Northeast portion of the United States.
We can do this using OR where | == OR

Medicare_Hospital_Subset_NE <- subset(Medicare_Hospital_Subset, State == "CT" | State == "DE" | State == "ME" | State == "MA" | State == "MD"| State == "NH" | State == "NJ" | State == "NY" | State == "PA" | State == "RI" | State == "VT")

Now we can use the dim() function to see how many observations and variables our new dataset has.

dim(Medicare_Hospital_Subset_NE)

## [1] 10758     6

Our new subset of data has 10578 observations and 6 variables, a much smaller dataset than our original 65499 x 13.

Mutating

We can mutate and add a column to the data. Lets say we want to see the difference of the Average spending per Hospital and the Average spending per State.
This will give tell us if the average spending per Hospital, or per State, was greater for these claims.

Medicare_Hospital_Subset_NE <- mutate(Medicare_Hospital_Subset_NE, "Hospital_vs_State_Spending" = Avg_Spending_Per_Episode_Hospital - Avg_Spending_Per_Episode_State)
Medicare_Hospital_Subset_NE

If we scroll all the way to the right we can see we’ve added a new column “Hospital_vs_State_Spending” which gives the difference of “Avg_Spending_Per_Episode_Hospital” and “Avg_Spending_Per_Episode_State”

Reorganizing

Last we want to reorer the data, by State, alphabetically. We can do this using arrange()

Medicare_Hospital_Subset_NE_organized <- arrange(Medicare_Hospital_Subset_NE, (State))

Our data should now be ordered alphabetically, by State.

Reading and Cleaning Data

Alex Krasniewski

June 18th, 2019

Goals