Using large and complex data sets in a day to day is a reality, and in order to be able to gain valuable insights from data you must first clean and seperate the data you want to analyze.
In this assignment we will read and clean data using the dplyr package. Our goal is to clean our dataset to make it more managable for data analysis down the road.
Using the dplyr package we will subset, mutate, reorganize, and remove NULL values from our data.
## dplyr We first want to isnstall the dplyr package and add them to our library. We can do this using install.packages
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Our dataset “describes average spending levels during hospitals’ Medicare Spending per Beneficiary (MSPB) episodes by Medicare claim type.” More information can be viewed about the dataset here: https://catalog.data.gov/dataset/medicare-hospital-spending-by-claim You can download the csv file here: Medicare Spending Dataset
We will read our data in using the read.csv() function.
Medicare_Hospital <- read.csv("Medicare_Hospital_Spending_by_Claim.csv")
Now that we’ve read in our data, we can use str(), dim(), head(), and tail() to get a better overall better view of what data we are dealing with.
# head() gives us the first 6 observations (rows) and all of the variables (columns)
head(Medicare_Hospital)
# tail() gives us the last 6 observations, and all of the variables
tail(Medicare_Hospital)
#str() gives us the overall structure of the datset
str(Medicare_Hospital)
## 'data.frame': 65499 obs. of 13 variables:
## $ Hospital_Name : Factor w/ 2890 levels "ABBEVILLE GENERAL HOSPITAL",..: 2248 2248 2248 2248 2248 2248 2248 2248 2248 2248 ...
## $ Provider_ID : int 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
## $ State : Factor w/ 47 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Period : Factor w/ 4 levels "1 through 30 days After Discharge from Index Hospital Admission",..: 2 2 2 2 2 2 2 4 4 4 ...
## $ Claim_Type : Factor w/ 8 levels "Carrier","Durable Medical Equipment",..: 3 4 5 6 7 2 1 3 4 5 ...
## $ Avg_Spending_Per_Episode_Hospital: int 21 1 9 215 1 6 569 0 0 10046 ...
## $ Avg_Spending_Per_Episode_State : int 15 1 7 100 2 8 527 0 0 9451 ...
## $ Avg_Spending_Per_Episode_Nation : int 13 0 6 138 2 7 561 0 0 9865 ...
## $ Percent_of_Spending_Hospital : num 0.1 0.01 0.04 1 0 ...
## $ Percent_of_Spending_State : num 0.08 0 0.03 0.5 0.01 0.04 2.61 0 0 46.8 ...
## $ Percent_of_Spending_Nation : num 0.06 0 0.03 0.65 0.01 ...
## $ Start_Date : int 1012017 1012017 1012017 1012017 1012017 1012017 1012017 1012017 1012017 1012017 ...
## $ End_Date : int 12312017 12312017 12312017 12312017 12312017 12312017 12312017 12312017 12312017 12312017 ...
# dim() gives us the number of observations and variables
dim(Medicare_Hospital)
## [1] 65499 13
We see that we have 65499 observations (rows) and 13 variables (columns) in this dataset. However, we are only interested in certain variables, and certain states within the “State” variable. We’ll talk about how we can extract just the information we need in the Subsetting Data section.
Next we are going to remove any observations (rows) that have any missing values in them, using complete.cases(). We will name the new file “Medicare_Hospital_Complete”
Medicare_Hospital_Complete <- Medicare_Hospital[complete.cases(Medicare_Hospital), ]
Medicare_Hospital_Complete <- na.omit(Medicare_Hospital)
We had 65499 obersations and 13 variables in our original dataset. Lets use dim() to quickly find out the new number of obersations and variables after deleted any null values.
dim(Medicare_Hospital_Complete)
## [1] 65499 13
We have the exact same number of observations and variables so we didn’t have any missing data to begin with. Since we don’t have any missing values we can continue to use Medicare_Hospital instead of Medicare_Hospital_Complete
If you want to quickly check how many NA values are a dataset, you can do so by using sum(is.na()).
sum(is.na(Medicare_Hospital))
## [1] 0
We have 0 null values in our dataset. You can do this before removing any null values. It could poassibly save you time if you already have a complete data set.
Now that we’re a little more familiar with our data set, we want to subset the data based on what we’ll need.
There’s 13 variables in our dataset, however we don’t need all of them.
Since the provider_ID, start and end date column provided very little useful information, so we can remove those.
We also only want to look at Average spending per Hospital and by State. We don’t need to look at Average spending per claim per Nation, and also do not need percentages. so we can remove the percentage variables as well. We can remove all of these variables at once by subsetting our data.
Medicare_Hospital_Subset <- Medicare_Hospital[c(1:65499), c(1, 3:7)]
Medicare_Hospital_Subset
# check our new data set
dim(Medicare_Hospital_Subset)
## [1] 65499 6
Using dim we see we now only have our 6 variables of interest.
This dataset contains information on almost all 50 states. We are only intersted in data from the Northeast part of the United States, so we will subset the data again to include just the Northeast portion of the United States.
We can do this using OR where | == OR
Medicare_Hospital_Subset_NE <- subset(Medicare_Hospital_Subset, State == "CT" | State == "DE" | State == "ME" | State == "MA" | State == "MD"| State == "NH" | State == "NJ" | State == "NY" | State == "PA" | State == "RI" | State == "VT")
Now we can use the dim() function to see how many observations and variables our new dataset has.
dim(Medicare_Hospital_Subset_NE)
## [1] 10758 6
Our new subset of data has 10578 observations and 6 variables, a much smaller dataset than our original 65499 x 13.
We can mutate and add a column to the data. Lets say we want to see the difference of the Average spending per Hospital and the Average spending per State.
This will give tell us if the average spending per Hospital, or per State, was greater for these claims.
Medicare_Hospital_Subset_NE <- mutate(Medicare_Hospital_Subset_NE, "Hospital_vs_State_Spending" = Avg_Spending_Per_Episode_Hospital - Avg_Spending_Per_Episode_State)
Medicare_Hospital_Subset_NE
If we scroll all the way to the right we can see we’ve added a new column “Hospital_vs_State_Spending” which gives the difference of “Avg_Spending_Per_Episode_Hospital” and “Avg_Spending_Per_Episode_State”
Last we want to reorer the data, by State, alphabetically. We can do this using arrange()
Medicare_Hospital_Subset_NE_organized <- arrange(Medicare_Hospital_Subset_NE, (State))
Our data should now be ordered alphabetically, by State.
We can read in our final data set.
Medicare_Hospital_Subset_NE_organized
dim(Medicare_Hospital_Subset_NE_organized)
## [1] 10758 7
We have a final dataset containing 10758 observations and 7 variables, ordered alphabetically by state.
Now that we have a subsetted data frame of our States and variables of interest, we could begin an analysis portion. We could also break down this dataset further if we were interested in different regions, different or individual states, percentages instead of averages, etc.