This dataset includes information about the cost of the military planes used during World War 2. The filetype is a csv with tab seperated values. In order to properly interpret the tabs, we simply indicate the seperator as being “\t”.
First we will import the file, then “tidy” it into a usable form for later analysis.
file<- "RawCost.csv"
rawdf<-read.csv(file, sep="\t",header=TRUE,fill=TRUE)
head(rawdf,5)
## Type Model X1941 X1942 X1943 X1944 X1945
## 1 Very Heavy Bombers
## 2 B-29 897,730 605,360 509,465
## 3 Heavy Bombers
## 4 B-17 301,221 258,949 204,370 187,742
## 5 B-24 379,162 304,391 215,516 -
As we see above, we make the following observations on the data:
We will correct each of these items.
We begin tidying the data by adding interpretable NAs (to be used later to fill in the missing values) to the “Type” column and remove the rows that do not have a value for the column “Model”.
empty_as_na <- function(x){
if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
ifelse(as.character(x)!="", x, NA)
}
df <- rawdf
df$Type <- empty_as_na(df$Type)
df$Model[df$Model == " "] <- NA
head(df,5)
## Type Model X1941 X1942 X1943 X1944 X1945
## 1 Very Heavy Bombers <NA>
## 2 <NA> B-29 897,730 605,360 509,465
## 3 Heavy Bombers <NA>
## 4 <NA> B-17 301,221 258,949 204,370 187,742
## 5 <NA> B-24 379,162 304,391 215,516 -
In order to populate the “Type” column’s missing values, we will fill in the NA values created in the prior step in the “Type” column with the previous value encountered in the same column using a solution from the R Graphics Cookbook by Winston Chang.1
fillNAgaps <- function(x, firstBack=FALSE) {
lvls <- NULL
if (is.factor(x)) {
lvls <- levels(x)
x <- as.integer(x)
}
goodIdx <- !is.na(x)
if (firstBack) goodVals <- c(x[goodIdx][1], x[goodIdx])
else goodVals <- c(NA, x[goodIdx])
fillIdx <- cumsum(goodIdx)+1
x <- goodVals[fillIdx]
if (!is.null(lvls)) {
x <- factor(x, levels=seq_along(lvls), labels=lvls)
}
x
}
df$Type <- fillNAgaps(df$Type)
df<- df[!is.na(df$Model),]
head(df,5)
## Type Model X1941 X1942 X1943 X1944 X1945
## 2 Very Heavy Bombers B-29 897,730 605,360 509,465
## 4 Heavy Bombers B-17 301,221 258,949 204,370 187,742
## 5 Heavy Bombers B-24 379,162 304,391 215,516 -
## 6 Heavy Bombers B-32 - 790,433 - 790,433 -
## 8 Medium Bombers B-25 180,031 153,396 151,894 142,194 116,752
As shown above, we now have a value in the Type column for each row corresponding to each aircraft model.
The dataframe is still wide though, having columns representing years. We will gather these into one “Year” column and a corresponding value “Cost” column in the chunk below.
Additionally, we will convert dashes and empty cells into NA.
At the end of this step, we will have a tidy dataframe which we can use to look at the average cost of each aircraft type.
colnames(df)[3:7] <- c("1941","1942","1943","1944","1945")
df <- gather(df, "Year", "Cost", 3:7)
df$Year <- as.integer(df$Year)
df$Cost <- str_trim(sub(",", "", df$Cost, fixed = TRUE))
df$Cost[df$Cost == "-" | df$Cost == ""] <- NA
df$Cost <- as.integer(df$Cost)
df$Model <- str_trim(df$Model)
head(df,5)
## Type Model Year Cost
## 1 Very Heavy Bombers B-29 1941 NA
## 2 Heavy Bombers B-17 1941 301221
## 3 Heavy Bombers B-24 1941 379162
## 4 Heavy Bombers B-32 1941 NA
## 5 Medium Bombers B-25 1941 180031
Now that we have tidied the data, we can perform some exploratory analysis on it such as looking at the average cost of each aircraft type over the course of the war. We will present that in a horizontal bar plot.
means<- df %>% group_by(Type) %>% summarize(Mean=round(mean(Cost,na.rm=TRUE),0))
ggplot(means, aes(x=reorder(Type, Mean), y= Mean)) +
xlab("Type") +
geom_col() +
coord_flip()
The next dataset we will manipulate concerns hospital complications and deaths. One possible insight we’d like to gain from this dataset is assessing the top 10 best and worst hospitals in the country. We will perform the necessary operations to rank and populate the 10 best and worst into a table.
First we load the data into a dataframe…
file <-"Complications_and_Deaths_-_Hospital.csv"
df2 <-read.csv(file,header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
datatable(head(df2,5),options=list(pageLength=5,autoHideNavigation=TRUE))
As we see there are many columns and rows (example above limited to 5 for space).
We are interested in finding out the most problematic Adverse Events for underperforming hospitals, as well as how hospitals perform in comparison to the national average and ranking the best and worst in terms of that variable.
Below we perform the following adjustments to the dataframe:
For reference, below we print the existing string levels and and show how the transformation changes these to numeric values.
df2 <- df2[df2$Score!="Not Available",]
df2$Compared.to.National <- as.factor(df2$Compared.to.National)
df2$Measure.Name <- as.factor(df2$Measure.Name)
levels(df2$Compared.to.National)
## [1] "Better than the National Rate"
## [2] "Better than the National Value"
## [3] "No Different than the National Rate"
## [4] "No Different than the National Value"
## [5] "Worse than the National Rate"
## [6] "Worse than the National Value"
levels(df2$Compared.to.National) <- c(1,1,0,0,-1,-1)
levels(df2$Compared.to.National)
## [1] "1" "0" "-1"
To ensure the transformation was done properly, here are 2 rows for each of the Compared.To.National levels:
ex1<-head(df2[df2$Compared.to.National==1,c("Provider.ID","Measure.Name","Compared.to.National")],2)
ex2<-head(df2[df2$Compared.to.National==0,c("Provider.ID","Measure.Name","Compared.to.National")],2)
ex3<-head(df2[df2$Compared.to.National==-1,c("Provider.ID","Measure.Name","Compared.to.National")],2)
rbind(ex1,ex2,ex3)
## Provider.ID Measure.Name
## 1198 520098 Blood stream infection after surgery
## 1346 290009 Pressure sores
## 1 271344 Death rate for pneumonia patients
## 3 280105 Accidental cuts and tears from medical treatment
## 22 010005 Death rate for COPD patients
## 23 010005 Death rate for heart failure patients
## Compared.to.National
## 1198 1
## 1346 1
## 1 0
## 3 0
## 22 -1
## 23 -1
Now that the levels are normalized lets answer our initial questions…
Below we isolate the instances where a provider performs a given measure worse than national average, count the occurences, then generate a bar plot.
This shows that the top five measures being performed worse than average are:
worst<- df2[df2$Compared.to.National==-1,]
par(mar=c(12,3,1,1))
barplot(sort(table(worst$Measure.ID),decreasing=TRUE), las=2)
Below we look for the 10 hospitals that have the most number of measures being performed better than the national average by their physicians. In order to rank them, we’ll group the dataframe by Hospital.Name then sum the Compared.to.National values for each. Since we previously converted that column into numeric values [-1,0,1] we can sum them to see how many more measures they are performing better than average in.
df2$Compared.to.National <- as.numeric(levels(df2$Compared.to.National))[df2$Compared.to.National]
Hospitals <- df2 %>% group_by(Hospital.Name) %>% summarize(sum=sum(Compared.to.National))
bestHospitals <- subset(Hospitals, sum>0)
tenBest <- bestHospitals[order(bestHospitals$sum, decreasing=TRUE),][1:10,]
datatable(tenBest)
Here we’ll perform the same analysis but inverted to find the worst hospitals.
worstHospitals <- subset(Hospitals, sum<0)
tenWorst <- worstHospitals[order(worstHospitals$sum),][1:10,]
datatable(tenWorst)
Here we will be working with a dataset containing all the Google Play Store apps and various variables related to each. Some possible analyses we can perform on this dataset includes finding out which are the top apps by installs. We can also look at the different categories of apps, the number of free and paid apps, and how ratings compare between free and paid.
Below we load the data and remove some duplicate rows.
apps <- read.csv("googleplaystore.csv")
apps <- apps[!duplicated(apps$App),] #remove duplicates
head(apps,5)
## App Category Rating
## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
## 2 Coloring book moana ART_AND_DESIGN 3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7
## 4 Sketch - Draw & Paint ART_AND_DESIGN 4.5
## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
## Reviews Size Installs Type Price Content.Rating
## 1 159 19M 10,000+ Free 0 Everyone
## 2 967 14M 500,000+ Free 0 Everyone
## 3 87510 8.7M 5,000,000+ Free 0 Everyone
## 4 215644 25M 50,000,000+ Free 0 Teen
## 5 967 2.8M 100,000+ Free 0 Everyone
## Genres Last.Updated Current.Ver
## 1 Art & Design January 7, 2018 1.0.0
## 2 Art & Design;Pretend Play January 15, 2018 2.0.0
## 3 Art & Design August 1, 2018 1.2.4
## 4 Art & Design June 8, 2018 Varies with device
## 5 Art & Design;Creativity June 20, 2018 1.1
## Android.Ver
## 1 4.0.3 and up
## 2 4.0.3 and up
## 3 4.0.3 and up
## 4 4.2 and up
## 5 4.4 and up
First we’d like to see which are the most installed apps. One issue with the dataset as we can see above, is that the Installs column is composed of text strings. Below, we will transform those values into integers using the gsub() function. Afterward we’ll be able to sort the dataframe by the Installs column.
apps$Installs <- suppressWarnings(as.numeric(gsub(",", "", str_extract_all(apps$Installs, ".+?(?=\\+)"))))
app_installs <- apps %>% group_by(App) %>% summarize(Installs=Installs) %>% arrange(desc(Installs))
datatable(app_installs)
Next let’s look at the different categories of apps and whether they are free or paid.
categoryDistr <- apps[-10473,]
categoryDistr <- categoryDistr[categoryDistr$Type=="Free" | categoryDistr$Type=="Paid",]
categoryDistr <- categoryDistr %>%
group_by(Category, Type) %>%
summarise(num = length(Category))
ggplot(categoryDistr, aes(x=Category, y=num, fill=Type)) +
xlab("Category") +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Distribution of apps by category & Type")
Finally let’s take a look at the distribution of ratings for both free and paid apps. Once again, the Price column is interpreted as characters with a dollar sign. In order to perform this analysis we’ll have to convert this into an interpretable number. This will be accomplished by using a regex isolate just the numbers, then cast that into numeric class.
apps$Price <- as.character(apps$Price)
apps$Price <- as.numeric(str_extract_all(apps$Price, "\\d+(\\.\\d+)?"))
freeRatings <- apps[apps$Price==0,]
paidRatings <- apps[apps$Price!=0,]
par(mfrow=c(1,2))
ggplot(paidRatings, aes(x=Rating)) +
geom_line(stat="density", aes(color="Paid")) +
geom_line(data=freeRatings, stat="density", aes(color="Free")) +
ggtitle("Distribution of Ratings for free and paid apps") +
scale_color_manual(values = c(
"Paid" = "blue",
"Free" = "red")) +
labs(color = "App Type")
theme(plot.title = element_text(hjust = 0.5))
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Above we can see that the vast majority of apps are free, the top 3 categories are Family, Games, and tools. We can also see that paid apps tend to be rated higher than free apps.
Sources:
https://stackoverflow.com/questions/24172111/change-the-blank-cells-to-na
http://www.cookbook-r.com/Manipulating_data/Filling_in_NAs_with_last_non-NA_value/