Bringing the file into R is the most important of all the steps, as we cannot work with the file unless we have access to it in R in the first place.
pokemonFile <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/pokemon_proj02.csv"
pokemon <- read.csv(pokemonFile, encoding = "UTF-8", stringsAsFactors = F)
Next we want to reformat the CSV. The most glaring issue is that the rows and columns have been swapped, with the columns bearing names that are actually values instead of variables. Our best bet is to make the data into a tibble and change it from there.
pokemon <- as.data.frame(t(pokemon), stringsAsFactors = F)
pokemon <- as_tibble(pokemon)
pokemon <- rownames_to_column(pokemon)
head(pokemon)
## # A tibble: 6 x 41
## rowname V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 X.U.FEFF~ again~ again~ again~ agai~ agai~ agai~ agai~ agai~ agai~ agai~
## 2 X..Overg~ 1 1 1 0.5 0.5 0.5 2 2 1 0.25
## 3 X..Overg~ 1 1 1 0.5 0.5 0.5 2 2 1 0.25
## 4 X..Overg~ 1 1 1 0.5 0.5 0.5 2 2 1 0.25
## 5 X..Blaze~ 0.5 1 1 1 0.5 1 0.5 1 1 0.5
## 6 X..Blaze~ 0.5 1 1 1 0.5 1 0.5 1 1 0.5
## # ... with 30 more variables: V11 <chr>, V12 <chr>, V13 <chr>, V14 <chr>,
## # V15 <chr>, V16 <chr>, V17 <chr>, V18 <chr>, V19 <chr>, V20 <chr>,
## # V21 <chr>, V22 <chr>, V23 <chr>, V24 <chr>, V25 <chr>, V26 <chr>,
## # V27 <chr>, V28 <chr>, V29 <chr>, V30 <chr>, V31 <chr>, V32 <chr>,
## # V33 <chr>, V34 <chr>, V35 <chr>, V36 <chr>, V37 <chr>, V38 <chr>,
## # V39 <chr>, V40 <chr>
By using the tranpose (here, written as t()) function, we have swapped the rows and columns. The rownames_to_column() function allows us to make the abilities into a column once more instead of being the names of the rows. From there, it’s about changing the first row to be more readable, and then into column names. We’ll be substituting out the underscores into spaces, making sure units of measurement are clear, and rearranging the column order to be clearer.
pokemon[1,] <- gsub("_", " ", pokemon[1,])
pokemon[1,] <- gsub("\\bsp\\b", "special", pokemon[1,])
pokemon[1,1] <- "abilities"
pokemon[1,39] <- "weight (kg)"
pokemon[1,28] <- "height (m)"
pokemon[1,25] <- "classification"
pokemon <- select(pokemon, 31, 33, everything())
head(pokemon)
## # A tibble: 6 x 41
## V30 V32 rowname V1 V2 V3 V4 V5 V6 V7 V8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 name poked~ abilities again~ agai~ agai~ agai~ agai~ agai~ agai~ agai~
## 2 Bulba~ 1 X..Overg~ 1 1 1 0.5 0.5 0.5 2 2
## 3 Ivysa~ 2 X..Overg~ 1 1 1 0.5 0.5 0.5 2 2
## 4 Venus~ 3 X..Overg~ 1 1 1 0.5 0.5 0.5 2 2
## 5 Charm~ 4 X..Blaze~ 0.5 1 1 1 0.5 1 0.5 1
## 6 Charm~ 5 X..Blaze~ 0.5 1 1 1 0.5 1 0.5 1
## # ... with 30 more variables: V9 <chr>, V10 <chr>, V11 <chr>, V12 <chr>,
## # V13 <chr>, V14 <chr>, V15 <chr>, V16 <chr>, V17 <chr>, V18 <chr>,
## # V19 <chr>, V20 <chr>, V21 <chr>, V22 <chr>, V23 <chr>, V24 <chr>,
## # V25 <chr>, V26 <chr>, V27 <chr>, V28 <chr>, V29 <chr>, V31 <chr>,
## # V33 <chr>, V34 <chr>, V35 <chr>, V36 <chr>, V37 <chr>, V38 <chr>,
## # V39 <chr>, V40 <chr>
Then it’s a matter of updating the column names to be the first row, and removing it.
colnames(pokemon) <- pokemon[1,]
pokemon <- pokemon[-1,]
head(pokemon)
## # A tibble: 6 x 41
## name `pokedex number` abilities `against bug` `against dark`
## <chr> <chr> <chr> <chr> <chr>
## 1 Bulbasa~ 1 X..Overgrow....C~ 1 1
## 2 Ivysaur 2 X..Overgrow....C~ 1 1
## 3 Venusaur 3 X..Overgrow....C~ 1 1
## 4 Charman~ 4 X..Blaze....Sola~ 0.5 1
## 5 Charmel~ 5 X..Blaze....Sola~ 0.5 1
## 6 Chariza~ 6 X..Blaze....Sola~ 0.25 1
## # ... with 36 more variables: `against dragon` <chr>, `against
## # electric` <chr>, `against fairy` <chr>, `against fight` <chr>,
## # `against fire` <chr>, `against flying` <chr>, `against ghost` <chr>,
## # `against grass` <chr>, `against ground` <chr>, `against ice` <chr>,
## # `against normal` <chr>, `against poison` <chr>, `against
## # psychic` <chr>, `against rock` <chr>, `against steel` <chr>, `against
## # water` <chr>, attack <chr>, `base egg steps` <chr>, `base
## # happiness` <chr>, `base total` <chr>, `capture rate` <chr>,
## # classification <chr>, defense <chr>, `experience growth` <chr>,
## # `height (m)` <chr>, hp <chr>, `japanese name` <chr>, `percentage
## # male` <chr>, `special attack` <chr>, `special defense` <chr>,
## # speed <chr>, type1 <chr>, type2 <chr>, `weight (kg)` <chr>,
## # generation <chr>, `is legendary` <chr>
Next we have to fix the abilities row. It had originally been a list of strings and we need to bring that back. We’ll need to remove the initial X and any trailing numbers, and replace the periods with either a comma, a space, or an empty string.
pokemon[["abilities"]] <- gsub("(\\bX\\b)|\\d", "", pokemon[["abilities"]])
pokemon[["abilities"]] <- gsub("(^\\.+)|(\\.+$)", "", pokemon[["abilities"]])
pokemon[["abilities"]] <- gsub("\\.{4}", ",", pokemon[["abilities"]])
pokemon[["abilities"]] <- gsub("\\.", " ", pokemon[["abilities"]])
Then comes making the lists, which results in the abilities column having lists as values.
pokemon[["abilities"]] <- lapply(pokemon[["abilities"]], strsplit, ",")
pokemon[["abilities"]] <- sapply(pokemon[["abilities"]], unlist, recursive=F, simplify=F)
And then, there are a number of columns that are given as characters but are actually number variables and we need to modify them as such. The only numeric column we will not be doing this to is the pokedex number.
pokemon[,c(4:26, 28:31, 33:36, 39:41)] <- lapply(pokemon[,c(4:26, 28:31, 33:36, 39:41)], as.numeric)
Now we are able to work with the data.
In order to analyze the data, we’ll need to pick out the data that’s numeric and normalize it using the scale() function. Then we’ll remove all rows with NAs.
invisible(gc(verbose=FALSE))
pokemon_num_vars <- sapply(pokemon, is.numeric)
working.pokemon <- na.omit(pokemon)
working.pokemon[pokemon_num_vars] <- lapply(working.pokemon[pokemon_num_vars], scale)
working.pokemon <- na.omit(working.pokemon)
summary(working.pokemon[pokemon_num_vars])
## against bug against dark against dragon
## Min. :-1.25712 Min. :-2.00772 Min. :-2.86025
## 1st Qu.:-0.83249 1st Qu.:-0.05611 1st Qu.: 0.06851
## Median : 0.01676 Median :-0.05611 Median : 0.06851
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.01676 3rd Qu.:-0.05611 3rd Qu.: 0.06851
## Max. : 5.11230 Max. : 2.54603 Max. : 2.99727
## against electric against fairy against fight against fire
## Min. :-1.6486 Min. :-1.5746 Min. :-1.45758 Min. :-1.2728
## 1st Qu.:-0.8999 1st Qu.:-0.1481 1st Qu.:-0.77190 1st Qu.:-0.9126
## Median :-0.1511 Median :-0.1481 Median :-0.08621 Median :-0.1922
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 1.3465 3rd Qu.:-0.1481 3rd Qu.:-0.08621 3rd Qu.: 1.2486
## Max. : 4.3417 Max. : 5.5580 Max. : 4.02790 Max. : 4.1302
## against flying against ghost against grass against ground
## Min. :-1.6489 Min. :-1.8298 Min. :-0.97745 Min. :-1.46862
## 1st Qu.:-0.3878 1st Qu.: 0.1031 1st Qu.:-0.66312 1st Qu.:-0.77488
## Median :-0.3878 Median : 0.1031 Median :-0.03447 Median :-0.08114
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.2937 3rd Qu.: 0.1031 3rd Qu.:-0.03447 3rd Qu.:-0.08114
## Max. : 4.6567 Max. : 2.0361 Max. : 3.73748 Max. : 4.08131
## against ice against normal against poison against psychic
## Min. :-1.2986 Min. :-3.5404 Min. :-1.87956 Min. :-2.05805
## 1st Qu.:-0.9642 1st Qu.: 0.3813 1st Qu.:-0.01296 1st Qu.:-0.06413
## Median :-0.2953 Median : 0.3813 Median :-0.01296 Median :-0.06413
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0424 3rd Qu.: 0.3813 3rd Qu.:-0.01296 3rd Qu.:-0.06413
## Max. : 3.7179 Max. : 0.3813 Max. : 5.58682 Max. : 5.91764
## against rock against steel against water attack
## Min. :-1.4813 Min. :-1.54197 Min. :-1.32883 Min. :-2.2633
## 1st Qu.:-0.4069 1st Qu.:-1.01382 1st Qu.:-0.90568 1st Qu.:-0.7220
## Median :-0.4069 Median : 0.04247 Median :-0.05939 Median :-0.1653
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 1.0258 3rd Qu.: 0.04247 3rd Qu.:-0.05939 3rd Qu.: 0.6417
## Max. : 3.8911 Max. : 6.38021 Max. : 5.01836 Max. : 3.5466
## base egg steps base happiness base total capture rate
## Min. :-1.4934 Min. :-5.3932 Min. :-2.1410 Min. :-1.3935
## 1st Qu.:-0.1505 1st Qu.: 0.1333 1st Qu.:-0.8643 1st Qu.:-0.8248
## Median :-0.1505 Median : 0.1333 Median : 0.0331 Median :-0.4185
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.1505 3rd Qu.: 0.1333 3rd Qu.: 0.7640 3rd Qu.: 1.1388
## Max. : 8.8021 Max. : 5.6599 Max. : 2.6698 Max. : 2.0190
## defense experience growth height (m) hp
## Min. :-2.1692 Min. :-2.7614 Min. :-1.0420 Min. :-1.8862
## 1st Qu.:-0.6682 1st Qu.:-0.2460 1st Qu.:-0.6094 1st Qu.:-0.6843
## Median :-0.1679 Median :-0.2460 Median :-0.1769 Median :-0.0834
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4992 3rd Qu.: 0.1305 3rd Qu.: 0.3638 3rd Qu.: 0.4374
## Max. : 5.3357 Max. : 3.7787 Max. :14.5294 Max. : 7.5284
## percentage male special attack special defense speed
## Min. :-2.7067 Min. :-1.9542 Min. :-1.8218 Min. :-2.1159
## 1st Qu.:-0.2627 1st Qu.:-0.7702 1st Qu.:-0.6791 1st Qu.:-0.7436
## Median :-0.2627 Median :-0.1951 Median :-0.1077 Median :-0.1297
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2627 3rd Qu.: 0.5830 3rd Qu.: 0.5970 3rd Qu.: 0.6286
## Max. : 2.1814 Max. : 3.6276 Max. : 6.1775 Max. : 3.4815
## weight (kg) generation is legendary
## Min. :-0.6481 Min. :-1.4240 Min. :-0.1016
## 1st Qu.:-0.5293 1st Qu.:-0.8926 1st Qu.:-0.1016
## Median :-0.3067 Median : 0.1701 Median :-0.1016
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.1216 3rd Qu.: 0.7015 3rd Qu.:-0.1016
## Max. :12.2484 Max. : 1.7643 Max. : 9.8272
From here, it’s a matter of comparing variables with weight. In my opinion, something like height, hp (health points), defense, or special defense should make a Pokemon weigh more. My reasoning for height is simply because the taller a thing is, typically the more it weighs. For the latter three variables, in video games, being able to withstand attacks is called being “tanky.” Therefore, it would reason these “tanky” variables might play a role in a Pokemon’s likelihood to weigh more.
invisible(gc(verbose=FALSE))
cbPalette <- c("Height (m)"="#999999", "HP"="#E69F00", "Defense"="#56B4E9", "Special Defense"="#009E73", "Generation" = "#F0E442", "Speed"="#0072B2", "Attack"="#D55E00", "Special Attack"="#CC79A7")
ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.02)) +
geom_point(aes(y=get("height (m)"), color="Height (m)")) +
geom_point(aes(y=get("hp"), color="HP")) +
geom_point(aes(y=get("defense"), color="Defense")) +
geom_point(aes(y=get("special defense"), color="Special Defense")) +
labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
scale_colour_manual(values = cbPalette) +
scale_shape_identity()
ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.5)) +
geom_line(aes(y=get("height (m)"), color="Height (m)")) +
geom_line(aes(y=get("hp"), color="HP")) +
geom_line(aes(y=get("defense"), color="Defense")) +
geom_line(aes(y=get("special defense"), color="Special Defense")) +
labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
scale_colour_manual(values = cbPalette) +
scale_shape_identity()
Based on the above, it can be safely said that these four variables do not have any direct correlation with weight. What of the other three stats, then? Attack, Special Attack, and Speed? Do they have a correlation with weight?
ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.02)) +
geom_point(aes(y=get("attack"), color="Attack")) +
geom_point(aes(y=get("special attack"), color="Special Attack")) +
geom_point(aes(y=get("speed"), color="Speed")) +
labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
scale_colour_manual(values = cbPalette) +
scale_shape_identity()
ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.5)) +
geom_line(aes(y=get("attack"), color="Attack")) +
geom_line(aes(y=get("special attack"), color="Special Attack")) +
geom_line(aes(y=get("speed"), color="Speed")) +
labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
scale_colour_manual(values = cbPalette) +
scale_shape_identity()
As we can see, there is no correlation between attack, special attack, or speed and weight.
Given the massive failing of finding a stat that correlates with weight, the next check would be typing.
pokemon_type1_plot <- ggplot(working.pokemon, aes(get("weight (kg)"), group=get("type1"), color=get("type1"))) +
geom_density() +
labs(x="Weight (kg)", y="Density", title="Normalized Pokemon Weights by Pokemon Type", color="Types")
ggplotly(pokemon_type1_plot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
It doesn’t appear any of the types are normally distributed by weight.
Can we find a generation of Pokemon that correlates with weight, if we cannot find a type that does?
pokemon_gen_plot <- ggplot(working.pokemon, aes(get("weight (kg)"), group=get("generation"), color=get("generation"))) +
geom_density() +
labs(x="Normalized Weight (kg)", y="Density", title="Normalized Pokemon Weights by Generation", color="Normalized Generations")
ggplotly(pokemon_gen_plot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
At first blush, there seems there is no correlation between weight and any generation. Looking at a scatterplot reveals…
ggplot(pokemon, aes(get("generation"), get("weight (kg)"), group=get("generation"), color=get("generation"))) +
geom_point() +
labs(x="Generation", y="Weight (kg)", title="Pokemon Weights by Generation", color="Generation")
… there is no correlation. Each generation has a sizeable amount of light-weight Pokemon, but no generation seems to have a direct correlation with weight.
There is no direct correlation between a Pokemon’s weight and their height, their stats, their type, or their generation. Further investigation into decisions made regarding Pokemon weight is needed, given that a Wailord - based off of a real-life blue whale - is around half the weight of a Mudsdale - based off of a real-life Clydesdale.
This dataset was provided by Justin Herman on his Discussion 5 post.
Bringing the file into R is the most important of all the steps, as we cannot work with the file unless we have access to it in R in the first place.
justinFile <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/justinHermanDataSet.csv"
golf <- read.csv(justinFile, encoding = "UTF-8", stringsAsFactors = F, na.strings = c("", " ", "NA"))
This file needs a lot of work. To start with, the description of the file is inside of it, so let’s take that out and make it its own thing.
golf_desc <- names(golf)[1]
golf_desc <- paste(golf_desc, golf[1,1], golf[2,4], golf[2,6])
golf_desc <- gsub("\\.", " ", golf_desc)
golf_desc <- gsub(" {2,}", " ", golf_desc)
golf_desc <- gsub("^(\\w)", "\\U\\1", golf_desc, perl=T)
golf_desc
## [1] "Association Golf Outing Raffle Prizes as of 3/11/2018"
With that out of the way, we can remove those lines and get into the real set of data. We’ll want to remove empty rows and columns and get the proper column names set up.
golf <- golf[4:125,]
na_reduction <- function(dat) {
dat %>%
filter(Reduce(`+`, lapply(., is.na)) != ncol(.)) %>%
select_if(~sum(!is.na(.)) > 0)
}
golf <- na_reduction(golf)
colnames(golf) <- golf[1,]
golf <- golf[2:length(golf[,1]),]
golf <- na_reduction(golf)
golf <- lapply(golf, trimws)
golf <- lapply(golf, gsub, pattern=" {2,}", replacement=" ", perl=T)
golf <- as_tibble(golf)
# We'll be putting the years in here... I know there's a neater way to do this.
golf_years <- filter(golf, is.na(quantity) & is.na(Cost) & is.na(Status))
golf["Year"] <- rep(NA, times=length(golf[,1]))
golf[["Year"]][(match(golf_years[["Item"]][1], golf[["Item"]])+1):(match(golf_years[["Item"]][2], golf[["Item"]])-1)] <- golf_years[["Item"]][1]
golf[["Year"]][(match(golf_years[["Item"]][2], golf[["Item"]])+1):(match(golf_years[["Item"]][3], golf[["Item"]])-1)] <- golf_years[["Item"]][2]
golf[["Year"]][(match(golf_years[["Item"]][3], golf[["Item"]])+1):(match(golf_years[["Item"]][4], golf[["Item"]])-1)] <- golf_years[["Item"]][3]
golf[["Year"]][(match(golf_years[["Item"]][4], golf[["Item"]])+1):(match(golf_years[["Item"]][5], golf[["Item"]])-1)] <- golf_years[["Item"]][4]
golf[["Year"]][(match(golf_years[["Item"]][5], golf[["Item"]])+1):(match(golf_years[["Item"]][6], golf[["Item"]])-1)] <- golf_years[["Item"]][5]
golf[["Year"]][(match(golf_years[["Item"]][6], golf[["Item"]])+1):(match(golf_years[["Item"]][7], golf[["Item"]])-1)] <- golf_years[["Item"]][6]
golf[["Year"]][(match(golf_years[["Item"]][7], golf[["Item"]])+1):(match(golf_years[["Item"]][8], golf[["Item"]])-1)] <- golf_years[["Item"]][7]
golf[["Year"]][(match(golf_years[["Item"]][8], golf[["Item"]])+1):nrow(golf)] <- golf_years[["Item"]][8]
# Removing the year rows
golf <- filter(golf, !is.na(quantity) & !is.na(Cost) & !is.na(Status))
Next we’re going to want to make the numeric columns numeric. As a note, there is an item in the Cost column that has the word “each”. Looking up the cost of golf sets told me that spending $500 for a golf set isn’t unheard of, and all the other items in the list seem to be based on total price rather than the price of each item, so the best bet for that row is to change it from “20 each” to “80”, which is 20 (each) multiplied by the quantity, 4.
golf[1, "Cost"] <- 80
golf[["quantity"]] <- as.numeric(golf[["quantity"]])
golf[["Cost"]] <- as.numeric(golf[["Cost"]])
We should pretty up the Status row as well. A little bit of googling led me to find there is a company named “Tukuru Technologies”, which seems to be miswritten or written inconsistently. While the TBD is likely important to whomever this sheet belongs, it’s the only one like it, and giving it its own dedicated column seems pointless. Given the consistency of Tukuru Technologies delivering their items (they’re all over this data set!) it’s safe to say it will be delivered. We also need to change the values “who” into “na” and the explicit column into “Joe”.
golf[["Status"]] <- gsub("[Ff]rom ", "", golf[["Status"]])
golf[["Status"]] <- gsub("Tuk.*", "Tukuru Technologies", golf[["Status"]])
golf[["Status"]][64] <- NA
golf[["Status"]][86] <- "Joe"
With that, we have a data set we can work with.
## # A tibble: 6 x 5
## quantity Item Cost Status Year
## <dbl> <chr> <dbl> <chr> <chr>
## 1 4.00 Golf Shirts 80.0 Tukuru Technologies 2016
## 2 1.00 Golf Windvest 40.0 AE 2016
## 3 4.00 Individual Golf Clubs sets 2000 Matt B 2016
## 4 1.00 Surface Tablet 350 Tukuru Technologies 2016
## 5 1.00 Whitney 300 Tukuru Technologies 2016
## 6 1.00 Drone 300 Tukuru Technologies 2016
Justin suggested a couple of analysis that could be performed. They were:
golf_year_cumsum <- c(
"2016" = colSums(filter(golf, Year==2016)["Cost"], na.rm = T),
"2015" = colSums(filter(golf, Year==2015)["Cost"], na.rm = T),
"2014" = colSums(filter(golf, Year==2014)["Cost"], na.rm = T),
"2013" = colSums(filter(golf, Year==2013)["Cost"], na.rm = T),
"2012" = colSums(filter(golf, Year==2012)["Cost"], na.rm = T),
"2011" = colSums(filter(golf, Year==2011)["Cost"], na.rm = T),
"2010" = colSums(filter(golf, Year==2010)["Cost"], na.rm = T),
"2009" = colSums(filter(golf, Year==2009)["Cost"], na.rm = T)
)
golf_year_cumsum <- as.data.frame(golf_year_cumsum)
rownames(golf_year_cumsum) <- gsub("\\..*", "", rownames(golf_year_cumsum))
names(golf_year_cumsum) <- c("Total Cost in USD")
knitr::kable(golf_year_cumsum, caption="Total Cost of Raffled Items in USD by Year")
| Total Cost in USD | |
|---|---|
| 2016 | 5693.00 |
| 2015 | 1600.00 |
| 2014 | 2310.30 |
| 2013 | 2623.96 |
| 2012 | 4032.00 |
| 2011 | 3145.91 |
| 2010 | 2105.00 |
| 2009 | 620.00 |
golf_year_items_cumsum <- c(
"2016" = colSums(filter(golf, Year==2016)["quantity"], na.rm = T),
"2015" = colSums(filter(golf, Year==2015)["quantity"], na.rm = T),
"2014" = colSums(filter(golf, Year==2014)["quantity"], na.rm = T),
"2013" = colSums(filter(golf, Year==2013)["quantity"], na.rm = T),
"2012" = colSums(filter(golf, Year==2012)["quantity"], na.rm = T),
"2011" = colSums(filter(golf, Year==2011)["quantity"], na.rm = T),
"2010" = colSums(filter(golf, Year==2010)["quantity"], na.rm = T),
"2009" = colSums(filter(golf, Year==2009)["quantity"], na.rm = T)
)
golf_year_items_cumsum <- as.data.frame(golf_year_items_cumsum)
rownames(golf_year_items_cumsum) <- gsub("\\..*", "", rownames(golf_year_items_cumsum))
names(golf_year_items_cumsum) <- c("Number of Items Raffled")
knitr::kable(golf_year_items_cumsum, caption="Number of Items Raffled by Year")
| Number of Items Raffled | |
|---|---|
| 2016 | 18 |
| 2015 | 11 |
| 2014 | 16 |
| 2013 | 18 |
| 2012 | 24 |
| 2011 | 22 |
| 2010 | 11 |
| 2009 | 18 |
golf_year_donor_cumsum <- c(
"2016" = count(filter(golf, Year==2016), Status, sort=T)[1,1],
"2015" = count(filter(golf, Year==2015), Status, sort=T)[1,1],
"2014" = count(filter(golf, Year==2014), Status, sort=T)[1,1],
"2013" = count(filter(golf, Year==2013), Status, sort=T)[1,1],
"2012" = count(filter(golf, Year==2012), Status, sort=T)[1,1],
"2011" = count(filter(golf, Year==2011), Status, sort=T)[1,1],
"2010" = count(filter(golf, Year==2010), Status, sort=T)[1,1],
"2009" = count(filter(golf, Year==2009), Status, sort=T)[1,1]
)
names(golf_year_donor_cumsum) <- gsub("\\..*", "", names(golf_year_donor_cumsum))
golf_year_donor_cumsum <- as_tibble(golf_year_donor_cumsum)
golf_year_donor_cumsum <- gather(golf_year_donor_cumsum)
names(golf_year_donor_cumsum) <- c("Year", "Donor")
knitr::kable(golf_year_donor_cumsum, caption="Most Generous Donor by Year")
| Year | Donor |
|---|---|
| 2016 | Tukuru Technologies |
| 2015 | Matt B |
| 2014 | AE |
| 2013 | AE |
| 2012 | Friend of Joe’s |
| 2011 | AE |
| 2010 | Matt B |
| 2009 | DM |
This dataset was provided by Youngkoung Kim on his Discussion 5 post.
Bringing the file into R is the most important of all the steps, as we cannot work with the file unless we have access to it in R in the first place.
youngkoungFile <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/youngkoungKimDataSet.csv"
msa <- read.csv(youngkoungFile, encoding = "UTF-8", stringsAsFactors = F, na.strings = c("", " ", "NA"))
While the CSV offers a lot of data, we do not need all the data it offers for the analysis we will be performing.
msa <- as_tibble(msa)
msa <- select(msa, Title, City, State, Average.Shooter.Age, Total.Number.of.Guns, Place.Type, Possible.Motive...General)
There’s the matter of making sure the appropriate format is applied to all the variables in the CSV.
msa[["Average.Shooter.Age"]] <- as.numeric(msa[["Average.Shooter.Age"]])
msa[["Total.Number.of.Guns"]] <- as.numeric(msa[["Total.Number.of.Guns"]])
After checking the table, I noticed some reduncies we can fix.
msa[["Place.Type"]] <- gsub("Secondary School", "Secondary school", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("[cC]afe.*", "cafe", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("Home", "home", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("Wildness", "Wilderness", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("Venue", "venue", msa[["Place.Type"]], perl=T)
Now we’re ready to work with the data.
ggplot(msa, aes(x=get("Place.Type"), fill=get("Place.Type"))) +
geom_bar() +
labs(x="Places", fill="Places", title="Total Shootings by Place") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
With this, we can see most shootings occur locally; the shooter will perform their shooting in their home or neighborhood and does not venture far from their origin.
ggplot(msa, aes(x=get("Average.Shooter.Age"), y=get("Total.Number.of.Guns"), fill=get("Average.Shooter.Age"))) +
geom_bar(stat="identity") +
labs(x="Average Shooter Age", y="Total Number of Guns", title="Total Number of Guns of the Average Shooter by Age")
This chart shows up until their mid-forties, a shooter is more likely to carry more than a couple of guns.
ggplot(msa, aes(x=get("State"), fill=get("State"))) +
geom_bar() +
labs(x="State", title="Total Number of Shootings by State", fill="State") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
We can see here that the state with the most shootings is actually California… hypothesis, perhaps it’s Los Angeles in particular. Florida appears to come in second, with Texas coming in third. I think it might be Tampa for Florida, and perhaps San Antonio for Texas. Let’s find out.
ggplot(filter(msa, State=="California"), aes(x=get("City"), fill=get("City"))) +
geom_bar() +
labs(x="City", fill="City") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
This plot confirms my hypothesis that the city with the most shootings in California is indeed Los Angeles.
ggplot(filter(msa, State=="Florida"), aes(x=get("City"), fill=get("City"))) +
geom_bar() +
labs(x="City", fill="City") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
This plot confirms my hypothesis that the city with the most shootings in Florida is indeed Tampa.
ggplot(filter(msa, State=="Texas"), aes(x=get("City"), fill=get("City"))) +
geom_bar() +
labs(x="City", fill="City") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
This plot disproves my hypothesis that the city with the most shootings in Texas are in San Antonio - most shootings in Texas happen in Killeen.
Shooters in America are more likely to perform their shooting in their neighborhood or home, and if they’re under about 45 they’re likely to be carrying multiple guns. Shootings are more likely to occur in California, Florida, and Texas; from these states, specifically in Los Angeles, California, Tampa, Florida, and Killeen, Texas.