File One: Pokemon

Bring the file into R

Bringing the file into R is the most important of all the steps, as we cannot work with the file unless we have access to it in R in the first place.

pokemonFile <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/pokemon_proj02.csv"
pokemon <- read.csv(pokemonFile, encoding = "UTF-8", stringsAsFactors = F)

Reformat the CSV so it’s easier to read

Next we want to reformat the CSV. The most glaring issue is that the rows and columns have been swapped, with the columns bearing names that are actually values instead of variables. Our best bet is to make the data into a tibble and change it from there.

pokemon <- as.data.frame(t(pokemon), stringsAsFactors = F)
pokemon <- as_tibble(pokemon)
pokemon <- rownames_to_column(pokemon)
head(pokemon)
## # A tibble: 6 x 41
##   rowname   V1     V2     V3     V4    V5    V6    V7    V8    V9    V10  
##   <chr>     <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 X.U.FEFF~ again~ again~ again~ agai~ agai~ agai~ agai~ agai~ agai~ agai~
## 2 X..Overg~ 1      1      1      0.5   0.5   0.5   2     2     1     0.25 
## 3 X..Overg~ 1      1      1      0.5   0.5   0.5   2     2     1     0.25 
## 4 X..Overg~ 1      1      1      0.5   0.5   0.5   2     2     1     0.25 
## 5 X..Blaze~ 0.5    1      1      1     0.5   1     0.5   1     1     0.5  
## 6 X..Blaze~ 0.5    1      1      1     0.5   1     0.5   1     1     0.5  
## # ... with 30 more variables: V11 <chr>, V12 <chr>, V13 <chr>, V14 <chr>,
## #   V15 <chr>, V16 <chr>, V17 <chr>, V18 <chr>, V19 <chr>, V20 <chr>,
## #   V21 <chr>, V22 <chr>, V23 <chr>, V24 <chr>, V25 <chr>, V26 <chr>,
## #   V27 <chr>, V28 <chr>, V29 <chr>, V30 <chr>, V31 <chr>, V32 <chr>,
## #   V33 <chr>, V34 <chr>, V35 <chr>, V36 <chr>, V37 <chr>, V38 <chr>,
## #   V39 <chr>, V40 <chr>

By using the tranpose (here, written as t()) function, we have swapped the rows and columns. The rownames_to_column() function allows us to make the abilities into a column once more instead of being the names of the rows. From there, it’s about changing the first row to be more readable, and then into column names. We’ll be substituting out the underscores into spaces, making sure units of measurement are clear, and rearranging the column order to be clearer.

pokemon[1,] <- gsub("_", " ", pokemon[1,])
pokemon[1,] <- gsub("\\bsp\\b", "special", pokemon[1,])
pokemon[1,1] <- "abilities"
pokemon[1,39] <- "weight (kg)"
pokemon[1,28] <- "height (m)"
pokemon[1,25] <- "classification"
pokemon <- select(pokemon, 31, 33, everything())
head(pokemon)
## # A tibble: 6 x 41
##   V30    V32    rowname   V1     V2    V3    V4    V5    V6    V7    V8   
##   <chr>  <chr>  <chr>     <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 name   poked~ abilities again~ agai~ agai~ agai~ agai~ agai~ agai~ agai~
## 2 Bulba~ 1      X..Overg~ 1      1     1     0.5   0.5   0.5   2     2    
## 3 Ivysa~ 2      X..Overg~ 1      1     1     0.5   0.5   0.5   2     2    
## 4 Venus~ 3      X..Overg~ 1      1     1     0.5   0.5   0.5   2     2    
## 5 Charm~ 4      X..Blaze~ 0.5    1     1     1     0.5   1     0.5   1    
## 6 Charm~ 5      X..Blaze~ 0.5    1     1     1     0.5   1     0.5   1    
## # ... with 30 more variables: V9 <chr>, V10 <chr>, V11 <chr>, V12 <chr>,
## #   V13 <chr>, V14 <chr>, V15 <chr>, V16 <chr>, V17 <chr>, V18 <chr>,
## #   V19 <chr>, V20 <chr>, V21 <chr>, V22 <chr>, V23 <chr>, V24 <chr>,
## #   V25 <chr>, V26 <chr>, V27 <chr>, V28 <chr>, V29 <chr>, V31 <chr>,
## #   V33 <chr>, V34 <chr>, V35 <chr>, V36 <chr>, V37 <chr>, V38 <chr>,
## #   V39 <chr>, V40 <chr>

Then it’s a matter of updating the column names to be the first row, and removing it.

colnames(pokemon) <- pokemon[1,]
pokemon <- pokemon[-1,]
head(pokemon)
## # A tibble: 6 x 41
##   name     `pokedex number` abilities         `against bug` `against dark`
##   <chr>    <chr>            <chr>             <chr>         <chr>         
## 1 Bulbasa~ 1                X..Overgrow....C~ 1             1             
## 2 Ivysaur  2                X..Overgrow....C~ 1             1             
## 3 Venusaur 3                X..Overgrow....C~ 1             1             
## 4 Charman~ 4                X..Blaze....Sola~ 0.5           1             
## 5 Charmel~ 5                X..Blaze....Sola~ 0.5           1             
## 6 Chariza~ 6                X..Blaze....Sola~ 0.25          1             
## # ... with 36 more variables: `against dragon` <chr>, `against
## #   electric` <chr>, `against fairy` <chr>, `against fight` <chr>,
## #   `against fire` <chr>, `against flying` <chr>, `against ghost` <chr>,
## #   `against grass` <chr>, `against ground` <chr>, `against ice` <chr>,
## #   `against normal` <chr>, `against poison` <chr>, `against
## #   psychic` <chr>, `against rock` <chr>, `against steel` <chr>, `against
## #   water` <chr>, attack <chr>, `base egg steps` <chr>, `base
## #   happiness` <chr>, `base total` <chr>, `capture rate` <chr>,
## #   classification <chr>, defense <chr>, `experience growth` <chr>,
## #   `height (m)` <chr>, hp <chr>, `japanese name` <chr>, `percentage
## #   male` <chr>, `special attack` <chr>, `special defense` <chr>,
## #   speed <chr>, type1 <chr>, type2 <chr>, `weight (kg)` <chr>,
## #   generation <chr>, `is legendary` <chr>

Next we have to fix the abilities row. It had originally been a list of strings and we need to bring that back. We’ll need to remove the initial X and any trailing numbers, and replace the periods with either a comma, a space, or an empty string.

pokemon[["abilities"]] <- gsub("(\\bX\\b)|\\d", "", pokemon[["abilities"]])
pokemon[["abilities"]] <- gsub("(^\\.+)|(\\.+$)", "", pokemon[["abilities"]])
pokemon[["abilities"]] <- gsub("\\.{4}", ",", pokemon[["abilities"]])
pokemon[["abilities"]] <- gsub("\\.", " ", pokemon[["abilities"]])

Then comes making the lists, which results in the abilities column having lists as values.

pokemon[["abilities"]] <- lapply(pokemon[["abilities"]], strsplit, ",")
pokemon[["abilities"]] <- sapply(pokemon[["abilities"]], unlist, recursive=F, simplify=F)

And then, there are a number of columns that are given as characters but are actually number variables and we need to modify them as such. The only numeric column we will not be doing this to is the pokedex number.

pokemon[,c(4:26, 28:31, 33:36, 39:41)] <- lapply(pokemon[,c(4:26, 28:31, 33:36, 39:41)], as.numeric)

Now we are able to work with the data.

Analyzing the data

In order to analyze the data, we’ll need to pick out the data that’s numeric and normalize it using the scale() function. Then we’ll remove all rows with NAs.

invisible(gc(verbose=FALSE))
pokemon_num_vars <- sapply(pokemon, is.numeric)
working.pokemon <- na.omit(pokemon)
working.pokemon[pokemon_num_vars] <- lapply(working.pokemon[pokemon_num_vars], scale)
working.pokemon <- na.omit(working.pokemon)
summary(working.pokemon[pokemon_num_vars])
##   against bug        against dark      against dragon    
##  Min.   :-1.25712   Min.   :-2.00772   Min.   :-2.86025  
##  1st Qu.:-0.83249   1st Qu.:-0.05611   1st Qu.: 0.06851  
##  Median : 0.01676   Median :-0.05611   Median : 0.06851  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.01676   3rd Qu.:-0.05611   3rd Qu.: 0.06851  
##  Max.   : 5.11230   Max.   : 2.54603   Max.   : 2.99727  
##  against electric  against fairy     against fight       against fire    
##  Min.   :-1.6486   Min.   :-1.5746   Min.   :-1.45758   Min.   :-1.2728  
##  1st Qu.:-0.8999   1st Qu.:-0.1481   1st Qu.:-0.77190   1st Qu.:-0.9126  
##  Median :-0.1511   Median :-0.1481   Median :-0.08621   Median :-0.1922  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 1.3465   3rd Qu.:-0.1481   3rd Qu.:-0.08621   3rd Qu.: 1.2486  
##  Max.   : 4.3417   Max.   : 5.5580   Max.   : 4.02790   Max.   : 4.1302  
##  against flying    against ghost     against grass      against ground    
##  Min.   :-1.6489   Min.   :-1.8298   Min.   :-0.97745   Min.   :-1.46862  
##  1st Qu.:-0.3878   1st Qu.: 0.1031   1st Qu.:-0.66312   1st Qu.:-0.77488  
##  Median :-0.3878   Median : 0.1031   Median :-0.03447   Median :-0.08114  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.2937   3rd Qu.: 0.1031   3rd Qu.:-0.03447   3rd Qu.:-0.08114  
##  Max.   : 4.6567   Max.   : 2.0361   Max.   : 3.73748   Max.   : 4.08131  
##   against ice      against normal    against poison     against psychic   
##  Min.   :-1.2986   Min.   :-3.5404   Min.   :-1.87956   Min.   :-2.05805  
##  1st Qu.:-0.9642   1st Qu.: 0.3813   1st Qu.:-0.01296   1st Qu.:-0.06413  
##  Median :-0.2953   Median : 0.3813   Median :-0.01296   Median :-0.06413  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.0424   3rd Qu.: 0.3813   3rd Qu.:-0.01296   3rd Qu.:-0.06413  
##  Max.   : 3.7179   Max.   : 0.3813   Max.   : 5.58682   Max.   : 5.91764  
##   against rock     against steel      against water          attack       
##  Min.   :-1.4813   Min.   :-1.54197   Min.   :-1.32883   Min.   :-2.2633  
##  1st Qu.:-0.4069   1st Qu.:-1.01382   1st Qu.:-0.90568   1st Qu.:-0.7220  
##  Median :-0.4069   Median : 0.04247   Median :-0.05939   Median :-0.1653  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 1.0258   3rd Qu.: 0.04247   3rd Qu.:-0.05939   3rd Qu.: 0.6417  
##  Max.   : 3.8911   Max.   : 6.38021   Max.   : 5.01836   Max.   : 3.5466  
##  base egg steps    base happiness      base total       capture rate    
##  Min.   :-1.4934   Min.   :-5.3932   Min.   :-2.1410   Min.   :-1.3935  
##  1st Qu.:-0.1505   1st Qu.: 0.1333   1st Qu.:-0.8643   1st Qu.:-0.8248  
##  Median :-0.1505   Median : 0.1333   Median : 0.0331   Median :-0.4185  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.1505   3rd Qu.: 0.1333   3rd Qu.: 0.7640   3rd Qu.: 1.1388  
##  Max.   : 8.8021   Max.   : 5.6599   Max.   : 2.6698   Max.   : 2.0190  
##     defense        experience growth   height (m)            hp         
##  Min.   :-2.1692   Min.   :-2.7614   Min.   :-1.0420   Min.   :-1.8862  
##  1st Qu.:-0.6682   1st Qu.:-0.2460   1st Qu.:-0.6094   1st Qu.:-0.6843  
##  Median :-0.1679   Median :-0.2460   Median :-0.1769   Median :-0.0834  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4992   3rd Qu.: 0.1305   3rd Qu.: 0.3638   3rd Qu.: 0.4374  
##  Max.   : 5.3357   Max.   : 3.7787   Max.   :14.5294   Max.   : 7.5284  
##  percentage male   special attack    special defense       speed        
##  Min.   :-2.7067   Min.   :-1.9542   Min.   :-1.8218   Min.   :-2.1159  
##  1st Qu.:-0.2627   1st Qu.:-0.7702   1st Qu.:-0.6791   1st Qu.:-0.7436  
##  Median :-0.2627   Median :-0.1951   Median :-0.1077   Median :-0.1297  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2627   3rd Qu.: 0.5830   3rd Qu.: 0.5970   3rd Qu.: 0.6286  
##  Max.   : 2.1814   Max.   : 3.6276   Max.   : 6.1775   Max.   : 3.4815  
##   weight (kg)        generation       is legendary    
##  Min.   :-0.6481   Min.   :-1.4240   Min.   :-0.1016  
##  1st Qu.:-0.5293   1st Qu.:-0.8926   1st Qu.:-0.1016  
##  Median :-0.3067   Median : 0.1701   Median :-0.1016  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.1216   3rd Qu.: 0.7015   3rd Qu.:-0.1016  
##  Max.   :12.2484   Max.   : 1.7643   Max.   : 9.8272

From here, it’s a matter of comparing variables with weight. In my opinion, something like height, hp (health points), defense, or special defense should make a Pokemon weigh more. My reasoning for height is simply because the taller a thing is, typically the more it weighs. For the latter three variables, in video games, being able to withstand attacks is called being “tanky.” Therefore, it would reason these “tanky” variables might play a role in a Pokemon’s likelihood to weigh more.

invisible(gc(verbose=FALSE))
cbPalette <- c("Height (m)"="#999999", "HP"="#E69F00", "Defense"="#56B4E9", "Special Defense"="#009E73", "Generation" = "#F0E442", "Speed"="#0072B2", "Attack"="#D55E00", "Special Attack"="#CC79A7")

ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.02)) +
  geom_point(aes(y=get("height (m)"), color="Height (m)")) +
  geom_point(aes(y=get("hp"), color="HP")) +
  geom_point(aes(y=get("defense"), color="Defense")) +
  geom_point(aes(y=get("special defense"), color="Special Defense")) +
  labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
  scale_colour_manual(values = cbPalette) +
  scale_shape_identity()

ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.5)) +
  geom_line(aes(y=get("height (m)"), color="Height (m)")) +
  geom_line(aes(y=get("hp"), color="HP")) +
  geom_line(aes(y=get("defense"), color="Defense")) +
  geom_line(aes(y=get("special defense"), color="Special Defense")) +
  labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
  scale_colour_manual(values = cbPalette) +
  scale_shape_identity()

Based on the above, it can be safely said that these four variables do not have any direct correlation with weight. What of the other three stats, then? Attack, Special Attack, and Speed? Do they have a correlation with weight?

ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.02)) +
  geom_point(aes(y=get("attack"), color="Attack")) +
  geom_point(aes(y=get("special attack"), color="Special Attack")) +
  geom_point(aes(y=get("speed"), color="Speed")) +
  labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
  scale_colour_manual(values = cbPalette) +
  scale_shape_identity()

ggplot(working.pokemon, aes(x=get("weight (kg)"), shape=1, alpha=0.5)) +
  geom_line(aes(y=get("attack"), color="Attack")) +
  geom_line(aes(y=get("special attack"), color="Special Attack")) +
  geom_line(aes(y=get("speed"), color="Speed")) +
  labs(x="Normalized Weight (kg)", y="Z-Score of Various Variables", title="Normalized Pokemon Weight as it Compares to Various Variables") +
  scale_colour_manual(values = cbPalette) +
  scale_shape_identity()

As we can see, there is no correlation between attack, special attack, or speed and weight.

Do any of the types correlate with weight?

Given the massive failing of finding a stat that correlates with weight, the next check would be typing.

pokemon_type1_plot <- ggplot(working.pokemon, aes(get("weight (kg)"), group=get("type1"), color=get("type1"))) +
  geom_density() +
  labs(x="Weight (kg)", y="Density", title="Normalized Pokemon Weights by Pokemon Type", color="Types")
ggplotly(pokemon_type1_plot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

It doesn’t appear any of the types are normally distributed by weight.

Do any of the generations correlate with weight?

Can we find a generation of Pokemon that correlates with weight, if we cannot find a type that does?

pokemon_gen_plot <- ggplot(working.pokemon, aes(get("weight (kg)"), group=get("generation"), color=get("generation"))) +
  geom_density() +
  labs(x="Normalized Weight (kg)", y="Density", title="Normalized Pokemon Weights by Generation", color="Normalized Generations")
ggplotly(pokemon_gen_plot)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

At first blush, there seems there is no correlation between weight and any generation. Looking at a scatterplot reveals…

ggplot(pokemon, aes(get("generation"), get("weight (kg)"), group=get("generation"), color=get("generation"))) +
  geom_point() +
  labs(x="Generation", y="Weight (kg)", title="Pokemon Weights by Generation", color="Generation")

… there is no correlation. Each generation has a sizeable amount of light-weight Pokemon, but no generation seems to have a direct correlation with weight.

Conclusion

There is no direct correlation between a Pokemon’s weight and their height, their stats, their type, or their generation. Further investigation into decisions made regarding Pokemon weight is needed, given that a Wailord - based off of a real-life blue whale - is around half the weight of a Mudsdale - based off of a real-life Clydesdale.

File Two: Association Golf Outing

This dataset was provided by Justin Herman on his Discussion 5 post.

Bring the file into R

Bringing the file into R is the most important of all the steps, as we cannot work with the file unless we have access to it in R in the first place.

justinFile <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/justinHermanDataSet.csv"
golf <- read.csv(justinFile, encoding = "UTF-8", stringsAsFactors = F, na.strings = c("", " ", "NA"))

Reformat the CSV so it’s easier to read

This file needs a lot of work. To start with, the description of the file is inside of it, so let’s take that out and make it its own thing.

golf_desc <- names(golf)[1]
golf_desc <- paste(golf_desc, golf[1,1], golf[2,4], golf[2,6])
golf_desc <- gsub("\\.", " ", golf_desc)
golf_desc <- gsub(" {2,}", " ", golf_desc)
golf_desc <- gsub("^(\\w)", "\\U\\1", golf_desc, perl=T)
golf_desc
## [1] "Association Golf Outing Raffle Prizes as of 3/11/2018"

With that out of the way, we can remove those lines and get into the real set of data. We’ll want to remove empty rows and columns and get the proper column names set up.

golf <- golf[4:125,]
na_reduction <- function(dat) {
  dat %>%
    filter(Reduce(`+`, lapply(., is.na)) != ncol(.)) %>%
    select_if(~sum(!is.na(.)) > 0)
}
golf <- na_reduction(golf)
colnames(golf) <- golf[1,]
golf <- golf[2:length(golf[,1]),]
golf <- na_reduction(golf)
golf <- lapply(golf, trimws)
golf <- lapply(golf, gsub, pattern=" {2,}", replacement=" ", perl=T)
golf <- as_tibble(golf)

# We'll be putting the years in here... I know there's a neater way to do this.

golf_years <- filter(golf, is.na(quantity) & is.na(Cost) & is.na(Status))
golf["Year"] <- rep(NA, times=length(golf[,1]))
golf[["Year"]][(match(golf_years[["Item"]][1], golf[["Item"]])+1):(match(golf_years[["Item"]][2], golf[["Item"]])-1)] <- golf_years[["Item"]][1]
golf[["Year"]][(match(golf_years[["Item"]][2], golf[["Item"]])+1):(match(golf_years[["Item"]][3], golf[["Item"]])-1)] <- golf_years[["Item"]][2]
golf[["Year"]][(match(golf_years[["Item"]][3], golf[["Item"]])+1):(match(golf_years[["Item"]][4], golf[["Item"]])-1)] <- golf_years[["Item"]][3]
golf[["Year"]][(match(golf_years[["Item"]][4], golf[["Item"]])+1):(match(golf_years[["Item"]][5], golf[["Item"]])-1)] <- golf_years[["Item"]][4]
golf[["Year"]][(match(golf_years[["Item"]][5], golf[["Item"]])+1):(match(golf_years[["Item"]][6], golf[["Item"]])-1)] <- golf_years[["Item"]][5]
golf[["Year"]][(match(golf_years[["Item"]][6], golf[["Item"]])+1):(match(golf_years[["Item"]][7], golf[["Item"]])-1)] <- golf_years[["Item"]][6]
golf[["Year"]][(match(golf_years[["Item"]][7], golf[["Item"]])+1):(match(golf_years[["Item"]][8], golf[["Item"]])-1)] <- golf_years[["Item"]][7]
golf[["Year"]][(match(golf_years[["Item"]][8], golf[["Item"]])+1):nrow(golf)] <- golf_years[["Item"]][8]

# Removing the year rows
golf <- filter(golf, !is.na(quantity) & !is.na(Cost) & !is.na(Status))

Next we’re going to want to make the numeric columns numeric. As a note, there is an item in the Cost column that has the word “each”. Looking up the cost of golf sets told me that spending $500 for a golf set isn’t unheard of, and all the other items in the list seem to be based on total price rather than the price of each item, so the best bet for that row is to change it from “20 each” to “80”, which is 20 (each) multiplied by the quantity, 4.

golf[1, "Cost"] <- 80
golf[["quantity"]] <- as.numeric(golf[["quantity"]])
golf[["Cost"]] <- as.numeric(golf[["Cost"]])

We should pretty up the Status row as well. A little bit of googling led me to find there is a company named “Tukuru Technologies”, which seems to be miswritten or written inconsistently. While the TBD is likely important to whomever this sheet belongs, it’s the only one like it, and giving it its own dedicated column seems pointless. Given the consistency of Tukuru Technologies delivering their items (they’re all over this data set!) it’s safe to say it will be delivered. We also need to change the values “who” into “na” and the explicit column into “Joe”.

golf[["Status"]] <- gsub("[Ff]rom ", "", golf[["Status"]])
golf[["Status"]] <- gsub("Tuk.*", "Tukuru Technologies", golf[["Status"]])
golf[["Status"]][64] <- NA
golf[["Status"]][86] <- "Joe"

With that, we have a data set we can work with.

## # A tibble: 6 x 5
##   quantity Item                         Cost Status              Year 
##      <dbl> <chr>                       <dbl> <chr>               <chr>
## 1     4.00 Golf Shirts                  80.0 Tukuru Technologies 2016 
## 2     1.00 Golf Windvest                40.0 AE                  2016 
## 3     4.00 Individual Golf Clubs sets 2000   Matt B              2016 
## 4     1.00 Surface Tablet              350   Tukuru Technologies 2016 
## 5     1.00 Whitney                     300   Tukuru Technologies 2016 
## 6     1.00 Drone                       300   Tukuru Technologies 2016

Analysis

Justin suggested a couple of analysis that could be performed. They were:

  • Total cost of items raffled by year
  • Amount of items raffled by year
  • Person who donated the most items by year

Total Cost of Items Raffled by Year

golf_year_cumsum <- c(
  "2016" = colSums(filter(golf, Year==2016)["Cost"], na.rm = T),
  "2015" = colSums(filter(golf, Year==2015)["Cost"], na.rm = T),
  "2014" = colSums(filter(golf, Year==2014)["Cost"], na.rm = T),
  "2013" = colSums(filter(golf, Year==2013)["Cost"], na.rm = T),
  "2012" = colSums(filter(golf, Year==2012)["Cost"], na.rm = T),
  "2011" = colSums(filter(golf, Year==2011)["Cost"], na.rm = T),
  "2010" = colSums(filter(golf, Year==2010)["Cost"], na.rm = T),
  "2009" = colSums(filter(golf, Year==2009)["Cost"], na.rm = T)
)
golf_year_cumsum <- as.data.frame(golf_year_cumsum)
rownames(golf_year_cumsum) <- gsub("\\..*", "", rownames(golf_year_cumsum))
names(golf_year_cumsum) <- c("Total Cost in USD")
knitr::kable(golf_year_cumsum, caption="Total Cost of Raffled Items in USD by Year")
Total Cost of Raffled Items in USD by Year
Total Cost in USD
2016 5693.00
2015 1600.00
2014 2310.30
2013 2623.96
2012 4032.00
2011 3145.91
2010 2105.00
2009 620.00

Total Items Raffled by Year

golf_year_items_cumsum <- c(
  "2016" = colSums(filter(golf, Year==2016)["quantity"], na.rm = T),
  "2015" = colSums(filter(golf, Year==2015)["quantity"], na.rm = T),
  "2014" = colSums(filter(golf, Year==2014)["quantity"], na.rm = T),
  "2013" = colSums(filter(golf, Year==2013)["quantity"], na.rm = T),
  "2012" = colSums(filter(golf, Year==2012)["quantity"], na.rm = T),
  "2011" = colSums(filter(golf, Year==2011)["quantity"], na.rm = T),
  "2010" = colSums(filter(golf, Year==2010)["quantity"], na.rm = T),
  "2009" = colSums(filter(golf, Year==2009)["quantity"], na.rm = T)
)
golf_year_items_cumsum <- as.data.frame(golf_year_items_cumsum)
rownames(golf_year_items_cumsum) <- gsub("\\..*", "", rownames(golf_year_items_cumsum))
names(golf_year_items_cumsum) <- c("Number of Items Raffled")
knitr::kable(golf_year_items_cumsum, caption="Number of Items Raffled by Year")
Number of Items Raffled by Year
Number of Items Raffled
2016 18
2015 11
2014 16
2013 18
2012 24
2011 22
2010 11
2009 18

Most Generous Donor by Year

golf_year_donor_cumsum <- c(
  "2016" = count(filter(golf, Year==2016), Status, sort=T)[1,1],
  "2015" = count(filter(golf, Year==2015), Status, sort=T)[1,1],
  "2014" = count(filter(golf, Year==2014), Status, sort=T)[1,1],
  "2013" = count(filter(golf, Year==2013), Status, sort=T)[1,1],
  "2012" = count(filter(golf, Year==2012), Status, sort=T)[1,1],
  "2011" = count(filter(golf, Year==2011), Status, sort=T)[1,1],
  "2010" = count(filter(golf, Year==2010), Status, sort=T)[1,1],
  "2009" = count(filter(golf, Year==2009), Status, sort=T)[1,1]
)
names(golf_year_donor_cumsum) <- gsub("\\..*", "", names(golf_year_donor_cumsum))
golf_year_donor_cumsum <- as_tibble(golf_year_donor_cumsum)
golf_year_donor_cumsum <- gather(golf_year_donor_cumsum)
names(golf_year_donor_cumsum) <- c("Year", "Donor")
knitr::kable(golf_year_donor_cumsum, caption="Most Generous Donor by Year")
Most Generous Donor by Year
Year Donor
2016 Tukuru Technologies
2015 Matt B
2014 AE
2013 AE
2012 Friend of Joe’s
2011 AE
2010 Matt B
2009 DM

File Three: Stanford Mass Shootings in America (MSA)

This dataset was provided by Youngkoung Kim on his Discussion 5 post.

Bring the file into R

Bringing the file into R is the most important of all the steps, as we cannot work with the file unless we have access to it in R in the first place.

youngkoungFile <- "https://raw.githubusercontent.com/gabartomeo/data607-cunysps/master/Project02/youngkoungKimDataSet.csv"
msa <- read.csv(youngkoungFile, encoding = "UTF-8", stringsAsFactors = F, na.strings = c("", " ", "NA"))

Reformatting the CSV

While the CSV offers a lot of data, we do not need all the data it offers for the analysis we will be performing.

msa <- as_tibble(msa)
msa <- select(msa, Title, City, State, Average.Shooter.Age, Total.Number.of.Guns, Place.Type, Possible.Motive...General)

There’s the matter of making sure the appropriate format is applied to all the variables in the CSV.

msa[["Average.Shooter.Age"]] <- as.numeric(msa[["Average.Shooter.Age"]])
msa[["Total.Number.of.Guns"]] <- as.numeric(msa[["Total.Number.of.Guns"]])

After checking the table, I noticed some reduncies we can fix.

msa[["Place.Type"]] <- gsub("Secondary School", "Secondary school", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("[cC]afe.*", "cafe", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("Home", "home", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("Wildness", "Wilderness", msa[["Place.Type"]], perl=T)
msa[["Place.Type"]] <- gsub("Venue", "venue", msa[["Place.Type"]], perl=T)

Now we’re ready to work with the data.

Analysis

ggplot(msa, aes(x=get("Place.Type"), fill=get("Place.Type"))) +
  geom_bar() +
  labs(x="Places", fill="Places", title="Total Shootings by Place") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

With this, we can see most shootings occur locally; the shooter will perform their shooting in their home or neighborhood and does not venture far from their origin.

ggplot(msa, aes(x=get("Average.Shooter.Age"), y=get("Total.Number.of.Guns"), fill=get("Average.Shooter.Age"))) +
  geom_bar(stat="identity") +
  labs(x="Average Shooter Age", y="Total Number of Guns", title="Total Number of Guns of the Average Shooter by Age")

This chart shows up until their mid-forties, a shooter is more likely to carry more than a couple of guns.

ggplot(msa, aes(x=get("State"), fill=get("State"))) +
  geom_bar() +
  labs(x="State", title="Total Number of Shootings by State", fill="State") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

We can see here that the state with the most shootings is actually California… hypothesis, perhaps it’s Los Angeles in particular. Florida appears to come in second, with Texas coming in third. I think it might be Tampa for Florida, and perhaps San Antonio for Texas. Let’s find out.

California’s Shooting Locations

ggplot(filter(msa, State=="California"), aes(x=get("City"), fill=get("City"))) +
  geom_bar() +
  labs(x="City", fill="City") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

This plot confirms my hypothesis that the city with the most shootings in California is indeed Los Angeles.

Florida’s Shooting Locations

ggplot(filter(msa, State=="Florida"), aes(x=get("City"), fill=get("City"))) +
  geom_bar() +
  labs(x="City", fill="City") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

This plot confirms my hypothesis that the city with the most shootings in Florida is indeed Tampa.

Texas’s Shooting Locations

ggplot(filter(msa, State=="Texas"), aes(x=get("City"), fill=get("City"))) +
  geom_bar() +
  labs(x="City", fill="City") +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

This plot disproves my hypothesis that the city with the most shootings in Texas are in San Antonio - most shootings in Texas happen in Killeen.

Conclusion

Shooters in America are more likely to perform their shooting in their neighborhood or home, and if they’re under about 45 they’re likely to be carrying multiple guns. Shootings are more likely to occur in California, Florida, and Texas; from these states, specifically in Los Angeles, California, Tampa, Florida, and Killeen, Texas.