library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(scales)
library(stringr)
library(anytime)
library(plotly)
library(gapminder)
library(ggcorrplot)
Diane DeOcampo found a great kickstarter data source that had some data to be cleaned/tidied in order to answer a few questions. Below are a few questions I answered after tidying the dataset:
kickstarter_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project2/kickstarter.csv')
In order to extract the category for each project, as well as the project ID, I used regular expressions to isolate the values I needed in the long object data types in columns ‘category’ and ‘profile’. I stored these values in temporary variables and then did further extractions before adding them as new columns to the kickstarter dataset. Then, I turned the UNIX date-formatted values in multiple columns to more useful formats. I used the ‘anytime’ package to help with this process.
tmp <- unlist(str_extract_all(kickstarter_data$category, '\\"name":\\"\\w+'))
tmp2 <- unlist(str_extract_all(kickstarter_data$profile, '\\"project_id"\\:\\d+'))
kickstarter_data$category_cleaned <- unlist(str_extract_all(tmp, '([^\"]+$)'))
kickstarter_data$project_id <- unlist(str_extract_all(tmp2, '([^\":]+$)'))
kickstarter_data$created_at <- anydate(kickstarter_data$created_at)
kickstarter_data$launched_at <- anydate(kickstarter_data$launched_at)
kickstarter_data$state_changed_at <- anydate(kickstarter_data$state_changed_at)
kickstarter_data$deadline <- anydate(kickstarter_data$deadline)
kickstarter_data$usd_pledged <- round(as.numeric(as.character(format(kickstarter_data$usd_pledged, scientific = FALSE))), digits = 2)
I wanted to create a cleaner version of the kickstarter dataset, removing unhelpful columns. To do this, I used the ‘select’ function, and added an additional column that calculated the amount of time each campaign was active, by doing calculations on the data variables. The following table is the result.
kickstarter_cleaned <- kickstarter_data %>%
select(project_id,
name,
category_cleaned,
created_at,
launched_at,
goal,
usd_pledged,
backers_count,
state_changed_at,
state,
staff_pick) %>%
mutate(campaign_length = state_changed_at - created_at)
head(kickstarter_cleaned, 10)
## project_id name
## 1 2396425 John Chuck & The Class Debut E.P.
## 2 3004537 Girls of Summer: Big Diamond Dreams
## 3 438930 Task No.1
## 4 2376664 Future Heroes - SXSW IS CALLING
## 5 2548344 Rhode Island Pelagic Shark Diving conservation Video fund
## 6 2877965 Gorilla my Dreams: Mime of my Life
## 7 340272 Living In A Bubble
## 8 3659475 MyMiniReport - Visual battle Report !
## 9 3334409 AIUR-Your Premium Wireless Home Theater with Gesture Control
## 10 3453077 N64 - WiiVC Controller Adapter with Full Stick Correction
## category_cleaned created_at launched_at goal usd_pledged backers_count
## 1 Hip 2016-02-27 2016-03-07 5000 5612.00 103
## 2 Documentary 2017-05-17 2017-06-06 24042 26237.00 318
## 3 Electronic 2013-01-08 2013-01-09 4000 0.00 0
## 4 Hip 2016-02-15 2016-02-23 500 1575.00 22
## 5 Documentary 2016-06-06 2016-06-06 2500 3290.00 17
## 6 Webcomics 2017-02-13 2017-10-01 1500 2962.06 177
## 7 Electronic 2012-09-13 2012-10-14 3000 3428.61 121
## 8 Apps 2019-01-31 2019-03-04 8000 12476.13 969
## 9 Gadgets 2018-03-15 2018-03-21 10000 17021.00 130
## 10 DIY 2018-08-21 2018-08-22 200 882.88 21
## state_changed_at state staff_pick campaign_length
## 1 2016-04-06 successful TRUE 39 days
## 2 2017-07-03 successful TRUE 47 days
## 3 2013-03-10 failed FALSE 61 days
## 4 2016-03-14 successful FALSE 28 days
## 5 2016-07-06 successful FALSE 30 days
## 6 2017-10-29 successful FALSE 258 days
## 7 2012-11-13 successful FALSE 61 days
## 8 2019-04-03 successful FALSE 62 days
## 9 2018-04-22 successful FALSE 38 days
## 10 2018-10-21 successful FALSE 61 days
Question #1: How do staff-picked projects influence the success of a project?
To work through this question, I used tidyr and dplyr functions to group, summarise, mutate, filter, arrange, and select the data I needed in order to find the percent successful campaigns based on whether or not campaigns were staff-picked.
staff_pick <- kickstarter_cleaned %>%
group_by(staff_pick, state) %>%
summarise(count = n()) %>%
mutate(percent_of_total_state = round(count / sum(count), digits = 4)) %>%
filter(state == 'successful') %>%
select(staff_pick, state, percent_of_total_state) %>%
arrange(desc(percent_of_total_state))
kable(staff_pick, align = rep('c', 3)) %>%
kable_styling(bootstrap_options = c("striped"), full_width = F)
| staff_pick | state | percent_of_total_state |
|---|---|---|
| TRUE | successful | 0.8732 |
| FALSE | successful | 0.4739 |
p <- staff_pick %>%
mutate(percent_formatted = percent(percent_of_total_state))
plot <- ggplot(p, aes(x = staff_pick, y = percent_of_total_state, fill=percent_of_total_state))
plot <- plot + scale_y_continuous(labels = scales::percent)
plot <- plot + theme(legend.position = "none")
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd") + ylab("Success rate") + xlab("Picked by staff?")
plot <- plot + geom_text(aes(label=percent_formatted), vjust=2.5, hjust=0.45, position = position_dodge(width = 0.9), color="white", fontface="bold")
plot
Conclusion: From the plot above, as well as the table, we can see that Kickstarter campaigns that were staff-picked had a much higher percent success rate than those that were not picked by staff.
Question #2: Which projects tend to be more successful (looking at category)?
To work through this question, I had to tidy the data in a way to calculate the success rate based on categories that I pulled from the original file. After summarizing the data similar to above, I then was able to group the data based on the categories and filter by successful campaigns and find the success rate out of the total campaigns per category. I then was interested in finding the categories that had high success rates, but also had a large number of campaigns in that category to support the high success rate (a bit of a decision tree analysis). It’s hard to judge whether or not a certain category has good success if there are only a few campaigns to analyze. Therefore, I wanted to isolate those that had a success rate above 75%, as well as more than 30 campaigns in a category, as a safer judgement to being more appealing to potential donors. You can find this data plotted below. If you click the legend, you can add/remove data points and use more of plotly’s features to examine the plot. There is also a tooltip if you hover over each point, which gives you the category information.
category_success <- kickstarter_cleaned %>%
group_by(category_cleaned, state) %>%
summarise(count = n()) %>%
mutate(percent_of_total_state = round(count / sum(count), digits = 2)) %>%
filter(state == 'successful') %>%
select(category_cleaned, state, count, percent_of_total_state) %>%
arrange(desc(percent_of_total_state), desc(count)) %>%
rename("Number of Campaigns In Category" = count, "Success Rate" = percent_of_total_state) %>%
mutate(coloration = ifelse(`Success Rate` > 0.75 & `Number of Campaigns In Category` > 30, 'Tend to Be More Successful', 'Either too few campaigns or low success rate'))
p <- ggplot(category_success, aes(x=`Number of Campaigns In Category`, y=`Success Rate`, color = coloration, text = paste('Category: ', category_cleaned))) +
geom_point(fill = "#ffffff", pch = 21, size = 2, stroke = 0.5) +
labs(x="Number of campaigns per category", y = "Success rate per category") +
scale_color_manual(values = c("#bbbbbb", "#52854C"))
p <- ggplotly(p) %>% layout(legend = list(orientation = "h", x = -0.5, y = 10))
p
All of the green dots in the plot above are also found in this chart below, arranged by success rate.
top_categories <- category_success %>%
filter(coloration == "Tend to Be More Successful") %>%
rename("Category" = category_cleaned) %>%
select(Category, `Number of Campaigns In Category`, `Success Rate`)
kable(top_categories, align = rep('c', 5)) %>%
kable_styling(bootstrap_options = c("striped"), full_width = F)
| Category | Number of Campaigns In Category | Success Rate |
|---|---|---|
| Shorts | 65 | 0.98 |
| Comic | 42 | 0.95 |
| Country | 41 | 0.95 |
| Documentary | 61 | 0.92 |
| Illustration | 60 | 0.92 |
| Crafts | 49 | 0.92 |
| Narrative | 58 | 0.91 |
| Tabletop | 57 | 0.89 |
| Video | 94 | 0.86 |
Conclusion: As you can see, campaigns that fall under the categories of ‘Shorts’, ‘Comic’, ‘Country’ and ‘Documentary’ tend to have higher success rates. My theory is that those with engaging videos, powerful illustrations, or were patriotic (i.e. “Country”), seemed to have more appeal with potential donors. No statistical tests have been done to prove these claims though, so it would be interesting to dive deeper into these analysis to see if there are true differences between categories.
Question #3: What’s the relationship between the state of the campaign and the total number of backers and length of campaign?
To work through this final question, I needed to summarize the total backers column and the campaign length column that I calculated earlier, and gather the two into one column in order to perform the visualization later.
backers_length_df <- kickstarter_cleaned %>%
group_by(state) %>%
summarise("Total Backers (Mean)" = round(as.numeric(mean(backers_count)), digits = 0), "Campaign Length in Days (Mean)" = round(as.numeric(mean(campaign_length)), digits = 0)) %>%
gather('x', 'n', 2:3) %>%
arrange(state)
kable(backers_length_df, align = rep('c', 3)) %>%
kable_styling(bootstrap_options = c("striped"), full_width = F)
| state | x | n |
|---|---|---|
| canceled | Total Backers (Mean) | 19 |
| canceled | Campaign Length in Days (Mean) | 71 |
| failed | Total Backers (Mean) | 12 |
| failed | Campaign Length in Days (Mean) | 81 |
| live | Total Backers (Mean) | 81 |
| live | Campaign Length in Days (Mean) | 36 |
| successful | Total Backers (Mean) | 221 |
| successful | Campaign Length in Days (Mean) | 83 |
| suspended | Total Backers (Mean) | 0 |
| suspended | Campaign Length in Days (Mean) | 10 |
After wrangling the data into a form that can be used for a facet bar plot, I was then able to plot the data below:
plot <- ggplot(backers_length_df, aes(x = x, y = n, fill= x))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", legend.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd") + xlab("Campaign State")
plot <- plot + facet_grid(. ~ state)
plot <- plot + geom_text(aes(label=n), vjust=-0.25, hjust=0.50, position = position_dodge(width = 0), color="black", fontface="bold")
plot
Conclusion: We can see, based on the state of the campaign that there is a wide discrepancy in the mean number of backers as well as the mean length of a campaign (in days). Successful campaigns seem to accumulate many more total backers, compared to all other states/statuses, and the mean total backers for both failed and canceled campaigns is much lower than their mean campaign length.
Avi Adler found a great dataset on the U.S. Census website that looks at personal income data relative to education. This dataset also requires a lot of tidying/wrangling before analysis can be completed. Once that is taken care of, I’ll do my best to answer his primary question below:
Question #1: Do a comparison between income and education, investigating if there is a trend or correlation between them.
pincome_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project2/personal_income.csv')
pincome_data_cleaned <- pincome_data %>%
slice(17:57) %>%
rename("total" = X,
"less_than_9th" = X.1,
"9th_to_12th_nongrad" = X.2,
"graduate_incl_ged" = X.3,
"some_col_no_degree" = X.4,
"assoc_degree" = X.5,
"bachelors" = X.7,
"masters" = X.8,
"professional" = X.9,
"doctorate" = X.10,
"bach_or_more" = X.6)
colnames(pincome_data_cleaned)[1] <- "characteristic"
pincome_data_cleaned$characteristic <- unlist(str_replace_all(pincome_data_cleaned$characteristic, '\\..', ""))
sapply(pincome_data_cleaned, typeof)
## characteristic total less_than_9th
## "character" "integer" "integer"
## 9th_to_12th_nongrad graduate_incl_ged some_col_no_degree
## "integer" "integer" "integer"
## assoc_degree bach_or_more bachelors
## "integer" "integer" "integer"
## masters professional doctorate
## "integer" "integer" "integer"
converttoNum_removeComma <- function(value){
tmp <- as.numeric(str_replace_all(value, "\\,", ""))
tmp2 <- tmp * 1000
return(tmp2)
}
pincome_nums <- sapply(pincome_data_cleaned[2:12], converttoNum_removeComma)
pincome_nums <- data.frame(pincome_nums)
I wanted to add in a few extra columns to help with data analysis later on. One column that I added in particular, groups the 41 different income levels into 5 different income levels.
pincome_final <- pincome_nums
pincome_final$income_group <- pincome_data_cleaned$characteristic
pincome_final <- pincome_final %>%
mutate(income_group_num = row_number())
pincome_final <- pincome_final %>%
mutate(income_subset = ifelse(income_group_num < 11, 1,
ifelse(income_group_num < 21 & income_group_num >= 11, 2,
ifelse(income_group_num < 31 & income_group_num >= 21, 3,
ifelse(income_group_num < 41 & income_group_num >= 31, 4,
ifelse(income_group_num == 41, 5, NA)))))) %>%
mutate(highschool_nongrad = less_than_9th + X9th_to_12th_nongrad,
highschool_grad = graduate_incl_ged + some_col_no_degree,
bachelors_or_more = bachelors + masters + professional + doctorate) %>%
select(income_group,
income_subset,
highschool_nongrad,
highschool_grad,
bachelors_or_more,
less_than_9th,
X9th_to_12th_nongrad,
graduate_incl_ged,
some_col_no_degree,
assoc_degree,
bachelors,
masters,
professional,
doctorate) %>%
rename("9th_to_12th_nongrad" = X9th_to_12th_nongrad)
head(pincome_final)
## income_group income_subset highschool_nongrad highschool_grad
## 1 $1 to $2,499 or loss 1 305000 1382000
## 2 $2,500 to $4,999 1 298000 1040000
## 3 $5,000 to $7,499 1 354000 1456000
## 4 $7,500 to $9,999 1 326000 1151000
## 5 $10,000 to $12,499 1 581000 2362000
## 6 $12,500 to $14,999 1 336000 1073000
## bachelors_or_more less_than_9th 9th_to_12th_nongrad graduate_incl_ged
## 1 967000 115000 190000 775000
## 2 632000 97000 201000 641000
## 3 933000 109000 245000 900000
## 4 565000 106000 220000 776000
## 5 978000 185000 396000 1536000
## 6 503000 127000 209000 711000
## some_col_no_degree assoc_degree bachelors masters professional doctorate
## 1 607000 287000 613000 289000 21000 44000
## 2 399000 241000 459000 138000 13000 22000
## 3 556000 334000 608000 265000 16000 44000
## 4 375000 224000 371000 158000 11000 25000
## 5 826000 528000 688000 236000 22000 32000
## 6 362000 206000 364000 113000 18000 8000
To make use of this data, I wanted to group the counts of individuals based on more concrete income groups. Therefore, I tidied the data using group_by, gather, spread, and mutate functions in order to do the following:
calc <- pincome_final %>%
mutate(income_group_num = row_number()) %>%
select(income_group_num, 6:14) %>%
group_by(income_group_num) %>%
gather(education, n, 2:10) %>%
arrange(education) %>%
group_by(education) %>%
mutate(prct = n / sum(n)) %>%
select(income_group_num, education, prct) %>%
mutate(income_subset = ifelse(income_group_num < 11, 1,
ifelse(income_group_num < 21 & income_group_num >= 11, 2,
ifelse(income_group_num < 31 & income_group_num >= 21, 3,
ifelse(income_group_num < 41 & income_group_num >= 31, 4,
ifelse(income_group_num == 41, 5, NA)))))) %>%
select(income_subset, education, prct) %>%
group_by(income_subset, education) %>%
summarise(sum = sum(prct)) %>%
spread(income_subset, sum) %>%
gather('income', 'percent', 2:6) %>%
mutate(income = ifelse(income == 1, 'A - Less than 25K',
ifelse(income == 2, 'B - Between 25K and 49K',
ifelse(income == 3, 'C - Between 50K and 74K',
ifelse(income ==4, 'D - Between 75K and 99K',
ifelse(income == 5, 'E - 100K or more', NA))))))
calc <- calc %>%
mutate(education = ifelse(education == 'less_than_9th', 'A - Less than 9th Grade',
ifelse(education == '9th_to_12th_nongrad', 'B - Some HS, No Diploma',
ifelse(education == 'graduate_incl_ged', 'C - HS Grad (or GED)',
ifelse(education == 'some_col_no_degree', 'D - Some Col, No Degree',
ifelse(education == 'assoc_degree', 'E - Associates Degree',
ifelse(education == 'bachelors', 'F - Bachelors Degree',
ifelse(education == 'masters', 'G - Masters Degree',
ifelse(education == 'professional', 'H - Professional Degree',
ifelse(education == 'doctorate', 'I - Doctoral Degree', NA)))))))))) %>%
arrange(education, income)
kable(calc, align = rep('c', 3)) %>%
kable_styling(bootstrap_options = c("striped"), full_width = F)
| education | income | percent |
|---|---|---|
| A - Less than 9th Grade | A - Less than 25K | 0.4896716 |
| A - Less than 9th Grade | B - Between 25K and 49K | 0.3818856 |
| A - Less than 9th Grade | C - Between 50K and 74K | 0.0937500 |
| A - Less than 9th Grade | D - Between 75K and 99K | 0.0172140 |
| A - Less than 9th Grade | E - 100K or more | 0.0174788 |
| B - Some HS, No Diploma | A - Less than 25K | 0.4924351 |
| B - Some HS, No Diploma | B - Between 25K and 49K | 0.3602484 |
| B - Some HS, No Diploma | C - Between 50K and 74K | 0.1011308 |
| B - Some HS, No Diploma | D - Between 75K and 99K | 0.0299411 |
| B - Some HS, No Diploma | E - 100K or more | 0.0162446 |
| C - HS Grad (or GED) | A - Less than 25K | 0.3076881 |
| C - HS Grad (or GED) | B - Between 25K and 49K | 0.4064963 |
| C - HS Grad (or GED) | C - Between 50K and 74K | 0.1804075 |
| C - HS Grad (or GED) | D - Between 75K and 99K | 0.0585813 |
| C - HS Grad (or GED) | E - 100K or more | 0.0468268 |
| D - Some Col, No Degree | A - Less than 25K | 0.2809503 |
| D - Some Col, No Degree | B - Between 25K and 49K | 0.3629663 |
| D - Some Col, No Degree | C - Between 50K and 74K | 0.2070604 |
| D - Some Col, No Degree | D - Between 75K and 99K | 0.0801510 |
| D - Some Col, No Degree | E - 100K or more | 0.0688721 |
| E - Associates Degree | A - Less than 25K | 0.2338505 |
| E - Associates Degree | B - Between 25K and 49K | 0.3564320 |
| E - Associates Degree | C - Between 50K and 74K | 0.2434938 |
| E - Associates Degree | D - Between 75K and 99K | 0.0921679 |
| E - Associates Degree | E - 100K or more | 0.0740558 |
| F - Bachelors Degree | A - Less than 25K | 0.1538626 |
| F - Bachelors Degree | B - Between 25K and 49K | 0.2577105 |
| F - Bachelors Degree | C - Between 50K and 74K | 0.2386980 |
| F - Bachelors Degree | D - Between 75K and 99K | 0.1391759 |
| F - Bachelors Degree | E - 100K or more | 0.2105530 |
| G - Masters Degree | A - Less than 25K | 0.1189611 |
| G - Masters Degree | B - Between 25K and 49K | 0.1726932 |
| G - Masters Degree | C - Between 50K and 74K | 0.2501340 |
| G - Masters Degree | D - Between 75K and 99K | 0.1648895 |
| G - Masters Degree | E - 100K or more | 0.2933222 |
| H - Professional Degree | A - Less than 25K | 0.0737802 |
| H - Professional Degree | B - Between 25K and 49K | 0.0991670 |
| H - Professional Degree | C - Between 50K and 74K | 0.1582705 |
| H - Professional Degree | D - Between 75K and 99K | 0.1233637 |
| H - Professional Degree | E - 100K or more | 0.5454185 |
| I - Doctoral Degree | A - Less than 25K | 0.0901526 |
| I - Doctoral Degree | B - Between 25K and 49K | 0.1131761 |
| I - Doctoral Degree | C - Between 50K and 74K | 0.1597781 |
| I - Doctoral Degree | D - Between 75K and 99K | 0.1675451 |
| I - Doctoral Degree | E - 100K or more | 0.4693481 |
Above is my much tidyer table that summarizes some important data! We can now see the percent of individuals that make up each income group broken out by education level. In order to work more with tidyr and dplyr, I decided to present this data in a transposed table. See below for my syntax on creating a transposed table of this information (back to wide version):
calc_wide <- calc %>%
spread(income, percent) %>%
rename("Education Level" = education,
"Less than $25K" = `A - Less than 25K`,
"Between $25K and $49K" = `B - Between 25K and 49K`,
"Between $50K and $74K" = `C - Between 50K and 74K`,
"Between $75K and $99K" = `D - Between 75K and 99K`,
"$100K or more" = `E - 100K or more`) %>%
mutate(`Less than $25K` = paste0(format((`Less than $25K` * 100), digits = 2), '%'),
`Between $25K and $49K` = paste0(format((`Between $25K and $49K` * 100), digits = 2), '%'),
`Between $50K and $74K` = paste0(format((`Between $50K and $74K` * 100), digits = 2), '%'),
`Between $75K and $99K` = paste0(format((`Between $75K and $99K` * 100), digits = 2), '%'),
`$100K or more` = paste0(format((`$100K or more` * 100), digits = 2), '%')) %>%
mutate(`Education Level` = str_replace_all(`Education Level`, '\\w+\\s+\\-\\s+', ''))
kable(calc_wide, align = rep('c', 3)) %>%
kable_styling(bootstrap_options = c("striped"), full_width = F)
| Education Level | Less than $25K | Between $25K and $49K | Between $50K and $74K | Between $75K and $99K | $100K or more |
|---|---|---|---|---|---|
| Less than 9th Grade | 49.0% | 38.2% | 9.4% | 1.7% | 1.7% |
| Some HS, No Diploma | 49.2% | 36.0% | 10.1% | 3.0% | 1.6% |
| HS Grad (or GED) | 30.8% | 40.6% | 18.0% | 5.9% | 4.7% |
| Some Col, No Degree | 28.1% | 36.3% | 20.7% | 8.0% | 6.9% |
| Associates Degree | 23.4% | 35.6% | 24.3% | 9.2% | 7.4% |
| Bachelors Degree | 15.4% | 25.8% | 23.9% | 13.9% | 21.1% |
| Masters Degree | 11.9% | 17.3% | 25.0% | 16.5% | 29.3% |
| Professional Degree | 7.4% | 9.9% | 15.8% | 12.3% | 54.5% |
| Doctoral Degree | 9.0% | 11.3% | 16.0% | 16.8% | 46.9% |
Although both of these tables, more particularly the second one, is a good way to summarize the data, I thought it would be helpful to visualize this by education level in a bar plot.
plot <- ggplot(calc, aes(x = income, y = percent, fill= income))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd")
plot <- plot + facet_wrap(education ~ .)
plot
Conclusion: From looking at the bar chart above, we can see that there is a trend between level of education and income. It appears that those with higher degrees, including either a Master’s degree, Professional degree, or Doctorate, seem to show a higher proportion of individuals have personal incomes greater than $50,000. Interestingly, it also looks like those with a professional degree have a higher proportion that make over 100,000 dollars relative to those that have a doctorate.
Bryan Persaud found a dataset that brought back some nostalgia for the days that I played Pokemon! I decided to utilize this example for my final tidying task.
Question #1: Do a comparison between the strongest and weakest pokemon between two different types (i.e. Grass, Fire, Water, etc.)
pokemon_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project2/pokemon.csv')
First, since I’m only familiar with the 1st Generation of Pokemon, I decided to use that dataset for this analysis. After pulling this in from GitHub, I noticed there were a few duplicate listings of Pokemon – mainly those that were split into “Mega” versions of themselves. Because I’d like to look at this solely from the perspective of the original 150 Pokemon, I decided to remove these duplicate listings. Below, I filtered these out using another regular expression:
head(pokemon_data, 10)
## Number Name Type1 Type2 Total HP Attack Defense
## 1 1 Bulbasaur Grass Poison 318 45 49 49
## 2 2 Ivysaur Grass Poison 405 60 62 63
## 3 3 Venusaur Grass Poison 525 80 82 83
## 4 3 VenusaurMega Venusaur Grass Poison 625 80 100 123
## 5 4 Charmander Fire 309 39 52 43
## 6 5 Charmeleon Fire 405 58 64 58
## 7 6 Charizard Fire Flying 534 78 84 78
## 8 6 CharizardMega Charizard X Fire Dragon 634 78 130 111
## 9 6 CharizardMega Charizard Y Fire Flying 634 78 104 78
## 10 7 Squirtle Water 314 44 48 65
## SpecialAtk SpecialDef Speed Generation Legendary
## 1 65 65 45 1 False
## 2 80 80 60 1 False
## 3 100 100 80 1 False
## 4 122 120 80 1 False
## 5 60 50 65 1 False
## 6 80 65 80 1 False
## 7 109 85 100 1 False
## 8 130 85 100 1 False
## 9 159 115 100 1 False
## 10 50 64 43 1 False
pokemon_data <- pokemon_data %>%
filter(!str_detect(Name, '\\Mega'))
head(pokemon_data, 10)
## Number Name Type1 Type2 Total HP Attack Defense SpecialAtk
## 1 1 Bulbasaur Grass Poison 318 45 49 49 65
## 2 2 Ivysaur Grass Poison 405 60 62 63 80
## 3 3 Venusaur Grass Poison 525 80 82 83 100
## 4 4 Charmander Fire 309 39 52 43 60
## 5 5 Charmeleon Fire 405 58 64 58 80
## 6 6 Charizard Fire Flying 534 78 84 78 109
## 7 7 Squirtle Water 314 44 48 65 50
## 8 8 Wartortle Water 405 59 63 80 65
## 9 9 Blastoise Water 530 79 83 100 85
## 10 10 Caterpie Bug 195 45 30 35 20
## SpecialDef Speed Generation Legendary
## 1 65 45 1 False
## 2 80 60 1 False
## 3 100 80 1 False
## 4 50 65 1 False
## 5 65 80 1 False
## 6 85 100 1 False
## 7 64 43 1 False
## 8 80 58 1 False
## 9 105 78 1 False
## 10 20 45 1 False
We can see from the two dataframes that the “Mega” versions of pokemon have been removed.
Then, I thought it would be beneficial to select relevant columns for this comparison, and group them based on their main type and total HP. The question then would like us to identify the Pokemon with the highest and lowest HP in each type. Do do this, I used group by and mutate a few times in order calculate the maximum and minimum HP per pokemon type:
best_worst_of_type <- pokemon_data %>%
select(Name, Type1, Total) %>%
arrange(Type1, Total) %>%
group_by(Type1) %>%
mutate(Highest_HP = max(Total),
Lowest_HP = min(Total)) %>%
ungroup() %>%
mutate(Reason = (Total == Highest_HP) | (Total == Lowest_HP)) %>%
filter(Reason == TRUE) %>%
mutate(Reason = ifelse(Total == Highest_HP, 'Highest HP', 'Lowest HP')) %>%
select(Name, Type1, Reason, Total) %>%
rename("Pokemon" = Name,
"Type" = Type1,
"Total HP" = Total)
t1 <- best_worst_of_type %>%
filter(Reason == 'Lowest HP')
t2 <- best_worst_of_type %>%
filter(Reason == 'Highest HP')
kable(t1, align = 'clc', caption = 'Pokemon w/ Lowest HP in Group') %>%
kable_styling(bootstrap_options = c('striped'), full_width = F)
| Pokemon | Type | Reason | Total HP |
|---|---|---|---|
| Caterpie | Bug | Lowest HP | 195 |
| Weedle | Bug | Lowest HP | 195 |
| Dratini | Dragon | Lowest HP | 300 |
| Pikachu | Electric | Lowest HP | 320 |
| Clefairy | Fairy | Lowest HP | 323 |
| Mankey | Fighting | Lowest HP | 305 |
| Machop | Fighting | Lowest HP | 305 |
| Vulpix | Fire | Lowest HP | 299 |
| Gastly | Ghost | Lowest HP | 310 |
| Bellsprout | Grass | Lowest HP | 300 |
| Diglett | Ground | Lowest HP | 265 |
| Jynx | Ice | Lowest HP | 455 |
| Pidgey | Normal | Lowest HP | 251 |
| Zubat | Poison | Lowest HP | 245 |
| Abra | Psychic | Lowest HP | 310 |
| Geodude | Rock | Lowest HP | 300 |
| Magikarp | Water | Lowest HP | 200 |
kable(t2, align = 'clc', caption = 'Pokemon w/ Highest HP in Group') %>%
kable_styling(bootstrap_options = c('striped'), full_width = F)
| Pokemon | Type | Reason | Total HP |
|---|---|---|---|
| Scyther | Bug | Highest HP | 500 |
| Pinsir | Bug | Highest HP | 500 |
| Dragonite | Dragon | Highest HP | 600 |
| Zapdos | Electric | Highest HP | 580 |
| Clefable | Fairy | Highest HP | 483 |
| Machamp | Fighting | Highest HP | 505 |
| Moltres | Fire | Highest HP | 580 |
| Gengar | Ghost | Highest HP | 500 |
| Venusaur | Grass | Highest HP | 525 |
| Rhydon | Ground | Highest HP | 485 |
| Articuno | Ice | Highest HP | 580 |
| Snorlax | Normal | Highest HP | 540 |
| Nidoqueen | Poison | Highest HP | 505 |
| Nidoking | Poison | Highest HP | 505 |
| Mewtwo | Psychic | Highest HP | 680 |
| Aerodactyl | Rock | Highest HP | 515 |
| Gyarados | Water | Highest HP | 540 |
As we can see from above, all of these first-generation Pokemon that have the highest HP for their group are evolved forms of less-evolved Pokemon. Conversely, the Pokemon with the lowest HP for their group are mostly primitive forms of Pokemon (some of the first you’ll encounter in the wild - I bumped into too many Pidgeys when I played the game!!). This makes sense, given that in the game the goal as a trainer is to increase your Pokemon’s hit points (HP) and evolve your pokemon in order to be resilient against other Pokemon in battles. Now, we can do a comparison between the Water-type pokemon and Psychic-type pokemon to complete the question.
pokemon_filter_for_analysis <- best_worst_of_type %>%
filter(Type == 'Water' | Type == 'Psychic') %>%
arrange(`Total HP`)
plot <- ggplot(pokemon_filter_for_analysis, aes(x = Pokemon, y = `Total HP`, fill= Reason))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd")
plot <- plot + facet_wrap(Type ~ ., scales = "free_x")
plot
Conclusion: As we can see here when comparing the Pokemon with the highest and lowest hit points from the Psychic and Water types, we find that Abra doesn’t have as low of an HP as Magikarp, which is the pokemon with the lowest HP for a Water-type. Additionally, Mewtwo has a higher HP than Gyarados, who has the highest HP of all Water-type pokemon. Therefore, when comparing the pokemon with the highest and lowest HPs between the water-types and psychic-types, we can see that psychic types have a higher HP for both its lowest- and highest-rated pokemon relative to water-type pokemon.