Libraries Used in Project #2

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(scales)
library(stringr)
library(anytime)
library(plotly)
library(gapminder)
library(ggcorrplot)

Dataset #1 - Kickstarter

Questions to Answer

Diane DeOcampo found a great kickstarter data source that had some data to be cleaned/tidied in order to answer a few questions. Below are a few questions I answered after tidying the dataset:

How do staff-picked projects influence the success of a project?
Which projects tend to be more successful (looking at category)?
What’s the relationship between the state of the campaign and the total number of backers and length of campaign?

Reading the data from csv into R

kickstarter_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project2/kickstarter.csv')

Using regular expressions and functions to fix data formats and extract important data from messy columns

In order to extract the category for each project, as well as the project ID, I used regular expressions to isolate the values I needed in the long object data types in columns ‘category’ and ‘profile’. I stored these values in temporary variables and then did further extractions before adding them as new columns to the kickstarter dataset. Then, I turned the UNIX date-formatted values in multiple columns to more useful formats. I used the ‘anytime’ package to help with this process.

tmp <- unlist(str_extract_all(kickstarter_data$category, '\\"name":\\"\\w+'))
tmp2 <- unlist(str_extract_all(kickstarter_data$profile, '\\"project_id"\\:\\d+'))
kickstarter_data$category_cleaned <- unlist(str_extract_all(tmp, '([^\"]+$)'))
kickstarter_data$project_id <- unlist(str_extract_all(tmp2, '([^\":]+$)'))
kickstarter_data$created_at <- anydate(kickstarter_data$created_at)
kickstarter_data$launched_at <- anydate(kickstarter_data$launched_at)
kickstarter_data$state_changed_at <- anydate(kickstarter_data$state_changed_at)
kickstarter_data$deadline <- anydate(kickstarter_data$deadline)
kickstarter_data$usd_pledged <- round(as.numeric(as.character(format(kickstarter_data$usd_pledged, scientific = FALSE))), digits = 2)

Creating a clean kickstarter dataset

I wanted to create a cleaner version of the kickstarter dataset, removing unhelpful columns. To do this, I used the ‘select’ function, and added an additional column that calculated the amount of time each campaign was active, by doing calculations on the data variables. The following table is the result.

kickstarter_cleaned <- kickstarter_data %>% 
  select(project_id, 
         name, 
         category_cleaned, 
         created_at, 
         launched_at,
         goal,
         usd_pledged, 
         backers_count,
         state_changed_at,
         state,
         staff_pick) %>% 
  mutate(campaign_length = state_changed_at - created_at)

head(kickstarter_cleaned, 10)

##    project_id                                                         name
## 1     2396425                            John Chuck & The Class Debut E.P.
## 2     3004537                          Girls of Summer: Big Diamond Dreams
## 3      438930                                                    Task No.1
## 4     2376664                              Future Heroes - SXSW IS CALLING
## 5     2548344    Rhode Island Pelagic Shark Diving conservation Video fund
## 6     2877965                           Gorilla my Dreams: Mime of my Life
## 7      340272                                           Living In A Bubble
## 8     3659475                        MyMiniReport - Visual battle Report !
## 9     3334409 AIUR-Your Premium Wireless Home Theater with Gesture Control
## 10    3453077    N64 - WiiVC Controller Adapter with Full Stick Correction
##    category_cleaned created_at launched_at  goal usd_pledged backers_count
## 1               Hip 2016-02-27  2016-03-07  5000     5612.00           103
## 2       Documentary 2017-05-17  2017-06-06 24042    26237.00           318
## 3        Electronic 2013-01-08  2013-01-09  4000        0.00             0
## 4               Hip 2016-02-15  2016-02-23   500     1575.00            22
## 5       Documentary 2016-06-06  2016-06-06  2500     3290.00            17
## 6         Webcomics 2017-02-13  2017-10-01  1500     2962.06           177
## 7        Electronic 2012-09-13  2012-10-14  3000     3428.61           121
## 8              Apps 2019-01-31  2019-03-04  8000    12476.13           969
## 9           Gadgets 2018-03-15  2018-03-21 10000    17021.00           130
## 10              DIY 2018-08-21  2018-08-22   200      882.88            21
##    state_changed_at      state staff_pick campaign_length
## 1        2016-04-06 successful       TRUE         39 days
## 2        2017-07-03 successful       TRUE         47 days
## 3        2013-03-10     failed      FALSE         61 days
## 4        2016-03-14 successful      FALSE         28 days
## 5        2016-07-06 successful      FALSE         30 days
## 6        2017-10-29 successful      FALSE        258 days
## 7        2012-11-13 successful      FALSE         61 days
## 8        2019-04-03 successful      FALSE         62 days
## 9        2018-04-22 successful      FALSE         38 days
## 10       2018-10-21 successful      FALSE         61 days

Question #1: How do staff-picked projects influence the success of a project?

To work through this question, I used tidyr and dplyr functions to group, summarise, mutate, filter, arrange, and select the data I needed in order to find the percent successful campaigns based on whether or not campaigns were staff-picked.

staff_pick <- kickstarter_cleaned %>% 
  group_by(staff_pick, state) %>% 
  summarise(count = n()) %>% 
  mutate(percent_of_total_state = round(count / sum(count), digits = 4)) %>% 
  filter(state == 'successful') %>% 
  select(staff_pick, state, percent_of_total_state) %>% 
  arrange(desc(percent_of_total_state))

kable(staff_pick, align = rep('c', 3)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

staff_pick	state	percent_of_total_state
TRUE	successful	0.8732
FALSE	successful	0.4739

p <- staff_pick %>% 
  mutate(percent_formatted = percent(percent_of_total_state))

plot <- ggplot(p, aes(x = staff_pick, y = percent_of_total_state, fill=percent_of_total_state))
plot <- plot + scale_y_continuous(labels = scales::percent)
plot <- plot + theme(legend.position = "none")
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd") + ylab("Success rate") + xlab("Picked by staff?")
plot <- plot + geom_text(aes(label=percent_formatted), vjust=2.5, hjust=0.45, position = position_dodge(width = 0.9), color="white", fontface="bold")
plot

Conclusion: From the plot above, as well as the table, we can see that Kickstarter campaigns that were staff-picked had a much higher percent success rate than those that were not picked by staff.

Question #2: Which projects tend to be more successful (looking at category)?

To work through this question, I had to tidy the data in a way to calculate the success rate based on categories that I pulled from the original file. After summarizing the data similar to above, I then was able to group the data based on the categories and filter by successful campaigns and find the success rate out of the total campaigns per category. I then was interested in finding the categories that had high success rates, but also had a large number of campaigns in that category to support the high success rate (a bit of a decision tree analysis). It’s hard to judge whether or not a certain category has good success if there are only a few campaigns to analyze. Therefore, I wanted to isolate those that had a success rate above 75%, as well as more than 30 campaigns in a category, as a safer judgement to being more appealing to potential donors. You can find this data plotted below. If you click the legend, you can add/remove data points and use more of plotly’s features to examine the plot. There is also a tooltip if you hover over each point, which gives you the category information.

category_success <- kickstarter_cleaned %>% 
  group_by(category_cleaned, state) %>% 
  summarise(count = n()) %>% 
  mutate(percent_of_total_state = round(count / sum(count), digits = 2)) %>% 
  filter(state == 'successful') %>% 
  select(category_cleaned, state, count, percent_of_total_state) %>% 
  arrange(desc(percent_of_total_state), desc(count)) %>% 
  rename("Number of Campaigns In Category" = count, "Success Rate" = percent_of_total_state) %>% 
  mutate(coloration = ifelse(`Success Rate` > 0.75 & `Number of Campaigns In Category` > 30, 'Tend to Be More Successful', 'Either too few campaigns or low success rate'))

p <- ggplot(category_success, aes(x=`Number of Campaigns In Category`, y=`Success Rate`, color = coloration, text = paste('Category: ', category_cleaned))) +
  geom_point(fill = "#ffffff", pch = 21, size = 2, stroke = 0.5) +
  labs(x="Number of campaigns per category", y = "Success rate per category") +
  scale_color_manual(values = c("#bbbbbb", "#52854C"))

p <- ggplotly(p) %>% layout(legend = list(orientation = "h", x = -0.5, y = 10))

p

All of the green dots in the plot above are also found in this chart below, arranged by success rate.

top_categories <- category_success %>% 
  filter(coloration == "Tend to Be More Successful") %>% 
  rename("Category" = category_cleaned) %>% 
  select(Category, `Number of Campaigns In Category`, `Success Rate`)

kable(top_categories, align = rep('c', 5)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

Category	Number of Campaigns In Category	Success Rate
Shorts	65	0.98
Comic	42	0.95
Country	41	0.95
Documentary	61	0.92
Illustration	60	0.92
Crafts	49	0.92
Narrative	58	0.91
Tabletop	57	0.89
Video	94	0.86

Conclusion: As you can see, campaigns that fall under the categories of ‘Shorts’, ‘Comic’, ‘Country’ and ‘Documentary’ tend to have higher success rates. My theory is that those with engaging videos, powerful illustrations, or were patriotic (i.e. “Country”), seemed to have more appeal with potential donors. No statistical tests have been done to prove these claims though, so it would be interesting to dive deeper into these analysis to see if there are true differences between categories.

Question #3: What’s the relationship between the state of the campaign and the total number of backers and length of campaign?

To work through this final question, I needed to summarize the total backers column and the campaign length column that I calculated earlier, and gather the two into one column in order to perform the visualization later.

backers_length_df <- kickstarter_cleaned %>% 
  group_by(state) %>%   
  summarise("Total Backers (Mean)" = round(as.numeric(mean(backers_count)), digits = 0), "Campaign Length in Days (Mean)" =   round(as.numeric(mean(campaign_length)), digits = 0)) %>% 
  gather('x', 'n', 2:3) %>% 
  arrange(state)

kable(backers_length_df, align = rep('c', 3)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

state	x	n
canceled	Total Backers (Mean)	19
canceled	Campaign Length in Days (Mean)	71
failed	Total Backers (Mean)	12
failed	Campaign Length in Days (Mean)	81
live	Total Backers (Mean)	81
live	Campaign Length in Days (Mean)	36
successful	Total Backers (Mean)	221
successful	Campaign Length in Days (Mean)	83
suspended	Total Backers (Mean)	0
suspended	Campaign Length in Days (Mean)	10

After wrangling the data into a form that can be used for a facet bar plot, I was then able to plot the data below:

plot <- ggplot(backers_length_df, aes(x = x, y = n, fill= x))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", legend.title = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd") + xlab("Campaign State")
plot <- plot + facet_grid(. ~ state)
plot <- plot + geom_text(aes(label=n), vjust=-0.25, hjust=0.50, position = position_dodge(width = 0), color="black", fontface="bold")
plot

Conclusion: We can see, based on the state of the campaign that there is a wide discrepancy in the mean number of backers as well as the mean length of a campaign (in days). Successful campaigns seem to accumulate many more total backers, compared to all other states/statuses, and the mean total backers for both failed and canceled campaigns is much lower than their mean campaign length.

Dataset #2 - Personal Income

Questions to Answer

Avi Adler found a great dataset on the U.S. Census website that looks at personal income data relative to education. This dataset also requires a lot of tidying/wrangling before analysis can be completed. Once that is taken care of, I’ll do my best to answer his primary question below:

Question #1: Do a comparison between income and education, investigating if there is a trend or correlation between them.

Reading the data from csv into R

pincome_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project2/personal_income.csv')

Cleaning the dataset, removing unwanted columns and rows

pincome_data_cleaned <- pincome_data %>% 
  slice(17:57) %>% 
  rename("total" = X,
         "less_than_9th" = X.1,
         "9th_to_12th_nongrad" = X.2,
         "graduate_incl_ged" = X.3,
         "some_col_no_degree" = X.4,
         "assoc_degree" = X.5,
         "bachelors" = X.7,
         "masters" = X.8,
         "professional" = X.9,
         "doctorate" = X.10,
         "bach_or_more" = X.6)

colnames(pincome_data_cleaned)[1] <- "characteristic"

pincome_data_cleaned$characteristic <- unlist(str_replace_all(pincome_data_cleaned$characteristic, '\\..', ""))
sapply(pincome_data_cleaned, typeof)

##      characteristic               total       less_than_9th 
##         "character"           "integer"           "integer" 
## 9th_to_12th_nongrad   graduate_incl_ged  some_col_no_degree 
##           "integer"           "integer"           "integer" 
##        assoc_degree        bach_or_more           bachelors 
##           "integer"           "integer"           "integer" 
##             masters        professional           doctorate 
##           "integer"           "integer"           "integer"

Removing the commas and changing data types to numeric

converttoNum_removeComma <- function(value){
  tmp <- as.numeric(str_replace_all(value, "\\,", ""))
  tmp2 <- tmp * 1000
  return(tmp2)
}

pincome_nums <- sapply(pincome_data_cleaned[2:12], converttoNum_removeComma)
pincome_nums <- data.frame(pincome_nums)

Selecting columns, making new columns, and saving to a cleaner data frame

I wanted to add in a few extra columns to help with data analysis later on. One column that I added in particular, groups the 41 different income levels into 5 different income levels.

pincome_final <- pincome_nums
pincome_final$income_group <- pincome_data_cleaned$characteristic

pincome_final <- pincome_final %>% 
  mutate(income_group_num = row_number())

pincome_final <- pincome_final %>%
  mutate(income_subset = ifelse(income_group_num < 11, 1,
       ifelse(income_group_num < 21 & income_group_num >= 11, 2,
       ifelse(income_group_num < 31 & income_group_num >= 21, 3,
       ifelse(income_group_num < 41 & income_group_num >= 31, 4,
       ifelse(income_group_num == 41, 5, NA)))))) %>% 
  mutate(highschool_nongrad = less_than_9th + X9th_to_12th_nongrad,
         highschool_grad = graduate_incl_ged + some_col_no_degree,
         bachelors_or_more = bachelors + masters + professional + doctorate) %>% 
  select(income_group,
         income_subset,
         highschool_nongrad,
         highschool_grad,
         bachelors_or_more,
         less_than_9th,
         X9th_to_12th_nongrad,
         graduate_incl_ged,
         some_col_no_degree,
         assoc_degree,
         bachelors,
         masters,
         professional,
         doctorate) %>%
  rename("9th_to_12th_nongrad" = X9th_to_12th_nongrad)

head(pincome_final)

##           income_group income_subset highschool_nongrad highschool_grad
## 1 $1 to $2,499 or loss             1             305000         1382000
## 2     $2,500 to $4,999             1             298000         1040000
## 3     $5,000 to $7,499             1             354000         1456000
## 4     $7,500 to $9,999             1             326000         1151000
## 5   $10,000 to $12,499             1             581000         2362000
## 6   $12,500 to $14,999             1             336000         1073000
##   bachelors_or_more less_than_9th 9th_to_12th_nongrad graduate_incl_ged
## 1            967000        115000              190000            775000
## 2            632000         97000              201000            641000
## 3            933000        109000              245000            900000
## 4            565000        106000              220000            776000
## 5            978000        185000              396000           1536000
## 6            503000        127000              209000            711000
##   some_col_no_degree assoc_degree bachelors masters professional doctorate
## 1             607000       287000    613000  289000        21000     44000
## 2             399000       241000    459000  138000        13000     22000
## 3             556000       334000    608000  265000        16000     44000
## 4             375000       224000    371000  158000        11000     25000
## 5             826000       528000    688000  236000        22000     32000
## 6             362000       206000    364000  113000        18000      8000

Wrangling data from wide to tall

To make use of this data, I wanted to group the counts of individuals based on more concrete income groups. Therefore, I tidied the data using group_by, gather, spread, and mutate functions in order to do the following:

Added a value column based on the income level rank (1 = lowest income group, 41 = highest income group)
Selected the appropriate columns, and then gathered the data to condense the values into a tall structured data frame
Then, I found the percent for each income level rank out of the total for each education group (i.e. associates degree, doctorate, etc.)
Next, I grouped the income levels into smaller subsets (5 different subsets), in order to simplify the income brackets a bit - this was copied from above
Then, by summarizing the data, I summed the percents and gathered the data to show the sum of the percents by education level and modified income level (5 levels of income at this point)
After this, I fixed up the labels in the modified table, and prepared it for my bar plot below.

calc <- pincome_final %>% 
  mutate(income_group_num = row_number()) %>% 
  select(income_group_num, 6:14) %>% 
  group_by(income_group_num) %>% 
  gather(education, n, 2:10) %>%
  arrange(education) %>% 
  group_by(education) %>% 
  mutate(prct = n / sum(n)) %>% 
  select(income_group_num, education, prct) %>% 
  mutate(income_subset = ifelse(income_group_num < 11, 1,
       ifelse(income_group_num < 21 & income_group_num >= 11, 2,
       ifelse(income_group_num < 31 & income_group_num >= 21, 3,
       ifelse(income_group_num < 41 & income_group_num >= 31, 4,
       ifelse(income_group_num == 41, 5, NA)))))) %>% 
  select(income_subset, education, prct) %>% 
  group_by(income_subset, education) %>% 
  summarise(sum = sum(prct)) %>% 
  spread(income_subset, sum) %>% 
  gather('income', 'percent', 2:6) %>% 
  mutate(income = ifelse(income == 1, 'A - Less than 25K',
         ifelse(income == 2, 'B - Between 25K and 49K',
         ifelse(income == 3, 'C - Between 50K and 74K',
         ifelse(income ==4, 'D - Between 75K and 99K',
         ifelse(income == 5, 'E - 100K or more', NA))))))

calc <- calc %>% 
  mutate(education = ifelse(education == 'less_than_9th', 'A - Less than 9th Grade',
        ifelse(education == '9th_to_12th_nongrad', 'B - Some HS, No Diploma',
        ifelse(education == 'graduate_incl_ged', 'C - HS Grad (or GED)',
        ifelse(education == 'some_col_no_degree', 'D - Some Col, No Degree',
        ifelse(education == 'assoc_degree', 'E - Associates Degree',
        ifelse(education == 'bachelors', 'F - Bachelors Degree',
        ifelse(education == 'masters', 'G - Masters Degree',
        ifelse(education == 'professional', 'H - Professional Degree',
        ifelse(education == 'doctorate', 'I - Doctoral Degree', NA)))))))))) %>% 
  arrange(education, income)


kable(calc, align = rep('c', 3)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

education	income	percent
A - Less than 9th Grade	A - Less than 25K	0.4896716
A - Less than 9th Grade	B - Between 25K and 49K	0.3818856
A - Less than 9th Grade	C - Between 50K and 74K	0.0937500
A - Less than 9th Grade	D - Between 75K and 99K	0.0172140
A - Less than 9th Grade	E - 100K or more	0.0174788
B - Some HS, No Diploma	A - Less than 25K	0.4924351
B - Some HS, No Diploma	B - Between 25K and 49K	0.3602484
B - Some HS, No Diploma	C - Between 50K and 74K	0.1011308
B - Some HS, No Diploma	D - Between 75K and 99K	0.0299411
B - Some HS, No Diploma	E - 100K or more	0.0162446
C - HS Grad (or GED)	A - Less than 25K	0.3076881
C - HS Grad (or GED)	B - Between 25K and 49K	0.4064963
C - HS Grad (or GED)	C - Between 50K and 74K	0.1804075
C - HS Grad (or GED)	D - Between 75K and 99K	0.0585813
C - HS Grad (or GED)	E - 100K or more	0.0468268
D - Some Col, No Degree	A - Less than 25K	0.2809503
D - Some Col, No Degree	B - Between 25K and 49K	0.3629663
D - Some Col, No Degree	C - Between 50K and 74K	0.2070604
D - Some Col, No Degree	D - Between 75K and 99K	0.0801510
D - Some Col, No Degree	E - 100K or more	0.0688721
E - Associates Degree	A - Less than 25K	0.2338505
E - Associates Degree	B - Between 25K and 49K	0.3564320
E - Associates Degree	C - Between 50K and 74K	0.2434938
E - Associates Degree	D - Between 75K and 99K	0.0921679
E - Associates Degree	E - 100K or more	0.0740558
F - Bachelors Degree	A - Less than 25K	0.1538626
F - Bachelors Degree	B - Between 25K and 49K	0.2577105
F - Bachelors Degree	C - Between 50K and 74K	0.2386980
F - Bachelors Degree	D - Between 75K and 99K	0.1391759
F - Bachelors Degree	E - 100K or more	0.2105530
G - Masters Degree	A - Less than 25K	0.1189611
G - Masters Degree	B - Between 25K and 49K	0.1726932
G - Masters Degree	C - Between 50K and 74K	0.2501340
G - Masters Degree	D - Between 75K and 99K	0.1648895
G - Masters Degree	E - 100K or more	0.2933222
H - Professional Degree	A - Less than 25K	0.0737802
H - Professional Degree	B - Between 25K and 49K	0.0991670
H - Professional Degree	C - Between 50K and 74K	0.1582705
H - Professional Degree	D - Between 75K and 99K	0.1233637
H - Professional Degree	E - 100K or more	0.5454185
I - Doctoral Degree	A - Less than 25K	0.0901526
I - Doctoral Degree	B - Between 25K and 49K	0.1131761
I - Doctoral Degree	C - Between 50K and 74K	0.1597781
I - Doctoral Degree	D - Between 75K and 99K	0.1675451
I - Doctoral Degree	E - 100K or more	0.4693481

Above is my much tidyer table that summarizes some important data! We can now see the percent of individuals that make up each income group broken out by education level. In order to work more with tidyr and dplyr, I decided to present this data in a transposed table. See below for my syntax on creating a transposed table of this information (back to wide version):

calc_wide <- calc %>% 
  spread(income, percent) %>% 
  rename("Education Level" = education,
         "Less than $25K" = `A - Less than 25K`,
         "Between $25K and $49K" = `B - Between 25K and 49K`,
         "Between $50K and $74K" = `C - Between 50K and 74K`,
         "Between $75K and $99K" = `D - Between 75K and 99K`,
         "$100K or more" = `E - 100K or more`) %>% 
  mutate(`Less than $25K` = paste0(format((`Less than $25K` * 100), digits = 2), '%'),
         `Between $25K and $49K` = paste0(format((`Between $25K and $49K` * 100), digits = 2), '%'),
         `Between $50K and $74K` = paste0(format((`Between $50K and $74K` * 100), digits = 2), '%'),
         `Between $75K and $99K` = paste0(format((`Between $75K and $99K` * 100), digits = 2), '%'),
         `$100K or more` = paste0(format((`$100K or more` * 100), digits = 2), '%')) %>% 
  mutate(`Education Level` = str_replace_all(`Education Level`, '\\w+\\s+\\-\\s+', ''))

kable(calc_wide, align = rep('c', 3)) %>% 
  kable_styling(bootstrap_options = c("striped"), full_width = F)

Education Level	Less than $25K	Between $25K and $49K	Between $50K and $74K	Between $75K and $99K	$100K or more
Less than 9th Grade	49.0%	38.2%	9.4%	1.7%	1.7%
Some HS, No Diploma	49.2%	36.0%	10.1%	3.0%	1.6%
HS Grad (or GED)	30.8%	40.6%	18.0%	5.9%	4.7%
Some Col, No Degree	28.1%	36.3%	20.7%	8.0%	6.9%
Associates Degree	23.4%	35.6%	24.3%	9.2%	7.4%
Bachelors Degree	15.4%	25.8%	23.9%	13.9%	21.1%
Masters Degree	11.9%	17.3%	25.0%	16.5%	29.3%
Professional Degree	7.4%	9.9%	15.8%	12.3%	54.5%
Doctoral Degree	9.0%	11.3%	16.0%	16.8%	46.9%

Although both of these tables, more particularly the second one, is a good way to summarize the data, I thought it would be helpful to visualize this by education level in a bar plot.

Looking at trends between income and education

plot <- ggplot(calc, aes(x = income, y = percent, fill= income))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd")
plot <- plot + facet_wrap(education ~ .)
plot

Conclusion: From looking at the bar chart above, we can see that there is a trend between level of education and income. It appears that those with higher degrees, including either a Master’s degree, Professional degree, or Doctorate, seem to show a higher proportion of individuals have personal incomes greater than $50,000. Interestingly, it also looks like those with a professional degree have a higher proportion that make over 100,000 dollars relative to those that have a doctorate.

Dataset #3 - Pokemon Dataset

Bryan Persaud found a dataset that brought back some nostalgia for the days that I played Pokemon! I decided to utilize this example for my final tidying task.

Question #1: Do a comparison between the strongest and weakest pokemon between two different types (i.e. Grass, Fire, Water, etc.)

Reading the data from csv into R

pokemon_data <- read.csv('https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project2/pokemon.csv')

Tidying the data and selecting relevant columns

First, since I’m only familiar with the 1st Generation of Pokemon, I decided to use that dataset for this analysis. After pulling this in from GitHub, I noticed there were a few duplicate listings of Pokemon – mainly those that were split into “Mega” versions of themselves. Because I’d like to look at this solely from the perspective of the original 150 Pokemon, I decided to remove these duplicate listings. Below, I filtered these out using another regular expression:

head(pokemon_data, 10)

##    Number                      Name Type1  Type2 Total HP Attack Defense
## 1       1                 Bulbasaur Grass Poison   318 45     49      49
## 2       2                   Ivysaur Grass Poison   405 60     62      63
## 3       3                  Venusaur Grass Poison   525 80     82      83
## 4       3     VenusaurMega Venusaur Grass Poison   625 80    100     123
## 5       4                Charmander  Fire          309 39     52      43
## 6       5                Charmeleon  Fire          405 58     64      58
## 7       6                 Charizard  Fire Flying   534 78     84      78
## 8       6 CharizardMega Charizard X  Fire Dragon   634 78    130     111
## 9       6 CharizardMega Charizard Y  Fire Flying   634 78    104      78
## 10      7                  Squirtle Water          314 44     48      65
##    SpecialAtk SpecialDef Speed Generation Legendary
## 1          65         65    45          1     False
## 2          80         80    60          1     False
## 3         100        100    80          1     False
## 4         122        120    80          1     False
## 5          60         50    65          1     False
## 6          80         65    80          1     False
## 7         109         85   100          1     False
## 8         130         85   100          1     False
## 9         159        115   100          1     False
## 10         50         64    43          1     False

pokemon_data <- pokemon_data %>% 
  filter(!str_detect(Name, '\\Mega'))

head(pokemon_data, 10)

##    Number       Name Type1  Type2 Total HP Attack Defense SpecialAtk
## 1       1  Bulbasaur Grass Poison   318 45     49      49         65
## 2       2    Ivysaur Grass Poison   405 60     62      63         80
## 3       3   Venusaur Grass Poison   525 80     82      83        100
## 4       4 Charmander  Fire          309 39     52      43         60
## 5       5 Charmeleon  Fire          405 58     64      58         80
## 6       6  Charizard  Fire Flying   534 78     84      78        109
## 7       7   Squirtle Water          314 44     48      65         50
## 8       8  Wartortle Water          405 59     63      80         65
## 9       9  Blastoise Water          530 79     83     100         85
## 10     10   Caterpie   Bug          195 45     30      35         20
##    SpecialDef Speed Generation Legendary
## 1          65    45          1     False
## 2          80    60          1     False
## 3         100    80          1     False
## 4          50    65          1     False
## 5          65    80          1     False
## 6          85   100          1     False
## 7          64    43          1     False
## 8          80    58          1     False
## 9         105    78          1     False
## 10         20    45          1     False

We can see from the two dataframes that the “Mega” versions of pokemon have been removed.

Then, I thought it would be beneficial to select relevant columns for this comparison, and group them based on their main type and total HP. The question then would like us to identify the Pokemon with the highest and lowest HP in each type. Do do this, I used group by and mutate a few times in order calculate the maximum and minimum HP per pokemon type:

best_worst_of_type <- pokemon_data %>% 
  select(Name, Type1, Total) %>% 
  arrange(Type1, Total) %>% 
  group_by(Type1) %>% 
  mutate(Highest_HP = max(Total), 
         Lowest_HP = min(Total)) %>% 
  ungroup() %>% 
  mutate(Reason = (Total == Highest_HP) | (Total == Lowest_HP)) %>% 
  filter(Reason == TRUE) %>% 
  mutate(Reason = ifelse(Total == Highest_HP, 'Highest HP', 'Lowest HP')) %>%
  select(Name, Type1, Reason, Total) %>%
  rename("Pokemon" = Name,
         "Type" = Type1,
         "Total HP" = Total)

t1 <- best_worst_of_type %>% 
  filter(Reason == 'Lowest HP')

t2 <- best_worst_of_type %>% 
  filter(Reason == 'Highest HP')

Printing the tables

kable(t1, align = 'clc', caption = 'Pokemon w/ Lowest HP in Group') %>%
    kable_styling(bootstrap_options = c('striped'), full_width = F)

Pokemon w/ Lowest HP in Group
Pokemon	Type	Reason	Total HP
Caterpie	Bug	Lowest HP	195
Weedle	Bug	Lowest HP	195
Dratini	Dragon	Lowest HP	300
Pikachu	Electric	Lowest HP	320
Clefairy	Fairy	Lowest HP	323
Mankey	Fighting	Lowest HP	305
Machop	Fighting	Lowest HP	305
Vulpix	Fire	Lowest HP	299
Gastly	Ghost	Lowest HP	310
Bellsprout	Grass	Lowest HP	300
Diglett	Ground	Lowest HP	265
Jynx	Ice	Lowest HP	455
Pidgey	Normal	Lowest HP	251
Zubat	Poison	Lowest HP	245
Abra	Psychic	Lowest HP	310
Geodude	Rock	Lowest HP	300
Magikarp	Water	Lowest HP	200

kable(t2, align = 'clc', caption = 'Pokemon w/ Highest HP in Group') %>%
    kable_styling(bootstrap_options = c('striped'), full_width = F)

Pokemon w/ Highest HP in Group
Pokemon	Type	Reason	Total HP
Scyther	Bug	Highest HP	500
Pinsir	Bug	Highest HP	500
Dragonite	Dragon	Highest HP	600
Zapdos	Electric	Highest HP	580
Clefable	Fairy	Highest HP	483
Machamp	Fighting	Highest HP	505
Moltres	Fire	Highest HP	580
Gengar	Ghost	Highest HP	500
Venusaur	Grass	Highest HP	525
Rhydon	Ground	Highest HP	485
Articuno	Ice	Highest HP	580
Snorlax	Normal	Highest HP	540
Nidoqueen	Poison	Highest HP	505
Nidoking	Poison	Highest HP	505
Mewtwo	Psychic	Highest HP	680
Aerodactyl	Rock	Highest HP	515
Gyarados	Water	Highest HP	540

As we can see from above, all of these first-generation Pokemon that have the highest HP for their group are evolved forms of less-evolved Pokemon. Conversely, the Pokemon with the lowest HP for their group are mostly primitive forms of Pokemon (some of the first you’ll encounter in the wild - I bumped into too many Pidgeys when I played the game!!). This makes sense, given that in the game the goal as a trainer is to increase your Pokemon’s hit points (HP) and evolve your pokemon in order to be resilient against other Pokemon in battles. Now, we can do a comparison between the Water-type pokemon and Psychic-type pokemon to complete the question.

pokemon_filter_for_analysis <- best_worst_of_type %>% 
  filter(Type == 'Water' | Type == 'Psychic') %>% 
  arrange(`Total HP`)

plot <- ggplot(pokemon_filter_for_analysis, aes(x = Pokemon, y = `Total HP`, fill= Reason))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", width = 0.95, position = "stack", color="#dddddd")
plot <- plot + facet_wrap(Type ~ ., scales = "free_x")
plot

Conclusion: As we can see here when comparing the Pokemon with the highest and lowest hit points from the Psychic and Water types, we find that Abra doesn’t have as low of an HP as Magikarp, which is the pokemon with the lowest HP for a Water-type. Additionally, Mewtwo has a higher HP than Gyarados, who has the highest HP of all Water-type pokemon. Therefore, when comparing the pokemon with the highest and lowest HPs between the water-types and psychic-types, we can see that psychic types have a higher HP for both its lowest- and highest-rated pokemon relative to water-type pokemon.

Project 2 - Data 607

Zach Alexander

October 3, 2019

Libraries Used in Project #2

Dataset #1 - Kickstarter

Questions to Answer

Reading the data from csv into R

Using regular expressions and functions to fix data formats and extract important data from messy columns

Creating a clean kickstarter dataset

Dataset #2 - Personal Income

Questions to Answer

Reading the data from csv into R

Cleaning the dataset, removing unwanted columns and rows

Removing the commas and changing data types to numeric

Selecting columns, making new columns, and saving to a cleaner data frame

Wrangling data from wide to tall

Looking at trends between income and education

Dataset #3 - Pokemon Dataset

Reading the data from csv into R

Tidying the data and selecting relevant columns

Printing the tables