DATA_607_PROJECT

Project 2 is to select 3 untidyed data sets suggested by classmates and perform data tidying and analysis.

The following 3 data sets are selected and each individual data set will be tidyed and analyzed:

Analysis 1: https://github.com/ErindaB/Other
data source found by: Erinda Budo
Analysis 2: https://openei.org/datasets/dataset/historical-renewable-energy-consumption-by-energy-use-sector-and-energy-source-1989-2008
data source found by: Mario Pena
Analysis 3: https://www.scq.ubc.ca/so-much-candy-data-seriously/
data source found by: Ajay Arora

Analysis 1: Unemployment Rate

The data set suggested by Erinda Budo contains the unemployement rates from 1990 to 2018 for numbers of countries around the world. Erinda suggested to analyze the unemployment rate from 2011 to 2015 for each of the country and the whole world. Since we have dataup to 2018, I will perform analysis on the unemployment rate from 2014 to 2018.

I select this data set because it has a lot of NA values and some of the columns are represneting a region or a group of countries that need to be excluded from our analysis.

Load all libraries

library(tidyr)
library(dplyr)
library(plyr)
library(ggplot2)
library(stringr)
library(gdata)

Load the data file in to a table

rawData1 <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/Unemployment%20Rate.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
head(rawData1)

##      X Advanced.Economies Argentina Australia  Austria  Belgium Bulgaria
## 1   NA                 NA        NA        NA       NA       NA       NA
## 2 1990           5.800582        NA  6.943297 5.373002 6.550260       NA
## 3 1991           6.728688        NA  9.614137 5.823096 6.439812       NA
## 4 1992           7.511064        NA 10.750080 5.941711 7.088092 13.23500
## 5 1993           7.936175        NA 10.866170 6.811381 8.619130 15.85583
## 6 1994           7.715897        NA  9.705695 6.545480 9.753554 14.06583
##   Bahrain Belarus Brazil   Canada Switzerland Chile China Colombia Cyprus
## 1      NA      NA     NA       NA          NA    NA    NA       NA     NA
## 2      NA      NA     NA  8.15000    0.501328    NA    NA       NA     NA
## 3      NA      NA     NA 10.31667    1.090451    NA    NA       NA     NA
## 4      NA      NA     NA 11.21667    2.563105    NA    NA       NA     NA
## 5      NA      NA     NA 11.37500    4.516116    NA    NA       NA     NA
## 6      NA      NA     NA 10.39167    4.718465    NA    NA       NA     NA
##   Czech.Republic  Germany Denmark Dominican.Republic Algeria
## 1             NA       NA      NA                 NA      NA
## 2             NA       NA      NA                 NA    25.0
## 3             NA 4.864885      NA                 NA    25.0
## 4             NA 5.764563      NA                 NA    27.0
## 5       4.333333 6.931370      NA                 NA    23.2
## 6       4.283333 7.340639      NA                 NA    24.4
##   EMDE.East.Asia...Pacific EMDE.Europe...Central.Asia Ecuador
## 1                       NA                         NA      NA
## 2                       NA                         NA      NA
## 3                       NA                         NA      NA
## 4                       NA                         NA      NA
## 5                       NA                         NA      NA
## 6                       NA                         NA      NA
##   Egypt..Arab.Rep. Emerging.Market.and.Developing.Economies..EMDEs.
## 1               NA                                               NA
## 2               NA                                               NA
## 3               NA                                               NA
## 4               NA                                               NA
## 5               NA                                               NA
## 6               NA                                               NA
##      Spain Estonia   Finland France United.Kingdom Greece
## 1       NA      NA        NA     NA             NA     NA
## 2 15.48333   0.650  3.103129  7.625       7.091667     NA
## 3 15.51667   1.475  6.666424  7.800       8.825000     NA
## 4 17.06667   3.725 11.796830  8.650       9.966667     NA
## 5 20.83333   6.550 16.384210  9.650      10.400000     NA
## 6 22.05000   7.550 16.534420 10.250       9.500000     NA
##   High.Income.Countries Hong.Kong.SAR..China Croatia Hungary India
## 1                    NA                   NA      NA      NA    NA
## 2              5.619945             1.318868      NA      NA    NA
## 3              6.771918             1.750180      NA      NA    NA
## 4              7.693434             1.946343      NA      NA    NA
## 5              8.192391             1.979785      NA      NA    NA
## 6              8.052012             1.911372      NA      NA    NA
##    Ireland Iceland Israel Italy Jordan    Japan Kazakhstan Korea..Rep.
## 1       NA      NA     NA    NA     NA       NA         NA          NA
## 2 13.41667      NA     NA    NA     NA 2.108117         NA          NA
## 3 14.73333      NA     NA    NA     NA 2.099018         NA          NA
## 4 15.40000      NA     NA    NA     NA 2.151389         NA          NA
## 5 15.63333      NA     NA    NA   19.7 2.503291         NA          NA
## 6 14.35000      NA     NA    NA   15.8 2.890953         NA          NA
##   EMDE.Latin.America...Caribbean Low.Income.Countries..LIC. Sri.Lanka
## 1                             NA                         NA        NA
## 2                             NA                         NA      15.9
## 3                             NA                         NA      14.7
## 4                             NA                         NA      14.6
## 5                             NA                         NA      13.8
## 6                             NA                         NA      13.1
##   Lithuania Luxembourg   Latvia Morocco Moldova..Rep. Mexico
## 1        NA         NA       NA      NA            NA     NA
## 2        NA         NA       NA      NA            NA     NA
## 3        NA         NA       NA      NA            NA     NA
## 4        NA         NA       NA      NA            NA     NA
## 5  4.191667         NA 4.658333      NA            NA     NA
## 6  3.625000         NA 6.358333      NA            NA     NA
##   Middle.Income.Countries..MIC. North.Macedonia Malta
## 1                            NA              NA    NA
## 2                            NA              NA    NA
## 3                            NA              NA    NA
## 4                            NA              NA    NA
## 5                            NA              NA    NA
## 6                            NA              NA    NA
##   EMDE.Middle.East...N..Africa Netherlands   Norway New.Zealand Pakistan
## 1                           NA          NA       NA          NA       NA
## 2                           NA          NA 5.783333    7.984591     3.13
## 3                           NA          NA 6.041667   10.611440     6.28
## 4                           NA          NA 6.550000   10.644730     5.85
## 5                           NA          NA 6.608333    9.800159     4.73
## 6                           NA          NA 6.000000    8.342465     4.84
##   Peru Philippines    Poland Portugal   Romania Russian.Federation
## 1   NA          NA        NA       NA        NA                 NA
## 2   NA          NA  3.441667       NA        NA                 NA
## 3   NA      10.475  9.008333       NA        NA                 NA
## 4   NA       9.850 12.933330       NA  5.450000                 NA
## 5   NA       9.350 15.033330       NA  9.208333                 NA
## 6   NA       9.550 16.508330       NA 10.975000            7.00654
##   EMDE.South.Asia Saudi.Arabia Singapore EMDE.Sub.Saharan.Africa Slovakia
## 1              NA           NA        NA                      NA       NA
## 2              NA           NA        NA                      NA       NA
## 3              NA           NA     1.750                      NA  7.05000
## 4              NA           NA     1.800                      NA 11.31833
## 5              NA           NA     1.675                      NA 12.85500
## 6              NA           NA     1.725                      NA 14.62917
##   Slovenia    Sweden Thailand Tunisia Turkey Taiwan..China Uruguay
## 1       NA        NA       NA      NA     NA            NA      NA
## 2       NA  2.239701       NA      NA     NA      1.658333      NA
## 3       NA  4.005607       NA      NA     NA      1.533333      NA
## 4 11.56667  7.110956       NA      NA     NA      1.500000      NA
## 5 14.57500 11.146890       NA      NA     NA      1.425000      NA
## 6 14.55000 10.766190       NA      NA     NA      1.566667      NA
##   United.States Venezuela..RB Vietnam World..WBG.members. South.Africa
## 1            NA            NA      NA                  NA           NA
## 2      5.616667            NA      NA                  NA           NA
## 3      6.850000            NA      NA                  NA           NA
## 4      7.491667            NA      NA                  NA           NA
## 5      6.908333            NA      NA                  NA           NA
## 6      6.100000            NA      NA                  NA           NA

Column 1 contains the value of year, but the column name is missing. Add the column name.

colnames(rawData1)[1] <- "Year"

This analysis is on counties, organzations and groups of countries need to be excluded from our analysis

analysis1 <- select(rawData1, -matches("EMDE|Countries|Economies|World"))

Filter out the blank rows

analysis1 <- filter(analysis1, !is.na(Year))

Tidy our data table and ready for analysis

analysis1 <- gather(analysis1, "Country", "Unemployment_Rate", -Year)
analysis1$Unemployment_Rate <- round(as.numeric(analysis1$Unemployment_Rate),2)
head(analysis1)

##   Year   Country Unemployment_Rate
## 1 1990 Argentina                NA
## 2 1991 Argentina                NA
## 3 1992 Argentina                NA
## 4 1993 Argentina                NA
## 5 1994 Argentina                NA
## 6 1995 Argentina                NA

Question 1: What are the average umemployment rate from 2014 to 2018 for each country?

question_one <- analysis1 %>% 
  dplyr::filter(Year %in% c(2014:2018)) %>% 
  dplyr::group_by(Country) %>% 
  dplyr::summarise(ave_unemp_rate = mean(Unemployment_Rate, na.rm = TRUE), 
                   min = min(Unemployment_Rate, na.rm = TRUE), 
                   max = max(Unemployment_Rate, na.rm = TRUE),
                   median = median(Unemployment_Rate, na.rm = TRUE),
                   standard_dev = sd(Unemployment_Rate, na.rm = TRUE)) %>% 
  dplyr::arrange(desc(ave_unemp_rate))
question_one

## # A tibble: 72 x 6
##    Country          ave_unemp_rate   min   max median standard_dev
##    <chr>                     <dbl> <dbl> <dbl>  <dbl>        <dbl>
##  1 South.Africa               26.3 25.1   27.4   26.7        1.05 
##  2 North.Macedonia            24.2 20.7   28.0   23.8        2.90 
##  3 Greece                     23.1 19.2   26.5   23.6        2.86 
##  4 Spain                      19.7 15.3   24.4   19.6        3.67 
##  5 Jordan                     15.4 11.9   18.6   15.3        3.01 
##  6 Tunisia                    15.3 15.0   15.5   15.4        0.193
##  7 Croatia                    14.7  9.86  19.3   14.9        3.71 
##  8 Cyprus                     12.7  8.38  16.2   13.0        3.08 
##  9 Egypt..Arab.Rep.           12.0  9.85  13.2   12.5        1.31 
## 10 Italy                      11.6 10.6   12.6   11.7        0.758
## # ... with 62 more rows

Question 2: What’s the overall trend of the world’s annual unemployment rate from 2014 to 2018?

In order to make the unemployment rate for each year comparable, countries with unemployment rate missing in any year from 2014 to 2018 will be excluded. The countries with missing values are

question_two <- analysis1 %>% 
  dplyr::filter(Year %in% c(2014:2018)) %>% 
  dplyr::filter(is.na(Unemployment_Rate)) %>% 
  dplyr::distinct(Country)
question_two

##              Country
## 1            Bahrain
## 2 Dominican.Republic
## 3            Algeria
## 4              India
## 5           Pakistan
## 6      Venezuela..RB

Calculate the overall unemployment rate for each year

question_two <- analysis1 %>% 
  dplyr::filter(Year %in% c(2014:2018)) %>% 
  dplyr::filter(!Country %in% unlist(question_two)) %>%
  dplyr::group_by(Year) %>% 
  dplyr::summarise(ave_unemp_rate = mean(Unemployment_Rate, na.rm = TRUE), 
                   min = min(Unemployment_Rate, na.rm = TRUE), 
                   max = max(Unemployment_Rate, na.rm = TRUE),
                   median = median(Unemployment_Rate, na.rm = TRUE),
                   standard_dev = sd(Unemployment_Rate, na.rm = TRUE)) %>% 
  dplyr::arrange(Year)
question_two

## # A tibble: 5 x 6
##    Year ave_unemp_rate   min   max median standard_dev
##   <int>          <dbl> <dbl> <dbl>  <dbl>        <dbl>
## 1  2014           8.50  0.5   28.0   6.80         5.83
## 2  2015           8.18  0.88  26.0   6.55         5.44
## 3  2016           7.85  0.99  26.7   6.60         5.22
## 4  2017           7.36  0.76  27.4   5.95         5.10
## 5  2018           6.81  0.41  27.1   5.44         4.89

ggplot(data=question_two, aes(x=Year, y=ave_unemp_rate)) +
  geom_line(color="red")+
  geom_point()

As we can see, the overall unemployment rate is dropping every year, which indicates that the world is getting better every year.

Analysis 2: Renewable Energy Consumption

The data set suggested by Mario Pena contains annual renewable energy consumption by source and end use from 1989 to 2008. Mario suggested to analyze the most used renewable energy per year or to find a “bright spot”

I select this data set becasue the structure of the data is very messy and some of the data fields are very confusing. For example, the first coloumn contains inforamtion of both sectors and sources. It will require a lot of work to clean up and tidy the data.

First, Load the data file in to a table

rawData2 <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/Renewable%20Energy.csv", sep = ",", header = FALSE,stringsAsFactors = FALSE)

remove the discription rows which contain no data and fix the column names

analysis2 <- rawData2[-c(1:8),]
colnames(analysis2) <- rawData2[7,]

remove all note reference numbers from column 1

analysis2[,1] <- str_remove_all(analysis2[,1],"\\d")

remove the sections that represent the total of some other sections

temp <- which(analysis2[,1] == "")

analysis2 <- analysis2[-c(which(analysis2[,1] == "Total")[1] :
                          temp[temp > which(analysis2[,1] == "Total")[1]][1]),]

temp <- which(analysis2[,1] == "")
analysis2 <- analysis2[-c(which(analysis2[,1] == "Electric Power")[1] :
                          temp[temp > which(analysis2[,1] == "Electric Power")[1]][1]),]

Filter out the blank rows

analysis2 <- dplyr::filter(analysis2, analysis2$`Sector and Source` != "")

The first column contains multiple information: the sector, the energy source’s main category and also the sub categories for Biomass. We need to sparate these information into 3 different columns.
Add columns for separating the sector, source main category, and source sub category

analysis2 <- analysis2 %>% 
  mutate(Sector = NA) %>%
  mutate("Main category" = NA) %>%
  select("Main category", "Sector and Source", "Sector", everything())

Transform and fill in the correct value for sector, main category and sub category

temp <- analysis2 %>% 
  dplyr::filter(!str_detect(analysis2$`Sector and Source`, "  ")) %>% 
  dplyr::distinct(`Sector and Source`)

analysis2$Sector <- ifelse(analysis2$`Sector and Source` %in% unlist(temp), 
                           analysis2$`Sector and Source`, NA)

analysis2 <- fill(analysis2, Sector)

analysis2 <- analysis2 %>% 
  dplyr::filter(!`Sector and Source` %in% unlist(temp)) 
  
temp <- analysis2 %>% 
  dplyr::filter(str_detect(analysis2$`Sector and Source`, "    ")) %>% 
  dplyr::distinct(`Sector and Source`)

analysis2$`Main category` <- ifelse(analysis2$`Sector and Source` %in% unlist(temp), 
                            "Biomass", analysis2$`Sector and Source`)

analysis2 <- analysis2 %>% 
  dplyr::filter(`Sector and Source` != "  Biomass") %>% 
  dplyr::mutate(`Sector and Source` = str_remove_all(`Sector and Source`, "  ")) %>% 
  dplyr::mutate(`Main category` = str_remove_all(`Main category`, "  ")) %>% 
  dplyr::rename("Sub Category" = "Sector and Source")

head(analysis2)

##      Main category           Sub Category      Sector  1989  1990  1991
## 1          Biomass Wood and Derived Fuels Residential 0.920 0.580 0.610
## 2       Geothermal             Geothermal Residential 0.005 0.006 0.006
## 3 Solar Thermal/PV       Solar Thermal/PV Residential 0.053 0.056 0.058
## 4          Biomass               Biofuels  Commercial 0.001     *     *
## 5          Biomass                  Waste  Commercial 0.022 0.028 0.026
## 6          Biomass Wood and Derived Fuels  Commercial 0.076 0.066 0.068
##    1992  1993  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003
## 1 0.640 0.550 0.520 0.520 0.540 0.430 0.380 0.390 0.420 0.370 0.380 0.400
## 2 0.006 0.007 0.006 0.007 0.007 0.008 0.008 0.009 0.009 0.009 0.010 0.013
## 3 0.060 0.062 0.064 0.065 0.065 0.065 0.065 0.064 0.061 0.060 0.059 0.058
## 4     *     *     *     *     *     *     *     *     *     *     * 0.001
## 5 0.032 0.033 0.035 0.040 0.053 0.058 0.054 0.054 0.047 0.025 0.026 0.029
## 6 0.072 0.076 0.072 0.072 0.076 0.073 0.064 0.067 0.071 0.067 0.069 0.071
##    2004  2005  2006  2007  2008
## 1 0.410 0.430 0.390 0.430 0.450
## 2 0.014 0.016 0.018 0.022 0.026
## 3 0.059 0.061 0.067 0.075 0.088
## 4 0.001 0.001 0.001 0.002 0.002
## 5 0.034 0.034 0.036 0.031 0.034
## 6 0.070 0.070 0.065 0.069 0.073

Tiny the data for analysis

analysis2 <- analysis2 %>% 
  gather("Year", "Usage", -c(1:3))  
head(analysis2)

##      Main category           Sub Category      Sector Year Usage
## 1          Biomass Wood and Derived Fuels Residential 1989 0.920
## 2       Geothermal             Geothermal Residential 1989 0.005
## 3 Solar Thermal/PV       Solar Thermal/PV Residential 1989 0.053
## 4          Biomass               Biofuels  Commercial 1989 0.001
## 5          Biomass                  Waste  Commercial 1989 0.022
## 6          Biomass Wood and Derived Fuels  Commercial 1989 0.076

Filter out the rows with no value in the Usage column

analysis2$Usage <- ifelse(str_detect(analysis2$Usage, "[[:digit:]]"),as.numeric(analysis2$Usage), NA)
analysis2 <- filter(analysis2,str_detect(analysis2$Usage, "[[:digit:]]"))

Question 1: what is the most used renewable energy each year?

question_one <- analysis2 %>% 
  dplyr::group_by(Year, "Main category" = analysis2$`Main category`) %>% 
  dplyr::summarise(sum = sum(Usage, na.rm = TRUE)) %>% 
  dplyr::arrange(Year, desc(sum)) %>% 
  dplyr::ungroup() %>% 
  dplyr::group_by(Year) %>% 
  dplyr::top_n(1)

## Selecting by sum

question_one

## # A tibble: 20 x 3
## # Groups:   Year [20]
##    Year  `Main category`              sum
##    <chr> <chr>                      <dbl>
##  1 1989  Biomass                     3.16
##  2 1990  Hydroelectric Conventional  3.05
##  3 1991  Hydroelectric Conventional  3.02
##  4 1992  Biomass                     2.93
##  5 1993  Biomass                     2.91
##  6 1994  Biomass                     3.03
##  7 1995  Hydroelectric Conventional  3.20
##  8 1996  Hydroelectric Conventional  3.59
##  9 1997  Hydroelectric Conventional  3.64
## 10 1998  Hydroelectric Conventional  3.30
## 11 1999  Hydroelectric Conventional  3.27
## 12 2000  Biomass                     3.01
## 13 2001  Biomass                     2.62
## 14 2002  Biomass                     2.70
## 15 2003  Hydroelectric Conventional  2.82
## 16 2004  Biomass                     3.01
## 17 2005  Biomass                     3.12
## 18 2006  Biomass                     3.28
## 19 2007  Biomass                     3.50
## 20 2008  Biomass                     3.85

Question 2: the are the amount of renewable energy coumsumed each year by each sector?

question_two <- analysis2 %>% 
  dplyr::group_by(Year, Sector) %>% 
  dplyr::summarise(sum = sum(Usage, na.rm = TRUE)) %>% 
  dplyr::arrange(Year, desc(sum))
question_two

## # A tibble: 120 x 3
## # Groups:   Year [20]
##    Year  Sector                        sum
##    <chr> <chr>                       <dbl>
##  1 1989  Electric Utilities          2.98 
##  2 1989  Industrial                  1.87 
##  3 1989  Residential                 0.978
##  4 1989  Independent Power Producers 0.39 
##  5 1989  Commercial                  0.103
##  6 1989  Transportation              0.068
##  7 1990  Electric Utilities          3.15 
##  8 1990  Industrial                  1.72 
##  9 1990  Residential                 0.642
## 10 1990  Independent Power Producers 0.539
## # ... with 110 more rows

Finding a “bright spot”

question_two_1989 <- question_two %>% 
  filter(Year == 1989)
question_two_2008 <- question_two %>% 
  filter(Year == 2008)

question_two_1989 %>% 
  left_join(question_two_2008, by = "Sector") %>% 
  mutate(Percent_Increase = (sum.y / sum.x - 1)*100) %>% 
  select(Sector, everything())

## # A tibble: 6 x 6
## # Groups:   Year.x [1]
##   Sector                      Year.x sum.x Year.y sum.y Percent_Increase
##   <chr>                       <chr>  <dbl> <chr>  <dbl>            <dbl>
## 1 Electric Utilities          1989   2.98  2008   2.40            -19.4 
## 2 Industrial                  1989   1.87  2008   2.05              9.78
## 3 Residential                 1989   0.978 2008   0.564           -42.3 
## 4 Independent Power Producers 1989   0.39  2008   1.39            257.  
## 5 Commercial                  1989   0.103 2008   0.125            21.4 
## 6 Transportation              1989   0.068 2008   0.827          1116.

Comparing the energy consumption of 2008 to 1989, we can see that the consumption by Transportation is increased by 1116%! As we know, transportation nowadays still relies on gasoline. It would be a good idea to expand the usage of renewable energy in Transportation.

Analysis 3: So much candy!

The data set suggested by Ajay contains ratings of over 100 candies by people around the world (mainly from the United States and Canada). Ajay suggested to perform analysis to find out the favorite candy based the people’s demographics (gender, geopraphy, etc.)

I select this data set becasue the data set is huge and the data values are not standarized. Some data fields have wrong or missing information. It would be a challenge to clean up the data for analysis.

First, Load the data file in to a table

rawData3 <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/candy.txt", sep = "\t", header = TRUE, stringsAsFactors = FALSE, quote = "", fill = TRUE)

Also, prepare a table that contains the abbriviations and province names of Canada for later use

canadaProvince <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/Canada.csv", sep = ",", header = TRUE,stringsAsFactors = FALSE)

Simplify the column names. Also, because the candy names are too long, I will create a sparate table to hold the candy names with a unique candy ID. The column names of the candies will be replaced by the candy ID

analysis3 <- select(rawData3, 1: sum(str_detect(colnames(rawData3), "Q[1-6]\\D|Internal")))

columnNames <- colnames(analysis3)
temp <- sum(str_detect(colnames(rawData3), "Q[1-5]\\D|Internal"))
columnNames <- str_remove_all(columnNames, ".*?Q\\d[[:punct:]]*")
columnNames <- str_replace_all(columnNames, "\\.+", " ")
columnNames <- trim(columnNames)

candyTable <- matrix(nrow = length(columnNames) - temp, ncol=2)
colnames(candyTable) <- c("CandyID", "CandyName")
candyTable[,1] = as.character(c(1:length(candyTable[,1])))
candyTable[,2] = columnNames[-c(1:temp)]
columnNames[-c(1:temp)] <- candyTable[,1] 
colnames(analysis3) <- columnNames
colnames(analysis3)[6] <- "STATE"

head(analysis3)

##   Internal ID GOING OUT GENDER AGE COUNTRY    STATE   1       2   3
## 1    90258773                                                      
## 2    90272821        No   Male  44    USA        NM MEH DESPAIR JOY
## 3    90272829             Male  49     USA Virginia                
## 4    90272840        No   Male  40      us       or MEH DESPAIR JOY
## 5    90272841        No   Male  23     usa exton pa JOY DESPAIR JOY
## 6    90272852        No   Male                      JOY DESPAIR JOY
##         4       5       6       7       8       9      10      11      12
## 1                                                                        
## 2     MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR     MEH     MEH
## 3                                                                        
## 4     MEH     MEH DESPAIR     MEH DESPAIR DESPAIR     MEH     MEH DESPAIR
## 5 DESPAIR     MEH DESPAIR     MEH DESPAIR DESPAIR     MEH     MEH DESPAIR
## 6                             MEH     MEH DESPAIR     JOY DESPAIR     MEH
##        13  14  15  16      17      18      19      20      21  22  23
## 1                                                                    
## 2 DESPAIR MEH JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH JOY
## 3                                                                    
## 4 DESPAIR MEH JOY MEH     JOY     MEH DESPAIR DESPAIR DESPAIR JOY JOY
## 5 DESPAIR JOY MEH JOY DESPAIR DESPAIR     MEH DESPAIR DESPAIR MEH JOY
## 6     MEH JOY JOY JOY     MEH     MEH     JOY DESPAIR     MEH MEH JOY
##        24      25      26      27  28      29  30  31      32      33  34
## 1                                                                        
## 2 DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR MEH MEH DESPAIR     MEH JOY
## 3                                                                        
## 4     MEH     MEH     JOY     JOY MEH DESPAIR JOY MEH     MEH DESPAIR MEH
## 5 DESPAIR DESPAIR DESPAIR     MEH MEH DESPAIR JOY MEH DESPAIR DESPAIR JOY
## 6         DESPAIR     MEH     JOY MEH DESPAIR MEH MEH     MEH     JOY JOY
##    35  36      37      38  39      40      41      42      43      44  45
## 1                                                                        
## 2 JOY MEH DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR JOY
## 3                                                                        
## 4 MEH MEH     MEH     MEH MEH DESPAIR DESPAIR DESPAIR DESPAIR     MEH MEH
## 5 MEH MEH DESPAIR     MEH JOY     MEH     JOY DESPAIR DESPAIR     JOY JOY
## 6 MEH MEH     MEH     MEH MEH             JOY         DESPAIR         JOY
##        46  47  48      49  50      51      52      53  54      55  56  57
## 1                                                                        
## 2 DESPAIR MEH MEH     JOY MEH DESPAIR DESPAIR DESPAIR MEH     MEH JOY JOY
## 3                                                                        
## 4     JOY JOY MEH     JOY JOY DESPAIR     JOY     MEH MEH DESPAIR JOY MEH
## 5     JOY JOY MEH DESPAIR JOY     MEH     MEH DESPAIR MEH     MEH JOY JOY
## 6     MEH MEH MEH     MEH JOY     MEH     JOY         JOY     JOY JOY JOY
##    58  59  60  61  62      63      64      65      66      67      68
## 1                                                                    
## 2 MEH JOY JOY JOY JOY DESPAIR DESPAIR     MEH DESPAIR DESPAIR DESPAIR
## 3                                                                    
## 4 JOY MEH MEH MEH MEH     MEH DESPAIR DESPAIR     MEH     MEH DESPAIR
## 5 JOY JOY JOY JOY JOY     MEH DESPAIR     JOY     JOY     MEH DESPAIR
## 6 JOY JOY JOY                     JOY     JOY     JOY     JOY     MEH
##        69  70      71      72      73      74      75  76  77      78  79
## 1                                                                        
## 2 DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR JOY JOY DESPAIR JOY
## 3                                                                        
## 4     MEH MEH     MEH DESPAIR DESPAIR DESPAIR DESPAIR JOY MEH     MEH MEH
## 5     JOY JOY     JOY     MEH DESPAIR     JOY DESPAIR JOY JOY     MEH JOY
## 6     JOY MEH     MEH DESPAIR DESPAIR     MEH DESPAIR JOY JOY DESPAIR JOY
##        80      81      82      83  84      85      86  87      88  89
## 1                                                                    
## 2 DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR DESPAIR MEH DESPAIR MEH
## 3                                                                    
## 4 DESPAIR DESPAIR DESPAIR     MEH JOY     MEH DESPAIR MEH     MEH JOY
## 5             JOY     JOY DESPAIR         MEH DESPAIR JOY     JOY MEH
## 6 DESPAIR     JOY     JOY     JOY JOY DESPAIR DESPAIR MEH     MEH MEH
##        90      91      92      93      94  95      96  97      98      99
## 1                                                                        
## 2 DESPAIR DESPAIR DESPAIR DESPAIR     JOY JOY DESPAIR JOY DESPAIR DESPAIR
## 3                                                                        
## 4     MEH     JOY     MEH DESPAIR DESPAIR JOY     MEH JOY DESPAIR     JOY
## 5 DESPAIR     MEH DESPAIR DESPAIR     JOY JOY DESPAIR JOY     MEH     JOY
## 6                     MEH DESPAIR     JOY JOY     MEH JOY DESPAIR DESPAIR
##       100     101     102     103
## 1                                
## 2 DESPAIR DESPAIR DESPAIR DESPAIR
## 3                                
## 4     JOY DESPAIR DESPAIR DESPAIR
## 5     JOY DESPAIR DESPAIR     JOY
## 6     JOY DESPAIR DESPAIR     JOY

Standardize the country name for “United States” and “Canada”

analysis3$COUNTRY[str_detect(toupper(analysis3$COUNTRY), "U.*S|MERICA")] <- "United States"
analysis3$COUNTRY[str_detect(toupper(analysis3$COUNTRY), "CANADA")] <- "Canada"

Standardize the state/province by using the abbreviations only

for(i in c(1:length(state.abb))){
  analysis3$STATE[str_detect(toupper(analysis3$STATE),toupper(state.name[i]))
                                         & analysis3$COUNTRY == "United States"] <- state.abb[i]
  analysis3$STATE[str_detect(toupper(analysis3$STATE),state.abb[i])
                                         & analysis3$COUNTRY == "United States"] <- state.abb[i]
}


for(i in c(1:length(canadaProvince$Abbreviation))){
  analysis3$STATE[str_detect(toupper(analysis3$STATE),toupper(canadaProvince$Province.or.Territory[i]))
                                         & analysis3$COUNTRY == "Canada"] <- canadaProvince$Abbreviation[i]
  analysis3$STATE[str_detect(toupper(analysis3$STATE),canadaProvince$Abbreviation[i])
                                         & analysis3$COUNTRY == "Canada"] <- canadaProvince$Abbreviation[i]
}

Extract and convert the age to integer

analysis3$AGE <- as.integer(str_extract(analysis3$AGE, "\\d+"))

Exclude people from countries other than the United States and Canada, because the number of people from the other countries is to small to be sufficient for analysis.
Exclude people less than 12 years old and more than 120 years old.
Exclude people with invalid value in the gender column

analysis3 <- analysis3 %>% 
  filter(GENDER == "Male" | GENDER == "Female") %>% 
  filter(AGE >= 12 & AGE <= 120) %>% 
  filter((COUNTRY == "United States" & STATE %in% state.abb) | (COUNTRY == "Canada" & STATE %in% canadaProvince$Abbreviation))
head(analysis3)

##   Internal ID GOING OUT GENDER AGE       COUNTRY STATE   1       2   3
## 1    90272821        No   Male  44 United States    NM MEH DESPAIR JOY
## 2    90272829             Male  49 United States    IN                
## 3    90272840        No   Male  40 United States    OR MEH DESPAIR JOY
## 4    90272841        No   Male  23 United States    PA JOY DESPAIR JOY
## 5    90272853        No   Male  53 United States    CO                
## 6    90272854        No   Male  33        Canada    NT JOY DESPAIR JOY
##         4       5       6       7       8       9      10  11      12
## 1     MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH     MEH
## 2                                                                    
## 3     MEH     MEH DESPAIR     MEH DESPAIR DESPAIR     MEH MEH DESPAIR
## 4 DESPAIR     MEH DESPAIR     MEH DESPAIR DESPAIR     MEH MEH DESPAIR
## 5                                                                    
## 6 DESPAIR DESPAIR     MEH     JOY     MEH     JOY     JOY JOY     JOY
##        13  14  15  16      17      18      19      20      21  22  23
## 1 DESPAIR MEH JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH JOY
## 2                                                                    
## 3 DESPAIR MEH JOY MEH     JOY     MEH DESPAIR DESPAIR DESPAIR JOY JOY
## 4 DESPAIR JOY MEH JOY DESPAIR DESPAIR     MEH DESPAIR DESPAIR MEH JOY
## 5                                                                    
## 6 DESPAIR JOY JOY JOY DESPAIR     MEH     MEH DESPAIR DESPAIR MEH MEH
##        24      25      26      27      28      29  30      31      32
## 1 DESPAIR DESPAIR DESPAIR DESPAIR     MEH DESPAIR MEH     MEH DESPAIR
## 2                                                                    
## 3     MEH     MEH     JOY     JOY     MEH DESPAIR JOY     MEH     MEH
## 4 DESPAIR DESPAIR DESPAIR     MEH     MEH DESPAIR JOY     MEH DESPAIR
## 5                                                                    
## 6     JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR JOY DESPAIR DESPAIR
##        33  34  35  36      37      38      39      40      41      42
## 1     MEH JOY JOY MEH DESPAIR DESPAIR     MEH DESPAIR DESPAIR DESPAIR
## 2                                                                    
## 3 DESPAIR MEH MEH MEH     MEH     MEH     MEH DESPAIR DESPAIR DESPAIR
## 4 DESPAIR JOY MEH MEH DESPAIR     MEH     JOY     MEH     JOY DESPAIR
## 5                                                                    
## 6 DESPAIR JOY MEH MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR
##        43      44  45      46  47  48      49      50      51      52
## 1 DESPAIR DESPAIR JOY DESPAIR MEH MEH     JOY     MEH DESPAIR DESPAIR
## 2                                                                    
## 3 DESPAIR     MEH MEH     JOY JOY MEH     JOY     JOY DESPAIR     JOY
## 4 DESPAIR     JOY JOY     JOY JOY MEH DESPAIR     JOY     MEH     MEH
## 5                                                                    
## 6 DESPAIR DESPAIR MEH     MEH JOY MEH DESPAIR DESPAIR     MEH     JOY
##        53  54      55  56  57  58  59  60  61  62      63      64      65
## 1 DESPAIR MEH     MEH JOY JOY MEH JOY JOY JOY JOY DESPAIR DESPAIR     MEH
## 2                                                                        
## 3     MEH MEH DESPAIR JOY MEH JOY MEH MEH MEH MEH     MEH DESPAIR DESPAIR
## 4 DESPAIR MEH     MEH JOY JOY JOY JOY JOY JOY JOY     MEH DESPAIR     JOY
## 5                                                                        
## 6     JOY JOY     JOY JOY JOY JOY JOY JOY JOY MEH DESPAIR     JOY DESPAIR
##        66      67      68      69  70      71      72      73      74
## 1 DESPAIR DESPAIR DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR
## 2                                                                    
## 3     MEH     MEH DESPAIR     MEH MEH     MEH DESPAIR DESPAIR DESPAIR
## 4     JOY     MEH DESPAIR     JOY JOY     JOY     MEH DESPAIR     JOY
## 5                                                                    
## 6 DESPAIR     MEH     MEH     JOY JOY     MEH DESPAIR DESPAIR DESPAIR
##        75  76  77      78  79      80      81      82      83  84      85
## 1 DESPAIR JOY JOY DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR
## 2                                                                        
## 3 DESPAIR JOY MEH     MEH MEH DESPAIR DESPAIR DESPAIR     MEH JOY     MEH
## 4 DESPAIR JOY JOY     MEH JOY             JOY     JOY DESPAIR         MEH
## 5                                                                        
## 6 DESPAIR JOY JOY         JOY DESPAIR     JOY     MEH     JOY MEH     JOY
##        86  87      88  89      90      91      92      93      94  95
## 1 DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR DESPAIR DESPAIR     JOY JOY
## 2                                                                    
## 3 DESPAIR MEH     MEH JOY     MEH     JOY     MEH DESPAIR DESPAIR JOY
## 4 DESPAIR JOY     JOY MEH DESPAIR     MEH DESPAIR DESPAIR     JOY JOY
## 5                                                                    
## 6 DESPAIR MEH     JOY JOY     MEH     MEH DESPAIR DESPAIR     JOY MEH
##        96  97      98      99     100     101     102     103
## 1 DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR
## 2                                                            
## 3     MEH JOY DESPAIR     JOY     JOY DESPAIR DESPAIR DESPAIR
## 4 DESPAIR JOY     MEH     JOY     JOY DESPAIR DESPAIR     JOY
## 5                                                            
## 6 DESPAIR JOY     JOY     MEH DESPAIR DESPAIR DESPAIR DESPAIR

The rating with “JOY”, “MEH” and “DESPAIR” are vague, replace them with “Like”, “OK” and “Dislike”

analysis3[analysis3 == "JOY"] <- "Like"
analysis3[analysis3 == "MEH"] <- "OK"
analysis3[analysis3 == "DESPAIR"] <- "Dislike"
analysis3[analysis3 == ""] <- "NA"

Tidy the data and filter out the rows with no rating

analysis3 <- analysis3 %>% 
  gather("CandyID", "Rating", -c(1:temp)) %>% 
  filter(Rating != "NA")
head(analysis3)

##   Internal ID GOING OUT GENDER AGE       COUNTRY STATE CandyID Rating
## 1    90272821        No   Male  44 United States    NM       1     OK
## 2    90272840        No   Male  40 United States    OR       1     OK
## 3    90272841        No   Male  23 United States    PA       1   Like
## 4    90272854        No   Male  33        Canada    NT       1   Like
## 5    90272858        No   Male  40        Canada    NT       1   Like
## 6    90272859        No Female  53 United States    WA       1     OK

Question 1: what are the top 10 candies with the highest rate of “Like”

analysis3 %>% 
  dplyr::group_by(CandyID) %>% 
  dplyr::count(Rating) %>% 
  dplyr::mutate(percentage = n/sum(n)) %>% 
  dplyr::filter(Rating == "Like") %>% 
  dplyr::arrange(desc(percentage)) %>% 
  dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>% 
  dplyr::ungroup() %>% 
  dplyr::slice(1:10)

## # A tibble: 10 x 5
##    CandyID Rating     n percentage CandyName                          
##    <chr>   <chr>  <int>      <dbl> <chr>                              
##  1 3       Like    1359      0.873 Any full sized candy bar           
##  2 76      Like    1317      0.849 Reese s Peanut Butter Cups         
##  3 45      Like    1265      0.813 Kit Kat                            
##  4 15      Like    1258      0.811 Cash or other forms of legal tender
##  5 97      Like    1240      0.803 Twix                               
##  6 84      Like    1229      0.797 Snickers                           
##  7 95      Like    1157      0.756 Tolberone something or other       
##  8 50      Like    1120      0.736 Lindt Truffle                      
##  9 58      Like    1136      0.729 Peanut M M s                       
## 10 23      Like    1039      0.676 Dove Bars

Question 2: what are the top 5 candies with the highest rate of “Like” for each country

analysis3 %>% 
  dplyr::group_by(COUNTRY,CandyID) %>% 
  dplyr::count(Rating) %>% 
  dplyr::mutate(percentage = n/sum(n)) %>% 
  dplyr::filter(Rating == "Like") %>% 
  dplyr::arrange(COUNTRY, desc(percentage)) %>% 
  dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
  dplyr::ungroup() %>% 
  dplyr::group_by(COUNTRY) %>%
  dplyr::slice(1:5)

## # A tibble: 10 x 6
## # Groups:   COUNTRY [2]
##    COUNTRY      CandyID Rating     n percentage CandyName                  
##    <chr>        <chr>   <chr>  <int>      <dbl> <chr>                      
##  1 Canada       3       Like     132      0.936 Any full sized candy bar   
##  2 Canada       76      Like     124      0.892 Reese s Peanut Butter Cups 
##  3 Canada       15      Like     124      0.879 Cash or other forms of leg~
##  4 Canada       19      Like     122      0.865 Coffee Crisp               
##  5 Canada       45      Like     121      0.864 Kit Kat                    
##  6 United Stat~ 3       Like    1227      0.867 Any full sized candy bar   
##  7 United Stat~ 76      Like    1193      0.845 Reese s Peanut Butter Cups 
##  8 United Stat~ 45      Like    1144      0.808 Kit Kat                    
##  9 United Stat~ 97      Like    1136      0.807 Twix                       
## 10 United Stat~ 15      Like    1134      0.804 Cash or other forms of leg~

Question 3: what are the top 5 candies with the highest rate of “Like” for each gender

analysis3 %>% 
  dplyr::group_by(GENDER,CandyID) %>% 
  dplyr::count(Rating) %>% 
  dplyr::mutate(percentage = n/sum(n)) %>% 
  dplyr::filter(Rating == "Like") %>% 
  dplyr::arrange(GENDER, desc(percentage)) %>% 
  dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
  dplyr::ungroup() %>% 
  dplyr::group_by(GENDER) %>%
  dplyr::slice(1:5)

## # A tibble: 10 x 6
## # Groups:   GENDER [2]
##    GENDER CandyID Rating     n percentage CandyName                        
##    <chr>  <chr>   <chr>  <int>      <dbl> <chr>                            
##  1 Female 3       Like     501      0.887 Any full sized candy bar         
##  2 Female 15      Like     477      0.846 Cash or other forms of legal ten~
##  3 Female 76      Like     474      0.842 Reese s Peanut Butter Cups       
##  4 Female 97      Like     464      0.827 Twix                             
##  5 Female 45      Like     456      0.809 Kit Kat                          
##  6 Male   3       Like     858      0.865 Any full sized candy bar         
##  7 Male   76      Like     843      0.853 Reese s Peanut Butter Cups       
##  8 Male   45      Like     809      0.816 Kit Kat                          
##  9 Male   84      Like     789      0.803 Snickers                         
## 10 Male   15      Like     781      0.791 Cash or other forms of legal ten~

Question 4: what are the top 1 candy with the highest rate of “Like” for each state in the United States (some state may be missing in the data)

analysis3 %>% 
  dplyr::filter(COUNTRY == "United States") %>% 
  dplyr::group_by(STATE,CandyID) %>% 
  dplyr::count(Rating) %>% 
  dplyr::mutate(percentage = n/sum(n)) %>% 
  dplyr::filter(Rating == "Like") %>% 
  dplyr::arrange(STATE, desc(percentage)) %>% 
  dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
  dplyr::ungroup() %>% 
  dplyr::group_by(STATE) %>%
  dplyr::slice(1:1)

## # A tibble: 47 x 6
## # Groups:   STATE [47]
##    STATE CandyID Rating     n percentage CandyName                         
##    <chr> <chr>   <chr>  <int>      <dbl> <chr>                             
##  1 AK    15      Like       6      1     Cash or other forms of legal tend~
##  2 AL    3       Like     129      0.816 Any full sized candy bar          
##  3 AR    3       Like      42      0.933 Any full sized candy bar          
##  4 AZ    3       Like      21      0.913 Any full sized candy bar          
##  5 CA    3       Like      76      0.864 Any full sized candy bar          
##  6 CO    3       Like      78      0.897 Any full sized candy bar          
##  7 CT    3       Like       6      1     Any full sized candy bar          
##  8 DE    15      Like       6      1     Cash or other forms of legal tend~
##  9 FL    76      Like      32      0.941 Reese s Peanut Butter Cups        
## 10 GA    3       Like      50      0.962 Any full sized candy bar          
## # ... with 37 more rows

DATA_607_PROJECT_2

Euclid Zhang

10/6/2019