Project 2 is to select 3 untidyed data sets suggested by classmates and perform data tidying and analysis.
The following 3 data sets are selected and each individual data set will be tidyed and analyzed:
Analysis 1: https://github.com/ErindaB/Other
data source found by: Erinda Budo
Analysis 2: https://openei.org/datasets/dataset/historical-renewable-energy-consumption-by-energy-use-sector-and-energy-source-1989-2008
data source found by: Mario Pena
Analysis 3: https://www.scq.ubc.ca/so-much-candy-data-seriously/
data source found by: Ajay Arora
Analysis 1: Unemployment Rate
The data set suggested by Erinda Budo contains the unemployement rates from 1990 to 2018 for numbers of countries around the world. Erinda suggested to analyze the unemployment rate from 2011 to 2015 for each of the country and the whole world. Since we have dataup to 2018, I will perform analysis on the unemployment rate from 2014 to 2018.
I select this data set because it has a lot of NA values and some of the columns are represneting a region or a group of countries that need to be excluded from our analysis.
Load all libraries
library(tidyr)
library(dplyr)
library(plyr)
library(ggplot2)
library(stringr)
library(gdata)
Load the data file in to a table
rawData1 <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/Unemployment%20Rate.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
head(rawData1)
## X Advanced.Economies Argentina Australia Austria Belgium Bulgaria
## 1 NA NA NA NA NA NA NA
## 2 1990 5.800582 NA 6.943297 5.373002 6.550260 NA
## 3 1991 6.728688 NA 9.614137 5.823096 6.439812 NA
## 4 1992 7.511064 NA 10.750080 5.941711 7.088092 13.23500
## 5 1993 7.936175 NA 10.866170 6.811381 8.619130 15.85583
## 6 1994 7.715897 NA 9.705695 6.545480 9.753554 14.06583
## Bahrain Belarus Brazil Canada Switzerland Chile China Colombia Cyprus
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA 8.15000 0.501328 NA NA NA NA
## 3 NA NA NA 10.31667 1.090451 NA NA NA NA
## 4 NA NA NA 11.21667 2.563105 NA NA NA NA
## 5 NA NA NA 11.37500 4.516116 NA NA NA NA
## 6 NA NA NA 10.39167 4.718465 NA NA NA NA
## Czech.Republic Germany Denmark Dominican.Republic Algeria
## 1 NA NA NA NA NA
## 2 NA NA NA NA 25.0
## 3 NA 4.864885 NA NA 25.0
## 4 NA 5.764563 NA NA 27.0
## 5 4.333333 6.931370 NA NA 23.2
## 6 4.283333 7.340639 NA NA 24.4
## EMDE.East.Asia...Pacific EMDE.Europe...Central.Asia Ecuador
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## Egypt..Arab.Rep. Emerging.Market.and.Developing.Economies..EMDEs.
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Spain Estonia Finland France United.Kingdom Greece
## 1 NA NA NA NA NA NA
## 2 15.48333 0.650 3.103129 7.625 7.091667 NA
## 3 15.51667 1.475 6.666424 7.800 8.825000 NA
## 4 17.06667 3.725 11.796830 8.650 9.966667 NA
## 5 20.83333 6.550 16.384210 9.650 10.400000 NA
## 6 22.05000 7.550 16.534420 10.250 9.500000 NA
## High.Income.Countries Hong.Kong.SAR..China Croatia Hungary India
## 1 NA NA NA NA NA
## 2 5.619945 1.318868 NA NA NA
## 3 6.771918 1.750180 NA NA NA
## 4 7.693434 1.946343 NA NA NA
## 5 8.192391 1.979785 NA NA NA
## 6 8.052012 1.911372 NA NA NA
## Ireland Iceland Israel Italy Jordan Japan Kazakhstan Korea..Rep.
## 1 NA NA NA NA NA NA NA NA
## 2 13.41667 NA NA NA NA 2.108117 NA NA
## 3 14.73333 NA NA NA NA 2.099018 NA NA
## 4 15.40000 NA NA NA NA 2.151389 NA NA
## 5 15.63333 NA NA NA 19.7 2.503291 NA NA
## 6 14.35000 NA NA NA 15.8 2.890953 NA NA
## EMDE.Latin.America...Caribbean Low.Income.Countries..LIC. Sri.Lanka
## 1 NA NA NA
## 2 NA NA 15.9
## 3 NA NA 14.7
## 4 NA NA 14.6
## 5 NA NA 13.8
## 6 NA NA 13.1
## Lithuania Luxembourg Latvia Morocco Moldova..Rep. Mexico
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 NA NA NA NA NA NA
## 5 4.191667 NA 4.658333 NA NA NA
## 6 3.625000 NA 6.358333 NA NA NA
## Middle.Income.Countries..MIC. North.Macedonia Malta
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## EMDE.Middle.East...N..Africa Netherlands Norway New.Zealand Pakistan
## 1 NA NA NA NA NA
## 2 NA NA 5.783333 7.984591 3.13
## 3 NA NA 6.041667 10.611440 6.28
## 4 NA NA 6.550000 10.644730 5.85
## 5 NA NA 6.608333 9.800159 4.73
## 6 NA NA 6.000000 8.342465 4.84
## Peru Philippines Poland Portugal Romania Russian.Federation
## 1 NA NA NA NA NA NA
## 2 NA NA 3.441667 NA NA NA
## 3 NA 10.475 9.008333 NA NA NA
## 4 NA 9.850 12.933330 NA 5.450000 NA
## 5 NA 9.350 15.033330 NA 9.208333 NA
## 6 NA 9.550 16.508330 NA 10.975000 7.00654
## EMDE.South.Asia Saudi.Arabia Singapore EMDE.Sub.Saharan.Africa Slovakia
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA 1.750 NA 7.05000
## 4 NA NA 1.800 NA 11.31833
## 5 NA NA 1.675 NA 12.85500
## 6 NA NA 1.725 NA 14.62917
## Slovenia Sweden Thailand Tunisia Turkey Taiwan..China Uruguay
## 1 NA NA NA NA NA NA NA
## 2 NA 2.239701 NA NA NA 1.658333 NA
## 3 NA 4.005607 NA NA NA 1.533333 NA
## 4 11.56667 7.110956 NA NA NA 1.500000 NA
## 5 14.57500 11.146890 NA NA NA 1.425000 NA
## 6 14.55000 10.766190 NA NA NA 1.566667 NA
## United.States Venezuela..RB Vietnam World..WBG.members. South.Africa
## 1 NA NA NA NA NA
## 2 5.616667 NA NA NA NA
## 3 6.850000 NA NA NA NA
## 4 7.491667 NA NA NA NA
## 5 6.908333 NA NA NA NA
## 6 6.100000 NA NA NA NA
Column 1 contains the value of year, but the column name is missing. Add the column name.
colnames(rawData1)[1] <- "Year"
This analysis is on counties, organzations and groups of countries need to be excluded from our analysis
analysis1 <- select(rawData1, -matches("EMDE|Countries|Economies|World"))
Filter out the blank rows
analysis1 <- filter(analysis1, !is.na(Year))
Tidy our data table and ready for analysis
analysis1 <- gather(analysis1, "Country", "Unemployment_Rate", -Year)
analysis1$Unemployment_Rate <- round(as.numeric(analysis1$Unemployment_Rate),2)
head(analysis1)
## Year Country Unemployment_Rate
## 1 1990 Argentina NA
## 2 1991 Argentina NA
## 3 1992 Argentina NA
## 4 1993 Argentina NA
## 5 1994 Argentina NA
## 6 1995 Argentina NA
Question 1: What are the average umemployment rate from 2014 to 2018 for each country?
question_one <- analysis1 %>%
dplyr::filter(Year %in% c(2014:2018)) %>%
dplyr::group_by(Country) %>%
dplyr::summarise(ave_unemp_rate = mean(Unemployment_Rate, na.rm = TRUE),
min = min(Unemployment_Rate, na.rm = TRUE),
max = max(Unemployment_Rate, na.rm = TRUE),
median = median(Unemployment_Rate, na.rm = TRUE),
standard_dev = sd(Unemployment_Rate, na.rm = TRUE)) %>%
dplyr::arrange(desc(ave_unemp_rate))
question_one
## # A tibble: 72 x 6
## Country ave_unemp_rate min max median standard_dev
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South.Africa 26.3 25.1 27.4 26.7 1.05
## 2 North.Macedonia 24.2 20.7 28.0 23.8 2.90
## 3 Greece 23.1 19.2 26.5 23.6 2.86
## 4 Spain 19.7 15.3 24.4 19.6 3.67
## 5 Jordan 15.4 11.9 18.6 15.3 3.01
## 6 Tunisia 15.3 15.0 15.5 15.4 0.193
## 7 Croatia 14.7 9.86 19.3 14.9 3.71
## 8 Cyprus 12.7 8.38 16.2 13.0 3.08
## 9 Egypt..Arab.Rep. 12.0 9.85 13.2 12.5 1.31
## 10 Italy 11.6 10.6 12.6 11.7 0.758
## # ... with 62 more rows
Question 2: What’s the overall trend of the world’s annual unemployment rate from 2014 to 2018?
In order to make the unemployment rate for each year comparable, countries with unemployment rate missing in any year from 2014 to 2018 will be excluded. The countries with missing values are
question_two <- analysis1 %>%
dplyr::filter(Year %in% c(2014:2018)) %>%
dplyr::filter(is.na(Unemployment_Rate)) %>%
dplyr::distinct(Country)
question_two
## Country
## 1 Bahrain
## 2 Dominican.Republic
## 3 Algeria
## 4 India
## 5 Pakistan
## 6 Venezuela..RB
Calculate the overall unemployment rate for each year
question_two <- analysis1 %>%
dplyr::filter(Year %in% c(2014:2018)) %>%
dplyr::filter(!Country %in% unlist(question_two)) %>%
dplyr::group_by(Year) %>%
dplyr::summarise(ave_unemp_rate = mean(Unemployment_Rate, na.rm = TRUE),
min = min(Unemployment_Rate, na.rm = TRUE),
max = max(Unemployment_Rate, na.rm = TRUE),
median = median(Unemployment_Rate, na.rm = TRUE),
standard_dev = sd(Unemployment_Rate, na.rm = TRUE)) %>%
dplyr::arrange(Year)
question_two
## # A tibble: 5 x 6
## Year ave_unemp_rate min max median standard_dev
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014 8.50 0.5 28.0 6.80 5.83
## 2 2015 8.18 0.88 26.0 6.55 5.44
## 3 2016 7.85 0.99 26.7 6.60 5.22
## 4 2017 7.36 0.76 27.4 5.95 5.10
## 5 2018 6.81 0.41 27.1 5.44 4.89
ggplot(data=question_two, aes(x=Year, y=ave_unemp_rate)) +
geom_line(color="red")+
geom_point()
As we can see, the overall unemployment rate is dropping every year, which indicates that the world is getting better every year.
Analysis 2: Renewable Energy Consumption
The data set suggested by Mario Pena contains annual renewable energy consumption by source and end use from 1989 to 2008. Mario suggested to analyze the most used renewable energy per year or to find a “bright spot”
I select this data set becasue the structure of the data is very messy and some of the data fields are very confusing. For example, the first coloumn contains inforamtion of both sectors and sources. It will require a lot of work to clean up and tidy the data.
First, Load the data file in to a table
rawData2 <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/Renewable%20Energy.csv", sep = ",", header = FALSE,stringsAsFactors = FALSE)
remove the discription rows which contain no data and fix the column names
analysis2 <- rawData2[-c(1:8),]
colnames(analysis2) <- rawData2[7,]
remove all note reference numbers from column 1
analysis2[,1] <- str_remove_all(analysis2[,1],"\\d")
remove the sections that represent the total of some other sections
temp <- which(analysis2[,1] == "")
analysis2 <- analysis2[-c(which(analysis2[,1] == "Total")[1] :
temp[temp > which(analysis2[,1] == "Total")[1]][1]),]
temp <- which(analysis2[,1] == "")
analysis2 <- analysis2[-c(which(analysis2[,1] == "Electric Power")[1] :
temp[temp > which(analysis2[,1] == "Electric Power")[1]][1]),]
Filter out the blank rows
analysis2 <- dplyr::filter(analysis2, analysis2$`Sector and Source` != "")
The first column contains multiple information: the sector, the energy source’s main category and also the sub categories for Biomass. We need to sparate these information into 3 different columns.
Add columns for separating the sector, source main category, and source sub category
analysis2 <- analysis2 %>%
mutate(Sector = NA) %>%
mutate("Main category" = NA) %>%
select("Main category", "Sector and Source", "Sector", everything())
Transform and fill in the correct value for sector, main category and sub category
temp <- analysis2 %>%
dplyr::filter(!str_detect(analysis2$`Sector and Source`, " ")) %>%
dplyr::distinct(`Sector and Source`)
analysis2$Sector <- ifelse(analysis2$`Sector and Source` %in% unlist(temp),
analysis2$`Sector and Source`, NA)
analysis2 <- fill(analysis2, Sector)
analysis2 <- analysis2 %>%
dplyr::filter(!`Sector and Source` %in% unlist(temp))
temp <- analysis2 %>%
dplyr::filter(str_detect(analysis2$`Sector and Source`, " ")) %>%
dplyr::distinct(`Sector and Source`)
analysis2$`Main category` <- ifelse(analysis2$`Sector and Source` %in% unlist(temp),
"Biomass", analysis2$`Sector and Source`)
analysis2 <- analysis2 %>%
dplyr::filter(`Sector and Source` != " Biomass") %>%
dplyr::mutate(`Sector and Source` = str_remove_all(`Sector and Source`, " ")) %>%
dplyr::mutate(`Main category` = str_remove_all(`Main category`, " ")) %>%
dplyr::rename("Sub Category" = "Sector and Source")
head(analysis2)
## Main category Sub Category Sector 1989 1990 1991
## 1 Biomass Wood and Derived Fuels Residential 0.920 0.580 0.610
## 2 Geothermal Geothermal Residential 0.005 0.006 0.006
## 3 Solar Thermal/PV Solar Thermal/PV Residential 0.053 0.056 0.058
## 4 Biomass Biofuels Commercial 0.001 * *
## 5 Biomass Waste Commercial 0.022 0.028 0.026
## 6 Biomass Wood and Derived Fuels Commercial 0.076 0.066 0.068
## 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
## 1 0.640 0.550 0.520 0.520 0.540 0.430 0.380 0.390 0.420 0.370 0.380 0.400
## 2 0.006 0.007 0.006 0.007 0.007 0.008 0.008 0.009 0.009 0.009 0.010 0.013
## 3 0.060 0.062 0.064 0.065 0.065 0.065 0.065 0.064 0.061 0.060 0.059 0.058
## 4 * * * * * * * * * * * 0.001
## 5 0.032 0.033 0.035 0.040 0.053 0.058 0.054 0.054 0.047 0.025 0.026 0.029
## 6 0.072 0.076 0.072 0.072 0.076 0.073 0.064 0.067 0.071 0.067 0.069 0.071
## 2004 2005 2006 2007 2008
## 1 0.410 0.430 0.390 0.430 0.450
## 2 0.014 0.016 0.018 0.022 0.026
## 3 0.059 0.061 0.067 0.075 0.088
## 4 0.001 0.001 0.001 0.002 0.002
## 5 0.034 0.034 0.036 0.031 0.034
## 6 0.070 0.070 0.065 0.069 0.073
Tiny the data for analysis
analysis2 <- analysis2 %>%
gather("Year", "Usage", -c(1:3))
head(analysis2)
## Main category Sub Category Sector Year Usage
## 1 Biomass Wood and Derived Fuels Residential 1989 0.920
## 2 Geothermal Geothermal Residential 1989 0.005
## 3 Solar Thermal/PV Solar Thermal/PV Residential 1989 0.053
## 4 Biomass Biofuels Commercial 1989 0.001
## 5 Biomass Waste Commercial 1989 0.022
## 6 Biomass Wood and Derived Fuels Commercial 1989 0.076
Filter out the rows with no value in the Usage column
analysis2$Usage <- ifelse(str_detect(analysis2$Usage, "[[:digit:]]"),as.numeric(analysis2$Usage), NA)
analysis2 <- filter(analysis2,str_detect(analysis2$Usage, "[[:digit:]]"))
Question 1: what is the most used renewable energy each year?
question_one <- analysis2 %>%
dplyr::group_by(Year, "Main category" = analysis2$`Main category`) %>%
dplyr::summarise(sum = sum(Usage, na.rm = TRUE)) %>%
dplyr::arrange(Year, desc(sum)) %>%
dplyr::ungroup() %>%
dplyr::group_by(Year) %>%
dplyr::top_n(1)
## Selecting by sum
question_one
## # A tibble: 20 x 3
## # Groups: Year [20]
## Year `Main category` sum
## <chr> <chr> <dbl>
## 1 1989 Biomass 3.16
## 2 1990 Hydroelectric Conventional 3.05
## 3 1991 Hydroelectric Conventional 3.02
## 4 1992 Biomass 2.93
## 5 1993 Biomass 2.91
## 6 1994 Biomass 3.03
## 7 1995 Hydroelectric Conventional 3.20
## 8 1996 Hydroelectric Conventional 3.59
## 9 1997 Hydroelectric Conventional 3.64
## 10 1998 Hydroelectric Conventional 3.30
## 11 1999 Hydroelectric Conventional 3.27
## 12 2000 Biomass 3.01
## 13 2001 Biomass 2.62
## 14 2002 Biomass 2.70
## 15 2003 Hydroelectric Conventional 2.82
## 16 2004 Biomass 3.01
## 17 2005 Biomass 3.12
## 18 2006 Biomass 3.28
## 19 2007 Biomass 3.50
## 20 2008 Biomass 3.85
Question 2: the are the amount of renewable energy coumsumed each year by each sector?
question_two <- analysis2 %>%
dplyr::group_by(Year, Sector) %>%
dplyr::summarise(sum = sum(Usage, na.rm = TRUE)) %>%
dplyr::arrange(Year, desc(sum))
question_two
## # A tibble: 120 x 3
## # Groups: Year [20]
## Year Sector sum
## <chr> <chr> <dbl>
## 1 1989 Electric Utilities 2.98
## 2 1989 Industrial 1.87
## 3 1989 Residential 0.978
## 4 1989 Independent Power Producers 0.39
## 5 1989 Commercial 0.103
## 6 1989 Transportation 0.068
## 7 1990 Electric Utilities 3.15
## 8 1990 Industrial 1.72
## 9 1990 Residential 0.642
## 10 1990 Independent Power Producers 0.539
## # ... with 110 more rows
Finding a “bright spot”
question_two_1989 <- question_two %>%
filter(Year == 1989)
question_two_2008 <- question_two %>%
filter(Year == 2008)
question_two_1989 %>%
left_join(question_two_2008, by = "Sector") %>%
mutate(Percent_Increase = (sum.y / sum.x - 1)*100) %>%
select(Sector, everything())
## # A tibble: 6 x 6
## # Groups: Year.x [1]
## Sector Year.x sum.x Year.y sum.y Percent_Increase
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Electric Utilities 1989 2.98 2008 2.40 -19.4
## 2 Industrial 1989 1.87 2008 2.05 9.78
## 3 Residential 1989 0.978 2008 0.564 -42.3
## 4 Independent Power Producers 1989 0.39 2008 1.39 257.
## 5 Commercial 1989 0.103 2008 0.125 21.4
## 6 Transportation 1989 0.068 2008 0.827 1116.
Comparing the energy consumption of 2008 to 1989, we can see that the consumption by Transportation is increased by 1116%! As we know, transportation nowadays still relies on gasoline. It would be a good idea to expand the usage of renewable energy in Transportation.
Analysis 3: So much candy!
The data set suggested by Ajay contains ratings of over 100 candies by people around the world (mainly from the United States and Canada). Ajay suggested to perform analysis to find out the favorite candy based the people’s demographics (gender, geopraphy, etc.)
I select this data set becasue the data set is huge and the data values are not standarized. Some data fields have wrong or missing information. It would be a challenge to clean up the data for analysis.
First, Load the data file in to a table
rawData3 <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/candy.txt", sep = "\t", header = TRUE, stringsAsFactors = FALSE, quote = "", fill = TRUE)
Also, prepare a table that contains the abbriviations and province names of Canada for later use
canadaProvince <- read.table("https://raw.githubusercontent.com/ezaccountz/Data_607_Project_2/master/Canada.csv", sep = ",", header = TRUE,stringsAsFactors = FALSE)
Simplify the column names. Also, because the candy names are too long, I will create a sparate table to hold the candy names with a unique candy ID. The column names of the candies will be replaced by the candy ID
analysis3 <- select(rawData3, 1: sum(str_detect(colnames(rawData3), "Q[1-6]\\D|Internal")))
columnNames <- colnames(analysis3)
temp <- sum(str_detect(colnames(rawData3), "Q[1-5]\\D|Internal"))
columnNames <- str_remove_all(columnNames, ".*?Q\\d[[:punct:]]*")
columnNames <- str_replace_all(columnNames, "\\.+", " ")
columnNames <- trim(columnNames)
candyTable <- matrix(nrow = length(columnNames) - temp, ncol=2)
colnames(candyTable) <- c("CandyID", "CandyName")
candyTable[,1] = as.character(c(1:length(candyTable[,1])))
candyTable[,2] = columnNames[-c(1:temp)]
columnNames[-c(1:temp)] <- candyTable[,1]
colnames(analysis3) <- columnNames
colnames(analysis3)[6] <- "STATE"
head(analysis3)
## Internal ID GOING OUT GENDER AGE COUNTRY STATE 1 2 3
## 1 90258773
## 2 90272821 No Male 44 USA NM MEH DESPAIR JOY
## 3 90272829 Male 49 USA Virginia
## 4 90272840 No Male 40 us or MEH DESPAIR JOY
## 5 90272841 No Male 23 usa exton pa JOY DESPAIR JOY
## 6 90272852 No Male JOY DESPAIR JOY
## 4 5 6 7 8 9 10 11 12
## 1
## 2 MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH MEH
## 3
## 4 MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR
## 5 DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR
## 6 MEH MEH DESPAIR JOY DESPAIR MEH
## 13 14 15 16 17 18 19 20 21 22 23
## 1
## 2 DESPAIR MEH JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH JOY
## 3
## 4 DESPAIR MEH JOY MEH JOY MEH DESPAIR DESPAIR DESPAIR JOY JOY
## 5 DESPAIR JOY MEH JOY DESPAIR DESPAIR MEH DESPAIR DESPAIR MEH JOY
## 6 MEH JOY JOY JOY MEH MEH JOY DESPAIR MEH MEH JOY
## 24 25 26 27 28 29 30 31 32 33 34
## 1
## 2 DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR MEH MEH DESPAIR MEH JOY
## 3
## 4 MEH MEH JOY JOY MEH DESPAIR JOY MEH MEH DESPAIR MEH
## 5 DESPAIR DESPAIR DESPAIR MEH MEH DESPAIR JOY MEH DESPAIR DESPAIR JOY
## 6 DESPAIR MEH JOY MEH DESPAIR MEH MEH MEH JOY JOY
## 35 36 37 38 39 40 41 42 43 44 45
## 1
## 2 JOY MEH DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR JOY
## 3
## 4 MEH MEH MEH MEH MEH DESPAIR DESPAIR DESPAIR DESPAIR MEH MEH
## 5 MEH MEH DESPAIR MEH JOY MEH JOY DESPAIR DESPAIR JOY JOY
## 6 MEH MEH MEH MEH MEH JOY DESPAIR JOY
## 46 47 48 49 50 51 52 53 54 55 56 57
## 1
## 2 DESPAIR MEH MEH JOY MEH DESPAIR DESPAIR DESPAIR MEH MEH JOY JOY
## 3
## 4 JOY JOY MEH JOY JOY DESPAIR JOY MEH MEH DESPAIR JOY MEH
## 5 JOY JOY MEH DESPAIR JOY MEH MEH DESPAIR MEH MEH JOY JOY
## 6 MEH MEH MEH MEH JOY MEH JOY JOY JOY JOY JOY
## 58 59 60 61 62 63 64 65 66 67 68
## 1
## 2 MEH JOY JOY JOY JOY DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR
## 3
## 4 JOY MEH MEH MEH MEH MEH DESPAIR DESPAIR MEH MEH DESPAIR
## 5 JOY JOY JOY JOY JOY MEH DESPAIR JOY JOY MEH DESPAIR
## 6 JOY JOY JOY JOY JOY JOY JOY MEH
## 69 70 71 72 73 74 75 76 77 78 79
## 1
## 2 DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR JOY JOY DESPAIR JOY
## 3
## 4 MEH MEH MEH DESPAIR DESPAIR DESPAIR DESPAIR JOY MEH MEH MEH
## 5 JOY JOY JOY MEH DESPAIR JOY DESPAIR JOY JOY MEH JOY
## 6 JOY MEH MEH DESPAIR DESPAIR MEH DESPAIR JOY JOY DESPAIR JOY
## 80 81 82 83 84 85 86 87 88 89
## 1
## 2 DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR DESPAIR MEH DESPAIR MEH
## 3
## 4 DESPAIR DESPAIR DESPAIR MEH JOY MEH DESPAIR MEH MEH JOY
## 5 JOY JOY DESPAIR MEH DESPAIR JOY JOY MEH
## 6 DESPAIR JOY JOY JOY JOY DESPAIR DESPAIR MEH MEH MEH
## 90 91 92 93 94 95 96 97 98 99
## 1
## 2 DESPAIR DESPAIR DESPAIR DESPAIR JOY JOY DESPAIR JOY DESPAIR DESPAIR
## 3
## 4 MEH JOY MEH DESPAIR DESPAIR JOY MEH JOY DESPAIR JOY
## 5 DESPAIR MEH DESPAIR DESPAIR JOY JOY DESPAIR JOY MEH JOY
## 6 MEH DESPAIR JOY JOY MEH JOY DESPAIR DESPAIR
## 100 101 102 103
## 1
## 2 DESPAIR DESPAIR DESPAIR DESPAIR
## 3
## 4 JOY DESPAIR DESPAIR DESPAIR
## 5 JOY DESPAIR DESPAIR JOY
## 6 JOY DESPAIR DESPAIR JOY
Standardize the country name for “United States” and “Canada”
analysis3$COUNTRY[str_detect(toupper(analysis3$COUNTRY), "U.*S|MERICA")] <- "United States"
analysis3$COUNTRY[str_detect(toupper(analysis3$COUNTRY), "CANADA")] <- "Canada"
Standardize the state/province by using the abbreviations only
for(i in c(1:length(state.abb))){
analysis3$STATE[str_detect(toupper(analysis3$STATE),toupper(state.name[i]))
& analysis3$COUNTRY == "United States"] <- state.abb[i]
analysis3$STATE[str_detect(toupper(analysis3$STATE),state.abb[i])
& analysis3$COUNTRY == "United States"] <- state.abb[i]
}
for(i in c(1:length(canadaProvince$Abbreviation))){
analysis3$STATE[str_detect(toupper(analysis3$STATE),toupper(canadaProvince$Province.or.Territory[i]))
& analysis3$COUNTRY == "Canada"] <- canadaProvince$Abbreviation[i]
analysis3$STATE[str_detect(toupper(analysis3$STATE),canadaProvince$Abbreviation[i])
& analysis3$COUNTRY == "Canada"] <- canadaProvince$Abbreviation[i]
}
Extract and convert the age to integer
analysis3$AGE <- as.integer(str_extract(analysis3$AGE, "\\d+"))
Exclude people from countries other than the United States and Canada, because the number of people from the other countries is to small to be sufficient for analysis.
Exclude people less than 12 years old and more than 120 years old.
Exclude people with invalid value in the gender column
analysis3 <- analysis3 %>%
filter(GENDER == "Male" | GENDER == "Female") %>%
filter(AGE >= 12 & AGE <= 120) %>%
filter((COUNTRY == "United States" & STATE %in% state.abb) | (COUNTRY == "Canada" & STATE %in% canadaProvince$Abbreviation))
head(analysis3)
## Internal ID GOING OUT GENDER AGE COUNTRY STATE 1 2 3
## 1 90272821 No Male 44 United States NM MEH DESPAIR JOY
## 2 90272829 Male 49 United States IN
## 3 90272840 No Male 40 United States OR MEH DESPAIR JOY
## 4 90272841 No Male 23 United States PA JOY DESPAIR JOY
## 5 90272853 No Male 53 United States CO
## 6 90272854 No Male 33 Canada NT JOY DESPAIR JOY
## 4 5 6 7 8 9 10 11 12
## 1 MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH MEH
## 2
## 3 MEH MEH DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR
## 4 DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR MEH MEH DESPAIR
## 5
## 6 DESPAIR DESPAIR MEH JOY MEH JOY JOY JOY JOY
## 13 14 15 16 17 18 19 20 21 22 23
## 1 DESPAIR MEH JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR MEH JOY
## 2
## 3 DESPAIR MEH JOY MEH JOY MEH DESPAIR DESPAIR DESPAIR JOY JOY
## 4 DESPAIR JOY MEH JOY DESPAIR DESPAIR MEH DESPAIR DESPAIR MEH JOY
## 5
## 6 DESPAIR JOY JOY JOY DESPAIR MEH MEH DESPAIR DESPAIR MEH MEH
## 24 25 26 27 28 29 30 31 32
## 1 DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR MEH MEH DESPAIR
## 2
## 3 MEH MEH JOY JOY MEH DESPAIR JOY MEH MEH
## 4 DESPAIR DESPAIR DESPAIR MEH MEH DESPAIR JOY MEH DESPAIR
## 5
## 6 JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR JOY DESPAIR DESPAIR
## 33 34 35 36 37 38 39 40 41 42
## 1 MEH JOY JOY MEH DESPAIR DESPAIR MEH DESPAIR DESPAIR DESPAIR
## 2
## 3 DESPAIR MEH MEH MEH MEH MEH MEH DESPAIR DESPAIR DESPAIR
## 4 DESPAIR JOY MEH MEH DESPAIR MEH JOY MEH JOY DESPAIR
## 5
## 6 DESPAIR JOY MEH MEH DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR
## 43 44 45 46 47 48 49 50 51 52
## 1 DESPAIR DESPAIR JOY DESPAIR MEH MEH JOY MEH DESPAIR DESPAIR
## 2
## 3 DESPAIR MEH MEH JOY JOY MEH JOY JOY DESPAIR JOY
## 4 DESPAIR JOY JOY JOY JOY MEH DESPAIR JOY MEH MEH
## 5
## 6 DESPAIR DESPAIR MEH MEH JOY MEH DESPAIR DESPAIR MEH JOY
## 53 54 55 56 57 58 59 60 61 62 63 64 65
## 1 DESPAIR MEH MEH JOY JOY MEH JOY JOY JOY JOY DESPAIR DESPAIR MEH
## 2
## 3 MEH MEH DESPAIR JOY MEH JOY MEH MEH MEH MEH MEH DESPAIR DESPAIR
## 4 DESPAIR MEH MEH JOY JOY JOY JOY JOY JOY JOY MEH DESPAIR JOY
## 5
## 6 JOY JOY JOY JOY JOY JOY JOY JOY JOY MEH DESPAIR JOY DESPAIR
## 66 67 68 69 70 71 72 73 74
## 1 DESPAIR DESPAIR DESPAIR DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR
## 2
## 3 MEH MEH DESPAIR MEH MEH MEH DESPAIR DESPAIR DESPAIR
## 4 JOY MEH DESPAIR JOY JOY JOY MEH DESPAIR JOY
## 5
## 6 DESPAIR MEH MEH JOY JOY MEH DESPAIR DESPAIR DESPAIR
## 75 76 77 78 79 80 81 82 83 84 85
## 1 DESPAIR JOY JOY DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR MEH DESPAIR
## 2
## 3 DESPAIR JOY MEH MEH MEH DESPAIR DESPAIR DESPAIR MEH JOY MEH
## 4 DESPAIR JOY JOY MEH JOY JOY JOY DESPAIR MEH
## 5
## 6 DESPAIR JOY JOY JOY DESPAIR JOY MEH JOY MEH JOY
## 86 87 88 89 90 91 92 93 94 95
## 1 DESPAIR MEH DESPAIR MEH DESPAIR DESPAIR DESPAIR DESPAIR JOY JOY
## 2
## 3 DESPAIR MEH MEH JOY MEH JOY MEH DESPAIR DESPAIR JOY
## 4 DESPAIR JOY JOY MEH DESPAIR MEH DESPAIR DESPAIR JOY JOY
## 5
## 6 DESPAIR MEH JOY JOY MEH MEH DESPAIR DESPAIR JOY MEH
## 96 97 98 99 100 101 102 103
## 1 DESPAIR JOY DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR DESPAIR
## 2
## 3 MEH JOY DESPAIR JOY JOY DESPAIR DESPAIR DESPAIR
## 4 DESPAIR JOY MEH JOY JOY DESPAIR DESPAIR JOY
## 5
## 6 DESPAIR JOY JOY MEH DESPAIR DESPAIR DESPAIR DESPAIR
The rating with “JOY”, “MEH” and “DESPAIR” are vague, replace them with “Like”, “OK” and “Dislike”
analysis3[analysis3 == "JOY"] <- "Like"
analysis3[analysis3 == "MEH"] <- "OK"
analysis3[analysis3 == "DESPAIR"] <- "Dislike"
analysis3[analysis3 == ""] <- "NA"
Tidy the data and filter out the rows with no rating
analysis3 <- analysis3 %>%
gather("CandyID", "Rating", -c(1:temp)) %>%
filter(Rating != "NA")
head(analysis3)
## Internal ID GOING OUT GENDER AGE COUNTRY STATE CandyID Rating
## 1 90272821 No Male 44 United States NM 1 OK
## 2 90272840 No Male 40 United States OR 1 OK
## 3 90272841 No Male 23 United States PA 1 Like
## 4 90272854 No Male 33 Canada NT 1 Like
## 5 90272858 No Male 40 Canada NT 1 Like
## 6 90272859 No Female 53 United States WA 1 OK
Question 1: what are the top 10 candies with the highest rate of “Like”
analysis3 %>%
dplyr::group_by(CandyID) %>%
dplyr::count(Rating) %>%
dplyr::mutate(percentage = n/sum(n)) %>%
dplyr::filter(Rating == "Like") %>%
dplyr::arrange(desc(percentage)) %>%
dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
dplyr::ungroup() %>%
dplyr::slice(1:10)
## # A tibble: 10 x 5
## CandyID Rating n percentage CandyName
## <chr> <chr> <int> <dbl> <chr>
## 1 3 Like 1359 0.873 Any full sized candy bar
## 2 76 Like 1317 0.849 Reese s Peanut Butter Cups
## 3 45 Like 1265 0.813 Kit Kat
## 4 15 Like 1258 0.811 Cash or other forms of legal tender
## 5 97 Like 1240 0.803 Twix
## 6 84 Like 1229 0.797 Snickers
## 7 95 Like 1157 0.756 Tolberone something or other
## 8 50 Like 1120 0.736 Lindt Truffle
## 9 58 Like 1136 0.729 Peanut M M s
## 10 23 Like 1039 0.676 Dove Bars
Question 2: what are the top 5 candies with the highest rate of “Like” for each country
analysis3 %>%
dplyr::group_by(COUNTRY,CandyID) %>%
dplyr::count(Rating) %>%
dplyr::mutate(percentage = n/sum(n)) %>%
dplyr::filter(Rating == "Like") %>%
dplyr::arrange(COUNTRY, desc(percentage)) %>%
dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
dplyr::ungroup() %>%
dplyr::group_by(COUNTRY) %>%
dplyr::slice(1:5)
## # A tibble: 10 x 6
## # Groups: COUNTRY [2]
## COUNTRY CandyID Rating n percentage CandyName
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Canada 3 Like 132 0.936 Any full sized candy bar
## 2 Canada 76 Like 124 0.892 Reese s Peanut Butter Cups
## 3 Canada 15 Like 124 0.879 Cash or other forms of leg~
## 4 Canada 19 Like 122 0.865 Coffee Crisp
## 5 Canada 45 Like 121 0.864 Kit Kat
## 6 United Stat~ 3 Like 1227 0.867 Any full sized candy bar
## 7 United Stat~ 76 Like 1193 0.845 Reese s Peanut Butter Cups
## 8 United Stat~ 45 Like 1144 0.808 Kit Kat
## 9 United Stat~ 97 Like 1136 0.807 Twix
## 10 United Stat~ 15 Like 1134 0.804 Cash or other forms of leg~
Question 3: what are the top 5 candies with the highest rate of “Like” for each gender
analysis3 %>%
dplyr::group_by(GENDER,CandyID) %>%
dplyr::count(Rating) %>%
dplyr::mutate(percentage = n/sum(n)) %>%
dplyr::filter(Rating == "Like") %>%
dplyr::arrange(GENDER, desc(percentage)) %>%
dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
dplyr::ungroup() %>%
dplyr::group_by(GENDER) %>%
dplyr::slice(1:5)
## # A tibble: 10 x 6
## # Groups: GENDER [2]
## GENDER CandyID Rating n percentage CandyName
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 Female 3 Like 501 0.887 Any full sized candy bar
## 2 Female 15 Like 477 0.846 Cash or other forms of legal ten~
## 3 Female 76 Like 474 0.842 Reese s Peanut Butter Cups
## 4 Female 97 Like 464 0.827 Twix
## 5 Female 45 Like 456 0.809 Kit Kat
## 6 Male 3 Like 858 0.865 Any full sized candy bar
## 7 Male 76 Like 843 0.853 Reese s Peanut Butter Cups
## 8 Male 45 Like 809 0.816 Kit Kat
## 9 Male 84 Like 789 0.803 Snickers
## 10 Male 15 Like 781 0.791 Cash or other forms of legal ten~
Question 4: what are the top 1 candy with the highest rate of “Like” for each state in the United States (some state may be missing in the data)
analysis3 %>%
dplyr::filter(COUNTRY == "United States") %>%
dplyr::group_by(STATE,CandyID) %>%
dplyr::count(Rating) %>%
dplyr::mutate(percentage = n/sum(n)) %>%
dplyr::filter(Rating == "Like") %>%
dplyr::arrange(STATE, desc(percentage)) %>%
dplyr::left_join(as_tibble(candyTable),by = "CandyID") %>%
dplyr::ungroup() %>%
dplyr::group_by(STATE) %>%
dplyr::slice(1:1)
## # A tibble: 47 x 6
## # Groups: STATE [47]
## STATE CandyID Rating n percentage CandyName
## <chr> <chr> <chr> <int> <dbl> <chr>
## 1 AK 15 Like 6 1 Cash or other forms of legal tend~
## 2 AL 3 Like 129 0.816 Any full sized candy bar
## 3 AR 3 Like 42 0.933 Any full sized candy bar
## 4 AZ 3 Like 21 0.913 Any full sized candy bar
## 5 CA 3 Like 76 0.864 Any full sized candy bar
## 6 CO 3 Like 78 0.897 Any full sized candy bar
## 7 CT 3 Like 6 1 Any full sized candy bar
## 8 DE 15 Like 6 1 Cash or other forms of legal tend~
## 9 FL 76 Like 32 0.941 Reese s Peanut Butter Cups
## 10 GA 3 Like 50 0.962 Any full sized candy bar
## # ... with 37 more rows