Project 2 - Untidy Data

Assignment

Choose any three of the “wide” datasets identified in the Week 5 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.

Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Libraries

library(stringr)
library(tidyr)
library(tidyverse)

## -- Attaching packages ----------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(ggplot2)

Data Set 1 - Farhana Zahir: Untidy stock market data

I mostly work with financial data and a very interesting dataset I came across is from the stock market of India at this link https://www.kaggle.com/souravroy1/stock-market-data/download . The dataset contains 20 columns and is basically the trading data of the market on a single day. The columns contain the ticker, sector, the last traded price, high price, low price and a bunch of other price points and valuation metrics on that particular trading day.

The structure of the data set is as follows:(only showing a few of the 63 variables)

As can be seen above, this is extremely untidy data with a lot of the prices being shown as character variables. There are also a lot of NAs in this file. I would first need to do a missing value analysis to determine what the variables that I can actually work with in this dataset. I would need to change the data types to appropriate, and since a lot of the variables can be derived from the other variables, I would try to use reformulate as much of the table as possible.

Preliminary Look

Let’s first load the data into R and take a look at it.

stockData <- read.csv('Stock Market data .csv', header = TRUE)
dim(stockData)

## [1] 1568   63

str(stockData)

## 'data.frame':    1568 obs. of  63 variables:
##  $ Share                   : Factor w/ 1563 levels "","20MICRONS",..: 2 3 4 5 6 7 8 9 10 11 ...
##  $ Category                : Factor w/ 10 levels "","#N/A","Buy",..: 5 4 5 9 7 5 4 3 5 5 ...
##  $ Sector                  : Factor w/ 42 levels "","#N/A","#REF!",..: 12 31 14 31 31 27 38 12 41 15 ...
##  $ RM                      : Factor w/ 63 levels "","#N/A","#REF!",..: 2 2 4 2 4 4 4 20 2 4 ...
##  $ Up                      : Factor w/ 11 levels "","#N/A","#REF!",..: 2 2 4 2 4 4 4 7 2 4 ...
##  $ Last.Traded.Price       : Factor w/ 1402 levels "","#N/A","0.15",..: 882 975 139 1222 1107 943 1027 1310 1192 475 ...
##  $ Percentage.Change       : Factor w/ 487 levels "","-0.01","-0.02",..: 246 246 355 273 264 432 29 281 246 298 ...
##  $ High.Price              : Factor w/ 916 levels "","#N/A","0",..: 2 2 83 803 735 648 684 855 2 331 ...
##  $ Low.Price               : Factor w/ 915 levels "","#N/A","0",..: 2 2 77 784 713 620 669 846 2 313 ...
##  $ X..High.Movt            : Factor w/ 934 levels "","-0.000118329",..: 21 21 490 484 741 889 617 358 21 549 ...
##  $ X..Low.movt             : Factor w/ 928 levels "","-0.000111857",..: 126 126 444 127 4 156 305 32 126 73 ...
##  $ Yearly.Gainner          : Factor w/ 295 levels "","-1%","-10%",..: 195 181 97 1 31 93 195 233 224 195 ...
##  $ X27th.Dec               : Factor w/ 188 levels "","-1%","-10%",..: 119 46 102 1 2 105 34 110 38 46 ...
##  $ X1st.Feb                : Factor w/ 156 levels "","-1%","-10%",..: 55 24 142 1 98 72 5 77 7 47 ...
##  $ X1st.March              : Factor w/ 132 levels "","-1%","-10%",..: 61 95 129 1 117 60 4 51 23 39 ...
##  $ X1st.April              : Factor w/ 108 levels "","-1%","-10%",..: 55 92 92 1 62 37 30 42 24 20 ...
##  $ TB                      : Factor w/ 2 levels "","TB": 1 2 2 1 2 1 2 1 2 2 ...
##  $ RH                      : num  49.4 5.8 12231 83 663.8 ...
##  $ Corr                    : Factor w/ 76 levels "","-1%","-10%",..: 24 72 19 75 66 75 29 31 38 26 ...
##  $ PeRatio                 : num  10.72 6.89 66.29 213.4 23.2 ...
##  $ New.Pe                  : Factor w/ 638 levels "","-0.033587786",..: 395 607 614 581 410 281 410 449 238 75 ...
##  $ W52_High                : num  49.4 7.15 15250 96 1119 ...
##  $ Corre                   : Factor w/ 84 levels "","0%","1%","10%",..: 7 20 15 16 41 82 17 14 32 18 ...
##  $ latest                  : int  42775 NA 42775 42767 NA 42779 42873 42874 42777 42769 ...
##  $ Annual_Pat              : num  11.1 -1059.3 200.9 0 53.2 ...
##  $ Cum.PAT.3.Quarter       : num  11.727 60.76 150.364 0.144 85.423 ...
##  $ Pat.Jump                : num  0.667 1120.03 -50.566 0.144 32.187 ...
##  $ Annual.Growth           : Factor w/ 6 levels "","0.24726477",..: 1 6 1 1 6 1 6 6 1 1 ...
##  $ Year.End                : Factor w/ 5 levels "","December",..: 1 1 1 1 1 2 1 1 1 1 ...
##  $ Mar_17_ReportedPAT      : num  0 33.3 0 0 40.2 ...
##  $ Dec_16_ReportedPAT      : num  2.282 19.01 42.433 0.144 32.6 ...
##  $ Sep_16_ReportedPAT      : num  4.48 21.58 48.21 0 28.67 ...
##  $ Jun_16_ReportedPAT      : num  4.96 20.17 59.72 0 24.15 ...
##  $ Mar_16_ReportedPAT      : num  2.36 -812.28 58.41 0 16.36 ...
##  $ Dec_15_ReportedPAT      : num  3.22 -79.08 56.61 130.83 15.19 ...
##  $ Y.oY.PAT                : Factor w/ 2 levels "","Y": 2 2 2 1 2 1 2 2 1 1 ...
##  $ LQ.0.5.4q.s.            : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PAT.Growth              : Factor w/ 2 levels "","Y": 2 2 2 1 2 1 2 2 1 1 ...
##  $ PeRatio.1               : num  10.72 6.89 66.29 213.4 23.2 ...
##  $ ResultDate              : int  42775 42855 42775 42767 42865 42779 42873 42874 42777 42769 ...
##  $ Mar_17_Inst_no_of_shares: int  3 46 51 14 6 6 13 99 3 45 ...
##  $ Dec_16_Inst_no_of_shares: int  2 49 48 15 6 6 11 88 4 40 ...
##  $ Int..Incr..Decr         : int  1 -3 3 -1 0 0 2 11 -1 5 ...
##  $ Pledge                  : num  54.38 0 0 0 8.78 ...
##  $ DividendYield           : num  0 0 0 6.61 0 0 1.26 0.98 0 0 ...
##  $ DERatio                 : num  1.7 -0.52 0.01 0.16 0.01 2.63 1.26 1.11 1.32 2.41 ...
##  $ FB                      : Factor w/ 3 levels "","#N/A","FB": 3 3 3 3 1 3 1 1 3 3 ...
##  $ Mar_17_Eps_Before       : Factor w/ 274 levels "","-0.01","-0.04",..: 42 55 42 42 146 42 42 42 42 42 ...
##  $ Dec_16_Eps_Before       : Factor w/ 938 levels "","-0.01","-0.02",..: 330 284 727 270 915 110 875 920 217 223 ...
##  $ Sep_16_Eps_Before       : Factor w/ 942 levels "","-0.01","-0.02",..: 382 280 777 266 620 88 925 926 613 224 ...
##  $ Jun_16_Eps_Before       : Factor w/ 936 levels "","-0.01","-0.02",..: 392 295 825 264 524 90 895 935 604 203 ...
##  $ Mar_16_Eps_Before       : Factor w/ 955 levels "","-0.01","-0.02",..: 338 115 839 277 461 320 901 921 398 166 ...
##  $ Dec_15_Eps_Before       : Factor w/ 893 levels "","-0.01","-0.02",..: 361 82 793 636 457 40 829 837 331 136 ...
##  $ Annual_Eps              : Factor w/ 1266 levels "","-0.01","-0.03",..: 873 132 676 368 953 100 856 914 757 1211 ...
##  $ Volume                  : Factor w/ 1547 levels "","#N/A","0.00001875",..: 594 769 748 788 1350 1220 610 1060 407 1193 ...
##  $ MarketCap               : Factor w/ 1212 levels "","#N/A","#REF!",..: 217 968 168 646 351 1021 128 1035 274 131 ...
##  $ Annual_Pat.1            : Factor w/ 1478 levels "","-0.0140877",..: 534 41 800 405 1225 71 1331 914 1340 1181 ...
##  $ EquityCapital           : Factor w/ 1488 levels "","#N/A","#REF!",..: 458 78 98 1429 345 263 669 1018 655 123 ...
##  $ FaceValue               : Factor w/ 14 levels "","#N/A","#REF!",..: 11 6 6 8 11 6 6 11 6 8 ...
##  $ New.Listing             : Factor w/ 92 levels "","ABMINTLTD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ PE.High.Price.High      : logi  NA NA NA NA NA NA ...
##  $ Future                  : Factor w/ 4 levels "","RISKY","RS2",..: 1 1 1 1 1 4 1 1 1 1 ...
##  $ Cap.Category            : Factor w/ 7 levels "","#N/A","#REF!",..: 6 6 5 6 7 6 7 5 6 7 ...

We have 1568 records with 63 fields. Just by looking at the data types, we can see that the majority of the columns are stored as charactrers (even ones that should be numeric!) We can also see that there are quite a bit of fields that have a value of either #N/A or #REF.

This is quite a bit of data to work with, so let’s first start by limiting it to a subset of the columns. I’m not very familiar with financial data, so we’ll take the subset of columns according to the discussion from Farhana. Looks like all of the fields that were discussed are in columns 1 - 9. For fun, let’s also take the Future column, because it’s a categorical feature that might be interesting to look at across sectors.

finalStocks <- stockData[,c(1:9,62)]

Before we start looking at the data, let’s convert all instances of #REF!, #N/A, and blank to NA. This will allow us to see which of these columns has enough meaningful records to analyze.

finalStocks[] <- lapply(finalStocks, function(x){
 levels(x)[levels(x) == "#N/A"| levels(x) == "#REF!"| levels(x) == ''] <- NA
 x
 })

Now that we’ve converted our unknown data, let’s take a look at the distribution of nulls in each column. To do this, we’ll calculate the total number of null values in each column and divide that by the total number of records in the data set.

na_count <-sapply(finalStocks, function(y) sum(length(which(is.na(y)))))
na_count/dim(finalStocks)[1]

##             Share          Category            Sector                RM 
##       0.003826531       0.015306122       0.008928571       0.778698980 
##                Up Last.Traded.Price Percentage.Change        High.Price 
##       0.787627551       0.012755102       0.382015306       0.382015306 
##         Low.Price            Future 
##       0.382015306       0.970663265

Looks like we have 3 columns that have more than 50% null values. Let’s subset our data once again to get rid of those.

finalStocks <- subset(finalStocks, select = -c(RM,Up,Future) )

Now that we have the columns we’d like to use, let’s take one last look at the data types of each to see if they’re formatted properly. We know that Share, Category, and Sector should all be string columns, and the rest should be numeric. We’ll convert the columns that do not have the appropriate data type.

str(finalStocks)

## 'data.frame':    1568 obs. of  7 variables:
##  $ Share            : Factor w/ 1562 levels "20MICRONS","3IINFOTECH",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Category         : Factor w/ 8 levels "Buy","BWLV","Exit",..: 3 2 3 7 5 3 2 1 3 3 ...
##  $ Sector           : Factor w/ 39 levels "0","AIRLINES",..: 9 28 11 28 28 24 35 9 38 12 ...
##  $ Last.Traded.Price: Factor w/ 1400 levels "0.15","0.2","0.25",..: 880 973 137 1220 1105 941 1025 1308 1190 473 ...
##  $ Percentage.Change: Factor w/ 485 levels "-0.01","-0.02",..: NA NA 353 271 262 430 28 279 NA 296 ...
##  $ High.Price       : Factor w/ 914 levels "0","10.1","10.25",..: NA NA 81 801 733 646 682 853 NA 329 ...
##  $ Low.Price        : Factor w/ 913 levels "0","10","10.05",..: NA NA 75 782 711 618 667 844 NA 311 ...

numericCols <- c('Last.Traded.Price','Percentage.Change','High.Price','Low.Price')

finalStocks[numericCols] <- sapply(finalStocks[numericCols],as.character)
finalStocks[numericCols] <- sapply(finalStocks[numericCols],as.numeric)

str(finalStocks)

## 'data.frame':    1568 obs. of  7 variables:
##  $ Share            : Factor w/ 1562 levels "20MICRONS","3IINFOTECH",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Category         : Factor w/ 8 levels "Buy","BWLV","Exit",..: 3 2 3 7 5 3 2 1 3 3 ...
##  $ Sector           : Factor w/ 39 levels "0","AIRLINES",..: 9 28 11 28 28 24 35 9 38 12 ...
##  $ Last.Traded.Price: num  43 5.35 12219.65 75.6 623.9 ...
##  $ Percentage.Change: num  NA NA 1.23 0.26 0.17 2.93 -0.31 0.35 NA 0.55 ...
##  $ High.Price       : num  NA NA 12400 76.7 640 ...
##  $ Low.Price        : num  NA NA 12115 75.6 624 ...

Almost done with the tidying! There’s only one thing that’s left. If we look at the Sector column, one of the levels is 0. Let’s convert that to NA.

levels(finalStocks$Sector)[levels(finalStocks$Sector) == '0'] <- NA

Analysis

Since there was no discussion of an analysis in this discussion item, I am going to take a look at the following:

Describe the distribution of mean last trading prices across sectors
Identify the mean high and low trading price per sector

Describe the distribution of mean last trading prices across sectors: Our summary statistics show us that the spread of the data is 130.5 - 10,520.9, with a mean of 715.3 and a median of 332.4. Because our median and mean are so different, we know that there’s some skew in the data set. We can see from the histogram that we have a right-skewed distribution and 1 outlier with a very high average trading price.

finalStocks <- as_tibble(finalStocks)

# remove records with null Sector
finalStocks <- finalStocks[!is.na(finalStocks$Sector),]

avgTradingPrice <- finalStocks %>%
  group_by(Sector) %>%
  summarize(Avg_Trading_Price = mean(Last.Traded.Price, na.rm = TRUE)) %>%
  arrange(desc(Avg_Trading_Price))

summary(avgTradingPrice$Avg_Trading_Price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   130.5   218.3   332.4   715.3   590.4 10520.9

hist(avgTradingPrice$Avg_Trading_Price, breaks=50)

Identify the mean high and low trading price per sector: – From the following plot, we can see two things:

There is not much spread between the mean high and low trading prices
Most of the sectors have a similar mean high and low trading price, with the exception of AUTO- TYRES AND TUBES, which has a significantly higher value

finalStocks <- as_tibble(finalStocks)

# remove records with null Sector
finalStocks <- finalStocks[!is.na(finalStocks$Sector),]

highAndLow <- finalStocks %>%
  group_by(Sector) %>%
  summarize(Mean_Low_Price = mean(Low.Price, na.rm = TRUE),
            Mean_High_Price = mean(High.Price, na.rm = TRUE)) %>%
  arrange(desc(Mean_High_Price))

highAndLow

## # A tibble: 38 x 3
##    Sector                          Mean_Low_Price Mean_High_Price
##    <fct>                                    <dbl>           <dbl>
##  1 AUTO-TYRES AND TUBES                    12246.          12517.
##  2 AUTO-AUTOMOBILES AND AUTO PARTS          1651.           1732.
##  3 DIVERSIFIED                              1673.           1710.
##  4 FMCG                                     1660.           1695.
##  5 AUTO-AUTO ANCL AND COMPONENTS            1377.           1413.
##  6 CEMENTS                                  1324.           1380.
##  7 AIRLINES                                  800.            838.
##  8 FINANCIAL SERVICES-NBFC                   781.            804.
##  9 PHARMA                                    745.            761.
## 10 ENERGY-OIL & GAS                          668.            679.
## # ... with 28 more rows

# combine the Mean high and low prices into 1 column called Group
plotDat <- gather(highAndLow, Group, Avg_Price, -1)
ggplot(plotDat, aes(Sector, Avg_Price, col = Group)) +
  geom_point()+
  theme(axis.text.x = element_text(angle=90))

## Warning: Removed 4 rows containing missing values (geom_point).

Data Set 2 - Erinda Budo: Untidy unemployment rates data

The dataset I chose was taken from the World Bank Global Economic Monitor. This dataset contains unemployment rates from 88 countries from year 1990 through 2017.It can be found: https://github.com/ErindaB/Other The data will be transformed from wide to long format.Many columns need to be renamed and some blank ones need to be removed.The analysis will investigate annual unemployment rates from 2011 to 2015 of 71 countries and it will asnwer questions:

For the five year period from 2011 to 2015, what’s the average annual unemployment rate of each country?
For the five year period from 2011 to 2015, what’s the distribution of the average annual unemployment rate?
For the five year period from 2011 to 2015, what’s the overall trend of the world’s annual unemployment rate?

Preliminary Look

Let’s first load the data into R and take a look at it.

unemploymentData <- read.csv('Unemployment Data.csv', header = TRUE)
unemploymentData <- as_tibble(unemploymentData)
dim(unemploymentData)

## [1] 31 85

str(unemploymentData)

## Classes 'tbl_df', 'tbl' and 'data.frame':    31 obs. of  85 variables:
##  $ X                                               : int  NA 1990 1991 1992 1993 1994 1995 1996 1997 1998 ...
##  $ Advanced.Economies                              : num  NA 5.8 6.73 7.51 7.94 ...
##  $ Argentina                                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Australia                                       : num  NA 6.94 9.61 10.75 10.87 ...
##  $ Austria                                         : num  NA 5.37 5.82 5.94 6.81 ...
##  $ Belgium                                         : num  NA 6.55 6.44 7.09 8.62 ...
##  $ Bulgaria                                        : num  NA NA NA 13.2 15.9 ...
##  $ Bahrain                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Belarus                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Brazil                                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Canada                                          : num  NA 8.15 10.32 11.22 11.38 ...
##  $ Switzerland                                     : num  NA 0.501 1.09 2.563 4.516 ...
##  $ Chile                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ China                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Colombia                                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Cyprus                                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Czech.Republic                                  : num  NA NA NA NA 4.33 ...
##  $ Germany                                         : num  NA NA 4.86 5.76 6.93 ...
##  $ Denmark                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Dominican.Republic                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Algeria                                         : num  NA 25 25 27 23.2 24.4 28.1 28 28 28 ...
##  $ EMDE.East.Asia...Pacific                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ EMDE.Europe...Central.Asia                      : num  NA NA NA NA NA ...
##  $ Ecuador                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Egypt..Arab.Rep.                                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Emerging.Market.and.Developing.Economies..EMDEs.: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Spain                                           : num  NA 15.5 15.5 17.1 20.8 ...
##  $ Estonia                                         : num  NA 0.65 1.48 3.73 6.55 ...
##  $ Finland                                         : num  NA 3.1 6.67 11.8 16.38 ...
##  $ France                                          : num  NA 7.62 7.8 8.65 9.65 ...
##  $ United.Kingdom                                  : num  NA 7.09 8.82 9.97 10.4 ...
##  $ Greece                                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ High.Income.Countries                           : num  NA 5.62 6.77 7.69 8.19 ...
##  $ Hong.Kong.SAR..China                            : num  NA 1.32 1.75 1.95 1.98 ...
##  $ Croatia                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Hungary                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ India                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Ireland                                         : num  NA 13.4 14.7 15.4 15.6 ...
##  $ Iceland                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Israel                                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Italy                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Jordan                                          : num  NA NA NA NA 19.7 15.8 15.3 12.8 14.4 14.4 ...
##  $ Japan                                           : num  NA 2.11 2.1 2.15 2.5 ...
##  $ Kazakhstan                                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Korea..Rep.                                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ EMDE.Latin.America...Caribbean                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Low.Income.Countries..LIC.                      : logi  NA NA NA NA NA NA ...
##  $ Sri.Lanka                                       : num  NA 15.9 14.7 14.6 13.8 13.1 12.3 11.3 10.5 9.2 ...
##  $ Lithuania                                       : num  NA NA NA NA 4.19 ...
##  $ Luxembourg                                      : num  NA NA NA NA NA ...
##  $ Latvia                                          : num  NA NA NA NA 4.66 ...
##  $ Morocco                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Moldova..Rep.                                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Mexico                                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Middle.Income.Countries..MIC.                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ North.Macedonia                                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Malta                                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ EMDE.Middle.East...N..Africa                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Netherlands                                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Norway                                          : num  NA 5.78 6.04 6.55 6.61 ...
##  $ New.Zealand                                     : num  NA 7.98 10.61 10.64 9.8 ...
##  $ Pakistan                                        : num  NA 3.13 6.28 5.85 4.73 4.84 5.37 5.37 6.12 5.89 ...
##  $ Peru                                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Philippines                                     : num  NA NA 10.47 9.85 9.35 ...
##  $ Poland                                          : num  NA 3.44 9.01 12.93 15.03 ...
##  $ Portugal                                        : num  NA NA NA NA NA ...
##  $ Romania                                         : num  NA NA NA 5.45 9.21 ...
##  $ Russian.Federation                              : num  NA NA NA NA NA ...
##  $ EMDE.South.Asia                                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Saudi.Arabia                                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Singapore                                       : num  NA NA 1.75 1.8 1.68 ...
##  $ EMDE.Sub.Saharan.Africa                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovakia                                        : num  NA NA 7.05 11.32 12.86 ...
##  $ Slovenia                                        : num  NA NA NA 11.6 14.6 ...
##  $ Sweden                                          : num  NA 2.24 4.01 7.11 11.15 ...
##  $ Thailand                                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Tunisia                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Turkey                                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Taiwan..China                                   : num  NA 1.66 1.53 1.5 1.42 ...
##  $ Uruguay                                         : num  NA NA NA NA NA ...
##  $ United.States                                   : num  NA 5.62 6.85 7.49 6.91 ...
##  $ Venezuela..RB                                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Vietnam                                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ World..WBG.members.                             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ South.Africa                                    : num  NA NA NA NA NA NA NA NA NA NA ...

We have 31 records with 85 variables. Each row represents a different year and each column represents a different country. There are quite a number of NULLS in the data set, and there are a few columns that represent multiple countries.

head(unemploymentData)

## # A tibble: 6 x 85
##       X Advanced.Econom~ Argentina Australia Austria Belgium Bulgaria
##   <int>            <dbl>     <dbl>     <dbl>   <dbl>   <dbl>    <dbl>
## 1    NA            NA           NA     NA      NA      NA        NA  
## 2  1990             5.80        NA      6.94    5.37    6.55     NA  
## 3  1991             6.73        NA      9.61    5.82    6.44     NA  
## 4  1992             7.51        NA     10.8     5.94    7.09     13.2
## 5  1993             7.94        NA     10.9     6.81    8.62     15.9
## 6  1994             7.72        NA      9.71    6.55    9.75     14.1
## # ... with 78 more variables: Bahrain <dbl>, Belarus <dbl>, Brazil <dbl>,
## #   Canada <dbl>, Switzerland <dbl>, Chile <dbl>, China <dbl>,
## #   Colombia <dbl>, Cyprus <dbl>, Czech.Republic <dbl>, Germany <dbl>,
## #   Denmark <dbl>, Dominican.Republic <dbl>, Algeria <dbl>,
## #   EMDE.East.Asia...Pacific <dbl>, EMDE.Europe...Central.Asia <dbl>,
## #   Ecuador <dbl>, Egypt..Arab.Rep. <dbl>,
## #   Emerging.Market.and.Developing.Economies..EMDEs. <dbl>, Spain <dbl>,
## #   Estonia <dbl>, Finland <dbl>, France <dbl>, United.Kingdom <dbl>,
## #   Greece <dbl>, High.Income.Countries <dbl>, Hong.Kong.SAR..China <dbl>,
## #   Croatia <dbl>, Hungary <dbl>, India <dbl>, Ireland <dbl>,
## #   Iceland <dbl>, Israel <dbl>, Italy <dbl>, Jordan <dbl>, Japan <dbl>,
## #   Kazakhstan <dbl>, Korea..Rep. <dbl>,
## #   EMDE.Latin.America...Caribbean <dbl>,
## #   Low.Income.Countries..LIC. <lgl>, Sri.Lanka <dbl>, Lithuania <dbl>,
## #   Luxembourg <dbl>, Latvia <dbl>, Morocco <dbl>, Moldova..Rep. <dbl>,
## #   Mexico <dbl>, Middle.Income.Countries..MIC. <dbl>,
## #   North.Macedonia <dbl>, Malta <dbl>,
## #   EMDE.Middle.East...N..Africa <dbl>, Netherlands <dbl>, Norway <dbl>,
## #   New.Zealand <dbl>, Pakistan <dbl>, Peru <dbl>, Philippines <dbl>,
## #   Poland <dbl>, Portugal <dbl>, Romania <dbl>, Russian.Federation <dbl>,
## #   EMDE.South.Asia <dbl>, Saudi.Arabia <dbl>, Singapore <dbl>,
## #   EMDE.Sub.Saharan.Africa <dbl>, Slovakia <dbl>, Slovenia <dbl>,
## #   Sweden <dbl>, Thailand <dbl>, Tunisia <dbl>, Turkey <dbl>,
## #   Taiwan..China <dbl>, Uruguay <dbl>, United.States <dbl>,
## #   Venezuela..RB <dbl>, Vietnam <dbl>, World..WBG.members. <dbl>,
## #   South.Africa <dbl>

Looks like our first column is not named properly, so let’s change that. Also, since our analysis is only on the years 2011 - 2015, let’s go ahead and eliminate the years not within that timeframe.

names(unemploymentData)[1] <- 'YEAR'

unemploymentData <- unemploymentData %>%
  filter(YEAR >= 2011 & YEAR <= 2015)

Looks like we still have a few countries that do not have a full data set. Let’s remove any columns that have at least 1 null value.

unemploymentData <- unemploymentData %>%
    select_if(~ !any(is.na(.)))

We’re almost there, but not quite. We still have a few columns that represent groups of countries. Let’s eliminate those from the dataset.

removeColumns <- c('Advanced.Economies','EMDE.East.Asia...Pacific','EMDE.Europe...Central.Asia','EMDE.Middle.East...N..Africa','EMDE.South.Asia','EMDE.Sub.Saharan.Africa','Emerging.Market.and.Developing.Economies..EMDEs.','High.Income.Countries', 'Middle.Income.Countries..MIC.', 'World..WBG.members.')

unemploymentData <- unemploymentData %>% 
  select(-removeColumns)

Now that we have the columns that we want, let’s apply the tidy principle of transforming our data so that the country names are in 1 variable. This will allow us to more easily answer the questions from the discussion post.

unemploymentData <- unemploymentData %>%
  gather(COUNTRY, UNEMPLOYMENT_RATE, -YEAR)

The data is finally cleaned up and ready for analysis. Let’s tackle the first question: For the five year period from 2011 to 2015, what’s the average annual unemployment rate of each country?

avgUnemplRate <- unemploymentData %>% 
  group_by(COUNTRY) %>%
  summarise(AVERAGE_UNEMPLOYMENT_RATE = mean(UNEMPLOYMENT_RATE)) %>%
  arrange(AVERAGE_UNEMPLOYMENT_RATE)

avgUnemplRate

## # A tibble: 68 x 2
##    COUNTRY              AVERAGE_UNEMPLOYMENT_RATE
##    <chr>                                    <dbl>
##  1 Belarus                                  0.650
##  2 Thailand                                 0.751
##  3 Singapore                                1.94 
##  4 Vietnam                                  2.16 
##  5 Switzerland                              3.03 
##  6 Hong.Kong.SAR..China                     3.34 
##  7 Korea..Rep.                              3.36 
##  8 Norway                                   3.72 
##  9 Bahrain                                  3.86 
## 10 Japan                                    3.98 
## # ... with 58 more rows

ggplot(avgUnemplRate, aes(reorder(COUNTRY,AVERAGE_UNEMPLOYMENT_RATE), AVERAGE_UNEMPLOYMENT_RATE)) +
  geom_point() +
  theme(axis.text.x = element_text(angle=90))

For the five year period from 2011 to 2015, what’s the distribution of the average annual unemployment rate? – The data is right-skewed with a spread between 0 and 30% and a center between 5-10 percent.

hist(avgUnemplRate$AVERAGE_UNEMPLOYMENT_RATE)

For the five year period from 2011 to 2015, what’s the overall trend of the world’s annual unemployment rate? – There was an increase between 2011 and 2013 and then a decrease from 2013 to 2015.

yrUnemplRate <- unemploymentData %>% 
  group_by(YEAR) %>%
  summarise(AVERAGE_UNEMPLOYMENT_RATE = mean(UNEMPLOYMENT_RATE)) 

ggplot(yrUnemplRate, aes(YEAR, AVERAGE_UNEMPLOYMENT_RATE)) +
  geom_line()

Data Set 3 - Ajay Arora: Untidy data - candy

The following link has many different datasets for tidying:

https://makingnoiseandhearingthings.com/2018/04/19/

There are many interesting datasets, but one that is more intriguing is the candy dataset. It would be interesting to determine from analysis the kind of candy people prefer based on Age, gender, demographics, geography, type of candy, etc.

Preliminary Look

Let’s first load the data into R and take a look at it.

candyData <- read.csv("candyhierarchy2017.csv", header = TRUE)
candyData <- as_tibble(candyData)
dim(candyData)

## [1] 2460  120

str(candyData)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2460 obs. of  120 variables:
##  $ Internal.ID                                                                          : int  90258773 90272821 90272829 90272840 90272841 90272852 90272853 90272854 90272858 90272859 ...
##  $ Q1..GOING.OUT.                                                                       : Factor w/ 3 levels "","No","Yes": 1 2 1 2 2 2 2 2 2 2 ...
##  $ Q2..GENDER                                                                           : Factor w/ 5 levels "","Female","I'd rather not say",..: 1 4 4 4 4 4 4 4 4 2 ...
##  $ Q3..AGE                                                                              : Factor w/ 107 levels "","?","1","10",..: 1 45 52 41 20 1 56 32 41 56 ...
##  $ Q4..COUNTRY                                                                          : Factor w/ 129 levels "","'merica","1",..: 1 122 120 113 117 1 117 18 19 114 ...
##  $ Q5..STATE..PROVINCE..COUNTY..ETC                                                     : Factor w/ 512 levels "","|NC","¯rb¾k",..: 1 295 465 342 111 1 83 337 338 473 ...
##  $ Q6...100.Grand.Bar                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 3 3 1 3 3 4 ...
##  $ Q6...Anonymous.brown.globs.that.come.in.black.and.orange.wrappers..a.k.a..Mary.Janes.: Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ Q6...Any.full.sized.candy.bar                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 3 3 3 1 3 3 3 ...
##  $ Q6...Black.Jacks                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 2 1 1 2 4 4 ...
##  $ Q6...Bonkers..the.candy.                                                             : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 1 1 2 4 4 ...
##  $ Q6...Bonkers..the.board.game.                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 1 1 4 4 4 ...
##  $ Q6...Bottle.Caps                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 4 1 3 4 4 ...
##  $ Q6...Box.o.Raisins                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 4 1 4 2 4 ...
##  $ Q6...Broken.glow.stick                                                               : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 3 2 4 ...
##  $ Q6...Butterfinger                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 3 1 3 3 4 ...
##  $ Q6...Cadbury.Creme.Eggs                                                              : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 4 2 1 3 4 2 ...
##  $ Q6...Candy.Corn                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 2 2 4 1 3 2 4 ...
##  $ Q6...Candy.that.is.clearly.just.the.stuff.given.out.for.free.at.restaurants          : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 4 1 2 2 4 ...
##  $ Q6...Caramellos                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 3 3 1 3 4 4 ...
##  $ Q6...Cash..or.other.forms.of.legal.tender                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 3 4 3 1 3 4 4 ...
##  $ Q6...Chardonnay                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 3 3 1 3 3 4 ...
##  $ Q6...Chick.o.Sticks..we.donÕt.know.what.that.is.                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 3 2 4 1 2 4 4 ...
##  $ Q6...Chiclets                                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 4 1 4 2 4 ...
##  $ Q6...Coffee.Crisp                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 4 3 1 4 3 4 ...
##  $ Q6...Creepy.Religious.comics.Chick.Tracts                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ Q6...Dental.paraphenalia                                                             : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 4 1 2 2 2 ...
##  $ Q6...Dots                                                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 4 4 1 4 2 4 ...
##  $ Q6...Dove.Bars                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 3 3 3 1 4 3 4 ...
##  $ Q6...Fuzzy.Peaches                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 1 1 3 4 4 ...
##  $ Q6...Generic.Brand.Acetaminophen                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 2 1 2 2 4 ...
##  $ Q6...Glow.sticks                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 3 2 4 1 2 4 4 ...
##  $ Q6...Goo.Goo.Clusters                                                                : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 3 4 3 1 2 4 4 ...
##  $ Q6...Good.N..Plenty                                                                  : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 4 4 1 2 2 2 ...
##  $ Q6...Gum.from.baseball.cards                                                         : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ Q6...Gummy.Bears.straight.up                                                         : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 3 4 1 3 2 4 ...
##  $ Q6...Hard.Candy                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 4 4 1 2 2 2 ...
##  $ Q6...Healthy.Fruit                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 4 1 2 2 4 ...
##  $ Q6...Heath.Bar                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 2 2 3 1 2 2 4 ...
##  $ Q6...Hershey.s.Dark.Chocolate                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 3 3 3 ...
##  $ Q6...HersheyÕs.Milk.Chocolate                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 4 4 1 4 3 4 ...
##  $ Q6...Hershey.s.Kisses                                                                : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 4 4 1 4 3 4 ...
##  $ Q6...Hugs..actual.physical.hugs.                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 4 1 2 2 2 ...
##  $ Q6...Jolly.Rancher..bad.flavor.                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 4 1 2 4 4 ...
##  $ Q6...Jolly.Ranchers..good.flavor.                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 3 4 1 2 3 4 ...
##  $ Q6...JoyJoy..Mit.Iodine..                                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 4 1 1 2 4 4 ...
##  $ Q6...Junior.Mints                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 3 3 1 2 4 3 ...
##  $ Q6...Senior.Mints                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 1 1 2 2 4 ...
##  $ Q6...Kale.smoothie                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ Q6...Kinder.Happy.Hippo                                                              : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 3 1 1 2 4 3 ...
##  $ Q6...Kit.Kat                                                                         : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 4 3 3 ...
##  $ Q6...LaffyTaffy                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 3 3 4 1 4 4 2 ...
##  $ Q6...LemonHeads                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 3 4 1 3 4 4 ...
##  $ Q6...Licorice..not.black.                                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 4 4 1 4 3 4 ...
##  $ Q6...Licorice..yes.black.                                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 3 2 4 1 2 2 4 ...
##  $ Q6...Lindt.Truffle                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 3 3 1 2 4 3 ...
##  $ Q6...Lollipops                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 4 4 1 4 4 4 ...
##  $ Q6...Mars                                                                            : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 3 4 3 1 3 3 4 ...
##  $ Q6...Maynards                                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 1 1 3 2 2 ...
##  $ Q6...Mike.and.Ike                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 4 3 1 3 2 4 ...
##  $ Q6...Milk.Duds                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 2 4 3 1 3 3 4 ...
##  $ Q6...Milky.Way                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 3 3 3 1 3 3 4 ...
##  $ Q6...Regular.M.Ms                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 3 3 3 ...
##  $ Q6...Peanut.M.MÕs                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 3 3 1 3 3 4 ...
##  $ Q6...Blue.M.M.s                                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 3 3 4 ...
##  $ Q6...Red.M.M.s                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 3 3 4 ...
##  $ Q6...Green.Party.M.M.s                                                               : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 1 1 3 3 4 ...
##  $ Q6...Independent.M.M.s                                                               : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 1 1 4 3 4 ...
##  $ Q6...Abstained.from.M.M.ing.                                                         : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 1 1 2 2 4 ...
##  $ Q6...Minibags.of.chips                                                               : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 3 1 3 3 3 ...
##  $ Q6...Mint.Kisses                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 2 3 3 1 2 2 4 ...
##  $ Q6...Mint.Juleps                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 3 3 1 2 3 4 ...
##  $ Q6...Mr..Goodbar                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 3 1 4 4 4 ...
##  $ Q6...Necco.Wafers                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 4 1 4 2 4 ...
##  $ Q6...Nerds                                                                           : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 3 3 1 3 4 4 ...
##  $ Q6...Nestle.Crunch                                                                   : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 4 1 3 3 4 ...
##  $ Q6...Now.n.Laters                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 3 4 1 4 4 4 ...
##  $ Q6...Peeps                                                                           : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 4 2 1 2 2 2 ...
##  $ Q6...Pencils                                                                         : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ Q6...Pixy.Stix                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 3 4 1 2 2 2 ...
##  $ Q6...Real.Housewives.of.Orange.County.Season.9.Blue.Ray                              : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ Q6...ReeseÕs.Peanut.Butter.Cups                                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 3 3 3 1 3 3 3 ...
##  $ Q6...Reese.s.Pieces                                                                  : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 3 3 3 ...
##  $ Q6...Reggie.Jackson.Bar                                                              : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 2 1 1 2 4 ...
##  $ Q6...Rolos                                                                           : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 3 1 4 3 3 1 3 3 4 ...
##  $ Q6...Sandwich.sized.bags.filled.with.BooBerry.Crunch                                 : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 1 2 1 2 2 1 ...
##  $ Q6...Skittles                                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 3 3 1 3 2 4 ...
##  $ Q6...Smarties..American.                                                             : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 3 3 1 4 4 4 ...
##  $ Q6...Smarties..Commonwealth.                                                         : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 3 1 3 3 4 ...
##  $ Q6...Snickers                                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 1 3 1 4 3 4 ...
##  $ Q6...Sourpatch.Kids..i.e..abominations.of.nature.                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 4 2 1 3 3 4 ...
##  $ Q6...Spotted.Dick                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 4 ...
##  $ Q6...Starburst                                                                       : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 4 3 4 1 4 2 4 ...
##  $ Q6...Sweet.Tarts                                                                     : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 3 4 1 3 4 4 ...
##  $ Q6...Swedish.Fish                                                                    : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 4 1 3 4 4 1 3 2 4 ...
##  $ Q6...Sweetums..a.friend.to.diabetes.                                                 : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 1 1 4 2 4 ...
##  $ Q6...Take.5                                                                          : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 3 4 1 1 4 2 4 ...
##  $ Q6...Tic.Tacs                                                                        : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 4 2 4 1 2 2 4 ...
##  $ Q6...Those.odd.marshmallow.circus.peanut.things                                      : Factor w/ 4 levels "","DESPAIR","JOY",..: 1 2 1 2 2 2 1 2 2 4 ...
##   [list output truncated]

There are 2460 records in this dataset, each representing 1 response from the candy survey. Additionally, there are 120 variables, encompassing demographic information and candy preferences. It appears that many of the responses are standardized (the indivdual must have been given an option for ranking), but some appear to be free-form text.

There are quite a bit of candy options here, so let’s limit to 10. Let’s also eliminate the Internal ID, Going Out, State, and Country columns. Lastly, let’s change our categorical ratings to numeric so that it’s easier to compare preferences across demographics.

candyKeep <- c(3,4,7,13,16,17,18,20,25,28,29,34,39)

candyData <- candyData %>%
  select(candyKeep)

candyData[] <- lapply(candyData, function(x){
 levels(x)[levels(x) == "DESPAIR"] <- 1
 x
 })

candyData[] <- lapply(candyData, function(x){
 levels(x)[levels(x) == "MEH"] <- 2
 x
 })

candyData[] <- lapply(candyData, function(x){
 levels(x)[levels(x) == "JOY"] <- 3
 x
 })

Since we want to do an analysis of candy differences, let’s eliminate any records where any of the fields are blank. This leaves us with 1436 final records, with 14 variables.

candyData <- candyData %>% 
  filter(Q2..GENDER != '' & 
           Q3..AGE != '' & 
           Q6...100.Grand.Bar != '' & 
           Q6...Bottle.Caps != '' &
           Q6...Butterfinger != '' &
           Q6...Cadbury.Creme.Eggs != '' &
           Q6...Candy.Corn != '' &
           Q6...Caramellos != '' &
           Q6...Coffee.Crisp != '' &
           Q6...Dots != '' &
           Q6...Dove.Bars != '' &
           Q6...Good.N..Plenty != '' &
           Q6...Heath.Bar != '')

dim(candyData)

## [1] 1436   13

Let’s rename our columns so that they’re a little easier to work with. While we’re at it, let’s tidy up the data by putting all of the types of candy in one column.

candyColNames <- c('GENDER', 'AGE', '100GRAND', 'BOTTLE_CAPS','BUTTERFINGER','CADBURY_EGGS','CANDY_CORN','CARMELLO','COFFEE_CRISP','DOTS','DOVE_BARS','GOOD_AND_PLENTY', 'HEATH')

colnames(candyData) <- candyColNames

candyData <- candyData %>%
  gather(CANDY_BAR, RATING, -GENDER, -AGE)

candyData$RATING <- as.numeric(candyData$RATING)

Now that our data is tidy, we can start to take at the ratings. Which candy has the highest rating? Dove bars, followed closely by butterfinger! Not surprisingly Good & Plenty was ranked the lowest (definitely not my favorite candy)

candyRatings <- candyData %>%
  group_by(CANDY_BAR) %>%
  summarise(AVERAGE_RATING = mean(RATING)) 


ggplot(candyRatings, aes(reorder(CANDY_BAR, AVERAGE_RATING), AVERAGE_RATING)) +
  geom_point() +
  theme(axis.text.x = element_text(angle=90))

What about by gender? Looks like the highest ranked for females is Dove, males and other is butterfinger, and I’d rather not say is Dove.

candyGenderRatings <- candyData %>%
  group_by(CANDY_BAR,GENDER) %>%
  summarise(AVERAGE_RATING = mean(RATING)) 


ggplot(candyGenderRatings, aes(CANDY_BAR, AVERAGE_RATING)) +
  geom_point(aes(colour = factor(GENDER)), size = 3) +
  theme(axis.text.x = element_text(angle=90))

How about average rating by age? Because the age field was free-form, we can see that there’s a pretty large number of distinct values for this column. For future work, scrubbing this data would be helpful.

candyAgeRatings <- candyData %>%
  group_by(CANDY_BAR,AGE) %>%
  summarise(AVERAGE_RATING = mean(RATING)) 

ggplot(candyAgeRatings, aes(CANDY_BAR, AVERAGE_RATING)) +
  geom_point(aes(colour = factor(AGE)), size = 3) +
  theme(axis.text.x = element_text(angle=90))

Project 2 - Untidy Data

Amber Ferger

10/6/2019

Assignment

Libraries

Data Set 1 - Farhana Zahir: Untidy stock market data

Preliminary Look

Analysis

Data Set 2 - Erinda Budo: Untidy unemployment rates data

Preliminary Look

Data Set 3 - Ajay Arora: Untidy data - candy

Preliminary Look