CUNY 607 Data Acquisition and Management - Project# 2

This assignment is on practicing the techniques of data manipulations to transform data from wide to long format using tidyr and dplyr packages. To this effect we will work on 3 separate data sets provided by other students in the class.

The following packages are required for the assignment:
* RCurl,
* lubridate,
* tidyr,
* dplyr,
* ggplot2

## Loading required package: bitops
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Mobile App’s downloads

Data Set: Explanation and load

This data set contains the number of downloads per provider (Apple, Android) for year 2010 to 2015. The number of download are in millions. In addition, the coresponding release date for each app provider. The number of downloads are approximations from overall numbers found on Wikipedia. The release date is based on information on Wikipedia.

For reproducible research, the data set will be loaded in a GitHub account as a .csv. The file will be downloaded into tb_tp1. Prior to transforming the table we will view it.

file1_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607_Project2_dataset1.csv")
tb_tp1 <- read.csv(text = file1_url)

# View file using "tbl_df" command from dplyr package
tbl_df(tb_tp1)
## Source: local data frame [14 x 17]
## 
##    Number.of.down.load.for.Top.games.from.2010...2015..in.millions.      X
##                                                              (fctr) (fctr)
## 1                                                                         
## 2                                                          App Name   Type
## 3                                                  Candy Crush Saga   Game
## 4                                                       Fruit Ninja   Game
## 5                                                       Angry Birds   Game
## 6                                                    Subway Surfers   Game
## 7                                        Despicable Me: Minion Rush   Game
## 8                                                    Clash of Clans   Game
## 9                                                        Temple Run   Game
## 10                                                  Angry Birds Rio   Game
## 11                                                     Temple Run 2   Game
## 12                                               Words With Friends   Game
## 13                                        Minecraft: Pocket Edition   Game
## 14                                                         NBA 2K16   Game
## Variables not shown: X.1 (fctr), X.2 (fctr), X.3 (int), X.4 (int), X.5
##   (int), X.6 (int), X.7 (int), X.8 (int), X.9 (fctr), X.10 (int), X.11
##   (int), X.12 (int), X.13 (int), X.14 (int), X.15 (int)
# view structure of data set tb_tp1
str(tb_tp1)
## 'data.frame':    14 obs. of  17 variables:
##  $ Number.of.down.load.for.Top.games.from.2010...2015..in.millions.: Factor w/ 14 levels "","Angry Birds",..: 1 4 5 8 2 11 7 6 12 3 ...
##  $ X                                                               : Factor w/ 3 levels "","Game","Type": 1 3 2 2 2 2 2 2 2 2 ...
##  $ X.1                                                             : Factor w/ 4 levels "","Free","Paid",..: 1 4 2 2 2 2 2 2 2 2 ...
##  $ X.2                                                             : Factor w/ 14 levels "1/16/2013","10/14/2015",..: 13 14 6 7 4 8 9 10 11 5 ...
##  $ X.3                                                             : int  NA 2010 NA 4 10 NA NA NA NA NA ...
##  $ X.4                                                             : int  NA 2011 NA 8 124 NA NA NA 18 104 ...
##  $ X.5                                                             : int  NA 2012 8 58 320 23 NA 24 102 204 ...
##  $ X.6                                                             : int  NA 2013 56 102 547 123 16 123 246 382 ...
##  $ X.7                                                             : int  NA 2014 60 126 648 202 58 234 306 485 ...
##  $ X.8                                                             : int  NA 2015 76 148 627 303 128 345 378 324 ...
##  $ X.9                                                             : Factor w/ 14 levels "1/16/2013","10/14/2015",..: 13 14 5 11 4 9 10 3 8 7 ...
##  $ X.10                                                            : int  NA 2010 NA 1 2 NA NA NA NA NA ...
##  $ X.11                                                            : int  NA 2011 NA 9 108 NA NA NA NA 108 ...
##  $ X.12                                                            : int  NA 2012 2 64 312 26 NA NA 108 205 ...
##  $ X.13                                                            : int  NA 2013 53 108 538 128 18 4 254 398 ...
##  $ X.14                                                            : int  NA 2014 64 132 647 236 64 143 302 476 ...
##  $ X.15                                                            : int  NA 2015 72 165 656 329 294 256 402 389 ...

Formatting of raw data into a long-format table

After the load, we have a few problems that we have to address. A closer look at the data leads to the following conclusion; This data set is comprise of 3 sections;
1. Common fields; app’s name, type, paying status
2. section for Apple; release date, years 2010 - 2015
3. section for Androids; release date, years 2010 - 2015

For transformation, we will first segregate the data set into 2 tables, one for apple and one for android, will then add a column for provider on each table and fill it with the appropriate value and then combine the 2 tables. Each of the provider table will contains the common section.

It is also clear that we can remove the first line of the raw table.

tb_tp1_2 <- tb_tp1[-c(1), ]  # remove row# 1

tbl_df(tb_tp1_2)
## Source: local data frame [13 x 17]
## 
##    Number.of.down.load.for.Top.games.from.2010...2015..in.millions.      X
##                                                              (fctr) (fctr)
## 1                                                          App Name   Type
## 2                                                  Candy Crush Saga   Game
## 3                                                       Fruit Ninja   Game
## 4                                                       Angry Birds   Game
## 5                                                    Subway Surfers   Game
## 6                                        Despicable Me: Minion Rush   Game
## 7                                                    Clash of Clans   Game
## 8                                                        Temple Run   Game
## 9                                                   Angry Birds Rio   Game
## 10                                                     Temple Run 2   Game
## 11                                               Words With Friends   Game
## 12                                        Minecraft: Pocket Edition   Game
## 13                                                         NBA 2K16   Game
## Variables not shown: X.1 (fctr), X.2 (fctr), X.3 (int), X.4 (int), X.5
##   (int), X.6 (int), X.7 (int), X.8 (int), X.9 (fctr), X.10 (int), X.11
##   (int), X.12 (int), X.13 (int), X.14 (int), X.15 (int)
# split the table into 2
tb_tp1_apple <- select(tb_tp1_2, 1:10)
tb_tp1_android <- select(tb_tp1_2, 1:3, 11:17)

From observation of the data, it is clear that the now first row of the data is actually the column header. We will therefore modify the column header in for each table. Once this is done, we will remove the first row.
Finally, we will fix the missing date entry… We will change it to ‘07/01/2009’ and convert this column to a date field.

# names(tb_tp1_apple) <- c(as.character((tb_tp1_apple[1, ]))), this does not do what I would like.  It uses the number of levels for each factor (I think) instead of the actual content
# however, on Stackoverflow, I found that you can use the lapply function... so I tried...

names(tb_tp1_apple) <- lapply(tb_tp1_apple[1,], as.character)
names(tb_tp1_android) <- lapply(tb_tp1_android[1,], as.character)

tb_tp1_apple <- tb_tp1_apple[-c(1), ]      # remove row# 1
tb_tp1_android <- tb_tp1_android[-c(1), ]  # remove row# 1


# Droplevels, this will drop unused level (hence the column header values that were originally mistaken for row of data will be dropped as level)
#tb_tp1_apple <- droplevels(tb_tp1_apple)
#tb_tp1_android <- droplevels(tb_tb1_android)
#  Did not seem to have worked, I am not sure why....

# Instead we will use factor()
tb_tp1_apple$`App Name` <- factor(tb_tp1_apple$`App Name`)
tb_tp1_apple$Type <- factor(tb_tp1_apple$Type)
tb_tp1_apple$`Paid/Free` <- factor(tb_tp1_apple$`Paid/Free`)
tb_tp1_apple$`Release Date` <- factor(tb_tp1_apple$`Release Date`)

tb_tp1_android$`App Name` <- factor(tb_tp1_android$`App Name`)
tb_tp1_android$Type <- factor(tb_tp1_android$Type)
tb_tp1_android$`Paid/Free` <- factor(tb_tp1_android$`Paid/Free`)
tb_tp1_android$`Release Date` <- factor(tb_tp1_android$`Release Date`)

# Finally we replace the incorrect date in "Release date" column and convert "Release Date" column to date

grep("9-Jul", tb_tp1_apple$`Release Date`, value = FALSE)
## [1] 10
tb_tp1_apple$`Release Date` <- gsub("9-Jul","07/01/2009",tb_tp1_apple$`Release Date`)
grep("9-Jul", tb_tp1_apple$`Release Date`, value = FALSE)
## integer(0)
grep("9-Jul", tb_tp1_android$`Release Date`, value = FALSE)
## [1] 10
tb_tp1_android$`Release Date` <- gsub("9-Jul","07/01/2009",tb_tp1_android$`Release Date`)
grep("9-Jul", tb_tp1_android$`Release Date`, value = FALSE)
## integer(0)
# Note, after gsub applied, "Release Date" Column is changed to Character from factor??
# We will now change it to Date
tb_tp1_apple$`Release Date` <- mdy(tb_tp1_apple$`Release Date`)
tb_tp1_android$`Release Date` <- mdy(tb_tp1_android$`Release Date`)

str(tb_tp1_apple)
## 'data.frame':    12 obs. of  10 variables:
##  $ App Name    : Factor w/ 12 levels "Angry Birds",..: 3 6 1 9 5 4 10 2 11 12 ...
##  $ Type        : Factor w/ 1 level "Game": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Paid/Free   : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Release Date: POSIXct, format: "2012-04-12" "2010-04-21" ...
##  $ 2010        : int  NA 4 10 NA NA NA NA NA NA 143 ...
##  $ 2011        : int  NA 8 124 NA NA NA 18 104 NA 156 ...
##  $ 2012        : int  8 58 320 23 NA 24 102 204 NA 100 ...
##  $ 2013        : int  56 102 547 123 16 123 246 382 230 93 ...
##  $ 2014        : int  60 126 648 202 58 234 306 485 403 85 ...
##  $ 2015        : int  76 148 627 303 128 345 378 324 503 86 ...
str(tb_tp1_android)
## 'data.frame':    12 obs. of  10 variables:
##  $ App Name    : Factor w/ 12 levels "Angry Birds",..: 3 6 1 9 5 4 10 2 11 12 ...
##  $ Type        : Factor w/ 1 level "Game": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Paid/Free   : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Release Date: POSIXct, format: "2012-11-04" "2010-07-10" ...
##  $ 2010        : int  NA 1 2 NA NA NA NA NA NA 146 ...
##  $ 2011        : int  NA 9 108 NA NA NA NA 108 NA 173 ...
##  $ 2012        : int  2 64 312 26 NA NA 108 205 NA 112 ...
##  $ 2013        : int  53 108 538 128 18 4 254 398 253 105 ...
##  $ 2014        : int  64 132 647 236 64 143 302 476 493 95 ...
##  $ 2015        : int  72 165 656 329 294 256 402 389 523 92 ...
# Note, after mdy is applied, "Release Date" Column is changed to POSIXct date format

We now have 2 tables with similar structure one for data for App’s downloaded for Apple and the other downloaded for Android. We will now add a column on each table, provider that will have the value “Apple” or “Android”.

tb_tp1_apple$provider <- c("Apple")
tb_tp1_android$provider <- c("Android")

tb_tp1_apple
##                      App Name Type Paid/Free Release Date 2010 2011 2012
## 3            Candy Crush Saga Game      Free   2012-04-12   NA   NA    8
## 4                 Fruit Ninja Game      Free   2010-04-21    4    8   58
## 5                 Angry Birds Game      Free   2009-12-11   10  124  320
## 6              Subway Surfers Game      Free   2012-05-24   NA   NA   23
## 7  Despicable Me: Minion Rush Game      Free   2013-06-10   NA   NA   NA
## 8              Clash of Clans Game      Free   2012-08-02   NA   NA   24
## 9                  Temple Run Game      Free   2011-08-04   NA   18  102
## 10            Angry Birds Rio Game      Free   2011-03-22   NA  104  204
## 11               Temple Run 2 Game      Free   2013-01-16   NA   NA   NA
## 12         Words With Friends Game      Free   2009-07-01  143  156  100
## 13  Minecraft: Pocket Edition Game      Paid   2011-11-07   NA    4  120
## 14                   NBA 2K16 Game      Paid   2015-10-14   NA   NA   NA
##    2013 2014 2015 provider
## 3    56   60   76    Apple
## 4   102  126  148    Apple
## 5   547  648  627    Apple
## 6   123  202  303    Apple
## 7    16   58  128    Apple
## 8   123  234  345    Apple
## 9   246  306  378    Apple
## 10  382  485  324    Apple
## 11  230  403  503    Apple
## 12   93   85   86    Apple
## 13  240  320  340    Apple
## 14   NA   NA  245    Apple
tb_tp1_android
##                      App Name Type Paid/Free Release Date 2010 2011 2012
## 3            Candy Crush Saga Game      Free   2012-11-04   NA   NA    2
## 4                 Fruit Ninja Game      Free   2010-07-10    1    9   64
## 5                 Angry Birds Game      Free   2010-11-19    2  108  312
## 6              Subway Surfers Game      Free   2012-05-24   NA   NA   26
## 7  Despicable Me: Minion Rush Game      Free   2013-06-10   NA   NA   NA
## 8              Clash of Clans Game      Free   2013-10-07   NA   NA   NA
## 9                  Temple Run Game      Free   2012-03-27   NA   NA  108
## 10            Angry Birds Rio Game      Free   2011-03-22   NA  108  205
## 11               Temple Run 2 Game      Free   2013-01-16   NA   NA   NA
## 12         Words With Friends Game      Free   2009-07-01  146  173  112
## 13  Minecraft: Pocket Edition Game      Paid   2011-11-07   NA    8  154
## 14                   NBA 2K16 Game      Paid   2015-10-14   NA   NA   NA
##    2013 2014 2015 provider
## 3    53   64   72  Android
## 4   108  132  165  Android
## 5   538  647  656  Android
## 6   128  236  329  Android
## 7    18   64  294  Android
## 8     4  143  256  Android
## 9   254  302  402  Android
## 10  398  476  389  Android
## 11  253  493  523  Android
## 12  105   95   92  Android
## 13  285  369  352  Android
## 14   NA   NA  345  Android

We are now combining the 2 tables back into one. To do so we will use the bind_rows function from dplyr.

tb_tp1_result <- bind_rows(tb_tp1_apple, tb_tp1_android)

We have now a table in wide-format. We will now transform this table into long format by “gathering” the years into one variable “year” and introducing a variable “downloads”. We will do so using the gather function from the tidyr package.

tb_tp1_result <- gather(tb_tp1_result, "year", "downloads", 5:10)

Analysis - We will perform some analysis

*** free app with most download’s (per year)*** What is the free app with most download for each year

tb <- tb_tp1_result %>% filter(`Paid/Free` =="Free") %>% 
                  group_by(`App Name`, year) %>% 
                  summarise(sum(downloads))

# still working on getting the numbers out... for some reasons I am having some difficulties

College Graduation Rate as predictor of jobs in county (James Topor’s example)

Data Set: Explanation and load

This data set represent the college graduation rate and the number of jobs per county for year 1970, 1980, 1990, and 2000. In addition there are 2 additional variables for “Land Area” and “National Amenity”. We would be interested to see whether graduation rate from college is a predictor for number of jobs. The original data set can be found at: [http://www.theanalysisfactor.com/wide-and-long-data/].

We have created a .csv file with this data and posted in github for reproducibility. We are first going to load the data and then proceed with changing the format from wide to long.

file1_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607_Project2_dataset2.csv")
tb_tp2 <- read.csv(text = file1_url)

# View file using "tbl_df" command from dplyr package
tbl_df(tb_tp2)
## Source: local data frame [5 x 11]
## 
##    County LandArea NatAmenity College1970 College1980 College1990
##    (fctr)    (int)      (int)       (dbl)       (dbl)       (dbl)
## 1 Autauga      599          4       0.064       0.121       0.145
## 2 Baldwin     1578          4       0.065       0.121       0.168
## 3 Barbour      891          4       0.073       0.092       0.118
## 4    Bibb      625          3       0.042       0.049       0.047
## 5  Blount      639          4       0.027       0.053       0.070
## Variables not shown: College2000 (dbl), Jobs1970 (int), Jobs1980 (int),
##   Jobs1990 (int), Jobs2000 (int)
# view structure of data set tb_tp1
str(tb_tp2)
## 'data.frame':    5 obs. of  11 variables:
##  $ County     : Factor w/ 5 levels "Autauga","Baldwin",..: 1 2 3 4 5
##  $ LandArea   : int  599 1578 891 625 639
##  $ NatAmenity : int  4 4 4 3 4
##  $ College1970: num  0.064 0.065 0.073 0.042 0.027
##  $ College1980: num  0.121 0.121 0.092 0.049 0.053
##  $ College1990: num  0.145 0.168 0.118 0.047 0.07
##  $ College2000: num  0.18 0.231 0.109 0.071 0.096
##  $ Jobs1970   : int  6853 19749 9448 3965 7587
##  $ Jobs1980   : int  11278 27861 9755 4276 9490
##  $ Jobs1990   : int  11471 40809 12163 5564 11811
##  $ Jobs2000   : int  16289 70247 15197 6098 16503

Formatting of raw data into a long-format table

We will first split this table into 2 to handle jobs and college separately then we ill join them back. The first table will have County, LandArea, NetAmenity, and collegesyears. The 2nd table will have County and Jobyears. We will join the table table on County. For each table, we will renames the year headers as year to facilitate transformation into long lormat. We will store the final table.

# split the table into 2
tb_tp2_colleges <- select(tb_tp2, 1:7)
tb_tp2_jobs <- select(tb_tp2, 1, 8:11)

# Rename collumn for years as year
names(tb_tp2_colleges) <- c("County", "LandArea", "NatAmenity", "1970", "1980", "1990", "2000")
names(tb_tp2_jobs) <- c("County", "1970", "1980", "1990", "2000")

# Transform each table from wide to long format using the gather function from tidyr and store resulting table so that we can join them
tb_tp2_colleges_long <- gather(tb_tp2_colleges, years, college_graduation, 4:7)
tb_tp2_jobs_long <- gather(tb_tp2_jobs, years, jobs, 2:5)

# We are now going to combine the table again using inner_join from dplyr package joining on county, year
tb_tp2_result <- inner_join(tb_tp2_colleges_long, tb_tp2_jobs_long, by = c("County", "years"))

Analysis - Proportion of resident completing college as predictor for number of jobs

we will try to plot the 2 variables separating by countys

ggplot(tb_tp2_result, aes(x=college_graduation, y=jobs, colour = County)) + geom_line() +geom_point()

# Still working on this... as well.

Oil Consumption

Data Set: Explanation and load

This dataset represent oil consumption for the beining of 2015 from January to June of 3 major brand of Oil per category. In addition, the opening balance is indicated.

The analysis that we will try to perform is as follows: 1. Give the closing balance of component + brand 2. Give the most consumed brand across the 2 category of oil

For reproducible research, the data set will be loaded in a GitHub account as a .csv. The file will be downloaded into tb_tp3. Prior to transforming the table we will view it.

file3_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607_project2_dataset3.csv")
tb_tp3 <- read.csv(text = file3_url)

# View file using "tbl_df" command from dplyr package
tbl_df(tb_tp3)
## Source: local data frame [15 x 8]
## 
##         X         X.1    Caltex      X.2      Gulf      X.3     Mobil
##    (fctr)      (fctr)    (fctr)   (fctr)    (fctr)   (fctr)    (fctr)
## 1   Month    Category Purchased Consumed Purchased Consumed Purchased
## 2    Open  Engine Oil       140        0       199        0       141
## 3         GearBox Oil       198        0       132        0       121
## 4     Jan  Engine Oil       170      103       194      132       109
## 5         GearBox Oil       132      106       125      105       191
## 6     Feb  Engine Oil       112      133       138      113       171
## 7         GearBox Oil       193      148       199      119       134
## 8     Mar  Engine Oil       184      100       141      141       114
## 9         GearBox Oil       138      121       172      133       193
## 10    Apr  Engine Oil       149      150       117      118       117
## 11        GearBox Oil       185      125       191      133       119
## 12    May  Engine Oil       170      139       104      119       200
## 13        GearBox Oil       168      117       138      102       121
## 14    Jun  Engine Oil       159      129       170      138       169
## 15        GearBox Oil       107      129       195      141       141
## Variables not shown: X.4 (fctr)
# view structure of data set tb_tp3
str(tb_tp3)
## 'data.frame':    15 obs. of  8 variables:
##  $ X     : Factor w/ 9 levels "","Apr","Feb",..: 8 9 1 4 1 3 1 6 1 2 ...
##  $ X.1   : Factor w/ 3 levels "Category","Engine Oil",..: 1 2 3 2 3 2 3 2 3 2 ...
##  $ Caltex: Factor w/ 14 levels "107","112","132",..: 14 5 13 9 3 2 12 10 4 6 ...
##  $ X.2   : Factor w/ 13 levels "0","100","103",..: 13 1 1 3 4 9 11 2 6 12 ...
##  $ Gulf  : Factor w/ 13 levels "104","117","125",..: 13 12 4 10 3 5 12 6 8 2 ...
##  $ X.3   : Factor w/ 11 levels "0","102","105",..: 11 1 1 7 3 4 6 10 8 5 ...
##  $ Mobil : Factor w/ 13 levels "109","114","117",..: 13 7 5 1 10 9 6 2 11 3 ...
##  $ X.4   : Factor w/ 13 levels "0","100","101",..: 13 1 1 11 2 3 11 5 7 9 ...

First we will split the table for purchased and consumed value, repeating the common column of Month and Category. Then, we will rename the columns and remove the first line of table. Then we will fill in the “month” column using fill() function from tidyr package.

# split the table into 2
tb_tp3_purchased <- select(tb_tp3, 1:3, 5, 7)
tb_tp3_consumed <- select(tb_tp3, 1:2, 4, 6, 8)

# Rename collumn for years as year
names(tb_tp3_purchased) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")
names(tb_tp3_consumed) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")

# remove first row 
tb_tp3_purchased <- tb_tp3_purchased[-c(1), ]      # remove row# 1
tb_tp3_consumed <- tb_tp3_consumed[-c(1), ]        # remove row# 1