This assignment is on practicing the techniques of data manipulations to transform data from wide to long format using tidyr and dplyr packages. To this effect we will work on 3 separate data sets provided by other students in the class.
The following packages are required for the assignment:
* RCurl,
* lubridate,
* tidyr,
* dplyr,
* ggplot2
## Loading required package: bitops
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
This data set contains the number of downloads per provider (Apple, Android) for year 2010 to 2015. The number of download are in millions. In addition, the coresponding release date for each app provider. The number of downloads are approximations from overall numbers found on Wikipedia. The release date is based on information on Wikipedia.
For reproducible research, the data set will be loaded in a GitHub account as a .csv. The file will be downloaded into tb_tp1. Prior to transforming the table we will view it.
file1_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607_Project2_dataset1.csv")
tb_tp1 <- read.csv(text = file1_url)
# View file using "tbl_df" command from dplyr package
tbl_df(tb_tp1)
## Source: local data frame [14 x 17]
##
## Number.of.down.load.for.Top.games.from.2010...2015..in.millions. X
## (fctr) (fctr)
## 1
## 2 App Name Type
## 3 Candy Crush Saga Game
## 4 Fruit Ninja Game
## 5 Angry Birds Game
## 6 Subway Surfers Game
## 7 Despicable Me: Minion Rush Game
## 8 Clash of Clans Game
## 9 Temple Run Game
## 10 Angry Birds Rio Game
## 11 Temple Run 2 Game
## 12 Words With Friends Game
## 13 Minecraft: Pocket Edition Game
## 14 NBA 2K16 Game
## Variables not shown: X.1 (fctr), X.2 (fctr), X.3 (int), X.4 (int), X.5
## (int), X.6 (int), X.7 (int), X.8 (int), X.9 (fctr), X.10 (int), X.11
## (int), X.12 (int), X.13 (int), X.14 (int), X.15 (int)
# view structure of data set tb_tp1
str(tb_tp1)
## 'data.frame': 14 obs. of 17 variables:
## $ Number.of.down.load.for.Top.games.from.2010...2015..in.millions.: Factor w/ 14 levels "","Angry Birds",..: 1 4 5 8 2 11 7 6 12 3 ...
## $ X : Factor w/ 3 levels "","Game","Type": 1 3 2 2 2 2 2 2 2 2 ...
## $ X.1 : Factor w/ 4 levels "","Free","Paid",..: 1 4 2 2 2 2 2 2 2 2 ...
## $ X.2 : Factor w/ 14 levels "1/16/2013","10/14/2015",..: 13 14 6 7 4 8 9 10 11 5 ...
## $ X.3 : int NA 2010 NA 4 10 NA NA NA NA NA ...
## $ X.4 : int NA 2011 NA 8 124 NA NA NA 18 104 ...
## $ X.5 : int NA 2012 8 58 320 23 NA 24 102 204 ...
## $ X.6 : int NA 2013 56 102 547 123 16 123 246 382 ...
## $ X.7 : int NA 2014 60 126 648 202 58 234 306 485 ...
## $ X.8 : int NA 2015 76 148 627 303 128 345 378 324 ...
## $ X.9 : Factor w/ 14 levels "1/16/2013","10/14/2015",..: 13 14 5 11 4 9 10 3 8 7 ...
## $ X.10 : int NA 2010 NA 1 2 NA NA NA NA NA ...
## $ X.11 : int NA 2011 NA 9 108 NA NA NA NA 108 ...
## $ X.12 : int NA 2012 2 64 312 26 NA NA 108 205 ...
## $ X.13 : int NA 2013 53 108 538 128 18 4 254 398 ...
## $ X.14 : int NA 2014 64 132 647 236 64 143 302 476 ...
## $ X.15 : int NA 2015 72 165 656 329 294 256 402 389 ...
After the load, we have a few problems that we have to address. A closer look at the data leads to the following conclusion; This data set is comprise of 3 sections;
1. Common fields; app’s name, type, paying status
2. section for Apple; release date, years 2010 - 2015
3. section for Androids; release date, years 2010 - 2015
For transformation, we will first segregate the data set into 2 tables, one for apple and one for android, will then add a column for provider on each table and fill it with the appropriate value and then combine the 2 tables. Each of the provider table will contains the common section.
It is also clear that we can remove the first line of the raw table.
tb_tp1_2 <- tb_tp1[-c(1), ] # remove row# 1
tbl_df(tb_tp1_2)
## Source: local data frame [13 x 17]
##
## Number.of.down.load.for.Top.games.from.2010...2015..in.millions. X
## (fctr) (fctr)
## 1 App Name Type
## 2 Candy Crush Saga Game
## 3 Fruit Ninja Game
## 4 Angry Birds Game
## 5 Subway Surfers Game
## 6 Despicable Me: Minion Rush Game
## 7 Clash of Clans Game
## 8 Temple Run Game
## 9 Angry Birds Rio Game
## 10 Temple Run 2 Game
## 11 Words With Friends Game
## 12 Minecraft: Pocket Edition Game
## 13 NBA 2K16 Game
## Variables not shown: X.1 (fctr), X.2 (fctr), X.3 (int), X.4 (int), X.5
## (int), X.6 (int), X.7 (int), X.8 (int), X.9 (fctr), X.10 (int), X.11
## (int), X.12 (int), X.13 (int), X.14 (int), X.15 (int)
# split the table into 2
tb_tp1_apple <- select(tb_tp1_2, 1:10)
tb_tp1_android <- select(tb_tp1_2, 1:3, 11:17)
From observation of the data, it is clear that the now first row of the data is actually the column header. We will therefore modify the column header in for each table. Once this is done, we will remove the first row.
Finally, we will fix the missing date entry… We will change it to ‘07/01/2009’ and convert this column to a date field.
# names(tb_tp1_apple) <- c(as.character((tb_tp1_apple[1, ]))), this does not do what I would like. It uses the number of levels for each factor (I think) instead of the actual content
# however, on Stackoverflow, I found that you can use the lapply function... so I tried...
names(tb_tp1_apple) <- lapply(tb_tp1_apple[1,], as.character)
names(tb_tp1_android) <- lapply(tb_tp1_android[1,], as.character)
tb_tp1_apple <- tb_tp1_apple[-c(1), ] # remove row# 1
tb_tp1_android <- tb_tp1_android[-c(1), ] # remove row# 1
# Droplevels, this will drop unused level (hence the column header values that were originally mistaken for row of data will be dropped as level)
#tb_tp1_apple <- droplevels(tb_tp1_apple)
#tb_tp1_android <- droplevels(tb_tb1_android)
# Did not seem to have worked, I am not sure why....
# Instead we will use factor()
tb_tp1_apple$`App Name` <- factor(tb_tp1_apple$`App Name`)
tb_tp1_apple$Type <- factor(tb_tp1_apple$Type)
tb_tp1_apple$`Paid/Free` <- factor(tb_tp1_apple$`Paid/Free`)
tb_tp1_apple$`Release Date` <- factor(tb_tp1_apple$`Release Date`)
tb_tp1_android$`App Name` <- factor(tb_tp1_android$`App Name`)
tb_tp1_android$Type <- factor(tb_tp1_android$Type)
tb_tp1_android$`Paid/Free` <- factor(tb_tp1_android$`Paid/Free`)
tb_tp1_android$`Release Date` <- factor(tb_tp1_android$`Release Date`)
# Finally we replace the incorrect date in "Release date" column and convert "Release Date" column to date
grep("9-Jul", tb_tp1_apple$`Release Date`, value = FALSE)
## [1] 10
tb_tp1_apple$`Release Date` <- gsub("9-Jul","07/01/2009",tb_tp1_apple$`Release Date`)
grep("9-Jul", tb_tp1_apple$`Release Date`, value = FALSE)
## integer(0)
grep("9-Jul", tb_tp1_android$`Release Date`, value = FALSE)
## [1] 10
tb_tp1_android$`Release Date` <- gsub("9-Jul","07/01/2009",tb_tp1_android$`Release Date`)
grep("9-Jul", tb_tp1_android$`Release Date`, value = FALSE)
## integer(0)
# Note, after gsub applied, "Release Date" Column is changed to Character from factor??
# We will now change it to Date
tb_tp1_apple$`Release Date` <- mdy(tb_tp1_apple$`Release Date`)
tb_tp1_android$`Release Date` <- mdy(tb_tp1_android$`Release Date`)
str(tb_tp1_apple)
## 'data.frame': 12 obs. of 10 variables:
## $ App Name : Factor w/ 12 levels "Angry Birds",..: 3 6 1 9 5 4 10 2 11 12 ...
## $ Type : Factor w/ 1 level "Game": 1 1 1 1 1 1 1 1 1 1 ...
## $ Paid/Free : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
## $ Release Date: POSIXct, format: "2012-04-12" "2010-04-21" ...
## $ 2010 : int NA 4 10 NA NA NA NA NA NA 143 ...
## $ 2011 : int NA 8 124 NA NA NA 18 104 NA 156 ...
## $ 2012 : int 8 58 320 23 NA 24 102 204 NA 100 ...
## $ 2013 : int 56 102 547 123 16 123 246 382 230 93 ...
## $ 2014 : int 60 126 648 202 58 234 306 485 403 85 ...
## $ 2015 : int 76 148 627 303 128 345 378 324 503 86 ...
str(tb_tp1_android)
## 'data.frame': 12 obs. of 10 variables:
## $ App Name : Factor w/ 12 levels "Angry Birds",..: 3 6 1 9 5 4 10 2 11 12 ...
## $ Type : Factor w/ 1 level "Game": 1 1 1 1 1 1 1 1 1 1 ...
## $ Paid/Free : Factor w/ 2 levels "Free","Paid": 1 1 1 1 1 1 1 1 1 1 ...
## $ Release Date: POSIXct, format: "2012-11-04" "2010-07-10" ...
## $ 2010 : int NA 1 2 NA NA NA NA NA NA 146 ...
## $ 2011 : int NA 9 108 NA NA NA NA 108 NA 173 ...
## $ 2012 : int 2 64 312 26 NA NA 108 205 NA 112 ...
## $ 2013 : int 53 108 538 128 18 4 254 398 253 105 ...
## $ 2014 : int 64 132 647 236 64 143 302 476 493 95 ...
## $ 2015 : int 72 165 656 329 294 256 402 389 523 92 ...
# Note, after mdy is applied, "Release Date" Column is changed to POSIXct date format
We now have 2 tables with similar structure one for data for App’s downloaded for Apple and the other downloaded for Android. We will now add a column on each table, provider that will have the value “Apple” or “Android”.
tb_tp1_apple$provider <- c("Apple")
tb_tp1_android$provider <- c("Android")
tb_tp1_apple
## App Name Type Paid/Free Release Date 2010 2011 2012
## 3 Candy Crush Saga Game Free 2012-04-12 NA NA 8
## 4 Fruit Ninja Game Free 2010-04-21 4 8 58
## 5 Angry Birds Game Free 2009-12-11 10 124 320
## 6 Subway Surfers Game Free 2012-05-24 NA NA 23
## 7 Despicable Me: Minion Rush Game Free 2013-06-10 NA NA NA
## 8 Clash of Clans Game Free 2012-08-02 NA NA 24
## 9 Temple Run Game Free 2011-08-04 NA 18 102
## 10 Angry Birds Rio Game Free 2011-03-22 NA 104 204
## 11 Temple Run 2 Game Free 2013-01-16 NA NA NA
## 12 Words With Friends Game Free 2009-07-01 143 156 100
## 13 Minecraft: Pocket Edition Game Paid 2011-11-07 NA 4 120
## 14 NBA 2K16 Game Paid 2015-10-14 NA NA NA
## 2013 2014 2015 provider
## 3 56 60 76 Apple
## 4 102 126 148 Apple
## 5 547 648 627 Apple
## 6 123 202 303 Apple
## 7 16 58 128 Apple
## 8 123 234 345 Apple
## 9 246 306 378 Apple
## 10 382 485 324 Apple
## 11 230 403 503 Apple
## 12 93 85 86 Apple
## 13 240 320 340 Apple
## 14 NA NA 245 Apple
tb_tp1_android
## App Name Type Paid/Free Release Date 2010 2011 2012
## 3 Candy Crush Saga Game Free 2012-11-04 NA NA 2
## 4 Fruit Ninja Game Free 2010-07-10 1 9 64
## 5 Angry Birds Game Free 2010-11-19 2 108 312
## 6 Subway Surfers Game Free 2012-05-24 NA NA 26
## 7 Despicable Me: Minion Rush Game Free 2013-06-10 NA NA NA
## 8 Clash of Clans Game Free 2013-10-07 NA NA NA
## 9 Temple Run Game Free 2012-03-27 NA NA 108
## 10 Angry Birds Rio Game Free 2011-03-22 NA 108 205
## 11 Temple Run 2 Game Free 2013-01-16 NA NA NA
## 12 Words With Friends Game Free 2009-07-01 146 173 112
## 13 Minecraft: Pocket Edition Game Paid 2011-11-07 NA 8 154
## 14 NBA 2K16 Game Paid 2015-10-14 NA NA NA
## 2013 2014 2015 provider
## 3 53 64 72 Android
## 4 108 132 165 Android
## 5 538 647 656 Android
## 6 128 236 329 Android
## 7 18 64 294 Android
## 8 4 143 256 Android
## 9 254 302 402 Android
## 10 398 476 389 Android
## 11 253 493 523 Android
## 12 105 95 92 Android
## 13 285 369 352 Android
## 14 NA NA 345 Android
We are now combining the 2 tables back into one. To do so we will use the bind_rows function from dplyr.
tb_tp1_result <- bind_rows(tb_tp1_apple, tb_tp1_android)
We have now a table in wide-format. We will now transform this table into long format by “gathering” the years into one variable “year” and introducing a variable “downloads”. We will do so using the gather function from the tidyr package.
tb_tp1_result <- gather(tb_tp1_result, "year", "downloads", 5:10)
*** free app with most download’s (per year)*** What is the free app with most download for each year
tb <- tb_tp1_result %>% filter(`Paid/Free` =="Free") %>%
group_by(`App Name`, year) %>%
summarise(sum(downloads))
# still working on getting the numbers out... for some reasons I am having some difficulties
This data set represent the college graduation rate and the number of jobs per county for year 1970, 1980, 1990, and 2000. In addition there are 2 additional variables for “Land Area” and “National Amenity”. We would be interested to see whether graduation rate from college is a predictor for number of jobs. The original data set can be found at: [http://www.theanalysisfactor.com/wide-and-long-data/].
We have created a .csv file with this data and posted in github for reproducibility. We are first going to load the data and then proceed with changing the format from wide to long.
file1_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607_Project2_dataset2.csv")
tb_tp2 <- read.csv(text = file1_url)
# View file using "tbl_df" command from dplyr package
tbl_df(tb_tp2)
## Source: local data frame [5 x 11]
##
## County LandArea NatAmenity College1970 College1980 College1990
## (fctr) (int) (int) (dbl) (dbl) (dbl)
## 1 Autauga 599 4 0.064 0.121 0.145
## 2 Baldwin 1578 4 0.065 0.121 0.168
## 3 Barbour 891 4 0.073 0.092 0.118
## 4 Bibb 625 3 0.042 0.049 0.047
## 5 Blount 639 4 0.027 0.053 0.070
## Variables not shown: College2000 (dbl), Jobs1970 (int), Jobs1980 (int),
## Jobs1990 (int), Jobs2000 (int)
# view structure of data set tb_tp1
str(tb_tp2)
## 'data.frame': 5 obs. of 11 variables:
## $ County : Factor w/ 5 levels "Autauga","Baldwin",..: 1 2 3 4 5
## $ LandArea : int 599 1578 891 625 639
## $ NatAmenity : int 4 4 4 3 4
## $ College1970: num 0.064 0.065 0.073 0.042 0.027
## $ College1980: num 0.121 0.121 0.092 0.049 0.053
## $ College1990: num 0.145 0.168 0.118 0.047 0.07
## $ College2000: num 0.18 0.231 0.109 0.071 0.096
## $ Jobs1970 : int 6853 19749 9448 3965 7587
## $ Jobs1980 : int 11278 27861 9755 4276 9490
## $ Jobs1990 : int 11471 40809 12163 5564 11811
## $ Jobs2000 : int 16289 70247 15197 6098 16503
We will first split this table into 2 to handle jobs and college separately then we ill join them back. The first table will have County, LandArea, NetAmenity, and collegesyears. The 2nd table will have County and Jobyears. We will join the table table on County. For each table, we will renames the year headers as year to facilitate transformation into long lormat. We will store the final table.
# split the table into 2
tb_tp2_colleges <- select(tb_tp2, 1:7)
tb_tp2_jobs <- select(tb_tp2, 1, 8:11)
# Rename collumn for years as year
names(tb_tp2_colleges) <- c("County", "LandArea", "NatAmenity", "1970", "1980", "1990", "2000")
names(tb_tp2_jobs) <- c("County", "1970", "1980", "1990", "2000")
# Transform each table from wide to long format using the gather function from tidyr and store resulting table so that we can join them
tb_tp2_colleges_long <- gather(tb_tp2_colleges, years, college_graduation, 4:7)
tb_tp2_jobs_long <- gather(tb_tp2_jobs, years, jobs, 2:5)
# We are now going to combine the table again using inner_join from dplyr package joining on county, year
tb_tp2_result <- inner_join(tb_tp2_colleges_long, tb_tp2_jobs_long, by = c("County", "years"))
we will try to plot the 2 variables separating by countys
ggplot(tb_tp2_result, aes(x=college_graduation, y=jobs, colour = County)) + geom_line() +geom_point()
# Still working on this... as well.
This dataset represent oil consumption for the beining of 2015 from January to June of 3 major brand of Oil per category. In addition, the opening balance is indicated.
The analysis that we will try to perform is as follows: 1. Give the closing balance of component + brand 2. Give the most consumed brand across the 2 category of oil
For reproducible research, the data set will be loaded in a GitHub account as a .csv. The file will be downloaded into tb_tp3. Prior to transforming the table we will view it.
file3_url <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/CUNY%20607_project2_dataset3.csv")
tb_tp3 <- read.csv(text = file3_url)
# View file using "tbl_df" command from dplyr package
tbl_df(tb_tp3)
## Source: local data frame [15 x 8]
##
## X X.1 Caltex X.2 Gulf X.3 Mobil
## (fctr) (fctr) (fctr) (fctr) (fctr) (fctr) (fctr)
## 1 Month Category Purchased Consumed Purchased Consumed Purchased
## 2 Open Engine Oil 140 0 199 0 141
## 3 GearBox Oil 198 0 132 0 121
## 4 Jan Engine Oil 170 103 194 132 109
## 5 GearBox Oil 132 106 125 105 191
## 6 Feb Engine Oil 112 133 138 113 171
## 7 GearBox Oil 193 148 199 119 134
## 8 Mar Engine Oil 184 100 141 141 114
## 9 GearBox Oil 138 121 172 133 193
## 10 Apr Engine Oil 149 150 117 118 117
## 11 GearBox Oil 185 125 191 133 119
## 12 May Engine Oil 170 139 104 119 200
## 13 GearBox Oil 168 117 138 102 121
## 14 Jun Engine Oil 159 129 170 138 169
## 15 GearBox Oil 107 129 195 141 141
## Variables not shown: X.4 (fctr)
# view structure of data set tb_tp3
str(tb_tp3)
## 'data.frame': 15 obs. of 8 variables:
## $ X : Factor w/ 9 levels "","Apr","Feb",..: 8 9 1 4 1 3 1 6 1 2 ...
## $ X.1 : Factor w/ 3 levels "Category","Engine Oil",..: 1 2 3 2 3 2 3 2 3 2 ...
## $ Caltex: Factor w/ 14 levels "107","112","132",..: 14 5 13 9 3 2 12 10 4 6 ...
## $ X.2 : Factor w/ 13 levels "0","100","103",..: 13 1 1 3 4 9 11 2 6 12 ...
## $ Gulf : Factor w/ 13 levels "104","117","125",..: 13 12 4 10 3 5 12 6 8 2 ...
## $ X.3 : Factor w/ 11 levels "0","102","105",..: 11 1 1 7 3 4 6 10 8 5 ...
## $ Mobil : Factor w/ 13 levels "109","114","117",..: 13 7 5 1 10 9 6 2 11 3 ...
## $ X.4 : Factor w/ 13 levels "0","100","101",..: 13 1 1 11 2 3 11 5 7 9 ...
First we will split the table for purchased and consumed value, repeating the common column of Month and Category. Then, we will rename the columns and remove the first line of table. Then we will fill in the “month” column using fill() function from tidyr package.
# split the table into 2
tb_tp3_purchased <- select(tb_tp3, 1:3, 5, 7)
tb_tp3_consumed <- select(tb_tp3, 1:2, 4, 6, 8)
# Rename collumn for years as year
names(tb_tp3_purchased) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")
names(tb_tp3_consumed) <- c("Month", "Category", "Caltex", "Gulf", "Mobil")
# remove first row
tb_tp3_purchased <- tb_tp3_purchased[-c(1), ] # remove row# 1
tb_tp3_consumed <- tb_tp3_consumed[-c(1), ] # remove row# 1