The Assignment

Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. Over the next few weeks, we’ll work with several packages that help with the tasks of tidying and transforming data.

Solution

I opted to leverage data on online video encoding. The information is available at: https://archive.ics.uci.edu/ml/datasets/Online+Video+Characteristics+and+Transcoding+Time+Dataset

theUrl="https://raw.githubusercontent.com/kennygfm/IS607/master/Week%202/online_video_dataset/transcoding_mesurment.tsv"
#Load all of the data into encodingm_data
encoding_data <- read.table(file = theUrl, header = TRUE, sep = "\t")

Below are some commands that I used to examine the file, note that the data dictionary is available at the same url above, and we will be paying attention to the second data file which include input and output video characteristics along with their transcoding time and memory resource requirements while transcoding videos to diffrent but valid formats.

str(encoding_data)
## 'data.frame':    68784 obs. of  22 variables:
##  $ id         : Factor w/ 1099 levels "_2al-ZI1Wss",..: 26 26 26 26 26 26 26 26 26 26 ...
##  $ duration   : num  130 130 130 130 130 ...
##  $ codec      : Factor w/ 4 levels "flv","h264","mpeg4",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ width      : int  176 176 176 176 176 176 176 176 176 176 ...
##  $ height     : int  144 144 144 144 144 144 144 144 144 144 ...
##  $ bitrate    : int  54590 54590 54590 54590 54590 54590 54590 54590 54590 54590 ...
##  $ framerate  : num  12 12 12 12 12 12 12 12 12 12 ...
##  $ i          : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ p          : int  1537 1537 1537 1537 1537 1537 1537 1537 1537 1537 ...
##  $ b          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ frames     : int  1564 1564 1564 1564 1564 1564 1564 1564 1564 1564 ...
##  $ i_size     : int  64483 64483 64483 64483 64483 64483 64483 64483 64483 64483 ...
##  $ p_size     : int  825054 825054 825054 825054 825054 825054 825054 825054 825054 825054 ...
##  $ b_size     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ size       : int  889537 889537 889537 889537 889537 889537 889537 889537 889537 889537 ...
##  $ o_codec    : Factor w/ 4 levels "flv","h264","mpeg4",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ o_bitrate  : int  56000 56000 56000 56000 56000 56000 56000 56000 56000 56000 ...
##  $ o_framerate: num  12 12 12 12 12 12 15 15 15 15 ...
##  $ o_width    : int  176 320 480 640 1280 1920 176 320 480 640 ...
##  $ o_height   : int  144 240 360 480 720 1080 144 240 360 480 ...
##  $ umem       : int  22508 25164 29228 34316 58528 102072 23132 25164 29236 34312 ...
##  $ utime      : num  0.612 0.98 1.216 1.692 3.456 ...
names(encoding_data)
##  [1] "id"          "duration"    "codec"       "width"       "height"     
##  [6] "bitrate"     "framerate"   "i"           "p"           "b"          
## [11] "frames"      "i_size"      "p_size"      "b_size"      "size"       
## [16] "o_codec"     "o_bitrate"   "o_framerate" "o_width"     "o_height"   
## [21] "umem"        "utime"

Reviewing the data dictionary and the data, we will limit our analysis to see if there is a relationship between the original encoding and file size with the output encoding and processing time. Note that the exact same machine is used for all the calculations so issues with processor speed and memory are immaterial. Also note that we have no idea if these are driving factors relative to the other variables in the data. Ideally we learn how to identify those in a statistics course. We will keep the number of rows for now, because it is not a huge number of observations: **68784*

encoding_data_limit <- na.omit(encoding_data[,c("codec","size","o_codec","utime")])

One would assume that file size plays a huge factor in transcoding time, so we will examine that variable first and create a new column that classifies the files into five groups.

summary(encoding_data_limit$size)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    191900   2258000   7881000  25020000  19770000 806700000
hist(encoding_data_limit$size)

breaks <- quantile(encoding_data_limit$size,c(0.1,0.25,0.5,0.75,0.9))
f <- cut(encoding_data_limit$size, breaks, labels=c("Bottom","Mid-Low","Mid-High","Top-most"))
summary(f)
##   Bottom  Mid-Low Mid-High Top-most     NA's 
##    10259    17101    17021    10259    14144
tapply(encoding_data_limit$size, f, median)
##   Bottom  Mid-Low Mid-High Top-most 
##  1724802  5470732 12250321 31065203
#Let's append this information to the data frame so we can apply some interesting pivots
encoding_data_limit$size_group <- f

Now that we have a factor by which we can look at group file size, we can more readily see the impact on transcoding time.

# Table on encoding time based on file size
tapply(encoding_data_limit$utime, f, median)
##   Bottom  Mid-Low Mid-High Top-most 
##    3.020    4.140    4.224    6.992
#Table on encoding time based on file size and input encoding
tapply(encoding_data_limit$utime, list(f,encoding_data_limit$codec), median)
##            flv  h264 mpeg4   vp8
## Bottom   3.952 3.542 2.522 2.702
## Mid-Low  3.716 4.368 2.878 4.306
## Mid-High 3.324 4.404 3.102 4.272
## Top-most 7.372 7.704 1.422 5.728
#Table on encoding time based on file size and output encoding
tapply(encoding_data_limit$utime, list(f,encoding_data_limit$o_codec), median)
##            flv   h264 mpeg4   vp8
## Bottom   0.864  7.904 2.834 4.090
## Mid-Low  1.316 11.281 3.668 5.848
## Mid-High 1.432 10.659 3.688 6.004
## Top-most 3.756 15.037 5.240 9.587

What the above tables show is how little impact file size has on encoding time. A surprise for yours truly actually! It seems the more significant variable is simply the output codec.

#Table on on encoding time based on input vs. output encoding time
tapply(encoding_data_limit$utime, list(encoding_data_limit$codec,encoding_data_limit$o_codec), median)
##         flv   h264 mpeg4   vp8
## flv   0.860 10.877 3.164 5.470
## h264  1.868 12.329 4.240 7.256
## mpeg4 0.596  8.125 2.126 3.464
## vp8   2.248 11.017 4.320 6.892
#Table on encoding time based on all three other variables
tapply(encoding_data_limit$utime, list(f,encoding_data_limit$codec,encoding_data_limit$o_codec), median)
## , , flv
## 
##            flv  h264 mpeg4   vp8
## Bottom   0.912 0.984 0.600 1.000
## Mid-Low  0.892 1.448 0.628 1.322
## Mid-High 0.758 1.556 0.660 1.410
## Top-most 0.808 3.876 0.516 2.320
## 
## , , h264
## 
##             flv   h264  mpeg4    vp8
## Bottom   11.653  8.821  8.197  4.248
## Mid-Low  10.455 11.355  9.565 12.149
## Mid-High 11.151 10.571 10.117 13.069
## Top-most 12.093 15.833     NA 13.097
## 
## , , mpeg4
## 
##            flv  h264 mpeg4   vp8
## Bottom   3.272 3.284 2.124 3.210
## Mid-Low  3.174 3.886 2.186 3.814
## Mid-High 2.960 3.800 2.402 3.786
## Top-most 7.372 5.344    NA 5.112
## 
## , , vp8
## 
##            flv   h264 mpeg4   vp8
## Bottom   6.166  4.734 3.272 3.444
## Mid-Low  5.496  6.376 3.996 5.908
## Mid-High 4.730  6.204 5.230 5.942
## Top-most    NA 10.001 2.328 8.065

The tables of course are useful, let’s make sure median was the right call and review all the outliers via boxplot visualizations.

#boxplot(Horsepower ~ Origin, data=Cars93)
boxplot(encoding_data_limit$utime)

boxplot(encoding_data_limit$utime ~ f)

boxplot(encoding_data_limit$utime ~ encoding_data_limit$codec)

boxplot(encoding_data_limit$utime ~ encoding_data_limit$o_codec)

boxplot(encoding_data_limit$utime ~ encoding_data_limit$o_codec+encoding_data_limit$codec, horizontal=TRUE)

From this, it becomes clearer that the output codec has the most significant impact on processing time. So, we can further limit our data to the slowest output file time (h264) and provide advanced analysis on the original dataframe based on that.

encoding_data_limit2 <- na.omit(encoding_data[encoding_data$o_codec=="h264",])

This leaves us with new data to explore (at another time), to identify potentially other variables that drive up processing time.