As a first step, we came up with a stategy for working on the assignment together. First, we picked the 3 datasets we were interested in analyzing. Then we each picked one dataset to take a lead on after discussing some basic strategies. We worked on those individually, collaborating and reaching out to each other with questions. That gave us an opportunity to perfect our data wrangling and collaboration skils. Then we worked on a 3rd dataset together.
#install.packages("tidyverse")
#install.packages("dplyr")
require(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
require(tidyr)
The source of the data that I am analyzing is located in the following location: http://bit.ly/tardistravels. The reference to this dataset was found in the following article: https://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel-information-is-beautiful.
This dataset includes details of time travel dates on the TV Show Doctor Who. The classic era of the show has been airing since 1963 to 1987. The new era has started in 2005. This data includes episodes through 2011 only. This is data for 11 incarnations of the Doctor. I am going to compare time travel span between classis and new era episodes. My theory is that due to the technology limitations at the time the classis episodes were aired there will be limited time travel. I will also determine what’s the average number of years each doctor gets to travel through during their tenure and identify which Doctor gets to time travel the most/least?
DWData <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data-607-Assignments/master/Copy%20of%20Doctor%20Who%20Time%20Travel%20Journeys%20-%20Copy%20of%20DM_mastersheet%201.csv", header=TRUE, sep=",")
head(DWData)
## Doctor.Who.season doctor.actor ep..no episode.title from
## 1 10 Tennant xmas 2006 the Runaway Bride 2007
## 2 1 Hartnell 1 An Unearthly Child 1963
## 3 1 Hartnell 21 The Daleks' Master Plan -2500
## 4 3 Pertwee 64 The Time Monster 1972
## 5 1 Hartnell 20 The Myth Makers -1200
## 6 1 Hartnell 12 The Romans 64
## estimated.from to estimated. planet sub.location
## 1 NA -5000000000 Earth England
## 2 NA -100000 n/a
## 3 4000 -2500 Earth
## 4 2900 -2000 Earth/ Atlantis
## 5 3999 -1200 y Earth
## 6 2493 64 Earth Italy
## location
## 1 London
## 2 n/a
## 3 Egypt
## 4
## 5 Asia Minor/Troy
## 6 Rome
## notes
## 1
## 2
## 3
## 4 'ancient' Atlantis. Krasis is brought forward in time from 2000 BC
## 5
## 6
## source X X.1 X.2 X.3 X.4 X.5
## 1 http://who-transcripts.atspace.com/ NA NA NA NA
## 2 http://tardis.wikia.com/wiki/An_Unearthly_Child NA NA NA NA
## 3 http://tardis.wikia.com/wiki/Season_Three NA NA NA NA
## 4 http://tardis.wikia.com/wiki/Krasis NA NA NA NA
## 5 http://tardis.wikia.com/wiki/Season_Three NA NA NA NA
## 6 http://tardis.wikia.com/wiki/Season_Two NA NA NA NA
## X.6
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Since none of my questions apply to location information - I am going to get rid of that data and keep the time travel details only.
DWTime <- subset(DWData, select=Doctor.Who.season:to)
head(DWTime)
## Doctor.Who.season doctor.actor ep..no episode.title from
## 1 10 Tennant xmas 2006 the Runaway Bride 2007
## 2 1 Hartnell 1 An Unearthly Child 1963
## 3 1 Hartnell 21 The Daleks' Master Plan -2500
## 4 3 Pertwee 64 The Time Monster 1972
## 5 1 Hartnell 20 The Myth Makers -1200
## 6 1 Hartnell 12 The Romans 64
## estimated.from to
## 1 NA -5000000000
## 2 NA -100000
## 3 4000 -2500
## 4 2900 -2000
## 5 3999 -1200
## 6 2493 64
After visually inspecing the data, I am going to assume that “Estimated From” is a more accurate variable for measuring the beginning of time travel journey. The original “From” column seems to include same data as “to” column if the beginning time is not explicitely stated. I am going to replace the data in “From” column with “Estimated From” since that seems to provide a more accurate picture.
DWTime$estimated.from[is.na(DWTime$estimated.from)] <- as.character(DWTime$from[is.na(DWTime$estimated.from)])
#the code below removes "," which are present in certain fields.
DWTime[,'estimated.from'] <- gsub(",","",DWTime[,'estimated.from'])
head(DWTime)
## Doctor.Who.season doctor.actor ep..no episode.title from
## 1 10 Tennant xmas 2006 the Runaway Bride 2007
## 2 1 Hartnell 1 An Unearthly Child 1963
## 3 1 Hartnell 21 The Daleks' Master Plan -2500
## 4 3 Pertwee 64 The Time Monster 1972
## 5 1 Hartnell 20 The Myth Makers -1200
## 6 1 Hartnell 12 The Romans 64
## estimated.from to
## 1 2007 -5000000000
## 2 1963 -100000
## 3 4000 -2500
## 4 2900 -2000
## 5 3999 -1200
## 6 2493 64
Now we can remove “from” variable
DWTime$from <- NULL
head(DWTime)
## Doctor.Who.season doctor.actor ep..no episode.title
## 1 10 Tennant xmas 2006 the Runaway Bride
## 2 1 Hartnell 1 An Unearthly Child
## 3 1 Hartnell 21 The Daleks' Master Plan
## 4 3 Pertwee 64 The Time Monster
## 5 1 Hartnell 20 The Myth Makers
## 6 1 Hartnell 12 The Romans
## estimated.from to
## 1 2007 -5000000000
## 2 1963 -100000
## 3 4000 -2500
## 4 2900 -2000
## 5 3999 -1200
## 6 2493 64
Let’s calculate time travel span.
DWTime$Span = (as.numeric(as.character(DWTime$to))) - as.numeric(as.character(DWTime$estimated.from))
## Warning: NAs introduced by coercion
#since we don't care where the travel is back in time or forward in time - we are taking absolute values
DWTime$Span<-abs(DWTime$Span)
head(DWTime)
## Doctor.Who.season doctor.actor ep..no episode.title
## 1 10 Tennant xmas 2006 the Runaway Bride
## 2 1 Hartnell 1 An Unearthly Child
## 3 1 Hartnell 21 The Daleks' Master Plan
## 4 3 Pertwee 64 The Time Monster
## 5 1 Hartnell 20 The Myth Makers
## 6 1 Hartnell 12 The Romans
## estimated.from to Span
## 1 2007 -5000000000 5000002007
## 2 1963 -100000 101963
## 3 4000 -2500 6500
## 4 2900 -2000 4900
## 5 3999 -1200 5199
## 6 2493 64 2429
Now let’s make out dataset Wide.
DWWide<-spread(DWTime, doctor.actor, Span)
head(DWWide)
## Doctor.Who.season ep..no episode.title estimated.from
## 1 2007
## 2 2007
## 3 164 The Empty Child 2005
## 4 1 1 An Unearthly Child 1963
## 5 1 10 The Dalek Invasion of Earth 1963
## 6 1 12 The Romans 2493
## to V1 Baker Colin Baker Davison Eccleston Hartnell McCoy McGann
## 1 1969 NA NA NA NA NA NA NA NA
## 2 2008 NA NA NA NA NA NA NA NA
## 3 1941 NA NA NA NA 64 NA NA NA
## 4 -100000 NA NA NA NA NA 101963 NA NA
## 5 2164 NA NA NA NA NA 201 NA NA
## 6 64 NA NA NA NA NA 2429 NA NA
## Pertwee Smith Tennant Troughton
## 1 NA NA 38 NA
## 2 NA NA 1 NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
DWWide$V1 <- NULL
# I noticed that the cell below has an empty value and I am adding the correct number just because I know what it is. If this error was not visible I would not be able to make this correction and it would just be excluded from calculation.
DWWide[3, 1] = 9
Now I am removing some blank values.
DWWide <- subset(DWWide, DWWide$episode.title != "")
head(DWWide)
## Doctor.Who.season ep..no episode.title estimated.from
## 3 9 164 The Empty Child 2005
## 4 1 1 An Unearthly Child 1963
## 5 1 10 The Dalek Invasion of Earth 1963
## 6 1 12 The Romans 2493
## 7 1 14 The Crusade 1965
## 8 1 16 The Chase 1872
## to Baker Colin Baker Davison Eccleston Hartnell McCoy McGann
## 3 1941 NA NA NA 64 NA NA NA
## 4 -100000 NA NA NA NA 101963 NA NA
## 5 2164 NA NA NA NA 201 NA NA
## 6 64 NA NA NA NA 2429 NA NA
## 7 1200 NA NA NA NA 765 NA NA
## 8 1996 NA NA NA NA 124 NA NA
## Pertwee Smith Tennant Troughton
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## 7 NA NA NA NA
## 8 NA NA NA NA
From the summary below we can see glimpses of a lot of information, things like most frequently visited year (2010), which seasons have the most time travel (10,11), which episodes have the most time jumps.
summary(DWWide)
## Doctor.Who.season ep..no episode.title
## 11 :27 12 : 6 The Big Bang : 4
## 10 :25 1 : 5 the Girl in the Fireplace: 4
## 1 :21 13 : 5 The Pandorica Opens : 4
## 5 :12 204 : 4 The Daleks' Master Plan : 3
## 2 :11 Special: 4 The Eleventh Hour : 3
## 3 : 9 21 : 3 The Chase : 2
## (Other):27 (Other):105 (Other) :112
## estimated.from to Baker Colin Baker
## Length:132 2010 : 8 Min. :6.900e+01 Min. : 0.0
## Class :character 2008 : 5 1st Qu.:1.140e+02 1st Qu.: 50.0
## Mode :character 1983 : 4 Median :4.880e+02 Median :100.0
## 1941 : 3 Mean :5.714e+11 Mean :215.0
## 1966 : 3 3rd Qu.:1.914e+04 3rd Qu.:322.5
## 1970 : 3 Max. :4.000e+12 Max. :545.0
## (Other):106 NA's :125 NA's :129
## Davison Eccleston Hartnell
## Min. :5.000e+00 Min. :1.00e+00 Min. :4.500e+01
## 1st Qu.:2.200e+02 1st Qu.:1.50e+01 1st Qu.:1.690e+02
## Median :8.420e+02 Median :9.30e+01 Median :8.990e+02
## Mean :2.296e+09 Mean :1.25e+09 Mean :6.195e+08
## 3rd Qu.:4.250e+07 3rd Qu.:1.25e+09 3rd Qu.:2.933e+03
## Max. :1.370e+10 Max. :5.00e+09 Max. :1.300e+10
## NA's :120 NA's :124 NA's :111
## McCoy McGann Pertwee Smith
## Min. : 40.0 Min. :12 Min. : 20 Min. : 0
## 1st Qu.: 86.0 1st Qu.:12 1st Qu.: 128 1st Qu.: 11
## Median : 252.5 Median :12 Median : 567 Median : 271
## Mean : 921.5 Mean :12 Mean :1298 Mean :1313
## 3rd Qu.:1581.0 3rd Qu.:12 3rd Qu.: 774 3rd Qu.:2990
## Max. :2759.0 Max. :12 Max. :4900 Max. :5043
## NA's :124 NA's :131 NA's :123 NA's :105
## Tennant Troughton
## Min. :0.000e+00 Min. : 3.0
## 1st Qu.:1.500e+01 1st Qu.: 38.0
## Median :1.110e+02 Median : 69.0
## Mean :4.167e+12 Mean :100.6
## 3rd Qu.:4.800e+02 3rd Qu.:162.0
## Max. :1.000e+14 Max. :240.0
## NA's :108 NA's :121
Let’s compare average travel years for classic Era
#Classis Era Average
mean(DWTime$Span[DWTime$Doctor.Who.season ==1 | DWTime$Doctor.Who.season ==2 | DWTime$Doctor.Who.season ==3 | DWTime$Doctor.Who.season ==4 | DWTime$Doctor.Who.season ==5 | DWTime$Doctor.Who.season ==6 | DWTime$Doctor.Who.season == 7 |DWTime$Doctor.Who.season ==8], na.rm = TRUE)
## [1] 56118891661
#New Era Average
mean(DWTime$Span[DWTime$Doctor.Who.season ==9 | DWTime$Doctor.Who.season ==10 | DWTime$Doctor.Who.season ==11], na.rm = TRUE)
## [1] 1.724483e+12
#Classis Era Sum
sum(DWTime$Span[DWTime$Doctor.Who.season ==1 | DWTime$Doctor.Who.season ==2 | DWTime$Doctor.Who.season ==3 | DWTime$Doctor.Who.season ==4 | DWTime$Doctor.Who.season ==5 | DWTime$Doctor.Who.season ==6 | DWTime$Doctor.Who.season == 7 |DWTime$Doctor.Who.season ==8], na.rm = TRUE)
## [1] 4.04056e+12
#New Era Sum
sum(DWTime$Span[DWTime$Doctor.Who.season ==9 | DWTime$Doctor.Who.season ==10 | DWTime$Doctor.Who.season ==11], na.rm = TRUE)
## [1] 1.0002e+14
My initial assumption was correct - New Era has more time travel than Classic Episodes.
Now, we can see which incarnations of the Doctor traveled in time the most.
DWDoctors<-DWWide %>%
select ("Baker", "Colin Baker", "Davison", "Hartnell", "McCoy", "McGann", "Pertwee", "Troughton", "Eccleston", "Smith", "Tennant")
#Average
sort(apply(DWDoctors, 2, mean, na.rm=TRUE), decreasing = TRUE)
## Tennant Baker Davison Eccleston Hartnell
## 4.167083e+12 5.714286e+11 2.295834e+09 1.250000e+09 6.195299e+08
## Smith Pertwee McCoy Colin Baker Troughton
## 1.312963e+03 1.297667e+03 9.215000e+02 2.150000e+02 1.006364e+02
## McGann
## 1.200000e+01
#Sum of all time travel years
sort(apply(DWDoctors, 2, sum, na.rm=TRUE), decreasing = TRUE)
## Tennant Baker Davison Hartnell Eccleston
## 1.000100e+14 4.000000e+12 2.755001e+10 1.301013e+10 9.999996e+09
## Smith Pertwee McCoy Troughton Colin Baker
## 3.545000e+04 1.167900e+04 7.372000e+03 1.107000e+03 6.450000e+02
## McGann
## 1.200000e+01
1st Place: Tennant (New Era), 2nd Place: Baker (Classic Episodes), 3rd Place: Davidson(Classic Episodes)
DWClassic<-DWWide %>%
select ("Baker", "Colin Baker", "Davison", "Hartnell", "McCoy", "McGann", "Pertwee", "Troughton")
DWNew<-DWWide %>%
select ("Eccleston", "Smith", "Tennant")