options(width = 100)
#This is a standard setup I include so that my working directory is set correctly whether I work on one of my windows or linux machines.
if (Sys.info()["sysname"]=="Windows"){
setwd("~/Masters/DATA607/Week12/Assignment")
} else {
setwd("~/Documents/Masters/DATA607/Week12/Assignment")
}
csv_file <- read.csv(file = "https://raw.githubusercontent.com/john-grando/Masters/master/DATA607/Week12/Assignment/tb.csv?token=AXxIejuTdx-_SiiG9LH5jwl9HvCeb1paks5aEk1LwA%3D%3D",
header = FALSE,
col.names = c("Country", "Year", "Gender","Val1", "Val2", "Val3"))
library(mongolite)
my_collection <- mongo(collection = "tb_collection", db = "diseases")
my_collection$insert(csv_file)
## List of 5
## $ nInserted : num 3800
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
my_collection$iterate()$one()
## $Country
## [1] "Afghanistan"
##
## $Year
## [1] 1995
##
## $Gender
## [1] "female"
##
## $Val1
## [1] -1
##
## $Val2
## [1] -1
##
## $Val3
## [1] -1
Well, there are plenty of pro vs. con webpages stating different things about why NoSQL is more useful in some situations but I thought I would take the one reason that stuck with me the most and show an example. SQL databases are built upon structured data tables, and it is very difficult to change those structures on a consistent basis. Therefore, if your data structure is constantly changing then you can be in some trouble trying to map it to a SQL database. However, NoSQL (Mongo in this case) stores information as documents and does not mind if there is additional, or less data. For example, let’s just add a new attribute to an observation.
new_entry <- data.frame(Country=c("TestCountry"), Year=c(9999), Gender=("Test Gender"), Val1=999, Val2=999, Val3=999, Val4=999)
my_collection$insert(new_entry)
## List of 5
## $ nInserted : num 1
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
So, mongo said it was inserted but did it work?
my_collection$find(query = '{"Country":"TestCountry"}')
## Country Year Gender Val1 Val2 Val3 Val4
## 1 TestCountry 9999 Test Gender 999 999 999 999
Yes it did, but wait, what did it do to our existing dataset?
my_collection$find(query = '{"Country":"Congo"}')
## Country Year Gender Val1 Val2 Val3
## 1 Congo 1995 female 17 915 11
## 2 Congo 1995 male 16 1012 15
## 3 Congo 1996 female -1 -1 -1
## 4 Congo 1996 male -1 -1 -1
## 5 Congo 1997 female -1 -1 -1
## 6 Congo 1997 male -1 -1 -1
## 7 Congo 1998 female -1 -1 -1
## 8 Congo 1998 male -1 -1 -1
## 9 Congo 1999 female 25 895 22
## 10 Congo 1999 male 17 1046 27
## 11 Congo 2000 female -1 -1 -1
## 12 Congo 2000 male -1 -1 -1
## 13 Congo 2001 female 53 1899 107
## 14 Congo 2001 male 31 2009 65
## 15 Congo 2002 female -1 -1 -1
## 16 Congo 2002 male -1 -1 -1
## 17 Congo 2003 female -1 -1 -1
## 18 Congo 2003 male -1 -1 -1
## 19 Congo 2004 female 38 1727 44
## 20 Congo 2004 male 9 2269 32
## 21 Congo 2005 female -1 -1 -1
## 22 Congo 2005 male -1 -1 -1
## 23 Congo 2006 female 44 1348 54
## 24 Congo 2006 male 32 1662 51
## 25 Congo 2007 female 45 1577 60
## 26 Congo 2007 male 28 1779 63
## 27 Congo 2008 female 56 1429 51
## 28 Congo 2008 male 31 1755 49
## 29 Congo 2009 female 65 1414 47
## 30 Congo 2009 male 50 1801 56
## 31 Congo 2010 female 49 1437 56
## 32 Congo 2010 male 41 1811 55
## 33 Congo 2011 female 72 1491 97
## 34 Congo 2011 male 58 1922 76
## 35 Congo 2012 female 63 1508 108
## 36 Congo 2012 male 46 2187 72
## 37 Congo 2013 female 46 1467 71
## 38 Congo 2013 male 35 2224 60
Nothing! That’s right! If we have new data with more information to insert, we can just do it and not worry about the existing data. However, note that this can cause a lot of workarounds and/or queries to make sure that we pull all the data we want. Additionally, there are many articles that can easily be googled which cite disadvantages of NoSQL, which are mainly security, consistency, and standardization.
Here is a query showing the Congo and our test country in one result. Note, a column is made for the new variable as long as one of the queried documents has a value (it is NA for all other documents)
my_collection$find(query = '{"$or" : [{"Country":"TestCountry"}, {"Country":"Congo"}]}')
## Country Year Gender Val1 Val2 Val3 Val4
## 1 Congo 1995 female 17 915 11 NA
## 2 Congo 1995 male 16 1012 15 NA
## 3 Congo 1996 female -1 -1 -1 NA
## 4 Congo 1996 male -1 -1 -1 NA
## 5 Congo 1997 female -1 -1 -1 NA
## 6 Congo 1997 male -1 -1 -1 NA
## 7 Congo 1998 female -1 -1 -1 NA
## 8 Congo 1998 male -1 -1 -1 NA
## 9 Congo 1999 female 25 895 22 NA
## 10 Congo 1999 male 17 1046 27 NA
## 11 Congo 2000 female -1 -1 -1 NA
## 12 Congo 2000 male -1 -1 -1 NA
## 13 Congo 2001 female 53 1899 107 NA
## 14 Congo 2001 male 31 2009 65 NA
## 15 Congo 2002 female -1 -1 -1 NA
## 16 Congo 2002 male -1 -1 -1 NA
## 17 Congo 2003 female -1 -1 -1 NA
## 18 Congo 2003 male -1 -1 -1 NA
## 19 Congo 2004 female 38 1727 44 NA
## 20 Congo 2004 male 9 2269 32 NA
## 21 Congo 2005 female -1 -1 -1 NA
## 22 Congo 2005 male -1 -1 -1 NA
## 23 Congo 2006 female 44 1348 54 NA
## 24 Congo 2006 male 32 1662 51 NA
## 25 Congo 2007 female 45 1577 60 NA
## 26 Congo 2007 male 28 1779 63 NA
## 27 Congo 2008 female 56 1429 51 NA
## 28 Congo 2008 male 31 1755 49 NA
## 29 Congo 2009 female 65 1414 47 NA
## 30 Congo 2009 male 50 1801 56 NA
## 31 Congo 2010 female 49 1437 56 NA
## 32 Congo 2010 male 41 1811 55 NA
## 33 Congo 2011 female 72 1491 97 NA
## 34 Congo 2011 male 58 1922 76 NA
## 35 Congo 2012 female 63 1508 108 NA
## 36 Congo 2012 male 46 2187 72 NA
## 37 Congo 2013 female 46 1467 71 NA
## 38 Congo 2013 male 35 2224 60 NA
## 39 TestCountry 9999 Test Gender 999 999 999 999
my_collection$drop()