Project Instruction:

For Project 4, you should take information from a relational database and migrate it to a NoSQL database of your own choosing. For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation. For the NoSQL database, you may use MongoDB, Neo4j (which we introduce in Week 12), or another NoSQL database of your choosing. Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

I am picking MongoDB to be my NoSQL database. Because it is my first time running something on MongoDB, I need to install the software into my local machine. There is a step-by-step tutorial that tells me how to download, install, and configure MongoDB. https://docs.mongodb.com/getting-started/shell/tutorial/install-mongodb-on-windows/

There are two ways we can run MongoDB. First is by entering mongod on the command window, second is by simply clicking mongod.exe file that exists in this directory: C:Files.2

I need to install any required libraries. RMySQL and mongolite are two that are heavily used in this project.

#install.packages("RODBC")
#install.packages("RMySQL")
#install.packages("DBI")
#install.packages("mongolite")

library(RODBC)
## Warning: package 'RODBC' was built under R version 3.3.2
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 3.3.2
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.3.2
library(DBI)
library(mongolite)
## Warning: package 'mongolite' was built under R version 3.3.2
library(jsonlite)

From previous assignment, I have already set up the flights database in my local machine. It is also been set to be default schema. The following code will create connection to this existing MySQl database. I am setting both username and password to be NULL, so anybody will gain access.

rmysql.settingsfile<-"C:/ProgramData/MySQL/MySQL Server 5.7/my.ini"
con <- dbConnect(RMySQL::MySQL(), default.file=rmysql.settingsfile, dbname = "flights", user=NULL, password=NULL)

I need to import data from MySQL to R. There are few tables in the database. I just pick the flights table and extract all of its data/

flights <- dbGetQuery(con,"select * from flights")
dbListFields(con, "flights")
##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay"
##  [6] "arr_time"  "arr_delay" "carrier"   "tailnum"   "flight"   
## [11] "origin"    "dest"      "air_time"  "distance"  "hour"     
## [16] "minute"
head(flights)
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    6     54
## 6   1696    EWR  ORD      150      719    6     54

The function mongo from package mongolite build a mongo connection object. Then I insert the data from R into MongoDB object.

mongo_data <- mongo(collection = "flights")
mongo_data$insert(flights)
## 
Processed 1000 rows...
Processed 2000 rows...
Processed 3000 rows...
Processed 4000 rows...
Processed 5000 rows...
Processed 6000 rows...
Processed 7000 rows...
Processed 8000 rows...
Processed 9000 rows...
Processed 10000 rows...
Processed 11000 rows...
Processed 12000 rows...
Processed 13000 rows...
Processed 14000 rows...
Processed 15000 rows...
Processed 16000 rows...
Processed 17000 rows...
Processed 18000 rows...
Processed 19000 rows...
Processed 20000 rows...
Processed 21000 rows...
Processed 22000 rows...
Processed 23000 rows...
Processed 24000 rows...
Processed 25000 rows...
Processed 26000 rows...
Processed 27000 rows...
Processed 28000 rows...
Processed 29000 rows...
Processed 30000 rows...
Processed 31000 rows...
Processed 32000 rows...
Processed 33000 rows...
Processed 34000 rows...
Processed 35000 rows...
Processed 36000 rows...
Processed 37000 rows...
Processed 38000 rows...
Processed 39000 rows...
Processed 40000 rows...
Processed 41000 rows...
Processed 42000 rows...
Processed 43000 rows...
Processed 44000 rows...
Processed 45000 rows...
Processed 46000 rows...
Processed 47000 rows...
Processed 48000 rows...
Processed 49000 rows...
Processed 50000 rows...
Processed 51000 rows...
Processed 52000 rows...
Processed 53000 rows...
Processed 54000 rows...
Processed 55000 rows...
Processed 56000 rows...
Processed 57000 rows...
Processed 58000 rows...
Processed 59000 rows...
Processed 60000 rows...
Processed 61000 rows...
Processed 62000 rows...
Processed 63000 rows...
Processed 64000 rows...
Processed 65000 rows...
Processed 66000 rows...
Processed 67000 rows...
Processed 68000 rows...
Processed 69000 rows...
Processed 70000 rows...
Processed 71000 rows...
Processed 72000 rows...
Processed 73000 rows...
Processed 74000 rows...
Processed 75000 rows...
Processed 76000 rows...
Processed 77000 rows...
Processed 78000 rows...
Processed 79000 rows...
Processed 80000 rows...
Processed 81000 rows...
Processed 82000 rows...
Processed 83000 rows...
Processed 84000 rows...
Processed 85000 rows...
Processed 86000 rows...
Processed 87000 rows...
Processed 88000 rows...
Processed 89000 rows...
Processed 90000 rows...
Processed 91000 rows...
Processed 92000 rows...
Processed 93000 rows...
Processed 94000 rows...
Processed 95000 rows...
Processed 96000 rows...
Processed 97000 rows...
Processed 98000 rows...
Processed 99000 rows...
Processed 1e+05 rows...
Processed 101000 rows...
Processed 102000 rows...
Processed 103000 rows...
Processed 104000 rows...
Processed 105000 rows...
Processed 106000 rows...
Processed 107000 rows...
Processed 108000 rows...
Processed 109000 rows...
Processed 110000 rows...
Processed 111000 rows...
Processed 112000 rows...
Processed 113000 rows...
Processed 114000 rows...
Processed 115000 rows...
Processed 116000 rows...
Processed 117000 rows...
Processed 118000 rows...
Processed 119000 rows...
Processed 120000 rows...
Processed 121000 rows...
Processed 122000 rows...
Processed 123000 rows...
Processed 124000 rows...
Processed 125000 rows...
Processed 126000 rows...
Processed 127000 rows...
Processed 128000 rows...
Processed 129000 rows...
Processed 130000 rows...
Processed 131000 rows...
Processed 132000 rows...
Processed 133000 rows...
Processed 134000 rows...
Processed 135000 rows...
Processed 136000 rows...
Processed 137000 rows...
Processed 138000 rows...
Processed 139000 rows...
Processed 140000 rows...
Processed 141000 rows...
Processed 142000 rows...
Processed 143000 rows...
Processed 144000 rows...
Processed 145000 rows...
Processed 146000 rows...
Processed 147000 rows...
Processed 148000 rows...
Processed 149000 rows...
Processed 150000 rows...
Processed 151000 rows...
Processed 152000 rows...
Processed 153000 rows...
Processed 154000 rows...
Processed 155000 rows...
Processed 156000 rows...
Processed 157000 rows...
Processed 158000 rows...
Processed 159000 rows...
Processed 160000 rows...
Processed 161000 rows...
Processed 162000 rows...
Processed 163000 rows...
Processed 164000 rows...
Processed 165000 rows...
Processed 166000 rows...
Processed 167000 rows...
Processed 168000 rows...
Processed 169000 rows...
Processed 170000 rows...
Processed 171000 rows...
Processed 172000 rows...
Processed 173000 rows...
Processed 174000 rows...
Processed 175000 rows...
Processed 176000 rows...
Processed 177000 rows...
Processed 178000 rows...
Processed 179000 rows...
Processed 180000 rows...
Processed 181000 rows...
Processed 182000 rows...
Processed 183000 rows...
Processed 184000 rows...
Processed 185000 rows...
Processed 186000 rows...
Processed 187000 rows...
Processed 188000 rows...
Processed 189000 rows...
Processed 190000 rows...
Processed 191000 rows...
Processed 192000 rows...
Processed 193000 rows...
Processed 194000 rows...
Processed 195000 rows...
Processed 196000 rows...
Processed 197000 rows...
Processed 198000 rows...
Processed 199000 rows...
Processed 2e+05 rows...
Processed 201000 rows...
Processed 202000 rows...
Processed 203000 rows...
Processed 204000 rows...
Processed 205000 rows...
Processed 206000 rows...
Processed 207000 rows...
Processed 208000 rows...
Processed 209000 rows...
Processed 210000 rows...
Processed 211000 rows...
Processed 212000 rows...
Processed 213000 rows...
Processed 214000 rows...
Processed 215000 rows...
Processed 216000 rows...
Processed 217000 rows...
Processed 218000 rows...
Processed 219000 rows...
Processed 220000 rows...
Processed 221000 rows...
Processed 222000 rows...
Processed 223000 rows...
Processed 224000 rows...
Processed 225000 rows...
Processed 226000 rows...
Processed 227000 rows...
Processed 228000 rows...
Processed 229000 rows...
Processed 230000 rows...
Processed 231000 rows...
Processed 232000 rows...
Processed 233000 rows...
Processed 234000 rows...
Processed 235000 rows...
Processed 236000 rows...
Processed 237000 rows...
Processed 238000 rows...
Processed 239000 rows...
Processed 240000 rows...
Processed 241000 rows...
Processed 242000 rows...
Processed 243000 rows...
Processed 244000 rows...
Processed 245000 rows...
Processed 246000 rows...
Processed 247000 rows...
Processed 248000 rows...
Processed 249000 rows...
Processed 250000 rows...
Processed 251000 rows...
Processed 252000 rows...
Processed 253000 rows...
Processed 254000 rows...
Processed 255000 rows...
Processed 256000 rows...
Processed 257000 rows...
Processed 258000 rows...
Processed 259000 rows...
Processed 260000 rows...
Processed 261000 rows...
Processed 262000 rows...
Processed 263000 rows...
Processed 264000 rows...
Processed 265000 rows...
Processed 266000 rows...
Processed 267000 rows...
Processed 268000 rows...
Processed 269000 rows...
Processed 270000 rows...
Processed 271000 rows...
Processed 272000 rows...
Processed 273000 rows...
Processed 274000 rows...
Processed 275000 rows...
Processed 276000 rows...
Processed 277000 rows...
Processed 278000 rows...
Processed 279000 rows...
Processed 280000 rows...
Processed 281000 rows...
Processed 282000 rows...
Processed 283000 rows...
Processed 284000 rows...
Processed 285000 rows...
Processed 286000 rows...
Processed 287000 rows...
Processed 288000 rows...
Processed 289000 rows...
Processed 290000 rows...
Processed 291000 rows...
Processed 292000 rows...
Processed 293000 rows...
Processed 294000 rows...
Processed 295000 rows...
Processed 296000 rows...
Processed 297000 rows...
Processed 298000 rows...
Processed 299000 rows...
Processed 3e+05 rows...
Processed 301000 rows...
Processed 302000 rows...
Processed 303000 rows...
Processed 304000 rows...
Processed 305000 rows...
Processed 306000 rows...
Processed 307000 rows...
Processed 308000 rows...
Processed 309000 rows...
Processed 310000 rows...
Processed 311000 rows...
Processed 312000 rows...
Processed 313000 rows...
Processed 314000 rows...
Processed 315000 rows...
Processed 316000 rows...
Processed 317000 rows...
Processed 318000 rows...
Processed 319000 rows...
Processed 320000 rows...
Processed 321000 rows...
Processed 322000 rows...
Processed 323000 rows...
Processed 324000 rows...
Processed 325000 rows...
Processed 326000 rows...
Processed 327000 rows...
Processed 328000 rows...
Processed 329000 rows...
Processed 330000 rows...
Processed 331000 rows...
Processed 332000 rows...
Processed 333000 rows...
Processed 334000 rows...
Processed 335000 rows...
Processed 336000 rows...
Complete! Processed total of 336776 rows.
## $nInserted
## [1] 336776
## 
## $nMatched
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
mongo_data$count()
## [1] 2020662
nrow(flights)
## [1] 336776

There are functions exist in the mongolite package which we can run to do analysis of MongoDB dataset directly in R.

testing_data <- mongo_data$find('{"carrier": "UA" , "dest": "IAH"}')
## 
 Found 1000 records...
 Found 2000 records...
 Found 3000 records...
 Found 4000 records...
 Found 5000 records...
 Found 6000 records...
 Found 7000 records...
 Found 8000 records...
 Found 9000 records...
 Found 10000 records...
 Found 11000 records...
 Found 12000 records...
 Found 13000 records...
 Found 14000 records...
 Found 15000 records...
 Found 16000 records...
 Found 17000 records...
 Found 18000 records...
 Found 19000 records...
 Found 20000 records...
 Found 21000 records...
 Found 22000 records...
 Found 23000 records...
 Found 24000 records...
 Found 25000 records...
 Found 26000 records...
 Found 27000 records...
 Found 28000 records...
 Found 29000 records...
 Found 30000 records...
 Found 31000 records...
 Found 32000 records...
 Found 33000 records...
 Found 34000 records...
 Found 35000 records...
 Found 36000 records...
 Found 37000 records...
 Found 38000 records...
 Found 39000 records...
 Found 40000 records...
 Found 41000 records...
 Found 41544 records...
 Imported 41544 records. Simplifying into dataframe...
head(testing_data)
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      623        -4      933         1      UA  N459UA
## 4 2013     1   1      728        -4     1041         3      UA  N488UA
## 5 2013     1   1      739         0     1104        26      UA  N37408
## 6 2013     1   1      908         0     1228         9      UA  N12216
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3    496    LGA  IAH      229     1416    6     23
## 4    473    LGA  IAH      238     1416    7     28
## 5   1479    EWR  IAH      249     1400    7     39
## 6   1220    EWR  IAH      233     1400    9      8
mongo_data$distinct("carrier")
##  [1] "UA"  "AA"  "B6"  "DL"  "EV"  "MQ"  "US"  "WN"  "VX"  "FL"  "AS" 
## [12] "9E"  "F9"  "HA"  "YV"  "OO"  "XYZ"
mongo_data$insert('{"year": "2016", "mongth": "1", "day": "1", "dep_time": "500", "arr_time": "800", "arr_delay": 10, "carrier": "XYZ", "tailnum": "XXXXXX", "flight": "XXXXXX", "origin": "XXX", "dest": "XXX", "air_time": "300", "distance": "1000", "hour": "5", "minute": "30"}')
## $nInserted
## [1] 1
## 
## $nMatched
## [1] 0
## 
## $nModified
## [1] 0
## 
## $nRemoved
## [1] 0
## 
## $nUpserted
## [1] 0
## 
## $writeErrors
## list()
#After inserting new observation, we are able to find the one entry that is just added, which means we are able to change the MongoDB data from R.
mongo_data$find('{"year": "2016"}')
## 
 Found 7 records...
 Imported 7 records. Simplifying into dataframe...
##   year mongth day dep_time arr_time arr_delay carrier tailnum flight
## 1 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
## 2 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
## 3 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
## 4 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
## 5 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
## 6 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
## 7 2016      1   1      500      800        10     XYZ  XXXXXX XXXXXX
##   origin dest air_time distance hour minute
## 1    XXX  XXX      300     1000    5     30
## 2    XXX  XXX      300     1000    5     30
## 3    XXX  XXX      300     1000    5     30
## 4    XXX  XXX      300     1000    5     30
## 5    XXX  XXX      300     1000    5     30
## 6    XXX  XXX      300     1000    5     30
## 7    XXX  XXX      300     1000    5     30
#The following code made a chart that display the average arrival delay time. 
mongo_data$aggregate('[{"$group":{"_id":"$carrier", "average delay":{"$avg":"$arr_delay"}}}]')
## 
 Found 17 records...
 Imported 17 records. Simplifying into dataframe...
##    _id average delay
## 1   UA     3.5580111
## 2   WN     9.6491199
## 3   AA     0.3642909
## 4   B6     9.4579733
## 5   DL     1.6443409
## 6   OO    11.9310345
## 7   EV    15.7964311
## 8   VX     1.7644644
## 9   US     2.1295951
## 10  9E     7.3796692
## 11  MQ    10.7747334
## 12 XYZ    10.0000000
## 13  AS    -9.9308886
## 14  FL    20.1159055
## 15  F9    21.9207048
## 16  HA    -6.9152047
## 17  YV    15.5569853

To disconnect the object is important too, otherwise if we run the code the second time, the data entry will add up incrementally.

class(mongo_data)
## [1] "mongo"       "jeroen"      "environment"
mongo_data$drop
## function () 
## {
##     check_col()
##     mongo_collection_drop(col)
## }
## <environment: 0x000000000b505e20>

Relational Database VS. NoSQL

Advantage of NoSQL: 1. There is no predefined schema, so that it is easier to update the data 2. NoSQL can handle unstructured data, and are much more flexible. 3. NoSQl database is easier to scale. It is a better choice for big data. On the other hand, RDBMS requires big, expensive, and multiple servers. 4. NoNoSQL server is cheaper and maintain. 5. NoSQL can increase the data output and performance by caching data in system memory, while RDBMS need a separate infrastucture.

Disavantage of NoSQL: 1. NoSQL is still new to many companies. Many key features need to be developed. 2. The vendors are usually small start-up companies. On the other hand, RDBMS are supported by big companies such as: Oracle, Microsoft, or IBM, which have global influences. 3. NoSQl offers few facilities for ad-hoc questy and analysis. For RDBMS, the coding is much easier. 4. NoSQL lacks of standardization, RDBMS is more standardizaed. 5. RDBMS provide ACID properties(Atomicity, Consistency, Isolation, Durability). NoSQL not so much.

Reference: 1. https://www.mongodb.com/scale/nosql-vs-relational-databases 2. https://www.sitepoint.com/sql-vs-nosql-differences/