library(RMySQL)
## Loading required package: DBI
#I thought tb was the flights dataset, which explains dbname
mydb = dbConnect(MySQL(), user = 'root', password = 'password1234', dbname = 'flights', host = 'localhost')
dbListFields(mydb, 'tb')
## [1] "country" "year" "sex" "child" "adult" "elderly"
Run a query using the dbSendQuery command to extract the database from MySQL
rs <- dbSendQuery(mydb, "SELECT * FROM tb")
mySQL_data <- fetch(rs,n =-1)
Save the retrieved data as a csv file
write.csv(mySQL_data, "MySQLquery_db.csv",row.names = FALSE)
dbDisconnect(mydb)
## Warning: Closing open result sets
## [1] TRUE
If you’re skipping the SQL approach, here’s the direct link to the csv in the github repository https://raw.githubusercontent.com/hvasquez81/DATA607-Week-12-Assignment/master/MySQLquery_db.csv
The next thing to do here is open Neo4j desktop and create a new projectStep 1
Step 2
Set graph name and password
Step 3
Start the new graph
Step 4
Step 5
Step 6
Step 7
In this situation I feel that a NoSQL database isn’t as useful as it normally is. I believe the data is rather simple and not as diverse as a database from say a store like Amazon. The connections are rather short, and therefore network analysis on a graph database wouldn’t show much. In this case, a RDBMS would be easier to work with if we were to look for aggregates such as the sum of child, adult, elderely, sex, between different years or in different countries. The queries in MySQL would be much easier to work with and much more effective in finding out what the data shows us as opposed to what Neo4j could show us.
However, say we were to include another data set to tb, something that included more information as to what’s going on with these people and where exactly they’re migrating to and from, I would then assume that Neo4j would be much more efficient in seeing what exactly is going on. Something interesting that could derive from this is seeing migration and immigration patterns between countries and different age groups. But since our data is limited with only 6 variables, and no variables relating one country to another, I would suggest keeping the dataset in MySQL to perform any analysis.
We have the mySQL database stored as a dataframe in R. We need to go ahead and connect to the MongoDB
Install the necessary packages
library(mongolite)
c = mongo(collection = "mySQL_data", db ="MySQLData")
c$insert(mySQL_data)
## List of 5
## $ nInserted : num 3800
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
The 3800 rows have been added from our MySQL database tb!
head(c$aggregate())
## _id country year sex child adult elderly
## 1 5bf203c196db8a2c2b5096d2 Afghanistan 1995 female NA NA NA
## 2 5bf203c196db8a2c2b5096d3 Afghanistan 1995 male NA NA NA
## 3 5bf203c196db8a2c2b5096d4 Afghanistan 1996 female NA NA NA
## 4 5bf203c196db8a2c2b5096d5 Afghanistan 1996 male NA NA NA
## 5 5bf203c196db8a2c2b5096d6 Afghanistan 1997 female 5 96 1
## 6 5bf203c196db8a2c2b5096d7 Afghanistan 1997 male 0 26 0
Here is the head of our database in MongoDB. Note that the _id is added everytime you insert data into Mongodb so each id is unique. Lets take a look inside the terminal
MongoDB