For this assignment, I worked in a group with Maliha Arshad and Soumya Nagesh.

Question 1 — Setup MongoDB

# Saving connection url for later use from our cluster
mongo_url <- "mongodb+srv://alissa:valentine@da5020-cluster.4xqgf.mongodb.net/Connecticut?retryWrites=true&w=majority"

Question 2.1 — Open R studio, connect to your mongoDB instance and create a database called Connecticut and a collection called drug_overdose

# Loading mongolite
library(mongolite)
# Connecting to mongo and creating a database and collection
c <- mongo(collection = "drug_overdose", # naming collection drug_overdose
      db = "Connecticut", # naming database Connecticut
      url = mongo_url, # url is the url saved above
      verbose = TRUE)

Question 2.2 — Use the insert function from the mongolite package to insert the data into the collection directly from the URL

# Saving csv file's url
c_url <- "https://data.ct.gov/api/views/rybz-nyjw/rows.csv"
# Reading csv data from the url above
c_data <- read.csv(url(c_url), na.strings = "") # assigning NA to empty values
# Tidying the date characters and making a new month variable for later data visualization
c_data$"Date" <- substr(c_data$"Date", 0, 10)
c_data$"Month" <- substr(c_data$"Date", 0, 2)
# Inserting csv data into mongo collection and database that was opened earlier
#c$insert(c_data) #<- this is commented out so that when it is rerun, it doesn't add a second set of data to the collection

Question 2.3 — Display the number of distinct records in the collection

# Counting number of records in the collection
c$count() # counting each record in the collection
## [1] 5105
length(c$distinct({"ID"})) # counting each distinct ID value in the collection
## [1] 5105
# There appear to be 5105 distinct or unique records, based off the ID number, and the general amount of records in the data set.

Question 2.4 — Calculate the number of deaths in the deathcounty. Sort the results in descending order based on the total and limit your results to the top 5 counties

# Loading dplyr and tidyverse
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ stringr 1.4.0
## ✓ tidyr   1.0.3     ✓ forcats 0.5.0
## ✓ readr   1.3.1
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Saving query for Question 2.4 as Q2.4, query is to count deaths in each county, sort descending, and only present top 5 counties
Q2.4 <- c$aggregate('[
{"$match": {"DeathCounty": {"$exists": true, "$ne": null}}},
{"$group" : {"_id" : "$DeathCounty", "Count" : {"$sum" : 1 }}},
{"$sort" : {"Count" : -1 }},
{"$limit": 5}]') # Saving query to Q2.4, only saving for values that exist in DeathCounty, and saving death count in new column
## 
 Found 5 records...
 Imported 5 records. Simplifying into dataframe...
# Printing Q2.4 to see the data
print(Q2.4)
##          _id Count
## 1   HARTFORD  1233
## 2  NEW HAVEN  1107
## 3  FAIRFIELD   623
## 4 NEW LONDON   368
## 5 LITCHFIELD   237
# Displaying data in a bar chart
Q2.4 %>%
  ggplot(aes(x = `_id`, y = `Count`)) + # County name is x axis and death count is y axis
  geom_col() +
  labs(title = "Deaths by County", x = "County", y = "Deaths")

# Chart caption: Hartford county has the highest amount of deaths, second comes New Haven. The last three of the top five are less than half of the amount in the top two counties, including Fairfield, New London, and then Litchfield.

Question 2.5.a — Select the county with the highest deaths, show the ratio of male to female deaths

# Saving query for Question 2.5 a as Q2.5a, filtering for only Hartford County and grouping by Sex, counting the number of records for each sex
Q2.5a <- c$aggregate('[
{"$match": {"DeathCounty": "HARTFORD"}},
{"$group" : {"_id" : "$Sex", "Count" : {"$sum" : 1 }}}]')
## 
 Found 4 records...
 Imported 4 records. Simplifying into dataframe...
ratio = 918/313 # calculating ratio of males vs females based off values taken from data frame Q2.5a
print(ratio) # printing the ratio
## [1] 2.932907
# Ratio of male to female is 3:1.

Question 2.6 — Analyze the ages (and you can use age groups), is this issue prevalent within certain age groups

# Saving query for Question 2.6 as Q2.6, where we group by age and count the number of records for each age
Q2.6 <- c$aggregate('[{"$group" : {"_id" : "$Age", "Count" : {"$sum" : 1 }}}]')
## 
 Found 66 records...
 Imported 66 records. Simplifying into dataframe...
head(Q2.6)
Q2.6 %>% # Plotting the data frame from the query
  ggplot(aes(x = `_id`, y = `Count`,na.rm=TRUE)) +  # putting age on x axis and death count on y axis
  geom_line() +
  labs(title = "Age Analysis", x = "Age", y = "Deaths")
## Warning: Removed 1 row(s) containing missing values (geom_path).

# Chart caption: As seen in the table and the line graph above; ages 14 - 23 and 70+ have a much lower death count. The 30 highest death counts are for ages between 29 and 57. Therefore, the issue is prevalent in age group 29-57.

Question 2.7 — Calculate the total monthly cases of heroin, cocaine, and fentanyl in 2018 and display the results using a line chart

# Saving the query for Question 2.7 re heroin use as Q2.7.h, filtering for heroin reported in person's body in the year of 2018, grouping by month and counting the records
Q2.7.h <- c$aggregate('[
{"$match": {"Heroin": "Y", "Date" : { "$regex" : ".2018", "$options" : "i" }}},
{"$group": {"_id" : "$Month", "Count" : {"$sum" : 1 }}}
]')
## 
 Found 12 records...
 Imported 12 records. Simplifying into dataframe...
Q2.7.h$`_id`<-as.integer(Q2.7.h$`_id`) # changing month values to an integer

# Saving the query for Question 2.7 re cocaine use as Q2.7.c, filtering for cocaine reported in person's body in the year of 2018, grouping by month and counting the records
Q2.7.c <- c$aggregate('[
{"$match": {"Cocaine": "Y", "Date" : { "$regex" : ".2018", "$options" : "i" }}},
{"$group": {"_id" : "$Month", "Count" : {"$sum" : 1 }}}
]')
## 
 Found 12 records...
 Imported 12 records. Simplifying into dataframe...
Q2.7.c$`_id`<-as.integer(Q2.7.c$`_id`) # changing month values to an integer

# Saving the query for Question 2.7 re fentanyl use as Q2.7.f, filtering for fentanyl reported in person's body in the year of 2018, grouping by month and counting the records
Q2.7.f <- c$aggregate('[
{"$match": {"Fentanyl": "Y", "Date" : { "$regex" : ".2018", "$options" : "i" }}},
{"$group": {"_id" : "$Month", "Count" : {"$sum" : 1 }}}
]')
## 
 Found 12 records...
 Imported 12 records. Simplifying into dataframe...
Q2.7.f$`_id`<-as.integer(Q2.7.f$`_id`) # changing month values to an integer
 
ggplot()+ # plotting the data from the 3 queries
  geom_line(Q2.7.h,mapping= aes(x=`_id`,y=Count ,colour="red"))+ # plotting heroin data
  geom_line(Q2.7.c,mapping= aes(x=`_id`,y=Count,colour="blue"))+ # plotting cocaine data
  geom_line(Q2.7.f,mapping= aes(x=`_id`,y=Count,colour="green"))+ # plotting fentanyl data
  labs(title = "Monthly Deaths by Heroin, Cocaine and Fentanyl", x = "Months for 2018", y = "Count") +
  scale_color_discrete(name = "Drugs", labels = c("Heroin", "Fentanyl","Cocaine")) + # labeling the scale
  scale_x_discrete(limits = c(1:12)) # editing the x axis values

# Chart caption: As seen in the line graph above, fentanyl related deaths are more than double compared to cocaine and heroin. January has the the least number of deaths related to all three drugs; highest number of fentanyl, heroin and cocaine related deaths occured in June of 2018. Moreover, the highest number of deaths related to cocaine use were observed in the month of March, June and July.

Question 3 — Build one additional query (of your choice) to retrieve data from MongoDB into a data frame. Prepare supporting visualizations and explain your analysis

# Saving query for Question 3 as Q3
# For this question, we decided to investigate the relationship between sex and race in the opioid overdoses in Hartford County during 2018
Q3 <- c$aggregate('[
{"$match": {"DeathCounty": "HARTFORD", "AnyOpioid": "Y", "Date" : { "$regex" : ".2018", "$options" : "i" }}},
{"$group" : {"_id" : {"Race":"$Race", "Sex":"$Sex"}, "Count" : {"$sum" : 1 }}},
{"$sort" : {"Count" : -1 }}
]') # Here we filter data from Hartford, reported that the person took any opioid, and died in 2018, we also group the data by Race AND Sex, counting the number of records within each group, and sort it in descending order
## 
 Found 14 records...
 Imported 14 records. Simplifying into dataframe...
Q3 <- do.call(data.frame, Q3) # removing the data frame inside the data frame
head(Q3)
colnames(Q3)[2] <- "Sex" # renaming a column as "Sex" to make things easier later

Q3 %>% # plotting data from query
  ggplot(mapping = aes(x = X_id.Race, y = Count, fill = Sex)) + # race as x axis value, death count as y axis, and fill color of bars is based on sex
  geom_col(na.rm = TRUE) + # removing NA value
  labs(title = "2018 Deaths by Opioid in Hartford County", x = "Race and/or Ethnicity", y = "Death Count") +
  coord_flip() # Flipping chart to make values more clear to see

# Chart caption: As seen in the graph above, opiod related accidental deaths are most prevalent in men across all races. The top three races with highest number of deaths are White, Hispanic-White and Black. The highest number of deaths among females is found within the race White.