Project 3a

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

setwd

## function (dir) 
## .Internal(setwd(dir))
## <bytecode: 0x7faada7b9808>
## <environment: namespace:base>

drugs <- read.csv("drug poisoning mortality by state NCHS.csv")

Question 1

A metadata listing for each variable. In other words, list the names of the variables, types, and brief descriptions of values if needed.

State - catagorical variable, Year - catagorical variable Sex - catagorical variable, this is one catagory for all states except United states where they are divided into different catagories. This goes for all the catagorical varibales of this data set. Age Group - catagorical variable Race - catagorical variable

Deaths - quantitative variable, number of death by drugs Population - quantitative variable, entire population crude death rate - quantitative variable, death rates according to population standard error for crude death rate - quantitative variable, the standard error of the crude death rates variable Lower confidence limit for crude death rate - quantitative variable, lower confidence interval for crude death rate Upper confidence limit for crude death rate - quantitative variable, upper confidence interval for crude death rate

Age adjusted rate - quantitative variable, a technique used to allow statistical populations to be compared when the age profiles of the populations are quite different. They are summary measures adjusted for differences in age distributions. standard error for age adjusted rate - quantitative variable Lower confidence limit for age adjusted rate - quantitative variable, lower confidence interval for age adjusted rate Upper confidence limit for age adjusted rate - quantitative variable, upper confidence interval for age adjusted rate state crude rate - catagorical variable, ranges of the state crude rates US crude rate - quantitative variable, crude of the entire US US age adjusted rate - quantitative variable Unit - catagorical variable, units per 100,000 population

Question 2

Summary statistics: mean, median, mode, min, max, count and standard deviation for 2-3 key quantitative variables.

2 key quantitative variables - deaths, crude death rate

drugs_q <- drugs [,c("Deaths", "Crude.Death.Rate")] 
summary(drugs_q)

##      Deaths        Crude.Death.Rate 
##  Min.   :    1.0   Min.   : 0.0389  
##  1st Qu.:  126.2   1st Qu.: 3.9705  
##  Median :  491.0   Median : 9.0742  
##  Mean   : 1966.6   Mean   :10.6588  
##  3rd Qu.: 1526.8   3rd Qu.:14.8954  
##  Max.   :63632.0   Max.   :68.3122

drugs_q <- drugs [,c("Deaths", "Crude.Death.Rate")] 
sd(drugs_q$Deaths)

## [1] 4806.402

sd(drugs_q$Crude.Death.Rate)

## [1] 8.576254

Question 3

A frequency distribution and relative frequency distribution for a key categorical variable

table(drugs$Sex)

## 
## Both Sexes     Female       Male 
##       1566        648        648

Question 4

A contingency table for two categorical variables.

table(drugs$Age.Group, drugs$Year)

##           
##            1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
##   0–14       12   12   12   12   12   12   12   12   12   12   12   12   12
##   15–24      12   12   12   12   12   12   12   12   12   12   12   12   12
##   25–34      12   12   12   12   12   12   12   12   12   12   12   12   12
##   35–44      12   12   12   12   12   12   12   12   12   12   12   12   12
##   45–54      12   12   12   12   12   12   12   12   12   12   12   12   12
##   55–64      12   12   12   12   12   12   12   12   12   12   12   12   12
##   65–74      12   12   12   12   12   12   12   12   12   12   12   12   12
##   75+        12   12   12   12   12   12   12   12   12   12   12   12   12
##   All Ages   63   63   63   63   63   63   63   63   63   63   63   63   63
##           
##            2012 2013 2014 2015 2016
##   0–14       12   12   12   12   12
##   15–24      12   12   12   12   12
##   25–34      12   12   12   12   12
##   35–44      12   12   12   12   12
##   45–54      12   12   12   12   12
##   55–64      12   12   12   12   12
##   65–74      12   12   12   12   12
##   75+        12   12   12   12   12
##   All Ages   63   63   63   63   63

Question 5

1 bar graph and 1 pie chart for two of your categorical variables. (A bar graph for one variable and a pie chart for the second variable.) Label your plots!

bar graph

library(ggplot2)
ggplot(drugs, aes(x = Race.and.Hispanic.Origin)) +
  geom_bar(position = "stack", col = "purple", fill = "pink") +
  theme_minimal() +
  labs(x = "Race", 
       y = "Count",
       title = "Drugs Mortality Rates According to Race",)

pie chart

drugs_pie <- table(drugs$Sex)
lbl <- c("Both Sexes", "Female", "Male")
drugs_pie

## 
## Both Sexes     Female       Male 
##       1566        648        648

pie(drugs_pie, labels = lbl, main = "Drug Mortality Rates Accordng to Genders ", col = c("pink", "purple", "red"))

Question 6

2 histograms and 2 boxplots for two quantitative variables. (Both a histogram and boxplot for each variable.) Label your plots!

drugs %>%
  group_by(Deaths) %>%
  summarise(n = n())

## # A tibble: 1,602 × 2
##    Deaths     n
##     <int> <int>
##  1      1     1
##  2      3     2
##  3      4     7
##  4      5     7
##  5      6     8
##  6      7     9
##  7      8    11
##  8      9    10
##  9     10    10
## 10     11    14
## # … with 1,592 more rows

Histograms

ggplot(drugs, aes(Deaths)) + 
        geom_histogram(breaks=seq(20, 60, by=2), 
                       col="gray",
                       aes(fill=..count..))+
        labs(x="Deaths", title = "Deaths by Drugs", y="Frequency of deaths")+
        scale_fill_gradient("Count", low="light blue", high="dark blue")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.

ggplot(drugs, aes(drugs$Age.adjusted.Rate)) + 
        geom_histogram(breaks=seq(10,50, by=2), 
                       col="gray",
                       aes(fill=..count..))+
        labs(x="Age Adjusted rate", title = "Age Adjusted Rates of Drug mortality", y="Frequency")+
        scale_fill_gradient("Count", low="light blue", high="dark blue")

## Warning: Use of `drugs$Age.adjusted.Rate` is discouraged.
## ℹ Use `Age.adjusted.Rate` instead.

## Warning: Removed 1728 rows containing non-finite values (`stat_bin()`).

boxplots

ggplot(drugs, aes(drugs$Deaths)) + 
        geom_boxplot(fill = "purple")+
        labs(x="Deaths", title = "Deaths by Drugs", y="Frequency of deaths")

## Warning: Use of `drugs$Deaths` is discouraged.
## ℹ Use `Deaths` instead.

ggplot(drugs, aes(drugs$Age.adjusted.Rate)) + 
        geom_boxplot(fill = "purple")+
        labs(x="Age Adjusted rate", title = "Age Adjusted Rates of Drug mortality", y="Frequency")

## Warning: Use of `drugs$Age.adjusted.Rate` is discouraged.
## ℹ Use `Age.adjusted.Rate` instead.

## Warning: Removed 1728 rows containing non-finite values (`stat_boxplot()`).

Summary:

This data set provides information on drug overdose mortality by state (and the District of Columbia) and by race and ethnicity on drug overdose death rates. Something ususual is that as I coded the contengency tables, I noticed that the size for each catagorical variable is exactly the same. As you can notice on the pie chart for gender, both female and male have a number of 648. This could be due to the way the data was collected and sampled.

If you notice on the histogram and boxplot of death rates due to drug poisioning, theres a very high frequency at 20 - 30 deaths. We have to remember that these are units per 100,000, so the highest death seems to be around 20 per 100,000 standard population. I would say a boxplot may not be a very good visualization for this perticular variable as there are way too many outliers to perceive. Next, the age adjusted rates of drug poisoning decreased as the age increased. It is very high in between 10 - 15, and then there is a very clear skew in the distribution. Similiarly, if you look at the boxplot, there are many outliers. This could be a result of this being a very big data set.