Introduction

For this project, I am using a library that compiles data about Suicide Rates from 1985 to 2016. It pulls data from the United Nations Human Developments Report, The World Bank DataBank, World Health Organization (WHO) Suicide Statistics, and the World Health Organization National suicid prevention strategies report.

I want to look at how suicide rates in the United States affected different generations over time. I am also interested in comparing the populations of the United States and the United Kingdom in this data set.

A description of generations is listed below by the Center for Generational Kinetics.

Generation Z: Born 1996 – TBD Millennials: Born 1977 – 1995 Generation X: Born 1965 – 1976 Baby Boomers: Born 1946 – 1964 Silent: Born 1927 - 1945 G.I. Generation: Born 1901 - 1926

1. Formulating the Question

I have two questions that I would like to address in this project: 1. How did number of suicides in the United States change over time across different generations?

  1. Is the difference between the mean population of the United States and the United Kingdom significant? Null hypothesis (H0):μ=μ0 Alternative hypothesis (Ha):μ≠μ0

2. Reading the Data

First, I imported data and stored it in the “suicide” variable, so it is easier to call. I set the argument stringsAsFactors = FALSE, so that it would import strings as characters.

suicide <-read.csv("master.csv", stringsAsFactors = FALSE)

I will be using functions from the dplyr, tidyr, ggplot2, and gganimate libraries. It can be installed with install.packages() if it is not already on the device.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(gganimate)

3. Checking the Packaging

I assume the imported data will be a data frame, as it contains data of different data types. I can confirm this by utilizing the class() function. I will also check the number of rows and columns in the data frame. I can do this with one function by viewing the dimensions of the data set.

#checking characteristics of the dataset
class(suicide)
## [1] "data.frame"
#checking the number of rows and columns
nrow(suicide)
## [1] 27820
ncol(suicide)
## [1] 12
#this can also be done all at once with dimension
dim(suicide)
## [1] 27820    12

4. Observing the Structure

I want to check the structure of the data frame to see the names of the columns and tables. I can also see the data type contained in each column.

#checking the structure
str(suicide)
## 'data.frame':    27820 obs. of  12 variables:
##  $ ï..country        : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ year              : int  1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
##  $ sex               : chr  "male" "male" "female" "male" ...
##  $ age               : chr  "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
##  $ suicides_no       : int  21 16 14 1 9 1 6 4 1 0 ...
##  $ population        : int  312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
##  $ suicides.100k.pop : num  6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
##  $ country.year      : chr  "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
##  $ HDI.for.year      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gdp_for_year....  : chr  "2,156,624,900" "2,156,624,900" "2,156,624,900" "2,156,624,900" ...
##  $ gdp_per_capita....: int  796 796 796 796 796 796 796 796 796 796 ...
##  $ generation        : chr  "Generation X" "Silent" "Generation X" "G.I. Generation" ...

5. Checking the Head and Tail

I am interested in checking the head and tails of the data. I set the second argument in these functions to 10 in order to see the first 10 observations at the beginning and end of the data frame.

#head of data
head(suicide, 10)
##    ï..country year    sex         age suicides_no population
## 1     Albania 1987   male 15-24 years          21     312900
## 2     Albania 1987   male 35-54 years          16     308000
## 3     Albania 1987 female 15-24 years          14     289700
## 4     Albania 1987   male   75+ years           1      21800
## 5     Albania 1987   male 25-34 years           9     274300
## 6     Albania 1987 female   75+ years           1      35600
## 7     Albania 1987 female 35-54 years           6     278800
## 8     Albania 1987 female 25-34 years           4     257200
## 9     Albania 1987   male 55-74 years           1     137500
## 10    Albania 1987 female  5-14 years           0     311000
##    suicides.100k.pop country.year HDI.for.year gdp_for_year....
## 1               6.71  Albania1987           NA    2,156,624,900
## 2               5.19  Albania1987           NA    2,156,624,900
## 3               4.83  Albania1987           NA    2,156,624,900
## 4               4.59  Albania1987           NA    2,156,624,900
## 5               3.28  Albania1987           NA    2,156,624,900
## 6               2.81  Albania1987           NA    2,156,624,900
## 7               2.15  Albania1987           NA    2,156,624,900
## 8               1.56  Albania1987           NA    2,156,624,900
## 9               0.73  Albania1987           NA    2,156,624,900
## 10              0.00  Albania1987           NA    2,156,624,900
##    gdp_per_capita....      generation
## 1                 796    Generation X
## 2                 796          Silent
## 3                 796    Generation X
## 4                 796 G.I. Generation
## 5                 796         Boomers
## 6                 796 G.I. Generation
## 7                 796          Silent
## 8                 796         Boomers
## 9                 796 G.I. Generation
## 10                796    Generation X
#tail of data
tail(suicide, 10)
##       ï..country year    sex         age suicides_no population
## 27811 Uzbekistan 2014 female 15-24 years         347    2992817
## 27812 Uzbekistan 2014   male 55-74 years         144    1271111
## 27813 Uzbekistan 2014   male 15-24 years         347    3126905
## 27814 Uzbekistan 2014   male   75+ years          17     224995
## 27815 Uzbekistan 2014 female 25-34 years         162    2735238
## 27816 Uzbekistan 2014 female 35-54 years         107    3620833
## 27817 Uzbekistan 2014 female   75+ years           9     348465
## 27818 Uzbekistan 2014   male  5-14 years          60    2762158
## 27819 Uzbekistan 2014 female  5-14 years          44    2631600
## 27820 Uzbekistan 2014 female 55-74 years          21    1438935
##       suicides.100k.pop   country.year HDI.for.year gdp_for_year....
## 27811             11.59 Uzbekistan2014        0.675   63,067,077,179
## 27812             11.33 Uzbekistan2014        0.675   63,067,077,179
## 27813             11.10 Uzbekistan2014        0.675   63,067,077,179
## 27814              7.56 Uzbekistan2014        0.675   63,067,077,179
## 27815              5.92 Uzbekistan2014        0.675   63,067,077,179
## 27816              2.96 Uzbekistan2014        0.675   63,067,077,179
## 27817              2.58 Uzbekistan2014        0.675   63,067,077,179
## 27818              2.17 Uzbekistan2014        0.675   63,067,077,179
## 27819              1.67 Uzbekistan2014        0.675   63,067,077,179
## 27820              1.46 Uzbekistan2014        0.675   63,067,077,179
##       gdp_per_capita....   generation
## 27811               2309   Millenials
## 27812               2309      Boomers
## 27813               2309   Millenials
## 27814               2309       Silent
## 27815               2309   Millenials
## 27816               2309 Generation X
## 27817               2309       Silent
## 27818               2309 Generation Z
## 27819               2309 Generation Z
## 27820               2309      Boomers

Selecting Specific Columns

I decided to select columns with variables that I believe will be most useful to my analysis. There are many missing values in the HDI.for.year, the variable country.year seems redudant, and the variable gdp_for_year seems less comprehensive than gdp per capita. Thus, I have decided to select the data without these three aforementioned columns. I have also decided to rename the columns ï..country and gdp_per_capita…. to country and gdp_per_capita, respectively, so that they are easier to call. I filtered by the two countries I am interested in–the Untied States and the United Kingdom. This extraction improves the useaebility and readability of the data frame.

#creating the sorted_sui extraction
sorted_sui <- suicide %>% select(-(HDI.for.year), -(country.year), -(gdp_for_year....)) %>% rename(country = ï..country, gdp_per_cap = gdp_per_capita....) %>% filter(country==c("United States","United Kingdom"))
sorted_sui_US <- suicide %>% select(-(HDI.for.year), -(country.year), -(gdp_for_year....)) %>% rename(country = ï..country, gdp_per_cap = gdp_per_capita....) %>% filter(country==c("United States"))
sorted_sui_UK <- suicide %>% select(-(HDI.for.year), -(country.year), -(gdp_for_year....)) %>% rename(country = ï..country, gdp_per_cap = gdp_per_capita....) %>% filter(country==c("United Kingdom"))



#checking the sorted_sui extraction
str(sorted_sui)
## 'data.frame':    372 obs. of  9 variables:
##  $ country          : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  $ year             : int  1985 1985 1985 1985 1985 1985 1986 1986 1986 1986 ...
##  $ sex              : chr  "male" "male" "female" "female" ...
##  $ age              : chr  "55-74 years" "25-34 years" "75+ years" "35-54 years" ...
##  $ suicides_no      : int  915 620 237 505 86 2 892 616 223 424 ...
##  $ population       : int  5170113 3969689 2418692 6879295 4549121 3542919 5159651 4039982 2454593 6933075 ...
##  $ suicides.100k.pop: num  17.7 15.62 9.8 7.34 1.89 ...
##  $ gdp_per_cap      : int  9231 9231 9231 9231 9231 9231 11323 11323 11323 11323 ...
##  $ generation       : chr  "G.I. Generation" "Boomers" "G.I. Generation" "Silent" ...
head(sorted_sui)
##          country year    sex         age suicides_no population
## 1 United Kingdom 1985   male 55-74 years         915    5170113
## 2 United Kingdom 1985   male 25-34 years         620    3969689
## 3 United Kingdom 1985 female   75+ years         237    2418692
## 4 United Kingdom 1985 female 35-54 years         505    6879295
## 5 United Kingdom 1985 female 15-24 years          86    4549121
## 6 United Kingdom 1985 female  5-14 years           2    3542919
##   suicides.100k.pop gdp_per_cap      generation
## 1             17.70        9231 G.I. Generation
## 2             15.62        9231         Boomers
## 3              9.80        9231 G.I. Generation
## 4              7.34        9231          Silent
## 5              1.89        9231    Generation X
## 6              0.06        9231    Generation X
tail(sorted_sui)
##           country year    sex         age suicides_no population
## 367 United States 2015   male   75+ years        3171    8171136
## 368 United States 2015   male 35-54 years       11634   41658010
## 369 United States 2015   male 15-24 years        4359   22615073
## 370 United States 2015 female 55-74 years        2872   35115610
## 371 United States 2015 female 15-24 years        1132   21633813
## 372 United States 2015   male  5-14 years         255   21273987
##     suicides.100k.pop gdp_per_cap   generation
## 367             38.81       60387       Silent
## 368             27.93       60387 Generation X
## 369             19.27       60387   Millenials
## 370              8.18       60387      Boomers
## 371              5.23       60387   Millenials
## 372              1.20       60387 Generation Z

6. Checking the Count

How many years are represented in this data set?

#number of unique years
select(sorted_sui, year) %>% unique %>% nrow 
## [1] 31

Which years in order?

#all years represented in data set
unique(sorted_sui$year)
##  [1] 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
## [15] 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## [29] 2013 2014 2015

7. Validating the Data

I would like to validate my data with an external source. The Centers for Disease Control and Prevention (CDC) has a tool to present suicide statistics for the United States. I will be looking at the years available to compare to my data set with information from the World Health Organization (WHO).

Suicides by year in the US, according to CDC data:

#checking the suicides by year in the US, according to CDC
cdc <- read.csv("results.csv", stringsAsFactors = FALSE)
ranking_0414_CDC<-group_by(cdc, Year) %>% filter(Year %in% c("2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")) %>% select(Year, Deaths, Cause.of.Death) %>% as.data.frame %>%  arrange(Year)
ranking_0414_CDC
##    Year Deaths Cause.of.Death
## 1  2004  32439 Suicide Injury
## 2  2005  32637 Suicide Injury
## 3  2006  33300 Suicide Injury
## 4  2007  34598 Suicide Injury
## 5  2008  36035 Suicide Injury
## 6  2009  36909 Suicide Injury
## 7  2010  38364 Suicide Injury
## 8  2011  39518 Suicide Injury
## 9  2012  40600 Suicide Injury
## 10 2013  41149 Suicide Injury
## 11 2014  42826 Suicide Injury

Suicides by year in the US, according to WHO data:

#checking the suicides by year in the US, according to WHO
ranking_0414_WHO<- group_by(suicide, year) %>% rename(country = ï..country)  %>% filter(year %in% c("2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014"), country=="United States") %>% summarize(total_suicides = sum(suicides_no)) %>% as.data.frame %>%  arrange(year)
ranking_0414_WHO
##    year total_suicides
## 1  2004          32428
## 2  2005          32629
## 3  2006          33292
## 4  2007          34596
## 5  2008          36030
## 6  2009          36900
## 7  2010          38362
## 8  2011          39508
## 9  2012          40596
## 10 2013          41143
## 11 2014          42769

The tables look quite similar! This is exactly what we wanted and confirms that our data is valid.

8. Try the easy solution first

Scatterplot Visualization: Number of Suicides Over Time for the U

In order to address the first question, we can see the number of suicides over time as it impacts generations individually is with multiple boxplots.

US_boxplot<- ggplot(data = sorted_sui_US) + 
  geom_boxplot(mapping = aes(x = factor(generation), y = suicides_no, 
                             fill = factor(generation))) +
  labs(title = "Number of Suicides by Generation Boxplots", 
       x = "Generation", y = "Number of Suicides")
US_boxplot

Lineplot visualization

For change over time, line plot visualizations can be helpful. I will add color aesthetics to account for the third categorical variable of generations.

ggplot(data = sorted_sui_US) + 
  geom_point(mapping = aes(x = year, y = suicides_no, color=generation)) +
  geom_smooth(mapping = aes(x = year, y = suicides_no, color=generation)) +
  scale_x_continuous(breaks = seq(1985, 2015, by = 5)) +
  labs(title = "Number of Suicides in the US vs Year by Generation", 
       x = "Year", y = "Number of Suicides")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This graph is interesting! From it, we can see that the G.I. Generation is no longer represented at about the year 2000, and that Generation Z is not represented until about the year 2007. This makes sense as they are the oldest and youngest generations, respsectively.

Histogram Visualization: GDP per Cap

In order to address the second question, we need to examine whether or not the population is approximately normal.

#creating and printing the histogram for the US population
hist_pop_US<- ggplot(sorted_sui_US, aes(population)) + 
  geom_histogram(aes(y=..density..), color="red", fill="white") +
  geom_density(fill="green",  alpha=.5)

hist_pop_US
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#creating and printing the histogram for the UK population
hist_pop_US<- ggplot(sorted_sui_UK, aes(population)) + 
  geom_histogram(aes(y=..density..), color="red", fill="white") +
  geom_density(fill="green",  alpha=.5)

hist_pop_US
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data appears to be symmmetric. If it were to emulate the normal distribution, than I would expect that the mean and median to be approximately equal.

#US summary stats
summary(sorted_sui_US$population)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  4064000 18185450 20375469 21650611 22616944 43805214
#UK summary stats
summary(sorted_sui_UK$population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 1202838 3620903 4121212 4674107 5810017 8881944

Mean for US: 21650611 Median for US: 20375469

Mean for UK: 4674107 Median for UK: 4121212

They are indeed approximately equal!

Boxplot Visualization: Population

We can quickly visualize this summary information with a boxplot.

#creating and printing the boxplot for US
boxplot(sorted_sui_US$population, col = "blue")

#creating and printing the boxplot for UK
boxplot(sorted_sui_UK$population, col = "red")

Performing the T-test

Now to see if the difference of the populations means is significance, we can take the t-test.

#performing the t-test on the difference of 
t.test(sorted_sui_US$population, sorted_sui_UK$population)
## 
##  Welch Two Sample t-test
## 
## data:  sorted_sui_US$population and sorted_sui_UK$population
## t = 33.972, df = 401.04, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  15994103 17958904
## sample estimates:
## mean of x mean of y 
##  21650611   4674107

In the data set, the mean population of the United States is 21,650,611 and the mean population of the UK is 4,674,107. The t-statistic is 33.972. The 95% confidence interval of the difference in mean population is between 15,994,103 people and 1,7958,904 people. The p-value is less than 2.2*10^(-16), which is an incredibly small number. This indicates the the result is highly singificant. Thus, at a 95 percent confidence level, we can reject the null hypothesis that that μ=μ0 or that the two are means are equal, and we accept the alternative hypothesis.

9. Challenge your solution

For the first question, we can create line plots for the United States (and the United Kingdom) that show number of suicides per year for each generation.

g1 <- ggplot(sorted_sui_US, aes(x = year, y = suicides_no, colour = generation)) +
  geom_line() +
  stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
  scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
  facet_grid(. ~ generation)
g2 <- ggplot(sorted_sui_UK, aes(x = year, y = suicides_no, colour = generation)) +
  geom_line() +
  stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
  scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
  facet_grid(. ~ generation)

g1

g2

For the second question, we can create similar line plots for the United States (and the United Kingdom) that the population per year for each generation.

g3 <- ggplot(sorted_sui_US, aes(x = year, y = population, colour = generation)) +
  geom_line() +
  stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
  scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
  facet_grid(. ~ generation)
g4 <- ggplot(sorted_sui_UK, aes(x = year, y = population, colour = generation)) +
  geom_line() +
  stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
  scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
  facet_grid(. ~ generation)

g3

g4

10. Follow up

We can use the line graphs produes in item 9 to conclude that the suicide rates by generation for the United States were directly correlated with the population of these countries in the data set. This makes sense. Further, we can see that the overall patterns of suicide rates rose and fell around the same times for the United States and the United Kingdom. We also see that the overall patterns of population rose and fell around the same times for the United States and the United kingdom. All of the plots generally seem to rise and fall similarly. Further exploration about what may have caused theses trends can be made!