For this project, I am using a library that compiles data about Suicide Rates from 1985 to 2016. It pulls data from the United Nations Human Developments Report, The World Bank DataBank, World Health Organization (WHO) Suicide Statistics, and the World Health Organization National suicid prevention strategies report.
I want to look at how suicide rates in the United States affected different generations over time. I am also interested in comparing the populations of the United States and the United Kingdom in this data set.
A description of generations is listed below by the Center for Generational Kinetics.
Generation Z: Born 1996 – TBD Millennials: Born 1977 – 1995 Generation X: Born 1965 – 1976 Baby Boomers: Born 1946 – 1964 Silent: Born 1927 - 1945 G.I. Generation: Born 1901 - 1926
I have two questions that I would like to address in this project: 1. How did number of suicides in the United States change over time across different generations?
First, I imported data and stored it in the “suicide” variable, so it is easier to call. I set the argument stringsAsFactors = FALSE, so that it would import strings as characters.
suicide <-read.csv("master.csv", stringsAsFactors = FALSE)
I will be using functions from the dplyr, tidyr, ggplot2, and gganimate libraries. It can be installed with install.packages() if it is not already on the device.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(gganimate)
I assume the imported data will be a data frame, as it contains data of different data types. I can confirm this by utilizing the class() function. I will also check the number of rows and columns in the data frame. I can do this with one function by viewing the dimensions of the data set.
#checking characteristics of the dataset
class(suicide)
## [1] "data.frame"
#checking the number of rows and columns
nrow(suicide)
## [1] 27820
ncol(suicide)
## [1] 12
#this can also be done all at once with dimension
dim(suicide)
## [1] 27820 12
I want to check the structure of the data frame to see the names of the columns and tables. I can also see the data type contained in each column.
#checking the structure
str(suicide)
## 'data.frame': 27820 obs. of 12 variables:
## $ ï..country : chr "Albania" "Albania" "Albania" "Albania" ...
## $ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
## $ sex : chr "male" "male" "female" "male" ...
## $ age : chr "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
## $ suicides_no : int 21 16 14 1 9 1 6 4 1 0 ...
## $ population : int 312900 308000 289700 21800 274300 35600 278800 257200 137500 311000 ...
## $ suicides.100k.pop : num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
## $ country.year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
## $ HDI.for.year : num NA NA NA NA NA NA NA NA NA NA ...
## $ gdp_for_year.... : chr "2,156,624,900" "2,156,624,900" "2,156,624,900" "2,156,624,900" ...
## $ gdp_per_capita....: int 796 796 796 796 796 796 796 796 796 796 ...
## $ generation : chr "Generation X" "Silent" "Generation X" "G.I. Generation" ...
I am interested in checking the head and tails of the data. I set the second argument in these functions to 10 in order to see the first 10 observations at the beginning and end of the data frame.
#head of data
head(suicide, 10)
## ï..country year sex age suicides_no population
## 1 Albania 1987 male 15-24 years 21 312900
## 2 Albania 1987 male 35-54 years 16 308000
## 3 Albania 1987 female 15-24 years 14 289700
## 4 Albania 1987 male 75+ years 1 21800
## 5 Albania 1987 male 25-34 years 9 274300
## 6 Albania 1987 female 75+ years 1 35600
## 7 Albania 1987 female 35-54 years 6 278800
## 8 Albania 1987 female 25-34 years 4 257200
## 9 Albania 1987 male 55-74 years 1 137500
## 10 Albania 1987 female 5-14 years 0 311000
## suicides.100k.pop country.year HDI.for.year gdp_for_year....
## 1 6.71 Albania1987 NA 2,156,624,900
## 2 5.19 Albania1987 NA 2,156,624,900
## 3 4.83 Albania1987 NA 2,156,624,900
## 4 4.59 Albania1987 NA 2,156,624,900
## 5 3.28 Albania1987 NA 2,156,624,900
## 6 2.81 Albania1987 NA 2,156,624,900
## 7 2.15 Albania1987 NA 2,156,624,900
## 8 1.56 Albania1987 NA 2,156,624,900
## 9 0.73 Albania1987 NA 2,156,624,900
## 10 0.00 Albania1987 NA 2,156,624,900
## gdp_per_capita.... generation
## 1 796 Generation X
## 2 796 Silent
## 3 796 Generation X
## 4 796 G.I. Generation
## 5 796 Boomers
## 6 796 G.I. Generation
## 7 796 Silent
## 8 796 Boomers
## 9 796 G.I. Generation
## 10 796 Generation X
#tail of data
tail(suicide, 10)
## ï..country year sex age suicides_no population
## 27811 Uzbekistan 2014 female 15-24 years 347 2992817
## 27812 Uzbekistan 2014 male 55-74 years 144 1271111
## 27813 Uzbekistan 2014 male 15-24 years 347 3126905
## 27814 Uzbekistan 2014 male 75+ years 17 224995
## 27815 Uzbekistan 2014 female 25-34 years 162 2735238
## 27816 Uzbekistan 2014 female 35-54 years 107 3620833
## 27817 Uzbekistan 2014 female 75+ years 9 348465
## 27818 Uzbekistan 2014 male 5-14 years 60 2762158
## 27819 Uzbekistan 2014 female 5-14 years 44 2631600
## 27820 Uzbekistan 2014 female 55-74 years 21 1438935
## suicides.100k.pop country.year HDI.for.year gdp_for_year....
## 27811 11.59 Uzbekistan2014 0.675 63,067,077,179
## 27812 11.33 Uzbekistan2014 0.675 63,067,077,179
## 27813 11.10 Uzbekistan2014 0.675 63,067,077,179
## 27814 7.56 Uzbekistan2014 0.675 63,067,077,179
## 27815 5.92 Uzbekistan2014 0.675 63,067,077,179
## 27816 2.96 Uzbekistan2014 0.675 63,067,077,179
## 27817 2.58 Uzbekistan2014 0.675 63,067,077,179
## 27818 2.17 Uzbekistan2014 0.675 63,067,077,179
## 27819 1.67 Uzbekistan2014 0.675 63,067,077,179
## 27820 1.46 Uzbekistan2014 0.675 63,067,077,179
## gdp_per_capita.... generation
## 27811 2309 Millenials
## 27812 2309 Boomers
## 27813 2309 Millenials
## 27814 2309 Silent
## 27815 2309 Millenials
## 27816 2309 Generation X
## 27817 2309 Silent
## 27818 2309 Generation Z
## 27819 2309 Generation Z
## 27820 2309 Boomers
I decided to select columns with variables that I believe will be most useful to my analysis. There are many missing values in the HDI.for.year, the variable country.year seems redudant, and the variable gdp_for_year seems less comprehensive than gdp per capita. Thus, I have decided to select the data without these three aforementioned columns. I have also decided to rename the columns ï..country and gdp_per_capita…. to country and gdp_per_capita, respectively, so that they are easier to call. I filtered by the two countries I am interested in–the Untied States and the United Kingdom. This extraction improves the useaebility and readability of the data frame.
#creating the sorted_sui extraction
sorted_sui <- suicide %>% select(-(HDI.for.year), -(country.year), -(gdp_for_year....)) %>% rename(country = ï..country, gdp_per_cap = gdp_per_capita....) %>% filter(country==c("United States","United Kingdom"))
sorted_sui_US <- suicide %>% select(-(HDI.for.year), -(country.year), -(gdp_for_year....)) %>% rename(country = ï..country, gdp_per_cap = gdp_per_capita....) %>% filter(country==c("United States"))
sorted_sui_UK <- suicide %>% select(-(HDI.for.year), -(country.year), -(gdp_for_year....)) %>% rename(country = ï..country, gdp_per_cap = gdp_per_capita....) %>% filter(country==c("United Kingdom"))
#checking the sorted_sui extraction
str(sorted_sui)
## 'data.frame': 372 obs. of 9 variables:
## $ country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
## $ year : int 1985 1985 1985 1985 1985 1985 1986 1986 1986 1986 ...
## $ sex : chr "male" "male" "female" "female" ...
## $ age : chr "55-74 years" "25-34 years" "75+ years" "35-54 years" ...
## $ suicides_no : int 915 620 237 505 86 2 892 616 223 424 ...
## $ population : int 5170113 3969689 2418692 6879295 4549121 3542919 5159651 4039982 2454593 6933075 ...
## $ suicides.100k.pop: num 17.7 15.62 9.8 7.34 1.89 ...
## $ gdp_per_cap : int 9231 9231 9231 9231 9231 9231 11323 11323 11323 11323 ...
## $ generation : chr "G.I. Generation" "Boomers" "G.I. Generation" "Silent" ...
head(sorted_sui)
## country year sex age suicides_no population
## 1 United Kingdom 1985 male 55-74 years 915 5170113
## 2 United Kingdom 1985 male 25-34 years 620 3969689
## 3 United Kingdom 1985 female 75+ years 237 2418692
## 4 United Kingdom 1985 female 35-54 years 505 6879295
## 5 United Kingdom 1985 female 15-24 years 86 4549121
## 6 United Kingdom 1985 female 5-14 years 2 3542919
## suicides.100k.pop gdp_per_cap generation
## 1 17.70 9231 G.I. Generation
## 2 15.62 9231 Boomers
## 3 9.80 9231 G.I. Generation
## 4 7.34 9231 Silent
## 5 1.89 9231 Generation X
## 6 0.06 9231 Generation X
tail(sorted_sui)
## country year sex age suicides_no population
## 367 United States 2015 male 75+ years 3171 8171136
## 368 United States 2015 male 35-54 years 11634 41658010
## 369 United States 2015 male 15-24 years 4359 22615073
## 370 United States 2015 female 55-74 years 2872 35115610
## 371 United States 2015 female 15-24 years 1132 21633813
## 372 United States 2015 male 5-14 years 255 21273987
## suicides.100k.pop gdp_per_cap generation
## 367 38.81 60387 Silent
## 368 27.93 60387 Generation X
## 369 19.27 60387 Millenials
## 370 8.18 60387 Boomers
## 371 5.23 60387 Millenials
## 372 1.20 60387 Generation Z
How many years are represented in this data set?
#number of unique years
select(sorted_sui, year) %>% unique %>% nrow
## [1] 31
Which years in order?
#all years represented in data set
unique(sorted_sui$year)
## [1] 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
## [15] 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## [29] 2013 2014 2015
I would like to validate my data with an external source. The Centers for Disease Control and Prevention (CDC) has a tool to present suicide statistics for the United States. I will be looking at the years available to compare to my data set with information from the World Health Organization (WHO).
Suicides by year in the US, according to CDC data:
#checking the suicides by year in the US, according to CDC
cdc <- read.csv("results.csv", stringsAsFactors = FALSE)
ranking_0414_CDC<-group_by(cdc, Year) %>% filter(Year %in% c("2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")) %>% select(Year, Deaths, Cause.of.Death) %>% as.data.frame %>% arrange(Year)
ranking_0414_CDC
## Year Deaths Cause.of.Death
## 1 2004 32439 Suicide Injury
## 2 2005 32637 Suicide Injury
## 3 2006 33300 Suicide Injury
## 4 2007 34598 Suicide Injury
## 5 2008 36035 Suicide Injury
## 6 2009 36909 Suicide Injury
## 7 2010 38364 Suicide Injury
## 8 2011 39518 Suicide Injury
## 9 2012 40600 Suicide Injury
## 10 2013 41149 Suicide Injury
## 11 2014 42826 Suicide Injury
Suicides by year in the US, according to WHO data:
#checking the suicides by year in the US, according to WHO
ranking_0414_WHO<- group_by(suicide, year) %>% rename(country = ï..country) %>% filter(year %in% c("2004","2005","2006","2007","2008","2009","2010","2011","2012","2013","2014"), country=="United States") %>% summarize(total_suicides = sum(suicides_no)) %>% as.data.frame %>% arrange(year)
ranking_0414_WHO
## year total_suicides
## 1 2004 32428
## 2 2005 32629
## 3 2006 33292
## 4 2007 34596
## 5 2008 36030
## 6 2009 36900
## 7 2010 38362
## 8 2011 39508
## 9 2012 40596
## 10 2013 41143
## 11 2014 42769
The tables look quite similar! This is exactly what we wanted and confirms that our data is valid.
In order to address the first question, we can see the number of suicides over time as it impacts generations individually is with multiple boxplots.
US_boxplot<- ggplot(data = sorted_sui_US) +
geom_boxplot(mapping = aes(x = factor(generation), y = suicides_no,
fill = factor(generation))) +
labs(title = "Number of Suicides by Generation Boxplots",
x = "Generation", y = "Number of Suicides")
US_boxplot
For change over time, line plot visualizations can be helpful. I will add color aesthetics to account for the third categorical variable of generations.
ggplot(data = sorted_sui_US) +
geom_point(mapping = aes(x = year, y = suicides_no, color=generation)) +
geom_smooth(mapping = aes(x = year, y = suicides_no, color=generation)) +
scale_x_continuous(breaks = seq(1985, 2015, by = 5)) +
labs(title = "Number of Suicides in the US vs Year by Generation",
x = "Year", y = "Number of Suicides")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This graph is interesting! From it, we can see that the G.I. Generation is no longer represented at about the year 2000, and that Generation Z is not represented until about the year 2007. This makes sense as they are the oldest and youngest generations, respsectively.
In order to address the second question, we need to examine whether or not the population is approximately normal.
#creating and printing the histogram for the US population
hist_pop_US<- ggplot(sorted_sui_US, aes(population)) +
geom_histogram(aes(y=..density..), color="red", fill="white") +
geom_density(fill="green", alpha=.5)
hist_pop_US
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#creating and printing the histogram for the UK population
hist_pop_US<- ggplot(sorted_sui_UK, aes(population)) +
geom_histogram(aes(y=..density..), color="red", fill="white") +
geom_density(fill="green", alpha=.5)
hist_pop_US
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data appears to be symmmetric. If it were to emulate the normal distribution, than I would expect that the mean and median to be approximately equal.
#US summary stats
summary(sorted_sui_US$population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4064000 18185450 20375469 21650611 22616944 43805214
#UK summary stats
summary(sorted_sui_UK$population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1202838 3620903 4121212 4674107 5810017 8881944
Mean for US: 21650611 Median for US: 20375469
Mean for UK: 4674107 Median for UK: 4121212
They are indeed approximately equal!
We can quickly visualize this summary information with a boxplot.
#creating and printing the boxplot for US
boxplot(sorted_sui_US$population, col = "blue")
#creating and printing the boxplot for UK
boxplot(sorted_sui_UK$population, col = "red")
Now to see if the difference of the populations means is significance, we can take the t-test.
#performing the t-test on the difference of
t.test(sorted_sui_US$population, sorted_sui_UK$population)
##
## Welch Two Sample t-test
##
## data: sorted_sui_US$population and sorted_sui_UK$population
## t = 33.972, df = 401.04, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 15994103 17958904
## sample estimates:
## mean of x mean of y
## 21650611 4674107
In the data set, the mean population of the United States is 21,650,611 and the mean population of the UK is 4,674,107. The t-statistic is 33.972. The 95% confidence interval of the difference in mean population is between 15,994,103 people and 1,7958,904 people. The p-value is less than 2.2*10^(-16), which is an incredibly small number. This indicates the the result is highly singificant. Thus, at a 95 percent confidence level, we can reject the null hypothesis that that μ=μ0 or that the two are means are equal, and we accept the alternative hypothesis.
For the first question, we can create line plots for the United States (and the United Kingdom) that show number of suicides per year for each generation.
g1 <- ggplot(sorted_sui_US, aes(x = year, y = suicides_no, colour = generation)) +
geom_line() +
stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
facet_grid(. ~ generation)
g2 <- ggplot(sorted_sui_UK, aes(x = year, y = suicides_no, colour = generation)) +
geom_line() +
stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
facet_grid(. ~ generation)
g1
g2
For the second question, we can create similar line plots for the United States (and the United Kingdom) that the population per year for each generation.
g3 <- ggplot(sorted_sui_US, aes(x = year, y = population, colour = generation)) +
geom_line() +
stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
facet_grid(. ~ generation)
g4 <- ggplot(sorted_sui_UK, aes(x = year, y = population, colour = generation)) +
geom_line() +
stat_summary(aes(group = 1), geom = "line", fun.y = mean, size = 1, col = "black") +
scale_x_continuous(breaks = seq(1985, 2015, by = 15)) +
facet_grid(. ~ generation)
g3
g4
We can use the line graphs produes in item 9 to conclude that the suicide rates by generation for the United States were directly correlated with the population of these countries in the data set. This makes sense. Further, we can see that the overall patterns of suicide rates rose and fell around the same times for the United States and the United Kingdom. We also see that the overall patterns of population rose and fell around the same times for the United States and the United kingdom. All of the plots generally seem to rise and fall similarly. Further exploration about what may have caused theses trends can be made!