#Loading
library(tidyverse)
library(moments)
loading library tidyverse and moments into my working directory
suicide_rate<- read.csv(file.choose())
Importing data set
str(suicide_rate)
'data.frame': 31756 obs. of 14 variables:
$ country : chr "Albania" "Albania" "Albania" "Albania" ...
$ year : int 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 ...
$ sex : chr "male" "male" "female" "male" ...
$ age : chr "15-24 years" "35-54 years" "15-24 years" "75+ years" ...
$ suicides_no : int 21 16 14 1 9 1 6 4 1 0 ...
$ population : num 312900 308000 289700 21800 274300 ...
$ suicides.100k.pop : num 6.71 5.19 4.83 4.59 3.28 2.81 2.15 1.56 0.73 0 ...
$ country.year : chr "Albania1987" "Albania1987" "Albania1987" "Albania1987" ...
$ HDI.for.year : num NA NA NA NA NA NA NA NA NA NA ...
$ gdp_for_year.... : chr "2,15,66,24,900" "2,15,66,24,900" "2,15,66,24,900" "2,15,66,24,900" ...
$ gdp_per_capita....: num 796 796 796 796 796 796 796 796 796 796 ...
$ generation : chr "Generation X" "Silent" "Generation X" "G.I. Generation" ...
$ log_population : num 12.65 12.64 12.58 9.99 12.52 ...
$ sex_numeric : num 1 1 0 1 1 0 0 0 1 0 ...
checking the structure of my data frame
##Plot 1 (Numeric Variable-“population”) Subset the data frame to include only the “Population” column and make it numeric
suicide_rate$population<- as.numeric(suicide_rate$population)
##Remove any missing or NA values from the “Population” column
suicide_rate$population<- na.omit(suicide_rate$population)
sum(is.na(suicide_rate$population))
[1] 0
#Plotting a histogram of the Population variable
hist(suicide_rate$population, main = "Population Distribution", xlab = "Population", col = "blue")
##identifying outliers using the IQR method and removing them from the data
library(tidyverse)
library(moments)
The output of identifying outliers using the IQR method is ‘TRUE’ for most value, this suggest that there are values that fall outside the lower and upper outlier boundaries the distribution of the data is relatively symmetric. The outliers cannot be ommited since they are the correct values and may affect the analysis negatively if ommited.
boxplot(suicide_rate$population)
##shape and skewness of the distribution
plot(density(suicide_rate$population), main = "Population Distribution")
skewness(suicide_rate$population)
[1] 21.53854
tail of the distribution is longer on the right side than on the left side
we use logarithm transformation to transform the data to reduce the skewness and make the distribution more symmetrical
suicide_rate$log_population<- log(suicide_rate$population)
plot(density(suicide_rate$log_population), main = "Log-transformed Population Distribution")
skewness(suicide_rate$log_population)
[1] -0.150254
The skewness has reduced after the log transformation
summary(suicide_rate$population)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.780e+02 1.288e+05 5.468e+05 7.217e+06 2.909e+06 1.411e+09
median(suicide_rate$population)
[1] 546832.5
mean(suicide_rate$population)
[1] 7217454
median(suicide_rate$population)
[1] 546832.5
skewness(suicide_rate$population)
[1] 21.53854
since skewness is high, it suggests the presence of outliers or extreme values that can distort the mean. Therefore, the median, is a robust measure of central tendency that is less influenced by extreme values,thus being an appropriate measure of central tendency
Q1<- quantile(suicide_rate$population, 0.25)
Q3<- quantile(suicide_rate$population, 0.75)
IQR<- Q1 - Q3
IQR
25%
-2779942
The median is a robust measure of central tendency, a robust measure of spread such as the interquartile range (IQR) is appropriate, since Quartiles are less impacted by outliers
country_counts<- data.frame(table(suicide_rate$country))
ggplot(country_counts, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
xlab("Country") +
ylab("Count") +
labs(title = "Count of Suicides by Country", tag = "Plot 2.1")
NA
*The graph is abit clouded since there are more than 100 countries in observation**
country_group <- suicide_rate %>%
group_by(suicide_rate$Country) %>%
summarise(Count = n()) %>%
mutate(percentage = round(Count / sum(Count) * 100, 2))
Q1<- quantile(suicide_rate$population, 0.25)
Q3<- quantile(suicide_rate$population, 0.75)
IQR<- Q1 - Q3
upper_outlier_bound <- Q3 + 1.5*IQR
lower_outlier_bound <- Q1 - 1.5*IQR
outliers <- suicide_rate$population > upper_outlier_bound | suicide_rate$population < lower_outlier_bound
view(head(outliers))
ggplot(country_group, aes(x="", y=Count, fill=as.factor(Count)))+
geom_bar(stat = 'identity' , width = 0.8) +
coord_polar("y")+
scale_fill_manual(values=cbPalette)+
geom_text(aes(label = paste(percentage,"%")),
position = position_stack(vjust = 0.5),size=3.5) +
guides(fill = guide_legend(title = "Country")) +
theme_void()+
labs(title = "Distribution of proportions for variable Country (in Percentage)",
tag = "Plot 2.2")
Unusual observations for the “country” variable would be some countries have an unusually high or low count or proportion of suicides compared to the rest of the dataset
unique(country_counts$Var1)
[1] Albania Antigua and Barbuda
[3] Argentina Armenia
[5] Aruba Australia
[7] Austria Azerbaijan
[9] Bahamas Bahrain
[11] Barbados Belarus
[13] Belgium Belize
[15] Bosnia and Herzegovina Brazil
[17] Brunei Darussalam Bulgaria
[19] Cabo Verde Canada
[21] Chile China, Hong Kong SAR
[23] Colombia Costa Rica
[25] Croatia Cuba
[27] Cyprus Czech Republic
[29] Czechia Denmark
[31] Dominica Dominican Republic
[33] Ecuador Egypt
[35] El Salvador Estonia
[37] Fiji Finland
[39] France Georgia
[41] Germany Greece
[43] Grenada Guatemala
[45] Guyana Hungary
[47] Iceland Ireland
[49] Israel Italy
[51] Jamaica Japan
[53] Jordan Kazakhstan
[55] Kiribati Kuwait
[57] Kyrgyzstan Latvia
[59] Lebanon Lithuania
[61] Luxembourg Macau
[63] Maldives Malta
[65] Mauritius Mexico
[67] Mongolia Montenegro
[69] Netherlands New Zealand
[71] Nicaragua North Macedonia
[73] Norway Oman
[75] Panama Paraguay
[77] Peru Philippines
[79] Poland Portugal
[81] Puerto Rico Qatar
[83] Republic of Korea Republic of Moldova
[85] Romania Russian Federation
[87] Saint Kitts and Nevis Saint Lucia
[89] Saint Vincent and Grenadines Saint Vincent and the Grenadines
[91] San Marino Serbia
[93] Seychelles Singapore
[95] Slovakia Slovenia
[97] South Africa Spain
[99] Sri Lanka Suriname
[101] Sweden Switzerland
[103] Tajikistan Thailand
[105] Trinidad and Tobago Turkey
[107] Turkmenistan Ukraine
[109] United Arab Emirates United Kingdom
[111] United States United States of America
[113] Uruguay Uzbekistan
114 Levels: Albania Antigua and Barbuda Argentina Armenia Aruba Australia ... Uzbekistan
ggplot(suicide_rate, aes(x = gdp_for_year...., y = gdp_per_capita....)) +
geom_point() +
labs(title = "Relationship between GDP per year and GDP per capita",
x = "GDP per year",
y = "GDP per capita")
ggplot(suicide_rate, aes(x = sex, y = suicides.100k.pop, color = sex)) +
geom_jitter(alpha = 0.5, size = 3, width = 0.2) +
scale_color_manual(values=c("#E69F00", "#56B4E9")) +
labs(title = "Relationship between Sex and Suicide Rate",
x = "Sex",
y = "Suicide Rate") +
theme_minimal()
From the scatter plot, we can see that there is a slightly positive linear relationship between the suicide rate and sex. As the suicide rate increases, the number of males committing suicide is slightly higher than females
The relationship between sex and suicide rate indicates that there is a difference in the suicide rate between males and females. Specifically, the data suggests that males have a higher suicide rate than females. This is an important finding that may have implications for suicide prevention and intervention efforts, as it highlights the need to focus on understanding and addressing the specific risk factors that may be contributing to higher suicide rates in males. It may also point to the need for targeted interventions that take into account gender differences in the experience and expression of mental health issues.
In the scatterplot, we can see that there is a considerable amount of variability in the data, with many points spread out across the range of suicide rates for both males and females. This indicates that there is not a perfect relationship between sex and suicide rate, as there are many factors that could influence suicide rates in a given country. However, we can still see a general trend of higher suicide rates among males compared to females, which is supported by the positive correlation coefficient we calculated. This suggests that sex is a factor that can help predict suicide rates, but it is not the only factor, as there is still a significant amount of variability in the data. Overall, the scatterplot and correlation coefficient provide useful information about the relationship between sex and suicide rates, but further analysis would be needed to fully understand the factors that contribute to suicide rates in different countries.