In this project, we were assigned to clean, explore the data variables, perform at least one statistical analysis, and explore both quantitative and categorical variables. I am really interested in the dataset “USRegionalMortality” because I was wondering how is the rate of mortality in both rural and urban locations in the United States. In addition, I want to know which diseases have high rates of prevalence among the deceased people in rural areas. The source of my dataset is: http://vincentarelbundock.github.io/Rdatasets/datasets.html.
USRegionalMortality is a data frame with 400 observations on the following 6 variables:
1- Region A factor specifying HHS Region
2- Status A factor with levels Rural and Urban
3- Sex A factor with levels Female and Male
4- Cause Cause of death. A factor with levels Alzheimers, Cancer, Cerebrovascular diseases, Diabetes, Flu and pneumonia, Heart disease, Lower respiratory, Nephritis, Suicide, and Unintentional injuries
5- Rate Age-adjusted death rate per 100,000 population
6- SE Standard error for the rate
In this project, we will work with the following useful packages to clean, explore the data variables, create a ggplot visualization, and incorporate interactivity.
#Installing the R packages
library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.1
library(RColorBrewer)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages ----------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v tibble 2.1.3 v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.1
## Warning: package 'purrr' was built under R version 3.6.1
## Warning: package 'stringr' was built under R version 3.6.1
## Warning: package 'forcats' was built under R version 3.6.1
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
First of all, we are going to read the .csv file from the Dataset “U.S. Regional Mortality”.
US_Regional_Mortality <- read.csv("US_Regional_Mortality.csv")
We may confirm that the imported data will be a data frame by using class() function.
class(US_Regional_Mortality)
## [1] "data.frame"
We are going to check the number of rows and columns in the data frame using the dim() function.
dim(US_Regional_Mortality)
## [1] 400 6
We will examine the structure of the data frame by using the str() function to see the names of the columns and tables as well as the specific data type of each column.
str(US_Regional_Mortality)
## 'data.frame': 400 obs. of 6 variables:
## $ Status: Factor w/ 2 levels "Rural","Urban": 2 1 2 1 2 1 2 1 2 1 ...
## $ Cause : Factor w/ 10 levels "Alzheimers","Cancer",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Rate : num 188 199 115 124 227 ...
## $ SE : num 1 2.6 0.6 1.7 0.8 3.3 0.5 2.3 0.8 2 ...
## $ Region: Factor w/ 10 levels "HHS Region 01",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 2 ...
We will look at the top and the bottom of the data set to see by defaut its first and last 6 observations.
head(US_Regional_Mortality)
## Status Cause Rate SE Region Sex
## 1 Urban Heart disease 188.2 1.0 HHS Region 01 Male
## 2 Rural Heart disease 199.1 2.6 HHS Region 01 Male
## 3 Urban Heart disease 115.1 0.6 HHS Region 01 Female
## 4 Rural Heart disease 124.5 1.7 HHS Region 01 Female
## 5 Urban Heart disease 226.8 0.8 HHS Region 02 Male
## 6 Rural Heart disease 248.8 3.3 HHS Region 02 Male
tail(US_Regional_Mortality)
## Status Cause Rate SE Region Sex
## 395 Urban Nephritis 6.1 0.1 HHS Region 09 Female
## 396 Rural Nephritis 8.4 0.5 HHS Region 09 Female
## 397 Urban Nephritis 8.6 0.3 HHS Region 10 Male
## 398 Rural Nephritis 8.6 0.5 HHS Region 10 Male
## 399 Urban Nephritis 5.9 0.2 HHS Region 10 Female
## 400 Rural Nephritis 6.7 0.4 HHS Region 10 Female
Now we will use dplyr to manipulate the data, using operations and functions.
At this step, we are going to remove all the missing values by using complete.cases() function. We will call the new data frame “US_Regional_Mortality1”
US_Regional_Mortality1 <- US_Regional_Mortality[complete.cases(US_Regional_Mortality),]
dim(US_Regional_Mortality1)
## [1] 400 6
we do not have any missing values.
Now, we are going to use the rename function to rename the variable Rate into Mortality_Rate
US_Regional_Mortality<- rename(US_Regional_Mortality, Mortality_Rate= Rate)
Now, we will use - function to omit the columns SE to Sex.
US_Regional_Mortality2<-select(US_Regional_Mortality, -(SE:Sex))
str(US_Regional_Mortality2)
## 'data.frame': 400 obs. of 3 variables:
## $ Status : Factor w/ 2 levels "Rural","Urban": 2 1 2 1 2 1 2 1 2 1 ...
## $ Cause : Factor w/ 10 levels "Alzheimers","Cancer",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Mortality_Rate: num 188 199 115 124 227 ...
We currently have 3 variables as we will explore to show the insight through the following visualizations.
Now, we will Provide at least one statistical component.
# summary() has the five-number summary and the mean
summary(US_Regional_Mortality2$Mortality_Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.40 18.07 29.20 56.91 53.70 276.40
Status_summary <- US_Regional_Mortality2 %>%
group_by(Status) %>%
summarize(min = min(Mortality_Rate), median = median(Mortality_Rate), mean = mean(Mortality_Rate), max = max(Mortality_Rate), count = n()) %>%
arrange(desc(min))
Status_summary
## # A tibble: 2 x 6
## Status min median mean max count
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Rural 3.9 32.8 60.7 276. 200
## 2 Urban 3.4 27.8 53.1 227. 200
Cause_summary <- US_Regional_Mortality2 %>%
group_by(Cause) %>%
summarize(min = min(Mortality_Rate), median = median(Mortality_Rate), mean = mean(Mortality_Rate), max = max(Mortality_Rate), count = n()) %>%
arrange(desc(min))
Cause_summary
## # A tibble: 10 x 6
## Cause min median mean max count
## <fct> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Cancer 122. 162. 172. 245. 40
## 2 Heart disease 106 176. 177. 276. 40
## 3 Lower respiratory 28 46.0 46.6 70.7 40
## 4 Cerebrovascular diseases 27.2 37.3 37.3 49.1 40
## 5 Unintentional injuries 17 41.1 45.1 79.1 40
## 6 Diabetes 12.1 23.0 22.3 32.8 40
## 7 Alzheimers 10.5 23.8 23.9 40.8 40
## 8 Flu and pneumonia 8.8 15.6 16.2 24.1 40
## 9 Nephritis 5.9 12.9 13.5 22.6 40
## 10 Suicide 3.4 11.1 15.0 35.4 40
Next, we will create plots with quantitative and categorical variable with boxplot and histogramm to show the mortality Rates in US by Status (Rural & Urban) and Cause.
US_Regional_Mortality2<-select(US_Regional_Mortality, -(SE:Sex))
graph_boxplot <- ggplot(US_Regional_Mortality2, aes(x = Status, y = Mortality_Rate, fill = Status)) +
geom_boxplot()+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))
ggplotly(graph_boxplot)
US_Regional_Mortality2<-select(US_Regional_Mortality, -(SE:Sex))
graph_boxplot <- ggplot(US_Regional_Mortality2, aes(x = Status, y = Mortality_Rate, fill = Cause)) +
geom_boxplot()+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))
ggplotly(graph_boxplot)
qplot(data = US_Regional_Mortality2,Mortality_Rate,fill = Status,bins = 30)
qplot(data = US_Regional_Mortality2,Mortality_Rate,fill = Cause,bins = 30)
After we determined what we wanted to focus on US mortality rate in rural and urban areas through the above graphics, next we will create ggplot visualization and an incorporate interactivity to break up graphs into semantic components such as scales, labels, title,and legend.
g<-ggplot(data = US_Regional_Mortality2,mapping = aes(x=Status,y=Mortality_Rate, color=Cause))
class(g)
## [1] "gg" "ggplot"
g + geom_point() +
labs(title = "Mortality Rates in US vs Satus by Cause 2011-2013",
x = "Status", y = "Mortality_Rate", color = "Cause")
g1<-ggplot(data = US_Regional_Mortality2,mapping = aes(x=Status,y=Mortality_Rate, size=Mortality_Rate, color=Cause))
class(g1)
## [1] "gg" "ggplot"
g1 + geom_point() +
labs(title = "Mortality Rates in US vs Satus 2011-2013",
x = "Status", y = "Mortality_Rate", color = "Cause")
plotly
g3<-ggplot(data = US_Regional_Mortality2,mapping = aes(x=Status,y=Mortality_Rate, color=Cause))+
geom_point() +
labs(title = "Mortality Rates in US vs Satus 2011-2013")+
labs(x = "Status", y = "Mortality_Rate")
ggplotly(g3)
1- We cleaned this dataset by using dplyr package. First,we removed all the missing values with the complete.cases() function. After, we used the “rename” function to rename the variable Rate into Mortality_Rate. Finally, we employed - function to omit the columns SE to Sex. Therefore, it remained 3 variables and 400 observations.
2- In the background research, I found an article about “Rural–urban disparities in the prevalence of diabetes and coronary heart disease”. The authors, O’Connor & Wellenius, examined the rural–urban differences in the prevalence of diabetes and coronary heart disease. The sample was more than 214,000 respondents using data from the US Centers for Disease Control and Prevention’s (CDC’s) in 2008. In conclusion, the crude prevalence rates of diabetes and coronary heart disease were 8.6% and 38.8% higher among respondents living in rural areas compared with urban areas, respectively.
3- The results globally show that, in U.S., people living in rural location have a higher rate of mortality than those living in urban areas. In addition, the rate of mortality in rural environments is more likely to be linked to Heart disease than in urban areas; these deceased persons also tend to be more affected by cancer. I expected to see a higher rate of diabetes among those people in rural environment as the study revealed early. However, I surprisingly found that cancer appeared among diseases with the highest rate in rural areas. Despite the fact that the data frame provided great information, this study may extend to other risk factors such as annual household income, age, education, city, and life style. We will be able to analyze the mortality rate with other factors, especially poverty.
The higher rate of cancer and heart disease among deceased people in rural populations in the U.S. becomes a major challenge in public health. The government of U.S. should create more exercise facilities and sidewalks to allow population to exercise and also diminish the cost of fresh fruits and vegetables in order to reduce the rate of heart disease and cancer in rural locations.
O’Connor, A., & Wellenius, G. (2012). Rural–urban disparities in the prevalence of diabetes and coronary heart disease. Public Health, 126(10), 813–820. https://doi.org/10.1016/j.puhe.2012.05.029