Introduction

In this project, we were assigned to clean, explore the data variables, perform at least one statistical analysis, and explore both quantitative and categorical variables. I am really interested in the dataset “USRegionalMortality” because I was wondering how is the rate of mortality in both rural and urban locations in the United States. In addition, I want to know which diseases have high rates of prevalence among the deceased people in rural areas. The source of my dataset is: http://vincentarelbundock.github.io/Rdatasets/datasets.html.

USRegionalMortality is a data frame with 400 observations on the following 6 variables:

1- Region A factor specifying HHS Region

2- Status A factor with levels Rural and Urban

3- Sex A factor with levels Female and Male

4- Cause Cause of death. A factor with levels Alzheimers, Cancer, Cerebrovascular diseases, Diabetes, Flu and pneumonia, Heart disease, Lower respiratory, Nephritis, Suicide, and Unintentional injuries

5- Rate Age-adjusted death rate per 100,000 population

6- SE Standard error for the rate

R Markdown

In this project, we will work with the following useful packages to clean, explore the data variables, create a ggplot visualization, and incorporate interactivity.

#Installing the R packages
library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.1
library(RColorBrewer)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.1
## -- Attaching packages ----------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.1
## Warning: package 'tidyr' was built under R version 3.6.1
## Warning: package 'purrr' was built under R version 3.6.1
## Warning: package 'stringr' was built under R version 3.6.1
## Warning: package 'forcats' was built under R version 3.6.1
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag()     masks stats::lag()

Read the data

First of all, we are going to read the .csv file from the Dataset “U.S. Regional Mortality”.

US_Regional_Mortality <- read.csv("US_Regional_Mortality.csv")

Explore the data

We may confirm that the imported data will be a data frame by using class() function.

class(US_Regional_Mortality)
## [1] "data.frame"

We are going to check the number of rows and columns in the data frame using the dim() function.

dim(US_Regional_Mortality)
## [1] 400   6

We will examine the structure of the data frame by using the str() function to see the names of the columns and tables as well as the specific data type of each column.

str(US_Regional_Mortality)
## 'data.frame':    400 obs. of  6 variables:
##  $ Status: Factor w/ 2 levels "Rural","Urban": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Cause : Factor w/ 10 levels "Alzheimers","Cancer",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Rate  : num  188 199 115 124 227 ...
##  $ SE    : num  1 2.6 0.6 1.7 0.8 3.3 0.5 2.3 0.8 2 ...
##  $ Region: Factor w/ 10 levels "HHS Region 01",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Sex   : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 2 ...

We will look at the top and the bottom of the data set to see by defaut its first and last 6 observations.

head(US_Regional_Mortality)
##   Status         Cause  Rate  SE        Region    Sex
## 1  Urban Heart disease 188.2 1.0 HHS Region 01   Male
## 2  Rural Heart disease 199.1 2.6 HHS Region 01   Male
## 3  Urban Heart disease 115.1 0.6 HHS Region 01 Female
## 4  Rural Heart disease 124.5 1.7 HHS Region 01 Female
## 5  Urban Heart disease 226.8 0.8 HHS Region 02   Male
## 6  Rural Heart disease 248.8 3.3 HHS Region 02   Male
tail(US_Regional_Mortality)
##     Status     Cause Rate  SE        Region    Sex
## 395  Urban Nephritis  6.1 0.1 HHS Region 09 Female
## 396  Rural Nephritis  8.4 0.5 HHS Region 09 Female
## 397  Urban Nephritis  8.6 0.3 HHS Region 10   Male
## 398  Rural Nephritis  8.6 0.5 HHS Region 10   Male
## 399  Urban Nephritis  5.9 0.2 HHS Region 10 Female
## 400  Rural Nephritis  6.7 0.4 HHS Region 10 Female

Cleaning the data

Now we will use dplyr to manipulate the data, using operations and functions.

At this step, we are going to remove all the missing values by using complete.cases() function. We will call the new data frame “US_Regional_Mortality1”

US_Regional_Mortality1 <- US_Regional_Mortality[complete.cases(US_Regional_Mortality),]
dim(US_Regional_Mortality1)
## [1] 400   6

we do not have any missing values.

Now, we are going to use the rename function to rename the variable Rate into Mortality_Rate

US_Regional_Mortality<- rename(US_Regional_Mortality, Mortality_Rate= Rate)

Now, we will use - function to omit the columns SE to Sex.

US_Regional_Mortality2<-select(US_Regional_Mortality, -(SE:Sex))
str(US_Regional_Mortality2)
## 'data.frame':    400 obs. of  3 variables:
##  $ Status        : Factor w/ 2 levels "Rural","Urban": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Cause         : Factor w/ 10 levels "Alzheimers","Cancer",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Mortality_Rate: num  188 199 115 124 227 ...

We currently have 3 variables as we will explore to show the insight through the following visualizations.

Data Visualization

Simple Summaries: One Dimension

Now, we will Provide at least one statistical component.

# summary() has the five-number summary and the mean
summary(US_Regional_Mortality2$Mortality_Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.40   18.07   29.20   56.91   53.70  276.40
Status_summary <- US_Regional_Mortality2 %>%
  group_by(Status) %>%
  summarize(min = min(Mortality_Rate), median = median(Mortality_Rate), mean = mean(Mortality_Rate), max = max(Mortality_Rate), count = n()) %>%
  arrange(desc(min))
Status_summary
## # A tibble: 2 x 6
##   Status   min median  mean   max count
##   <fct>  <dbl>  <dbl> <dbl> <dbl> <int>
## 1 Rural    3.9   32.8  60.7  276.   200
## 2 Urban    3.4   27.8  53.1  227.   200
Cause_summary <- US_Regional_Mortality2 %>%
  group_by(Cause) %>%
  summarize(min = min(Mortality_Rate), median = median(Mortality_Rate), mean = mean(Mortality_Rate), max = max(Mortality_Rate), count = n()) %>%
  arrange(desc(min))
Cause_summary
## # A tibble: 10 x 6
##    Cause                      min median  mean   max count
##    <fct>                    <dbl>  <dbl> <dbl> <dbl> <int>
##  1 Cancer                   122.   162.  172.  245.     40
##  2 Heart disease            106    176.  177.  276.     40
##  3 Lower respiratory         28     46.0  46.6  70.7    40
##  4 Cerebrovascular diseases  27.2   37.3  37.3  49.1    40
##  5 Unintentional injuries    17     41.1  45.1  79.1    40
##  6 Diabetes                  12.1   23.0  22.3  32.8    40
##  7 Alzheimers                10.5   23.8  23.9  40.8    40
##  8 Flu and pneumonia          8.8   15.6  16.2  24.1    40
##  9 Nephritis                  5.9   12.9  13.5  22.6    40
## 10 Suicide                    3.4   11.1  15.0  35.4    40

Next, we will create plots with quantitative and categorical variable with boxplot and histogramm to show the mortality Rates in US by Status (Rural & Urban) and Cause.

Boxplot

US_Regional_Mortality2<-select(US_Regional_Mortality, -(SE:Sex))
graph_boxplot <- ggplot(US_Regional_Mortality2, aes(x = Status, y = Mortality_Rate, fill = Status)) +
  geom_boxplot()+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))
 ggplotly(graph_boxplot)
US_Regional_Mortality2<-select(US_Regional_Mortality, -(SE:Sex))
graph_boxplot <- ggplot(US_Regional_Mortality2, aes(x = Status, y = Mortality_Rate, fill = Cause)) +
  geom_boxplot()+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))
 ggplotly(graph_boxplot)

Histogram

qplot(data = US_Regional_Mortality2,Mortality_Rate,fill = Status,bins = 30)

qplot(data = US_Regional_Mortality2,Mortality_Rate,fill = Cause,bins = 30)

After we determined what we wanted to focus on US mortality rate in rural and urban areas through the above graphics, next we will create ggplot visualization and an incorporate interactivity to break up graphs into semantic components such as scales, labels, title,and legend.

Create a ggplot visualization

g<-ggplot(data = US_Regional_Mortality2,mapping = aes(x=Status,y=Mortality_Rate, color=Cause))
class(g)
## [1] "gg"     "ggplot"
g + geom_point() +
 labs(title = "Mortality Rates in US vs Satus by Cause 2011-2013", 
       x = "Status", y = "Mortality_Rate", color = "Cause")

g1<-ggplot(data = US_Regional_Mortality2,mapping = aes(x=Status,y=Mortality_Rate, size=Mortality_Rate, color=Cause))
class(g1)
## [1] "gg"     "ggplot"
g1 + geom_point() +
 labs(title = "Mortality Rates in US vs Satus 2011-2013", 
       x = "Status", y = "Mortality_Rate", color = "Cause")

Incorporate interactivity: plotly

plotly

g3<-ggplot(data = US_Regional_Mortality2,mapping = aes(x=Status,y=Mortality_Rate, color=Cause))+
        geom_point() +
 labs(title = "Mortality Rates in US vs Satus 2011-2013")+
 labs(x = "Status", y = "Mortality_Rate")
 ggplotly(g3)

Essay

1- We cleaned this dataset by using dplyr package. First,we removed all the missing values with the complete.cases() function. After, we used the “rename” function to rename the variable Rate into Mortality_Rate. Finally, we employed - function to omit the columns SE to Sex. Therefore, it remained 3 variables and 400 observations.

2- In the background research, I found an article about “Rural–urban disparities in the prevalence of diabetes and coronary heart disease”. The authors, O’Connor & Wellenius, examined the rural–urban differences in the prevalence of diabetes and coronary heart disease. The sample was more than 214,000 respondents using data from the US Centers for Disease Control and Prevention’s (CDC’s) in 2008. In conclusion, the crude prevalence rates of diabetes and coronary heart disease were 8.6% and 38.8% higher among respondents living in rural areas compared with urban areas, respectively.

3- The results globally show that, in U.S., people living in rural location have a higher rate of mortality than those living in urban areas. In addition, the rate of mortality in rural environments is more likely to be linked to Heart disease than in urban areas; these deceased persons also tend to be more affected by cancer. I expected to see a higher rate of diabetes among those people in rural environment as the study revealed early. However, I surprisingly found that cancer appeared among diseases with the highest rate in rural areas. Despite the fact that the data frame provided great information, this study may extend to other risk factors such as annual household income, age, education, city, and life style. We will be able to analyze the mortality rate with other factors, especially poverty.

Conclusion

The higher rate of cancer and heart disease among deceased people in rural populations in the U.S. becomes a major challenge in public health. The government of U.S. should create more exercise facilities and sidewalks to allow population to exercise and also diminish the cost of fresh fruits and vegetables in order to reduce the rate of heart disease and cancer in rural locations.

Reference

O’Connor, A., & Wellenius, G. (2012). Rural–urban disparities in the prevalence of diabetes and coronary heart disease. Public Health, 126(10), 813–820. https://doi.org/10.1016/j.puhe.2012.05.029