library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(dplyr)
library(tidyr)
require(reshape)

## Loading required package: reshape

## Warning: package 'reshape' was built under R version 4.2.2

## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:dplyr':
## 
##     rename
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths

setwd("~/dataset")
pad <- read.csv("preventabledeath.csv")

Background

Within the data set, my primary questions/goal was to analyze and see the effects of misdiagnosis, due to the worsening of the true condition, based on observed and expected overall death and their death rates. This interest/questions originated from both research/studies, but also awareness of misdiagnosis and its prevalence within medicine, especially with qualities such as provider burnout, physician shortage, socioeconomic status, transfer to more technological uses/treatments in medicine, and many more issues, making misdiagnosis/medical error more prevalent. To clarify this is not to say that a misdiagnosis is entirely due to provider error, as stated prior many qualities actively go into the process of a misdiagnosis, and if you were to take a closer look at each of these qualities independently, you would realize that they are prominent issues within medicine, on their own standing. It’s often easier to think of medicine as a bunch of mangled links, that are connected, if one link is to be broken or weakened, other links would suffer as well, but if attention and care are brought to that link than other links could prosper, but others could also suffer. With this analysis in mind, medicine has an entire set of weak links, in my case, I decided to focus on a link that could turn into a prominent issue, which could and is even starting to be considered a leading cause of death. As such, the dataset that I will be using measured observed and expected deaths due to the 5 leading causes of death being strokes, Chronic Low, Repository, disease, Heart Disease, Cancer, and Unintentional Injury (National Center Of Health Statistics, 2017). The method of obtaining the data was via observation/an observational study, which commonly houses bias due to the researcher actively knowing what they are looking for, in this case, the research was specifically looking at observed vs expected deaths in metropolitan and non-metropolitan areas, a variable that we’ll define later, but within this study/method, observation selection and even response bias are likely to occur. However, this is primarily an assumption, the data set nor the publisher activity goes into detail on how the observation was carried out and whether it accommodated for such a degree of bias and data wavering. So I will just assume the data set and study were done properly, to minimize the presence of bias, and produce proper raw data. The data set itself would later be divided into several subsets based on what type of information was needed, to answer stated questions in its respective section. This subset was also used to remove objects and variables that were either empty/missing, duplicated or weren’t helpful in a sense to the overarching question/goal. Now, before I move on to emphasizing the question, and to the actual data, I want to touch on the issues of an observational data set, and the lack of a cause and effect. To address this issues, we had to make/change an already present variable within the data set to a variable we could correlate a cause and effect with, which would be further emphasized and touched on within the Variable section. As such its important to take these finding with a heavy grain of salt, as they would only be true if all the potentially excess death was truly caused by misdiagnosis/medical error, which based on link analogy used earlier and the bases of medicine, cant not be true unless the data set and publisher measured only misdiagnosis based deaths, which is difficult to accomplish. Throughout this analysis, I will be using both a combination of summary statistics and other statistical techniques such as confidence intervals, and p value, to better answer and explore the effects of misdiagnosis, in observed vs expected deaths, and how this correlates to the worsening of the true Condition.

Varibales

Within the analysis, the variables I used, and later added, and the name changed include years of observed and Expected death from 2005 - 2015, Observed Deaths, Expected Deaths, Potentially Excess death, and locality. Variables that would be added or renamed include, Observed death rate, and expected death rate. While Potentially excess death will be renamed to “Excess death due to misdiagnosis”, again this is done to have a cause and affect in order to answer are question or even use the data set in the way we intend. For instance the cause of misdiagnosis in cancer, leads to an excess in death due to misdiagnosis. To give insight on the raw data set used/gathered some terms that you may want to now is metropolitan vs non-metropolitan. Which from what I have gathered metropolitan refers to a regions ass a set of counties surrounding a city, that each have a population of above 50,000 or more((Definition of Metro Vs Non-metro, n.d.), while non-metropolitan doesn’t meet the criteria of a population of 50,000 of higher. To add on a little further, we can simplify the variables by only looking at the observed, expected, and misdiagnosed/excess death, and the observed and expected rates, for each of the measured causes of death. In my case, I decided to exclude Internal injury from the raw data set, as from a medical point of view it’s unfair to use, as the unintentional injury is too broad to necessarily correlate with misdiagnosis unless specified to a heavy degree. For instance, saying someone tripped on a banana, causing them to hit their head and die instantly, is unfair to say they died from a medical error. But if we were to tie that death due to a result of undiagnosed schizophrenia causing an episode, motor coordination issues, or musculoskeletal issues, then we may be able to correlate the death with misdiagnosis. Again it’s important to realize just how hard it is to measure misdiagnosis, as for the reasons stated earlier, along with its ability to be read down to one definition that can be measured. In general, you can think of the variables being used to compare and contrast t and the reason for it will be in the conclusion.

Note

Throughout the following summaries and histograms you will see a notable pattern in terms of structure for each cause of death. In the end, there would be a slight summary for each finding, but a more informative and well-rounded analysis/summary of each summary, histograms, and plot diagrams will be provided at the end, or for ease of finding would be located after the plot diagram of Stroke. So reading each description/conclusion may help in better understanding the final analysis/summary, but are necessary.

Cancer

For ease will start by making a subset of each of the measured causes of death, in its respective data set. For instance only cancer in one subset (data set), another measured cause of death in another, and so on until each respectively used cause of death will have its own data/subset, this will just make it easier to both compare findings among the observed vs Expected, and to plot. I’ll add a header to indicate when we transition from each cause of death.

# Make a subset of only cancer related observation 
splitcancer<- subset(pad, pad$Cause.of.Death %in% c("Cancer"))
# Removed specific observation ,variable, and any function that is altering or affecting proper analysis
splitcancer <- subset(splitcancer, splitcancer$Locality != "All")

#add variable observed death rate per 10,000 people
splitcancer <- splitcancer%>%   
  mutate(Observed_death_rate=Observed.Deaths/Population*10000)

We used the na. omit command, because we want to remove rows that have or lack data that we are using, for instance, while looking at the data set I noted some observations of observed vs expected death were left blank, while others were filled. As such I wanted to remove any of those, to prevent any swaying or misinterpretation of the results.

#Rename subset, and removed specific observation ,variable, and any function lacking data
c<- subset(splitcancer, splitcancer$State !="United States")
c <- splitcancer %>% distinct()

c <- na.omit(c)

#Add variable Expected death rate via 10,000 people
c <- c%>%
  mutate(Expected_death_rate=Expected.Deaths/Population*10000)
### Provide Summary Statistics of both observed and expected death rate 
summary(c$Observed_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.666   3.837   8.040   8.905  13.071  25.144

summary(c$Expected_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9227  3.0402  6.2814  6.9883 10.5314 20.0798

#Provide histograms of observed and Expected death rates\
par(mfrow = c(1,2))
hist(c$Observed_death_rate, col = "Lavender",
     ylim = c(0,5000),
     xlab = ("Death Rate"),
     main = ("Observevd Death Rate via Cancer"))
hist(c$Expected_death_rate, col= "Lavender"
     , ylim =c(0,5000),
     xlab=("Death rate"),
     main =("Expected Death Rate via Cancer"))

Looking at these histograms and even the summaries we can tell that the death rate expected vs observed with regards to cancer, is skewed right. From this, we can say the death rates for both were never anticipated to be normal, but the observed death rates in cancer are slightly more skewed as seen by the x-axis.

Ploting the Mean/Average(Cancer)

For the following I wanted to compare the mean/average death rate for both observed and expected, to look for the presence of a trend in terms of death rates, throughout the measured years. Ideally combining all the respective death rates of each observed death would have provided a more appealing diagram, but due to the variety/frequency of each condition being shown within the y-value, it would be easier to compare, if each was done individually. For easier finding, I’ll add a header to indicate the presence of the Mean/average for each respective cause of death.

#Make plot to compare the the two, to do this we will first take the mean/average of the average observed death rate, and expected death rate to make it easier to plot and compare. 

#Make subsets of the cancer group per year, we'll later combined the means into one table, to later plot
c05<- subset(c, c$Year =="2005")
c06 <- subset(c, c$Year == "2006")
c07 <- subset(c, c$Year == "2007")
c08 <- subset(c, c$Year == "2008")
c09 <- subset(c, c$Year == "2009")
c10 <- subset(c, c$Year == "2010")
c11 <- subset(c, c$Year == "2011")
c12 <- subset(c, c$Year == "2012")
c13 <- subset(c, c$Year == "2013")
c14 <- subset(c, c$Year == "2014")
c15 <- subset(c, c$Year == "2015")

# Find the mean for both the observed death rate and the Expected for the given year, a # w

MOBC05<- c05%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC05 <- c05%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))

#
MOBC06<- c06%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC06 <- c06%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC07<- c07%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC07 <- c07%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC08<- c08%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(c08$Observed_death_rate))

MEBC08 <- c08%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC09<- c09%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC09 <- c09%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC10<- c10%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC10 <- c10%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC11<- c11%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC11 <- c11%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC12<- c12%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC12 <- c12%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC13<- c13%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC13 <- c13%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC14<- c14%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC14 <- c14%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBC15<- c15%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBC15 <- c15%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))


#Combine all the means of both the observed death rates and expected death rates into one single list

listC <- list(MOBC05, MEBC05, MOBC06, MEBC06,MOBC07, MEBC07,MOBC08, MEBC08, MOBC09, MEBC09, MOBC10, MEBC10, MOBC11, MEBC11, MOBC12, MEBC12, MOBC13, MEBC13, MOBC14, MEBC14, MOBC15, MEBC15)

#Convert the list to a table containing all the respective death rates and their years, we can ignore the N/A in this case as they don't show in the plot.

listC <- Reduce(function(x,y) merge(x,y,all=TRUE), listC)
#plot of observed death rates and expected death rates among cancer.
par(mfrow = c(1,2))
plot(listC$Year, listC$mean_ObservedDR,
     xlab = ("Years, 2005-2015"),
     ylab = ("Death Rate"),
     pch = 19,
     main = ("Mean Observed Death Rate
             2005-2015"),
     col = "cadetblue3")
plot(listC$Year, listC$mean_ExpectedDR,
     xlab = ("Years, 2005 -2015"),
     ylab = ("Death Rate"),
     pch = 19,
     main = ("Mean Expected Death rates
             2005-2015"),
     col="cadetblue3")

Comparing the average of the means among both Expected vs Observed deaths rate among cancer, Expected represented a more linear/predictable outcome. When in reality the observed and actual average death with regards to each respective year was much higher than what was anticipated. In regards to trends, there only seems to be that actual deaths were much more random and higher than what was initially expected.

Chronic Lower Respitory disease

#Follow the same format as cancer, create subset of only Chronic Lower Respiratory disease, along with adding the variable of Observed Death rate, and Removed specific observation ,variable, and any function that is altering or affecting proper analysis.
splitCLD <- subset(pad, pad$Cause.of.Death %in% c("Chronic Lower Respiratory Disease"))
splitCLD <- subset(splitCLD, splitCLD$Locality != "All")
splitCLD <- subset(splitCLD, splitCLD$State !="United States")
splitCLD <- splitCLD %>% distinct()
splitCLD<- na.omit(splitCLD)

# Include Observed death per 10,000 people via Chronic Lower Respiratory Disease
splitCLD <- splitCLD %>%   mutate(Observed_death_rate=Observed.Deaths/Population*10000)

# Rename subset to just CLD(Chronic Lower Respiratory disease)
CLD <- splitCLD

# Include/add Expected death rate via 10,000 people
CLD<- CLD%>%
  mutate(Expected_death_rate=Expected.Deaths/Population*10000)

#Add summary of Both expected and Observed death rate
summary(CLD$Observed_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0507  0.6011  1.4684  1.8825  2.8132  8.4766

summary(CLD$Expected_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06368 0.28261 0.80891 1.01246 1.60194 3.66320

#Add histogram of expected and Observed death rate 
hist(CLD$Expected_death_rate, 
     col = "Lavender",
     ylim = c(0,5000),
     xlab = ("Death Rate"),
     main = ("Expected Death Rate via CLRD"))

hist(CLD$Observed_death_rate,
     col= "Lavender",
     ylim = c(0,5000),
     xlab = ("Death Rate"),
     main = ("Observed Death Rate via CLRD"))

At first glance, we can see both histograms are actually fairly identical to the ones we did initially, only differing in just how heavy the skew is to one another. So both histograms are skewed to the right with Observed being more strongly skewed compared to expected.

#Plotting the Mean/Average(Chronic Lower Respirtory Disease)

To find the mean of both observed and Expected death rates the same process that was used for cancers average/mean will be done for Chronic Lower respiratory disease, and the remaining causes of death.

#Make plot to compare the the two, to do this we will first take the mean/average of the average observed death rate, and expected death rate to make it easier to plot and compare. 

#Make subsets of the group per year, we will later combined the means into one table, to plot
CLD5<- subset(CLD, CLD$Year =="2005")
CLD06 <- subset(CLD, CLD$Year == "2006")
CLD07 <- subset(CLD, CLD$Year == "2007")
CLD08 <- subset(CLD, CLD$Year == "2008")
CLD09 <- subset(CLD, CLD$Year == "2009")
CLD10 <- subset(CLD, CLD$Year == "2010")
CLD11 <- subset(CLD, CLD$Year == "2011")
CLD12 <- subset(CLD, CLD$Year == "2012")
CLD13 <- subset(CLD, CLD$Year == "2013")
CLD14 <- subset(CLD, CLD$Year == "2014")
CLD15 <- subset(CLD, CLD$Year == "2015")

MOBCLD05<- CLD5%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD05 <- CLD5%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))

#
MOBCLD06<- CLD06%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD06 <- CLD06%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD07<- CLD07%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD07 <- CLD07%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD08<- CLD08%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD08 <- CLD08%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD09<- CLD09%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD09 <- CLD09%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD10<- CLD10%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD10 <- CLD10%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD11<- CLD11%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD11 <- CLD11%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD12<- CLD12%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD12 <- CLD12%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD13<- CLD13%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD13 <- CLD13%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD14<- CLD14%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD14 <- CLD14%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBCLD15<- CLD15%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBCLD15 <- CLD15%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))


#

listCLD <- list(MOBCLD05, MEBCLD05, MOBCLD06, MEBCLD06,MOBCLD07, MEBCLD07,MOBCLD08, MEBCLD08, MOBCLD09, MEBCLD09, MOBCLD10, MEBCLD10, MOBCLD11, MEBCLD11, MOBCLD12, MEBCLD12, MOBCLD13, MEBCLD13, MOBCLD14, MEBCLD14, MOBCLD15, MEBCLD15)

listCLD <- Reduce(function(x,y) merge(x,y,all=TRUE), listCLD)
par(mfrow = c(1,2))
plot(listCLD$Year, listCLD$mean_ObservedDR,
     xlab = ("Years, 2005-2015"),
     ylab = ("Death Rate"),
     main = ("Mean Observed Death Rate(CLD)
     2005-2015"),
     pch = 19,
     col = "cadetblue3")
plot(listCLD$Year, listCLD$mean_ExpectedDR,
     xlab = ("Years, 2005 -2015"),
     ylab = ("Death Rate"),
     main = ("Meab Expected Death Rate(CLD)
             2005-2015"),
     pch = 19,
     col="cadetblue3")

From the comparisons, we can tell that there may be a trend, in terms of the controlled and predictable propositions of the given leading cause of death. Similar to cancer, Chronic Lower respiratory disease has a much higher death rate than expected, but almost seems to resemble a linear line, although not perfect it’s interesting that death rates for CLD, seem to be on a more consistent rise than compared to cancer.

Heart Disease

#Follow the same format as the previous, Make a subset of only Heard Disease related Observations, remove specific observations, variables, and any function lacking data or not used
splitHD <- subset(pad, pad$Cause.of.Death %in% c("Heart Disease"))
splitHD <- subset(splitHD, splitHD$Locality != "All")
splitHD <- subset(splitHD, splitHD$State !="United States")
splitHD <- splitHD %>% distinct()
splitHD<- na.omit(splitHD)

#add observed death rate via HD, per 10,000 people
splitHD <-splitHD %>%   mutate(Observed_death_rate=Observed.Deaths/Population*10000)

#Rename subset to just HD(Heart Disease) 
HD <- splitHD
#add Expected death rate via, HD per 10,000 people
HD<- HD%>%
  mutate(Expected_death_rate=Expected.Deaths/Population*10000)

#Add summary of Observed and expected death rate 
summary(HD$Observed_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5573  2.8940  5.4627  6.5076  9.1341 24.4969

summary(HD$Expected_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6072  1.8847  3.5001  4.2414  6.0844 14.9920

#Add histogram of Observed and Expected Death rate 
par(mfrow = c(1,2))
hist(HD$Observed_death_rate, 
     col = "Lavender",
     xlab = ("Death Rate"),
     main = ("Observed Death Rate via HD")
     ,ylim = c(0,6000))
hist(HD$Expected_death_rate, 
     col = "Lavender",
     xlab = ("Death Rate"),
     main = ("Expected Death Rate"),
     ylim=c(0,5000))

Looking at the histogram for the death rate of Heart disease, they follow the same trend of being skewed right and follow/have roughly identical structures to chronic lower Respiratory disease, and Cancer.

#Make plot to compare the the two, to do this we will first take the mean/average of the average observed death rate, and expected death rate to make it easier to plot and compare. 

#Make subsets of the group per year, we will later combined the means into one table, to plot
HD05<- subset(HD, Year =="2005")
HD06 <- subset(HD, Year == "2006")
HD07 <- subset(HD, Year == "2007")
HD08 <- subset(HD, Year == "2008")
HD09 <- subset(HD, Year == "2009")
HD10 <- subset(HD, Year == "2010")
HD11 <- subset(HD, Year == "2011")
HD12 <- subset(HD, Year == "2012")
HD13 <- subset(HD, Year == "2013")
HD14 <- subset(HD, Year == "2014")
HD15 <- subset(HD, Year == "2015")

MOBHD05<- HD05%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD05 <- HD05%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))

#
MOBHD06<- HD06%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD06 <- HD06%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD07<- HD07%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD07 <- HD07%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD08<- HD08%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD08 <- HD08%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD09<- HD09%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD09 <- HD09%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD10<- HD10%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD10 <- HD10%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD11<- HD11%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD11 <- HD11%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD12<- HD12%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD12 <- HD12%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD13<- HD13%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD13 <- HD13%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD14<- HD14%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD14 <- HD14%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBHD15<- HD15%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBHD15 <- HD15%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))


#

listHD <- list(MOBHD05, MEBHD05, MOBHD06, MEBHD06,MOBHD07, MEBHD07,MOBHD08, MEBHD08, MOBHD09, MEBHD09, MOBHD10, MEBHD10, MOBHD11, MEBHD11, MOBHD12, MEBHD12, MOBHD13, MEBHD13, MOBHD14, MEBHD14, MOBHD15, MEBHD15)

listHD <- Reduce(function(x,y) merge(x,y,all=TRUE), listHD)
par(mfrow = c(1,2))
plot(listHD$Year, listHD$mean_ObservedDR,
     xlab = ("Years, 2005-2015"),
     ylab = ("Death Rate"),
     main = ("Mean Observed Death Rate,
             2005-2015"),
     pch = 19,
     col = "cadetblue3")
plot(listHD$Year, listHD$mean_ExpectedDR,
     xlab = ("Years, 2005 -2015"),
     ylab = ("Death Rate"),
     main = ("Mean Expected Death Rate"),
     pch = 19,
     col="cadetblue3")

The scatter plot here is interesting as the expected is seemingly random at first, but proceeds to indicate a consistent increase in the death rate after 2011. Looking at the observed it seems to have symmetrical and different qualities of the predicted outcome of the expected, based solely on the structure. Most notably the constant increase after 2011, and the value of 2010 concerning the y value.

#Create subset of just stroke related observation ,remove specific observation ,variable, and any function lacking data
splitstroke <- subset(pad, pad$Cause.of.Death %in% c("Stroke"))
splitstroke <- subset(splitstroke, splitstroke$Locality != "All")
splitstroke <- subset(splitstroke, splitstroke$State !="United States")
splitstroke <- splitstroke %>% distinct()
splitstroke<- na.omit(splitstroke)

#add observed/variable of death rate via stroke per 10,000 people, and rename to just stroke 
splitstroke <- splitstroke %>%   mutate(Observed_death_rate=Observed.Deaths/Population*10000)
stroke <- splitstroke

#Expected death rate via, Stroke per 10,000 people 
stroke<- stroke%>%
    mutate(Expected_death_rate=Expected.Deaths/Population*10000)

#Summary and Hist for both observe and Expect death rates

summary(stroke$Observed_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09068 0.54650 1.03001 1.28810 1.81680 5.44441

summary(stroke$Expected_death_rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1085  0.3161  0.6610  0.8498  1.2527  3.0527

#Add histogram of both observed and Expected death rate.
par(mfrow = c(1,2))
hist(stroke$Observed_death_rate, 
     col = "Lavender",
     main = ("Observed Death Rate via Storke"),
     xlab =("Death Rate")
     ,ylim=c(0,6000))
hist(stroke$Expected_death_rate,
     main = ("Expected Death Rate via Stroke"),
     xlab = ("Death Rate"),
     col = "Lavender",
     ylim=c(0,6000))

Similar to the previous histogram both are skewed right which indicates a independent mean and meadain.

#Make plot to compare the the two, to do this we will first take the mean/average of the average observed death rate, and expected death rate to make it easier to plot and compare. 

#Make subsets of the group per year, we will later combined the means into one table, to plot
S05<- subset(stroke, Year =="2005")
S06 <- subset(stroke, Year == "2006")
S07 <- subset(stroke, Year == "2007")
S08 <- subset(stroke, Year == "2008")
S09 <- subset(stroke, Year == "2009")
S10 <- subset(stroke, Year == "2010")
S11 <- subset(stroke, Year == "2011")
S12 <- subset(stroke, Year == "2012")
S13 <- subset(stroke, Year == "2013")
S14 <- subset(stroke, Year == "2014")
S15 <- subset(stroke, Year == "2015")

MOBS05<- S05%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS05 <- S05%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))

#
MOBS06<- S06%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS06 <- S06%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS07<- S07%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS07 <- S07%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS08<- S08%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS08 <- S08%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS09<- S09%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS09 <- S09%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS10<- S10%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS10 <- S10%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS11<- S11%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS11 <- S11%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS12<- S12%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS12 <- S12%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS13<- S13%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS13 <- S13%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS14<- S14%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS14 <- S14%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))
#
MOBS15<- S15%>% group_by(Year) %>% summarise(mean_ObservedDR = mean(Observed_death_rate))

MEBS15 <- S15%>% group_by(Year) %>%
  summarise(mean_ExpectedDR = mean(Expected_death_rate))


#

listS <- list(MOBS05, MEBS05, MOBS06, MEBS06,MOBS07, MEBS07,MOBS08, MEBS08, MOBS09, MEBS09, MOBS10, MEBS10, MOBS11, MEBS11, MOBS12, MEBS12, MOBS13, MEBS13, MOBS14, MEBS14, MOBS15, MEBS15)

listS <- Reduce(function(x,y) merge(x,y,all=TRUE), listS)
par(mfrow = c(1,2))
plot(listS$Year, listS$mean_ObservedDR,
     xlab = ("Years, 2005-2015"),
     ylab = ("Death Rate"),
     main = ("Mean Observed Death Rate,
             2005-2015"),
     pch = 19,
     col = "cadetblue3")
plot(listS$Year, listS$mean_ExpectedDR,
     main =("Mean Expected Death Rate
            2005-2015"),
     xlab = ("Years, 2005 -2015"),
     ylab = ("Death Rate"),
     pch = 19,
     col="cadetblue3")

Looking at both scatter plots Expected DR seems to show a section of consistency among a fairly low death rate, while observed is much more random and we can even say it shows a decrease in deaths rates before seemingly increasing, however Observed death rate is working/has a higher measured death rate compared to Observed. A trend that is seemingly present within all the scatterplots.

Analysis/summary of Scatterplots and Histograms

Looking at all the histograms, they have a genuine basis of being mainly skewed right, with the strength/prominence varying based on condition. For instance, looking at both histograms for each replicate cause of death, the observation for Chronic lower respiratory disease, resembles a stronger right skewed, compared to its contenders such as cancer. But regardless of the varying strength, likely due to individual factors of each condition, they all share the commonality of being skewed to the right, and having a mean and median independent of one another, with the mean likely swaying based on the outliers of the death rate of each respective condition. Now with regard to the scatter plot, this was primarily done, to get a better visualization of a trend within the death rates, as plotting them as they were initially, would prove difficult to read and interpret. As such, taking the mean/average of the death rate per year would allow us to better visualize and look for a sense of a trend in either a constant increase, decrease, dip, or random trend. With comparison, there was a great deal of variety but also similarities, for instance within observed cancer and Chronic lower respiratory disease scatter plot, a more linear expectation was set out, compared to the random dip-like trend, within strokes and Heart disease. Even more interesting were the findings within observed deaths, primarily with Chronic lower respiratory disease, as it seems to resemble linear growth, compared to most random trends found in others observed death rates. Of course, nowhere near perfect, but, interestingly, the average death rate seems to be increasing, possibly indicating the rise of concern and prevalence of this condition. Overall, the main takeaway should be regarding the y value as the only consistent trend, was the difference in death rates among the expected vs observed, with observed having double the death rate of the expected. With this in mind, we can assume that the mean/average death rate is predominantly higher then anticipated, which leads to exceeding deaths, which in this case will be exceeding deaths due to misdiagnosis.

Affects of Observed vs Expectced Deaths Rates

With the general idea of different deaths rate, I wanted to how notable or big the mean/average difference was, for each respective condition with reagrds to the actual daths. In other words on average how how many deaths were caused due to misdiagnsois, as a resutls of the expected death/ ratio. In this case will use observed deaths as are total, and in a way subtract, to find the averageg diffrence for each year, with respectes to each condtion. But first we want to make sure that these finding would be consider statsically signifcant,

Null: There is no difference in the mean misdiagnosed death between the observed death and the expected death

Alternative: There is a difference in the mean misdiagnosed death between observed death and the expected death

We can use a Wilcox test to see if this would be statically significant before we move on, as are distributions based based of the previous histograms.

#Cancer
wilcox.test(c$Expected.Deaths,c$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  c$Expected.Deaths and c$Observed.Deaths
## W = 275243320, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -136.0000 -106.9999
## sample estimates:
## difference in location 
##              -121.0001

#
wilcox.test(c$Potentially.Excess.Deaths, c$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  c$Potentially.Excess.Deaths and c$Observed.Deaths
## W = 146121124, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -597 -563
## sample estimates:
## difference in location 
##                   -580

#
wilcox.test(c$Potentially.Excess.Deaths, c$Expected.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  c$Potentially.Excess.Deaths and c$Expected.Deaths
## W = 167745543, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -401 -375
## sample estimates:
## difference in location 
##                   -388

In all test, we have significant evidence to inductee we would reject the null and even a have a confidence interval that indicate with 95% confidence the mean difference between each measured/compared death.

#From here will do something similar as we did in the first half to plot and compare the observed, expected, and misdiagnosed deaths. and just how notable the mean difference among the three.

OB5 <- subset(c, Year =="2005")
OB6 <- subset(c, Year =="2006")
OB7 <- subset(c, Year =="2007")
OB8 <- subset(c, Year =="2008")
OB9 <- subset(c, Year =="2009")
OB10 <-subset(c, Year =="2010")
OB11 <-subset(c, Year =="2011")
OB12 <-subset(c, Year =="2012")
OB13 <-subset(c, Year =="2013")
OB14 <-subset(c, Year =="2014")
OB15 <-subset(c, Year =="2015")


MOBC05<- c05%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))


#
MOBC06<- c06%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBC07<- c07%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBC08<- c08%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths), mean(Expected.Deaths))

#
MOBC09<- c09%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBC10<- c10%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBC11<- c11%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBC12<- c12%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBC13<- c13%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBC14<- c14%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBC15<- c15%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#

listC <- list(MOBC05,  MOBC06, MOBC07, MOBC08, MOBC09, MOBC10,  MOBC11, MOBC12, MOBC13, MOBC14, MOBC15)

listC <- Reduce(function(x,y) merge(x,y,all=TRUE), listC)

plot(listC$`mean(Potentially.Excess.Deaths)`~listC$Year, ylim=c(0,8000), 
     col = "mistyrose2", xlab = "Year",ylab="Mean/average deaths",main = "Cancer Deaths, 2005-2015", pch =20, cex = 1.8)
par(new=TRUE)
plot(listC$`mean(Expected.Deaths)`~listC$Year,col = "navajowhite1",
     ylim = c(0,8000),xlab="",ylab = "",main = "",pch = 20, cex = 1.8) 
par(new=TRUE)
plot(listC$mean_ObservedD ~ listC$Year,
     col="peachpuff3",ylim=c(0,8000),xlab ="", ylab = "", main = "", pch = 20, cex = 1.8)
legend(x=2009,y=8000, legend =c("Deaths from misdiagnosis", "Expected Deaths", "Observed Deaths"), col= c("mistyrose2","navajowhite1","peachpuff3"),cex=.95,pch = c(20))
abline( h=4651.615, lty=2,col ="darkseagreen3" )
mean(c$Observed.Deaths)

## [1] 4651.615

abline(h=908.2237, lty = 2, col="darkseagreen3")
mean(c$Potentially.Excess.Deaths)

## [1] 908.2237

abline(h=3735.117, lty=2, col ="darkseagreen3")

mean(c$Expected.Deaths)

## [1] 3763.117

Looking at the dot plot, it’s interesting to see the general trend when it comes to the average/mean death in each year, and how they compare to the depicted trend of the overall mean/average for all the years. For instance, going based only on the dot plots, we would imagine the difference in overall deaths due to misdiagnosis to decrease as the expected death seems to progressively catch up to the observed. Looking at the line of best fit, we would see that this seems to support a close difference. Now whether this trend continues or a rise is to come is unknown. But we can conclude that the average mean death difference between expected and misdiagnosed deaths seems to be closing.

From here Will just Repeat the steps in the previous section to see if the same analysis/trends/ideas could be found.

#Chronic Low Respiratory disease
wilcox.test(CLD$Expected.Deaths,CLD$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CLD$Expected.Deaths and CLD$Observed.Deaths
## W = 151401303, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -61.99998 -55.00000
## sample estimates:
## difference in location 
##              -58.00001

#
wilcox.test(CLD$Potentially.Excess.Deaths, CLD$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CLD$Potentially.Excess.Deaths and CLD$Observed.Deaths
## W = 131258211, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -85.00006 -77.99998
## sample estimates:
## difference in location 
##              -81.00005

#
wilcox.test(CLD$Potentially.Excess.Deaths, CLD$Expected.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  CLD$Potentially.Excess.Deaths and CLD$Expected.Deaths
## W = 176636773, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -16.99997 -13.00000
## sample estimates:
## difference in location 
##              -15.00002

Similar to cancer we would reject the null for chronic lower respiratory disease, and even be provided with a confidence interval for the true difference in mean for each death would be.

OBCLD5 <- subset(CLD, Year =="2005")
OBCLD6 <- subset(CLD, Year =="2006")
OBCLD7 <- subset(CLD, Year =="2007")
OBCLD8 <- subset(CLD, Year =="2008")
OBCLD9 <- subset(CLD, Year =="2009")
OBCLD10 <-subset(CLD, Year =="2010")
OBCLD11 <-subset(CLD, Year =="2011")
OBCLD12 <-subset(CLD, Year =="2012")
OBCLD13 <-subset(CLD, Year =="2013")
OBCLD14 <-subset(CLD, Year =="2014")
OBCLD15 <-subset(CLD, Year =="2015")


MOBCLD05<- CLD5%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))


#
MOBCLD06<- CLD06%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBCLD07<- CLD07%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBCLD08<- CLD08%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths), mean(Expected.Deaths))

#
MOBCLD09<- CLD09%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBCLD10<- CLD10%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBCLD11<- CLD11%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBCLD12<- CLD12%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBCLD13<- CLD13%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBCLD14<- CLD14%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBCLD15<- CLD15%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#

listCLD <- list(MOBCLD05,  MOBCLD06, MOBCLD07, MOBCLD08, MOBCLD09, MOBCLD10,  MOBCLD11, MOBCLD12, MOBCLD13, MOBCLD14, MOBCLD15)

listCLD <- Reduce(function(x,y) merge(x,y,all=TRUE), listCLD)

plot(listCLD$`mean(Potentially.Excess.Deaths)`~listCLD$Year, ylim=c(0,800), 
     col = "mistyrose2", xlab = "Year",ylab="Mean/average deaths",main = "CLRD, 2005-2015", pch =20, cex = 1.8)
par(new=TRUE)
plot(listCLD$`mean(Expected.Deaths)`~listCLD$Year,col = "navajowhite1",
     ylim = c(0,800),xlab="",ylab = "",main = "",pch = 20, cex = 1.8) 
par(new=TRUE)
plot(listCLD$mean_ObservedD ~ listCLD$Year,
     col="peachpuff3",ylim=c(0,800),xlab ="", ylab = "", main = "", pch = 20, cex = 1.8)
legend(x=2006,y=800, legend =c("Deaths from misdiagnosis", "Expected Deaths", "Observed Deaths"), col= c("mistyrose2","navajowhite1","peachpuff3"),cex=.95,pch = c(20))
abline( h=469.0697, lty=2,col ="darkseagreen3" )
mean(CLD$Observed.Deaths)

## [1] 469.0697

abline(h=180.8106, lty = 2, col="darkseagreen3")
mean(CLD$Potentially.Excess.Deaths)

## [1] 180.8106

abline(h=290.3618, lty=2, col ="darkseagreen3")

mean(CLD$Expected.Deaths)

## [1] 290.3618

With this plot, the deaths seem to have a stronger relationship with year compared to cancer for instance, but then again, it’s important to consider the drastic difference in mean deaths. Looking at comparisons it’s good to take note of the overall minimal change in terms of average deaths, and how this can be interpreted. Chances are we could argue with how minimal the cases are, and based on the trends we would say the chronic lower respiratory disease is more likely to get misdiagnosed compared to cancer, for instance.

#Heart Disease
wilcox.test(HD$Expected.Deaths,HD$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  HD$Expected.Deaths and HD$Observed.Deaths
## W = 238168284, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -159 -138
## sample estimates:
## difference in location 
##              -148.0001

#
wilcox.test(HD$Potentially.Excess.Deaths, HD$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  HD$Potentially.Excess.Deaths and HD$Observed.Deaths
## W = 166367526, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -345 -321
## sample estimates:
## difference in location 
##                   -333

#
wilcox.test(HD$Potentially.Excess.Deaths, HD$Expected.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  HD$Potentially.Excess.Deaths and HD$Expected.Deaths
## W = 205404610, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -151 -138
## sample estimates:
## difference in location 
##                   -144

We would reject the null, as we are using the same null used in cancer just applying it to Heart Disease. Again were given a conf interval,depicting true mean of difference per death.

OBHD5 <- subset(HD, Year =="2005")
OBHD6 <- subset(HD, Year =="2006")
OBHD7 <- subset(HD, Year =="2007")
OBHD8 <- subset(HD, Year =="2008")
OBHD9 <- subset(HD, Year =="2009")
OBHD10 <-subset(HD, Year =="2010")
OBHD11 <-subset(HD, Year =="2011")
OBHD12 <-subset(HD, Year =="2012")
OBHD13 <-subset(HD, Year =="2013")
OBHD14 <-subset(HD, Year =="2014")
OBHD15 <-subset(HD, Year =="2015")


MOBHD05<- HD05%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))


#
MOBHD06<- HD06%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBHD07<- HD07%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBHD08<- HD08%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths), mean(Expected.Deaths))

#
MOBHD09<- HD09%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBHD10<- HD10%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBHD11<- HD11%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBHD12<- HD12%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBHD13<- HD13%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBHD14<- HD14%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBHD15<- HD15%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#

listHD <- list(MOBHD05,  MOBHD06, MOBHD07, MOBHD08, MOBHD09, MOBHD10,  MOBHD11, MOBHD12, MOBHD13, MOBHD14, MOBHD15)

listHD <- Reduce(function(x,y) merge(x,y,all=TRUE), listHD)

plot(listHD$`mean(Potentially.Excess.Deaths)`~listHD$Year, ylim=c(0,25000), 
     col = "mistyrose2", xlab = "Year",ylab="Mean/average deaths",main = "Heart Disease, 2005-2015", pch =20, cex = 1.8)
par(new=TRUE)
plot(listHD$`mean(Expected.Deaths)`~listHD$Year,col = "navajowhite1",
     ylim = c(0,2500),xlab="",ylab = "",main = "",pch = 20, cex = 1.8) 
par(new=TRUE)
plot(listHD$mean_ObservedD ~ listHD$Year,
     col="peachpuff3",ylim=c(0,2500),xlab ="", ylab = "", main = "", pch = 20, cex = 1.8)
legend(x=2008,y=2500, legend =c("Deaths from misdiagnosis", "Expected Deaths", "Observed Deaths"), col= c("mistyrose2","navajowhite1","peachpuff3"),cex=.95,pch = c(20))
abline( h=1719.259, lty=2,col ="darkseagreen3" )
mean(HD$Observed.Deaths)

## [1] 1719.259

abline(h=560.1418, lty = 2, col="darkseagreen3")
mean(HD$Potentially.Excess.Deaths)

## [1] 560.1418

abline(h=1170.2242, lty=2, col ="darkseagreen3")

mean(HD$Expected.Deaths)

## [1] 1170.242

(note: I’m unsure why the y-value looks like that, focus on the legible values). At first glance, this plot doesn’t seem to show anything primary due to the contradiction located between the plotted points and the mean line. As a reminder, this is for Heart disease and if we look back at the death rate for heart disease the plot seemed entirely random, with that trend leaking/showing in the deaths as well. The reason for plotting it is to primarily compare the deaths to one another. With that in mind, we can conclude based on the mean line or just the mean, there is a notable jump in deaths due to misdiagnosis or referred to as in the dataset “Potentially. Excess. Deaths’ ’. This can likely tell us the qualities of heart disease, primarily either how it is on the rise, or how progressively getting difficult to diagnose and treat.

#Stroke
wilcox.test(stroke$Expected.Deaths,stroke$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  stroke$Expected.Deaths and stroke$Observed.Deaths
## W = 170732172, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -37.00005 -32.00001
## sample estimates:
## difference in location 
##              -34.99997

#
wilcox.test(stroke$Potentially.Excess.Deaths, stroke$Observed.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  stroke$Potentially.Excess.Deaths and stroke$Observed.Deaths
## W = 119397583, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -76.99997 -71.00003
## sample estimates:
## difference in location 
##              -73.99996

#
wilcox.test(stroke$Potentially.Excess.Deaths, stroke$Expected.Deaths, conf.int = TRUE, conf.level = .95)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  stroke$Potentially.Excess.Deaths and stroke$Expected.Deaths
## W = 151413856, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -32.99996 -29.00005
## sample estimates:
## difference in location 
##              -30.99995

Similar trend to the other, low p value, we reject the null, and have confidence interval of the true mean difference among each death.

OBS5 <- subset(stroke, Year =="2005")
OBS6 <- subset(stroke, Year =="2006")
OBS7 <- subset(stroke, Year =="2007")
OBS8 <- subset(stroke, Year =="2008")
OBS9 <- subset(stroke, Year =="2009")
OBS10 <-subset(stroke, Year =="2010")
OBS11 <-subset(stroke, Year =="2011")
OBS12 <-subset(stroke, Year =="2012")
OBS13 <-subset(stroke, Year =="2013")
OBS14 <-subset(stroke, Year =="2014")
OBS15 <-subset(stroke, Year =="2015")


MOBS05<- S05%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))


#
MOBS06<- S06%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBS07<- S07%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBS08<- S08%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths), mean(Expected.Deaths))

#
MOBS09<- S09%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBS10<- S10%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBS11<- S11%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBS12<- S12%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))
#
MOBS13<- S13%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBS14<- S14%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#
MOBS15<- S15%>% group_by(Year) %>% summarise(mean_ObservedD = mean(Observed.Deaths),mean(Potentially.Excess.Deaths),mean(Expected.Deaths))

#

listS <- list(MOBS05,  MOBS06, MOBS07, MOBS08, MOBS09, MOBS10,  MOBS11, MOBS12, MOBS13, MOBS14, MOBS15)

listS <- Reduce(function(x,y) merge(x,y,all=TRUE), listS)

plot(listS$`mean(Potentially.Excess.Deaths)`~listS$Year, ylim=c(0,550), 
     col = "mistyrose2", xlab = "Year",ylab="Mean/average deaths",main = "Stroke, 2005-2015", pch =20, cex = 1.8)
par(new=TRUE)
plot(listS$`mean(Expected.Deaths)`~listS$Year,col = "navajowhite1",
     ylim = c(0,550),xlab="",ylab = "",main = "",pch = 20, cex = 1.8) 
par(new=TRUE)
plot(listS$mean_ObservedD ~ listS$Year,
     col="peachpuff3",ylim=c(0,550),xlab ="", ylab = "", main = "", pch = 20, cex = 1.8)
legend(x=2010,y=550, legend =c("Deaths from misdiagnosis", "Expected Deaths", "Observed Deaths"), col= c("mistyrose2","navajowhite1","peachpuff3"),cex=.95,pch = c(20))
abline( h=365.0347, lty=2,col ="darkseagreen3" )
mean(stroke$Observed.Deaths)

## [1] 365.0347

abline(h=123.2936, lty = 2, col="darkseagreen3")
mean(stroke$Potentially.Excess.Deaths)

## [1] 123.2936

abline(h=245.4083, lty=2, col ="darkseagreen3")

mean(stroke$Expected.Deaths)

## [1] 245.4083

Compared to the others, we can look at the mean and get a general sense of how the number of deaths due to misdiagnosis, which again is present as excess death within the dataset. What we can take away is that there seems to be a smaller difference compared to the other conditions, either implying a good change of misdiagnosis but a high chance of survival, for instance.

Wrapping It Up

Incorporating both the applied changes and reasoning for using each statistical analysis, our finding could be considered significant, only if applied changes were to be true. In other words, if the excess death was truly due only to misdiagnosis, then yes the findings are significant, but given the actual/outside setting we can’t say this. Again keep in mind that this is an observational study, and the intention I wanted, was to have a cause and effect, an intention observational studies do not contain. So as stated prior, I had to make a given cause and effect with what I had to work with. Now, it’s not to say that if the context were to be true, my analysis would be right, as this is due to several underlying factors that I will touch on momentarily. For starters, personal bias, obviously I didn’t make the data, but I am manipulating it to better fit my argument, along with my lack of proficiency in coding and to degree statistics, they are likely an easier way to do what I did and could lead to different findings. Ideally, I wanted to include so much more, but the method I used initially messed up my ideal plans and made it increasingly difficult to achieve the desired statistical analysis and graphing. With that in mind, I also wanted to touch on the dataset, itself, as it did seem to have some issues, predominantly the missing gaps in the death, where one wouldn’t be filled, but the other recorded deaths were. Along with the weird age ranges, and included summaries within the dataset, it was much more of a hassle to actively use this data set, which was as close as I could have gotten to a dataset representing excess death from misdiagnosis.

So what was the point of doing this process, what was the point if my finding can’t be considered significant outside a special and made context and if I’m not even confident in my process? Well, referring back to the overview, I wanted to explore the effect of misdiagnosis in the context of observed and expected deaths. In other words, how many extra deaths were due to misdiagnosis? This would be the first part of my question that could be answered with the data set. This would also be shown through primary comparisons of the mean/average, of the deaths and the death rates, as I mainly wanted to focus on empathizing that there is indeed a notable and potentially rising concern for misdiagnosed deaths. My primary intention was not to be 100% statistically accurate but to be accurate enough to show that there is some notable and worthwhile difference, an intention that I do believe I reach one way or another. But what about the second part of my question? Keep in mind, the second part of my question was more or so an interpretation, or a general analysis, answered through the excess/misdiagnosed deaths. For instance, let’s say you were experiencing respiratory issues, due to pneumonia, but you got misdiagnosed with asthma. That initial misdiagnosis will lead to the neglect of treating and keeping an eye on the true condition of pneumonia, which will eventually take your life. In general, as I hopefully made clear, the part of my question is answered from the first part and even the initial dataset.

Before I end, I want to explain what the findings/implication could mean, how they affect us, and why we should even care. Let’s start with misdiagnosis, surely there could have been better datasets, that were more specific and were tied down misdiagnosis to a root that could be measured. But that isn’t the case, as some researchers and doctors have touched on, medical errors/misdiagnoses, were not considered to actively play a role in one passing, and were left out of data health analysis (Study Suggests Medical Errors Now Third Leading Cause of Death in the U.S. - 05/03/2016, n.d.). Even now, as methods have changed, they don’t consider or have a consistent way to differentiate and recognize deaths from medical error/misdiagnosis to other more notable conditions. This lack of recognition has caused this to slowly become a more prevalent issue, as other issues within medicine become more and more prominent. Looking at how it affects the general population, this refers more to the link analogy mentioned earlier. The more we neglect misdiagnosis as prevalent in medicine, the less attention it receives and the less of a public health priority it will become, which again will only add to the links causing an increase in excess deaths. For instance, the less attention it would receive, and the more common misdiagnosis becomes, the more a provider shortage is likely to become, the more expensive treating and finding the misdiagnosis becomes, and with this increase in demand and price, the more likely those misdiagnosed in lower socioeconomic status are like to pass due to their missions/medical error.

In short, the prevalence of excess death from medical error/misdiagnosis is very prevalent and seemingly increases based on the measured condition, in general, it is safe to assume that as we move on through the years and new medical problems arise while others get neglected, a misdiagnosis could easily become a leading cause of death. Even now, in some studies where medical error/misdiagnosis was given a measurable quality and compared to other leading causes of death, it only fell behind cancer (Study Suggests Medical Errors Now Third Leading Cause of Death in the U.S. - 05/03/2016, n.d.). I do, however, want to clarify that this is not the fault of a single quality, again a lot of factors go into the qualities of a medical error/misdiagnosed, and not one root cause of it to blame or to fear. This is not to say to be fearful of medicine, but it’s also not to say to trust everything blindly, if possible it never hurts to get a second option and listen to your body.

#14. Refrence

Definition of Metro vs Non-metro. (n.d.). https://gillingscovid19.unc.edu/definitions/metro-vs-non-metro

NCHS - Potentially Excess Deaths from the Five Leading Causes of Death | Data | Centers for Disease Control and Prevention. (2017, August 15). https://data.cdc.gov/NCHS/NCHS-Potentially-Excess-Deaths-from-the-Five-Leadi/vdpk-qzpr

Study Suggests Medical Errors Now Third Leading Cause of Death in the U.S. - 05/03/2016. (n.d.). https://www.hopkinsmedicine.org/news/media/releases/study_suggests_medical_errors_now_third_leading_cause_of_death_in_the_us

Excess Deaths from Misdiagnosis

Angel Ochoa

2022-12-15