The following code helped complete #2 of HW 6: exploring the data set through data inspection

glimpse(us_contagious_diseases)
dim(us_contagious_diseases)
tail(us_contagious_diseases)
head(us_contagious_diseases)

us_contagious_diseases %>%
 arrange(desc(year))

#the following checks to see if any variables have NA
 us_contagious_diseases %>%
  filter(is.na(disease))
 
us_contagious_diseases %>%
  filter(is.na(year))

us_contagious_diseases %>%
  filter(is.na(weeks_reporting))

us_contagious_diseases %>%
  filter(is.na(count))

us_contagious_diseases %>%
  filter(is.na(population))
#population is the only variable that has NA

max(us_contagious_diseases$weeks_reporting)
min(us_contagious_diseases$weeks_reporting)
#weeks reporting means the week number in the year, where 52 is the highest, and 0 is the lowest
min(us_contagious_diseases$year)
max((us_contagious_diseases$year))
#min and max of year recorded

##Checks to see if leap year changes number of weeks_reporting
us_contagious_diseases %>%
  filter(year %% 4 != 0) %>%
  arrange(desc(weeks_reporting))
 us_contagious_diseases %>%
  filter(year %% 4 == 0) %>%
  arrange(desc(weeks_reporting))
 
 #therefore, leap year does not change week_reporting

Data description

Brief Overview

This data set is taken from the us_contagious_diseases data set. The package is dslabs. us_contagious_diseases has 6 variables: disease, state, year, weeks reporting, count, and population. The weeks-reporting is identified from 0 to 52. The count is the number of individuals with the disease. population is the entire human population of the state identified. In conclusion, this data set gives information about Contagious Diseases in the US, separated by state.

Description of Data Source

The data of us_contagious_diseases is a data set based off of the Tycho Project in 2013. The first version of this project was created to combat global health by partnering with health institutes and researchers. Together, they researched to improve standards, machine readability, and availability. By their second version, they expanded the resources and gathered data with 28 notifiable conditions and has dengue-related conditions for 100 countries between 1955 and 2010. This information was attained from the World Health Organization and national health agencies. In addition, one of the Tycho Project’s featured works is on polio and typhoid fever data and works to advanvce the twentieth century theories: “Unraveling The Social Ecology Of Polio.” The Tycho project’s current Team Science for Data and Health is Wilbert van Panhuis, Donald Burke, and Ann Cross. There are three types of this data available from the Tycho Project website: pre-compiled, compile your own, and application programming interface

Variable Summary Table

Variable Name Description
disease A factor containing disease names.
state A factor containing state names.
year The year reported.
weeks_reporting Number of weeks counts were reported that year.
count Total number of reported cases.
population State population, interpolated for non-census years.

Data visualizations

glimpse(us_contagious_diseases)
## Observations: 18,870
## Variables: 6
## $ disease         <fct> Hepatitis A, Hepatitis A, Hepatitis A, Hepatitis…
## $ state           <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala…
## $ year            <dbl> 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, …
## $ weeks_reporting <int> 50, 49, 52, 49, 51, 51, 45, 45, 45, 46, 50, 43, …
## $ count           <dbl> 321, 291, 314, 380, 413, 378, 342, 467, 244, 286…
## $ population      <dbl> 3345787, 3364130, 3386068, 3412450, 3444165, 348…

Scatterplot

The following code provides the bar graph, Contagious Diseases From 1928 to 2011:
ggplot(data = us_contagious_diseases, aes(x = year, y = count)) + geom_point(aes(color = as.factor(disease))) + theme(plot.subtitle = element_text(vjust = 1), 
    plot.caption = element_text(vjust = 1)) +labs(title = "Count of Contagious Diseases From 1928 to 2011", 
    x = "Year", y = "Count", colour = "Disease") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + theme_bw()

Analysis of Count of Contagious Diseases From 1928 to 2011

Contagious Diseases From 1928 to 2011 is a scatterplot. The variables represented in the scatterplot are year and count. Year is the year reported and count is the total number of reported cases. The type of disease is further identified by color.

I used these variables as I thought it would be interesting to see how the number of contagious diseases in the United State has evolved, if any, over time. Especially to identify the trends in correlation with the development of vaccines. The reader should gain insight from this graph that the number of contagious diseases has strongly decreased over time. Although it is not linear, it is evident that number of contagious diseases and types contagious diseases have been greatly reduced compared to 1928.

It is important to recognize how Measles abruptly reduced in number during the mid-1960’s. The Measles vaccine became available in 1963, so this abrupt change is likely due to this medical advancement.

Histogram

The following code provides the Number of Disease Reported per Week Histogram
ggplot(data = us_contagious_diseases, aes(x = weeks_reporting)) +
  geom_bar() + 
  theme_bw() + 
  labs(x = " Weeks Reporting", y = "Number of Diseases") + 
  theme(axis.ticks = element_line(colour = "gray10")) + 
  labs(title = "Number of Diseases Reported per Week")

Analysis of Number of Disease Reported per Week

Number of Disease Reported per Week is a histogram that identifies how many diseases from 1928 to 2011 were reported per week; this would be grouped by week. I chose the weeks_reporting to see if there was a relationship by season of when individuals had a disease reported.

By the analysis of this graph, it is evident that more diseases were reported in the latter weeks of the year rather than the beginning of the year. However, the majority of diseases were reported during the first week of the year. Since this is data collected in the United States, it is reasonable to infer that this may be because it is during the colder months. Another speculation could just simply be due to the fact that most of the weeks reported were not the actual week the disease was diagnosed. Since firms typically catch up on all their official papers by the end of the year, the week reported could have been inputted in a database right before their yearly report.