The following code helped complete #2 of HW 6: exploring the data set through data inspection
glimpse(us_contagious_diseases)
dim(us_contagious_diseases)
tail(us_contagious_diseases)
head(us_contagious_diseases)
us_contagious_diseases %>%
arrange(desc(year))
#the following checks to see if any variables have NA
us_contagious_diseases %>%
filter(is.na(disease))
us_contagious_diseases %>%
filter(is.na(year))
us_contagious_diseases %>%
filter(is.na(weeks_reporting))
us_contagious_diseases %>%
filter(is.na(count))
us_contagious_diseases %>%
filter(is.na(population))
#population is the only variable that has NA
max(us_contagious_diseases$weeks_reporting)
min(us_contagious_diseases$weeks_reporting)
#weeks reporting means the week number in the year, where 52 is the highest, and 0 is the lowest
min(us_contagious_diseases$year)
max((us_contagious_diseases$year))
#min and max of year recorded
##Checks to see if leap year changes number of weeks_reporting
us_contagious_diseases %>%
filter(year %% 4 != 0) %>%
arrange(desc(weeks_reporting))
us_contagious_diseases %>%
filter(year %% 4 == 0) %>%
arrange(desc(weeks_reporting))
#therefore, leap year does not change week_reporting
This data set is taken from the us_contagious_diseases
data set. The package is dslabs
. us_contagious_diseases
has 6 variables: disease
, state
, year
, weeks reporting
, count
, and population
. The weeks-reporting
is identified from 0 to 52. The count
is the number of individuals with the disease. population
is the entire human population of the state identified. In conclusion, this data set gives information about Contagious Diseases in the US, separated by state.
The data of us_contagious_diseases
is a data set based off of the Tycho Project in 2013. The first version of this project was created to combat global health by partnering with health institutes and researchers. Together, they researched to improve standards, machine readability, and availability. By their second version, they expanded the resources and gathered data with 28 notifiable conditions and has dengue-related conditions for 100 countries between 1955 and 2010. This information was attained from the World Health Organization and national health agencies. In addition, one of the Tycho Project’s featured works is on polio and typhoid fever data and works to advanvce the twentieth century theories: “Unraveling The Social Ecology Of Polio.” The Tycho project’s current Team Science for Data and Health is Wilbert van Panhuis, Donald Burke, and Ann Cross. There are three types of this data available from the Tycho Project website: pre-compiled, compile your own, and application programming interface
Variable Name | Description |
---|---|
disease | A factor containing disease names. |
state | A factor containing state names. |
year | The year reported. |
weeks_reporting | Number of weeks counts were reported that year. |
count | Total number of reported cases. |
population | State population, interpolated for non-census years. |
glimpse(us_contagious_diseases)
## Observations: 18,870
## Variables: 6
## $ disease <fct> Hepatitis A, Hepatitis A, Hepatitis A, Hepatitis…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Ala…
## $ year <dbl> 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, …
## $ weeks_reporting <int> 50, 49, 52, 49, 51, 51, 45, 45, 45, 46, 50, 43, …
## $ count <dbl> 321, 291, 314, 380, 413, 378, 342, 467, 244, 286…
## $ population <dbl> 3345787, 3364130, 3386068, 3412450, 3444165, 348…
ggplot(data = us_contagious_diseases, aes(x = year, y = count)) + geom_point(aes(color = as.factor(disease))) + theme(plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1)) +labs(title = "Count of Contagious Diseases From 1928 to 2011",
x = "Year", y = "Count", colour = "Disease") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + theme_bw()
Contagious Diseases From 1928 to 2011 is a scatterplot. The variables represented in the scatterplot are year
and count
. Year
is the year reported and count
is the total number of reported cases. The type of disease is further identified by color.
I used these variables as I thought it would be interesting to see how the number of contagious diseases in the United State has evolved, if any, over time. Especially to identify the trends in correlation with the development of vaccines. The reader should gain insight from this graph that the number of contagious diseases has strongly decreased over time. Although it is not linear, it is evident that number of contagious diseases and types contagious diseases have been greatly reduced compared to 1928.
It is important to recognize how Measles abruptly reduced in number during the mid-1960’s. The Measles vaccine became available in 1963, so this abrupt change is likely due to this medical advancement.
ggplot(data = us_contagious_diseases, aes(x = weeks_reporting)) +
geom_bar() +
theme_bw() +
labs(x = " Weeks Reporting", y = "Number of Diseases") +
theme(axis.ticks = element_line(colour = "gray10")) +
labs(title = "Number of Diseases Reported per Week")
Number of Disease Reported per Week is a histogram that identifies how many diseases from 1928 to 2011 were reported per week; this would be grouped by week. I chose the weeks_reporting
to see if there was a relationship by season of when individuals had a disease reported.
By the analysis of this graph, it is evident that more diseases were reported in the latter weeks of the year rather than the beginning of the year. However, the majority of diseases were reported during the first week of the year. Since this is data collected in the United States, it is reasonable to infer that this may be because it is during the colder months. Another speculation could just simply be due to the fact that most of the weeks reported were not the actual week the disease was diagnosed. Since firms typically catch up on all their official papers by the end of the year, the week reported could have been inputted in a database right before their yearly report.