Throughout the exam, you will be analyzing the data set named us_contagious_diseases in the R package dslabs. Please submit both the .Rmd and knitted .html files.
(a). How many observations and variables are there? What are the types of each variable?
From the result, we can know that there are 16065 observations and 6 variables in the data set. The variables of "year","weeks_reporting", "count" and "population" are numeric variables, while the "disease" and "state" variables are factor.(b). Compute the frequency of each type of disease, and the 0.15 quantile of the variable population.
#install.packages("dslabs")
library(dslabs)
d <- us_contagious_diseases
str(d)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
table(d$disease)
##
## Hepatitis A Measles Mumps Pertussis Polio Rubella
## 2346 3825 1785 2856 2091 1887
## Smallpox
## 1275
quantile(d$population,0.15,na.rm = TRUE)
## 15%
## 670515
Find the top 5 states with the most “Measles” cases over the 10 years from 1991 to 2000 (both years inclusive).
d %>%
filter(year >= 1991, year <= 2000, disease == "Measles") %>%
arrange(desc(count)) %>%
head(5)
For the state of Texas,
ave_count, representing the average count per weeks_reportingd %>%
filter(state == "Texas") %>%
select(count, weeks_reporting) %>%
mutate(ave_count = count/weeks_reporting)
year (x-axis) and ave_count (y-axis), using different colors for different diseases.d %>%
filter(state == "Texas") %>%
mutate(ave_count = count/weeks_reporting) %>%
ggplot(mapping = aes(x = year, y = ave_count, color = disease)) + geom_point() +geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).
(c) Remove all the observations for disease “Measles” and redo the plot in (b).
c <-which(d$disease=="Measles")
without_m <- d[-c,]
without_m %>%
filter(state == "Texas") %>%
mutate(ave_count = count/weeks_reporting) %>%
ggplot(mapping = aes(x = year, y = ave_count, color = disease)) + geom_point() +geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
### 4.
d %>%
group_by(year,state) %>%
summarize(total_count= sum(count))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
count_density, which is defined by the total count of all diseases divided by the population. d %>%
group_by(state, year) %>%
summarize(count_density = sum(count) / sum(population))
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
count_density. d %>%
group_by(state, year) %>%
summarize(count_density = sum(count) / sum(population))%>%
arrange(count_density)
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.