Homework 2

Author

Jingyi Yang

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(readr)
library(readxl)
library(stringr)
library(dplyr)
library(lubridate)
library("khroma")
library(here)

here() starts at C:/8-601

library("ggthemes")
library(gt)

Overview

My final project mainly analyzes the health conditions among children under 18 based on some particular demographics from 1997 to 2018.

Data

The data needed for analysis is downloaded from data.gov. The original data has been published from Centers for Disease Control and Prevention in Jun 2021.

This data set consists of information about the health conditions among children under 18 years old based on particular population characteristics. It can help get information regarding the illness that is prevalent among minors, the percentage of children for each illness based on the different demographic characteristics, and more.

health <- read_csv("Final Project/Health_conditions_among_children_under_age_18__by_selected_characteristics__United_States.csv")

health %>% print(n= 10, width = Inf)

# A tibble: 2,744 × 16
   INDICATOR                                    
   <chr>                                        
 1 Health conditions among children under age 18
 2 Health conditions among children under age 18
 3 Health conditions among children under age 18
 4 Health conditions among children under age 18
 5 Health conditions among children under age 18
 6 Health conditions among children under age 18
 7 Health conditions among children under age 18
 8 Health conditions among children under age 18
 9 Health conditions among children under age 18
10 Health conditions among children under age 18
   PANEL                                       PANEL_NUM
   <chr>                                           <dbl>
 1 Current asthma among persons under 18 years         1
 2 Current asthma among persons under 18 years         1
 3 ADHD among persons under 18 years                   3
 4 ADHD among persons under 18 years                   3
 5 ADHD among persons under 18 years                   3
 6 Current asthma among persons under 18 years         1
 7 Current asthma among persons under 18 years         1
 8 Current asthma among persons under 18 years         1
 9 Current asthma among persons under 18 years         1
10 Current asthma among persons under 18 years         1
   UNIT                       UNIT_NUM STUB_NAME STUB_NAME_NUM STUB_LABEL    
   <chr>                         <dbl> <chr>             <dbl> <chr>         
 1 Percent of children, crude        1 Total                 0 Under 18 years
 2 Percent of children, crude        1 Total                 0 Under 18 years
 3 Percent of children, crude        1 Age                   1 10-17 years   
 4 Percent of children, crude        1 Age                   1 10-17 years   
 5 Percent of children, crude        1 Age                   1 10-17 years   
 6 Percent of children, crude        1 Total                 0 Under 18 years
 7 Percent of children, crude        1 Total                 0 Under 18 years
 8 Percent of children, crude        1 Total                 0 Under 18 years
 9 Percent of children, crude        1 Total                 0 Under 18 years
10 Percent of children, crude        1 Total                 0 Under 18 years
   STUB_LABEL_NUM YEAR      YEAR_NUM AGE            AGE_NUM ESTIMATE    SE FLAG 
            <dbl> <chr>        <dbl> <chr>            <dbl>    <dbl> <dbl> <chr>
 1           0    1997-1999        1 Under 18 years     0       NA    NA   ...  
 2           0    2000-2002        2 Under 18 years     0       NA    NA   ...  
 3           1.22 1997-1999        1 10-17 years        2.2      7.6   0.2 <NA> 
 4           1.22 2000-2002        2 10-17 years        2.2      9     0.3 <NA> 
 5           1.22 2003-2005        3 10-17 years        2.2      8.9   0.3 <NA> 
 6           0    2003-2005        3 Under 18 years     0        8.7   0.2 <NA> 
 7           0    2006-2008        4 Under 18 years     0        9.3   0.2 <NA> 
 8           0    2007-2009        5 Under 18 years     0        9.4   0.2 <NA> 
 9           0    2008-2010        6 Under 18 years     0        9.5   0.2 <NA> 
10           0    2009-2011        7 Under 18 years     0        9.5   0.2 <NA> 
# ℹ 2,734 more rows

Clean the data set

Overview

The data set includes 16 columns. The column “INDICATOR” consists of the information about the whole data set, which is “Health conditions among children under age 18” and is a characteristic variable. The column “PANEL” is a characteristic variable and includes information about various health issues. The column “PANEL_NUM” represents the code for different health issues. The column “UNIT” indicates the number in the data set focused on percentage. “UNIT_NUM” refers to the code representing the characteristic variables in the “UNIT” column. “STUB_NAME” includes different characteristics for the respondents, like age, race, sex, etc., and the “STUB_NAME_NUM” shows the codes related to them. “STUB_LABEL” means the sub-variables related to the characteristics, like different age groups, female and male for the gender, and column “STUB_LABEL_NUM” involves the number representing them. “YEAR” includes information about different year stages, “YEAR_NUM” represents them, “AGE” and “AGE_NUM” include different age stages and numbers represent them, “ESTIMATE” is the number percent of children, “SE” means the standard error. In the “FLAG” column, “—” means the data is not available, and “*” means the estimate might not be reliable. A document provided by the Centers for Disease Control and Prevention (CDC) can provide more information.

To clean the data, delete the columns that include some repeat information, like the number representing the variables, the columns that only include one piece of information, like the “INDICATOR” column, and the rows that contain the not available cells, are necessary. Besides, arranging the table through “PANEL” and “YEAR,” renaming the column, and using the information in the “SE” column (standard error) to calculate the upper and lower percentage information will make the table more in order and easier to understand.

health_clean <- health %>%
  select(- ("PANEL_NUM"),
         - ("UNIT_NUM"),
         -("STUB_NAME_NUM"),
         -("STUB_LABEL_NUM"),
         -("AGE_NUM"),
         - ("INDICATOR"),
         - ("UNIT"),
         - ("AGE"),
         - ("FLAG"))%>%
  arrange(PANEL, YEAR)%>%
  na.omit()%>%
  rename("illness"=`PANEL`, "characteristics"=`STUB_NAME`, "sub_name"=`STUB_LABEL`, "year"= `YEAR`, "year_number"= `YEAR_NUM`, "percentage"=`ESTIMATE`, "se"=`SE`) %>%
   mutate(illness= str_remove(illness, "among persons under 18 years"))%>%
  mutate("percentage_high"= percentage+se, "percentage_low"= percentage-se )
  

health_clean %>% print(n= 10, width = Inf)

# A tibble: 2,516 × 9
   illness characteristics          sub_name                             
   <chr>   <chr>                    <chr>                                
 1 "ADHD " Age                      10-17 years                          
 2 "ADHD " Age                      5-17 years                           
 3 "ADHD " Age                      5-9 years                            
 4 "ADHD " Sex                      Male                                 
 5 "ADHD " Sex                      Female                               
 6 "ADHD " Race                     White only                           
 7 "ADHD " Race                     Black or African American only       
 8 "ADHD " Race                     American Indian or Alaska Native only
 9 "ADHD " Race                     Asian only                           
10 "ADHD " Hispanic origin and race Hispanic or Latino: All races        
   year      year_number percentage    se percentage_high percentage_low
   <chr>           <dbl>      <dbl> <dbl>           <dbl>          <dbl>
 1 1997-1999           1        7.6   0.2             7.8            7.4
 2 1997-1999           1        6.5   0.2             6.7            6.3
 3 1997-1999           1        4.8   0.2             5              4.6
 4 1997-1999           1        9.6   0.3             9.9            9.3
 5 1997-1999           1        3.2   0.2             3.4            3  
 6 1997-1999           1        7.1   0.2             7.3            6.9
 7 1997-1999           1        5     0.3             5.3            4.7
 8 1997-1999           1        8.5   1.9            10.4            6.6
 9 1997-1999           1        1.7   0.5             2.2            1.2
10 1997-1999           1        3.6   0.3             3.9            3.3
# ℹ 2,506 more rows

Insight (Describe the Statistics)

freq_sumary_table_1 <-health_clean %>%
   count(illness)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop))%>%
  rename(Names=illness)

freq_sumary_table_2 <- health_clean %>%
  count(characteristics)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop))%>%
  rename(Names=characteristics)

freq_sumary_table_3 <- health_clean %>%
  count(sub_name)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop)) %>%
  rename(Names=sub_name)

freq_sumary_table_4 <- health_clean %>%
 count(year)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop))%>%
  rename(Names=year)

freq_sumary_table <-bind_rows(freq_sumary_table_1,freq_sumary_table_2,freq_sumary_table_3, freq_sumary_table_4)

 freq_sumary_table%>%
   gt()

Names	n	prop
ADHD	303	8.304%
Asthma attack in last 12 months	332	7.578%
Current asthma	288	8.736%
Ear infections	332	7.578%
Food allergy	333	7.556%
Hay fever or respiratory allergy	335	7.510%
Serious emotional or behavioral difficulties	258	9.752%
Skin allergy	335	7.510%
Age	406	6.20%
Health insurance status at the time of interview	432	5.82%
Hispanic origin and race	432	5.82%
Percent of poverty level	432	5.82%
Race	516	4.88%
Sex	216	11.65%
Total	82	30.68%
0-4 years	82	30.68%
10-17 years	108	23.30%
100%-199%	108	23.30%
2 or more races	102	24.67%
200%-399%	108	23.30%
400% or more	108	23.30%
5-17 years	108	23.30%
5-9 years	108	23.30%
American Indian or Alaska Native only	92	27.35%
Asian only	105	23.96%
Below 100%	108	23.30%
Black or African American only	108	23.30%
Female	108	23.30%
Hispanic or Latino: All races	108	23.30%
Insured	108	23.30%
Insured: Medicaid	108	23.30%
Insured: Private	108	23.30%
Male	108	23.30%
Native Hawaiian or Other Pacific Islander only	1	2 516.00%
Not Hispanic or Latino: All races	108	23.30%
Not Hispanic or Latino: Black or African American only	108	23.30%
Not Hispanic or Latino: White only	108	23.30%
Under 18 years	82	30.68%
Uninsured	108	23.30%
White only	108	23.30%
1997-1999	135	18.637%
2000-2002	140	17.971%
2003-2005	186	13.527%
2006-2008	184	13.674%
2007-2009	185	13.600%
2008-2010	186	13.527%
2009-2011	187	13.455%
2010-2012	187	13.455%
2011-2013	188	13.383%
2012-2014	187	13.455%
2013-2015	188	13.383%
2014-2016	188	13.383%
2015-2017	188	13.383%
2016-2018	187	13.455%

As the summary for the variable “percentage” cannot clearly reflect the data variation, which does not help to define using mean or median for future analysis, further calculating the mean, median, standard deviation, and range based on various illnesses and demographic characteristics is necessary.

  numerical_summary_table <- health_clean%>%
  group_by (illness, characteristics) %>%
  summarise(mean_percentage= mean(percentage), median_percentage=median(percentage), sd_percentage= sd(percentage),range_percentage= max(percentage)-min(percentage), max_percentage= max(percentage), min_percentage= min(percentage))%>%
mutate(extrem_value_range_percentage_max= mean_percentage+3*sd_percentage, extrem_value_range_percentage_min= mean_percentage-3*sd_percentage)
 
numerical_summary_table %>%
  gt()

characteristics	mean_percentage	median_percentage	sd_percentage	range_percentage	max_percentage	min_percentage	extrem_value_range_percentage_max	extrem_value_range_percentage_min
ADHD
Age	9.004762	9.00	2.2895257	8.1	12.9	4.8	15.873339	2.13618477
Health insurance status at the time of interview	9.042857	9.05	2.6883491	9.0	13.8	4.8	17.107904	0.97780990
Hispanic origin and race	9.235714	9.80	2.7325265	9.8	13.4	3.6	17.433294	1.03813466
Percent of poverty level	9.535714	9.35	1.9163819	7.4	13.5	6.1	15.284860	3.78656867
Race	8.480000	9.40	3.8424520	14.0	15.4	1.4	20.007356	-3.04735605
Sex	9.275000	8.25	3.9694785	11.6	14.8	3.2	21.183435	-2.63343539
Asthma attack in last 12 months
Age	5.226786	5.55	1.0204261	4.1	6.8	2.7	8.288064	2.16550733
Health insurance status at the time of interview	5.051786	5.00	1.1683697	4.4	7.7	3.3	8.556895	1.54667673
Hispanic origin and race	5.592857	5.00	1.4868951	5.2	9.0	3.8	10.053543	1.13217174
Percent of poverty level	5.258929	5.00	1.0405902	3.8	7.5	3.7	8.380699	2.13715803
Race	5.918182	6.00	1.9100470	7.0	9.3	2.3	11.648323	0.18804068
Sex	5.139286	4.90	0.9117908	2.9	6.6	3.7	7.874658	2.40391336
Total	5.157143	5.40	0.4847113	1.3	5.7	4.4	6.611277	3.70300892
Current asthma
Age	8.987500	10.00	2.2236280	7.1	11.1	4.0	15.658384	2.31661614
Health insurance status at the time of interview	8.697917	8.40	1.8200409	6.8	12.4	5.6	14.158039	3.23779406
Hispanic origin and race	9.937500	8.60	2.9670262	9.6	16.4	6.8	18.838579	1.03642129
Percent of poverty level	9.089583	8.55	1.7353937	6.4	12.7	6.3	14.295764	3.88340222
Race	10.203333	10.35	3.4826430	12.5	16.4	3.9	20.651262	-0.24459566
Sex	8.875000	8.75	1.3587878	4.0	11.1	7.1	12.951363	4.79863653
Total	8.916667	8.85	0.4932883	1.4	9.5	8.1	10.396532	7.43680181
Ear infections
Age	5.505357	4.35	3.1997966	11.6	13.7	2.1	15.104747	-4.09403255
Health insurance status at the time of interview	5.439286	5.20	1.3939712	7.3	10.2	2.9	9.621199	1.25737197
Hispanic origin and race	5.278571	5.45	1.0312771	4.3	7.7	3.4	8.372403	2.18474011
Percent of poverty level	5.587500	5.60	1.1395474	4.4	8.3	3.9	9.006142	2.16885793
Race	4.992424	5.00	1.6692984	8.8	10.8	2.0	10.000319	-0.01547092
Sex	5.478571	5.60	0.7804828	3.0	7.3	4.3	7.820020	3.13712296
Total	5.478571	5.55	0.7707924	2.7	7.1	4.4	7.790949	3.16619432
Food allergy
Age	4.991071	5.15	0.9654536	3.5	6.6	3.1	7.887432	2.09471064
Health insurance status at the time of interview	4.769643	4.80	0.9999854	4.6	7.0	2.4	7.769599	1.76968669
Hispanic origin and race	4.891071	4.90	1.1784854	4.6	6.7	2.1	8.426528	1.35561512
Percent of poverty level	4.944643	5.10	1.0350089	4.6	7.6	3.0	8.049670	1.83961604
Race	5.605970	5.70	1.3992293	7.2	10.2	3.0	9.803658	1.40828228
Sex	4.989286	5.25	0.9338646	3.1	6.5	3.4	7.790880	2.18769178
Total	4.992857	5.15	0.9441840	3.0	6.4	3.4	7.825409	2.16030500
Hay fever or respiratory allergy
Age	16.433929	17.70	4.0868871	13.7	21.7	8.0	28.694590	4.17326730
Health insurance status at the time of interview	15.671429	15.30	1.9936913	8.2	19.2	11.0	21.652503	9.69035452
Hispanic origin and race	15.832143	15.85	2.4471956	8.3	19.9	11.6	23.173730	8.49055610
Percent of poverty level	16.158929	15.55	2.3631856	8.5	21.1	12.6	23.248485	9.06937182
Race	15.756522	15.30	2.7640272	11.0	20.9	9.9	24.048603	7.46444004
Sex	16.375000	16.40	1.8128400	5.9	18.9	13.0	21.813520	10.93647998
Total	16.392857	16.55	0.9392831	3.0	17.7	14.7	19.210706	13.57500783
Serious emotional or behavioral difficulties
Age	5.505556	5.60	0.4478166	2.0	6.3	4.3	6.849005	4.16210584
Health insurance status at the time of interview	5.568750	4.90	1.9676270	6.7	9.6	2.9	11.471631	-0.33413088
Hispanic origin and race	5.554167	5.95	0.9213677	3.2	6.6	3.4	8.318270	2.79006356
Percent of poverty level	5.822917	5.20	1.7383337	6.3	9.6	3.3	11.037918	0.60791554
Race	5.927778	5.70	2.4777056	9.5	10.6	1.1	13.360895	-1.50533909
Sex	5.545833	5.25	1.4637591	3.4	7.2	3.8	9.937111	1.15455616
Skin allergy
Age	11.194643	11.40	2.0078410	7.4	14.5	7.1	17.218166	5.17111997
Health insurance status at the time of interview	10.680357	11.00	2.0108270	8.1	13.4	5.3	16.712838	4.64787607
Hispanic origin and race	11.705357	11.30	3.1149738	12.5	18.0	5.5	21.050279	2.36043578
Percent of poverty level	11.132143	11.80	1.7604862	6.4	13.5	7.1	16.413601	5.85068423
Race	12.175362	11.40	3.3056727	14.0	18.1	4.1	22.092381	2.25834409
Sex	11.135714	11.85	1.6742526	5.8	13.1	7.3	16.158472	6.11295655
Total	11.128571	11.85	1.7134888	5.3	12.7	7.4	16.269038	5.98810495

As all the numbers fit in the range mean+/-3(standard deviation), there is no extreme value in the data set, which means using the mean to analyze the percentage of children provided in the data set is appropriate.

Research questions

Which illness is most common? In this particular illness, how does the percentage for demographic characteristics change over time, and What do the demographic characteristics look like?
Which percentage of illness increased most over time? How does the percentage of demographic characteristics change over time in this particular illness?
Based on the “insurance” demographic characteristics, what is the trend of various illnesses over time? Can we define which illness is most common among those in “insured” status and those not?
Can we define the trend of the percentage of illness based on the poverty rate over time? Which illness is most common in various poverty rates?

Question 1

Question 1.1: Which illness is most common?

health_clean %>%
  group_by(illness)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(illness, mean, fill=illness))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
  scale_x_discrete(name= "Illness")+
  scale_y_continuous(limits= range(0,0.3), label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Illness")+
   ggthemes::theme_few()+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom", legend.key.size = unit(0.1, 'cm'), legend.text = element_text(size=7), legend.title = element_text(size=10))+
  geom_text(aes(label = labels), size=3, vjust=-.5)+
  labs(title = "Mean For Various Illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic above, “Hay fever or respiratory allergy among persons under 18 years” is the most common illness.

Question 1.2 In this particular illness, how does the percentage for demographic characteristics change over time?

health_clean %>%
  filter(str_detect(illness,"respiratory"))%>%
  mutate(percentage= percentage/100)%>%
   ggplot(aes(year_number, percentage, col=sub_name))+
 geom_line()+
  facet_wrap(vars(characteristics), labeller = label_wrap_gen(width=25))+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=6))+
  theme(legend.position = "bottom", legend.key.size = unit(0.1, 'cm'), legend.text = element_text(size=7), legend.title = element_text(size=6))+
  labs(title = "Percentages for various characteristics-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(strip.text = element_text(size=7))

According to the graphic, the trends for all characteristics overtime are going down, except the “Two or more races” in the “Race” category.

Questions 1.3 What do the demographic characteristics look like?

Insurance Level

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "insurance"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage),mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Insurance Status")+
  scale_y_continuous(limits= range(0,0.25),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Health insurance status at the time of interview")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various insuranced categories and uninsured category-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, the “Insured: Private” have the highest percentage of mean.

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "insurance"))%>%
  mutate(sub_name= str_remove(sub_name, ": Medicaid$|: Private$"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage),mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
scale_x_discrete(name= "Insurance Status- Insured and Uninsured")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
  scale_y_continuous(limits= range(0,0.25),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Health insruance status")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for insured and uninsured-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, mean percentage for children who are in the Insured status is higher than those who are not.

Poverty Level

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics , "poverty"))%>%
   mutate(sub_name=factor(sub_name,levels=c("Below 100%", "100%-199%", "200%-399%","400% or more"))) %>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Percent of poverty level")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Percent of poverty level")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various poverty level-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “400% or more” have the highest percentage rate.

Race

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "Race"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), median=median(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, median= median/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
scale_x_discrete(name= "Race")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Race")+
   ggthemes::theme_few()+
  guides(fill=guide_legend(nrow=3))+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various race-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “400% or more” have the highest percentage rate.

Sex

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "Sex"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Sex")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Sex")+
   ggthemes::theme_few()+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for Sex-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, the “Male” have higher mean percentage.

Age

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "Age"))%>%
   mutate(sub_name=factor(sub_name,levels=c("0-4 years", "5-17 years","5-9 years", "10-17 years")))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), median=median(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, median= median/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Age")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Age")+
   ggthemes::theme_few()+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for Age-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, the “10-17 years old” have higher mean percentage.

##Question 2

Question 2.1 Which percentage of illness increased most over time?

health_clean %>%
  group_by(illness, year, `year_number`)%>%
  summarise(mean= mean(percentage), median=median(percentage))%>%
  ungroup()%>%
  mutate(mean=mean/100, median= median/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(year_number, mean, col= illness))+
  geom_path()+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of illness")+
  scale_color_discrete(name= "Illness")+
   guides(col=guide_legend(nrow=4))+
  theme(legend.position = "bottom")+
  labs(title = "Trends For Various Illness overtime")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(strip.text = element_text(size=7))

According to the graphic,Skin Allergy is the illness that increase most overtime.

Question 2.2 How does the percentage of demographic characteristics change over time in this particular illness?

health_clean %>%
  filter(str_detect(illness,"Skin"))%>%
  mutate(percentage= percentage/100)%>%
   ggplot(aes(year_number, percentage, col=sub_name))+
 geom_line()+
  facet_wrap(vars(characteristics), labeller = label_wrap_gen(width=25), scale= "free")+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=6))+
  theme(legend.position = "bottom", legend.key.size = unit(0.1, 'cm'), legend.text = element_text(size=7), legend.title = element_text(size=6))+
  labs(title = "Percentages for various characteristics")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(strip.text = element_text(size=7))

According to the graphic, almost all the demographic characteristics are in the increased trend overtime.

Question 3

Question 3.1 Based on the “insurance” demographic characteristics, what is the trend of various illnesses over time?

health_clean %>%
  filter(str_detect(characteristics,"insurance")) %>%
  mutate(percentage= percentage/100)%>%
ggplot(aes(year_number, percentage,col=sub_name))+
  geom_line()+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25), scales = "free")+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=2))+
  theme(legend.position = "bottom")+
  labs(title = " Percentages for various health insurance status overtime-various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, besides “Food allergy”, “Skin allergy”, and “ADHD”, other illness are seems in the decline trends overtime.

Question 3.2 Can we define which illness is most common among those in “insured” status and those not?

health_clean %>%
  filter(str_detect(characteristics,"insurance")) %>%
   mutate(sub_name= str_remove(sub_name, ": Medicaid$|: Private$"))%>%
  group_by(illness, sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100,mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean)) %>%
ggplot(aes(sub_name, mean,fill=sub_name))+
  geom_col(position = "dodge")+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25))+
scale_x_discrete(name= "Insurance Status")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = " Mean of percentage")+
  scale_fill_discrete(name= "Insurance Status")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various insuance status-various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “Hay fever or respiratory allergy” is the most common illness among those in “insured” status and those not.

Question 4

Question 4.1 Can we define the trend of the percentage of illness based on the poverty rate over time?

health_clean %>%
  filter(str_detect(characteristics,"poverty")) %>%
  mutate(percentage= percentage/100)%>%
ggplot(aes(year_number, percentage,col=sub_name))+
  geom_line()+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25),scales = "free")+
   ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Percent of poverty level")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=2))+
  theme(legend.position = "bottom")+
  labs(title = "Percentages for various poverty level overtime- various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, besides “Food allergy” and “Skin allergy”, other illness are seems in the decline trends overtime.

Question 4.2 Which illness is most common in various poverty rates?

health_clean %>%
  filter(str_detect(characteristics,"poverty")) %>%
   mutate(sub_name=factor(sub_name,levels=c("Below 100%", "100%-199%", "200%-399%","400% or more")))%>%
  group_by(illness, sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
ggplot(aes(sub_name, mean,fill=sub_name))+
  geom_col(position = "dodge")+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25))+
scale_x_discrete(name= "Percent of poverty level")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = " Mean of percentage")+
  scale_fill_discrete(name= "Percent of poverty level")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=3, vjust=-1)+
  theme(axis.text.x = element_text(angle=90))+
  labs(title = "Mean percentage for various poverty level-various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “Hay fever or respiratory allergy” is the most common illness among those in various poverty level.

Reflection for visualization

I think my visualization answers all the research questions I have so far. However, it might improved by making the insight part visualized and putting them into a same or multiple graphic.

Another thing that can improved for visualization is 1) adding a label for the liner graphic, especially the graphic “Trends For Various Illness Over time,” and 2) making the color differences more obvious when the legend contains a large amount of the variables. It is challenging, and more research is needed to achieve it, but it will make the graphic clearer to the viewer.

For ways to improve, I do search some articles online, like adding lable and expanding the color palette. As it is not working well, I might seek more materials or even ask for tutoring.