Homework 2

Author

Jingyi Yang

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(readr)
library(readxl)
library(stringr)
library(dplyr)
library(lubridate)
library("khroma")
library(here)
here() starts at C:/8-601
library("ggthemes")
library(gt)

Overview

My final project mainly analyzes the health conditions among children under 18 based on some particular demographics from 1997 to 2018.

Data

The data needed for analysis is downloaded from data.gov. The original data has been published from Centers for Disease Control and Prevention in Jun 2021.

This data set consists of information about the health conditions among children under 18 years old based on particular population characteristics. It can help get information regarding the illness that is prevalent among minors, the percentage of children for each illness based on the different demographic characteristics, and more.

health <- read_csv("Final Project/Health_conditions_among_children_under_age_18__by_selected_characteristics__United_States.csv")

health %>% print(n= 10, width = Inf)
# A tibble: 2,744 × 16
   INDICATOR                                    
   <chr>                                        
 1 Health conditions among children under age 18
 2 Health conditions among children under age 18
 3 Health conditions among children under age 18
 4 Health conditions among children under age 18
 5 Health conditions among children under age 18
 6 Health conditions among children under age 18
 7 Health conditions among children under age 18
 8 Health conditions among children under age 18
 9 Health conditions among children under age 18
10 Health conditions among children under age 18
   PANEL                                       PANEL_NUM
   <chr>                                           <dbl>
 1 Current asthma among persons under 18 years         1
 2 Current asthma among persons under 18 years         1
 3 ADHD among persons under 18 years                   3
 4 ADHD among persons under 18 years                   3
 5 ADHD among persons under 18 years                   3
 6 Current asthma among persons under 18 years         1
 7 Current asthma among persons under 18 years         1
 8 Current asthma among persons under 18 years         1
 9 Current asthma among persons under 18 years         1
10 Current asthma among persons under 18 years         1
   UNIT                       UNIT_NUM STUB_NAME STUB_NAME_NUM STUB_LABEL    
   <chr>                         <dbl> <chr>             <dbl> <chr>         
 1 Percent of children, crude        1 Total                 0 Under 18 years
 2 Percent of children, crude        1 Total                 0 Under 18 years
 3 Percent of children, crude        1 Age                   1 10-17 years   
 4 Percent of children, crude        1 Age                   1 10-17 years   
 5 Percent of children, crude        1 Age                   1 10-17 years   
 6 Percent of children, crude        1 Total                 0 Under 18 years
 7 Percent of children, crude        1 Total                 0 Under 18 years
 8 Percent of children, crude        1 Total                 0 Under 18 years
 9 Percent of children, crude        1 Total                 0 Under 18 years
10 Percent of children, crude        1 Total                 0 Under 18 years
   STUB_LABEL_NUM YEAR      YEAR_NUM AGE            AGE_NUM ESTIMATE    SE FLAG 
            <dbl> <chr>        <dbl> <chr>            <dbl>    <dbl> <dbl> <chr>
 1           0    1997-1999        1 Under 18 years     0       NA    NA   ...  
 2           0    2000-2002        2 Under 18 years     0       NA    NA   ...  
 3           1.22 1997-1999        1 10-17 years        2.2      7.6   0.2 <NA> 
 4           1.22 2000-2002        2 10-17 years        2.2      9     0.3 <NA> 
 5           1.22 2003-2005        3 10-17 years        2.2      8.9   0.3 <NA> 
 6           0    2003-2005        3 Under 18 years     0        8.7   0.2 <NA> 
 7           0    2006-2008        4 Under 18 years     0        9.3   0.2 <NA> 
 8           0    2007-2009        5 Under 18 years     0        9.4   0.2 <NA> 
 9           0    2008-2010        6 Under 18 years     0        9.5   0.2 <NA> 
10           0    2009-2011        7 Under 18 years     0        9.5   0.2 <NA> 
# ℹ 2,734 more rows

Clean the data set

Overview

The data set includes 16 columns. The column “INDICATOR” consists of the information about the whole data set, which is “Health conditions among children under age 18” and is a characteristic variable. The column “PANEL” is a characteristic variable and includes information about various health issues. The column “PANEL_NUM” represents the code for different health issues. The column “UNIT” indicates the number in the data set focused on percentage. “UNIT_NUM” refers to the code representing the characteristic variables in the “UNIT” column. “STUB_NAME” includes different characteristics for the respondents, like age, race, sex, etc., and the “STUB_NAME_NUM” shows the codes related to them. “STUB_LABEL” means the sub-variables related to the characteristics, like different age groups, female and male for the gender, and column “STUB_LABEL_NUM” involves the number representing them. “YEAR” includes information about different year stages, “YEAR_NUM” represents them, “AGE” and “AGE_NUM” include different age stages and numbers represent them, “ESTIMATE” is the number percent of children, “SE” means the standard error. In the “FLAG” column, “—” means the data is not available, and “*” means the estimate might not be reliable. A document provided by the Centers for Disease Control and Prevention (CDC) can provide more information.

To clean the data, delete the columns that include some repeat information, like the number representing the variables, the columns that only include one piece of information, like the “INDICATOR” column, and the rows that contain the not available cells, are necessary. Besides, arranging the table through “PANEL” and “YEAR,” renaming the column, and using the information in the “SE” column (standard error) to calculate the upper and lower percentage information will make the table more in order and easier to understand.

health_clean <- health %>%
  select(- ("PANEL_NUM"),
         - ("UNIT_NUM"),
         -("STUB_NAME_NUM"),
         -("STUB_LABEL_NUM"),
         -("AGE_NUM"),
         - ("INDICATOR"),
         - ("UNIT"),
         - ("AGE"),
         - ("FLAG"))%>%
  arrange(PANEL, YEAR)%>%
  na.omit()%>%
  rename("illness"=`PANEL`, "characteristics"=`STUB_NAME`, "sub_name"=`STUB_LABEL`, "year"= `YEAR`, "year_number"= `YEAR_NUM`, "percentage"=`ESTIMATE`, "se"=`SE`) %>%
   mutate(illness= str_remove(illness, "among persons under 18 years"))%>%
  mutate("percentage_high"= percentage+se, "percentage_low"= percentage-se )
  

health_clean %>% print(n= 10, width = Inf)
# A tibble: 2,516 × 9
   illness characteristics          sub_name                             
   <chr>   <chr>                    <chr>                                
 1 "ADHD " Age                      10-17 years                          
 2 "ADHD " Age                      5-17 years                           
 3 "ADHD " Age                      5-9 years                            
 4 "ADHD " Sex                      Male                                 
 5 "ADHD " Sex                      Female                               
 6 "ADHD " Race                     White only                           
 7 "ADHD " Race                     Black or African American only       
 8 "ADHD " Race                     American Indian or Alaska Native only
 9 "ADHD " Race                     Asian only                           
10 "ADHD " Hispanic origin and race Hispanic or Latino: All races        
   year      year_number percentage    se percentage_high percentage_low
   <chr>           <dbl>      <dbl> <dbl>           <dbl>          <dbl>
 1 1997-1999           1        7.6   0.2             7.8            7.4
 2 1997-1999           1        6.5   0.2             6.7            6.3
 3 1997-1999           1        4.8   0.2             5              4.6
 4 1997-1999           1        9.6   0.3             9.9            9.3
 5 1997-1999           1        3.2   0.2             3.4            3  
 6 1997-1999           1        7.1   0.2             7.3            6.9
 7 1997-1999           1        5     0.3             5.3            4.7
 8 1997-1999           1        8.5   1.9            10.4            6.6
 9 1997-1999           1        1.7   0.5             2.2            1.2
10 1997-1999           1        3.6   0.3             3.9            3.3
# ℹ 2,506 more rows

Insight (Describe the Statistics)

freq_sumary_table_1 <-health_clean %>%
   count(illness)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop))%>%
  rename(Names=illness)

freq_sumary_table_2 <- health_clean %>%
  count(characteristics)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop))%>%
  rename(Names=characteristics)

freq_sumary_table_3 <- health_clean %>%
  count(sub_name)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop)) %>%
  rename(Names=sub_name)

freq_sumary_table_4 <- health_clean %>%
 count(year)%>%
   mutate(prop= sum(n)/n)%>%
   mutate(prop= prop/100)%>%
   mutate(prop=scales::percent(prop))%>%
  rename(Names=year)

freq_sumary_table <-bind_rows(freq_sumary_table_1,freq_sumary_table_2,freq_sumary_table_3, freq_sumary_table_4)

 freq_sumary_table%>%
   gt()
Names n prop
ADHD 303 8.304%
Asthma attack in last 12 months 332 7.578%
Current asthma 288 8.736%
Ear infections 332 7.578%
Food allergy 333 7.556%
Hay fever or respiratory allergy 335 7.510%
Serious emotional or behavioral difficulties 258 9.752%
Skin allergy 335 7.510%
Age 406 6.20%
Health insurance status at the time of interview 432 5.82%
Hispanic origin and race 432 5.82%
Percent of poverty level 432 5.82%
Race 516 4.88%
Sex 216 11.65%
Total 82 30.68%
0-4 years 82 30.68%
10-17 years 108 23.30%
100%-199% 108 23.30%
2 or more races 102 24.67%
200%-399% 108 23.30%
400% or more 108 23.30%
5-17 years 108 23.30%
5-9 years 108 23.30%
American Indian or Alaska Native only 92 27.35%
Asian only 105 23.96%
Below 100% 108 23.30%
Black or African American only 108 23.30%
Female 108 23.30%
Hispanic or Latino: All races 108 23.30%
Insured 108 23.30%
Insured: Medicaid 108 23.30%
Insured: Private 108 23.30%
Male 108 23.30%
Native Hawaiian or Other Pacific Islander only 1 2 516.00%
Not Hispanic or Latino: All races 108 23.30%
Not Hispanic or Latino: Black or African American only 108 23.30%
Not Hispanic or Latino: White only 108 23.30%
Under 18 years 82 30.68%
Uninsured 108 23.30%
White only 108 23.30%
1997-1999 135 18.637%
2000-2002 140 17.971%
2003-2005 186 13.527%
2006-2008 184 13.674%
2007-2009 185 13.600%
2008-2010 186 13.527%
2009-2011 187 13.455%
2010-2012 187 13.455%
2011-2013 188 13.383%
2012-2014 187 13.455%
2013-2015 188 13.383%
2014-2016 188 13.383%
2015-2017 188 13.383%
2016-2018 187 13.455%

As the summary for the variable “percentage” cannot clearly reflect the data variation, which does not help to define using mean or median for future analysis, further calculating the mean, median, standard deviation, and range based on various illnesses and demographic characteristics is necessary.

  numerical_summary_table <- health_clean%>%
  group_by (illness, characteristics) %>%
  summarise(mean_percentage= mean(percentage), median_percentage=median(percentage), sd_percentage= sd(percentage),range_percentage= max(percentage)-min(percentage), max_percentage= max(percentage), min_percentage= min(percentage))%>%
mutate(extrem_value_range_percentage_max= mean_percentage+3*sd_percentage, extrem_value_range_percentage_min= mean_percentage-3*sd_percentage)
 
numerical_summary_table %>%
  gt()
characteristics mean_percentage median_percentage sd_percentage range_percentage max_percentage min_percentage extrem_value_range_percentage_max extrem_value_range_percentage_min
ADHD
Age 9.004762 9.00 2.2895257 8.1 12.9 4.8 15.873339 2.13618477
Health insurance status at the time of interview 9.042857 9.05 2.6883491 9.0 13.8 4.8 17.107904 0.97780990
Hispanic origin and race 9.235714 9.80 2.7325265 9.8 13.4 3.6 17.433294 1.03813466
Percent of poverty level 9.535714 9.35 1.9163819 7.4 13.5 6.1 15.284860 3.78656867
Race 8.480000 9.40 3.8424520 14.0 15.4 1.4 20.007356 -3.04735605
Sex 9.275000 8.25 3.9694785 11.6 14.8 3.2 21.183435 -2.63343539
Asthma attack in last 12 months
Age 5.226786 5.55 1.0204261 4.1 6.8 2.7 8.288064 2.16550733
Health insurance status at the time of interview 5.051786 5.00 1.1683697 4.4 7.7 3.3 8.556895 1.54667673
Hispanic origin and race 5.592857 5.00 1.4868951 5.2 9.0 3.8 10.053543 1.13217174
Percent of poverty level 5.258929 5.00 1.0405902 3.8 7.5 3.7 8.380699 2.13715803
Race 5.918182 6.00 1.9100470 7.0 9.3 2.3 11.648323 0.18804068
Sex 5.139286 4.90 0.9117908 2.9 6.6 3.7 7.874658 2.40391336
Total 5.157143 5.40 0.4847113 1.3 5.7 4.4 6.611277 3.70300892
Current asthma
Age 8.987500 10.00 2.2236280 7.1 11.1 4.0 15.658384 2.31661614
Health insurance status at the time of interview 8.697917 8.40 1.8200409 6.8 12.4 5.6 14.158039 3.23779406
Hispanic origin and race 9.937500 8.60 2.9670262 9.6 16.4 6.8 18.838579 1.03642129
Percent of poverty level 9.089583 8.55 1.7353937 6.4 12.7 6.3 14.295764 3.88340222
Race 10.203333 10.35 3.4826430 12.5 16.4 3.9 20.651262 -0.24459566
Sex 8.875000 8.75 1.3587878 4.0 11.1 7.1 12.951363 4.79863653
Total 8.916667 8.85 0.4932883 1.4 9.5 8.1 10.396532 7.43680181
Ear infections
Age 5.505357 4.35 3.1997966 11.6 13.7 2.1 15.104747 -4.09403255
Health insurance status at the time of interview 5.439286 5.20 1.3939712 7.3 10.2 2.9 9.621199 1.25737197
Hispanic origin and race 5.278571 5.45 1.0312771 4.3 7.7 3.4 8.372403 2.18474011
Percent of poverty level 5.587500 5.60 1.1395474 4.4 8.3 3.9 9.006142 2.16885793
Race 4.992424 5.00 1.6692984 8.8 10.8 2.0 10.000319 -0.01547092
Sex 5.478571 5.60 0.7804828 3.0 7.3 4.3 7.820020 3.13712296
Total 5.478571 5.55 0.7707924 2.7 7.1 4.4 7.790949 3.16619432
Food allergy
Age 4.991071 5.15 0.9654536 3.5 6.6 3.1 7.887432 2.09471064
Health insurance status at the time of interview 4.769643 4.80 0.9999854 4.6 7.0 2.4 7.769599 1.76968669
Hispanic origin and race 4.891071 4.90 1.1784854 4.6 6.7 2.1 8.426528 1.35561512
Percent of poverty level 4.944643 5.10 1.0350089 4.6 7.6 3.0 8.049670 1.83961604
Race 5.605970 5.70 1.3992293 7.2 10.2 3.0 9.803658 1.40828228
Sex 4.989286 5.25 0.9338646 3.1 6.5 3.4 7.790880 2.18769178
Total 4.992857 5.15 0.9441840 3.0 6.4 3.4 7.825409 2.16030500
Hay fever or respiratory allergy
Age 16.433929 17.70 4.0868871 13.7 21.7 8.0 28.694590 4.17326730
Health insurance status at the time of interview 15.671429 15.30 1.9936913 8.2 19.2 11.0 21.652503 9.69035452
Hispanic origin and race 15.832143 15.85 2.4471956 8.3 19.9 11.6 23.173730 8.49055610
Percent of poverty level 16.158929 15.55 2.3631856 8.5 21.1 12.6 23.248485 9.06937182
Race 15.756522 15.30 2.7640272 11.0 20.9 9.9 24.048603 7.46444004
Sex 16.375000 16.40 1.8128400 5.9 18.9 13.0 21.813520 10.93647998
Total 16.392857 16.55 0.9392831 3.0 17.7 14.7 19.210706 13.57500783
Serious emotional or behavioral difficulties
Age 5.505556 5.60 0.4478166 2.0 6.3 4.3 6.849005 4.16210584
Health insurance status at the time of interview 5.568750 4.90 1.9676270 6.7 9.6 2.9 11.471631 -0.33413088
Hispanic origin and race 5.554167 5.95 0.9213677 3.2 6.6 3.4 8.318270 2.79006356
Percent of poverty level 5.822917 5.20 1.7383337 6.3 9.6 3.3 11.037918 0.60791554
Race 5.927778 5.70 2.4777056 9.5 10.6 1.1 13.360895 -1.50533909
Sex 5.545833 5.25 1.4637591 3.4 7.2 3.8 9.937111 1.15455616
Skin allergy
Age 11.194643 11.40 2.0078410 7.4 14.5 7.1 17.218166 5.17111997
Health insurance status at the time of interview 10.680357 11.00 2.0108270 8.1 13.4 5.3 16.712838 4.64787607
Hispanic origin and race 11.705357 11.30 3.1149738 12.5 18.0 5.5 21.050279 2.36043578
Percent of poverty level 11.132143 11.80 1.7604862 6.4 13.5 7.1 16.413601 5.85068423
Race 12.175362 11.40 3.3056727 14.0 18.1 4.1 22.092381 2.25834409
Sex 11.135714 11.85 1.6742526 5.8 13.1 7.3 16.158472 6.11295655
Total 11.128571 11.85 1.7134888 5.3 12.7 7.4 16.269038 5.98810495

As all the numbers fit in the range mean+/-3(standard deviation), there is no extreme value in the data set, which means using the mean to analyze the percentage of children provided in the data set is appropriate.

Research questions

  1. Which illness is most common? In this particular illness, how does the percentage for demographic characteristics change over time, and What do the demographic characteristics look like?

  2. Which percentage of illness increased most over time? How does the percentage of demographic characteristics change over time in this particular illness?

  3. Based on the “insurance” demographic characteristics, what is the trend of various illnesses over time? Can we define which illness is most common among those in “insured” status and those not?

  4. Can we define the trend of the percentage of illness based on the poverty rate over time? Which illness is most common in various poverty rates?

Question 1

Question 1.1: Which illness is most common?

health_clean %>%
  group_by(illness)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(illness, mean, fill=illness))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
  scale_x_discrete(name= "Illness")+
  scale_y_continuous(limits= range(0,0.3), label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Illness")+
   ggthemes::theme_few()+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom", legend.key.size = unit(0.1, 'cm'), legend.text = element_text(size=7), legend.title = element_text(size=10))+
  geom_text(aes(label = labels), size=3, vjust=-.5)+
  labs(title = "Mean For Various Illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic above, “Hay fever or respiratory allergy among persons under 18 years” is the most common illness.

Question 1.2 In this particular illness, how does the percentage for demographic characteristics change over time?

health_clean %>%
  filter(str_detect(illness,"respiratory"))%>%
  mutate(percentage= percentage/100)%>%
   ggplot(aes(year_number, percentage, col=sub_name))+
 geom_line()+
  facet_wrap(vars(characteristics), labeller = label_wrap_gen(width=25))+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=6))+
  theme(legend.position = "bottom", legend.key.size = unit(0.1, 'cm'), legend.text = element_text(size=7), legend.title = element_text(size=6))+
  labs(title = "Percentages for various characteristics-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(strip.text = element_text(size=7))

According to the graphic, the trends for all characteristics overtime are going down, except the “Two or more races” in the “Race” category.

Questions 1.3 What do the demographic characteristics look like?

Insurance Level

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "insurance"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage),mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Insurance Status")+
  scale_y_continuous(limits= range(0,0.25),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Health insurance status at the time of interview")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various insuranced categories and uninsured category-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, the “Insured: Private” have the highest percentage of mean.

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "insurance"))%>%
  mutate(sub_name= str_remove(sub_name, ": Medicaid$|: Private$"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage),mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
scale_x_discrete(name= "Insurance Status- Insured and Uninsured")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
  scale_y_continuous(limits= range(0,0.25),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Health insruance status")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for insured and uninsured-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, mean percentage for children who are in the Insured status is higher than those who are not.

Poverty Level

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics , "poverty"))%>%
   mutate(sub_name=factor(sub_name,levels=c("Below 100%", "100%-199%", "200%-399%","400% or more"))) %>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Percent of poverty level")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Percent of poverty level")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various poverty level-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “400% or more” have the highest percentage rate.

Race

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "Race"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), median=median(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, median= median/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
scale_x_discrete(name= "Race")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Race")+
   ggthemes::theme_few()+
  guides(fill=guide_legend(nrow=3))+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various race-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “400% or more” have the highest percentage rate.

Sex

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "Sex"))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Sex")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Sex")+
   ggthemes::theme_few()+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for Sex-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, the “Male” have higher mean percentage.

Age

health_clean %>%
  filter(str_detect(illness,"respiratory")& str_detect(characteristics, "Age"))%>%
   mutate(sub_name=factor(sub_name,levels=c("0-4 years", "5-17 years","5-9 years", "10-17 years")))%>%
  group_by(sub_name)%>%
  summarise(mean= mean(percentage), median=median(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, median= median/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(sub_name, mean, fill=sub_name))+
  geom_col(position = "dodge")+
  geom_errorbar(aes(ymin=mean_low,ymax=mean_high))+
scale_x_discrete(name= "Age")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = "Percentage of mean")+
  scale_fill_discrete(name= "Age")+
   ggthemes::theme_few()+
  theme(axis.text.x = element_text(angle=90))+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for Age-Hay fever or respiratory allergy")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, the “10-17 years old” have higher mean percentage.

##Question 2

Question 2.1 Which percentage of illness increased most over time?

health_clean %>%
  group_by(illness, year, `year_number`)%>%
  summarise(mean= mean(percentage), median=median(percentage))%>%
  ungroup()%>%
  mutate(mean=mean/100, median= median/100)%>%
  mutate(labels=scales::percent(mean))%>%
  ggplot(aes(year_number, mean, col= illness))+
  geom_path()+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of illness")+
  scale_color_discrete(name= "Illness")+
   guides(col=guide_legend(nrow=4))+
  theme(legend.position = "bottom")+
  labs(title = "Trends For Various Illness overtime")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(strip.text = element_text(size=7))

According to the graphic,Skin Allergy is the illness that increase most overtime.

Question 2.2 How does the percentage of demographic characteristics change over time in this particular illness?

health_clean %>%
  filter(str_detect(illness,"Skin"))%>%
  mutate(percentage= percentage/100)%>%
   ggplot(aes(year_number, percentage, col=sub_name))+
 geom_line()+
  facet_wrap(vars(characteristics), labeller = label_wrap_gen(width=25), scale= "free")+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=6))+
  theme(legend.position = "bottom", legend.key.size = unit(0.1, 'cm'), legend.text = element_text(size=7), legend.title = element_text(size=6))+
  labs(title = "Percentages for various characteristics")+
  theme(plot.title = element_text(hjust=0.5))+
  theme(strip.text = element_text(size=7))

According to the graphic, almost all the demographic characteristics are in the increased trend overtime.

Question 3

Question 3.1 Based on the “insurance” demographic characteristics, what is the trend of various illnesses over time?

health_clean %>%
  filter(str_detect(characteristics,"insurance")) %>%
  mutate(percentage= percentage/100)%>%
ggplot(aes(year_number, percentage,col=sub_name))+
  geom_line()+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25), scales = "free")+
  ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Year Number")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=2))+
  theme(legend.position = "bottom")+
  labs(title = " Percentages for various health insurance status overtime-various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, besides “Food allergy”, “Skin allergy”, and “ADHD”, other illness are seems in the decline trends overtime.

Question 3.2 Can we define which illness is most common among those in “insured” status and those not?

health_clean %>%
  filter(str_detect(characteristics,"insurance")) %>%
   mutate(sub_name= str_remove(sub_name, ": Medicaid$|: Private$"))%>%
  group_by(illness, sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100,mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean)) %>%
ggplot(aes(sub_name, mean,fill=sub_name))+
  geom_col(position = "dodge")+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25))+
scale_x_discrete(name= "Insurance Status")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = " Mean of percentage")+
  scale_fill_discrete(name= "Insurance Status")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=5, vjust=-1)+
  labs(title = "Mean percentage for various insuance status-various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “Hay fever or respiratory allergy” is the most common illness among those in “insured” status and those not.

Question 4

Question 4.1 Can we define the trend of the percentage of illness based on the poverty rate over time?

health_clean %>%
  filter(str_detect(characteristics,"poverty")) %>%
  mutate(percentage= percentage/100)%>%
ggplot(aes(year_number, percentage,col=sub_name))+
  geom_line()+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25),scales = "free")+
   ggthemes::theme_few()+
  scale_x_continuous(n.break=8, name= "Percent of poverty level")+
  scale_y_continuous(label = scales::percent, name = "Percentage of characteristics")+
  scale_color_discrete(name= "Sub-name of characteristics")+
   guides(col=guide_legend(nrow=2))+
  theme(legend.position = "bottom")+
  labs(title = "Percentages for various poverty level overtime- various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, besides “Food allergy” and “Skin allergy”, other illness are seems in the decline trends overtime.

Question 4.2 Which illness is most common in various poverty rates?

health_clean %>%
  filter(str_detect(characteristics,"poverty")) %>%
   mutate(sub_name=factor(sub_name,levels=c("Below 100%", "100%-199%", "200%-399%","400% or more")))%>%
  group_by(illness, sub_name)%>%
  summarise(mean= mean(percentage), mean_high= mean(percentage_high), mean_low= mean(percentage_low))%>%
  ungroup()%>%
  mutate(mean=mean/100, mean_high= mean_high/100, mean_low= mean_low/100)%>%
  mutate(labels=scales::percent(mean))%>%
ggplot(aes(sub_name, mean,fill=sub_name))+
  geom_col(position = "dodge")+
  facet_wrap(vars(illness), labeller = label_wrap_gen(width=25))+
scale_x_discrete(name= "Percent of poverty level")+
  scale_y_continuous(limits= range(0,0.3),label = scales::percent, name = " Mean of percentage")+
  scale_fill_discrete(name= "Percent of poverty level")+
   ggthemes::theme_few()+
  theme(legend.position = "bottom")+
  geom_text(aes(label = labels), size=3, vjust=-1)+
  theme(axis.text.x = element_text(angle=90))+
  labs(title = "Mean percentage for various poverty level-various illness")+
  theme(plot.title = element_text(hjust=0.5))

According to the graphic, “Hay fever or respiratory allergy” is the most common illness among those in various poverty level.

Reflection for visualization

I think my visualization answers all the research questions I have so far. However, it might improved by making the insight part visualized and putting them into a same or multiple graphic.

Another thing that can improved for visualization is 1) adding a label for the liner graphic, especially the graphic “Trends For Various Illness Over time,” and 2) making the color differences more obvious when the legend contains a large amount of the variables. It is challenging, and more research is needed to achieve it, but it will make the graphic clearer to the viewer.

For ways to improve, I do search some articles online, like adding lable and expanding the color palette. As it is not working well, I might seek more materials or even ask for tutoring.