Project1Data110

The data set I am using for this project was made by the U.S. Department of Health & Human Services and shows the percentage of children (under age 18) in the United States that have a certain illness/ disability and fall into a certain group (race, living condition, parent situation, location, etc). It has data collected from six consecutive years.

#first I loaded library tidyverse and got my data set into r studio. I then made sure there were no spaces or uppercase letter in my headers so I can clean my data set easily.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
getwd()
[1] "/Users/Sayla/Desktop/DATA 110"
health_data <- read_csv("NHIS_Child_Summary_Health_Statistics.csv")
Rows: 7644 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Outcome (or Indicator), Group, Confidence Interval, Title, Description
dbl (2): Percentage, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(health_data) <- tolower(names(health_data))
names(health_data) <- gsub(" ","_",names(health_data))
head(health_data)
# A tibble: 6 × 7
  outcome_(or_indicator…¹ group percentage confidence_interval title description
  <chr>                   <chr>      <dbl> <chr>               <chr> <chr>      
1 Ever having asthma      Total       10.5 9.8 ,11.3           Perc… "Based on …
2 Ever having asthma      Male        12.5 11.3,13.7           Perc… "Based on …
3 Ever having asthma      Fema…        8.5 7.6 ,9.5            Perc… "Based on …
4 Ever having asthma      0-4 …        3.2 2.4 ,4.2            Perc… "Based on …
5 Ever having asthma      5-11…       11.9 10.6,13.3           Perc… "Based on …
6 Ever having asthma      12-1…       14.8 13.3,16.3           Perc… "Based on …
# ℹ abbreviated name: ¹​`outcome_(or_indicator)`
# ℹ 1 more variable: year <dbl>
#I had so much data so to narrow it down to make a simpler visualization, I filtered the outcome column to only have "ever having asthma" and then my group variable to only have area where the children grew up (in a metropolitan, non metro, or in between area). I also kept the 'total' row showing what percent of kids had asthma that year. I then created a new data set with just my desired variables called 'newhealth'. Until here I had the professor help me get started and when filtering my groups, I searched how to filter for multiple variables. I used Google and a screenshot of my search is attached. I used the information given, which was to use '%in%' and the list my variables with a comma. 

newhealth_df <- health_data |>
  select(-c(confidence_interval,title,description)) |>
  filter(`outcome_(or_indicator)` == "Ever having asthma",
         group %in% c("Total",
                      "Large central metro", 
                      "Large fringe metro", 
                      "Medium and small metro", 
                      "Nonmetropolitan")) 

#Here I am setting up my linear regression. I made a new data set to just have the total values of kids who had asthma each year to see if the year is a significant predictor in the percentage of kids who have asthma. In other words, can we see if the rate of asthma is noticeably increasing or decreasing as time goes on.

total_in_year <- newhealth_df |>
  filter(group == "Total")
fit1 <- lm(percentage ~ year, data = total_in_year)
summary(fit1)

Call:
lm(formula = percentage ~ year, data = total_in_year)

Residuals:
       1        2        3        4        5        6 
 0.38571 -0.66857  0.17714 -0.07714  0.36857 -0.18571 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.41143  216.57011   0.473    0.661
year         -0.04571    0.10713  -0.427    0.692

Residual standard error: 0.4482 on 4 degrees of freedom
Multiple R-squared:  0.04354,   Adjusted R-squared:  -0.1956 
F-statistic: 0.1821 on 1 and 4 DF,  p-value: 0.6916
plot(fit1)

After making a linear regression model using the year and total number of children who has asthma that year, we are given the equation

predicted percentage = 102.411 + (-0.04571*year)

Year being the year the data is from.

This shows that the intercept is 102.411 and we must multiply the year number (2019, 2020, etc) with -0.04571 to get our predicted percentage of children that have asthma each year.

The p-value for the year in this model is over 0.5 meaning the year is not a statistically significant predictor when trying to find the percentage of asthma cases in children (we do not have enough evidence to reject the null). The R-squared value is 0.04 meaning only 4% of the data is explained by the year predictor. Our adjusted R-squared value is -0.2 meaning our model is so bad it is essentially below useless. From analyzing this model we can conclude that year is not a very good predictor of the percentage of children who have asthma.

#Here I am setting up to make my visualization. I do not want the total number in my visualization so I am making a new data set from the health data that does not include the total numbers. I made this visualization using the week 6 material to help me. 
vis_data <- newhealth_df |>
  filter(group != "Total")
ggplot(vis_data, aes(x=year, y=percentage, color=group, group=group))+
  geom_point() +
  geom_line()+
  scale_color_brewer(palette="Set1")+
  labs(
    title = "Percent of Children with Asthma by Area Type",
    x = "Year",
    y = "Percentage with Asthma",
    color = "Area Type",
    caption = "Source: catalog.data.gov/dataset/nhis-child-summary-health-statistics-9185f"
  )+
  theme_minimal()+
  theme(plot.caption = element_text(hjust = 0))

I cleaned the data set by first narrowing down what elements I wanted to use in my regression and visualization. I got rid of the confidence intervals, titles, and description. I also chose what illness/ disability I wanted to focus this project on and chose asthma, so I got rid of the data that had to do with anything else. I then picked what groups I wanted to analyze in my project and chose what type of area they lives in, so filtered so I was left with only the total percent of children who had asthma each year and the percent of children who had it that lived in each type of area (urban, metropolitan, etc). This visualization shows the percentage of kids who have asthma that live in each of these areas. I was surprised to see the spike in percentage of children who had asthma in non metropolitan areas in 2023. I also thought that children living in large metro areas would consistently have higher rates of asthma than kis who lived in small or non metropolitan areas, but the visualization shows us otherwise.