The US Chronic Disease Indicators

The US chronic disease indicator is a thorough and useful resource that offers a full overview of numerous chronic health diseases and related risk factors across the US. This data set includes a variety of indicators linked to chronic diseases, such as diabetes, cancer, heart disease, and respiratory disorders.For this project, I plan on exploring the topic of chronic liver disease in Maryland and the use of alcohol among the youth in California.

Source

https://www.cdc.gov/

loading the data set file

# Load Dataset
USChronicDiseaseIndicators <- read.csv("~/DATA 110/USChronicDiseaseIndicators.csv")
View(USChronicDiseaseIndicators)

Filtering out the data set for Chronic Liver disease in Maryland

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# filter for Chronic liver disease
Chronic_liver_data <- USChronicDiseaseIndicators %>%
  filter(Question == "Chronic liver disease mortality" & LocationDesc == "Maryland")
# View the resulting dataset
View(Chronic_liver_data)

Making the linear regression model with the high and low confidence limit. I used na.omit to remove the NAs and chose the color blue for the regression line.

library(ggplot2)
# linear regression removing NAs
lm_data <- na.omit(Chronic_liver_data[, c("LowConfidenceLimit", "HighConfidenceLimit")])
lm_model <- lm(LowConfidenceLimit ~ HighConfidenceLimit , data = lm_data)

# Print summary
summary(lm_model)
## 
## Call:
## lm(formula = LowConfidenceLimit ~ HighConfidenceLimit, data = lm_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6908 -0.2175  0.3266  0.4849  0.8793 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.65842    0.38909  -1.692   0.0935 .  
## HighConfidenceLimit  0.82321    0.04407  18.678   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.034 on 108 degrees of freedom
## Multiple R-squared:  0.7636, Adjusted R-squared:  0.7614 
## F-statistic: 348.9 on 1 and 108 DF,  p-value: < 2.2e-16
# Plot the data and regression line
ggplot(lm_data, aes(x = LowConfidenceLimit, y = HighConfidenceLimit)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Linear Regression Analysis",
       x = "LowConidenceLimit",
       y = "HighConfidenceLimit")
## `geom_smooth()` using formula = 'y ~ x'

Calculating the linear regression Line for the graph above

# coefficients from the linear model
coefficients <- coef(lm_model)

# Print the equation of the regression line
cat("Linear Regression Equation: Y =", coefficients[1], "+", coefficients[2], "* X\n")
## Linear Regression Equation: Y = -0.6584168 + 0.8232125 * X

Model Analysis

p-value: < 2.2e-16 , Adjusted R-squared: 0.7614

Based on the extremely small p-value and the relatively high adjusted r-squared value, I came to the conclusion that my linear regression model is statistically significant, and the model explains a considerable amount of the variability in the response variable.

Making a bar chart for 2015’s youth alcoholism across three races in California

Filtering for Youth Alcohol

library(dplyr)
Youth_alcohol <- USChronicDiseaseIndicators %>%
  filter(Question == "Alcohol use among youth")
View(Youth_alcohol)

filtering for Year 2015 and state California

library(dplyr)
State_data <- Youth_alcohol %>%
  filter(LocationDesc == "California" & YearStart == "2015")
View(State_data)

filtering for race

library(dplyr)
races_data <- State_data %>%
  filter(StratificationCategory1 == "Race/Ethnicity")
head(races_data)
##   YearStart YearEnd LocationAbbr LocationDesc DataSource   Topic
## 1      2015    2015           CA   California      YRBSS Alcohol
## 2      2015    2015           CA   California      YRBSS Alcohol
## 3      2015    2015           CA   California      YRBSS Alcohol
## 4      2015    2015           CA   California      YRBSS Alcohol
## 5      2015    2015           CA   California      YRBSS Alcohol
##                  Question Response DataValueUnit    DataValueType DataValue
## 1 Alcohol use among youth       NA             % Crude Prevalence          
## 2 Alcohol use among youth       NA             % Crude Prevalence      17.1
## 3 Alcohol use among youth       NA             % Crude Prevalence      35.1
## 4 Alcohol use among youth       NA             % Crude Prevalence      29.1
## 5 Alcohol use among youth       NA             % Crude Prevalence          
##   DataValueAlt DataValueFootnoteSymbol
## 1           NA                       -
## 2         17.1                        
## 3         35.1                        
## 4         29.1                        
## 5           NA                       ~
##                                        DatavalueFootnote LowConfidenceLimit
## 1                                      No data available                 NA
## 2                                                                      11.9
## 3                                                                      27.4
## 4                                                                      21.1
## 5 Data not shown because of too few respondents or cases                 NA
##   HighConfidenceLimit StratificationCategory1                  Stratification1
## 1                  NA          Race/Ethnicity              Black, non-Hispanic
## 2                23.9          Race/Ethnicity              Asian, non-Hispanic
## 3                43.7          Race/Ethnicity              White, non-Hispanic
## 4                38.6          Race/Ethnicity                         Hispanic
## 5                  NA          Race/Ethnicity American Indian or Alaska Native
##   StratificationCategory2 Stratification2 StratificationCategory3
## 1                      NA              NA                      NA
## 2                      NA              NA                      NA
## 3                      NA              NA                      NA
## 4                      NA              NA                      NA
## 5                      NA              NA                      NA
##   Stratification3                                   GeoLocation ResponseID
## 1              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 2              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 3              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 4              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 5              NA POINT (-120.99999953799971 37.63864012300047)         NA
##   LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1          6     ALC     ALC1_1         CRDPREV                      RACE
## 2          6     ALC     ALC1_1         CRDPREV                      RACE
## 3          6     ALC     ALC1_1         CRDPREV                      RACE
## 4          6     ALC     ALC1_1         CRDPREV                      RACE
## 5          6     ALC     ALC1_1         CRDPREV                      RACE
##   StratificationID1 StratificationCategoryID2 StratificationID2
## 1               BLK                        NA                NA
## 2               ASN                        NA                NA
## 3               WHT                        NA                NA
## 4               HIS                        NA                NA
## 5              AIAN                        NA                NA
##   StratificationCategoryID3 StratificationID3
## 1                        NA                NA
## 2                        NA                NA
## 3                        NA                NA
## 4                        NA                NA
## 5                        NA                NA

filter for specific races

library(dplyr)
SpecificR <- races_data %>%
    filter(Stratification1 %in% c("Asian, non-Hispanic", "White, non-Hispanic", "Hispanic"))
head(SpecificR)
##   YearStart YearEnd LocationAbbr LocationDesc DataSource   Topic
## 1      2015    2015           CA   California      YRBSS Alcohol
## 2      2015    2015           CA   California      YRBSS Alcohol
## 3      2015    2015           CA   California      YRBSS Alcohol
##                  Question Response DataValueUnit    DataValueType DataValue
## 1 Alcohol use among youth       NA             % Crude Prevalence      17.1
## 2 Alcohol use among youth       NA             % Crude Prevalence      35.1
## 3 Alcohol use among youth       NA             % Crude Prevalence      29.1
##   DataValueAlt DataValueFootnoteSymbol DatavalueFootnote LowConfidenceLimit
## 1         17.1                                                         11.9
## 2         35.1                                                         27.4
## 3         29.1                                                         21.1
##   HighConfidenceLimit StratificationCategory1     Stratification1
## 1                23.9          Race/Ethnicity Asian, non-Hispanic
## 2                43.7          Race/Ethnicity White, non-Hispanic
## 3                38.6          Race/Ethnicity            Hispanic
##   StratificationCategory2 Stratification2 StratificationCategory3
## 1                      NA              NA                      NA
## 2                      NA              NA                      NA
## 3                      NA              NA                      NA
##   Stratification3                                   GeoLocation ResponseID
## 1              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 2              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 3              NA POINT (-120.99999953799971 37.63864012300047)         NA
##   LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1          6     ALC     ALC1_1         CRDPREV                      RACE
## 2          6     ALC     ALC1_1         CRDPREV                      RACE
## 3          6     ALC     ALC1_1         CRDPREV                      RACE
##   StratificationID1 StratificationCategoryID2 StratificationID2
## 1               ASN                        NA                NA
## 2               WHT                        NA                NA
## 3               HIS                        NA                NA
##   StratificationCategoryID3 StratificationID3
## 1                        NA                NA
## 2                        NA                NA
## 3                        NA                NA

plotting bar chart with the states crude rate for the population

library(ggplot2)
SpecificR
##   YearStart YearEnd LocationAbbr LocationDesc DataSource   Topic
## 1      2015    2015           CA   California      YRBSS Alcohol
## 2      2015    2015           CA   California      YRBSS Alcohol
## 3      2015    2015           CA   California      YRBSS Alcohol
##                  Question Response DataValueUnit    DataValueType DataValue
## 1 Alcohol use among youth       NA             % Crude Prevalence      17.1
## 2 Alcohol use among youth       NA             % Crude Prevalence      35.1
## 3 Alcohol use among youth       NA             % Crude Prevalence      29.1
##   DataValueAlt DataValueFootnoteSymbol DatavalueFootnote LowConfidenceLimit
## 1         17.1                                                         11.9
## 2         35.1                                                         27.4
## 3         29.1                                                         21.1
##   HighConfidenceLimit StratificationCategory1     Stratification1
## 1                23.9          Race/Ethnicity Asian, non-Hispanic
## 2                43.7          Race/Ethnicity White, non-Hispanic
## 3                38.6          Race/Ethnicity            Hispanic
##   StratificationCategory2 Stratification2 StratificationCategory3
## 1                      NA              NA                      NA
## 2                      NA              NA                      NA
## 3                      NA              NA                      NA
##   Stratification3                                   GeoLocation ResponseID
## 1              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 2              NA POINT (-120.99999953799971 37.63864012300047)         NA
## 3              NA POINT (-120.99999953799971 37.63864012300047)         NA
##   LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1          6     ALC     ALC1_1         CRDPREV                      RACE
## 2          6     ALC     ALC1_1         CRDPREV                      RACE
## 3          6     ALC     ALC1_1         CRDPREV                      RACE
##   StratificationID1 StratificationCategoryID2 StratificationID2
## 1               ASN                        NA                NA
## 2               WHT                        NA                NA
## 3               HIS                        NA                NA
##   StratificationCategoryID3 StratificationID3
## 1                        NA                NA
## 2                        NA                NA
## 3                        NA                NA
# data needed
  Stratification1 = c("Asian, non-Hispanic", "White, non-Hispanic", "Hispanic")
  LocationDesc = rep("California", 3)  # Replicating for each race

# Specify colors for each race
race_colors <- c("Asian, non-Hispanic" = "red", "White, non-Hispanic" = "purple", "Hispanic" = "blue")

# Create Plot
ggplot(SpecificR, aes(x = Stratification1, y = DataValue, fill = Stratification1)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = race_colors) +  # Assign colors to races
  labs(title = "Crude rate of races affected by youth alcoholism",
       x = "Race",
       y = "Count") +
  theme_dark()

Project Conclusion

Cleaning the data is an important step to make sure the data is accurate. One thing I did to clean the data is by removing duplicate columns. One thing I found interesting was the positive correlation between certain variables. I wish I could’ve shown the crude rate for more races but there was a lot of subgroups skewing the values.