U.S. Chronic Disease Indicators Analysis - Project 1

Author

Danny Le

For this project, I will be working with the U.S Chronic Disease Indicators data set, which was created by the CDC. The indicators I chose to look at was alcohol. The DataValue is one of my quantitative variables and it represents the rate of alcohol related behaviors. My other quantitative variable is the LowerConfidenceLimit, which shows the lower bound of the confidence intervals. I plan to explore how the LowerConfidenceLimit effects the DataValue.

CDI <- readr::read_csv("C:/Users/panca/Downloads/U.S._Chronic_Disease_Indicators.csv")
Rows: 309215 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): LocationAbbr, LocationDesc, DataSource, Topic, Question, DataValue...
dbl  (6): YearStart, YearEnd, DataValue, DataValueAlt, LowConfidenceLimit, H...
lgl (10): Response, StratificationCategory2, Stratification2, Stratification...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.1     ✔ readr     2.1.6
✔ ggplot2   4.0.2     ✔ stringr   1.6.0
✔ lubridate 1.9.5     ✔ tibble    3.3.1
✔ purrr     1.2.1     ✔ tidyr     1.3.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
#Cleans the 2 quantitative variables
CDI_cleaned <- CDI |> filter(!is.na(DataValue), !is.na(LowConfidenceLimit)) 


#Filters the data to be about alcohol, in percentage, specifically in Maryland, Texas, and Florida
CDI_filtered <- CDI_cleaned |> filter(Topic=="Alcohol", DataValueUnit=="%", LocationAbbr %in% c("MD","TX","FL")) 


#Allows me to double check visually to see if everything worked
View(CDI_filtered) 
#Fit a linear regression model with LowConfidenceLimit as the independent variable and DataValue as the dependent variable
CDI_analysis <- lm(DataValue ~ LowConfidenceLimit, data=CDI_filtered) 

#Displays a summary of the model that includes coefficients, p-values, and R-squared
summary(CDI_analysis)

Call:
lm(formula = DataValue ~ LowConfidenceLimit, data = CDI_filtered)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4861 -1.2912 -0.6849  0.4517 14.7943 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.4861     0.2682   9.268   <2e-16 ***
LowConfidenceLimit   1.0177     0.0189  53.844   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.115 on 341 degrees of freedom
Multiple R-squared:  0.8948,    Adjusted R-squared:  0.8945 
F-statistic:  2899 on 1 and 341 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(CDI_analysis)

#Creates a scatter plot with a regression line
ggplot(CDI_filtered, aes(x=LowConfidenceLimit, y=DataValue, color=LocationAbbr))+
  geom_point(alpha=0.5) +
  #Adds a regression Line
  geom_smooth(method="lm", color="black", linetype="dotdash", se=FALSE) + 
  #Manually chose the color for each state
  scale_color_manual(values=c("MD"="#D61A1F", "TX"="#ADFFAD","FL"="#6EB0FF"))+
  #Adds labels for title, x value, y value, and the caption
  labs(
    title="Alcohol Behavior Data In Maryland, Texas, and Flordia ",
    x="Low Confidence Limit",
    y="Data Value (%)",
    caption = "Source: U.S. Chronic Disease Indicators"
  ) +
  #Customization of the legend
  guides(color = guide_legend(title = "State (Color)", title.position="top"))+
  #Minimal theme for a cleaner look
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Before beginning any analysis and graphing, I cleaned the data set to ensure proper data. Firstly, I identified any missing values in the DataValue and LowerConfidenceLimit variables and then I removed them. The next thing I did was filter the data. The data set has a wide selection of variables, but I was only interested in alcohol behaviors in Maryland, Texas, and Florida. Therefore, I filtered the data so that only those states would remain if they had their Topic set to Alcohol. The scatter plot with its regression line represent the relationship between LowConfidenceLimit, the lower bound of a confidence interval, and DataValue, the percentage of people with alcohol related behaviors. Any given x-value represents the certainty of the estimates of the rate of alcohol related behaviors. This means that a higher x-value suggests a higher confidence in the estimate. The y-values represent the percent of alcohol related behaviors in each state. The dashed black line in the middle represents the line of regression, which is the linear relationship between the two variables. As LowConfidenceLimit increased, DataValue tended to increase as well, which suggests a positive relationship between the variables. I wish I could have included the DataValueAlt variable. This variable most likely consisted of adjusted values that were conceived after adjusting for bias. Otherwise, they could have been conceived through a different source or statistical method. I would have liked to see how much it differed from the regular DataValue variable. I would have also liked to work with the topic of sleep. I did not see many cases with the topic of sleep and as a result I ended up choosing alcohol.