U.S. Chronic Disease Indicators Analysis - Project 1
Author
Danny Le
For this project, I will be working with the U.S Chronic Disease Indicators data set, which was created by the CDC. The indicators I chose to look at was alcohol. The DataValue is one of my quantitative variables and it represents the rate of alcohol related behaviors. My other quantitative variable is the LowerConfidenceLimit, which shows the lower bound of the confidence intervals. I plan to explore how the LowerConfidenceLimit effects the DataValue.
Rows: 309215 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): LocationAbbr, LocationDesc, DataSource, Topic, Question, DataValue...
dbl (6): YearStart, YearEnd, DataValue, DataValueAlt, LowConfidenceLimit, H...
lgl (10): Response, StratificationCategory2, Stratification2, Stratification...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.1 ✔ readr 2.1.6
✔ ggplot2 4.0.2 ✔ stringr 1.6.0
✔ lubridate 1.9.5 ✔ tibble 3.3.1
✔ purrr 1.2.1 ✔ tidyr 1.3.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
#Cleans the 2 quantitative variablesCDI_cleaned <- CDI |>filter(!is.na(DataValue), !is.na(LowConfidenceLimit)) #Filters the data to be about alcohol, in percentage, specifically in Maryland, Texas, and FloridaCDI_filtered <- CDI_cleaned |>filter(Topic=="Alcohol", DataValueUnit=="%", LocationAbbr %in%c("MD","TX","FL")) #Allows me to double check visually to see if everything workedView(CDI_filtered)
#Fit a linear regression model with LowConfidenceLimit as the independent variable and DataValue as the dependent variableCDI_analysis <-lm(DataValue ~ LowConfidenceLimit, data=CDI_filtered) #Displays a summary of the model that includes coefficients, p-values, and R-squaredsummary(CDI_analysis)
Call:
lm(formula = DataValue ~ LowConfidenceLimit, data = CDI_filtered)
Residuals:
Min 1Q Median 3Q Max
-2.4861 -1.2912 -0.6849 0.4517 14.7943
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4861 0.2682 9.268 <2e-16 ***
LowConfidenceLimit 1.0177 0.0189 53.844 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.115 on 341 degrees of freedom
Multiple R-squared: 0.8948, Adjusted R-squared: 0.8945
F-statistic: 2899 on 1 and 341 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))plot(CDI_analysis)
#Creates a scatter plot with a regression lineggplot(CDI_filtered, aes(x=LowConfidenceLimit, y=DataValue, color=LocationAbbr))+geom_point(alpha=0.5) +#Adds a regression Linegeom_smooth(method="lm", color="black", linetype="dotdash", se=FALSE) +#Manually chose the color for each statescale_color_manual(values=c("MD"="#D61A1F", "TX"="#ADFFAD","FL"="#6EB0FF"))+#Adds labels for title, x value, y value, and the captionlabs(title="Alcohol Behavior Data In Maryland, Texas, and Flordia ",x="Low Confidence Limit",y="Data Value (%)",caption ="Source: U.S. Chronic Disease Indicators" ) +#Customization of the legendguides(color =guide_legend(title ="State (Color)", title.position="top"))+#Minimal theme for a cleaner looktheme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Before beginning any analysis and graphing, I cleaned the data set to ensure proper data. Firstly, I identified any missing values in the DataValue and LowerConfidenceLimit variables and then I removed them. The next thing I did was filter the data. The data set has a wide selection of variables, but I was only interested in alcohol behaviors in Maryland, Texas, and Florida. Therefore, I filtered the data so that only those states would remain if they had their Topic set to Alcohol. The scatter plot with its regression line represent the relationship between LowConfidenceLimit, the lower bound of a confidence interval, and DataValue, the percentage of people with alcohol related behaviors. Any given x-value represents the certainty of the estimates of the rate of alcohol related behaviors. This means that a higher x-value suggests a higher confidence in the estimate. The y-values represent the percent of alcohol related behaviors in each state. The dashed black line in the middle represents the line of regression, which is the linear relationship between the two variables. As LowConfidenceLimit increased, DataValue tended to increase as well, which suggests a positive relationship between the variables. I wish I could have included the DataValueAlt variable. This variable most likely consisted of adjusted values that were conceived after adjusting for bias. Otherwise, they could have been conceived through a different source or statistical method. I would have liked to see how much it differed from the regular DataValue variable. I would have also liked to work with the topic of sleep. I did not see many cases with the topic of sleep and as a result I ended up choosing alcohol.