The US chronic disease indicator is a thorough and useful resource that offers a full overview of numerous chronic health diseases and related risk factors across the US. This data set includes a variety of indicators linked to chronic diseases, such as diabetes, cancer, heart disease, and respiratory disorders.For this project, I plan on exploring the topic of chronic liver disease in Maryland and the use of alcohol among the youth in California.
# Load Dataset
USChronicDiseaseIndicators <- read.csv("~/DATA 110/USChronicDiseaseIndicators.csv")
View(USChronicDiseaseIndicators)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# filter for Chronic liver disease
Chronic_liver_data <- USChronicDiseaseIndicators %>%
filter(Question == "Chronic liver disease mortality" & LocationDesc == "Maryland")
# View the resulting dataset
View(Chronic_liver_data)
library(ggplot2)
# linear regression removing NAs
lm_data <- na.omit(Chronic_liver_data[, c("LowConfidenceLimit", "HighConfidenceLimit")])
lm_model <- lm(LowConfidenceLimit ~ HighConfidenceLimit , data = lm_data)
# Print summary
summary(lm_model)
##
## Call:
## lm(formula = LowConfidenceLimit ~ HighConfidenceLimit, data = lm_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6908 -0.2175 0.3266 0.4849 0.8793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.65842 0.38909 -1.692 0.0935 .
## HighConfidenceLimit 0.82321 0.04407 18.678 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.034 on 108 degrees of freedom
## Multiple R-squared: 0.7636, Adjusted R-squared: 0.7614
## F-statistic: 348.9 on 1 and 108 DF, p-value: < 2.2e-16
# Plot the data and regression line
ggplot(lm_data, aes(x = LowConfidenceLimit, y = HighConfidenceLimit)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Regression Analysis",
x = "LowConidenceLimit",
y = "HighConfidenceLimit")
## `geom_smooth()` using formula = 'y ~ x'
# coefficients from the linear model
coefficients <- coef(lm_model)
# Print the equation of the regression line
cat("Linear Regression Equation: Y =", coefficients[1], "+", coefficients[2], "* X\n")
## Linear Regression Equation: Y = -0.6584168 + 0.8232125 * X
p-value: < 2.2e-16 , Adjusted R-squared: 0.7614
Based on the extremely small p-value and the relatively high adjusted r-squared value, I came to the conclusion that my linear regression model is statistically significant, and the model explains a considerable amount of the variability in the response variable.
library(dplyr)
Youth_alcohol <- USChronicDiseaseIndicators %>%
filter(Question == "Alcohol use among youth")
View(Youth_alcohol)
library(dplyr)
State_data <- Youth_alcohol %>%
filter(LocationDesc == "California" & YearStart == "2015")
View(State_data)
library(dplyr)
races_data <- State_data %>%
filter(StratificationCategory1 == "Race/Ethnicity")
head(races_data)
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic
## 1 2015 2015 CA California YRBSS Alcohol
## 2 2015 2015 CA California YRBSS Alcohol
## 3 2015 2015 CA California YRBSS Alcohol
## 4 2015 2015 CA California YRBSS Alcohol
## 5 2015 2015 CA California YRBSS Alcohol
## Question Response DataValueUnit DataValueType DataValue
## 1 Alcohol use among youth NA % Crude Prevalence
## 2 Alcohol use among youth NA % Crude Prevalence 17.1
## 3 Alcohol use among youth NA % Crude Prevalence 35.1
## 4 Alcohol use among youth NA % Crude Prevalence 29.1
## 5 Alcohol use among youth NA % Crude Prevalence
## DataValueAlt DataValueFootnoteSymbol
## 1 NA -
## 2 17.1
## 3 35.1
## 4 29.1
## 5 NA ~
## DatavalueFootnote LowConfidenceLimit
## 1 No data available NA
## 2 11.9
## 3 27.4
## 4 21.1
## 5 Data not shown because of too few respondents or cases NA
## HighConfidenceLimit StratificationCategory1 Stratification1
## 1 NA Race/Ethnicity Black, non-Hispanic
## 2 23.9 Race/Ethnicity Asian, non-Hispanic
## 3 43.7 Race/Ethnicity White, non-Hispanic
## 4 38.6 Race/Ethnicity Hispanic
## 5 NA Race/Ethnicity American Indian or Alaska Native
## StratificationCategory2 Stratification2 StratificationCategory3
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## Stratification3 GeoLocation ResponseID
## 1 NA POINT (-120.99999953799971 37.63864012300047) NA
## 2 NA POINT (-120.99999953799971 37.63864012300047) NA
## 3 NA POINT (-120.99999953799971 37.63864012300047) NA
## 4 NA POINT (-120.99999953799971 37.63864012300047) NA
## 5 NA POINT (-120.99999953799971 37.63864012300047) NA
## LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1 6 ALC ALC1_1 CRDPREV RACE
## 2 6 ALC ALC1_1 CRDPREV RACE
## 3 6 ALC ALC1_1 CRDPREV RACE
## 4 6 ALC ALC1_1 CRDPREV RACE
## 5 6 ALC ALC1_1 CRDPREV RACE
## StratificationID1 StratificationCategoryID2 StratificationID2
## 1 BLK NA NA
## 2 ASN NA NA
## 3 WHT NA NA
## 4 HIS NA NA
## 5 AIAN NA NA
## StratificationCategoryID3 StratificationID3
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
library(dplyr)
SpecificR <- races_data %>%
filter(Stratification1 %in% c("Asian, non-Hispanic", "White, non-Hispanic", "Hispanic"))
head(SpecificR)
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic
## 1 2015 2015 CA California YRBSS Alcohol
## 2 2015 2015 CA California YRBSS Alcohol
## 3 2015 2015 CA California YRBSS Alcohol
## Question Response DataValueUnit DataValueType DataValue
## 1 Alcohol use among youth NA % Crude Prevalence 17.1
## 2 Alcohol use among youth NA % Crude Prevalence 35.1
## 3 Alcohol use among youth NA % Crude Prevalence 29.1
## DataValueAlt DataValueFootnoteSymbol DatavalueFootnote LowConfidenceLimit
## 1 17.1 11.9
## 2 35.1 27.4
## 3 29.1 21.1
## HighConfidenceLimit StratificationCategory1 Stratification1
## 1 23.9 Race/Ethnicity Asian, non-Hispanic
## 2 43.7 Race/Ethnicity White, non-Hispanic
## 3 38.6 Race/Ethnicity Hispanic
## StratificationCategory2 Stratification2 StratificationCategory3
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## Stratification3 GeoLocation ResponseID
## 1 NA POINT (-120.99999953799971 37.63864012300047) NA
## 2 NA POINT (-120.99999953799971 37.63864012300047) NA
## 3 NA POINT (-120.99999953799971 37.63864012300047) NA
## LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1 6 ALC ALC1_1 CRDPREV RACE
## 2 6 ALC ALC1_1 CRDPREV RACE
## 3 6 ALC ALC1_1 CRDPREV RACE
## StratificationID1 StratificationCategoryID2 StratificationID2
## 1 ASN NA NA
## 2 WHT NA NA
## 3 HIS NA NA
## StratificationCategoryID3 StratificationID3
## 1 NA NA
## 2 NA NA
## 3 NA NA
library(ggplot2)
SpecificR
## YearStart YearEnd LocationAbbr LocationDesc DataSource Topic
## 1 2015 2015 CA California YRBSS Alcohol
## 2 2015 2015 CA California YRBSS Alcohol
## 3 2015 2015 CA California YRBSS Alcohol
## Question Response DataValueUnit DataValueType DataValue
## 1 Alcohol use among youth NA % Crude Prevalence 17.1
## 2 Alcohol use among youth NA % Crude Prevalence 35.1
## 3 Alcohol use among youth NA % Crude Prevalence 29.1
## DataValueAlt DataValueFootnoteSymbol DatavalueFootnote LowConfidenceLimit
## 1 17.1 11.9
## 2 35.1 27.4
## 3 29.1 21.1
## HighConfidenceLimit StratificationCategory1 Stratification1
## 1 23.9 Race/Ethnicity Asian, non-Hispanic
## 2 43.7 Race/Ethnicity White, non-Hispanic
## 3 38.6 Race/Ethnicity Hispanic
## StratificationCategory2 Stratification2 StratificationCategory3
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## Stratification3 GeoLocation ResponseID
## 1 NA POINT (-120.99999953799971 37.63864012300047) NA
## 2 NA POINT (-120.99999953799971 37.63864012300047) NA
## 3 NA POINT (-120.99999953799971 37.63864012300047) NA
## LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1
## 1 6 ALC ALC1_1 CRDPREV RACE
## 2 6 ALC ALC1_1 CRDPREV RACE
## 3 6 ALC ALC1_1 CRDPREV RACE
## StratificationID1 StratificationCategoryID2 StratificationID2
## 1 ASN NA NA
## 2 WHT NA NA
## 3 HIS NA NA
## StratificationCategoryID3 StratificationID3
## 1 NA NA
## 2 NA NA
## 3 NA NA
# data needed
Stratification1 = c("Asian, non-Hispanic", "White, non-Hispanic", "Hispanic")
LocationDesc = rep("California", 3) # Replicating for each race
# Specify colors for each race
race_colors <- c("Asian, non-Hispanic" = "red", "White, non-Hispanic" = "purple", "Hispanic" = "blue")
# Create Plot
ggplot(SpecificR, aes(x = Stratification1, y = DataValue, fill = Stratification1)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = race_colors) + # Assign colors to races
labs(title = "Crude rate of races affected by youth alcoholism",
x = "Race",
y = "Count") +
theme_dark()
Cleaning the data is an important step to make sure the data is accurate. One thing I did to clean the data is by removing duplicate columns. One thing I found interesting was the positive correlation between certain variables. I wish I could’ve shown the crude rate for more races but there was a lot of subgroups skewing the values.