My data set is of the cirrhosis mortality rate estimates based on counties in the United States in the year 2009 and 2019. The data set has 404460 observations and 19 variables. The variables are Measure ID, Measure Name, Location ID, Location Name, fips, Race ID, Race Name, Sex ID, Sex Name, Age Group ID, Age Name, Cause ID, Cause Name, Year, Metric ID, Metric Name, Value, Upper, and Lower. We will filter and refine the data so it would be easier to understand. We will be exploring the highest morality rate estimates of counties from 2009 and 2019. The source of the data set is from the Institute for Health Metrics and Evaluation (IHME).
Load CSV files and library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 404460 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): measure_name, location_name, race_name, sex_name, age_name, cause_...
dbl (12): measure_id, location_id, fips, race_id, sex_id, age_group_id, caus...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 404460 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): measure_name, location_name, race_name, sex_name, age_name, cause_...
dbl (12): measure_id, location_id, fips, race_id, sex_id, age_group_id, caus...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Get Rid of Unnecessary Columns and NAs
new_male_2009 <- cirrhosis_male_2009[-c(1,2,3,5,6,8,10,12,13,14,15)] # Extracts the following unwanted columnsrefined_male_2009 <-na.omit(new_male_2009) |># Filters all NAs and unwanted row content that are not necessaryfilter(race_name !="Total") |>filter(location_name !="United States of America") |>filter(age_name !="Age-standardized")|>filter(age_name !="All Ages") |>filter(age_name !="<1 year") |>filter(age_name !="1 to 4") |>filter(age_name !="5 to 9")new_male_2019 <- cirrhosis_male_2019[-c(1,2,3,5,6,8,10,12,13,14,15)]new_male_2019 <-na.omit(new_male_2019)refined_male_2019 <- new_male_2019 |>filter(race_name !="Total") |>filter(location_name !="United States of America")|>filter(age_name !="Age-standardized")|>filter(age_name !="All Ages")|>filter(age_name !="<1 year") |>filter(age_name !="1 to 4") |>filter(age_name !="5 to 9")
Visual of cirrhosis occuring based on age group and race in 2009
p1 <-ggplot(refined_male_2009, aes(age_name, race_name)) +# x-axis is the age and y-axis is race geom_point(aes(size = val), alpha = .3) +geom_smooth() +scale_size_area() +theme_bw() +#A theme with white background and black grid lines.theme(axis.text.x =element_text(angle =45)) +# Adjusts the placement of the text on x-axislabs(x ="Age",y ="Race",caption ="Source: Institute for Health Metrics and Evaluation (IHME)",title ="Cirrhosis Rate Estimates According to Race and Age in 2009")p1
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# p1 creates a plot of points that vary in size and opacity based on race and age group in 2009
Analysis: In 2009, the mortality rates range from 0.005- 0.001. The mortality rate appears to be higher in the race AIAN (non-Latino and non-Hispanic American Indian or Alaska Native) and between the ages 40-59. Asians tend to have a lower mortality rate.
Visual of cirrhosis occuring based on age group and race in 2019
p2 <-ggplot(refined_male_2019, aes(age_name, race_name)) +geom_point(aes(size = val), alpha = .3) +geom_smooth() +scale_size_area() +theme_bw() +theme(axis.text.x =element_text(angle =45)) +labs(x ="Age",y ="Race",caption ="Source: Institute for Health Metrics and Evaluation (IHME)",title ="Cirrhosis Rate Estimates According to Race and Age in 2019") p2
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
# p2 creates a plot of points that vary in size and opacity based on race and age group in 2019
Analysis: In 2019, the range is no longer 0.005- 0.001, but between 0.006- 0.002. An increase in all groups. The mortality rate is still higher in the race AIAN (non-Latino and non-Hispanic American Indian or Alaska Native) and between the ages 40-59.
County with the Highest Morality Rate Estimate in 2009
refined_count_2009 <- refined_male_2009 |>group_by(location_name)|>summarise(count=n(),ave_mortality_rate =mean(val),ave_upper =mean(upper), ave_lower =mean(lower) )# This data frame is created by creating new columns based on location in 2009. I included the count for each location, average mortality rate based on the location, and the average upper and lower rate in each locationtop_2009 <- refined_count_2009|>arrange(desc(ave_mortality_rate)) |>head(20)# Next I wanted only the top 20 locations with the highest rate.top_2009
# A tibble: 20 × 5
location_name count ave_mortality_rate ave_upper ave_lower
<chr> <int> <dbl> <dbl> <dbl>
1 Oglala Lakota County (South Dak… 16 0.00204 0.00255 0.00161
2 Shannon County (South Dakota) 16 0.00204 0.00255 0.00161
3 Union County (Florida) 32 0.00167 0.00218 0.00126
4 Mellette County (South Dakota) 16 0.00148 0.00219 0.000936
5 Todd County (South Dakota) 16 0.00140 0.00188 0.000998
6 Anderson County (Texas) 48 0.00126 0.00168 0.000921
7 Sioux County (North Dakota) 16 0.00111 0.00169 0.000698
8 Buffalo County (South Dakota) 16 0.00109 0.00181 0.000616
9 Fremont County (Wyoming) 48 0.00105 0.00145 0.000733
10 Rio Arriba County (New Mexico) 48 0.00102 0.00137 0.000740
11 Gila County (Arizona) 48 0.00101 0.00135 0.000729
12 Zavala County (Texas) 16 0.000982 0.00127 0.000749
13 Brooks County (Texas) 16 0.000969 0.00125 0.000729
14 Roosevelt County (Montana) 32 0.000931 0.00131 0.000645
15 Uintah County (Utah) 48 0.000901 0.00132 0.000584
16 Ramsey County (North Dakota) 32 0.000900 0.00152 0.000485
17 Bennett County (South Dakota) 32 0.000891 0.00135 0.000545
18 Glacier County (Montana) 32 0.000889 0.00129 0.000591
19 Thurston County (Nebraska) 32 0.000888 0.00130 0.000586
20 Lake County (California) 64 0.000877 0.00125 0.000596
County with the Highest Morality Rate Estimate in 2019
refined_count_2019 <- refined_male_2019 |>group_by(location_name)|># Based the content on location namesummarise(count=n(),ave_mortality_rate =mean(val),ave_upper =mean(upper), ave_lower =mean(lower) )# This data frame is created by creating new columns based on location in 2019. I included the count for each location, average mortality rate based on the location, and the average upper and lower rate in each locationtop_2019 <- refined_count_2019|>arrange(desc(ave_mortality_rate)) |>head(20)top_2019
# A tibble: 20 × 5
location_name count ave_mortality_rate ave_upper ave_lower
<chr> <int> <dbl> <dbl> <dbl>
1 Oglala Lakota County (South Dak… 16 0.00241 0.00312 0.00183
2 Shannon County (South Dakota) 16 0.00241 0.00312 0.00183
3 Union County (Florida) 32 0.00198 0.00274 0.00139
4 Mellette County (South Dakota) 16 0.00176 0.00263 0.00111
5 Todd County (South Dakota) 16 0.00165 0.00231 0.00115
6 Anderson County (Texas) 48 0.00140 0.00197 0.000952
7 Rio Arriba County (New Mexico) 48 0.00129 0.00180 0.000885
8 Buffalo County (South Dakota) 16 0.00124 0.00202 0.000702
9 Martinsville City (Virginia) 32 0.00123 0.00231 0.000571
10 Gila County (Arizona) 48 0.00123 0.00171 0.000846
11 Sioux County (North Dakota) 16 0.00117 0.00179 0.000740
12 Galax City (Virginia) 16 0.00116 0.00184 0.000675
13 Fremont County (Wyoming) 48 0.00114 0.00165 0.000754
14 McKenzie County (North Dakota) 32 0.00114 0.00180 0.000659
15 Zavala County (Texas) 16 0.00112 0.00151 0.000814
16 Brooks County (Texas) 16 0.00111 0.00149 0.000790
17 Uintah County (Utah) 48 0.00109 0.00167 0.000670
18 Roosevelt County (Montana) 32 0.00105 0.00152 0.000684
19 Thurston County (Nebraska) 32 0.00104 0.00156 0.000669
20 Montezuma County (Colorado) 48 0.00104 0.00157 0.000645
Visual as a Geom Tile in 2009
p3 <-ggplot(data = top_2009, aes(x=location_name, y=count, fill = ave_mortality_rate)) +geom_tile()+# What chart we are usingscale_fill_distiller(palette="Spectral") +# Color palettetheme_bw()+theme(axis.text.x =element_text(angle =90))+# Angles the text on the x-axislabs(x= ("County and State"),y= ("Count"),title= ("Cirrhosis Rate Estimates based on Location in 2009"),caption= ("Source: Institute for Health Metrics and Evaluation (IHME)"),fill =("Average Morality Rate"))p3
# The geom tile chart visualizes the top 20 locations with the highest average mortality rate. Red is highest rate and blue is lowest rate.
Analysis: Oglala Lakota County in South Dakota and Shannon County in South Dakota have the highest mortality rate, but they are not recorded (counted) as often in the data. Lake County in California has the least mortality rate in cirrhosis, but it is recorded the most in the data. This makes it disproportionate and might need further investigation. Maybe consider comparing locations with the same number of counties? Values range from 0.0009- 0.0018.
Visual as a Geom Tile in 2009
p4 <-ggplot(data = top_2019, aes(x=location_name, y=count, fill = ave_mortality_rate)) +geom_tile()+scale_fill_distiller(palette="Spectral") +theme_bw()+theme(axis.text.x =element_text(angle =90))+labs(x= ("County and State"),y= ("Count"),title= ("Cirrhosis Rate Estimates based on Location in 2019 "),caption= ("Source: Institute for Health Metrics and Evaluation (IHME)"),fill =("Average Morality Rate"))p4
Analysis: Oglala Lakota County in South Dakota and Shannon County in South Dakota maintain to have the highest mortality rate. There are a few locations that decreased in mortality rate, for example, Rio Arriba County (New Mexico) used to be in 7th place in 2009. In 2019, it is in 10th place. In this geom tile chart, the count range appear to end a little over 50, while in 2009 the count range appear to go over 60. Lake County in California might just be an outlier.
treemap(top_2009, index="location_name", vSize="count", vColor="ave_mortality_rate", type="manual", palette="Reds",title ="Cirrhosis Rate Estimates based on Location in 2009 ", title.legend ="Average Morality Rate")
treemap(top_2019, index="location_name", vSize="count", vColor="ave_mortality_rate", type="manual", palette="Reds",title ="Cirrhosis Rate Estimates based on Location in 2019 ", title.legend ="Average Morality Rate")
Linear Regression
# Does the average mortality rate have a relationship with the average upper bounds? p5 <-ggplot(top_2009, aes(x = ave_upper, y = ave_mortality_rate)) +labs(title ="Relationship Between Average Upper Bound and Average Mortality",caption= ("Source: Institute for Health Metrics and Evaluation (IHME)"),x ="Average Upper Bound Rate",y ="Average Mortality Rate") +theme_grey(base_size =12)p5+geom_point()
# Using ggplot, we create a scatter plot of the average upper bound rate and the average mortality rate cor(top_2009$ave_upper, top_2009$ave_mortality_rate)
[1] 0.9626797
# The correlation is 0.963. The relationship between average mortality rate and average upper bounds in a strong positive correlation.fit1 <-lm(ave_mortality_rate ~ ave_upper , top_2009)summary(fit1)
Call:
lm(formula = ave_mortality_rate ~ ave_upper, data = top_2009)
Residuals:
Min 1Q Median 3Q Max
-2.347e-04 -2.409e-05 2.126e-05 6.807e-05 1.107e-04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.714e-04 9.174e-05 -1.868 0.0781 .
ave_upper 8.231e-01 5.454e-02 15.091 1.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0001025 on 18 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9227
F-statistic: 227.7 on 1 and 18 DF, p-value: 1.165e-11
Analysis: P-value suggests it is a meaningful variable to explain the linear increase of the average mortality rate in cirrhosis.
The model equation is (average mortality rate) = 0.963(average upper bound rate) + -0.0001714. In other words, for any additional average upper bound mortality rate in cirrhosis, it is predicted to increase by 0.963. The adjusted R squared indicates that 92% of the variation in the observations may be explained by the model, and that 8% of the variation may not be explained.
p6 <-ggplot(top_2019, aes(x = ave_upper, y = ave_mortality_rate)) +labs(title ="Relationship Between Average Upper Bound and Average Mortality",caption= ("Source: Institute for Health Metrics and Evaluation (IHME)"),x ="Average Upper Bound Rate",y ="Average Mortality Rate") +theme_grey(base_size =12)p6+geom_point()
Call:
lm(formula = ave_mortality_rate ~ ave_upper, data = top_2019)
Residuals:
Min 1Q Median 3Q Max
-3.949e-04 -4.758e-05 2.573e-05 6.926e-05 1.511e-04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0002031 0.0001185 -1.715 0.104
ave_upper 0.7908990 0.0572342 13.819 5.04e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0001306 on 18 degrees of freedom
Multiple R-squared: 0.9139, Adjusted R-squared: 0.9091
F-statistic: 191 on 1 and 18 DF, p-value: 5.044e-11
Analysis: P-value suggests it is a meaningful variable to explain the linear increase of the average mortality rate in cirrhosis.
The model equation is (average mortality rate) = 0.956(average upper bound rate) + -0.0002031. In other words, for any additional average upper bound mortality rate in cirrhosis, it is predicted to increase by 0.956. The adjusted R squared indicates that 91% of the variation in the observations may be explained by the model, and that 9% of the variation may not be explained.
ESSAY
My data set originally contained 404460 observations and 19 variables. We filter the data of all unnecessary columns, rows, and data that contain NAs. First, I create a new data set that extracts columns 1,2,3,5,6,8,10,12,13,14,15 and then remove all the NAs so that it would easier to read. Next I remove rows Total, United States of America, Age-standardized, All Ages, <1 year, 1 to 4, and 5 to 9. Although it may seem important to keep all age groups, cirrhosis normally affects individuals in there 50s- 70s. Keeping age groups <1 year, 1 to 4, and 5 to 9 did not seem crucial to this investigation. After filtering, the data set contained 121088 observations and 8 variables. In terms of visuals, I utilized a variety of visuals such as geom point, geom tile, tree map, and scatter plot to explore and understand the relationship between variables. Interesting pattern that I discovered was that, although there are locations that had originally been on a higher ranking in 2009 became a lower rank in 2019, the overall rate of cirrhosis is increasing with no indication of it decreasing. Overall, I enjoyed the project. I got a better understanding on how to wrangle data and utilizing it so that it would be easier to analyze. I did wish that I had a different data set that had more quantitative variables so that I could explore more about linear regression.