The topic that I chose is Alzheimer’s Disease and Healthy Aging Data. I found the “Alzheimer_s_Disease_and_Healthy_Aging_Data.csv” using the website data.gov. This data was maintained by the DPH (Department of Public Health) Public inquiries, and it was published by the Center for Disease Control and Prevention. There is no ReadMe file with that information. This data set have around 39 variables such as RowId is a value that uniquely identifies a row in a table. The Variables that this dataset have: Year Start, Year End, Location Abbr, Location Desc, Data source, Class, Topic, Question, Response, Data_Value_Unit, DataValueTypeID, Data_Value_Type, Data_Value, Data_Value_Alt Data_Value_Footnote_Symbol, Data_Value_Footnote Low_Confidence_Limit High_Confidence_Limit, Sample_Size, StratificationCategory Stratification,Geolocation, ClassIDTopicID Question ID,Response ID,Location ID.
The variables that I will be using are the Year End that is when the search was completed. Data Type that I will be using percentage. Stratification2 in order to divide in genres male and female. Locations that have each state that search was done. Also, Data_ Valeu that is the percentage from each state. My goal is to analyze separate the percentage of females and then males who had Alzheimer in each state from 1025 until 2021 to see which genres may had more diagnostics and which state had the most cases.
I chose this topic, because my father is getting old, and we have been noticing some signs that her mind may not being doings well, and Alzheimer is one of things that we want to get him tested.
Alzheimer’s Disease
The tidyverse package and other packages were used to created the plots.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.2.3
library(leaflet.extras)
## Warning: package 'leaflet.extras' was built under R version 4.2.3
library(dplyr)
library(gganimate)
## Warning: package 'gganimate' was built under R version 4.2.3
library(gifski)
## Warning: package 'gifski' was built under R version 4.2.3
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(stringr)
library(DataExplorer)
library(ggplot2)
library(ggfortify)
This chunk the settled was used to created the path in order to R to read the file.Also I named “Alzheimer” in order to read from the csv file ““Alzheimer_s_Disease_and_Healthy_Aging_Data.csv” This file have 250937 observation and 39 variables. Pretty much covers most of all states from the United states.
setwd("C:/Users/aline/Downloads/Data Final Project")
alzheimer <- read_csv("Alzheimer_s_Disease_and_Healthy_Aging_Data.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 250937 Columns: 39
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): RowId, LocationAbbr, LocationDesc, Datasource, Class, Topic, Quest...
## dbl (6): YearStart, YearEnd, Data_Value, Data_Value_Alt, Low_Confidence_Lim...
## lgl (8): Response, Sample_Size, StratificationCategory3, Stratification3, R...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Second chunk. The summary of the file. It shows that the median of years which people were analyzed.
summary(alzheimer)
## RowId YearStart YearEnd LocationAbbr
## Length:250937 Min. :2015 Min. :2015 Length:250937
## Class :character 1st Qu.:2016 1st Qu.:2016 Class :character
## Mode :character Median :2018 Median :2018 Mode :character
## Mean :2018 Mean :2018
## 3rd Qu.:2020 3rd Qu.:2020
## Max. :2021 Max. :2021
##
## LocationDesc Datasource Class Topic
## Length:250937 Length:250937 Length:250937 Length:250937
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Question Response Data_Value_Unit DataValueTypeID
## Length:250937 Mode:logical Length:250937 Length:250937
## Class :character NA's:250937 Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Data_Value_Type Data_Value Data_Value_Alt
## Length:250937 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 15.70 1st Qu.: 15.70
## Mode :character Median : 32.30 Median : 32.30
## Mean : 37.33 Mean : 37.33
## 3rd Qu.: 56.00 3rd Qu.: 56.00
## Max. :100.00 Max. :100.00
## NA's :81635 NA's :81635
## Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
## Length:250937 Length:250937 Min. :-0.7
## Class :character Class :character 1st Qu.:12.4
## Mode :character Mode :character Median :26.6
## Mean :32.7
## 3rd Qu.:48.4
## Max. :99.6
## NA's :81811
## High_Confidence_Limit Sample_Size StratificationCategory1
## Min. : 1.40 Mode:logical Length:250937
## 1st Qu.: 19.40 NA's:250937 Class :character
## Median : 38.30 Mode :character
## Mean : 42.24
## 3rd Qu.: 64.00
## Max. :100.00
## NA's :81811
## Stratification1 StratificationCategory2 Stratification2
## Length:250937 Length:250937 Length:250937
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## StratificationCategory3 Stratification3 Geolocation ClassID
## Mode:logical Mode:logical Length:250937 Length:250937
## NA's:250937 NA's:250937 Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## TopicID QuestionID ResponseID LocationID
## Length:250937 Length:250937 Mode:logical Length:250937
## Class :character Class :character NA's:250937 Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## StratificationCategoryID1 StratificationID1 StratificationCategoryID2
## Length:250937 Length:250937 Length:250937
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## StratificationID2 StratificationCategoryID3 StratificationID3 Report
## Length:250937 Mode:logical Mode:logical Mode:logical
## Class :character NA's:250937 NA's:250937 NA's:250937
## Mode :character
##
##
##
##
Third chunk. Checking how many column and rows the dataset have and which ones are quantitative and qualitative.
glimpse(alzheimer)
## Rows: 250,937
## Columns: 39
## $ RowId <chr> "BRFSS~2015~2015~9003~Q43~TOC11~AGE~OVERALL…
## $ YearStart <dbl> 2015, 2021, 2021, 2021, 2021, 2021, 2021, 2…
## $ YearEnd <dbl> 2015, 2021, 2021, 2021, 2021, 2021, 2021, 2…
## $ LocationAbbr <chr> "SOU", "AL", "OR", "NE", "IN", "AZ", "OH", …
## $ LocationDesc <chr> "South", "Alabama", "Oregon", "Nebraska", "…
## $ Datasource <chr> "BRFSS", "BRFSS", "BRFSS", "BRFSS", "BRFSS"…
## $ Class <chr> "Overall Health", "Mental Health", "Mental …
## $ Topic <chr> "Arthritis among older adults", "Frequent m…
## $ Question <chr> "Percentage of older adults ever told they …
## $ Response <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Data_Value_Unit <chr> "%", "%", "%", "%", "%", "%", "%", "%", "%"…
## $ DataValueTypeID <chr> "PRCTG", "PRCTG", "PRCTG", "PRCTG", "PRCTG"…
## $ Data_Value_Type <chr> "Percentage", "Percentage", "Percentage", "…
## $ Data_Value <dbl> 36.8, 15.5, 23.5, 13.6, 25.5, 9.1, 22.2, 16…
## $ Data_Value_Alt <dbl> 36.8, 15.5, 23.5, 13.6, 25.5, 9.1, 22.2, 16…
## $ Data_Value_Footnote_Symbol <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Data_Value_Footnote <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Low_Confidence_Limit <dbl> 35.9, 13.4, 16.0, 12.6, 23.9, 7.3, 20.4, 14…
## $ High_Confidence_Limit <dbl> 37.7, 17.9, 33.2, 14.6, 27.3, 11.1, 24.0, 1…
## $ Sample_Size <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ StratificationCategory1 <chr> "Age Group", "Age Group", "Age Group", "Age…
## $ Stratification1 <chr> "50-64 years", "Overall", "Overall", "Overa…
## $ StratificationCategory2 <chr> NA, "Gender", "Race/Ethnicity", NA, "Gender…
## $ Stratification2 <chr> NA, "Female", "Hispanic", NA, "Female", "Ma…
## $ StratificationCategory3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Stratification3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Geolocation <chr> NA, "POINT (-86.63186076199969 32.840571122…
## $ ClassID <chr> "C01", "C05", "C05", "C05", "C05", "C05", "…
## $ TopicID <chr> "TOC11", "TMC01", "TMC03", "TMC03", "TMC03"…
## $ QuestionID <chr> "Q43", "Q03", "Q27", "Q27", "Q27", "Q27", "…
## $ ResponseID <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LocationID <chr> "9003", "01", "41", "31", "18", "04", "39",…
## $ StratificationCategoryID1 <chr> "AGE", "AGE", "AGE", "AGE", "AGE", "AGE", "…
## $ StratificationID1 <chr> "5064", "AGE_OVERALL", "AGE_OVERALL", "AGE_…
## $ StratificationCategoryID2 <chr> "OVERALL", "GENDER", "RACE", "OVERALL", "GE…
## $ StratificationID2 <chr> "OVERALL", "FEMALE", "HIS", "OVERALL", "FEM…
## $ StratificationCategoryID3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ StratificationID3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ Report <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Loading extra packages in order to create different visualizations.
library(devtools)
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.2.3
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.2.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
This Chunk was named alzheimerpercetage that reads from the dataset Alzheimer. In this chunk will be filter the “Data_Value_Type” which will be analyze the Percentage of Alzheimer in Female and Male.
alzheimerpercetage <- alzheimer %>%
filter(Data_Value_Type == "Percentage")
This Chunk was named “alzheimerfemale” that reads from the dataset “alzheimerpercetage”. In this chunk will be filter the “Stratification2” which will be analyze the Percentage of Alzheimer in Female.
alzheimerfemale <- alzheimerpercetage %>%
filter(Stratification2 == "Female")
This Chunk was named “fem_plot” that reads from the dataset “alzheimerfemale”. In this chunk puts together the Year End, Data Value, and Location Dec. The goal is to analyze the percentage of the females with Alzheimer in each state according to each year.
fem_plot <- ggplot(alzheimerfemale, aes(x = YearEnd, y = Data_Value, color = LocationDesc))+
guides(color = FALSE, fill = FALSE) +
geom_point() +
geom_smooth(method='lm',formula=y~x, color = "red") +
labs(title = "Female Alzheimer Percentage in Each Year") +
xlab("Year") +
ylab ("Percentage") +
theme_minimal()
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
fem_plot <- ggplotly(fem_plot)
## Warning: Removed 344 rows containing non-finite values (`stat_smooth()`).
fem_plot
This second plot have the same same goal as the first plot, but I was trying to visualize in a better way.
ggplot(alzheimerfemale, aes(x = LocationDesc, y = Data_Value, color = LocationDesc)) +
geom_boxplot() +
guides(color = FALSE) +
labs(title = "Female Alzheimer Pergentage", y = "Pergentage", x = "States") +
coord_flip() +
theme(axis.text.y = element_text(size = 5))
## Warning: Removed 344 rows containing non-finite values (`stat_boxplot()`).
This Chunk name “alzheimermale” that reads from the dataset “alzheimerpercetage”. In this chunk will be filter the “Stratification2” which will be analyze the Percentage of Alzheimer in Male.
alzheimermale <- alzheimerpercetage %>%
filter(Stratification2 == "Male")
This Chunk was named “m_plot” that reads from the dataset “alzheimermale”. In this chunk puts together the Year End, Data Value, and Location Dec. The goal is to analyze the percentage of the males with Alzheimer in each state according to each year.
m_plot <- ggplot(alzheimermale, aes(x = YearEnd, y = Data_Value, color = LocationDesc))+
guides(color = FALSE, fill = FALSE) +
geom_point() +
geom_smooth(method='lm',formula=y~x, color = "blue") +
labs(title = "Male Alzheimer Percentage in Each Year") +
xlab("Year") +
ylab ("Percentage") +
theme_minimal()
m_plot <- ggplotly(m_plot)
## Warning: Removed 624 rows containing non-finite values (`stat_smooth()`).
m_plot
This third plot have the same same goal as the first plot, but I was trying to visualize in a better way.
ggplot(alzheimermale, aes(x = LocationDesc, y = Data_Value, color = LocationDesc)) +
geom_boxplot() +
guides(color = FALSE) +
labs(title = "Male Alzheimer Pergentage", y = "Pergentage", x = "States") +
coord_flip() +
theme(axis.text.y = element_text(size = 5))
## Warning: Removed 624 rows containing non-finite values (`stat_boxplot()`).
The visualization represents the dataset filtered by percentage and by genre. My goal was to analyzing the percentage though the years of the female who had Alzheimer in each state. Also, to analyzing the percentage though the years of males who had Alzheimer. I was trying to see if one genre is more incline to develop Alzheimer than the other.
An interesting surprise that I noticed on the Female Alzheimer Percentage and Male Alzheimer Percentage visualization is that for female the state that had a higher percentage of Alzheimer was South Dakota. On the other hand, for males the state that had a higher percentage was Virginia.In the other two visualizations Male Alzheimer Percentage in Each Year and Female Alzheimer Percentage in Each Year I noticed that for the males the States of Puerto Rico was on the top for the years 2018, 2020, 2021. The same states Puerto Rico appears at the top for the females in 2016, 2018, 2019, 2020. I was not able to tell if genre plays a rule when Alzheimer is developed in someone, but see Puerto Rico appearing more the ones in each genre and multiply years makes me think if one’s life style, environment, could contribute to the Alzheimer to be develop.
What I could have been shown that I could not get to work it were maps. I need to work on processing maps in R to make sure that I am able to create them with any dataset. I tried to follow one’s note, but I got stuck.