The question being answered is which age had the largest population in each state in 2019. This data set has the columns state, state_name, age, population (which is the population of the people who are that age), and state_total_population. The columns used to answer the question are state, age, and population. the data set used is Population Age 2019 Data and the source is Centers for Disease Control and Prevention. https://www.openintro.org/data/index.php?data=pop_age_2019
First step was lading the data set. Next selecting the columns that are going to be used to answer the question. Then grouping the data by state and removing ages that where 85+ because that would include more than one age as well as filtering to only keep the maximum population for each state as that will be the age with the largest population. Lastly creating a visualization of the results and showed how frequently each age appeared.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
pa19 <- read_csv("pop_age_2019.csv") #loading
## Rows: 4386 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, state_name, age
## dbl (2): population, state_total_population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pa19s <- pa19 |>
select(state,age,population) #selecting the columns that are going to be used to anwser the question
pa19sf <- pa19s |>
group_by(state) |> #grouping the data by state
filter(age != "85+") |> #removing ages that where 85+ because that would include more than one age
filter(population == max(population)) |> #filtering to only keep the maximum population for each satate as that will be the age with the largest population.
arrange(desc(age)) #wanted to see the oldest and youngest
#from data110
pa19sfg2 <- pa19sf
ggplot(pa19sfg2, aes(x = state, y = age , fill = population,)) +
geom_col(position = position_dodge(width = 5), width = .4) +
scale_fill_gradient(low = "yellow4", high = "red", name = "Number of People That Are That Age in the State") +
labs (title = "Age With Largest Population by State",
x="State",
y="Age",
caption= "Source: Centers for Disease Control and Prevention") +
theme_minimal(base_size = 5) +
pa19sfg2
pa19sfc <- pa19sf |>
group_by(age) |> #grouping by age
count (age) #wanted to which ages apeard most frequently
head(pa19sfc,n=12) # to show all the ages and their count
## # A tibble: 12 × 2
## # Groups: age [12]
## age n
## <chr> <int>
## 1 12 2
## 2 19 7
## 3 20 1
## 4 22 1
## 5 23 1
## 6 28 21
## 7 29 4
## 8 55 4
## 9 58 2
## 10 59 4
## 11 61 3
## 12 62 1
The visualization shows which age had the largest population in each state in 2019. Some findings where the oldest age with the largest population in a state was 62 for Montana (MT). The youngest age with the largest population in a state was 12 for Idaho (ID) and Mississippi (MS). 28 was the age that had the largest population in 21 states which was more than any other age. In the future it would be better to have a nicer visualization, maybe arranging it in deceding order and answering a different question.
Population Age 2019 Data (“pop_age_2019.csv”) source: Centers for Disease Control and Prevention. https://www.openintro.org/data/index.php?data=pop_age_2019