Project 1: Autism Prevalence studies

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

getwd()

## [1] "C:/Users/leyla/Documents/DATA 101"

setwd("C:/Users/leyla/Documents/DATA 101")
df <- read_csv("C:/Users/leyla/Documents/DATA 101/autism_prevalence_studies_20251015.csv")

## Rows: 207 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (15): Author, Title, Country, Area(s), Age Range, Study Years, Case Iden...
## dbl  (9): Year Published, ASD Prevalence Estimate per 1,000, Male:Female Sex...
## num  (2): Sample Size, Number of Cases
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Do Newer Studies Tend to Report Higher Autism Prevalence Rates?

###Introduction

Autism Spectrum Disorder (ASD) is a neurological condition that has generated increasing attention in recent years, For researchers is unclear if the growing number of autism diagnoses are due to improved detection, changes in diagnosis criteria, or a true rise prevalence. The question that can give us some insights for for this analysis is: “Do newer studies tend to report higher autism prevalence rates?”. To answer this, I am using data from the CDC’s Autism Prevalence Studies Data set, which contains information from different peer-reviewed ASD studies. The dataset includes key variables such as Study Year, ASD Prevalence per 1000, and Case Identification Method; which are important for nderstanding trends over time. The Study Year will allow us to examine how prevalence estimates have changed over the years. ASD Prevalence Estimates, provides the actual rates we aim to analyze. While the Case Identification Method is an important factor that might explain differences in reported prevalence, this analysis will focus on trends in prevalence over time.

###Data Analysis In this area, I will clean data and perform exploratory data analysis (EDA), to view trends in autism prevalence. Specifically I will focus on filtering the data to remove missing values, selecting the columns necessary for analysis, and summarizing the data

Filter and select columns First, I will filter the dataset to remove the rows with missing values for the relevant variables and select the columns to analyze.

#Filter dataset for Study Year and ASD Prevalence Estimate
install.packages("tidyverse")

## Warning: package 'tidyverse' is in use and will not be installed

library(tidyverse)

clean_data <- df |>
  filter(!is.na(`Year Published`) & !is.na(`ASD Prevalence Estimate per 1,000`)) |>
  select(`Year Published`, `ASD Prevalence Estimate per 1,000`, `Case Identification Method`)

clean_data

## # A tibble: 207 × 3
##    `Year Published` `ASD Prevalence Estimate per 1,000` Case Identification Me…¹
##               <dbl>                               <dbl> <chr>                   
##  1             1966                               0.45  survey (mail); health r…
##  2             1970                               0.077 health records          
##  3             1972                               0.43  health records; service…
##  4             1976                               4.48  registry; survey (unspe…
##  5             1979                               0.49  registry; service provi…
##  6             1982                               0.525 survey (mail)           
##  7             1983                               0.56  survey (mail)           
##  8             1984                               0.43  survey (mail)           
##  9             1984                               0.2   survey (unspecified)    
## 10             1986                               0.19  health records          
## # ℹ 197 more rows
## # ℹ abbreviated name: ¹`Case Identification Method`

Summaryzing Prevalence by Year Next, I will group the date by Year Published so we can use each look at each Year separetely. Then, I calculate the average autism prevalence reported in studies from each year, usin the mean function.

#Group data by Year Published and calculate the average ASD prevalence per year

avg_prevalence <- clean_data |>
  group_by(`Year Published`) |>
  summarize(avg_prevalence = mean(`ASD Prevalence Estimate per 1,000`, na.rm = TRUE))

avg_prevalence

## # A tibble: 41 × 2
##    `Year Published` avg_prevalence
##               <dbl>          <dbl>
##  1             1966          0.45 
##  2             1970          0.077
##  3             1972          0.43 
##  4             1976          4.48 
##  5             1979          0.49 
##  6             1982          0.525
##  7             1983          0.56 
##  8             1984          0.315
##  9             1986          0.425
## 10             1987          0.94 
## # ℹ 31 more rows

Visualization

Finally, I will create a scatter plot showing how the average ASD prevalence change by year, this will help us clearly see the trend of increasing ASD Prevalence.

# Scatter Plot

ggplot(clean_data, aes(x = `Year Published`,
                       y = `ASD Prevalence Estimate per 1,000`)) +
  geom_point(size = 4, color = "purple") +
  labs(title = "Autism Prevalence Over Time",
       x = "Year Published",
       y = "ASD Prevalence Estimate per 1,000") +
  theme_minimal()

###Conclusion

In this project, we looked at how autism prevalence estimates have changed over time using data from CDC studies. The results showed a clear increasing trend in reported ASD prevalence rates. For example: the average prevalence in 2002 was around 3.4 per 1000 people, while by 2022 it had risen to 21.6 per 1000 people. This steady increase over the years suggests that autism is being identified more frequently in recent studies. This could be due to multiple reasons, such as better awareness, improved screening tools, or broader diagnostic criteria. In future research, it would be relevant to compare how different identifaction methodss impact autism prevalence estimate. It would also be helpful to look into geographic regions or age groups to see if certain populations show stronger trends. Overall, this analysis supports the idea that autism is being supported more often in recent years.

###References

CDC website (2025, May 27), Autism Prevalence Studies Data Table. https://www.cdc.gov/autism/data-research/data-table.html

Project 1: Autism Prevalence studies

Leyla C

2025-10-15