setwd("C:/Users/Lenovo/Downloads/SummerData110")
Final project
Introduction
My project explores consumer complaints about telecommunication services in the United States between 2014 and 2024. The dataset I chose was collected by the Federal Communications Commission (FCC), the U.S. government agency responsible for regulating communications by radio, television, wire, satellite, and cable across the nation. The FCC provides this data to the public to ensure transparency and accountability regarding consumer experiences and complaints.
Data Source and Methodology
The data was directly collected by the FCC from consumer submissions, which include complaints about unwanted calls, billing issues, service availability, telemarketing, equipment issues, privacy concerns, robocalls, and other categories related to telecommunication services. Consumers submit these complaints via online forms, emails, phone calls, and written correspondence to the FCC. Unfortunately, no explicit ReadMe file describing the detailed methodology was provided, but the general process involves voluntary public reporting of issues.
Sources: - https://www.fcc.gov/general/statistical-reports-fcc?utm_source=chatgpt.com
-https://catalog.data.gov/dataset/cgb-consumer-complaints-data
Variables Used in the Project
Date Created: Date when the complaint was officially recorded by the FCC.
Issue: The specific type of complaint submitted by consumers.
State: U.S. state from which the complaint originated.
Location (Center point of Zip Code): Geographical coordinates indicating the location of the complainant.
Questions for Exploration
The key questions explored in this project include:
Has there been a statistically significant increase or decrease in the number of complaints in California from 2014 to 2024?
How have volumes of the ten most frequent complaint types changed over time (2014–2024) in the five states with the highest total complaints?
Can mapping “Privacy” complaints in Pennsylvania provide insights into geographical patterns?
Why This Topic?
I chose this topic because unwanted calls and telecommunication issues have become a widespread nuisance in everyday life, significantly impacting consumer privacy and quality of life. I have personally experienced frustration due to these intrusive calls and misleading telemarketing practices. This analysis can provide useful insights into how prevalent these problems are across different regions, potentially guiding policymakers and consumer advocacy groups to better address these concerns.
# Load tidyverse for data manipulation and visualization
library(tidyverse)
#loading the data
<- read_csv("CGB_-_Consumer_Complaints_Data.csv") data
Identify the Most Common Complaints
Here I want to find the top 10 most frequent types of complaints to analyze further.
<- data |>
top_complains group_by(Issue) |>
summarize(n = n()) |>
arrange(desc(n)) |>
slice(1:10)
# Filtering the dataset to keep only records from these top 10 complaints
<- data |>
top_complains filter(Issue %in% top_complains$Issue)
Prepare Data for Visualization (Complaints by State and Year)
# # Change 'Date Created' to date format for analysis
<- top_complains |>
Viz2_top_complains mutate(date_created = as.Date(`Date Created`, format = "%m/%d/%Y")) # Asked ChatGPT how to convert the data format from charecter to date
# Filtering years 2014-2015, creating a new column with year, counting number of complains per each year
<- Viz2_top_complains |>
Viz2_top_complains mutate(year = year(date_created)) |>
filter(year >= 2014, year <= 2024) |>
group_by(State, year, Issue) |>
summarise(n = n(), .groups = "drop")
Find Top 5 States with Most Complaints
Next step is to find states with most complaints.
<- Viz2_top_complains |>
top5_states group_by(State) |>
summarise(total_complaints = sum(n, na.rm = TRUE)) |>
arrange(desc(total_complaints)) |>
slice_head(n = 5) |>
pull(State) # This should now return a character vector with 5 states
library(lubridate)
# Convert 'Date Created' to a real Date format
<- data |>
data_clean filter(State %in% top5_states) |>
mutate(`Date Created` = mdy(`Date Created`), # fix format!
Year = year(`Date Created`)) |>
filter(!is.na(Year))
# Summarize complaints by year and state
<- data_clean |>
complaints_by_year group_by(State, Year) |>
summarize(total_complaints = n(), .groups = "drop")
Regression Analysis (California)
I will do linear regression to see how the number of complaints changed in California from 2014 to 2024.
# # Filter data specifically for California
<- complaints_by_year |> filter(State == "CA") ca_data
# Linear regression model
<- lm(total_complaints ~ Year, data = ca_data)
model_ca summary(model_ca)
Call:
lm(formula = total_complaints ~ Year, data = ca_data)
Residuals:
Min 1Q Median 3Q Max
-29353.2 -389.9 2327.9 6582.5 11999.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 438403.5 1990456.9 0.220 0.830
Year -201.1 985.6 -0.204 0.842
Residual standard error: 11790 on 10 degrees of freedom
Multiple R-squared: 0.004146, Adjusted R-squared: -0.09544
F-statistic: 0.04163 on 1 and 10 DF, p-value: 0.8424
# Visualize regression
<- ca_data |>
viz1 ggplot(aes(x = Year, y = total_complaints)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "green") +
labs(title = "Complaints Over Time in California",
x = "Year",
y = "Total Complaints"
+
)theme_minimal()
viz1
`geom_smooth()` using formula = 'y ~ x'
For my regression analysis, I chose to look at how the number of complaints in California changed over the years. I used data from 2015 to 2024 and ran a linear regression. The model showed a very small negative trend, meaning complaints slightly decreased over time. However, the p-value is 0.84, which means the result is not statistically significant, and the R-squared value is 0.004, meaning the model explains less than 1% of the variation in complaints. The number of complaints didn’t really change in a meaningful way over the years.
Visualization 2: Complaints Trends Over Time (Top 5 States)
I will show how the top 10 complaints changed over the years in the top 5 states.
<- Viz2_top_complains |>
Viz2_top_complains filter(State %in% top5_states) # Keep only records from the top 5 states
# Asked chatGPT how to help organize the States in descending order
library(forcats)
<- Viz2_top_complains |>
Viz2_top_complains mutate(State = fct_reorder(State, n, .fun = sum, .desc = TRUE))
library(ggplot2)
<- Viz2_top_complains |>
viz2 ggplot(aes(x = year, y = n, color = Issue)) +
geom_line(size = 1.2, alpha = 0.9) +
facet_wrap(~State, scales = "fixed") +
scale_color_brewer(palette = "Paired") +
labs(
title = "Complaint Trends Over Time (2014–2024)",
subtitle = "Top 10 Complaint Types in Top 5 States",
x = "Year",
y = "Number of Complaints",
color = "Complaint Type"
+
) theme_minimal(base_size = 4) +
theme(
plot.title = element_text(size = 16, face = "bold"), # bold title
plot.subtitle = element_text(size = 13),
axis.title.x = element_text(size = 7),
axis.title.y = element_text(size = 7),
axis.text.x = element_text(size = 6, angle = 45, hjust = 1),
axis.text.y = element_text(size = 6),
legend.title = element_text(size = 8),
legend.text = element_text(size = 8),
strip.text = element_text(size = 8, face = "bold")
)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
viz2
Visualization 2: Mapping “Privacy” Complaints in Pennsylvania
Now, I want to map privacy complaints in Pennsylvania to see if there’s a geographical pattern.
<- data |>
viz3_map_data filter(Issue == "Privacy", State == "PA") |> # double filter
mutate(coords = str_extract(`Location (Center point of the Zip Code)`, "\\(.*\\)")) |>
mutate(coords = str_remove_all(coords, "[()]")) |>
separate(coords, into = c("lat", "long"), sep = ",", convert = TRUE) |>
mutate(
lat = as.numeric(str_trim(lat)),
long = as.numeric(str_trim(long))
|>
) filter(
!is.na(lat), !is.na(long), # remove bad coords
> 24, lat < 50, # only U.S. latitude
lat > -125, long < -66 # only U.S. longitude
long )
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.2
<- leaflet(viz3_map_data) |>
viz3 addTiles() |>
addCircleMarkers(
lng = ~long,
lat = ~lat,
radius = 3,
color = "orange",
stroke = FALSE,
fillOpacity = 0.5,
popup = ~paste("City:", City, "<br>Issue:", Issue)
) viz3
Conclusion
In this project, I explored consumer complaints data collected by the FCC about telecommunication services from 2014 to 2024. I filtered the data to focus on the most common types of complaints and states with the highest number of issues, making it easier to analyze and visualize. Surprisingly, the overall number of complaints did not significantly increase over the years. This aligns with external findings, such as FTC reports indicating a significant decrease in unwanted call complaints since 2021.
The biggest challenges for me were not just handling a large dataset with over 3 million rows and 18 variables but rather figuring out exactly how to approach it. Finding the right questions to ask, deciding how to filter and analyze the data effectively, and especially extracting longitude and latitude coordinates (since these were mixed together with ZIP codes and state names in a single column) were difficult tasks. Despite these challenges, I learned valuable skills for managing and interpreting complex data.
In the future, I’d like to explore the real reasons behind the stability or even slight decline in complaint numbers. For example, maybe fewer people are reporting complaints now—I personally didn’t know I could submit complaints about unwanted calls or scams to the FCC, so perhaps others also aren’t aware of this reporting process.