Final project

Author

V. Lyon

Introduction

My project explores consumer complaints about telecommunication services in the United States between 2014 and 2024. The dataset I chose was collected by the Federal Communications Commission (FCC), the U.S. government agency responsible for regulating communications by radio, television, wire, satellite, and cable across the nation. The FCC provides this data to the public to ensure transparency and accountability regarding consumer experiences and complaints.

Data Source and Methodology

The data was directly collected by the FCC from consumer submissions, which include complaints about unwanted calls, billing issues, service availability, telemarketing, equipment issues, privacy concerns, robocalls, and other categories related to telecommunication services. Consumers submit these complaints via online forms, emails, phone calls, and written correspondence to the FCC. Unfortunately, no explicit ReadMe file describing the detailed methodology was provided, but the general process involves voluntary public reporting of issues.

Sources: - https://www.fcc.gov/general/statistical-reports-fcc?utm_source=chatgpt.com

-https://www.theverge.com/2024/11/16/24298278/ftc-illegal-scam-call-robocall-telemarketer-complaints-down-50-percent-since-2021?utm_source=chatgpt.com

-https://catalog.data.gov/dataset/cgb-consumer-complaints-data

Variables Used in the Project

  • Date Created: Date when the complaint was officially recorded by the FCC.

  • Issue: The specific type of complaint submitted by consumers.

  • State: U.S. state from which the complaint originated.

  • Location (Center point of Zip Code): Geographical coordinates indicating the location of the complainant.

Questions for Exploration

The key questions explored in this project include:

  1. Has there been a statistically significant increase or decrease in the number of complaints in California from 2014 to 2024?

  2. How have volumes of the ten most frequent complaint types changed over time (2014–2024) in the five states with the highest total complaints?

  3. Can mapping “Privacy” complaints in Pennsylvania provide insights into geographical patterns?

Why This Topic?

I chose this topic because unwanted calls and telecommunication issues have become a widespread nuisance in everyday life, significantly impacting consumer privacy and quality of life. I have personally experienced frustration due to these intrusive calls and misleading telemarketing practices. This analysis can provide useful insights into how prevalent these problems are across different regions, potentially guiding policymakers and consumer advocacy groups to better address these concerns.

setwd("C:/Users/Lenovo/Downloads/SummerData110")
# Load tidyverse for data manipulation and visualization
library(tidyverse)
#loading the data
data <- read_csv("CGB_-_Consumer_Complaints_Data.csv")

Identify the Most Common Complaints

Here I want to find the top 10 most frequent types of complaints to analyze further.

top_complains <- data |>
  group_by(Issue) |>
  summarize(n = n()) |>
  arrange(desc(n)) |>
  slice(1:10)
# Filtering the dataset to keep only records from these top 10 complaints
top_complains <- data |>
  filter(Issue %in% top_complains$Issue)

Prepare Data for Visualization (Complaints by State and Year)

# # Change 'Date Created' to date format for analysis
Viz2_top_complains <- top_complains |>
  mutate(date_created = as.Date(`Date Created`, format = "%m/%d/%Y")) # Asked ChatGPT how to convert the data format from charecter to date
# Filtering years 2014-2015, creating a new column with year, counting number of complains per each year
Viz2_top_complains <- Viz2_top_complains |>
  mutate(year = year(date_created)) |>        
  filter(year >= 2014, year <= 2024) |> 
  group_by(State, year, Issue) |>       
  summarise(n = n(), .groups = "drop")

Find Top 5 States with Most Complaints

Next step is to find states with most complaints.

top5_states <- Viz2_top_complains |> 
  group_by(State) |> 
  summarise(total_complaints = sum(n, na.rm = TRUE)) |>  
  arrange(desc(total_complaints)) |> 
  slice_head(n = 5) |> 
  pull(State)  # This should now return a character vector with 5 states
library(lubridate)

# Convert 'Date Created' to a real Date format
data_clean <- data |>
  filter(State %in% top5_states) |>
  mutate(`Date Created` = mdy(`Date Created`),   # fix format!
         Year = year(`Date Created`)) |>
  filter(!is.na(Year))
# Summarize complaints by year and state
complaints_by_year <- data_clean |>
  group_by(State, Year) |>
  summarize(total_complaints = n(), .groups = "drop")

Regression Analysis (California)

I will do linear regression to see how the number of complaints changed in California from 2014 to 2024.

# # Filter data specifically for California
ca_data <- complaints_by_year |> filter(State == "CA")
# Linear regression model
model_ca <- lm(total_complaints ~ Year, data = ca_data)
summary(model_ca)

Call:
lm(formula = total_complaints ~ Year, data = ca_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-29353.2   -389.9   2327.9   6582.5  11999.0 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  438403.5  1990456.9   0.220    0.830
Year           -201.1      985.6  -0.204    0.842

Residual standard error: 11790 on 10 degrees of freedom
Multiple R-squared:  0.004146,  Adjusted R-squared:  -0.09544 
F-statistic: 0.04163 on 1 and 10 DF,  p-value: 0.8424
# Visualize regression
viz1 <- ca_data |>
  ggplot(aes(x = Year, y = total_complaints)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "Complaints Over Time in California",
       x = "Year",
       y = "Total Complaints"
       )+
  theme_minimal()
viz1
`geom_smooth()` using formula = 'y ~ x'

For my regression analysis, I chose to look at how the number of complaints in California changed over the years. I used data from 2015 to 2024 and ran a linear regression. The model showed a very small negative trend, meaning complaints slightly decreased over time. However, the p-value is 0.84, which means the result is not statistically significant, and the R-squared value is 0.004, meaning the model explains less than 1% of the variation in complaints. The number of complaints didn’t really change in a meaningful way over the years.

Visualization 2: Mapping “Privacy” Complaints in Pennsylvania

Now, I want to map privacy complaints in Pennsylvania to see if there’s a geographical pattern.

viz3_map_data <- data |>
  filter(Issue == "Privacy", State == "PA") |>  # double filter
  mutate(coords = str_extract(`Location (Center point of the Zip Code)`, "\\(.*\\)")) |>
  mutate(coords = str_remove_all(coords, "[()]")) |>
  separate(coords, into = c("lat", "long"), sep = ",", convert = TRUE) |>
  mutate(
    lat = as.numeric(str_trim(lat)),
    long = as.numeric(str_trim(long))
  ) |>
  filter(
    !is.na(lat), !is.na(long),        # remove bad coords
    lat > 24, lat < 50,              # only U.S. latitude
    long > -125, long < -66          # only U.S. longitude
  )
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.2
viz3 <- leaflet(viz3_map_data) |>
  addTiles() |>
  addCircleMarkers(
    lng = ~long,
    lat = ~lat,
    radius = 3,
    color = "orange",
    stroke = FALSE,
    fillOpacity = 0.5,
    popup = ~paste("City:", City, "<br>Issue:", Issue)
  )
viz3

Conclusion

In this project, I explored consumer complaints data collected by the FCC about telecommunication services from 2014 to 2024. I filtered the data to focus on the most common types of complaints and states with the highest number of issues, making it easier to analyze and visualize. Surprisingly, the overall number of complaints did not significantly increase over the years. This aligns with external findings, such as FTC reports indicating a significant decrease in unwanted call complaints since 2021.

The biggest challenges for me were not just handling a large dataset with over 3 million rows and 18 variables but rather figuring out exactly how to approach it. Finding the right questions to ask, deciding how to filter and analyze the data effectively, and especially extracting longitude and latitude coordinates (since these were mixed together with ZIP codes and state names in a single column) were difficult tasks. Despite these challenges, I learned valuable skills for managing and interpreting complex data.

In the future, I’d like to explore the real reasons behind the stability or even slight decline in complaint numbers. For example, maybe fewer people are reporting complaints now—I personally didn’t know I could submit complaints about unwanted calls or scams to the FCC, so perhaps others also aren’t aware of this reporting process.