project2_jpark

Does the cars reliability actually going bad?

Image: gep.com

The automotive industry has faced a significant paradigm shift over the past two decades due to global environmental mandates. While these regulations aim for a cleaner environment, a growing number of consumers claim that modern vehicles are less reliable than older models. This project aims to validate these claims by analyzing the correlation between production years, manufacturing regions, and vehicle reliability.

Project Overview

The primary objective of this study is to identify whether vehicle durability trends vary significantly across different manufacturing regions, including Korea, Germany, Japan, and the USA. By establishing these regional patterns, the project then analyzes the specific influence of European emission standards (Euro 4, 5, and 6). The investigation focuses on whether the transition to complex emission-control hardware (such as EGR valves and DPF systems) has created a systematic trade-off with the mechanical integrity of engine and fuel systems.

Questions

I plan to explore these main questions:

  • How do defect trends differ across major automobile manufacturing regions? (Linear regression)

  • Is there a correlation between the implementation timeline of stricter Euro emission standards and the increase in engine and fuel system complaints? (Visual analysis)

Data Overview

The data used in this study is sourced from the National Highway Traffic Safety Administration (NHTSA) Consumer Complaints database (https://www.nhtsa.gov/nhtsa-datasets-and-apis). The dataset encompasses consumer complaint records from the years 2000 to 2026, allowing for a comprehensive longitudinal analysis of vehicle defect trends.

Each dataset had 49 variables, but I planned to delete duplicate observations and sensitive/useless information (unique identification numbers, regions, etc.). This was conducted in Excel.

Also, I improved the reliability of the data through the processes below.

Data Integrity & Methodology

To ensure the highest level of data accuracy and focus purely on “Initial Reliability,” I implemented a rigorous multi-stage filtering process:

  • Wear and Tear vs. Reliability: I limited the dataset to vehicles that experienced failures within 6 years of their production date and before reaching 100,000 miles. This threshold was established because failures beyond these limits are often attributable to natural aging or excessive use rather than manufacturing defects.

  • AI-Driven Cleaning: To eliminate noise caused by poor owner maintenance, I utilized an AI-driven filtering model to scan the raw text of the complaints. This process successfully removed cases where the failure was clearly a result of maintenance negligence rather than an engineering flaw.

  • Temporal Scope: While the dataset has data from 2000 to 2026, the analysis period was restricted to 2000–2024 models. While the raw data included 2025 and 2026 models, they were excluded from this study as they have not yet been in service long enough to provide statistically meaningful or consistent failure trends.

#1 Combining Data

In order to analyze the data comprehensively across multiple years, several CSV files containing yearly complaint records must be combined into a single dataset. During this process, missing values and mismatched column names across different years are standardized.

library(purrr)
library(dplyr)
Warning: 패키지 'dplyr'는 R 버전 4.5.3에서 작성되었습니다

다음의 패키지를 부착합니다: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
Warning: 패키지 'readr'는 R 버전 4.5.3에서 작성되었습니다
setwd("C:/Users/gun26/OneDrive/Desktop/project2")

# 1. Get file list (Fixed pattern and quotes)
files <- list.files(pattern = "\\.csv$")

# 2. Combine data (Fixed pipe and map_df syntax)
# col_types = cols(.default = "c") is added to prevent data type mismatch errors
combined_data <- files %>%
  map_df(~read_csv(.x, col_types = cols(.default = "c")))

# 3. Save to CSV
write_csv(combined_data, "2000-2026_combined.csv")

#2 Running Code

Vehicle Defect Trends by Euro Regulations

https://www.shinyapps.io/admin/#/application/17185419

library(shiny)
Warning: 패키지 'shiny'는 R 버전 4.5.3에서 작성되었습니다
library(dplyr)
library(readr)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
# 1. Load Data and Map Regions
# Mapping brands to their respective countries for comparison

df_final <- read_csv("2000-2026_combined.csv") %>%
  mutate(MAKETXT = toupper(trimws(MAKETXT))) %>%
  mutate(Region = case_when(
    MAKETXT %in% c("HYUNDAI", "KIA", "GENESIS", "DAEWOO") ~ "Korea",
    MAKETXT %in% c("BMW", "MERCEDES-BENZ", "MERCEDES BENZ", "VOLKSWAGEN", "AUDI", "PORSCHE", "SMART") ~ "Germany",
    MAKETXT %in% c("TOYOTA", "HONDA", "NISSAN", "LEXUS", "INFINITI", "MAZDA", "SUBARU", "MITSUBISHI", "SUZUKI", "ISUZU", "SCION") ~ "Japan",
    MAKETXT %in% c("FORD", "CHEVROLET", "CHRYSLER", "DODGE", "JEEP", "GMC", "CADILLAC", "BUICK", "LINCOLN", "SATURN", "PONTIAC", "OLDSMOBILE", "HUMMER", "RAM") ~ "USA",
    MAKETXT %in% c("JAGUAR", "LANDROVER", "ASTON MARTIN", "MINI", "ROLLS-ROYCE", "MCLAREN", "BENTLEY", "LOTUS") ~ "UK",
    MAKETXT %in% c("VOLVO", "FIAT", "ALFA ROMEO", "MASERATI", "FERRARI", "LAMBORGHINI") ~ "Europe/Others",
    TRUE ~ "Others"
  ))
Rows: 1511223 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): MFR_NAME, MAKETXT, model, CRASH, FIRE, detail, CDESCR, ACTIONNUM, ...
dbl (13): MODELYEAR, FAILDATE, YEARGAP, INJURED, DEATHS, DATEA, LDATE, MILES...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2. User Interface (UI)
# Designing the dashboard layout and input controls
ui <- fluidPage(
  titlePanel("Vehicle Defect Trends by Euro Regulations"),
  sidebarLayout(
    sidebarPanel(
      # Switch between brand-level and region-level analysis
      radioButtons("analysis_level", "Analysis Level:", 
                   choices = c("By Brand" = "MAKETXT", "By Region" = "Region"),
                   selected = "MAKETXT",
                   inline = TRUE),
      hr(),
      
# Filter by manufacturing region
      selectizeInput("region", "1. Select Region (Country):", 
                     choices = c("All", "Korea", "Germany", "Japan", "USA", "UK", "Europe/Others"),
                     multiple = TRUE,
                     options = list(placeholder = 'Select Region')),
      
# Dynamic multi-brand selection
      selectizeInput("brand", "2. Select Brands to Compare:", 
                     choices = NULL, 
                     multiple = TRUE,
                     options = list(placeholder = 'Select brands')),
                     
# Filter by specific vehicle component
      selectInput("part", "3. Select Component Category:", 
                  choices = c("All Components", "ENGINE", "FUEL SYSTEM", "POWER TRAIN"),
                  selected = "All Components"),
      
# Production year range selector
      sliderInput("year_range", "4. Select Production Year Range:", 
                  min = 2000, max = 2026, 
                  value = c(2000, 2026),
                  sep = ""),
      
      hr(),
      helpText("Data source: NHTSA Consumer Complaints Dataset")
    ),
    
    mainPanel(
# Interactive chart output
      highchartOutput("trendPlot", height = "500px"),
      hr(),
# Statistical analysis output for the essay
      h4("Statistical Analysis Results (Linear Regression)"),
      verbatimTextOutput("statResults")
    )
  )
)

# 3. Server Logic
# Handling data processing, filtering, and rendering
server <- function(input, output, session) {
  
# Update brand dropdown list based on the selected region
  observe({
    if (is.null(input$region)) {
      brand_choices <- character(0)
    } else if ("All" %in% input$region) {
      brand_choices <- sort(unique(df_final$MAKETXT))
    } else {
      brand_choices <- df_final %>% 
        filter(Region %in% input$region) %>% 
        pull(MAKETXT) %>% 
        unique() %>% 
        sort()
    }
    updateSelectizeInput(session, "brand", choices = brand_choices, server = TRUE)
  })
  
# Reactive function to filter data based on user inputs
  filtered_data_react <- reactive({
    data <- df_final
    
# Filter by Region
    if (!("All" %in% input$region) && !is.null(input$region)) {
      data <- data %>% filter(Region %in% input$region)
    }
# Filter by Brand (only if Analysis Level is 'By Brand')
    if (input$analysis_level == "MAKETXT" && !is.null(input$brand)) {
      data <- data %>% filter(MAKETXT %in% input$brand)
    }
# Filter by Component Keyword
    if (input$part != "All Components") {
      data <- data %>% filter(grepl(input$part, detail, ignore.case = TRUE))
    }
    
# Filter by Year Range selected in the slider
    data <- data %>% 
      filter(!is.na(MODELYEAR) & MODELYEAR != 9999) %>%
      filter(MODELYEAR >= input$year_range[1] & MODELYEAR <= input$year_range[2])
    
    data
  })
  
# Render the Interactive Trend Plot
  output$trendPlot <- renderHighchart({
    if (input$analysis_level == "MAKETXT") req(input$brand)
    if (input$analysis_level == "Region") req(input$region)
    
# Group and summarize defect counts
    summary_data <- filtered_data_react() %>%
      group_by(MODELYEAR, group_var = .data[[input$analysis_level]]) %>%
      summarise(complaints = n(), .groups = "drop")
    
    hchart(summary_data, "line", hcaes(x = MODELYEAR, y = complaints, group = group_var)) %>%
      hc_title(text = paste("Yearly Complaints Trend Comparison", 
                            ifelse(input$analysis_level == "MAKETXT", "(By Brand)", "(By Region)"))) %>%
      hc_xAxis(
        title = list(text = "Production Year"),
# Euro Regulation periods with background colors
        plotBands = list(
          list(from = 2005, to = 2010, color = "#FFFFF0", label = list(text = "EURO 4")),
          list(from = 2011, to = 2014, color = "#E0FFFF", label = list(text = "EURO 5")),
          list(from = 2015, to = 2024, color = "#FFE4E1", label = list(text = "EURO 6"))
        )
      ) %>%
      hc_yAxis(title = list(text = "Number of Complaints Reported")) %>%
      hc_tooltip(shared = TRUE, crosshairs = TRUE)
  })

# Perform Linear Regression and display results
  output$statResults <- renderPrint({
    if (input$analysis_level == "MAKETXT") req(input$brand)
    
# Aggregate data by year for regression
    stat_data <- filtered_data_react() %>%
      group_by(MODELYEAR) %>%
      summarise(total_complaints = n(), .groups = "drop")
    
# Calculate Linear Regression Model
    fit <- lm(total_complaints ~ MODELYEAR, data = stat_data)
    
    cat("Linear Regression Analysis:\n")
    print(summary(fit))
    cat("\nRegression Equation:\n")
    cat(paste0("Complaints = ", round(coef(fit)[1], 2), " + ", round(coef(fit)[2], 2), " * (Production Year)\n"))
  })
}

shinyApp(ui = ui, server = server)

Shiny applications not supported in static R Markdown documents

1. Analysis by Region and Euro

The regression analysis reveals distinct trajectories for each manufacturing region, reflecting how different engineering philosophies adapted to the Euro regulation timeline.

  • Korea (Positive Correlation): Korean manufacturers show the most significant upward trend in complaints. The positive slope of 94.72 with a statistically significant p-value of 0.00437 indicates that for every production year, defects increased by approximately 95 units. This suggests a direct trade-off between the rapid integration of complex emission-control hardware (EGR, DPF) and long-term mechanical durability.

    • Equation: -188,171.90 + 94.72 X (Production Year)

  • Japan & Germany (Stability): Both regions maintained remarkably stable reliability throughout the regulatory transitions. With non-significant p-values (0.487 for Japan and 0.852 for Germany), these manufacturers successfully neutralized the impact of increased hardware complexity through robust engineering and quality control.

    • Equation (Japan): Complaints = 97,390.80 - 44.53 X (Production Year)

    • Equation (German): Complaints = 7,838.73 - 2.80 X (Production Year)

  • USA (Significant Improvement): Interestingly, the USA shows a strong negative correlation (Slope -406.8, p-value 0.00231). This indicates that while older models had high complaint volumes, the quality of US brands has statistically improved over time, effectively overcoming initial reliability hurdles even under stricter regulations.

    • Equation: Complaints = 834,950.65 - 406.79 X (Production Year)

  • UK & Others (Excluded): These regions were excluded from the final comparative analysis due to insufficient sample sizes and low absolute complaint volumes, which lacked the statistical power required for meaningful interpretation.

2. Conclusion

The results prove that vehicle reliability is intrinsically linked to the regulatory landscape of the production era. While stricter Euro emission standards have successfully reduced the environmental footprint of automobiles, the statistical evidence (particularly from regions like Korea) confirms that the resulting hardware complexity has led to systematic durability trade-offs. The divergent trends between regions further indicate that a manufacturer’s technological maturity and adaptation strategy are critical factors in mitigating the reliability risks associated with global environmental compliance.

3. Limitations

The primary limitation of this study is the lack of normalization by total sales volume. As raw complaint counts can be influenced by market share and the number of units sold, the observed spikes may partially reflect a brand’s sales success rather than purely mechanical failure rates.

To improve?: To achieve a more precise “Defect-to-Sales” ratio, future research should integrate market exposure data to calculate defect rates per 10,000 units sold.