mydata <- read.table("C:/Users/danie/OneDrive/Desktop/HW1/traffic_accidents.csv", header=TRUE, sep=",", dec = ".")

head(mydata)
##   accidents traffic_fine_amount traffic_density traffic_lights pavement_quality
## 1        20              4.3709          2.3049         0.0753           0.7700
## 2        11              9.5564          3.2757         0.5452           4.0540
## 3        19              7.5879          2.0989         0.6697           0.3450
## 4        23              6.3879          4.9188         0.9412           0.4729
## 5        23              2.4042          0.1961         0.7393           1.7111
## 6        31              2.4040          6.7137         0.5411           0.5905
##   urban_area average_speed rain_intensity vehicle_count time_of_day
## 1          1       32.1592         1.1944       29.0857     16.0432
## 2          1       47.8623         0.6296       93.1812      8.9108
## 3          0       36.4476         2.8584       83.0086      5.5727
## 4          0       20.9200         2.1065       81.3159     13.1452
## 5          1       37.3780         1.7028        1.4663      0.6961
## 6          1       40.4621         1.8936       68.9041      8.1801
mydata <- cbind(ID = 1:nrow(mydata), mydata)
head(mydata)
##   ID accidents traffic_fine_amount traffic_density traffic_lights
## 1  1        20              4.3709          2.3049         0.0753
## 2  2        11              9.5564          3.2757         0.5452
## 3  3        19              7.5879          2.0989         0.6697
## 4  4        23              6.3879          4.9188         0.9412
## 5  5        23              2.4042          0.1961         0.7393
## 6  6        31              2.4040          6.7137         0.5411
##   pavement_quality urban_area average_speed rain_intensity vehicle_count
## 1           0.7700          1       32.1592         1.1944       29.0857
## 2           4.0540          1       47.8623         0.6296       93.1812
## 3           0.3450          0       36.4476         2.8584       83.0086
## 4           0.4729          0       20.9200         2.1065       81.3159
## 5           1.7111          1       37.3780         1.7028        1.4663
## 6           0.5905          1       40.4621         1.8936       68.9041
##   time_of_day
## 1     16.0432
## 2      8.9108
## 3      5.5727
## 4     13.1452
## 5      0.6961
## 6      8.1801
#Change names a bit
colnames(mydata)[1] <- "ID"
colnames(mydata)[2] <- "AccidentCount"
colnames(mydata)[3] <- "FineAmount"
colnames(mydata)[4] <- "TrafficIndex"
colnames(mydata)[5] <- "LightProportion"
colnames(mydata)[6] <- "RoadQuality"
colnames(mydata)[7] <- "urban_area"
colnames(mydata)[8] <- "AvgSpeed"
colnames(mydata)[9] <- "RainLevel"
colnames(mydata)[10] <- "VehicleCount"
colnames(mydata)[11] <- "HourOfDay"
head(mydata)
##   ID AccidentCount FineAmount TrafficIndex LightProportion RoadQuality
## 1  1            20     4.3709       2.3049          0.0753      0.7700
## 2  2            11     9.5564       3.2757          0.5452      4.0540
## 3  3            19     7.5879       2.0989          0.6697      0.3450
## 4  4            23     6.3879       4.9188          0.9412      0.4729
## 5  5            23     2.4042       0.1961          0.7393      1.7111
## 6  6            31     2.4040       6.7137          0.5411      0.5905
##   urban_area AvgSpeed RainLevel VehicleCount HourOfDay
## 1          1  32.1592    1.1944      29.0857   16.0432
## 2          1  47.8623    0.6296      93.1812    8.9108
## 3          0  36.4476    2.8584      83.0086    5.5727
## 4          0  20.9200    2.1065      81.3159   13.1452
## 5          1  37.3780    1.7028       1.4663    0.6961
## 6          1  40.4621    1.8936      68.9041    8.1801

Description

Unit of observation: Individual traffic observation Sample size: 8,756 observations

Definition of variables:

Source: Torres, D. Key Factors Influencing Traffic Accidents. Kaggle. https://www.kaggle.com/datasets/torresdanilo/key-factors-influencing-traffic-accidents

(Purpose: This dataset was designed for analyzing trends, modeling risks, improving infrastructure, and supporting policy-making to enhance road safety. It is intended for educational purposes and demonstrates analytical methods in traffic safety.)

summary(mydata[ , -1])
##  AccidentCount     FineAmount     TrafficIndex    LightProportion 
##  Min.   : 5.00   Min.   :1.000   Min.   :0.0005   Min.   :0.0000  
##  1st Qu.:17.00   1st Qu.:3.191   1st Qu.:2.5216   1st Qu.:0.2608  
##  Median :21.00   Median :5.428   Median :5.0235   Median :0.5092  
##  Mean   :20.63   Mean   :5.446   Mean   :5.0136   Mean   :0.5076  
##  3rd Qu.:24.00   3rd Qu.:7.676   3rd Qu.:7.5431   3rd Qu.:0.7521  
##  Max.   :35.00   Max.   :9.998   Max.   :9.9992   Max.   :0.9999  
##   RoadQuality      urban_area        AvgSpeed         RainLevel     
##  Min.   :0.000   Min.   :0.0000   Min.   : 0.0047   Min.   :0.0000  
##  1st Qu.:1.230   1st Qu.:0.0000   1st Qu.:12.3077   1st Qu.:0.7742  
##  Median :2.484   Median :1.0000   Median :25.0091   Median :1.4970  
##  Mean   :2.480   Mean   :0.6944   Mean   :24.9520   Mean   :1.5019  
##  3rd Qu.:3.695   3rd Qu.:1.0000   3rd Qu.:37.4483   3rd Qu.:2.2486  
##  Max.   :4.999   Max.   :1.0000   Max.   :49.9978   Max.   :2.9998  
##   VehicleCount      HourOfDay      
##  Min.   : 1.027   Min.   : 0.0031  
##  1st Qu.:25.957   1st Qu.: 6.0507  
##  Median :51.280   Median :12.0137  
##  Mean   :50.698   Mean   :12.0014  
##  3rd Qu.:75.714   3rd Qu.:17.9518  
##  Max.   :99.962   Max.   :23.9970

Mean Accident Count: The mean number of accidents recorded is approximately 20.63.

3rd Quartile Accident Count: About 75% of accident counts are 24 or fewer in this dataset.

Median Fine Amount: The median fine amount, when sorted, is around 5.428.

Mean Fine Amount: The average fine amount imposed is about 5.446.

3rd Quartile Traffic Index: Around 75% of traffic indices are below or equal to 7.543.

Mean Light Proportion: On average, the light proportion value is 0.5076.

Median Road Quality: Half of the data shows road quality ratings up to 2.484.

Research question 1

Research Question 1: Is there a significant difference in the average number of traffic accidents between urban areas (urban classification = 1) and rural areas (urban classification = 0)?

Hypotheses: H₀ (Null Hypothesis): μ₁ - μ₂ = 0 (μ₁ is the average number of accidents in urban areas, μ₂ is the average number of accidents in rural areas) There is no significant difference in the average number of traffic accidents between urban and rural areas.

H₁ (Alternative Hypothesis): μ₁ - μ₂ ≠ 0 (μ₁ is the average number of accidents in urban areas, μ₂ is the average number of accidents in rural areas) There is a significant difference in the average number of traffic accidents between urban and rural areas.

library(ggplot2)

ggplot(mydata, aes(x = AccidentCount, fill = as.factor(urban_area))) +
  geom_histogram(position = "dodge", binwidth = 3, colour = "black", alpha = 0.6) +
  ylab("Frequency") +
  xlab("Number of Traffic Accidents") +
  labs(fill = "Area Type (0 = Rural, 1 = Urban)") +
  ggtitle("Distribution of Traffic Accidents by Area Type") +
  theme_minimal()

mydata_reduced <- sample(mydata$AccidentCount, size = 5000) #the sample data was too big, for shapiro test maximum is 5000, so I reduced the 



# Take random samples for urban and rural areas
urban_sample <- sample(mydata$AccidentCount[mydata$urban_area == 1], size = 3000)

min_size <- min(table(mydata$urban_area))

# Take samples based on the minimum size
urban_sample <- sample(mydata$AccidentCount[mydata$urban_area == 1], size = min_size, replace = FALSE)
rural_sample <- sample(mydata$AccidentCount[mydata$urban_area == 0], size = min_size, replace = FALSE)

# Perform Shapiro-Wilk test on the samples
shapiro.test(urban_sample)  # Test for urban areas
## 
##  Shapiro-Wilk normality test
## 
## data:  urban_sample
## W = 0.99446, p-value = 1.721e-08
shapiro.test(rural_sample)  # Test for rural areas
## 
##  Shapiro-Wilk normality test
## 
## data:  rural_sample
## W = 0.99287, p-value = 3.5e-10
# install.packages("ggpubr")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggqqplot(mydata, x = "AccidentCount", facet.by = "urban_area", ggtheme = theme_minimal())

QQ plot also shows deviations from normality for both rural (0) and urban (1) areas. While the points follow the diagonal line in the middle range, there are clear deviations at the tails, especially at the lower and upper ends. This indicates that the distribution of accident counts is not normally distributed for either group.

Now I will perform both parametric and alternative non-parametric test First Independet t-test with Welch correction

Parametric test

t_test_result <- t.test(AccidentCount ~ urban_area, 
                        data = mydata, 
                        var.equal = FALSE, 
                        alternative = "two.sided")
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  AccidentCount by urban_area
## t = -13.8, df = 5182, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1.880466 -1.412645
## sample estimates:
## mean in group 0 mean in group 1 
##        19.48206        21.12862
# install.packages("effectsize")

library(effectsize)
## Warning: package 'effectsize' was built under R version 4.4.2
# Calculate Cohen's d for AccidentCount between urban_area groups
cohens_d_result <- effectsize::cohens_d(AccidentCount ~ urban_area, 
                                        data = mydata, 
                                        pooled_sd = FALSE)
cohens_d_result
## Cohen's d |         95% CI
## --------------------------
## -0.32     | [-0.36, -0.27]
## 
## - Estimated using un-pooled SD.
# Interpret Cohen's d using Sawilowsky's 2009 rules
interpret_cohens_d(cohens_d_result$Cohens_d, rules = "sawilowsky2009")
## [1] "small"
## (Rules: sawilowsky2009)

Hypotheses

  • H₀ (Null Hypothesis): The arithmetic means of the number of traffic accidents are the same for urban and rural areas (μ₁ = μ₂).
  • H₁ (Alternative Hypothesis): The arithmetic means of the number of traffic accidents are different for urban and rural areas (μ₁ ≠ μ₂).
  • Based on the results of the statistical test (p < 0.001), we reject the null hypothesis (H₀) We can’t say that the arithmetic means of the number of traffic accidents are the same for urban and rural areas

Cohen’s d = -0.32 indicates a small effect size

Now I wil perfrom Alternative non-parametical test Wilcoxon Rank sum test

Non-parametric test

  wilcox_test_result <- wilcox.test(AccidentCount ~ urban_area, 
                                   data = mydata, 
                                   alternative = "two.sided")
wilcox_test_result
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  AccidentCount by urban_area
## W = 6698053, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# Calculate rank-biserial correlation (effect size)
rank_biserial_result <- rank_biserial(AccidentCount ~ urban_area, data = mydata)
rank_biserial_result
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.18             | [-0.20, -0.15]
# Interpret rank-biserial correlation
interpret_rank_biserial(rank_biserial_result$r, rules = "funder2019")
## [1] "small"
## (Rules: funder2019)

Hypotheses

  • H₀: The distribution location of traffic accidents is the same for urban and rural areas.
  • H₁: The distribution location of traffic accidents is not the same for urban and rural areas.
  • Conclusion WE reject H₀ (p < 0.001). We can’t say that distribution location of traffic accidents is the same for urban and rural areas Effect Size Rank-biserial correlation (r): -0.18 Interpretation: The effect size is small

Conclusion of Task 1

The goal of this task was to determine whether there is a significant difference in the average number of traffic accidents between urban and rural areas. Although a t-test was conducted as required by the instructions, the assumption of normality was not met based on both the Shapiro-Wilk test and QQ plots. Consequently, the results of the parametric test may not be reliable. Therefore, a non-parametric Wilcoxon rank-sum test was performed, which is more appropriate for the data I chose, which does not follow a normal distribution.

it is important to note that parametric tests, when assumptions are met, hold higher statistical power and are preferred for more precise estimates:)

The Wilcoxon test showed a statistically significant difference in the distribution of traffic accidents between urban and rural areas (p < 0.001). The effect size, as measured by rank-biserial correlation, indicated a small difference in practical terms.

Research question 2

Research Question: RQ2: Is there a linear correlation between AccidentCount and RoadQuality?

colnames(mydata)
##  [1] "ID"              "AccidentCount"   "FineAmount"      "TrafficIndex"   
##  [5] "LightProportion" "RoadQuality"     "urban_area"      "AvgSpeed"       
##  [9] "RainLevel"       "VehicleCount"    "HourOfDay"
head(mydata[, c("AccidentCount", "RoadQuality")])
##   AccidentCount RoadQuality
## 1            20      0.7700
## 2            11      4.0540
## 3            19      0.3450
## 4            23      0.4729
## 5            23      1.7111
## 6            31      0.5905
summary(mydata[, c("AccidentCount", "RoadQuality")])
##  AccidentCount    RoadQuality   
##  Min.   : 5.00   Min.   :0.000  
##  1st Qu.:17.00   1st Qu.:1.230  
##  Median :21.00   Median :2.484  
##  Mean   :20.63   Mean   :2.480  
##  3rd Qu.:24.00   3rd Qu.:3.695  
##  Max.   :35.00   Max.   :4.999
#install.packages("carData")
library(car)
## Warning: package 'car' was built under R version 4.4.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.2
# Reduce the dataset to a random sample of 1000 observations
set.seed(123)  # For reproducibility
mydata_sampled <- mydata[sample(nrow(mydata), 1000), ]

scatterplotMatrix(mydata_sampled[, c(2, 6)], smooth = FALSE)

From the scatterplot in the lower-left panel of the image, it seems that there is a weak negative linear correlation between “AccidentCount” and “RoadQuality.” The fitted line slopes slightly downward, suggesting that as “RoadQuality” increases, “AccidentCount” tends to decrease slightly. However, the spread of the data points is quite high, indicating that the relationship is not strong. Statistical analysis, such as calculating the correlation coefficient, would provide a more precise measure of the strength and direction of this relationship.

cor(mydata$AccidentCount, mydata$RoadQuality,
    method = "pearson",
    use = "complete.obs")
## [1] -0.2066249
cor.test(mydata$AccidentCount, mydata$RoadQuality,
         method = "pearson",
         use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  mydata$AccidentCount and mydata$RoadQuality
## t = -19.759, df = 8754, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2265905 -0.1864858
## sample estimates:
##        cor 
## -0.2066249
str(mydata$RoadQuality)
##  num [1:8756] 0.77 4.054 0.345 0.473 1.711 ...
str(mydata$AccidentCount)
##  int [1:8756] 20 11 19 23 23 31 29 18 15 22 ...

Here I checked if both variables that I used are numerical (I was not sure for RoadQuality)

Research question 3

Pearson Chi2 test (parametric test) should be used because we are analyzing the association between 2 categorical variables

Assumptions:

Observations must be independent

All expected frequencies are greater than 1

Maximum 20% of the frequencies can be between 1 and 5, however, this will reduce the power of the test

# Convert RoadQuality to categorical
mydata$CategoricalRoadQuality <- cut(
  mydata$RoadQuality,
  breaks = c(-Inf, 1, 3, 5),
  labels = c("Poor", "Moderate", "Excellent"),
  right = TRUE
)

# Create contingency table
table_data <- table(mydata$urban_area, mydata$CategoricalRoadQuality)
print(table_data)
##    
##     Poor Moderate Excellent
##   0  521     1067      1088
##   1 1249     2434      2397
results <- chisq.test(mydata$urban_area, mydata$CategoricalRoadQuality, correct = FALSE)
results
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$urban_area and mydata$CategoricalRoadQuality
## X-squared = 1.7812, df = 2, p-value = 0.4104

Pearson’s Chi-squared test - H₀: There is no association between Urban or Rural Classification and Categorical Road Quality. - H₁: There is an association between Urban or Rural Classification and Categorical Road Quality.

# Observed frequencies
addmargins(results$observed)
##                  mydata$CategoricalRoadQuality
## mydata$urban_area Poor Moderate Excellent  Sum
##               0    521     1067      1088 2676
##               1   1249     2434      2397 6080
##               Sum 1770     3501      3485 8756
# Expected frequencies
round(results$expected, 2)
##                  mydata$CategoricalRoadQuality
## mydata$urban_area    Poor Moderate Excellent
##                 0  540.95  1069.97   1065.08
##                 1 1229.05  2431.03   2419.92

The observed frequencies show the actual counts of observations in each combination of Urban or Rural Classification and Categorical Road Quality. The second assumption is met because all expected frequencies are greater than 1. The third assumption is also met because no expected frequencies are below 5.

round(results$residuals, 2)
##                  mydata$CategoricalRoadQuality
## mydata$urban_area  Poor Moderate Excellent
##                 0 -0.86    -0.09      0.70
##                 1  0.57     0.06     -0.47

All residuals are below |1.96|, so there are no statistically significant differences between observed and expected frequencies in any category.

# Proportion table 1
addmargins(round(prop.table(results$observed), 3))
##                  mydata$CategoricalRoadQuality
## mydata$urban_area  Poor Moderate Excellent   Sum
##               0   0.060    0.122     0.124 0.306
##               1   0.143    0.278     0.274 0.695
##               Sum 0.203    0.400     0.398 1.001
# Proportion table 2 (row-wise)
addmargins(round(prop.table(results$observed, 1), 3), 2)
##                  mydata$CategoricalRoadQuality
## mydata$urban_area  Poor Moderate Excellent   Sum
##                 0 0.195    0.399     0.407 1.001
##                 1 0.205    0.400     0.394 0.999
# Proportion table 3 (column-wise)
addmargins(round(prop.table(results$observed, 2), 3), 1)
##                  mydata$CategoricalRoadQuality
## mydata$urban_area  Poor Moderate Excellent
##               0   0.294    0.305     0.312
##               1   0.706    0.695     0.688
##               Sum 1.000    1.000     1.000

Proportion Table 1:

6.0% of all observations were Rural and had Poor Road Quality. 14.3% of all observations were Urban and had Poor Road Quality. Proportion Table 2:

Proportion table 2: In Rural areas, 19.5% of roads were Poor, 39.9% Moderate, and 40.7% Excellent. In Urban areas, 20.5% of roads were Poor, 40.0% Moderate, and 39.4% Excellent. Proportion Table 3:

Proportion table 3: Of all Poor-rated roads, 29.4% were Rural and 70.6% were Urban. Of all Excellent-rated roads, 31.2% were Rural and 68.8% were Urban.

library(effectsize)
effectsize::cramers_v(mydata$urban_area, mydata$CategoricalRoadQuality)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

Cramer’s V = 0.00 indicates no association between Urban or Rural Classification and Categorical Road Quality. The effect size is negligible, supporting the conclusion that these variables are independent.