data uploading

# Load necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(broom)

# Load your data
data <- read.csv("C:/Users/Dell/Downloads/Taxi.csv")

boxplot income and overtreatment and data adjustment

# boxplot to visualize relationship
ggplot(data, aes(x = income, y = overtreatment)) +
  geom_boxplot() +
  labs(title = "Relationship between Overtreatment and Alleged Income",
       x = "Alleged Income",
       y = "Overtreatment Index")

# check for missing values
summary(data)
##      triple          origin             income            rushhour        
##  Min.   :  1.00   Length:348         Length:348         Length:348        
##  1st Qu.: 29.75   Class :character   Class :character   Class :character  
##  Median : 58.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 58.50                                                           
##  3rd Qu.: 87.25                                                           
##  Max.   :116.00                                                           
##    dgender               dage       overtreatment    overcharge       
##  Length:348         Min.   :23.00   Min.   :1.000   Length:348        
##  Class :character   1st Qu.:40.00   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :48.50   Median :1.012   Mode  :character  
##                     Mean   :46.94   Mean   :1.064                     
##                     3rd Qu.:55.00   3rd Qu.:1.048                     
##                     Max.   :75.00   Max.   :1.991
# check data types
str(data)
## 'data.frame':    348 obs. of  8 variables:
##  $ triple       : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ origin       : chr  "resident" "nonresident" "foreign" "resident" ...
##  $ income       : chr  "high" "low" "high" "high" ...
##  $ rushhour     : chr  "yes" "yes" "yes" "no" ...
##  $ dgender      : chr  "male" "male" "male" "male" ...
##  $ dage         : int  45 45 50 50 50 60 50 55 65 45 ...
##  $ overtreatment: num  1.11 1 1.09 1.16 1.13 ...
##  $ overcharge   : chr  "no" "no" "no" "no" ...
# data cleansing
data <- na.omit(data)  # Remove rows with missing values
data$income <- as.factor(data$income)  # Convert to factor if not already

# t-test
t_test_income <- t.test(overtreatment ~ income, data = data)
tidy(t_test_income)
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   0.0182      1.07      1.06      1.37   0.172      346. -0.00797    0.0444
## # ℹ 2 more variables: method <chr>, alternative <chr>

Overtreatment and Alleged Income: Upon visual examination, a potential correlation emerges between the index of excessive treatment and the suspected income. Notably, most data points for both high and low income fall within the range of 1.25 to 1.50. The statistical tests, such as the t-test or ANOVA, help determine the significance of this relationship. A significant p-value would imply a connection between excessive treatment and suspected income.

Given this information, the calculated variance in means between the two groups (estimate = 0.0182) lacks statistical significance, evident from the p-value of 0.172. This result suggests insufficient grounds to reject the null hypothesis. The 95% confidence interval for the difference in means (-0.00797 to 0.0444) encompasses zero, reinforcing the absence of a noteworthy impact. Consequently, there is not substantial evidence to assert a meaningful distinction between the two groups. The data does not support the alternative hypothesis of a non-zero difference in means.

boxplot origin and overtreatment and ANOVA test

# boxplot to visualize relationship
ggplot(data, aes(x = origin, y = overtreatment)) +
  geom_boxplot() +
  labs(title = "Relationship between Overtreatment and Alleged Origin",
       x = "Alleged Origin",
       y = "Overtreatment Index")

# ANOVA test to test for a statistically significant link
anova_origin <- aov(overtreatment ~ origin, data = data)
summary(anova_origin)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## origin        2  0.226 0.11284   7.584 0.000598 ***
## Residuals   345  5.133 0.01488                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Overtreatment and Alleged Origin: A visual examination reveals a potential correlation between the overtreatment index and the purported origin. The chart illustrates that the overtreatment index is higher for cases with a foreign origin, followed by those involving non-residents. The ANOVA test is employed to determine the statistical significance of this relationship. If the p-value is significant, it indicates a connection between overtreatment and the alleged origin.

Considering this, the results of the analysis of variance (ANOVA) signify a statistically significant impact of the “origin” variable on the response variable. The F value of 7.584, with a corresponding p-value of 0.000598, falls below the conventional significance threshold of 0.05. Consequently, we reject the null hypothesis asserting that the means of the groups are equal. In simpler terms, there is evidence supporting the notion that the “origin” variable significantly affects the response variable. The differences in means across groups are unlikely to be solely attributable to random chance. Post-hoc tests or pairwise comparisons can be employed to identify specific group differences.

The residual sum of squares (Sum Sq) and mean squared error (Mean Sq) depict the variability within the groups and offer context regarding the magnitude of the effect. The majority of the variability in the response variable is accounted for by the “origin” variable, evident in the larger Sum Sq for “origin” compared to the residual Sum Sq.

To sum up, the “origin” variable has a significant influence on the response variable, prompting the consideration of further investigation or post-hoc tests to delve into specific group distinctions.

boxplot income and overcharge and chi-square test

# Barplot to visualize relationship
ggplot(data, aes(x = income, fill = overcharge)) +
  geom_bar(position = "fill") +
  labs(title = "Relationship between Overcharge and Alleged Income",
       x = "Alleged Income",
       y = "Proportion of Overcharge")

# chi-square test to test for a statistically significant link
chi_square_income <- chisq.test(table(data$income, data$overcharge))
chi_square_income
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(data$income, data$overcharge)
## X-squared = 0, df = 1, p-value = 1

Overcharge and Alleged Income: Upon visual examination, a potential correlation between overcharging and reported income becomes apparent. It is observed that, irrespective of income levels being high or low, the overcharge tends to hover around 0.125. To determine the statistical significance of this relationship, a chi-square test was employed.

The chi-square test was utilized to investigate the connection between the “income” and “overcharge” variables. The test yielded a statistic, X-squared, of 0 with 1 degree of freedom, resulting in a p-value of 1. With a p-value of 1, there is no compelling evidence to reject the null hypothesis, indicating the absence of an association between “income” and “overcharge.” The test lacks ample support to establish a statistically significant relationship between these two variables. Consequently, we refrain from rejecting the null hypothesis, leading to the conclusion that, based on this analysis, there is no substantial association between the alleged income level and the incidence of overcharging.

barplot and chi-square test

# barplot to visualize relationship
ggplot(data, aes(x = origin, fill = overcharge)) +
  geom_bar(position = "fill") +
  labs(title = "Relationship between Overcharge and Alleged Origin",
       x = "Alleged Origin",
       y = "Proportion of Overcharge")

# chi-square test to test for a statistically significant link
chi_square_origin <- chisq.test(table(data$origin, data$overcharge))
chi_square_origin
## 
##  Pearson's Chi-squared test
## 
## data:  table(data$origin, data$overcharge)
## X-squared = 23.044, df = 2, p-value = 9.909e-06

Overcharge and Alleged Origin:

A visual examination reveals a potential connection between overcharging and the purported origin. Foreign passengers exhibit a slightly higher overcharge rate, approaching 0.25, followed by non-residents. The chi-square test determines the statistical significance of this relationship. A significant p-value implies a correlation between overcharging and the alleged origin.

Conducting a chi-squared test to explore the association between “origin” and “overcharge,” the test generated a statistic (X-squared) of 23.044 with 2 degrees of freedom, resulting in an exceedingly low p-value of 9.909e-06. With such a minuscule p-value, compelling evidence emerges to reject the null hypothesis positing no association between “origin” and “overcharge.” The test underscores a statistically significant correlation between these variables. Consequently, based on this analysis, it can be affirmed that the purported origin of the passenger is significantly linked to the incidence of overcharging by taxi drivers. ```