# Load necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(broom)
# Load your data
data <- read.csv("C:/Users/Dell/Downloads/Taxi.csv")
# boxplot to visualize relationship
ggplot(data, aes(x = income, y = overtreatment)) +
geom_boxplot() +
labs(title = "Relationship between Overtreatment and Alleged Income",
x = "Alleged Income",
y = "Overtreatment Index")
# check for missing values
summary(data)
## triple origin income rushhour
## Min. : 1.00 Length:348 Length:348 Length:348
## 1st Qu.: 29.75 Class :character Class :character Class :character
## Median : 58.50 Mode :character Mode :character Mode :character
## Mean : 58.50
## 3rd Qu.: 87.25
## Max. :116.00
## dgender dage overtreatment overcharge
## Length:348 Min. :23.00 Min. :1.000 Length:348
## Class :character 1st Qu.:40.00 1st Qu.:1.000 Class :character
## Mode :character Median :48.50 Median :1.012 Mode :character
## Mean :46.94 Mean :1.064
## 3rd Qu.:55.00 3rd Qu.:1.048
## Max. :75.00 Max. :1.991
# check data types
str(data)
## 'data.frame': 348 obs. of 8 variables:
## $ triple : int 1 1 1 2 2 2 3 3 3 4 ...
## $ origin : chr "resident" "nonresident" "foreign" "resident" ...
## $ income : chr "high" "low" "high" "high" ...
## $ rushhour : chr "yes" "yes" "yes" "no" ...
## $ dgender : chr "male" "male" "male" "male" ...
## $ dage : int 45 45 50 50 50 60 50 55 65 45 ...
## $ overtreatment: num 1.11 1 1.09 1.16 1.13 ...
## $ overcharge : chr "no" "no" "no" "no" ...
# data cleansing
data <- na.omit(data) # Remove rows with missing values
data$income <- as.factor(data$income) # Convert to factor if not already
# t-test
t_test_income <- t.test(overtreatment ~ income, data = data)
tidy(t_test_income)
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0182 1.07 1.06 1.37 0.172 346. -0.00797 0.0444
## # ℹ 2 more variables: method <chr>, alternative <chr>
Overtreatment and Alleged Income: Upon visual examination, a potential correlation emerges between the index of excessive treatment and the suspected income. Notably, most data points for both high and low income fall within the range of 1.25 to 1.50. The statistical tests, such as the t-test or ANOVA, help determine the significance of this relationship. A significant p-value would imply a connection between excessive treatment and suspected income.
Given this information, the calculated variance in means between the two groups (estimate = 0.0182) lacks statistical significance, evident from the p-value of 0.172. This result suggests insufficient grounds to reject the null hypothesis. The 95% confidence interval for the difference in means (-0.00797 to 0.0444) encompasses zero, reinforcing the absence of a noteworthy impact. Consequently, there is not substantial evidence to assert a meaningful distinction between the two groups. The data does not support the alternative hypothesis of a non-zero difference in means.
# boxplot to visualize relationship
ggplot(data, aes(x = origin, y = overtreatment)) +
geom_boxplot() +
labs(title = "Relationship between Overtreatment and Alleged Origin",
x = "Alleged Origin",
y = "Overtreatment Index")
# ANOVA test to test for a statistically significant link
anova_origin <- aov(overtreatment ~ origin, data = data)
summary(anova_origin)
## Df Sum Sq Mean Sq F value Pr(>F)
## origin 2 0.226 0.11284 7.584 0.000598 ***
## Residuals 345 5.133 0.01488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Overtreatment and Alleged Origin: A visual examination reveals a potential correlation between the overtreatment index and the purported origin. The chart illustrates that the overtreatment index is higher for cases with a foreign origin, followed by those involving non-residents. The ANOVA test is employed to determine the statistical significance of this relationship. If the p-value is significant, it indicates a connection between overtreatment and the alleged origin.
Considering this, the results of the analysis of variance (ANOVA) signify a statistically significant impact of the “origin” variable on the response variable. The F value of 7.584, with a corresponding p-value of 0.000598, falls below the conventional significance threshold of 0.05. Consequently, we reject the null hypothesis asserting that the means of the groups are equal. In simpler terms, there is evidence supporting the notion that the “origin” variable significantly affects the response variable. The differences in means across groups are unlikely to be solely attributable to random chance. Post-hoc tests or pairwise comparisons can be employed to identify specific group differences.
The residual sum of squares (Sum Sq) and mean squared error (Mean Sq) depict the variability within the groups and offer context regarding the magnitude of the effect. The majority of the variability in the response variable is accounted for by the “origin” variable, evident in the larger Sum Sq for “origin” compared to the residual Sum Sq.
To sum up, the “origin” variable has a significant influence on the response variable, prompting the consideration of further investigation or post-hoc tests to delve into specific group distinctions.
# Barplot to visualize relationship
ggplot(data, aes(x = income, fill = overcharge)) +
geom_bar(position = "fill") +
labs(title = "Relationship between Overcharge and Alleged Income",
x = "Alleged Income",
y = "Proportion of Overcharge")
# chi-square test to test for a statistically significant link
chi_square_income <- chisq.test(table(data$income, data$overcharge))
chi_square_income
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(data$income, data$overcharge)
## X-squared = 0, df = 1, p-value = 1
Overcharge and Alleged Income: Upon visual examination, a potential correlation between overcharging and reported income becomes apparent. It is observed that, irrespective of income levels being high or low, the overcharge tends to hover around 0.125. To determine the statistical significance of this relationship, a chi-square test was employed.
The chi-square test was utilized to investigate the connection between the “income” and “overcharge” variables. The test yielded a statistic, X-squared, of 0 with 1 degree of freedom, resulting in a p-value of 1. With a p-value of 1, there is no compelling evidence to reject the null hypothesis, indicating the absence of an association between “income” and “overcharge.” The test lacks ample support to establish a statistically significant relationship between these two variables. Consequently, we refrain from rejecting the null hypothesis, leading to the conclusion that, based on this analysis, there is no substantial association between the alleged income level and the incidence of overcharging.
# barplot to visualize relationship
ggplot(data, aes(x = origin, fill = overcharge)) +
geom_bar(position = "fill") +
labs(title = "Relationship between Overcharge and Alleged Origin",
x = "Alleged Origin",
y = "Proportion of Overcharge")
# chi-square test to test for a statistically significant link
chi_square_origin <- chisq.test(table(data$origin, data$overcharge))
chi_square_origin
##
## Pearson's Chi-squared test
##
## data: table(data$origin, data$overcharge)
## X-squared = 23.044, df = 2, p-value = 9.909e-06
Overcharge and Alleged Origin:
A visual examination reveals a potential connection between overcharging and the purported origin. Foreign passengers exhibit a slightly higher overcharge rate, approaching 0.25, followed by non-residents. The chi-square test determines the statistical significance of this relationship. A significant p-value implies a correlation between overcharging and the alleged origin.
Conducting a chi-squared test to explore the association between “origin” and “overcharge,” the test generated a statistic (X-squared) of 23.044 with 2 degrees of freedom, resulting in an exceedingly low p-value of 9.909e-06. With such a minuscule p-value, compelling evidence emerges to reject the null hypothesis positing no association between “origin” and “overcharge.” The test underscores a statistically significant correlation between these variables. Consequently, based on this analysis, it can be affirmed that the purported origin of the passenger is significantly linked to the incidence of overcharging by taxi drivers. ```