Statement #1: CitiBike management thinks that men incur in more overtime fees. Test this hypothesis by comparing overtime variances across genders.
What should be your null hypothesis? Test your null hypothesis for 95% confidence What type I error test are your conducting? Provide a paragraph discussing the findings and statistical significance of the test.
First, lets load the data with read.csv with header activated and then we explore the file using names( ) and dim( )
data <- read.csv("~/Google Drive/Business Analytics/Data/CitiBike Data 1hr+.csv", header=TRUE, stringsAsFactors=FALSE)
dim(data)
## [1] 1158 15
names(data)
## [1] "tripduration" "starttime"
## [3] "stoptime" "start.station.id"
## [5] "start.station.name" "start.station.latitude"
## [7] "start.station.longitude" "end.station.id"
## [9] "end.station.name" "end.station.latitude"
## [11] "end.station.longitude" "bikeid"
## [13] "usertype" "birth.year"
## [15] "gender"
For subsets of MEN only, we can do the following:
#subsets
mendata<-subset(data, data$gender==1)
sample<-data$tripduration
men<-mendata$tripduration
boxplot(sample,men, outline=FALSE, col="blue")
summary(sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.00 75.07 101.10 160.10 167.70 717.00
sd(sample)
## [1] 144.47
summary(men)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.00 77.66 110.80 182.30 202.00 717.00
sd(men)
## [1] 164.6802
Now we can setup the null hyposethis as Ho:u=160; x=182.30; N=1158; S=144.47
t.s.1<-(182.30-160.10)/(sd(sample)/sqrt(1158))
t.s.1
## [1] 5.229131
At 95% confidence, we can reject the null hypothesis that men have a different mean or bike utilization that then rest of the users of CitiBike.
Another approach to this test is to solve it using an f-test. In R, the f-test uses the formula var.test( ). Then we can say that:
var.test(men, sample)
##
## F test to compare two variances
##
## data: men and sample
## F = 1.2994, num df = 614, denom df = 1157, p-value = 0.0001699
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.132986 1.494599
## sample estimates:
## ratio of variances
## 1.299354
The F value is 1.2994 which is almost the ratio of variances at 1.299354. Therefore we reject the hypothesis.
In terms of statistical errors, in this problem we are dealing with a test for a “false positive”: a type I error is the incorrect rejection of a true null hypothesis
Statement #2: CitiBike management thinks that subscribers incur in more overtime fees. Test this hypothesis by comparing overtime variances across users. Similarly to statement #1,
subsdata<-subset(data, data$usertype=="Subscriber")
subscriber<-subsdata$tripduration
boxplot(sample, subscriber, outline=FALSE, col="yellow")
summary(sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.00 75.07 101.10 160.10 167.70 717.00
sd(sample)
## [1] 144.47
summary(subscriber)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.00 77.27 109.70 180.10 201.10 717.00
sd(subscriber)
## [1] 161.0684
t.s.2<-(mean(subscriber)-mean(sample))/(sd(sample)/sqrt(1158))
t.s.2
## [1] 4.697379
var.test(subscriber, sample)
##
## F test to compare two variances
##
## data: subscriber and sample
## F = 1.243, num df = 783, denom df = 1157, p-value = 0.0008164
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.094142 1.414345
## sample estimates:
## ratio of variances
## 1.242983
Statement #3: CitiBike management is worried because the majority of rides in the 1st week are logging with excessive durations. Out of 200 rides, the average tripduration has been 200 minutes with a standard deviation of 185. Should management worry about changes in variance?
t.s.3<-(200-mean(sample))/(185/sqrt(1158))
t.s.3
## [1] 7.332436