The dataset being analysed is the number of train tickets sold per hour at Grand Central Station. Your aim is to estimate the average number of tickets sold using historical data.
1. What type of distribution (discrete or continuous) is the number of tickets sold? Explain your answer.
Discrete. The variable is discrete because the number of tickets are countable and finite vs a continuous variable which would be infinite. Also you cannot have half a ticket.
2. Given the type of data, which distribution would be most appropriate to model the data and why?
Poisson Distribution. This distribution would be appropriate to model the data as it shows the amount of time an event is likely to occur in a given time period and a given space, which in the case of tickets being sold, we want to find out the amount of train tickets being sold per hour over a two-year period at the Grand Central Station.
rm(list = ls()) # Clear work environment
# Set Working Directory
setwd("~/ECON3032/Assignment")
set.seed(367)
train.import <- read.csv("Train_Tickets.csv")
train.data <- train.import[sample(nrow(train.import), 3800), ]
3. Plot the data (histogram along with superimposed density curve)
hist(train.data[,2], prob=TRUE, col="grey", main = "Number of Tickets Sold", xlab = "Number of Tickets")# prob=TRUE for probabilities not counts
lines(density(train.data[,2]), col="blue", lwd=2) # add a density estimate with defaults
4. Which distribution(s) does your plot resemble?
The plot resembles a right-skewed or positive-shew distribution as the tail is in the positive direction of the number line.
5. Calculate the mean and variance of the distribution.
library(psych) # for complete summary table
library(kableExtra) # for nice html table
kable(round(describe(train.data[,2]), 3)) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | 3800 | 141.171 | 156.091 | 88 | 113.557 | 109.712 | 2 | 1244 | 1242 | 1.806 | 4.187 | 2.532 |
6. Based on this, do the data seem to be from the distribution you assumed in Question 4?
Yes. This is because a right-skewed distribution’s mean is higher than its median which can be seen in this output where the mean is equal to 141.171 and the median is equal to 88.
7. Irrespective of your answer to the previous questions, plot the likelihood function using a Poisson distribution.
# The function used to calculate the log likelihood
# ===========================================================
loglik.Poisson <- function(lambda, dataset){
# Store the number of observations in the dataset
n <- length(dataset)
# Calculate the likelihood
- (n* lambda) + (sum(dataset) *log(lambda))
}
# Get the value of Log the likelihood for many different values of lambda
# ====================================================================
train.count <- train.data$Count # Only use the column with counts
# Change 150 to the number of lambda values assigned to your ID#
k <- 312 # Number of lambda values to be substituted
Lambda <- seq(100, 200, length = k) # a series of values to be substituted
W <- rep(0, k) # Create an object to hold the results
# Calculate the value for each
for(i in 1:k){
W[i] <- loglik.Poisson(lambda = Lambda[i], dataset = train.count)
}
8. Using what you know about Maximum Likelihood Estimators, where do you expect the maximum to occur? Explain your answer. A specific value is required.
the MLE occurs at the summation of xi/n which is equal to the mean which, in this case, is equal to 141.171.
9. Plot the Log Likelihood function.
# Plot the Log Likelihood
plot(Lambda, W, type = "l", ylab = "Log Likelihood", xlab = "Lambda", col = "blue", lwd = 2)
# Plot the graph with the maximum value
# ==========================================
# Find the maximum and its related value
# Identify the Maximum
# =========================
max.log <- which.max(W)
max.value.log <- max(W)
lambda.max <- Lambda[max.log]
# Plot the Log Likelihood
plot(Lambda, W, type = "l", ylab = "Log Likelihood", xlab = "Lambda", col = "blue", lwd = 2)
abline(h = max.value.log, v = lambda.max, col = "red", lty = 4, lwd = 2)
text(185, 2110000, bquote(lambda-hat == .(lambda.max)), col = "red")
# If your lambda-hat value does not show on your graph, change the second vaule in Line 221
11. Is the maximised result exactly what you expected, close to what you expected or different from what you expected? A one sentence answer is required (not just yes or no).
The maximised result is very close to what i expected given that after subtracting the mean from the lambda value i got 0.0134.furthermore my expected value is 141.171 and the lambda value is 141.1576
12. Plot the data against time when the ticket was bought.
library(ggplot2)
train.data$dt <- as.POSIXct(train.data$Datetime, format="%d-%m-%Y %H:%M")
train.order <- train.data[order(as.Date(train.data$dt, format="%d/%m/%Y")),]
#Plot of Sale over time
ggplot(data=train.order,aes(x=dt, y=Count)) +
geom_path(colour="blue") +
ylab("Count of Tickets Sold") +
xlab("Date") +
ggtitle("Ticket Sales Over Time") +
theme(plot.title = element_text(hjust = 0.5))
13. Given the plot in Question 12, do you believe that your estimate is appropriate? Explain your answer.
Yes the estimate was appropriate.