The proposed template for the functionality can be viewed in this sketch. This is a draft of what the functionality the dashboard might include if there is sufficient support for implementation from center staff. All graphical demonstrations are created using mock data.
The functionality is threefold at present:
Graph 2 will provide a statistical overview of various metrics per student across a given time period. Default time bins will be seasons, with adjustments of quarters, months, weeks, days. The graph will have y-axis as the # of students admitted, and x-axis as the date. Stacked bar graphs will show the number of os v ns per retreat across the duration of the retreat. These can be regressed against electricity cost, food cost, donations etc to establish an amount per student of a given resource per given time period.
A ride suggestion algorithm can be created with KNN (k nearest neighbors) and the Google Maps API. The application process can be modified to include inputs for:
With this information, the OpenMaps API can then be queried to provide an inline user selection (with PHP) of the latitude and longitude of that location. Because kNN is directionally indifferent, we need to create a custom algorithm to find the individuals that are closest to the driver’s path to the center. To do so we will need to compute the distance of parallel lines between the rider’s origin point \(r_{la},r_{lo}\) and the vector between the driver’s origin and the center \(\vec{d_{oc}}\) for a rough estimate of whom will be closest to the path of travel. We can determine which riders will be encompassed in a query for a drive by finding the bisecting half way point \(distance(\vec{d_{oc}})/2\) on the direction vector from the driver origin to center, and creating a circular radius equivalent to half the distance of the journey with an overlapping sphere of 10 mile radius from their point of origin (a driver would likely be willing to pick up someone with 10 miles even if in the opposite direction). The distance of each of rider’s origin falling within these radii to the will be calculated using an equation for a perpendicular line, and k closest riders will be determined where k is equal to the number of open seats.
The k closest riders will then be submitted to the Google Maps API to determine the amount of time added to the driver’s trip in order to pick up the rider - the total added time will be computed for each and for all (combinations can be manually calculated from this info). The Google Maps API provides 2500 free queries per 24 hour period, thus the queries will be as follows:
With the pre-computation described above, the queries per day will be unlikely to exceed the 24 hour quota.
ExceptionsExceptions might occur where a person resides just outside of a 10 mile radius but also in the same large city, and would be willing to take a Lyft or public transportation to reach the driver. To account for such exceptions, a regex matching system based on city name would be helpful to identify and match these exceptions.
Potentially relevant links:dRange <- seq.Date(lubridate::ymd("2015-01-01"), lubridate::ymd("2018-05-25"), by = "day")
applications <- data.frame(Date = dRange, NS = round(runif(length(dRange), 1, 20)),
OS = round(runif(length(dRange), 1, 20)/3), stringsAsFactors = F)
applications %<>% gather(NS, OS, key = Status, value = Applied) %>% arrange(Date) # Gather the New Student and Old student column as a factor by date
applications %<>% mutate(Wk = lubridate::week(Date), WkBgn = lubridate::floor_date(Date,
"week")) # Bin these values by week for a clearer graph
appWk <- applications %>% group_by(WkBgn, Status) %>% summarize(Applied = sum(Applied))ggplot(data = appWk %>% filter(WkBgn >= lubridate::ymd("2018-01-01")), mapping = aes(x = WkBgn,
y = Applied)) + geom_bar(stat = "identity", position = "stack", aes(y = Applied,
fill = Status)) + geom_text(data = appT <- appWk %>% group_by(WkBgn) %>% summarize(Total = sum(Applied)) %>%
right_join(appWk, by = "WkBgn") %>% filter(WkBgn >= lubridate::ymd("2018-01-01")),
aes(x = WkBgn, y = Applied, label = paste0(round(Applied/Total * 100), "%")),
size = 4, vjust = 1) + geom_smooth(data = appT, aes(y = Total)) + scale_fill_manual(values = c(NS = "#FF7878",
OS = "#5CFF66")) + scale_x_date(date_breaks = "week", date_labels = "%m-%e") +
theme_minimal(base_size = 11) + xlab("# of Applications") + ylab("Time YTD 2018 (by wk)") +
ggtitle("Demo of Graph 1", subtitle = "Graph created using Demo Data") + labs(caption = "New Student/Old Student ratio stacked bar graph with Loess average trendline of total applications") +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 60, size = 8))googlesheets::gs_auth()
coursesheet <- googlesheets::gs_url("https://docs.google.com/spreadsheets/d/1Qx4a2SbT136pRcYputkOIWj8qdX_F565owJn3cnL7Wo/edit?ts=5b0b40d9#gid=0")
coursedf <- googlesheets::gs_read(coursesheet)
cdf <- coursedf[1:15, ] %>% separate(col = `Course Dates`, into = c("Begin", "End"),
sep = "\\s\\-\\s?") %>% mutate_at(.vars = vars(Begin), .funs = funs(lubridate::mdy(paste(.,
"2018", sep = "/"))))
enddate <- function(col1, col2) {
if (is.na(col1)) {
col1 <- col2
}
if (str_count(col1, "\\/") < 1) {
col1 <- lubridate::mdy(paste(lubridate::month(col2), col1, "2018", sep = "/"))
} else if (str_count(col1, "\\/") < 2) {
col1 <- lubridate::mdy(paste(col1, "2018", sep = "/"))
} else {
col1 <- lubridate::mdy(col1)
}
return(col1)
}
cdf %>% mutate(EndDate = enddate(End, Start))
# ----------------------- Sun May 27 20:17:49 2018 ------------------------# It
# will be easier just to use constructed data for demo purposes as the data is
# ultimately going to be different for the final implementation.# ----------------------- Sun May 27 13:00:06 2018 ------------------------#
# Aggregate statistics by day in the previous created dataset do not allow for
# the linear regression model predicting course closure date to be built. A new
# dataset will be constructed to accommodate this task.
Courses <- seq.Date(lubridate::ymd("2018-01-04"), lubridate::now() %>% lubridate::as_date(),
by = "13 days") # Create mock course dates
addCourse <- function(Dt, Courses) {
# p <- rev(exp(1:length(Courses))) %>% {. / sum(.)}
out <- sample(Courses[Courses > Dt], 1)
return(out)
}
applications <- data.frame(Date = sample(dRange, 5000, replace = T), Student.No = seq(1,
5000, 1), Type = sample(c("NS", "OS"), 5000, replace = T, prob = c(0.66, 0.34)),
Status = sample(c("Accepted", "In Process", "Rejected"), 5000, replace = T, prob = c(0.5,
0.4, 0.1)))
applications %<>% filter(Date > lubridate::mdy("01/01/2018")) %>% rowwise %>% mutate(Course = addCourse(Date,
Courses))Due to the number and diversity of variables involved in this analysis, I believe that trying to develop a demo with dummy data will be a wasted effort. The code for this particular calculation will be heavily dependent on where, how, and in what units the data is stored and retrieved and thus the actual code necessary to complete the analysis will likely entirely invalidate any mock data demo. This section will be skipped until further information is available.
Elec <- data.frame(Date = lubridate::mdy(paste(seq(1, 5, 1), "25", "2018", sep = "-")),
Usage = runif(5, 20000, 50000), stringsAsFactors = F)
App_By_Mo <- applications %>% group_by(Course) %>% count(Status) %>% filter(Status ==
"Accepted") %>% mutate(Mo = lubridate::month(Course)) %>% ungroup %>% group_by(Mo) %>%
summarise(TotalS = sum(n))
Elec %>% mutate(Mo = lubridate::month(Date)) %>% left_join(App_By_Mo, by = "Mo") %>%
ggplot(mapping = aes(x = factor(Date))) + geom_bar(stat = "identity", aes(y = Usage/TotalS),
fill = "green") + labs(title = "Electricity Usage Per Student", subtitle = "kwH / Student",
caption = "Created using mock data", x = "Month", y = "kwH") + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))The level of complexity of the algorithm documented above may not be feasible for the purposes of this project, but a simpler mechanism that employs kNN and a directional filtering mechanism may be implementable in a reasonable amount of time and provide sufficient matching to help individuals without vehicular transporation get to the center with greater ease.
RS <- data.frame(Student = 1:6, Type = c("D", rep("R", 5)), Lon = c(40.695331, 41.322251,
41.264751, 41.829653, 42.08225, 42.3781439), Lat = c(-73.9256047, -72.970297,
-75.8360507, -72.632024, -72.566943, -72.5487467), City = c("Brooklyn", "New Haven",
"Wilkes-Barre", "South Windsor", "Springfield", "Hadley"), State = c("NY", "CT",
"PA", "CT", "MA", "MA"))## $nn
## [,1] [,2] [,3]
## [1,] 2 4 5
## [2,] 4 5 6
## [3,] 1 2 4
## [4,] 5 6 2
## [5,] 4 6 2
## [6,] 5 4 2
##
## $np
## [1] 6
##
## $k
## [1] 3
##
## $dimension
## [1] 2
##
## $x
## Lon Lat
## [1,] 40.69533 -73.92560
## [2,] 41.32225 -72.97030
## [3,] 41.26475 -75.83605
## [4,] 41.82965 -72.63202
## [5,] 42.08225 -72.56694
## [6,] 42.37814 -72.54875
##
## attr(,"class")
## [1] "knn"
## attr(,"call")
## spdep::knearneigh(x = as.matrix(RS[, c("Lon", "Lat")]), k = 3)
The algorithm will have to be built such that the driver’s shortest path is used as the point from which the nearest locations are calculated.