Project Proposal

TEAM MEMBERS

Junsuk Oh Devyani Gupta Henry Schwartz

DATA

The data we will be using will consist of the all NHL players’ data from 2007 till the present date. The source of this data is the API Documentations from MySportsFeeds. The data consists of 53 columns, which indicates player’s general information such as weight, height and playing position, as well as various in-game statistics such as goals scored, blocked, and minutes played. Since the data was grouped by season, we had to add an additional attribute “seasons” in order to indicate which NHL season the rest of the data was referring to. Also, there are some attributes that have many missing entries, such as “blocked shots”, which will be excluded from the analysis. We speculate that the most important features of our data will be ‘Games Won’, ‘Games Lost’, ‘Goals Scored’, and ‘Minutes Played’.

PROBLEM DESCRIPTION

The NHL is the largest ice hockey organization in the world and holds a season of 82 games for 31 teams every year. Our focus will be on prediction and our goal is to build two models that predict the winner of each season and the MVP (most valuable player) of each season. Sports prediction is a popular activity and has many useful applications. NHL teams themselves have Analytical Divisions which attempt to find insight that can lead to a competitive advantage for the team, especially player acquisition or contract negotiations. The gaming industry also has a large focus on Data Analytics and strives to create models with the highest performance and efficiency. This research is also quite timely as the winner of the current NHL season will be revealed within the next two months.

ANALYTICS PLAN

  1. Explore and analyze the API data to differentiate the efficiency of diiferent parameters required to get our accurate predictions.
  2. Predict the MVP and Season Winner (also referred to as Stanley Cup Champion) of the current NHL season using the data.
  3. Use multiple regression and determine which variables are the most significant regarding our research.
  4. Use a machine learning technique, probably, Neural Networks to add depth to our predictions and see how the two models compare and contrast.

EVALUATION PLAN

Our results will yield accuracy, recall, and precision metrics regarding our models and also will visualize our predictions compared to the actual results. Possibly no graphs are necessary but visualizing what factors chiefly led to our results is very important. For example, if we find that the teams with the most shots every year tends to win the season, we will show that trend appropriately with the historical data.