DATA 607 - Data In Context Presentation

Vladimir Nimchenko

Introduction:

In the code below, I would like to demonstrate the data mining technique of Regression. Regression is a data mining technique used to predict the range of values for a particular data set. In my example, I would like to find out the relationship between an NHL team’s away goals and home goals. It is important to remember that the degree of relationship between two specific variables ranges from -1 to 1. Any negative value would indicate an inverse relationship (one variable increases as the other decreases) relationship between the variables. Any positive number would indicate a direct relationship between the variables (as one variable increases the other one also increases).

Importing the needed libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6      v purrr   0.3.4 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.4.1 
## v readr   2.1.2      v forcats 0.5.2 
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Importing a CSV file from GitHub

nhl_win_data<- read.csv("https://raw.githubusercontent.com/GitHub-Vlad/Data-Science/main/game.csv", header = TRUE)
View(nhl_win_data)

Utilize the “pearson” method to calculate the correlation R between home goals and away goals

cor(nhl_win_data$home_goals, nhl_win_data$away_goals, method="pearson")
## [1] -0.04594149

From the calculation of -0.046 above we see an inverse relationship between home_goals and away_goals. In other words, as the number of home_goals increases the number of away goals decreases.

Performing a correlation test to obtain more information about the relationship between home_goals and away_goals

cor.test(nhl_win_data$home_goals, nhl_win_data$away_goals)
## 
##  Pearson's product-moment correlation
## 
## data:  nhl_win_data$home_goals and nhl_win_data$away_goals
## t = -7.4588, df = 26303, p-value = 9.004e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05799390 -0.03387569
## sample estimates:
##         cor 
## -0.04594149

From the test above we could see that we have 95% confidence interval which would imply that we are 95% confident that the correlation between home_goals and away_goals is in fact between -0.058 and -0.034.

Displaying the summary of the linear regression model and plotting the relationship between the away goals and home goals.

#displaying summary
summary(lm(nhl_win_data$away_goals~nhl_win_data$home_goals))
## 
## Call:
## lm(formula = nhl_win_data$away_goals ~ nhl_win_data$home_goals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8183 -1.5100  0.1817  1.2698  8.4019 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2.818277   0.020116 140.099   <2e-16 ***
## nhl_win_data$home_goals -0.044039   0.005904  -7.459    9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.617 on 26303 degrees of freedom
## Multiple R-squared:  0.002111,   Adjusted R-squared:  0.002073 
## F-statistic: 55.63 on 1 and 26303 DF,  p-value: 9.004e-14
#plotting the relationship between the away goals and home goals as well as drawing the regression line
plot(nhl_win_data$home_goals,nhl_win_data$away_goals, col='red')
abline(lm(nhl_win_data$away_goals~nhl_win_data$home_goals))

The graph above further validates the fact that home goals and away goals have an inverse relationship

Conclusion:

In conclusion, the negative correlation obtained from utilizing the pearson method was further displayed when plotting the linear model which showed a negative slope (regression line).It follows that as a teams probability in scoring a goal at home increases their chance in scoring on the road decreases. The opposite will hold true as well. This helps the coach tremendously as far as deciding the starting line up for away games vs home games. At away games, he would probably put forth his most offensive players (to score more goals) and for away games he would probably match up more defensive players (to allow less goals since they will score fewer of them).