Introduction:
In the code below, I would like to demonstrate the data mining technique of Regression. Regression is a data mining technique used to predict the range of values for a particular data set. In my example, I would like to find out the relationship between an NHL team’s away goals and home goals. It is important to remember that the degree of relationship between two specific variables ranges from -1 to 1. Any negative value would indicate an inverse relationship (one variable increases as the other decreases) relationship between the variables. Any positive number would indicate a direct relationship between the variables (as one variable increases the other one also increases).
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.2 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
nhl_win_data<- read.csv("https://raw.githubusercontent.com/GitHub-Vlad/Data-Science/main/game.csv", header = TRUE)
View(nhl_win_data)
cor(nhl_win_data$home_goals, nhl_win_data$away_goals, method="pearson")
## [1] -0.04594149
From the calculation of -0.046 above we see an inverse relationship between home_goals and away_goals. In other words, as the number of home_goals increases the number of away goals decreases.
cor.test(nhl_win_data$home_goals, nhl_win_data$away_goals)
##
## Pearson's product-moment correlation
##
## data: nhl_win_data$home_goals and nhl_win_data$away_goals
## t = -7.4588, df = 26303, p-value = 9.004e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05799390 -0.03387569
## sample estimates:
## cor
## -0.04594149
From the test above we could see that we have 95% confidence interval which would imply that we are 95% confident that the correlation between home_goals and away_goals is in fact between -0.058 and -0.034.
#displaying summary
summary(lm(nhl_win_data$away_goals~nhl_win_data$home_goals))
##
## Call:
## lm(formula = nhl_win_data$away_goals ~ nhl_win_data$home_goals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8183 -1.5100 0.1817 1.2698 8.4019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.818277 0.020116 140.099 <2e-16 ***
## nhl_win_data$home_goals -0.044039 0.005904 -7.459 9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.617 on 26303 degrees of freedom
## Multiple R-squared: 0.002111, Adjusted R-squared: 0.002073
## F-statistic: 55.63 on 1 and 26303 DF, p-value: 9.004e-14
#plotting the relationship between the away goals and home goals as well as drawing the regression line
plot(nhl_win_data$home_goals,nhl_win_data$away_goals, col='red')
abline(lm(nhl_win_data$away_goals~nhl_win_data$home_goals))
The graph above further validates the fact that home goals and away goals have an inverse relationship
Conclusion:
In conclusion, the negative correlation obtained from utilizing the pearson method was further displayed when plotting the linear model which showed a negative slope (regression line).It follows that as a teams probability in scoring a goal at home increases their chance in scoring on the road decreases. The opposite will hold true as well. This helps the coach tremendously as far as deciding the starting line up for away games vs home games. At away games, he would probably put forth his most offensive players (to score more goals) and for away games he would probably match up more defensive players (to allow less goals since they will score fewer of them).