This serves as a non-exhaustive review for the course. These are elements that I assume you have knowledge of prior to starting the course.
| variable | class | description |
|---|---|---|
| track | character | Track name |
| type | factor | Single or three lap record |
| shortcut | factor | Shortcut or non-shortcut record |
| player | character | Player’s name |
| system_played | character | Used system (NTSC or PAL) |
| date | date | World record date |
| time_period | period | Time as hms period |
| time | double | Time in seconds |
| record_duration | double | Record duration in days |
library(tidyverse)
library(ggformula)
library(lubridate)
library(mosaic)
library(e1071)
theme_set(theme_bw(base_size = 18))
mariokart <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-25/records.csv') %>%
mutate(year = year(date),
month = month(date),
day = month(date))
head(mariokart)
## # A tibble: 6 x 12
## track type shortcut player system_played date time_period time
## <chr> <chr> <chr> <chr> <chr> <date> <chr> <dbl>
## 1 Luigi Raceway Thre~ No Salam NTSC 1997-02-15 2M 12.99S 133.
## 2 Luigi Raceway Thre~ No Booth NTSC 1997-02-16 2M 9.99S 130.
## 3 Luigi Raceway Thre~ No Salam NTSC 1997-02-16 2M 8.99S 129.
## 4 Luigi Raceway Thre~ No Salam NTSC 1997-02-28 2M 6.99S 127.
## 5 Luigi Raceway Thre~ No Gregg~ NTSC 1997-03-07 2M 4.51S 125.
## 6 Luigi Raceway Thre~ No Rocky~ NTSC 1997-04-30 2M 2.89S 123.
## # ... with 4 more variables: record_duration <dbl>, year <dbl>, month <dbl>,
## # day <dbl>
gf_histogram(~ time, data = mariokart, bins = 30) %>%
gf_labs(x = "Time (in seconds)")
gf_density(~ time, data = mariokart) %>%
gf_labs(x = "Time (in seconds)")
df_stats(~ time, data = mariokart, mean, median, sd, skewness, kurtosis, quantile(probs = c(0.1, 0.5, 0.9)))
## response mean median sd skewness kurtosis 10% 50% 90%
## 1 time 90.62383 86.19 66.6721 1.771732 3.844745 31.31 86.19 171.961
cor(time ~ record_duration, data = mariokart)
## [1] -0.06736739
gf_point(time ~ record_duration, data = mariokart) %>%
gf_labs(x = "How long the record was held",
y = "Time (in seconds)")
What is problematic about the analyses above? Why?
What could be done to improve the analyses above?
knitr::include_url(“https://uiowa.instructure.com/courses/170115/assignments/1523376/submissions/275326”)
This week will dive into linear regression, the foundation of this course. The exploration into linear regression will first start with the case when we have 2 continuous predictors or attributes. We may write this general model as:
\[ Y = \beta_{0} + \beta_{1} X + \epsilon \]
Where \(Y\) is the outcome attribute. It is also known as the dependent variable. The \(X\) term is the predictor/covariate attribute. It is also known as the independent variable. The \(\epsilon\) is a random error term, more on this later. Finally, \(\beta_{0}\) and \(\beta_{1}\) are unknown population coefficients that we are interested in estimating. More on this later too.
The data used for this section of the course is from the 2019 WNBA season. These data are part of the bayesrules package/book. The data contain 146 rows, one for each WNBA player sampled, and 32 attributes for that player. The R packages are loaded and the first few rows of the data are shown below.
library(tidyverse)
library(mosaic)
library(ggformula)
basketball <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/basketball.csv")
theme_set(theme_bw(base_size = 18))
head(basketball)
## # A tibble: 6 x 32
## player_name height weight year team age games_played games_started
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 Natalie Achonwa 75 190 2019 IND 26 30 18
## 2 Kayla Alexander 76 195 2019 CHI 28 3 0
## 3 Rebecca Allen 74 162 2019 NYL 26 24 2
## 4 Jillian Alleyne 74 193 2019 MIN 24 5 0
## 5 Kristine Anigwe 76 200 2019 TOT 22 27 0
## 6 Kristine Anigwe 76 200 2019 CON 22 17 0
## # ... with 24 more variables: avg_minutes_played <dbl>, avg_field_goals <dbl>,
## # avg_field_goal_attempts <dbl>, field_goal_pct <dbl>,
## # avg_three_pointers <dbl>, avg_three_pointer_attempts <dbl>,
## # three_pointer_pct <dbl>, avg_two_pointers <dbl>,
## # avg_two_pointer_attempts <dbl>, two_pointer_pct <dbl>,
## # avg_free_throws <dbl>, avg_free_throw_attempts <dbl>, free_throw_pct <dbl>,
## # avg_offensive_rb <dbl>, avg_defensive_rb <dbl>, avg_rb <dbl>, ...
Suppose we are interested in exploring if players tend to score more points by playing more minutes in the season. That is, those that play more may have more opportunities to score more points. More generally, the relationship between average points in each game by the total minutes played across the season.
One first step in an analysis would be to explore each distribution independently first. I’m going to leave that as an exercise for you to do on your own.
The next step would be to explore the bivariate figure of these two attributes. As both of these attributes are continuous ratio type attributes, a scatterplot would be one way to visualize this. A scatterplot takes each X,Y pair of data and plots those coordinates. This can be done in R with the following code.
gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
gf_labs(x = "Total Minutes Played",
y = "Average Points Scored")
What can be noticed about the relationship between these two attributes?
Does there appear to be a relationship between the two?
Is this relationship perfect?
Adding a smoother line to the figure can help to guide how strong the relationship may be. In general, there are two types of smoothers that we will consider in this course. One is flexible and data dependent. This means that the functional form of the relationship is flexible to allow the data to specify if there are in non-linear aspects. The second is a linear or straight-line approach.
I’m going to add both to the figure below. The flexible (in this case this is a LOESS curve) curve is darker blue, the linear line is lighter blue.
Does there appear to be much difference in the relationship across the two lines?
gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
gf_smooth()%>%
gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
gf_labs(x = "Total Minutes Played",
y = "Average Points Scored")
## `geom_smooth()` using method = 'loess'
The linear regression coefficients can be estimated within any statistical software (or by hand, even if tedious). Within R, the primary function is lm() to estimate a linear regression. The primary argument is a formula similar to the regression formula shown above at the top of the notes.
This equation could be written more directly for our specific problem.
\[ Avg\_points = \beta_{0} + \beta_{1} Minutes\_Played + \epsilon \]
For the R formula, instead of an \(=\), you could insert a \(~\).
wnba_reg <- lm(avg_points ~ total_minutes, data = basketball)
coef(wnba_reg)
## (Intercept) total_minutes
## 1.13562456 0.01014207
Now that we have estimates for the linear regression terms, how are these interpretted? The linear regression equation with these estimates plugged in would look like the following:
\[ \hat{avg\_points} = 1.1356 + .0101 min\_played \]
Where instead of \(\beta_{0}\) or \(\beta_{1}\), the estimated values from this single season were inserted. Note the \(\hat{avg\_points}\), which the caret symbol is read as a hat, that is, average points hat, is a very important small distinction. This now represents the predicted values for the linear regression. That means, that the predicted value for the average number of points is assumed to function solely based on the minutes a player played. We could put in any value for the minutes played and get an estimated average number of points out.
1.1356 + .0101 * 0
## [1] 1.1356
1.1356 + .0101 * 1
## [1] 1.1457
1.1356 + .0101 * 100
## [1] 2.1456
1.1356 + .0101 * mean(basketball$total_minutes)
## [1] 5.342042
1.1356 + .0101 * 5000
## [1] 51.6356
1.1356 + .0101 * -50
## [1] 0.6306
mean(basketball$total_minutes)
## [1] 416.4795
Also notice from the equation above with the estimated coefficients, there is no longer any error. More on this later, but I wanted to point that out now. Back to model interpretations, these can become a bit more obvious with the values computed above by inputting specific values for the total minutes played.
First, for the intercept (\(\beta_{0}\)), notice that for the first computation above when 0 total minutes was input into the equation, that the same value for the intercept estimate was returned. This highlights what the intercept is, the average number of points scored when the X attribute (minutes played) equals 0.
The slope, (\(\beta_{1}\)), term is the average change in the outcome (average points here) for a one unit change in the predictor attribute (minutes played). Therefore, the slope here is 0.0101, which means that the average points scores increases by about 0.01 points for every additional minute played. This effect is additive, meaning that the 0.01 for a one unit change, say from 100 to 101 minutes, will remain when increasing from 101 to 102 minutes.
The predictions coming from the linear regression are the same as the light blue dashed line shown in the figure above and recreated here without the dark blue line.
gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
gf_smooth(method = 'lm', linetype = 2, color = 'blue') %>%
gf_labs(x = "Total Minutes Played",
y = "Average Points Scored")
So far the error has been disregarded, but where did it go? The error didn’t disappear, it is actually in the figure just created above. Where can you see the error? Why was it disregarded when creating the predicted values?
The short answer is that the error in a linear regression is commonly assumed to follow a Normal distribution with a mean of 0 and some variance, \(\sigma^2\). Sometimes this is written in math notation as:
\[ \epsilon \sim N(0, \sigma^2) \]
From this notation, can you see why the error was disregarded earlier when generating predictions?
In short, on average, the error is assumed to be 0 across all the sample data. The error will be smaller when the data are more closely clustered around the linear regression line and larger when the data are not clustered around the linear regression line. In the simple case with a single predictor, the error would be minimized when the correlation is closest to 1 in absolute value and largest when the correlation close to or equals 0.
This comes from partitioning of variance that you maybe heard from a design of experiment or analysis of variance course. More specifically, the variance in the outcome can be partioned or split into two components, those that the independent attribute helped to explain vs those that it can not explain. The part that can be explained is sometimes referred to as the sum of squares regression (SSR), the portion that is unexplained is referred to as the sum of squares error (SSE). This could be written in math notation as:
\[ \sum (Y - \bar{Y})^2 = \sum (Y - \hat{Y})^2 + \sum (\hat{Y} - \bar{Y})^2 \]
Let’s try to visualize what this means.
gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
gf_hline(yintercept = ~mean(avg_points), data = basketball) %>%
gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
gf_segment(avg_points + mean(avg_points) ~ total_minutes + total_minutes, data = basketball, color = '#FF7F7F') %>%
gf_labs(x = "Total Minutes Played",
y = "Average Points Scored")
gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
gf_hline(yintercept = ~mean(avg_points), data = basketball) %>%
gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
gf_segment(avg_points + fitted(wnba_reg) ~ total_minutes + total_minutes, data = basketball, color = '#65a765') %>%
gf_labs(x = "Total Minutes Played",
y = "Average Points Scored")
gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
gf_hline(yintercept = ~mean(avg_points), data = basketball) %>%
gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
gf_segment(mean(avg_points) + fitted(wnba_reg) ~ total_minutes + total_minutes, data = basketball, color = '#FFD580') %>%
gf_labs(x = "Total Minutes Played",
y = "Average Points Scored")
## Another related measure of error
Another way to get a measure of how well the model is performing, would be a statistic called R-squared. This statistic is a function of the sum of squares described above.
\[ R^{2} = 1 - \frac{SS_{res}}{SS_{total}} \]
or
\[ R^{2} = \frac{SS_{reg}}{SS_{total}} \]
Let’s compute the sum of square and get a value for \(R^2\).
basketball %>%
summarise(ss_total = sum((avg_points - mean(avg_points))^2),
ss_error = sum((avg_points - fitted(wnba_reg))^2),
ss_reg = sum((fitted(wnba_reg) - mean(avg_points))^2)) %>%
mutate(r_square = 1 - ss_error / ss_total,
r_square2 = ss_reg / ss_total)
## # A tibble: 1 x 5
## ss_total ss_error ss_reg r_square r_square2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2004. 564. 1440. 0.719 0.719
summary(wnba_reg)$r.square
## [1] 0.7185315
summary(wnba_reg)$sigma
## [1] 1.979045
sigma_hat_square <- 563.9929 / (nrow(basketball) - 2)
sigma_hat <- sqrt(sigma_hat_square)
sigma_hat_square
## [1] 3.916617
sigma_hat
## [1] 1.979045
knitr::include_url(“https://uiowa.instructure.com/courses/170115/assignments/1528020/submissions/275326”)
This section of notes aims to dig a bit more into what the simple linear regression (ie., regression with a single continuous covariate) parameters mean. We will consider the estimation formulas in part of this to gain a sense of how these can be computed.
The new data for this section of notes will explore data from the Environmental Protection Agency on Air Quality collected for the state of Iowa in 2021. The data are daily values for PM 2.5 particulates. The attributes included in the data are shown below with a short description.
| Variable | Description |
|---|---|
| date | Date of observation |
| id | Site ID |
| poc | Parameter Occurrence Code (POC) |
| pm2.5 | Average daily pm 2.5 particulate value, in (ug/m3; micrograms per meter cubed) |
| daily_aqi | Average air quality index |
| site_name | Site Name |
| aqs_parameter_desc | Text Description of Observation |
| cbsa_code | Core Based Statistical Area (CBSA) ID |
| cbsa_name | CBSA Name |
| county | County in Iowa |
| avg_wind | Average daily wind speed (in knots) |
| max_wind | Maximum daily wind speed (in knots) |
| max_wind_hours | Time of maximum daily wind speed |
How is average daily wind speed related to the daily air quality index?
Note, below I do a bit of post-processing to combine data from different POC values within a single CBSA.
library(tidyverse)
library(ggformula)
library(mosaic)
theme_set(theme_bw(base_size = 18))
airquality <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/iowa_air_quality_2021.csv")
wind <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/daily_WIND_2021-iowa.csv")
airquality <- airquality %>%
left_join(wind, by = c('cbsa_name', 'date')) %>%
drop_na()
head(airquality)
## # A tibble: 6 x 13
## date id poc pm2.5 daily_aqi site_name aqs_parameter_d~ cbsa_code
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1/1/21 190130009 1 15.1 57 Water Tower PM2.5 - Local C~ 47940
## 2 1/4/21 190130009 1 13.3 54 Water Tower PM2.5 - Local C~ 47940
## 3 1/7/21 190130009 1 20.5 69 Water Tower PM2.5 - Local C~ 47940
## 4 1/10/21 190130009 1 14.3 56 Water Tower PM2.5 - Local C~ 47940
## 5 1/13/21 190130009 1 13.7 54 Water Tower PM2.5 - Local C~ 47940
## 6 1/16/21 190130009 1 5.3 22 Water Tower PM2.5 - Local C~ 47940
## # ... with 5 more variables: cbsa_name <chr>, county <chr>, avg_wind <dbl>,
## # max_wind <dbl>, max_wind_hours <dbl>
dim(airquality)
## [1] 4821 13
gf_point(daily_aqi ~ avg_wind, data = airquality, size = 4, alpha = .15) %>%
gf_labs(x = "Average daily wind speed (in knots)",
y = "Daily Air Quality") %>%
gf_smooth() %>%
gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)
cor(daily_aqi ~ avg_wind, data = airquality)
## [1] -0.2920277
air_lm <- lm(daily_aqi ~ avg_wind, data = airquality)
coef(air_lm)
## (Intercept) avg_wind
## 48.222946 -2.211798
summary(air_lm)$r.square
## [1] 0.08528019
summary(air_lm)$sigma
## [1] 18.05479
There are times when centering of predictors can be helpful for interpretation of the model parameters. This can be helpful when 0 is not a practically useful characteristic of the attribute or for more specific tests of certain elements of the X attribute.
Mean centering is where the mean of the attribute is subtracted from each value. This is a linear transformation where each data point is subtracted by a constant, the mean. This means that the distance between points do not change.
airquality <- airquality %>%
mutate(avg_wind_mc = avg_wind - mean(avg_wind),
avg_wind_maxc = avg_wind - max(avg_wind),
avg_wind_10 = avg_wind - 10)
gf_point(daily_aqi ~ avg_wind_mc, data = airquality, size = 4, alpha = .15) %>%
gf_labs(x = "Average daily wind speed (in knots)",
y = "Daily Air Quality") %>%
gf_smooth() %>%
gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)
air_lm_maxc <- lm(daily_aqi ~ avg_wind_maxc, data = airquality)
coef(air_lm_maxc)
## (Intercept) avg_wind_maxc
## 5.968391 -2.211798
summary(air_lm_maxc)$r.square
## [1] 0.08528019
summary(air_lm_maxc)$sigma
## [1] 18.05479
air_lm_10 <- lm(daily_aqi ~ avg_wind_10, data = airquality)
coef(air_lm_10)
## (Intercept) avg_wind_10
## 26.104968 -2.211798
summary(air_lm_10)$r.square
## [1] 0.08528019
summary(air_lm_10)$sigma
## [1] 18.05479
In conclusion, it could be noticed that the values of the slope, \(R^2\),and sigma have the same values in all centered and non-centered models. However, the intercept values are different in each model and has different interpretaion basen on its context.
Another type of regression that can be done is one in which the attributes are standardized prior to estimating the linear regression. What is meant by standardizing? This is converting the attributes into z-scores:
\[ Z_{api} = \frac{(aqi - \bar{aqi})}{s_{aqi}} \]
airquality <- airquality %>%
mutate(z_aqi = scale(daily_aqi),
z_aqi2 = (daily_aqi - mean(daily_aqi)) / sd(daily_aqi),
z_wind = scale(avg_wind))
head(airquality)
## # A tibble: 6 x 19
## date id poc pm2.5 daily_aqi site_name aqs_parameter_d~ cbsa_code
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1/1/21 190130009 1 15.1 57 Water Tower PM2.5 - Local C~ 47940
## 2 1/4/21 190130009 1 13.3 54 Water Tower PM2.5 - Local C~ 47940
## 3 1/7/21 190130009 1 20.5 69 Water Tower PM2.5 - Local C~ 47940
## 4 1/10/21 190130009 1 14.3 56 Water Tower PM2.5 - Local C~ 47940
## 5 1/13/21 190130009 1 13.7 54 Water Tower PM2.5 - Local C~ 47940
## 6 1/16/21 190130009 1 5.3 22 Water Tower PM2.5 - Local C~ 47940
## # ... with 11 more variables: cbsa_name <chr>, county <chr>, avg_wind <dbl>,
## # max_wind <dbl>, max_wind_hours <dbl>, avg_wind_mc <dbl>,
## # avg_wind_maxc <dbl>, avg_wind_10 <dbl>, z_aqi <dbl[,1]>, z_aqi2 <dbl>,
## # z_wind <dbl[,1]>
air_lm_s <- lm(z_aqi ~ z_wind, data = airquality)
coef(air_lm_s)
## (Intercept) z_wind
## -6.712023e-16 -2.920277e-01
summary(air_lm_s)$r.square
## [1] 0.08528019
summary(air_lm_s)$sigma
## [1] 0.9565091
We can also use this formula to convert any unstandardized regression coefficients into a standardized metric.
\[ b^{'}_{k} = b_{k} * \frac{s_{x_{k}}}{s_{y}} \]
-2.211 * sd(airquality$avg_wind) / sd(airquality$daily_aqi)
## [1] -0.2919224
cor(daily_aqi ~ avg_wind, data = airquality)
## [1] -0.2920277
Now that we looked how the parameters are impacted by some changes in the model specification, how are these parameters actually estimated? I will show two ways, one is general, the other is specific to this simple case with a single predictor/covariate attribute. In general, linear regression (or more generally the general linear model) uses least square estimation. This means that the the parameters in the model minimize the squared distance between the observed and predicted values. That is, least squares estimates minimize this criterion:
\[ \sum (Y - \hat{Y})^2 \]
Calculus can be used to show that these two equations can be solved simultanuously to get estimates for \(\beta_{0}\) and \(\beta_{1}\) that minimize the criterion above. These formulas are:
\[ b_{1} = \frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sum(X - \bar{X})^2} \] \[ b_{0} = \bar{Y} - b_{1}\bar{X} \]
Let’s use R to get these quantities.
b1 <- with(airquality,
sum((avg_wind - mean(avg_wind)) * (daily_aqi - mean(daily_aqi))) / sum((avg_wind - mean(avg_wind))^2)
)
b0 <- with(airquality,
mean(daily_aqi) - b1 * mean(avg_wind)
)
b0
## [1] 48.22295
b1
## [1] -2.211798
coef(air_lm)
## (Intercept) avg_wind
## 48.222946 -2.211798
When there are more than one predictor, the number of equations gets a bit unyieldy, therefore, there is a general analytic approach that works for any set of predictor attributes. The general approach uses matrix algebra (anyone take linear algebra?), to achieve their estimates. This general form is:
\[ \mathbf{b} = \left( \mathbf{X}^{`}\mathbf{X} \right)^{-1} \left( \mathbf{X}^{`} \mathbf{Y} \right). \] Where \(\mathbf{b}\) is a vector of estimated regression coefficients, \(\mathbf{X}\) is a matrix of covariate/predictor attributes (called the design matrix), and \(\mathbf{Y}\) is a vector of the outcome attribute.
Below, I show what these would look like for the air quality example that has been used and solve for the regression coefficients.
X <- model.matrix(air_lm)
head(X)
## (Intercept) avg_wind
## 1 1 2.941667
## 2 1 2.445833
## 3 1 1.995833
## 4 1 3.445833
## 5 1 1.116667
## 6 1 6.091667
Y <- as.matrix(airquality$daily_aqi)
head(Y)
## [,1]
## [1,] 57
## [2,] 54
## [3,] 69
## [4,] 56
## [5,] 54
## [6,] 22
X_X <- solve(t(X) %*% X)
X_X
## (Intercept) avg_wind
## (Intercept) 0.0008152474 -1.424894e-04
## avg_wind -0.0001424894 3.340328e-05
X_Y <- t(X) %*% Y
X_Y
## [,1]
## (Intercept) 186997
## avg_wind 731464
X_X %*% X_Y
## [,1]
## (Intercept) 48.222946
## avg_wind -2.211798
coef(air_lm)
## (Intercept) avg_wind
## 48.222946 -2.211798
knitr::include_url(“https://uiowa.instructure.com/courses/170115/assignments/1533818/submissions/275326”)