Week (1)

This serves as a non-exhaustive review for the course. These are elements that I assume you have knowledge of prior to starting the course.

Variable vs constant attributes
Types of variables (ie., nominal, ordinal, integer, ratio)
Descriptive Statistics (eg., mean, median, standard deviation, variance, percentiles)
Higher order moments (eg., skewness and kurtosis)
Exploring/summarizing univariate distributions (eg., histogram or density figure)
What is a statistical model? Why do we use them?
Population vs Sample

Example: Mario Kart 64 world record data:

variable	class	description
track	character	Track name
type	factor	Single or three lap record
shortcut	factor	Shortcut or non-shortcut record
player	character	Player’s name
system_played	character	Used system (NTSC or PAL)
date	date	World record date
time_period	period	Time as `hms` period
time	double	Time in seconds
record_duration	double	Record duration in days

Load Libraries

library(tidyverse)
library(ggformula)
library(lubridate)
library(mosaic)
library(e1071)

theme_set(theme_bw(base_size = 18))

Load Some Data

mariokart <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-25/records.csv') %>%
    mutate(year = year(date),
           month = month(date),
           day = month(date))

head(mariokart)

## # A tibble: 6 x 12
##   track         type  shortcut player system_played date       time_period  time
##   <chr>         <chr> <chr>    <chr>  <chr>         <date>     <chr>       <dbl>
## 1 Luigi Raceway Thre~ No       Salam  NTSC          1997-02-15 2M 12.99S    133.
## 2 Luigi Raceway Thre~ No       Booth  NTSC          1997-02-16 2M 9.99S     130.
## 3 Luigi Raceway Thre~ No       Salam  NTSC          1997-02-16 2M 8.99S     129.
## 4 Luigi Raceway Thre~ No       Salam  NTSC          1997-02-28 2M 6.99S     127.
## 5 Luigi Raceway Thre~ No       Gregg~ NTSC          1997-03-07 2M 4.51S     125.
## 6 Luigi Raceway Thre~ No       Rocky~ NTSC          1997-04-30 2M 2.89S     123.
## # ... with 4 more variables: record_duration <dbl>, year <dbl>, month <dbl>,
## #   day <dbl>

Univariate Distribution of Time

First, using histogram diagram

gf_histogram(~ time, data = mariokart, bins = 30) %>% 
   gf_labs(x = "Time (in seconds)")

Second, using density diagram

gf_density(~ time, data = mariokart) %>% 
   gf_labs(x = "Time (in seconds)")

Third, discribtive staistics

df_stats(~ time, data = mariokart, mean, median, sd, skewness, kurtosis, quantile(probs = c(0.1, 0.5, 0.9)))

##   response     mean median      sd skewness kurtosis   10%   50%     90%
## 1     time 90.62383  86.19 66.6721 1.771732 3.844745 31.31 86.19 171.961

Bivariate Association

Pearson Correlation Coefficient

cor(time ~ record_duration, data = mariokart)

## [1] -0.06736739

Scatterplot

gf_point(time ~ record_duration, data = mariokart) %>%
  gf_labs(x = "How long the record was held",
          y = "Time (in seconds)")

Questions

What is problematic about the analyses above? Why?
What could be done to improve the analyses above?

Quiz #1

knitr::include_url(“https://uiowa.instructure.com/courses/170115/assignments/1523376/submissions/275326”)

Week (2)

Introduction to Linear Regression

This week will dive into linear regression, the foundation of this course. The exploration into linear regression will first start with the case when we have 2 continuous predictors or attributes. We may write this general model as:

\[ Y = \beta_{0} + \beta_{1} X + \epsilon \]

Where \(Y\) is the outcome attribute. It is also known as the dependent variable. The \(X\) term is the predictor/covariate attribute. It is also known as the independent variable. The \(\epsilon\) is a random error term, more on this later. Finally, \(\beta_{0}\) and \(\beta_{1}\) are unknown population coefficients that we are interested in estimating. More on this later too.

Specific example

The data used for this section of the course is from the 2019 WNBA season. These data are part of the bayesrules package/book. The data contain 146 rows, one for each WNBA player sampled, and 32 attributes for that player. The R packages are loaded and the first few rows of the data are shown below.

Load Libraries and Data

library(tidyverse)
library(mosaic)
library(ggformula)

basketball <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/basketball.csv")

theme_set(theme_bw(base_size = 18))

Show the data

head(basketball)

## # A tibble: 6 x 32
##   player_name     height weight  year team    age games_played games_started
##   <chr>            <dbl>  <dbl> <dbl> <chr> <dbl>        <dbl>         <dbl>
## 1 Natalie Achonwa     75    190  2019 IND      26           30            18
## 2 Kayla Alexander     76    195  2019 CHI      28            3             0
## 3 Rebecca Allen       74    162  2019 NYL      26           24             2
## 4 Jillian Alleyne     74    193  2019 MIN      24            5             0
## 5 Kristine Anigwe     76    200  2019 TOT      22           27             0
## 6 Kristine Anigwe     76    200  2019 CON      22           17             0
## # ... with 24 more variables: avg_minutes_played <dbl>, avg_field_goals <dbl>,
## #   avg_field_goal_attempts <dbl>, field_goal_pct <dbl>,
## #   avg_three_pointers <dbl>, avg_three_pointer_attempts <dbl>,
## #   three_pointer_pct <dbl>, avg_two_pointers <dbl>,
## #   avg_two_pointer_attempts <dbl>, two_pointer_pct <dbl>,
## #   avg_free_throws <dbl>, avg_free_throw_attempts <dbl>, free_throw_pct <dbl>,
## #   avg_offensive_rb <dbl>, avg_defensive_rb <dbl>, avg_rb <dbl>, ...

Guiding Question

Suppose we are interested in exploring if players tend to score more points by playing more minutes in the season. That is, those that play more may have more opportunities to score more points. More generally, the relationship between average points in each game by the total minutes played across the season.

One first step in an analysis would be to explore each distribution independently first. I’m going to leave that as an exercise for you to do on your own.

The next step would be to explore the bivariate figure of these two attributes. As both of these attributes are continuous ratio type attributes, a scatterplot would be one way to visualize this. A scatterplot takes each X,Y pair of data and plots those coordinates. This can be done in R with the following code.

Scatterplot

gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>% 
  gf_labs(x = "Total Minutes Played",
          y = "Average Points Scored")

Questions to consider

What can be noticed about the relationship between these two attributes?
Does there appear to be a relationship between the two?
Is this relationship perfect?

Adding a smoother line

Adding a smoother line to the figure can help to guide how strong the relationship may be. In general, there are two types of smoothers that we will consider in this course. One is flexible and data dependent. This means that the functional form of the relationship is flexible to allow the data to specify if there are in non-linear aspects. The second is a linear or straight-line approach.

I’m going to add both to the figure below. The flexible (in this case this is a LOESS curve) curve is darker blue, the linear line is lighter blue.

Does there appear to be much difference in the relationship across the two lines?

gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>%
  gf_smooth()%>%
  gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
  gf_labs(x = "Total Minutes Played",
          y = "Average Points Scored")

## `geom_smooth()` using method = 'loess'

Estimating linear regression coefficients

The linear regression coefficients can be estimated within any statistical software (or by hand, even if tedious). Within R, the primary function is lm() to estimate a linear regression. The primary argument is a formula similar to the regression formula shown above at the top of the notes.

This equation could be written more directly for our specific problem.

\[ Avg\_points = \beta_{0} + \beta_{1} Minutes\_Played + \epsilon \]

For the R formula, instead of an \(=\), you could insert a \(~\).

wnba_reg <- lm(avg_points ~ total_minutes, data = basketball)
coef(wnba_reg)

##   (Intercept) total_minutes 
##    1.13562456    0.01014207

Interpretting linear regression terms

Now that we have estimates for the linear regression terms, how are these interpretted? The linear regression equation with these estimates plugged in would look like the following:

\[ \hat{avg\_points} = 1.1356 + .0101 min\_played \]

Where instead of \(\beta_{0}\) or \(\beta_{1}\), the estimated values from this single season were inserted. Note the \(\hat{avg\_points}\), which the caret symbol is read as a hat, that is, average points hat, is a very important small distinction. This now represents the predicted values for the linear regression. That means, that the predicted value for the average number of points is assumed to function solely based on the minutes a player played. We could put in any value for the minutes played and get an estimated average number of points out.

1.1356 + .0101 * 0

## [1] 1.1356

1.1356 + .0101 * 1

## [1] 1.1457

1.1356 + .0101 * 100

## [1] 2.1456

1.1356 + .0101 * mean(basketball$total_minutes)

## [1] 5.342042

1.1356 + .0101 * 5000

## [1] 51.6356

1.1356 + .0101 * -50

## [1] 0.6306

mean(basketball$total_minutes)

## [1] 416.4795

Also notice from the equation above with the estimated coefficients, there is no longer any error. More on this later, but I wanted to point that out now. Back to model interpretations, these can become a bit more obvious with the values computed above by inputting specific values for the total minutes played.

First, for the intercept (\(\beta_{0}\)), notice that for the first computation above when 0 total minutes was input into the equation, that the same value for the intercept estimate was returned. This highlights what the intercept is, the average number of points scored when the X attribute (minutes played) equals 0.

The slope, (\(\beta_{1}\)), term is the average change in the outcome (average points here) for a one unit change in the predictor attribute (minutes played). Therefore, the slope here is 0.0101, which means that the average points scores increases by about 0.01 points for every additional minute played. This effect is additive, meaning that the 0.01 for a one unit change, say from 100 to 101 minutes, will remain when increasing from 101 to 102 minutes.

The predictions coming from the linear regression are the same as the light blue dashed line shown in the figure above and recreated here without the dark blue line.

gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>% 
  gf_smooth(method = 'lm', linetype = 2, color = 'blue') %>%
  gf_labs(x = "Total Minutes Played",
          y = "Average Points Scored")

What about the error?

So far the error has been disregarded, but where did it go? The error didn’t disappear, it is actually in the figure just created above. Where can you see the error? Why was it disregarded when creating the predicted values?

The short answer is that the error in a linear regression is commonly assumed to follow a Normal distribution with a mean of 0 and some variance, \(\sigma^2\). Sometimes this is written in math notation as:

\[ \epsilon \sim N(0, \sigma^2) \]

From this notation, can you see why the error was disregarded earlier when generating predictions?

In short, on average, the error is assumed to be 0 across all the sample data. The error will be smaller when the data are more closely clustered around the linear regression line and larger when the data are not clustered around the linear regression line. In the simple case with a single predictor, the error would be minimized when the correlation is closest to 1 in absolute value and largest when the correlation close to or equals 0.

Estimating error in linear regression

This comes from partitioning of variance that you maybe heard from a design of experiment or analysis of variance course. More specifically, the variance in the outcome can be partioned or split into two components, those that the independent attribute helped to explain vs those that it can not explain. The part that can be explained is sometimes referred to as the sum of squares regression (SSR), the portion that is unexplained is referred to as the sum of squares error (SSE). This could be written in math notation as:

\[ \sum (Y - \bar{Y})^2 = \sum (Y - \hat{Y})^2 + \sum (\hat{Y} - \bar{Y})^2 \]

Let’s try to visualize what this means.

gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>% 
  gf_hline(yintercept = ~mean(avg_points), data = basketball) %>%
  gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
  gf_segment(avg_points + mean(avg_points) ~ total_minutes + total_minutes, data = basketball, color = '#FF7F7F') %>%
  gf_labs(x = "Total Minutes Played",
          y = "Average Points Scored")

gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>% 
  gf_hline(yintercept = ~mean(avg_points), data = basketball) %>%
  gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
  gf_segment(avg_points + fitted(wnba_reg) ~ total_minutes + total_minutes, data = basketball, color = '#65a765') %>%
  gf_labs(x = "Total Minutes Played",
          y = "Average Points Scored")

gf_point(avg_points ~ total_minutes, data = basketball, size = 4, alpha = .5) %>% 
  gf_hline(yintercept = ~mean(avg_points), data = basketball) %>%
  gf_smooth(method = 'lm', linetype = 2, color = 'lightblue') %>%
  gf_segment(mean(avg_points) + fitted(wnba_reg) ~ total_minutes + total_minutes, data = basketball, color = '#FFD580') %>%
  gf_labs(x = "Total Minutes Played",
          y = "Average Points Scored")

## Another related measure of error

Another way to get a measure of how well the model is performing, would be a statistic called R-squared. This statistic is a function of the sum of squares described above.

\[ R^{2} = 1 - \frac{SS_{res}}{SS_{total}} \]

\[ R^{2} = \frac{SS_{reg}}{SS_{total}} \]

Let’s compute the sum of square and get a value for \(R^2\).

basketball %>%
summarise(ss_total = sum((avg_points - mean(avg_points))^2),
          ss_error = sum((avg_points - fitted(wnba_reg))^2),
          ss_reg = sum((fitted(wnba_reg) - mean(avg_points))^2)) %>%
mutate(r_square = 1 - ss_error / ss_total,
       r_square2 = ss_reg / ss_total)

## # A tibble: 1 x 5
##   ss_total ss_error ss_reg r_square r_square2
##      <dbl>    <dbl>  <dbl>    <dbl>     <dbl>
## 1    2004.     564.  1440.    0.719     0.719

summary(wnba_reg)$r.square

## [1] 0.7185315

summary(wnba_reg)$sigma

## [1] 1.979045

sigma_hat_square <- 563.9929 / (nrow(basketball) - 2)
sigma_hat <- sqrt(sigma_hat_square)

sigma_hat_square

## [1] 3.916617

sigma_hat

## [1] 1.979045

Quiz #2

knitr::include_url(“https://uiowa.instructure.com/courses/170115/assignments/1528020/submissions/275326”)

Week (3)

Understanding Regression Parameters

This section of notes aims to dig a bit more into what the simple linear regression (ie., regression with a single continuous covariate) parameters mean. We will consider the estimation formulas in part of this to gain a sense of how these can be computed.

New Example Data

The new data for this section of notes will explore data from the Environmental Protection Agency on Air Quality collected for the state of Iowa in 2021. The data are daily values for PM 2.5 particulates. The attributes included in the data are shown below with a short description.

Variable	Description
date	Date of observation
id	Site ID
poc	Parameter Occurrence Code (POC)
pm2.5	Average daily pm 2.5 particulate value, in (ug/m3; micrograms per meter cubed)
daily_aqi	Average air quality index
site_name	Site Name
aqs_parameter_desc	Text Description of Observation
cbsa_code	Core Based Statistical Area (CBSA) ID
cbsa_name	CBSA Name
county	County in Iowa
avg_wind	Average daily wind speed (in knots)
max_wind	Maximum daily wind speed (in knots)
max_wind_hours	Time of maximum daily wind speed

Guiding Question

How is average daily wind speed related to the daily air quality index?

Bivariate Figure

Note, below I do a bit of post-processing to combine data from different POC values within a single CBSA.

Load Packages and Data Needed

library(tidyverse)
library(ggformula)
library(mosaic)

theme_set(theme_bw(base_size = 18))

airquality <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/iowa_air_quality_2021.csv")
wind <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/daily_WIND_2021-iowa.csv")

airquality <- airquality %>%
   left_join(wind, by = c('cbsa_name', 'date')) %>% 
   drop_na()

Show Data

head(airquality)

## # A tibble: 6 x 13
##   date           id   poc pm2.5 daily_aqi site_name   aqs_parameter_d~ cbsa_code
##   <chr>       <dbl> <dbl> <dbl>     <dbl> <chr>       <chr>                <dbl>
## 1 1/1/21  190130009     1  15.1        57 Water Tower PM2.5 - Local C~     47940
## 2 1/4/21  190130009     1  13.3        54 Water Tower PM2.5 - Local C~     47940
## 3 1/7/21  190130009     1  20.5        69 Water Tower PM2.5 - Local C~     47940
## 4 1/10/21 190130009     1  14.3        56 Water Tower PM2.5 - Local C~     47940
## 5 1/13/21 190130009     1  13.7        54 Water Tower PM2.5 - Local C~     47940
## 6 1/16/21 190130009     1   5.3        22 Water Tower PM2.5 - Local C~     47940
## # ... with 5 more variables: cbsa_name <chr>, county <chr>, avg_wind <dbl>,
## #   max_wind <dbl>, max_wind_hours <dbl>

dim(airquality)

## [1] 4821   13

Draw Smoothed and Regression Line

gf_point(daily_aqi ~ avg_wind, data = airquality, size = 4, alpha = .15) %>%
  gf_labs(x = "Average daily wind speed (in knots)",
          y = "Daily Air Quality") %>%
  gf_smooth() %>%
  gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)

Calculate Pearson’s Correlation Coefficient

cor(daily_aqi ~ avg_wind, data = airquality)

## [1] -0.2920277

Calcuating \(R^2\) and Sigma for the Linear Model

air_lm <- lm(daily_aqi ~ avg_wind, data = airquality)
coef(air_lm)

## (Intercept)    avg_wind 
##   48.222946   -2.211798

summary(air_lm)$r.square

## [1] 0.08528019

summary(air_lm)$sigma

## [1] 18.05479

Centering predictors

There are times when centering of predictors can be helpful for interpretation of the model parameters. This can be helpful when 0 is not a practically useful characteristic of the attribute or for more specific tests of certain elements of the X attribute.

Mean Centering

Mean centering is where the mean of the attribute is subtracted from each value. This is a linear transformation where each data point is subtracted by a constant, the mean. This means that the distance between points do not change.

Calculating the Estimates for the New Centered Linear Models

1) Mean Centered

airquality <- airquality %>%
  mutate(avg_wind_mc = avg_wind - mean(avg_wind),
         avg_wind_maxc = avg_wind - max(avg_wind),
         avg_wind_10 = avg_wind - 10)

gf_point(daily_aqi ~ avg_wind_mc, data = airquality, size = 4, alpha = .15) %>%
  gf_labs(x = "Average daily wind speed (in knots)",
          y = "Daily Air Quality") %>%
  gf_smooth() %>%
  gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)

2) Maximum Centered

air_lm_maxc <- lm(daily_aqi ~ avg_wind_maxc, data = airquality)
coef(air_lm_maxc)

##   (Intercept) avg_wind_maxc 
##      5.968391     -2.211798

summary(air_lm_maxc)$r.square

## [1] 0.08528019

summary(air_lm_maxc)$sigma

## [1] 18.05479

3) Minus 10 Centered

air_lm_10 <- lm(daily_aqi ~ avg_wind_10, data = airquality)
coef(air_lm_10)

## (Intercept) avg_wind_10 
##   26.104968   -2.211798

summary(air_lm_10)$r.square

## [1] 0.08528019

summary(air_lm_10)$sigma

## [1] 18.05479

In conclusion, it could be noticed that the values of the slope, \(R^2\),and sigma have the same values in all centered and non-centered models. However, the intercept values are different in each model and has different interpretaion basen on its context.

Standardized Regression

Another type of regression that can be done is one in which the attributes are standardized prior to estimating the linear regression. What is meant by standardizing? This is converting the attributes into z-scores:

\[ Z_{api} = \frac{(aqi - \bar{aqi})}{s_{aqi}} \]

airquality <- airquality %>%
  mutate(z_aqi = scale(daily_aqi),
         z_aqi2 = (daily_aqi - mean(daily_aqi)) / sd(daily_aqi),
         z_wind = scale(avg_wind))

head(airquality)

## # A tibble: 6 x 19
##   date           id   poc pm2.5 daily_aqi site_name   aqs_parameter_d~ cbsa_code
##   <chr>       <dbl> <dbl> <dbl>     <dbl> <chr>       <chr>                <dbl>
## 1 1/1/21  190130009     1  15.1        57 Water Tower PM2.5 - Local C~     47940
## 2 1/4/21  190130009     1  13.3        54 Water Tower PM2.5 - Local C~     47940
## 3 1/7/21  190130009     1  20.5        69 Water Tower PM2.5 - Local C~     47940
## 4 1/10/21 190130009     1  14.3        56 Water Tower PM2.5 - Local C~     47940
## 5 1/13/21 190130009     1  13.7        54 Water Tower PM2.5 - Local C~     47940
## 6 1/16/21 190130009     1   5.3        22 Water Tower PM2.5 - Local C~     47940
## # ... with 11 more variables: cbsa_name <chr>, county <chr>, avg_wind <dbl>,
## #   max_wind <dbl>, max_wind_hours <dbl>, avg_wind_mc <dbl>,
## #   avg_wind_maxc <dbl>, avg_wind_10 <dbl>, z_aqi <dbl[,1]>, z_aqi2 <dbl>,
## #   z_wind <dbl[,1]>

air_lm_s <- lm(z_aqi ~ z_wind, data = airquality)
coef(air_lm_s)

##   (Intercept)        z_wind 
## -6.712023e-16 -2.920277e-01

summary(air_lm_s)$r.square

## [1] 0.08528019

summary(air_lm_s)$sigma

## [1] 0.9565091

We can also use this formula to convert any unstandardized regression coefficients into a standardized metric.

\[ b^{'}_{k} = b_{k} * \frac{s_{x_{k}}}{s_{y}} \]

-2.211 * sd(airquality$avg_wind) / sd(airquality$daily_aqi)

## [1] -0.2919224

cor(daily_aqi ~ avg_wind, data = airquality)

## [1] -0.2920277

Parameter Estimation

Now that we looked how the parameters are impacted by some changes in the model specification, how are these parameters actually estimated? I will show two ways, one is general, the other is specific to this simple case with a single predictor/covariate attribute. In general, linear regression (or more generally the general linear model) uses least square estimation. This means that the the parameters in the model minimize the squared distance between the observed and predicted values. That is, least squares estimates minimize this criterion:

\[ \sum (Y - \hat{Y})^2 \]

Specific example

Calculus can be used to show that these two equations can be solved simultanuously to get estimates for \(\beta_{0}\) and \(\beta_{1}\) that minimize the criterion above. These formulas are:

\[ b_{1} = \frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sum(X - \bar{X})^2} \] \[ b_{0} = \bar{Y} - b_{1}\bar{X} \]

Let’s use R to get these quantities.

b1 <- with(airquality, 
      sum((avg_wind - mean(avg_wind)) * (daily_aqi - mean(daily_aqi))) / sum((avg_wind - mean(avg_wind))^2)
)
b0 <- with(airquality, 
      mean(daily_aqi) - b1 * mean(avg_wind)
)
b0

## [1] 48.22295

b1

## [1] -2.211798

coef(air_lm)

## (Intercept)    avg_wind 
##   48.222946   -2.211798

General Approach

When there are more than one predictor, the number of equations gets a bit unyieldy, therefore, there is a general analytic approach that works for any set of predictor attributes. The general approach uses matrix algebra (anyone take linear algebra?), to achieve their estimates. This general form is:

\[ \mathbf{b} = \left( \mathbf{X}^{`}\mathbf{X} \right)^{-1} \left( \mathbf{X}^{`} \mathbf{Y} \right). \] Where \(\mathbf{b}\) is a vector of estimated regression coefficients, \(\mathbf{X}\) is a matrix of covariate/predictor attributes (called the design matrix), and \(\mathbf{Y}\) is a vector of the outcome attribute.

Below, I show what these would look like for the air quality example that has been used and solve for the regression coefficients.

X <- model.matrix(air_lm)
head(X)

##   (Intercept) avg_wind
## 1           1 2.941667
## 2           1 2.445833
## 3           1 1.995833
## 4           1 3.445833
## 5           1 1.116667
## 6           1 6.091667

Y <- as.matrix(airquality$daily_aqi)
head(Y)

##      [,1]
## [1,]   57
## [2,]   54
## [3,]   69
## [4,]   56
## [5,]   54
## [6,]   22

X_X <- solve(t(X) %*% X)
X_X

##               (Intercept)      avg_wind
## (Intercept)  0.0008152474 -1.424894e-04
## avg_wind    -0.0001424894  3.340328e-05

X_Y <- t(X) %*% Y
X_Y

##               [,1]
## (Intercept) 186997
## avg_wind    731464

X_X %*% X_Y

##                  [,1]
## (Intercept) 48.222946
## avg_wind    -2.211798

coef(air_lm)

## (Intercept)    avg_wind 
##   48.222946   -2.211798

Quiz #3

knitr::include_url(“https://uiowa.instructure.com/courses/170115/assignments/1533818/submissions/275326”)

PSQF 6243 Class Notes

Ahmed Bediwy

10/23/2021

Week (1)

Example: Mario Kart 64 world record data:

Load Libraries

Load Some Data

Univariate Distribution of Time

First, using histogram diagram

Second, using density diagram

Third, discribtive staistics

Bivariate Association

Pearson Correlation Coefficient

Scatterplot

Questions

Quiz #1

Week (2)

Introduction to Linear Regression

Specific example

Load Libraries and Data

Show the data

Guiding Question

Scatterplot

Questions to consider

Adding a smoother line

Estimating linear regression coefficients

Interpretting linear regression terms

What about the error?

Estimating error in linear regression

Quiz #2

Week (3)

Understanding Regression Parameters

New Example Data

Guiding Question

Bivariate Figure

Load Packages and Data Needed

Show Data

Draw Smoothed and Regression Line

Calculate Pearson’s Correlation Coefficient

Calcuating \(R^2\) and Sigma for the Linear Model

Centering predictors

Mean Centering

Calculating the Estimates for the New Centered Linear Models

1) Mean Centered

2) Maximum Centered

3) Minus 10 Centered

Standardized Regression

Parameter Estimation

Specific example

General Approach

Quiz #3