DATA606 Learning Outcomes

Chapter 1 - Introduction to Data

Identify the type of variables (e.g. numerical or categorical; discrete or continuous; ordered or not ordered).
Identify the relationship between multiple variables (i.e. independent vs. dependent).
Define variables that are not associated as independent.
Be able to describe and identify the difference between observational and experimental studies.
Distinguish between simple random, stratified, and cluster sampling, and recognize the benefits and drawbacks of choosing one sampling scheme over another.
Identify the four principles of experimental design and recognize their purposes: control any possible con- founders, randomize into treatment and control groups, replicate by using a sufficiently large sample or repeating the experiment, and block any variables that might influence the response.

Chapter 2 - Summarizing Data

Use appropriate visualizations for different types of data (e.g. histogram, barplot, scatterplot, boxplot, etc.).
Use different measures of center and spread and be able to describe the robustness of different statistics.
Describe the shape of distributions vis-a-vis histograms and boxplots.
Create and interpret contingency and frequency tables (one- and two-way tables).

Chapter 3 - Probability

Define trial, outcome, and sample space.
Define and describe the law of large numbers.
Distinguish disjoint (also called mutually exclusive) and independent events.
Use Venn diagrams to represent events and their probabilities.
Describe probability distributions.
Distinguish between marginal and conditional probabilities.
Use tree diagrams and/or Bayes Theorem to calculate conditional probabilities and probabilities of intersection of non-independent events.
The expected value of a discrete random variable is computed by adding each outcome weighted by its probability.
\[ E(X)=\mu=\sum_{i=1}^{k}{{x}_{i}P\left(X={x}_{i}\right)} \]
The variance of a discrete random variable is computed by adding each squared deviation of an outcome from the expected value weighted by its probability. The standard deviation is the square root of the variance.
\[ Var(X)={\sigma}^{2}=\sum_{i=1}^{k}{{\left({x}_{i}-\mu\right)}^{2}P\left(X={x}_{i}\right) } \]
The average of a linear combination of discrete random variables is computed as the sum of their averages, weighted by the constant multipliers.
The variance of a linear combination of independent discrete random variables is computed as the sum of their variances, weighted by the square of the constant multipliers.
The distribution of a continuous random variable is described by the probability density function.
The total area under the density curve is 1.
Probabilities under the density curve can be calculated as the area under the curve.
The probability of a continuous random variable being exactly equal to a value is 0, since there is no area under the curve at a given location.

Chapter 4 - Distributions of Random Variables

Define the standardized (Z) score of a data point as the number of standard deviations it is away from the mean: \(Z = \frac{x - \mu}{\sigma}\).
Use the Z score
- if the distribution is normal: to determine the percentile score of a data point (using technology or normal probability tables)
- regardless of the shape of the distribution: to assess whether or not the particular observation is considered to be unusual (more than 2 standard deviations away from the mean)
Depending on the shape of the distribution determine whether the median would have a negative, positive, or 0 Z score.
Assess whether or not a distribution is nearly normal using the 68-95-99.7% rule or graphical methods such as a normal probability plot.
- Reading: Section 4.1 of OpenIntro Statistics
- Test yourself: True/False: In a right skewed distribution the Z score of the median is positive.
If X is a random variable that takes the value 1 with probability of success \(p\) and 0 with probability of success \(1-p\), then \(X\) is a Bernoulli random variable.
The geometric distribution is used to describe how many trials it takes to observe a success.
Define the probability of finding the first success in the \(n^{th}\) trial as \((1-p)^{n-1}p\).
- \(\mu = \frac{1}{p}\)
- \(\sigma^2 = \frac{1-p}{p^2}\)
- \(\sigma = \sqrt{\frac{1-p}{p^2}}\)
Determine if a random variable is binomial using the four conditions:
- The trials are independent.
- The number of trials, n, is fixed.
- Each trial outcome can be classified as a success or failure.
- The probability of a success, p, is the same for each trial.
Calculate the number of possible scenarios for obtaining \(k\) successes in \(n\) trials using the choose function: \({n \choose k} = \frac{n!}{k!~(n - k)!}\).
Calculate probability of a given number of successes in a given number of trials using the binomial distribution: \(P(k = K) = \frac{n!}{k!~(n - k)!}~p^k~(1-p)^{(n - k)}\).
Calculate the expected number of successes in a given number of binomial trials \((\mu = np)\) and its standard deviation \((\sigma = \sqrt{np(1-p)})\).
When number of trials is sufficiently large (\(np \ge 10\) and \(n(1-p) \ge 10\)), use normal approximation to calculate binomial probabilities, and explain why this approach works.

Chapter 5 - Foundations for Inference

Define sample statistic as a point estimate for a population parameter, for example, the sample proportion is used to estimate the population proportion, and note that point estimate and sample statistic are synonymous.
Recognize that point estimates (such as the sample proportion) will vary from one sample to another, and define this variability as sampling variation.
Calculate the sampling variability of the proportion, the standard error, as \(SE = \sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion.
Note that when the population proportion \(p\) is not known (almost always), this can be estimated using the sample proportion, \(SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\).
Standard error measures the variability in point estimates from different samples of the same size and from the same population, i.e. measures the sampling variability.
Recognize that when the sample size increases we would expect the sampling variability to decrease.
- Conceptually: Imagine taking many samples from the population. When sample sizes are large the sample proportion will be much more consistent across samples than when the sample sizes are small.
- Mathematically: \(SE = ???\), when \(n\) increases, \(SE\) will decrease since \(n\) is in the denominator.
Notice that sampling distributions of point estimates coming from samples that don’t meet the required conditions for the CLT (about sample size and independence) will not be normal.
Define a confidence interval as the plausible range of values for a population parameter.
Define the confidence level as the percentage of random samples which yield confidence intervals that capture the true population parameter.
Calculate an approximate 95% confidence interval by adding and subtracting 2 standard errors to the point estimate: \(point~estimate \pm 2 \times SE\).
Recognize that the Central Limit Theorem (CLT) is about the distribution of point estimates, and that given certain conditions, this distribution will be nearly normal.
In the case of the proportion the CLT tells us that if

the observations in the sample are independent, and
there are at least 10 successes and 10 failures, \end{itemize} then the distribution of the sample proportion will be nearly normal, centered at the true population proportion and with a standard error of \(\sqrt{\frac{p(1-p)}{n}}\). [ N ( mean = p, SE = ) ]

When the population proportion is unknown, condition (2) can be checked using the sample proportion.
Recall that independence of observations in a sample is provided by random sampling (in the case of observational studies) or random assignment (in the case of experiments).
In addition, the sample should not be large compared to the population, or more precisely, should be smaller than 10% of the population, since samples that are too large will likely contain observations that are not independent. \end{itemize}
Recognize that the nearly normal distribution of the point estimate (as suggested by the CLT) implies that a more precise confidence interval can be calculated as [ point~estimate z^{} SE, ] where \(z^{\star}\) corresponds to the cutoff points in the standard normal distribution to capture the middle XX% of the data, where XX% is the desired confidence level.
- For proportions this is \(\bar{x} \pm Z^\star \sqrt{\frac{p(1-p)}{n}}\).
- Note that \(z^{\star}\) is always positive.
Define margin of error as the distance required to travel in either direction away from the point estimate when constructing a confidence interval, i.e. \(z^{\star} \times SE\).
- Notice that this corresponds to half the width of the confidence interval.
Interpret a confidence interval as “We are XX% confident that the true population parameter is in this interval”, where XX% is the desired confidence level.
- Note that your interpretation must always be in context of the data – mention what the population is and what the parameter is (mean or proportion).
Explain how the hypothesis testing framework resembles a court trial.
Recognize that in hypothesis testing we evaluate two competing claims:
- the null hypothesis, which represents a skeptical perspective or the status quo, and
- the alternative hypothesis, which represents an alternative under consideration and is often represented by a range of possible parameter values.
Construction of hypotheses:
- Always construct hypotheses about population parameters (e.g. population proportion, \(p\)) and not the sample statistics (e.g. sample proportion, \(\hat{p}\)). Note that the population parameter is unknown while the sample statistic is measured using the observed data and hence there is no point in hypothesizing about it.
- Define the null value as the value the parameter is set to equal in the null hypothesis.
- Note that the alternative hypothesis might be one-sided (\(\mu\) \(<\) or \(>\) the null value) or two-sided (\(\mu\) \(\ne\) the null value), and the choice depends on the research question.
Define a p-value as the conditional probability of obtaining a sample statistic at least as extreme as the one observed given that the null hypothesis is true. \(\text{p-value} = \text{P(observed or more extreme sample statistic}~|~H_0 \text{ true)}\)
Calculate a p-value as the area under the normal curve beyond the observed sample proportion (either in one tail or both, depending on the alternative hypothesis). Note that in doing so you can use a Z score, where \(Z = \frac{sample~statistic - null~value}{SE} = \frac{\bar{x} - \mu_0}{SE}\)
- Always sketch the normal curve when calculating the p-value, and shade the appropriate area(s) depending on whether the alternative hypothesis is one- or two-sided.
Infer that if a confidence interval does not contain the null value the null hypothesis should be rejected in favor of the alternative.
Compare the p-value to the significance level to make a decision between the hypotheses:
- If the p-value \(<\) the significance level, reject the null hypothesis since this means that obtaining a sample statistics at least as extreme as the observed data is extremely unlikely to happen just by chance, and conclude that the data provides evidence for the alternative hypothesis.
- If the p-value \(>\) the significance level, fail to reject the null hypothesis since this means that obtaining a sample statistics at least as extreme as the observed data is quite likely to happen by chance, and conclude that the data does not provide evidence for the alternative hypothesis.
- Note that we can never “accept” the null hypothesis since the hypothesis testing framework does not allow us to confirm it.
Note that the conclusion of a hypothesis test might be erroneous regardless of the decision we make.
- Define a Type 1 error as rejecting the null hypothesis when the null hypothesis is actually true.
- Define a Type 2 error as failing to reject the null hypothesis when the alternative hypothesis is actually true.
Choose a significance level depending on the risks associated with Type 1 and Type 2 errors.
- Use a smaller \(\alpha\) is Type 1 error is relatively riskier.
- Use a larger \(\alpha\) is Type 2 error is relatively riskier.
Formulate the framework for statistical inference using hypothesis testing and nearly normal point estimates:

Set up the hypotheses first in plain language and then using appropriate notation.
Identify the appropriate sample statistic that can be used as a point estimate for the parameter of interest.
Verify that the conditions for the CLT holds.
Compute the SE, sketch the sampling distribution, and shade area(s) representing the p-value.
Using the sketch and the normal model, calculate the p-value and determine if the null hypothesis should be rejected or not, and state your conclusion in context of the data and the research question.

If the conditions necessary for the CLT to hold are not met, note this and do not go forward with the analysis. (We will later learn about methods to use in these situations.)
Distinguish statistical significance vs. practical significance.

Chapter 6 - Inference for Categorical Data

Define population proportion \(p\) (parameter) and sample proportion \(\hat{p}\) (point estimate).
Calculate the sampling variability of the proportion, the standard error, as [ SE = , ] where \(p\) is the population proportion.
Note that when the population proportion \(p\) is not known (almost always), this can be estimated using the sample proportion, \(SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\).
Recognize that the Central Limit Theorem (CLT) is about the distribution of point estimates, and that given certain conditions, this distribution will be nearly normal.
In the case of the proportion the CLT tells us that if the observations in the sample are independent, the sample size is sufficiently large (checked using the success/failure condition: \(np \ge 10\) and \(n(1-p) \ge 10\)), then the distribution of the sample proportion will be nearly normal, centered at the true population proportion and with a standard error of \(\sqrt{\frac{p(1-p)}{n}}\). [ N ( mean = p, SE = ) ]
Note that if the CLT doesn’t apply and the sample proportion is low (close to 0) the sampling distribution will likely be right skewed, if the sample proportion is high (close to 1) the sampling distribution will likely be left skewed.
Remember that confidence intervals are calculated as [ ] and test statistics are calculated as [ ]
Note that the standard error calculation for the confidence interval and the hypothesis test are different when dealing with proportions, since in the hypothesis test we need to assume that the null hypothesis is true – remember: p-value = P(observed or more extreme test statistic \(|\) \(H_0\) true).
For confidence intervals use \(\hat{p}\) (observed sample proportion) when calculating the standard error and when checking the success/failure condition: \(SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)
For hypothesis tests use \(p_0\) (null value) when calculating the standard error and checking the success/failure condition: \(SE_{\hat{p}} = \sqrt{\frac{p_0 (1-p_0)}{n}}\)
Such a discrepancy doesn’t exist when conducting inference for means, since the mean doesn’t factor into the calculation of the standard error, while the proportion does.
Calculate the required minimum sample size for a given margin of error at a given confidence level, and explain why we use \(\hat{p} = 0.5\) if there are no previous studies suggesting a more accurate estimate.
Conceptually: When there is no additional information, 50% chance of success is a good guess for events with only two outcomes (success or failure).
Mathematically: Using \(\hat{p} = 0.5\) yields the most conservative (highest) estimate for the required sample size.
Note that the calculation of the standard error of the distribution of the difference in two independent sample proportions is different for a confidence interval and a hypothesis test. confidence interval (and hypothesis test when \(H_0: p_1 -p_2 =\) some value other than 0): \(SE_{(\hat{p}_1 - \hat{p}_2)} = \sqrt{\frac{ \hat{p}_1 (1 - \hat{p}_1)}{n_1} + \frac{ \hat{p}_2 (1 - \hat{p}_2)}{n_2} }\) hypothesis test when \(H_0: p_1 -p_2 = 0\): \(SE_{(\hat{p}_1 - \hat{p}_2)} = \sqrt{\frac{ \hat{p}_{pool} (1 - \hat{p}_{pool})}{n_1} + \frac{ \hat{p}_{pool} (1 - \hat{p}_{pool})}{n_2} }\), where \(\hat{p}_{pool}\) is the overall rate of success: \(\hat{p}_{pool} = \frac{\text{number of successes in group 1 + number of successes in group 2}}{n_1 + n_2}\)
Note that the reason for the difference in calculations of standard error is the same as in the case of the single proportion: when the null hypothesis claims that the two population proportions are equal, we need to take that into consideration when calculating the standard error for the hypothesis test, and use a common proportion for both samples.
Use a chi-square test of goodness of fit to evaluate if the distribution of levels of a single categorical variable follows a hypothesized distribution. \(H_0:\) The distribution of the variable follows the hypothesized distribution, and any observed differences are due to chance. \(H_A:\) The distribution of the variable does not follow the hypothesized distribution.
Calculate the expected counts for a given level (cell) in a one-way table as the sample size times the hypothesized proportion for that level.
Calculate the chi-square test statistic as \(\chi = \sum_{i = 1}^{k} \frac{(\text{observed count} - \text{expected count})^2}{\text{expected count}}\), where \(k\) is the number of cells.
Note that the chi-square distribution is right skewed with one parameter: degrees of freedom. In the case of a goodness of fit test, \(df = # \text{of categories} - 1\).
List the conditions necessary for performing a chi-square test (goodness of fit or independence)
- the observations should be independent
- expected counts for each cell should be at least 5
- degrees of freedom should be at least 2 (if not, use methods for evaluating proportions)
Describe how to use the chi-square table to obtain a p-value. When evaluating the independence of two categorical variables where at least one has more than two levels, use a chi-square test of independence.
\(H_0:\) The two variables are independent.
\(H_A:\) The two variables are dependent.
Calculate expected counts in two-way tables as [ E = ]
Calculate the degrees of freedom for chi-square test of independence as \(df = (R - 1) \times (C - 1)\), where \(R\) is the number of rows in a two-way table, and \(C\) is the number of columns.
Note that there is no such thing as a chi-square confidence interval for proportions, since in the case of a categorical variables with many levels, there isn’t one parameter to estimate.
Use simulation methods when sample size conditions aren’t met for inference for categorical variables.
Note that the \(t\)-distribution is only appropriate to use for means, when sample size isn’t sufficiently large, and the parameter of interest is a proportion or a difference between two proportions, we need to use simulation.
In hypothesis testing for one categorical variable, generate simulated samples based on the null hypothesis, and then calculate the number of samples that are at least as extreme as the observed data. for two categorical variables, use a randomization test.
Use bootstrap methods for confidence intervals for categorical variables with at most two levels.

Chapter 7 - Inference for Numerical Data

Use the \(t\)-distribution for inference on a single mean, difference of paired (dependent) means, and difference of independent means.
Explain why the \(t\)-distribution helps make up for the additional variability introduced by using \(s\) (sample standard deviation) in calculation of the standard error, in place of \(\sigma\) (population standard deviation).
Describe how the \(t\)-distribution is different from the normal distribution, and what ?heavy tail? means in this context.
Note that the \(t\)-distribution has a single parameter, degrees of freedom, and as the degrees of freedom increases this distribution approaches the normal distribution.
Use a \(t\)-statistic, with degrees of freedom \(df = n - 1\) for inference for a population mean:
Standard error: \(SE = \frac{s}{\sqrt{n}}\)
Confidence interval: \(\bar{x} \pm t_{df}^\star SE\)
Hypothesis test: \(T_{df} = \frac{\bar{x} - \mu}{SE}\) \end{itemize}
Describe how to obtain a p-value for a \(t\)-test and a critical \(t\)-score (\(t^\star_{df}\)) for a confidence interval.
Define observations as paired if each observation in one dataset has a special correspondence or connection with exactly one observation in the other data set.
Carry out inference for paired data by first subtracting the paired observations from each other, and then treating the set of differences as a new numerical variable on which to do inference (such as a confidence interval or hypothesis test for the average difference).
Calculate the standard error of the difference between means of two paired (dependent) samples as \(SE = \frac{s_{diff}}{\sqrt{n_{diff}}}\) and use this standard error in hypothesis testing and confidence intervals comparing means of paired (dependent) groups.
Use a \(t\)-statistic, with degrees of freedom \(df = n_{diff} - 1\) for inference for a population mean: \begin{itemize}
Standard error: \(SE = \frac{s}{\sqrt{n}}\)
Confidence interval: \(\bar{x}_{diff} \pm t_{df}^\star SE\)
Hypothesis test: \(T_{df} = \frac{\bar{x}_{diff} - \mu_{diff}}{SE}\). Note that \(\mu_{diff}\) is often 0, since often \(H_0: \mu_{diff} = 0\).
Recognize that a good interpretation of a confidence interval for the difference between two parameters includes a comparative statement (mentioning which group has the larger parameter).
Recognize that a confidence interval for the difference between two parameters that doesn?t include 0 is in agreement with a hypothesis test where the null hypothesis that sets the two parameters equal to each other is rejected.
Calculate the standard error of the difference between means of two independent samples as \(SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\), and use this standard error in hypothesis testing and confidence intervals comparing means of independent groups.
Use a \(t\)-statistic, with degrees of freedom \(df = min(n_1 - 1, n_2 - 1)\) for inference for a population mean:
Standard error: \(\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
Confidence interval: \((\bar{x}_1 - \bar{x}_2) \pm t_{df}^\star SE\)
Hypothesis test: \(T_{df} = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE}\). Note that \(\mu_{diff}\) is often 0, since often \(H_0: \mu_1 - \mu_2 = 0\).
Calculate the power of a test for a given effect size and significance level in two steps: (1) Find the cutoff for the sample statistic that will allow the null hypothesis to be rejected at the given significance level, (2) Calculate the probability of obtaining that sample statistic given the effect size.
Explain how power changes for changes in effect size, sample size, significance level, and standard error.
Define analysis of variance (ANOVA) as a statistical inference method that is used to determine if the variability in the sample means is so large that it seems unlikely to be from chance alone by simultaneously considering many groups at once.
Recognize that the null hypothesis in ANOVA sets all means equal to each other, and the alternative hypothesis suggest that at least one mean is different. \begin{itemize}
\(H_0: \mu_1 = \mu_2 = \cdots = \mu_k\)
\(H_A:\) At least one mean is different
List the conditions necessary for performing ANOVA the observations should be independent within and across groups the data within each group are nearly normal the variability across the groups is about equal and check if they are met using graphical diagnostics.
Recognize that the test statistic for ANOVA, the F statistic, is calculated as the ratio of the mean square between groups (MSG, variability between groups) and mean square error (MSE, variability within errors), and has two degrees of freedom, one for the numerator (\(df_{G} = k - 1\), where \(k\) is the number of groups) and one for the denominator (\(df_{E} = n - k\), where \(n\) is the total sample size).
Note that you won’t be expected to calculate MSG or MSE from the raw data, but you should have a conceptual understanding of how they’re calculated and what they measure.
Describe why calculation of the p-value for ANOVA is always “one sided”.
Describe why conducting many \(t\)-tests for differences between each pair of means leads to an increased Type 1 Error rate, and we use a corrected significance level (Bonferroni corection, \(\alpha^\star = \alpha / K\), where \(K\) is the e number of comparisons being considered) to combat inflating this error rate.
Describe why it is possible to reject the null hypothesis in ANOVA but not find significant differences between groups as a result of pairwise comparisons.

Chapter 8 - Linear Regression

Define the explanatory variable as the independent variable (predictor), and the response variable as the dependent variable (predicted).
Plot the explanatory variable (\(x\)) on the x-axis and the response variable (\(y\)) on the y-axis, and fit a linear regression model \(y = \beta_0 + \beta_1 x\) where \(\beta_0\) is the intercept, and \(\beta_1\) is the slope.
Note that the point estimates (estimated from observed data) for \(\beta_0\) and \(\beta_1\) are \(b_0\) and \(b_1\), respectively.
When describing the association between two numerical variables, evaluate
- direction: positive (\(x \uparrow, y \uparrow\)), negative (\(x \downarrow, y \uparrow\))
- form: linear or not
- strength: determined by the scatter around the underlying relationship
Define correlation as the association between two numerical variables.
Note that a relationship that is nonlinear is simply called an association.
Note that correlation coefficient (\(r\), also called Pearson’s \(r\)) the following properties:
- the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables
- the sign of the correlation coefficient indicates the direction of association
- the correlation coefficient is always between -1 and 1, inclusive, with -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no relationship
- the correlation coefficient is unitless, since the correlation coefficient is unitless, it is not affected by changes in the center or scale of either variable (such as unit conversions)
- the correlation of X with Y is the same as of Y with X
- the correlation coefficient is sensitive to outliers
Recall that correlation does not imply causation.
Define residual (\(e\)) as the difference between the observed (\(y\)) and predicted (\(\hat{y}\)) values of the response variable. \(e_i = y_i - \hat{y}_i\)
Define the least squares line as the line that minimizes the sum of the squared residuals, and list conditions necessary for fitting such line:
- linearity
- nearly normal residuals
- constant variability
Define an indicator variable as a binary explanatory variable (with two levels).
Calculate the estimate for the slope (\(b_1\)) as \(b_1 = R\frac{s_y}{s_x}\), where \(r\) is the correlation coefficient, \(s_y\) is the standard deviation of the response variable, and \(s_x\) is the standard deviation of the explanatory variable.
Interpret the slope as
- “For each unit increase in \(x\), we would expect \(y\) to increase/decrease on average by \(|b_1|\) units” when \(x\) is numerical.
- “The average increase/decrease in the response variable when between the baseline level and the other level of the explanatory variable is \(|b_1|\).” when \(x\) is categorical.
Note that whether the response variable increases or decreases is determined by the sign of \(b_1\).
Note that the least squares line always passes through the average of the response and explanatory variables (\(\bar{x},\bar{y}\)).
Use the above property to calculate the estimate for the slope (\(b_0\)) as \(b_0 = \bar{y} - b_1 \bar{x}\), where \(b_1\) is the slope, \(\bar{y}\) is the average of the response variable, and \(\bar{x}\) is the average of explanatory variable.
Interpret the intercept as
- “When \(x = 0\), we would expect \(y\) to equal, on average, \(b_0\).” when \(x\) is numerical.
- “The expected average value of the response variable for the reference level of the explanatory variable is \(b_0\).” when \(x\) is categorical.
Predict the value of the response variable for a given value of the explanatory variable, \(x^\star\), by plugging in \(x^\star\) in the in the linear model: \(\hat{y} = b_0 + b_1 x^\star\)
Only predict for values of \(x^\star\) that are in the range of the observed data.
Do not extrapolate beyond the range of the data, unless you are confident that the linear pattern continues.
Define \(R^2\) as the percentage of the variability in the response variable explained by the the explanatory variable.
For a good model, we would like this number to be as close to 100% as possible.
This value is calculated as the square of the correlation coefficient, and is between 0 and 1, inclusive.
Define a leverage point as a point that lies away from the center of the data in the horizontal direction.
Define an influential point as a point that influences (changes) the slope of the regression line.
This is usually a leverage point that is away from the trajectory of the rest of the data.
Do not remove outliers from an analysis without good reason.
Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points.
Determine whether an explanatory variable is a significant predictor for the response variable using the \(t\)-test and the associated p-value in the regression output.
Set the null hypothesis testing for the significance of the predictor as \(H_0: \beta_1 = 0\), and recognize that the standard software output yields the p-value for the two-sided alternative hypothesis.
Note that \(\beta_1 = 0\) means the regression line is horizontal, hence suggesting that there is no relationship between the explanatory and the response variables.
Calculate the T score for the hypothesis test as \(T_{df}=\frac { b_{ 1 }-{ null\quad value } }{ SE_{ b_{ 1 } } }\) with \(df = n - 2\).
Note that the T score has \(n - 2\) degrees of freedom since we lose one degree of freedom for each parameter we estimate, and in this case we estimate the intercept and the slope.
Note that a hypothesis test for the intercept is often irrelevant since it’s usually out of the range of the data, and hence it is usually an extrapolation.
Calculate a confidence interval for the slope as \(b_1 \pm t^\star_{df} SE_{b_1}\) where \(df = n - 2\) and \(t^\star_{df}\) is the critical score associated with the given confidence level at the desired degrees of freedom.
Note that the standard error of the slope estimate \(SE_{b_1}\) can be found on the regression output.

Chapter 9 - Multiple and Logistic Regression

Define the multiple linear regression model as \[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k\] where there are \(k\) predictors (explanatory variables).
Interpret the estimate for the intercept (\(b_0\)) as the expected value of \(y\) when all predictors are equal to 0, on average.
Interpret the estimate for a slope (say \(b_1\)) as “All else held constant, for each unit increase in \(x_1\), we would expect \(y\) to increase/decrease on average by \(b_1\).”
Define collinearity as a high correlation between two independent variables such that the two variables contribute redundant information to the model – which is something we want to avoid in multiple linear regression.
Note that \(R^2\) will increase with each explanatory variable added to the model, regardless of whether or not the added variables is a meaningful predictor of the response variable. Therefore we use adjusted \(R^2\), which applies a penalty for the number of predictors included in the model, to better assess the strength of a multiple linear regression model: \[R^2 = 1 - \frac{Var(e_i) / (n - k - 1)}{Var(y_i) / (n - 1)}\] where \(Var(e_i)\) measures the variability of residuals (\(SS_{Err}\)), \(Var(y_i)\) measures the total variability in observed \(y\) (\(SS_{Tot}\)), \(n\) is the number of cases and \(k\) is the number of predictors.
Note that adjusted \(R^2\) will only increase if the added variable has a meaningful contribution to the amount of explained variability in \(y\), i.e. if the gains from adding the variable exceeds the penalty.
Define model selection as identifying the best model for predicting a given response variable.
Note that we usually prefer simpler (parsimonious) models over more complicated ones.
Define the full model as the model with all explanatory variables included as predictors.
Note that the p-values associated with each predictor are conditional on other variables being included in the model, so they can be used to assess if a given predictor is significant, given that all others are in the model.
These p-values are calculated based on a \(t\) distribution with \(n - k - 1\) degrees of freedom.
The same degrees of freedom can be used to construct a confidence interval for the slope parameter of each predictor: \[b_i \pm t^\star_{n - k - 1} SE_{b_i}\]
Stepwise model selection (backward or forward) can be done based based on adjusted \(R^2\) (choose the model with higher adjusted \(R^2\)).
The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached. i. Start with the full model. ii. Refit all possible models omitting one variable at a time, and choose the model with the highest adjusted \(R^2\). iii. Repeat until maximum possible adjusted \(R^2\) is reached.
The general idea behind forward-selection is to start with only one variable and adding one variable at a time until the ideal model is reached. i. Try all possible simple linear regression models predicting \(y\) using one explanatory variable at a time. Choose the model with the highest adjusted \(R^2\). ii. Try all possible models adding one more explanatory variable at a time, and choose the model with the highest adjusted \(R^2\). iii. Repeat until maximum possible adjusted \(R^2\) is reached.
Adjusted \(R^2\) method is more computationally intensive, but it is more reliable, since it doesn’t depend on an arbitrary significant level.
List the conditions for multiple linear regression as linear relationship between each (numerical) explanatory variable and the response - checked using scatterplots of \(y\) vs. each \(x\), and residuals plots of \(residuals\) vs. each \(x\) nearly normal residuals with mean 0 - checked using a normal probability plot and histogram of residuals constant variability of residuals - checked using residuals plots of \(residuals\) vs. \(\hat{y}\), and \(residuals\) vs. each \(x\) independence of residuals (and hence observations) - checked using a scatterplot of \(residuals\) vs. order of data collection (will reveal non-independence if data have time series structure) Note that no model is perfect, but even imperfect models can be useful.

Learning Objectives

Mael Illien

11/25/2019

DATA606 Learning Outcomes

Chapter 1 - Introduction to Data

Chapter 2 - Summarizing Data

Chapter 3 - Probability

Chapter 4 - Distributions of Random Variables

Chapter 5 - Foundations for Inference

Chapter 6 - Inference for Categorical Data

Chapter 7 - Inference for Numerical Data

Chapter 8 - Linear Regression

Chapter 9 - Multiple and Logistic Regression