Introduction

As you’ve already demonstrated a solid aptitude for the material in this course, I’m going to just focus on the concepts that are just tricky enough that you could lose a few points or two. I’m really gunning fr you to ace the shit out of this test, because when I see the topics that follow in the syllabus - it’s obvious that the material will become dramatically more diffiult, and an A on this test will ease the rest of the semester.

Identify key parameters of the problem

  • the dependent variable, independent variables, and the time frame for the data collection.
  • Really consider the research question for each question. Everything you need for this part should be clearly articulated here.
  • Checkboxes: for both response variable and independent variables. *Firstly, what is the independent variable/response variable - Y? Really, this is the only really interesting variable in the problem at this point. You should only really be facing continuous type response variables at this point. This will change in next few weeks.
Variable Name Description
Exam Score Response variable - numeric
Age indep, numeric
Gender indep, categorical
Age indep, numeric
zip code indep, categorical
rank indep, ordinal
Survey (low medium high) indep, ordinal

Data Cleaning.

Doubt you would be expected to do this during shortened time period of time. But if presented wih a dataset for data cleaning, some exaomples might be:

  • Empty cells in Excel column
  • A column filled with numbers,
  • excepte a few instances where there - is a label for some inexplicable reason
  • date fields that aren’t in the Amerian MM/DD/YYYY format

Perform univariate analysis to examine and describe the data.

  • histogram
  • mean
  • standard deviation
  • median
  • possibly quartile, deciles

Time Series Analysis

There are two features of time series that make it different than normal analysis.

  • Seasonality, why do sales behave differently around Christmas time? Halloween? Weekends? Summertime (Not sure what he wants you to know about seasonality, so bypassing that.)
  • Autocorrelation, which I don’t think he covered.

I don’t think auto-correlation is necessary exam material, but just in case, just providing illustrative example: “autocorrelation”: Let’s say we are looking at IBM’s stock price over time. Its price today may very well depend on its behavior yesterday and or days prior. It is therefore considered to be auto-sexual… err… auto-correlated.

Again, I want to emphasize that this will not come up in any of the the other subject matter I see in the syllabus, so if he didn’t metnion in, don’t spend any time on it.

Perform cross-tabulations, graphs/charts to analyze simple bivariate relationships.

Understand value of these while understanding limitation with regards to concluding things. They’re just here to help

  • Correlation plots. These can be X/Y. X1/X2, X2/Y, etc.
  • Correlation does NOT imply causation. So: Y may cause X. X may cause Y. Both X, Y may have zero connection whatsoever. Correlation plots aren’t provided to answer these questions.
  • Correlation ranges from -1 to 1. Which just makes it easier to intepret.

Because the next step is Multiple Regression I assume that he will present Simple Regression here.

  • Focus hard on understanding this. Things are about to become more challenging.
  • Recall from algebra – where looked at slope = rise/run = change in y over change in x.
  • The graph will be a line, so don’t you dare be intimidated by this.

Important: You are expected to interpret the output of the table that is generated. This is lost on most students. Interpretation is the trickiest - and of course the most critical part.

  • The graph is meant to assist in guiding you, but it is the p-value to that you are interpreting.
  • For each line variable with a p-value associated with it in the output, the null hypothesis says “there is no relationship between independent variable X and outcome variable Y”.
  • Always keep the null hypothesis in your mind when interpreting these. The first p-value, only tests whether the average of the numbers is 0. This hypothesis is almost always rejected and is of little interest anyway.
  • You want to get in the practice of evaluating the p-values from bottom to top.
  • Hypothesis rejecting p-value indicating significant relationship will not tell you if positively related, or inverse relationship. Graph will help with directionality, as will t-test value, right next to p-value in table.

Some examples of linear regression null hypotheses that you will be assessing

  • Null hypothesis: No relationship between X and Y. (graph will indicate this likelihood by just being a flat horizontal line).

the following examples are a a bit more complicated, and this is a very possible point in which your prof might try to trip you up. I’ll do some examples tomorrow.