🎶 Quiet Assumptions - Elskavon


agenda for today

1. check for understanding & Introduce Practice Exercise 1.6 [-11:25]

2. visualization and regression diagnostics [-Noon]


Recap questions

in this week, we learned six assumptions of linear regression, including validity, generalizability, additivity and linearity, homogeneity, independence, and normality. Among these assumptions…

Q1. which reflects the math of regression (aka. model formula)?

Q2. which are conceptually super important but often hard to test/address in analysis?

Q3. which are about errors?

Q4. which is often violated if the sample is drawn from clusters/groups or there are repeated measures?

Q5. Is RStudent a measure of …
a. Leverage b. Homoscedasticity c. Case Influence d. Discrepancy

Q6. Is Cook’s D a measure of …?
a. Leverage b. Normality of errors c. Case Influence d. Discrepancy

Bonus Q. VIGHAN is a city in which country?
a. Canada b. Malawi c. India d. Brazil


Answers

Answers


visualization and regression diagnostics

datafile:HB761_Recitation_Week6.sas
syntax file: kidiq.sas7bdat


plotting the data

1. descriptive plots for each variable (optional)

/*continous variable: histogram or boxplot*/
proc sgplot data=temp;
    title "histogram of chidren's test score";
    histogram kid_score; 
    xaxis label = "child's test score";
run;
proc sgplot data=temp;
    title "boxplot of mom's IQ test score";
    vbox mom_iq; 
run;

/*categorical variable: bar chart*/
proc sgplot data=temp;
    title "bar chart of maternal eduction level";
    histogram mom_hs; 
    xaxis label = "maternal eduction: HS vs. non HS";
run;



2. bivariate plots (outcome X predictors): scatter plots, boxplots

scatterplot matrix: exploratory, patterns of relationships, multicollineariry
proc sgscatter data=temp;
    title "scatter plot matrix";
    matrix kid_score mom_iq mom_age mom_income/
        group = mom_hs
        diagonal=(histogram kernel);
run;
mom’s income was simulated using mom_iq * 500 + N(0, 1000)


individual bivariate scatterplot and boxplot

scatter plot: continuous var X continuous var
proc sgplot data=temp;
    title "scatter plot of chidren's test score vs. mom's iq score";
    scatter y = kid_score x = mom_iq /
        markerattrs = (color = pink symbol = circlefilled); 
    yaxis label = "child's test score";
    xaxis label = "mom's IQ score";
run;


side-by-side boxplot: continuous var X categorical var
proc sgplot data=temp;
    title "boxplot of kid test score by maternal eduction level";
    vbox kid_score / group = mom_hs; 
    yaxis label = "kid's test score";
run;


panel plots: more than 2 variables
proc sgpanel data=temp;
    title "scatter plot of chidren's test score vs. mom's iq score";
    panelby mom_hs;
    scatter y = kid_score x = mom_iq /
        markerattrs = (symbol = circlefilled); 
    rowaxis label = "child's test score";
    colaxis label = "mom's IQ score";
run;


residual plots

1. Q-Q plot and histogram: normality
2. residuals X predicted values: linearity, homoscedasticity
3. residuals X predictors: linearity
4. outliers: RStudent (discrepancy), Leverage, Cook’s D
proc reg data=temp;
    title "outcome: children score, predictors: mom's age, iq, and education";
    model kid_score = mom_iq mom_age mom_hs/vif clb collin;
run; quit;


Next week: Logistic regression