1 Regression

1.1 Before you begin

  1. Check for assumptions
  • Linearity assumption
  • Correlation (in R: use cor()) Collinearity
  • The value you’re predicting already within the range of existing data
  • Normality of residuals hist(model$residuals) plot(model$fitted.values, model$residuals)

1.2 Constructing your model

  1. Choose your formula
  • formula() with your dependent variable and independent variables
  • linear models works with both numeric and categorical variables
  1. Decide what to do with outliers (if any)
  • Remember the “leverage vs influence” section of your coursebook
  • 1 outlier can greatly biased your model
  • A lot of deciding whether to keep / remove requires domain expertise
  1. Interpret your model
  • Understand your Coefficients ols$coefficients Intuitively get the “direction” of the relationships between your x and y
    • P-value: because what you observed could be down to random sampling
  • Understand your Residuals Your model is never perfect, and always estimated with error.

    Your goal is to find a line that minimizes the sum of squared distance between each point and the predicted

    You improve your model from looking at “what you did not do well”

    Plot my residuals, create residual plots, look at the histogram or other distribution of residuals etc

  1. Use your model Generic functions like summary(), predict()

2 Exercise : Predict crime_rate based on socioecomic variable

Write a regression analysis report applying what you’ve learned in the workshop. Using the dataset provided by you, write your findings on the different socioeconomic variables most highly correlated to crime rates (y).

Data : bit.ly/crime_dataset

The dataset was collected in 1960 and a full description of the dataset wasn’t conveniently available. the variables are:
- M: percentage of males aged 14-24 - So: whether it is in a Southern state. 1 for Yes, 0 for No.
- Ed: mean years of schooling
- Po1: police expenditure in 1960
- Po2: police expenditure in 1959 - LF: labour force participation rate
- LF: number of males per 1000 females
- Pop: state population
- NW: number of non-whites resident per 1000 people
- U1: unemployment rate of urban males aged 14-24
- U2: unemployment rate of urban males aged 35-39
- GDP: gross domestic product per head
- Ineq: income inequality
- Prob: probability of imprisonment
- Time: avg time served in prisons
- y: crime rate in an unspecified category Explain your recommendations where appropriate. To help you through the exercise, you should ask the following questions of your candidate model:

  • Can your model be any more simplified without losing substantial information?
  • Have you tried predicting the crime rate given a reasonable set of values for the predictor variable?
  • Have you identified any non-random pattern in your residual plot?

Challenge:

  1. The model achieves an adjusted R-squared value above the grading threshold of 0.701

  2. The residual plot resembles a random scatterplot

crime <- read.csv("crime.csv")
crime <- subset(crime, select = -c(X))

Step :

  • Predict crime_rate based on other variables
  • Check correlation
  • Create model (lm())
  • Use step() or regsubset() to choose variable most significant
  • Try to predict with your created dataset