1 Regression

1.1 Before you begin

Check for assumptions

Linearity assumption
Correlation (in R: use cor()) Collinearity
The value you’re predicting already within the range of existing data
Normality of residuals hist(model$residuals) plot(model$fitted.values, model$residuals)

1.2 Constructing your model

Choose your formula

formula() with your dependent variable and independent variables
linear models works with both numeric and categorical variables

Decide what to do with outliers (if any)

Remember the “leverage vs influence” section of your coursebook
1 outlier can greatly biased your model
A lot of deciding whether to keep / remove requires domain expertise

Interpret your model

Understand your Coefficients ols$coefficients Intuitively get the “direction” of the relationships between your x and y
- P-value: because what you observed could be down to random sampling
Understand your Residuals Your model is never perfect, and always estimated with error.

Your goal is to find a line that minimizes the sum of squared distance between each point and the predicted

You improve your model from looking at “what you did not do well”

Plot my residuals, create residual plots, look at the histogram or other distribution of residuals etc

Use your model Generic functions like summary(), predict()

2 Exercise : Predict `crime_rate` based on socioecomic variable

Write a regression analysis report applying what you’ve learned in the workshop. Using the dataset provided by you, write your findings on the different socioeconomic variables most highly correlated to crime rates (y).

Data : bit.ly/crime_dataset

The dataset was collected in 1960 and a full description of the dataset wasn’t conveniently available. the variables are:
- M: percentage of males aged 14-24 - So: whether it is in a Southern state. 1 for Yes, 0 for No.
- Ed: mean years of schooling
- Po1: police expenditure in 1960
- Po2: police expenditure in 1959 - LF: labour force participation rate
- LF: number of males per 1000 females
- Pop: state population
- NW: number of non-whites resident per 1000 people
- U1: unemployment rate of urban males aged 14-24
- U2: unemployment rate of urban males aged 35-39
- GDP: gross domestic product per head
- Ineq: income inequality
- Prob: probability of imprisonment
- Time: avg time served in prisons
- y: crime rate in an unspecified category Explain your recommendations where appropriate. To help you through the exercise, you should ask the following questions of your candidate model:

Can your model be any more simplified without losing substantial information?
Have you tried predicting the crime rate given a reasonable set of values for the predictor variable?
Have you identified any non-random pattern in your residual plot?

Challenge:

The model achieves an adjusted R-squared value above the grading threshold of 0.701
The residual plot resembles a random scatterplot

crime <- read.csv("crime.csv")
crime <- subset(crime, select = -c(X))

Step :

Predict crime_rate based on other variables
Check correlation
Create model (lm())
Use step() or regsubset() to choose variable most significant
Try to predict with your created dataset

Step Regression

Algoritma

November 14, 2018

1 Regression

1.1 Before you begin

1.2 Constructing your model

2 Exercise : Predict `crime_rate` based on socioecomic variable

Step Regression

Algoritma

November 14, 2018

1 Regression

1.1 Before you begin

1.2 Constructing your model

2 Exercise : Predict crime_rate based on socioecomic variable

2 Exercise : Predict `crime_rate` based on socioecomic variable