cor()) Collinearityhist(model$residuals) plot(model$fitted.values, model$residuals)formula() with your dependent variable and independent variablesols$coefficients Intuitively get the “direction” of the relationships between your x and y
Understand your Residuals Your model is never perfect, and always estimated with error.
Your goal is to find a line that minimizes the sum of squared distance between each point and the predicted
You improve your model from looking at “what you did not do well”
Plot my residuals, create residual plots, look at the histogram or other distribution of residuals etc
summary(), predict()crime_rate based on socioecomic variableWrite a regression analysis report applying what you’ve learned in the workshop. Using the dataset provided by you, write your findings on the different socioeconomic variables most highly correlated to crime rates (y).
Data : bit.ly/crime_dataset
The dataset was collected in 1960 and a full description of the dataset wasn’t conveniently available. the variables are:
- M: percentage of males aged 14-24 - So: whether it is in a Southern state. 1 for Yes, 0 for No.
- Ed: mean years of schooling
- Po1: police expenditure in 1960
- Po2: police expenditure in 1959 - LF: labour force participation rate
- LF: number of males per 1000 females
- Pop: state population
- NW: number of non-whites resident per 1000 people
- U1: unemployment rate of urban males aged 14-24
- U2: unemployment rate of urban males aged 35-39
- GDP: gross domestic product per head
- Ineq: income inequality
- Prob: probability of imprisonment
- Time: avg time served in prisons
- y: crime rate in an unspecified category Explain your recommendations where appropriate. To help you through the exercise, you should ask the following questions of your candidate model:
Challenge:
The model achieves an adjusted R-squared value above the grading threshold of 0.701
The residual plot resembles a random scatterplot
crime <- read.csv("crime.csv")
crime <- subset(crime, select = -c(X))Step :
crime_rate based on other variableslm())step() or regsubset() to choose variable most significant