General instructions
In this project, you will be guided through the analysis of a dataset containing health data from 157 countries. Your main question is What affects life expectancy throughout the world? You must submit a report on turnitin.
As well as the answer to each question you must include the code snippet that generated this answer. This snippet must be fully commented. You will lose marks for not including these commented snippets. This is both to ensure you understand what you are doing and as a plagiarism check.
Embedded in each question is the mark available for it.
Each question will be marked independently, that is, if you get something wrong early on you will lose marks at that point, but later analysis based on this answer will be marked as correct if the correct analytical technique was applied, even though the answer will be wrong.
You will lose marks for poor presentation and generally not explaining yourself. Expecting us to just read the R output and understand how you got from a to b is a recipe for losing marks. Creating figures with no figure captions etc. will lose you marks.
You will submit this through turnitin. The deadline is the 29th of April 2024 at 10.00 a.m.
Getting help
This is the end of module assessment for BS2004. You should not need extra help. The questions should be completely clear and use techniques which we taught you during the course. If you think there is a mistake in a question (there almost certainly in not), email ebm3@le.ac.uk. Otherwise no other help is offered (as it is the end of module assessment).
The data
The data comes from the World Bank (life_expectancy_data.csv ). As well as life expectancy (the response variable life_exp), I scanned the database for various predictors that I thought would be interesting. As well as country as a label, I choose 12 predictor variables.
- pollution Levels of pollution measured as
pm2.5
- gdp_capita Per capita gross domestic product. How
rich a country is
- alcohol Total alcohol consumption per capita
(liters of pure alcohol, projected estimates, 15+ years of age)
- cause_of_death_communicable How many deaths were caused by communicable diseases (percentage of total)
- dpt_im Percentage of babies vaccinated against
diphtheria, pertussis (whooping cough), and tetanus
- meas_im Percentage of babies vaccinated against
measles
- pol_im Percentage of babies vaccinated against
polio
- hospital_beds Number of hospital beds per 1000
people
- diabetes Diabetes prevalence (percentage of
population)
- overweight Prevalence of overweight adults
(percentage of population)
- fertility_rate total births per woman
- pop Total population
The questions
Question 1: Plot (scatterplot) life expectancy (y-axis) against each
of the raw predictor variables (6 Marks total). Note I said
raw, not the log that I do in Figure 1. Hint: Make your life easy by
combining all the graphs in a single figure using R. There are several
ways to do this. I use grid.arrange
below.
When I produced this figure. It was clear to me that gdp_capita, hospital_beds and pop did not show a linear relationship with life_exp. This makes sense. For example, a lot of countries have low gdp and some have much higher gdp. An easy solution for that is to log (natural log) these predictors. That’s what I did in the figure below. Much better. So from now on, use the logged values of gdp_capita, hospital_beds and pop in your analysis.
Question 2: From my figure above, pick three relationships (except log(gdp) see below). Describe these relationships in words and give an explanation/hypothesis for why this relationship exists (18 marks). So for example; “There is an increasing linear relationship between life expectancy and log (gdp). It could be expected that with the increasing availability to higher living standards (e.g. modern sanitation) and modern medicine seen in rich countries, would come an increase in life expectancy.”
Question 3: Checking for collinearity. Create the full linear model (life expectancy against all twelve predictors with no interactions). Remember to log the predictors you need to log. It makes sense that some of these variables are collinear. Use the cut-offs you have been taught (vif > 5, cor > 0.8) to identify collinear predictors.
A simple pipeline here
- Identify all variables with vif >5
- For each collinear variable look at the correlations to find out what it is collinear with ( > 0.8 it will be one or more of the other collinear variables)
- In each match set of collinear variables, chose one variable to stay in the model. Which one should you keep? Be sensible here e.g. you find gdp and hospital beds are collinear. I would keep gdp as it would seem clearly to be a more root cause. It won’t always be this simple.
Marks
- 2 marks for a correctly written full model
- 2 marks for identifying collinear predictors (report vif and cor)
- 2 marks for a correctly written reduced model. The increased marks are for correctly using the collinear information to make this new model.
Question 4: Model selection. You are going to do a simple model selection here on your new model (with variables removed from question 3). Using step wise selection based on p (0.05), identify which predictor variables to remove. 3 marks for presenting a table of step selection summary. 2 marks for identifying which predictors to remove. 5 marks for correctly coding this final model.
Question 5: Assumption checking for your final model. You now have a final model (after model selection) Here is a nice recap of how to use linear model diagnostic plots to check your assumptions. Show the plots and briefly explain why your data fits/ doesn’t fit assumptions.
Marks
- The four diagnostic plots (4 marks)
- Linearity? (1 mark)
- Normality? (1 mark)
- Homoscedasticity? (1 mark)
- Any outliers to be concerned about (1 mark). Remember if you cannot see a Cook’s distance line, that is because it is not anywhere near your data.