General instructions
In this project, you will be guided through the analysis of a dataset containing health data from 157 countries. Your main question is What affects life expectancy throughout the world? You must submit a report on turnitin.
You need to present your analysis in a research report. This means that you need to write a ‘body text’ in coherent prose and well-structured paragraphs that guides the reader through your statistical analyses and their interpretation. You need to refer to figures and tables in the body text, using figure / table references (Figure 1, Table 1 etc.). The formatting of the body text paragraphs needs to be clearly distinct from the code snippets. Figures and tables need clear captions, again distinct from the body text.
As well as the answer to each question you must include the code snippet that generated this answer. This snippet must be fully commented. You will lose marks for not including these commented snippets. This is both to ensure you understand what you are doing and as a plagiarism check. With regards code snippets, do not use screenshots. Rather format them differently from body text. The choice in how to do this is up to you as long as its clear (some ways to do it in Word), perhaps overly complicated.
Embedded in each question is the mark available for it.
Each question will be marked independently, that is, if you get something wrong early on you will lose marks at that point, but later analysis based on this answer will be marked as correct if the correct analytical technique was applied, even though the answer will be wrong.
You will lose marks for poor presentation and generally not explaining yourself. Expecting us to just read the R output and understand how you got from a to b is a recipe for losing marks. Creating figures with no figure captions etc. will lose you marks. Raw R console output dumps will lose you marks, R output needs to be presented in proper tables etc.
You will submit this through turnitin. The deadline is the 27th of April 2026 at 10.00 a.m.
Getting help
This is the end of module assessment for BS2004. You should not need extra help. The questions should be completely clear and use techniques which we taught you during the course. If you think there is a mistake in a question (there almost certainly is not), email ebm3@le.ac.uk. Otherwise no other help is offered (as it is the end of module assessment).
The data
The data comes from the World Bank (life_expectancy_data.csv ). As well as life expectancy (the response variable life_exp), I scanned the database for various predictors that I thought would be interesting. As well as country as a label, I choose 12 predictor variables.
- pollution Levels of pollution measured as
pm2.5
- gdp_capita Per capita gross domestic product in US
dollars. How rich a country is
- alcohol Total alcohol consumption per capita
(liters of pure alcohol, projected estimates, 15+ years of age)
- cause_of_death_communicable How many deaths were caused by communicable diseases (percentage of total)
- dpt_im Percentage of babies vaccinated against
diphtheria, pertussis (whooping cough), and tetanus
- meas_im Percentage of babies vaccinated against
measles
- pol_im Percentage of babies vaccinated against
polio
- hospital_beds Number of hospital beds per 1000
people
- diabetes Diabetes prevalence (percentage of
population)
- overweight Prevalence of overweight adults
(percentage of population)
- fertility_rate total births per woman
- pop Total population
The questions
Question 1: Plot (scatterplot) life expectancy (y-axis) against each of the raw predictor variables (6 Marks total). Note I said raw, not the log that I do in Figure 1. Hint: Make your life easy by combining all the graphs in a single figure using R.
When I produced the figure requested in Question 1, it was clear to me that gdp_capita, hospital_beds and pop did not show a linear relationship with life_exp. This makes sense. For example, a lot of countries have low gdp and some have much higher gdp. An easy solution for that is to log (natural log) these predictors. That’s what I did in the figure below. Much better. So from now on, use the logged values of gdp_capita, hospital_beds and pop in your analysis.
Question 2: From my figure 1 above, pick three relationships (except log(gdp) which I use as an example below). Describe these relationships in words and give an explanation/hypothesis for why this relationship exists (30 marks). So for example; “There is an increasing linear relationship between life expectancy and log (gdp). It could be expected that with the increasing availability to higher living standards (e.g. modern sanitation) and modern medicine seen in rich countries, would come an increase in life expectancy.”
Question 3: Checking for collinearity. Create the full linear model (life expectancy against all twelve predictors with no interactions). Remember to log the predictors you need to log. It makes sense that some of these variables are collinear. Use the cut-offs you have been taught (vif > 5, cor > 0.8) to identify collinear predictors.
- 2 marks for a correctly written full model
- 2 marks for identifying collinear predictors (report vif and cor)
- 2 marks for a correctly written reduced model.
Question 4: Model selection. You are going to do a simple model selection here on your model. Using step wise selection based on p (0.05), identify which predictor variables to remove.
- 4 marks for presenting a table of step selection summary.
- 2 marks for identifying which predictors to remove.
- 1 mark for correctly coding this final model.