BS2004 Guided mini project

General instructions

In this project, you will be guided through the analysis of a dataset containing health data from 157 countries. Your main question is What affects life expectancy throughout the world? You must submit a report on turnitin.

You need to present your analysis in a research report. This means that you need to write a ‘body text’ in coherent prose and well-structured paragraphs that guides the reader through your statistical analyses and their interpretation. You need to refer to figures and tables in the body text, using figure / table references (Figure 1, Table 1 etc.). The formatting of the body text paragraphs needs to be clearly distinct from the code snippets. Figures and tables need clear captions, again distinct from the body text.

As well as the answer to each question you must include the code snippet that generated this answer. This snippet must be fully commented. You will lose marks for not including these commented snippets. This is both to ensure you understand what you are doing and as a plagiarism check. With regards code snippets, do not use screenshots. Rather format them differently from body text. The choice in how to do this is up to you as long as its clear (some ways to do it in Word), perhaps overly complicated.

Embedded in each question is the mark available for it.

Each question will be marked independently, that is, if you get something wrong early on you will lose marks at that point, but later analysis based on this answer will be marked as correct if the correct analytical technique was applied, even though the answer will be wrong.

You will lose marks for poor presentation and generally not explaining yourself. Expecting us to just read the R output and understand how you got from a to b is a recipe for losing marks. Creating figures with no figure captions etc. will lose you marks. Raw R console output dumps will lose you marks, R output needs to be presented in proper tables etc.

You will submit this through turnitin. The deadline is the 27th of April 2026 at 10.00 a.m.

Getting help

This is the end of module assessment for BS2004. You should not need extra help. The questions should be completely clear and use techniques which we taught you during the course. If you think there is a mistake in a question (there almost certainly is not), email ebm3@le.ac.uk. Otherwise no other help is offered (as it is the end of module assessment).

The data

The data comes from the World Bank (life_expectancy_data.csv ). As well as life expectancy (the response variable life_exp), I scanned the database for various predictors that I thought would be interesting. As well as country as a label, I choose 12 predictor variables.

pollution Levels of pollution measured as pm2.5
gdp_capita Per capita gross domestic product in US dollars. How rich a country is
alcohol Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
cause_of_death_communicable How many deaths were caused by communicable diseases (percentage of total)
dpt_im Percentage of babies vaccinated against diphtheria, pertussis (whooping cough), and tetanus
meas_im Percentage of babies vaccinated against measles
pol_im Percentage of babies vaccinated against polio
hospital_beds Number of hospital beds per 1000 people
diabetes Diabetes prevalence (percentage of population)
overweight Prevalence of overweight adults (percentage of population)
fertility_rate total births per woman
pop Total population

The questions

Question 1: Plot (scatterplot) life expectancy (y-axis) against each of the raw predictor variables (6 Marks total). Note I said raw, not the log that I do in Figure 1. Hint: Make your life easy by combining all the graphs in a single figure using R.

When I produced the figure requested in Question 1, it was clear to me that gdp_capita, hospital_beds and pop did not show a linear relationship with life_exp. This makes sense. For example, a lot of countries have low gdp and some have much higher gdp. An easy solution for that is to log (natural log) these predictors. That’s what I did in the figure below. Much better. So from now on, use the logged values of gdp_capita, hospital_beds and pop in your analysis.

Figure 1. Replotting of life expectancy against each predictor variable. Some predictor variables have been logged

Question 2: From my figure 1 above, pick three relationships (except log(gdp) which I use as an example below). Describe these relationships in words and give an explanation/hypothesis for why this relationship exists (30 marks). So for example; “There is an increasing linear relationship between life expectancy and log (gdp). It could be expected that with the increasing availability to higher living standards (e.g. modern sanitation) and modern medicine seen in rich countries, would come an increase in life expectancy.”

Question 3: Checking for collinearity. Create the full linear model (life expectancy against all twelve predictors with no interactions). Remember to log the predictors you need to log. It makes sense that some of these variables are collinear. Use the cut-offs you have been taught (vif > 5, cor > 0.8) to identify collinear predictors.

2 marks for a correctly written full model

2 marks for identifying collinear predictors (report vif and cor)

2 marks for a correctly written reduced model.

Question 4: Model selection. You are going to do a simple model selection here on your model. Using step wise selection based on p (0.05), identify which predictor variables to remove.

4 marks for presenting a table of step selection summary.

2 marks for identifying which predictors to remove.

1 mark for correctly coding this final model.

Question 5: Assumption checking for your final model. You now have a final model (after model selection). Show the diagnostic plots and briefly explain why your data fits/ doesn’t fit the proposed linear model. 4 Marks

Question 6: Describing your model. Report the results of your model. This includes significance statistics (t and p values), estimate sizes and errors (8 marks) and most importantly a narrative explanation for each significant predictor (35 Marks). This is more involved than that for question 2. You can’t just say “There is an increasing linear relationship between life expectancy and log (gdp)”. You have carried out a statisitcal analysis now, you can say by looking at the estimates “a 1000 dollar increase in per capita gdp increases average life expectancy in a country by 4 months” as a made up example. Be very careful with predictors that you log transformed, they are not trivial to work out.

Question 7: From Figure 1, it is pretty clear there is a positive relationship between life expectancy and log (hospital beds). But imagine in my final model, there is a significant negative relationship. Can you think of a statistical explanation for why the simple graph and your analysis would have different results (10 marks)? Don’t worry if you didn’t find that, its a hypothetical.