BS2004 Guided mini project

General instructions

In this project, you will be guided through the analysis of a dataset containing health data from 157 countries. Your main question is What affects life expectancy throughout the world? You must submit a report on turnitin.

As well as the answer to each question you must include the code snippet that generated this answer. This snippet must be fully commented. You will lose marks for not including these commented snippets. This is both to ensure you understand what you are doing and as a plagiarism check.

Embedded in each question is the mark available for it.

Each question will be marked independently, that is, if you get something wrong early on you will lose marks at that point, but later analysis based on this answer will be marked as correct if the correct analytical technique was applied, even though the answer will be wrong.

You will lose marks for poor presentation and generally not explaining yourself. Expecting us to just read the R output and understand how you got from a to b is a recipe for losing marks. Creating figures with no figure captions etc. will lose you marks.

You will submit this through turnitin. The deadline is the 29th of April 2024 at 10.00 a.m.

Getting help

This is the end of module assessment for BS2004. You should not need extra help. The questions should be completely clear and use techniques which we taught you during the course. If you think there is a mistake in a question (there almost certainly in not), email ebm3@le.ac.uk. Otherwise no other help is offered (as it is the end of module assessment).

The data

The data comes from the World Bank (life_expectancy_data.csv ). As well as life expectancy (the response variable life_exp), I scanned the database for various predictors that I thought would be interesting. As well as country as a label, I choose 12 predictor variables.

pollution Levels of pollution measured as pm2.5
gdp_capita Per capita gross domestic product. How rich a country is
alcohol Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
cause_of_death_communicable How many deaths were caused by communicable diseases (percentage of total)
dpt_im Percentage of babies vaccinated against diphtheria, pertussis (whooping cough), and tetanus
meas_im Percentage of babies vaccinated against measles
pol_im Percentage of babies vaccinated against polio
hospital_beds Number of hospital beds per 1000 people
diabetes Diabetes prevalence (percentage of population)
overweight Prevalence of overweight adults (percentage of population)
fertility_rate total births per woman
pop Total population

The questions

Question 1: Plot (scatterplot) life expectancy (y-axis) against each of the raw predictor variables (6 Marks total). Note I said raw, not the log that I do in Figure 1. Hint: Make your life easy by combining all the graphs in a single figure using R. There are several ways to do this. I use `grid.arrange` below.

When I produced this figure. It was clear to me that gdp_capita, hospital_beds and pop did not show a linear relationship with life_exp. This makes sense. For example, a lot of countries have low gdp and some have much higher gdp. An easy solution for that is to log (natural log) these predictors. That’s what I did in the figure below. Much better. So from now on, use the logged values of gdp_capita, hospital_beds and pop in your analysis.

Figure 1. Replotting of life expectancy against each predictor variable. Some predictor variables have been logged

Question 2: From my figure above, pick three relationships (except log(gdp) see below). Describe these relationships in words and give an explanation/hypothesis for why this relationship exists (18 marks). So for example; “There is an increasing linear relationship between life expectancy and log (gdp). It could be expected that with the increasing availability to higher living standards (e.g. modern sanitation) and modern medicine seen in rich countries, would come an increase in life expectancy.”

Question 3: Checking for collinearity. Create the full linear model (life expectancy against all twelve predictors with no interactions). Remember to log the predictors you need to log. It makes sense that some of these variables are collinear. Use the cut-offs you have been taught (vif > 5, cor > 0.8) to identify collinear predictors.

A simple pipeline here

Identify all variables with vif >5
For each collinear variable look at the correlations to find out what it is collinear with ( > 0.8 it will be one or more of the other collinear variables)
In each match set of collinear variables, chose one variable to stay in the model. Which one should you keep? Be sensible here e.g. you find gdp and hospital beds are collinear. I would keep gdp as it would seem clearly to be a more root cause. It won’t always be this simple.

Marks

2 marks for a correctly written full model

2 marks for identifying collinear predictors (report vif and cor)

2 marks for a correctly written reduced model. The increased marks are for correctly using the collinear information to make this new model.

Question 4: Model selection. You are going to do a simple model selection here on your new model (with variables removed from question 3). Using step wise selection based on p (0.05), identify which predictor variables to remove. 3 marks for presenting a table of step selection summary. 2 marks for identifying which predictors to remove. 5 marks for correctly coding this final model.

Question 5: Assumption checking for your final model. You now have a final model (after model selection) Here is a nice recap of how to use linear model diagnostic plots to check your assumptions. Show the plots and briefly explain why your data fits/ doesn’t fit assumptions.

Marks

The four diagnostic plots (4 marks)

Linearity? (1 mark)

Normality? (1 mark)

Homoscedasticity? (1 mark)

Any outliers to be concerned about (1 mark). Remember if you cannot see a Cook’s distance line, that is because it is not anywhere near your data.

Question 6: Describing your model. Report the results of your model. This includes significance statistics (5 marks), estimate sizes and errors (5 marks) and most importantly a narrative explanation for each significant predictor, something similar to what you did in Question 2 (35 Marks).

Question 7: From Figure 1, it is pretty clear there is a positive relationship between life expectancy and log (hospital beds). But imagine in my final model, there is a significant negative relationship. Can you think of an explanation for this (10 marks)? Don’t worry if you didn’t find that, its a hypothetical.