General instructions
In this project, you will be guided through the analysis of a dataset containing health data from 157 countries. Your main question is What affects life expectancy throughout the world? You must submit a report on turnitin.
You need to present your analysis in a research report. This means that you need to write a ‘body text’ in coherent prose and well-structured paragraphs that guides the reader through your statistical analyses and their interpretation. You need to refer to figures and tables in the body text, using figure / table references (Figure 1, Table 1 etc.). The formatting of the body text paragraphs needs to be clearly distinct from the code snippets. Figures and tables need clear captions, again distinct from the body text.
As well as the answer to each question you must include the code snippet that generated this answer. This snippet must be fully commented. You will lose marks for not including these commented snippets. This is both to ensure you understand what you are doing and as a plagiarism check. With regards code snippets, do not use screenshots. Rather format them differently from body text. The choice in how to do this is up to you as long as its clear (some ways to do it in Word), perhaps overly complicated.
Embedded in each question is the mark available for it.
Each question will be marked independently, that is, if you get something wrong early on you will lose marks at that point, but later analysis based on this answer will be marked as correct if the correct analytical technique was applied, even though the answer will be wrong.
You will lose marks for poor presentation and generally not explaining yourself. Expecting us to just read the R output and understand how you got from a to b is a recipe for losing marks. Creating figures with no figure captions etc. will lose you marks. Raw R console output dumps will lose you marks, R output needs to be presented in proper tables etc.
You will submit this through turnitin. The deadline is the 28th of April 2025 at 10.00 a.m.
Getting help
This is the end of module assessment for BS2004. You should not need extra help. The questions should be completely clear and use techniques which we taught you during the course. If you think there is a mistake in a question (there almost certainly is not), email ebm3@le.ac.uk. Otherwise no other help is offered (as it is the end of module assessment).
The data
The data comes from the World Bank (life_expectancy_data.csv ). As well as life expectancy (the response variable life_exp), I scanned the database for various predictors that I thought would be interesting. As well as country as a label, I choose 12 predictor variables.
- pollution Levels of pollution measured as
pm2.5
- gdp_capita Per capita gross domestic product in US
dollars. How rich a country is
- alcohol Total alcohol consumption per capita
(liters of pure alcohol, projected estimates, 15+ years of age)
- cause_of_death_communicable How many deaths were caused by communicable diseases (percentage of total)
- dpt_im Percentage of babies vaccinated against
diphtheria, pertussis (whooping cough), and tetanus
- meas_im Percentage of babies vaccinated against
measles
- pol_im Percentage of babies vaccinated against
polio
- hospital_beds Number of hospital beds per 1000
people
- diabetes Diabetes prevalence (percentage of
population)
- overweight Prevalence of overweight adults
(percentage of population)
- fertility_rate total births per woman
- pop Total population
The questions
Question 1: Plot (scatterplot) life expectancy (y-axis) against each of the raw predictor variables (6 Marks total). Note I said raw, not the log that I do in Figure 1. Hint: Make your life easy by combining all the graphs in a single figure using R.
When I produced the figure requested in Question 1, it was clear to me that gdp_capita, hospital_beds and pop did not show a linear relationship with life_exp. This makes sense. For example, a lot of countries have low gdp and some have much higher gdp. An easy solution for that is to log (natural log) these predictors. That’s what I did in the figure below. Much better. So from now on, use the logged values of gdp_capita, hospital_beds and pop in your analysis.