Medical Insurance Charges (Questions)
1 Overview
This notebook uses a real healthcare-related dataset on medical insurance charges.
You will be tasked to conduct analysis on the medical insurance charges.
1.1 Your Task
- Import and load a dataset in R.
- Identify outcome and predictor variables.
- Fit simple and multiple linear regression models & carry out forward selection based on p-values.
- Interpret coefficients, p-values, and model fit statistics.
- Make predictions from a fitted regression model.
2 Load the data
2.1 Question 1:
Go to the following URL (https://www.kaggle.com/datasets/mirichoi0218/insurance/data)
Download the data in csv and print out the first few rows of the data.
Click on this link for answers to question 1: https://rpubs.com/Samuelllim/Q1
3 Data preparation
3.1 Question 2a:
List the variables that are categorical?
3.2 Question 2b:
What should you do with the data type of these variables before fitting a regression model?
3.3 Question 2c:
Which is the response variable?
Click on this link for answers to question 3: https://rpubs.com/Samuelllim/Q2
4 Simple linear regression
4.1 Question 3a
Fit a simple linear regression model with charges as the response and bmi as the predictor.
4.2 Question 3b
Interpret the coefficient of bmi
Click on this link for answers to question 3: https://rpubs.com/Samuelllim/Q3
5 Multiple linear regression
5.1 Question 4a
Fit the following multiple linear regression model with age, sex, bmi, children, smoker and region as predictors.
5.2 Question 4b
Using the fitted model, interpret the coefficient of smokeryes.
5.3 Question 4c
Using the fitted model, interpret the coefficient of age.
6 Model Selection - Forward selection based on p-values
In this section, use forward selection based on p-values.
Click on this link for answers to question 3: https://rpubs.com/Samuelllim/Q4
6.1 Question 5a
Fit one-predictor models for each candidate variable and compare their p-values.
Which variable should enter first?
6.2 Question 5b
Write down your final selected model.
Click on this link for answers to question 5: https://rpubs.com/Samuelllim/Q5
7 Prediction
7.1 Question 6
Predict the insurance charges for the following person:
- age = 40
- sex = female
- bmi = 30
- children = 2
- smoker = no
- region = southeast
Click on this link for answers to question 6: https://rpubs.com/Samuelllim/Q6
8 Binary Regression
Now that we have gone through the linear regression. We will redefine the objective of our study.
You are tasked to conduct analysis on whether one is a smoker or not based on its attributes
8.1 Question 7a
- Find the model that best predicts whether one is smoker or not. Feel free to use of the binary regression models that we have covered in class (logistic regression, probit regression)
8.2 Question 7b
- Interpret the coefficients of the fitted model. hint: Look at the signs of the coefficient and comment on how it changes the probability of being a smoker.
8.3 Question 7c
- Compute the marginal effects of the fitted model at the mean of the data and interpret the results.
8.4 Question 7d
- Compute the average partial effect of each predictor and interpret the results.
8.5 Question 7e
- Make predictions from a fitted regression model.
- age = 40
- sex = female
- bmi = 30
- children = 2
- region = southeast
- charges = 8427
Click on this link for answers to question 6: https://rpubs.com/Samuelllim/Q7