The work of a data scientist goes way beyond performing basic tasks such as analyzing and visualizing data. Now, I will explore another exciting tool of data analysis: predictions and estimates
Get to know that data
This data corresponds to the percentage of the population with access to improved sanitation facilities and life expectancy (in years) in 2014 for 183 countries in the world. It was extracted from The World Bank website and pre-processed to eliminate NAs and other issues.
Here, we examine two variables: Access to Sanitation and Life Expectancy.
Import the Dataset
my_data <- read.csv("https://ibm.box.com/shared/static/q0gt7rsj6z5p3fld163n70i65id3awz3.csv")
Explore that Data
head(my_data)
## Country Access_to_Sanitation Life_Expectancy
## 1 Afghanistan 31.8 60.37446
## 2 Albania 93.2 77.83046
## 3 Algeria 87.4 74.80810
## 4 Angola 51.1 52.26688
## 5 Argentina 96.1 76.15861
## 6 Armenia 89.5 74.67571
str(my_data)
## 'data.frame': 183 obs. of 3 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ Access_to_Sanitation: num 31.8 93.2 87.4 51.1 96.1 89.5 97.7 100 100 87.9 ...
## $ Life_Expectancy : num 60.4 77.8 74.8 52.3 76.2 ...
Just like I mentioned before, it looks like the dataframe contains 183 observations of 3 variables: Country Name, Access to Sanitation Facilities (in % of population) and Life Expectancy (in years).
Visualizing the Data**
plot(my_data$Access_to_Sanitation, my_data$Life_Expectancy, xlab = "Access to Sanitation (% of population)",
ylab = "Life Expectancy (years)", col = "aquamarine4", lwd = 2)
What if we compare the values of each country for these two variables?
Here, we'll just make a simple bar plot to see what are the bottom 20 countries for Sanitation and Life Expectancy, respectively.
You can use the function order() to sort the data.
my_data <- my_data[order(my_data["Access_to_Sanitation"]),]
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
barplot(my_data[c(1:20),"Access_to_Sanitation"],
names.arg = as.vector(my_data[c(1:20),"Country"]),
col = "rosybrown", las = 2,
ylab = "Access to Sanintation (% of Population)")
# Order rows increasingly by Life Expectancy
my_data <- my_data[order(my_data["Life_Expectancy"]),]
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
barplot(my_data[c(1:20),"Life_Expectancy"],
names.arg = as.vector(my_data[c(1:20),"Country"]),
col = "cadetblue", las = 2,
ylab = "Life Expectancy (years)")
Building a Linear Model
Looking at the plots, were you able to identify any trends? Is there a relationship between Access to Sanitation and Life Expectancy? If so, which is the independent (explanatory) variable and which is the dependent (outcome) variable?
Since access to sanitation is likely to affect one's life expectancy, in our linear model, Access to Sanitation is our independent variable and Life Expectancy is our dependent variable.
Creating the Model
# Create a vector containing the sanitation data
sanitation <- as.vector(my_data$Access_to_Sanitation)
# Build a linear model
model <- lm(Life_Expectancy ~ sanitation, data=my_data)
# Get some information about the model
summary(model)
##
## Call:
## lm(formula = Life_Expectancy ~ sanitation, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4453 -2.4513 0.3395 3.5113 8.7049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.51021 0.88544 60.43 <2e-16 ***
## sanitation 0.24122 0.01129 21.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.488 on 181 degrees of freedom
## Multiple R-squared: 0.7162, Adjusted R-squared: 0.7147
## F-statistic: 456.9 on 1 and 181 DF, p-value: < 2.2e-16
model
##
## Call:
## lm(formula = Life_Expectancy ~ sanitation, data = my_data)
##
## Coefficients:
## (Intercept) sanitation
## 53.5102 0.2412
{
# Plot the previous scatter plot
plot(my_data$Access_to_Sanitation, my_data$Life_Expectancy, xlab = "Access to Sanitation (% of Population)",
ylab = "Life Expectancy (years)", col = "blue", lwd = 2)
# Fit a line in the plot
abline(model, col = "red")
}
Using the model for Estimations/Predictions
Now that we have a model, we can estimate the Life Expectancy for a given value of Access to Sanitation Facilities that isn’t in the dataset. For this, we can use the function predict() and our linear model.
# Tips: using point that fall out of the term interval in the training dataset may lead to errors of prediction that are much larger than expected.
summary(my_data$Access_to_Sanitation)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.70 47.70 87.40 72.74 97.35 100.00
# Create points to predict, use the point that fall between 6.70 and 100.00
pointsToPredict <- data.frame(sanitation = c(10, 42))
#Use predict() to compute our predictions!
predictionWithInterval <- predict(model, pointsToPredict, interval = 'prediction')
predictionWithInterval
## fit lwr upr
## 1 55.92236 46.93336 64.91136
## 2 63.64124 54.73514 72.54735
Let’s add our estimations to the previous plot!
{
# Plot the previous scatter plot
plot(my_data$Access_to_Sanitation, my_data$Life_Expectancy, xlab = "Access to Sanitation (% of Population)",
ylab = "Life Expectancy (years)", col = "blue", ylim=c(45, 85), lwd = 2)
# Add the new predicted points!
points(pointsToPredict$sanitation, predictionWithInterval[,"fit"], col = "red", lwd = 4)
points(pointsToPredict$sanitation, predictionWithInterval[,"lwr"], col = "firebrick4", lwd = 4)
points(pointsToPredict$sanitation, predictionWithInterval[,"upr"], col = "firebrick4", lwd = 4)
legend("topleft",legend = c("Dataset Points", "Prediction Points"), fill = c("blue","red"), bty = "n")
# Fit a line in the plot
abline(model, col = "red")
}
END