The work of a data scientist goes way beyond performing basic tasks such as analyzing and visualizing data. Now, I will explore another exciting tool of data analysis: predictions and estimates

Get to know that data

This data corresponds to the percentage of the population with access to improved sanitation facilities and life expectancy (in years) in 2014 for 183 countries in the world. It was extracted from The World Bank website and pre-processed to eliminate NAs and other issues.

Here, we examine two variables: Access to Sanitation and Life Expectancy.

Import the Dataset

my_data <- read.csv("https://ibm.box.com/shared/static/q0gt7rsj6z5p3fld163n70i65id3awz3.csv")

Explore that Data

head(my_data)
##       Country Access_to_Sanitation Life_Expectancy
## 1 Afghanistan                 31.8        60.37446
## 2     Albania                 93.2        77.83046
## 3     Algeria                 87.4        74.80810
## 4      Angola                 51.1        52.26688
## 5   Argentina                 96.1        76.15861
## 6     Armenia                 89.5        74.67571
str(my_data)
## 'data.frame':    183 obs. of  3 variables:
##  $ Country             : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Access_to_Sanitation: num  31.8 93.2 87.4 51.1 96.1 89.5 97.7 100 100 87.9 ...
##  $ Life_Expectancy     : num  60.4 77.8 74.8 52.3 76.2 ...
Just like I mentioned before, it looks like the dataframe contains 183 observations of 3 variables: Country Name, Access to Sanitation Facilities (in % of population) and Life Expectancy (in years).

Visualizing the Data**

plot(my_data$Access_to_Sanitation, my_data$Life_Expectancy, xlab = "Access to Sanitation (% of population)",
     ylab = "Life Expectancy (years)", col = "aquamarine4", lwd = 2)

What if we compare the values of each country for these two variables?

Here, we'll just make a simple bar plot to see what are the bottom 20 countries for Sanitation and Life Expectancy, respectively.

You can use the function order() to sort the data.
my_data <- my_data[order(my_data["Access_to_Sanitation"]),]
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
barplot(my_data[c(1:20),"Access_to_Sanitation"],
        names.arg = as.vector(my_data[c(1:20),"Country"]),
        col = "rosybrown", las = 2,
       ylab = "Access to Sanintation (% of Population)")

# Order rows increasingly by Life Expectancy
my_data <- my_data[order(my_data["Life_Expectancy"]),]
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
barplot(my_data[c(1:20),"Life_Expectancy"],
        names.arg = as.vector(my_data[c(1:20),"Country"]),
        col = "cadetblue", las = 2,
       ylab = "Life Expectancy (years)")

Building a Linear Model

Looking at the plots, were you able to identify any trends? Is there a relationship between Access to Sanitation and Life Expectancy? If so, which is the independent (explanatory) variable and which is the dependent (outcome) variable?

Since access to sanitation is likely to affect one's life expectancy, in our linear model, Access to Sanitation is our independent variable and Life Expectancy is our dependent variable.

Creating the Model

# Create a vector containing the sanitation data
sanitation <- as.vector(my_data$Access_to_Sanitation)

# Build a linear model
model <- lm(Life_Expectancy ~ sanitation, data=my_data)

# Get some information about the model
summary(model)
## 
## Call:
## lm(formula = Life_Expectancy ~ sanitation, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.4453  -2.4513   0.3395   3.5113   8.7049 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.51021    0.88544   60.43   <2e-16 ***
## sanitation   0.24122    0.01129   21.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.488 on 181 degrees of freedom
## Multiple R-squared:  0.7162, Adjusted R-squared:  0.7147 
## F-statistic: 456.9 on 1 and 181 DF,  p-value: < 2.2e-16
model
## 
## Call:
## lm(formula = Life_Expectancy ~ sanitation, data = my_data)
## 
## Coefficients:
## (Intercept)   sanitation  
##     53.5102       0.2412
{
# Plot the previous scatter plot
plot(my_data$Access_to_Sanitation, my_data$Life_Expectancy, xlab = "Access to Sanitation (% of Population)",
     ylab = "Life Expectancy (years)", col = "blue", lwd = 2)

# Fit a line in the plot
abline(model, col = "red")
}

Using the model for Estimations/Predictions

Now that we have a model, we can estimate the Life Expectancy for a given value of Access to Sanitation Facilities that isn’t in the dataset. For this, we can use the function predict() and our linear model.

# Tips: using point that fall out of the term interval in the training dataset may lead to errors of prediction that are much larger than expected. 
summary(my_data$Access_to_Sanitation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.70   47.70   87.40   72.74   97.35  100.00
# Create points to predict, use the point that fall between 6.70 and 100.00
pointsToPredict <- data.frame(sanitation = c(10, 42))

#Use predict() to compute our predictions!
predictionWithInterval <- predict(model, pointsToPredict, interval = 'prediction')
predictionWithInterval
##        fit      lwr      upr
## 1 55.92236 46.93336 64.91136
## 2 63.64124 54.73514 72.54735

Let’s add our estimations to the previous plot!

{
# Plot the previous scatter plot
plot(my_data$Access_to_Sanitation, my_data$Life_Expectancy, xlab = "Access to Sanitation (% of Population)",
     ylab = "Life Expectancy (years)", col = "blue", ylim=c(45, 85), lwd = 2)
    
# Add the new predicted points!
points(pointsToPredict$sanitation, predictionWithInterval[,"fit"], col = "red", lwd = 4)
points(pointsToPredict$sanitation, predictionWithInterval[,"lwr"], col = "firebrick4", lwd = 4)
points(pointsToPredict$sanitation, predictionWithInterval[,"upr"], col = "firebrick4", lwd = 4)
    
legend("topleft",legend = c("Dataset Points", "Prediction Points"), fill = c("blue","red"), bty = "n")
    
# Fit a line in the plot
abline(model, col = "red")
}

END