AI and Machine Learning for Public health

EXERCISE 1

Author
Affiliation

Bongani Ncube(3002164)

University Of the Witwatersrand (School of Public Health)

Published

September 1, 2025

Keywords

Survival analysis, Statistical modeling, Proportional hazards

Installing and loading Packages

Introduction

Goal for Today

Introduce you to R and Rstudio.

What Are R and Rstudio?

What Is R?

R is an open source programming language with origins in C and FORTRAN. Advantages:

  • Flexibility
  • It’s free (and open source)!
  • Ease of handling advanced computational models
  • Ease of handling multiple data sets in one session
  • Higher demand in industries.

Getting Started in R and Rstudio

Let’s get started in Rstudio first. Select “Tools” in the menu.

  • Scroll to “Global Options” (should be at the bottom)
  • On the pop-up, select “pane layout.”
  • Rearrange so that “Source” is top left, “Console” is top right”, and the files/plots/packages/etc. is the bottom right.
  • Save

Getting Started in R and Rstudio

Hit Ctrl-Shift-N (Cmd-Shift-N if you’re on a Mac) to open up a new script.

  • Minimize the “Environment/History/Connections/Git” pane in the bottom left.
  • Adjust the console output to your liking.

This should maximize your Rstudio experience, esp. as you’ll eventually start writing documents in Rstudio.

  • That should maximize your Rstudio experience, esp. as you begin to write documents in Rstudio as well.

A Few Commands to Get Started

getwd() will spit out your current working directory.

getwd()

By default, assuming your username is “Bongani”:

  • Windows: "D:/" (notice the forward slashes!)

Tidyverse

The tidyverse is a suite of functions/packages that totally rethink base R. Some functions we’ll discuss:

  • |> (the pipe)
  • glimpse() and summary()
  • select()
  • group_by()
  • summarize()
  • mutate()
  • filter()

I cannot fully discuss everything from the tidyverse. That’s why there’s Google/Stackexchange. :P

|>

The pipe (|>) allows you to chain together a series of tidyverse functions.

  • This is especially useful when you’re recoding data and you want to make sure you got everything right before saving the data.

You can chain together a host of tidyverse commands within it.

glimpse() and summary()

glimpse() and summary() will get you some basic descriptions of your data. For example:

gapminder |>  glimpse() # notice the pipe
#> Rows: 1,704
#> Columns: 6
#> $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

glimpse() and summary()

summary() is technically not a tidyverse function, but it works within the pipe.

gapminder |> summary()
#>         country        continent        year         lifeExp     
#>  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
#>  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
#>  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
#>  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
#>  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
#>  Australia  :  12                  Max.   :2007   Max.   :82.60  
#>  (Other)    :1632                                                
#>       pop              gdpPercap       
#>  Min.   :6.001e+04   Min.   :   241.2  
#>  1st Qu.:2.794e+06   1st Qu.:  1202.1  
#>  Median :7.024e+06   Median :  3531.8  
#>  Mean   :2.960e+07   Mean   :  7215.3  
#>  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
#>  Max.   :1.319e+09   Max.   :113523.1  
#> 

select()

  • select() will grab (or omit) columns from the data.
  • select the variables country, lifeExp
# grab just these columns.
gapminder |> 
  select(country, lifeExp) |> 
  head()
country lifeExp
Afghanistan 28.801
Afghanistan 30.332
Afghanistan 31.997
Afghanistan 34.020
Afghanistan 36.088
Afghanistan 38.438

filter()

filter() is a great diagnostic tool for subsetting your data to look at specific observations.

  • Notice the use of double-equal signs (==) for the filter() functions.
gapminder |> 
  select(country, lifeExp) |> 
  filter(country== "South Africa" | country == "Ireland")
country lifeExp
Ireland 66.910
Ireland 68.900
Ireland 70.290
Ireland 71.080
Ireland 71.280
Ireland 72.030
Ireland 73.100
Ireland 74.360
Ireland 75.467
Ireland 76.122
Ireland 77.783
Ireland 78.885
South Africa 45.009
South Africa 47.985
South Africa 49.951
South Africa 51.927
South Africa 53.696
South Africa 55.527
South Africa 58.161
South Africa 60.834
South Africa 61.888
South Africa 60.236
South Africa 53.365
South Africa 49.339

group_by() and summarize()

  • group_by() might be the most powerful function in tidyverse.

  • tl;dr: it allows you to perform functions within specific subsets (groups) of the data.

  • summarize() creates condensed summaries of the data, for whatever it is you want.

gapminder |> 
  select(country, lifeExp) |> 
  filter(country== "South Africa" | country == "Ireland") |> 
  group_by(country) |> 
  summarise(Average = mean(lifeExp))
country Average
Ireland 73.01725
South Africa 53.99317

mutate()

mutate() creates new columns while retaining original dimensions of the data (unlike summarize()).

gapminder |>
    select(country, lifeExp) |> 
    filter(country== "South Africa" | country == "Ireland") |> 
    mutate(log_lifeExp = log(lifeExp))
country lifeExp log_lifeExp
Ireland 66.910 4.203348
Ireland 68.900 4.232656
Ireland 70.290 4.252630
Ireland 71.080 4.263806
Ireland 71.280 4.266616
Ireland 72.030 4.277083
Ireland 73.100 4.291828
Ireland 74.360 4.308918
Ireland 75.467 4.323696
Ireland 76.122 4.332337
Ireland 77.783 4.353923
Ireland 78.885 4.367991
South Africa 45.009 3.806862
South Africa 47.985 3.870888
South Africa 49.951 3.911043
South Africa 51.927 3.949839
South Africa 53.696 3.983338
South Africa 55.527 4.016869
South Africa 58.161 4.063215
South Africa 60.834 4.108149
South Africa 61.888 4.125326
South Africa 60.236 4.098270
South Africa 53.365 3.977155
South Africa 49.339 3.898715

Don’t Forget to Assign

When you’re done, don’t forget to assign what you’ve done to an object.

gapminder_new<-gapminder |>
    select(country, lifeExp) |> 
    filter(country== "South Africa" | country == "Ireland") |> 
    mutate(log_lifeExp = log(lifeExp))

T.test


options(scipen=999)
t.test(lifeExp~country,data=gapminder_new)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  lifeExp by country
#> t = 10.067, df = 19.109, p-value = 0.000000004466
#> alternative hypothesis: true difference in means between group Ireland and group South Africa is not equal to 0
#> 95 percent confidence interval:
#>  15.07022 22.97794
#> sample estimates:
#>      mean in group Ireland mean in group South Africa 
#>                   73.01725                   53.99317
Note
  • There statistically significant difference in the mean life expectance of these two countries \((p=0.000000004466)\)

About ggplot2

ggplot2 is a widely used R package that extends R’s visualization capabilities. It takes the hassle out of things like creating legends, mapping other variables to scales like color, or faceting plots into small multiples. We’ll learn about what all these things mean shortly.

Specifically, ggplot2 allows you to build a plot layer-by-layer by specifying:

  • a geom, which specifies how the data are represented on the plot (points, lines, bars, etc.),
  • aesthetics that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,
  • a stat, a statistical transformation or summary of the data applied prior to plotting,
  • facets, which we’ve already seen above, that allow the data to be divided into chunks on the basis of other categorical or continuous variables and the same plot drawn for each chunk. :::
gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=gdpPercap , y= lifeExp))+
  geom_point()

Adding more layers

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=gdpPercap , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)

Transform gdp per Capita using log

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)

Use geom_smooth to add regression lines

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth()

Use geom_smooth with linear model

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth(method =lm)

Divide continents into facets

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth(method =lm)+
  facet_wrap(~continent)

Use years for color

Divide continents into facets

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=year , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth(method =lm)+
  facet_wrap(~continent)

Linear Regression Model 1

model1<- lm(lifeExp~gdpPercap, data = gapminder)
summary(model1)
#> 
#> Call:
#> lm(formula = lifeExp ~ gdpPercap, data = gapminder)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -82.754  -7.758   2.176   8.225  18.426 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept) 53.95556088  0.31499494  171.29 <0.0000000000000002 ***
#> gdpPercap    0.00076488  0.00002579   29.66 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.49 on 1702 degrees of freedom
#> Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
#> F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 0.00000000000000022
Note
  • gdpPercap is statistically significant in predicting life expectancy. (\(p<0.001\))

Linear Regression Model 2

model2<- lm(lifeExp~gdpPercap+ pop, data = gapminder)
summary(model2)
#> 
#> Call:
#> lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -82.754  -7.745   2.055   8.212  18.534 
#> 
#> Coefficients:
#>                    Estimate      Std. Error t value             Pr(>|t|)    
#> (Intercept) 53.648242352613  0.322479720682  166.36 < 0.0000000000000002 ***
#> gdpPercap    0.000767564618  0.000025681107   29.89 < 0.0000000000000002 ***
#> pop          0.000000009728  0.000000002385    4.08            0.0000472 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.44 on 1701 degrees of freedom
#> Multiple R-squared:  0.3471, Adjusted R-squared:  0.3463 
#> F-statistic: 452.2 on 2 and 1701 DF,  p-value: < 0.00000000000000022
Note
  • when adjusting for pop gdpPercap is still statistically significant in predicting life expectancy and so is population (\(p<0.001\))