Installing and loading Packages

Introduction

Goal for Today

Introduce you to R and Rstudio.

What Are R and Rstudio?

What Is R?

R is an open source programming language with origins in C and FORTRAN. Advantages:

Flexibility
It’s free (and open source)!
Ease of handling advanced computational models
Ease of handling multiple data sets in one session
Higher demand in industries.

Getting Started in R and Rstudio

Let’s get started in Rstudio first. Select “Tools” in the menu.

Scroll to “Global Options” (should be at the bottom)
On the pop-up, select “pane layout.”
Rearrange so that “Source” is top left, “Console” is top right”, and the files/plots/packages/etc. is the bottom right.
Save

Getting Started in R and Rstudio

Hit Ctrl-Shift-N (Cmd-Shift-N if you’re on a Mac) to open up a new script.

Minimize the “Environment/History/Connections/Git” pane in the bottom left.
Adjust the console output to your liking.

This should maximize your Rstudio experience, esp. as you’ll eventually start writing documents in Rstudio.

That should maximize your Rstudio experience, esp. as you begin to write documents in Rstudio as well.

A Few Commands to Get Started

getwd() will spit out your current working directory.

getwd()

By default, assuming your username is “Bongani”:

Windows: "D:/" (notice the forward slashes!)

Tidyverse

The tidyverse is a suite of functions/packages that totally rethink base R. Some functions we’ll discuss:

|> (the pipe)
glimpse() and summary()
select()
group_by()
summarize()
mutate()
filter()

I cannot fully discuss everything from the tidyverse. That’s why there’s Google/Stackexchange. :P

`|>`

The pipe (|>) allows you to chain together a series of tidyverse functions.

This is especially useful when you’re recoding data and you want to make sure you got everything right before saving the data.

You can chain together a host of tidyverse commands within it.

`glimpse()` and `summary()`

glimpse() and summary() will get you some basic descriptions of your data. For example:

gapminder |>  glimpse() # notice the pipe
#> Rows: 1,704
#> Columns: 6
#> $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

`glimpse()` and `summary()`

summary() is technically not a tidyverse function, but it works within the pipe.

gapminder |> summary()
#>         country        continent        year         lifeExp     
#>  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
#>  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
#>  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
#>  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
#>  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
#>  Australia  :  12                  Max.   :2007   Max.   :82.60  
#>  (Other)    :1632                                                
#>       pop              gdpPercap       
#>  Min.   :6.001e+04   Min.   :   241.2  
#>  1st Qu.:2.794e+06   1st Qu.:  1202.1  
#>  Median :7.024e+06   Median :  3531.8  
#>  Mean   :2.960e+07   Mean   :  7215.3  
#>  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
#>  Max.   :1.319e+09   Max.   :113523.1  
#>

`select()`

select() will grab (or omit) columns from the data.
select the variables country, lifeExp

# grab just these columns.
gapminder |> 
  select(country, lifeExp) |> 
  head()

country	lifeExp
Afghanistan	28.801
Afghanistan	30.332
Afghanistan	31.997
Afghanistan	34.020
Afghanistan	36.088
Afghanistan	38.438

`filter()`

filter() is a great diagnostic tool for subsetting your data to look at specific observations.

Notice the use of double-equal signs (==) for the filter() functions.

gapminder |> 
  select(country, lifeExp) |> 
  filter(country== "South Africa" | country == "Ireland")

country	lifeExp
Ireland	66.910
Ireland	68.900
Ireland	70.290
Ireland	71.080
Ireland	71.280
Ireland	72.030
Ireland	73.100
Ireland	74.360
Ireland	75.467
Ireland	76.122
Ireland	77.783
Ireland	78.885
South Africa	45.009
South Africa	47.985
South Africa	49.951
South Africa	51.927
South Africa	53.696
South Africa	55.527
South Africa	58.161
South Africa	60.834
South Africa	61.888
South Africa	60.236
South Africa	53.365
South Africa	49.339

`group_by()` and `summarize()`

group_by() might be the most powerful function in tidyverse.
tl;dr: it allows you to perform functions within specific subsets (groups) of the data.
summarize() creates condensed summaries of the data, for whatever it is you want.

gapminder |> 
  select(country, lifeExp) |> 
  filter(country== "South Africa" | country == "Ireland") |> 
  group_by(country) |> 
  summarise(Average = mean(lifeExp))

country	Average
Ireland	73.01725
South Africa	53.99317

`mutate()`

mutate() creates new columns while retaining original dimensions of the data (unlike summarize()).

gapminder |>
    select(country, lifeExp) |> 
    filter(country== "South Africa" | country == "Ireland") |> 
    mutate(log_lifeExp = log(lifeExp))

country	lifeExp	log_lifeExp
Ireland	66.910	4.203348
Ireland	68.900	4.232656
Ireland	70.290	4.252630
Ireland	71.080	4.263806
Ireland	71.280	4.266616
Ireland	72.030	4.277083
Ireland	73.100	4.291828
Ireland	74.360	4.308918
Ireland	75.467	4.323696
Ireland	76.122	4.332337
Ireland	77.783	4.353923
Ireland	78.885	4.367991
South Africa	45.009	3.806862
South Africa	47.985	3.870888
South Africa	49.951	3.911043
South Africa	51.927	3.949839
South Africa	53.696	3.983338
South Africa	55.527	4.016869
South Africa	58.161	4.063215
South Africa	60.834	4.108149
South Africa	61.888	4.125326
South Africa	60.236	4.098270
South Africa	53.365	3.977155
South Africa	49.339	3.898715

Don’t Forget to Assign

When you’re done, don’t forget to assign what you’ve done to an object.

gapminder_new<-gapminder |>
    select(country, lifeExp) |> 
    filter(country== "South Africa" | country == "Ireland") |> 
    mutate(log_lifeExp = log(lifeExp))

T.test


options(scipen=999)
t.test(lifeExp~country,data=gapminder_new)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  lifeExp by country
#> t = 10.067, df = 19.109, p-value = 0.000000004466
#> alternative hypothesis: true difference in means between group Ireland and group South Africa is not equal to 0
#> 95 percent confidence interval:
#>  15.07022 22.97794
#> sample estimates:
#>      mean in group Ireland mean in group South Africa 
#>                   73.01725                   53.99317

Note

There statistically significant difference in the mean life expectance of these two countries \((p=0.000000004466)\)

About ggplot2

ggplot2 is a widely used R package that extends R’s visualization capabilities. It takes the hassle out of things like creating legends, mapping other variables to scales like color, or faceting plots into small multiples. We’ll learn about what all these things mean shortly.

Specifically, ggplot2 allows you to build a plot layer-by-layer by specifying:

a geom, which specifies how the data are represented on the plot (points, lines, bars, etc.),
aesthetics that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,
a stat, a statistical transformation or summary of the data applied prior to plotting,
facets, which we’ve already seen above, that allow the data to be divided into chunks on the basis of other categorical or continuous variables and the same plot drawn for each chunk. :::

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=gdpPercap , y= lifeExp))+
  geom_point()

Adding more layers

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=gdpPercap , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)

Transform gdp per Capita using log

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)

Use geom_smooth to add regression lines

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth()

Use geom_smooth with linear model

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth(method =lm)

Divide continents into facets

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth(method =lm)+
  facet_wrap(~continent)

Use years for color

Divide continents into facets

gapminder |> 
  filter(gdpPercap < 50000) |> 
  ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=year , size = pop))+
  geom_point(alpha=0.3)+
  geom_smooth(method =lm)+
  facet_wrap(~continent)

Linear Regression Model 1

model1<- lm(lifeExp~gdpPercap, data = gapminder)
summary(model1)
#> 
#> Call:
#> lm(formula = lifeExp ~ gdpPercap, data = gapminder)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -82.754  -7.758   2.176   8.225  18.426 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept) 53.95556088  0.31499494  171.29 <0.0000000000000002 ***
#> gdpPercap    0.00076488  0.00002579   29.66 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.49 on 1702 degrees of freedom
#> Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
#> F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 0.00000000000000022

Note

gdpPercap is statistically significant in predicting life expectancy. (\(p<0.001\))

Linear Regression Model 2

model2<- lm(lifeExp~gdpPercap+ pop, data = gapminder)
summary(model2)
#> 
#> Call:
#> lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -82.754  -7.745   2.055   8.212  18.534 
#> 
#> Coefficients:
#>                    Estimate      Std. Error t value             Pr(>|t|)    
#> (Intercept) 53.648242352613  0.322479720682  166.36 < 0.0000000000000002 ***
#> gdpPercap    0.000767564618  0.000025681107   29.89 < 0.0000000000000002 ***
#> pop          0.000000009728  0.000000002385    4.08            0.0000472 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.44 on 1701 degrees of freedom
#> Multiple R-squared:  0.3471, Adjusted R-squared:  0.3463 
#> F-statistic: 452.2 on 2 and 1701 DF,  p-value: < 0.00000000000000022

Note

when adjusting for pop gdpPercap is still statistically significant in predicting life expectancy and so is population (\(p<0.001\))

Installing and loading Packages

Introduction

Goal for Today

What Are R and Rstudio?

What Is R?

Getting Started in R and Rstudio

Getting Started in R and Rstudio

A Few Commands to Get Started

Tidyverse

|>

glimpse() and summary()

glimpse() and summary()

select()

filter()

group_by() and summarize()

mutate()

Don’t Forget to Assign

T.test

About ggplot2

Adding more layers

Transform gdp per Capita using log

Use geom_smooth to add regression lines

Use geom_smooth with linear model

Divide continents into facets

Use years for color

Divide continents into facets

Linear Regression Model 1

Linear Regression Model 2

`|>`

`glimpse()` and `summary()`

`glimpse()` and `summary()`

`select()`

`filter()`

`group_by()` and `summarize()`

`mutate()`