Installing and loading Packages
Introduction
Goal for Today
Introduce you to R and Rstudio.
What Are R and Rstudio?
What Is R?
R is an open source programming language with origins in C and FORTRAN. Advantages:
- Flexibility
- It’s free (and open source)!
- Ease of handling advanced computational models
- Ease of handling multiple data sets in one session
- Higher demand in industries.
Getting Started in R and Rstudio
Let’s get started in Rstudio first. Select “Tools” in the menu.
- Scroll to “Global Options” (should be at the bottom)
- On the pop-up, select “pane layout.”
- Rearrange so that “Source” is top left, “Console” is top right”, and the files/plots/packages/etc. is the bottom right.
- Save
Getting Started in R and Rstudio
Hit Ctrl-Shift-N (Cmd-Shift-N if you’re on a Mac) to open up a new script.
- Minimize the “Environment/History/Connections/Git” pane in the bottom left.
- Adjust the console output to your liking.
This should maximize your Rstudio experience, esp. as you’ll eventually start writing documents in Rstudio.
- That should maximize your Rstudio experience, esp. as you begin to write documents in Rstudio as well.
A Few Commands to Get Started
getwd()
will spit out your current working directory.
getwd()
By default, assuming your username is “Bongani”:
- Windows:
"D:/"
(notice the forward slashes!)
Tidyverse
The tidyverse is a suite of functions/packages that totally rethink base R. Some functions we’ll discuss:
|>
(the pipe)glimpse()
andsummary()
select()
group_by()
summarize()
mutate()
filter()
I cannot fully discuss everything from the tidyverse. That’s why there’s Google/Stackexchange. :P
|>
The pipe (|>
) allows you to chain together a series of tidyverse functions.
- This is especially useful when you’re recoding data and you want to make sure you got everything right before saving the data.
You can chain together a host of tidyverse commands within it.
glimpse()
and summary()
glimpse()
and summary()
will get you some basic descriptions of your data. For example:
|> glimpse() # notice the pipe
gapminder #> Rows: 1,704
#> Columns: 6
#> $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
#> $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
#> $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
#> $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
glimpse()
and summary()
summary()
is technically not a tidyverse function, but it works within the pipe.
|> summary()
gapminder #> country continent year lifeExp
#> Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
#> Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
#> Algeria : 12 Asia :396 Median :1980 Median :60.71
#> Angola : 12 Europe :360 Mean :1980 Mean :59.47
#> Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
#> Australia : 12 Max. :2007 Max. :82.60
#> (Other) :1632
#> pop gdpPercap
#> Min. :6.001e+04 Min. : 241.2
#> 1st Qu.:2.794e+06 1st Qu.: 1202.1
#> Median :7.024e+06 Median : 3531.8
#> Mean :2.960e+07 Mean : 7215.3
#> 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
#> Max. :1.319e+09 Max. :113523.1
#>
select()
select()
will grab (or omit) columns from the data.- select the variables country, lifeExp
# grab just these columns.
|>
gapminder select(country, lifeExp) |>
head()
country | lifeExp |
---|---|
Afghanistan | 28.801 |
Afghanistan | 30.332 |
Afghanistan | 31.997 |
Afghanistan | 34.020 |
Afghanistan | 36.088 |
Afghanistan | 38.438 |
filter()
filter()
is a great diagnostic tool for subsetting your data to look at specific observations.
- Notice the use of double-equal signs (
==
) for thefilter()
functions.
|>
gapminder select(country, lifeExp) |>
filter(country== "South Africa" | country == "Ireland")
country | lifeExp |
---|---|
Ireland | 66.910 |
Ireland | 68.900 |
Ireland | 70.290 |
Ireland | 71.080 |
Ireland | 71.280 |
Ireland | 72.030 |
Ireland | 73.100 |
Ireland | 74.360 |
Ireland | 75.467 |
Ireland | 76.122 |
Ireland | 77.783 |
Ireland | 78.885 |
South Africa | 45.009 |
South Africa | 47.985 |
South Africa | 49.951 |
South Africa | 51.927 |
South Africa | 53.696 |
South Africa | 55.527 |
South Africa | 58.161 |
South Africa | 60.834 |
South Africa | 61.888 |
South Africa | 60.236 |
South Africa | 53.365 |
South Africa | 49.339 |
group_by()
and summarize()
group_by()
might be the most powerful function in tidyverse.tl;dr: it allows you to perform functions within specific subsets (groups) of the data.
summarize()
creates condensed summaries of the data, for whatever it is you want.
|>
gapminder select(country, lifeExp) |>
filter(country== "South Africa" | country == "Ireland") |>
group_by(country) |>
summarise(Average = mean(lifeExp))
country | Average |
---|---|
Ireland | 73.01725 |
South Africa | 53.99317 |
mutate()
mutate()
creates new columns while retaining original dimensions of the data (unlike summarize()
).
|>
gapminder select(country, lifeExp) |>
filter(country== "South Africa" | country == "Ireland") |>
mutate(log_lifeExp = log(lifeExp))
country | lifeExp | log_lifeExp |
---|---|---|
Ireland | 66.910 | 4.203348 |
Ireland | 68.900 | 4.232656 |
Ireland | 70.290 | 4.252630 |
Ireland | 71.080 | 4.263806 |
Ireland | 71.280 | 4.266616 |
Ireland | 72.030 | 4.277083 |
Ireland | 73.100 | 4.291828 |
Ireland | 74.360 | 4.308918 |
Ireland | 75.467 | 4.323696 |
Ireland | 76.122 | 4.332337 |
Ireland | 77.783 | 4.353923 |
Ireland | 78.885 | 4.367991 |
South Africa | 45.009 | 3.806862 |
South Africa | 47.985 | 3.870888 |
South Africa | 49.951 | 3.911043 |
South Africa | 51.927 | 3.949839 |
South Africa | 53.696 | 3.983338 |
South Africa | 55.527 | 4.016869 |
South Africa | 58.161 | 4.063215 |
South Africa | 60.834 | 4.108149 |
South Africa | 61.888 | 4.125326 |
South Africa | 60.236 | 4.098270 |
South Africa | 53.365 | 3.977155 |
South Africa | 49.339 | 3.898715 |
Don’t Forget to Assign
When you’re done, don’t forget to assign what you’ve done to an object.
<-gapminder |>
gapminder_newselect(country, lifeExp) |>
filter(country== "South Africa" | country == "Ireland") |>
mutate(log_lifeExp = log(lifeExp))
T.test
options(scipen=999)
t.test(lifeExp~country,data=gapminder_new)
#>
#> Welch Two Sample t-test
#>
#> data: lifeExp by country
#> t = 10.067, df = 19.109, p-value = 0.000000004466
#> alternative hypothesis: true difference in means between group Ireland and group South Africa is not equal to 0
#> 95 percent confidence interval:
#> 15.07022 22.97794
#> sample estimates:
#> mean in group Ireland mean in group South Africa
#> 73.01725 53.99317
- There statistically significant difference in the mean life expectance of these two countries \((p=0.000000004466)\)
About ggplot2
ggplot2 is a widely used R package that extends R’s visualization capabilities. It takes the hassle out of things like creating legends, mapping other variables to scales like color, or faceting plots into small multiples. We’ll learn about what all these things mean shortly.
Specifically, ggplot2 allows you to build a plot layer-by-layer by specifying:
- a geom, which specifies how the data are represented on the plot (points, lines, bars, etc.),
- aesthetics that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,
- a stat, a statistical transformation or summary of the data applied prior to plotting,
- facets, which we’ve already seen above, that allow the data to be divided into chunks on the basis of other categorical or continuous variables and the same plot drawn for each chunk. :::
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=gdpPercap , y= lifeExp))+
geom_point()
Adding more layers
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=gdpPercap , y= lifeExp , col=continent , size = pop))+
geom_point(alpha=0.3)
Transform gdp per Capita using log
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
geom_point(alpha=0.3)
Use geom_smooth to add regression lines
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
geom_point(alpha=0.3)+
geom_smooth()
Use geom_smooth with linear model
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
geom_point(alpha=0.3)+
geom_smooth(method =lm)
Divide continents into facets
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=continent , size = pop))+
geom_point(alpha=0.3)+
geom_smooth(method =lm)+
facet_wrap(~continent)
Use years for color
Divide continents into facets
|>
gapminder filter(gdpPercap < 50000) |>
ggplot(aes(x=log(gdpPercap) , y= lifeExp , col=year , size = pop))+
geom_point(alpha=0.3)+
geom_smooth(method =lm)+
facet_wrap(~continent)
Linear Regression Model 1
<- lm(lifeExp~gdpPercap, data = gapminder)
model1summary(model1)
#>
#> Call:
#> lm(formula = lifeExp ~ gdpPercap, data = gapminder)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -82.754 -7.758 2.176 8.225 18.426
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 53.95556088 0.31499494 171.29 <0.0000000000000002 ***
#> gdpPercap 0.00076488 0.00002579 29.66 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 10.49 on 1702 degrees of freedom
#> Multiple R-squared: 0.3407, Adjusted R-squared: 0.3403
#> F-statistic: 879.6 on 1 and 1702 DF, p-value: < 0.00000000000000022
- gdpPercap is statistically significant in predicting life expectancy. (\(p<0.001\))
Linear Regression Model 2
<- lm(lifeExp~gdpPercap+ pop, data = gapminder)
model2summary(model2)
#>
#> Call:
#> lm(formula = lifeExp ~ gdpPercap + pop, data = gapminder)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -82.754 -7.745 2.055 8.212 18.534
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 53.648242352613 0.322479720682 166.36 < 0.0000000000000002 ***
#> gdpPercap 0.000767564618 0.000025681107 29.89 < 0.0000000000000002 ***
#> pop 0.000000009728 0.000000002385 4.08 0.0000472 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 10.44 on 1701 degrees of freedom
#> Multiple R-squared: 0.3471, Adjusted R-squared: 0.3463
#> F-statistic: 452.2 on 2 and 1701 DF, p-value: < 0.00000000000000022
- when adjusting for pop gdpPercap is still statistically significant in predicting life expectancy and so is population (\(p<0.001\))