pacman::p_load(tidyverse, gapminder)
pacman::p_load(tidyverse, 
               gapminder)

Dataset

gapminder
# Look at the gapminder dataset
gapminder

How many observations (rows) are in the dataset?

glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> Afghanistan, Afghanistan, Afghani…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asi…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 197…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 3…
$ pop       <int> 8425333, 9240934, 10267083, 11537…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836…

Filter

Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.

gapminder %>% 
  filter(year == 1957)
# Filter the gapminder dataset for the year 1957
gapminder %>%
  filter(year == 1957)
gapminder %>% 
  filter(year == 1957 & continent == "Europe")

Arrange

gapminder %>% 
  filter(year == 1957 & continent == "Europe") %>% 
  arrange(gdpPercap) 

Select

gapminder %>% 
  filter(year == 1957 & continent == "Europe") %>% 
  arrange(gdpPercap)  %>% 
  select(country, gdpPercap)
gapminder %>% 
  filter(year == 2002) %>% 
  arrange(desc(lifeExp)) %>% 
  select(country, lifeExp, continent) 

Mutate

gapminder

1 United States Dollar equals 0.85 Euro

gapminder %>% 
  mutate( gdpEuros = gdpPercap *.85  ) %>% 
  select(country, gdpEuros)

Use mutate to create a new column called lifeExpMonths

gapminder %>% 
  mutate(lifeExpMonths = lifeExp * 12)

In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

In one sequence of pipes on the gapminder dataset: - filter() for observations from the year 2007, - mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and - arrange() in descending order of that new column - select and show only country and lifeExpMonths

gapminder %>%                        # my df
  filter(year == 2007) %>%           # here I filter because this year is ....
  mutate (lifeExpMonth = lifeExp * 12 ) %>% # Create a new var
  arrange(desc(lifeExp)) %>% 
  select (country, lifeExpMonth)

Summarize

Filter for the year 1957, then use the mean() function within a summarize() to calculate the median life expectancy into a column called meanLifeExp.

Which was the mean gpdPerCap in 1957

group_by

What is the gdpPercat by continent

gapminder %>% 
  group_by(continent) %>% 
  summarize(gdpPerContMedian = median(gdpPercap))
`summarise()` ungrouping output (override with `.groups` argument)

Calculate the mean life expectancy by year

gapminder %>% 
  group_by(year) %>% 
  summarise(lifeExpMean = mean(lifeExp))
`summarise()` ungrouping output (override with `.groups` argument)
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(lifeExpMean = mean(lifeExp)) %>% 
  ggplot(aes(x = year, 
             y = lifeExpMean)) + 
  geom_line() + 
  aes(color = continent) +
  expand_limits(y = 0)
`summarise()` regrouping output by 'year' (override with `.groups` argument)

Change of the gdp by year and continent

gapminder %>% 
  group_by(year, continent) %>% 
  summarise(meanGdpPercap = mean(gdpPercap)) %>% 
  ggplot(aes(x = year,
             y = meanGdpPercap, color = continent)) +
  geom_line() +
  expand_limits(y = 0) + 
  scale_y_log10()
`summarise()` regrouping output by 'year' (override with `.groups` argument)

Arrange

# Sort in ascending order of lifeExp
gapminder %>% 
  arrange(lifeExp)
NA
NA
# Sort in descending order of lifeExp
gapminder %>% 
  arrange(desc(lifeExp))

Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).

# Filter for the year 1957, then arrange in descending order of population
gapminder %>% 
  filter(year == 1957) %>% 
  arrange(desc(pop))

Mutate

Suppose we want life expectancy to be measured in months instead of years: you’d have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that’s calculated this way.

# Use mutate to change lifeExp to be in months
gapminder %>% 
        mutate(lifeExp = lifeExp * 12)

In this exercise, you’ll combine all three of the verbs you’ve learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

In one sequence of pipes on the gapminder dataset: filter() for observations from the year 2007, mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and arrange() in descending order of that new column

# Filter, mutate, and arrange the gapminder dataset
gapminder %>%
  filter(year == 2007) %>%
  mutate(lifeExpMonths = 12 * lifeExp) %>%
  arrange(desc(lifeExpMonths))

summarize

Use the median() function within a summarize() to find the median life expectancy. Save it into a column called medianLifeExp.

Filter for the year 1957, then use the median() function within a summarize() to calculate the median life expectancy into a column called medianLifeExp.

Find both the median life expectancy (lifeExp) and the maximum GDP per capita (gdpPercap) in the year 1957, calling them medianLifeExp and maxGdpPercap respectively. You can use the max() function to find the maximum.

The group_by verb

gapminder %>% 
  group_by(year) %>% 
  summarise(meanLifeExp = mean(lifeExp), 
            totalPop = sum(pop))
`summarise()` ungrouping output (override with `.groups` argument)
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(meanLifeExp = mean(lifeExp), 
            totalPop = sum(pop))
`summarise()` regrouping output by 'year' (override with `.groups` argument)
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(meanLifeExp = mean(lifeExp), 
            totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = meanLifeExp, 
             size = totalPop, 
             color = continent)) + 
  geom_line()
`summarise()` regrouping output by 'year' (override with `.groups` argument)

Exercise

Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each year, saving them into medianLifeExp and maxGdpPercap, respectively.

gapminder %>% 
  group_by(year) %>% 
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

Filter the gapminder data for the year 1957. Then find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each continent, saving them into medianLifeExp and maxGdpPercap, respectively.

Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each combination of continent and year, saving them into medianLifeExp and maxGdpPercap, respectively.

gapminder %>% 
        group_by(continent, year) %>% 
        summarize(medianLifeExp = median(lifeExp), 
                  maxGdpPercap = max(gdpPercap))
`summarise()` regrouping output by 'continent' (override with `.groups` argument)

Visualizing

Visualize population over time

gapminder %>%
  group_by(year) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop )) + 
  geom_point()
`summarise()` ungrouping output (override with `.groups` argument)

gapminder %>%
  group_by(year) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop )) + 
  geom_point() + 
  expand_limits(y = 0)
`summarise()` ungrouping output (override with `.groups` argument)

gapminder %>%
  group_by(year, continent) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop )) + 
  geom_point() + 
  expand_limits(y = 0)
`summarise()` regrouping output by 'year' (override with `.groups` argument)

gapminder %>%
  group_by(year, continent) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop, 
             color = continent)) + 
  geom_point() + 
  expand_limits(y = 0)
`summarise()` regrouping output by 'year' (override with `.groups` argument)

gapminder %>%
  group_by(year, continent) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop, 
             color = continent)) + 
  geom_point() + 
  expand_limits(y = 0) + 
  scale_y_log10()
`summarise()` regrouping output by 'year' (override with `.groups` argument)

Use the by_year dataset to create a scatter plot showing the change of median life expectancy over time, with year on the x-axis and medianLifeExp on the y-axis. Be sure to add expand_limits(y = 0) to make sure the plot’s y-axis includes zero.

Summarize the gapminder dataset by continent and year, finding the median GDP per capita (gdpPercap) within each and putting it into a column called medianGdpPercap. Use the assignment operator <- to save this summarized data as by_year_continent. Create a scatter plot showing the change in medianGdpPercap by continent over time. Use color to distinguish between continents, and be sure to add expand_limits(y = 0) so that the y-axis starts at zero.

---
title: "3rd Session: Data wrangling"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---





```{r}
pacman::p_load(tidyverse, gapminder)
```

```{r}
pacman::p_load(tidyverse, 
               gapminder)
```



# Dataset


```{r}
gapminder
```



```{r}
# Look at the gapminder dataset
gapminder
```

How many observations (rows) are in the dataset?

```{r}
glimpse(gapminder)
```


## Filter

Add a filter() line after the pipe (%>%) to extract only the observations from the year 1957. Remember that you use == to compare two values.


```{r}
gapminder %>% 
  filter(year == 1957)
```





```{r}
# Filter the gapminder dataset for the year 1957
gapminder %>%
  filter(year == 1957)
```

```{r}
gapminder %>% 
  filter(year == 1957 & continent == "Europe")
```


## Arrange

```{r}
gapminder %>% 
  filter(year == 1957 & continent == "Europe") %>% 
  arrange(gdpPercap) 
```


# Select
```{r}
gapminder %>% 
  filter(year == 1957 & continent == "Europe") %>% 
  arrange(gdpPercap)  %>% 
  select(country, gdpPercap)
```


```{r}
gapminder %>% 
  filter(year == 2002) %>% 
  arrange(desc(lifeExp)) %>% 
  select(country, lifeExp, continent) 
```


# Mutate

```{r}
gapminder
```




1 United States Dollar equals
0.85 Euro

```{r}
gapminder %>% 
  mutate( gdpEuros = gdpPercap *.85  ) %>% 
  select(country, gdpEuros)
```

 Use mutate to create a new column called lifeExpMonths
 
 
```{r}
gapminder %>% 
  mutate(lifeExpMonths = lifeExp * 12)
```
 
In this exercise, you'll combine all three of the verbs you've learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

In one sequence of pipes on the gapminder dataset:
- filter() for observations from the year 2007,
- mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and
- arrange() in descending order of that new column
- select and show only country and lifeExpMonths

```{r}
gapminder %>%                        # my df
  filter(year == 2007) %>%           # here I filter because this year is ....
  mutate (lifeExpMonth = lifeExp * 12 ) %>% # Create a new var
  arrange(desc(lifeExp)) %>% 
  select (country, lifeExpMonth)
```

# Summarize

```{r}
gapminder %>% 
  filter(year == 1957) %>% 
  summarise(medianLifeExp = median(lifeExp))
```

Filter for the year 1957, then use the mean() function within a summarize() to calculate the median life expectancy into a column called meanLifeExp.
```{r}
gapminder %>%
  filter(year == 1957) %>%
  summarise(meanLifeExp = mean(lifeExp),
            sd = sd(lifeExp))
```

```{r}
gapminder %>% 
  filter(year == 1957) %>% 
  summarise(meanLifeExp = mean(lifeExp),
            sd = sd(lifeExp),
            max = max(lifeExp),
            min = min(lifeExp))
```

Which was the mean gpdPerCap in 1957

```{r}
gapminder %>% 
  filter(year == 1957) %>% 
  mutate(gpdPercapEuros = gdpPercap * .85) %>% 
  summarise(meangdpEuros = mean(gpdPercapEuros))
```

# group_by

What is the gdpPercat by continent

```{r}
gapminder %>%
  group_by(continent) %>%
  summarize(gdpPerContMedian = median(gdpPercap))
```
Calculate the mean life expectancy by year

```{r}
gapminder %>% 
  group_by(year) %>% 
  summarise(lifeExpMean = mean(lifeExp))
```
```{r}
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(lifeExpMean = mean(lifeExp)) %>% 
  ggplot(aes(x = year, 
             y = lifeExpMean)) + 
  geom_line() + 
  aes(color = continent) +
  expand_limits(y = 0)
```

Change of the gdp by year and continent

```{r}
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(meanGdpPercap = mean(gdpPercap)) %>% 
  ggplot(aes(x = year,
             y = meanGdpPercap, color = continent)) +
  geom_line() +
  expand_limits(y = 0) + 
  scale_y_log10()
```








## Arrange

```{r}
# Sort in ascending order of lifeExp
gapminder %>% 
  arrange(lifeExp)
  

```
```{r}
# Sort in descending order of lifeExp
gapminder %>% 
  arrange(desc(lifeExp))
```

Use filter() to extract observations from just the year 1957, then use arrange() to sort in descending order of population (pop).


```{r}
# Filter for the year 1957, then arrange in descending order of population
gapminder %>% 
  filter(year == 1957) %>% 
  arrange(desc(pop))
```

## Mutate

Suppose we want life expectancy to be measured in months instead of years: you'd have to multiply the existing value by 12. You can use the mutate() verb to change this column, or to create a new column that's calculated this way.

```{r}
# Use mutate to change lifeExp to be in months
gapminder %>% 
        mutate(lifeExp = lifeExp * 12)
```


```{r}
# Use mutate to create a new column called lifeExpMonths
gapminder %>% 
        mutate(lifeExpMonths = lifeExp * 12)
```

In this exercise, you'll combine all three of the verbs you've learned in this chapter, to find the countries with the highest life expectancy, in months, in the year 2007.

In one sequence of pipes on the gapminder dataset:
filter() for observations from the year 2007,
mutate() to create a column lifeExpMonths, calculated as 12 * lifeExp, and
arrange() in descending order of that new column

```{r}
# Filter, mutate, and arrange the gapminder dataset
gapminder %>%
  filter(year == 2007) %>%
  mutate(lifeExpMonths = 12 * lifeExp) %>%
  arrange(desc(lifeExpMonths))
```


## summarize

Use the median() function within a summarize() to find the median life expectancy. Save it into a column called medianLifeExp.

```{r}
gapminder %>% 
  summarise(medianLifeExp = median(lifeExp))
```

Filter for the year 1957, then use the median() function within a summarize() to calculate the median life expectancy into a column called medianLifeExp.

```{r}
gapminder %>%
  filter(year == 1957) %>%
  summarize(medianLifeExp = median(lifeExp))
```

Find both the median life expectancy (lifeExp) and the maximum GDP per capita (gdpPercap) in the year 1957, calling them medianLifeExp and maxGdpPercap respectively. You can use the max() function to find the maximum.

```{r}
gapminder %>%
  filter(year == 1957) %>% summarize(medianLifeExp = median(lifeExp),
                                     maxGdpPercap = max(gdpPercap))
```
## The group_by verb

```{r}
gapminder %>% 
  group_by(year) %>% 
  summarise(meanLifeExp = mean(lifeExp), 
            totalPop = sum(pop))
```
```{r}
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(meanLifeExp = mean(lifeExp), 
            totalPop = sum(pop))
```
```{r}
gapminder %>% 
  group_by(year, continent) %>% 
  summarise(meanLifeExp = mean(lifeExp), 
            totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = meanLifeExp, 
             size = totalPop, 
             color = continent)) + 
  geom_line()
```


Exercise

Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each year, saving them into medianLifeExp and maxGdpPercap, respectively.

```{r}
gapminder %>% 
  group_by(year) %>% 
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
```

Filter the gapminder data for the year 1957. Then find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each continent, saving them into medianLifeExp and maxGdpPercap, respectively.

```{r}
gapminder %>% 
        filter(year == 1957) %>% 
        group_by(continent) %>% 
        summarize(medianLifeExp = median(lifeExp), 
                  maxGdpPerCap = max(gdpPercap))
```

Find the median life expectancy (lifeExp) and maximum GDP per capita (gdpPercap) within each combination of continent and year, saving them into medianLifeExp and maxGdpPercap, respectively.

```{r}
gapminder %>% 
        group_by(continent, year) %>% 
        summarize(medianLifeExp = median(lifeExp), 
                  maxGdpPercap = max(gdpPercap))
```

# Visualizing

Visualize population over time

```{r}
gapminder %>%
  group_by(year) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop )) + 
  geom_point()
```
```{r}
gapminder %>%
  group_by(year) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop )) + 
  geom_point() + 
  expand_limits(y = 0)
```

```{r}
gapminder %>%
  group_by(year, continent) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop )) + 
  geom_point() + 
  expand_limits(y = 0)
```
```{r}
gapminder %>%
  group_by(year, continent) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop, 
             color = continent)) + 
  geom_point() + 
  expand_limits(y = 0)
```
```{r}
gapminder %>%
  group_by(year, continent) %>% 
  summarise(totalPop = sum(pop)) %>% 
  ggplot(aes(x = year, 
             y = totalPop, 
             color = continent)) + 
  geom_point() + 
  expand_limits(y = 0) + 
  scale_y_log10()
```
Use the by_year dataset to create a scatter plot showing the change of median life expectancy over time, with year on the x-axis and medianLifeExp on the y-axis. Be sure to add expand_limits(y = 0) to make sure the plot's y-axis includes zero.



```{r}
by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))

# Create a scatter plot showing the change in medianLifeExp over time
by_year %>% 
        ggplot(aes(x = year, y = medianLifeExp)) + 
        geom_point() +
        expand_limits(y = 0)
```

Summarize the gapminder dataset by continent and year, finding the median GDP per capita (gdpPercap) within each and putting it into a column called medianGdpPercap. Use the assignment operator <- to save this summarized data as by_year_continent.
Create a scatter plot showing the change in medianGdpPercap by continent over time. Use color to distinguish between continents, and be sure to add expand_limits(y = 0) so that the y-axis starts at zero.

```{r}
# Summarize medianGdpPercap within each continent within each year: by_year_continent
by_year_continent <- gapminder %>% 
        group_by(year, continent) %>% 
        summarize(medianGdpPercap = median(gdpPercap))

# Plot the change in medianGdpPercap in each continent over time
by_year_continent %>% 
        ggplot(aes(x = year, y = medianGdpPercap, color= continent)) +
        geom_point() + 
        expand_limits(y = 0)
```
