Learning Objectives:

Get started with "dplyr" and its basic verbs:

slice(), filter(), select()

mutate()

arrange()

summarise()

group_by()

Get started with "ggplot2"

Produce basic plots with ggplot()

General Instructions

Write your descriptions, explanations, and code in an Rmd (R markdown) file.
Name this file as lab02-first-last.Rmd, where first and last are your first and last names (e.g. lab02-gaston-sanchez.Rmd).
Knit your Rmd file as an html document (default option).
Submit your Rmd and html files to bCourses, in the corresponding lab assignment.

1) Manipulating and Visualizing Data Frames

In this lab, you will start learning a couple of approaches to manipulate tables and create statistical graphics. We are going to use the functionality of the package "dplyr" to work with tabular data (in a syntactic way). This is a fairly recent package introduced a couple of years ago, but it is based on more than a decade of research and work lead by Hadley Wickham.

Likewise, to create graphics in a fairly consistent and visually pleasing way, we are going to use the package "ggplot2", also originally authored by Hadley Wickham, and developed as part of his PhD more than a decade ago.

While you follow this lab, you may want to open these cheat sheets:

1.1) Installing packages

I’m assuming that you already installed the packages "dplyr" and "ggplot2". If that’s not the case then run on the console the command below (do NOT include this command in your Rmd):

# don't include this command in your Rmd file
# don't worry too much if you get a warning message
install.packages(c("dplyr", "ggplot2"))

Remember that you only need to install a package once! After a package has been installed in your machine, there is no need to call install.packages() again on the same package. What you should always invoke, in order to use the functions in a package, is the library() function:

# (include these commands in your Rmd file)
# load the packages
library(dplyr)
library(ggplot2)

About loading packages: Another rule to keep in mind is to always load any required packages at the very top of your script files (.R or .Rmd or .Rnw files). Avoid calling the library() function in the middle of a script. Instead, load all the packages before anything else.

1.2) Data `starwars`

The data file for this lab has to do with Star Wars characters. The dataset is part of the dplyr package: starwars. So, assuming that you loaded the package "dplyr", then simply type the name of the object: starwars

# assuming you loaded dplyr ...
starwars

Part I) Basic `"dplyr"` verbs

To make the learning process of "dplyr" gentler, Hadley Wickham proposes beginning with a set of five basic verbs or operations for data frames (each verb corresponds to a function in "dplyr"):

filter: keep rows matching criteria
select: pick columns by name
mutate: add new variables
arrange: reorder rows
summarise: reduce variables to values

I’ve slightly modified Hadley’s list of verbs:

filter(), slice(), and select(): subsetting and selecting rows and columns
mutate(): add new variables
arrange(): reorder rows
summarise(): reduce variables to values
group_by(): grouped (aggregated) operations

2) Filtering, slicing, and selecting

slice() allows you to select rows by position:

# first three rows
three_rows <- slice(starwars, 1:3)
three_rows

## # A tibble: 3 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke…    172    77 blond      fair       blue              19 male  mascu…
## 2 C-3PO    167    75 <NA>       gold       yellow           112 none  mascu…
## 3 R2-D2     96    32 <NA>       white, bl… red               33 none  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

filter() allows you to select rows by defining a condition (which could be simple or compound):

# subset rows given a simple condition
# (height greater than 200 cm)
gt_200 <- filter(starwars, height > 200)
gt_200

## # A tibble: 10 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Dart…    202   136 none       white      yellow          41.9 male  mascu…
##  2 Chew…    228   112 brown      unknown    blue           200   male  mascu…
##  3 Roos…    224    82 none       grey       orange          NA   male  mascu…
##  4 Rugo…    206    NA none       green      orange          NA   male  mascu…
##  5 Yara…    264    NA none       white      yellow          NA   male  mascu…
##  6 Lama…    229    88 none       grey       black           NA   male  mascu…
##  7 Taun…    213    NA none       grey       black           NA   fema… femin…
##  8 Grie…    216   159 none       brown, wh… green, y…       NA   male  mascu…
##  9 Tarf…    234   136 brown      brown      blue            NA   male  mascu…
## 10 Tion…    206    80 none       grey       black           NA   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

# subset rows given a compound condition
filter(starwars, height > 200 & mass < 100)

## # A tibble: 3 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Roos…    224    82 none       grey       orange            NA male  mascu…
## 2 Lama…    229    88 none       grey       black             NA male  mascu…
## 3 Tion…    206    80 none       grey       black             NA male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

select() allows you to select one or more columns by name:

# columns by name
name_height <- select(starwars, name, height)

2.1) Your turn:

use slice() to subset the data by selecting the first 5 rows.

slice(starwars, 1:5)

## # A tibble: 5 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia…    150    49 brown      light      brown           19   fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

use slice() to subset the data by selecting rows 10, 15, 20, …, 50. Optional hint: seq() is your friend.

slice(starwars, seq(10,50,5))

## # A tibble: 9 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Obi-…    182    77 auburn, w… fair       blue-gray         57 male  mascu…
## 2 Gree…    173    74 <NA>       green      black             44 male  mascu…
## 3 Palp…    170    75 grey       pale       yellow            82 male  mascu…
## 4 Lobot    175    79 none       light      blue              37 male  mascu…
## 5 Nien…    160    68 none       grey       black             NA male  mascu…
## 6 Roos…    224    82 none       grey       orange            NA male  mascu…
## 7 Quar…    183    NA black      dark       brown             62 <NA>  <NA>  
## 8 Dud …     94    45 none       blue, grey yellow            NA male  mascu…
## 9 Kit …    196    87 none       green      black             NA male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

use slice() to subset the data by selecting the last 5 rows.

slice(starwars, 83:87)

## # A tibble: 5 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Rey       NA    NA brown      light      hazel             NA fema… femin…
## 2 Poe …     NA    NA brown      light      brown             NA male  mascu…
## 3 BB8       NA    NA none       none       black             NA none  mascu…
## 4 Capt…     NA    NA unknown    unknown    unknown           NA <NA>  <NA>  
## 5 Padm…    165    45 brown      light      brown             46 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

use filter() to subset those individuals with height less than 100 cm tall.

filter(starwars, height < 100)

## # A tibble: 7 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 R2-D2     96    32 <NA>       white, bl… red               33 none  mascu…
## 2 R5-D4     97    32 <NA>       white, red red               NA none  mascu…
## 3 Yoda      66    17 white      green      brown            896 male  mascu…
## 4 Wick…     88    20 brown      brown      brown              8 male  mascu…
## 5 Dud …     94    45 none       blue, grey yellow            NA male  mascu…
## 6 Ratt…     79    15 none       grey, blue unknown           NA male  mascu…
## 7 R4-P…     96    NA none       silver, r… red, blue         NA none  femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

use filter() to subset rows of female individuals (gender).

filter(starwars, gender == "feminine")

## # A tibble: 17 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Leia…    150  49   brown      light      brown             19 fema… femin…
##  2 Beru…    165  75   brown      light      blue              47 fema… femin…
##  3 Mon …    150  NA   auburn     fair       blue              48 fema… femin…
##  4 Shmi…    163  NA   black      fair       brown             72 fema… femin…
##  5 Ayla…    178  55   none       blue       hazel             48 fema… femin…
##  6 Adi …    184  50   none       dark       blue              NA fema… femin…
##  7 Cordé    157  NA   brown      light      brown             NA fema… femin…
##  8 Lumi…    170  56.2 black      yellow     blue              58 fema… femin…
##  9 Barr…    166  50   black      yellow     blue              40 fema… femin…
## 10 Dormé    165  NA   brown      light      brown             NA fema… femin…
## 11 Zam …    168  55   blonde     fair, gre… yellow            NA fema… femin…
## 12 Taun…    213  NA   none       grey       black             NA fema… femin…
## 13 Joca…    167  NA   white      fair       blue              NA fema… femin…
## 14 R4-P…     96  NA   none       silver, r… red, blue         NA none  femin…
## 15 Shaa…    178  57   none       red, blue… black             NA fema… femin…
## 16 Rey       NA  NA   brown      light      hazel             NA fema… femin…
## 17 Padm…    165  45   brown      light      brown             46 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

use filter() to subset rows of individuals with brown hair color.

filter(starwars, hair_color == "brown")

## # A tibble: 18 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Leia…    150    49 brown      light      brown             19 fema… femin…
##  2 Beru…    165    75 brown      light      blue              47 fema… femin…
##  3 Chew…    228   112 brown      unknown    blue             200 male  mascu…
##  4 Han …    180    80 brown      fair       brown             29 male  mascu…
##  5 Wedg…    170    77 brown      fair       hazel             21 male  mascu…
##  6 Jek …    180   110 brown      fair       blue              NA male  mascu…
##  7 Arve…     NA    NA brown      fair       brown             NA male  mascu…
##  8 Wick…     88    20 brown      brown      brown              8 male  mascu…
##  9 Qui-…    193    89 brown      fair       blue              92 male  mascu…
## 10 Ric …    183    NA brown      fair       blue              NA <NA>  <NA>  
## 11 Cordé    157    NA brown      light      brown             NA fema… femin…
## 12 Clie…    183    NA brown      fair       blue              82 male  mascu…
## 13 Dormé    165    NA brown      light      brown             NA fema… femin…
## 14 Tarf…    234   136 brown      brown      blue              NA male  mascu…
## 15 Raym…    188    79 brown      light      brown             NA male  mascu…
## 16 Rey       NA    NA brown      light      hazel             NA fema… femin…
## 17 Poe …     NA    NA brown      light      brown             NA male  mascu…
## 18 Padm…    165    45 brown      light      brown             46 fema… femin…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

use filter() and then select(), to subset rows of individuals from Naboo, and then display their names.

naboo_home_world <-filter(starwars, homeworld == "Naboo")
select(naboo_home_world, name)

## # A tibble: 11 x 1
##    name         
##    <chr>        
##  1 R2-D2        
##  2 Palpatine    
##  3 Jar Jar Binks
##  4 Roos Tarpals 
##  5 Rugor Nass   
##  6 Ric Olié     
##  7 Quarsh Panaka
##  8 Gregar Typho 
##  9 Cordé        
## 10 Dormé        
## 11 Padmé Amidala

use "dplyr" functions to display the names of individuals with green skin color.

green_skin <- filter(starwars, skin_color == "green")
select(green_skin, name)

## # A tibble: 6 x 1
##   name             
##   <chr>            
## 1 Greedo           
## 2 Yoda             
## 3 Bossk            
## 4 Rugor Nass       
## 5 Kit Fisto        
## 6 Poggle the Lesser

find how to select() the name, height, and mass, of male individuals, with brown or black hair color.

male_black_brown_hair <- filter(starwars,  hair_color == "black" | hair_color == "brown" & sex == "male")
select(male_black_brown_hair, name, height, mass)

## # A tibble: 24 x 3
##    name                  height  mass
##    <chr>                  <int> <dbl>
##  1 Biggs Darklighter        183  84  
##  2 Chewbacca                228 112  
##  3 Han Solo                 180  80  
##  4 Wedge Antilles           170  77  
##  5 Jek Tono Porkins         180 110  
##  6 Boba Fett                183  78.2
##  7 Lando Calrissian         177  79  
##  8 Arvel Crynyd              NA  NA  
##  9 Wicket Systri Warrick     88  20  
## 10 Qui-Gon Jinn             193  89  
## # … with 14 more rows

find how to select the name and homeworld, of human female individuals.

human_female <- filter(starwars,  species == "Human" & sex == "female")
select(human_female, name, homeworld)

## # A tibble: 9 x 2
##   name               homeworld
##   <chr>              <chr>    
## 1 Leia Organa        Alderaan 
## 2 Beru Whitesun lars Tatooine 
## 3 Mon Mothma         Chandrila
## 4 Shmi Skywalker     Tatooine 
## 5 Cordé              Naboo    
## 6 Dormé              Naboo    
## 7 Jocasta Nu         Coruscant
## 8 Rey                <NA>     
## 9 Padmé Amidala      Naboo

3) Adding new variables: `mutate()`

Another basic verb is mutate() which allows you to add new variables. Let’s create a small data frame for the female individuals with three columns: name, height, and mass:

# creating a small data frame step by step
fem <- filter(starwars, sex == "female")
fem <- select(fem, name, height, mass)
fem <- slice(fem, c(1, 2, 5, 6, 8))
fem

## # A tibble: 5 x 3
##   name               height  mass
##   <chr>               <int> <dbl>
## 1 Leia Organa           150  49  
## 2 Beru Whitesun lars    165  75  
## 3 Ayla Secura           178  55  
## 4 Adi Gallia            184  50  
## 5 Luminara Unduli       170  56.2

Now, let’s use mutate() to (temporarily) add a column with the ratio height / mass:

mutate(fem, height / mass)

You can also give a new name, like: ht_wt = height / mass:

mutate(fem, ht_wt = height / mass)

In order to permanently change the data, you need to assign the changes to an object:

fem2 <- mutate(fem, ht_m = height * 0.0254, wt_kg = mass * 0.4536)
fem2

4) Reordering rows: `arrange()`

The next basic verb of "dplyr" is arrange() which allows you to reorder rows. For example, here’s how to arrange the rows of fem by height

# order rows by height (increasingly)
arrange(fem, height)

By default arrange() sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc().

# order rows by height (decreasingly)
arrange(fem, desc(height))

# order rows by height, and then mass
arrange(fem, height, mass)

4.1) Your Turn:

using the data frame fem, add a new variable product with the product of height and mass.

mutate(fem, product = height*mass)

## # A tibble: 5 x 4
##   name               height  mass product
##   <chr>               <int> <dbl>   <dbl>
## 1 Leia Organa           150  49      7350
## 2 Beru Whitesun lars    165  75     12375
## 3 Ayla Secura           178  55      9790
## 4 Adi Gallia            184  50      9200
## 5 Luminara Unduli       170  56.2    9554

create a new data frame fem3, by adding columns log_height and log_mass with the log transformations of height and mass.

fem3 <- mutate(fem, log_height = log10(height), log_mass = log10(mass))
fem3

## # A tibble: 5 x 5
##   name               height  mass log_height log_mass
##   <chr>               <int> <dbl>      <dbl>    <dbl>
## 1 Leia Organa           150  49         2.18     1.69
## 2 Beru Whitesun lars    165  75         2.22     1.88
## 3 Ayla Secura           178  55         2.25     1.74
## 4 Adi Gallia            184  50         2.26     1.70
## 5 Luminara Unduli       170  56.2       2.23     1.75

use the original data frame to filter() and arrange() those individuals with height less than 150 cm tall, in increasing order by height.

height_order_150 <- filter(starwars, height < 150)
arrange(height_order_150, height)

## # A tibble: 10 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yoda      66    17 white      green      brown            896 male  mascu…
##  2 Ratt…     79    15 none       grey, blue unknown           NA male  mascu…
##  3 Wick…     88    20 brown      brown      brown              8 male  mascu…
##  4 Dud …     94    45 none       blue, grey yellow            NA male  mascu…
##  5 R2-D2     96    32 <NA>       white, bl… red               33 none  mascu…
##  6 R4-P…     96    NA none       silver, r… red, blue         NA none  femin…
##  7 R5-D4     97    32 <NA>       white, red red               NA none  mascu…
##  8 Sebu…    112    40 none       grey, red  orange            NA male  mascu…
##  9 Gasg…    122    NA none       white, bl… black             NA male  mascu…
## 10 Watto    137    NA black      blue, grey yellow            NA male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

display the name, homeworld, and species, of the top-5 tallest individuals.

top_height <-arrange(starwars, desc(height))
top5_height <- slice(top_height,1:5)
select(top5_height, name, homeworld, species)

## # A tibble: 5 x 3
##   name         homeworld species 
##   <chr>        <chr>     <chr>   
## 1 Yarael Poof  Quermia   Quermian
## 2 Tarfful      Kashyyyk  Wookiee 
## 3 Lama Su      Kamino    Kaminoan
## 4 Chewbacca    Kashyyyk  Wookiee 
## 5 Roos Tarpals Naboo     Gungan

display the name, homeworld, and species, for the top-5 heaviest individuals.

top_weight <-arrange(starwars, desc(mass))
top5_weight <- slice(top_weight,1:5)
select(top5_weight, name, homeworld, species)

## # A tibble: 5 x 3
##   name                  homeworld species
##   <chr>                 <chr>     <chr>  
## 1 Jabba Desilijic Tiure Nal Hutta Hutt   
## 2 Grievous              Kalee     Kaleesh
## 3 IG-88                 <NA>      Droid  
## 4 Darth Vader           Tatooine  Human  
## 5 Tarfful               Kashyyyk  Wookiee

5) Summarizing values with `summarise()`

The next verb is summarise(). Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.

Say you are interested in calculating the average height of all individuals. To do this “a la dplyr” you use summarise(), or its synonym function summarize():

# average height (removing missing values)
summarise(starwars, avg_height = mean(height, na.rm = TRUE))

## # A tibble: 1 x 1
##   avg_height
##        <dbl>
## 1       174.

What if you want to calculate some summary statistics for height: min, median, mean, and max?

# some stats for height (dplyr)
summarise(
  starwars, 
  min = min(height, na.rm = TRUE),
  median = median(height, na.rm = TRUE),
  avg = mean(height, na.rm = TRUE),
  max = max(height, na.rm = TRUE)
)

## # A tibble: 1 x 4
##     min median   avg   max
##   <int>  <int> <dbl> <int>
## 1    66    180  174.   264

6) Grouped operations

To actually appreciate the power of summarise(), we need to introduce the other major basic verb in "dplyr": group_by(). This is the function that allows you to perform data aggregations, or grouped operations.

Let’s see the combination of summarise() and group_by() to calculate the average salary by team:

# average height, grouped by homeworld
summarise(
  group_by(starwars, homeworld),
  avg_salary = mean(height, na.rm = TRUE)
)

## `summarise()` ungrouping output (override with `.groups` argument)

Here’s a more fancy example: average mass and height, by homeworld, displayed in desceding order by average height:

arrange(
  summarise(
    group_by(starwars, homeworld),
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE)),
  desc(avg_height)
)

## `summarise()` ungrouping output (override with `.groups` argument)

6.1) Your turn:

use summarise() to get the largest height value.

largest_height<- summarise(starwars, height)
largest_height <- arrange(largest_height,desc(height))
slice(largest_height,1)

## # A tibble: 1 x 1
##   height
##    <int>
## 1    264

use summarise() to get the standard deviation of mass.

sd_mass <- summarise(starwars, mass)
sd_mass <- data.matrix(sd_mass)
sd(sd_mass, na.rm = TRUE)

## [1] 169.4572

use summarise() and group_by() to display the median of mass, by homeworld

summarise(group_by(starwars, homeworld), median_mass = median(mass, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 49 x 2
##    homeworld      median_mass
##    <chr>                <dbl>
##  1 Alderaan              64  
##  2 Aleen Minor           15  
##  3 Bespin                79  
##  4 Bestine IV           110  
##  5 Cato Neimoidia        90  
##  6 Cerea                 82  
##  7 Champala              NA  
##  8 Chandrila             NA  
##  9 Concord Dawn          79  
## 10 Corellia              78.5
## # … with 39 more rows

display the average mass by gender, in ascending order,

avg_mass_gender<-summarise(group_by(starwars, sex), average_mass = mean(mass, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

arrange(avg_mass_gender, average_mass)

## # A tibble: 5 x 2
##   sex            average_mass
##   <chr>                 <dbl>
## 1 <NA>                   48  
## 2 female                 54.7
## 3 none                   69.8
## 4 male                   81.0
## 5 hermaphroditic       1358

obtain the mean and standard deviation of height, for female characters.

avg_height_female<-filter(starwars,sex=="female")
avg_height_female1<-select(avg_height_female,height)
avg_height_female<- summarise(avg_height_female1,average_height = mean(height, na.rm = TRUE))
sd_avg_height_female<- data.matrix(avg_height_female1)
sd(sd_avg_height_female, na.rm = TRUE)

## [1] 15.32256

avg_height_female

## # A tibble: 1 x 1
##   average_height
##            <dbl>
## 1           169.

Part II) First contact with `ggplot()`

The package "ggplot2" is probably the most popular package in R to create beautiful static graphics. Compared to the functions in the base package "graphics", the package "ggplot2" follows a somewhat different philosophy, and it tries to be more consistent and modular as possible.

The main function in "ggplot2" is ggplot()
The main input to ggplot() is a data frame object.
You can use the internal function aes() to specify what columns of the data frame will be used for the graphical elements of the plot.
You must specify what kind of geometric objects or geoms will be displayed: e.g. geom_point(), geom_bar(), geom_boxpot().
Pretty much anything else that you want to add to your plot is controlled by auxiliary functions, especially those things that have to do with the format, rather than the underlying data.
The construction of a ggplot is done by adding layers with the + operator.

7) Scatterplots

Let’s start with a scatterplot of height and mass

# scatterplot (option 1)
ggplot(data = starwars) +
  geom_point(aes(x = height, y = mass))

## Warning: Removed 28 rows containing missing values (geom_point).

ggplot() creates an object of class "ggplot"
the main input for ggplot() is data which must be a data frame
then we use the "+" operator to add a layer
the geometric object (geom) are points: geom_points()
aes() is used to specify the x and y coordinates, by taking columns points and salary from the data frame

The same scatterplot can also be created with this alternative, and more common use of ggplot()

# scatterplot (option 2)
ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point()

7.1) Adding color

Say you want to color code the points in terms of gender

# colored scatterplot 
ggplot(data = starwars, aes(x = height, y = mass)) +
  geom_point(aes(color = gender))

## Warning: Removed 28 rows containing missing values (geom_point).

7.2) Your turn:

Open the ggplot2 cheatsheet.

# your code

Use the data frame fem to make a scatterplot of height and mass.

ggplot(data = fem, aes(x = height, y = mass)) + geom_point()

Find out how to make another scatterplot of height and mass, using geom_text() to display the names of the individuals

ggplot(data = fem, aes(x = height, y = mass)) + geom_text(aes(label = name))

Get a scatter plot of height and mass, for ALL the females, displaying their names with geom_label().

ggplot(data = filter(starwars, sex == "female"), aes(x = height, y = mass)) + geom_label(aes(label = name))

## Warning: Removed 7 rows containing missing values (geom_label).

Get a histogram of mass (for all individuals).

ggplot(data = starwars ,aes(mass)) + geom_histogram(binwidth = 1)

## Warning: Removed 28 rows containing non-finite values (stat_bin).

Get a density plot of height (for all individuals).

ggplot(data = starwars ,aes(height)) + geom_density(kernal = "gassuian")

## Warning: Ignoring unknown parameters: kernal

## Warning: Removed 6 rows containing non-finite values (stat_density).

Get a barchart of the gender frequencies (for all individuals).

ggplot(starwars, aes(sex)) + geom_bar()

Lab 2: First contact with dplyr and ggplot2

Stat 133, Fall 2020

Learning Objectives:

General Instructions

1) Manipulating and Visualizing Data Frames

1.1) Installing packages

1.2) Data `starwars`

Part I) Basic `"dplyr"` verbs

2) Filtering, slicing, and selecting

2.1) Your turn:

3) Adding new variables: `mutate()`

4) Reordering rows: `arrange()`

4.1) Your Turn:

5) Summarizing values with `summarise()`

6) Grouped operations

6.1) Your turn:

Part II) First contact with `ggplot()`

7) Scatterplots

7.1) Adding color

7.2) Your turn:

Lab 2: First contact with dplyr and ggplot2

Stat 133, Fall 2020

Learning Objectives:

General Instructions

1) Manipulating and Visualizing Data Frames

1.1) Installing packages

1.2) Data starwars

Part I) Basic "dplyr" verbs

2) Filtering, slicing, and selecting

2.1) Your turn:

3) Adding new variables: mutate()

4) Reordering rows: arrange()

4.1) Your Turn:

5) Summarizing values with summarise()

6) Grouped operations

6.1) Your turn:

Part II) First contact with ggplot()

7) Scatterplots

7.1) Adding color

7.2) Your turn:

1.2) Data `starwars`

Part I) Basic `"dplyr"` verbs

3) Adding new variables: `mutate()`

4) Reordering rows: `arrange()`

5) Summarizing values with `summarise()`

Part II) First contact with `ggplot()`