library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:datasets':
## 
##     cars

I chose to look at the satGPA dataset to explain uses of tools within dplyr. First we must look at the data to see if there are any anomolies or outliers that need to be cleaned up.

glimpse(satGPA)
## Observations: 1,000
## Variables: 6
## $ sex    <int> 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2...
## $ SATV   <int> 65, 58, 56, 42, 55, 55, 57, 53, 67, 41, 58, 45, 43, 50,...
## $ SATM   <int> 62, 64, 60, 53, 52, 56, 65, 62, 77, 44, 70, 57, 45, 58,...
## $ SATSum <int> 127, 122, 116, 95, 107, 111, 122, 115, 144, 85, 128, 10...
## $ HSGPA  <dbl> 3.40, 4.00, 3.75, 3.75, 4.00, 4.00, 2.80, 3.80, 4.00, 2...
## $ FYGPA  <dbl> 3.18, 3.33, 3.25, 2.42, 2.63, 2.91, 2.83, 2.51, 3.82, 2...
summary(satGPA)
##       sex             SATV            SATM          SATSum     
##  Min.   :1.000   Min.   :24.00   Min.   :29.0   Min.   : 53.0  
##  1st Qu.:1.000   1st Qu.:43.00   1st Qu.:49.0   1st Qu.: 93.0  
##  Median :1.000   Median :49.00   Median :55.0   Median :103.0  
##  Mean   :1.484   Mean   :48.93   Mean   :54.4   Mean   :103.3  
##  3rd Qu.:2.000   3rd Qu.:54.00   3rd Qu.:60.0   3rd Qu.:113.0  
##  Max.   :2.000   Max.   :76.00   Max.   :77.0   Max.   :144.0  
##      HSGPA           FYGPA      
##  Min.   :1.800   Min.   :0.000  
##  1st Qu.:2.800   1st Qu.:1.980  
##  Median :3.200   Median :2.465  
##  Mean   :3.198   Mean   :2.468  
##  3rd Qu.:3.700   3rd Qu.:3.020  
##  Max.   :4.500   Max.   :4.000
str(satGPA)
## 'data.frame':    1000 obs. of  6 variables:
##  $ sex   : int  1 2 2 1 1 2 1 1 2 1 ...
##  $ SATV  : int  65 58 56 42 55 55 57 53 67 41 ...
##  $ SATM  : int  62 64 60 53 52 56 65 62 77 44 ...
##  $ SATSum: int  127 122 116 95 107 111 122 115 144 85 ...
##  $ HSGPA : num  3.4 4 3.75 3.75 4 4 2.8 3.8 4 2.6 ...
##  $ FYGPA : num  3.18 3.33 3.25 2.42 2.63 2.91 2.83 2.51 3.82 2.54 ...

This dataset contains 1000 observations and 6 variables detailing SAT and GPA data for students. Now that we have looked at the data, we want to see if how the score a student has on the SAT affects their GPA in high school and in their first year of college. To do this, we will create a new tibble containing only the total SAT score, HS GPA, FYGPA, and sex so that we can represent the data graphically later. We can do this by using the tool select() and piping which will help to filter out everything else besides what we want. I will name this tibble Weekly1. In addition, so that our graph has correct markings later, we will assign the numbers 1 and 2 to Male and Female.

weekly1 <- satGPA %>%
  select(SATSum, HSGPA, FYGPA, sex) -> weekly1

weekly1$sex[weekly1$sex == 1] <- "Female"
weekly1$sex[weekly1$sex == 2] <- "Male"

glimpse(weekly1)
## Observations: 1,000
## Variables: 4
## $ SATSum <int> 127, 122, 116, 95, 107, 111, 122, 115, 144, 85, 128, 10...
## $ HSGPA  <dbl> 3.40, 4.00, 3.75, 3.75, 4.00, 4.00, 2.80, 3.80, 4.00, 2...
## $ FYGPA  <dbl> 3.18, 3.33, 3.25, 2.42, 2.63, 2.91, 2.83, 2.51, 3.82, 2...
## $ sex    <chr> "Female", "Male", "Male", "Female", "Female", "Male", "...

Now we will create a scatterplot to show the connection between SAT scores and GPA. We can do this graphically thanks to ggplot2. In the ggplot2 function, we decide which category is represented along the x and y axes, as well as change some cosmetic things about the graph such as transparency and loess curves. In this example, geom_point() establishes a scatterplot for us to work with and geom_smooth() creates a loess curve to show how x affects y. Also, by using color under aesthetics, we can have the points colored by any available column, which in this case I chose sex. The alpha under geom_point() affects the transparency of the points on the scatterplot. This is useful when dealing these large amounts of observations in the same area.

g1 = ggplot(weekly1, aes(x = SATSum, y = HSGPA, color=sex)) + geom_point(alpha = 0.3) + geom_smooth(color = "RED")

g2 = ggplot(weekly1, aes(x = SATSum, y = FYGPA, color=sex)) + geom_point(alpha = 0.3) + geom_smooth(color = "RED")

g1
## `geom_smooth()` using method = 'gam'

g2
## `geom_smooth()` using method = 'gam'

Based on the graph, we can see that there is a positive correlation between SAT scores and a higher high school GPA, as well as a positive one between SAT scores and student’s first year of college GPA.