How to load data:

  1. Getting Started with Homework Videos
  2. Modern Statistics with R, Chapter 2
  3. R for Data Science, Chapter 1

Creating new data frames:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cars_df<-mtcars

to make a data frame with only one variable do the %>% select(mpg) after

desired data frame name

(Modtern Statistics with R, Chapter 2)

Modifying existing data frames:

cars_df <- cars_df %>% select(mpg,hp)
  1. Modern Statistics with R, Chapter 2
  2. Weekly lectures 2 - 5

Code chunks must open and close with “```”. Space can be created in a chunk with the “enter” key:

(Anything else will not work)

  1. Weekly lectures 2 - 5

Removing NA

air<-airquality

air<-air %>% select(Ozone,Month,Day) %>% na.omit(.)

do not na omit on entire data set without specifying variables because it will

mess up the data

  1. Weekly lectures 2 - 5

The dollar sign specifies columns/variables:

summary(air$Ozone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   31.50   42.13   63.25  168.00
summary(air$Month)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   6.000   7.000   7.198   8.250   9.000
# summary can't be called on more than one column, unless you call the whole dataset

summary(air)
##      Ozone            Month            Day       
##  Min.   :  1.00   Min.   :5.000   Min.   : 1.00  
##  1st Qu.: 18.00   1st Qu.:6.000   1st Qu.: 8.00  
##  Median : 31.50   Median :7.000   Median :16.00  
##  Mean   : 42.13   Mean   :7.198   Mean   :15.53  
##  3rd Qu.: 63.25   3rd Qu.:8.250   3rd Qu.:22.00  
##  Max.   :168.00   Max.   :9.000   Max.   :31.00
  1. Modern Statistics with R, Chapter 2
  2. R for Data Science, Chapter 10
  3. Weekly lectures 2 - 5
  4. Data Selection assignment description!! *does not have to be in the same data frame to run correlation

Correlation: both variables need to be free of NA and both need to be specified with a “$”:

cor(cars_df$mpg,cars_df$hp)
## [1] -0.7761684

mpg and horse power are negatively correlated based on the outcome

years plotting is more difficult but can find cheatsheets online (prof is going to provide)

fist element is data frame, variables are which variables + geom_point says to take all before # and make a plot chart

ggplot(cars_df,aes(x=mpg,y=hp)) + geom_point()

  1. Modern Statistics with R, Chapter 2
  2. Weekly lectures 2 - 5
  3. Data Selection assignment description