Setting your library.
getwd()
## [1] "C:/Users/tomas/Downloads/eda-course-materials"
setwd("C://Users//tomas//Downloads//eda-course-materials")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(dslabs)
Accesing the data set.
states_info <- read.csv('C://Users//tomas//Downloads//eda-course-materials//lesson2//stateData.csv')
data("mtcars")
Subset command to acces variable info.
Dataset[ROW, COLUMN].
Leave COLUMN blank to show all columns.
states_info[states_info$state.region==1, ]
## X state.abb state.area state.region population income illiteracy
## 7 Connecticut CT 5009 1 3100 5348 1.1
## 19 Maine ME 33215 1 1058 3694 0.7
## 21 Massachusetts MA 8257 1 5814 4755 1.1
## 29 New Hampshire NH 9304 1 812 4281 0.7
## 30 New Jersey NJ 7836 1 7333 5237 1.1
## 32 New York NY 49576 1 18076 4903 1.4
## 38 Pennsylvania PA 45333 1 11860 4449 1.0
## 39 Rhode Island RI 1214 1 931 4558 1.3
## 45 Vermont VT 9609 1 472 3907 0.6
## life.exp murder highSchoolGrad frost area
## 7 72.48 3.1 56.0 139 4862
## 19 70.39 2.7 54.7 161 30920
## 21 71.83 3.3 58.5 103 7826
## 29 71.23 3.3 57.6 174 9027
## 30 70.93 5.2 52.5 115 7521
## 32 70.55 10.9 52.7 82 47831
## 38 70.43 6.1 50.2 126 44966
## 39 71.90 2.4 46.4 127 1049
## 45 71.64 5.5 57.1 168 9267
The str() function gives us the variable names and their types.
The summary() function gives us an idea of the values a variable can take on.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
R uses one single & for the logical operator AND.
It also uses one | for the logical operator OR.
mtcars[mtcars$mpg < 14 | mtcars$disp > 395, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
You can also create new variables in a data frame.
Let’s say you wanted to have the year of each car’s model.
We can create the variable mtcars$year. Here we’ll assume that all of the models were from 1974.
mtcars$year <- 1974
To drop a variable, subset the data frame and select the variable you want to drop with a negative sign in front of it.
mtcars <- subset(mtcars, select = -year)
Conditional if - else (ifelse).
ifelse(SET CONDITION, IS TRUE, IS FALSE)
mtcars$wt
## [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
## [13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
## [25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780
cond <- mtcars$wt < 3
cond
## [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
mtcars$weight_class <- ifelse(cond, 'light', 'average')
mtcars$weight_class
## [1] "light" "light" "light" "average" "average" "average" "average"
## [8] "average" "average" "average" "average" "average" "average" "average"
## [15] "average" "average" "average" "light" "light" "light" "light"
## [22] "average" "average" "average" "average" "light" "light" "light"
## [29] "average" "light" "average" "light"
cond <- mtcars$wt > 3.5
mtcars$weight_class <- ifelse(cond, 'heavy', mtcars$weight_class)
mtcars$weight_class
## [1] "light" "light" "light" "average" "average" "average" "heavy"
## [8] "average" "average" "average" "average" "heavy" "heavy" "heavy"
## [15] "heavy" "heavy" "heavy" "light" "light" "light" "light"
## [22] "heavy" "average" "heavy" "heavy" "light" "light" "light"
## [29] "average" "light" "heavy" "light"
Use rm() to delete objects form the environment.
rm(cond)
table() creates a summary table of a factor variable.
reddit <- read.csv("C://Users//tomas//Downloads//eda-course-materials//lesson2//reddit.csv")
table(reddit$employment.status)
##
## Employed full time Freelance
## 14814 1948
## Not employed and not looking for work Not employed, but looking for work
## 682 2087
## Retired Student
## 85 12987
To convert a character vector into factor use as.factor(x), factor(x).
%>% is a pipe to access data frames.
levels(x) visualize the values of a factor.
reddit <- reddit %>% select(id, employment.status, gender, marital.status, military.service, children, education, country, state, income.range, dog.cat, cheese, age.range, fav.reddit) %>% mutate(across(!c(id, fav.reddit), factor))
levels(reddit$education)
## [1] "Associate degree" "Bachelor's degree"
## [3] "Graduate or professional degree" "High school graduate or equivalent"
## [5] "Some college" "Some high school"
## [7] "Trade or Vocational degree"
To order the levels inside a variable use the function ordered(x) or factor(x).
VARIABLE <- ordered(VARIABLE, levels = c(“LEVEL 1”, “LEVEL 2”, “…”))
VARIABLE <- factor(VARIABLE, levels = c(“LEVEL 1”, “LEVEL 2”, “…”), ordered=TRUE)
reddit$age.range <- ordered(reddit$age.range, levels=c( "Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"))
reddit$age.range <- factor(reddit$age.range, levels=c( "Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"), ordered = T)
To create a histogram for the variable with ggplot2, use qplot(data = DATASET, x=VARIABLE)
qplot(data=reddit, x=age.range)