Data Analysis with R

Starting up

Setting your library.

getwd()

## [1] "C:/Users/tomas/Downloads/eda-course-materials"

setwd("C://Users//tomas//Downloads//eda-course-materials")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(dslabs)

Accesing the data set.

states_info <- read.csv('C://Users//tomas//Downloads//eda-course-materials//lesson2//stateData.csv')
data("mtcars")

Subset command to acces variable info.
Dataset[ROW, COLUMN].
Leave COLUMN blank to show all columns.

states_info[states_info$state.region==1, ]

##                X state.abb state.area state.region population income illiteracy
## 7    Connecticut        CT       5009            1       3100   5348        1.1
## 19         Maine        ME      33215            1       1058   3694        0.7
## 21 Massachusetts        MA       8257            1       5814   4755        1.1
## 29 New Hampshire        NH       9304            1        812   4281        0.7
## 30    New Jersey        NJ       7836            1       7333   5237        1.1
## 32      New York        NY      49576            1      18076   4903        1.4
## 38  Pennsylvania        PA      45333            1      11860   4449        1.0
## 39  Rhode Island        RI       1214            1        931   4558        1.3
## 45       Vermont        VT       9609            1        472   3907        0.6
##    life.exp murder highSchoolGrad frost  area
## 7     72.48    3.1           56.0   139  4862
## 19    70.39    2.7           54.7   161 30920
## 21    71.83    3.3           58.5   103  7826
## 29    71.23    3.3           57.6   174  9027
## 30    70.93    5.2           52.5   115  7521
## 32    70.55   10.9           52.7    82 47831
## 38    70.43    6.1           50.2   126 44966
## 39    71.90    2.4           46.4   127  1049
## 45    71.64    5.5           57.1   168  9267

The str() function gives us the variable names and their types.
The summary() function gives us an idea of the values a variable can take on.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Manipulating the dataset

R uses one single & for the logical operator AND.
It also uses one | for the logical operator OR.

mtcars[mtcars$mpg < 14 | mtcars$disp > 395, ]

##                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8  400 175 3.08 3.845 17.05  0  0    3    2

You can also create new variables in a data frame.
Let’s say you wanted to have the year of each car’s model.
We can create the variable mtcars$year. Here we’ll assume that all of the models were from 1974.

mtcars$year <- 1974

To drop a variable, subset the data frame and select the variable you want to drop with a negative sign in front of it.

mtcars <- subset(mtcars, select = -year)

Conditional if - else (ifelse).
ifelse(SET CONDITION, IS TRUE, IS FALSE)

mtcars$wt

##  [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070
## [13] 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840
## [25] 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

cond <- mtcars$wt < 3
cond

##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

mtcars$weight_class <- ifelse(cond, 'light', 'average')
mtcars$weight_class

##  [1] "light"   "light"   "light"   "average" "average" "average" "average"
##  [8] "average" "average" "average" "average" "average" "average" "average"
## [15] "average" "average" "average" "light"   "light"   "light"   "light"  
## [22] "average" "average" "average" "average" "light"   "light"   "light"  
## [29] "average" "light"   "average" "light"

cond <- mtcars$wt > 3.5
mtcars$weight_class <- ifelse(cond, 'heavy', mtcars$weight_class)
mtcars$weight_class

##  [1] "light"   "light"   "light"   "average" "average" "average" "heavy"  
##  [8] "average" "average" "average" "average" "heavy"   "heavy"   "heavy"  
## [15] "heavy"   "heavy"   "heavy"   "light"   "light"   "light"   "light"  
## [22] "heavy"   "average" "heavy"   "heavy"   "light"   "light"   "light"  
## [29] "average" "light"   "heavy"   "light"

Use rm() to delete objects form the environment.

rm(cond)

table() creates a summary table of a factor variable.

reddit <- read.csv("C://Users//tomas//Downloads//eda-course-materials//lesson2//reddit.csv")
table(reddit$employment.status)

## 
##                    Employed full time                             Freelance 
##                                 14814                                  1948 
## Not employed and not looking for work    Not employed, but looking for work 
##                                   682                                  2087 
##                               Retired                               Student 
##                                    85                                 12987

To convert a character vector into factor use as.factor(x), factor(x).
%>% is a pipe to access data frames.
levels(x) visualize the values of a factor.

reddit <- reddit %>% select(id, employment.status, gender, marital.status, military.service, children, education, country, state, income.range, dog.cat, cheese, age.range, fav.reddit) %>% mutate(across(!c(id, fav.reddit), factor))
levels(reddit$education)

## [1] "Associate degree"                   "Bachelor's degree"                 
## [3] "Graduate or professional degree"    "High school graduate or equivalent"
## [5] "Some college"                       "Some high school"                  
## [7] "Trade or Vocational degree"

To order the levels inside a variable use the function ordered(x) or factor(x).
VARIABLE <- ordered(VARIABLE, levels = c(“LEVEL 1”, “LEVEL 2”, “…”))
VARIABLE <- factor(VARIABLE, levels = c(“LEVEL 1”, “LEVEL 2”, “…”), ordered=TRUE)

reddit$age.range <- ordered(reddit$age.range, levels=c( "Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"))
reddit$age.range <- factor(reddit$age.range, levels=c( "Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"), ordered = T)

To create a histogram for the variable with ggplot2, use qplot(data = DATASET, x=VARIABLE)

qplot(data=reddit, x=age.range)