Master_data_science_in

MASTER DATA SCIENCE IN R

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

EXERCISE 3.2.4

Run ggplot(data = mpg). What do you see?

-You see a blank plot since no x and y axis have been mapped.

ggplot(data = mpg)

2.How many rows are in mpg?

-234 rows and 11 variables

3.What does the drv variable describe? Read the help for ?mpg to find out.

-drv- front wheel drive, rear wheel drive

4.Make a scatterplot of hwy vs cyl.

ggplot(data = mpg)+
  geom_point(mapping = aes(x=hwy ,y=cyl))

5.What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

The plot is not useful since we are mapping class vs drv yet they are both categorical data

ggplot(data = mpg)+
  geom_point(mapping = aes(x=class,y=drv))

EXERCISE 3.3.1

1.What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The points are not blue as the color attribute has to be set manually outside the aeshetics brackets.

Correction:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy),color = "blue")

2.Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

mpg

## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4      2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4      2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows

continuous variables: disp, cyl, hwy, cty

categorica variables:class, fl, drv, trans, year

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg)+
  geom_point(mapping = aes(x=hwy,y=cyl,color=hwy,shape=class,size=cyl));

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.

## Warning: Removed 62 rows containing missing values (geom_point).

For continous variables ,in the legend, values are shown while for categorical variables,names of the categories are indicated

4.What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg)+
  geom_point(mapping = aes(x=hwy,y=cyl,color= hwy, shape= class));

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.

## Warning: Removed 62 rows containing missing values (geom_point).

Only 6 rows are mapped on the plot as it is hard to discrimanate between hence a maximum of 6 discrete vaues are used

5.What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

The stroke aesthetic only works with bordered shapes and is used to manipulate the width of the border.

ggplot(mtcars, aes(wt, mpg)) +
  geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)

6.What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

ggplot(data = mpg)+
  geom_point(mapping = aes(x=hwy,y=displ,color=displ<5))

EXERCISE 3.5.1

1.What happens if you facet on a continuous variable?

Each value has a facet

ggplot(data = mpg)+
  geom_point(mapping = aes(x=hwy,y=cyl))+
  facet_grid(.~displ)

2.What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

The empty cells in the plot means that there are no cars with the combination given hence the empty facets

3.What plots does the following code make? What does . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

The dot attribute plots without the row attribute hence the y axis values are repeated.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~drv)

The dot attribute plots without the column attribute hence the x axis values are repeated.

4.Take the first faceted plot in this section:What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

With faceting,you are able to see individual classes clearly.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy,color=class))

With coloring ,you are able to see how classes are clustered all over.However, with larger datasets, individual classes are harder to see than clustered classes.

5.Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

#nrow-specifies no. of rows required.
#ncol-specifies no.of columns required.

facet_grid() have no nrow and ncol arguments as the aesthetics attributes specify the number of rows and columns.

6.When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

if variables with more levels are placed on the column, y-axis would shrink making it harder to see points on the plot.

Master_data_science_in_R

Shamim Rashid

13th november,2019