data_science

MASTER DATA SCIENCE IN R WEEK 2

library(tidyverse)

## -- Attaching packages --------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

3.6 GEOMETRIC OBJECTS 1.What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

linechart:geom_line();

boxplot:geom_boxplot();

histogram:geom_histogram();

areachart:geom_areachart();

ggplot(data=mpg,mapping=aes(x=displ,y=hwy),color="blue")+
  geom_smooth()+
  geom_boxplot()+
  geom_area();

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

2.Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

The smoothing line comes after our points hence another layer is created, the (se) attribute displays confidence interval around the smooth,(se=FALSE) removes the confidence interval

#with confidence interval
ggplot(data=mpg, mapping = aes(x = displ, y = hwy, color = drv))+
  geom_point()+
  geom_smooth(se=TRUE);

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#without confidence interval
ggplot(data=mpg, mapping = aes(x = displ, y = hwy, color = drv))+
  geom_point()+
  geom_smooth(se=FALSE);

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3.What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

The command will remove the legend that is supposed to be generated.The legend was used to show how data has been distributed in relevance to the categorical data plotted.

4.Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

No will not look different because the two block of codes relay the exact same meaning since they get the same data types from the dataset

5.Recreate the R code necessary to generate the following graphs.

#1
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(position="jitter") +
  geom_smooth(se = FALSE);

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#2
 ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(position="jitter") +
  geom_smooth(mapping = aes(group = drv), se = FALSE);

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#3
 ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) + 
  geom_point(position="jitter") +
  geom_smooth(se = FALSE);

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#4 
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color=drv),position="jitter") +
  geom_smooth(se = FALSE);

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#5
 ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color=drv),position="jitter") +
  geom_smooth(se = FALSE, mapping = aes(linetype = drv));

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#6
 ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color=drv),position="jitter") + 
  geom_point(shape = 21, color = "white", stroke = 1);

3.7 STATISTICAL TRANSFORMATION.

3.7.1 Exercises.

1.What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The default geom used is geom_bar()

ggplot(data=diamonds)+
  geom_bar(mapping=aes(x=cut,y=..prop..,group=1));

2.What does geom_col() do? How is it different to geom_bar()? geom_col() requires y aesthetics plotted,it also does not require the use of attribute (stat=“identity”)

demo <- tribble(
  ~cut,         ~fre,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)
ggplot(data=demo)+
  geom_col(mapping=aes(x=cut,y=fre));

3.Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Each geom has a default stat, and each stat has a default geom.

-For geom_point the default stat is stat_identity.

-For geom_bar the default stat is stat_count.

-For geom_histogram the default is stat_bin.

4.What variables does stat_smooth() compute? What parameters control its behaviour?

Variables that are computed are:

y: predicted value

ymin: lower value of the confidence interval

ymax: upper value of the confidence interval

se: standard error

parameters that control its behaviour: -se: standard error -span -stat -na.rm -formula -show.legend -method.args

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  );

5.In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

#with group=1 attribute
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..,group=1))

# without group=1 attribute
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

Without the group attribute,all the bars are the same irrespective of the classes.When group=1 attribute is introduced, the bar sizes change.

3.8 POSITION ADJUSTMENTS 3.8.1 Exercises

1.What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

There’s overplotting in the first graph,adding position=“jitter” attribute adds a small amount of random noise to each point which spreads the points out because no two points are likely to receive the same amount of random noise.

#after jittering
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point( position="jitter");

2.What parameters to geom_jitter() control the amount of jittering?

width:controls horizonta variations of points

height:controls vertical variations of points

3.Compare and contrast geom_jitter() with geom_count().

geom_jitter();

ggplot(data= mpg,mapping= aes(x=displ,y=hwy,color=class))+
  geom_jitter();

This creates small variations for each point of combination (x,y values).

geom_count();

ggplot(data= mpg,mapping= aes(x=displ,y=hwy,color=class))+
  geom_count();

This makes combination(x,y) points with more observations larger in size than those with less observations.

4.What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

ggplot(data=mpg,mapping=aes(x=displ,y=hwy,color=class))+
  geom_boxplot();

3.9 COORDINATE SYSTEM. 3.9.1 Exercises

1.Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(data= diamonds)+
  geom_bar(mapping=aes(x=cut,fill=clarity));

# change to pie chart
ggplot(data= diamonds)+
  geom_bar(mapping=aes(x=cut,fill=clarity))+
coord_polar();

2.What does labs() do? Read the documentation.

3.What’s the difference between coord_quickmap() and coord_map()?

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap();

nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_map();

coord_map projects a portion of the earth, which is approximately spherical, onto a flat 2D plane using any projection defined by the mapproj package. Map projections do not, in general, preserve straight lines, so this requires considerable computation. coord_quickmap is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.

4.What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed();

Coord_fixed() ensures that the line produced by geom_abline() is at a 45-degree angle, a 45-degree line enables us to make comparisons of highway and city mileage where they are at per.

data_science_r

Shamim Rashid

13th november,2019