03. Exploring Bivariate Relations Using Plots"

author: “Mehrad” date: “17 juli 2016” ——————————————————-

Introduction:

In this document, we introduce basic techniques/tools to explore bivariate relations between two variables. As usual, we learn how to R using several examples.Examples 1 and 2 are mostly concenred with plotting and exploring the rleation between two continuous variables. Example 3 should be seen as an introduction to use plots in order to explore the relation between a categorical preditor (X) and a continuous response (Y) variable.

This document mostly focuses on plotting the relations and examines related functions that assist us attain this goal.

The relation between two categorical variables will be covered in another document.


Example 1. Some Preliminary Analyses

The michelson dataset in HistDatapackage has the results of 5 experiments that Michelson conducted to estimates the light speed. Using that dataset, do the following:

  1. load the dataset and investigate the type of its different variables.

  2. Filter data related of fifth experiment and generate a simple histogram of data (Univariate analysis)

  3. Change the binwidth of histogram to: 5, 25, 50, and 100. See which one looks nicer/better (Univariate analysis)

  4. Add the density plot of the fifth experiment to its historgram. What is the problem with this approach? (Univariate analysis)

  5. Filter data related to both the 3rd and the 5th experiments and plot them against one another using a scatter plot. Do these data points seem correlated? Use two different methods to do so.

  6. Compare the distribution of data in experiments 2 and 3, using boxplots (a case for continuous-categorical data).

  7. Compare the distribution of all experiments using boxplots (a case for continuous-categorical data).


Solutions to Example 1.

class(michelson)    
## [1] "data.frame"
str(michelson) 
## 'data.frame':    100 obs. of  3 variables:
##  $ Speed: int  850 740 900 1070 930 850 950 980 980 880 ...
##  $ Run  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Expt : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...

Results of our initial analysis reveals that the dataset is a dataframe object, consisting of 100 observations and 3 variables in total which are labled Speed, Run and Exp. The first variable is an integer vector, the latter two are factor vectors.

# Using $ Symbol to filter a variable. 

Speed5 = michelson$Speed[michelson$Expt==5]
Speed5
##  [1] 890 840 780 810 760 810 790 810 820 850 870 870 810 740 810 940 950
## [18] 800 810 870
hist(Speed5, col="blue")

minSp5 = min(Speed5)
maxSp5 = max(Speed5)

minSp5
## [1] 740
maxSp5
## [1] 950
binwidth_length5 = seq((minSp5-5), (maxSp5+5), by=5) #creating the bins. 
binwidth_length5
##  [1] 735 740 745 750 755 760 765 770 775 780 785 790 795 800 805 810 815
## [18] 820 825 830 835 840 845 850 855 860 865 870 875 880 885 890 895 900
## [35] 905 910 915 920 925 930 935 940 945 950 955
hist(Speed5, breaks = binwidth_length5, col="red")

Now, we create bins of width 25. Notice that I subtracted and added 25 to the min and max respectively to make sure all numbers will be included in the created bins.

binwidth_length25 = seq((minSp5-25), (maxSp5+25), by= 25) 
hist(Speed5, breaks = binwidth_length25, col="green")

Now we create bins of width 50. I subtracted and added 50 to the min and max respectively to make sure all numbers will be included in the created bins.

binwidth_length50 = seq((minSp5-50), (maxSp5+50), by= 50)
hist(Speed5, breaks = binwidth_length50, col="blue")

Now we create bins of width 100. I subtracted and added 100 to the min and max respectively to make sure all numbers will be included in the created bins.

binwidth_length100 = seq((minSp5-100), (maxSp5+100), by= 100)
hist(Speed5, breaks = binwidth_length100, col="purple")

dens5 = density(Speed5)
dens5 
## 
## Call:
##  density.default(x = Speed5)
## 
## Data: Speed5 (20 obs.);  Bandwidth 'bw' = 23.06
## 
##        x                y            
##  Min.   : 670.8   Min.   :1.024e-05  
##  1st Qu.: 757.9   1st Qu.:6.614e-04  
##  Median : 845.0   Median :1.763e-03  
##  Mean   : 845.0   Mean   :2.867e-03  
##  3rd Qu.: 932.1   3rd Qu.:4.766e-03  
##  Max.   :1019.2   Max.   :8.479e-03

As you can see, the output of density() functon is a density object containing two columns showing range of x and y used for drawing the density curve. Now the following creates a density curve.

plot(dens5) # Creating a direct density curve. 

To create a density curve on a histogram of data, we do the following:

hist(Speed5, col = "blue", freq = T)
lines(dens5, type = "l", col="red") 

Note: we can barely see the desnity function on the histogram simply because the y-scale of the density function ranges from 0 to 0.008 and when they are drawn on the same y axis of histogram of frequencies which is ranged from 0 to 8, they are barely seen.

This problem can be solved if we draw probabilty curve instead of frequency curve and then add the density curve on the same graph.

hist(Speed5, col = "blue", freq = F) 
lines(dens5, type = "l", col="red") 

by turining off the frequency argument in ´hist()´, we switch to probability curve automatically.

Note: we will discuss important arguments in the plot(), lines() or points() functions in the next example.

Speed3 = michelson$Speed[michelson$Expt==3]

# Method 1. 

plot(Speed3, Speed5, xlab = "Speed of light in Experiment 3", 
     ylab = "Speed of light in Experiment 5", 
     ylim= range(600:1100), xlim=range(600: 1100))

legend(590, 1080, 
       legend= "Michelson Experiments to Measure the Light Speed")

I intentionally defined a broader range for x and y axis so that I will have enouh space to insert a legend on the graph.

I also intentionally wrote the code of plot in several lines to ensure that the output Word or PDF file will look nice and nothing would be out of gray borders of the codes.

# Method 2. 

plot(Speed5~Speed3, type="p", col="purple", pch=4) 

Note: In the second method, we used arguments such as col, type, and pch which we will disucss in detail in Example 2. The important point about our second method is the use of ~ sign which works nicely by specifying y~x. It leads to scatter or line graphs (if the type is = “l”) when the variables are both continuous.If the x variable is categorical (i.e., factor), then the result of a plot function would be a boxplot.

Speed2 = michelson$Speed[michelson$Expt==2]
boxplot(Speed2, Speed3)   

Boxplot argument receives more than one x, so we can easily draw several variables next to each other.

  1. A more efficient way of drawing boxplots for allexperiments results next to each other:
plot(Speed ~ Expt, data=michelson) 

We will discuss the use of arguments in plot() in the next examples.


Keypoint 1: Formula Interface

In the previous example, parts E and G, we saw that it is possible to define a formula interface for a ´plot()´ function. The general structure of formula interface is:

action(y ~ x, data = dataset name, subset = conditions)

the action is indeed a function (e.g., plot(), lines(), etc.) that accepts a formula argument. Most of generic R functions do that. Moreover, the add-on mosaic package extends the formula interface to many of the functions providing numeric summaries, such as mean() and sd(). The Hmisc package also extends the summary function to produce many of these values when called on a model formula.

Now regarding our purpose, the plot() function is a generic function, with methods that provide a good graphical summary of an object. When the object is a formula of the type y ~ f side-by-side boxplots of the y values are drawn with groups decided by the factor f, as in this case the y ~ f model formula represents the question of asking how y varies across the groups formed by f. Notice that this by deafault creates boxplots for each level of the categorical variable. Howwever, it is possible to change this default to see only data points in each level. This can be done by changing the argument type ="p", which we will use in the following examples. No worries for now!

However, when using the object of the type y ~ x where both y and x are both continuous variables, the outcome will be a scatter plot showing how both x and y vary together.


Keypoint 2: Useful arguments to use for plotting.

The plot() function has several arguments/methods that one can utilize to work on the looks of the outcome.

here are some of them that are commonly used: main =: Title to put on the graphic. xlab =: Label for the x-axis. Similarly for ylab=. xlim =: Specify the x-limits, as in xlim=c(0,10), or xlim=range(1,10) for the interval [0,10]. Similarly for ylim =.

type =: determines the type of plot to make. Use “p” for points (the default), “l” (ell not one) for lines, and “h” for vertical lines.Using qoutation marks important.

bty =: Type of box to draw around the plot. Use “l” for “L”-shaped, default is “o”, which is “O”-shaped. Using qoutation marks is important. More details in ?par

pch=Determines the charater used to draw a scatter plot. This can be a number or a single character. Numbers 0 - 18 are similar to those of S programming language, though there are more available beyond that range. Note that, if the plot is going todraw lines, then specifying pch does not do anything. It works only when the outcomes are dicrete points as in scatter plot. lty =: When lines are plotted, this command specifies the type of line to be drawn (e.g., dashed lines, dotlines, continuous lines, etc). Details in ?par.

lwd =: determines the thickness of lines. Numbers bigger than 1 increase the default. Again, this is applicalbe to lines.

cex=Magnification factor. Default is 1. Numbers bigger than one makes the points bigger on the graph.

col =Specifies the color to use for the points or lines. it can get the numerical code of colors or just receives the color names as in col=“purple”.


Keypoint 3. Useful plotting functions that can accompany plot() function and add elements to an existing graphic.

There are a few functions that can add further infromation to a grpahic that has already been created, for example by a plot() function. Note that these are not arguments of plot() functon, but they are functions of their own.

  • points(): Add points or lines to a graphic. This depends on how its type= argument is set (i.e., “p” or “l”).
  • lines(): Add line or points to a graphic.This depends on how its type= argument is set (i.e., “p” or “l”).
  • abline(): Add a line to a graphic.
  • text(): Locate text in body of a graphic.
  • mtext(): Locate text in margins of a graphic.
  • title(): Add main title, x label or y label to a graphic.
  • legend(): Locate/insert a legend in body of a graphic, based on the coordinates defined in a graphic axes.

Example 2. Working with more arguments and accompanying functions in plot().

The dataset fat in UsingR package consists of 252 observations/cases and 19 variables. Call this dataset and do the following:

  1. examine the type of this dataset and its variables.

  2. make a plot to see if there is an association between bodyfat and hip variables.

  3. Plot the bodyfat vs. hip but only for those whose forearm is smaller than 27. How many observations are in this graph?

  4. Plot the bodyfat vs. hip but only for those whose forearm is smaller than 27 or bigger than 32. Use the efficient way of coding with subset()

  5. Make sure points that are related to forearms smaller than 27 have a different shape and color than those bigger than 32.

  6. add a proper title and legend to this plot.


Solutions to Example 2.

str(fat)
## 'data.frame':    252 obs. of  19 variables:
##  $ case         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ body.fat     : num  12.6 6.9 24.6 10.9 27.8 20.6 19 12.8 5.1 12 ...
##  $ body.fat.siri: num  12.3 6.1 25.3 10.4 28.7 20.9 19.2 12.4 4.1 11.7 ...
##  $ density      : num  1.07 1.09 1.04 1.08 1.03 ...
##  $ age          : int  23 22 22 26 24 24 26 25 25 23 ...
##  $ weight       : num  154 173 154 185 184 ...
##  $ height       : num  67.8 72.2 66.2 72.2 71.2 ...
##  $ BMI          : num  23.7 23.4 24.7 24.9 25.6 26.5 26.2 23.6 24.6 25.8 ...
##  $ ffweight     : num  135 161 116 165 133 ...
##  $ neck         : num  36.2 38.5 34 37.4 34.4 39 36.4 37.8 38.1 42.1 ...
##  $ chest        : num  93.1 93.6 95.8 101.8 97.3 ...
##  $ abdomen      : num  85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
##  $ hip          : num  94.5 98.7 99.2 101.2 101.9 ...
##  $ thigh        : num  59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
##  $ knee         : num  37.3 37.3 38.9 37.3 42.2 42 38.3 39.4 38.3 41.7 ...
##  $ ankle        : num  21.9 23.4 24 22.8 24 25.6 22.9 23.2 23.8 25 ...
##  $ bicep        : num  32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
##  $ forearm      : num  27.4 28.9 25.2 29.4 27.7 30.6 27.8 29 31.1 30 ...
##  $ wrist        : num  17.1 18.2 16.6 18.2 17.7 18.8 17.7 18.8 18.2 19.2 ...
class(fat) # it is a dataframe. 
## [1] "data.frame"
# Method 1. 
x = fat$hip
y= fat$body.fat

plot(x,y)

# Method 2. 

plot(y~x)

#Method 3. 

plot(body.fat ~ hip, data = fat)

  1. We use two ways to plot this. The prefered method is the first one, using the subset() function because it needs the least amount of coding.
# Method 1. 

plot(body.fat~hip, data= fat, subset = forearm<27)

# Method 2. 

x1 = fat$hip[fat$forearm<27]
y1 = fat$body.fat[fat$forearm<27]

x1; y1; length(x1)   # shows that we have 47 observations. 
##  [1]  99.2  99.2  94.6  93.4  94.7  98.7 100.6  88.2  88.5  92.7  90.4
## [12]  87.2  98.3  91.0  89.1  88.6  98.3  99.3  96.9  96.2  98.3 100.6
## [23]  92.6  95.8  92.1  91.1  91.5  89.6  89.2 114.4  89.2  96.4  85.0
## [34]  96.1  96.9 102.0  94.0  96.1  96.3  94.4  94.5  91.9  87.5 100.2
## [45]  93.7  87.6  88.8
##  [1] 24.6 20.5 15.7 17.6 22.4  9.4  6.5  8.4 11.2  6.4 13.4  5.0 10.7  7.4
## [15]  8.7 13.7 21.7 21.1 26.0 26.1 20.1 21.0 16.7 18.0  6.1 10.6  9.9 15.2
## [29]  1.9 24.6 10.4 13.4  0.0 12.4 21.7 16.6 11.3 19.3 13.8  8.2 15.1  6.1
## [43] 12.8 25.9 18.4 17.0 11.5
## [1] 47
plot(y1~x1)

Method 3. Making a new Dataframe out of the previous one and then continue our analysis.

new_dataframe = fat[fat$forearm<27, ] 

This code simply says: select all the variables (i.e., all columns) for cases (i.e., rows) whose forearm variable is smaller than 27.

str(new_dataframe)  # shows that only 47 obs are selected. 
## 'data.frame':    47 obs. of  19 variables:
##  $ case         : int  3 16 23 24 28 30 32 45 47 48 ...
##  $ body.fat     : num  24.6 20.5 15.7 17.6 22.4 9.4 6.5 8.4 11.2 6.4 ...
##  $ body.fat.siri: num  25.3 20.9 15.6 17.7 22.9 8.8 5.7 7.7 10.8 5.6 ...
##  $ density      : num  1.04 1.05 1.06 1.06 1.05 ...
##  $ age          : int  22 35 31 32 31 29 29 39 40 39 ...
##  $ weight       : num  154 163 140 149 148 ...
##  $ height       : num  66.2 66 68.2 70 67.5 ...
##  $ BMI          : num  24.7 26.3 21.2 21.4 22.9 23.8 22.2 19.1 20.6 20.6 ...
##  $ ffweight     : num  116 129 118 123 115 ...
##  $ neck         : num  34 36.4 33.9 35.5 38.8 36.7 37.3 31.5 33.6 34.6 ...
##  $ chest        : num  95.8 99.1 86 86.7 97.4 97.4 93.5 85.1 88.2 89.8 ...
##  $ abdomen      : num  87.9 92.8 76.4 80 88.7 83.5 84.5 76 73.7 79.5 ...
##  $ hip          : num  99.2 99.2 94.6 93.4 94.7 ...
##  $ thigh        : num  59.6 63.1 57.4 54.9 57.5 58.9 58.5 50 53.3 52.7 ...
##  $ knee         : num  38.9 38.7 35.3 36.2 36 35.3 38.8 34.7 34.5 37.5 ...
##  $ ankle        : num  24 21.7 22.2 22.1 21 22.6 21.5 21 22.5 21.9 ...
##  $ bicep        : num  28.8 31.1 27.9 29.8 29.2 30.1 30.1 26.1 27.9 28.8 ...
##  $ forearm      : num  25.2 26.4 25.9 26.7 26.6 26.7 26.4 23.1 26.2 26.8 ...
##  $ wrist        : num  16.6 16.9 16.7 17.1 17 17.6 17.9 16.1 17.3 17.9 ...
plot(body.fat~hip, data= new_dataframe)

So we made a temporarily dataframe and then worked with its variables.

Note:When using the subset() function, we do not care about the = sign. We simply continue inserting our conditions in front of this sign, independent of how many signs our own condition might have. See solutions to part D too.

plot(body.fat~hip, data= fat, subset = forearm<27 | forearm>32)

  1. here we use the accompanying functions of plot(), such as lines() and points()
plot(body.fat~hip, data= fat, subset = forearm<27)
lines(body.fat~hip, data= fat, subset = forearm>32, type="p", pch=4, col = "red")  

Note: Notice that lines() generates points because we have set its type= argument to "p" (i.e., points). In this regard, it seems the output of llines() here are very much like plot() function. Could we simply use another plot() function instead of lines()? The answer is NO! Using another plot() function creates a SEPARATE plot for cases whose forearm is bigger than 32. It does not insert those points on the same graph which is created by the first plot() command.

Could we also use points() in a similar fashion, as lines()? YES! See below:

plot(body.fat~hip, data= fat, subset = forearm<27)
points(body.fat~hip, data= fat, subset = forearm>32, type="l", pch=4, col = "purple")  

plot(body.fat~hip, data= fat, subset = forearm<27 | forearm>32)
legend(90, 33, legend = "Hello World")
title(main = "My First Graph")

The two first arguments of legend are related to coordination where the legend is began to be inserted. It might be hard to set a legend really in the center of things. Note, however, that the title() function inserts yhe texts outside of the graph frame.


Example 3 Learning to Use Split() : A Case for a Continuous Y and a Categorical X.

The aim of this exercise mainly is to learn how to work with split() function and also to enhance our plotting abilities.

In this example we will use the ToothGrowth data set from the UsingR package.

  1. call the dataset and explore its variables.

  2. In this dataset, the variable len is the growth of participants’ teeth that has been measured for two types of supplements sup and three doses dose. Split the len variable based on different types of supplements.

  3. Split the len variable based on different types of doses.

  4. split the len variable based on all 6 possible combinations of dose x supplement.

  5. Create three boxplots for len as y, corresponding to each amount of dose variable. Should you first coerce the dose variable into a factor?

  6. plot the len variable based on the three dose categories. Specifically, draw a scatter plot with len as Y and dose as X axis. Subsequently, identify the mean of each level of dosein the graph with a red color, magnify them and then connect them using a line.

  7. Repeat what you did in E but this time for all 2x3 combntaions of variables supp and dose.


Solutions to Example 3.

str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
class(ToothGrowth)
## [1] "data.frame"

The ToothGrowthdataset is a dataframe consisting of three variables and 60 observations. len is a numerical variable, supp is a factor with “OJ” or “VC” levels, and dose is a neumerical variables that only has three qunatities, 0.5, 1 and 2.

  1. There are two ways of doing this. Again, one way is the traditional way and the other one is using the split() function.
# Method 1. Tedious way

Supp_OJ = ToothGrowth$len[ToothGrowth$supp =="OJ"]
Supp_VC = ToothGrowth$len[ToothGrowth$supp =="VC"]

Supp_OJ
##  [1] 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3 23.6 26.4
## [15] 20.0 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3
## [29] 29.4 23.0
Supp_VC
##  [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3
## [15] 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5
## [29] 23.3 29.5

Note: The created outputs here are two simple vectors. What if, we wanted to keep all the cases and their respective variables but only for those cases whose supplement leves is OJ or VC? We could do the following:

NewData_OJ = ToothGrowth[ToothGrowth$supp=="OJ",]
class(NewData_OJ)
## [1] "data.frame"
str(NewData_OJ)
## 'data.frame':    30 obs. of  3 variables:
##  $ len : num  15.2 21.5 17.6 9.7 14.5 10 8.2 9.4 16.5 9.7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 1 1 1 1 1 1 1 1 1 1 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

As it can be seen, the newdata set is also a data frame and only has 30 observations. We can do the same with the VC level of the factor supp.

NewData_VC = ToothGrowth[ToothGrowth$supp=="VC",]
class(NewData_VC)
## [1] "data.frame"
str(NewData_VC)
## 'data.frame':    30 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Note: we have in fact splitted the original data frame into two data frames now. One should however notice that doing so does not change the nature of variables. As it is evident, the variable supp in the new data frames still is a vactor of two levels, although the new data frames NewData_VC or NewData_OJ each has has only one level of this categorical variable, respectively.

Alternatively, one can use the split() function as following:

# Method 2. Using Split() function.

My_list = split(ToothGrowth, ToothGrowth$supp)

class(My_list)
## [1] "list"
str(My_list)
## List of 2
##  $ OJ:'data.frame':  30 obs. of  3 variables:
##   ..$ len : num [1:30] 15.2 21.5 17.6 9.7 14.5 10 8.2 9.4 16.5 9.7 ...
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ dose: num [1:30] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ VC:'data.frame':  30 obs. of  3 variables:
##   ..$ len : num [1:30] 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ dose: num [1:30] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Some key points about split() function.

Note 1: The split(x,f) splits x based on levels of f. X can be a vector of numbers of a dataframe (as in above). In the latter case, the outcome is a list containing individual dataframes. In the former, the outcome is list containing several vectors.

Note 2: the f in the argument of split(x,f) could be a cateogrical variable as in the above example, of a numerical variable with limited amounts of quantities (see section C). If f is a numerical variable, the function splic() coerces it to a cateogrical/ factor variable.

Note 3: As it is clear from the output of the code above, the new object that is created is a list object. List is an object that can contain various types of objcts inside it. The current list has now two data frames. One of them is called OJ and the other one has been labeled automatically as VC. Here is how to call them now:

My_list[["OJ"]]
##     len supp dose
## 31 15.2   OJ  0.5
## 32 21.5   OJ  0.5
## 33 17.6   OJ  0.5
## 34  9.7   OJ  0.5
## 35 14.5   OJ  0.5
## 36 10.0   OJ  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 39 16.5   OJ  0.5
## 40  9.7   OJ  0.5
## 41 19.7   OJ  1.0
## 42 23.3   OJ  1.0
## 43 23.6   OJ  1.0
## 44 26.4   OJ  1.0
## 45 20.0   OJ  1.0
## 46 25.2   OJ  1.0
## 47 25.8   OJ  1.0
## 48 21.2   OJ  1.0
## 49 14.5   OJ  1.0
## 50 27.3   OJ  1.0
## 51 25.5   OJ  2.0
## 52 26.4   OJ  2.0
## 53 22.4   OJ  2.0
## 54 24.5   OJ  2.0
## 55 24.8   OJ  2.0
## 56 30.9   OJ  2.0
## 57 26.4   OJ  2.0
## 58 27.3   OJ  2.0
## 59 29.4   OJ  2.0
## 60 23.0   OJ  2.0
My_list[["VC"]]
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
## 16 17.3   VC  1.0
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 19 18.8   VC  1.0
## 20 15.5   VC  1.0
## 21 23.6   VC  2.0
## 22 18.5   VC  2.0
## 23 33.9   VC  2.0
## 24 25.5   VC  2.0
## 25 26.4   VC  2.0
## 26 32.5   VC  2.0
## 27 26.7   VC  2.0
## 28 21.5   VC  2.0
## 29 23.3   VC  2.0
## 30 29.5   VC  2.0

Note: I have introduced working with basic elements such as list in a seprate file before. Here we do not cover that, however,let’s remind ourselves how to access elements within an object (e.g., a data frame) which itself is inside a list. Let’s print the len variable for all cases in the dataframe OJ inside the list My_list:

My_list[["OJ"]][,"len"]
##  [1] 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3 23.6 26.4
## [15] 20.0 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3
## [29] 29.4 23.0
  1. In this part, we will split the original dataset based on a numerical varible, dose. As we will see, the split() coerce the dose to a categorical variable and then does the splitting. The outcomes is a new list containig three data frames.
My_list2= split(ToothGrowth, ToothGrowth$dose)
str(My_list2)
## List of 3
##  $ 0.5:'data.frame': 20 obs. of  3 variables:
##   ..$ len : num [1:20] 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ dose: num [1:20] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ 1  :'data.frame': 20 obs. of  3 variables:
##   ..$ len : num [1:20] 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5 ...
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ dose: num [1:20] 1 1 1 1 1 1 1 1 1 1 ...
##  $ 2  :'data.frame': 20 obs. of  3 variables:
##   ..$ len : num [1:20] 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5 ...
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ dose: num [1:20] 2 2 2 2 2 2 2 2 2 2 ...

The new dataframes inside this list are labled after their former numerical numbers, that are now levels of a categorical vairable based on which the splitting has taken place. So we can simply call them as before, either using the “”:

My_list2[["0.5"]]
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 31 15.2   OJ  0.5
## 32 21.5   OJ  0.5
## 33 17.6   OJ  0.5
## 34  9.7   OJ  0.5
## 35 14.5   OJ  0.5
## 36 10.0   OJ  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 39 16.5   OJ  0.5
## 40  9.7   OJ  0.5

Note: if we do not use the “” marks and simple put a number such as 1 the outcome is the first object in the list, that is the dataframe related to does = 0.5, and not the dataframe with dose=1.

So the following codes have the same outputs:

My_list2[["0.5"]]
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 31 15.2   OJ  0.5
## 32 21.5   OJ  0.5
## 33 17.6   OJ  0.5
## 34  9.7   OJ  0.5
## 35 14.5   OJ  0.5
## 36 10.0   OJ  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 39 16.5   OJ  0.5
## 40  9.7   OJ  0.5
My_list2[[1]]   # spit out the first object in the list. 
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 31 15.2   OJ  0.5
## 32 21.5   OJ  0.5
## 33 17.6   OJ  0.5
## 34  9.7   OJ  0.5
## 35 14.5   OJ  0.5
## 36 10.0   OJ  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 39 16.5   OJ  0.5
## 40  9.7   OJ  0.5

And therefore the following codes led to different outputs:

My_list2[[1]]  # spits out the first object in the list. 
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 31 15.2   OJ  0.5
## 32 21.5   OJ  0.5
## 33 17.6   OJ  0.5
## 34  9.7   OJ  0.5
## 35 14.5   OJ  0.5
## 36 10.0   OJ  0.5
## 37  8.2   OJ  0.5
## 38  9.4   OJ  0.5
## 39 16.5   OJ  0.5
## 40  9.7   OJ  0.5
My_list2[["1"]] 
##     len supp dose
## 11 16.5   VC    1
## 12 16.5   VC    1
## 13 15.2   VC    1
## 14 17.3   VC    1
## 15 22.5   VC    1
## 16 17.3   VC    1
## 17 13.6   VC    1
## 18 14.5   VC    1
## 19 18.8   VC    1
## 20 15.5   VC    1
## 41 19.7   OJ    1
## 42 23.3   OJ    1
## 43 23.6   OJ    1
## 44 26.4   OJ    1
## 45 20.0   OJ    1
## 46 25.2   OJ    1
## 47 25.8   OJ    1
## 48 21.2   OJ    1
## 49 14.5   OJ    1
## 50 27.3   OJ    1
  1. Split() function can accept a list of factors, to do the splitting of the original dataset based on them. In such cases, the original data is splitted based on the possible combination of factor levels. For example, if the two factors have two and three levels each, the dataset is splitted into six levels (2 x 3). Let’s see an example:
My_Factor_List=list(ToothGrowth$supp, ToothGrowth$dose) 
My_List3= split(ToothGrowth, My_Factor_List)

str(My_List3)
## List of 6
##  $ OJ.0.5:'data.frame':  10 obs. of  3 variables:
##   ..$ len : num [1:10] 15.2 21.5 17.6 9.7 14.5 10 8.2 9.4 16.5 9.7
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 1 1 1 1 1 1 1 1 1 1
##   ..$ dose: num [1:10] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
##  $ VC.0.5:'data.frame':  10 obs. of  3 variables:
##   ..$ len : num [1:10] 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2
##   ..$ dose: num [1:10] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
##  $ OJ.1  :'data.frame':  10 obs. of  3 variables:
##   ..$ len : num [1:10] 19.7 23.3 23.6 26.4 20 25.2 25.8 21.2 14.5 27.3
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 1 1 1 1 1 1 1 1 1 1
##   ..$ dose: num [1:10] 1 1 1 1 1 1 1 1 1 1
##  $ VC.1  :'data.frame':  10 obs. of  3 variables:
##   ..$ len : num [1:10] 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2
##   ..$ dose: num [1:10] 1 1 1 1 1 1 1 1 1 1
##  $ OJ.2  :'data.frame':  10 obs. of  3 variables:
##   ..$ len : num [1:10] 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3 29.4 23
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 1 1 1 1 1 1 1 1 1 1
##   ..$ dose: num [1:10] 2 2 2 2 2 2 2 2 2 2
##  $ VC.2  :'data.frame':  10 obs. of  3 variables:
##   ..$ len : num [1:10] 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
##   ..$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2
##   ..$ dose: num [1:10] 2 2 2 2 2 2 2 2 2 2
  1. To create boxplots, we can use the function formula in the plot() as discussed in example 2 of this document. However, notice that the output of plot() will only be boxplots, if the x variable is a categorical/factor one. Otherwise it will generate scatterplot. Let’s see how the two will be different below. First we simply plot len against dose. Notice that dose is numeric.
plot(len~dose, data=ToothGrowth)

To make boxplots, this is one way to do it:

my_factor = factor(ToothGrowth$dose)

plot(len~my_factor, data=ToothGrowth)  

#Or 

plot(ToothGrowth$len~my_factor) 

A very tedious way would be doing this:

a = ToothGrowth$len[ToothGrowth$dose==0.5]
b = ToothGrowth$len[ToothGrowth$dose==1]
c = ToothGrowth$len[ToothGrowth$dose==2]

boxplot(a,b,c)

Note: Using the tedious method creates the boxplots next to each other but on the X-axis we do not see the dose variable anymore with proper labels. we just simply see 1, 2 and 3. To fix that, we can add lables to the graph using the name() argument in boxplot() function. For example:

boxplot(a,b,c, xlab= "dose", names = c("0.5", "1", "2"))

  1. First, lets measure the mean of observations for each level of dose.
G1 = ToothGrowth$len[ToothGrowth$dose==0.5]
G2 = ToothGrowth$len[ToothGrowth$dose==1]
G3 = ToothGrowth$len[ToothGrowth$dose==2]

G1_average= mean(G1)
G2_average=mean(G2)
G3_average=mean(G3)

Now lets do the plotting.

plot(len~dose, data= ToothGrowth, col="orange")

points(c(0.5, 1, 2), c(G1_average, G2_average, G3_average), 
       col = "red", cex= 2, type = "p")

lines(c(0.5, 1, 2), c(G1_average, G2_average, G3_average), 
      col = "red", lwd=2, type = "l")

Note: functions points() and lines() in their non-formula format get x and y coordinates to draw points in those coordinates or connect them with a line, respectively.

  1. Here, we first need to create a new variable that is the combined version of both supp and dose. In doing so, first I single out dose and make a factor out of it (to make a cateogrical variable out of it) and store it in a new variable new_factor. Then I make a whole new varialbe by combining the two categorical variables that we have now: supp and new_factor. It is now sufficient to add this new combined variable to our original dataset and then do the plotting based on that.
new_factor = factor(ToothGrowth$dose)
new_factor
##  [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1   1   1   1   1   1   1  
## [18] 1   1   1   2   2   2   2   2   2   2   2   2   2   0.5 0.5 0.5 0.5
## [35] 0.5 0.5 0.5 0.5 0.5 0.5 1   1   1   1   1   1   1   1   1   1   2  
## [52] 2   2   2   2   2   2   2   2   2  
## Levels: 0.5 1 2
new_var = paste(ToothGrowth$supp, new_factor, sep = "")
new_var
##  [1] "VC0.5" "VC0.5" "VC0.5" "VC0.5" "VC0.5" "VC0.5" "VC0.5" "VC0.5"
##  [9] "VC0.5" "VC0.5" "VC1"   "VC1"   "VC1"   "VC1"   "VC1"   "VC1"  
## [17] "VC1"   "VC1"   "VC1"   "VC1"   "VC2"   "VC2"   "VC2"   "VC2"  
## [25] "VC2"   "VC2"   "VC2"   "VC2"   "VC2"   "VC2"   "OJ0.5" "OJ0.5"
## [33] "OJ0.5" "OJ0.5" "OJ0.5" "OJ0.5" "OJ0.5" "OJ0.5" "OJ0.5" "OJ0.5"
## [41] "OJ1"   "OJ1"   "OJ1"   "OJ1"   "OJ1"   "OJ1"   "OJ1"   "OJ1"  
## [49] "OJ1"   "OJ1"   "OJ2"   "OJ2"   "OJ2"   "OJ2"   "OJ2"   "OJ2"  
## [57] "OJ2"   "OJ2"   "OJ2"   "OJ2"
class(new_var) # it is character variable, not a factor anymore. 
## [1] "character"
new_data = data.frame(ToothGrowth, new_var)

Note: the paste() function pastes its first argument to its second argument. These could be just two simple characters or vectors of characters or numbers. Another important entry to its arguments is the sep = element which determines the distance between two elements when they are pasted. If one uses sep = " " then 1 space will be the distance between attached items whereas if one uses sep = "", then no distance will be between the items. I used no distance for our case.

Now, we can first create six boxplots next to each other (which is not waht the question asked for, but let’s do it).

plot(len~new_var, data= new_data)

To have a scatter plot instead of boxplots, we should make the variable on the x- axis numerical. for example, numbers from 1 to 6. One way to do so is to first make a factor of the current variable (which is new_var that we created in the previous step). Then turn its levels into nunmbers categories from 1 to 6 (which still be categorical). We can then turn those new levels into numric values using as.numeric() function.

Lets begin with making a factor out of new_var

factor_var = factor(new_var)

Now, we can change its levels into number format. Note that those numbers are still going to be treated as categorical variables by R.

levels(factor_var) = c(1,2,3,4,5,6)  
factor_var
##  [1] 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1
## [36] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3 4 5 6
class(factor_var)
## [1] "factor"

Now, we can turn the factor_var we have into a numeric vector.

new_x = as.numeric(factor_var)
plot(len~new_x, data=new_data, col = "purple")

Notice that, we did not add this newly made variable (new_x) to our dataset (new_data). We can do it and then use it as a way to filter groups, calculate their means and then identify those points on the current graph using points() and lines(). This we do to create something similar to what we did in section F.

updated_new_data= data.frame(new_data, new_x)
a1 = updated_new_data$len[updated_new_data$new_x == 1]
a2 = updated_new_data$len[updated_new_data$new_x == 2]
a3 = updated_new_data$len[updated_new_data$new_x == 3]
a4 = updated_new_data$len[updated_new_data$new_x == 4]
a5 = updated_new_data$len[updated_new_data$new_x == 5]
a6 = updated_new_data$len[updated_new_data$new_x == 6]

aveG1 = mean(a1); aveG2 = mean(a2); aveG3 = mean(a3)
aveG4 = mean(a4); aveG5 = mean(a5); aveG6 = mean(a6)

Now that we have all the coordination we need, we do the plotting. Notice that we should repeate the original plotting also again. Because additive functions such as points() or lines() only work if they come after the plot() command.

plot(len~new_x, data=new_data, col = "purple")
points(c(1, 2, 3, 4, 5, 6), 
       c(aveG1, aveG2, aveG3, aveG4, aveG5, aveG6), 
       col = "green", cex= 2, type = "p")

lines(c(1, 2, 3, 4, 5, 6), 
      c(aveG1, aveG2, aveG3, aveG4, aveG5, aveG6), 
      col = "green", lwd=2, type = "l")