Activity 1

In this activity we will see a Simpson’s Paradox example on correlation. The data are used here is for two type of books (paper back and hardcover). The price and number of pages were colected foreach book.

The scatter plot below shows the prices on y-axis vs the number of pages on x-axis.

Now we run the function cor() between the the price and numberof pages of the two types of books and then we combined them both and run the function again.

cor(y1,x1)
## [1] 0.8481439
cor(y2,x2)
## [1] 0.9559518
cor(y,x)
## [1] -0.5949366

Note that the each type has a positive correlation; however when the two groups combined they have a negative correlation, this called a Simpson’s Paradox.

We can do the same with the regression slop.

lm(y ~ x)$coefficients[2]
##           x 
## -0.02664736
lm(y1 ~ x1)$coefficients[2]
##        x1 
## 0.1177437
lm(y2 ~ x2)$coefficients[2]
##         x2 
## 0.01818674

Simpson’s Paradox also appears here the two groups have positive slop whereas when they are combined the slop is negative.

To Visualize this on plots.

plot(x,y)
model0 <- lm(y ~ x)
abline(model0)

plot(x,y)

text(200,40,"hardcover")
model1 <- lm(y1 ~ x1)
abline(model1)

text(600,15,"paperback")
model2 <- lm(y2 ~ x2)
abline(model2)

We can use ggplot2 library to repeat the plots.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
# put the data into a data.frame
#   cond = type of book
#   xvar = x
#   yvar = y
dat <- data.frame(cond = z, xvar = x, yvar = y)

ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1)      # Use hollow circles

ggplot(dat, aes(x=xvar, y=yvar)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm)   # Add linear regression line 

ggplot(dat, aes(x=xvar, y=yvar)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm,   # Add linear regression line
                se=FALSE)    # Don't add shaded confidence region

ggplot(dat, aes(x=xvar, y=yvar, color=cond)) + geom_point(shape=1)

# Same, but with different colors and add regression lines
ggplot(dat, aes(x=xvar, y=yvar, color=cond)) +
    geom_point(shape=1) +
    scale_colour_hue(l=50) + # Use a slightly darker palette than normal
    geom_smooth(method=lm,   # Add linear regression lines
                se=FALSE)    # Don't add shaded confidence region

# Set shape by cond
ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point()

# Same, but with different shapes
ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point() +
    scale_shape_manual(values=c(1,2))  # Use a hollow circle and triangle

Activity 2

Using the crime book data, the following scatter plot matrix is plotted.

Activity 3

A collogram for the crime data was also produced using R::corrgram() function from the corrgram libraray.

## Warning: package 'corrgram' was built under R version 3.1.3

Activity 4

Using Minitab we fit regression linear models where the price was our response and the page number our predictor the results match R results and Simpson’s Paradox can be seen.

For Hardcover

alt text

For Parperback

alt text

Combined

alt text

The regression model is also fitted in Tableau.

For Hardcover and Paperback

alt text

Combined

alt text

Activity 5

A histogram was produced for the birth-rate data whith the option break of 5 and 20.

Activity 6

In this activity we will run some of ggplot examples on the diamond dataset.

First we run a summary on the dataset.

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

In ggplot2 the qplot() function is used to plot a quick graphs. Here are some examples

Note that price and carat carat dont have linear relation. It seems that they have exponantial relation.On the other hand, the log of price has a linear relation with log of carat.

We can also change the plot colors or opacity using colour=I("red") and alpha=I(1/10) options respectivly.

Moreover, we can add geomatricobjects such asa smooth LOESS curve using geom=c("point","smooth") option.

A colored histogram can be plot whith qplot() as the following.

qplot(carat, data=diamonds, geom="histogram", fill=color)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Similarly, a density plot can be produced.

qplot(carat, data=diamonds, geom="density", fill=color)

Bar Chart

qplot(color, data=diamonds, geom="bar", weight=carat) + scale_y_continuous("carat")

Faceing allows you to plot multiple plots in one figure. Here we plot histograms of carat for different colors.

suppressWarnings(qplot(carat, ..density.., data=diamonds, facets = color ~ .,
      geom="histogram", binwidth=0.1, xlim=c(0,3))) 
## Warning: Removed 32 rows containing non-finite values (stat_bin).
## Warning: Removed 14 rows containing missing values (geom_bar).

Now we will use the mpg dataset to show an example of a scatter plot.

qplot(displ, hwy, data=mpg) + geom_smooth(method="lm")

We can group it by year or by cylinders.

Activity 7

We now plot 8 histograms from the TV data for TVs of size (less than 80 and greater than 10) for different years from 2001 until 2008. Note that the function par(mfrow=c(4,2)) changes the plot parameters so mfrow(4,2) means put 8 plots in one figure with 4 rows and 2 columns.