In this activity we will see a Simpson’s Paradox example on correlation. The data are used here is for two type of books (paper back and hardcover). The price and number of pages were colected foreach book.
The scatter plot below shows the prices on y-axis vs the number of pages on x-axis.
Now we run the function cor()
between the the price and numberof pages of the two types of books and then we combined them both and run the function again.
cor(y1,x1)
## [1] 0.8481439
cor(y2,x2)
## [1] 0.9559518
cor(y,x)
## [1] -0.5949366
Note that the each type has a positive correlation; however when the two groups combined they have a negative correlation, this called a Simpson’s Paradox.
We can do the same with the regression slop.
lm(y ~ x)$coefficients[2]
## x
## -0.02664736
lm(y1 ~ x1)$coefficients[2]
## x1
## 0.1177437
lm(y2 ~ x2)$coefficients[2]
## x2
## 0.01818674
Simpson’s Paradox also appears here the two groups have positive slop whereas when they are combined the slop is negative.
To Visualize this on plots.
plot(x,y)
model0 <- lm(y ~ x)
abline(model0)
plot(x,y)
text(200,40,"hardcover")
model1 <- lm(y1 ~ x1)
abline(model1)
text(600,15,"paperback")
model2 <- lm(y2 ~ x2)
abline(model2)
We can use ggplot2 library to repeat the plots.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
# put the data into a data.frame
# cond = type of book
# xvar = x
# yvar = y
dat <- data.frame(cond = z, xvar = x, yvar = y)
ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1) # Use hollow circles
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) # Add linear regression line
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm, # Add linear regression line
se=FALSE) # Don't add shaded confidence region
ggplot(dat, aes(x=xvar, y=yvar, color=cond)) + geom_point(shape=1)
# Same, but with different colors and add regression lines
ggplot(dat, aes(x=xvar, y=yvar, color=cond)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm, # Add linear regression lines
se=FALSE) # Don't add shaded confidence region
# Set shape by cond
ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point()
# Same, but with different shapes
ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point() +
scale_shape_manual(values=c(1,2)) # Use a hollow circle and triangle
Using the crime book data, the following scatter plot matrix is plotted.
A collogram for the crime data was also produced using R::corrgram()
function from the corrgram libraray.
## Warning: package 'corrgram' was built under R version 3.1.3
Using Minitab we fit regression linear models where the price was our response and the page number our predictor the results match R results and Simpson’s Paradox can be seen.
For Hardcover
For Parperback
Combined
The regression model is also fitted in Tableau.
For Hardcover and Paperback
Combined
A histogram was produced for the birth-rate data whith the option break
of 5 and 20.
In this activity we will run some of ggplot examples on the diamond dataset.
First we run a summary on the dataset.
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
In ggplot2 the qplot()
function is used to plot a quick graphs. Here are some examples
Note that price and carat carat dont have linear relation. It seems that they have exponantial relation.On the other hand, the log of price has a linear relation with log of carat.
We can also change the plot colors or opacity using colour=I("red")
and alpha=I(1/10)
options respectivly.
Moreover, we can add geomatricobjects such asa smooth LOESS curve using geom=c("point","smooth")
option.
A colored histogram can be plot whith qplot()
as the following.
qplot(carat, data=diamonds, geom="histogram", fill=color)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similarly, a density plot can be produced.
qplot(carat, data=diamonds, geom="density", fill=color)
Bar Chart
qplot(color, data=diamonds, geom="bar", weight=carat) + scale_y_continuous("carat")
Faceing allows you to plot multiple plots in one figure. Here we plot histograms of carat for different colors.
suppressWarnings(qplot(carat, ..density.., data=diamonds, facets = color ~ .,
geom="histogram", binwidth=0.1, xlim=c(0,3)))
## Warning: Removed 32 rows containing non-finite values (stat_bin).
## Warning: Removed 14 rows containing missing values (geom_bar).
Now we will use the mpg dataset to show an example of a scatter plot.
qplot(displ, hwy, data=mpg) + geom_smooth(method="lm")
We can group it by year or by cylinders.
We now plot 8 histograms from the TV data for TVs of size (less than 80 and greater than 10) for different years from 2001 until 2008. Note that the function par(mfrow=c(4,2))
changes the plot parameters so mfrow(4,2) means put 8 plots in one figure with 4 rows and 2 columns.