We will use the Palmer Penguins dataset for the following items:
In a new R Markdown file:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.92 loaded
penguins <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv")
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 <NA> 2007
## 5 female 2007
## 6 male 2007
penguins <-
penguins %>%
drop_na()
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 female 2007
## 5 male 2007
## 6 female 2007
plot(penguins$bill_length_mm,penguins$flipper_length_mm,data=penguins,xlab="Bill Length (mm)",ylab="Flipper Length (mm)")
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
cor(penguins$bill_length_mm,penguins$flipper_length_mm)
## [1] 0.6530956
r2 <- cor(penguins$bill_length_mm,penguins$flipper_length_mm)^2
r2 <- r2*100
round(r2,2)
## [1] 42.65
penguins$species <- as.factor(penguins$species)
my_cols <- c("#00AFBB", "#E7B800", "#FC4E07")
plot(penguins$bill_length_mm, penguins$flipper_length_mm,
col = my_cols[penguins$species])
Q: Why three colors codes?
Answer: It is because we have 3 species in the dataset so we want to
give the col option in the plot function a three elements
vector in order to show three colors. It is required that the number of
elements of the vector my_cols are equal to the number of
categories in the col command otherwise we get an
error.
df <- subset(penguins, penguins$species == "Chinstrap")
plot(penguins$bill_length_mm,penguins$flipper_length_mm,data=df, xlab="Bill Length (mm)",ylab="Flipper Length (mm)")
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
df <- subset(penguins, penguins$species == "Gentoo")
plot(penguins$bill_length_mm,penguins$flipper_length_mm,data=df, xlab="Bill Length (mm)",ylab="Flipper Length (mm)")
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
df <- subset(penguins, penguins$species == "Adelie")
plot(penguins$bill_length_mm,penguins$flipper_length_mm,data=df, xlab="Bill Length (mm)",ylab="Flipper Length (mm)")
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not a
## graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
plot(penguins$bill_length_mm, penguins$flipper_length_mm,
col = penguins$species)
my_cols <- c("#00AFBB", "#E7B800", "#FC4E07")
plot(penguins$bill_length_mm, penguins$flipper_length_mm,
col = my_cols[penguins$species],main="Relationship between Bill length and Flipper length for 3 species of Penguins")
numeric_columns <- penguins[,c(3,4,5,6,8)]
cor_matrix <- cor(numeric_columns)
round(cor_matrix,3)
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.000 -0.229 0.653 0.589
## bill_depth_mm -0.229 1.000 -0.578 -0.472
## flipper_length_mm 0.653 -0.578 1.000 0.873
## body_mass_g 0.589 -0.472 0.873 1.000
## year 0.033 -0.048 0.151 0.022
## year
## bill_length_mm 0.033
## bill_depth_mm -0.048
## flipper_length_mm 0.151
## body_mass_g 0.022
## year 1.000
body_mass_g and
flipper_length_mm for all the penguins because the
corelation coefficient value is 0.873 as compared to all other values in
the corelation matrix given above.bill_length_mm and flipper_length_mm before.
The value we got for these two variables is lower than
body_mass_g and flipper_length_mm.t.test(penguins$bill_length_mm,penguins$flipper_length_mm)
##
## Welch Two Sample t-test
##
## data: penguins$bill_length_mm and penguins$flipper_length_mm
## t = -190.4, df = 430.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -158.5946 -155.3537
## sample estimates:
## mean of x mean of y
## 43.99279 200.96697
In order to test the significant relationship between
bill_length_mm and flipper_length_mm we
conduct t-test. It is requirement for t-test that we define 2 hypothesis
which are given as
Null hypothesis \(H_0\): There is no significant difference between mean value of bill length and flipper length for penguins. Alternate Hypothesis \(H_A\): There is significant difference between mean value of bill length and flipper length for penguins.
The acceptance of null or alternate hypothesis depends upon the p-value which is based upon the significance level \(\alpha\) . So at 5% significance level, our p-value from the t-test is <0.05 which means that we reject our null hypothesis and accept the alternate hypothesis. If it was other way around i.e p-value >0.05, the null hypothesis would have been accepted.
In other words we can say that there is significant difference between the means of bill length and flipper length. Remember that our observation is based upon 95% Confidence interval.
pairs(numeric_columns,
labels = colnames(numeric_columns),
pch = 21,
bg = rainbow(3)[penguins$species],
col = rainbow(3)[penguins$species],
main = "Penguins dataset",
row1attop = TRUE,
gap = 1,
cex.labels = NULL,
font.labels = 1)
Yes corrgram is an effective way of finding relationship
between numerical data in a given dataset. It can help us visualize the
corelation matrix from question 11 in graphical form. From the above
graph we can deduce that there are several variables which have positive
corelation with each other since there is an increasing trend of
scatterplot points between variables. For example the variables which
have positive corelation with each other are body_mass_g
and flipper_length_mm for all species. Similarly the
variables bill_length_mm and body_mass_g have
almost positive corelation.. It can also help us to determine which
numeric variables have negative or no corelation with each other. In the
above graph we can see that year numeric variable is not
linked to any of the other numeric varibles and has a flat line
coressponding to y-axis.
library(ggplot2)
ggplot(penguins) +
aes(x = bill_length_mm, y = flipper_length_mm, colour = species) +
geom_point(shape = "square cross",
size = 2L) +
scale_color_hue(direction = 1) +
labs(x = "Bill Length (mm)", y = "Flipper Length (mm)") +
ggthemes::theme_base() +
theme(legend.position = "bottom")
ggplot(penguins) +
aes(x = bill_length_mm, y = flipper_length_mm) +
geom_point(shape = "square cross",
size = 1.5, colour = "#112446") +
labs(x = "Bill Length (mm)", y = "Flipper Length (mm)") +
ggthemes::theme_base() +
facet_wrap(vars(species))