tidyverse and
ggplot2. I think these are the most useful for data
cleaning/wrangling and visualization.Installing Packages: once you install a package, you shouldn’t have to install it again. But for your first time, here’s the protocol:
# install.packages("tidyverse")
# install.packages("ggplot2")
Loading Packages: after you install the package, you have to specify that you want it to be used in your R session. You must do this everytime you restart R if you want to use that package.
library(tidyverse) ; library(ggplot2)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Last note for now–if two packages have identically named functions,
you can specify which package to call the function using the following
notation: [package]::[function]
tidyverse and ggplot2). Let’s begin by
calling in our dataset to use for explanation purposes.Importing datasets: if you aren’t using premade R datasets or
creating your own from scratch (recall, data.frame) this is
something you will have to do all the time. See the code chunk below for
how to import CSV and Excel files; you’ll need to install a package
readxl for the excel file. Should you ever need to, you can
import a other different file types via various pre-made packages.
# CSV file
# path <- "your path here/your_dataset.csv"
# your_dataset <- read_csv(path)
# EXCEL (xls or xlsx) file... need to do install.packages("readxl") if you haven't already
library(readxl)
# path <- "your path here/your_dataset.xlsx"
# your_dataset <- read_excel(path)
To streamline this lesson, I will use a pre-made dataset in R called
mtcars. Let’s take a look and learn some new functions on
the way!
# ?mtcars ... important things to know about the dataset, like variable meanings
# View(mtcars) ... opens the full dataset in a new tab
head(mtcars) # displays the first 6 observations (6 is the default) in the dataset
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars) # displays the last 6 observations (6 is the default) in the dataset
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
head(mtcars, 2) ; tail(mtcars, 2) # you can specify the number of observations for these functions like so
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15.0 8 301 335 3.54 3.57 14.6 0 1 5 8
## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
colnames(mtcars) #see the variable names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
dim(mtcars) #number of rows and number of columns
## [1] 32 11
nrow(mtcars) ; ncol(mtcars) # any clue what these might do?
## [1] 32
## [1] 11
As always, let’s run some descriptive statistics.
summary(mtcars) # gives some important details about your variables
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
cor(mtcars) #because all of the variables are numeric, I can create a correlation matrix between all of the variables to see how they (linearly) relate
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
plot(mtcars) #likewise, I can plot two-way scatterplots between all of the variables to see how they relate, visually
Checkpoint: write down some key observations–do any of the
correlations and/or plots stand out to you?
mpg and
wt of the car are related. We can inspect more closely like
so:#To look at more specific variables, I can attach the dataset so I don't have to keep referring to it with mtcars$[variable] in my functions
attach(mtcars)
## The following object is masked from package:ggplot2:
##
## mpg
summary(mpg) ; summary(wt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.513 2.581 3.325 3.217 3.610 5.424
hist(mpg) ; hist(wt) #this creates histograms of mpg and wt
shapiro.test(mpg) ; shapiro.test(wt) #shapiro wilk tests for normality
##
## Shapiro-Wilk normality test
##
## data: mpg
## W = 0.94756, p-value = 0.1229
##
## Shapiro-Wilk normality test
##
## data: wt
## W = 0.94326, p-value = 0.09265
cor(mpg, wt)
## [1] -0.8676594
plot(mpg, wt) #this creates a scatterplot of mpg vs wt
Seems like they might be linearly related (r = -0.87).
Let’s run a simple linear regression (t-test) using the
lm() function to see if their association is
significant:
model.crude <- lm(mpg ~ wt, mtcars) # our independent variable is wt, and our dependent variable is mpg here
summary(model.crude)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Perhaps I think that some of the other variables are confounders of
the relationship between mpg and wt. For
example, I think that we need to adjust for disp,
hp, and gear to get a less biased
estimate of the association between mpg and wt. I can hold them constant
in our linear regression model like so:
model.adj<- lm(mpg ~ wt + disp + hp + gear, mtcars)
summary(model.adj)
##
## Call:
## lm(formula = mpg ~ wt + disp + hp + gear, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5667 -1.8313 -0.3168 1.0760 6.0710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.506996 4.780618 6.591 4.53e-07 ***
## wt -3.550608 1.070639 -3.316 0.00261 **
## disp 0.006428 0.011686 0.550 0.58678
## hp -0.042305 0.014178 -2.984 0.00598 **
## gear 1.282530 0.985501 1.301 0.20412
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.607 on 27 degrees of freedom
## Multiple R-squared: 0.8371, Adjusted R-squared: 0.8129
## F-statistic: 34.68 on 4 and 27 DF, p-value: 2.834e-10
Checkpoint What is the p-value of the test for association
between mpg and wt in the crude model? In the
adjusted model? What differences do you note between the crude and
adjusted model?
am (vehicle
transmission type, 0 = “automatic”, 1 = “manual”) can be predicted by
vs (vehicle engine shape; 0 = “V-shaped”, 1 = “straight”).
As always, we will start with some descriptive statistics.table(vs, am) #table of frequencies
## am
## vs 0 1
## 0 12 6
## 1 7 7
prop.table(table(vs,am)) #table of percent frequencies
## am
## vs 0 1
## 0 0.37500 0.18750
## 1 0.21875 0.21875
plot(vs, am) #notice that a scatterplot here is quite uninformative
For categorical outcomes, we can run a logistic regression using
glm() to test the association between vs and
am.
logreg.crude <- glm(am ~ as.factor(vs), family = "binomial")
summary(logreg.crude)
##
## Call:
## glm(formula = am ~ as.factor(vs), family = "binomial")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6931 0.5000 -1.386 0.166
## as.factor(vs)1 0.6931 0.7319 0.947 0.344
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 42.323 on 30 degrees of freedom
## AIC: 46.323
##
## Number of Fisher Scoring iterations: 4
Checkpoint: Is our test significant according to this chi-sq test of association? What is the p-value?
tidyverse is
very useful for data wrangling and streamlining this process.ggplot2 is quite useful for making much prettier
graphs.