Data Analysis without fun packages

Introducing… Packages!

One of the best parts of R being a free, open resource is that nerds are always making packages for your data needs. Different packages contain special functions that can make your life a lot easier. Of course, you can do very similar things in base R, but why not respect and benefit from other people’s hard work? Thanks, nerds!

Let’s install your first packages: `tidyverse` and `ggplot2`. I think these are the most useful for data cleaning/wrangling and visualization.

Installing Packages: once you install a package, you shouldn’t have to install it again. But for your first time, here’s the protocol:

# install.packages("tidyverse")
# install.packages("ggplot2")

Loading Packages: after you install the package, you have to specify that you want it to be used in your R session. You must do this everytime you restart R if you want to use that package.

library(tidyverse) ; library(ggplot2)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Last note for now–if two packages have identically named functions, you can specify which package to call the function using the following notation: [package]::[function]

Great! We are ready to move on. The next section will be all about using base R for different data analysis and visualization tasks. We will first use base R to build appreciation the power of the packages (`tidyverse` and `ggplot2`). Let’s begin by calling in our dataset to use for explanation purposes.

Importing datasets: if you aren’t using premade R datasets or creating your own from scratch (recall, data.frame) this is something you will have to do all the time. See the code chunk below for how to import CSV and Excel files; you’ll need to install a package readxl for the excel file. Should you ever need to, you can import a other different file types via various pre-made packages.

# CSV file
# path <- "your path here/your_dataset.csv"
# your_dataset <- read_csv(path)

# EXCEL (xls or xlsx) file... need to do install.packages("readxl") if you haven't already 
library(readxl)
# path <- "your path here/your_dataset.xlsx"
# your_dataset <- read_excel(path)

To streamline this lesson, I will use a pre-made dataset in R called mtcars. Let’s take a look and learn some new functions on the way!

# ?mtcars ... important things to know about the dataset, like variable meanings 

# View(mtcars) ... opens the full dataset in a new tab
head(mtcars) # displays the first 6 observations (6 is the default) in the dataset

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

tail(mtcars) # displays the last 6 observations (6 is the default) in the dataset

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

head(mtcars, 2) ; tail(mtcars, 2) # you can specify the number of observations for these functions like so

##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

##                mpg cyl disp  hp drat   wt qsec vs am gear carb
## Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
## Volvo 142E    21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

colnames(mtcars) #see the variable names

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

dim(mtcars) #number of rows and number of columns

## [1] 32 11

nrow(mtcars) ; ncol(mtcars) # any clue what these might do?

## [1] 32

## [1] 11

As always, let’s run some descriptive statistics.

summary(mtcars) # gives some important details about your variables

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

cor(mtcars) #because all of the variables are numeric, I can create a correlation matrix between all of the variables to see how they (linearly) relate

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

plot(mtcars) #likewise, I can plot two-way scatterplots between all of the variables to see how they relate, visually

Checkpoint: write down some key observations–do any of the correlations and/or plots stand out to you?

Let’s say we are interested primarily in how mpg and wt of the car are related. We can inspect more closely like so:

#To look at more specific variables, I can attach the dataset so I don't have to keep referring to it with mtcars$[variable] in my functions 
attach(mtcars)

## The following object is masked from package:ggplot2:
## 
##     mpg

summary(mpg) ; summary(wt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.513   2.581   3.325   3.217   3.610   5.424

hist(mpg) ; hist(wt) #this creates histograms of mpg and wt

shapiro.test(mpg) ; shapiro.test(wt) #shapiro wilk tests for normality

## 
##  Shapiro-Wilk normality test
## 
## data:  mpg
## W = 0.94756, p-value = 0.1229

## 
##  Shapiro-Wilk normality test
## 
## data:  wt
## W = 0.94326, p-value = 0.09265

cor(mpg, wt)

## [1] -0.8676594

plot(mpg, wt) #this creates a scatterplot of mpg vs wt

Seems like they might be linearly related (r = -0.87).

Let’s run a simple linear regression (t-test) using the lm() function to see if their association is significant:

model.crude <- lm(mpg ~ wt, mtcars) # our independent variable is wt, and our dependent variable is mpg here 
summary(model.crude)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Perhaps I think that some of the other variables are confounders of the relationship between mpg and wt. For example, I think that we need to adjust for disp, hp, and gear to get a less biased estimate of the association between mpg and wt. I can hold them constant in our linear regression model like so:

model.adj<- lm(mpg ~ wt + disp + hp + gear,  mtcars)
summary(model.adj)

## 
## Call:
## lm(formula = mpg ~ wt + disp + hp + gear, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5667 -1.8313 -0.3168  1.0760  6.0710 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.506996   4.780618   6.591 4.53e-07 ***
## wt          -3.550608   1.070639  -3.316  0.00261 ** 
## disp         0.006428   0.011686   0.550  0.58678    
## hp          -0.042305   0.014178  -2.984  0.00598 ** 
## gear         1.282530   0.985501   1.301  0.20412    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.607 on 27 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8129 
## F-statistic: 34.68 on 4 and 27 DF,  p-value: 2.834e-10

Checkpoint What is the p-value of the test for association between mpg and wt in the crude model? In the adjusted model? What differences do you note between the crude and adjusted model?

Now, let’s see an example for when our outcome variable is categorical (binary). Say we want to study if am (vehicle transmission type, 0 = “automatic”, 1 = “manual”) can be predicted by vs (vehicle engine shape; 0 = “V-shaped”, 1 = “straight”). As always, we will start with some descriptive statistics.

table(vs, am) #table of frequencies

##    am
## vs   0  1
##   0 12  6
##   1  7  7

prop.table(table(vs,am)) #table of percent frequencies

##    am
## vs        0       1
##   0 0.37500 0.18750
##   1 0.21875 0.21875

plot(vs, am) #notice that a scatterplot here is quite uninformative

For categorical outcomes, we can run a logistic regression using glm() to test the association between vs and am.

logreg.crude <- glm(am ~ as.factor(vs), family = "binomial")
summary(logreg.crude)

## 
## Call:
## glm(formula = am ~ as.factor(vs), family = "binomial")
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -0.6931     0.5000  -1.386    0.166
## as.factor(vs)1   0.6931     0.7319   0.947    0.344
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 42.323  on 30  degrees of freedom
## AIC: 46.323
## 
## Number of Fisher Scoring iterations: 4

Checkpoint: Is our test significant according to this chi-sq test of association? What is the p-value?

2 things to note:

Constantly subsetting and doing step-by-step manipulations on our dataset is annoying and takes a lot of time. tidyverse is very useful for data wrangling and streamlining this process.
The visualizations here are quite… bland. You’ll find later that ggplot2 is quite useful for making much prettier graphs.

Data Analysis without fun packages

Miss. Yunge

2024-07-28

Introducing… Packages!

Let’s install your first packages: `tidyverse` and `ggplot2`. I think these are the most useful for data cleaning/wrangling and visualization.

2 things to note:

Data Analysis without fun packages

Miss. Yunge

2024-07-28

Introducing… Packages!

Let’s install your first packages: tidyverse and ggplot2. I think these are the most useful for data cleaning/wrangling and visualization.

2 things to note:

Let’s install your first packages: `tidyverse` and `ggplot2`. I think these are the most useful for data cleaning/wrangling and visualization.