R - Loading data from Google drive

Mark Bounthavong

28 April 2024

Background

Recently, my colleague contacted me to assist another colleague who was having trouble loading data stored on a Google drive account into R. I have never thought about using Google drive as a place to store data and then load it into the R environment. Normally, I store and load data from GitHub, but there are some limitations, particularly when the dataset is very large. Google drive might be an easy workaround to this limitation, so I decided to figure out how to make this work. (Note: There are ways to store large datasets on GitHub, and I’ll explore these methods in a future tutorial.)

There are a lot of security issues that have to be addressed before loading data from a Google drive account. You should make sure to have two-factor authentication turned on and generate a strong password for your Google account. Once you’ve done these steps, it’s time to begin loading data from Google drive into the R environment.

Note: I created a new Google account for this tutorial given that I may want to use this for future R projects.

Note: I used RStudio version 2022.07.2 and R version 4.2.2 (Innocent and Trusting) for this tutorial. If you run into errors, I recommend updating all packages and downloading the newest version of R and RStudio.

Install and load libraries.

There are several libraries that you will need to install and load: googledrive, googlesheets4, and tidyverse. I learned a neat code chunk that will check to see if you have installed the package, and if not, R will install it for you. It is a set of if and !require statements, which I learned from FelixAnalytix. I highly recommend you visit their website.

### Install and load R packages
if (!require(googlesheets4)) install.packages("googlesheets4")
if (!require(googledrive)) install.packages("googledrive")
if (!require(tidyverse)) install.packages("tidyverse")

library("googlesheets4")
library("googledrive")
library("tidyverse")

Authenticate Google drive

Once the packages have been installed and the libraries have been loaded, the first thing we’ll do is to authenticate with our Google drive account.

### Authenticate Google drive
googledrive::drive_auth()

You may be asked to cache OAuth access credentials in a folder. The purpose of this prompt is to allow the cache to contain a token that will be used to access the Google drive account. This is necessary to ensure that you are securing the gateway between R and the Google drive account. The token is only valid for the current R session.

Figure 1. OAuth access message.

Figure 1. OAuth access message.

Select Yes; this will open the browser from which you will need to select the Google account.

Figure 2. Select the Google account.

Figure 2. Select the Google account.

After selecting the Google account, you will be asked to sign into the Tidyverse API packages.

Figure 3. Sign into the Tidyverse API packages.

Figure 3. Sign into the Tidyverse API packages.

Click on Continue.

You’ll be asked to authorize the Tidyverse API package access on the Google drive account. This includes the ability to make edits to the file in the Google drive account. Make sure check the box to allow the Tidyverse API package to See, edit, create, and delete all of your Google Drive files.

Figure 4. Allow the Tidyverse API package to access the Google drive account.

Figure 4. Allow the Tidyverse API package to access the Google drive account.

Click on Continue.

You should receive a message on your browser that the OAuth is complete.

Figure 5. OAuth complete.

Figure 5. OAuth complete.

The webpage can be closed at this time.

You may get a warning email on the Google account indicating that the Tidyverse API package has been granted access to the Google drive. This is normal and can be safely deleted if you want.

Figure 6. Warning email.

Figure 6. Warning email.

Now, go into the Google drive account and locate the file where the data is stored. In this example, I will use the Social Deprivation Index (SDI) for United States (US) counties. You can download this data from the Robert Graham Center’s website, which is publicly available.

For the purpose of this exercise, I will use the URL where this data is stored on my Google drive account. But you will need to make sure that you are using the correct URL where you will store your data on your Google drive account. Note: The data that I am using in this tutorial is a *.CSV file. I recommend using this file format to make the data loading process easier.

Figure 7. Copy the URL path where the data is tored.

Figure 7. Copy the URL path where the data is tored.

Once you have the URL copied, you can use the code chuck to load the data into a dataframe.

### Load data from Google drive
df <- read_sheet("https://docs.google.com/spreadsheets/d/1QgqlFnBVngBBpfsgGncAJnC_J0lX-NYGPV1zZ3jOC-4/edit#gid=851423220")

head(df)
## # A tibble: 6 × 18
##   COUNTY_FIPS County_population SDI_score PovertyLT100_FPL_score
##         <dbl>             <dbl>     <dbl>                  <dbl>
## 1        1001             55380        43                     69
## 2        1003            212830        16                     28
## 3        1005             25361        99                    100
## 4        1007             22493        64                     86
## 5        1009             57681        43                     51
## 6        1011             10248        84                    100
## # ℹ 14 more variables: Single_Parent_Fam_score <dbl>,
## #   Education_LT12years_score <dbl>, HHNo_Vehicle_score <dbl>,
## #   HHRenter_Occupied_score <dbl>, HHCrowding_score <dbl>,
## #   Nonemployed_score <dbl>, sdi <dbl>, pct_Poverty_LT100 <dbl>,
## #   pct_Single_Parent_Fam <dbl>, pct_Education_LT12years <dbl>,
## #   pct_NonEmployed <dbl>, pctHH_No_Vehicle <dbl>, pctHH_Renter_Occupied <dbl>,
## #   pctHH_Crowding <dbl>

Once you have the data loaded into the R environment as a dataframe, you can begin to manipulate, wrangle, or analyze the data.

Conclusions

This was a fun exercise to figure out. As I mentioned early on, I have never considered using Google drive to store / share data for R projects. Now that I’ve learned how to make this process work, I may consider using Google drive to store large datasets for future R projects.

Acknowledgements

When researching how to load data from Google drive into the R environment, I had to go to a couple of websites. I’m not familiar with Google drive and its nuances, nor am I familiar with the googledrive and googlesheets4 packages. Thankfully, there are some great resources on the internet by some amazing people.

One of these is FelixAnalytix.

The other is by Appsilon Team. I wasn’t able to figure out who the individual author was, but I wanted to acknowledge that this article was incredibly helpful.

If you run into issues with the OAuth process, you may want to read the gargle package vignette to troubleshoot the problem (link).

I also want to acknowledge my colleagues who brought this challenge to my attention. I would never have considered this possibility without their inquiry.

Disclaimer

This is a work in progress and may be updated in future iterations.

Additionally, this is for educational purposes only.