Load Data from Three Different Sources

In the following notes, you will load data directly from a URL, directly from pre-build datasets in R, and finally from a file you save in your own folder.

Load Data Method 1: Load Data from a URL

You can load data from a folder or you can load data directly from a URL. The next example loads the dataset, “Test Scores”, directly from the URL where it resides.

library(tidyverse)  # you will use the readr package in tidyverse to read in this data
allscores <- read_csv("https://goo.gl/MJyzNs") 
head(allscores)

## # A tibble: 6 × 4
##   group   pre  post  diff
##   <dbl> <dbl> <dbl> <dbl>
## 1     3  45    47.5   2.5
## 2     1  65    65     0  
## 3     1  52.5  40   -12.5
## 4     1  65    52.5 -12.5
## 5     1  60    52.5  -7.5
## 6     2  55    35   -20

Notice R interprets the variable “group” as continuous values (col_double). We will fix this later. The command “dim” provides the dimensions of the data, which are 22 observations (rows) by 4 variables (columns).

Use Side-by-Side Boxplots – Done Incorrectly (at first!)

Here is some easy code to create 3 groups of boxplots with some easy-to-access data, filled by group. Since the groups are discrete, you can get rid of the shading.

boxpl <- allscores |> 
  ggplot(aes(y=diff, group = group, fill = group)) + 
  geom_boxplot()
boxpl

Side-by-side boxplots of the 3 different groups's difference in pre and post test scores

Side-by-Side boxplots of pre- and post-test scores

Notice that the legend give a continuous range of values for the scores, even though the scores are only 1, 2, or 3. The code guides(fill = FALSE) will get rid of the legend. Also, the x-axis labels make no sense. We will deal with that later.

Try to correct for the misrepresenting legend

boxpl2 <- boxpl + guides(fill = FALSE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

boxpl2

Add your own color choices for the 3 different boxes

Ensure that the groups are considered as factors, rather than numbers.

Use as.factor as another way to ensure numerical values are read as categorical

allscores$group <- as.factor(allscores$group)
head(allscores)

## # A tibble: 6 × 4
##   group   pre  post  diff
##   <fct> <dbl> <dbl> <dbl>
## 1 3      45    47.5   2.5
## 2 1      65    65     0  
## 3 1      52.5  40   -12.5
## 4 1      65    52.5 -12.5
## 5 1      60    52.5  -7.5
## 6 2      55    35   -20

Then manually fill with the 3 colors: white, light gray, and dark gray. Make the boxplots orient horizontally.

boxpl3 <- allscores |>
  ggplot() + geom_boxplot(aes(y=diff, group=group, fill=group)) +
  scale_fill_manual(values=c("white","lightgray","darkgray")) +
  theme(axis.text.y=element_blank()) + # Remove the useless y-axis tick values.
  labs(title = "Score Improvements Across Three Groups",
          y = "Difference in Pre and Post Test Scores") +
  coord_flip()
boxpl3

Another way to conver to factor levels

Use the factor function:

boxpl3 <- allscores |>
  mutate(group=factor(group, levels=c("1","2","3"), ordered=TRUE)) |>
  ggplot() + geom_boxplot(aes(y=diff, group=group, fill=group)) +
  scale_fill_manual(values=c("white","lightgreen","lightpink")) +
  theme(axis.text.y=element_blank()) +
  labs(title = "Score Improvements Across Three Groups",
          y = "Difference in Pre and Post Test Scores") +
  coord_flip()
boxpl3

Load Data Method 2: Use prebuilt dataset

We will use the penguins dataset that is pre-build in the “palmerpenguins” package to create scatterplots.

Load the package and feed data into global environment

library(palmerpenguins)

## 
## Attaching package: 'palmerpenguins'

## The following objects are masked from 'package:datasets':
## 
##     penguins, penguins_raw

data("penguins")  # loads the penguins dataset into your global environment

It is essential to recognize that variables may be: int (integer), num (numeric), or double vs char (character) and factor (for categories)

Typically, chr or factor are used for discrete variables and int, dbl, or num for continuous variables.

Use head() function to view the tibble and variable types

head(penguins)

## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

Combine fig.cap for the Figure label and fig.alt for the alt text

fig.cap and fig.alt are YAML code embedded in chunks - these are tags for screen readers to improve accessibility in your document. The colors darkorange, purple, and cyan4 improve visibility of colors for colorblind access.

Example of using fig.cap and fig.alt in the next chunk:

{r fig.cap=“Bigger flippers, bigger bills”, fig.alt = “Scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills.”}

scatter1 <- ggplot(data = penguins, 
                   aes(x = flipper_length_mm,
                       y = bill_length_mm,
                       color = species)) +
  geom_point(aes(shape = species), alpha = 0.8) +
  scale_color_manual(values = c("#eb6b34","purple","cyan4"))  +
  labs(title = "Palmer Penguins Flipper Length to Bill Length by Species",
       x = "Flipper Length (mm)",
       y = "Bill Length (mm)")
scatter1

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills.

Bigger flippers, bigger bills

Notice both SHAPE and COLOR are used to help the viewer see the differences in the 3 penguin species.

Load Data Method 3: read in a saved csv from your computer

This is the most important skill, to be able to read in a csv file that is saved on your computer in order to load it into R Studio to work with.

Find a dataset, save it in a folder on your computer. This folder is called your working directory. You will need to set your working directory.

Clean data headings and variable names

Data often is messy and not ready to use right away!

The data might require some cleaning. Here are some important points to check:

Be sure the format is .csv
Be sure there are no spaces between variable names (headers).
Set all variable names to lowercase so you do not have to keep track of capitalizing.

Loading Data from a Working Directory

The easiest way to find out what your current working directory is, use the command getwd().

getwd()

## [1] "C:/Users/crewe/OneDrive/Documents"

This command shows you (in your console below) the path to your directory. My current path is: \[1\] “C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes”

If you want to change the path, there are several ways to do so. I find the easiest way to change it is to click the “Session” tab at the top of R Studio. Select “Set Working Directory”, and then arrow over to “Choose Directory”. At this point, it will take you to your computer folders, and you need to select where your data is held. I suggest you create a folder called “Datasets” and keep all the data you load for this class in that folder.

Notice that down in the console below, it will show the new path you have chosen: setwd(“C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets”). At this point, I copy that command and put it directly into a new chunk.

Load the libraries

Zoo is an abbreviation used in time series analysis, which stands for “Z’s ordered observations”

# install.packages("zoo")
library(zoo)  # this package will help us re-format the period to be a useable date.

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Load the data

The following data comes from New York Fed (https://www.newyorkfed.org/microeconomics/hhdc.html) regarding household debt for housing and non-housing expenses. Debt amounts are in $ trillions for all US households.

Download this dataset, Household_debt, from http://bit.ly/2P3084E and save it in your dataset folder. Change your working directory to load the dataset from YOUR folder. Then run this code.

# be sure to change this to your own directory
getwd()

## [1] "C:/Users/crewe/OneDrive/Documents"

("C:/Users/Crewe Mellott/ Data_110/ household_debt.csv")

## [1] "C:/Users/Crewe Mellott/ Data_110/ household_debt.csv"

household <- read_csv("household_debt.csv")

## Rows: 64 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Period
## dbl (7): Mortgage, HE Revolving, Auto Loan, Credit Card, Student Loan, Other...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(household)

## # A tibble: 6 × 8
##   Period Mortgage `HE Revolving` `Auto Loan` `Credit Card` `Student Loan` Other
##   <chr>     <dbl>          <dbl>       <dbl>         <dbl>          <dbl> <dbl>
## 1 03:Q1      4.94           0.24        0.64          0.69           0.24  0.48
## 2 03:Q2      5.08           0.26        0.62          0.69           0.24  0.49
## 3 03:Q3      5.18           0.27        0.68          0.69           0.25  0.48
## 4 03:Q4      5.66           0.3         0.7           0.7            0.25  0.45
## 5 04:Q1      5.84           0.33        0.72          0.7            0.26  0.45
## 6 04:Q2      5.97           0.37        0.74          0.7            0.26  0.42
## # ℹ 1 more variable: Total <dbl>

Clean data headings and variable names

Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.

Here are some useful cleaning commands:

Make all headings (column names) lowercase. Remove all spaces between words in headings and replace them with underscores with the gsub command. Then look at it with “head”.

names(household) <- tolower(names(household))
names(household) <- gsub(" ","_",names(household))
# gsub will remove spaces in between words in the headers and replace them with underscore
head(household)

## # A tibble: 6 × 8
##   period mortgage he_revolving auto_loan credit_card student_loan other total
##   <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
## 1 03:Q1      4.94         0.24      0.64        0.69         0.24  0.48  7.23
## 2 03:Q2      5.08         0.26      0.62        0.69         0.24  0.49  7.38
## 3 03:Q3      5.18         0.27      0.68        0.69         0.25  0.48  7.56
## 4 03:Q4      5.66         0.3       0.7         0.7          0.25  0.45  8.07
## 5 04:Q1      5.84         0.33      0.72        0.7          0.26  0.45  8.29
## 6 04:Q2      5.97         0.37      0.74        0.7          0.26  0.42  8.46

Mutate

Mutate is a powerful command in tidyverse. It creates a new variable (column) in your dataset. In our dataset, “period” is not anything useful if we want to plot chronological data. So we will use mutate from “tidyverse” with the package “zoo” to create a useable date format.

summary(household$mortgage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.940   7.923   8.265   8.041   8.762   9.290

create two new variables

Use mutate to create a variable “mortgage_ratio” that computes each mortgage in proportion to the total.
Use mutate to create a variable “mortgage_highlow” that makes a dichotomous variable that is “low” if lower than the median and “high” otherwise.

house <- household |>
  mutate(mortgage_ratio  = mortgage/total) |>
  mutate(mortgage_highlow = ifelse(mortgage < 8.265, "low", "high"))

Create a new DATE variable from “period”

You should see that there are 64 observations and 8 variables. All variables are “col_double” (continuous values) except “period”, which is interpreted as characters.We need to use the library “zoo” package to fix the unusual format of the “period”. We will mutate it to create a new variable, date.

household_debt <- household |>
   mutate(date = as.Date(as.yearqtr(period, format = "%y:Q%q")))
head(household_debt)

## # A tibble: 6 × 9
##   period mortgage he_revolving auto_loan credit_card student_loan other total
##   <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
## 1 03:Q1      4.94         0.24      0.64        0.69         0.24  0.48  7.23
## 2 03:Q2      5.08         0.26      0.62        0.69         0.24  0.49  7.38
## 3 03:Q3      5.18         0.27      0.68        0.69         0.25  0.48  7.56
## 4 03:Q4      5.66         0.3       0.7         0.7          0.25  0.45  8.07
## 5 04:Q1      5.84         0.33      0.72        0.7          0.26  0.45  8.29
## 6 04:Q2      5.97         0.37      0.74        0.7          0.26  0.42  8.46
## # ℹ 1 more variable: date <date>

Plot various loan types

plot1 <- household_debt |> 
  ggplot(aes(date, mortgage)) +
  geom_point() +
  labs(title = "Mortgage Debt(in $ trillions) Between 2003 and 2018",
       x = "Years (2003-2018)",
       caption = "Source: New York Fed")
plot1

Mortgage Debt Between 2003 and 2018

plot2 <- household_debt |> 
  ggplot(aes(date, credit_card)) +
  geom_point() + 
  labs(title = "Mortgage Debt(in $ trillions) Between 2003 and 2018",
       x = "Years (2003-2018)",
       caption = "Source: New York Fed")
plot2

Trend of credit card debt over time shows peaks in 2008 and 2018 and lowest values in 2014

Credit Card Debt Between 2003 and 2018

Pivot the table from “wide” to “long” format

This will enable use to use the “facet” function in the next step.

house_long <- household_debt |> 
  pivot_longer(
        cols = 2:7,
    names_to = "debt_type",
    values_to = "debt_amnt")
head(house_long)

## # A tibble: 6 × 5
##   period total date       debt_type    debt_amnt
##   <chr>  <dbl> <date>     <chr>            <dbl>
## 1 03:Q1   7.23 2003-01-01 mortgage          4.94
## 2 03:Q1   7.23 2003-01-01 he_revolving      0.24
## 3 03:Q1   7.23 2003-01-01 auto_loan         0.64
## 4 03:Q1   7.23 2003-01-01 credit_card       0.69
## 5 03:Q1   7.23 2003-01-01 student_loan      0.24
## 6 03:Q1   7.23 2003-01-01 other             0.48

Use “facet_wrap” to show all types of debt together

Facet_wrap allows you to plot all variables together for comparison.

In order to do this, you have to “reshape the”data from a wide format to a long format. Use gather from tidyr package to do this.

plot3 <- house_long %>%
  ggplot(aes(x=date, y= debt_amnt))+
  geom_point(aes(color = debt_type))+
  facet_wrap(~debt_type) + 
  labs(title = "Mortgage Debt(in $ trillions) Between 2003 and 2018",
       x = "Years (2003-2018)",
       y = "Debt Amount in $ Trillions",
       caption = "Source: New York Fed",
       color = "Debt Type")
plot3

The highest household debt in $ trillions is mortgage, which surpasses auto loans, credit cards, he-revolving, and student loans

Facet Wrap of All Types of Household Debt 2003-2018

Essay part: This hate crimes dataset has some useful features and some problems. On the positive side, it includes detailed information like the type of crime, when and where it happened, and sometimes details about the people involved. This helps us look at patterns over time, compare different locations, and study changes before and after major events or policies. The data can also show if certain groups are being targeted more than others.

However, there are also some downsides. Many hate crimes go unreported, so the data might not show the full picture. Some reports are missing important information, like where the crime happened or details about the people involved. Also, how crimes are recorded may vary between places or change over time, making comparisons harder.

If I were to study this dataset more and had to choose two specific parts. First, I would look at how hate crimes change over time in different states or cities. Second, I would study which groups are most affected by hate crimes and whether certain types of crimes are increasing for specific communities.

Reading_data_in_three_ways_and_Accessibility

Crewe Mellott

2025-09-22

Load Data from Three Different Sources

Load Data Method 1: Load Data from a URL

Use Side-by-Side Boxplots – Done Incorrectly (at first!)

Try to correct for the misrepresenting legend

Add your own color choices for the 3 different boxes

Use as.factor as another way to ensure numerical values are read as categorical

Another way to conver to factor levels

Load Data Method 2: Use prebuilt dataset

Load the package and feed data into global environment

Use head() function to view the tibble and variable types

Combine fig.cap for the Figure label and fig.alt for the alt text

Load Data Method 3: read in a saved csv from your computer

Clean data headings and variable names

Loading Data from a Working Directory

Load the libraries

Load the data

Clean data headings and variable names

Here are some useful cleaning commands:

Mutate

create two new variables

Create a new DATE variable from “period”

Plot various loan types

Pivot the table from “wide” to “long” format

Use “facet_wrap” to show all types of debt together