In the following notes, you will load data directly from a URL, directly from pre-build datasets in R, and finally from a file you save in your own folder.
Load Data Method 1: Load Data from a URL
You can load data from a folder or you can load data directly from a URL. The next example loads the dataset, “Test Scores”, directly from the URL where it resides.
library(tidyverse) # you will use the readr package in tidyverse to read in this dataallscores <-read_csv("https://goo.gl/MJyzNs")head(allscores)
Notice R interprets the variable “group” as continuous values (col_double). We will fix this later. The command “dim” provides the dimensions of the data, which are 22 observations (rows) by 4 variables (columns).
Use Side-by-Side Boxplots
Here is some easy code to create 3 groups of boxplots with some easy-to-access data, filled by group. Since the groups are discrete, you can get rid of the shading.
boxpl <- allscores |>ggplot(aes(y=diff, group = group, fill = group)) +geom_boxplot()boxpl
Side-by-Side boxplots of pre- and post-test scores
Notice that the legend give a continuous range of values for the scores, even though the scores are only 1, 2, or 3. The code guides(fill = FALSE) will get rid of the legend. Also, the x-axis labels make no sense. We will deal with that later.
Try to correct for the misrepresenting legend
boxpl2 <- boxpl +guides(fill =FALSE)
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
boxpl2
Add your own color choices for the 3 different boxes
Ensure that the groups are considered as factors, rather than numbers. Then manually fill with the 3 colors: white, light gray, and dark gray. Make the boxplots orient horizontally.
boxpl3 <- allscores |>mutate(group=factor(group, levels=c("1","2","3"), ordered=TRUE)) |>ggplot() +geom_boxplot(aes(y=diff, group=group, fill=group)) +scale_fill_manual(values=c("white","lightgray","lightpink")) +theme(axis.text.y=element_blank()) +labs(title ="Score Improvements Across Three Groups",y ="Difference in Pre and Post Test Scores") +coord_flip()boxpl3
Use as.factor as another way to ensure numerical values are read as categorical
Another way to have ensured that “group” is understood by R to be categorical is to use the command: as.factor
boxpl4 <- allscores |>ggplot() +geom_boxplot(aes(y=diff, group=group, fill=group)) +scale_fill_manual(values=c("white","lightgray","darkgray")) +theme(axis.text.y=element_blank()) +# Remove the useless y-axis tick values.labs(title ="Score Improvements Across Three Groups",y ="Difference in Pre and Post Test Scores") +coord_flip()boxpl4
Load Data Method 2: Use prebuilt dataset
We will use the penguins dataset that is pre-build in the “palmerpenguins” package to create scatterplots.
Load the package and feed data into global environment
library(palmerpenguins)data("penguins") # loads the penguins dataset into your global environment
It is essential to recognize that variables may be: int (integer), num (numeric), or double vs char (character) and factor (for categories)
Typically, chr or factor are used for discrete variables and int, dbl, or num for continuous variables.
Use head() function to view the tibble and variable types
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Combine fig.cap for the Figure label and fig.alt for the alt text
fig.cap and fig.alt are YAML code embedded in chunks - these are tags for screen readers to improve accessibility in your document. The colors darkorange, purple, and cyan4 improve visibility of colors for colorblind access.
Example of using fig.cap and fig.alt in the next chunk:
{r fig.cap=“Bigger flippers, bigger bills”, fig.alt = “Scatterplot of flipper length by bill length of 3 penguin species, where we show penguins with bigger flippers have bigger bills.”}
This command shows you (in your console below) the path to your directory. My current path is: [1] “C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes”
If you want to change the path, there are several ways to do so. I find the easiest way to change it is to click the “Session” tab at the top of R Studio. Select “Set Working Directory”, and then arrow over to “Choose Directory”. At this point, it will take you to your computer folders, and you need to select where your data is held. I suggest you create a folder called “Datasets” and keep all the data you load for this class in that folder.
Notice that down in the console below, it will show the new path you have chosen: setwd(“C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets”). At this point, I copy that command and put it directly into a new chunk.
Load the data
The following data comes from New York Fed (https://www.newyorkfed.org/microeconomics/hhdc.html) regarding household debt for housing and non-housing expenses. Debt amounts are in $ trillions for all US households.
Download this dataset, Household_debt, from http://bit.ly/2P3084E and save it in your dataset folder. Change your working directory to load the dataset from YOUR folder. Then run this code.
# be sure to change this to your own directorysetwd("/Users/thejitharajapakshe/Desktop/MyFolder/College/DATA 110")household <-read_csv("household_debt.csv")
Rows: 64 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Period
dbl (7): Mortgage, HE Revolving, Auto Loan, Credit Card, Student Loan, Other...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.
Here are some useful cleaning commands:
Make all headings (column names) lowercase. Remove all spaces between words in headings and replace them with underscores with the gsub command. Then look at it with “head”.
names(household) <-tolower(names(household))names(household) <-gsub(" ","_",names(household))# gsub will remove spaces in between words in the headers and replace them with underscorehead(household)
Look at the dimensions and the structure of the data. Note that it will be listed as a tibble.
dim(household)
[1] 64 8
Mutate
Mutate is a powerful command in tidyverse. It creates a new variable (column) in your dataset. In our dataset, “period” is not anything useful if we want to plot chronological data. So we will use mutate from “tidyverse” with the package “zoo” to create a useable date format.
Create a new DATE variable from “period”
You should see that there are 64 observations and 8 variables. All variables are “col_double” (continuous values) except “period”, which is interpreted as characters.We need to use the library “zoo” package to fix the unusual format of the “period”. We will mutate it to create a new variable, date.
household_debt <- household |>mutate(date =as.Date(as.yearqtr(period, format ="%y:Q%q")))head(household_debt)
Use “facet_wrap” to show all types of debt together
Facet_wrap allows you to plot all variables together for comparison.
In order to do this, you have to “reshape the”data from a wide format to a long format. Use gather from tidyr package to do this.
plot3 <- house_long %>%ggplot(aes(x=date, y= debt_amnt))+geom_point(aes(color = debt_type))+facet_wrap(~debt_type) +labs(title ="Mortgage Debt(in $ trillions) Between 2003 and 2018",x ="Years (2003-2018)",y ="Debt Amount in $ Trillions",caption ="Source: New York Fed",color ="Debt Type")plot3
Facet Wrap of All Types of Household Debt 2003-2018
Write about the positive and negative aspects of this hatecrimes dataset.
Positive:
We can make charts and maps that show how hate crimes change over time and where they happen. This helps people understand which areas or groups are most affected.
The dataset lets us compare different types of hate crimes or different groups of people affected. For example, we can see if hate crimes against one group are more common than others.
We can add buttons or menus to let people choose which groups they want to compare, making it easier to see the differences.
By putting hate crime data on a map, we can see where hate crimes happen most often.
Negative:
Allowing the viewer to click on the charts to get specific details such as location and time.
Each place might have its own way of defining and reporting hate crimes, making it hard to compare data from different places accurately.
The dataset might not tell us everything we need to know about each hate crime, like why it happened or who was affected.
We can improve maps by showing more details about specific locations and why hate crimes might be happening there.
List 2 different paths you would like to (hypothetically) study about this dataset.
First, would be relationship between hate crimes nd particular demographics like race, ethnicity, and gender. This can show us which groups are most targeted. Are there any patterns or trends in depending on the area?
Describe 2 things you would do to follow up after seeing the output from the hatecrimes tutorial.
Make sure the data is accurate and organized. Do not consider any duplicates or missing info and making sure everything looks the same, like dates and places.
Search for anymore trends or patterns for any other analysis that we could carry out.