Assignment: Working with Data in R and RStudio

Instructions

This first assignment reviews the Introduction to Data content. You will use the data_introduction.Rmd file I reviewed as part of the lectures for this week to complete this assignment. You will copy and paste relevant code from that file and update it to answer the questions in this assignment. You will respond to questions in each section after executing relevant code to answer a question. You will submit this assignment to its Submissions folder on D2L. You will submit two files:

this completed R Markdown script, and
a HTML rendered version of it to D2L.

To start:

First, create a folder on your computer to save all relevant files for this course. If you did not do so already, you will want to create a folder named GSB 519 that contains all of the materials for this course.

Second, inside of GSB 519, you will create a folder to host assignments. You can name that folder assignments.

Third, inside of assignments, you will create folders for each assignment. You can name the folder for this first assignment: data_introduction.

Fourth, create two additional folders in data_introduction named scripts and data. Store this script in the scripts folder and the data for this assignment in the data folder.

Fifth, go to the File menu in RStudio, select New Project…, choose Existing Directory, go to your ~/GSB 519/assignments/data_introduction folder to select it as the top-level directory for this R Project.

Global Settings

The first code chunk sets the global settings for the remaining code chunks in the document. Do not change anything in this code chunk.

Load Packages

In this code chunk, we load two packages we need for this assignment:

here and
tidyverse.

Make sure you installed these two packages when you reviewed the analytical lecture.

We will use functions from these packages to examine the data. Do not change anything in this code chunk.

### load libraries for use in current working session
## library "here" for workflow
library(here)

## tidyverse for data manipulation and plotting
# loads eight different libraries simultaneously
library(tidyverse)

Task 1: Load Data

We will use the same data as in the analytical lecture: credit_raw.csv. After you load the data, then you will execute other commands on the data.

Use the read_csv() and here() functions to load the data for this working session. Save the data as the object credit_raw.

Question 1.1: After you load the data, look at your Global Environment window. How many observations and variables are there in the data?

Response 1.1: Observations: 400, Variables: 12

Question 1.2: Use the glimpse() function to view a preview of values for each variable in the data.

Which variable is listed first?

Which variable is listed last?

Response 1.2: First : ID, Last : Balance.

#### task 1.1
### import data file
## save as object
## use read_csv() to import the csv data file
credit_raw <- read_csv(
  ## use here() to locate file in our project directory;
  here("data", "credit_raw.csv")
)

## Rows: 400 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Gender, Student, Married, Ethnicity
## dbl (8): ID, Income, Limit, Rating, Cards, Age, Education, Balance
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#### task 1.2
### print a preview of table
glimpse(credit_raw)

## Rows: 400
## Columns: 12
## $ ID        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
## $ Limit     <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
## $ Rating    <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
## $ Cards     <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
## $ Age       <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
## $ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
## $ Gender    <chr> "Male", "Female", "Male", "Female", "Male", "Male", "Female"…
## $ Student   <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes"…
## $ Married   <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "No", "Ye…
## $ Ethnicity <chr> "Caucasian", "Asian", "Asian", "Asian", "Caucasian", "Caucas…
## $ Balance   <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

Task 2: Clean Data

For your second task, you will clean the data. Apply the mutate() and across functions to convert character variables to factor variables using one piped command. Save the result as a new data object named: credit_work.

Question 2.1: Apply glimpse() to credit_work to preview the working data. How many factor variables (indicated by fct) are there now in the data?

Response 2.1: There are 4 factor variables.

Next, create a sample of the working data. Set the random seed to 547 and randomly sample 300 individuals from credit_work. Name the new data object: credit_work_samp.

Question 2.2: Print a preview of credit_work_samp.

What three individuals (i.e., ID values) are listed at the top?

Response 2.2: The three individuals that are listed at the top include 261, 344, and 250.

#### task 2.1
#### clean data
### change character variables to factor variables
## save as new data
credit_work <- credit_raw %>%
  ## mutate variables
  mutate(
    ## across variables
    across(
      ## choose variables
      .cols = Gender:Ethnicity,
      ## functions
      .fns = as_factor
    )
  )

#### inspect clean data
### glimpse the data
glimpse(credit_work)

## Rows: 400
## Columns: 12
## $ ID        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 7…
## $ Limit     <dbl> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, …
## $ Rating    <dbl> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, …
## $ Cards     <dbl> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2, …
## $ Age       <dbl> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, …
## $ Education <dbl> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15,…
## $ Gender    <fct> Male, Female, Male, Female, Male, Male, Female, Male, Female…
## $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, No…
## $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Ye…
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Africa…
## $ Balance   <dbl> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0,…

##### task 2.2
#### create a reproducible random sample of the working data
### set the random seed of computer
set.seed(547)

### save as new data
credit_work_samp <- credit_work %>%
  ## randomly sample
  sample_n(size = 300)

### print preview of sampled data to Console
credit_work_samp

Task 3: Inspect Data

For your third task, you will inspect credit_work_samp in more detail.

Call credit_work_samp and arrange the rows by ascending Age and descending Rating.

Question 3.1: What is the age and rating of the first listed individual?

Response 3.1: Age: 98, Rating: 982

Call credit_work_samp and slice credit_work_samp by the top seven Rating scores.

Question 3.2: What is the rating of the third listed individual?

Response 3.2: The rating of the third listed individual is 112.

Use a piped command to:

call credit_work_samp,
select ID, Rating, Income, and Student, and
filter the rows by a Yes response to Student, Income greater than 60, and Rating greater than 700.

Question 3.3: Which individual (i.e., ID value) is listed? What is this individual’s Rating?

Response 3.3: The individual ID Value listed would be 192. This individual’s Rating would be 701.

#### task 3.1
### arrange rows by variables
## call data
credit_raw %>%
  ## arrange by descending Education and ascending Age
  arrange(desc(Rating), Age)

####task 3.2
### select particular rows by condition
## call data
credit_raw %>%
  ## slice for minimum value
  slice_min(Rating, n = 7)

#### task 3.3
### select particular variables and rows by condition
## call data
credit_raw %>%
  ## select variables
  select(ID, Rating, Income, Student) %>%
  ## filter rows
  filter(Student == "Yes", Income > 60, Rating > 700)

Task 4: Save Data

For your fourth task, you will save credit_work to your data folder in your project directory for this assignment. Save the data file as credit_work.csv.

##### task 4.0
### save working data
## use write_csv() to export as a csv data file
write_csv(
  ## name of object
  credit_work,
  ## use here() to export data to project directory;
  here("data", "credit_work.csv")
)

Task 5: Conceptual Questions

For your last task, you will respond to conceptual questions based on the conceptual lectures for this week.

Question 5.1: What are the differences between nominal and ordinal variables? What are the differences between interval and ratio variables?

Response 5.1: Nominal variables stand for categories (eg. gender or training. interventions), whereas ordinal variables show how values are in specific orders (eg. assignments finished). Interval variables represent numeric-scale comparisons (eg. temperature), whereas ratio variables represent productivity levels (eg. comparing to the average level of something).

Question 5.2: What are the differences between experimental and correlational studies?

Response 5.2: Experimental studies generally focus on an independent vs. dependent variable study to observe/analyze cause and effect within an experiment. On the other hand, correlational studies naturally test the difference between variables without direvtly controlling them.

Question 5.3: What is the difference between inferential and descriptive statistics?

Response 5.3: Inferential Statistics is when sample calculations’ results are compared to those of population calculations’ results. On the other hand, descriptive statistics is when data is descriped through summaries, analyses, and visualization.