About
The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in class, and also to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself , and RStudio is a convenient interface. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible.
As the labs progress, you are encouraged to explore beyond what the labs cover; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R.
Setup
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work in Sakai as detailed in previous notes.
Note
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.
xecute all code chunks (already included and own added), preview, check integrity, and submit final work (\(html\) file) in Sakai.

Basics Operations
First we will begin with a few basic operations.
Variable assignment
We assign values to variables using the assignment operator ‘=’. Another form of assignment, more general, is the ‘<-’ operator. A variable allows you to store values or an object (e.g. a function).
x = 128
y = 16
z <- 5
vars = c(2,4,8,16,32) # Creates a vector list using the generic combine function 'c'
x # display value of variable x
[1] 128
z # displays value of variable z
[1] 5
vars[1] #This calls the first value in the vector vars
[1] 2
vars[2] #This calls the second value in the vector vars
[1] 4
vars[1:3] #This calls the first through third values in the vector vars
[1] 2 4 8
vars #This calls the vector list
[1] 2 4 8 16 32
Common Arithmetic Operations
Below shows some simple arithmetic operations.
12*6
128/16
9^2
Basic Data Types
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-TRUE/FALSE) and characters (string-"TEXT").
#Type: Character
#Example:"TRUE",'23.4'
v = "TRUE"
class(v)
[1] "character"
#Type: Numeric
#Example: 12.3,5
v = 23.5
class(v)
[1] "numeric"
#Type: Logical
#Example: TRUE,FALSE
v = TRUE
class(v)
[1] "logical"
#Type: Factor (nominal, categorical)
#Example: m f m f m
v = as.factor(c("m", "f", "m"))
class(v)
[1] "factor"
Functions
R Functions are invoked by its name, followed by the parenthesis, and zero or more arguments.
# The following applies the function 'c' (seen earlier) to combine three numeric values into a vector
c(1,2,3)
[1] 1 2 3
# Example of function mean() to calcule the mean of three values
mean(c(5,6,7))
[1] 6
# Square root of a number
sqrt(99)
[1] 9.949874
Importing Data and Variable Assignment
# Here we are reading a file of type csv (comma seperated values) typical of many Excel files
il_income = read.csv(file = "il_income.csv")
top_il_income = read.csv(file = "top_il_income.csv")
Arithmetic Operations with Data
We can extract values from the dataset to perform calculations.
DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
DuPage-Lake
[1] 472
DuPage+Lake
[1] 77390
(DuPage+Lake)/2
[1] 38695
McHenry = top_il_income$per_capita_income[3]
Sangamon = top_il_income$per_capita_income[10]
McHenry-Sangamon
[1] 2524
McHenry+Sangamon
[1] 63712
(McHenry+Sangamon)/2
[1] 31856
##### 1) Repeat here the above arithmetic operations code chunk using instead the columns for McHenry and Sangamon counties (4pts)
Basic Statistics
mean(il_income$per_capita_income)
median(il_income$per_capita_income)
quantile(il_income$per_capita_income)
# Summary
summary(il_income)
mean(top_il_income$per_capita_income)
[1] 32918.5
median(top_il_income$per_capita_income)
[1] 31430
quantile(top_il_income$per_capita_income)
0% 25% 50% 75% 100%
30594.00 30743.75 31430.00 33103.25 38931.00
# Summary
summary(top_il_income)
rank county per_capita_income
Min. : 2.00 DuPage :1 Min. :30594
1st Qu.: 4.25 Kane :1 1st Qu.:30744
Median :12.00 Kendall:1 Median :31430
Mean :27.10 Lake :1 Mean :32919
3rd Qu.:41.00 McHenry:1 3rd Qu.:33103
Max. :90.00 McLean :1 Max. :38931
(Other):4
population region
Min. : 7032 Min. :2.0
1st Qu.: 36921 1st Qu.:2.0
Median :194782 Median :3.0
Mean :334866 Mean :3.2
3rd Qu.:648159 3rd Qu.:4.0
Max. :933736 Max. :5.0
##### 2) Repeat here the above basic statistics code chunk using instead the data from the file top_il_income (4pts)
Vectors
Defining a Vector
A sequence of data elements of the same basic type is defined as a vector.
# vector of numeric values
c(2, 3, 5, 8)
[1] 2 3 5 8
# vector of logical values.
c(TRUE, FALSE, TRUE)
[1] TRUE FALSE TRUE
# vector of character strings.
c("A", "B", "B-", "C", "D")
[1] "A" "B" "B-" "C" "D"
Lists
Defining a List
Lists, as opposed to vectors, can hold components of different types.
scores = c(80, 75, 55) # vector of numeric values
grades = c("B", "C", "D-") # vector of character strings.
office_hours = c(TRUE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
[[1]]
[1] 80 75 55
[[2]]
[1] "B" "C" "D-"
[[3]]
[1] TRUE FALSE FALSE
List Slicing
We can retrieve components of the list with the single square bracket [] operator.
student[1]
$myscores
[1] 80 75 55
student[2]
$mygrades
[1] "B" "C" "D-"
student[3]
$myoffice_hours
[1] TRUE FALSE FALSE
# first two components of the list
student[1:2]
$myscores
[1] 80 75 55
$mygrades
[1] "B" "C" "D-"
Member Reference
Using the double square bracket [[]] operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.
student[[1]] # Components of the Scores Vector
[1] 80 75 55
First element of the Scores vector
student[[1]][1]
[1] 80
First three elements of the Scores vector
student[[1]][1:3]
[1] 80 75 55
student[[2]][2]
[1] "C"
grades[2]
[1] "C"
##### 3) Repeat here the above code chunk to extract instead the second element of the grades vector (3pts)
Named List Members
It’s possible to assign names to list members and reference them by names instead of by numeric indexes.
student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours)
student
$myscores
[1] 80 75 55
$mygrades
[1] "B" "C" "D-"
$myoffice_hours
[1] TRUE FALSE FALSE
student$myscores
[1] 80 75 55
student$mygrades
[1] "B" "C" "D-"
student$myoffice_hours
[1] TRUE FALSE FALSE
Matrices
All columns in a matrix must have the same data type and the same length.
Create a numeric matrix of 5 rows and 4 columns made of sequential numbers 1:20
x_mat = matrix(1:20, nrow=5, ncol=4)
x_mat
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
Retrieve the 4th column of matrix
x_mat[,4]
[1] 16 17 18 19 20
Retrieve the 3rd row of matrix
x_mat[3,]
[1] 3 8 13 18
Retrieve rows 2,3,4 of columns 1,2,3
x_mat[2:4,1:3]
[,1] [,2] [,3]
[1,] 2 7 12
[2,] 3 8 13
[3,] 4 9 14
x_mat[3,3]
[1] 13
x_mat[3,]
[1] 3 8 13 18
x_mat[,3]
[1] 11 12 13 14 15
##### 4) Repeat here the above code chunk to extract instead the third row and thrid column of the matrix (3pts)
Data Frames
A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.
Defining a Data Frame
When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows.
The str() function helps us to display the internal structure of any R data structure or object to make sure that it’s correct.
str(il_income)
'data.frame': 102 obs. of 5 variables:
$ rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ county : Factor w/ 102 levels "Adams","Alexander",..: 16 22 49 99 45 60 101 64 86 10 ...
$ per_capita_income: int 30468 38931 38459 30791 30645 23937 24802 30728 23279 26087 ...
$ population : int 5238216 933736 703910 687263 530847 307343 287078 266209 264052 208861 ...
$ region : int 1 2 2 2 2 2 2 5 5 3 ...
Creating a Data Frame
Snapshot of the solar system.
name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)
Now, by combining the vectors of equal size, we can create a data frame object.
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df
NA
Sentiment Analysis
This is an added section for the honored class section only. The purpose is to conduct a sample text analysis investigation, and assess the results.
And they wishin’ and wishin’ and wishin’ and wishin’ They wishin’ on me, yuh
[Verse 1] I been movin’ calm, don’t start no trouble with me Tryna keep it peaceful is a struggle for me Don’t pull up at 6 AM to cuddle with me You know how I like it when you lovin’ on me I don’t wanna die for them to miss me Yes, I see the things that they wishin’ on me Hope I got some brothers that outlive me They gon’ tell the story, shit was different with me
[Chorus] God’s plan, God’s plan I hold back, sometimes I won’t, yuh I feel good, sometimes I don’t, ayy, don’t I finessed down Weston Road, ayy, ’nessed Might go down a G.O.D., yeah, wait I go hard on Southside G, yuh, wait I make sure that north-side eat
[Post-Chorus] And still, bad things It’s a lot of bad things That they wishin’ and wishin’ and wishin’ and wishin’ They wishin’ on me Bad things It’s a lot of bad things That they wishin’ and wishin’ and wishin’ and wishin’ They wishin’ on me Yuh, ayy, ayy
I analyzed a rap song by Drake, and the song is about overcoming struggles because it’s “God’s Plan.” The sentiment had a lot of negative values for words like struggle, trouble, and the profain words. Overall, the verses I inputed had a score of 0, meaning neither a positive nor negative result. The maginitude of the words analyzed were not very strong, with the highest being 2.2. Most of the words were around .5, meaning the strength of the sentiment was not very strong.
Select a small text article (10 to 20 lines) from an online source of your choice and a topic of your interest. You can also make up your own text. Copy and paste the paragraph here. Use the Google Cloud NLP API demo at https://cloud.google.com/natural-language/ to paste your text in the appropriate space, and run a sentiment analysis. Without diving much into the analytics and the coding details behind the API, write a half paragraph summarizing your assessment of the sentiment analysis results as given by Google NLP. You probably want to investigate the meaning of \(score\) and \(magnitude\) in the results displayed by Google (6pts)
Suggested Exercises & Resources
Data Sources
Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site.
---
title: "Working with R/RStudio (lab02)"
author: Lauren Kroll
date: January 30, 2020
output:
  html_notebook: default
  html_document: default
subtitle: BSAD343H, Business Analytics, Spring 2020
---

### About

The goal of this lab is to introduce you to  R and RStudio, which you'll be using throughout the course both to learn the statistical concepts discussed in class,  and also to analyze real data and come to informed conclusions.  To straighten out which is which: R is the name of the programming language itself , and RStudio is a convenient interface.  R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, .) and graphical techniques, and is highly extensible. 

As the labs progress, you are encouraged to explore beyond what the labs cover; a willingness to experiment will make you a much better programmer.  Before we  get to that stage, however, you need to build some basic fluency in R. 

### Setup

Remember to always set your working directory to the source file location. Go to 'Session', scroll down to 'Set Working Directory', and click 'To Source File Location'. Read carefully the below and follow the instructions to complete the tasks and answer any questions.  Submit your work in Sakai as detailed in previous notes. 

### Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code.  For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section.  The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.

xecute all code chunks (already included and own added), preview, check integrity, and submit final work ($html$ file) in Sakai.

----------

![](img01.png)


# Basics Operations 
First we will begin with a few basic operations. 

## Variable assignment
We assign values to variables using the assignment operator '='. Another form of assignment, more general, is the '<-' operator. 
A variable allows you to store values or an object (e.g. a function).

```{r}
x = 128
y = 16
z <- 5
vars = c(2,4,8,16,32) # Creates a vector list using the generic combine function 'c' 
```

```{r}
x # display value of variable x
z # displays value of variable z
```

```{r}
vars[1] #This calls the first value in the vector vars
```

```{r}
vars[2] #This calls the second value in the vector vars
```

```{r}
vars[1:3] #This calls the first through third values in the vector vars
```

```{r}
vars #This calls the vector list
```

## Common Arithmetic Operations
Below shows some simple arithmetic operations.
```{r,eval=TRUE}
12*6
128/16
9^2
```

## Basic Data Types
R works with numerous data types. Some of the most basic types are: numeric,integers, logical (Boolean-`TRUE/FALSE`) and characters (string-`"TEXT"`).

```{r,eval=TRUE}
#Type: Character                   
#Example:"TRUE",'23.4'

v = "TRUE"                       
class(v)                           

#Type: Numeric                
#Example: 12.3,5

v = 23.5                  
class(v)                   
              
#Type: Logical    
#Example: TRUE,FALSE

v = TRUE
class(v)

#Type: Factor (nominal, categorical)
#Example: m f m f m

v = as.factor(c("m", "f", "m"))
class(v)
```

##  Functions

R Functions are invoked by its name, followed by the parenthesis, and zero or more arguments. 
```{r}
# The following applies the function 'c' (seen earlier) to combine three numeric values into a vector 
c(1,2,3)

# Example of function mean() to calcule the mean of three values
mean(c(5,6,7))

# Square root of a number
sqrt(99)
```

## Importing Data and Variable Assignment

```{r,eval=TRUE}
# Here we are reading a file of type csv (comma seperated values) typical of many Excel files
il_income = read.csv(file = "il_income.csv")
top_il_income = read.csv(file = "top_il_income.csv")
```


## Arithmetic Operations with Data

We can extract values from the dataset to perform calculations.
```{r,eval=TRUE}
DuPage = top_il_income$per_capita_income[1]
Lake = top_il_income$per_capita_income[2]
DuPage-Lake
DuPage+Lake
(DuPage+Lake)/2
```
```{r,eval=TRUE}
McHenry = top_il_income$per_capita_income[3]
Sangamon = top_il_income$per_capita_income[10]
McHenry-Sangamon
McHenry+Sangamon
(McHenry+Sangamon)/2
```

<span style="color:red">
##### 1) Repeat here the above arithmetic operations code chunk using instead the columns for McHenry and Sangamon counties (4pts)
</span>


## Basic Statistics

```{r,eval=TRUE}
mean(il_income$per_capita_income)
median(il_income$per_capita_income)
quantile(il_income$per_capita_income)
# Summary 
summary(il_income)
```
```{r,eval=TRUE}
mean(top_il_income$per_capita_income)
median(top_il_income$per_capita_income)
quantile(top_il_income$per_capita_income)
# Summary 
summary(top_il_income)
```

<span style="color:red">
##### 2) Repeat here the above basic statistics code chunk using instead the data from the file top_il_income (4pts)
</span>

# Vectors

## Defining a Vector

A sequence of data elements of the same basic type is defined as a vector.
```{r,eval=TRUE}
# vector of numeric values
c(2, 3, 5, 8)

# vector of logical values.
c(TRUE, FALSE, TRUE)

# vector of character strings.
c("A", "B", "B-", "C", "D")
```

# Lists

## Defining a List

Lists, as opposed to vectors, can hold components of different types.

```{r,eval=TRUE}
scores = c(80, 75, 55)  # vector of numeric values                   
grades = c("B", "C", "D-")  # vector of character strings.          

office_hours = c(TRUE, FALSE, FALSE) # vector of logical values.
student = list(scores,grades,office_hours) # list of vectors
student
```

## List Slicing 

We can retrieve components of the list with the single square bracket `[]` operator.


```{r,eval=TRUE}
student[1]     
student[2]
student[3]

# first two components of the list
student[1:2]
```

## Member Reference

Using the double square bracket `[[]]` operator we can reference a member of the list directly. Using one bracket [] would still reference the list but will not allow you to extract a particular member of the list.

```{r,eval=TRUE}
student[[1]] # Components of the Scores Vector
```

*First element of the Scores vector*

```{r,eval=TRUE}
student[[1]][1]
```


*First three elements of the Scores vector*

```{r,eval=TRUE}
student[[1]][1:3]
```
```{r,eval=TRUE}
student[[2]][2]
```
```{r}
grades[2]
```

<span style="color:red">
##### 3) Repeat here the above code chunk to extract instead the second element of the grades vector (3pts)
</span>

## Named List Members

It's possible to assign names to list members and reference them by names instead of by numeric indexes.

```{r,eval=TRUE}
student = list(myscores = scores, mygrades = grades , myoffice_hours = office_hours) 

student
student$myscores
student$mygrades
student$myoffice_hours
```

# Matrices

All columns in a matrix must have the same data type and the same length.

*Create a numeric matrix of 5 rows and 4 columns made of sequential numbers 1:20*
```{r}
x_mat = matrix(1:20, nrow=5, ncol=4)
x_mat
```

*Retrieve the 4th column of matrix*
```{r}
x_mat[,4]
```


*Retrieve the 3rd row of matrix*
```{r}
x_mat[3,]
```

*Retrieve rows 2,3,4 of columns 1,2,3*
```{r}
x_mat[2:4,1:3]
```
```{r}
x_mat[3,3]
x_mat[3,]
x_mat[,3]
```

<span style="color:red">
##### 4) Repeat here the above code chunk to extract instead the third row and thrid column of the matrix (3pts)
</span>


# Data Frames

A data frame is more general than a matrix, in that different columns can have different data types (numeric, character, logic, factor). It is a powerful way to work with mixed data structures.

## Defining a Data Frame

When we need to store data in table form, we use data frames, which are created by combining lists of vectors of equal length. The variables of a data set are the columns and the observations are the rows. 


The `str()` function helps us to display the internal structure of any R data structure or object to make sure that it's correct.


```{r,eval=TRUE}
str(il_income)
```

## Creating a Data Frame

Snapshot of the solar system.

```{r, eval=TRUE}
name = c("Earth", "Mars", "Jupiter")
type = c("Terrestrial","Terrestrial", "Gas giant")
diameter = c(1, 0.532, 11.209)
rotation = c(1, 1.03, 0.41)
rings = c(FALSE, FALSE, TRUE)
```

Now, by combining the vectors of equal size, we can create a data frame object.

```{r, eval=TRUE}
planets_df = data.frame(name,type,diameter,rotation,rings)
planets_df

```

## Sentiment Analysis

This is an added section for the honored class section only.  The purpose is to conduct a sample text analysis investigation, and assess the results.

<span style="color:red">

And they wishin' and wishin' and wishin' and wishin'
They wishin' on me, yuh

[Verse 1]
I been movin' calm, don't start no trouble with me
Tryna keep it peaceful is a struggle for me
Don't pull up at 6 AM to cuddle with me
You know how I like it when you lovin' on me
I don't wanna die for them to miss me
Yes, I see the things that they wishin' on me
Hope I got some brothers that outlive me
They gon' tell the story, shit was different with me

[Chorus]
God's plan, God's plan
I hold back, sometimes I won't, yuh
I feel good, sometimes I don't, ayy, don't
I finessed down Weston Road, ayy, 'nessed
Might go down a G.O.D., yeah, wait
I go hard on Southside G, yuh, wait
I make sure that north-side eat

[Post-Chorus]
And still, bad things
It's a lot of bad things
That they wishin' and wishin' and wishin' and wishin'
They wishin' on me
Bad things
It's a lot of bad things
That they wishin' and wishin' and wishin' and wishin'
They wishin' on me
Yuh, ayy, ayy


I analyzed a rap song by Drake, and the song is about overcoming struggles because it's "God's Plan." The sentiment had a lot of negative values for words like struggle, trouble, and the profain words.  Overall, the verses I inputed had a score of 0, meaning neither a positive nor negative result. The maginitude of the words analyzed were not very strong, with the highest being 2.2. Most of the words were around .5, meaning the strength of the sentiment was not very strong. 

Select a small text article (10 to 20 lines) from an online source of your choice and a topic of your interest.  You can also make up your own text. Copy and paste the paragraph here.  Use the Google Cloud NLP API demo at https://cloud.google.com/natural-language/ to paste your text in the appropriate space, and run a sentiment analysis.  Without diving much into the analytics and the coding details behind the API, write a half paragraph summarizing your assessment of the sentiment analysis results as given by Google NLP. You probably want to investigate the meaning of $score$ and $magnitude$ in the results displayed by Google (6pts)
</span>

# Suggested Exercises & Resources

## Exercises

* Datacamp - Learn Data Science from your browser: 
https://www.datacamp.com/courses/free-introduction-to-r

* R-tutor - An R intro to stats that explains basic R concepts:
http://www.r-tutor.com/r-introduction

## Data Sources
Data samples used in this worksheet were downloaded from the U.S. Census Bureau American FactFinder site.

* "SELECTED ECONOMIC CHARACTERISTICS 2006-2010 American Community Survey 5-Year Estimates" - U.S. Census Bureau. Retrieved 2016-09-09:
https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
