Week 1

R review using the hot hand example

Reading a csv

You can read a csv file using the function read.csv(filename)

data = read.csv("shot_logs.csv")

Accessing values

You can type the dataname$column_name[row_number] or dataname[row_number, “column name”]

To access the 1729th player name

data$player_name[1729]

## [1] "gary neal"

#you can also do

data[1729, "player_name"]

## [1] "gary neal"

Arranging the data

Use the order function to arrange the row order of the data. Similar to the arrange function in tidyverse

data = data[order(data$player_name, data$GAME_ID, data$PERIOD, data$SHOT_NUMBER), ]

#the comma at the end specifies that all columns will be returned. If you want a specific column returned, add the column index after the comma

For loops

The syntax is for(i in start:end){} Lets add up the number of made and missed shots that are hot or not hot.

made_h = 0; made_n = 0; missed_h = 0; missed_n = 0

for(i in 2:dim(data)[1]) {
#iterate from the second row to the last row
  if (data$SHOT_RESULT[i-1] == "made") {
    if(data$SHOT_RESULT[i] == "made") {
      made_h = made_h + 1
    }
    else {
      missed_h = missed_h + 1
    }
  }
  else {
    if(data$SHOT_RESULT[i] == "made") {
      made_n = made_n + 1
    }
    else {
      missed_n = missed_n + 1
    }
  }
}

Creating a data frame

Now that we have the values, we can make them into vectors then create a data frame.

Vectors are created using c(x,y), where x and y are the values
data frames are created using data.frame(C1 = c(x,y), C2 = x(w,y))

result = data.frame(MADE = c(made_n, made_h), MISSED = c(missed_n, missed_h))

result

##    MADE MISSED
## 1 31942  38222
## 2 25962  31942

We can name the rows using rownames(dataframe)

rownames(result) = c("NOT HOT", "HOT")

result

##          MADE MISSED
## NOT HOT 31942  38222
## HOT     25962  31942

Printing results

There are 3 ways to print in R:

print(): printing single objects
cat() similar to how you’d print in Python
sprintf() similar to how you’d print in C.

We will use cat() to print out results.

made = round(100 * (made_h/(made_h+missed_h)), 1)
missed = round(100 * (made_n/(made_n + missed_n)), 1)

#round the decimals to 2
                    
cat("Scoring percentage after making a shot:", made, "\n")

## Scoring percentage after making a shot: 44.8

cat("Scoring percentage after missing a shot:", missed)

## Scoring percentage after missing a shot: 45.5

Week 2

Reading Flat Files

We know that read.csv reads csv files but there are many other “read” functions. To use read.csv for other files such as a tsv (tab separated), we can add sep = "\t" in the parameter

data = read.csv('AS-N100.tsv', sep = "\t")

You can also read a csv file and other fi;es using read.table. Just remember to specify how you want the table to be read

Plotting

First lets filter our data

subset = data[data$ticker == "GOOGL", ]
xs = as.POSIXct(strptime(paste("2015-11-15",
subset$timestamp), "%Y-%m-%d %H:%M:%S"))
ys = subset$open
plot(xs, ys, xlab = "Time", ylab = "Price (USD)",
main = "GOOGL open price / minute", type = "l") #type = "l" creates a line graph

Lets go through what the xs variable is doing

paste() is taking all the timestamps giving from 2015-11-15 and returns a vector of it.
strptime() then parses through each of the values and converts it into the format which is the second argument.
as.POSIXct() then takes the returned formatted vector and converts the string into a date-time type object

Week 3

Lets first read in the data and column names

data = read.table("AS-N100.data", header = FALSE)
columns = readLines("AS-N100.names")
print(data[1, "ticker"])

## NULL

It returns null because the data does not have column names. We can add column names using colnames(data) = columns where columns is a vector

colnames(data) = columns
print(data[1, "ticker"])

## [1] "LLTC"

Summary Statistics

Min and max

We will create a subset where the ticker is GOOGL then find the min for opening prices.

subset = data[data[,"ticker"] == "GOOGL", ]
print(min(subset[,"open"]))

## [1] 1

There seems to be something wrong with the data as it’s unlikely that GOOGL’s minimum price was $1.

If we check closely, the order of the column names in the .names file is different from the .tsv file from the previous week!

The open and tradecount columns are swapped. We can use colnames() again but specify which columns we want to rename

colnames(data)[c(4,10)] = 
  c("open", "tradecount")

subset = data[data[, "ticker"] == "GOOGL", ]

print(min(subset[,"open"]))

## [1] 744.42

Mean

The mean, as we know, is the average

print(mean(subset[, "open"]))

## [1] 750.4808

Median

The median is the middle value of a dataset when ordered from least to greatest. Unlike the mean, it is not affected by outliers The mean and median will be close if the data is symmetric

Plotting the statistic summaries

We will create a histogram

xs = log(subset$tradecount)
hist(xs, main = "GOOGL on 2015-11-17", ylab = "Count",
xlab = "Log number of trades")

abline(v = c(median(xs), mean(xs)), col = c("blue", "red"),
lty = c(1,2), lwd = c(2,2))

legend("topleft", legend = c("median", "mean"), col = c("blue", "red"), lty = c(1,2), lwd = c(2,2))

lty() specifies the line type, 1 being a solid line and 2 being a dashed line. lwd() specifies the width. The greater the value, the thicker the line.

Creating Functions

The syntax for creating a function is functioname = function(v) {}

Here is a function that finds the max value of a vector

my.max = function(v) {
  result = -Inf
  for (i in 1: length(v)) {
    if (v[i] > result) {
      result = v[i]
    }
  }
  return(result)
}

print(my.max(c(1,5,4,7,9,12)))

## [1] 12

Preview of JSON

JSON stands for JavaScript object notation To read a flat file of this type, we need to load the package rjson.

library(rjson)
libraries = fromJSON(file = "libraries.json")
print(class(libraries))

## [1] "list"

The data is stored as a list. We can use the names() functino to view all the names of each header

names(libraries)

## [1] "exceededTransferLimit" "features"              "type"

To access elements in a list, we nested the brackets like so:

print(libraries[["features"]][[2]][["properties"]]$maptip)

## [1] "Library:West Point Grey\\n4480 W 10th Av"

Images

To load images, we can use the load.image() function from the package imager

library(imager)

## Loading required package: magrittr

## 
## Attaching package: 'imager'

## The following object is masked from 'package:magrittr':
## 
##     add

## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum

## The following object is masked from 'package:graphics':
## 
##     frame

## The following object is masked from 'package:base':
## 
##     save.image

image = load.image("Figure02.png")
plot(image)

To remove the axes, we can use axes = FALSE

plot(image, axes = FALSE)

The plot function lets us interact with the display further by using the plot functions.

W = dim(image)[1]; H = dim(image)[2]
plot(image, axes = FALSE)
points(W/2, H/2, col = "red", pch = 4, lwd = 5)
text(W/2 + 250, H/2, "Center", col = "red")

Week 4

Relational databases

It is an efficient way of storing and manipulating data. Unlike a data frame, it can be synchronized in many locations and accessible to many users.

For example, a company with multiple locations can use a relational database to track their inventory by having their POS update the database instantly when they make a sale. This ensures that the information is accurate and easily accessible for all locations.

Databases have many different tables each contain its own type of data

SQLite

It is easier to use than SQL

Lets load one of their databases using the package RSQLite

library(RSQLite)

db = dbConnect(SQLite(), dbname = "shapes.sqlite")
data = dbReadTable(db, "Platonic")
print(data)

##          Name Sides
## 1 Tetrahedron     4

Insert a row

request = dbExecute(db, "INSERT INTO Platonic VALUES ('Octahedron', 8)")

print(request)

## [1] 1

This returns the number of rows affected by the call. Here, we inserted a new row with values Octahderon and 8 into the table called Platonic

print(dbReadTable(db, "Platonic"))

##          Name Sides
## 1 Tetrahedron     4
## 2  Octahedron     8

Query functions

SELECT * FROM Name returns all columns and rows from the table called Name
SELECT *FROM Name WHERE ... can be used to only return rows that meet the condition given after “WHERE”
SELECT V1, V2 FROM Name returns columns V1 and V2 of all rows in the table called Name
INSERT INTO Name VALUES (v1, v2, ...) adds the row given by the vector v1, v2, … into the table called Name

db functions

result = dbSendQuery() Is used to SELECT data. To fetch the data, you must use -

dbFetch(result, n = 10). This fetches the data and returns the number of rows. If you want all rows printed, you may omit the second argument

dbExecute() is used when you want to edit the database

dbGetQuery() combines the process of dbSendQuery() and dbFetch()

dbClearResult(result) clears the memory to clear space. For every dbFetch, there is a corresponding dbClearResult()

`dbReadTable(db, tablename) reads the entire database

Insert a row without hard coding

Sometimes, we don’t know what we need to add or there is too much we need to add that hard coding will become too time consuming

In this case, we can use the function paste0().

name = "Pyramid"
sides = 4
query = paste0("INSERT INTO Platonic VALUES ('", name, "',", sides, ")")
print(query)

## [1] "INSERT INTO Platonic VALUES ('Pyramid',4)"

request = dbExecute(db, query)
print(dbReadTable(db, "Platonic"))

##          Name Sides
## 1 Tetrahedron     4
## 2  Octahedron     8
## 3     Pyramid     4

Week 5

Exploring a new database

When you’re given a new database, you might not know the table names. You can list the table names like so

library(RSQLite)
db = dbConnect(SQLite(), dbname = "ycoretail.sqlite") #load database
dbListTables(db)

## [1] "prices" "sales"

We can see that there are two tables: sales and prices

Lets print the first element from each

request = dbSendQuery(db, "SELECT * FROM sales")
dbFetch(request, n = 1)

##   StockCode Quantity InvoiceDate
## 1    85123A        6  2010-12-01

dbClearResult(request) #clear memory

request = dbSendQuery(db, "SELECT * FROM prices")
dbFetch(request, n = 1)

##   StockCode                        Description UnitPrice
## 1    85123A WHITE HANGING HEART T-LIGHT HOLDER      2.55

dbClearResult(request)

We can also examine the column names and types of a table

request = dbSendQuery(db, "pragma table_info('sales')")
dbFetch(request)

##   cid        name    type notnull dflt_value pk
## 1   0   StockCode    TEXT       0         NA  0
## 2   1    Quantity INTEGER       0         NA  0
## 3   2 InvoiceDate    TEXT       0         NA  0

Conditioning using LIKE

We can match string values using %. For example, if we want all the data from 2011-05, then we will do 2011-05%.

The position of the % matters. In the previous example, anything before % must be the same, if it was like %2011-05, anything after % must be the same. 2011-%-01 means that anything before and after must be the same, so like 2011-05-01 will match.

Lets use the LIKE operator to retrieve data for StockCode 71053 from 2011-05

request = dbSendQuery(db, "SELECT * FROM sales WHERE StockCode = '71053' AND InvoiceDate LIKE '2011-05%'")

## Warning: Closing open result set, pending rows

data = dbFetch(request)
dbClearResult(request)
dim(data) #returns the number of rows and columns respectively

## [1] 40  3

We can quickly check if we properly filtered out the data by using the function unique()

unique(data$StockCode)

## [1] "71053"

unique(data$InvoiceDate)

##  [1] "2011-05-04" "2011-05-05" "2011-05-06" "2011-05-08" "2011-05-09"
##  [6] "2011-05-10" "2011-05-11" "2011-05-12" "2011-05-16" "2011-05-17"
## [11] "2011-05-18" "2011-05-20" "2011-05-22" "2011-05-23" "2011-05-24"
## [16] "2011-05-25"

More data exploration

min(data$Quantity)

## [1] -1

max(data$Quantity)

## [1] 120

sum(data$Quantity)

## [1] 300

Making the printout

months = c("2011-02", "2011-03", "2011-04", "2011-05")
counts = rep(NA, 4)
for (i in 1:4) {
query = paste0("SELECT SUM(Quantity) FROM sales WHERE
StockCode = '71053' AND InvoiceDate LIKE '", months[i], "%'")
request = dbSendQuery(db, query)
counts[i] = dbFetch(request)
dbClearResult(request)
} #Examine counts to determine xlim and ylim below

plot(c(1,2,3), counts[-4], main = "Product 71053: 2011-05
order", ylab = "Units", xlab = "Month", pch = 19,
xlim=c(0.5,4.5), ylim = c(85, 305)) #Plot all but 4th count

points(4, counts[4], pch = 3) #Plot 4th count

Using sqlInterpolate()

You can use sqlInterpolate() from the DBI package to construct query strings without hardcoding as well

library(DBI)
i = 1
query = sqlInterpolate(db, "SELECT SUM(Quantity) FROM sales WHERE StockCode = '71053' AND InvoiceDate LIKE ?month", month = paste0(months[i], '%'))

print(query)

## <SQL> SELECT SUM(Quantity) FROM sales WHERE StockCode = '71053' AND InvoiceDate LIKE '2011-02%'

Week 6

Linear regression: theory

Assume that we have the input X = (X₁, …, X_n) and the output Y = (Y₁, …, Y_n)

The “true” demand is f(X_i) = $\beta$X_i + $\alpha$

However, the “observed” demand is a noisy version of the “true” demands with errors involved: $\beta$X_i + $\alpha$ + $\epsilon$_i

$\epsilon$ = Y_i - f(X_i)

Linear regression: Example

set.seed(22); x = 1:5; alpha = 2; beta = 1
epsilon = rnorm(5)

y = beta*x + alpha + epsilon
par(mar = c(5,5,1,1))

plot(x,y, xlab = 'x', ylab = 'y', ylim = c(0,10), pch = 19)

The OLS (ordinary least squares) solution

model = lm(y ~ x, data = data.frame(x = x, y = y))
model

## 
## Call:
## lm(formula = y ~ x, data = data.frame(x = x, y = y))
## 
## Coefficients:
## (Intercept)            x  
##      3.0887       0.8414

Notice how the first row first column is the intercept ($\alpha$) coefficient and the second row first column is the slope coefficient ($\beta$)

alpha0 = summary(model)$coefficients[1,1]
beta0 = summary(model)$coefficients[2,1]
par(mar = c(5,5,1,1))
plot(x,y, xlab = 'x', ylab = 'y', ylim = c(0,10), pch = 19)
abline(alpha0, beta0)

Retrieving URLS in R

We will use the NASDAQ API

library(httr)
url = 'https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?start_date=1985-05-01&end_date=1997-07-01&order=asc&column_index=4&collapse=quarterly&transformation=rdiff'

data = GET(url)
print(data)

## Response [https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?start_date=1985-05-01&end_date=1997-07-01&order=asc&column_index=4&collapse=quarterly&transformation=rdiff]
##   Date: 2024-02-23 17:24
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 2.55 kB

Midterm Review

Rio M

2024-02-21