You can read a csv file using the function
read.csv(filename)
data = read.csv("shot_logs.csv")
You can type the dataname$column_name[row_number] or dataname[row_number, “column name”]
To access the 1729th player name
data$player_name[1729]
## [1] "gary neal"
#you can also do
data[1729, "player_name"]
## [1] "gary neal"
Use the order function to arrange the row order of the data. Similar to the arrange function in tidyverse
data = data[order(data$player_name, data$GAME_ID, data$PERIOD, data$SHOT_NUMBER), ]
#the comma at the end specifies that all columns will be returned. If you want a specific column returned, add the column index after the comma
The syntax is for(i in start:end){} Lets add up the
number of made and missed shots that are hot or not hot.
made_h = 0; made_n = 0; missed_h = 0; missed_n = 0
for(i in 2:dim(data)[1]) {
#iterate from the second row to the last row
if (data$SHOT_RESULT[i-1] == "made") {
if(data$SHOT_RESULT[i] == "made") {
made_h = made_h + 1
}
else {
missed_h = missed_h + 1
}
}
else {
if(data$SHOT_RESULT[i] == "made") {
made_n = made_n + 1
}
else {
missed_n = missed_n + 1
}
}
}
Now that we have the values, we can make them into vectors then create a data frame.
Vectors are created using c(x,y), where x and y are
the values
data frames are created using
data.frame(C1 = c(x,y), C2 = x(w,y))
result = data.frame(MADE = c(made_n, made_h), MISSED = c(missed_n, missed_h))
result
## MADE MISSED
## 1 31942 38222
## 2 25962 31942
We can name the rows using rownames(dataframe)
rownames(result) = c("NOT HOT", "HOT")
result
## MADE MISSED
## NOT HOT 31942 38222
## HOT 25962 31942
There are 3 ways to print in R:
print(): printing single objectscat() similar to how you’d print in Pythonsprintf() similar to how you’d print in C.We will use cat() to print out results.
made = round(100 * (made_h/(made_h+missed_h)), 1)
missed = round(100 * (made_n/(made_n + missed_n)), 1)
#round the decimals to 2
cat("Scoring percentage after making a shot:", made, "\n")
## Scoring percentage after making a shot: 44.8
cat("Scoring percentage after missing a shot:", missed)
## Scoring percentage after missing a shot: 45.5
We know that read.csv reads csv files but there are many
other “read” functions. To use read.csv for other files such as a tsv
(tab separated), we can add sep = "\t" in the parameter
data = read.csv('AS-N100.tsv', sep = "\t")
You can also read a csv file and other fi;es using
read.table. Just remember to specify how you want the table
to be read
First lets filter our data
subset = data[data$ticker == "GOOGL", ]
xs = as.POSIXct(strptime(paste("2015-11-15",
subset$timestamp), "%Y-%m-%d %H:%M:%S"))
ys = subset$open
plot(xs, ys, xlab = "Time", ylab = "Price (USD)",
main = "GOOGL open price / minute", type = "l") #type = "l" creates a line graph
Lets go through what the xs variable is doing
paste() is taking all the timestamps giving from 2015-11-15 and returns a vector of it.
strptime() then parses through each of the values and converts it into the format which is the second argument.
as.POSIXct() then takes the returned formatted vector and converts the string into a date-time type object
Lets first read in the data and column names
data = read.table("AS-N100.data", header = FALSE)
columns = readLines("AS-N100.names")
print(data[1, "ticker"])
## NULL
It returns null because the data does not have column names. We can
add column names using colnames(data) = columns where
columns is a vector
colnames(data) = columns
print(data[1, "ticker"])
## [1] "LLTC"
We will create a subset where the ticker is GOOGL then find the min for opening prices.
subset = data[data[,"ticker"] == "GOOGL", ]
print(min(subset[,"open"]))
## [1] 1
There seems to be something wrong with the data as it’s unlikely that GOOGL’s minimum price was $1.
If we check closely, the order of the column names in the .names file is different from the .tsv file from the previous week!
The open and tradecount columns are swapped. We can
use colnames() again but specify which columns we want to
rename
colnames(data)[c(4,10)] =
c("open", "tradecount")
subset = data[data[, "ticker"] == "GOOGL", ]
print(min(subset[,"open"]))
## [1] 744.42
The mean, as we know, is the average
print(mean(subset[, "open"]))
## [1] 750.4808
The median is the middle value of a dataset when ordered from least to greatest. Unlike the mean, it is not affected by outliers The mean and median will be close if the data is symmetric
We will create a histogram
xs = log(subset$tradecount)
hist(xs, main = "GOOGL on 2015-11-17", ylab = "Count",
xlab = "Log number of trades")
abline(v = c(median(xs), mean(xs)), col = c("blue", "red"),
lty = c(1,2), lwd = c(2,2))
legend("topleft", legend = c("median", "mean"), col = c("blue", "red"), lty = c(1,2), lwd = c(2,2))
lty() specifies the line type, 1 being a solid line and
2 being a dashed line. lwd() specifies the width. The
greater the value, the thicker the line.
The syntax for creating a function is
functioname = function(v) {}
Here is a function that finds the max value of a vector
my.max = function(v) {
result = -Inf
for (i in 1: length(v)) {
if (v[i] > result) {
result = v[i]
}
}
return(result)
}
print(my.max(c(1,5,4,7,9,12)))
## [1] 12
JSON stands for JavaScript object notation To read a flat file of this type, we need to load the package rjson.
library(rjson)
libraries = fromJSON(file = "libraries.json")
print(class(libraries))
## [1] "list"
The data is stored as a list. We can use the names()
functino to view all the names of each header
names(libraries)
## [1] "exceededTransferLimit" "features" "type"
To access elements in a list, we nested the brackets like so:
print(libraries[["features"]][[2]][["properties"]]$maptip)
## [1] "Library:West Point Grey\\n4480 W 10th Av"
To load images, we can use the load.image() function from the package imager
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
image = load.image("Figure02.png")
plot(image)
To remove the axes, we can use axes = FALSE
plot(image, axes = FALSE)
The plot function lets us interact with the display further by using the plot functions.
W = dim(image)[1]; H = dim(image)[2]
plot(image, axes = FALSE)
points(W/2, H/2, col = "red", pch = 4, lwd = 5)
text(W/2 + 250, H/2, "Center", col = "red")
It is an efficient way of storing and manipulating data. Unlike a data frame, it can be synchronized in many locations and accessible to many users.
For example, a company with multiple locations can use a relational database to track their inventory by having their POS update the database instantly when they make a sale. This ensures that the information is accurate and easily accessible for all locations.
Databases have many different tables each contain its own type of data
It is easier to use than SQL
Lets load one of their databases using the package
RSQLite
library(RSQLite)
db = dbConnect(SQLite(), dbname = "shapes.sqlite")
data = dbReadTable(db, "Platonic")
print(data)
## Name Sides
## 1 Tetrahedron 4
request = dbExecute(db, "INSERT INTO Platonic VALUES ('Octahedron', 8)")
print(request)
## [1] 1
This returns the number of rows affected by the call. Here, we inserted a new row with values Octahderon and 8 into the table called Platonic
print(dbReadTable(db, "Platonic"))
## Name Sides
## 1 Tetrahedron 4
## 2 Octahedron 8
SELECT * FROM Name returns all columns and rows from
the table called Name
SELECT *FROM Name WHERE ... can be used to only
return rows that meet the condition given after “WHERE”
SELECT V1, V2 FROM Name returns columns V1 and V2 of
all rows in the table called Name
INSERT INTO Name VALUES (v1, v2, ...) adds the row
given by the vector v1, v2, … into the table called Name
result = dbSendQuery() Is used to SELECT data.
To fetch the data, you must use -
dbFetch(result, n = 10). This fetches the data and
returns the number of rows. If you want all rows printed, you may omit
the second argument
dbExecute() is used when you want to edit the
database
dbGetQuery() combines the process of
dbSendQuery() and dbFetch()
dbClearResult(result) clears the memory to clear space.
For every dbFetch, there is a corresponding
dbClearResult()
`dbReadTable(db, tablename) reads the entire database
Sometimes, we don’t know what we need to add or there is too much we need to add that hard coding will become too time consuming
In this case, we can use the function paste0().
name = "Pyramid"
sides = 4
query = paste0("INSERT INTO Platonic VALUES ('", name, "',", sides, ")")
print(query)
## [1] "INSERT INTO Platonic VALUES ('Pyramid',4)"
request = dbExecute(db, query)
print(dbReadTable(db, "Platonic"))
## Name Sides
## 1 Tetrahedron 4
## 2 Octahedron 8
## 3 Pyramid 4
When you’re given a new database, you might not know the table names. You can list the table names like so
library(RSQLite)
db = dbConnect(SQLite(), dbname = "ycoretail.sqlite") #load database
dbListTables(db)
## [1] "prices" "sales"
We can see that there are two tables: sales and prices
Lets print the first element from each
request = dbSendQuery(db, "SELECT * FROM sales")
dbFetch(request, n = 1)
## StockCode Quantity InvoiceDate
## 1 85123A 6 2010-12-01
dbClearResult(request) #clear memory
request = dbSendQuery(db, "SELECT * FROM prices")
dbFetch(request, n = 1)
## StockCode Description UnitPrice
## 1 85123A WHITE HANGING HEART T-LIGHT HOLDER 2.55
dbClearResult(request)
We can also examine the column names and types of a table
request = dbSendQuery(db, "pragma table_info('sales')")
dbFetch(request)
## cid name type notnull dflt_value pk
## 1 0 StockCode TEXT 0 NA 0
## 2 1 Quantity INTEGER 0 NA 0
## 3 2 InvoiceDate TEXT 0 NA 0
We can match string values using %. For example, if we want all the
data from 2011-05, then we will do 2011-05%.
The position of the % matters. In the previous example,
anything before % must be the same, if it was like
%2011-05, anything after % must be the same.
2011-%-01 means that anything before and after must be the
same, so like 2011-05-01 will match.
Lets use the LIKE operator to retrieve data for StockCode 71053 from 2011-05
request = dbSendQuery(db, "SELECT * FROM sales WHERE StockCode = '71053' AND InvoiceDate LIKE '2011-05%'")
## Warning: Closing open result set, pending rows
data = dbFetch(request)
dbClearResult(request)
dim(data) #returns the number of rows and columns respectively
## [1] 40 3
We can quickly check if we properly filtered out the data by using
the function unique()
unique(data$StockCode)
## [1] "71053"
unique(data$InvoiceDate)
## [1] "2011-05-04" "2011-05-05" "2011-05-06" "2011-05-08" "2011-05-09"
## [6] "2011-05-10" "2011-05-11" "2011-05-12" "2011-05-16" "2011-05-17"
## [11] "2011-05-18" "2011-05-20" "2011-05-22" "2011-05-23" "2011-05-24"
## [16] "2011-05-25"
min(data$Quantity)
## [1] -1
max(data$Quantity)
## [1] 120
sum(data$Quantity)
## [1] 300
months = c("2011-02", "2011-03", "2011-04", "2011-05")
counts = rep(NA, 4)
for (i in 1:4) {
query = paste0("SELECT SUM(Quantity) FROM sales WHERE
StockCode = '71053' AND InvoiceDate LIKE '", months[i], "%'")
request = dbSendQuery(db, query)
counts[i] = dbFetch(request)
dbClearResult(request)
} #Examine counts to determine xlim and ylim below
plot(c(1,2,3), counts[-4], main = "Product 71053: 2011-05
order", ylab = "Units", xlab = "Month", pch = 19,
xlim=c(0.5,4.5), ylim = c(85, 305)) #Plot all but 4th count
points(4, counts[4], pch = 3) #Plot 4th count
You can use sqlInterpolate() from the DBI package to
construct query strings without hardcoding as well
library(DBI)
i = 1
query = sqlInterpolate(db, "SELECT SUM(Quantity) FROM sales WHERE StockCode = '71053' AND InvoiceDate LIKE ?month", month = paste0(months[i], '%'))
print(query)
## <SQL> SELECT SUM(Quantity) FROM sales WHERE StockCode = '71053' AND InvoiceDate LIKE '2011-02%'
Assume that we have the input X = (X1, …, Xn) and the output Y = (Y1, …, Yn)
The “true” demand is f(Xi) = \(\beta\)Xi + \(\alpha\)
However, the “observed” demand is a noisy version of the “true” demands with errors involved: \(\beta\)Xi + \(\alpha\) + \(\epsilon\)i
\(\epsilon\) = Yi - f(Xi)
set.seed(22); x = 1:5; alpha = 2; beta = 1
epsilon = rnorm(5)
y = beta*x + alpha + epsilon
par(mar = c(5,5,1,1))
plot(x,y, xlab = 'x', ylab = 'y', ylim = c(0,10), pch = 19)
The OLS (ordinary least squares) solution
model = lm(y ~ x, data = data.frame(x = x, y = y))
model
##
## Call:
## lm(formula = y ~ x, data = data.frame(x = x, y = y))
##
## Coefficients:
## (Intercept) x
## 3.0887 0.8414
Notice how the first row first column is the intercept (\(\alpha\)) coefficient and the second row first column is the slope coefficient (\(\beta\))
alpha0 = summary(model)$coefficients[1,1]
beta0 = summary(model)$coefficients[2,1]
par(mar = c(5,5,1,1))
plot(x,y, xlab = 'x', ylab = 'y', ylim = c(0,10), pch = 19)
abline(alpha0, beta0)
We will use the NASDAQ API
library(httr)
url = 'https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?start_date=1985-05-01&end_date=1997-07-01&order=asc&column_index=4&collapse=quarterly&transformation=rdiff'
data = GET(url)
print(data)
## Response [https://data.nasdaq.com/api/v3/datasets/WIKI/AAPL.json?start_date=1985-05-01&end_date=1997-07-01&order=asc&column_index=4&collapse=quarterly&transformation=rdiff]
## Date: 2024-02-23 17:24
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 2.55 kB