Spring 2026

Course Introduction

The Data Pipeline

  • Most data analytics tasks involve a specific set of common steps

  • Some tools are better at some parts of the pipeline than others

  • So this course is organized around the data pipeline

  • We’ll intermingle high level concepts related to that with specific language features

The Data Pipeline

  1. Data Collection – Obtain structured or unstructured data from various sources, then organize and represent them for analysis

  2. Data Processing – Perform any cleaning, summarizing, or aggregation steps on the data that is needed, decide what to do about errant / missing data, etc.

  3. Data Storage – Determine how collected and result data will be stored / warehoused. Consider any IP or data restrictions or anonymization that must happen

  4. Analysis – Perform modeling / analysis tasks to describe, sumarized, predict, etc.

  5. Visualize – Provide a visual representation of data and/or results

How Course Content Will Be Organized

  • One or two weeks has been set aside for each step of the data pipeline

  • Each week, we’ll discuss that step at a high level

  • But we’ll focus mainly on coding concepts related to it

Languages

We will explore the following languages this semester:

  1. R – Open source language for statistical analysis and visualization

  2. Python – Open source language for numerical calculations and modeling

  3. Julia – A highly performant, parallelized language that supports integrated processing of data analytics

We’ll overlap all through throughout the semester

Homework

  • For most week’s there will be a small homework assignment related to the programming topics

  • The purpose of these assignments is for you to learn a little bit about the language

  • They aren’t intended to be difficult

  • They aren’t work much

  • You can resubmit as often as you like until they are correct

  • So there’s no advantage at all to using external resources to do them for you

Exams

  • There will be a midterm exam covering the first half of the semester

  • There will be a final exam covering the course as a whole

  • The homework is intended as formative practice for these

Project

  • The culminating assessment in this class is a project

  • That project will be some data analytics task you decide that involves the data pipeline

  • You will propose the project mid-semester via document to get feedback about feasibility / appropriateness

  • At end of term, you will submit all source code for it

  • You will also give a small presentation discussing how you handled each step of the data pipeline

Introduction to R

Why Use R?

  • R is open source and freely available
  • Works on all standard platforms (Mac, Windows, linux)
  • Extremely useful for statistical modeling and testing
  • Standard tool for many sciences (e.g., Computer Science, Physics)
  • Together with ggplot2 package can produce very nice data visualizations
  • R is better for some parts of the data analysis pipeline, Python is better for others
  • It is interpreted, so we can run interactively and use it like a workbook

Install the Latest R

Install the Latest R Studio

The R-Studio Console

  • One can use the Console in R Studio to enter commands, functions, and data to perform statistical processes
  • The Console is located in the bottom left of the R Studio screen
  • When you see a grey box with code in it, you can copy and paste this into the Console and match the results to what the slides say under the grey box
  • For example:
print("Statistics Rocks (roughly speaking)!")
## [1] "Statistics Rocks (roughly speaking)!"
  • Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

  • R stores values in variables
  • Like Python, R is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
  • There are three different assignment operators in R:
    • Right-to-left: Using the <- operator (this is the traditional operator)
    • Left-to-right: Using the -> operator
    • Compatibility: Using the = operator
x <- 3 * 2
6+7*2 -> y
boring_var = 32

Math Operations

  • The standard arithmetic operators more or less work as expected:
  • +, -, *, /, (, )
  • R also has an exponentiation operator: ^
x <- 2^10 + (3*5 + 1)/2
print(x)
## [1] 1032

What is an R Soure File?

  • You can type commands into the console

  • But for anything even moderately sophisticated, you’ll probably want to create a source file

  • An R source file is a text-readable file with R commands in them that can be run any time

  • There are many advantages, not the least of which is being able to give the file to someone else to run

  • Also, you’ll need to turn in a source file for homework

Some Basic Data Types

  • numeric (numbers)
  • character (strings)
  • vectors (lists of one type)
  • factors (categorical variables)
  • lists (lists of arbitrary types)
  • matrices (numeric 1D and 2D arrays)

Numeric Data

  • scalar numbers: integers and real values
  • arithmetic operations on these
x = 3
y = 4.7
x/y + 2*x - 3
## [1] 3.638298
y > x
## [1] TRUE

Character Data

  • strings
  • operations on strings
x = "hello"
nchar(x)
## [1] 5
gsub("he","HE-",x)
## [1] "HE-llo"

Concatenating Strings with paste()

x = "hello"
y = "world"
paste(x,y)
## [1] "hello world"
paste(x,y,"turtle",sep=':')
## [1] "hello:world:turtle"

Vectors

  • R vectors are basically flexible arrays
  • R vectors aren’t just numeric
  • However, they are the same type
  • Creating a vector with a mixture of numeric and character values will force R to coerce the vector to character
c(1,2,9)
## [1] 1 2 9
c(1,2,"9")
## [1] "1" "2" "9"

Vector Ops: Scaling, Indexing, and Length

x = c(-4,2,31)
3*x
## [1] -12   6  93
x[1]
## [1] -4
length(x)
## [1] 3

Vector Ops: Element-by-Element Ops

x = c(-4,2,3)
y = c(1,3,9)
z = 4*(1:3)
x*y
## [1] -4  6 27
x*y - z
## [1] -8 -2 15

Vector Ops: Named Elements

x = c(-4,2,3)
names(x) <- c("Bob","Frank","Mindy")
x["Frank"]
## Frank 
##     2
x["Mindy"] = 0
print(x)
##   Bob Frank Mindy 
##    -4     2     0

Vector Ops: Pasting with Vectors

x = 1:5
paste("String",x,sep='-')
## [1] "String-1" "String-2" "String-3" "String-4" "String-5"

Vector Ops: Appending to a Vector

x = seq(from=2,to=10,by=2)
print(x)
## [1]  2  4  6  8 10
x = c(x,-99)
x
## [1]   2   4   6   8  10 -99
x[7] = -98
x
## [1]   2   4   6   8  10 -99 -98

Factors

  • A factor is like a vector, except for categorical values
  • They can be created from vectors
x = c("good","good","bad","mediocre","good")
factor(x)
## [1] good     good     bad      mediocre good    
## Levels: bad good mediocre
factor(c(1,2,1,1,2,2,3,3,2,2))
##  [1] 1 2 1 1 2 2 3 3 2 2
## Levels: 1 2 3

Lists Are Like Mixed-Type Vectors

x = list()
x[[1]] = 1
x[[2]] = "hello"
x[[3]] = c(-1,-2,-9)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] -1 -2 -9

List Operations: Named Elements

x = list(3,1,4,5)
names(x) <- c("A","B","C","D")
x[["A"]]
## [1] 3
x$A
## [1] 3

Conditionals

x <- 4
if (x > 2) {
  print("x is bigger than 2")
} else if (x == 2) {
  print("x is exactly 2")
} else {
  print("x is less than 2")
}
## [1] "x is bigger than 2"

Simple for Loops

for (x in 0:10) {
   print(x)
 }
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Defining a Function

myfunction <- function(arg) {
  argsq <- arg^2
  return (argsq)
}

myfunction(3)
## [1] 9

Loading External Libraries

  • There are many external libraries available to install for R
  • Common ones that we’ll use include: dplyr, reshape2, ggplot2, RColorBrewer
  • We load these using the library function:
library(reshape2)
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.7.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.5
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39     R6_2.6.1          fastmap_1.2.0     xfun_0.55        
##  [5] magrittr_2.0.4    glue_1.8.0        cachem_1.1.0      stringr_1.6.0    
##  [9] knitr_1.51        htmltools_0.5.9   rmarkdown_2.30    lifecycle_1.0.4  
## [13] cli_3.6.5         sass_0.4.10       jquerylib_0.1.4   compiler_4.5.2   
## [17] plyr_1.8.9        rstudioapi_0.17.1 tools_4.5.2       evaluate_1.0.5   
## [21] bslib_0.9.0       Rcpp_1.1.0        yaml_2.3.12       rlang_1.1.6      
## [25] jsonlite_2.0.0    stringi_1.8.7

Creating a Source File

  • You can use any text editor, but we’ll use R Studio

  • Go to File \(\rightarrow\) New File \(\rightarrow\) R Script

  • Type commands into the file in the order that you would like them to be executed

    • Don’t forget to include any library() loads
    • The user running this file next may not have that library loaded
  • Save the file when you are done

    • File \(\rightarrow\) Save As, then give the file a location and name that you like
    • Or just File \(\rightarrow\) Save if it already has a location and name
    • It is customary to give the file the extension .R or .r

Sourcing the File

To run the file in R Studio, either

  • Press the Source button in the top right of the editor panel

  • Or, in the console, type:

source('path/to/my/script.R')

Make Sure Your Script Runs in a Clean Evironment

It is always a good idea to make sure your script will run with your environment completely clean

  1. From the top-right panel, select the Environment tab

  2. Select the little broom icon to clean the environment, the confirm with Yes on the dialog

  3. From the Session menu option, select Restart R and Clean Output

  4. Source your file

Learn More!

We’ll be learning a lot more about R as the semester progresses, including a number of lab assignments. But there’s also a lot of on-line materials, including:

Introduction to Python

Why Use Python?

  • Python is open source and freely available
  • Works on all standard platforms (Mac, Windows, linux)
  • Extremely useful for data manipulation and machine learning
  • Standard tool for many sciences (e.g., Computer Science, Physics)
  • Python is better for some parts of the data analysis pipeline, R is better for others
  • It is interpreted, so we can run interactively and use it like a workbook

Install A Stable Version of Python

  • The latest version of Python as of this presentation is 3.14.2
  • However, I encourage you to install a version between 3.10 and 3.12
  • There are many ways to install Python, so consult the Documentation
  • Avoid using Anaconda unless you already know about it
  • The most common cause for library/package problems later is having multiple distributions of Python installed via different methods, so pick a method and use that

Python IDEs

  • One common IDE for Python is PyCharm
  • Also, a lot of people just use a general IDE like VSCode
  • Choose what works best for you
  • Though it is helpful if the IDE you choose has git integration

The Python Console

  • You can also run Python interactively
  • If it is installed and in path, just type python or python3 at the command line
  • You’ll be give a prompt that likes like this: >>>
  • You can run commands interactively from that prompt
print("Python is easy!")
## Python is easy!
  • Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

  • Python also stores values in variables
  • Like R, Python is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
  • Assignment in Python uses the = operator
my_var = 32

Math Operations

  • The standard arithmetic operators more or less work as expected:
  • +, -, *, /, (, )
  • Python also has an exponentiation operator: **
x = 2**10 + (3*5 + 1)/2
print(x)
## 1032.0

What is a Python Soure File?

  • Like R, you can type commands into the console

  • But for anything even moderately sophisticated, you’ll probably want to create a source file

  • A Python source file is a text-readable file with Python commands in them that can be run any time

  • There are many advantages, not the least of which is being able to give the file to someone else to run

  • Also, you’ll need to turn in a source file for homework

Some Basic Data Types

  • floating point numbers
  • integer numbers
  • strings
  • Booleans
  • lists
  • dictionaries
  • tuples

Numeric Data

  • int are integers
  • float are floating point numbers
  • In general, Python (3) coerces to float
x = 3
y = 4.7
x/y + 2*x - 3
## 3.6382978723404253
y > x
## True

String Data

  • You may use double quotes or single quotes, but they must match
  • Python has many convenient string operations
x = "   Hello There "
len(x)
## 15
print( x.lower(), x.find('lo') )
##    hello there  6
x.split()
## ['Hello', 'There']

Concatenating Strings with Addition

x = "hello"
y = "world"
print(x + " " + y)
## hello world

Lists

  • Python lists can be dynamically expanded or reduced
  • They can contain different types of data
  • Are zero-indexed
x = [12, -3, "no", True, ['a', 'b', 'c']]
x[0]
## 12
x[-1]
## ['a', 'b', 'c']

Tuples

  • Tuples are like lists, but they are immutable
  • So you can build them, but you can’t change them once they are built
x = ('foo', 'bar', 'baz')
x[1]
## 'bar'
len(x)
## 3

Dictionaries

  • Dictionaries allow you to store things using a key (a hash table)
  • Dictionary keys must be immutable (numbers, strings, tuples, etc.)
x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"}
print(x[ (1,2) ])
## turtle
x["Paul"] = "Cool"
print( x["Paul"] )
## Cool

Conditionals:

x = 4
if x>2:
  print("x is bigger than 2")
elif x==2:
  print("x is exactly 2")
else:
  print("x is less than 2")
## x is bigger than 2

Simple for Loops

for idx in range(10):
  print("Index: ", idx)
## Index:  0
## Index:  1
## Index:  2
## Index:  3
## Index:  4
## Index:  5
## Index:  6
## Index:  7
## Index:  8
## Index:  9

Looping Through Lists

x = [12, -3, "no", True, ['a', 'b', 'c']]
for item in x:
  print(item)
## 12
## -3
## no
## True
## ['a', 'b', 'c']

Looping Through Dictionaries

x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"}
for k in x:
  print(k, "::: ", x[k])
## one :::  100
## two :::  oranges
## three :::  True
## (1, 2) :::  turtle

List Comprehensions

  • Python provides short-hand ways to build lists
cubes = [z**3  for z in range(10)]
print(cubes)
## [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

Defining a Function

def myfunction(arg):
  argsq = arg**2
  return argsq
  
myfunction(3)
## 9

Loading External Libraries

  • There are many external libraries available to install for Python
  • Common ones that we’ll use include: numpy, scipy, sklearn, pandas, tensorflow
  • We load these using the import function:
import numpy as np
print(np.__version__)
## 1.24.4

Learn More

Introduction to Julia

What is Julia?

  • A fourth generation language designed from the ground up with scientists in mind
    • Highly performant, capable of parallelism by default
    • Perhaps both dynamic and static type binding
    • Easy to read and code in, like R and Python
  • Advantages:
    • Very fast
    • Can call C, R, and Python natively without wrappers
    • Can create compact, efficient user-defined types like C
    • Still get the interactive, console experience from R and Python

Install the Latest Julia

The latest version of Julia as of this presentation is 1.11.4

  1. Go to https://julialang.org/
  2. Select the Download tab
  3. Then download the install that that matches your platform
  4. Follow the download and install prompts

The Julia Console

  • You can also run Julia interactively
  • If it is installed and in path, just type julia at the command line
  • You’ll be give a prompt that likes like this: julia>
  • You can run commands interactively from that prompt
print("Julia is easy!")
  • Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

  • Julia also stores values in variables
  • Like R and Python, Julia is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
  • Assignment in Julia uses the = operator
my_var = 32

Math Operations

What is a Julia Soure File?

  • Like R and Python, you can type commands into the console

  • But for anything even moderately sophisticated, you’ll probably want to create a source file

  • A Julia source file is a text-readable file with Julia commands in them that can be run any time

  • There are many advantages, not the least of which is being able to give the file to someone else to run

  • Also, you’ll need to turn in a source file for homework

  • Julia source files typically end in .jl

Some Basic Data Types

  • floating point numbers
  • integer numbers
  • strings
  • Booleans
  • lists

Static vs. Dynamic Type Binding

  • Julia can dynamically bind type, like R and Python
  • But you can also explicitly specify the type
  • And you can define your own types
myboundvar::Int = 11
myboundvar = 3.5   # Will produce error

struct myt
  a::Float64
  b::String
end

list = [myt(2,"toad"), "Foo", myt(-3.2, "Purple")]
list[1].b   # Julia is 1-indexed, not 0-indexed
list[3].a

Type Assumptions

  • Python, R, and Julia can dynamically bind a variable to a type on use
  • In Python it’s assumed to be bound to the most specific (local scope)
  • But Julia doesn’t make that assumption
  • So if you have two variables with the same names at different scope levels, you have to tell Julia
local x = 2

Numeric Data

  • Int are integers
  • Float are floating point numbers
x = 3
y = 4.7
x/y + 2*x - 3
  • But there are different versions of these:
    • Int8, UInt8, …, Int128, UInt128
    • Float16, Float32, Float64

Julia Int & Float Types

String Data

  • You may use double quotes or single quotes, but they must match
  • Julia has many convenient string operations
z = "   Hello There "
length(z)
print("My string was: '$z'")
print( lowercase(z) )
split("This; is; a test of; split")
println("This puts a newline at the end")

Concatenating Strings

x = "hello"
z = "world"
print("$x  $z")

Lists

  • Julia lists can be dynamically expanded or reduced
  • They can contain different types of data
  • Are one-indexed
x = [12, -3, "no", True, ['a', 'b', 'c']]
x[1]
x[end-1]

Tuples

  • Like Python, Julia has tuples
myVar = (1, 2, "Hello")
println(myVar)

Immutable Structures

  • Julia allows you to create types with named fields (i.e., structures)
  • By default, these variables are immutable, so they are a little named tuples
struct FooType1
         a::Int64
         b::Float64
       end
foo = FooType1(2, -6.3)
foo.a
foo.a = 3  # This will give an error

Mutable Structures

  • But you can tell Julia the structure is mutable
mutable struct FooType2
         a::Int64
         b::Float64
       end
foo = FooType2(2, -6.3)
foo.a
foo.a = 3  # This will not give an error

Dictionaries

  • Julia has dictionaries, just like Python
  • But Julia allows them to be typed, if you like
Dict2 = Dict("a" => 1, "b" => 2, "c" => 3) 
println("\nUntyped Dictionary = ", Dict2) 
  
Dict3 = Dict{String, Integer}("a" => 64, "c" => 20) 
println("\nTyped Dictionary = ", Dict3) 

Conditionals:

if x < y
    println("x is less than y")
elseif x > y
    println("x is greater than y")
else
    println("x is equal to y")
end

Simple for Loops

for i in 1:10
    println("Index: $i")
end

Looping Through Dictionaries

mydictionary = Dict("a" => 1, "b" => 2, "c" => 3) 
for item in mydictionary
    println("Item: $item")
end

List Comprehensions

  • Julia provides short-hand ways to build lists
cubes = [z^3  for z in 1:10]
println(cubes)

Defining a Function

# Typical Way
function myfunction(arg)
  argsq = arg^2
  return argsq
end

# Compact Way
otherfunction(x, y) = 2*x + y

# Calling functions
myfunction(3)
otherfunction(-2, 3)

Loading External Packages

  • There are many external pacakges available to install for Julia
  • Common ones that we’ll use include: DataFrames, Plots, SciML, JuliaStats
  • We load packages with using and install with Pkg.add
using Pkg
Pkg.add("Plots")
using Plots
x = 0:.1:10
z = sin.(x)
plot(x,z)

Learn More

Assignments in This Class

Source Repositories

  • We’ll be using source repositories for our assignments in this class
  • So you’ll need an account on Github: https://github.com/
  • I’ll provide a separate PDF and video about these basics this week

How Homeworks Will Be Assigned

  • You’ll be given a link to GitHub Classroom in an assignment
  • When you click that link and accept the invitation, a new Git Repository is created for you on GitHub
  • You may clone that repo locally to your computer or to Hopper
  • Then you can work on in as you usually would
  • Make sure to commit your work regularly to your Repo

How To Submit Your Assignments

  • Make sure your repo is pushed to GitHub
  • If you can’t see it on the GitHub web page, I won’t be able to see it – so check there before submitting
  • On BlackBoard, just tell me that you are done and ready for me to grade
  • I can access your GitHub repo for the assignment myself
  • You don’t need to send me code via email or attack anything in BlackBoard