Spring 2025

Course Introduction

The Data Pipeline

  • Most data analytics tasks involve a specific set of common steps

  • Some tools are better at some parts of the pipeline than others

  • So this course is organized around the data pipeline

  • We’ll intermingle high level concepts related to that with specific language features

The Data Pipeline

  1. Data Collection – Obtain structured or unstructured data from various sources, then organize and represent them for analysis

  2. Data Processing – Perform any cleaning, summarizing, or aggregation steps on the data that is needed, decide what to do about errant / missing data, etc.

  3. Data Storage – Determine how collected and result data will be stored / warehoused. Consider any IP or data restrictions or anonymization that must happen

  4. Analysis – Perform modeling / analysis tasks to describe, sumarized, predict, etc.

  5. Visualize – Provide a visual representation of data and/or results

How Course Content Will Be Organized

  • One or two weeks has been set aside for each step of the data pipeline

  • Each week, we’ll discuss that step at a high level

  • But we’ll focus mainly on coding concepts related to it

Languages

We will explore the following languages this semester:

  1. R – Open source language for statistical analysis and visualization

  2. Python – Open source language for numerical calculations and modeling

  3. Julia – A highly performant, parallelized language that supports integrated processing of data analytics

We’ll start with R and Python, and we’ll touch on Julia near the end.

Homework

  • For most week’s there will be a small homework assignment related to the programming topics

  • The purpose of these assignments is for you to learn a little bit about the language

  • They aren’t intended to be difficult, so do them yourself

Project

  • The culminating assessment in this class is a project

  • That project will be some data analytics task you decide that involves the data pipeline

  • You will propose the project mid-semester via a presentation to get feedback about feasibility / appropriateness

  • You will submit all source code for it

  • You will also give a small presentation discussing how you handled each step of the data pipeline

Introduction to R

Why Use R?

  • R is open source and freely available
  • Works on all standard platforms (Mac, Windows, linux)
  • Extremely useful for statistical modeling and testing
  • Standard tool for many sciences (e.g., Computer Science, Physics)
  • Together with ggplot2 package can produce very nice data visualizations
  • R is better for some parts of the data analysis pipeline, Python is better for others
  • It is interpreted, so we can run interactively and use it like a workbook

Install the Latest R

Install the Latest R Studio

The R-Studio Console

  • One can use the Console in R Studio to enter commands, functions, and data to perform statistical processes
  • The Console is located in the bottom left of the R Studio screen
  • When you see a grey box with code in it, you can copy and paste this into the Console and match the results to what the slides say under the grey box
  • For example:
print("Statistics Rocks (roughly speaking)!")
## [1] "Statistics Rocks (roughly speaking)!"
  • Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

  • R stores values in variables
  • Like Python, R is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
  • There are three different assignment operators in R:
    • Right-to-left: Using the <- operator (this is the traditional operator)
    • Left-to-right: Using the -> operator
    • Compatibility: Using the = operator
x <- 3 * 2
6+7*2 -> y
boring_var = 32

Math Operations

  • The standard arithmetic operators more or less work as expected:
  • +, -, *, /, (, )
  • R also has an exponentiation operator: ^
x <- 2^10 + (3*5 + 1)/2
print(x)
## [1] 1032

What is an R Soure File?

  • You can type commands into the console

  • But for anything even moderately sophisticated, you’ll probably want to create a source file

  • An R source file is a text-readable file with R commands in them that can be run any time

  • There are many advantages, not the least of which is being able to give the file to someone else to run

  • Also, you’ll need to turn in a source file for homework

Some Basic Data Types

  • numeric (numbers)
  • character (strings)
  • vectors (lists of one type)
  • factors (categorical variables)
  • lists (lists of arbitrary types)
  • matrices (numeric 1D and 2D arrays)

Numeric Data

  • scalar numbers: integers and real values
  • arithmetic operations on these
x = 3
y = 4.7
x/y + 2*x - 3
## [1] 3.638298
y > x
## [1] TRUE

Character Data

  • strings
  • operations on strings
x = "hello"
nchar(x)
## [1] 5
gsub("he","HE-",x)
## [1] "HE-llo"

Concatenating Strings with paste()

x = "hello"
y = "world"
paste(x,y)
## [1] "hello world"
paste(x,y,"turtle",sep=':')
## [1] "hello:world:turtle"

Vectors

  • R vectors are basically flexible arrays
  • R vectors aren’t just numeric
  • However, they are the same type
  • Creating a vector with a mixture of numeric and character values will force R to coerce the vector to character
c(1,2,9)
## [1] 1 2 9
c(1,2,"9")
## [1] "1" "2" "9"

Vector Ops: Scaling, Indexing, and Length

x = c(-4,2,31)
3*x
## [1] -12   6  93
x[1]
## [1] -4
length(x)
## [1] 3

Vector Ops: Element-by-Element Ops

x = c(-4,2,3)
y = c(1,3,9)
z = 4*(1:3)
x*y
## [1] -4  6 27
x*y - z
## [1] -8 -2 15

Vector Ops: Named Elements

x = c(-4,2,3)
names(x) <- c("Bob","Frank","Mindy")
x["Frank"]
## Frank 
##     2
x["Mindy"] = 0
print(x)
##   Bob Frank Mindy 
##    -4     2     0

Vector Ops: Pasting with Vectors

x = 1:5
paste("String",x,sep='-')
## [1] "String-1" "String-2" "String-3" "String-4" "String-5"

Vector Ops: Appending to a Vector

x = seq(from=2,to=10,by=2)
print(x)
## [1]  2  4  6  8 10
x = c(x,-99)
x
## [1]   2   4   6   8  10 -99
x[7] = -98
x
## [1]   2   4   6   8  10 -99 -98

Factors

  • A factor is like a vector, except for categorical values
  • They can be created from vectors
x = c("good","good","bad","mediocre","good")
factor(x)
## [1] good     good     bad      mediocre good    
## Levels: bad good mediocre
factor(c(1,2,1,1,2,2,3,3,2,2))
##  [1] 1 2 1 1 2 2 3 3 2 2
## Levels: 1 2 3

Lists Are Like Mixed-Type Vectors

x = list()
x[[1]] = 1
x[[2]] = "hello"
x[[3]] = c(-1,-2,-9)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] -1 -2 -9

List Operations: Named Elements

x = list(3,1,4,5)
names(x) <- c("A","B","C","D")
x[["A"]]
## [1] 3
x$A
## [1] 3

Conditionals

x <- 4
if (x > 2) {
  print("x is bigger than 2")
} else if (x == 2) {
  print("x is exactly 2")
} else {
  print("x is less than 2")
}
## [1] "x is bigger than 2"

Simple for Loops

for (x in 0:10) {
   print(x)
 }
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Defining a Function

myfunction <- function(arg) {
  argsq <- arg^2
  return (argsq)
}

myfunction(3)
## [1] 9

Loading External Libraries

  • There are many external libraries available to install for R
  • Common ones that we’ll use include: dplyr, reshape2, ggplot2, RColorBrewer
  • We load these using the library function:
library(reshape2)
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.4
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] magrittr_2.0.3    glue_1.7.0        cachem_1.1.0      stringr_1.5.1    
##  [9] knitr_1.49        htmltools_0.5.8.1 rmarkdown_2.29    lifecycle_1.0.4  
## [13] cli_3.6.2         sass_0.4.9        jquerylib_0.1.4   compiler_4.4.2   
## [17] plyr_1.8.9        rstudioapi_0.17.1 tools_4.4.2       evaluate_1.0.1   
## [21] bslib_0.8.0       Rcpp_1.0.12       yaml_2.3.10       rlang_1.1.4      
## [25] jsonlite_1.8.9    stringi_1.8.4

Creating a Source File

  • You can use any text editor, but we’ll use R Studio

  • Go to File \(\rightarrow\) New File \(\rightarrow\) R Script

  • Type commands into the file in the order that you would like them to be executed

    • Don’t forget to include any library() loads
    • The user running this file next may not have that library loaded
  • Save the file when you are done

    • File \(\rightarrow\) Save As, then give the file a location and name that you like
    • Or just File \(\rightarrow\) Save if it already has a location and name
    • It is customary to give the file the extension .R or .r

Sourcing the File

To run the file in R Studio, either

  • Press the Source button in the top right of the editor panel

  • Or, in the console, type:

source('path/to/my/script.R')

Make Sure Your Script Runs in a Clean Evironment

It is always a good idea to make sure your script will run with your environment completely clean

  1. From the top-right panel, select the Environment tab

  2. Select the little broom icon to clean the environment, the confirm with Yes on the dialog

  3. From the Session menu option, select Restart R and Clean Output

  4. Source your file

Learn More!

We’ll be learning a lot more about R as the semester progresses, including a number of lab assignments. But there’s also a lot of on-line materials, including:

Introduction to Python

Why Use Python?

  • Python is open source and freely available
  • Works on all standard platforms (Mac, Windows, linux)
  • Extremely useful for data manipulation and machine learning
  • Standard tool for many sciences (e.g., Computer Science, Physics)
  • Python is better for some parts of the data analysis pipeline, R is better for others
  • It is interpreted, so we can run interactively and use it like a workbook

Install A Stable Version of Python

  • The latest version of Python as of this presentation is 3.13.1
  • However, I encourage you to install a version between 3.10 and 3.12
  • There are many ways to install Python, so consult the Documentation
  • Avoid using Anaconda unless you already know about it
  • The most common cause for library/package problems later is having multiple distributions of Python installed via different methods, so pick a method and use that

Python IDEs

  • One common IDE for Python is PyCharm
  • Also, a lot of people just use a general IDE like VSCode
  • Choose what works best for you
  • Though it is helpful if the IDE you choose has git integration

The Python Console

  • You can also run Python interactively
  • If it is installed and in path, just type python or python3 at the command line
  • You’ll be give a prompt that likes like this: >>>
  • You can run commands interactively from that prompt
print("Python is easy!")
## Python is easy!
  • Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

  • Python also stores values in variables
  • Like R, Python is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
  • Assignment in Python uses the = operator
my_var = 32

Math Operations

  • The standard arithmetic operators more or less work as expected:
  • +, -, *, /, (, )
  • Python also has an exponentiation operator: **
x = 2**10 + (3*5 + 1)/2
print(x)
## 1032.0

What is a Python Soure File?

  • Like R, you can type commands into the console

  • But for anything even moderately sophisticated, you’ll probably want to create a source file

  • A Python source file is a text-readable file with Python commands in them that can be run any time

  • There are many advantages, not the least of which is being able to give the file to someone else to run

  • Also, you’ll need to turn in a source file for homework

Some Basic Data Types

  • floating point numbers
  • integer numbers
  • strings
  • Booleans
  • lists
  • dictionaries
  • tuples

Numeric Data

  • int are integers
  • float are floating point numbers
  • In general, Python (3) coerces to float
x = 3
y = 4.7
x/y + 2*x - 3
## 3.6382978723404253
y > x
## True

String Data

  • You may use double quotes or single quotes, but they must match
  • Python has many convenient string operations
x = "   Hello There "
len(x)
## 15
print( x.lower(), x.find('lo') )
##    hello there  6
x.split()
## ['Hello', 'There']

Concatenating Strings with Addition

x = "hello"
y = "world"
print(x + " " + y)
## hello world

Lists

  • Python lists can be dynamically expanded or reduced
  • They can contain different types of data
  • Are zero-indexed
x = [12, -3, "no", True, ['a', 'b', 'c']]
x[0]
## 12
x[-1]
## ['a', 'b', 'c']

Tuples

  • Tuples are like lists, but they are immutable
  • So you can build them, but you can’t change them once they are built
x = ('foo', 'bar', 'baz')
x[1]
## 'bar'
len(x)
## 3

Dictionaries

  • Dictionaries allow you to store things using a key (a hash table)
  • Dictionary keys must be immutable (numbers, strings, tuples, etc.)
x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"}
print(x[ (1,2) ])
## turtle
x["Paul"] = "Cool"
print( x["Paul"] )
## Cool

Conditionals:

x = 4
if x>2:
  print("x is bigger than 2")
elif x==2:
  print("x is exactly 2")
else:
  print("x is less than 2")
## x is bigger than 2

Simple for Loops

for idx in range(10):
  print("Index: ", idx)
## Index:  0
## Index:  1
## Index:  2
## Index:  3
## Index:  4
## Index:  5
## Index:  6
## Index:  7
## Index:  8
## Index:  9

Looping Through Lists

x = [12, -3, "no", True, ['a', 'b', 'c']]
for item in x:
  print(item)
## 12
## -3
## no
## True
## ['a', 'b', 'c']

Looping Through Dictionaries

x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"}
for k in x:
  print(k, "::: ", x[k])
## one :::  100
## two :::  oranges
## three :::  True
## (1, 2) :::  turtle

List Comprehensions

  • Python provides short-hand ways to build lists
cubes = [z**3  for z in range(10)]
print(cubes)
## [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

Defining a Function

def myfunction(arg):
  argsq = arg**2
  return argsq
  
myfunction(3)
## 9

Loading External Libraries

  • There are many external libraries available to install for Python
  • Common ones that we’ll use include: numpy, scipy, sklearn, pandas, tensorflow
  • We load these using the import function:
import numpy as np
print(np.__version__)
## 1.24.4

Learn More

Assignments in This Class

Source Repositories

  • We’ll be using source repositories for our assignments in this class
  • So you’ll need an account on Github: https://github.com/
  • I’ll provide a separate PDF and video about these basics in Week 2

How Homeworks Will Be Assigned

  • You’ll be given a link to GitHub Classroom in an assignmen t
  • When you click that link and accept the invitation, a new Git Repository is created for you on GitHub
  • You may clone that repo locally to your computer or to Hopper
  • Then you can work on in as you usually would
  • Make sure to commit your work regularly to your Repo

How To Submit Your Assignments

  • Make sure your repo is pushed to GitHub
  • If you can’t see it on the GitHub web page, I won’t be able to see it – so check there before submitting
  • On BlackBoard, just tell me that you are done and ready for me to grade
  • I can access your GitHub repo for the assignment myself
  • You don’t need to send me code via email or attack anything in BlackBoard