Introduction to Scripting for Data Analysis

Spring 2025

Course Introduction

The Data Pipeline

Most data analytics tasks involve a specific set of common steps
Some tools are better at some parts of the pipeline than others
So this course is organized around the data pipeline
We’ll intermingle high level concepts related to that with specific language features

The Data Pipeline

Data Collection – Obtain structured or unstructured data from various sources, then organize and represent them for analysis
Data Processing – Perform any cleaning, summarizing, or aggregation steps on the data that is needed, decide what to do about errant / missing data, etc.
Data Storage – Determine how collected and result data will be stored / warehoused. Consider any IP or data restrictions or anonymization that must happen
Analysis – Perform modeling / analysis tasks to describe, sumarized, predict, etc.
Visualize – Provide a visual representation of data and/or results

How Course Content Will Be Organized

One or two weeks has been set aside for each step of the data pipeline
Each week, we’ll discuss that step at a high level
But we’ll focus mainly on coding concepts related to it

Languages

We will explore the following languages this semester:

R – Open source language for statistical analysis and visualization
Python – Open source language for numerical calculations and modeling
Julia – A highly performant, parallelized language that supports integrated processing of data analytics

We’ll start with R and Python, and we’ll touch on Julia near the end.

Homework

For most week’s there will be a small homework assignment related to the programming topics
The purpose of these assignments is for you to learn a little bit about the language
They aren’t intended to be difficult, so do them yourself

Project

The culminating assessment in this class is a project
That project will be some data analytics task you decide that involves the data pipeline
You will propose the project mid-semester via a presentation to get feedback about feasibility / appropriateness
You will submit all source code for it
You will also give a small presentation discussing how you handled each step of the data pipeline

Introduction to R

Why Use R?

R is open source and freely available
Works on all standard platforms (Mac, Windows, linux)
Extremely useful for statistical modeling and testing
Standard tool for many sciences (e.g., Computer Science, Physics)
Together with ggplot2 package can produce very nice data visualizations
R is better for some parts of the data analysis pipeline, Python is better for others
It is interpreted, so we can run interactively and use it like a workbook

Install the Latest R

The latest version of R as of this presentation is 4.4.2

Go to http://cran.r-project.org/mirrors.html
Pick a mirror site (e.g., http://watson.nci.nih.gov/cran_mirror/)
Select the Download for … link that matches your platform
Follow the download and install prompts

Install the Latest R Studio

The latest version of R Studio as of this presentation is 0.99

Go to http://www.rstudio.com/products/RStudio/#Desk
Select Download RStudio Desktop, then pick your platform
Follow the download and install prompts
Watch their R Studio Overview video

The R-Studio Console

One can use the Console in R Studio to enter commands, functions, and data to perform statistical processes
The Console is located in the bottom left of the R Studio screen
When you see a grey box with code in it, you can copy and paste this into the Console and match the results to what the slides say under the grey box
For example:

print("Statistics Rocks (roughly speaking)!")

## [1] "Statistics Rocks (roughly speaking)!"

Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

R stores values in variables
Like Python, R is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
There are three different assignment operators in R:
- Right-to-left: Using the <- operator (this is the traditional operator)
- Left-to-right: Using the -> operator
- Compatibility: Using the = operator

x <- 3 * 2
6+7*2 -> y
boring_var = 32

Math Operations

The standard arithmetic operators more or less work as expected:
+, -, *, /, (, )
R also has an exponentiation operator: ^

x <- 2^10 + (3*5 + 1)/2
print(x)

## [1] 1032

What is an R Soure File?

You can type commands into the console
But for anything even moderately sophisticated, you’ll probably want to create a source file
An R source file is a text-readable file with R commands in them that can be run any time
There are many advantages, not the least of which is being able to give the file to someone else to run
Also, you’ll need to turn in a source file for homework

Some Basic Data Types

numeric (numbers)
character (strings)
vectors (lists of one type)
factors (categorical variables)
lists (lists of arbitrary types)
matrices (numeric 1D and 2D arrays)

Numeric Data

scalar numbers: integers and real values
arithmetic operations on these

x = 3
y = 4.7
x/y + 2*x - 3

## [1] 3.638298

y > x

## [1] TRUE

Character Data

strings
operations on strings

x = "hello"
nchar(x)

## [1] 5

gsub("he","HE-",x)

## [1] "HE-llo"

Concatenating Strings with paste()

x = "hello"
y = "world"
paste(x,y)

## [1] "hello world"

paste(x,y,"turtle",sep=':')

## [1] "hello:world:turtle"

Vectors

R vectors are basically flexible arrays
R vectors aren’t just numeric
However, they are the same type
Creating a vector with a mixture of numeric and character values will force R to coerce the vector to character

c(1,2,9)

## [1] 1 2 9

c(1,2,"9")

## [1] "1" "2" "9"

Vector Ops: Scaling, Indexing, and Length

x = c(-4,2,31)
3*x

## [1] -12   6  93

x[1]

## [1] -4

length(x)

## [1] 3

Vector Ops: Element-by-Element Ops

x = c(-4,2,3)
y = c(1,3,9)
z = 4*(1:3)
x*y

## [1] -4  6 27

x*y - z

## [1] -8 -2 15

Vector Ops: Named Elements

x = c(-4,2,3)
names(x) <- c("Bob","Frank","Mindy")
x["Frank"]

## Frank 
##     2

x["Mindy"] = 0
print(x)

##   Bob Frank Mindy 
##    -4     2     0

Vector Ops: Pasting with Vectors

x = 1:5
paste("String",x,sep='-')

## [1] "String-1" "String-2" "String-3" "String-4" "String-5"

Vector Ops: Appending to a Vector

x = seq(from=2,to=10,by=2)
print(x)

## [1]  2  4  6  8 10

x = c(x,-99)
x

## [1]   2   4   6   8  10 -99

x[7] = -98
x

## [1]   2   4   6   8  10 -99 -98

Factors

A factor is like a vector, except for categorical values
They can be created from vectors

x = c("good","good","bad","mediocre","good")
factor(x)

## [1] good     good     bad      mediocre good    
## Levels: bad good mediocre

factor(c(1,2,1,1,2,2,3,3,2,2))

##  [1] 1 2 1 1 2 2 3 3 2 2
## Levels: 1 2 3

Lists Are Like Mixed-Type Vectors

x = list()
x[[1]] = 1
x[[2]] = "hello"
x[[3]] = c(-1,-2,-9)
x

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] -1 -2 -9

List Operations: Named Elements

x = list(3,1,4,5)
names(x) <- c("A","B","C","D")
x[["A"]]

## [1] 3

x$A

## [1] 3

Conditionals

x <- 4
if (x > 2) {
  print("x is bigger than 2")
} else if (x == 2) {
  print("x is exactly 2")
} else {
  print("x is less than 2")
}

## [1] "x is bigger than 2"

Simple for Loops

for (x in 0:10) {
   print(x)
 }

## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Defining a Function

myfunction <- function(arg) {
  argsq <- arg^2
  return (argsq)
}

myfunction(3)

## [1] 9

Loading External Libraries

There are many external libraries available to install for R
Common ones that we’ll use include: dplyr, reshape2, ggplot2, RColorBrewer
We load these using the library function:

library(reshape2)
sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.4
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.49        
##  [5] magrittr_2.0.3    glue_1.7.0        cachem_1.1.0      stringr_1.5.1    
##  [9] knitr_1.49        htmltools_0.5.8.1 rmarkdown_2.29    lifecycle_1.0.4  
## [13] cli_3.6.2         sass_0.4.9        jquerylib_0.1.4   compiler_4.4.2   
## [17] plyr_1.8.9        rstudioapi_0.17.1 tools_4.4.2       evaluate_1.0.1   
## [21] bslib_0.8.0       Rcpp_1.0.12       yaml_2.3.10       rlang_1.1.4      
## [25] jsonlite_1.8.9    stringi_1.8.4

Creating a Source File

You can use any text editor, but we’ll use R Studio
Go to File \(\rightarrow\) New File \(\rightarrow\) R Script
Type commands into the file in the order that you would like them to be executed
- Don’t forget to include any library() loads
- The user running this file next may not have that library loaded
Save the file when you are done
- File \(\rightarrow\) Save As, then give the file a location and name that you like
- Or just File \(\rightarrow\) Save if it already has a location and name
- It is customary to give the file the extension .R or .r

Sourcing the File

To run the file in R Studio, either

Press the Source button in the top right of the editor panel
Or, in the console, type:

source('path/to/my/script.R')

Make Sure Your Script Runs in a Clean Evironment

It is always a good idea to make sure your script will run with your environment completely clean

From the top-right panel, select the Environment tab
Select the little broom icon to clean the environment, the confirm with Yes on the dialog
From the Session menu option, select Restart R and Clean Output
Source your file

Learn More!

We’ll be learning a lot more about R as the semester progresses, including a number of lab assignments. But there’s also a lot of on-line materials, including:

The online Cookbook for R, http://www.cookbook-r.com
A simple article in Computer World introducing R, http://www.computerworld.com/article/2497143/business-intelligence-beginner-s-guide-to-r-introduction.html
Another introductory article in Computer Word about getting your data into R, http://www.computerworld.com/article/2497164/business-intelligence/beginner-s-guide-to-r-get-your-data-into-r.html
The R-Bloggers site, http://www.r-bloggers.com
- This includes a two-part introduction to R series, http://rtutorialseries.blogspot.com/2009/10/r-tutorial-series-introduction-to-r_11.html

Introduction to Python

Why Use Python?

Python is open source and freely available
Works on all standard platforms (Mac, Windows, linux)
Extremely useful for data manipulation and machine learning
Standard tool for many sciences (e.g., Computer Science, Physics)
Python is better for some parts of the data analysis pipeline, R is better for others
It is interpreted, so we can run interactively and use it like a workbook

Install A Stable Version of Python

The latest version of Python as of this presentation is 3.13.1
However, I encourage you to install a version between 3.10 and 3.12
There are many ways to install Python, so consult the Documentation
Avoid using Anaconda unless you already know about it
The most common cause for library/package problems later is having multiple distributions of Python installed via different methods, so pick a method and use that

Python IDEs

One common IDE for Python is PyCharm
Also, a lot of people just use a general IDE like VSCode
Choose what works best for you
Though it is helpful if the IDE you choose has git integration

The Python Console

You can also run Python interactively
If it is installed and in path, just type python or python3 at the command line
You’ll be give a prompt that likes like this: >>>
You can run commands interactively from that prompt

print("Python is easy!")

## Python is easy!

Later we’ll talk about using scripts to do more complicated things

Variables & Assignment

Python also stores values in variables
Like R, Python is dynamically bound, meaning the type of a variable is determined at runtime when values are assigned (i.e., not declared)
Assignment in Python uses the = operator

my_var = 32

Math Operations

The standard arithmetic operators more or less work as expected:
+, -, *, /, (, )
Python also has an exponentiation operator: **

x = 2**10 + (3*5 + 1)/2
print(x)

## 1032.0

What is a Python Soure File?

Like R, you can type commands into the console
But for anything even moderately sophisticated, you’ll probably want to create a source file
A Python source file is a text-readable file with Python commands in them that can be run any time
There are many advantages, not the least of which is being able to give the file to someone else to run
Also, you’ll need to turn in a source file for homework

Some Basic Data Types

floating point numbers
integer numbers
strings
Booleans
lists
dictionaries
tuples

Numeric Data

int are integers
float are floating point numbers
In general, Python (3) coerces to float

x = 3
y = 4.7
x/y + 2*x - 3

## 3.6382978723404253

y > x

## True

String Data

You may use double quotes or single quotes, but they must match
Python has many convenient string operations

x = "   Hello There "
len(x)

## 15

print( x.lower(), x.find('lo') )

##    hello there  6

x.split()

## ['Hello', 'There']

Concatenating Strings with Addition

x = "hello"
y = "world"
print(x + " " + y)

## hello world

Lists

Python lists can be dynamically expanded or reduced
They can contain different types of data
Are zero-indexed

x = [12, -3, "no", True, ['a', 'b', 'c']]
x[0]

## 12

x[-1]

## ['a', 'b', 'c']

Tuples

Tuples are like lists, but they are immutable
So you can build them, but you can’t change them once they are built

x = ('foo', 'bar', 'baz')
x[1]

## 'bar'

len(x)

## 3

Dictionaries

Dictionaries allow you to store things using a key (a hash table)
Dictionary keys must be immutable (numbers, strings, tuples, etc.)

x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"}
print(x[ (1,2) ])

## turtle

x["Paul"] = "Cool"
print( x["Paul"] )

## Cool

Conditionals:

x = 4
if x>2:
  print("x is bigger than 2")
elif x==2:
  print("x is exactly 2")
else:
  print("x is less than 2")

## x is bigger than 2

Simple for Loops

for idx in range(10):
  print("Index: ", idx)

## Index:  0
## Index:  1
## Index:  2
## Index:  3
## Index:  4
## Index:  5
## Index:  6
## Index:  7
## Index:  8
## Index:  9

Looping Through Lists

x = [12, -3, "no", True, ['a', 'b', 'c']]
for item in x:
  print(item)

## 12
## -3
## no
## True
## ['a', 'b', 'c']

Looping Through Dictionaries

x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"}
for k in x:
  print(k, "::: ", x[k])

## one :::  100
## two :::  oranges
## three :::  True
## (1, 2) :::  turtle

List Comprehensions

Python provides short-hand ways to build lists

cubes = [z**3  for z in range(10)]
print(cubes)

## [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

Defining a Function

def myfunction(arg):
  argsq = arg**2
  return argsq
  
myfunction(3)

## 9

Loading External Libraries

There are many external libraries available to install for Python
Common ones that we’ll use include: numpy, scipy, sklearn, pandas, tensorflow
We load these using the import function:

import numpy as np
print(np.__version__)

## 1.24.4

Learn More

Python documentation is extremely robust and includes tutorials: https://docs.python.org/3/tutorial/index.html
For an interactive tutorial: https://www.learnpython.org/

Assignments in This Class

Source Repositories

We’ll be using source repositories for our assignments in this class
So you’ll need an account on Github: https://github.com/
I’ll provide a separate PDF and video about these basics in Week 2

How Homeworks Will Be Assigned

You’ll be given a link to GitHub Classroom in an assignmen t
When you click that link and accept the invitation, a new Git Repository is created for you on GitHub
You may clone that repo locally to your computer or to Hopper
Then you can work on in as you usually would
Make sure to commit your work regularly to your Repo

How To Submit Your Assignments

Make sure your repo is pushed to GitHub
If you can’t see it on the GitHub web page, I won’t be able to see it – so check there before submitting
On BlackBoard, just tell me that you are done and ready for me to grade
I can access your GitHub repo for the assignment myself
You don’t need to send me code via email or attack anything in BlackBoard