Spring 2025
Most data analytics tasks involve a specific set of common steps
Some tools are better at some parts of the pipeline than others
So this course is organized around the data pipeline
We’ll intermingle high level concepts related to that with specific language features
Data Collection – Obtain structured or unstructured data from various sources, then organize and represent them for analysis
Data Processing – Perform any cleaning, summarizing, or aggregation steps on the data that is needed, decide what to do about errant / missing data, etc.
Data Storage – Determine how collected and result data will be stored / warehoused. Consider any IP or data restrictions or anonymization that must happen
Analysis – Perform modeling / analysis tasks to describe, sumarized, predict, etc.
Visualize – Provide a visual representation of data and/or results
One or two weeks has been set aside for each step of the data pipeline
Each week, we’ll discuss that step at a high level
But we’ll focus mainly on coding concepts related to it
We will explore the following languages this semester:
R – Open source language for statistical analysis and visualization
Python – Open source language for numerical calculations and modeling
Julia – A highly performant, parallelized language that supports integrated processing of data analytics
We’ll start with R and Python, and we’ll touch on Julia near the end.
For most week’s there will be a small homework assignment related to the programming topics
The purpose of these assignments is for you to learn a little bit about the language
They aren’t intended to be difficult, so do them yourself
The culminating assessment in this class is a project
That project will be some data analytics task you decide that involves the data pipeline
You will propose the project mid-semester via a presentation to get feedback about feasibility / appropriateness
You will submit all source code for it
You will also give a small presentation discussing how you handled each step of the data pipeline
ggplot2
package can produce very nice data visualizationsThe latest version of R as of this presentation is 4.4.2
The latest version of R Studio as of this presentation is 0.99
print("Statistics Rocks (roughly speaking)!")
## [1] "Statistics Rocks (roughly speaking)!"
<-
operator (this is the traditional operator)->
operator=
operatorx <- 3 * 2 6+7*2 -> y boring_var = 32
+, -, *, /, (, )
^
x <- 2^10 + (3*5 + 1)/2 print(x)
## [1] 1032
You can type commands into the console
But for anything even moderately sophisticated, you’ll probably want to create a source file
An R source file is a text-readable file with R commands in them that can be run any time
There are many advantages, not the least of which is being able to give the file to someone else to run
Also, you’ll need to turn in a source file for homework
x = 3 y = 4.7 x/y + 2*x - 3
## [1] 3.638298
y > x
## [1] TRUE
x = "hello" nchar(x)
## [1] 5
gsub("he","HE-",x)
## [1] "HE-llo"
x = "hello" y = "world" paste(x,y)
## [1] "hello world"
paste(x,y,"turtle",sep=':')
## [1] "hello:world:turtle"
c(1,2,9)
## [1] 1 2 9
c(1,2,"9")
## [1] "1" "2" "9"
x = c(-4,2,31) 3*x
## [1] -12 6 93
x[1]
## [1] -4
length(x)
## [1] 3
x = c(-4,2,3) y = c(1,3,9) z = 4*(1:3) x*y
## [1] -4 6 27
x*y - z
## [1] -8 -2 15
x = c(-4,2,3) names(x) <- c("Bob","Frank","Mindy") x["Frank"]
## Frank ## 2
x["Mindy"] = 0 print(x)
## Bob Frank Mindy ## -4 2 0
x = 1:5 paste("String",x,sep='-')
## [1] "String-1" "String-2" "String-3" "String-4" "String-5"
x = seq(from=2,to=10,by=2) print(x)
## [1] 2 4 6 8 10
x = c(x,-99) x
## [1] 2 4 6 8 10 -99
x[7] = -98 x
## [1] 2 4 6 8 10 -99 -98
x = c("good","good","bad","mediocre","good") factor(x)
## [1] good good bad mediocre good ## Levels: bad good mediocre
factor(c(1,2,1,1,2,2,3,3,2,2))
## [1] 1 2 1 1 2 2 3 3 2 2 ## Levels: 1 2 3
x = list() x[[1]] = 1 x[[2]] = "hello" x[[3]] = c(-1,-2,-9) x
## [[1]] ## [1] 1 ## ## [[2]] ## [1] "hello" ## ## [[3]] ## [1] -1 -2 -9
x = list(3,1,4,5) names(x) <- c("A","B","C","D") x[["A"]]
## [1] 3
x$A
## [1] 3
x <- 4 if (x > 2) { print("x is bigger than 2") } else if (x == 2) { print("x is exactly 2") } else { print("x is less than 2") }
## [1] "x is bigger than 2"
for (x in 0:10) { print(x) }
## [1] 0 ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ## [1] 10
myfunction <- function(arg) { argsq <- arg^2 return (argsq) } myfunction(3)
## [1] 9
library
function:library(reshape2) sessionInfo()
## R version 4.4.2 (2024-10-31) ## Platform: x86_64-apple-darwin20 ## Running under: macOS Sequoia 15.2 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## time zone: America/New_York ## tzcode source: internal ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] reshape2_1.4.4 ## ## loaded via a namespace (and not attached): ## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49 ## [5] magrittr_2.0.3 glue_1.7.0 cachem_1.1.0 stringr_1.5.1 ## [9] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29 lifecycle_1.0.4 ## [13] cli_3.6.2 sass_0.4.9 jquerylib_0.1.4 compiler_4.4.2 ## [17] plyr_1.8.9 rstudioapi_0.17.1 tools_4.4.2 evaluate_1.0.1 ## [21] bslib_0.8.0 Rcpp_1.0.12 yaml_2.3.10 rlang_1.1.4 ## [25] jsonlite_1.8.9 stringi_1.8.4
You can use any text editor, but we’ll use R Studio
Go to File \(\rightarrow\) New File \(\rightarrow\) R Script
Type commands into the file in the order that you would like them to be executed
library()
loadsSave the file when you are done
.R
or .r
To run the file in R Studio, either
Press the Source button in the top right of the editor panel
Or, in the console, type:
source('path/to/my/script.R')
It is always a good idea to make sure your script will run with your environment completely clean
From the top-right panel, select the Environment tab
Select the little broom icon to clean the environment, the confirm with Yes on the dialog
From the Session menu option, select Restart R and Clean Output
Source your file
We’ll be learning a lot more about R as the semester progresses, including a number of lab assignments. But there’s also a lot of on-line materials, including:
python
or python3
at the command line>>>
print("Python is easy!")
## Python is easy!
=
operatormy_var = 32
+, -, *, /, (, )
**
x = 2**10 + (3*5 + 1)/2 print(x)
## 1032.0
Like R, you can type commands into the console
But for anything even moderately sophisticated, you’ll probably want to create a source file
A Python source file is a text-readable file with Python commands in them that can be run any time
There are many advantages, not the least of which is being able to give the file to someone else to run
Also, you’ll need to turn in a source file for homework
int
are integersfloat
are floating point numbersx = 3 y = 4.7 x/y + 2*x - 3
## 3.6382978723404253
y > x
## True
x = " Hello There " len(x)
## 15
print( x.lower(), x.find('lo') )
## hello there 6
x.split()
## ['Hello', 'There']
x = "hello" y = "world" print(x + " " + y)
## hello world
x = [12, -3, "no", True, ['a', 'b', 'c']] x[0]
## 12
x[-1]
## ['a', 'b', 'c']
x = ('foo', 'bar', 'baz') x[1]
## 'bar'
len(x)
## 3
x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"} print(x[ (1,2) ])
## turtle
x["Paul"] = "Cool" print( x["Paul"] )
## Cool
x = 4 if x>2: print("x is bigger than 2") elif x==2: print("x is exactly 2") else: print("x is less than 2")
## x is bigger than 2
for idx in range(10): print("Index: ", idx)
## Index: 0 ## Index: 1 ## Index: 2 ## Index: 3 ## Index: 4 ## Index: 5 ## Index: 6 ## Index: 7 ## Index: 8 ## Index: 9
x = [12, -3, "no", True, ['a', 'b', 'c']] for item in x: print(item)
## 12 ## -3 ## no ## True ## ['a', 'b', 'c']
x = {"one":100, "two":"oranges", "three":True, (1,2):"turtle"} for k in x: print(k, "::: ", x[k])
## one ::: 100 ## two ::: oranges ## three ::: True ## (1, 2) ::: turtle
cubes = [z**3 for z in range(10)] print(cubes)
## [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
def myfunction(arg): argsq = arg**2 return argsq myfunction(3)
## 9
import
function:import numpy as np print(np.__version__)
## 1.24.4