Spring 2026
Data collected on students in a statistics class on a variety of variables:
| Student | Major | CourseNum | … | Dread |
|---|---|---|---|---|
| 1 | CSCI | 208 | \(\cdots\) | 3 |
| 2 | CSCI | 208 | \(\cdots\) | 2 |
| 3 | Cyber | 311 | \(\cdots\) | 4 |
| 4 | CSCI | 271 | \(\cdots\) | 2 |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
| 86 | CIFS | 210 | \(\cdots\) | 1 |
To be able to conduct useful analysis, data must:
To produce the most meaningful results, data should also:
Often, we are coordinating some experiment and collecting empirical results from our own experiments
When we are organizing our own data, we should choose to lay it out and store it in a way that makes later analysis and visualization as easy as possible
Sometimes you must collect data not directly part of your own project
Common data storage formats include:
Student, Major, CourseNum, Dread 1, male, CSCI, 208, 3 2, female, CSCI, 208, 2 3, female, Cyber, 311, 4 4, female, CSCI, 271, 2 5, male, CIFS, 210, 1
Dealing with data is usually the most challenging part.
As it is produced, data often
Student = c("Bob", "Sue", "Cat", "Lin")
NumberGrade = c(96, 82, 97, 74)
LetterGrade = factor(c("A","B","A","C"))
RosterData = data.frame(Student,NumberGrade,LetterGrade)
RosterData
## Student NumberGrade LetterGrade ## 1 Bob 96 A ## 2 Sue 82 B ## 3 Cat 97 A ## 4 Lin 74 C
rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"),
NumberGrade = c(96, 82, 97, 74),
LetterGrade = factor(c("A","B","A","C")) )
rd$NumberGrade
## [1] 96 82 97 74
rd$LetterGrade
## [1] A B A C ## Levels: A B C
# rd from last slide dim(rd); levels(rd$LetterGrade)
## [1] 4 3
## [1] "A" "B" "C"
summary(rd)
## Student NumberGrade LetterGrade ## Length:4 Min. :74.00 A:2 ## Class :character 1st Qu.:80.00 B:1 ## Mode :character Median :89.00 C:1 ## Mean :87.25 ## 3rd Qu.:96.25 ## Max. :97.00
summary(rd$NumberGrade)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 74.00 80.00 89.00 87.25 96.25 97.00
x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/color-cookbook-eg.txt',
header=T)
head(x)
## cond1 cond2 yval ## 1 A I 2.0 ## 2 A J 2.5 ## 3 A K 1.6 ## 4 A L 0.8 ## 5 B I 2.2 ## 6 B J 2.4
Note: For some distributions of R, the header argument is assumed to be True, and others False – so best to specify what you want
R’s read.table() functions can handle a lot of things
https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv
x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
sep=',', # Use a comma as a field separator
header=FALSE, # Don't use the first line as the header
skip=10, # But skip the first 10 lines
col.names=c('ProblemInstance','Trial','BestSolution'),
colClasses =c('character', 'integer', 'numeric'),
blank.lines.skip=TRUE, # Skip blank lines
comment.char=";")
head(x)
## ProblemInstance Trial BestSolution ## 1 P1 0 13.55185 ## 2 P1 1 13.79003 ## 3 P1 3 14.68695 ## 4 P1 4 13.52434 ## 5 P1 5 13.55290 ## 6 P1 6 14.03507
read.csv function, which is simpler for standard, clean CSV filesreadxl library in R has the function read.excel that will read Excel spreadsheetshaven library in R has functions to read.sas, read.spss, and read.statascan function will just read numeric data into a vector out of a filedf <- data.frame(Instructor=factor(c("Wiegand", "Besmer", "Doman", "Scibelli")),
PresentationStyle=c(2.0, 1.6, 1.7, 1.8),
PreparedMaterials=c(1.5, 1.8, 1.6, 1.7),
Availability=c(2.0, 2.0, 2.0, 2.0))
df
## Instructor PresentationStyle PreparedMaterials Availability ## 1 Wiegand 2.0 1.5 2 ## 2 Besmer 1.6 1.8 2 ## 3 Doman 1.7 1.6 2 ## 4 Scibelli 1.8 1.7 2
library(reshape2) melt(df, variable.name="Measure", value.name="Score")
## Using Instructor as id variables
## Instructor Measure Score ## 1 Wiegand PresentationStyle 2.0 ## 2 Besmer PresentationStyle 1.6 ## 3 Doman PresentationStyle 1.7 ## 4 Scibelli PresentationStyle 1.8 ## 5 Wiegand PreparedMaterials 1.5 ## 6 Besmer PreparedMaterials 1.8 ## 7 Doman PreparedMaterials 1.6 ## 8 Scibelli PreparedMaterials 1.7 ## 9 Wiegand Availability 2.0 ## 10 Besmer Availability 2.0 ## 11 Doman Availability 2.0 ## 12 Scibelli Availability 2.0
pandasdata.framepd:import pandas as pd my_series = pd.Series([3.14, 2.78, 1.0, 5]) print(my_series)
## 0 3.14 ## 1 2.78 ## 2 1.00 ## 3 5.00 ## dtype: float64
DataFrame structure:df = pd.DataFrame({
"Student":["Bob", "Sue", "Cat", "Lin"],
"NumberGrade":pd.Series([96, 82, 97, 74]),
"LetterGrade":pd.Categorical(["A","B","A","C"])
})
df
## Student NumberGrade LetterGrade ## 0 Bob 96 A ## 1 Sue 82 B ## 2 Cat 97 A ## 3 Lin 74 C
df.columns
## Index(['Student', 'NumberGrade', 'LetterGrade'], dtype='object')
df.dtypes
## Student object ## NumberGrade int64 ## LetterGrade category ## dtype: object
df.shape
## (4, 3)
df["Student"]
## 0 Bob ## 1 Sue ## 2 Cat ## 3 Lin ## Name: Student, dtype: object
df["NumberGrade"].describe()
## count 4.000000 ## mean 87.250000 ## std 11.176612 ## min 74.000000 ## 25% 80.000000 ## 50% 89.000000 ## 75% 96.250000 ## max 97.000000 ## Name: NumberGrade, dtype: float64
df["LetterGrade"].cat.categories # Levels of the factor
## Index(['A', 'B', 'C'], dtype='object')
pd.read_csv("foo.csv") to read from a CSVpd.read_excel("foo.xlsx") to read from an excel filepd.read_sas("foo.sas7bdat") to read from a SAS filepd.read_spss("foo.sav") to read from a SPSS fileimport numpy as np
df = pd.read_csv('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
sep=',', # Use a comma as a field separator
header=0, # Don't use the first line as the header
skiprows=list(range(10)), # But skip the first 10 lines
names=['ProblemInstance','Trial','BestSolution'],
skip_blank_lines=True, # Skip blank lines
comment=";")
df.head()
## ProblemInstance Trial BestSolution ## 0 P1 1 13.790031 ## 1 P1 3 14.686953 ## 2 P1 4 13.524335 ## 3 P1 5 13.552902 ## 4 P1 6 14.035066
df = pd.DataFrame({
"Instructor":pd.Categorical(["Wiegand", "Besmer", "Doman", "Scibelli"]),
"PresentationStyle":pd.Series([2.0, 1.6, 1.7, 1.8]),
"PreparedMaterials":pd.Series([1.5, 1.8, 1.6, 1.7]),
"Availability":pd.Series([2.0, 2.0, 2.0, 2.0])
})
df
## Instructor PresentationStyle PreparedMaterials Availability ## 0 Wiegand 2.0 1.5 2.0 ## 1 Besmer 1.6 1.8 2.0 ## 2 Doman 1.7 1.6 2.0 ## 3 Scibelli 1.8 1.7 2.0
df.melt(id_vars=['Instructor'], var_name="Measure", value_name="Score")
## Instructor Measure Score ## 0 Wiegand PresentationStyle 2.0 ## 1 Besmer PresentationStyle 1.6 ## 2 Doman PresentationStyle 1.7 ## 3 Scibelli PresentationStyle 1.8 ## 4 Wiegand PreparedMaterials 1.5 ## 5 Besmer PreparedMaterials 1.8 ## 6 Doman PreparedMaterials 1.6 ## 7 Scibelli PreparedMaterials 1.7 ## 8 Wiegand Availability 2.0 ## 9 Besmer Availability 2.0 ## 10 Doman Availability 2.0 ## 11 Scibelli Availability 2.0
using Pkg; Pkg.add(["DataFrames", "CSV", "DelimitedFiles", "HTTP"]); using DataFrames, CSV, DelimitedFiles, HTTP;
foo.csv:foo = DataFrame(CSV.File("foo.csv"))
# Make a function to simplify this:
read_remote_csv(url) = DataFrame(CSV.File(HTTP.get(url).body));
irisdf = read_remote_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
fetched_file = HTTP.get("https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv").body;
csv_file=CSV.File(fetched_file,
header=0, # Do not use the header
skipto=10, # Skip the first 10 lines
comment=";", # Ignore things after a ;
delim=",", # Comma delimited
missingstring="") # Fill in missing values with empty string
df = DataFrame(csv_file)
# Now rename the columns to what you like
rename!(df, "Column1"=>"ProblemInstance", "Column2"=>"Trial", "Column3"=>"BestSolution")
show(first(irisdf, 5)) # Show first five rows names(irisdf) # Show names of the columns irisdf.sepal_length # Get just the sepal_length column data first(irisdf.sepal_length, 7) # Get just first 7 values of that
describe() function to get the equivalent of R’s summary()describe(irisdf)
XLSX package allows you to read excel filesusing XLSX
mydf = DataFrame(XLSX.readtable("myfile.xlsx", "Sheet1", infer_eltypes=true))
stack function is similar to R’s meltunstackirisdf.id = 1:size(irisdf, 1) # Create new column of unique identifiers longdf = stack(irisdf, [:sepal_length, :sepal_width, :petal_length, :petal_width]) unstack(longdf, :id, :variable, :value) unstack(longdf, :variable, :value, combine=sum) # Or summarize variable/value
More Info: https://dataframes.juliadata.org/stable/man/reshaping_and_pivoting/