Spring 2025
Data collected on students in a statistics class on a variety of variables:
Student | Gender | Intro/Extra | … | Dread |
---|---|---|---|---|
1 | male | extravert | \(\cdots\) | 3 |
2 | female | extravert | \(\cdots\) | 2 |
3 | female | introvert | \(\cdots\) | 4 |
4 | female | extravert | \(\cdots\) | 2 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
86 | male | extravert | \(\cdots\) | 3 |
To be able to conduct useful analysis, data must:
To produce the most meaningful results, data should also:
Often, we are coordinating some experiment and collecting empirical results from our own experiments
When we are organizing our own data, we should choose to lay it out and store it in a way that makes later analysis and visualization as easy as possible
Sometimes you must collect data not directly part of your own project
Common data storage formats include:
Dealing with data is usually the most challenging part.
As it is produced, data often
Student = c("Bob", "Sue", "Cat", "Lin") NumberGrade = c(96, 82, 97, 74) LetterGrade = factor(c("A","B","A","C")) RosterData = data.frame(Student,NumberGrade,LetterGrade) RosterData
## Student NumberGrade LetterGrade ## 1 Bob 96 A ## 2 Sue 82 B ## 3 Cat 97 A ## 4 Lin 74 C
rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"), NumberGrade = c(96, 82, 97, 74), LetterGrade = factor(c("A","B","A","C")) ) rd$NumberGrade
## [1] 96 82 97 74
rd$LetterGrade
## [1] A B A C ## Levels: A B C
# rd from last slide dim(rd); levels(rd$LetterGrade)
## [1] 4 3
## [1] "A" "B" "C"
summary(rd)
## Student NumberGrade LetterGrade ## Length:4 Min. :74.00 A:2 ## Class :character 1st Qu.:80.00 B:1 ## Mode :character Median :89.00 C:1 ## Mean :87.25 ## 3rd Qu.:96.25 ## Max. :97.00
summary(rd$NumberGrade)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 74.00 80.00 89.00 87.25 96.25 97.00
x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/color-cookbook-eg.txt', header=T) head(x)
## cond1 cond2 yval ## 1 A I 2.0 ## 2 A J 2.5 ## 3 A K 1.6 ## 4 A L 0.8 ## 5 B I 2.2 ## 6 B J 2.4
Note: For some distributions of R, the header
argument is assumed to be True
, and others False
– so best to specify what you want
R’s read.table() functions can handle a lot of things
https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv
x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv', sep=',', # Use a comma as a field separator header=FALSE, # Don't use the first line as the header skip=10, # But skip the first 10 lines col.names=c('ProblemInstance','Trial','BestSolution'), colClasses =c('character', 'integer', 'numeric'), blank.lines.skip=TRUE, # Skip blank lines comment.char=";") head(x)
## ProblemInstance Trial BestSolution ## 1 P1 0 13.55185 ## 2 P1 1 13.79003 ## 3 P1 3 14.68695 ## 4 P1 4 13.52434 ## 5 P1 5 13.55290 ## 6 P1 6 14.03507
read.csv
function, which is simpler for standard, clean CSV filesreadxl
library in R has the function read.excel
that will read Excel spreadsheetshaven
library in R has functions to read.sas
, read.spss
, and read.stata
scan
function will just read numeric data into a vector out of a filedf <- data.frame(Instructor=factor(c("Wiegand", "Besmer", "Doman", "Scibelli")), PresentationStyle=c(2.0, 1.6, 1.7, 1.8), PreparedMaterials=c(1.5, 1.8, 1.6, 1.7), Availability=c(2.0, 2.0, 2.0, 2.0)) df
## Instructor PresentationStyle PreparedMaterials Availability ## 1 Wiegand 2.0 1.5 2 ## 2 Besmer 1.6 1.8 2 ## 3 Doman 1.7 1.6 2 ## 4 Scibelli 1.8 1.7 2
library(reshape2) melt(df, variable.name="Measure", value.name="Score")
## Using Instructor as id variables
## Instructor Measure Score ## 1 Wiegand PresentationStyle 2.0 ## 2 Besmer PresentationStyle 1.6 ## 3 Doman PresentationStyle 1.7 ## 4 Scibelli PresentationStyle 1.8 ## 5 Wiegand PreparedMaterials 1.5 ## 6 Besmer PreparedMaterials 1.8 ## 7 Doman PreparedMaterials 1.6 ## 8 Scibelli PreparedMaterials 1.7 ## 9 Wiegand Availability 2.0 ## 10 Besmer Availability 2.0 ## 11 Doman Availability 2.0 ## 12 Scibelli Availability 2.0
pandas
data.frame
pd
:import pandas as pd my_series = pd.Series([3.14, 2.78, 1.0, 5]) print(my_series)
## 0 3.14 ## 1 2.78 ## 2 1.00 ## 3 5.00 ## dtype: float64
DataFrame
structure:df = pd.DataFrame({ "Student":["Bob", "Sue", "Cat", "Lin"], "NumberGrade":pd.Series([96, 82, 97, 74]), "LetterGrade":pd.Categorical(["A","B","A","C"]) }) df
## Student NumberGrade LetterGrade ## 0 Bob 96 A ## 1 Sue 82 B ## 2 Cat 97 A ## 3 Lin 74 C
df.columns
## Index(['Student', 'NumberGrade', 'LetterGrade'], dtype='object')
df.dtypes
## Student object ## NumberGrade int64 ## LetterGrade category ## dtype: object
df.shape
## (4, 3)
df["Student"]
## 0 Bob ## 1 Sue ## 2 Cat ## 3 Lin ## Name: Student, dtype: object
df["NumberGrade"].describe()
## count 4.000000 ## mean 87.250000 ## std 11.176612 ## min 74.000000 ## 25% 80.000000 ## 50% 89.000000 ## 75% 96.250000 ## max 97.000000 ## Name: NumberGrade, dtype: float64
df["LetterGrade"].cat.categories # Levels of the factor
## Index(['A', 'B', 'C'], dtype='object')
pd.read_csv("foo.csv")
to read from a CSVpd.read_excel("foo.xlsx")
to read from an excel filepd.read_sas("foo.sas7bdat")
to read from a SAS filepd.read_spss("foo.sav")
to read from a SPSS fileimport numpy as np df = pd.read_csv('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv', sep=',', # Use a comma as a field separator header=0, # Don't use the first line as the header skiprows=list(range(10)), # But skip the first 10 lines names=['ProblemInstance','Trial','BestSolution'], skip_blank_lines=True, # Skip blank lines comment=";") df.head()
## ProblemInstance Trial BestSolution ## 0 P1 1 13.790031 ## 1 P1 3 14.686953 ## 2 P1 4 13.524335 ## 3 P1 5 13.552902 ## 4 P1 6 14.035066
df = pd.DataFrame({ "Instructor":pd.Categorical(["Wiegand", "Besmer", "Doman", "Scibelli"]), "PresentationStyle":pd.Series([2.0, 1.6, 1.7, 1.8]), "PreparedMaterials":pd.Series([1.5, 1.8, 1.6, 1.7]), "Availability":pd.Series([2.0, 2.0, 2.0, 2.0]) }) df
## Instructor PresentationStyle PreparedMaterials Availability ## 0 Wiegand 2.0 1.5 2.0 ## 1 Besmer 1.6 1.8 2.0 ## 2 Doman 1.7 1.6 2.0 ## 3 Scibelli 1.8 1.7 2.0
df.melt(id_vars=['Instructor'], var_name="Measure", value_name="Score")
## Instructor Measure Score ## 0 Wiegand PresentationStyle 2.0 ## 1 Besmer PresentationStyle 1.6 ## 2 Doman PresentationStyle 1.7 ## 3 Scibelli PresentationStyle 1.8 ## 4 Wiegand PreparedMaterials 1.5 ## 5 Besmer PreparedMaterials 1.8 ## 6 Doman PreparedMaterials 1.6 ## 7 Scibelli PreparedMaterials 1.7 ## 8 Wiegand Availability 2.0 ## 9 Besmer Availability 2.0 ## 10 Doman Availability 2.0 ## 11 Scibelli Availability 2.0