Spring 2025
Data collected on students in a statistics class on a variety of variables:
| Student | Gender | Intro/Extra | … | Dread |
|---|---|---|---|---|
| 1 | male | extravert | \(\cdots\) | 3 |
| 2 | female | extravert | \(\cdots\) | 2 |
| 3 | female | introvert | \(\cdots\) | 4 |
| 4 | female | extravert | \(\cdots\) | 2 |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
| 86 | male | extravert | \(\cdots\) | 3 |
To be able to conduct useful analysis, data must:
To produce the most meaningful results, data should also:
Often, we are coordinating some experiment and collecting empirical results from our own experiments
When we are organizing our own data, we should choose to lay it out and store it in a way that makes later analysis and visualization as easy as possible
Sometimes you must collect data not directly part of your own project
Common data storage formats include:
Dealing with data is usually the most challenging part.
As it is produced, data often
Student = c("Bob", "Sue", "Cat", "Lin")
NumberGrade = c(96, 82, 97, 74)
LetterGrade = factor(c("A","B","A","C"))
RosterData = data.frame(Student,NumberGrade,LetterGrade)
RosterData
## Student NumberGrade LetterGrade ## 1 Bob 96 A ## 2 Sue 82 B ## 3 Cat 97 A ## 4 Lin 74 C
rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"),
NumberGrade = c(96, 82, 97, 74),
LetterGrade = factor(c("A","B","A","C")) )
rd$NumberGrade
## [1] 96 82 97 74
rd$LetterGrade
## [1] A B A C ## Levels: A B C
# rd from last slide dim(rd); levels(rd$LetterGrade)
## [1] 4 3
## [1] "A" "B" "C"
summary(rd)
## Student NumberGrade LetterGrade ## Length:4 Min. :74.00 A:2 ## Class :character 1st Qu.:80.00 B:1 ## Mode :character Median :89.00 C:1 ## Mean :87.25 ## 3rd Qu.:96.25 ## Max. :97.00
summary(rd$NumberGrade)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 74.00 80.00 89.00 87.25 96.25 97.00
x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/color-cookbook-eg.txt',
header=T)
head(x)
## cond1 cond2 yval ## 1 A I 2.0 ## 2 A J 2.5 ## 3 A K 1.6 ## 4 A L 0.8 ## 5 B I 2.2 ## 6 B J 2.4
Note: For some distributions of R, the header argument is assumed to be True, and others False – so best to specify what you want
R’s read.table() functions can handle a lot of things
https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv
x = read.table('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
sep=',', # Use a comma as a field separator
header=FALSE, # Don't use the first line as the header
skip=10, # But skip the first 10 lines
col.names=c('ProblemInstance','Trial','BestSolution'),
colClasses =c('character', 'integer', 'numeric'),
blank.lines.skip=TRUE, # Skip blank lines
comment.char=";")
head(x)
## ProblemInstance Trial BestSolution ## 1 P1 0 13.55185 ## 2 P1 1 13.79003 ## 3 P1 3 14.68695 ## 4 P1 4 13.52434 ## 5 P1 5 13.55290 ## 6 P1 6 14.03507
read.csv function, which is simpler for standard, clean CSV filesreadxl library in R has the function read.excel that will read Excel spreadsheetshaven library in R has functions to read.sas, read.spss, and read.statascan function will just read numeric data into a vector out of a filedf <- data.frame(Instructor=factor(c("Wiegand", "Besmer", "Doman", "Scibelli")),
PresentationStyle=c(2.0, 1.6, 1.7, 1.8),
PreparedMaterials=c(1.5, 1.8, 1.6, 1.7),
Availability=c(2.0, 2.0, 2.0, 2.0))
df
## Instructor PresentationStyle PreparedMaterials Availability ## 1 Wiegand 2.0 1.5 2 ## 2 Besmer 1.6 1.8 2 ## 3 Doman 1.7 1.6 2 ## 4 Scibelli 1.8 1.7 2
library(reshape2) melt(df, variable.name="Measure", value.name="Score")
## Using Instructor as id variables
## Instructor Measure Score ## 1 Wiegand PresentationStyle 2.0 ## 2 Besmer PresentationStyle 1.6 ## 3 Doman PresentationStyle 1.7 ## 4 Scibelli PresentationStyle 1.8 ## 5 Wiegand PreparedMaterials 1.5 ## 6 Besmer PreparedMaterials 1.8 ## 7 Doman PreparedMaterials 1.6 ## 8 Scibelli PreparedMaterials 1.7 ## 9 Wiegand Availability 2.0 ## 10 Besmer Availability 2.0 ## 11 Doman Availability 2.0 ## 12 Scibelli Availability 2.0
pandasdata.framepd:import pandas as pd my_series = pd.Series([3.14, 2.78, 1.0, 5]) print(my_series)
## 0 3.14 ## 1 2.78 ## 2 1.00 ## 3 5.00 ## dtype: float64
DataFrame structure:df = pd.DataFrame({
"Student":["Bob", "Sue", "Cat", "Lin"],
"NumberGrade":pd.Series([96, 82, 97, 74]),
"LetterGrade":pd.Categorical(["A","B","A","C"])
})
df
## Student NumberGrade LetterGrade ## 0 Bob 96 A ## 1 Sue 82 B ## 2 Cat 97 A ## 3 Lin 74 C
df.columns
## Index(['Student', 'NumberGrade', 'LetterGrade'], dtype='object')
df.dtypes
## Student object ## NumberGrade int64 ## LetterGrade category ## dtype: object
df.shape
## (4, 3)
df["Student"]
## 0 Bob ## 1 Sue ## 2 Cat ## 3 Lin ## Name: Student, dtype: object
df["NumberGrade"].describe()
## count 4.000000 ## mean 87.250000 ## std 11.176612 ## min 74.000000 ## 25% 80.000000 ## 50% 89.000000 ## 75% 96.250000 ## max 97.000000 ## Name: NumberGrade, dtype: float64
df["LetterGrade"].cat.categories # Levels of the factor
## Index(['A', 'B', 'C'], dtype='object')
pd.read_csv("foo.csv") to read from a CSVpd.read_excel("foo.xlsx") to read from an excel filepd.read_sas("foo.sas7bdat") to read from a SAS filepd.read_spss("foo.sav") to read from a SPSS fileimport numpy as np
df = pd.read_csv('https://faculty.winthrop.edu/wiegandrp/teaching/csci296/L8-knapsack-data.csv',
sep=',', # Use a comma as a field separator
header=0, # Don't use the first line as the header
skiprows=list(range(10)), # But skip the first 10 lines
names=['ProblemInstance','Trial','BestSolution'],
skip_blank_lines=True, # Skip blank lines
comment=";")
df.head()
## ProblemInstance Trial BestSolution ## 0 P1 1 13.790031 ## 1 P1 3 14.686953 ## 2 P1 4 13.524335 ## 3 P1 5 13.552902 ## 4 P1 6 14.035066
df = pd.DataFrame({
"Instructor":pd.Categorical(["Wiegand", "Besmer", "Doman", "Scibelli"]),
"PresentationStyle":pd.Series([2.0, 1.6, 1.7, 1.8]),
"PreparedMaterials":pd.Series([1.5, 1.8, 1.6, 1.7]),
"Availability":pd.Series([2.0, 2.0, 2.0, 2.0])
})
df
## Instructor PresentationStyle PreparedMaterials Availability ## 0 Wiegand 2.0 1.5 2.0 ## 1 Besmer 1.6 1.8 2.0 ## 2 Doman 1.7 1.6 2.0 ## 3 Scibelli 1.8 1.7 2.0
df.melt(id_vars=['Instructor'], var_name="Measure", value_name="Score")
## Instructor Measure Score ## 0 Wiegand PresentationStyle 2.0 ## 1 Besmer PresentationStyle 1.6 ## 2 Doman PresentationStyle 1.7 ## 3 Scibelli PresentationStyle 1.8 ## 4 Wiegand PreparedMaterials 1.5 ## 5 Besmer PreparedMaterials 1.8 ## 6 Doman PreparedMaterials 1.6 ## 7 Scibelli PreparedMaterials 1.7 ## 8 Wiegand Availability 2.0 ## 9 Besmer Availability 2.0 ## 10 Doman Availability 2.0 ## 11 Scibelli Availability 2.0