Analysis, Part 1

Spring 2025

Data Analysis

Analysis

Data Analysis is the act of trying to fnd utility or meaning in data
We are typically attempting to summarize, draw conclusions from, make predictions, discover patterns out of data during analysis
Data analysis is a broad idea that encompasses many tasks, techniques, and objectives
But a key aspect of it typically entails some form of modeling

What is a Model?

A model is an abstraction or generalization of something that captures salient properties to study
All models are wrong, but good models are useful
We model because it is difficult to understand reality in toto
With data, models typically involve summarization at some level
E.g., There is no such thing as “an average height of humans”, but knowing the average height can be useful and fully comprehending the entire population of heights of humans is not possible
The process of finding a model that represents the data as well as possible is called fitting
A model can also (sometimes) be seen as a kind of explanation for the data

Explaining vs. Observing

In science, we observe phenomena – We record or note what we see, hear, smell, taste, fell, or detect through some apparatus
We can do a lot of things with that data, including just summarize it
An explanation is something that provides a (typically causal) reason that we observe those things
An hypothesis is a (testable) possible explanation for what we are seeing
A theory is the explanation we have for what we are observing that is best supported by evidentiary data
Theories can be descriptive, predictive, or prescriptive – and they also often involve modeling to accomplish those
This is why statistics is so important in science

Statistcs, Machine Learning, Data Analytics! Oh My!

Statistics – Practice or science of collecting and analyzing data, especially for the purpose of making inferences about a population based on a sample
Machine Learning – The use of existing data to build a generalized model to apply to new, unseen data
Data Analytics – The process of analyzing data to make informed decisions and solve problems
In pactice, though, there’s very little difference
All of them involve building models to try to generalize from information we have
They often use exactly the same techniques and are applied to the same kinds of tasks

Some Typical Modeling Tasks

Hypothesis Testing – Determining whether a hypothesis is a likely explanation for observed data
Classification – Build a model to predict categories of data
Regression – Build a model to show the functaional relation ship between data variables
Clustering – Build a model to group “similar” data
Dimension Reduction – Build a model to transform data so that fewer dimensions are used, but as much information as possible still remains
Reinforcement Learning – Build a behavior model the performs well on some task

Statistical Modeling in R

Hypothesis Testing in R

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test

HA: Cars get less than 22 miles per gallon
H0: Cars tend to get 22 miles per gallon or more
The \(p\)-value is less than 5%, so we reject the NULL hypothesis and conclude cars get less than 22 mpg

t.test(mtcars$mpg,mu=22, alternative="less", conf.leve=0.95)

## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = -1.7921, df = 31, p-value = 0.04144
## alternative hypothesis: true mean is less than 22
## 95 percent confidence interval:
##      -Inf 21.89707
## sample estimates:
## mean of x 
##  20.09062

Models in R

We can use R to construct statistical models
Typically, we have to tell R what the response and explanatory variables of the model are using something called a formula
response variable ~ explanatory variables
For example:
- \(y \sim x\)
- \(y \sim x + z + w\)
- \(y \sim \;.\)

Two-Way Analysis of Variance (ANOVA)

What matters for predicting MPG, number of gears or manual vs. automatic?

mtcars_fit <- aov(mtcars$mpg ~  factor(mtcars$gear) * factor(mtcars$am)) 
summary(mtcars_fit)

##                     Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(mtcars$gear)  2  483.2  241.62  11.869 0.000185 ***
## factor(mtcars$am)    1   72.8   72.80   3.576 0.069001 .  
## Residuals           28  570.0   20.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gears have a statistically significant impact, but manual vs. automatic doesn’t

How Good Does a Linear Model Fit?

One way to estimate good a fit is is using the coefficient of determination
Computed by examing the inter- and intra- variation of the variables
Uni-variate model (explanatory variable \(x\) vs. response variable \(y\)):

\[ r = \frac{n\left(\sum_{i=1}^{n} \sum_{j=1}^{n}x_iy_j\right) - \left(\sum_{i=1}^{n}x_i\right)\left(\sum_{j=1}^{n}y_j\right)} {\sqrt{\left(n\sum_{i=1}^{n}x_i^2 -(\sum_{i=1}^{n}x_i)^2\right) \left(n\sum_{j=1}^{n}y_j^2 -(\sum_{j=1}^{n}y_j)^2\right)}}\]

Typically this is squared to give us a number between 0 and 1, where larger suggests a stronger correlation, \(r^2\)
\(r^2\) is basically a measure of explained variation over total variation

Back to Linear Models in R, p.1

For linear models, we use the function lm()
lm() takes the data set and the formula, then returns a model

fit = lm(data=Orange,formula= age ~ circumference)
summary(fit)

Back to Linear Models in R, p.2

## 
## Call:
## lm(formula = age ~ circumference, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    16.6036    78.1406   0.212    0.833    
## circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Issues with R^2

\(r^2\) is an ad-hoc measure of variance from beteween actual and predicted values from the model
It can’t tell you whether the estimates are biased, meaning whether there are systemic errors in the predictions
- For this, you have to look at the residuals
The more variables you add to it, the higher it gets, regardless
- For multi-variate models, you should use the adjusted coefficient of determination
You can have a high \(r^2\) for a bad model or a low \(r^2\) for a good model
- For one, the data can be non-linearly related

Residuals Plot

## `geom_smooth()` using formula = 'y ~ x'

Multivariate Adjusted R^2

# using ds from last slide
summary(lm(mpg ~ hp, data=mtcars))$adj.r.squared

## [1] 0.5891853

summary(lm(mpg ~ hp + wt, data=mtcars))$adj.r.squared

## [1] 0.8148396

Some Data is Obviously Not Linear

Notice Something about Polynomial Fits:

Suppose I have explanatory variable \(x\), as well as a variable \(y\) that depends on \(x\) by a quadratic function
This means that there’s some function: \[y = \alpha_0 + \alpha_1 x^2 + \alpha_2 x\]
But this is just like a linear function, if \(x^2\) were it’s own variable! \[y = \alpha_0 + \alpha_1 z + \alpha_2 x\]

Notice Something about Polynomial Fits (2):

So we could rework our formula to “trick” R into fitting a polynomial model:

fit = lm(data=hatcolor,formula= y ~ x^2 + x)

Actually, we could use this trick for all sorts of non-linear functions: \[y = \alpha_0 + \alpha_1 \sin(x) + \alpha_2 x + \alpha_3 x^3\]

fit = lm(data=hatcolor,formula= y ~ sin(x) + x + x^3)

For polynomials, R gives us a function, poly() so we can be more general:

fit = lm(data=hatcolor,formula= y ~ poly(x,2)); summary(fit)

Use Non-Linear Models for Non-Linear Data

hatcolorURL = "http://cs.ucf.edu/~wiegand/ids6938/datasets/hatcolor.csv"
hatcolor = read.table(hatcolorURL,header=TRUE)

summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$r.squared

## [1] 0.005302867

summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$adj.r.squared

## [1] -0.01541999

summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$r.squared

## [1] 0.8888319

summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$adj.r.squared

## [1] 0.8841013

Isn’t That Better?

Model Prediction

# Build the Model
carModel = lm(data=mtcars, formula=mpg ~ wt + factor(cyl))

# Create the dataset over which to make predictions
newCarData = data.frame(wt=c(2.5,1.9),
                        cyl=factor(c(4,6), levels= levels(factor(mtcars$cyl))))

# Make predictions
newCarData$mpg = predict(carModel, newdata=newCarData)

print(newCarData)

##    wt cyl      mpg
## 1 2.5   4 25.97676
## 2 1.9   6 23.64455

Generalized Linear Models

The lm() function assumes the mean value for response variable modeled by the linear function over the explanatory variables deviates by a normal distribution
I.e.,: \(y = \alpha_0 + \left(\sum_{i=1}^n \alpha_i x_i\right) + N(\mu,\sigma)\)
That is, that points more or less follow the model except for an additive error that is both Normal and i.i.d.
But what if the distribution is different?
Generalized linear models extend traditional methods to allow the mean to depend on the explanatory variable through a link function, and use different kinds of distributions
This is particularly significant with the response variable is categorical, rather than numeric
In R, you can use glm() to perform such modeling

Logistic Regression, p.1

Suppose you have a dataset with different student applicants to a graduate program, each with verbal, quantitative, and analytical scores on the GRE, as well as a GPA value and several other numeric measures
You want a predictive model as to whether a given applicant will be accepted or not
The response variable is categorical (1 or 0, accepted or not) and the explanatory variables are numeric
You can’t just use regular linear regression, you must use logistic regression (“logit”), which models the response variable using a binomial distribution

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)

Logistic Regression, p.2

We can do a number of things with such a model:

Basically a classifier
Get the coefficients for the model
Get confidence intervals for the different explanatory variables for acceptance
Predict the probability whether a new point(s) would be accepted or not

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)
coef(acceptModel)
exp(confint.default(acceptModel))
predict.glm(acceptModel,
            data.frame(verbal=155,quant=160,analytic=4,gpa=3.6),
            type="response",
            se.fit=TRUE)

Other Generalizations

You can weight different points in the data set differently (i.e., weighted linear regression) by giving lm() or glm() a weights vector
You can use other non-linear models (e.g., an exponential function of the explanatory variables)
You can use local regression, loess(), which uses a \(k-\)nearest neighbor type approach to produce a piece-wise continuous curve that fits points as well as possible in local regions of the space

LOESS Modeling

library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Get population by state and build a DF of year and population
popbystate <- read.csv("population-by-year.csv", header=F, col.names=c("State", "Year", "Population"))
population <- summarise(group_by(filter(popbystate, Year>=1986, Year<=2023), Year),      
                                 Total=sum(Population))

# Get crime by year and build a DF of year and violent crime rate
crimebyyear <- read.csv("fbi-violentcrime-1986-2025.csv", header=T)

## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'fbi-violentcrime-1986-2025.csv'

crime <- data.frame(Year=as.numeric(gsub("X", "", colnames(crimebyyear)[2:39])), 
                    Count=as.numeric(crimebyyear[1,2:39]))
 
# Merge these, then compute the rate per 100K
df <- mutate(merge(crime, population), Rate=100000*Count/Total)

crimeModel = loess(data=df, formula=Year ~ Rate)