Spring 2025

Data Analysis

Analysis

  • Data Analysis is the act of trying to fnd utility or meaning in data
  • We are typically attempting to summarize, draw conclusions from, make predictions, discover patterns out of data during analysis
  • Data analysis is a broad idea that encompasses many tasks, techniques, and objectives
  • But a key aspect of it typically entails some form of modeling

What is a Model?

  • A model is an abstraction or generalization of something that captures salient properties to study
  • All models are wrong, but good models are useful
  • We model because it is difficult to understand reality in toto
  • With data, models typically involve summarization at some level
  • E.g., There is no such thing as “an average height of humans”, but knowing the average height can be useful and fully comprehending the entire population of heights of humans is not possible
  • The process of finding a model that represents the data as well as possible is called fitting
  • A model can also (sometimes) be seen as a kind of explanation for the data

Explaining vs. Observing

  • In science, we observe phenomena – We record or note what we see, hear, smell, taste, fell, or detect through some apparatus
  • We can do a lot of things with that data, including just summarize it
  • An explanation is something that provides a (typically causal) reason that we observe those things
  • An hypothesis is a (testable) possible explanation for what we are seeing
  • A theory is the explanation we have for what we are observing that is best supported by evidentiary data
  • Theories can be descriptive, predictive, or prescriptive – and they also often involve modeling to accomplish those
  • This is why statistics is so important in science

Statistcs, Machine Learning, Data Analytics! Oh My!

  • Statistics – Practice or science of collecting and analyzing data, especially for the purpose of making inferences about a population based on a sample
  • Machine Learning – The use of existing data to build a generalized model to apply to new, unseen data
  • Data Analytics – The process of analyzing data to make informed decisions and solve problems
  • In pactice, though, there’s very little difference
  • All of them involve building models to try to generalize from information we have
  • They often use exactly the same techniques and are applied to the same kinds of tasks

Some Typical Modeling Tasks

  • Hypothesis Testing – Determining whether a hypothesis is a likely explanation for observed data
  • Classification – Build a model to predict categories of data
  • Regression – Build a model to show the functaional relation ship between data variables
  • Clustering – Build a model to group “similar” data
  • Dimension Reduction – Build a model to transform data so that fewer dimensions are used, but as much information as possible still remains
  • Reinforcement Learning – Build a behavior model the performs well on some task

Statistical Modeling in R

Hypothesis Testing in R

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test

  • HA: Cars get less than 22 miles per gallon
  • H0: Cars tend to get 22 miles per gallon or more
  • The \(p\)-value is less than 5%, so we reject the NULL hypothesis and conclude cars get less than 22 mpg
t.test(mtcars$mpg,mu=22, alternative="less", conf.leve=0.95)
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = -1.7921, df = 31, p-value = 0.04144
## alternative hypothesis: true mean is less than 22
## 95 percent confidence interval:
##      -Inf 21.89707
## sample estimates:
## mean of x 
##  20.09062

Models in R

  • We can use R to construct statistical models

  • Typically, we have to tell R what the response and explanatory variables of the model are using something called a formula

  • response variable ~ explanatory variables

  • For example:

    • \(y \sim x\)
    • \(y \sim x + z + w\)
    • \(y \sim \;.\)

Two-Way Analysis of Variance (ANOVA)

What matters for predicting MPG, number of gears or manual vs. automatic?

mtcars_fit <- aov(mtcars$mpg ~  factor(mtcars$gear) * factor(mtcars$am)) 
summary(mtcars_fit) 
##                     Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(mtcars$gear)  2  483.2  241.62  11.869 0.000185 ***
## factor(mtcars$am)    1   72.8   72.80   3.576 0.069001 .  
## Residuals           28  570.0   20.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Gears have a statistically significant impact, but manual vs. automatic doesn’t

How Good Does a Linear Model Fit?

  • One way to estimate good a fit is is using the coefficient of determination
  • Computed by examing the inter- and intra- variation of the variables
  • Uni-variate model (explanatory variable \(x\) vs. response variable \(y\)):

\[ r = \frac{n\left(\sum_{i=1}^{n} \sum_{j=1}^{n}x_iy_j\right) - \left(\sum_{i=1}^{n}x_i\right)\left(\sum_{j=1}^{n}y_j\right)} {\sqrt{\left(n\sum_{i=1}^{n}x_i^2 -(\sum_{i=1}^{n}x_i)^2\right) \left(n\sum_{j=1}^{n}y_j^2 -(\sum_{j=1}^{n}y_j)^2\right)}}\]

  • Typically this is squared to give us a number between 0 and 1, where larger suggests a stronger correlation, \(r^2\)
  • \(r^2\) is basically a measure of explained variation over total variation

Back to Linear Models in R, p.1

  • For linear models, we use the function lm()

  • lm() takes the data set and the formula, then returns a model

fit = lm(data=Orange,formula= age ~ circumference)
summary(fit)

Back to Linear Models in R, p.2

## 
## Call:
## lm(formula = age ~ circumference, data = Orange)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.88 -140.90  -17.20   96.54  471.16 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    16.6036    78.1406   0.212    0.833    
## circumference   7.8160     0.6059  12.900 1.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.1 on 33 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.8295 
## F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

Issues with R^2

  • \(r^2\) is an ad-hoc measure of variance from beteween actual and predicted values from the model

  • It can’t tell you whether the estimates are biased, meaning whether there are systemic errors in the predictions

    • For this, you have to look at the residuals
  • The more variables you add to it, the higher it gets, regardless

    • For multi-variate models, you should use the adjusted coefficient of determination
  • You can have a high \(r^2\) for a bad model or a low \(r^2\) for a good model

    • For one, the data can be non-linearly related

Residuals Plot

## `geom_smooth()` using formula = 'y ~ x'

Multivariate Adjusted R^2

# using ds from last slide
summary(lm(mpg ~ hp, data=mtcars))$adj.r.squared
## [1] 0.5891853
summary(lm(mpg ~ hp + wt, data=mtcars))$adj.r.squared
## [1] 0.8148396

Some Data is Obviously Not Linear

Notice Something about Polynomial Fits:

  • Suppose I have explanatory variable \(x\), as well as a variable \(y\) that depends on \(x\) by a quadratic function

  • This means that there’s some function: \[y = \alpha_0 + \alpha_1 x^2 + \alpha_2 x\]

  • But this is just like a linear function, if \(x^2\) were it’s own variable! \[y = \alpha_0 + \alpha_1 z + \alpha_2 x\]

Notice Something about Polynomial Fits (2):

  • So we could rework our formula to “trick” R into fitting a polynomial model:
fit = lm(data=hatcolor,formula= y ~ x^2 + x)
  • Actually, we could use this trick for all sorts of non-linear functions: \[y = \alpha_0 + \alpha_1 \sin(x) + \alpha_2 x + \alpha_3 x^3\]
fit = lm(data=hatcolor,formula= y ~ sin(x) + x + x^3)
  • For polynomials, R gives us a function, poly() so we can be more general:
fit = lm(data=hatcolor,formula= y ~ poly(x,2)); summary(fit)

Use Non-Linear Models for Non-Linear Data

hatcolorURL = "http://cs.ucf.edu/~wiegand/ids6938/datasets/hatcolor.csv"
hatcolor = read.table(hatcolorURL,header=TRUE)

summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$r.squared
## [1] 0.005302867
summary(lm(data=hatcolor,formula=Coolitude ~ HatLightness))$adj.r.squared
## [1] -0.01541999
summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$r.squared
## [1] 0.8888319
summary(lm(data=hatcolor,formula=Coolitude ~ poly(HatLightness,2)))$adj.r.squared
## [1] 0.8841013

Isn’t That Better?

Model Prediction

# Build the Model
carModel = lm(data=mtcars, formula=mpg ~ wt + factor(cyl))

# Create the dataset over which to make predictions
newCarData = data.frame(wt=c(2.5,1.9),
                        cyl=factor(c(4,6), levels= levels(factor(mtcars$cyl))))

# Make predictions
newCarData$mpg = predict(carModel, newdata=newCarData)

print(newCarData)
##    wt cyl      mpg
## 1 2.5   4 25.97676
## 2 1.9   6 23.64455

Generalized Linear Models

  • The lm() function assumes the mean value for response variable modeled by the linear function over the explanatory variables deviates by a normal distribution

  • I.e.,: \(y = \alpha_0 + \left(\sum_{i=1}^n \alpha_i x_i\right) + N(\mu,\sigma)\)

  • That is, that points more or less follow the model except for an additive error that is both Normal and i.i.d.

  • But what if the distribution is different?

  • Generalized linear models extend traditional methods to allow the mean to depend on the explanatory variable through a link function, and use different kinds of distributions

  • This is particularly significant with the response variable is categorical, rather than numeric

  • In R, you can use glm() to perform such modeling

Logistic Regression, p.1

  • Suppose you have a dataset with different student applicants to a graduate program, each with verbal, quantitative, and analytical scores on the GRE, as well as a GPA value and several other numeric measures

  • You want a predictive model as to whether a given applicant will be accepted or not

  • The response variable is categorical (1 or 0, accepted or not) and the explanatory variables are numeric

  • You can’t just use regular linear regression, you must use logistic regression (“logit”), which models the response variable using a binomial distribution

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)

Logistic Regression, p.2

We can do a number of things with such a model:

  • Basically a classifier

  • Get the coefficients for the model

  • Get confidence intervals for the different explanatory variables for acceptance

  • Predict the probability whether a new point(s) would be accepted or not

acceptModel = glm(data=somedata, formula = accept ~ ., family=binomial)
coef(acceptModel)
exp(confint.default(acceptModel))
predict.glm(acceptModel,
            data.frame(verbal=155,quant=160,analytic=4,gpa=3.6),
            type="response",
            se.fit=TRUE)

Other Generalizations

  • You can weight different points in the data set differently (i.e., weighted linear regression) by giving lm() or glm() a weights vector

  • You can use other non-linear models (e.g., an exponential function of the explanatory variables)

  • You can use local regression, loess(), which uses a \(k-\)nearest neighbor type approach to produce a piece-wise continuous curve that fits points as well as possible in local regions of the space

LOESS Modeling

library(dplyr,quietly=TRUE, warn.conflicts=FALSE)

# Get population by state and build a DF of year and population
popbystate <- read.csv("population-by-year.csv", header=F, col.names=c("State", "Year", "Population"))
population <- summarise(group_by(filter(popbystate, Year>=1986, Year<=2023), Year),      
                                 Total=sum(Population))

# Get crime by year and build a DF of year and violent crime rate
crimebyyear <- read.csv("fbi-violentcrime-1986-2025.csv", header=T)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'fbi-violentcrime-1986-2025.csv'
crime <- data.frame(Year=as.numeric(gsub("X", "", colnames(crimebyyear)[2:39])), 
                    Count=as.numeric(crimebyyear[1,2:39]))
 
# Merge these, then compute the rate per 100K
df <- mutate(merge(crime, population), Rate=100000*Count/Total)

crimeModel = loess(data=df, formula=Year ~ Rate)

LOESS Visualization