CS 424 Big Data Analytics

Session 3: Nature of Data

Instructor: Dr. Bob Batzinger
Academic year: 2021/2022
Semester: 1

Begins June 2021

Analysis

Problem Type: Outliers

Mean and central limiting theorem

The sum of all deviations of a mean is zero.

\[\begin{matrix} \hbox{Lower} & & \hbox{Upper}\\ \hbox{extreme}&\hbox{Mid range} & \hbox{extreme}\\ 1/6 & 4/6 & 1/6 \\ & & \\ & & 236_\rlap{(-46)}\\ &218_\rlap{(-28)}& \\ &185_\rlap{(5)}& \\ &178_\rlap{(12)}& \\ & 172_\rlap{(18)}& \\ 151_\rlap{(39)}& & \\ \end{matrix}\]

Sampling

Normal distribution

Formula for normal curve

\[ f(x,μ,σ)=\frac{1}{σ\sqrt{2π}}\ e^{\ −\frac{(x−μ)^2}{2σ^2}} \]

\[\begin{eqnarray} x &=& \hbox{observed value}\\ \mu &=& \hbox{mean}\\ \sigma &=& \hbox{variance}\\ \end{eqnarray}\]

Sampling from a normal population

Integrating the Normal curve

Variance

Increasing distance between 2 similar populations

Other distributions

Problem Type: Regression ===============================

More models

Regression Analysis

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5693.1  -959.2  -186.0   822.4  7517.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2054.3723   101.0300  -20.33   <2e-16 ***
## x              14.0297     0.1749   80.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1596 on 998 degrees of freedom
## Multiple R-squared:  0.8658, Adjusted R-squared:  0.8656 
## F-statistic:  6438 on 1 and 998 DF,  p-value: < 2.2e-16

Residuals

Model Min 1Q Median 3Q Max R
1 -5120.3 -965.0 -163.4 745.2 7824.5 0.8672
2 -6526.2 -322.2 -84.3 467.4 6384.9 0.9116
3 -7293.2 -931.6 -94.0 912.3 5737.7 0.8800
4 -6394.5 -341.6 -18.8 394.9 6504.1 0.9122
5 -6546.0 -434.1 -5.5 391.3 6385.6 0.9118
6 -6456.5 -364.8 11.9 395.3 6454.3 0.9123

Coefficients y = ax^3 + bx^2 + cx + d

Model a b c d
1 * 1.384e-02 * 1.384e-02
2 * 1.384e-02 * 3.361e+02
3 * 1.430e-05 * 1.376e+03
4 * 1.232e-02 * 1.629e+00 3.027e+01
5 * 8.054e-06 * 6.698e+00 * -4.124e+02
6 2.811e-06 8.096e-03 3.320e+00 -1.111e+02

Classification Problem

Problem Type: Classification by k-means clustering

Visualizing classification errors

Classification errors

Case Study

Background:

Research questions:

Dataset:

birthdat = read.csv("../datasets/WPP2015_FERT_SEX_RATIO_AT_BIRTH.csv")
t(birthdat[birthdat$Region=="Thailand",])
##             112        
## Index       "112"      
## Variant     "Estimates"
## Region      "Thailand" 
## Notes       ""         
## CountryCode "764"      
## X1950       "1.054"    
## X1955       "1.055"    
## X1960       "1.056"    
## X1965       "1.056"    
## X1970       "1.057"    
## X1975       "1.053"    
## X1980       "1.052"    
## X1985       "1.05"     
## X1990       "1.055"    
## X1995       "1.061"    
## X2000       "1.062"    
## X2005       "1.064"    
## X2010       "1.062"

Dataset structure

## 'data.frame':    241 obs. of  18 variables:
##  $ Index      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Variant    : Factor w/ 1 level "Estimates": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Region     : Factor w/ 241 levels "Afghanistan",..: 238 143 116 113 118 117 91 140 228 123 ...
##  $ Notes      : Factor w/ 37 levels "","1.0","10.0",..: 1 32 33 34 35 1 36 36 36 36 ...
##  $ CountryCode: num  900 901 902 941 934 ...
##  $ X1950      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.06 1.06 1.06 1.06 ...
##  $ X1955      : num  1.06 1.05 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1960      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1965      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1970      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1975      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1980      : num  1.06 1.05 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1985      : num  1.06 1.05 1.06 1.04 1.07 1.05 1.06 1.07 1.07 1.06 ...
##  $ X1990      : num  1.07 1.05 1.07 1.04 1.08 1.06 1.06 1.08 1.09 1.07 ...
##  $ X1995      : num  1.07 1.05 1.07 1.04 1.08 1.06 1.05 1.08 1.09 1.07 ...
##  $ X2000      : num  1.07 1.05 1.08 1.04 1.09 1.07 1.05 1.09 1.1 1.08 ...
##  $ X2005      : num  1.08 1.05 1.08 1.04 1.09 1.06 1.05 1.09 1.11 1.08 ...
##  $ X2010      : num  1.07 1.05 1.08 1.04 1.09 1.06 1.05 1.09 1.1 1.08 ...

Visualize the nature of the data

Simplify the Analysis to Pre vs Post 1985

Determine the countries of significant change

plot(0,0,xlim=c(1.01,1.08),ylim=c(1.01,1.14), xlab="Pre1980", ylab="Post1980", 
     main="Comparing birth male/female ratio pre and post 1980")
points(premean,postmean,pch=19,col=colors)
text(premean,postmean,cntrybdat$CountryCode,pos=1, cex=0.75)

Retreive the summary data for these countries

## [1] "31 : Azerbaijan 1.062 -> 1.1268 p= 0.02044"
## [1] "51 : Armenia 1.0598 -> 1.1247 p= 0.01806"
## [1] "156 : China 1.07 -> 1.14 p= 0.00241"
## [1] "158 : Other non-specified areas 1.056 -> 1.095 p= 9e-05"
## [1] "356 : India 1.06 -> 1.0973 p= 0.00187"
## [1] "410 : Republic of Korea 1.07 -> 1.1033 p= 0.04829"
## [1] "268 : Georgia 1.076 -> 1.0982 p= 0.02548"