CS 424 Big Data Analytics

Session 3: Nature of Data

Instructor: Dr. Bob Batzinger
Academic year: 2021/2022
Semester: 1

Begins June 2021

Analysis

Understand the problem
Gather the information
Reduce the complexity of the problem
Model the relationship between the independant and the dependant variables
Test your theory and verify the results
Communicate the results

Problem Type: Outliers

Mean and central limiting theorem

The sum of all deviations of a mean is zero.

\[\begin{matrix} \hbox{Lower} & & \hbox{Upper}\\ \hbox{extreme}&\hbox{Mid range} & \hbox{extreme}\\ 1/6 & 4/6 & 1/6 \\ & & \\ & & 236_\rlap{(-46)}\\ &218_\rlap{(-28)}& \\ &185_\rlap{(5)}& \\ &178_\rlap{(12)}& \\ & 172_\rlap{(18)}& \\ 151_\rlap{(39)}& & \\ \end{matrix}\]

Sampling

Normal distribution

Formula for normal curve

\[ f(x,μ,σ)=\frac{1}{σ\sqrt{2π}}\ e^{\ −\frac{(x−μ)^2}{2σ^2}} \]

\[\begin{eqnarray} x &=& \hbox{observed value}\\ \mu &=& \hbox{mean}\\ \sigma &=& \hbox{variance}\\ \end{eqnarray}\]

Sampling from a normal population

Integrating the Normal curve

Variance

Increasing distance between 2 similar populations

Other distributions

Problem Type: Regression ===============================

More models

Regression Analysis

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5693.1  -959.2  -186.0   822.4  7517.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2054.3723   101.0300  -20.33   <2e-16 ***
## x              14.0297     0.1749   80.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1596 on 998 degrees of freedom
## Multiple R-squared:  0.8658, Adjusted R-squared:  0.8656 
## F-statistic:  6438 on 1 and 998 DF,  p-value: < 2.2e-16

Residuals

Model	Min	1Q	Median	3Q	Max	R
1	-5120.3	-965.0	-163.4	745.2	7824.5	0.8672
2	-6526.2	-322.2	-84.3	467.4	6384.9	0.9116
3	-7293.2	-931.6	-94.0	912.3	5737.7	0.8800
4	-6394.5	-341.6	-18.8	394.9	6504.1	0.9122
5	-6546.0	-434.1	-5.5	391.3	6385.6	0.9118
6	-6456.5	-364.8	11.9	395.3	6454.3	0.9123

Coefficients y = ax^3 + bx^2 + cx + d

Model	a	b	c	d
1			* 1.384e-02	* 1.384e-02
2		* 1.384e-02		* 3.361e+02
3	* 1.430e-05			* 1.376e+03
4		* 1.232e-02	* 1.629e+00	3.027e+01
5	* 8.054e-06		* 6.698e+00	* -4.124e+02
6	2.811e-06	8.096e-03	3.320e+00	-1.111e+02

Classification Problem

Problem Type: Classification by k-means clustering

Visualizing classification errors

Classification errors

Case Study

Background:

1950s modern drugs and methods introduced to control population growth
1980s modern methods of abortion introduced and legalized in many countries

Research questions:

Have these technologies contributed to an imbalance in the ratio of men to women?
What is the size of the impact?

Dataset:

World Bank statistics of male/female ratios at birth in 200 countries expanding the period 1950 - 2010

birthdat = read.csv("../datasets/WPP2015_FERT_SEX_RATIO_AT_BIRTH.csv")
t(birthdat[birthdat$Region=="Thailand",])

##             112        
## Index       "112"      
## Variant     "Estimates"
## Region      "Thailand" 
## Notes       ""         
## CountryCode "764"      
## X1950       "1.054"    
## X1955       "1.055"    
## X1960       "1.056"    
## X1965       "1.056"    
## X1970       "1.057"    
## X1975       "1.053"    
## X1980       "1.052"    
## X1985       "1.05"     
## X1990       "1.055"    
## X1995       "1.061"    
## X2000       "1.062"    
## X2005       "1.064"    
## X2010       "1.062"

Dataset structure

## 'data.frame':    241 obs. of  18 variables:
##  $ Index      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Variant    : Factor w/ 1 level "Estimates": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Region     : Factor w/ 241 levels "Afghanistan",..: 238 143 116 113 118 117 91 140 228 123 ...
##  $ Notes      : Factor w/ 37 levels "","1.0","10.0",..: 1 32 33 34 35 1 36 36 36 36 ...
##  $ CountryCode: num  900 901 902 941 934 ...
##  $ X1950      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.06 1.06 1.06 1.06 ...
##  $ X1955      : num  1.06 1.05 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1960      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1965      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1970      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1975      : num  1.06 1.06 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1980      : num  1.06 1.05 1.06 1.04 1.06 1.05 1.05 1.06 1.06 1.06 ...
##  $ X1985      : num  1.06 1.05 1.06 1.04 1.07 1.05 1.06 1.07 1.07 1.06 ...
##  $ X1990      : num  1.07 1.05 1.07 1.04 1.08 1.06 1.06 1.08 1.09 1.07 ...
##  $ X1995      : num  1.07 1.05 1.07 1.04 1.08 1.06 1.05 1.08 1.09 1.07 ...
##  $ X2000      : num  1.07 1.05 1.08 1.04 1.09 1.07 1.05 1.09 1.1 1.08 ...
##  $ X2005      : num  1.08 1.05 1.08 1.04 1.09 1.06 1.05 1.09 1.11 1.08 ...
##  $ X2010      : num  1.07 1.05 1.08 1.04 1.09 1.06 1.05 1.09 1.1 1.08 ...

Visualize the nature of the data

Simplify the Analysis to Pre vs Post 1985

Determine the countries of significant change

plot(0,0,xlim=c(1.01,1.08),ylim=c(1.01,1.14), xlab="Pre1980", ylab="Post1980", 
     main="Comparing birth male/female ratio pre and post 1980")
points(premean,postmean,pch=19,col=colors)
text(premean,postmean,cntrybdat$CountryCode,pos=1, cex=0.75)

Retreive the summary data for these countries

## [1] "31 : Azerbaijan 1.062 -> 1.1268 p= 0.02044"
## [1] "51 : Armenia 1.0598 -> 1.1247 p= 0.01806"
## [1] "156 : China 1.07 -> 1.14 p= 0.00241"
## [1] "158 : Other non-specified areas 1.056 -> 1.095 p= 9e-05"
## [1] "356 : India 1.06 -> 1.0973 p= 0.00187"
## [1] "410 : Republic of Korea 1.07 -> 1.1033 p= 0.04829"
## [1] "268 : Georgia 1.076 -> 1.0982 p= 0.02548"