CS 424 Big Data Analytics

Session 4: Nature of Data/ Installing R

Instructor: Dr. Bob Batzinger
Academic year: 2021/2022
Semester: 1

Begins June 2021

Basic description of distributions

Normal Distribution

\[ f(x,μ,σ)=\frac{1}{σ\sqrt{2π}}\ e^{\ −\frac{(x−μ)^2}{2σ^2}} \]

Uniform Distribution

\[f(x,μ,σ)=\frac{1}{n}\]

Chi Square Distribution

\[f(x) = \frac{x^{k/(2-1)}\ e^{-x/2}}{2^{k/2}\ \Gamma(k/2)}\] \[\huge\chi^2 = \large\sum_{i=0}^{n} \frac{\left(x_{obs} - x_{exp}\right)^2}{x_{exp}}\] 1. Requires degrees of freedom (df)

Pareto Distribution

\[f(x)=\left(\frac{x_m}{x}\right)^\alpha\] * 80/20 Rule

Comparing the different distributions

R in IDEone

Success #stdin #stdout 0.23s 42812KB

Call:
glm(formula = y ~ x)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-8.4184  -0.5135   0.0514   0.3308   5.0336  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.04620    0.22230   0.208    0.835    
x            1.00223    0.00222 451.541   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.5019413)

    Null deviance: 102841.49  on 999  degrees of freedom
Residual deviance:    500.94  on 998  degrees of freedom
AIC: 2152.6

Number of Fisher Scoring iterations: 2

Plotting R is missing from IDEone

Installation of R and RStudio

R Studio Interface