Fazendo ciência de dados

Ciência de dados e a estatÃstica

EstatÃstica

Tipos de estatÃstica
Descritiva -> visualização de dados
- surpreende, mas não tem escala
Inferencial -> modelagem / aprendizado
- tem escala, mas não surpreende
Visualização de dados
idade
escolaridade
sexo
média, desvio padrão, mediana, …
- Visualizações / sumários
gráficos e tabelas
Aprendizado estatÃstico
- Supervisionado: prever ou estimar outputs a partir de inputs
- Interesse em predição (qual é o \(y\) para um novo \(X\)?)
- Interesse em inferência (como \(X\) afeta \(y\)?)
\[
y \approx f(X)
\]
ou
\[
y = f(X) + \epsilon
\]
- Não supervisionado: estudar inputs; não existe output
\[
X
\]
Big Data
(by Hadley Wickham)
Como saber se meu problema é big data?
Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.
Pontos de transição
- From in-memory to disk. If your data fits in memory, it’s small data.
- And these days you can get 1 TB of ram, so even small data is big!
- Moving from in-memory to on-disk is an important transition because access speeds are so different.
- You can do quite naive computations on in-memory data and it’ll be fast enough.
- You need to plan (and index) much more with on-disk data
Pontos de transição
- From one computer to many computers.
- The next important threshold occurs when you data no longer fits on one disk on one computer.
- Moving to a distributed environment makes computation much more challenging because you don’t have all the data needed for a computation in one place.
- Designing distributed algorithms is much harder, and you’re fundamentally limited by the way the data is split up between computers.
Classes de problemas
- Big data problems that are actually small data problems, once you have the right subset/sample/summary.
- Inventing numbers on the spot, I’d say 90% of big data problems fall into this category.
- To solve this problem you need a distributed database (like hive, impala, teradata etc), and a tool like
dplyr to let you rapidly iterate to the right small dataset (which still might be gigabytes in size).
Classes de problemas
- Big data problems that are actually lots and lots of small data problems,
- e.g. you need to fit one model per individual for thousands of individuals.
- I’d say ~9% of big data problems fall into this category.
- This sort of problem is known as a trivially parallelisable problem and you need some way to distribute computation over multiple machines.
- The
foreach package is a nice solution to this problem because it abstracts away the backend, allowing you to focus on the computation, not the details of distributing it.
Classes de problemas
- Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model.
- An example of this type of problem is recommender systems which really do benefit from lots of data because they need to recognise interactions that occur only rarely.
- These problems tend to be solved by dedicated systems specifically designed to solve a particular problem.