Aprendizado estatístico

Julio Trecenti

40% estatístico, 30% programador, 20% hacker, 10% matemático

	Doutorando em Estatística no IME-USP
	Diretor-Técnico da Associação Brasileira de Jurimetria
	Vice-presidente do CONRE-3
	Sócio da Platipus Consultoria
	Sócio da Curso-R

Descritiva -> visualização de dados

Inferencial -> modelagem / aprendizado

idade
escolaridade
sexo

média, desvio padrão, mediana, …

gráficos e tabelas

Supervisionado: prever ou estimar outputs a partir de inputs
- Interesse em predição (qual é o \(y\) para um novo \(X\)?)
- Interesse em inferência (como \(X\) afeta \(y\)?)

\[ y \approx f(X) \]

\[ y = f(X) + \epsilon \]

Não supervisionado: estudar inputs; não existe output
- dividir em grupos

\[ X \]

Big data is extremely overhyped and not terribly well defined. Many people think they have big data, when they actually don’t.

From in-memory to disk. If your data fits in memory, it’s small data.
- And these days you can get 1 TB of ram, so even small data is big!
- Moving from in-memory to on-disk is an important transition because access speeds are so different.
- You can do quite naive computations on in-memory data and it’ll be fast enough.
- You need to plan (and index) much more with on-disk data

From one computer to many computers.
- The next important threshold occurs when you data no longer fits on one disk on one computer.
- Moving to a distributed environment makes computation much more challenging because you don’t have all the data needed for a computation in one place.
- Designing distributed algorithms is much harder, and you’re fundamentally limited by the way the data is split up between computers.

I personally believe it’s impossible for one system to span from in-memory to on-disk to distributed.
- R is a fantastic environment for the rapid exploration of in-memory data, but there’s no elegant way to scale it to much larger datasets.
- Hadoop/spark works well when you have thousands of computers, but is incredible slow on just one machine.
- Fortunately, I don’t think one system needs to solve all big data problems.

Big data problems that are actually small data problems, once you have the right subset/sample/summary.
- Inventing numbers on the spot, I’d say 90% of big data problems fall into this category.
- To solve this problem you need a distributed database (like hive, impala, teradata etc), and a tool like dplyr to let you rapidly iterate to the right small dataset (which still might be gigabytes in size).

Big data problems that are actually lots and lots of small data problems,
- e.g. you need to fit one model per individual for thousands of individuals.
- I’d say ~9% of big data problems fall into this category.
- This sort of problem is known as a trivially parallelisable problem and you need some way to distribute computation over multiple machines.
- The foreach package is a nice solution to this problem because it abstracts away the backend, allowing you to focus on the computation, not the details of distributing it.

Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model.
- An example of this type of problem is recommender systems which really do benefit from lots of data because they need to recognise interactions that occur only rarely.
- These problems tend to be solved by dedicated systems specifically designed to solve a particular problem.