https://www.r-bloggers.com/essential-list-of-useful-r-packages-for-data-scientists/.
Loading and read data into R environment is most likely one of the first steps if not the most important. Data is the fuel.
Breaking it into the further sections, reading data from binary files, from ODBC drivers and from SQL databases.
# Reading from SAS and SPSS
#install.packages("Hmisc", dependencies = TRUE)
# Reading from Stata, Systat and Weka
#install.packages("foreign", dependencies = TRUE)
# Reading from KNIME
#install.packages(c("protr","foreign"), dependencies = TRUE)
# Reading from EXCEL
#install.packages(c("readxl","xlsx"), dependencies = TRUE)
# Reading from TXT, CSV
#install.packages(c("csv","readr","tidyverse"), dependencies = TRUE)
# Reading from JSON
#install.packages(c("jsonLite","rjson","RJSONIO","jsonvalidate"), dependencies = TRUE)
# Reading from AVRO
#install.packages("sparkavro", dependencies = TRUE)
# Reading from Parquet file
#install.packages("arrow", dependencies = TRUE)
#devtools::install_github("apache/arrow/r")
# Reading from XML
#install.packages("XML", dependencies = TRUE)
This will cover most of the used work for ODBC drives:
Accessing SQL database with a particular package can also have great benefits when pulling data from database into R data frame. In addition, I have added some useful R packages that will help you query data in R much easier (RSQL) or even directly write SQL Statements (sqldf) and other great features.
#Microsoft MSSQL Server
#install.packages(c("mssqlR", "RODBC"), dependencies = TRUE)
#MySQL
#install.packages(c("RMySQL","dbConnect"), dependencies = TRUE)
#PostgreSQL
#install.packages(c("postGIStools","RPostgreSQL"), dependencies = TRUE)
#Oracle
#install.packages(c("ODBC"), dependencies = TRUE)
#Amazon
#install.packages(c("RRedshiftSQL"), dependencies = TRUE)
#SQL Lite
#install.packages(c("RSQLite","sqliter","dbflobr"), dependencies = TRUE)
#General SQL packages
#install.packages(c("RSQL","sqldf","poplite","queryparser"), dependencies = TRUE)
Data Engineering, data copying, data wrangling and data manipulating data is the very next task in the journey.
Data cleaning is essential for cleaning out all the outliers, NULL, N/A values, wrong values, doing imputation or replacing them, checking up frequencies and descriptive and applying different single- , bi-, and multi-variate statistical analysis to tackle this issue. The list is by no means the complete list, but can be a good starting point:
Working with correct data types and knowing your ways around handling formatting of your data-set can be overlooked and yet important. List of the must have packages:
There are many packages available to do the task of wrangling, engineering and aggregating, especially {base} R package should not be overlooked, since it offers a lot of great and powerful features. But following is a list of those most widely used in the R community and easy to maneuver data:
Many of the statistical tests (Shapiro, T-test, Wilcox, equality, …) are available in base and stats package that are available with R engine. Which is great, because primarily R is a statistical language, and many of the tests are already included. But adding additional packages, that I have used:
Data sampling, working with samples and population, working with inference, weights, and type of statistical data sampling can be find in these brilliant packages, also including those that are great for surveying data.
Regarding of type of the variable, type of the analysis, and results a statistician wants to get, there are list of packages that should be part of daily R environment, when it comes to statistical analysis.
Distribution and and data dispersion is core to understanding the data. Many of the tests for variance are already built-in in R engine (package stats), but here are also some, that might be useful for analyzing variance.
Using more than two variables is considered multi-variate analysis. Excluding regression analysis and analysis of variance (between 2+ variables), since it is introduced in section 4.1., covering statistical analysis with working on many variables like factor analysis, principal axis component, canonical analysis, discrete analysis, and others:
Based on different type of clustering and classification, there are many packages to cover both. Some of the essential packages for clustering:
# install.packages(c("fpc","cluster","treeClust","e1071","NbClust","skmeans", "kml","compHclust","protoclust","pvclust","genie", "tclust", "ClusterR","dbscan","CEC","GMCM","EMCluster","randomLCA", "MOCCA","factoextra",poLCA), dependencies = TRUE)
# and for classification:
# install.packages("tree", "e1071")
Analysing time series and time-serie type of data will be done easier with the following packages:
Analyzing networks is also part of statistical analysis. And some of the relevant packages:
Besides analyzing open text, once can analyse any kind of text, including the word corpus, the semantics and many more. Couple of starting packages:
R has variety of good machine learning packages that are powerfull and give you the full Machine Learning cycle. Breaking down the sections by it’s natural way
Once you build one or more models, after comparing the results of each models, it is also important to validate the models against the test or any other datasets. Here are powerfull packages to do model validation.
Regression type of machine learning algorithm are many, with additional boosting or gradient. Some of very usable packages:
Classifying problems have many of the packages and many are also great for machine learning cases. Handful.
There are many types of Neural networks and many of different packages will give you all types of NN. Only couple of very useful R packages to tackle the neural networks.
R had embraced deep learning and many of the powerfull SDK and packages have been converted to R, making it very usable for R developers and R machine learning community.
Reinforcement learning is gaining popularity and more and more packages are being developered in R as well. Some of the very userful packages:
#devtools::install_github("nproellochs/ReinforcementLearning")
#install.packages(c("RLT","ReinforcementLearning","MDPtoolbox"), dependencies = TRUE)
Results of machine learning models can be a black-box. Many of the packages are dealing to have black-box more like “glass box”, making the models more understandable, interpretable and explainable. Very powerfull packages to do just that for many different machine learning algorithms.
Visualisation of the data is not only the final step to understanding the data, but can also bring clarity to interpretation and buidling the mental model around the data. Couple of packages, that will help boost the visualization:
Many R packages are specificly designed to scrape (harvest) data from particular website, API or archive. Here are only couple of very generic:
Organizing your documents (file, code, packages, diagrams, pictures) in readable document and have it as a dashboard or book view, there are couple of packages for this purpose: