Data science is an interdisciplinary field that’s required in all domains to extract meaningful insights from structured and unstructured data in order to take important decisions. R is a powerful tool that has excellent statistical and visualisation capabilities, making it very attractive to data scientists
Data science is the process of understanding data and deriving meaningful and essential insights from abundant structured and unstructured data. This field is most relevant in the modern context as every organisation possesses multitudes of data of every kind, but often struggles to get meaningful insights out of it. What data science processes need is a proper algorithm to process the information and the right tool to execute the algorithm with ease.
R is the most powerful tool to execute algorithms related to data science and has the capability of working with abundant data. It provides a wide variety of linear and non-linear models, classical statistical tests, time series analysis and machine learning capabilities (i.e., classification, clustering, regression and reinforcement learning) and excellent visualisation techniques. It is an integrated suite of software tools for data science related processes.
The main characteristics of R are:
- An effective data handling and storage facility
- Numerous operators for the analysis of data on every object
- Many integrated tools and packages for the analysis of structured and unstructured data
- Excellent visualisation capabilities to represent the data in pictorial form
- A simple and effective programming interface to manipulate data and to devise self-learning algorithms
- The best environment for statistical computations
- Great documentation to provide in-depth explanations of every function and package
Features of R applicable to data science
Powerful data wrangling: Data wrangling is most important for any data science project as it cleans, restructures and enriches the raw data, converting it into a more usable format. R provides several standard functions to deal with the special values of data. For instance, in the case of a missing value being represented by NA in R, R provides anyNA(), na.fail(), na.pass(), is.na(), na.omit(), na.exclude(), complete.cases() and the is.finite() functions to clean the data.
Extensive support for statistical modelling: Statistical modelling is an essential to determine how one variable is related to others. R provides powerful capabilities to deal with statistical modelling. It has excellent functions for central tendency, measure of variability, probability, hypothesis testing, ANOVA and regression analysis.
Great ETL facilities: R provides powerful functionalities for ETL (extract, transform and load) for data science applications. It provides excellent interfaces for many databases and even Excel type of spreadsheet programs for ETL.
The connection with NoSQL databases: The majority of data science projects deal with unstructured data. R has the ability to provide interfaces with NoSQL databases and to analyse unstructured data in effective ways.
Support for machine learning algorithms: Machine learning algorithms are of four main categories — supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. R handles all kinds of machine learning in compelling ways. Both supervised learning techniques — classification and regression — are handled effectively with R, which uses standard functions to deal with linear regression, logistic regression, linear discriminant analysis, K-nearest neighbours, decision trees, neural networks and support vector machines. R also deals with unsupervised learning problems related to clustering and association, effectively. In certain machine learning problems, some data is labelled and most of it is not. R has functions to deal with these kinds of problems as well.
In reinforcement learning, the machine is given the problems of an agent, which then learns the finest behaviour through trial-and-error dealings in a live environment. R also has a package to deal with such problems.
Packages of R applicable to data science
Data wrangling and analysis packages: Data wrangling is an essential step within data science. It refers to the process of scrubbing, reorganisation and enriching the raw data into more operational formats. Some popular data wrangling and analysis packages are given below.
dplyr: Hadly Wickham had written this package for data wrangling tasks. It makes data manipulation easy, consistent and performance-oriented. With this, you can select, filter and aggregate data. It is best suited for data frames of R.
purr: This package is also written and maintained by Hadly Wickham. It takes a vector as an input and function, which is to be applied to each and every element of the vector. vector.map is the main function of this package. It also allows you to specify the structure of your output.
tidyxl: Duncan Garmonsway is an author of this package. It imports non-tabular data from Excel files into R. It supports xml based file formats, and is a great package for data manipulation of Excel data.
Hmisc: This is a powerful package for data analysis in R. It was developed by Frank E. Harrell Jr. It contains many functions useful for data analysis. It also contains functions for importing and annotating data sets, inputing missing values and character string manipulation.
sqldf: G. Grothendieck designed this powerful package for data wrangling and analysis based on the SQL statement. It is helpful to import the data frame into databases and perform SQL statements in R.
Data import and display packages: Importing data and displaying them in appropriate ways is the most important task of data science. R has several popular packages related to the data import and display category.
readxl: This is the most popular package to import data from Excel files. Designed and developed by Hadley Wickham, the main property of this package is to read Excel files in R speedily without any dependencies.
readr: This package was also written by Hadley Wickham. It is best suited for huge files and to read CSV files faster. A similar package of this category is Vroom, developed by Jim Hester.
rio: This was designed by Thomas J. Leeper. It supports the Web based import from SSL and HTTPS. Compressed files can also be read directly without explicit decompression.
datapasta: Miles McBain is one of the authors of this package. If you have copied any data from the Web or a spreadsheet, and want to paste it into an R object so that it can be reproduced, this package will work for you.
httr: This package is useful for pulling data from Web APIs. It provides functions for all important aspects of HTTP such as GET(), HEAD(), PATCH(), PUT(), DELETE() and POST(). It was designed by Hadley Wickham.
Data visualisation packages: In data science, these are the most crucial as they display the outcome in a pictorial form so that anybody can understand it. It is also very useful for exploratory data analysis. Some important data visualisation packages in R are listed below.
ggplot2: This is the most powerful visualisation package in R. It is based on the grammar of graphics theory. With the help of this package, we can build custom plots more easily. It comprises two main functions — qplot() and ggplot(). It was designed by Hadley Wickham.
Lattice: This package is good for multi-variate data. It is inspired by Trellis Graphics. Built using grid packages, it was written by Deepayan Sarkar.
highcharter: This is an interactive visualisation package in R. It was designed by Joshua Kunst and is very useful for dynamic charting. This package has easy-to-customise themes for interactive visualisation.
Leaflet: Joe Cheng, Bhaskar Karambelkar and Yihui Xie wrote this package. It is lightweight but powerful enough to build interactive maps.
RColorBrewer: This is very handy to manipulate colours in plots, graphs and maps. Erich Neuwirth designed this package, using which you can design nice colour palettes.
plotly: This is also an interactive visualisation package consisting of different categories of charts. The main attractions of this package are contour plots, candlestick charts and 3D charts.
We are not able to accommodate all the packages related to data science in this article but have included all the main ones essential to address the basic functionalities required in this field.