Hadoop integration with R

Tom Cruser2299 06-May-2016

Developers and Programmers are still continue to explore various approaches to leverage the distributed computation benefits of MapReduce and the almost limitless storage capabilities of HDFS in intuitive manner that can be exploited by R. Integration of Hadoop with R is ongoing, with offerings provided by IBM (Big R as a part of BigInsights) and Revolution Analytics (Revolution R Enterprise). Bridging solutions that combines high-level programming and querying languages in Hadoop, for example RHive and RHadoop, are also available. Basically, every system have a intent to deliver the deep analytical capabilities of the R language to massive sets of data. Here, we briefly examine some of these efforts to marry Hadoop’s scalability with R’s statistical capabilities.

RHive

The RHive framework behaves as a bridge in between the R language and Hive. RHive offers the rich statistical libraries and algorithms of R to data stored in Hadoop by extending Hive’s SQL-like query language (HiveQL) for R-specific functions. Through the RHive methods, we can make use of HiveQL to implement R statistical models to data in our Hadoop cluster that we have catalogued using Hive.

RHadoop

Another open source framework facilitated to R programmers is RHadoop, it’s a set of packages intended to help handle the distribution and analysis of data with Hadoop. The mian functionality of RHadoop is provided by the Three packages of note — rmr2, rhdfs, and rhbase:

rmr2: This package supports translation of the R language into Hadoop-compliant MapReduce tasks (producing effective, low-level MapReduce code from higher-level R code).

rhdfs: This package offers an R language API used for file management on top of HDFS stores. Using rhdfs, users are able to read from HDFS stores to an R data frame (matrix), and in same way, write data from these R matrices back into HDFS storage.

rhbase: This packages offers an R language API as well, but their aim in life is to deal with database management for HBase stores, rather than HDFS files.

Revolution R

Revolution R ( provided by Revolution Analytics) is a commercial R offering which also support for R integration for Hadoop distributed systems. Revolution R offers to deliver enhanced performance, functionality, and usability for R on Hadoop. To enable deep analytics akin to R, Revolution R use of its own in-house company’s ScaleR library — a set of statistical analysis algorithms implemented specially for enterprise-scale big data collections. ScaleR is designed to deliver quick execution of R program code on Hadoop clusters, enabling the R developer to concentrate exclusively on their statistical algorithms and not on MapReduce. Also, it manages numerous analytics tasks, like data preparation, visualization, and statistical tests.

IBM BigInsights Big R

Big R provides end-to-end integration in between R and IBM’s Hadoop offering, BigInsights, offers R developers to analyse Hadoop data. The ultimate goal is to exploit R’s programming syntax and coding paradigms, but at the meantime assuring that the data operated upon stays in HDFS. R datatypes works as proxies to these data stores, that means R developers now don’t need to think about low-level MapReduce constructs or any other Hadoop-specific scripting languages (for example Pig). BigInsights Big R technology enables multiple data sources — including flat files, HBase, and Hive storage formats — and also provide concurrent and partitioned execution of R code with in the Hadoop cluster. It also hides lots of the complexities in the underlying HDFS and MapReduce frameworks.

articles

Hadoop integration with R

RHive

RHadoop

Revolution R

IBM BigInsights Big R

Tom Cruser

Leave Comment

1 Comments

Comments

Liked By