We all are already familiar with log data, relational data, text data, and binary data, but we will soon hear about another form of information: graph data. In its simplest form, a graph is simply a collection of nodes (an entity, for example — a person, a department, or a company), and the lines connecting them are edges (this represents a relationship between two entities, for example two people who know each other). What makes graphs interesting is that they can be used to represent concepts such as relationships in a much more efficient way than, say, a relational database. Social media is an application that immediately comes to mind — indeed, today’s leading social networks (Facebook, Twitter, LinkedIn, and Pinterest) are all making heavy use of graph stores and processing engines to map the connections and relationships between their subscribers.
With the introduction of NoSQL movement, the graph database is one major category of alternative data-storage technologies. Initially, the predominant graph store was Neo4j, an open source graph database. But now the use of Apache Giraph, a graph processing engine designed to work in Hadoop, is increasing rapidly. Using YARN, we expect Giraph adoption to increase even more because graph processing is no longer tied to the traditional MapReduce model, which was inefficient for this purpose. Facebook is reportedly the world’s largest Giraph shop, with a massive trillion-edge graph.
Graphs can represent any kind of relationship — not just people. One of the most common applications for graph processing now is mapping the Internet. When you think about it, a graph is the perfect way to store this kind of data, because the web itself is essentially a graph, where its websites are nodes and the hyperlinks between them are edges. Most PageRank algorithms use a form of graph processing to calculate the weightings of each page, which is a function of how many other pages point to it.
Hadoop is no doubt a “game-changer” tool, and most of the software development organizations start taking advantage of the potential value of Hadoop. When we use more data, we can make better decisions and predictions and guide better outcomes. In cases where we need to retain data for regulatory purposes and provide a level of query access, Hadoop is a cost-effective solution. The more a business depends on new and valuable analytics that are discovered in Hadoop, the more it wants. When we initiate successful Hadoop projects, our clusters will find new purposes and grow!