Oh, my dear friend “Big Data”. Oh, “Big Data” you are spawning a whole new industry. “Big Data” you are re-engineering the required skills of the stereotypical Information Technology (IT) specialist like never before. “Big Data” and your close relatives “Cloud Computing”, “Social Media” and “Mobile”, you are the new frontier of innovation.
From a personal standpoint, I generally avoid using over-used, or faddish, terms such as “Social Media”, “Cloud Computing” or “Big Data” but these terms do serve a purpose. Therefore, for this blog post I will concede to my detest for these terms and I would like to share some thoughts on “Big Data” and how as an industry we can take a very complicated topic and break it down into some logical methods that can be applied to help solve “Big Data” problems.
Understanding Big Data
The best description I’ve ever heard of “Big Data” was quite simple, yet extremely powerful to illustrate the essence of the problems we are trying to solve. “Big Data” is defined as Volume, Variety and Velocity (Updated note: I would like to acknowledge Gartner for the original reference of the 3V’s — http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/). Let’s break each of these items into how this confluence is contributing to a great opportunity for savvy IT individuals to reinvent themselves into Big Data content management professionals.
Think about Volume of content creation for a second. Volume is unquestionably increasing at incredible rates and I don’t think most reasonable people would dispute this fact. With more people using high speed internet connections than ever, plus these people becoming more proficient at creating content and just more people in general contributing information are combined forces that are causing this tremendous increase in Volume. The sheer number of content items that are created and stored is increasing and logically it should be assumed that the individual file sizes themselves are likely increasing.
Next in breaking down Big Data into easily digestible bite-size chunks is the concept of Variety. Take your personal experience and think about how much information you create and contribute in your daily routine. Your voicemails, your e-mails, your file shares, your TV viewing habits, your Facebook updates, your LinkedIn activity, your credit card transactions, etc. The point is, whether you consciously think about it or not the Variety of information you personally create on a daily basis which is being collected and analyzed is simply overwhelming. In this basic example, a data analysis specialist would have to gather data from audio, video, receive data from third-party systems and, of course, have a computer understand the information, as well as context, of images.
The speed at which data enters organizations these days is absolutely amazing. With mega internet bandwidth nearly being common place anymore in conjunction with the proliferation of mobile devices, this simply gives people more opportunity than ever to contribute content to storage systems. Additionally, the ease of use to contribute information only encourages more creation and more storage of content.
Big Data. Making sense of electronic junkyards
Understanding some of the factors such as Volume, Variety and Velocity that are causing this perfect storm of Big Data growth can also help us solve problems. Now let’s ask ourselves the following questions and start to create solutions with one simple, yet highly effective concept (which I will explain below).
- Question: What is the root problem with solving Big Data issues?
- Answer: Too much information is introduced into systems without the proper “Index” so computers can not understand the information, nor context of the data.
As individuals, businesses and organizations we are creating ‘electronic junkyards’. We are creating electronic junkyards because the images we upload have no context; much less index. Things go in but they rarely come out with much value and, therefore you have created this junkyard of nearly useless electronic information. There are many reasons including the following:
- “Non-compliant” users that insist on using more useful tools than their corporate policy offers (i.e. consumerization). You might have personally, or at least known of someone, who used an IT-unauthorized cloud storage service to share a large PowerPoint file, for example. In this case that information is simply non-discoverable and not available to the in the set of data for analysis.
Geoffrey Moore – “The Big Disconnect”
- IT reluctance to enforce business rules upon submission. For fear of a poor experience and having poor adoption of technology due to user frustration. In other words, IT continues to allow users to upload content without proper tagging, or metadata, associated with the content.
- Non-existent or inadequate back-end systems to transform electronic files into computer readable indexes. For example, image-only PDF or TIFF files that are non-searchable offer only limited value for the purpose of data analytics. If IT departments choose not to enforce indexing by the users for whatever reason then a back-end process must be in place to help achieve indexing of this content.
Exploiting Big Data with Indexes (Simple and Obvious)
Now that we’ve defined some of the general technological factors causing this explosion of “Big Data”, and we’ve explained some of the human nature factors contributing to the challenges of managing “Big Data”, let’s start by offering a simple, yet extremely important- way to start gaining control of sets of data. It might be obvious to some and over simplified to others, but Exploiting Big Data starts by capturing Indexes.
- Exploit: To ‘take advantage of’ to its full potential
- Big Data: The volume, variety and velocity of data
- Indexes: Computer understanding of content
There are many effective ways to capture Indexes. A common automated method to index the content itself is to make a Searchable PDF file that most of us are familiar with. Another way to automatically index is via Data Capture technology where only selective fields such as an invoice number and vendor name is extracted from an invoice instead of the full-page text. Data Capture is particularly useful for providing ‘relevant index’ versus all index values. An example is if an organization processes contracts. In this case there is no need to index all the terms and conditions of the contract agreement. Only the ‘relevant index’ values such as the parties involved in the agreement, the date and maybe a few other pertinent pieces of information. Also, new methods to offer simplified automation of indexing can be to utilize touch-screen devices for indexing fields from images. Touch indexing makes the user experience much more enjoyable and therefore encourages indexing by persons most familiar with the content.
A well-designed solution can perform indexing at the time content is introduced to the system, or after the fact on the server or a combination of both.
The proper architecture of an effective solution to Exploit Big Data with Indexes will depend on individual organizations requirements and needs. One of the most important things to know is now, more than ever, there are many ways to achieve a highly efficient system while also delivering these capabilities at an affordable cost. The benefits of Exploiting Big Data is tremendous for organizations in many different industries.
However, to truly realize these benefits a solution must consist of techniques and methods for machines to make sense of electronic content. And making sense of electronic content starts with creating Indexes to Exploit Big Data.