Wednesday, February 18, 2015

Big Unstructured Data v/s Structured Relational Data

If left unmanaged, your data can become overwhelming; making it difficult to procure information you need when you need it. While software is designed to address archiving, discovery, compliance, etc., the overarching goal is most always the same: to make managing and maintaining data a feasible task. In this post, you’ll see two types of data you’re accustomed to working with, paying close attention to the differences between structured and unstructured data.

Unstructured Data

“Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”
In fine, unstructured data is not useful when fit into a schema/table. I’ll use email as an example. There are certain values from an email that can be fit into a table. Sender, recipient, email body, etc. Although you can have a column for the email body, the information stored in that column would be useless when analyzed in such a way. What questions could analysts ask of all data entries in the “email body” column? Could they be answered? The answer is no.
When looking at the illustration it's obvious that social media plays a heavy role in unstructured data. In addition to social media there are many other common forms of unstructured data:
  • Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
  • Audio Files - Customer service recordings, voicemails, 911 phone calls
  • Presentations - PowerPoints, SlideShares
  • Videos - Police dash cam, personal video, YouTube uploads
  • Images - Pictures, illustrations, memes
  • Messaging - Instant messages, text messages
Unstructured data is a valuable piece to the data pie of any business. Tools that are widely accessible today can help businesses use this data to its greatest potential.

Structured Data
Contrasting to unstructured data, structured data is data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases.
Today, big data tools and apps have allowed for the exploration of structured data that was once too expensive to gather and store. Some examples of structured data:
Machine Generated: Sensory Data, Point-of-Sale Data, Call Detail Records, Web Server Logs - Page requests, other server activity
Human Generated: Input Data - Any data input into a computer: age, zip code, gender, etc.
Although it's outnumbered by its unstructured brother, structured data has always and will always play a critical role in data analytics. It functions as a backbone to critical business insights. Without structured data, it is difficult to know where to find insights hiding in your unstructured data sets.



Limitations of Data Warehousing
Data Warehousing is evolving but does have certain limitations. 


1. Extra Reporting Work:
Traditionally, Data warehousing involves a 'scheduled push' of data periodically (like once a day) from operational data sources in to the Data Warehouse architecture. However, there is a growing demand for analyzing and reporting real time. Data warehousing must adapt to this demand.
2. Analyzing unstructured data:
The Data Warehouse model described in the previous section does have a module (like HADOOP) to store unstructured data. However, frequently Data warehouses need to simultaneously analyze unstructured data with structured data to produce meaningful results at a particular grain. This is still an ongoing process in design.
3. Data Ownership Concerns:
As businesses get cross functional and data is distributed, security is an ongoing concern. Clear business processes have to be defined within enterprises with regards to security of disparate data.
4. Cost/Benefit Ratio:
Costs of building, integrating and Data warehouses is still high. While in the past, there was cost associated with storage, now-a -days there is a cost of integration and maintenance.

Future of Data Warehousing
Data Analytics can move beyond the limitations imposed due to the lack of structure in unstructured data and can now seamlessly use all forms of data together in a single context for analytics. The value of such a capability holds tremendous promises for the future of analytics.
The below video from the CEO of Xurmo Technologies gives us a better insight.


“In the past, companies couldn’t integrate these disparate technologies with the data warehouse because each technology required different file formats and data schemas,” says Stackowiak. “Today, you can integrate these technologies, and the result is that companies can access more of their data—not just the 20 percent from enterprise systems—and convert it into valuable, profitable information.”

Companies interested in building out their traditional data warehouse infrastructures may consider starting with reporting, if they don’t already have reporting capabilities in place, suggests Solari. Then, they can begin integrating analytics technologies to their reporting framework.

“When companies start bringing this data together and federating it inside a data warehouse, the total cost of ownership for the data warehouse may begin to go down while the ROI goes up,” says Solari. “The ability to integrate big data technologies, analytics technologies, back office systems, and traditional data warehouses has the potential to fundamentally change the economics of data warehousing for the better.”

References
http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data

http://www.sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it

http://www.computerweekly.com/feature/How-to-manage-unstructured-data-for-business-benefit


http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/

3 comments: