Wednesday, April 1, 2015

Moore's Law, Cloud Computing and DW/BI

In this blog I try to examine the concepts behind Cloud Computing and Moore's Law and how it has affected the world of Data Warehousing and Business Intelligence via a few examples.


What is Moore's Law?

It states that until some kind of wall is reached in the current progression of computer capabilities, there will continue to be drastic growth in the computer capabilities area. Moore found that computer power and cost are inversely related. The simplified version of this law states that processor speeds, or overall processing power for computers will double every two years.


The economics of data warehousing, significantly different today, is intervening to change the picture entirely. Managing from scarcity has given way to a rush to find the insight from large volumes of data. Most companies are taking a step towards constructing their data warehouse to store and monitor real time data as well as historical data that can be extracted for quick and accurate decision making. It is understood that data warehouse and business intelligence (BI) platforms are complex and create multiple challenges. But, the costs to compute and store with respect to DW and BI platforms has been decreasing exponentially as per Moore’s Law.


What is Cloud Computing?


Cloud Computing is anything that involves delivering hosted services over the internet. In other words we do not have any more headache of maintaining the hardware or the software. It is basically sold on demand. You only pay for what you use.  Clouds can be classified as public, private or hybrid. Basically, cloud treats computing as a utility rather than specific product or technology. 

With so many big players in IT providing cloud services, the time to provision even moderately complex environments can be reduced to under an hour, with entry-level costs at less than one dollar per hour. However, cloud-based environments for big data analytics, or more specifically, data warehousing analytics for structured data, are not appropriate for all use cases.


Let's have a look at a few vendors of cloud computing in DW/BI sphere

Amazon Redshift

The Amazon Redshift is a great example of moving to the cloud. Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze all your data using existing business intelligence tools. It is optimized for data warehousing as it uses a variety of innovations to obtain very high performance on data sets ranging in size from a hundred gigabytes to a petabyte or more. It is scalable.

Snowflake 

The Snowflake Elastic Data Warehouse is a new data warehouse built from the ground up for the cloud. Its patent-pending architecture decouples data storage from compute, making it uniquely able to take advantage of the elasticity, scalability and flexibility of the cloud. As a native relational database with full support for standard SQL, Snowflake empowers any analyst with self-service access to data, which enables organizations to take advantage of the tools and skills that they already have.

Teradata
Its  Aster Cloud Edition offers big data analytics on demand by bringing in the flexibility and agility of cloud computing. The massive parallel analytics engine stores and processes big data to offer performance and scalability. This edition takes full advantage of scalability and elasticity of cloud computing.

The Data Warehouse as a Service provides access to Teradata Database, ETL and BI ecosystems, managed by Teradata, in a secure and reliable production class environment in the cloud. Teradata says it assumes responsibility for the hardware/software and daily end-to-end operations via Teradata Managed Services.

Real World Implementations

FourSquare

The location based social app has 40 million users. It streams hundreds of millions of application logs each day. It's database system required a lot of staff time to keep it running and high annual licensing costs. By migrating to Redshift and Tableau, Foursquare could expose its analytics platform to a larger number of their employees and save tens of thousands in licensing costs.

Comcast

Comcast the world’s leading media, entertainment and communications companies. It serves 24 million cable subscribers, runs 300 TV channels including 150 HD and provides 150,000 online entertainment choices. Comcast claims that Teradata saves them time in design phase where requirement input is lacking. Under the supervision of Teradata, Comcast decided to re-architect data warehouse from scratch. With the holistic view of data produced and consumed it provides a bigger picture of future.

Verizon
The largest wireless carrier in the United States with the lowest churn rate has employed Unified Data Architecture to 'listen' to its 100 million customers. This innovative Teradata Warehouse strategy together with Aster Discovery Platform and Hadoop Verizon Wireless has helped the organization gain valuable insights for better customer service.


References:

http://www.teradata.com

http://bits.blogs.nytimes.com/2014/06/11/the-era-of-cloud-computing/?_r=1

http://aws.amazon.com/redshift/

http://www.snowflake.net/news/snowflake-reinvents-the-data-warehouse-for-the-cloud/

http://www.teradata.com/business-needs/business-intelligence/?ICID=Sbi&LangType=1033&LangSelect=true

http://www.vertica.com/wp-content/uploads/2013/05/GigaOM-New-economies-of-enterprise-data-warehousing.pdf

http://simple.wikipedia.org/wiki/Moore%27s_law

Thursday, March 5, 2015

Presentation and Visualization Methods

Data visualization is the method of consolidating data into one collective, illustrative graphic. Traditionally, data visualization has been used for quantitative work (info-graphics are a popular example) but ways to represent qualitative work have shown to be equally as powerful.

Why Visualize??
Its the best way to get the message across the audience among tons of competing data streams. In the current world of big data, Visualization plays roles a key role by depicting the underlying patterns, trends and relationships contained within data.They are quick to draw attention to key business metrics and observations that wouldn't be apparent from statistics alone.

In order to demonstrate few presentation and data visualization methods, I would make use of the following three vignettes:

HEALTHCARE

Healthcare data is difficult to measure as it can have inconsistent definitions for variables and new researches coming out every day makes it one of the most difficult to measure and present. Driven by industry trends, the analysis of large sets of data, such as medication usage or hospital readmissions, has enabled health care providers and policymakers to make smarter decisions and predict future trends. Electronic medical records and decisions by governments and companies to share data have made for smarter decision-making that can save money and provide better care.

Source: ASPE computations from Current Population Survey Annual Social and Economic Survey (CPS-ASEC) microdata for Calendar Year 2012.

The graph above is a Pie Chart that depicts the spread of Insurance enrollment by non-elderly Employed and Unemployed adults in the United States for the year 2009-2010. Here we deal with a large number of dimensions, such as people, date, employment and insurance. By looking at the graph, we may easily conclude that among the employed adults, maximum are privately insured whereas among the unemployed adults, maximum are uninsured.


TRANSPORTATION

Traffic updates are always helpful in our daily life. They inform us of the traffic conditions on certain main routes, delay if any and thereby give us options to choose the best possible route. The transportation stats can also be shown by means of a bar chart or pie chart to show a graph of routes having the least distance and the least delay.

The map approach famously implemented by Google is one of the best resources out there to visualize this because if implemented as such, the users can easily see which route to avoid and which to take. A map also ensures that the user focuses on only those routes he's travelling in or wishes to travel. A color coding of the routes makes it easier to identify this and any estimates related to travel time or delay are embedded into the map based on user selection.


Source: Google Maps


INSURANCE

The primary value chain of an insurance company is seemingly short and simple. The core processes are to issue policies, collect premium payments, and process claims. The organization is interested in better understanding the metrics spawned by each of these events. Users want to analyze detailed transactions relating to the formulation of policies, as well as transactions generated by claims processing. They want to measure performance over time by coverage, covered item, policyholder, and sales distribution channel characteristics.

Having said that let's have a look at an important parameter in insurance sector i.e. the trends in Insurance premiums over the years and  the contribution by employees to Insurance in health care.


We can also show the visualizations using pie chart or bar graph. But a line graph represents trends in insurance payments in a better way. The line graph clearly depicts the increase or decrease in the insurance trends over the years as shown in the following graphs:



Source: http://www.washingtonpost.com/blogs/wonkblog/wp/2012/09/11/the-average-employer-health-plan-now-costs-15980-and-thats-kind-of-good-news/


In conclusion, what we need to learn from this is that its important to know your audience and present the information to them in the best possible way for them to understand. Visualization should be Intuitive, Simple, Appealing and Interactive.

Wednesday, February 18, 2015

Big Unstructured Data v/s Structured Relational Data

If left unmanaged, your data can become overwhelming; making it difficult to procure information you need when you need it. While software is designed to address archiving, discovery, compliance, etc., the overarching goal is most always the same: to make managing and maintaining data a feasible task. In this post, you’ll see two types of data you’re accustomed to working with, paying close attention to the differences between structured and unstructured data.

Unstructured Data

“Unstructured data refers to information that either does not have a pre-defined data model and/or is not organized in a predefined manner.”
In fine, unstructured data is not useful when fit into a schema/table. I’ll use email as an example. There are certain values from an email that can be fit into a table. Sender, recipient, email body, etc. Although you can have a column for the email body, the information stored in that column would be useless when analyzed in such a way. What questions could analysts ask of all data entries in the “email body” column? Could they be answered? The answer is no.
When looking at the illustration it's obvious that social media plays a heavy role in unstructured data. In addition to social media there are many other common forms of unstructured data:
  • Word Doc’s, PDF’s and Other Text Files - Books, letters, other written documents, audio and video transcripts
  • Audio Files - Customer service recordings, voicemails, 911 phone calls
  • Presentations - PowerPoints, SlideShares
  • Videos - Police dash cam, personal video, YouTube uploads
  • Images - Pictures, illustrations, memes
  • Messaging - Instant messages, text messages
Unstructured data is a valuable piece to the data pie of any business. Tools that are widely accessible today can help businesses use this data to its greatest potential.

Structured Data
Contrasting to unstructured data, structured data is data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the data available. It is clean, analytical and usually stored in databases.
Today, big data tools and apps have allowed for the exploration of structured data that was once too expensive to gather and store. Some examples of structured data:
Machine Generated: Sensory Data, Point-of-Sale Data, Call Detail Records, Web Server Logs - Page requests, other server activity
Human Generated: Input Data - Any data input into a computer: age, zip code, gender, etc.
Although it's outnumbered by its unstructured brother, structured data has always and will always play a critical role in data analytics. It functions as a backbone to critical business insights. Without structured data, it is difficult to know where to find insights hiding in your unstructured data sets.



Limitations of Data Warehousing
Data Warehousing is evolving but does have certain limitations. 


1. Extra Reporting Work:
Traditionally, Data warehousing involves a 'scheduled push' of data periodically (like once a day) from operational data sources in to the Data Warehouse architecture. However, there is a growing demand for analyzing and reporting real time. Data warehousing must adapt to this demand.
2. Analyzing unstructured data:
The Data Warehouse model described in the previous section does have a module (like HADOOP) to store unstructured data. However, frequently Data warehouses need to simultaneously analyze unstructured data with structured data to produce meaningful results at a particular grain. This is still an ongoing process in design.
3. Data Ownership Concerns:
As businesses get cross functional and data is distributed, security is an ongoing concern. Clear business processes have to be defined within enterprises with regards to security of disparate data.
4. Cost/Benefit Ratio:
Costs of building, integrating and Data warehouses is still high. While in the past, there was cost associated with storage, now-a -days there is a cost of integration and maintenance.

Future of Data Warehousing
Data Analytics can move beyond the limitations imposed due to the lack of structure in unstructured data and can now seamlessly use all forms of data together in a single context for analytics. The value of such a capability holds tremendous promises for the future of analytics.
The below video from the CEO of Xurmo Technologies gives us a better insight.


“In the past, companies couldn’t integrate these disparate technologies with the data warehouse because each technology required different file formats and data schemas,” says Stackowiak. “Today, you can integrate these technologies, and the result is that companies can access more of their data—not just the 20 percent from enterprise systems—and convert it into valuable, profitable information.”

Companies interested in building out their traditional data warehouse infrastructures may consider starting with reporting, if they don’t already have reporting capabilities in place, suggests Solari. Then, they can begin integrating analytics technologies to their reporting framework.

“When companies start bringing this data together and federating it inside a data warehouse, the total cost of ownership for the data warehouse may begin to go down while the ROI goes up,” says Solari. “The ability to integrate big data technologies, analytics technologies, back office systems, and traditional data warehouses has the potential to fundamentally change the economics of data warehousing for the better.”

References
http://smartdatacollective.com/michelenemschoff/206391/quick-guide-structured-and-unstructured-data

http://www.sherpasoftware.com/blog/structured-and-unstructured-data-what-is-it

http://www.computerweekly.com/feature/How-to-manage-unstructured-data-for-business-benefit


http://deloitte.wsj.com/cio/2013/07/17/the-future-of-data-warehouses-in-the-age-of-big-data/

Tuesday, February 3, 2015

Evaluation of BI Vendors (Blogging Assignment)

This blog details the comparison between some of the most popular BI tools that are being used today across multiple organizations. This blog gives an overview of the ‘Leaders’ in the BI & Analytics platform as mentioned in Gartner’s 2014 report. The tools are evaluated by a certain criteria with appropriate weightage given to each category to determine the best tool.

The tools considered for comparison in this blog are:

1.) Tableau
2.) Qlikview
3.) TIBCO Spotfire
4.) SAS
5.) Microstrategy

The above tools were critiqued on 5 factors that were determined to be crucial for its relevance in the industry.

1.) Data Visualization
2.) Analytical Insight
3.) Integration
4.) Customer Support
5.) Cost & Miscellaneous

Criteria Analysis

1.) Data Visualization
This module refers to the mode of delivery of insight or about the extent to which the data can be represented in a tool with minimal effort from the user. This component includes multiple factors such as Reporting, Dashboard and Mobile Interface

Reporting forms one of the primary abilities of a tool to create complex reports from multiple data sources.

Dashboarding is the feature that allows a visual representation of the analysis performed using various graphics such as charts, plots etc.

Since an increasing number of users are going mobile for various tasks, a mobile interface for the BI tools helps customers to port their tasks based on the features provided.

This category was highly competitive among all the tools under consideration for this review. Visualization is an important factor in today’s world and it acts as a good scale of measurement for the ease of use of a tool for a customer.  

Qlikview is strong on dashboards but is cloggy when importing and exporting documents for aggregating reports. SAS provides industry standard reports and works faster in collaborating with database application and other Microsoft Office applications hence it has a higher rating in comparison. Tableau has been rated as the best BI tool as per the weighted score analysis as it provides its users with best in class reporting and dashboard facilities. Microstrategy’s data visualization is ranked lowest due to its inconsistent dashboard & less interactive visualization.


2.) Analytical Insight

This section was further divided across multiple categories:

   a) Predictive analytics: Represents the extent to which tools support statistical modelling to forecast and            predict trends.
   b) Scorecarding: Refers to the ability of the tool to create and depict different scorecards such as Six                Sigma, Balanced Scorecards and key performance indicators for measuring the performance of the                company.
   c) OLAP (Online Analytical Processing): Mainly deals with the performance of the tools with respect to            querying, pivoting and sorting capabilities of the tool.

Qlikview scored the lowest in this section since it doesn’t host a predictive analytics module and has very primitive OLAP capabilities. Microstrategy & Tableau come up to about the same level but better than Qlikview. SAS has been a dominant player in this section. Its analytics includes features that are unique directly integrated into the BI tool. SAS has a 36% share of the advanced analytics market, more than the share of the next 10 vendors combined. 

Hence SAS comes out to be the top vendor in this category due to its offering in the advanced analytics domain. Spotfire also deserves as a honorable mention as it offers a lot of flexibility in applying analytics functions.



3.) Integration

The integration feature involves taking into consideration the following aspects.

   a) Workflow Engines: Addresses the ability of tool to model workflows for applications as per the needs           of the user.
   b) Developer Tools: Indicates the sophistication level of SDK provided to developers to customize the              application to add new features/fix bugs.
   c)  Big Data Support: Ability of the tool to handle large, unstructured data (No SQL).

Microstrategy ranks among the highest in this category due to its integrated Intelligence server which supports a wide range of BI applications. Qlikview has the lowest ranking as it does not handle Big Data at all and relies on a third party tool to introduce workflows in BI applications. SAS & Spotfire also provide extensive support for big data by ensuring high compatibility with big data sources.



4.) Customer Support

Customer interaction & support forms a crucial part of any organization/vendor. It is imperative to obtain customer feedback to learn about the user experiences & address their queries. It completes the loop by acting as an input to the vendors to consider adding new features into the tool and further their performance.

Tableau offers support services for all of their products, including an extensive online database of resources that users can search for answers to their issues. Qlikview offers non-technical support in the period immediately following the sale of their product through Qoncierge, an international, online support staff that helps to address issues such as license-related questions, portal access issues, download issues, and other general questions. MicroStrategy provides training on not just its solutions, but on Business Intelligence as a whole. SAS also operates on similar lines. Tibco Spotfire offers a variety of training for the software. Spotfire also offers discounts on the training packages in the form of Educational Passports. 



5.) Cost & Miscellaneous
Here we’ll consider the cost factor associated with each tool and in some cases the offline support provided by vendors such as Qlikview & Tableau which is a nice feature to have.

Tableau turns out to be the cheapest option whereas SAS & Microstrategy turn out on the expensive side when purchasing a license either for an individual or an enterprise. Spotfire & Qlikview are reasonably priced and provide good value for money for the investment.


Overall Score and Assessment:
The table below summarizes my analysis. Considering all aspects, Tableau scores the highest, which is also supported by Gartner’s repor. There seems to be a competition between Tableau and Microstrategy in the Enterprise Informatics space. Tableau is currently more flexible with its pricing and offers more value added options which are gives it a significant advantage over Microstrategy in the market.

Qlikview’s customer base is significantly different, they focus on small-medium level enterprises and are doing extremely well in this market. Spotfire and SAS continue to invest heavily in the BI space and are improving their product range continuously.


Weight
Tableau
Spotfire
Qlikview
SAS
Microstrategy
Data Visualization
25%
9
8
8.5
8
7
Analytical Insight
20%
8
8.5
7
9
8
Integration
15%
8
8.5
7
8
9
Customer Support
15%
8.5
7
8
8.5
8
Cost & Miscellaneous
25%
9
8
8.5
6
7.5
Points
100%
8.575
8.025
7.9
7.775
7.775
Rank

1
2
3
4
4

Final Recommendation tool for BI: Tableau