Big Data


                                            The Three approaches to Model Big Data  

2.2.1) Internet Search Engine Approach

Context: Text data is everybody’s data; within text the data has almost no explicit structure and meaning. The beauty of text is that anyone can generate it and it often contains useful information. The World Wide Web is one example of a fairly large and useful text repository.

The downside of text, at least at the moment, is that it has no computer understandable semantic content. A human might be able to read it, and a thought or impression or even sets of facts might be conveyed quite accurately – but it is not really possible to analyze the text using a computer to answer meaningful questions.
For this reason query systems against Text tend to be very primitive analytically. The most famous interface is that presented by the internet search engines for example the PageRank Algorithm used by Google. The details on this algorithm are as follows (as according to content on wiki):

PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by PR(E).

A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page.

2.2.2) Key – Value pair Approach

Context: Semi- Structured data can be defined as the data which should have been presented in a structured way but due to some modeling challenges it was not.

For Example: let’s look at the example of the offer of an Amex Gold card:
Bonus Points on using the Card once a week, Use the card once a week or 4 times every month. Get a whopping 1,000 points*, every month. Added up, that’s 12,000 Points every year. Earn Points on groceries, fuel and telephone bills, shopping and travel. Earn 1 Membership Reward Point for every Rs 50 charged to your Gold Card and 1 Membership Rewards Point for every Rs 100 spent on fuel and utility bills. The Supplementary Cards earn Points too. Renewal of the American Express Gold Card at the end of the first year, gets 5,000 Bonus Points Membership Rewards Points. Enrol for Standing Instructions Facility. Get a Bonus 1,000 Points the first time your bill gets paid through Bill Desk. Points don’t expire, The Points you earn on the American Express Membership Rewards Program are yours forever. *Only transactions above Rs.250 will be eligible for Bonus Points.

Now the above classified is a perfect example of free flow statement, none of the structural query type storage mechanism can be applied to the above, a rather simplistic storage would be storing all the details as a string in the following format

Card Name
Type
Year
Description
American Express
Gold
2012
Bonus Points on using the Card once a week, Use the card once a week or 4 times every month. Get a whopping 1,000 points*, every month. Added up, that’s 12,000 Points every year. Earn Points on groceries, fuel and telephone bills, shopping and travel. Earn 1 Membership Reward Point for every Rs 50 charged to your Gold Card and 1 Membership Rewards Point for every Rs 100 spent on fuel and utility bills. The Supplementary Cards earn Points too. Renewal of the American Express Gold Card at the end of the first year, gets 5,000 Bonus Points Membership Rewards Points. Enrol for Standing Instructions Facility. Get a Bonus 1,000 Points the first time your bill gets paid through Bill Desk. Points don’t expire, The Points you earn on the American Express Membership Rewards Programme are yours forever. Only transactions above Rs.250 will be eligible for Bonus Points.


Key Value Pairs:  A key value pair can be created when some of the respective columns have been identified in a key value combination, in the DW world a key will be something equivalent to a dimension or a combination of dimensions and a value will be a straightforward measure or a combination of measures.
A file of key value pairs has exactly two columns. One is structured – the KEY. The other, the value, is unstructured – at least as far as the system is concerned. The Mapping algorithm then allows you to move (or split) the data between the structured and unstructured sections at will. Using the Key-Value pair the retrieval can further be optimized for example a key-pair combination for the above example can be designed as the following

Key
Value
Amex-Corporate
Gold-2012-50000
Amex-Personal
Gold-2012-35000

Now the retrivel and aggregataion of data with the key components will be fast for example querying the data by card company and card type (corporate or personal) will be very fast but querying it by card nature (gold, platinum etc) will be a daunting task, as a matter of fact a new key pair combination might be required to support this query. In other words, if you have one or two access paths there is no problem – but you cannot access your data using a wide range of queries.

2.2.3) Resource Description Format
Context: XML formats traditionally have been assumed to be hierarchical data models, where each child has a single parent and is independent of the child records in a parallel stream. This assumption might be true nly in the simplest of the most simplistic cases, hence simple fetches of one or more columns are easy; queries that rely upon the relationship between the columns of a record, or between columns of different records are harder to express and execute. In short, whilst XML supports almost any data model; its features, search syntax, and performance footprint encourage, if not mandate, a hierarchical data model but the very nature of data flows in the free format is not hierarchical for example in our example above there can be following streams in a typical xml file:

Parent
Child

Card1
Type
CardName
CardType
Year
CreditLimitUsed
Value
Amex
Gold
2012
32000
Card2
Type
CardName
CardType
Year
CreditLimitUsed
Value
Amex
Patinum
2012
23000

As can be seen in the above example the classes which are assumed to be independent and with a single parent are not actually so and can very well be defined as individual classes on their own.

Resource Description Framework
Challenges such as the one stated above established the grounds for Resource Description Formats (RDF).
As defined in wiki
The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications [1] originally designed as a metadata data model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats:

The way our example will be designed using RDF will be as follows:

Object1
Relationship
Object2
Card1
Cardname
Amex
Card1
Cardtype
Gold
Card1
CardYear
2012
Card1
CardYear
50000
Card2
Cardname
Amex
Card2
Cardtype
Platinum
Card2
CardYear
2012
Card2
CardLimit
30000

The above solution is simplistic and serves myriad of scenarios but the problem is these queries represent some of the most computationally intensive algorithms that we know. To return all ‘Amex Gold Cards with limit more than 50000’ we need to retrieve 4 lists – all amex cards, all Gold cards, all 2012 cards and card limits then look for any objects appear on all three huge lists. Compared to a structured data multi-component fetch, we have turned one key fetch into 4; each of which would be at least 3 orders of magnitude slower.

No comments:

Post a Comment