The Three approaches to Model Big Data
2.2.1) Internet Search Engine Approach
Context: Text data is everybody’s data; within
text the data has almost no explicit structure and meaning. The beauty of text
is that anyone can generate it and it often contains useful information. The
World Wide Web is one example of a fairly large and useful text repository.
The downside of text, at least at the moment, is that it has
no computer understandable semantic content. A human might be able to read it,
and a thought or impression or even sets of facts might be conveyed quite
accurately – but it is not really possible to analyze the text using a computer
to answer meaningful questions.
For this reason query systems against Text tend to be very
primitive analytically. The most famous interface is that presented by the
internet search engines for example the PageRank Algorithm used by Google. The
details on this algorithm are as follows (as according to content on wiki):
PageRank is a link analysis algorithm, named after Larry Page
and used by the Google
Internet search engine, that assigns a numerical
weighting to each element of a hyperlinked set of documents, such as the World Wide
Web, with the purpose of "measuring" its relative
importance within the set. The algorithm may be applied to any collection of entities with reciprocal
quotations and references. The numerical weight that it assigns to any given
element E is referred to as the PageRank of E and denoted by
A PageRank results from a mathematical algorithm based on
the webgraph,
created by all World Wide Web pages as nodes and hyperlinks
as edges, taking into consideration authority hubs such as cnn.com
or usa.gov.
The rank value indicates an importance of a particular page. A hyperlink to a
page counts as a vote of support. The PageRank of a page is defined recursively
and depends on the number and PageRank metric of all pages that link to it
("incoming links"). A page that is linked to
by many pages with high PageRank receives a high rank itself. If there are no
links to a web page there is no support for that page.
2.2.2) Key – Value pair Approach
Context: Semi- Structured data can be defined as the data which should have been
presented in a structured way but due to some modeling challenges it was not.
For Example: let’s look at the example of the offer of an
Amex Gold card:
Bonus Points on using the Card once a week, Use the card
once a week or 4 times every month. Get a whopping 1,000 points*, every month.
Added up, that’s 12,000 Points every year. Earn Points on groceries, fuel and
telephone bills, shopping and travel. Earn 1 Membership Reward Point for every
Rs 50 charged to your Gold Card and 1 Membership Rewards Point for every Rs 100
spent on fuel and utility bills. The Supplementary Cards earn Points too.
Renewal of the American Express Gold Card at the end of the first year, gets
5,000 Bonus Points Membership Rewards Points. Enrol for Standing Instructions
Facility. Get a Bonus 1,000 Points the first time your bill gets paid through
Bill Desk. Points don’t expire, The Points you earn on the American Express
Membership Rewards Program are yours forever. *Only transactions above Rs.250
will be eligible for Bonus Points.
Now the above classified is a perfect example of free flow
statement, none of the structural query type storage mechanism can be applied
to the above, a rather simplistic storage would be storing all the details as a
string in the following format
Card Name
|
Type
|
Year
|
Description
|
American Express
|
Gold
|
2012
|
Bonus Points on using the Card once a week, Use the card
once a week or 4 times every month. Get a whopping 1,000 points*, every
month. Added up, that’s 12,000 Points every year. Earn Points on groceries,
fuel and telephone bills, shopping and travel. Earn 1 Membership Reward Point
for every Rs 50 charged to your Gold Card and 1 Membership Rewards Point for
every Rs 100 spent on fuel and utility bills. The Supplementary Cards earn
Points too. Renewal of the American Express Gold Card at the end of the first
year, gets 5,000 Bonus Points Membership Rewards Points. Enrol for Standing
Instructions Facility. Get a Bonus 1,000 Points the first time your bill gets
paid through Bill Desk. Points don’t expire, The Points you earn on the
American Express Membership Rewards Programme are yours forever. Only
transactions above Rs.250 will be eligible for Bonus Points.
|
Key Value Pairs: A key value pair can be created
when some of the respective columns have been identified in a key value
combination, in the DW world a key will be something equivalent to a dimension
or a combination of dimensions and a value will be a straightforward measure or
a combination of measures.
A file of key value pairs has exactly two columns. One is
structured – the KEY. The other, the value, is unstructured – at least as far
as the system is concerned. The Mapping algorithm then allows you to move (or split)
the data between the structured and unstructured sections at will. Using the Key-Value pair the retrieval can
further be optimized for example a key-pair combination for the above example can
be designed as the following
Key
|
Value
|
Amex-Corporate
|
Gold-2012-50000
|
Amex-Personal
|
Gold-2012-35000
|
Now the retrivel and
aggregataion of data with the key components will be fast for example querying
the data by card company and card type (corporate or personal) will be very
fast but querying it by card nature (gold, platinum etc) will be a daunting
task, as a matter of fact a new key pair combination might be required to
support this query. In other words, if you have one or two access paths
there is no problem – but you cannot access your data using a wide range of
queries.
2.2.3) Resource
Description Format
Context: XML
formats traditionally have been assumed to be hierarchical data models, where
each child has a single parent and is independent of the child records in a parallel
stream. This assumption might be true nly in the simplest of the most
simplistic cases, hence simple fetches of one or more columns are easy;
queries that rely upon the relationship between the columns of a record, or
between columns of different records are harder to express and execute. In
short, whilst XML supports almost any data model; its features, search syntax,
and performance footprint encourage, if not mandate, a hierarchical data model
but the very nature of data flows in the free format is not hierarchical for
example in our example above there can be following streams in a typical xml
file:
Parent
|
Child
|
|
Card1
|
Type
CardName
CardType
Year
CreditLimitUsed
|
Value
Amex
Gold
2012
32000
|
Card2
|
Type
CardName
CardType
Year
CreditLimitUsed
|
Value
Amex
Patinum
2012
23000
|
As can be seen in
the above example the classes which are assumed to be independent and with a
single parent are not actually so and can very well be defined as individual
classes on their own.
Resource Description Framework
Challenges such as
the one stated above established the grounds for Resource Description Formats
(RDF).
As defined in wiki
The Resource
Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications
[1]
originally designed as a metadata data model.
It has come to be used as a general method for conceptual description or
modeling of information that is implemented in web resources, using a variety
of syntax formats:
The way our example will
be designed using RDF will be as follows:
Object1
|
Relationship
|
Object2
|
Card1
|
Cardname
|
Amex
|
Card1
|
Cardtype
|
Gold
|
Card1
|
CardYear
|
2012
|
Card1
|
CardYear
|
50000
|
Card2
|
Cardname
|
Amex
|
Card2
|
Cardtype
|
Platinum
|
Card2
|
CardYear
|
2012
|
Card2
|
CardLimit
|
30000
|
The above solution
is simplistic and serves myriad of scenarios but the problem is these queries represent some of
the most computationally intensive algorithms that we know. To return all ‘Amex
Gold Cards with limit more than 50000’ we need to retrieve 4 lists – all amex
cards, all Gold cards, all 2012 cards and card limits then look for any objects
appear on all three huge lists. Compared to a structured data multi-component
fetch, we have turned one key fetch into 4; each of which would be at least 3
orders of magnitude slower.
No comments:
Post a Comment