Tuesday, 19 June 2012

Designing DW for Scalibility

Context: DW design and Implementation for Scalibility
Scalibilty is one of the prime components and most important design criteria for any DW/BI system design and Implementation; you can expand or reduce the capacity of the system by adjusting the number of processes available to the cluster. Increasing or decreasing the capacity of a system by making effective use of resources is known as scalability. A scalable system can handle increasing numbers of requests without adversely affecting response time and throughput.

Overview:
Scaling is the process of increasing or decreasing the capacity of the system by changing the number of processes available to service requests from sources. Scaling out a system provides additional capacity, while scaling in a system reduces capacity. Vertical scaling involves adding more Tiered Component to the same computer, to make increased use of the hardware resources on that computer. Horizontal scaling involves adding more computers to the environment. The three tiers where scalibilty can be addressed are as follows, the technology components used in the below article are Oracle database, Informatica on the ETL side and OBIEE on the Reporting tier.
2.1.3.1) Scalibility at the database Tier:
When designing for Scalibility, Oracle RAC is a smart option which allows you to start small and grow incrementally as your business demands grow.  RAC is based on Oracle's CacheFusion architecture which gives the ability to share your database on a common set of disks and memory across multiple servers, and delivers near-linear scalability as each additional server is added to the cluster. 
Oracle Real Application Clusters runs real applications – and hence it’s name.  It runs ALL Oracle database applications: transaction processing, data warehousing, 3rd party, and homegrown applications are all supported with NO code modification. 
And Oracle has dramatically simplified the management of clusters.  Therefore, if your business needs to support more data you add more disks. Similarly if your business needs to support more users or a greater transaction throughput you add more servers to the cluster.  What I am saying it that you can “scale out” -- effectively providing ‘capacity on demand’ for your business - without having to take your users offline!  And the entire time, Oracle’s Clustered Database looks and works exactly like a database on a single machine
2.1.3.2) Scalibility at the ETL Tier (Using Informatica)
In the informatica world a grid is a name assigned to a group of nodes that can run sessions and workflows. Running a workflow in a grid means distributing Session and Command tasks to service processes running on the nodes in the grid, workload is divided by distributing session threads to multiple DTM processes running on nodes in the grid. To run a workflow or session on a grid, you assign resources to nodes, create and configure the grid, and configure the Integration Service to run on a grid.
Some of the basic resources for PowerCenter are the database connections, files, directories, node names, and operating system types required by a task. You can configure the Integration Service to check resources. When you do this, the Load Balancer matches the resources available to nodes in the grid with the resources required by the workflow. It dispatches tasks in the workflow to nodes where the required resources are available. If the Integration Service is not configured to run on a grid, the Load Balancer ignores resource requirements.
Load Balancing in Informatica can be supported by three types of algorithms popularly called dispatch modes in Informatica, The informatica Load balancer uses dispatch mode to pick up a node for load balancing and all Integration Services in a domain use the same dispatch mode.
Round-robin: Each Node in Informatica can be marked for the maximum process threshold, when a request arrives the Load Balancer dispatches tasks to available nodes in a round-robin fashion, in the process it checks the Maximum Processes threshold on each available node and excludes a node if dispatching a task causes the threshold to be exceeded.
Metric-based: The Load Balancer evaluates nodes in a round-robin fashion. It checks all resource provision thresholds on each available node and excludes a node if dispatching a task causes the thresholds to be exceeded. The Load Balancer continues to evaluate nodes until it finds a node that can accept the task.
Adaptive: The most advanced approach for load balancing, in this approach the Load Balancer proactivly ranks nodes according to current CPU availability. It checks all resource provision thresholds on each available node and excludes a node if dispatching a task causes the thresholds to be exceeded.

2.1.3.3) Scalibility at the Reporting Tier (Using OBIEE)
Scaling is the process of increasing or decreasing the capacity of the system by changing the number of processes available to service requests from Oracle Business Intelligence clients. Scaling out a system provides additional capacity, while scaling in a system reduces capacity. Scaling the Oracle Business Intelligence environment applies principally to resource-intensive system processes and Java components. When you deploy more processes, Oracle Business Intelligence can handle more requests while staying responsive to requests.
Vertical scaling involves adding more Oracle Business Intelligence components to the same computer, to make increased use of the hardware resources on that computer. Horizontal scaling involves adding more computers to the environment For example: Oracle Business Intelligence can be vertically scaled by increasing the number of system components servicing requests on a given computer and results in increased use of the hardware resources on a given computer. For example, Oracle Business Intelligence is horizontally scaled by distributing the processing of requests across multiple computers.
The three system components that support both horizontal and vertical scale-out are Oracle BI Presentation Services, the Oracle BI Server, and the JavaHost. Oracle BI Scheduler uses Presentation Services and Oracle BI Server processes to perform computationally intense work on its behalf, while the Cluster Controller only manages other components and does not itself do any computationally intense work. Because of this, there is no need to scale out either Oracle BI Scheduler or the Cluster Controller. You can distribute these two processes as needed for high availability deployments, but they do not need to be scaled for capacity.

Tuesday, 12 June 2012

Full Load Vs Incremental Load OBIA


7.3 Full Load vs Incremental Loads 

One of the key concepts in OBIA is the mechanism thru which incremental and full loads have been of implemented. The details are as follows:
Ø  In Informatica almost for every mapping there will be 2 types of workflows i.e. there will one workflow for the full load and another for incremental load.
Ø   In most of the cases these workflows will be calling the same set of routines (ETL mappings) except that the Incremental workflows will have Incremental sessions that have a SQL override with a condition such as EFFECTIVE_DATE >=$$LAST_REFRESH_DATE. Due to this condition these workflows will only extract the changed data set (Delta).
Ø   Another key thing to note here is the there is a separate set of mappings for identifying records that have been physically deleted from the source system, these mappings are The primary extract (*_Primary) and delete mappings (*_IdentifyDelete and *_Softdelete) mappings. These mappings use $$LAST_ARCHIVE_DATE mechanism to update the warehouse with source related deletions.
Ø   When you run an execution plan, the type of task or workflow which will be executed will be decided based on the value of the refresh date for the primary source or primary target table, If the refresh date is null DAC will automatically choose the task with full load workflow and if there is a value available for the refresh date DAC will execute the respective incremental workflow.
Note: The tasks with _full denote that it is used while running the full load and the other task with no _full will indicate that it is used for incremental load.
If you want to view this In DAC, you can see these both workflows in Design > Tasks and select any one task and in the lower pane go to Edit tab. In Edit tab, there will be a Command for Incremental load and Command for full load and you will find the both the workflow names within these fields. The tasks with _full will be in Command for full load and the other will be in Command for incremental load.

Monday, 11 June 2012

Multi Source Loads With OBIA


 Multi Source Loads With OBIA
Multi Source loads refer to the situation where we load data from two of more source instances of similar or dissimilar types. Some common examples include:
Ø  Loading data from two similar type instances separated geographically for example EBS 11i in US and EBS 12i in Japan
Ø   Multi source load from dissimilar instance such as an instance of Siebel 7.8 in one location and of Oracle EBS 12i in another location.
Anyways, there are two key things to be noted in the case of a multi source load and this has primarily to do with the locking mechanism if they are seeking same targets and the truncate table, drop index and analyze table sequence. These details are as follows:

Multi-Source Order of Execution
The order in which DAC loads data from the different sources that are participating in
the ETL process is determined by the priority of the physical data source connection (set in the Physical Data Sources tab of the Setup view). The Priority property ensures that tasks attempting to write to the same target table will not be in conflict.
Note this property is primarily responsible if the sources are of the same types for example EBS11i and EBS 12i then one of them can be given a priority 1 and the next physical data source a different priority
 
Delay
As shown in the following Figure , the Delay property is located in the Parameters subtab in the Execution Plans tab of the Execute view. You set this property to specify how many minutes an extract of a data source will be delayed after the first extract of a multiple source extract process started.
 
Folder Level Priorities
As shown in the following Figure, you need to set the folder level priority in the Task Physical Folders dialog box, which you by selecting Tools, then Seed Data, and then Task Physical Folders.
 
I know that the first question which must be immediately bothering you is if we can set the priority at the data source connection level itself then why do we need the priority at the physical folder level, well indeed there is a very specific business case for that. Imagine that you have a Seibel 7.8 and an EBS 12 implementation in US plus an EBS 12i implementation in Japan. Then what should be the best mechanism to set the priorities that these loads run in a mixed mode (combination of parallel and serial mode) and the performance is optimum. What you might have to do is first set the priorities at the data source level for the EBS sources for Japan and US, as it’s the same physical folder there is no way that we can set different priorities at the physical folder level, next you need to set the priority for Seibel at the physical folder level.