Article written by Claudia Imhoff, Ph.D., Intelligent Solutions.
Summary
In today’s dynamic environment, organizations are faced with new and
increasingly critical decisions which can affect their very survival.
Decision makers, in turn, have placed more and more demands on their
Business Intelligence (BI) environments. Their demands are increasing
the pressure on IT to deliver the right data at the right time, and faster
than ever before.
“The right data” means well-documented, reliable, and consistent data.
“The right time” means delivering the appropriate data in a timeframe
appropriate for the particular decision. That is, delivery with enough lead
time to allow the effect(s) of the decision(s) to make a difference.
“The appropriate data” means the combination of historical data found
in data warehouses and data marts, low latency data found in an
operational data store, and real time data from operational systems.
Bringing the appropriate data together requires sophisticated data
integration infrastructures that go way beyond simple data
combinations and displays. So, data integration requires a mixture of
techniques and technologies – each specific to the type of decisionmaking
being performed.
In this paper, the two most popular techniques for data integration
(virtual data federation and physical data consolidation) are described,
and the data integration technologies that support each technique (EII
and ETL) are reviewed. Through eight BI examples, advice is offered to
help implementers determine when to use each technique and
technology.
A Strategic Architecture for BI
Today’s enterprises must be fueled by good decisions – about their
customers, products, employees, partners, and more. Business
Intelligence (BI) has long helped the company’s decision makers by
supplying them with reliable versions of historical data – snapshots used
to determine trends, patterns, comparisons over time, etc.
Strategic decisions are certainly critical to the ongoing well being of the
organization, but are not the only type of decisions being made.
Today’s BI architecture must also support a myriad of operational
decisions, driven by today’s dynamic business environment.
The public domain, logical architecture that I advocate for BI is the
Corporate Information Factory1 (CIF). Over the years, the CIF has
earned its stripes as the most reliable and most implemented
architecture for sustainable BI.

Figure 1: The Corporate Information Factory
Several CIF components – the data warehouse, data marts, and
operational data store – have remained fairly stable over the years. The
data warehouse is still the main source of generic, integrated, historical
snapshots of data. The operational data store is used to store low latency or near-real time integrated data. The specialized analytical
data marts are dependent on the data warehouse or operational data
store for their supply of data and contain the analytical applications.
Each serves a specific purpose in handling planned and unplanned
queries and reports.
Although these components have endured, the way we create them
has changed dramatically. While there is no question of the value of
historical snapshots, the time latency involved in the preparation of the
data makes them unsuitable for intraday or operational decision
making. So, the CIF evolved to support not only traditional, strategic BI,
but also tactical and operational BI.
Data Integration Techniques & Technologies
Today, BI implementers are fortunate to have two data integration
techniques available to help them create world class environments – CIF
or otherwise. These are Physical Data Consolidation and Virtual Data
Federation:
- Physical Data Consolidation – This technique uses processes that
capture, cleanse, integrate, transform, and load data into a target
data store. Typically data is consolidated using extraction,
transformation, and load (ETL) technologies, which obtain data
from operational data sources, transform it to the corporate
standard, and load it into physical data stores. Informatica’s
PowerCenter is an example of an ETL product.
- Virtual Data Federation – This technique uses processes that provide
a real-time integrated view of disparate data types from multiple
sources, providing a universal data access layer. Data federation
uses enterprise information integration (EII) technologies to create
virtual stores of data from data warehouses, marts, operational
data stores, and operational systems. Composite Software’s
Information Server is an example of an EII product.
The biggest question for implementers is not whether to use these
techniques, but when to use them. Each has its place within a data
integration infrastructure but it is not always clear when to use one or the
other. To help implementers make the “when” decision, eight examples
are presented below, including the rationale, the benefits, and the
drawbacks of choosing one over the other.
Data Warehouse Integration Example
Example 1: Implementing a Data Warehouse
A data warehouse is the cornerstone component of any BI
environment. It is the repository of non-volatile, historical snapshots
of enterprise data used for all sorts of analytical and reporting
purposes. The data warehouse does not store all data – only the
data needed for analytical purposes. Therefore it should not be
thought of as an operational data archiving mechanism.
Due to performance and other reasons, the operational systems
generally do not hold more than a year’s worth of data in their
databases. The rest is generally archived. Unlike operational
systems, the data warehouse has no time limit on the data it stores.
Many organizations store 2, 5, even 10 or more years of snapshot
data in their data warehouses.
For all data warehouse implementations, data must be physically
extracted from the operational environment. Once extracted, the
heavy lifting of transforming the data into consistent, standardized,
and cleansed snapshots is performed by physical data
consolidation techniques using the ETL technologies. These
snapshots are then loaded into the data warehouse schema and
made available for all forms of analytics and times-series reporting.
Data Mart Integration Examples
Data marts can be implemented virtually or physically, with each
approach having distinct advantages and disadvantages. It is
recommended that BI implementers first determine if a virtual mart
can satisfy their users’ BI needs before building a physical mart.
Example 2: When to Use Virtual Data Marts
As mentioned earlier, the CIF is a logical or conceptual
architecture. However, many people go about implementing it as
if it were a physical architecture. That is, they create only physical
marts by extracting data from the data warehouse source,
reformatting it, and loading it into its own data mart environment.
While this can satisfy BI needs, it is often unnecessary and inefficient.
A virtual mart may be created in one of two ways. The first method
is to use a BI tool to create required views of the data. There are
many BI tools for which a view of data is a suitable substitute for a physical data mart. For example, Business Objects is long known for
its utilization of “universes” of data.
But this form of virtual data mart can also be problematic – first, it
can be cumbersome when multiple tiers of views are created. Most
BI tools do not have sufficient view documentation and business
users can make erroneous assumptions about what data is actually
being analyzed. Second, these views generally work only with the
BI tool creating them. That is, they are proprietary to the BI tool and
are not available for use by other BI tools.
The second and often better way of creating needed views is via
virtual data federation techniques using EII technology. EII
technology allows enterprises to aggregate and present data
warehouse data to BI applications via real-time access.
There are several advantages to virtual data mart creation:
- No movement of data – The problems with bulk movement of
data into physical data marts simply do not exist with virtual
marts. Data can’t be lost or its delivery delayed.
- Real-time access to the data – Virtual data marts allow realtime
access to the historical data as long as the source data
warehouse is accessible. This is particularly important for data
warehouses that are updated frequently during the day.
- Reduced costs – Other than the EII technology, there is no
need for additional hardware or software which, in turn,
means a reduction in overall costs.
- Easy to maintain – Tearing down a view and rebuilding it is far
easier to do than unloading a physical data mart,
reconstructing its schema, and then reloading the data.
Therefore maintenance is simpler.
- Security reasons – For some organizations, there may be
privacy issues or compliance regulations that prevent
proliferation of sensitive data.
In addition to these advantages for virtual data marts, EII
technology brings its own specific advantages:
- Rapid and iterative development – For situations in which the
business users are unsure of their requirements or their
requirements are volatile, a virtual data mart using EII
technology makes sense. The ability to quickly implement the
virtual mart, get the users’ input, and then swiftly recreate the view is an inexpensive and beneficial way of gaining insight
into the necessary requirements. It ensures that the
application will satisfy the user community.
- Reduced maintenance – Just as the time for development is
reduced, so is the time for maintenance. In many cases, IT
may find that the EII technology is easy enough to use that
they no longer need to perform the maintenance; it can be
performed by the users themselves.
- Masked data warehouse complexity – The data schema for
the data warehouse is usually quite complex and rather
mystifying to the average business user. EII technology
eliminates the need for users to understand this data schema;
instead, a data view is customized into the way users want to
consume data.
However, there are several concerns of significance when
considering virtual data marts.
- Performance must be monitored – Virtual data marts may
appear to be separate entities to the business user but they
are indeed all using the same source of data – the data
warehouse. With multiple BI applications running and
multiple users accessing the data, the data warehouse must
be constantly monitored for performance. If a group of users
launches long-running, very complex queries that require
massive amounts of data, the performance of the entire
environment could be compromised. A physical data mart
may be needed for these users.
- Support for proprietary BI technologies – If the EII technology
cannot support a proprietary BI technology, then the IT
implementers have no choice but to create a separate,
physical data mart for that technology. Standards such as
SOAP, ODBC, and JDBC mitigate this problem however.
Example 3: When to use Physical Data Marts
There are some matters to consider when determining whether a
dependent physical data mart is appropriate.
- A change in technology – The data warehouse is best stored
in technology that can handle massive amounts of data with
fast load and unload times. These are usually relational
databases or some form of data warehouse appliance.
Some BI technologies may use their own form of proprietary technology or require a technology different from that used
for the data warehouse. If this is the case, then you have no
choice but to extract a subset of data warehouse data for
the application and load it into a separate data mart for
usage by the BI technology.
- Performance reasons – Performance of the BI application
can be better if a physically distinct data mart is set up for it.
For example, heavy data mining analytics can impact the
lighter reporting requests if both are performed within the
data warehouse environment. It makes sense then to
replicate the data mining data into its own environment
where performance can be tailored to the data miners’
specific needs.
- Security reasons – For some organizations, there may be a
need to create a separate data mart or warehouse that
contains sensitive data or algorithms. Privacy issues, security
worries, or in some cases, even compliance regulations
mandate a separate and protected environment.
There are disadvantages to building separate physical data marts
as well:
- Redundancy of data – Building a separate physical data
mart means more storage, more ETL processing, redundancy
of efforts, redundancy of the data, and higher costs. Also,
anytime you have to re-extract data from the data
warehouse, there is always the chance that the data mart
data will get out of sync with the data warehouse data.
There should be an established reason or ROI before this
process is performed.
- More IT maintenance and costs – Obviously, creation of a
physical data mart means extra work for the BI team. Any
time that the data mart design changes (which it certainly
will), the data has to be unloaded, the schema changed,
and the data reloaded – not a trivial task. In addition, IT must
maintain backup/recovery schemes and archiving media for
both the data warehouse and physical data marts. Most
physical marts require additional hardware and software
licenses, which add to the overall costs of the BI environment.
- Potential inconsistency – Many times, the dependent data
mart is created by a line-of-business IT group or by
technologically savvy business users themselves. These groups are generally not under the same constraints as the
data warehouse implementers are, and therefore, they may
not use the same nomenclature, algorithms, or even
formatting as was used in the data warehouse. This often
causes inconsistency between their analyses and those
performed by data warehouse analysts.
- Bulk movement of data – Because the data mart is physically
distinct from the data warehouse, the data must be
transferred via the network to the mart. Any time data must
be physically moved, there exists the possibility of problems,
e.g., the delivery can be delayed or data may be only
partially delivered. Secondly, if the data warehouse is
updated more than once a day, it may not be possible to
reload the physical data marts with each update.
Data Warehouse / Data Mart Archiving and
Extension Integration Examples
Example 4: Combine Data Warehouse (or Data Mart)
Data with Archived Data Warehouse Data
Many enterprises now archive older, infrequently used snapshots of
data warehouse data to other media. Because the data is not
contiguous with the data warehouse, it may be difficult to satisfy
one-off queries that span the two data forms. The easiest and
fastest way to combine these sets of data is to use virtual data
federation technology. Rather than physically (re)loading archived
data into the data warehouse, the BI team can quickly create a
virtual environment with EII technology. When the query completes,
the virtual combination can be just as quickly dismantled.
The advantages and drawbacks to this approach are the same as
for a virtual data mart (Example 2). Fortunately there is one other
advantage for this virtual scenario: usually very few users require the
ability to span long timeframes of data warehouse data so their
queries will usually not affect the performance of other users.
Example 5: Combine Data from Multiple Data
Warehouses
Another scenario all too common in enterprises today is the
existence of multiple data warehouses. The reasons for this are varied – mergers and acquisitions, political or cultural constraints,
and security or privacy concerns to name just a few. Inevitably, a
request will come up that requires data from these warehouses.
Without having to overhaul the entire BI environment, such requests
can often be satisfied by using virtual data federation techniques to
create a virtual connection between the physical stores.
The advantages are similar to those for a virtual data mart – speed,
low cost, no movement of data, and ease of maintenance.
However, there is one significant concern to resolve – it is doubtful
that the various data warehouses were built using an overarching
standard data model. Therefore, they most likely will not have used
consistent formats, names, data quality activities, calculations, etc.,
during their data consolidation processes. If a virtual connection is
to be made between these repositories, then these disparities must
be abstracted or easily transformed.
Operational Data Store Integration Examples
The operational data store (ODS) is a sister to the data warehouse,
similar in some aspects, but also similar to operational systems in others.
Like the data warehouse, the ODS contains integrated, cleansed data
for usage in reporting and analytics. But there the similarities end. The
ODS contains current – or as current as the technology will allow – data
that is updated rather than stored as historical snapshots. Little history is
stored in the ODS which, along with the update status of its data, makes
it similar to operational systems.
The ODS is often used to create management reports on operations or
to perform analyses on current events like campaigns or other sales
events, channel activities, inventory statuses, daily sales, etc.
Example 6: When to use a Virtual ODS
A real-time, virtual ODS is possible through the use of virtual data
federation. For this, EII technology is used to directly access data in
the operational systems. There are certain conditions that should
be met before a virtual ODS makes sense. These are:
- Operational need for real-time, not near real-time, data – The
business users have a mandatory and demonstrable
requirement for integrated, current data. Compliance,
customer satisfaction, and just-in-time inventory are just a few
of the drivers behind this need for real-time data.
- Fairly well-integrated operational data – The amount of
integration and quality processing that can be done “on the
fly” is limited so source data for the virtual ODS should be in
good shape to begin with. An example of this characteristic
would be data from multiple instances of an ERP application
such as SAP’s R3. The data exists in the same format,
meaning, and usage – it just resides in multiple places.
- Narrow queries – EII’s high performance query processing
works best when the federated queries are specific in nature.
They should focus on a small or limited amount of data
returning an equally limited amount of data. Generally the
ODS is used to examine a few records – e.g., a customer’s
current transactions or a product’s worldwide inventory level.
- Unpredictable queries – Operational BI needs and data are
quite varied – e.g. a customer’s last five shipments are
needed, next their last ten invoices, and then, their last three
returns. A virtual ODS can easily meet these diverse and
unpredictable operational requirements.
The advantages of a virtual ODS can be significant:
- Short time to solution - A virtual ODS is quite useful when there
is a short fuse on the need to access of operational data.
Access can be implemented in a matter of hours or, at most
a day or two.
- Frequent change – If changes in requirements are constantly
requested, the ability to quickly rework the data presentation
is a great way to ensure user satisfaction. The agility of the
enterprise places a high premium upon this form of ODS.
- Security or privacy issues prevent replication of data – The
need to access integrated data may be mandatory but the
ability to create a physical ODS may be prohibited due to
security, regulatory, or other issues.
Example 7: When to use a Physical ODS
A virtual ODS not make sense when the data volumes requested
are too high, the integration and quality issues cannot be resolved
at run time, and the performance of the underlying operational
systems becomes compromised. A physical ODS is mandatory
under these circumstances:
- Complex data integration processing – A physical ODS may
be the only place where a current “360 degree view” of
important enterprise entities such as customer or product can
exist. Many CRM projects used the ODS as the cornerstone
for understanding an enterprise’s customers. If this data
requires considerable integration processing, data
consolidation using traditional ETL technology is required.
- Extensive data quality processing – Similar to the data
integration situation, many corporations must perform
considerable name and address matching and hygiene
along with house-holding processes before the customer
data is usable. If this is the case, then a permanent, physical
ODS is recommended.
- Access by high numbers of users – If the ODS is to be made
available to every frontline employee in an organization, its
performance must be the principal driver and this access
must not impact the operational systems performing their
daily tasks. In this case, IT should construct a physical ODS.
Of course, there are some drawbacks to a physical ODS as well:
- Data latency – A physical ODS usually does not contain realtime
data; the data has some latency due to the data
consolidation processes performed. This latency can be from
a few seconds in duration up to 24 hours depending on the
state of the operational environment, business users’
requirements, and the maturity of the data integration and
quality infrastructure.
- Potential performance problems – If the ODS is used for
management reporting, operational look-ups, and BI
analytics, it must be closely monitored to detect potential
and real performance issues. Such a mixed workload can
impact response times unless the ODS is implemented with
appropriate indexing schemes, partitioning, etc.
Operational BI Integration Examples
Operational BI is the newest form of BI adopted by enterprises. It is
meant to support operational decision-making by combining the results
from strategic analyses with real-time and low-latency sources of data.
A formal definition is:
Operational BI is a set of services, applications and
technologies for monitoring, reporting on, analyzing and
managing the business performance of an organization’s daily
business operations.
Operational BI is used by line of business managers, front line employees
and even operational processes to manage and optimize daily business
operations. Its goal is to compress the time latency between knowing
when an important business event has occurred and taking action in
response to it. To be optimal, operational BI relies on the continuous
availability of operational data and analytical results.
Example 8: The Need for Multiple Data Sources for
Operational BI
The need to combine multiple, disparate sources of data in support
of operational BI makes a great case for virtual data federation.
The sources of data for operational BI include the operational
systems, operational data store, and traditional BI analytics from the
data warehouse and/or data marts. From the point of view of the
applications and business users, the data should appear as if it were
in a single, seamless store of data. Of course, in reality, the data
hasn’t been moved at all – it still resides in its original locations.
Wrap Up
This white paper has covered the most common data integration
scenarios, differentiating when to use physical data consolidation
techniques versus virtual data federation ones in support of today’s
diverse BI strategies and architectures. As a final piece of advice, here
are some overarching considerations:
- Understand the business problem – The best way of discerning
which technology and technique will work best in your IT
environment comes from a careful study of the business
requirements. Do the business users need real-time, low latency
data or historical data or a combination of all three? For the
business processes, what types of data and interconnectivity is
needed, who will be using the environment (analysts, executives or
front-line operations personnel), what is the timeframe for decision
making – immediate, tactical, or strategic in nature?
- Understand the pros and cons of each technique and technology
– Throughout this white paper, the benefits and drawbacks of each
form of implementation – physical versus virtual – have been given. Flexibility is important so use it to your advantage. Study these to
help you determine what will work best for your specific situation.
- Keep a focus on future needs – No matter where you are today,
the business will move rapidly into a new and different direction.
Keep your options open regarding physical data consolidation and
virtual data federation. If you are heavily using data consolidation
today, you may find that data federation makes more sense in the
near future. Reduction of costs, time to implementation, ease of
maintenance, etc., are strong forces that may change the
technological directions of your future BI implementations.
1 For more on the CIF, see CORPORATE INFORMATION FACTORY, 2nd Edition ISBN 0-471-39961-2
Copyright © 2008 – Intelligent Solutions, Inc. – All Rights Reserved
Republished with permission from:
Composite Software Inc.
2655 Campus Drive, Suite 200 | San Mateo, CA 94403
Phone: 650.227.8240 | Mobile: 510.304.3793 | Fax: 650.227.8199
reve@compositesw.com | www.compositesw.com