The data warehouse is essential to
enterprise business intelligence, which accounts for a great part of the total
enterprise cost. With the global data
explosion in recent years, the business data volume grow significantly, posing
a serious challenge for enterprise data warehouse to meet the diverse and complex
business demands. More data, more data warehouse applications, more concurrent
accesses, higher performance, and faster I/O - all these demands give more
pressure on data warehouse. Every IT manager nowadays has
concern over expanding the data warehouse capacity at lower cost.
Here is an example. A data warehouse is
originally provisioned, as shown below:
Server: One cluster with two high performance database servers.
Storage
space: 5TB high performance disk array.
CPU: 8 high performance CPUs.
User
license agreement: 100
To meet the storage
capacity expansion need for the recent 12 months:
Computational
performance: Double
Storage
space: Quadruple
Concurrency: Double
How can an IT manager achieve his storage
expansion goal? The common practice is to upgrade the database hardware and software:
replace with more advanced data warehouse servers, replenish two data warehouse
servers of the same class, add a 15T data-warehouse-specific disk or change to
a 20T hard disk cabinet, and add 8 CPUs. In addition, they have to pay for the
additional user license agreement, CPU, and disk storage space with expensive
software licensing fees.
No matter which way you choose to upgrade,
the data warehouse vendor will ultimately bind you with their products and
charge you for the expansive upgrades.
The computation outside database is an
alternative to expand storage capacity. As we all know, of the 20T data
warehouse data (including 30% real data, and 70% buffer), the core data is
usually less than 1/10, i.e. taking up 1T space. The remaining 19T spaces are
all for the redundant data. For example, after a new application is deployed,
for the sake of core data security protection, the data warehouse usually
requires a copy of the used data, not allowing for the direct access to core
data from application. Quite often, the new application needs the access to the
records with summarized and processed core data. For which, a core-data-based
intermediate table is fabricated to speed the access. Such redundant data are
growing with the development of existing and emerging business. The total
amount of core data will always keep low.
These redundant data is not the core data,
not requiring the high level of security protection. To move these redundant
data to the average PC, and use the tools other than database for
reading/writing and computing, the cost of database capacity expansion will be
reduced dramatically. So, we can say the computation outside database in
combination with the database computing is the best choice to achieve the
database capacity expansion. The benefits include:
Computational performance: Implement parallel computation across
multiple nodes using the inexpensive PCs and desktop CPUs. Compared with the
high performance of databases, the same or even greater computational
performance can be achieved at the relatively lower cost.
Storage space: With the cost-effective desktop level disk,
users can get a storage space far greater than data-warehouse-specific disk at
a extremely low cost. HDFS also facilitates the data security protection,
access consistency, and non-stop disk capacity expansion.
Concurrency: With the concurrent access from multi-nodes, the centralized
concurrent access can be allocated to multiple node machines for more accesses
than just the centralized access from data warehouse. In addition, users do not
have to pay for the access license agreement, additional CPUs, and disk storage
spaces.
It seems that the computation outside
database is pretty good. Hadoop and other alike software are available in the
market to meet all above demands. But why few people take Hadoop as an option
to alleviate the pressure on expanding the data warehouse capacity? This is
because they are not as powerful as database in computing, in particular the
computation involving complex logics.
What about there is the software meeting
the above-mentioned demands on computational performance, storage space, and
concurrency, while is still equal or even more powerful than database in
computing? With this software, it's evident that the storage capacity expansion
pressure on database will be relieved greatly, so does the database capacity
expansion cost.
esProc is built to meet these demands. It
is the middleware specially designed to undertake the computation jobs between
database and application. For the application layer, esProc has the easy-to-use
JDBC interface; For the database layer, esProc is powerful in parallel
computation. By implementing the computation outside database or in external
storage, esProc alleviates the computational pressure on the database &
storage, and concurrency. Owing to this, organizations can cut the cost of
database software and hardware effectively while still optimizing the database
administration.
esProc is built with a comprehensive and
well-defined computing architecture, which is fully capable of sharing the
workload on databases, and undertaking various computations of whatsoever
complexity for applications. In addition, esProc supports the parallel
computations across multiple nodes. The massive or intensive data computation
workload can be shared by multiple average servers or inexpensive PCs
balancedly.
With the supports for parallel computation,
esProc can balancedly decompose and allocate the computation jobs used to solve
centrally to multiple average PCs. Each node only needs to undertakes a few
data computations.
没有评论:
发表评论