Big Data, Hadoop, Hive, Pig, MapReduce, Impala, spark, nosql and mongodb: 四月 2014

2014年4月24日星期四

Several Methods for Structured Big Data Computation

All data can only have the existence value by getting involved in the computation to create value. The big data makes no exception. The computational capability on structural big data determines the range of practical applications of big data. In this article, I'd like to introduce several commonest computation methods: API, Script, SQL, and SQL-like languages.

API: The "API" here refers to a self-contained API access method without using JDBC or ODBC. Let's take MapReduce as an example. MapReduce is designed to handle the parallel computation cost-effectively from the very bottom layer. So, MapReduce offers superior scale-out, hot-swap, and cost efficiency. MapReduce is one of the Hadoop components with open-source code and abundant resources.

Sample code:

public void reduce(Text key, Iterator<Text> value,

OutputCollector<Text, Text> output, Reporter arg3)

throws IOException {

double avgX=0;

double avgY=0;

double sumX=0;

double sumY=0;

int count=0;

String [] strValue = null;

while(value.hasNext()){

count++;

strValue = value.next().toString().split("\t");

sumX = sumX + Integer.parseInt(strValue[1]);

sumY = sumY + Integer.parseInt(strValue[1]);

}

avgX = sumX/count;

avgY = sumY/count;

tKey.set("K"+key.toString().substring(1,2));

tValue.set(avgX + "\t" + avgY);

output.collect(tKey, tValue);

}

Since the universal programming language adopted is unsuitable for the specialized data computing, MapReduce is less capable than SQL and other specialized computation languages in computing. Plus, it is inefficient in developing. No wonder that the programmers generally complain it is "painful". In addition, the rigid framework of MapReduce results in the relatively poorer performance.

There are several products using API, and MapReduce is the most typical one among them.

Script: The "Script" here refers to the specialized script for computing. Take esProc as an example. esProc is designed to improve the computational capability of Hadoop. So, in addition to the inexpensive scale-out, it also offers the high performance, great computational capability, and convenient computation between heterogeneous data sources, especially ideal for achieving the complex computational goal. In addition, it is the grid-style script characterized with the high development efficiency and complete debug functions.

Sample code:

	A	B
1	=file(“hdfs://192.168.1.200/data/sales.txt”).size()	//file size
2	=10	//number of tasks
3	=to(A2)	//1 ~ 10, 10 tasks
4	=A3.(~*int(A1/A2))	//parameter list for start pos
5	=A3.((~-1)*int(A1/A2)+1)	//parameter list for end pos
6	=callx(“groupSub.dfx”,A5,A4;[“192.168.1.107:8281”, “192.168.1.108:8281”])	//sub-program calling, 10 tasks to 2 parallel nodes
7	=A6.merge(empID)	//mergingtaskresult
8	=A7.group@i(empID;~.sum(totalAmount):orderAmount ,~.max(max):maxAmount,~.min(min):minAmount,~.max(max) /~.sum(quantity):avgAmount)	//summarizing is completed

Java users can invoke the result from esProc via JDBC, but they are only allowed to invoke the result in the form of stored procedure instead of any SQL statement. Plus, esProc is not open source. These are two disadvantages of esProc.

The Script is widespread used in Mongo DB, Redis, and many other big data solutions, but they are not specialized enough in computing. For another example, the multi-table joining operation for Mongo DB is not only inefficient, but also involves the coding of one order of magnitude more complex than that of SQL or esProc.

SQL: The "SQL" here refers to the complete and whole SQL/SP, i.e. ANSI 2000 and its super set. Take Greenplum as an example, the major advantages of Greenplum SQL are the powerful computing, highly efficient developing, and great performance. Other advantages include the widespread use of its language, low learning cost, simple maintenance, and migration possibility -not to mention its trump-card of offering support for stored procedure to handle the complex computation. By this way, business value can be exploited from the big data conveniently.

Sample code:

CREATE OR REPLACE function view.merge_emp()

returns voidas$$

BEGIN

truncate view.updated_record;

insert into view.updated_recordselect y.* from view.emp_edw x right outer join emp_src y on x.empid=y.empid where x.empid is not null;

update view.emp_edwset deptno=y.deptno,sal=y.salfrom view.updated_record y where view.emp_edw.empid=y.empid;

insert into emp_edwselect y.* from emp_edw x right outer join emp_src y on x.empid=y.empid where x.empid is null;

end;

$$ language 'plpgsql';

The other databases with the similar structure to MPP include Teradata, Vertical, Oracle, and IBM. Their syntax characteristics are mostly alike. The disadvantages are similar. The acquisition cost and the ongoing maintenance expenses are extremely high. Charging its users by data scale, the so-called inexpensive Greenplum is actually not a bargain at all - it is way more like making big money under cover of big data. Other disadvantages include awkward debugging, incompatible syntax, lengthy down-time if expansion, and awkward multi-data-source computation.

SQL-like language: It refers to the output interfaces like JDBC/ODBC and only limited to those scripting languages that are the subset of standard SQL. Take Hive QL as an example. The greatest advantage of Hive QL is its ability to scale out cost-effectively while still a convenient tool for users to develop. The SQL syntax feature is kept in Hive QL, so that the learning cost is low, development efficient, and maintenance simple. In addition, Hive is a component of Hadoop. The open-source is another advantage.

Sample code:

SELECT e.* FROM (

SELECT name, salary, deductions["Federal Taxes"] as ded,

salary * (1 – deductions["Federal Taxes"]) as salary_minus_fed_taxes

FROM employees

) e

WHERE round(e.salary_minus_fed_taxes) > 70000;

The weak point of Hive QL is its non-support for stored procedure. Due to this, it is difficult for HiveQL to undertake the complex computation, and thus difficult to provide the truly valuable result. The slightly more complex computation will rely on MapReduce. Needless to say, the development efficiency is low. The poor performance and the threshold time can be regarded as a bane, especially in task allocation, multi-table joining, inter-row computation, multi-level query, and ordered grouping, as well as implementing other algorithm alike. So, it is quite difficult for HiveQL to implement the real-time Hadoop application for big data.

There are also some other products with SQL-like languages - MongoDB as an example - they are still worse than Hive yet.

The big data computation methods currently available are no more than these 4 types of API, Script, SQL, and SQL-like languages. Wish they would develop steadfastly and there would be more and more cost-effective, powerful, practical and usable products for data computing.

About esProc: http://www.raqsoft.com/product-esproc

2014年4月20日星期日

How to Facilitate Relational Reference: Generic, Sequence, and Table Sequence

Based on the generic data type, esProc provides the sequence and the Tabe Sequence for implementing the complete set-lizing and the much more convenient relational queries.

The relation between the department and the employee is one-to-many and that between the employee and the SSN (Social Security Number) is one-to-one. Everything is related to everything else in the world. The relational query is the access to relational dataset with the mathematical linguistics. Thanks to the join query, the relational database (RDBMS) is extensively adopted.

I Case and Comparison

Case

There is a telecommunications enterprise that needs to perform this analysis: to find out the annual outstanding employees whose line manager having been awarded the president honor. The data are from two tables: the first is the department table mainly consisting of deptName and manager fields; and the second is the employee table mainly consisting of the empName, empHonor, and empDept fields;

For empHonor, three kinds of values can be obtained: First, null value; Second, ”president's award” and PA for short; Third, ”employee of the year” and EOY for short; The corresponding relations are usually belong to either of the two below groups: empDept & deptName, and Manager & empName.

SQL Solution

SELECT A.*

FROM employee A,department B,employee C

WHERE A.empDept=B.deptName AND B.manager=C.empName AND A.empHonor=‘EOY’ AND C.empHornor=‘PA’

Complex SQL JOIN query can be used to solve such problems. In this case, we choose the nested query that is brief and clear. The association statements after “where” have established one-to-many relation between deptName and empDept, and the one-to-one relation between manager and empName.

esProc Solution

employee.select(empHonor:"EOY",empDept.manager.empHornor:"PA")

The esProc solution is quite intuitive: select the employees with EOY on condition that the line respective managers of these employees have won the “PA”.

Comparison

Regarding the SQL solution, the SQL statements is lengthy and not intuitive. Actually, the complete associated query statement is “inner join…on…”. We have put it in a rather simplified way or the statements would be even harder to comprehend.

Regarding the esProc solution, the esProc fields are of generic type, which can point to any data and dataset. Therefore, you can simply use ”.” symbol to access the associated table directly. By representing in such intuitive and easy-to-understand way, esProc users can convert the complicated and lengthy SQL statement for multiple table association to the simple object access. This is unachievable if using SQL.

II Function Description:

Generic Data Type

The data in esProc are all of generic type, that is, the data types are not strictly distinguished. Therefore, a data can be a simple data like “1” or “PA” ,or a set like [1,” PA”], or a set composed of sets like the database records.

Sequence

A sequence is a data structure specially designed for the mass data analysis. It is similar to the concept of “array + set” in the senior language. That is to say, esProc users can assess members of any type according to its serial number, and perform the intersection, union, and complementary set operations on these members. The sequence is characterized with two outstanding features: generic type, and being ordered.

For example, let’s suppose that the sequence A is a set of line managers, and the sequence B is a set of award-winning employees. Then, the award-winning departments can be computed as a result of A^B. The top three departments can be obtained as a result of [1,2,3] (Please refer to other documents for the characteristics of being ordered).

esProc provides a great many of easy-to-use functions for sequence. The analysis will be greatly simplified if you grasped the use of sequence well.

Table Sequence

The Table Sequence is a sequence of database structure. As a sequence, it is characterized by being generic and ordered. In addition, Table Sequence also inherited the concept of database table that allows for the access to data with the field and the record.

The characteristics of generic type allow for the associated query in a quite convenient way in which the access to the record of associated table is just like the access to object. For example, to access the line manager of a certain employee, you can just compose “empDept.manager”. By comparison, the counterpart SQL syntax requires quite lots of complex association statements: “from…where…” or “left outer/right outer/inner join…on…”

Moreover, the characteristics of being ordered are quite useful and convenient for solving the tough computational problems relating to the Table Sequence and serial numbers, such as computing the top N, year-on-year statistics, and link relative ratio analysis.

III Advantages

The Access Syntax to Convert Complexity to Simplicity

esProc users can use ”.” to access the record in the associated table. Compared with the lengthy and complicated association syntax of SQL, such access method and style is much easier.

Intuitive Analysis is Ideal for Business Specialist

Analyzing from the business aspect, the business specialist can reach the result more correctly and rapidly. esProc users can access to the associated data in an intuitive way following the business descriptions and thus it is ideal for business specialist.

Easy to Analyze and Solve Problem

The sequence and table sequence of esProc is fit for processing the mass data. Even for the complicated multiple-table association, esProc users can solve the problems conveniently in the process of data analysis.

About esProc: http://www.raqsoft.com/product-esproc

2014年4月16日星期三

esProc Helps Database realize Real-time Big Data computing

The Big Data Real-time Application is a scenario to return the computation and analysis results in real time even if there are huge amount of data. This is an emerging demand on database applications in recent years.

In the past, because there are not so many data, the computation is simple, and few parallelisms, the pressure on the database is not great. A high-end or middle-range database server or cluster can allocate enough resource to meet the demand. Moreover, in order to rapidly and parallelly access to the current business data and the historic data, users also tend to arrange a same database server for both the query analysis system and the production system. By this way, the database cost can be lowered, the data management streamlined, and the parallelism ensured to some extent. We are in the prime time of database real time application development.

In recent years, due to the data explosion, and the more and more diversified and complex application, new changes occur to the database system. The obvious change is that the data is growing at an accelerated pace with ever higher and higher volume. Applications are increasingly complex, and the number of concurrent accesses makes no exception. In this time of big data, the database is under increasing pressure, posing a serious challenge to the real-time application.

The first challenge is the real-time. With the heavy workload on the database, the database performance drops dramatically, the response is sluggish, and user experience is going from bad to worse quickly. The normal operation of the critical business system has been affected seriously. The real-time application has actually become the half real-time.

The second challenge is the cost. In order to alleviate the performance pressure, users have to upgrade the database. The database server is expensive, so are the storage media and user license agreement. Most databases require additional charges on the number of CPUs, cluster nodes, and size of storage space. Due to the constant increase of data volume and pressure on database, such upgrade will be done at intervals.

The third challenge is the database application. The increasing pressure on database can seriously affect the core business application. Users would have to off-load the historic data from the database. Two groups of database servers thus come into being: one group for storing the historical data, and the other group for storing the core business data. As we know, the native cross-database query ability of databases are quite weak, and the performance is very low. To deliver the latest and promptest analysis result on time, applications must perform the cross-database query on the data from both groups of databases. The application programing would be getting ever more complex.

The forth challenge is the database management. In order to deliver the latest and promptest analysis result on time, and avoid the complex and inefficient cross-database programming, most users choose to accept the management cost and difficulty increase - timely update the historic library with the latest data from the business library. The advanced edition of database will usually provide the similar subscription & distribution or data duplicate functions.

The real-time big data application is hard to progress when beset with these four challenges.

How to guarantee the parallelism of the big data application? How to reduce the database cost while ensuring the real-time? How to implement the cross-database query easily? How to reduce the management cost and difficulty? This is the one of hottest topics being discussed among the CIOs or CTOs.

esProc is a good remedy to this stubborn headache. It is the database middleware with the complete computational capability, offering the support for the computing no matter in external storage, across databases, or parallelly. The combination of database and esProc can deliver enough capability to solve the four challenges to big data applications.

esProc supports for the computation over files from external storage and the HDFS. This is to say, you can store a great volume of historical data in several cheap hard disks of average PCs, and leave them to esProc to handle. By comparison, database alone can only store and manage the current core business data. The goal of cutting cost and diverting computational load is thus achieved.

esProc supports the parallel cluster computing, so that the computational pressure can be averted to several cheap node machines when there are heavy workload and a great many of parallel and sudden access requests. Its real-time is equal or even superior to that of the high-end database.

esProc offers the complete computational capability especially for the complex data computing. Even it alone can handle those applications involving the complex business logics. What's even better, esProc can do a better job when working with the database. It supports the computations over data from multiple data sources, including various structural data, non-structural data, database data, local files, the big data files in the HDFS, and the distributed databases. esProc can provide a unified JDBC interface to the application at upper level. Thus the coupling difficulty between big data and traditional databases is reduced, the limitation on the single-source report removed, and the difficulty of the big data application reduced.

With the seamless support for the combined computation over files stored in external storage and the database data, users no longer need the complex and expensive data synchronization technology. The database only focus on the current data and core business applications, while esProc enable users to access both the historic data in external storage and the current business data in database. By doing so, the latest and promptest analysis result can be delivered on time.

The cross-database computation and external storage computation capability of esProc can ensure the real-time query while alleviating the pressure on database. Under the assistance of esProc, the big data real-time application can be implemented efficiently at relatively low cost.

About esProc: http://www.raqsoft.com/product-esproc

2014年4月14日星期一

esProc assists Hadoop to replace IOE

What is IOE? I=IBM, O=Oracle, and E=EMC. They represent the typical high-end database and data warehouse architecture. The high-end servers include HP, IBM, and Fujitsu, the high-end database software includes Teradata, Oracle, Greenplum; the high-end storages include EMC, Violin, and Fusion-io.

In the past, such typical high performance database architecture is the preference of large and middle sized organizations. They can run stably with superior performance, and became popular when the informatization degree was not so high and the enterprise application was simple. With the explosive data growth and the nowadays diversified and complex enterprise applications, most enterprises have gradually realized that they should replacing IOE, and quite a few of them have successfully implemented their road map to cancel the high-end database totally, including Intel, Alibaba, Amazon, eBay, Yahoo, and Facebook.

The data explosion has brought about sharp increase in the storage capacity demand, and the diversified and complex applications pose the challenge to meet the fast-growing computation pressure and parallel access requests. The only solution is to upgrade even more frequently. More and more enterprise managements get to feel the pressure of the great cost to upgrade IOE. More often than not, enterprises still suffer from the slow response and high workloads even if they've invested heavily. That is why these enterprises are determined to replace IOE.

Hadoop is one of the IOE solutions on which the enterprise management have pinned great hope.

It supports the cheap desktop hard disk as a replacement to high-end storage media of IOE.

Its HDFS file system can replace the disk cabinet of IOE, ensuring the secure data redundancy.

It supports the cheap PC to replace the high-end database server.

It is the open source software, not incurring any cost on additional CPUs, storage capacities, and user licenses.

With the support for parallel computing, the inexpensive scale-out can be implemented, and the storage pressure can be averted to multiple inexpensive PCs at less acquisition and management cost, so as to have greater storage capacity, higher computing performance, and a number of paralleling processes far more than that of IOE. That's why Hadoop is highly anticipated.

However, Hadoop’s structured data computation still cannot reach that level as IOE did, especially in relational database led by Oracle. The data computing is the most important software function for the modern enterprise data center. Nowadays, it is normal to find some data computing involving the complex business logics, in particular the applications of enterprise decision-making, procedure optimizing, performance benchmarking, time control, and cost management. However, Hadoop alone cannot replace IOE. As a matter of facts, those enterprises of high-profile champions for replacing IOE have to partly keep the IOE. With the drawback of insufficient computing capability, Hadoop can only be used to compute the simple ETL, data storage and locating, and is awkward to handle the truly massive business data computation.

To replace IOE, we need to have the computational capability no weaker than the enterprise-level database and seamlessly incorporating this capability to Hadoop to give full play to the advantageous computing solution of Hadoop. esProc can meet this demand.

esProc is a parallel computing middleware which is built with pure Java and focused on powering Hadoop. It can access Hive via JDBC or directly read and write to HDFS. With the complete data computing system, esProc can replace the most data computing ability of IOE in a simpler way. It is especially good at the computation requiring complex business logics and stored procedures.

esProc supports the professional data scripting languages, offering the true set data type, easy for algorithm design from business client's perspective, and effortless to implement the complex business logics. In addition, esProc supports the ordered set for arbitrary access to the member of set and perform the serial-number-related computation. The set of set can be used to represent the complex grouping style easily, for example, the equal grouping, align grouping, and enum grouping. esProc also provides the complete code editing and debugging functions. It can be regarded as a dynamic set-lized language which has something in common with R language, and offers native support for distributed parallel computation from the core. Programmers can surely be benefited from the efficient parallel computation of esProc while still having the simple syntax of R. It is built for the data computing, and optimized for data processing. For the complex analysis business, both its development efficiency and computing performance are beyond the existing solution of Hadoop in structured data computing.

The combined use of Hadoop + esProc can fully remedy the drawback to Hadoop, empowering Hadoop to replace the very most of IOE features and improving its computing capability dramatically.

About esProc: http://www.raqsoft.com/product-esproc

2014年4月10日星期四

esProc Empowers Computation Outside Database to Alleviate Data Warehouse Expansion Pressure

The data warehouse is essential to enterprise business intelligence, which accounts for a great part of the total enterprise cost. With the global data explosion in recent years, the business data volume grow significantly, posing a serious challenge for enterprise data warehouse to meet the diverse and complex business demands. More data, more data warehouse applications, more concurrent accesses, higher performance, and faster I/O - all these demands give more pressure on data warehouse. Every IT manager nowadays has concern over expanding the data warehouse capacity at lower cost.

Here is an example. A data warehouse is originally provisioned, as shown below:

Server: One cluster with two high performance database servers.

Storage space: 5TB high performance disk array.

CPU: 8 high performance CPUs.

User license agreement: 100

To meet the storage capacity expansion need for the recent 12 months:

Computational performance: Double

Storage space: Quadruple

Concurrency: Double

How can an IT manager achieve his storage expansion goal? The common practice is to upgrade the database hardware and software: replace with more advanced data warehouse servers, replenish two data warehouse servers of the same class, add a 15T data-warehouse-specific disk or change to a 20T hard disk cabinet, and add 8 CPUs. In addition, they have to pay for the additional user license agreement, CPU, and disk storage space with expensive software licensing fees.

No matter which way you choose to upgrade, the data warehouse vendor will ultimately bind you with their products and charge you for the expansive upgrades.

The computation outside database is an alternative to expand storage capacity. As we all know, of the 20T data warehouse data (including 30% real data, and 70% buffer), the core data is usually less than 1/10, i.e. taking up 1T space. The remaining 19T spaces are all for the redundant data. For example, after a new application is deployed, for the sake of core data security protection, the data warehouse usually requires a copy of the used data, not allowing for the direct access to core data from application. Quite often, the new application needs the access to the records with summarized and processed core data. For which, a core-data-based intermediate table is fabricated to speed the access. Such redundant data are growing with the development of existing and emerging business. The total amount of core data will always keep low.

These redundant data is not the core data, not requiring the high level of security protection. To move these redundant data to the average PC, and use the tools other than database for reading/writing and computing, the cost of database capacity expansion will be reduced dramatically. So, we can say the computation outside database in combination with the database computing is the best choice to achieve the database capacity expansion. The benefits include:

Computational performance: Implement parallel computation across multiple nodes using the inexpensive PCs and desktop CPUs. Compared with the high performance of databases, the same or even greater computational performance can be achieved at the relatively lower cost.

Storage space: With the cost-effective desktop level disk, users can get a storage space far greater than data-warehouse-specific disk at a extremely low cost. HDFS also facilitates the data security protection, access consistency, and non-stop disk capacity expansion.

Concurrency: With the concurrent access from multi-nodes, the centralized concurrent access can be allocated to multiple node machines for more accesses than just the centralized access from data warehouse. In addition, users do not have to pay for the access license agreement, additional CPUs, and disk storage spaces.

It seems that the computation outside database is pretty good. Hadoop and other alike software are available in the market to meet all above demands. But why few people take Hadoop as an option to alleviate the pressure on expanding the data warehouse capacity? This is because they are not as powerful as database in computing, in particular the computation involving complex logics.

What about there is the software meeting the above-mentioned demands on computational performance, storage space, and concurrency, while is still equal or even more powerful than database in computing? With this software, it's evident that the storage capacity expansion pressure on database will be relieved greatly, so does the database capacity expansion cost.

esProc is built to meet these demands. It is the middleware specially designed to undertake the computation jobs between database and application. For the application layer, esProc has the easy-to-use JDBC interface; For the database layer, esProc is powerful in parallel computation. By implementing the computation outside database or in external storage, esProc alleviates the computational pressure on the database & storage, and concurrency. Owing to this, organizations can cut the cost of database software and hardware effectively while still optimizing the database administration.

esProc is built with a comprehensive and well-defined computing architecture, which is fully capable of sharing the workload on databases, and undertaking various computations of whatsoever complexity for applications. In addition, esProc supports the parallel computations across multiple nodes. The massive or intensive data computation workload can be shared by multiple average servers or inexpensive PCs balancedly.

With the supports for parallel computation, esProc can balancedly decompose and allocate the computation jobs used to solve centrally to multiple average PCs. Each node only needs to undertakes a few data computations.

With esProc, the core data can be stored in the database, while the intermediate table and script deprived from the core data can now be stored outside the database. By leveraging resources reasonably, the workload pressure on database will be alleviated effectively, database cost will be kept under control, management problems will be solved effectively, and various data warehouse applications will be handled with ease. These applications include the real-time high performance application, non-real-time big data application, desktop BI, report application, and ETL.

About esProc: http://www.raqsoft.com/product-esproc

订阅：博文 (Atom)