All data can only have the existence value
by getting involved in the computation to create value. The big data makes no
exception. The computational capability on structural big data determines the range of practical applications of big data. In this article, I'd like to
introduce several commonest computation methods: API, Script, SQL, and SQL-like
languages.
API: The "API" here refers to
a self-contained API access method without using JDBC or ODBC. Let's take
MapReduce as an example. MapReduce is designed to handle the parallel
computation cost-effectively from the very bottom layer. So, MapReduce offers superior
scale-out, hot-swap,
and cost efficiency. MapReduce is one of the Hadoop
components with open-source code and abundant resources.
Sample code:
public void
reduce(Text key, Iterator<Text> value,
OutputCollector<Text, Text> output, Reporter arg3)
throws IOException {
double
avgX=0;
double
avgY=0;
double
sumX=0;
double
sumY=0;
int
count=0;
String []
strValue = null;
while(value.hasNext()){
count++;
strValue = value.next().toString().split("\t");
sumX
= sumX + Integer.parseInt(strValue[1]);
sumY
= sumY + Integer.parseInt(strValue[1]);
}
avgX =
sumX/count;
avgY =
sumY/count;
tKey.set("K"+key.toString().substring(1,2));
tValue.set(avgX + "\t" + avgY);
output.collect(tKey, tValue);
}
Since
the universal programing language adopted is unsuitable for the specialized
data computing, MapReduce is less capable than SQL and other specialized
computation languages in computing. Plus, it is inefficient in developing. No
wonder that the programmers generally complain it is "painful". In
addition, the rigid framework of MapReduce results in the relatively poorer
performance.
There
are several products using API, and MapReduce is the most typical one among
them.
Script: The "Script" here
refers to the specialized script for computing. Take esProc as an example.
esProc is designed to improve the computational capability of Hadoop. So, in
addition to the inexpensive scale-out, it also offers the high performance,
great computational capability, and convenient computation between
heterogeneous data sources, especially ideal for achieving
the complex computational goal. In addition, it is the grid-style script
characterized with the high development efficiency and complete debug
functions.
Sample code:
|
A
|
B
|
1
|
=file(“hdfs://192.168.1.200/data/sales.txt”).size()
|
//file size
|
2
|
=10
|
//number of tasks
|
3
|
=to(A2)
|
//1 ~ 10, 10 tasks
|
4
|
=A3.(~*int(A1/A2))
|
//parameter
list for start pos
|
5
|
=A3.((~-1)*int(A1/A2)+1)
|
//parameter list
for end pos
|
6
|
=callx(“groupSub.dfx”,A5,A4;[“192.168.1.107:8281”,
“192.168.1.108:8281”])
|
//sub-program
calling, 10 tasks to 2 parallel nodes
|
7
|
=A6.merge(empID)
|
//merging task result
|
8
|
=A7.group@i(empID;~.sum(totalAmount):orderAmount,~.max(max):
maxAmount,~.min(min):
minAmount,~.max(max)/~.sum(quantity):
avgAmount)
|
//summarizing is completed
|
Java
users can invoke the result from esProc via JDBC, but they are only allowed to invoke
the result in the form of stored procedure instead of
any SQL statement. Plus, esProc is not open source. These are two disadvantages of esProc.
The
Script is widespread used in MongoDB, Redis, and many other big data solutions,
but they are not specialized enough in computing. For another example, the multi-table
joining operation for MongoDB
is not only inefficient, but also involves the coding of one order of magnitude
more complex than that of SQL or esProc.
SQL: The "SQL" here refers to
the complete and whole SQL/SP, i.e. ANSI 2000 and its superset. Take Greenplum
as an example, the major advantages of Greenplum SQL are the powerful
computing, highly efficient developing, and great performance. Other advantages
include the widespread use of its language, low learning cost, simple
maintenance, and migration possibility - not to mention its trump-card of offering
support for stored procedure to handle the complex computation. By this way,
business value can be exploited from the big data conveniently.
Sample code:
CREATE OR REPLACE function view.merge_emp()
returns void as $$
BEGIN
truncate
view.updated_record;
insert into
view.updated_record select y.* from view.emp_edw x right outer join emp_src y on
x.empid=y.empid where x.empid is not
null;
update
view.emp_edw set deptno=y.deptno,sal=y.sal from view.updated_record y where view.emp_edw.empid=y.empid;
insert into
emp_edw select y.* from emp_edw x right outer join emp_src y on x.empid=y.empid where x.empid is null;
end;
$$ language 'plpgsql';
The
other databases with
the similar structure to MPP include
Teradata, Vertical, Oracle, and IBM. Their syntax characteristics are mostly
alike. The disadvantages are similar. The acquisition cost and the
ongoing maintenance expenses are extremely high. Charging its users by data
scale, the so-called inexpensive Greenplum is actually not a bargain at all -
it is way more like making big money under cover of big data. Other
disadvantages include awkward debugging, incompatible syntax, lengthy down-time
if expansion, and awkward multi-data-source computation.
SQL-like language: It refers to the
output interfaces like JDBC/ODBC and only limited to those scripting languages
that are the subset of standard SQL. Take Hive QL as an example. The greatest
advantage of Hive QL is its ability to scale out cost-effectively while still a
convenient tool for users to develop. The SQL syntax feature is kept in Hive
QL, so that the learning cost is low, development efficient, and maintenance
simple. In addition, Hive is a component of Hadoop. The open-source is another
advantage.
Sample code:
SELECT e.* FROM (
SELECT name,
salary, deductions["Federal Taxes"] as ded,
salary * (1 –
deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
) e
WHERE round(e.salary_minus_fed_taxes) > 70000;
The
weak point of Hive QL is its non-support for stored procedure. Due to this, it
is difficult for Hive QL to undertake the complex computation, and thus difficult to provide
the truly valuable result. The slightly more complex
computation will rely on MapReduce. Needless to say, the development efficiency
is low. The poor performance and the threshold time can be regarded as a bane,
especially in task allocation, multi-table joining, inter-row computation,
multi-level query, and ordered grouping, as well as
implementing other algorithm alike. So, it is quite difficult for Hive QL to implement the real-time Hadoop application for big data.
There
are also some other products with SQL-like languages - MongoDB as an example -
they are still worse than Hive yet.
The
big data computation methods currently available are no more than these 4 types
of API, Script, SQL, and SQL-like languages. Wish they would develop
steadfastly and
there would be more and more cost-effective, powerful,
practical and usable products for data computing.