All data can only have the existence value
by getting involved in the computation to create value. The big data makes no
exception. The computational capability on structural big data determines the range of practical applications of big data. In this article, I'd like to
introduce several commonest computation methods: API, Script, SQL, and SQL-like
languages.
API: The "API" here refers to a self-contained API access
method without using JDBC or ODBC. Let's take MapReduce as an example.
MapReduce is designed to handle the parallel computation cost-effectively from
the very bottom layer. So, MapReduce offers superior scale-out,
hot-swap, and cost
efficiency. MapReduce is one of the Hadoop components
with open-source code and abundant resources.
Sample code:
public void
reduce(Text key, Iterator<Text> value,
OutputCollector<Text, Text> output, Reporter arg3)
throws IOException {
double
avgX=0;
double
avgY=0;
double
sumX=0;
double
sumY=0;
int
count=0;
String []
strValue = null;
while(value.hasNext()){
count++;
strValue = value.next().toString().split("\t");
sumX
= sumX + Integer.parseInt(strValue[1]);
sumY
= sumY + Integer.parseInt(strValue[1]);
}
avgX =
sumX/count;
avgY =
sumY/count;
tKey.set("K"+key.toString().substring(1,2));
tValue.set(avgX + "\t" + avgY);
output.collect(tKey, tValue);
}
Since the universal programming
language adopted is unsuitable for the specialized data computing, MapReduce is
less capable than SQL and other specialized computation languages in computing.
Plus, it is inefficient in developing. No wonder that the programmers generally
complain it is "painful". In addition, the rigid framework of
MapReduce results in the relatively poorer performance.
There are several products using API, and
MapReduce is the most typical one among them.
Script: The "Script" here refers to the specialized script for
computing. Take esProc as an example. esProc is designed to improve the
computational capability of Hadoop. So, in addition to the inexpensive
scale-out, it also offers the high performance, great computational capability,
and convenient computation between heterogeneous data sources,
especially ideal for achieving the complex computational goal. In addition, it is
the grid-style script characterized with the high development efficiency and
complete debug functions.
Sample code:
|
A
|
B
|
1
|
=file(“hdfs://192.168.1.200/data/sales.txt”).size()
|
//file size
|
2
|
=10
|
//number of tasks
|
3
|
=to(A2)
|
//1 ~ 10, 10 tasks
|
4
|
=A3.(~*int(A1/A2))
|
//parameter
list for start pos
|
5
|
=A3.((~-1)*int(A1/A2)+1)
|
//parameter list
for end pos
|
6
|
=callx(“groupSub.dfx”,A5,A4;[“192.168.1.107:8281”,
“192.168.1.108:8281”])
|
//sub-program
calling, 10 tasks to 2 parallel nodes
|
7
|
=A6.merge(empID)
|
//mergingtaskresult
|
8
|
=A7.group@i(empID;~.sum(totalAmount):orderAmount
,~.max(max):maxAmount,~.min(min):minAmount,~.max(max)
/~.sum(quantity):avgAmount)
|
//summarizing is completed
|
Java users can invoke the
result from esProc via JDBC, but they are only allowed to invoke the result in the form of stored procedure instead of any SQL statement. Plus, esProc is not open source. These are two disadvantages of esProc.
The Script is widespread used in Mongo DB,
Redis, and many other big data solutions, but they are not specialized enough
in computing. For another example, the multi-table joining operation for Mongo DB is not only inefficient, but also
involves the coding of one order of magnitude more complex than that of SQL or
esProc.
SQL: The "SQL" here refers to the complete and whole SQL/SP, i.e. ANSI
2000 and its super set. Take Greenplum as an example, the major advantages of Greenplum
SQL are the powerful computing, highly efficient developing, and great
performance. Other advantages include the widespread use of its language, low
learning cost, simple maintenance, and migration possibility -not to
mention its trump-card of offering support for stored procedure to handle the
complex computation. By this way, business value can be exploited from the big
data conveniently.
Sample code:
CREATE OR REPLACE function view.merge_emp()
returns voidas$$
BEGIN
truncate
view.updated_record;
insert into
view.updated_recordselect y.* from view.emp_edw x right outer join emp_src y on x.empid=y.empid
where x.empid is not null;
update
view.emp_edwset deptno=y.deptno,sal=y.salfrom view.updated_record y where view.emp_edw.empid=y.empid;
insert into
emp_edwselect y.* from emp_edw x right outer join emp_src y on x.empid=y.empid where x.empid is null;
end;
$$ language 'plpgsql';
The other databases with the similar structure to MPP include Teradata, Vertical, Oracle, and IBM. Their
syntax characteristics are mostly alike. The disadvantages are similar. The acquisition
cost and the ongoing maintenance expenses are extremely high. Charging its
users by data scale, the so-called inexpensive Greenplum is actually not a
bargain at all - it is way more like making big money under cover of big data. Other
disadvantages include awkward debugging, incompatible syntax, lengthy down-time
if expansion, and awkward multi-data-source computation.
SQL-like
language: It refers to the output interfaces like
JDBC/ODBC and only limited to those scripting languages that are the subset of
standard SQL. Take Hive QL as an example. The greatest advantage of Hive QL is
its ability to scale out cost-effectively while still a convenient tool for
users to develop. The SQL syntax feature is kept in Hive QL, so that the
learning cost is low, development efficient, and maintenance
simple. In addition, Hive is a component of Hadoop. The open-source is another
advantage.
Sample code:
SELECT e.* FROM (
SELECT name,
salary, deductions["Federal Taxes"] as ded,
salary * (1 –
deductions["Federal Taxes"]) as salary_minus_fed_taxes
FROM employees
) e
WHERE round(e.salary_minus_fed_taxes) > 70000;
The weak point of Hive QL is its
non-support for stored procedure. Due to this, it is difficult for HiveQL to
undertake the complex computation, and thus difficult to provide the truly
valuable result. The slightly more complex computation will rely on MapReduce.
Needless to say, the development efficiency is low. The poor performance and
the threshold time can be regarded as a bane, especially in task allocation,
multi-table joining, inter-row computation, multi-level query, and ordered
grouping, as well as implementing other algorithm alike. So, it is quite difficult for HiveQL to implement the real-time Hadoop application for big data.
There are also some other products with
SQL-like languages - MongoDB as an example - they are still worse than Hive
yet.
About esProc: http://www.raqsoft.com/product-esproc