These days, I see more and more organizations evaluating the use of Hadoop as a part of their data warehouse. Using of Hadoop in addition or a "normal" SQL-based RDBMS has a number of advantages, just to list some of them:
1. Today, for various reasons, many organizations have to store many years of history data. Such data usually doesn't have to be available online, like the data from recent months. Therefore, it makes sense to offload the old data from the database just to avoid in indefinite database expansion, which implies high maintenance and licensing cost. Storing the data in Hadoop makes the data both relatively easy available and safe.
2. Hadoop allows to perform complex and heavy batch processes, which possibly involve external resources and complicated logic, outside the database. Offloading of such an activity from the database may help to distribute the hardware resources in a better way. Heavy batch processes don't slow down the fast online queries, that are expected to be returned in a small number of seconds. Also, Hadoop may allow much better scalability of complex processing than an RDBMS, given the Hadoop job is designed correctly.
3. Offloading of some processes to Hadoop allows to run the database on much less powerful and cheaper hardware. This may also help to decrease the license cost of RDBMS dramatically, provided that the price is per-core.
Recently, one of my customers asked me for advice about possible integration of IQ and Hadoop in a very big future projects (they are talking about PetaBytes). They plan to store a number of months in IQ and then offload the data to Hadoop. When a user submits a query (or stored procedure) and specifies that it also needs the old data from Hadoop, a result set that combines the recent data from IQ and older data Hadoop should be returned.
I have found a number of possibilities to accomplish this task. It is possible to write an IQ UDF in Java or C++ that access a file in HDFS. Then, the data in Hadoop can be accesses in a very simple way:
-- OpenHDFSFile is a UDF, impelmented in Java or C++
SELECT * from OpenHDFSFile('output/part-r-00001')
While this approach is definitely the most convenient one in terms of SQL coding, I have found it not very scalable. The Java UDF, for example, works quite fast when the number of rows to return is small (thousands), but with 100,000s of rows it becomes quite slow and resource-demanding. Specifically, on an idle server, the Java UDF that returns 100,000 rows took 17 sec. with IQ Temporary Cache of 1GB, while performing a decent number of physical I/Os. Increasing of the temporary cache to 10000 allows OpenHDFSFile to finish in about 5 seconds consistently. Further increasing of -iqtc doesn't provide any improvement.
Then, I decided to try an alternative approach. The rationale is based on the fact that LOAD TABLE is the fastest and the most scalable way to load data into IQ. Since IQ fully supports loading the data from a FIFO pipe, we don't even have to unload the data from HDFS to a local filesystem in order to expose the data to the LOAD TABLE command. See the steps below:
-- Create the global temporary table to collect the data from HDFS
-- This is a pure IQ table, we can index it and use it in subsequent queries as we like
create global temporary table glbtmp_bigtable_from_hdfs(
fname char(100),
fval unsigned int)
on commit preserve rows;
-- Manually triggered event that is used to call "hadoop fs -cat" that pushes the data
-- to a FIFO pipe that was created in advance using mkfifo command.
create event ev_cat_bigtable
handler
begin
declare l_return_code int;
l_return_code = call xp_cmdshell('hadoop fs -cat /user/sybasebigtable_stage_hadoop_100000.dat > /iq_db/bigdata/bigtable_pipe');
message 'Return code: ' || l_return_code to console;
end;
-- The main procedure
-- It triggers the event, which runs asynchronously, and then executes LOAD TABLE statement that reads from the FIFO pipe.
-- We have to put the call to "hadoop fs -cat" into an event and execute it asynchronously because LOAD TABLE statement blocks until EOF from the
pipe is received.
create or replace procedure load_bigtable_from_hdfs()
begin
trigger event ev_cat_bigtable;
load table glbtmp_bigtable_from_hdfs(fname, fval)
from '/iq_db/bigdata/bigtable_pipe'
quotes off
escapes off
format bcp
delimited by '\x09'
row delimited by '\x0a';
select count(*) from glbtmp_bigtable_from_hdfs;
end;
I have found this approach much faster than one that utilizes an IQ UDF. In my tests, it consistently takes about 2 seconds to load 100000 rows, even with minimalistic IQ cache configuration. Also, this approach requires zero Java/C++ programming, which in turn means easier maintenance. Of course, better exception handling should be implemented, but it is out of the scope for now. It would have been much better if IQ allowed loading from HDFS natively, as a part of LOAD TABLE syntax, I hope we will see it in IQ 16 at some point.
The next step of my research will probably be the integration between IQ and Hive. In my opinion, this combination can be very much relevant to many customers.