4
I Use This!
High Activity

News

Analyzed 26 days ago. based on code collected 2 months ago.
Posted over 11 years ago by Esen Sagynov
Today we are releasing a new version of CUBRID driver for Nodej.s. node-cubrid 2.1.0 has a few API improvements which aim to provide more convenient APIs with full backward compatibility. However, the main objective of this release is to refactor ... [More] the driver to achieve significant performance improvement by leveraging the characteristics of V8 and Node.js platforms. Continue reading to learn more about the changes. Improvements I will explain the changes chronologically. We have added Tavis Continuous Integration support to node-cubrid project. Now all 268K assertions are automatically performed on Travis servers upon every commit. Check out the latest node-cubrid bulid status at https://travis-ci.org/CUBRID/node-cubrid/builds.Through Travis we can now guarantee that node-cubrid works on all major Node.js releases as well as CUBRID Server releases. This node-cubrid 2.1.0 is even compatible with the latest CUBRID 8.4.4 version we released last week.To install CUBRID on Travis servers we use CUBRID Chef Cookbook.  Besides the Travis CI integration, we have added Coveralls Code Coverage support. Now whenever someone pushes the commit, Coveralls will provide a percentage change in code coverage, whether the code change has made a contribution to code coverage or instead decreased it. Very convenient and encouraging plarform to write more tests. To see the current code coverage status of node-cubrid project, visit https://coveralls.io/r/CUBRID/node-cubrid. At this moment we provide 89% code coverage. Now we exactly know how many tests we need to add more and which lines of the code we need to test. For those users who come from MySQL, we have added alias functions like end() in node-mysql for node-cubrid's close() function. Same goes for createConnection() alias for createCUBRIDConnection(). Now if you migrate to node-cubrid, there is less code you need to change in your application.  Now createConnection() accepts an object of connection parameters. After all JavaScript is all about objects. Connection timeout parameter can now be supplied to createConnection(). query() function now accepts params object which you can use to pass an array of values to bind to ? placeholders in the provided SQL query. _sqlFormat() function in Helpers object, which is used to format an SQL query by replacing ? placeholder with respective bind values, is now smarter. Now numbers are passed to the query as they are without being wrapped in single quotes as if they were strings, though users can wrap them if necessary. If you pass a Date object, it will correctly be converted into CUBRID compatible DATETIME string format which can be stored in DATE, DATETIME, and TIMESTAMP columns. Performance Improvement through Major Code Refactoring As I mentioned at the beginning of this post, the core objective of this release is to refactor the driver to improve the performance of the driver. There are several refactoring works we have performed in this release. Major refactoring of buffer parsers which handle server responses. Major refactoring of protocol packet wrters to optimize the work with the Buffer. Complete replacement of the Queries Queueing logic. Buffer parsers refactoring Prior to this 2.1.0 release, node-cubrid had many duplicate codes. All functions which initiate network communication with the CUBRID Server used to implement almost same functionality to read the data from the socket and prepare it for parsing. Though each function does need a separate logic to parse the buffer data, some functionality can be abstracted like reading bytes from the socket, and performing the basic preparation of the data before it is passed to the parser. There is one more thing we have improved in buffer parsers: it is the work with the instances of Node.js Buffer class. The idea is that memory allocation through Buffer is not as fast as through a local heap of V8 (the JavaScript engine Node.js runs on top of). Moreover, resizing an existing buffer is quite expensive. There is a great inspirational video by Trevor Norris from Mozilla Working with Node.js Buffers and Streams [Youtube] which cover this topic. I highly recommend watching it. Prior to 2.1.0, whenever node-cubrid received a chunk of data from the server, it concatenated this chunk to the main buffer object of this particular request. Since we do not know the potential size of the incoming stream of data, we cannot preallocate enough buffer memory to avoid buffer resizing. This resulted in constant creation, resizing and copying of buffers upon every arrival of the data from the server. In node-cubrid 2.1.0 we have resolved both of these issues: refactored buffer parsers completely to remove any code duplication and improved the work with the Buffer. Now, all functions which initiate network communication use the same API to read data from the socket and prepare it for parsing. To resolve the second issues, we started leveraging the features of CUBRID Communication Protocol. In CUBRID Protocol when a server sends a data packet to a client, the first four bytes (the length of an integer type) represents the length of the incoming data packet. Thus, after receiving the first piece of the data packet, we can learn how many bytes will be necessary to keep the entire packet in a single Buffer instance. Knowing this value, we can create a Buffer instance with enough memory and start reading the remaining data from the pipe into this Buffer instance. This way we avoid buffer resizing completely. Refactoring the protocol packet wrters Just like with Buffer Readers (or Packet Readers) node-cubrid has Buffer Writers which we call Packet Writers. Since the Packet Writers also write into a buffer and send it over the wire, the same rule applies: we needed to optimize how writers work with the Buffer. Unlike with reading data sent by the server, when writing data into the buffer to send it to the server, we know exactly the length of this data. So, why not create the buffer with enough memory in advance? That's what we did. In 2.1.0, we create a Buffer instance only once for a request and write the entire payload into it, thus avoid buffer resizing. Refactoring Queries Queueing logic The third major refactoring affected how queryies queueing work in node-cubrid. We have introduced queries queueing in version 2.0.0. At the time it was implemented via the setTimer() function. Every X period the internal logic would check if there is a query the user wants to run. Obviously this creates three problems for us: one that forces us to manage the queue and check if there is anything in the queue; second there is a potential loss of time in between the queries. Imagine the case when the query A has started in time X and is completed in X + 60, while the timer will check the queue only at X + 100. Why lose these previous 40 units of the time? the third issue is that there is a potential problem when a user can confuse and call the original query()/execute() functions instead of dedicated addQuery()/addNonQuery() queue related functions which would result in an error that tells that another query is already running. To address these issues, we have completely replaced the old queueing logic with a new, more efficient and lightweight one which does not incur any delay. To leverage the new queueing mechanism we had to process all user request, which initiate a network communication with the server, though a queue. Prior to 2.1.0 only those queries which were added by addQuery()/addNonQuery() functions were processed by the queue. Now, everything is processed by the queue even requests to close the query, commit the transaction or fetch more data. By processing all requests by one common queueing mechanism allows us to avoid "another query is already running" errors all together. Since everything goes through the queue, there is no way two requests are run at the same time within one connection. The new queue processes the pending request the moment the previous one is completed. Thus there is no any delay in executing the query. Moreover, this helps us remove the headache of managing and checking the timer. Now the query queueing logic is very efficient. If you check the source of the Queue.js module, you will notice that it's only 48 lines of code including comments and empty lines. Other minor refacroring There are a few other minor refactoring work we have also done. For example, JavaScript provides two very convenient functions like call() and apply() which you can use to alter the reference to the current context of the function. However, if you use it too often, it affects the performance. Previously, in node-cubrid there were quite many usages of these functions. Now we do not call them at all for the sake of better performance. We have also refactored most functions by removing unnecessary and repetitive logic. For the project we have also refactored the test suites by gathering all of them in a single test directory. Soon we will publish the performance test results which will compare the previous 2.0.2 release with this new 2.1.0 release. To see the full list of changes, refer to CHANGELOG.md file. If you have any questions or feature requests, feel free to leave your comments below or create an issue in Github. [Less]
Posted over 11 years ago by Esen Sagynov
On behalf of CUBRID Team I am happy to announce the new 8.4.4 version of CUBRID open source relational database. This release is mostly dedicated for stability improvement of the 8.4 branch, and has some minor behavioral changes which I will explain ... [More] below. Thus, CUBRID 8.4.4 is fully compatible with the previous 8.4.3 version. If you have been following up with our latest releases, you may have noticed that we are maintaining two branches for CUBRID 8.4.x and 9.x. As 8.4.0 release introduced major changes and extended support for MySQL compatible SQL, many users have started using or migrated to it. In fact, there are more 8.4.x users than 9.x at this moment. For this reason, we bring common bug fixes applied to the successor 9.x branch down to 8.4.x branch. This way we make sure users of the 8.4.x branch have the latest improvements. New Features Since this is a stability improvement release, there are no other new features except the SQL profiling which is brought down from the upcoming 9.2 release. SQL profiling Prior to 8.4.4 users could obtain SQL profiling information to analyze the query performance only in desktop GUI applications like CUBRID Manager or CUBRID Query Browser. Now CUBRID allows to retrieve SQL profiling information directly from drivers. To enable the SQL profiling one can execute SET TRACE ON before executing the query, then retrieve the query plan and statistics by executing SHOW TRACE query. See the following example: csql> SET TRACE ON; csql> SELECT /*+ RECOMPILE */ o.host_year, o.host_nation, o.host_city, n.name, SUM(p.gold), SUM(p.silver), SUM(p.bronze)         FROM OLYMPIC o, PARTICIPANT p, NATION n         WHERE o.host_year = p.host_year AND p.nation_code = n.code AND p.gold > 10         GROUP BY o.host_nation; csql> SHOW TRACE;   trace ======================   ' Query Plan:   SORT (group by)     NESTED LOOPS (inner join)       NESTED LOOPS (inner join)         TABLE SCAN (o)         INDEX SCAN (p.fk_participant_host_year) (key range: (o.host_year=p.host_year))       INDEX SCAN (n.pk_nation_code) (key range: p.nation_code=n.code)   rewritten query: select o.host_year, o.host_nation, o.host_city, n.[name], sum(p.gold), sum(p.silver), sum(p.bronze) from OLYMPIC o, PARTICIPANT p, NATION n where (o.host_year=p.host_year and p.nation_code=n.code and (p.gold> ?:0 )) group by o.host_nation Trace Statistics:   SELECT (time: 1, fetch: 1059, ioread: 2)     SCAN (table: olympic), (heap time: 0, fetch: 26, ioread: 0, readrows: 25, rows: 25)       SCAN (index: participant.fk_participant_host_year), (btree time: 1, fetch: 945, ioread: 2, readkeys: 5, filteredkeys: 5, rows: 916) (lookup time: 0, rows: 38)         SCAN (index: nation.pk_nation_code), (btree time: 0, fetch: 76, ioread: 0, readkeys: 38, filteredkeys: 38, rows: 38) (lookup time: 0, rows: 38)     GROUPBY (time: 0, sort: true, page: 0, ioread: 0, rows: 5) Improvements In CUBRID 8.4.4 there are tons of stability and usability improvements which include the following list of improvements of administrative functions. Administrative Convenience One of the major improvement is the SQL profiling which I've already explained above. Added a command line utility to output the parameter information of running brokers or shards. Added tranlist utility to display a list of active transaction throughout all CAS processes. Sort the output of the tranlist command by the specified column. Now killtran utility can also remove transactions by SQL ID. To remove multiple transactions at once -i option of killtran utility can be used. Added an additional event log file to record query status information such as SLOW_QUERY, MANY_IOREADS, LOCK_TIMEOUT, DEADLONG, and TEMP_VOLUME_EXPAND which affect the query performance. cubrid_master log file now includes each node information in the output of HA status. HA An option to change the failover method in HA environment. Sharding Now it's possible to print the shard ID for the shard key in the command line by running cubrid shard getid command. cci_set_db_parameter function can be used in SHARD environment. The password of shard DB can be also specified with an environment variable. Configuration Shard-related processes are started or terminated by using the "cubrid service" command. Added a parameter to adjust the maximum number of replication archive logs. Added a system parameter to specify transaction rollback when lock escalation occurs. Performance Query plan is rewritten when the proper data volume is exceeded during the re-executuion of prepare for the same query statement. Statistical information of only the added index is updated. Reduced time to fork new CAS processes. Behavioral changes Some of the functionality in CUBRID 8.4.4 has been changed. SQL When a column without a default value is added by using the ALTER ... ADD COLUMN statement, the PRIMARY KEY or NOT NULL constraint was violated as all values of the added columns became NULL. This problem has been fixed. Failure to repeat INSERT in the array/batch executive function after executing PREPARE without committing a newly-created table Driver In JDBC code changed the zero date of TIMESTAMP from '0001-01-01 00:00:00' to '1970-01-01 00:00:00'(GST) when the value of zeroDateTimeBehavior in the connection URL is "round". In JDBC and CCI drivers query timeout is now applied to the batch processing function not per query when execution multiple queries at once. Utility Made changes on "cubrid broker status" command. Configurations Changed the time and capacity related parameter names to allow to set time unit or capacity unit next to the time or capacity parameter value. Changed the upper limit of the sort_buffer_size system parameter to 2G. Removed OFF option from KEEP_CONNECTION broker parameter. Changed APPL_SERVER_MAX_SIZE_HARD_LIMIT broker parameter. Changed the default value of SQL_LOG_MAX_SIZE broker parameter from 100MB to 10MB. Other Asynchronous query mode is no longer supported. Besides these improvements and behavioral changes, there are many bug fixes and minor improvements. For the full list refer to CUBRID 8.4.4 Release Notes. What's next In a week or so we will release a new version of 9.x branch which will carry the 9.2 label. We will explain about the new features of 9.2 in a separate post. If you have any questions, feel free to leave your comment below. [Less]
Posted over 11 years ago by Esen Sagynov
On behalf of CUBRID Team I am happy to announce the new 8.4.4 version of CUBRID open source relational database. This release is mostly dedicated for stability improvement of the 8.4 branch, and has some minor behavioral changes which I will explain ... [More] below. Thus, CUBRID 8.4.4 is fully compatible with the previous 8.4.3 version. If you have been following up with our latest releases, you may have noticed that we are maintaining two branches for CUBRID 8.4.x and 9.x. As 8.4.0 release introduced major changes and extended support for MySQL compatible SQL, many users have started using or migrated to it. In fact, there are more 8.4.x users than 9.x at this moment. For this reason, we bring common bug fixes applied to the successor 9.x branch down to 8.4.x branch. This way we make sure users of the 8.4.x branch have the latest improvements. New Features Since this is a stability improvement release, there are no other new features except the SQL profiling which is brought down from the upcoming 9.2 release. SQL profiling Prior to 8.4.4 users could obtain SQL profiling information to analyze the query performance only in desktop GUI applications like CUBRID Manager or CUBRID Query Browser. Now CUBRID allows to retrieve SQL profiling information directly from drivers. To enable the SQL profiling one can execute SET TRACE ON before executing the query, then retrieve the query plan and statistics by executing SHOW TRACE query. See the following example: csql> SET TRACE ON; csql> SELECT /*+ RECOMPILE */ o.host_year, o.host_nation, o.host_city, n.name, SUM(p.gold), SUM(p.silver), SUM(p.bronze)         FROM OLYMPIC o, PARTICIPANT p, NATION n         WHERE o.host_year = p.host_year AND p.nation_code = n.code AND p.gold > 10         GROUP BY o.host_nation; csql> SHOW TRACE;   trace ======================   ' Query Plan:   SORT (group by)     NESTED LOOPS (inner join)       NESTED LOOPS (inner join)         TABLE SCAN (o)         INDEX SCAN (p.fk_participant_host_year) (key range: (o.host_year=p.host_year))       INDEX SCAN (n.pk_nation_code) (key range: p.nation_code=n.code)   rewritten query: select o.host_year, o.host_nation, o.host_city, n.[name], sum(p.gold), sum(p.silver), sum(p.bronze) from OLYMPIC o, PARTICIPANT p, NATION n where (o.host_year=p.host_year and p.nation_code=n.code and (p.gold> ?:0 )) group by o.host_nation Trace Statistics:   SELECT (time: 1, fetch: 1059, ioread: 2)     SCAN (table: olympic), (heap time: 0, fetch: 26, ioread: 0, readrows: 25, rows: 25)       SCAN (index: participant.fk_participant_host_year), (btree time: 1, fetch: 945, ioread: 2, readkeys: 5, filteredkeys: 5, rows: 916) (lookup time: 0, rows: 38)         SCAN (index: nation.pk_nation_code), (btree time: 0, fetch: 76, ioread: 0, readkeys: 38, filteredkeys: 38, rows: 38) (lookup time: 0, rows: 38)     GROUPBY (time: 0, sort: true, page: 0, ioread: 0, rows: 5) Improvements In CUBRID 8.4.4 there are tons of stability and usability improvements which include the following list of improvements of administrative functions. Administrative Convenience One of the major improvement is the SQL profiling which I've already explained above. Added a command line utility to output the parameter information of running brokers or shards. Added tranlist utility to display a list of active transaction throughout all CAS processes. Sort the output of the tranlist command by the specified column. Now killtran utility can also remove transactions by SQL ID. To remove multiple transactions at once -i option of killtran utility can be used. Added an additional event log file to record query status information such as SLOW_QUERY, MANY_IOREADS, LOCK_TIMEOUT, DEADLONG, and TEMP_VOLUME_EXPAND which affect the query performance. cubrid_master log file now includes each node information in the output of HA status. HA An option to change the failover method in HA environment. Sharding Now it's possible to print the shard ID for the shard key in the command line by running cubrid shard getid command. cci_set_db_parameter function can be used in SHARD environment. The password of shard DB can be also specified with an environment variable. Configuration Shard-related processes are started or terminated by using the "cubrid service" command. Added a parameter to adjust the maximum number of replication archive logs. Added a system parameter to specify transaction rollback when lock escalation occurs. Performance Query plan is rewritten when the proper data volume is exceeded during the re-executuion of prepare for the same query statement. Statistical information of only the added index is updated. Reduced time to fork new CAS processes. Behavioral changes Some of the functionality in CUBRID 8.4.4 has been changed. SQL When a column without a default value is added by using the ALTER ... ADD COLUMN statement, the PRIMARY KEY or NOT NULL constraint was violated as all values of the added columns became NULL. This problem has been fixed. Failure to repeat INSERT in the array/batch executive function after executing PREPARE without committing a newly-created table Driver In JDBC code changed the zero date of TIMESTAMP from '0001-01-01 00:00:00' to '1970-01-01 00:00:00'(GST) when the value of zeroDateTimeBehavior in the connection URL is "round". In JDBC and CCI drivers query timeout is now applied to the batch processing function not per query when execution multiple queries at once. Utility Made changes on "cubrid broker status" command. Configurations Changed the time and capacity related parameter names to allow to set time unit or capacity unit next to the time or capacity parameter value. Changed the upper limit of the sort_buffer_size system parameter to 2G. Removed OFF option from KEEP_CONNECTION broker parameter. Changed APPL_SERVER_MAX_SIZE_HARD_LIMIT broker parameter. Changed the default value of SQL_LOG_MAX_SIZE broker parameter from 100MB to 10MB. Other Asynchronous query mode is no longer supported. Besides these improvements and behavioral changes, there are many bug fixes and minor improvements. For the full list refer to CUBRID 8.4.4 Release Notes. What's next In a week or so we will release a new version of 9.x branch which will carry the 9.2 label. We will explain about the new features of 9.2 in a separate post. If you have any questions, feel free to leave your comment below. [Less]
Posted over 11 years ago by Esen Sagynov
We are happy to announce that we have updated our CUBRID PHP and PDO drivers to roll out loads of improvements and bug fixes. The latest releases can be installed from  CUBRID and PDO_CUBRID PECL packages. Below is a list of improvements which ... [More] landed in cubrid-9.1.0.0004 and pdo_cubrid-9.1.0.0003 versions. Connection parameters support. Improved CUBRID database sharding support. Collection data types support. Removed dynamic CUBRID CCI API dependency. Improved LOB data support. Improved documentation. Other minor improvements. Connection parameters support [APIS-580] Now both PHP and PDO drivers support all connection parameters supported by the underlying CCI API. When connecting to a CUBRID Broker, you can pass a list of various options like a list of alternative hosts which can be used either in HA environment to provide seamless failover functionality or in non-HA environment to provide driver level load balancing functionality. These are espectially useful in CUBRID SHARD environment. Here is an example in PHP: $conn = cubrid_connect($host, $port, $db, $user, $password, $createNewLink, "altHosts=192.168.0.2:33000,192.168.0.3:33000&rcTime=600"); Alternatively, connect with a URL: $conn = cubrid_connect_with_url("CUBRID:localhost:33000:demodb:user:password:?login_timeout=100&altHosts=192.168.0.2:33000,192.168.0.3:33000&rcTime=600"); Or in PDO: $db = new PDO("cubrid:dbname=demodb;host=localhost;port=33000", $user, $password, array("altHosts"=>"192.168.0.2:33000","rcTime"=>"600")); Improved CUBRID database sharding support We fixed a bug in PHP driver [APIS-562] which prevented DELETE queries from being executed by cubrid_execute() function in CUBRID SHARD environment. Collection data types support [APIS-555] Now PHP and PDO drivers provide native support for CUBRID collection data types like SET, MULTISET, LIST and SEQUENCE. There are multiple ways to insert an array of values into a collection data type column.  Inserting an array of integers $sql_stmt_insert = "INSERT INTO set_tbl_int VALUES (?);"; $data = array(123,456,789); $stmt = $dbh->prepare($sql_stmt_insert); // Provide the data type of elements as a third parameter. $ret = $stmt->bindParam(1, $data, PDO::PARAM_INT); $ret = $stmt->execute(); Inserting an array of strings To insert an array of strings, set the PDO data type to NULL. This will tell PDO not to perform any conversion. $sql_stmt_insert = "INSERT INTO set_tbl_str VALUES (?);"; $data = array("abc","def","ghi"); $stmt = $cubrid_pdo->prepare($sql_stmt_insert); // When inserting an array of string, set the data type to NULL. // This will tell PDO not to perform any conversion. $ret = $stmt->bindParam(1, $data, PDO::PARAM_NULL); $ret = $stmt->execute(); Alternatively, setting integer as a PDO data type will work as well. PDO will still keep the values as strings. $sql_stmt_insert = "INSERT INTO set_tbl_str VALUES (?);"; $data = array("abc","def","ghi"); $stmt = $cubrid_pdo->prepare($sql_stmt_insert); $ret = $stmt->bindParam(1, $data, PDO::PARAM_INT); $ret = $stmt->execute(); The third way to insert an array of string is to specify the CUBRID data type in the fifth parameter as shown below. $sql_stmt_insert = "INSERT INTO set_tbl_str VALUES (?);"; $data = array("abc","def","ghi"); $stmt = $cubrid_pdo->prepare($sql_stmt_insert); $ret = $stmt->bindParam(1, $data, 0, 0, "varchar"); $ret = $stmt->execute(); Notice that PDO::PARAM_ARRAY cannot be used because PDO does not support it. PDO::PARAM_STR also should not be used because PDO will throw an error that an array to string conversion is not allowed. Removed dynamic CUBRID CCI API dependency [APIS-559] Since this release CUBRID PHP and PDO drivers do not have dynamic dependency on CUBRID CCI driver. Instead we now compile PHP and PDO driver together with CCI source code. This allows to install PHP and PDO drivers on environments where CUBRID Server is not installed. This is one more step closer to provide a platform independent PHP and PDO drivers. Improved LOB data support [APIS-556] - We fixed a LOB related bug in cubrid_fetch() function which tried to allocate 4 billion bytes. [APIS-364] - Also fixed a bug which affected Windows users when a user tried to export LOB data from a database which was successfully imported from a fie prior to export. Documentation improvements [APIS-553] - Manual page for cubrid_close_request() should be edited. [APIS-554] - cubrid_connect_with_url() example does not have updated connection url. [APIS-557] - cubrid_lob2_seek64 example does not work properly. [APIS-558] - incorrect documentation. [APIS-584] - Some 'Example' codes about the driver in php.net are somewhat confused. [APIS-596] - The 'test_lob' table which is not created initially may lead some misunderstandings. Other minor improvements [APIS-120] - Return value is not correct when passing a date parameter to cubrid_bind method. [APIS-121] - Return value is not false when passing a invalid time to cubrid_bind method. [APIS-358] - Create table failed when using default timestamp '2038-01-19 12:14:07'. [APIS-513] - CAS error code is not expected in Windows. [APIS-560] - Cannot load PHP API 9.1.0.0003 in CentOS 5.6: undefined symbol: pthread_create in Unknown on line 0. [APIS-583] - The cubrid_get_class_name() function throws a warning if select data from a system table. [APIS-586] - cubrid_bind() doesn't work properly when insert bit data. [APIS-587] - cubrid_bind() doesn't support select bit data from a prepared statement. [APIS-549] - Warnings during PHP/PDO API build via PECL. What's next In the next version we will focus on improving Mac OS X support. The current release was a preparation for this. After removing the dynamic CCI dependency we can easily compile our PHP ad PDO drivers on Mac OS X as well. Stay tuned for further updates! [Less]
Posted over 11 years ago by Esen Sagynov
We are happy to announce that we have updated our CUBRID PHP and PDO drivers to roll out loads of improvements and bug fixes. The latest releases can be installed from  CUBRID and PDO_CUBRID PECL packages. Below is a list of improvements which ... [More] landed in cubrid-9.1.0.0004 and pdo_cubrid-9.1.0.0003 versions. Connection parameters support. Improved CUBRID database sharding support. Collection data types support. Removed dynamic CUBRID CCI API dependency. Improved LOB data support. Improved documentation. Other minor improvements. Connection parameters support [APIS-580] Now both PHP and PDO drivers support all connection parameters supported by the underlying CCI API. When connecting to a CUBRID Broker, you can pass a list of various options like a list of alternative hosts which can be used either in HA environment to provide seamless failover functionality or in non-HA environment to provide driver level load balancing functionality. These are espectially useful in CUBRID SHARD environment. Here is an example in PHP: $conn = cubrid_connect($host, $port, $db, $user, $password, $createNewLink, "altHosts=192.168.0.2:33000,192.168.0.3:33000&rcTime=600"); Alternatively, connect with a URL: $conn = cubrid_connect_with_url("CUBRID:localhost:33000:demodb:user:password:?login_timeout=100&altHosts=192.168.0.2:33000,192.168.0.3:33000&rcTime=600"); Or in PDO: $db = new PDO("cubrid:dbname=demodb;host=localhost;port=33000", $user, $password, array("altHosts"=>"192.168.0.2:33000","rcTime"=>"600")); Improved CUBRID database sharding support We fixed a bug in PHP driver [APIS-562] which prevented DELETE queries from being executed by cubrid_execute() function in CUBRID SHARD environment. Collection data types support [APIS-555] Now PHP and PDO drivers provide native support for CUBRID collection data types like SET, MULTISET, LIST and SEQUENCE. There are multiple ways to insert an array of values into a collection data type column.  Inserting an array of integers $sql_stmt_insert = "INSERT INTO set_tbl_int VALUES (?);"; $data = array(123,456,789); $stmt = $dbh->prepare($sql_stmt_insert); // Provide the data type of elements as a third parameter. $ret = $stmt->bindParam(1, $data, PDO::PARAM_INT); $ret = $stmt->execute(); Inserting an array of strings To insert an array of strings, set the PDO data type to NULL. This will tell PDO not to perform any conversion. $sql_stmt_insert = "INSERT INTO set_tbl_str VALUES (?);"; $data = array("abc","def","ghi"); $stmt = $cubrid_pdo->prepare($sql_stmt_insert); // When inserting an array of string, set the data type to NULL. // This will tell PDO not to perform any conversion. $ret = $stmt->bindParam(1, $data, PDO::PARAM_NULL); $ret = $stmt->execute(); Alternatively, setting integer as a PDO data type will work as well. PDO will still keep the values as strings. $sql_stmt_insert = "INSERT INTO set_tbl_str VALUES (?);"; $data = array("abc","def","ghi"); $stmt = $cubrid_pdo->prepare($sql_stmt_insert); $ret = $stmt->bindParam(1, $data, PDO::PARAM_INT); $ret = $stmt->execute(); The third way to insert an array of string is to specify the CUBRID data type in the fifth parameter as shown below. $sql_stmt_insert = "INSERT INTO set_tbl_str VALUES (?);"; $data = array("abc","def","ghi"); $stmt = $cubrid_pdo->prepare($sql_stmt_insert); $ret = $stmt->bindParam(1, $data, 0, 0, "varchar"); $ret = $stmt->execute(); Notice that PDO::PARAM_ARRAY cannot be used because PDO does not support it. PDO::PARAM_STR also should not be used because PDO will throw an error that an array to string conversion is not allowed. Removed dynamic CUBRID CCI API dependency [APIS-559] Since this release CUBRID PHP and PDO drivers do not have dynamic dependency on CUBRID CCI driver. Instead we now compile PHP and PDO driver together with CCI source code. This allows to install PHP and PDO drivers on environments where CUBRID Server is not installed. This is one more step closer to provide a platform independent PHP and PDO drivers. Improved LOB data support [APIS-556] - We fixed a LOB related bug in cubrid_fetch() function which tried to allocate 4 billion bytes. [APIS-364] - Also fixed a bug which affected Windows users when a user tried to export LOB data from a database which was successfully imported from a fie prior to export. Documentation improvements [APIS-553] - Manual page for cubrid_close_request() should be edited. [APIS-554] - cubrid_connect_with_url() example does not have updated connection url. [APIS-557] - cubrid_lob2_seek64 example does not work properly. [APIS-558] - incorrect documentation. [APIS-584] - Some 'Example' codes about the driver in php.net are somewhat confused. [APIS-596] - The 'test_lob' table which is not created initially may lead some misunderstandings. Other minor improvements [APIS-120] - Return value is not correct when passing a date parameter to cubrid_bind method. [APIS-121] - Return value is not false when passing a invalid time to cubrid_bind method. [APIS-358] - Create table failed when using default timestamp '2038-01-19 12:14:07'. [APIS-513] - CAS error code is not expected in Windows. [APIS-560] - Cannot load PHP API 9.1.0.0003 in CentOS 5.6: undefined symbol: pthread_create in Unknown on line 0. [APIS-583] - The cubrid_get_class_name() function throws a warning if select data from a system table. [APIS-586] - cubrid_bind() doesn't work properly when insert bit data. [APIS-587] - cubrid_bind() doesn't support select bit data from a prepared statement. [APIS-549] - Warnings during PHP/PDO API build via PECL. What's next In the next version we will focus on improving Mac OS X support. The current release was a preparation for this. After removing the dynamic CCI dependency we can easily compile our PHP ad PDO drivers on Mac OS X as well. Stay tuned for further updates! [Less]
Posted over 11 years ago by Esen Sagynov
On behalf of CUBRID open source database project I wanted to take a chance and make a public announcement that we are looking for a Fedora Package Sponsor who would mentor us on submitting CUBRID package to Fedora repository. It has been three years ... [More] since we have created a Review Request at https://bugzilla.redhat.com/show_bug.cgi?id=658754. Recently we have received quite many reviews and now we are at a point when we need a sponsor to move on to the next stage. So, if you know someone or you know someone who knows someone who is a sponsor at Fedora and would be willing to sponsor us, we would appreciate very much if you let us know by commenting on this post. For a reference, Packager sponsor responsibilities states: Packager Sponsors are maintainers that have a good record of maintaining packages, doing reviews and assisting others with the processes and procedures of Fedora. Sponsors act as mentors for new contributors to help point them to areas they would like to contribute, assist them with processes and procedures and assist them when they need general guidance.  [Less]
Posted over 11 years ago by Esen Sagynov
On behalf of CUBRID open source database project I wanted to take a chance and make a public announcement that we are looking for a Fedora Package Sponsor who would mentor us on submitting CUBRID package to Fedora repository. It has been three years ... [More] since we have created a Review Request at https://bugzilla.redhat.com/show_bug.cgi?id=658754. Recently we have received quite many reviews and now we are at a point when we need a sponsor to move on to the next stage. So, if you know someone or you know someone who knows someone who is a sponsor at Fedora and would be willing to sponsor us, we would appreciate very much if you let us know by commenting on this post. For a reference, Packager sponsor responsibilities states: Packager Sponsors are maintainers that have a good record of maintaining packages, doing reviews and assisting others with the processes and procedures of Fedora. Sponsors act as mentors for new contributors to help point them to areas they would like to contribute, assist them with processes and procedures and assist them when they need general guidance.  [Less]
Posted over 11 years ago by Choo Yun-cheol
Today I would like to talk about Spanner, a NewSQL distributed relational database by Google. It can distribute and store data in data centers across the world, provide consistency that is as excellent as in RDBMS while enabling to store an amount of ... [More] data that exceeds the capacity of a single data center. In this article I will briefly explain when the NewSQL trend has begun, then will introduce Spanner, its features and architecture, how it performs data distribution and rebalancing, how it actually stores the data, and finally how it provides data consistency. You will also learn about Google's TrueTime API which is at the core of Spanner distributed relational database. NoSQL and NewSQL NewSQL is on the rise. A wholly different database architecture, differentiated from NoSQL, is beginning to emerge. There are many reasons why NoSQL products have been popular. As there are a variety of NoSQL products, and they have been developed to serve different purposes, it is not easy to list their common features. However, as you can see in Hadoop or Cassandra, one of the main advantages of NoSQL is its horizontal scalability. As these NoSQL products don't provide Strong Consistency, they cannot be used where high-level data consistency is required.  NewSQL has as excellent scalability as NoSQL, and at the same time it guarantees ACID like RDBMS which is performed in a single node. The term NewSQL was first used in 2011 by Matthew Aslett at the 451 Group, a corporate business IT research group. Figure 1 below shows the classification of NewSQL that was made by the 451 Group (it does not include information on Spanner, as it was drawn up in 2011). Figure 1: Classification of RDBMS, NoSQL and NewSQL Made by the 451 Group. (Source http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/) Of course, HBase also provides transactions in a limited form (transaction for a single row). However, not all business requirements can be met with such a limited transaction rule. nBase, which is developed by Storage System Dev. Team here at NHN, is also a NewSQL. Currently, NAVER Mail, probably the most popular email service provider in Korea, uses nBase. nBase supports Strong Consistency from version 1.3. What is Spanner? Debuted [video] at the OSDI Conference in 2012, Spanner is a NewSQL created by Google. It is a distributed relational database that can distribute and store data in Google's BigTable storage system in multiple data centers. Spanner meets ACID (of course, it supports transaction) and supports SQL. Currently, F1, Google's advertisement platform, uses Spanner. Gmail and Google Search will also use it soon. F1 has 5 data replicas in the U.S. to prevent the service from stopping even in the event of a natural disasters such as an earthquake, or when one or two data centers have a failure. Spanner provides the scalability that enables you to store a few trillion database rows in millions of nodes distributed to hundreds of data centers. When you read data, Spanner connects you to the data center that is geographically closest to you, and when you write data, it distributes and stores it to multiple data centers. If the data center you try to access has a failure, of course, you can read the data from another data center that has a replica of the data.  The Spanner client automatically performs a failover between replicas. When the number of servers storing data is changed or a failure occurs in equipment, Spanner automatically re-distributes data through data transfer among the equipments. The following information on Spanner was obtained from Google's Spanner Paper. In addition, the figures below are also excerpted from the paper. The Data Model of Spanner as a Semi-relational Data Model  Spanner's data model is semi-relational. It is called semi-relational because it has some features that differentiate it from normal relational databases. In Spanner, all tables should have at least one primary key. Spanner is a multi-version database that uses a version when storing data in a column. It has evolved from the key-value store that maintains a version like BigTable (and like HBase, too, which was also influenced by BigTable). Data is stored in a semi-relational table that has a schema. This means that data has a version, which is given in the form of a time stamp when committing. In applications that use Spanner, you can also read data of past versions. Spanner supports transactions for general use and supports SQL as well. Spanner provides Strong Consistency. It can read and write data consistently and provides globally consistent read operations for a specific time stamp. This functionality enables you to carry out 'consistent backup', 'MapReduce operation' and 'atomic schema change', even with ongoing transactions. This is possible because Spanner issues a serialized commit time stamp to distributed transactions by using TrueTime API. Spanner Server Configuration The universal set of Spanner is called universe. A universe consists of multiple zones. A zone means a unit that can be run with physical independence. A data center may have one or more zones. If you want to store data separately in different server groups, you should make two or more zones in a single data center. You can also create or remove a zone in an operating system. Figure 2: Spanner Server Configuration. Figure 2 above shows the configuration of servers in a universe. One zone consists of one zonemaster and hundreds of, or thousands of, spanservers. The zonemaster allocates data to spanservers, while spanservers actually store and process data. Location proxy is called by the client, and shows in which spanserver the target data is stored. The universe master provides the status of all zones or debugging information, and the placement driver automatically transfers data between zones, and inspects to determine whether there is any data to be moved due to a change in replication settings or load balancing by communicating with spanservers periodically. Spanserver Software Stack Figure 3: Spanserver Software Stack. Each spanserver manages 10 to 1000 data structures called tablet. A tablet has a concept similar to a tablet of BigTable. It can store multiple mappings in the form of (key:string, timestamp:int64) string. The difference between the tablet of Spanner and the tablet of BigTable is that Spanner stores a time stamp together with data. This means Spanner is not a simple key-value store but has the characteristics of a multi-version database. The status of a tablet is stored in Colossus Distributed File System (the successor of Google File System) in the form of B-tree file and write-ahead log (WAL). Spanner uses Paxos state machine to support data replication among spanservers. Paxos is a protocol set created for reliable operations in distributed environments. In an distributed environment, you can have a failure at any time, and you are not also guaranteed to receive events from distributed nodes in the order of occurrence. Paxos is used to resolve this type of reliability issue. One of the main issues handled by Paxos is which node is a leader. This is for consistency in the process of data replication. A spanserver has the transaction manager to support distributed transactions. The transaction manager is not involved in a transaction performed in a single Paxos group, but when a transaction is performed across multiple Paxos groups, one of the participant leaders is selected as coordinator leader, and performs coordination to enable phase-2 commit. Directory A directory is a set of continuous keys that use the same prefix (you can think of it as a bucket). The Spanner paper says that "bucket" is a more appropriate term, and that the term "directory" is used as they intended to keep using the old term. A directory is a unit of data allocation. All the data in a directory have identical replication settings, and the transfer of data between Paxos groups is also conducted, as shown in Figure 4 below, in the unit of a directory. Figure 4: Transfer of a Directory between Paxos Groups. Spanner moves a directory to reduce the load of a certain Paxos group, groups directories which are frequently called together, or moves a directory into a group that is geographically close to the client that tries to access it. A directory can be transferred even while an operation of the client is in progress. Transfer of directories between Paxos groups is conducted in the background. Through this transfer work, you can add or delete replicas in a Paxos group. To prevent the blocking of read/write tasks in progress during the transfer of a large amount of data, directory transfer is not performed as a single transaction. Instead, only the start of data transfer is registered to the background when directory transfer is conducted, and after the task is complete, a transaction is used only when the rest of the data is moved. A directory is the minimum unit of geographical replica allocation. The administrator can specify the number of types of replica and its geographical allocation for each application. For example, you can configure settings to store 3 replicas of the data of User A in Europe, and store 5 replicas of the data of User B in North America. If the size of a directory is too big, you can split a single directory into multiple fragments. In this case, the unit of directory transfer or allocation among groups also becomes a fragment. Data Model Spanner's data model features a semi-relational table with a schema, a query language that is expanded from SQL, and transactions for general purpose. An application can create one or more databases in a universe, and one database can have many tables without a limit. A table has rows and columns like an RDBMS table. But unlike an RDBMS table, each data has version information. Figure 5: An Example of Spanner Schema. Figure 5 above shows an example of the schema of Spanner. In Spanner's schema definition language, you can express the hierarchical relationship among tables using the INTERLEAVE IN declaration. The top-level table in the hierarchical relationship is a directory table. According to the key defined in a directory table, its sub-table names are arranged in dictionary order and make up a directory. The ON DELETE CASCADE statement is used, when a row of a directory table is deleted, to delete the data of related sub-tables together. In the example, the Users table is specified as a directory table, and according to the value of the uid column, which is the primary key, data is divided and stored into different directories. As the client specifies the hierarchical relationship among multiple tables, a database can have better performance when data is divided and distributed. TrueTime API TrueTime is an API that provides time-related information, which consists of the following methods: TT.now() TT.now() method is used to get the current time. This method returns the current time in the form of TTinterval:[earliest, latest] which takes the inaccuracy of time into account. TTinterval returned by TT.now() method guarantees that the absolute time when TT.now() was called is within TTinterval. TT.after(t) The TT.after(t) method returns true if the time is surely after the specified time t, and returns false if not. TT.before(t) In contrast, the TT.before(t) method returns true if the time is surely before the specified time t, and returns false if not. TrueTime gets time information from GPS and the atomic clock. It gets time information from two different sources because one of them could have a failure and be unable to provide time information. When you get time information through GPS, you may fail to receive it due to problems such as an antenna or reception problem and radio wave interference. In addition, the time you get from an atomic clock may have an error due to frequency errors. TrueTime consists of multiple time master devices in each data center. Most masters have a GPS receiver equipped with an antenna. Other masters without GPS are called Armageddon master. These masters are equipped with an atomic clock. All masters check the status of time information by comparing their information with each other periodically. In this process, each master synchronizes the time by checking for any error in their clock. To reduce risks from errors by masters, the timeslave daemon that runs for each time master gets information from multiple time masters and identifies any master that provides incorrect information and adjusts its own clock. For this reason, the inaccuracy of the clock tends to increase from when the time is synchronized until the next synchronization comes. The time inaccuracy of Spanner draws a toothed wheel line going up and down between 1 ms and 7 ms per synchronization period. As the synchronization period is 30 seconds, and the time error increases by 200 μsec per second, the interval between teeth becomes 0-6 ms, and the other 1ms results from communication latency with the time master. Concurrency Control Spanner provides three types of operations: read/write transaction, read transaction and snapshot read operation. A single write operation is performed through a read/write transaction, while a single read operation, not a snapshot read, is performed through a read transaction. The Process of Read/Write Transaction A write operation executed in a read/write transaction is buffered on the client until the commit. A read operation in a transaction, therefore, is not affected by the result of a write operation. Read in a read/write transaction uses a wound-wait method to avoid a deadlock. The client gets a read lock from the leader replica of an appropriate group and reads the latest data. To prevent a timeout while a transaction is open, the client sends a keepalive message to the participant leaders. If the client completes all read tasks and the buffering of the write task is finished, the phase-2 commit is started. The client selects a coordinator group and sends a commit message to all participant leaders. The commit message contains the information on the coordinator and the write task that was buffered. Leaders other than the coordinator hold a write lock, prepare a time stamp value bigger than all the time stamps allocated to the transaction, log records through Paxos, and then send the time stamp value to the coordinator. The coordinator leader holds a write lock, and then skips the process of preparing a time stamp. The coordinator receives time stamps from all the participant leaders, and then selects a time stamp for the transaction. The commit time stamp should be equal to, or bigger than, the value of all the time stamps received from the participant leaders, bigger than the TT.now().latest of the time when the coordinator received the commit message, and bigger than the value of any time stamps leaders allocated to the previous transaction. After that, the coordinator leader logs the commit record through Paxos. Before applying the commit record to the coordinator replica, the coordinator leader waits until the value of TT.after (commit time stamp) becomes true to ensure that the time specified by the time stamp has passed. After that, the coordinator sends the commit time stamp to the client and all the participant leaders. The participant leaders that received the commit time stamp log the result of the transaction through Paxos. All participants apply the same time stamp and then release the lock. Figure 5: The Process of the Phase-2 Commit of a Read/Write Transaction. The Process of a Read Transaction As a read transaction is executed after a read time stamp is determined without locking, a write task coming in the course of a read task is not blocked. A read transaction is executed in two phases. First, a read time stamp is determined and then a read is executed from the snapshot of the read time stamp. To execute a read transaction, you need a scope expression that summarizes the scope of keys to be read by the transaction. If the scope in the scope expression can be executed within a single Paxos group, the client will request the group leader for a read transaction. Paxos leader determines a time stamp for the read transaction, and executes a read operation. If it is a read in a single group, the value of LastTS(), which is the time stamp value of the last committed write in the Paxos group, is used for a read time stamp instead of the value of TT.now().latest. If the scope in the scope expression should be executed across multiple Paxos groups, the client will use the value for TT.now().latest as a read time stamp. The client waits until TT.after() becomes true, and then confirms that the time stamp value has passed, and then sends the read task in the transaction to all replicas. Schema Change Transaction Spanner also supports atomic schema change by using TrueTime. As there may be millions of groups requiring schema change, it is almost impossible to change schemas by using normal transactions. BigTable supports atomic schema change in a data center, but during the change work, all operations are blocked. However, you can execute schema change without a block by using a special transaction. First, you should select a future point of time explicitly and create and register a time stamp. Through this, schema change can be performed without affecting any ongoing tasks.  A read/write task related to the schema change work is synchronized with the registered schema change time stamp. If the time stamp of the task is earlier than the schema change time stamp, the task will be performed earlier. If it is after the schema change time stamp, it will be blocked. Conclusion Spanner has blended and developed the ideas of two different research communities. First, Spanner accepted familiar, easy-to-use semi-relational interface, transactions and SQL-based query language from the database communities. Second, Spanner also accepted the concepts of scalability, auto segmentation, failure resistance, data replication consistency and wide distribution from the communities. Thanks to 5 years of development efforts, Spanner has gained the critical functionalities of a database, which had been impossible in BigTable under globally distributed environments. Another key functionality of Spanner is TrueTime. TrueTime provides a functionality based on accurate time synchronization in a distributed system by expressing the inaccuracy of time more specifically in the time API. By Choo Yun-cheol, Senior Software Engineer at Storage System Dev. Team, NHN Corporation. [Less]
Posted over 11 years ago by Choo Yun-cheol
Today I would like to talk about Spanner, a NewSQL distributed relational database by Google. It can distribute and store data in data centers across the world, provide consistency that is as excellent as in RDBMS while enabling to store an amount of ... [More] data that exceeds the capacity of a single data center. In this article I will briefly explain when the NewSQL trend has begun, then will introduce Spanner, its features and architecture, how it performs data distribution and rebalancing, how it actually stores the data, and finally how it provides data consistency. You will also learn about Google's TrueTime API which is at the core of Spanner distributed relational database. NoSQL and NewSQL NewSQL is on the rise. A wholly different database architecture, differentiated from NoSQL, is beginning to emerge. There are many reasons why NoSQL products have been popular. As there are a variety of NoSQL products, and they have been developed to serve different purposes, it is not easy to list their common features. However, as you can see in Hadoop or Cassandra, one of the main advantages of NoSQL is its horizontal scalability. As these NoSQL products don't provide Strong Consistency, they cannot be used where high-level data consistency is required.  NewSQL has as excellent scalability as NoSQL, and at the same time it guarantees ACID like RDBMS which is performed in a single node. The term NewSQL was first used in 2011 by Matthew Aslett at the 451 Group, a corporate business IT research group. Figure 1 below shows the classification of NewSQL that was made by the 451 Group (it does not include information on Spanner, as it was drawn up in 2011). Figure 1: Classification of RDBMS, NoSQL and NewSQL Made by the 451 Group. (Source http://blogs.the451group.com/information_management/2011/04/15/nosql-newsql-and-beyond/) Of course, HBase also provides transactions in a limited form (transaction for a single row). However, not all business requirements can be met with such a limited transaction rule. nBase, which is developed by Storage System Dev. Team here at NHN, is also a NewSQL. Currently, NAVER Mail, probably the most popular email service provider in Korea, uses nBase. nBase supports Strong Consistency from version 1.3. What is Spanner? Debuted [video] at the OSDI Conference in 2012, Spanner is a NewSQL created by Google. It is a distributed relational database that can distribute and store data in Google's BigTable storage system in multiple data centers. Spanner meets ACID (of course, it supports transaction) and supports SQL. Currently, F1, Google's advertisement platform, uses Spanner. Gmail and Google Search will also use it soon. F1 has 5 data replicas in the U.S. to prevent the service from stopping even in the event of a natural disasters such as an earthquake, or when one or two data centers have a failure. Spanner provides the scalability that enables you to store a few trillion database rows in millions of nodes distributed to hundreds of data centers. When you read data, Spanner connects you to the data center that is geographically closest to you, and when you write data, it distributes and stores it to multiple data centers. If the data center you try to access has a failure, of course, you can read the data from another data center that has a replica of the data.  The Spanner client automatically performs a failover between replicas. When the number of servers storing data is changed or a failure occurs in equipment, Spanner automatically re-distributes data through data transfer among the equipments. The following information on Spanner was obtained from Google's Spanner Paper. In addition, the figures below are also excerpted from the paper. The Data Model of Spanner as a Semi-relational Data Model  Spanner's data model is semi-relational. It is called semi-relational because it has some features that differentiate it from normal relational databases. In Spanner, all tables should have at least one primary key. Spanner is a multi-version database that uses a version when storing data in a column. It has evolved from the key-value store that maintains a version like BigTable (and like HBase, too, which was also influenced by BigTable). Data is stored in a semi-relational table that has a schema. This means that data has a version, which is given in the form of a time stamp when committing. In applications that use Spanner, you can also read data of past versions. Spanner supports transactions for general use and supports SQL as well. Spanner provides Strong Consistency. It can read and write data consistently and provides globally consistent read operations for a specific time stamp. This functionality enables you to carry out 'consistent backup', 'MapReduce operation' and 'atomic schema change', even with ongoing transactions. This is possible because Spanner issues a serialized commit time stamp to distributed transactions by using TrueTime API. Spanner Server Configuration The universal set of Spanner is called universe. A universe consists of multiple zones. A zone means a unit that can be run with physical independence. A data center may have one or more zones. If you want to store data separately in different server groups, you should make two or more zones in a single data center. You can also create or remove a zone in an operating system. Figure 2: Spanner Server Configuration. Figure 2 above shows the configuration of servers in a universe. One zone consists of one zonemaster and hundreds of, or thousands of, spanservers. The zonemaster allocates data to spanservers, while spanservers actually store and process data. Location proxy is called by the client, and shows in which spanserver the target data is stored. The universe master provides the status of all zones or debugging information, and the placement driver automatically transfers data between zones, and inspects to determine whether there is any data to be moved due to a change in replication settings or load balancing by communicating with spanservers periodically. Spanserver Software Stack Figure 3: Spanserver Software Stack. Each spanserver manages 10 to 1000 data structures called tablet. A tablet has a concept similar to a tablet of BigTable. It can store multiple mappings in the form of (key:string, timestamp:int64) string. The difference between the tablet of Spanner and the tablet of BigTable is that Spanner stores a time stamp together with data. This means Spanner is not a simple key-value store but has the characteristics of a multi-version database. The status of a tablet is stored in Colossus Distributed File System (the successor of Google File System) in the form of B-tree file and write-ahead log (WAL). Spanner uses Paxos state machine to support data replication among spanservers. Paxos is a protocol set created for reliable operations in distributed environments. In an distributed environment, you can have a failure at any time, and you are not also guaranteed to receive events from distributed nodes in the order of occurrence. Paxos is used to resolve this type of reliability issue. One of the main issues handled by Paxos is which node is a leader. This is for consistency in the process of data replication. A spanserver has the transaction manager to support distributed transactions. The transaction manager is not involved in a transaction performed in a single Paxos group, but when a transaction is performed across multiple Paxos groups, one of the participant leaders is selected as coordinator leader, and performs coordination to enable phase-2 commit. Directory A directory is a set of continuous keys that use the same prefix (you can think of it as a bucket). The Spanner paper says that "bucket" is a more appropriate term, and that the term "directory" is used as they intended to keep using the old term. A directory is a unit of data allocation. All the data in a directory have identical replication settings, and the transfer of data between Paxos groups is also conducted, as shown in Figure 4 below, in the unit of a directory. Figure 4: Transfer of a Directory between Paxos Groups. Spanner moves a directory to reduce the load of a certain Paxos group, groups directories which are frequently called together, or moves a directory into a group that is geographically close to the client that tries to access it. A directory can be transferred even while an operation of the client is in progress. Transfer of directories between Paxos groups is conducted in the background. Through this transfer work, you can add or delete replicas in a Paxos group. To prevent the blocking of read/write tasks in progress during the transfer of a large amount of data, directory transfer is not performed as a single transaction. Instead, only the start of data transfer is registered to the background when directory transfer is conducted, and after the task is complete, a transaction is used only when the rest of the data is moved. A directory is the minimum unit of geographical replica allocation. The administrator can specify the number of types of replica and its geographical allocation for each application. For example, you can configure settings to store 3 replicas of the data of User A in Europe, and store 5 replicas of the data of User B in North America. If the size of a directory is too big, you can split a single directory into multiple fragments. In this case, the unit of directory transfer or allocation among groups also becomes a fragment. Data Model Spanner's data model features a semi-relational table with a schema, a query language that is expanded from SQL, and transactions for general purpose. An application can create one or more databases in a universe, and one database can have many tables without a limit. A table has rows and columns like an RDBMS table. But unlike an RDBMS table, each data has version information. Figure 5: An Example of Spanner Schema. Figure 5 above shows an example of the schema of Spanner. In Spanner's schema definition language, you can express the hierarchical relationship among tables using the INTERLEAVE IN declaration. The top-level table in the hierarchical relationship is a directory table. According to the key defined in a directory table, its sub-table names are arranged in dictionary order and make up a directory. The ON DELETE CASCADE statement is used, when a row of a directory table is deleted, to delete the data of related sub-tables together. In the example, the Users table is specified as a directory table, and according to the value of the uid column, which is the primary key, data is divided and stored into different directories. As the client specifies the hierarchical relationship among multiple tables, a database can have better performance when data is divided and distributed. TrueTime API TrueTime is an API that provides time-related information, which consists of the following methods: TT.now() TT.now() method is used to get the current time. This method returns the current time in the form of TTinterval:[earliest, latest] which takes the inaccuracy of time into account. TTinterval returned by TT.now() method guarantees that the absolute time when TT.now() was called is within TTinterval. TT.after(t) The TT.after(t) method returns true if the time is surely after the specified time t, and returns false if not. TT.before(t) In contrast, the TT.before(t) method returns true if the time is surely before the specified time t, and returns false if not. TrueTime gets time information from GPS and the atomic clock. It gets time information from two different sources because one of them could have a failure and be unable to provide time information. When you get time information through GPS, you may fail to receive it due to problems such as an antenna or reception problem and radio wave interference. In addition, the time you get from an atomic clock may have an error due to frequency errors. TrueTime consists of multiple time master devices in each data center. Most masters have a GPS receiver equipped with an antenna. Other masters without GPS are called Armageddon master. These masters are equipped with an atomic clock. All masters check the status of time information by comparing their information with each other periodically. In this process, each master synchronizes the time by checking for any error in their clock. To reduce risks from errors by masters, the timeslave daemon that runs for each time master gets information from multiple time masters and identifies any master that provides incorrect information and adjusts its own clock. For this reason, the inaccuracy of the clock tends to increase from when the time is synchronized until the next synchronization comes. The time inaccuracy of Spanner draws a toothed wheel line going up and down between 1 ms and 7 ms per synchronization period. As the synchronization period is 30 seconds, and the time error increases by 200 μsec per second, the interval between teeth becomes 0-6 ms, and the other 1ms results from communication latency with the time master. Concurrency Control Spanner provides three types of operations: read/write transaction, read transaction and snapshot read operation. A single write operation is performed through a read/write transaction, while a single read operation, not a snapshot read, is performed through a read transaction. The Process of Read/Write Transaction A write operation executed in a read/write transaction is buffered on the client until the commit. A read operation in a transaction, therefore, is not affected by the result of a write operation. Read in a read/write transaction uses a wound-wait method to avoid a deadlock. The client gets a read lock from the leader replica of an appropriate group and reads the latest data. To prevent a timeout while a transaction is open, the client sends a keepalive message to the participant leaders. If the client completes all read tasks and the buffering of the write task is finished, the phase-2 commit is started. The client selects a coordinator group and sends a commit message to all participant leaders. The commit message contains the information on the coordinator and the write task that was buffered. Leaders other than the coordinator hold a write lock, prepare a time stamp value bigger than all the time stamps allocated to the transaction, log records through Paxos, and then send the time stamp value to the coordinator. The coordinator leader holds a write lock, and then skips the process of preparing a time stamp. The coordinator receives time stamps from all the participant leaders, and then selects a time stamp for the transaction. The commit time stamp should be equal to, or bigger than, the value of all the time stamps received from the participant leaders, bigger than the TT.now().latest of the time when the coordinator received the commit message, and bigger than the value of any time stamps leaders allocated to the previous transaction. After that, the coordinator leader logs the commit record through Paxos. Before applying the commit record to the coordinator replica, the coordinator leader waits until the value of TT.after (commit time stamp) becomes true to ensure that the time specified by the time stamp has passed. After that, the coordinator sends the commit time stamp to the client and all the participant leaders. The participant leaders that received the commit time stamp log the result of the transaction through Paxos. All participants apply the same time stamp and then release the lock. Figure 5: The Process of the Phase-2 Commit of a Read/Write Transaction. The Process of a Read Transaction As a read transaction is executed after a read time stamp is determined without locking, a write task coming in the course of a read task is not blocked. A read transaction is executed in two phases. First, a read time stamp is determined and then a read is executed from the snapshot of the read time stamp. To execute a read transaction, you need a scope expression that summarizes the scope of keys to be read by the transaction. If the scope in the scope expression can be executed within a single Paxos group, the client will request the group leader for a read transaction. Paxos leader determines a time stamp for the read transaction, and executes a read operation. If it is a read in a single group, the value of LastTS(), which is the time stamp value of the last committed write in the Paxos group, is used for a read time stamp instead of the value of TT.now().latest. If the scope in the scope expression should be executed across multiple Paxos groups, the client will use the value for TT.now().latest as a read time stamp. The client waits until TT.after() becomes true, and then confirms that the time stamp value has passed, and then sends the read task in the transaction to all replicas. Schema Change Transaction Spanner also supports atomic schema change by using TrueTime. As there may be millions of groups requiring schema change, it is almost impossible to change schemas by using normal transactions. BigTable supports atomic schema change in a data center, but during the change work, all operations are blocked. However, you can execute schema change without a block by using a special transaction. First, you should select a future point of time explicitly and create and register a time stamp. Through this, schema change can be performed without affecting any ongoing tasks.  A read/write task related to the schema change work is synchronized with the registered schema change time stamp. If the time stamp of the task is earlier than the schema change time stamp, the task will be performed earlier. If it is after the schema change time stamp, it will be blocked. Conclusion Spanner has blended and developed the ideas of two different research communities. First, Spanner accepted familiar, easy-to-use semi-relational interface, transactions and SQL-based query language from the database communities. Second, Spanner also accepted the concepts of scalability, auto segmentation, failure resistance, data replication consistency and wide distribution from the communities. Thanks to 5 years of development efforts, Spanner has gained the critical functionalities of a database, which had been impossible in BigTable under globally distributed environments. Another key functionality of Spanner is TrueTime. TrueTime provides a functionality based on accurate time synchronization in a distributed system by expressing the inaccuracy of time more specifically in the time API. By Choo Yun-cheol, Senior Software Engineer at Storage System Dev. Team, NHN Corporation. [Less]
Posted over 11 years ago by Kwon Donghun
In this article I would like to introduce you to Cloudera Impala, an open source system which provides real-time SQL querying functionality on top of Hadoop. I will quickly go over when and how Impala was created, then will explain in more details ... [More] about Impala's architecture, its advantages and drawbacks, compare it to Pig and Hive. You will also learn how to install, configure and run Impala. At the end of this article I will show the performance test results I have obtained when comparing Impala with Hive, Infobright, infiniDB, and Vertica. Analyzing results in real time through Hadoop With the advent of Hadoop in 2006, big data analytics was no longer a task that could be performed by only a few companies or groups. This is because Hadoop was an open-source framework, and thus many companies and groups that needed big data analytics could use Hadoop easily at low cost. In other words, big data analytics became a universal technology. The core of Hadoop is Hadoop Distributed File System (HDFS) and MapReduce. Hadoop stores data in HDFS, a file system that can expand the capacity in the form of a distributed file system, conducts MapReduce operations based on the stored data, and consequently gets the required data. There is no limit to one's needs. The Hadoop user group tried to overcome Hadoop's limits in terms of functionality and performance, and develop it more. Complaints were focused on the use of MapReduce. MapReduce has two main disadvantages. It is very inconvenient to use. Its processing is slow. To resolve the inconveniences of using MapReduce, platforms such as Pig and Hive appeared in 2008. Pig and Hive are sub-projects of Hadoop (Hadoop is also an ecosystem of multiple platforms; a variety of products based on Hadoop have been created). Both Pig and Hive have a form of high-level language, but Pig has a procedural form and Hive has a declarative language form similar to SQL. With the advent of Pig and Hive, Hadoop users could conduct big data analytics easier. However, as Hive and Pig are related to the data retrieval interface, they cannot contribute to accelerating big data analytics work. Internally, both Hive and Pig use MapReduce as well. This is why HBase, a column-based NoSQL appeared. HBase, which enables faster input/output of key/value data, finally provided Hadoop-based systems with an environment in which data could be processed in real time. This progress of Hadoop (Hadoop eco-System) was greatly influenced by Google. HDFS itself was implemented on the basis of papers on GFS published by Google, and Hbase appears to have been based on Google's papers on BigTable. Table 1 below shows this influential relationship. Table 1: Google Gives Us A Map (source: Strata + Hadoop World 2012 Keynote: Beyond Batch - Doug Cutting) Google Publication Hadoop Characteristics GFS & MapReduce (2004) HDFS & MapReduce (2006) Batch Programs Sawzall (2005) Pig & Hive (2008) Batch Queries BigTable (2006) HBase (2008) Online key/value Dremel (2010) Impala (2012) Online Queries Spanner (2012) ???? Transactions, Etc. Cloudera's Impala, introduced in this article, was also established under the influence of Google. It was created based on Google's Dremel paper, which was published back in 2010. Impala is an open-source system under an Apache license, which is an interactive/real-time SQL query system that runs on HDFS. SQL is very familiar to many developers and is able to express data manipulation/retrieval briefly. As Impala supports SQL and provides real-time big data processing functionality, it has the potential to be utilized as a business intelligence (BI) system. For this reason, some BI vendors are said to have already launched BI system development projects using Impala. The ability to get real-time analytics results by using SQL makes the prospect of big data brighter and also extends the application scope of Hadoop. Cloudera Impala Cloudera, which created Impala, said they had been technically inspired by Google's Dremel paper, which made them think that it would be possible to perform real-time, ad-hoc queries in Apache Hadoop. In October 2012, when announcing Impala, Cloudera introduced it as follows: “Real-Time Queries in Apache Hadoop, For Real” Impala adopted Hive-SQL as an interface. As mentioned above, Hive-SQL is similar in terms of syntax to SQL, a popularly used query language. For this reason, users can access data stored in HDFS through a very familiar method. As Hive-SQL uses Hive, you can access the same data through the same method. However, not all Hive-SQLs are supported by Impala. For this reason, you had better understand that Hive-SQLs that are used in Impala can also be used in Hive. The difference between Impala and Hive is whether it is real-time or not. While Hive uses MapReduce for data access, Impala uses its unique distributed query engine to minimize response time. This distributed query engine is installed on all data nodes in the cluster. This is why Impala and Hive show distinctively different performance in the response time to the same data. Cloudera mentions the following three reasons for Impala's good performance: Impala has reduced CPU load compared to Hive, and thus it can increase IO bandwidth to the extent that CPU load is reduced. This is why Impala shows 3-4 times better performance than Hive on purely IO bound queries. If a query becomes complex, Hive should conduct multi-stage MapReduce work or reduce side joins. For queries that cannot be efficiently processed through a MapReduce framework (a query that contains at least one join operation), Impala shows 7 to 45 times better performance than Hive. If the data block to analyze has been file-cached, Impala will show much faster performance, and in this case, it performs 20 to 90 times faster than Hive. Real-time in Data Analytics The term "real time" is emphasized with the introduction of Impala. But one may naturally ask, "How much time is real time?" On this question, Doug Cutting (the person who created Hadoop) and Cloudera's senior architect gave their opinions on what real time is. "When you sit and wait for it to finish, that is real time. When you go for a cup of coffee or even let it run overnight, that's not real time." "'real-time' in data analytics is better framed as 'waiting less.' Although it would be difficult to clearly define the criteria for real time numerically, if you can wait for a result, watching your monitor, that may be called real time. Impala Architecture Impala is composed largely of impalad and impala state store. Impalad is a process that functions as a distributed query engine. It designs a plan for queries and processes queries on data nodes in the Hadoop cluster. The impala state store process maintains metadata for the impalads executed on each data node. When the impalad process is added or deleted in the cluster, metadata is updated through the impala state store process. Figure 1:  Impala High-level Architectural View. Data Locality Tracking and Direct Reads In Impala, the impalad process processes queries on all data nodes in the cluster instead of MapReduce, which is Hadoop's traditional analytic framework. Some advantages of Impala with regard to this configuration include data locality and direct read. In other words, impalad processes only the data block within the data node to which it belongs, and reads the data directly from the local directory. Through this, Impala minimizes network load. Moreover, it can also benefit from the effect of file cache. Scale Out Impala provides a horizontal expansion like Hadoop cluster. In general, you can expand Impala when a cluster is horizontally expanded. All you have to do to expand the Impala that is running the impalad process on the server when a data node is added (metadata for the addition of impalad will be updated through impala state store). This is very similar to databases based on massively parallel processing (MPP). Failover Impala analyzes data stored in Hive and HBase. And HDFS used by Hive and HBase provides a certain level of failover through replication. For this reason, Impala can perform queries if the replica of a data block and at least one impalad process exist. Single Point of Failure (SPOF) One of the huge concerns about all systems which use HDFS as a storage medium is that their name node is SPOF (we have discussed this in details when comparing HDFS with other distributed file systems in our previous article Overview and Recommendations for Distributed File Systems). Some solutions to prevent this have recently been released, but resolving the problem fundamentally is still a distant goal. In Impala, the namenode is SPOF as well. This is because you can't perform queries unless you know the location of a data block. Query Execution Procedure The following is a brief account of the query execution procedure in Impala: The user selects a certain impalad in the cluster, and registers a query by using impala shell and ODBC. The impalad that received a query from the user carries out the following pre-task: It brings Table Schema from the Hive metastore and judges the appropriateness of the query statement. It collects data blocks and location information required to execute the query from the HDFS namenode. Based on the latest update of Impala metadata, it sends the information required to perform the query to all impalads in the cluster. All the impalads that received the query and metadata read the data block they should process from the local directory and execute the query. If all the impalads complete the task, the impalad that received the query from the user collects the result and delivers it to the user. Hive-SQL Supported by Impala Not all Hive-SQLs are supported by Impala. As Impala supports only some Hive SQLs, you need to know which statements are supported. SELECT QUERY Impala supports most of the SELECT-related statements of Hive SQL. Data Definition Language In Impala, you cannot create or modify a table. As shown below, you can only retrieve databases and table schemas. You can only create and modify a table through Hive. SHOW TABLES SHOW DATABASES SHOW SCHEMAS DESCRIBE TABLE USE DATABASE Data Manipulation Impala provides only the functionality of adding data to an already created table and partition. INSERT INTO INSERT OVERWRITE Unsupported Language Elements There are still many Hive SQLs not supported by Impala. Therefore, you are advised to see the Impala Language Reference before using Impala. Data Definition Language (DDL) such as CREATE, ALTER, DROP Non-scalar data types: maps, arrays, structs LOAD DATA to load raw files Extensibility mechanisms such as TRANSFORM, custom User Defined Functions (UDFs), custom file formats, custom SerDes XML and JSON functions User Defined Aggregate Functions (UDAFs) User Defined Table Generating Functions (UDTFs) Lateral Views etc. Data Model Since Impala 1.0 has dropped its beta label, Impala supports a variety of file formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format. The highest interest, however, lies in whether Impala will support Trevni, a project currently led by Doug Cutting. Trevni is a file format that stores a table record comprising rows and columns in the column-major format instead of the existing row-major format. Why is support for Trevni a matter of keen interest? It is because Impala could provide better performance with Trevni. As Trevni is still being developed, you will only see a brief account of the column file format mentioned in the Dremel paper. Dremel highlights the column file format as one of the factors affecting performance. What benefits most from the column file format is Disk IO. As the column file format uses a single record by dividing it into columns, it is very effective when you retrieve only some of the columns from the record. In the existing row-unit storage method, the same disk IO occurs whether you see a single column or all columns, but in the column file format, you can use Disk IO more efficiently, as Disk IO only occurs when you access the specific required column. Figure 2: Comparison of Row-unit Storage and Column-unit base (source: Dremel: Interactive Analysis of Web-Scale Datasets). Looking at the result of the two tests conducted with regard to the column file format in Dremel, you can estimate the degree of contribution of the column file format to performance. Figure 3: Comparison of the Performance of the Column-unit Storage and the Row-unit Storage (source: Dremel: Interactive Analysis of Web-Scale Datasets). Figure 4:  Comparison of the Performance of MapReduce and Dremel in the Column-unit Storage and the Row-unit Storage (300 nodes, 85 billion Records) (source: Dremel: Interactive Analysis of Web-Scale Datasets). In Figure 3, (a), (b) and (c) show the execution time according to the number of columns randomly selected in the column file format, while (d) and (e) show the existing record reading time. According to the result of the test, the execution of the column file format is faster. In particular, the gap becomes bigger when it accesses a smaller number of columns. Reading records using the previous method, it took a very long time even when accessing a single column, as if it was reading all columns. However, the smaller the number of selected columns, the better performance the column file format (a, b and c) shows. Figure 4 compares the execution time when MapReduce work is processed in the column file format data and when it is not (while this is a comparison test of MapReduce and Dremel, we will just check the result of application/non-application of the column file format to MR). In this way, the column file format improves performance significantly by reducing Disk IO. Of course, as the above test result is about the column file format implemented in Dremel, the result of Trevni, which will apply to Impala, may be different from the result above. Nevertheless, Trevni is also being developed with the same goal as that of the column file format of Dremel. For this reason, it is expected to have a similar result to the test result above. Installing, Configuring and Running Impala Installing To install Impala, you must install the following software (for detailed installation information, visit the Impala website): Red Hat Enterprise Linux (RHEL) / CentOS 6.2 (64 bit) or above. Hadoop 2.0 Hive MySQL MySQL is indispensable for Hive metastore. Hive supports a variety of databases used as metastore. However, currently, you should use MySQL to make Impala and Hive interwork. If interworking with HBase is required, you should install HBase as well. This article will not discuss the installation procedure for each software solution in detail. If you have installed all required software, you should now install the Impala package to all the data nodes in the cluster. Then, you should install the impala shell in the client host. Configuring For impalad to access HDFS file blocks directly from the local directory and for locality tracking, you should configure some settings for Hadoop core-site.xml and hdfs-site.xml. As this configuration is critical to the performance of Impala, you must add the settings. After changing the settings, HDFS should reboot. Refer to Configuring Impala for Performance without Cloudera Manager for more information on configuration. Running Before running Impala, you should carry out some pre-tasks, such as table creation and data loading through Hive Client, as Impala does not currently support these. If you need to analyze an existing table in HBase or a new table, you can redefine it as an extern table through Hive. Refer to https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration. To enter a query in Impala, you can enter it through Impala shell, Cloudera Beeswaw (an application that helps users to use Hive easily), or ODBC. Here you will see a query method by using the Impala shell. $ sudo –u impala impala-shell Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Build version: Impala v0.1 (1fafe67) built on Mon Oct 22 13:06:45 PDT 2012) [Not connected] > connect impalad-host:21000 [impalad-host:21000] > show tables custom_url [impalad-host:21000] > select sum(pv) from custom_url group by url limit 50 20390 34001 3049203 [impalad-host:21000] > Before entering a query, you should select an impalad to be the main server from the impalads in the cluster. Once the connection is complete, enter a query and see the result. Impala Functionality/Performance Test In this article I want to share with you the test results I have obtained when I first tried Impala beta version. Though many things have been changes since the beta release, I want you to see the trend in Impala performance. I conducted a simple test to see how well it performs within an available scope. Table 2 below shows the equipment and the version of the software used. The Hadoop cluster was composed of one Namenode/JobTracker, 3 to 5 Data/Task nodes, and one commander. And the HDFS replication was set to "2". The Map/Reduce capacities of node were also "2". Data for the test is approximately 1.3 billion records with 14 column data, totaling approximately 65 GB. Table 2: Hardware and Software Versions. Classification Description Equipment CPU Intel Xeon 2.00GHz Memory 16GB Software OS CentOS release 6.3(Final) Hadoop 2.0.0-cdh4.1.2 HBase 0.92.1-cdh4.1.2 Hive 0.9.0-cdh4.1.2 Impala 0.1.0-beta The table schema and query created for the test are as follows: Code 1: Hive Table Schema. hive> describe custom_url; OK sd_uid bigint type int freq int locale int os int sr int browser int gender int age int url int visitor int visit int pv int dt int Code 2: Test Query. select sr, url, sum(pv) as sumvalue from custom_url WHERE sr != -1 and url != -1 and (sd_uid=690291 or sd_uid=692758) group by sr, url order by sumvalue desc limit 50 The query was simply composed of frequently used simple statements, including select, from, group by and order by. Test Result With cluster sizes of 3 and 5 nodes, the same query in the same table was performed through Impala and Hive, and the average response time was measured. Table 3: Comparison of the Time to Execute the Same Query in Impala and Hive, by Cluster Size. Impala Hive Hive/Impala 3 nodes 265s 3,688s 13.96 5 nodes 187s 2,377s 13.71 As mentioned above, Impala shows much better performance than the analytic work by using MapReduce. In addition, the bigger the cluster size is, the better response time Impala shows linearly. (After expanding the cluster, the test was conducted after evenly distributing data blocks through rebalancing.) Figure 5: Impala's Response Time According to Cluster Size. In addition, to compare the performance of Impala with that of other column-based commercial databases, such as Infobright, infiniDB and Vertica, which have MPP as a common denominator with Impala, a test was also conducted with the same data and query (the open source version of Infobright and infiniDB and the Community Edition version of Vertica were used for the test). The test of the three databases above was conducted on a single server. Table 4: Execution Time of Commercial Column-based Databases. Database Execution time Infobright 200s infiniDB 32s Vertica 15s As the test result in Table 4 shows, Infobright, infiniDB and Vertica, which are commercial enterprise products, show much better results than Impala. This is because Impala is still in the initial development stage, and thus may have structural and technical shortcomings. For Vertica, which showed the best performance of the four, in a three-server environment, it showed 30% higher performance than the result specified in Table 4. However, as Vertica is a commercial product, if you want to configure a cluster of two or more servers, there is a higher cost. Conclusion Less than a year has passed since Impala beta was released and two months since the production 1.0 version is released. Since Impala 1.0 a variety of file formats are now supported, a subset of ANSI-92 SQL is supported including CREATE, ALTER, SELECT, INSERT,JOIN, and subqueries. In addition, it supports partitioned joins, fully distributed aggregations, and fully distributed top-n queries. Apart from this, MapR, which does business by using Hadoop like Cloudera, suggested Drill, a system based on the Dremel paper, like Impala, which is being prepared mainly by its HQ developers, to the Apache incubator. It is still just a proposal, but Drill is expected to make fast progress once it is launched, as it is being promoted mainly by MapR developers. By Kwon Donghun, Software Engineer at Data Infra Lab, NHN Corporation. [Less]