4
I Use This!
High Activity

News

Analyzed 26 days ago. based on code collected 2 months ago.
Posted over 11 years ago by Kwon Donghun
UPDATED on July 16 2013: the article has been updated inline considering the feedbacks on Twitter and comments below. In this article I would like to introduce you to Cloudera Impala, an open source system which provides real-time SQL querying ... [More] functionality on top of Hadoop. I will quickly go over when and how Impala was created, then will explain in more details about Impala's architecture, its advantages and drawbacks, compare it to Pig and Hive. You will also learn how to install, configure and run Impala. At the end of this article I will show the performance test results I have obtained when comparing Impala with Hive, Infobright, infiniDB, and Vertica. Analyzing results in real time through Hadoop With the advent of Hadoop in 2006, big data analytics was no longer a task that could be performed by only a few companies or groups. This is because Hadoop was an open-source framework, and thus many companies and groups that needed big data analytics could use Hadoop easily at low cost. In other words, big data analytics became a universal technology. The core of Hadoop is Hadoop Distributed File System (HDFS) and MapReduce. Hadoop stores data in HDFS, a file system that can expand the capacity in the form of a distributed file system, conducts MapReduce operations based on the stored data, and consequently gets the required data. There is no limit to one's needs. The Hadoop user group tried to overcome Hadoop's limits in terms of functionality and performance, and develop it more. Complaints were focused on the use of MapReduce. MapReduce has two main disadvantages. It is very inconvenient to use. Its processing is slow. To resolve the inconveniences of using MapReduce, platforms such as Pig and Hive appeared in 2008. Pig and Hive are sub-projects of Hadoop (Hadoop is also an ecosystem of multiple platforms; a variety of products based on Hadoop have been created). Both Pig and Hive have a form of high-level language, but Pig has a procedural form and Hive has a declarative language form similar to SQL. With the advent of Pig and Hive, Hadoop users could conduct big data analytics easier. However, as Hive and Pig are related to the data retrieval interface, they cannot contribute to accelerating big data analytics work. Internally, both Hive and Pig use MapReduce as well. This is why HBase, a column-based NoSQL appeared. HBase, which enables faster input/output of key/value data, finally provided Hadoop-based systems with an environment in which data could be processed in real time. This progress of Hadoop (Hadoop eco-System) was greatly influenced by Google. HDFS itself was implemented on the basis of papers on GFS published by Google, and Hbase appears to have been based on Google's papers on BigTable. Table 1 below shows this influential relationship. Table 1: Google Gives Us A Map (source: Strata + Hadoop World 2012 Keynote: Beyond Batch - Doug Cutting) Google Publication Hadoop Characteristics GFS & MapReduce (2004) HDFS & MapReduce (2006) Batch Programs Sawzall (2005) Pig & Hive (2008) Batch Queries BigTable (2006) HBase (2008) Online key/value Dremel (2010) Impala (2012) Online Queries Spanner (2012) ???? Transactions, Etc. Cloudera's Impala, introduced in this article, was also established under the influence of Google. It was created based on Google's Dremel paper, which was published back in 2010. Impala is an open-source system under an Apache license, which is an interactive/real-time SQL query system that runs on HDFS. SQL is very familiar to many developers and is able to express data manipulation/retrieval briefly. As Impala supports SQL and provides real-time big data processing functionality, it has the potential to be utilized as a business intelligence (BI) system. For this reason, some BI vendors are said to have already launched BI system development projects using Impala. The ability to get real-time analytics results by using SQL makes the prospect of big data brighter and also extends the application scope of Hadoop. Cloudera Impala Cloudera, which created Impala, said they had been technically inspired by Google's Dremel paper, which made them think that it would be possible to perform real-time, ad-hoc queries in Apache Hadoop. In October 2012, when announcing Impala, Cloudera introduced it as follows: “Real-Time Queries in Apache Hadoop, For Real” Impala adopted Hive-SQL as an interface. As mentioned above, Hive-SQL is similar in terms of syntax to SQL, a popularly used query language. For this reason, users can access data stored in HDFS through a very familiar method. As Hive-SQL uses Hive, you can access the same data through the same method. However, not all Hive-SQLs are supported by Impala. For this reason, you had better understand that Hive-SQLs that are used in Impala can also be used in Hive. The difference between Impala and Hive is whether it is real-time or not. While Hive uses MapReduce for data access, Impala uses its unique distributed query engine to minimize response time. This distributed query engine is installed on all data nodes in the cluster. This is why Impala and Hive show distinctively different performance in the response time to the same data. Cloudera mentions the following three reasons for Impala's good performance: Impala has reduced CPU load compared to Hive, and thus it can increase IO bandwidth to the extent that CPU load is reduced. This is why Impala shows 3-4 times better performance than Hive on purely IO bound queries. If a query becomes complex, Hive should conduct multi-stage MapReduce work or reduce side joins. For queries that cannot be efficiently processed through a MapReduce framework (a query that contains at least one join operation), Impala shows 7 to 45 times better performance than Hive. If the data block to analyze has been file-cached, Impala will show much faster performance, and in this case, it performs 20 to 90 times faster than Hive. Real-time in Data Analytics The term "real time" is emphasized with the introduction of Impala. But one may naturally ask, "How much time is real time?" On this question, Doug Cutting (the person who created Hadoop) and Cloudera's senior architect gave their opinions on what real time is. "When you sit and wait for it to finish, that is real time. When you go for a cup of coffee or even let it run overnight, that's not real time." "'real-time' in data analytics is better framed as 'waiting less.' Although it would be difficult to clearly define the criteria for real time numerically, if you can wait for a result, watching your monitor, that may be called real time. Impala Architecture Impala is composed largely of impalad and impala state store. Impalad is a process that functions as a distributed query engine. It designs a plan for queries and processes queries on data nodes in the Hadoop cluster. The impala state store process maintains metadata for the impalads executed on each data node. When the impalad process is added or deleted in the cluster, metadata is updated through the impala state store process. Figure 1:  Impala High-level Architectural View. Data Locality Tracking and Direct Reads In Impala, the impalad process processes queries on all data nodes in the cluster instead of MapReduce, which is Hadoop's traditional analytic framework. Some advantages of Impala with regard to this configuration include data locality and direct read. In other words, impalad processes only the data block within the data node to which it belongs, and reads the data directly from the local directory. Through this, Impala minimizes network load. Moreover, it can also benefit from the effect of file cache. Scale Out Impala provides a horizontal expansion like Hadoop cluster. In general, you can expand Impala when a cluster is horizontally expanded. All you have to do to expand the Impala that is running the impalad process on the server when a data node is added (metadata for the addition of impalad will be updated through impala state store). This is very similar to databases based on massively parallel processing (MPP). Failover Impala analyzes data stored in Hive and HBase. And HDFS used by Hive and HBase provides a certain level of failover through replication. For this reason, Impala can perform queries if the replica of a data block and at least one impalad process exist. Update: in addition, if Imapala is used in CDH4 (Cloudera’s Distribution including Apache Hadoop), it is possible to configure HA for HDFS. Single Point of Failure (SPOF) One of the huge concerns about all systems which use HDFS as a storage medium is that their name node is SPOF (we have discussed this in details when comparing HDFS with other distributed file systems in our previous article Overview and Recommendations for Distributed File Systems). Some solutions to prevent this have recently been released, but resolving the problem fundamentally is still a distant goal. In Impala, the namenode is SPOF as well. This is because you can't perform queries unless you know the location of a data block. According to Impala Performance and Availability, there is no SPOF in Impala. "All Impala daemons are fully able to handle incoming queries. If a machine fails however, all queries with fragments running on that machine will fail. Because queries are expected to return quickly, you can just rerun the query if there is a failure." Query Execution Procedure The following is a brief account of the query execution procedure in Impala: The user selects a certain impalad in the cluster, and registers a query by using impala shell and ODBC. The impalad that received a query from the user carries out the following pre-task: It brings Table Schema from the Hive metastore and judges the appropriateness of the query statement. It collects data blocks and location information required to execute the query from the HDFS namenode. Based on the latest update of Impala metadata, it sends the information required to perform the query to all impalads in the cluster. All the impalads that received the query and metadata read the data block they should process from the local directory and execute the query. If all the impalads complete the task, the impalad that received the query from the user collects the result and delivers it to the user. Hive-SQL Supported by Impala Not all Hive-SQLs are supported by Impala. As Impala supports only some Hive SQLs, you need to know which statements are supported. SELECT QUERY Impala supports most of the SELECT-related statements of Hive SQL. Data Definition Language In Impala, you cannot create or modify a table. As shown below, you can only retrieve databases and table schemas. You can only create and modify a table through Hive. SHOW TABLES SHOW DATABASES SHOW SCHEMAS DESCRIBE TABLE USE DATABASE Data Manipulation Impala provides only the functionality of adding data to an already created table and partition. INSERT INTO INSERT OVERWRITE Unsupported Language Elements There are still many Hive SQLs not supported by Impala. Therefore, you are advised to see the Impala Language Reference before using Impala. Data Definition Language (DDL) such as CREATE, ALTER, DROP. Impala stable version already supports DDL. Non-scalar data types: maps, arrays, structs LOAD DATA to load raw files Extensibility mechanisms such as TRANSFORM, custom User Defined Functions (UDFs), custom file formats, custom SerDes XML and JSON functions User Defined Aggregate Functions (UDAFs) User Defined Table Generating Functions (UDTFs) Lateral Views etc. Data Model Since Impala 1.0 has dropped its beta label, Impala supports a variety of file formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format. The highest interest, however, lies in whether Impala will support Trevni, a project currently led by Doug Cutting. Trevni is a file format that stores a table record comprising rows and columns in the column-major format instead of the existing row-major format. Why is support for Trevni a matter of keen interest? It is because Impala could provide better performance with Trevni. As Trevni is still being developed, you will only see a brief account of the column file format mentioned in the Dremel paper. Dremel highlights the column file format as one of the factors affecting performance. What benefits most from the column file format is Disk IO. As the column file format uses a single record by dividing it into columns, it is very effective when you retrieve only some of the columns from the record. In the existing row-unit storage method, the same disk IO occurs whether you see a single column or all columns, but in the column file format, you can use Disk IO more efficiently, as Disk IO only occurs when you access the specific required column. Figure 2: Comparison of Row-unit Storage and Column-unit base (source: Dremel: Interactive Analysis of Web-Scale Datasets). Looking at the result of the two tests conducted with regard to the column file format in Dremel, you can estimate the degree of contribution of the column file format to performance. Figure 3: Comparison of the Performance of the Column-unit Storage and the Row-unit Storage (source: Dremel: Interactive Analysis of Web-Scale Datasets). Figure 4:  Comparison of the Performance of MapReduce and Dremel in the Column-unit Storage and the Row-unit Storage (300 nodes, 85 billion Records) (source: Dremel: Interactive Analysis of Web-Scale Datasets). In Figure 3, (a), (b) and (c) show the execution time according to the number of columns randomly selected in the column file format, while (d) and (e) show the existing record reading time. According to the result of the test, the execution of the column file format is faster. In particular, the gap becomes bigger when it accesses a smaller number of columns. Reading records using the previous method, it took a very long time even when accessing a single column, as if it was reading all columns. However, the smaller the number of selected columns, the better performance the column file format (a, b and c) shows. Figure 4 compares the execution time when MapReduce work is processed in the column file format data and when it is not (while this is a comparison test of MapReduce and Dremel, we will just check the result of application/non-application of the column file format to MR). In this way, the column file format improves performance significantly by reducing Disk IO. Of course, as the above test result is about the column file format implemented in Dremel, the result of Trevni, which will apply to Impala, may be different from the result above. Nevertheless, Trevni is also being developed with the same goal as that of the column file format of Dremel. For this reason, it is expected to have a similar result to the test result above. Update: Intead of Trevni, Impala will support Parquet, an efficient general-purpose columnar file format for Apache Hadoop. In the current GA release of Impala only a preview form of Parquet is available. Installing, Configuring and Running Impala Installing To install Impala, you must install the following software (for detailed installation information, visit the Impala website): Red Hat Enterprise Linux (RHEL) / CentOS 6.2 (64 bit) or above. Hadoop 2.0 Hive MySQL MySQL is indispensable for Hive metastore. Hive supports a variety of databases used as metastore. However, currently, you should use MySQL to make Impala and Hive interwork. If interworking with HBase is required, you should install HBase as well. This article will not discuss the installation procedure for each software solution in detail. If you have installed all required software, you should now install the Impala package to all the data nodes in the cluster. Then, you should install the impala shell in the client host. Configuring For impalad to access HDFS file blocks directly from the local directory and for locality tracking, you should configure some settings for Hadoop core-site.xml and hdfs-site.xml. As this configuration is critical to the performance of Impala, you must add the settings. After changing the settings, HDFS should reboot. Refer to Configuring Impala for Performance without Cloudera Manager for more information on configuration. Running Before running Impala, you should carry out some pre-tasks, such as table creation and data loading through Hive Client, as Impala does not currently support these. If you need to analyze an existing table in HBase or a new table, you can redefine it as an extern table through Hive. Refer to https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration. To enter a query in Impala, you can enter it through Impala shell, Cloudera Beeswaw (an application that helps users to use Hive easily), or ODBC. Here you will see a query method by using the Impala shell. $ sudo –u impala impala-shell Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Build version: Impala v0.1 (1fafe67) built on Mon Oct 22 13:06:45 PDT 2012) [Not connected] > connect impalad-host:21000 [impalad-host:21000] > show tables custom_url [impalad-host:21000] > select sum(pv) from custom_url group by url limit 50 20390 34001 3049203 [impalad-host:21000] > Before entering a query, you should select an impalad to be the main server from the impalads in the cluster. Once the connection is complete, enter a query and see the result. Impala Functionality/Performance Test In this article I want to share with you the test results I have obtained when I first tried Impala beta version. Though many things have been changes since the beta release, I want you to see the trend in Impala performance. I conducted a simple test to see how well it performs within an available scope. Table 2 below shows the equipment and the version of the software used. The Hadoop cluster was composed of one Namenode/JobTracker, 3 to 5 Data/Task nodes, and one commander. And the HDFS replication was set to "2". The Map/Reduce capacities of node were also "2". Data for the test is approximately 1.3 billion records with 14 column data, totaling approximately 65 GB. Table 2: Hardware and Software Versions. Classification Description Equipment CPU Intel Xeon 2.00GHz Memory 16GB Software OS CentOS release 6.3(Final) Hadoop 2.0.0-cdh4.1.2 HBase 0.92.1-cdh4.1.2 Hive 0.9.0-cdh4.1.2 Impala 0.1.0-beta The table schema and query created for the test are as follows: Code 1: Hive Table Schema. hive> describe custom_url; OK sd_uid bigint type int freq int locale int os int sr int browser int gender int age int url int visitor int visit int pv int dt int Code 2: Test Query. select sr, url, sum(pv) as sumvalue from custom_url WHERE sr != -1 and url != -1 and (sd_uid=690291 or sd_uid=692758) group by sr, url order by sumvalue desc limit 50 The query was simply composed of frequently used simple statements, including select, from, group by and order by. Test Result With cluster sizes of 3 and 5 nodes, the same query in the same table was performed through Impala and Hive, and the average response time was measured. Table 3: Comparison of the Time to Execute the Same Query in Impala and Hive, by Cluster Size. Impala Hive Hive/Impala 3 nodes 265s 3,688s 13.96 5 nodes 187s 2,377s 13.71 As mentioned above, Impala shows much better performance than the analytic work by using MapReduce. In addition, the bigger the cluster size is, the better response time Impala shows linearly. (After expanding the cluster, the test was conducted after evenly distributing data blocks through rebalancing.) Figure 5: Impala's Response Time According to Cluster Size. In addition, to compare the performance of Impala with that of other column-based commercial databases, such as Infobright, infiniDB and Vertica, which have MPP as a common denominator with Impala, a test was also conducted with the same data and query (the open source version of Infobright and infiniDB and the Community Edition version of Vertica were used for the test). The test of the three databases above was conducted on a single server. Table 4: Execution Time of Commercial Column-based Databases. Database Execution time Infobright 200s infiniDB 32s Vertica 15s As the test result in Table 4 shows, Infobright, infiniDB and Vertica, which are commercial enterprise products, show much better results than Impala. This is because Impala is still in the initial development stage, and thus may have structural and technical shortcomings. For Vertica, which showed the best performance of the four, in a three-server environment, it showed 30% higher performance than the result specified in Table 4. However, as Vertica is a commercial product, if you want to configure a cluster of two or more servers, there is a higher cost. Conclusion Less than a year has passed since Impala beta was released and two months since the production 1.0 version is released. Since Impala 1.0 a variety of file formats are now supported, a subset of ANSI-92 SQL is supported including CREATE, ALTER, SELECT, INSERT,JOIN, and subqueries. In addition, it supports partitioned joins, fully distributed aggregations, and fully distributed top-n queries. Apart from this, MapR, which does business by using Hadoop like Cloudera, suggested Drill, an open source system based on the Dremel paper, like Impala, which is being prepared mainly by its HQ developers, to the Apache incubator. It is still just a proposal (not any more, it is currently undergoing incubation at The Apache Software Foundation), but Drill is expected to make fast progress once it is launched, as it is being promoted mainly by MapR developers. By Kwon Donghun, Software Engineer at Data Infra Lab, NHN Corporation. [Less]
Posted over 11 years ago by Kwon Donghun
UPDATED on July 16 2013: the article has been updated inline considering the feedbacks on Twitter and comments below. In this article I would like to introduce you to Cloudera Impala, an open source system which provides real-time SQL querying ... [More] functionality on top of Hadoop. I will quickly go over when and how Impala was created, then will explain in more details about Impala's architecture, its advantages and drawbacks, compare it to Pig and Hive. You will also learn how to install, configure and run Impala. At the end of this article I will show the performance test results I have obtained when comparing Impala with Hive, Infobright, infiniDB, and Vertica. Analyzing results in real time through Hadoop With the advent of Hadoop in 2006, big data analytics was no longer a task that could be performed by only a few companies or groups. This is because Hadoop was an open-source framework, and thus many companies and groups that needed big data analytics could use Hadoop easily at low cost. In other words, big data analytics became a universal technology. The core of Hadoop is Hadoop Distributed File System (HDFS) and MapReduce. Hadoop stores data in HDFS, a file system that can expand the capacity in the form of a distributed file system, conducts MapReduce operations based on the stored data, and consequently gets the required data. There is no limit to one's needs. The Hadoop user group tried to overcome Hadoop's limits in terms of functionality and performance, and develop it more. Complaints were focused on the use of MapReduce. MapReduce has two main disadvantages. It is very inconvenient to use. Its processing is slow. To resolve the inconveniences of using MapReduce, platforms such as Pig and Hive appeared in 2008. Pig and Hive are sub-projects of Hadoop (Hadoop is also an ecosystem of multiple platforms; a variety of products based on Hadoop have been created). Both Pig and Hive have a form of high-level language, but Pig has a procedural form and Hive has a declarative language form similar to SQL. With the advent of Pig and Hive, Hadoop users could conduct big data analytics easier. However, as Hive and Pig are related to the data retrieval interface, they cannot contribute to accelerating big data analytics work. Internally, both Hive and Pig use MapReduce as well. This is why HBase, a column-based NoSQL appeared. HBase, which enables faster input/output of key/value data, finally provided Hadoop-based systems with an environment in which data could be processed in real time. This progress of Hadoop (Hadoop eco-System) was greatly influenced by Google. HDFS itself was implemented on the basis of papers on GFS published by Google, and Hbase appears to have been based on Google's papers on BigTable. Table 1 below shows this influential relationship. Table 1: Google Gives Us A Map (source: Strata + Hadoop World 2012 Keynote: Beyond Batch - Doug Cutting) Google Publication Hadoop Characteristics GFS & MapReduce (2004) HDFS & MapReduce (2006) Batch Programs Sawzall (2005) Pig & Hive (2008) Batch Queries BigTable (2006) HBase (2008) Online key/value Dremel (2010) Impala (2012) Online Queries Spanner (2012) ???? Transactions, Etc. Cloudera's Impala, introduced in this article, was also established under the influence of Google. It was created based on Google's Dremel paper, which was published back in 2010. Impala is an open-source system under an Apache license, which is an interactive/real-time SQL query system that runs on HDFS. SQL is very familiar to many developers and is able to express data manipulation/retrieval briefly. As Impala supports SQL and provides real-time big data processing functionality, it has the potential to be utilized as a business intelligence (BI) system. For this reason, some BI vendors are said to have already launched BI system development projects using Impala. The ability to get real-time analytics results by using SQL makes the prospect of big data brighter and also extends the application scope of Hadoop. Cloudera Impala Cloudera, which created Impala, said they had been technically inspired by Google's Dremel paper, which made them think that it would be possible to perform real-time, ad-hoc queries in Apache Hadoop. In October 2012, when announcing Impala, Cloudera introduced it as follows: “Real-Time Queries in Apache Hadoop, For Real” Impala adopted Hive-SQL as an interface. As mentioned above, Hive-SQL is similar in terms of syntax to SQL, a popularly used query language. For this reason, users can access data stored in HDFS through a very familiar method. As Hive-SQL uses Hive, you can access the same data through the same method. However, not all Hive-SQLs are supported by Impala. For this reason, you had better understand that Hive-SQLs that are used in Impala can also be used in Hive. The difference between Impala and Hive is whether it is real-time or not. While Hive uses MapReduce for data access, Impala uses its unique distributed query engine to minimize response time. This distributed query engine is installed on all data nodes in the cluster. This is why Impala and Hive show distinctively different performance in the response time to the same data. Cloudera mentions the following three reasons for Impala's good performance: Impala has reduced CPU load compared to Hive, and thus it can increase IO bandwidth to the extent that CPU load is reduced. This is why Impala shows 3-4 times better performance than Hive on purely IO bound queries. If a query becomes complex, Hive should conduct multi-stage MapReduce work or reduce side joins. For queries that cannot be efficiently processed through a MapReduce framework (a query that contains at least one join operation), Impala shows 7 to 45 times better performance than Hive. If the data block to analyze has been file-cached, Impala will show much faster performance, and in this case, it performs 20 to 90 times faster than Hive. Real-time in Data Analytics The term "real time" is emphasized with the introduction of Impala. But one may naturally ask, "How much time is real time?" On this question, Doug Cutting (the person who created Hadoop) and Cloudera's senior architect gave their opinions on what real time is. "When you sit and wait for it to finish, that is real time. When you go for a cup of coffee or even let it run overnight, that's not real time." "'real-time' in data analytics is better framed as 'waiting less.' Although it would be difficult to clearly define the criteria for real time numerically, if you can wait for a result, watching your monitor, that may be called real time. Impala Architecture Impala is composed largely of impalad and impala state store. Impalad is a process that functions as a distributed query engine. It designs a plan for queries and processes queries on data nodes in the Hadoop cluster. The impala state store process maintains metadata for the impalads executed on each data node. When the impalad process is added or deleted in the cluster, metadata is updated through the impala state store process. Figure 1:  Impala High-level Architectural View. Data Locality Tracking and Direct Reads In Impala, the impalad process processes queries on all data nodes in the cluster instead of MapReduce, which is Hadoop's traditional analytic framework. Some advantages of Impala with regard to this configuration include data locality and direct read. In other words, impalad processes only the data block within the data node to which it belongs, and reads the data directly from the local directory. Through this, Impala minimizes network load. Moreover, it can also benefit from the effect of file cache. Scale Out Impala provides a horizontal expansion like Hadoop cluster. In general, you can expand Impala when a cluster is horizontally expanded. All you have to do to expand the Impala that is running the impalad process on the server when a data node is added (metadata for the addition of impalad will be updated through impala state store). This is very similar to databases based on massively parallel processing (MPP). Failover Impala analyzes data stored in Hive and HBase. And HDFS used by Hive and HBase provides a certain level of failover through replication. For this reason, Impala can perform queries if the replica of a data block and at least one impalad process exist. Update: in addition, if Imapala is used in CDH4 (Cloudera’s Distribution including Apache Hadoop), it is possible to configure HA for HDFS. Single Point of Failure (SPOF) One of the huge concerns about all systems which use HDFS as a storage medium is that their name node is SPOF (we have discussed this in details when comparing HDFS with other distributed file systems in our previous article Overview and Recommendations for Distributed File Systems). Some solutions to prevent this have recently been released, but resolving the problem fundamentally is still a distant goal. In Impala, the namenode is SPOF as well. This is because you can't perform queries unless you know the location of a data block. According to Impala Performance and Availability, there is no SPOF in Impala. "All Impala daemons are fully able to handle incoming queries. If a machine fails however, all queries with fragments running on that machine will fail. Because queries are expected to return quickly, you can just rerun the query if there is a failure." Query Execution Procedure The following is a brief account of the query execution procedure in Impala: The user selects a certain impalad in the cluster, and registers a query by using impala shell and ODBC. The impalad that received a query from the user carries out the following pre-task: It brings Table Schema from the Hive metastore and judges the appropriateness of the query statement. It collects data blocks and location information required to execute the query from the HDFS namenode. Based on the latest update of Impala metadata, it sends the information required to perform the query to all impalads in the cluster. All the impalads that received the query and metadata read the data block they should process from the local directory and execute the query. If all the impalads complete the task, the impalad that received the query from the user collects the result and delivers it to the user. Hive-SQL Supported by Impala Not all Hive-SQLs are supported by Impala. As Impala supports only some Hive SQLs, you need to know which statements are supported. SELECT QUERY Impala supports most of the SELECT-related statements of Hive SQL. Data Definition Language In Impala, you cannot create or modify a table. As shown below, you can only retrieve databases and table schemas. You can only create and modify a table through Hive. SHOW TABLES SHOW DATABASES SHOW SCHEMAS DESCRIBE TABLE USE DATABASE Data Manipulation Impala provides only the functionality of adding data to an already created table and partition. INSERT INTO INSERT OVERWRITE Unsupported Language Elements There are still many Hive SQLs not supported by Impala. Therefore, you are advised to see the Impala Language Reference before using Impala. Data Definition Language (DDL) such as CREATE, ALTER, DROP. Impala stable version already supports DDL. Non-scalar data types: maps, arrays, structs LOAD DATA to load raw files Extensibility mechanisms such as TRANSFORM, custom User Defined Functions (UDFs), custom file formats, custom SerDes XML and JSON functions User Defined Aggregate Functions (UDAFs) User Defined Table Generating Functions (UDTFs) Lateral Views etc. Data Model Since Impala 1.0 has dropped its beta label, Impala supports a variety of file formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format. The highest interest, however, lies in whether Impala will support Trevni, a project currently led by Doug Cutting. Trevni is a file format that stores a table record comprising rows and columns in the column-major format instead of the existing row-major format. Why is support for Trevni a matter of keen interest? It is because Impala could provide better performance with Trevni. As Trevni is still being developed, you will only see a brief account of the column file format mentioned in the Dremel paper. Dremel highlights the column file format as one of the factors affecting performance. What benefits most from the column file format is Disk IO. As the column file format uses a single record by dividing it into columns, it is very effective when you retrieve only some of the columns from the record. In the existing row-unit storage method, the same disk IO occurs whether you see a single column or all columns, but in the column file format, you can use Disk IO more efficiently, as Disk IO only occurs when you access the specific required column. Figure 2: Comparison of Row-unit Storage and Column-unit base (source: Dremel: Interactive Analysis of Web-Scale Datasets). Looking at the result of the two tests conducted with regard to the column file format in Dremel, you can estimate the degree of contribution of the column file format to performance. Figure 3: Comparison of the Performance of the Column-unit Storage and the Row-unit Storage (source: Dremel: Interactive Analysis of Web-Scale Datasets). Figure 4:  Comparison of the Performance of MapReduce and Dremel in the Column-unit Storage and the Row-unit Storage (300 nodes, 85 billion Records) (source: Dremel: Interactive Analysis of Web-Scale Datasets). In Figure 3, (a), (b) and (c) show the execution time according to the number of columns randomly selected in the column file format, while (d) and (e) show the existing record reading time. According to the result of the test, the execution of the column file format is faster. In particular, the gap becomes bigger when it accesses a smaller number of columns. Reading records using the previous method, it took a very long time even when accessing a single column, as if it was reading all columns. However, the smaller the number of selected columns, the better performance the column file format (a, b and c) shows. Figure 4 compares the execution time when MapReduce work is processed in the column file format data and when it is not (while this is a comparison test of MapReduce and Dremel, we will just check the result of application/non-application of the column file format to MR). In this way, the column file format improves performance significantly by reducing Disk IO. Of course, as the above test result is about the column file format implemented in Dremel, the result of Trevni, which will apply to Impala, may be different from the result above. Nevertheless, Trevni is also being developed with the same goal as that of the column file format of Dremel. For this reason, it is expected to have a similar result to the test result above. Update: Intead of Trevni, Impala will support Parquet, an efficient general-purpose columnar file format for Apache Hadoop. In the current GA release of Impala only a preview form of Parquet is available. Installing, Configuring and Running Impala Installing To install Impala, you must install the following software (for detailed installation information, visit the Impala website): Red Hat Enterprise Linux (RHEL) / CentOS 6.2 (64 bit) or above. Hadoop 2.0 Hive MySQL MySQL is indispensable for Hive metastore. Hive supports a variety of databases used as metastore. However, currently, you should use MySQL to make Impala and Hive interwork. If interworking with HBase is required, you should install HBase as well. This article will not discuss the installation procedure for each software solution in detail. If you have installed all required software, you should now install the Impala package to all the data nodes in the cluster. Then, you should install the impala shell in the client host. Configuring For impalad to access HDFS file blocks directly from the local directory and for locality tracking, you should configure some settings for Hadoop core-site.xml and hdfs-site.xml. As this configuration is critical to the performance of Impala, you must add the settings. After changing the settings, HDFS should reboot. Refer to Configuring Impala for Performance without Cloudera Manager for more information on configuration. Running Before running Impala, you should carry out some pre-tasks, such as table creation and data loading through Hive Client, as Impala does not currently support these. If you need to analyze an existing table in HBase or a new table, you can redefine it as an extern table through Hive. Refer to https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration. To enter a query in Impala, you can enter it through Impala shell, Cloudera Beeswaw (an application that helps users to use Hive easily), or ODBC. Here you will see a query method by using the Impala shell. $ sudo –u impala impala-shell Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Build version: Impala v0.1 (1fafe67) built on Mon Oct 22 13:06:45 PDT 2012) [Not connected] > connect impalad-host:21000 [impalad-host:21000] > show tables custom_url [impalad-host:21000] > select sum(pv) from custom_url group by url limit 50 20390 34001 3049203 [impalad-host:21000] > Before entering a query, you should select an impalad to be the main server from the impalads in the cluster. Once the connection is complete, enter a query and see the result. Impala Functionality/Performance Test In this article I want to share with you the test results I have obtained when I first tried Impala beta version. Though many things have been changes since the beta release, I want you to see the trend in Impala performance. I conducted a simple test to see how well it performs within an available scope. Table 2 below shows the equipment and the version of the software used. The Hadoop cluster was composed of one Namenode/JobTracker, 3 to 5 Data/Task nodes, and one commander. And the HDFS replication was set to "2". The Map/Reduce capacities of node were also "2". Data for the test is approximately 1.3 billion records with 14 column data, totaling approximately 65 GB. Table 2: Hardware and Software Versions. Classification Description Equipment CPU Intel Xeon 2.00GHz Memory 16GB Software OS CentOS release 6.3(Final) Hadoop 2.0.0-cdh4.1.2 HBase 0.92.1-cdh4.1.2 Hive 0.9.0-cdh4.1.2 Impala 0.1.0-beta The table schema and query created for the test are as follows: Code 1: Hive Table Schema. hive> describe custom_url; OK sd_uid bigint type int freq int locale int os int sr int browser int gender int age int url int visitor int visit int pv int dt int Code 2: Test Query. select sr, url, sum(pv) as sumvalue from custom_url WHERE sr != -1 and url != -1 and (sd_uid=690291 or sd_uid=692758) group by sr, url order by sumvalue desc limit 50 The query was simply composed of frequently used simple statements, including select, from, group by and order by. Test Result With cluster sizes of 3 and 5 nodes, the same query in the same table was performed through Impala and Hive, and the average response time was measured. Table 3: Comparison of the Time to Execute the Same Query in Impala and Hive, by Cluster Size. Impala Hive Hive/Impala 3 nodes 265s 3,688s 13.96 5 nodes 187s 2,377s 13.71 As mentioned above, Impala shows much better performance than the analytic work by using MapReduce. In addition, the bigger the cluster size is, the better response time Impala shows linearly. (After expanding the cluster, the test was conducted after evenly distributing data blocks through rebalancing.) Figure 5: Impala's Response Time According to Cluster Size. In addition, to compare the performance of Impala with that of other column-based commercial databases, such as Infobright, infiniDB and Vertica, which have MPP as a common denominator with Impala, a test was also conducted with the same data and query (the open source version of Infobright and infiniDB and the Community Edition version of Vertica were used for the test). The test of the three databases above was conducted on a single server. Table 4: Execution Time of Commercial Column-based Databases. Database Execution time Infobright 200s infiniDB 32s Vertica 15s As the test result in Table 4 shows, Infobright, infiniDB and Vertica, which are commercial enterprise products, show much better results than Impala. This is because Impala is still in the initial development stage, and thus may have structural and technical shortcomings. For Vertica, which showed the best performance of the four, in a three-server environment, it showed 30% higher performance than the result specified in Table 4. However, as Vertica is a commercial product, if you want to configure a cluster of two or more servers, there is a higher cost. Conclusion Less than a year has passed since Impala beta was released and two months since the production 1.0 version is released. Since Impala 1.0 a variety of file formats are now supported, a subset of ANSI-92 SQL is supported including CREATE, ALTER, SELECT, INSERT,JOIN, and subqueries. In addition, it supports partitioned joins, fully distributed aggregations, and fully distributed top-n queries. Apart from this, MapR, which does business by using Hadoop like Cloudera, suggested Drill, an open source system based on the Dremel paper, like Impala, which is being prepared mainly by its HQ developers, to the Apache incubator. It is still just a proposal (not any more, it is currently undergoing incubation at The Apache Software Foundation), but Drill is expected to make fast progress once it is launched, as it is being promoted mainly by MapR developers. By Kwon Donghun, Software Engineer at Data Infra Lab, NHN Corporation. [Less]
Posted over 11 years ago by Esen Sagynov
Along with the latest CUBRID Migration Toolkit release we are now announcing new 2013.05 versions (build 0120) of CUBRID Manager and CUBRID Query Browser tools which allow users to administer their CUBRID Database Servers and execute queries in a ... [More] friendly desktop GUI application. You can download CM and CQB by following these links: CM: http://www.cubrid.org/?mid=downloads&item=cubrid_manager&os=detect&cubrid=any CQB: http://www.cubrid.org/?mid=downloads&item=cubrid_query_browser&os=detect&cubrid=any As CQB is a subset of CM tool, the key features of this release is the same for both. They are: All versions of CUBRID Server since 8.2.2 are supported. Improved Query Tuning Mode feature: allows users to see the results of query tuning and compare the before/after results of query statistics and execution plan. Improved results output when concurrently executing queries: allows to compare the output results (data/columns/records) by applying different colors. Added search functionality to query results: now users can search within the results set in the Results Tab. Improved Import Wizard: added support to export the failed data and display the error message for each failed query. A number of bugs have also been fixed in this release. For full list of changes, refer to CUBRID Manager Release Notes. How to upgrade Autoupgrade Users of CM/CQB version 1.2 and above will be automatically notified about the new updates via the built-in update feature. Users of older versions need to manually uninstall the existing version of CM or CQB, then download and install the latest version from the links provided above. CUBRID Server compatibility CUBRID Server version 8.2.2+ are supported. When establishing a connection with CUBRID Server, users can indicate the version of a server. CM and CQB usually will automatically detect the Server version. Requirements CM and CQB tools require JRE 1.6 or later. Feature Walkthrough with Screenshots Query Tuning Mode The Tuning Mode is designed to allow users to compare the statistics and execution plan of two queries side by side. Sometimes you want to tweak the query and see how it compares to its original version. You want to see the differences between the number of fetched results, the number of dirty pages created, whether or not I/O reads and writes occur, and finally the total cost of executing the queries. These are the features you can obtain via the Tuning Mode. To enable the Tuning Mode, toggle on the Tuning Mode button as shown in the Figure 1 below. When the Tuning Mode is enabled, CM/CQB will notify you that the tuning mode has been started. However, nothing else special happens at this moment. Figure 1: Enable Tuning Mode. Once you execute a query in tuning mode, a new window will popup (Figure 2 below) which will display the statistics and execution plan cache for this query on the left panel. Figure 2: Tuning Mode window (click to enlarge). To run the second query and compare its results on the right panel, you need to check "Display on the right panel (R)" checkbox as shown below in Figure 3. Figure 3: Change the panel to run the second query. Once the right side is checked, run the modified version of the original query. The right side of the Tuning Mode window will auto udpate and display the statistics and execution plan (Figure 4). Figure 4: Compare two queries in Tuning Mode. This is one of the great features we are very excited about in this new release. Multple SQL execution on multiple databases in parallel CM and CQB allow to execute multiple queries at once not only on a single connected database, but also on multiple databases. Image you want to query a master and one or more of the slave nodes to check if the data on the selected nodes are identical, or you want to query the production server and compare it against the stage server, or during the migration you want to query the destination database and compare the results with those of the source database. Concurrent SQL execution on multiple databases is a really useful feature to accomplish all these tasks. In this release we have improved how results are displayed when executing multiple queries on multiple databases. You can now compare the results of executed queries side by side in a convenient popup window (Figure 5). Figure 5: Comparison of the results of multiple queries. Search within the Results Set Since this new release CM and CQB provide a search functionality in the Results Tab (Figure 6). This allows users to filter the fetched result set in real time while the matched column values get highlighted. Figure 6: Search within Results Set. Improved Import Wizard In CM and CQB users can import data into a database from various sources (SQL/TXT/CSV/XLS). All import operations are logged and kept in the import history so that users can rerun them again if necessary (Figure 7). Figure 7: Import Wizard. In this new release we have added a new feature which will export a list of erroneous queries into an external SQL file (Figure 8). If an error occurs while importing the data, you can click on "Browse error log..." button which will navigate you to a file where all problematic queries are logged. Figure 8: Browse error logs. Alternatively, you can double click on the import row (highlighted in red upon failure), and a popup window will appear where you can see a list of failed queries along with an error message for each query (Figure 9). Figure 9: View failed queries. This is a list of prominent features we wanted to share with you today. There are many more improvements (36 improvements, 66 bug fixes) we have made in this release. Refer to CUBRID Manager Release Notes to see the full list. If you have any comments or questions, there is a comment form below. We always look forward to receiving a feedback. [Less]
Posted over 11 years ago by Esen Sagynov
Along with the latest CUBRID Migration Toolkit release we are now announcing new 2013.05 versions (build 0120) of CUBRID Manager and CUBRID Query Browser tools which allow users to administer their CUBRID Database Servers and execute queries in a ... [More] friendly desktop GUI application. You can download CM and CQB by following these links: CM: http://www.cubrid.org/?mid=downloads&item=cubrid_manager&os=detect&cubrid=any CQB: http://www.cubrid.org/?mid=downloads&item=cubrid_query_browser&os=detect&cubrid=any As CQB is a subset of CM tool, the key features of this release is the same for both. They are: All versions of CUBRID Server since 8.2.2 are supported. Improved Query Tuning Mode feature: allows users to see the results of query tuning and compare the before/after results of query statistics and execution plan. Improved results output when concurrently executing queries: allows to compare the output results (data/columns/records) by applying different colors. Added search functionality to query results: now users can search within the results set in the Results Tab. Improved Import Wizard: added support to export the failed data and display the error message for each failed query. A number of bugs have also been fixed in this release. For full list of changes, refer to CUBRID Manager Release Notes. How to upgrade Autoupgrade Users of CM/CQB version 1.2 and above will be automatically notified about the new updates via the built-in update feature. Users of older versions need to manually uninstall the existing version of CM or CQB, then download and install the latest version from the links provided above. CUBRID Server compatibility CUBRID Server version 8.2.2+ are supported. When establishing a connection with CUBRID Server, users can indicate the version of a server. CM and CQB usually will automatically detect the Server version. Requirements CM and CQB tools require JRE 1.6 or later. Feature Walkthrough with Screenshots Query Tuning Mode The Tuning Mode is designed to allow users to compare the statistics and execution plan of two queries side by side. Sometimes you want to tweak the query and see how it compares to its original version. You want to see the differences between the number of fetched results, the number of dirty pages created, whether or not I/O reads and writes occur, and finally the total cost of executing the queries. These are the features you can obtain via the Tuning Mode. To enable the Tuning Mode, toggle on the Tuning Mode button as shown in the Figure 1 below. When the Tuning Mode is enabled, CM/CQB will notify you that the tuning mode has been started. However, nothing else special happens at this moment. Figure 1: Enable Tuning Mode. Once you execute a query in tuning mode, a new window will popup (Figure 2 below) which will display the statistics and execution plan cache for this query on the left panel. Figure 2: Tuning Mode window (click to enlarge). To run the second query and compare its results on the right panel, you need to check "Display on the right panel (R)" checkbox as shown below in Figure 3. Figure 3: Change the panel to run the second query. Once the right side is checked, run the modified version of the original query. The right side of the Tuning Mode window will auto udpate and display the statistics and execution plan (Figure 4). Figure 4: Compare two queries in Tuning Mode. This is one of the great features we are very excited about in this new release. Multple SQL execution on multiple databases in parallel CM and CQB allow to execute multiple queries at once not only on a single connected database, but also on multiple databases. Image you want to query a master and one or more of the slave nodes to check if the data on the selected nodes are identical, or you want to query the production server and compare it against the stage server, or during the migration you want to query the destination database and compare the results with those of the source database. Concurrent SQL execution on multiple databases is a really useful feature to accomplish all these tasks. In this release we have improved how results are displayed when executing multiple queries on multiple databases. You can now compare the results of executed queries side by side in a convenient popup window (Figure 5). Figure 5: Comparison of the results of multiple queries. Search within the Results Set Since this new release CM and CQB provide a search functionality in the Results Tab (Figure 6). This allows users to filter the fetched result set in real time while the matched column values get highlighted. Figure 6: Search within Results Set. Improved Import Wizard In CM and CQB users can import data into a database from various sources (SQL/TXT/CSV/XLS). All import operations are logged and kept in the import history so that users can rerun them again if necessary (Figure 7). Figure 7: Import Wizard. In this new release we have added a new feature which will export a list of erroneous queries into an external SQL file (Figure 8). If an error occurs while importing the data, you can click on "Browse error log..." button which will navigate you to a file where all problematic queries are logged. Figure 8: Browse error logs. Alternatively, you can double click on the import row (highlighted in red upon failure), and a popup window will appear where you can see a list of failed queries along with an error message for each query (Figure 9). Figure 9: View failed queries. This is a list of prominent features we wanted to share with you today. There are many more improvements (36 improvements, 66 bug fixes) we have made in this release. Refer to CUBRID Manager Release Notes to see the full list. If you have any comments or questions, there is a comment form below. We always look forward to receiving a feedback. [Less]
Posted over 11 years ago by Esen Sagynov
We are glad to announce the release of a new version 2013.05 of CUBRID Migration Toolkit which allows to migrate or extract existing data and/or schema from MySQL/Oracle/CUBRID (previous versions) to the latest version of CUBRID Database Server, and ... [More] perform scheduled execution of batch operations. You can download CMT binaries for Windows, Linux, and Mac OS X from http://www.cubrid.org/?mid=downloads&item=cubrid_migration_toolkit&os=detect&cubrid=any. In this release many UI improvements for the Migration Wizard have been made. We have added the migration type selection (Figure 1) as the first step of migration process. This improves the UX when starting a new migration. Figure 1: Select Migration Type. Before, it was possible to run one migration process at a time. The inconvenient part of this fact was that a user could create another migration script only after the previous one has been completed. Now CMT allows to schedule migration tasks even if there are running processes (Figure 2). Figure 2: Schedule migration instead of starting. When scheduling a migration task users are given an option to select a particular date and time to start the migration task, a time of the day for recurring tasks, or supply a UNIX compatible cron job scheduling pattern (Figure 3). Figure 3: Migration Schedule Settings. In the new version of CMT, users can rerun the migration script if any failure has occurred during migration (Figure 4). This allows users to avoid going through the creating of the mgiration script all over again. Figure 4: Rerun the migration script upon failure. In addition to this, users can go back to Migration History tab, choose one of the previously executed tasks and rerun it again (Figure 5). Figure 5: Migration History. We hope these improvements will provide a better user experience when performing the migration process. To see the full list of changes, refer to CMT Release Notes.  [Less]
Posted over 11 years ago by Esen Sagynov
We are glad to announce the release of a new version 2013.05 of CUBRID Migration Toolkit which allows to migrate or extract existing data and/or schema from MySQL/Oracle/CUBRID (previous versions) to the latest version of CUBRID Database Server, and ... [More] perform scheduled execution of batch operations. You can download CMT binaries for Windows, Linux, and Mac OS X from http://www.cubrid.org/?mid=downloads&item=cubrid_migration_toolkit&os=detect&cubrid=any. In this release many UI improvements for the Migration Wizard have been made. We have added the migration type selection (Figure 1) as the first step of migration process. This improves the UX when starting a new migration. Figure 1: Select Migration Type. Before, it was possible to run one migration process at a time. The inconvenient part of this fact was that a user could create another migration script only after the previous one has been completed. Now CMT allows to schedule migration tasks even if there are running processes (Figure 2). Figure 2: Schedule migration instead of starting. When scheduling a migration task users are given an option to select a particular date and time to start the migration task, a time of the day for recurring tasks, or supply a UNIX compatible cron job scheduling pattern (Figure 3). Figure 3: Migration Schedule Settings. In the new version of CMT, users can rerun the migration script if any failure has occurred during migration (Figure 4). This allows users to avoid going through the creating of the mgiration script all over again. Figure 4: Rerun the migration script upon failure. In addition to this, users can go back to Migration History tab, choose one of the previously executed tasks and rerun it again (Figure 5). Figure 5: Migration History. We hope these improvements will provide a better user experience when performing the migration process. To see the full list of changes, refer to CMT Release Notes.  [Less]
Posted almost 12 years ago by Esen Sagynov
Today I am thrilled to announce that we have released a new version of our PHP and PDO drivers which now support CUBRID SHARD Broker, a middleware in CUBRID Database that provides sharding feature. How it all began... For those of you who have not ... [More] heard much about CUBRID open source database, it provides numerous great features for stability and horizontal scalability. This includes native support for High-Availability, Database Sharding, Load Balancing, and many more. Since we have announced Database Sharding in CUBRID 8.4.3, we have been working hard on adding sharding support in all our APIs. Along with the announcement we have released new versions of our JDBC and C APIs which provided support for both Database Sharding and API-level Load Balancing. Then we have added sharding support in our node-cubrid Node.js API. Now, sharding comes to PHP and PDO drivers thanks to Kirill Shvakov [Twitter link] from Russia who has spent his time to test and report issues to us. Last month he told us that he could not use our PHP and PDO drivers to correctly insert records to multiple shards when connecting to CUBRID SHARD Broker. This was due to the fast that we have not added SHARD Broker support in these drivers. Now PHP API version 9.1.0.0003 and PDO API version 9.1.0.0002 includes patches to support CUBRID SHARD Broker. Next month we will release a new version for these drivers which will provide a few more sharding related fixes and a dozen of new features. Notice: If you are going to use the latest version of PHP and PDO drivers available today, instead of using cubrid_execute() function, use cubrid_query(). Refer to this comment for more details. In the next release we will provide a fix for this. If you have questions or feature requests, feel free to post on CUBRID Forum, Facebook, Twitter, Google+, or Freenode #cubrid IRC channel. [Less]
Posted almost 12 years ago by Esen Sagynov
Today I am thrilled to announce that we have released a new version of our PHP and PDO drivers which now support CUBRID SHARD Broker, a middleware in CUBRID Database that provides sharding feature. How it all began... For those of you who have not ... [More] heard much about CUBRID open source database, it provides numerous great features for stability and horizontal scalability. This includes native support for High-Availability, Database Sharding, Load Balancing, and many more. Since we have announced Database Sharding in CUBRID 8.4.3, we have been working hard on adding sharding support in all our APIs. Along with the announcement we have released new versions of our JDBC and C APIs which provided support for both Database Sharding and API-level Load Balancing. Then we have added sharding support in our node-cubrid Node.js API. Now, sharding comes to PHP and PDO drivers thanks to Kirill Shvakov [Twitter link] from Russia who has spent his time to test and report issues to us. Last month he told us that he could not use our PHP and PDO drivers to correctly insert records to multiple shards when connecting to CUBRID SHARD Broker. This was due to the fast that we have not added SHARD Broker support in these drivers. Now PHP API version 9.1.0.0003 and PDO API version 9.1.0.0002 includes patches to support CUBRID SHARD Broker. Next month we will release a new version for these drivers which will provide a few more sharding related fixes and a dozen of new features. Notice: If you are going to use the latest version of PHP and PDO drivers available today, instead of using cubrid_execute() function, use cubrid_query(). Refer to this comment for more details. In the next release we will provide a fix for this. If you have questions or feature requests, feel free to post on CUBRID Forum, Facebook, Twitter, Google+, or Freenode #cubrid IRC channel. [Less]
Posted almost 12 years ago by Jun Sung Won
OwFS, HDFS, NFS... What kind of distributed file systems do we need to provide a more competitive internet service? In this article I will explain the features of selected distributed file systems which I have experience with working at NHN, and ... [More] suggest which one is suitable to what case. I will also discuss about recent open source distributed file systems as well asl Google's distributed file system which are not yet used by NHN. This article will also cover the development trends of distributed file systems in general. What Kind of Distributed File Systems do We Need to Use? Currently NHN uses various distributed file systems such as NFS, CIFS, HDFS, and OwFS. A distributed file system is a file system that allows access to files via a network. The advantage of a distributed file system is that you can share data among multiple hosts. Before 2008, NHN used only NFS and CIFS by using NAS, but since 2008, it has developed and begun to use its own OwFS, and the usage of OwFs has since increased. As is widely known, HDFS is an open source file system that was influenced by Google's GFS. For NFS or CIFS, Network Attached Storage (NAS), which is an expensive storage system, is typically used. Whenever you increase its capacity, therefore, there is a high infrastructure cost. On the other hand, as OwFS and HDFS allow you to use relatively economical hardware (commodity server), you can establish high-capacity storage at much lower cost. However, OwFS or HDFS are not better than NFS or CIFS, which use NAS in all cases. For some purposes, you should use NAS. The same is true of OwFS and HDFS. As they are built for different purposes, you need to select one of these file systems according to the purpose of the internet server you are implementing. NFS and CIFS Generally, Network File System (NFS) is used by Linux/Unix, while CIFS (Common Internet File System) is used by MS Windows. NFS is a distributed file system developed in 1984 by Sun Microsystems. It allows you to share and use files from other hosts via a network. NFS has many versions, and it began to be widely used from version NFSv2. Currently, the most-used version is NFSv3. In 2003, NFSv4 was released, and offered better performance and security. In 2010, NFSv4.1, which supports cluster-based expansion, was released. NFS is a distributed file system with a long history, and it continues to evolve. CIFS is a distributed file system Microsoft applied to MS Windows, which was developed based on Server Message Block (SMB) but with enhanced functionalities, including security. NFS and CIFS comply with the POSIX standard. Thus, applications by using NFS or CIFS can use a distributed file system like a local file system. This means, when you implement or run an application, you don't need to prepare a local file system and distributed file system separately. When NFS or CIFS is used, Network Attached Storage (NAS) is used for better performance and availability. NAS has its unique file system, and processes remote requests for access to files by having NAS gateway support NFS or CIFS protocols. Generally, NAS has the structure shown in Figure 1: Figure 1: A General Structure of NAS. The features of NFS or CIFS are as follows: Provides the same functionality as a local file system. As they mostly use NAS, there is a high purchase cost. The scalability of NAS is very low compared to OwFS and HDFS. OwFS (Owner-based File System) OwFS is a distributed file system developed by NHN. Currently, it is the distributed file system that NHN uses the most. OwFS features the concept of a container called owner. An owner is a basic unit of metadata managed in OwFS. OwFS manages the replicas of data on the basis of this owner. Each owner has an independent file storage space, in which files and directories can be stored. OwFS forms a single huge file system by gathering these owners together. To access a file, users need to get the information of the owner first. The overall structure of OwFS is as follows: Figure 2: The OwFS Structure. The Meta-Data Server (MDS) of OwFS manages the status of data servers (DS). More specifically, MDS manages the capacity of each DS, and when a failure occurs, it performs restoration work by using the replicas of owners. Compared to HDFS, the advantage of OwFS is that it can process a large number of files efficiently (mainly when the size of a file is within dozens of MB). This is because OwFS has been designed to not increase the burden of MDS even when the number of files increases. DS manages the information on the files and directories stored in owners (i.e., metadata on files and directories), while MDS has only the information of the owner and the location of the owner's replica. For this reason, even when the number of files increases, the metadata, which MDS should process, does not increase, and consequently the burden of MDS does not increase much. The following shows the features of OwFS: Owner, a kind of container, is a single file system, and owners make up an entire file system. Owner information (file data and metadata) is stored in data server (DS). Multiple owners can be saved in a single DS, and owners are distributed and stored (replicated) in different DSs. The information on the location of owner including the location of its replica is stored in Metadata Data Server (MDS). It is suitable for processing a large number of files whose size is within dozens of MB. As all of its components are duplicated/triplicated, it works stably even when a failure occurs. A Map-Reduce Framework for OwFS also exists. Apache module to displays files from OwFS. OwFs_Fuse module to mount OwFS like NAS. OwFS provides its own unique interface (API). However, as it also provides the OwFs_Fuse module, you can mount OwFS like NAS and use it as a local disk. In addition, it also provides Apache module so that Apache web server can access the files stored in OwFS. Hadoop Distributed File System (HDFS) Google developed Google File System (GFS), its unique distribution system, which stores information about webpages crawled by Google. Google published a paper on the GFS in 2003. HDFS (Hadoop) is an open source system developed using GFS as a model. For this reason, HDFS has the same characteristics as GFS. HDFS separates a large file into chunks, and stores three of them into each datanode. In other words, one file is stored in multiple distributed data nodes. This also means that one file has three replicas. The typical size of a chunk is 64 MB. The metadata about which data node stores the chunk is stored in the namenode. This allows you to read data from distributed files and perform operations by using MapReduce. Figure 2 below shows the configuration of HDFS: Figure 2: HDFS Structure (http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html). The namenode of HDFS manages the name space and metadata of all files and the information on file chunks. Chunks are stored in data nodes and these data nodes process file operation requests from the clients. As explained above, in HDFS, large files can be distributed and stored effectively. Moreover, you can also perform distributed processing of operations by using the MapReduce framework based on the chunk location information. Compared to OwFS, the weakness of HDFS is that it is not suitable for processing a large number of files. This is because a bottleneck can occur at the namenode. If the number of files increases, OOM (Out of Memory) occurs at the service daemon of the namenode, and consequently the daemon process is terminated. The features of HDFS are as follows: A large file is divided into chunks and distributed and stored into multiple data nodes. The size of a chunk is usually 64 MB, each chunk has three replicas, and chunks are stored in different data nodes. The information on these chunks is stored in the namenode. It is advantageous for storing large files, though if the number of files is large, the burden of the namenode increases. The namenode is a SPOF, and if a failure occurs at the namenode, HDFS will stop and must be restored manually. As HDFS is written in Java, its interface (API) is also a Java API. However, you can also use C API by using JNI. The Hadoop community does not provide a mount for using FUSE officially. But a third party provides the functionality of FUSE mount for HDFS. What you should know when using HDFS is the measure against a failure in the namenode. When a failure occurs at the namenode, you can't use HDFS itself, and thus you need to take the time of failure occurrence into account. In addition, as NHN does not have an exclusive HDFS team, the relevant service department has to run and operate their own HDFS. Table 1: OwFS vs. HDFS. OwFS HDFS Redundant Metadata Server Supported. Not supported yet. Metadata to be Managed Owner allocation information. File name space, file information (stat. info.), chunk information and chunk allocation information. File Storage Method Without any change. Divided into chunks. Advantages Suitable for storing multiple files. Suitable for storing a large file. Choosing the Right Distributed Storage Type For Each Case Case 1 The size of files is not big (usually within dozens of MB) and the number of files is likely to increase significantly. A space of approximately 10-100 TB is expected to be required. Once files are stored, they will hardly be modified. ⇒ OwFS is suitable. As the size of files is small and the number of files is likely to increase, OwFS is more suitable than HDFS. As the content of files is not changed and the total size required is dozens of TB, it is more advantageous to use OwFS than NAS in terms of costs. Case 2 You need to store log files created from the web server and analyze their content periodically. The size of a log file is 1 TB on average, approximately 10 log files are created, and they are maintained for a month. ⇒ HDFS is suitable. This is because the size of a file is big but the number of files is small. You had better use HDFS when analyzing such large files. Once a file is stored in HDFS, though, you cannot change the stored file. For this reason, you need to store it in HDFS after a log file is completely created. Case 3 You need to store log files created from the web server and analyze their content periodically. One log file per day is created for each server, and the file size is 100 KB on average. Currently the number of servers that should collect logs is around 10,000, and the file retention period is 100 days. ⇒ You had better use OwFS. This is because the number of files will increase due to the maintenance period of 100 days and the size of log files is not that big. The analytics work can be conducted through the MapReduce framework. If you use MFO (Map-Reduce Framework for OwFS), you can also use MapReduce in OwFS. Case 4 Each user has multiple files, you want to establish an index of file information so that users can search their files quickly, and you need storage to store the index file. When a file is added, deleted or changed, this index file should be updated. If the number of files is small, the size of the index is small. However, if the number of files is large, the index file could be more than hundreds of MB. The total number of users is estimated to be 100,000. ⇒ NAS is suitable. In general, an index file should be updated frequently, but in OwFS or HDFS, stored files cannot be modified. With OwFS, if you use OwFS_FUSE in full mode, you can change files. But this method reads the entire file, and changes and re-writes it. This is true even if you only need to change 1 byte in the entire file. In addition, if the entire size is dozens of TB, using NAS will not result in high cost, either. Case 5 There are over 1,000,000 files of approximately 1 KB. The total size required is approximately 3 GB. It is difficult to change the existing directory structure. ⇒ NAS is suitable. When you use OwFS_FUSE, you can use a directory structure as flexible as a local file system. If there are many small files, however, it often provides performance lower than that of NAS. HDFS can also be mounted using FUSE, but it is not suitable in an environment in which a large number of files are used. In addition, as the total size required is very small, using NAS will not cause high costs. Case 6 File size ranges from a few MB to a few GB, and the total size required is approximately 500 TB. The directory structure should not be changed, and the content of files may be changed from time to time. ⇒ OwFS is suitable. If a large size of around 500 TB is required, NAS is not appropriate due to cost. If the size is 500 TB, that means the number of files required is also large. Therefore, OwFS is more suitable than HDFS. Although it is possible to use the existing directory structure without any change by using OwFS-FUSE, it is better to change it to a structure that is more suitable to the owner. When you change the content of a file, you can prevent the entire system load from increasing by reading and updating the file on the application server and overwriting it from the application server to OwFS. Table 2: Criteria for Selection of OwFS, HDFS, NFS/CIFS. OwFS HDFS NFS/CIFS File size Suitable when there are many small files of dozens of MB. Mid (< dozens of GB*) Large (10 GB* >>) Suitable when there are many large files of over 10 GB Small and mid-sized files Number of files Large Small Large TCO Low Low High Analytic functionality O O X Selection criteria When file size is small. When the number of files is large. When analysis is required. When file size is large. When the number of files is small. When the capacity is large. When analysis is required. When the capacity is small. When NAS is used and compatibility is required. Noteworthy Distributed File Systems There are some other noteworthy distributed file systems that are not currently used by NHN. These are GFS2, Swift, Ceph and pNFS. Nelow I will briefly introduce them and explain their functionalities. GFS2 Google's GFS is to distributed file systems what the Beatles were to the music industry, in that many distributed file systems, including HDFS, were inspired by GFS. However, GFS also has a huge structural weakness. It is vulnerable to namenode failure. Unlike HDFS, GFS has a slave namenode. This is why GFS is less susceptible to failures than HDFS. Despite its slave namenode, however, when a failure occurs at the master namenode, the transfer time is not short. If the number of files increases, the amount of metadata also increases, and consequently the processing speed is deteriorated, and the total number of files available is also limited due to the limit of the memory size of the master server. Usually the size of a chunk is 64 MB, and GFS is too inefficient to store data smaller than this size. Of course, you can reduce the size of a chunk, but if you reduce the size, the amount of metadata will increase. For this reason, even when there are many files smaller than 64 MB, it is still difficult to reduce the size of a chunk. However, GFS2 overcomes this weakness of GFS. GFS2 uses a much more advanced metadata management method than GFS. The namenode of GFS2 has a distributed structure rather than a single master. In addition, it stores metadata in a correctable database, such as BigTable. Through this, GFS2 addresses the limit of the number of files and the vulnerability to a namenode failure. As you can easily increase the amount of metadata to be processed, you can reduce the size of a chunk to 1 MB. The structure of GFS2 is expected to have a huge influence on approaches to improving the structure of most other distributed file systems. Swift Swift is an object storage system used in OpenStack, which is used by Rackspace Cloud and others. Swift uses a structure in which there is no separate master server, as Amazon S3 does. It uses a 3-level object structure (Account, Container and Object) to manage files. The Account object is a kind of account used to manage containers. The Container object is an object used to manage the Object object like a directory. It is like a bucket in Amazon S3. The Object is an object corresponding to a file. To access this Object, you should access the Account object and Container object, in that order. Swift provides REST API, and has a proxy server to provide the REST API. It uses a static table with the predefined location to which an object has been allocated, and this is called Ring. All servers in Swift share the Ring information and find the location of a desired object. As the use of OpenStack has been growing rapidly with the participation of more and more large companies, Swift has recently been getting more attention. In Korea, KT is participating in OpenStack, and provides its KT uCloud server by using Swift. Ceph Ceph is a distributed file system with a unique metadata management method. Like other distributed file systems, it also manages the namespace and metadata of the entire file system by using the metadata server. But it features the operation of metadata servers in clusters and the dynamic adjustment of the namespace area by metadata according to the degree of load. This allows you to easily respond when load is concentrated on some parts, and easily expand metadata servers. Moreover, unlike other distributed file systems, it is compatible with POSIX. This means that you can access a file stored in the distributed file system as in the local file system. It also supports REST API, and is compatible with the REST API of Swift or Amazon S3. The most noticeable thing about Ceph is that it is included in Linux Kernel source. The released version is that high, but as Linux develops, Ceph may eventually become the main file system of Linux. It is also attractive as it is compatible with POSIX and supports kernel mount. Parallel Network File System (pNFS) As mentioned above, NFS has multiple versions. To resolve the scalability issue of the versions up to NFSv4, NFSv4.1 has introduced pNFS. This version enables you to process the content of a file and its metadata separately, and to store a single file in multiple distributed places. If the client brings the metadata of a certain file and learns the location of the file, it will be connected to servers that contain the content of the file when it accesses the same file later. At this time, the client can read or write the content of the file in multiple servers in parallel. You can also easily expand the metadata server, which manages metadata, to prevent the occurrence of a bottleneck phenomenon. pNFS is an advanced version of NFS, and reflects the recent trends of distributed file systems. Therefore, pNFS has the advantages of NFS, as well as the advantages of the latest distributed file systems. Currently, there are some, if not many, products that support pNFS being released. One of the reasons we should pay attention to pNFS is that currently NFS is not managed by Oracle (Sun) but by the Internet Engineering Task Force (IETF). NFS is used as a standard in many Linux/Unix environments, and thus pNFS may be popularized if many vendors release products that support pNFS. Conclusion So far, you have learned that many distributed file systems have their own unique characteristics, and that you need to apply a suitable distributed file system depending on your business needs. This article also introduced some distributed file systems to which people have begun to pay attention, and explained new trends and things you can refer to. By Jun Sung Won, General Manager at Storage System Development Team, NHN Corporation. [Less]
Posted almost 12 years ago by Jun Sung Won
OwFS, HDFS, NFS... What kind of distributed file systems do we need to provide a more competitive internet service? In this article I will explain the features of selected distributed file systems which I have experience with working at NHN, and ... [More] suggest which one is suitable to what case. I will also discuss about recent open source distributed file systems as well asl Google's distributed file system which are not yet used by NHN. This article will also cover the development trends of distributed file systems in general. What Kind of Distributed File Systems do We Need to Use? Currently NHN uses various distributed file systems such as NFS, CIFS, HDFS, and OwFS. A distributed file system is a file system that allows access to files via a network. The advantage of a distributed file system is that you can share data among multiple hosts. Before 2008, NHN used only NFS and CIFS by using NAS, but since 2008, it has developed and begun to use its own OwFS, and the usage of OwFs has since increased. As is widely known, HDFS is an open source file system that was influenced by Google's GFS. For NFS or CIFS, Network Attached Storage (NAS), which is an expensive storage system, is typically used. Whenever you increase its capacity, therefore, there is a high infrastructure cost. On the other hand, as OwFS and HDFS allow you to use relatively economical hardware (commodity server), you can establish high-capacity storage at much lower cost. However, OwFS or HDFS are not better than NFS or CIFS, which use NAS in all cases. For some purposes, you should use NAS. The same is true of OwFS and HDFS. As they are built for different purposes, you need to select one of these file systems according to the purpose of the internet server you are implementing. NFS and CIFS Generally, Network File System (NFS) is used by Linux/Unix, while CIFS (Common Internet File System) is used by MS Windows. NFS is a distributed file system developed in 1984 by Sun Microsystems. It allows you to share and use files from other hosts via a network. NFS has many versions, and it began to be widely used from version NFSv2. Currently, the most-used version is NFSv3. In 2003, NFSv4 was released, and offered better performance and security. In 2010, NFSv4.1, which supports cluster-based expansion, was released. NFS is a distributed file system with a long history, and it continues to evolve. CIFS is a distributed file system Microsoft applied to MS Windows, which was developed based on Server Message Block (SMB) but with enhanced functionalities, including security. NFS and CIFS comply with the POSIX standard. Thus, applications by using NFS or CIFS can use a distributed file system like a local file system. This means, when you implement or run an application, you don't need to prepare a local file system and distributed file system separately. When NFS or CIFS is used, Network Attached Storage (NAS) is used for better performance and availability. NAS has its unique file system, and processes remote requests for access to files by having NAS gateway support NFS or CIFS protocols. Generally, NAS has the structure shown in Figure 1: Figure 1: A General Structure of NAS. The features of NFS or CIFS are as follows: Provides the same functionality as a local file system. As they mostly use NAS, there is a high purchase cost. The scalability of NAS is very low compared to OwFS and HDFS. OwFS (Owner-based File System) OwFS is a distributed file system developed by NHN. Currently, it is the distributed file system that NHN uses the most. OwFS features the concept of a container called owner. An owner is a basic unit of metadata managed in OwFS. OwFS manages the replicas of data on the basis of this owner. Each owner has an independent file storage space, in which files and directories can be stored. OwFS forms a single huge file system by gathering these owners together. To access a file, users need to get the information of the owner first. The overall structure of OwFS is as follows: Figure 2: The OwFS Structure. The Meta-Data Server (MDS) of OwFS manages the status of data servers (DS). More specifically, MDS manages the capacity of each DS, and when a failure occurs, it performs restoration work by using the replicas of owners. Compared to HDFS, the advantage of OwFS is that it can process a large number of files efficiently (mainly when the size of a file is within dozens of MB). This is because OwFS has been designed to not increase the burden of MDS even when the number of files increases. DS manages the information on the files and directories stored in owners (i.e., metadata on files and directories), while MDS has only the information of the owner and the location of the owner's replica. For this reason, even when the number of files increases, the metadata, which MDS should process, does not increase, and consequently the burden of MDS does not increase much. The following shows the features of OwFS: Owner, a kind of container, is a single file system, and owners make up an entire file system. Owner information (file data and metadata) is stored in data server (DS). Multiple owners can be saved in a single DS, and owners are distributed and stored (replicated) in different DSs. The information on the location of owner including the location of its replica is stored in Metadata Data Server (MDS). It is suitable for processing a large number of files whose size is within dozens of MB. As all of its components are duplicated/triplicated, it works stably even when a failure occurs. A Map-Reduce Framework for OwFS also exists. Apache module to displays files from OwFS. OwFs_Fuse module to mount OwFS like NAS. OwFS provides its own unique interface (API). However, as it also provides the OwFs_Fuse module, you can mount OwFS like NAS and use it as a local disk. In addition, it also provides Apache module so that Apache web server can access the files stored in OwFS. Hadoop Distributed File System (HDFS) Google developed Google File System (GFS), its unique distribution system, which stores information about webpages crawled by Google. Google published a paper on the GFS in 2003. HDFS (Hadoop) is an open source system developed using GFS as a model. For this reason, HDFS has the same characteristics as GFS. HDFS separates a large file into chunks, and stores three of them into each datanode. In other words, one file is stored in multiple distributed data nodes. This also means that one file has three replicas. The typical size of a chunk is 64 MB. The metadata about which data node stores the chunk is stored in the namenode. This allows you to read data from distributed files and perform operations by using MapReduce. Figure 2 below shows the configuration of HDFS: Figure 2: HDFS Structure (http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html). The namenode of HDFS manages the name space and metadata of all files and the information on file chunks. Chunks are stored in data nodes and these data nodes process file operation requests from the clients. As explained above, in HDFS, large files can be distributed and stored effectively. Moreover, you can also perform distributed processing of operations by using the MapReduce framework based on the chunk location information. Compared to OwFS, the weakness of HDFS is that it is not suitable for processing a large number of files. This is because a bottleneck can occur at the namenode. If the number of files increases, OOM (Out of Memory) occurs at the service daemon of the namenode, and consequently the daemon process is terminated. The features of HDFS are as follows: A large file is divided into chunks and distributed and stored into multiple data nodes. The size of a chunk is usually 64 MB, each chunk has three replicas, and chunks are stored in different data nodes. The information on these chunks is stored in the namenode. It is advantageous for storing large files, though if the number of files is large, the burden of the namenode increases. The namenode is a SPOF, and if a failure occurs at the namenode, HDFS will stop and must be restored manually. As HDFS is written in Java, its interface (API) is also a Java API. However, you can also use C API by using JNI. The Hadoop community does not provide a mount for using FUSE officially. But a third party provides the functionality of FUSE mount for HDFS. What you should know when using HDFS is the measure against a failure in the namenode. When a failure occurs at the namenode, you can't use HDFS itself, and thus you need to take the time of failure occurrence into account. In addition, as NHN does not have an exclusive HDFS team, the relevant service department has to run and operate their own HDFS. Table 1: OwFS vs. HDFS. OwFS HDFS Redundant Metadata Server Supported. Not supported yet. Metadata to be Managed Owner allocation information. File name space, file information (stat. info.), chunk information and chunk allocation information. File Storage Method Without any change. Divided into chunks. Advantages Suitable for storing multiple files. Suitable for storing a large file. Choosing the Right Distributed Storage Type For Each Case Case 1 The size of files is not big (usually within dozens of MB) and the number of files is likely to increase significantly. A space of approximately 10-100 TB is expected to be required. Once files are stored, they will hardly be modified. ⇒ OwFS is suitable. As the size of files is small and the number of files is likely to increase, OwFS is more suitable than HDFS. As the content of files is not changed and the total size required is dozens of TB, it is more advantageous to use OwFS than NAS in terms of costs. Case 2 You need to store log files created from the web server and analyze their content periodically. The size of a log file is 1 TB on average, approximately 10 log files are created, and they are maintained for a month. ⇒ HDFS is suitable. This is because the size of a file is big but the number of files is small. You had better use HDFS when analyzing such large files. Once a file is stored in HDFS, though, you cannot change the stored file. For this reason, you need to store it in HDFS after a log file is completely created. Case 3 You need to store log files created from the web server and analyze their content periodically. One log file per day is created for each server, and the file size is 100 KB on average. Currently the number of servers that should collect logs is around 10,000, and the file retention period is 100 days. ⇒ You had better use OwFS. This is because the number of files will increase due to the maintenance period of 100 days and the size of log files is not that big. The analytics work can be conducted through the MapReduce framework. If you use MFO (Map-Reduce Framework for OwFS), you can also use MapReduce in OwFS. Case 4 Each user has multiple files, you want to establish an index of file information so that users can search their files quickly, and you need storage to store the index file. When a file is added, deleted or changed, this index file should be updated. If the number of files is small, the size of the index is small. However, if the number of files is large, the index file could be more than hundreds of MB. The total number of users is estimated to be 100,000. ⇒ NAS is suitable. In general, an index file should be updated frequently, but in OwFS or HDFS, stored files cannot be modified. With OwFS, if you use OwFS_FUSE in full mode, you can change files. But this method reads the entire file, and changes and re-writes it. This is true even if you only need to change 1 byte in the entire file. In addition, if the entire size is dozens of TB, using NAS will not result in high cost, either. Case 5 There are over 1,000,000 files of approximately 1 KB. The total size required is approximately 3 GB. It is difficult to change the existing directory structure. ⇒ NAS is suitable. When you use OwFS_FUSE, you can use a directory structure as flexible as a local file system. If there are many small files, however, it often provides performance lower than that of NAS. HDFS can also be mounted using FUSE, but it is not suitable in an environment in which a large number of files are used. In addition, as the total size required is very small, using NAS will not cause high costs. Case 6 File size ranges from a few MB to a few GB, and the total size required is approximately 500 TB. The directory structure should not be changed, and the content of files may be changed from time to time. ⇒ OwFS is suitable. If a large size of around 500 TB is required, NAS is not appropriate due to cost. If the size is 500 TB, that means the number of files required is also large. Therefore, OwFS is more suitable than HDFS. Although it is possible to use the existing directory structure without any change by using OwFS-FUSE, it is better to change it to a structure that is more suitable to the owner. When you change the content of a file, you can prevent the entire system load from increasing by reading and updating the file on the application server and overwriting it from the application server to OwFS. Table 2: Criteria for Selection of OwFS, HDFS, NFS/CIFS. OwFS HDFS NFS/CIFS File size Suitable when there are many small files of dozens of MB. Mid (< dozens of GB*) Large (10 GB* >>) Suitable when there are many large files of over 10 GB Small and mid-sized files Number of files Large Small Large TCO Low Low High Analytic functionality O O X Selection criteria When file size is small. When the number of files is large. When analysis is required. When file size is large. When the number of files is small. When the capacity is large. When analysis is required. When the capacity is small. When NAS is used and compatibility is required. Noteworthy Distributed File Systems There are some other noteworthy distributed file systems that are not currently used by NHN. These are GFS2, Swift, Ceph and pNFS. Nelow I will briefly introduce them and explain their functionalities. GFS2 Google's GFS is to distributed file systems what the Beatles were to the music industry, in that many distributed file systems, including HDFS, were inspired by GFS. However, GFS also has a huge structural weakness. It is vulnerable to namenode failure. Unlike HDFS, GFS has a slave namenode. This is why GFS is less susceptible to failures than HDFS. Despite its slave namenode, however, when a failure occurs at the master namenode, the transfer time is not short. If the number of files increases, the amount of metadata also increases, and consequently the processing speed is deteriorated, and the total number of files available is also limited due to the limit of the memory size of the master server. Usually the size of a chunk is 64 MB, and GFS is too inefficient to store data smaller than this size. Of course, you can reduce the size of a chunk, but if you reduce the size, the amount of metadata will increase. For this reason, even when there are many files smaller than 64 MB, it is still difficult to reduce the size of a chunk. However, GFS2 overcomes this weakness of GFS. GFS2 uses a much more advanced metadata management method than GFS. The namenode of GFS2 has a distributed structure rather than a single master. In addition, it stores metadata in a correctable database, such as BigTable. Through this, GFS2 addresses the limit of the number of files and the vulnerability to a namenode failure. As you can easily increase the amount of metadata to be processed, you can reduce the size of a chunk to 1 MB. The structure of GFS2 is expected to have a huge influence on approaches to improving the structure of most other distributed file systems. Swift Swift is an object storage system used in OpenStack, which is used by Rackspace Cloud and others. Swift uses a structure in which there is no separate master server, as Amazon S3 does. It uses a 3-level object structure (Account, Container and Object) to manage files. The Account object is a kind of account used to manage containers. The Container object is an object used to manage the Object object like a directory. It is like a bucket in Amazon S3. The Object is an object corresponding to a file. To access this Object, you should access the Account object and Container object, in that order. Swift provides REST API, and has a proxy server to provide the REST API. It uses a static table with the predefined location to which an object has been allocated, and this is called Ring. All servers in Swift share the Ring information and find the location of a desired object. As the use of OpenStack has been growing rapidly with the participation of more and more large companies, Swift has recently been getting more attention. In Korea, KT is participating in OpenStack, and provides its KT uCloud server by using Swift. Ceph Ceph is a distributed file system with a unique metadata management method. Like other distributed file systems, it also manages the namespace and metadata of the entire file system by using the metadata server. But it features the operation of metadata servers in clusters and the dynamic adjustment of the namespace area by metadata according to the degree of load. This allows you to easily respond when load is concentrated on some parts, and easily expand metadata servers. Moreover, unlike other distributed file systems, it is compatible with POSIX. This means that you can access a file stored in the distributed file system as in the local file system. It also supports REST API, and is compatible with the REST API of Swift or Amazon S3. The most noticeable thing about Ceph is that it is included in Linux Kernel source. The released version is that high, but as Linux develops, Ceph may eventually become the main file system of Linux. It is also attractive as it is compatible with POSIX and supports kernel mount. Parallel Network File System (pNFS) As mentioned above, NFS has multiple versions. To resolve the scalability issue of the versions up to NFSv4, NFSv4.1 has introduced pNFS. This version enables you to process the content of a file and its metadata separately, and to store a single file in multiple distributed places. If the client brings the metadata of a certain file and learns the location of the file, it will be connected to servers that contain the content of the file when it accesses the same file later. At this time, the client can read or write the content of the file in multiple servers in parallel. You can also easily expand the metadata server, which manages metadata, to prevent the occurrence of a bottleneck phenomenon. pNFS is an advanced version of NFS, and reflects the recent trends of distributed file systems. Therefore, pNFS has the advantages of NFS, as well as the advantages of the latest distributed file systems. Currently, there are some, if not many, products that support pNFS being released. One of the reasons we should pay attention to pNFS is that currently NFS is not managed by Oracle (Sun) but by the Internet Engineering Task Force (IETF). NFS is used as a standard in many Linux/Unix environments, and thus pNFS may be popularized if many vendors release products that support pNFS. Conclusion So far, you have learned that many distributed file systems have their own unique characteristics, and that you need to apply a suitable distributed file system depending on your business needs. This article also introduced some distributed file systems to which people have begun to pay attention, and explained new trends and things you can refer to. By Jun Sung Won, General Manager at Storage System Development Team, NHN Corporation. [Less]