Posted
about 12 years
ago
On November 29, 2012, a Cassandra conference was held at Tokyo. This was the second annual conference run by the Cassandra community in Tokyo. This year, our CTO, Jonathan Ellis and I joined the conference as speakers. The conference started with a
... [More]
keynote speech from Jonathan, followed by 3 sessions each on a use case track and a technical track. More than 150 people signed up, the conference was full of interesting topics for both developers, current users as well as those who are considering using Cassandra. Today, I’d like to share what conference was like in this blog post.
Keynote speech – What’s new in Cassandra 1.2?
Keynote from Jonathan
Jonathan gave his keynote speech about the new features in the Cassandra 1.2 release. His talk covered a wide variety of features Cassandra 1.2 introduces, including virtual node, “fat node” support, atomic batch to request tracing. The talk covered many new features not covered in his Cassandra Summit speech, so if you are interested, you can watch his keynote here.
CQL3, Backup and monitoring, and Data modeling with wide row…
Technical session
The following technical sessions covered interesting topics for Cassandra developers.
For the first session of the technical track, I gave my presentation about CQL3(slides can be found here). The talk was based on blog posts we have written in this dev blog, including new collection support. I also talked about the new native protocol, and I could see many developers were interested in this new feature.
In the following session, Kazutaka Tomita gave his talk about Cassandra monitoring and backup. He showed how Cassandra stores its data, and steps for data backup and recovery when node goes down.
The last technical session was from Yongkun Wang of Rakuten Inc., Japan’s largest e-commerce company, and he talked about how their log management platform stores application logs and access logs generated from their own PaaS platform to Cassandra cluster. They chose Cassandra because of its linear scalability, and fast write performance. It was interesting to see how they transform application log in JSON format into Cassandra rows using CompositeType to construct wide row.
Use cases – e-commerce, cloud storage, web mail and more
There were also sessions on a use case track. Three companies showed their Cassandra use cases including cloud e-commerce platform, cloud storage, and web mail. Besides use cases covered in sessions, I could hear a lot of other use cases like collecting data from tons of censors, a Cassandra cluster on Amazon EC2, and so on at the after party.
I was at last year’s conference also, but at the time, most of the people were just evaluating what Cassandra can do. This year, I could hear a lot of actual use cases that were in production. I can say that Cassandra is getting more and more attention here in Japan.
[Less]
|
Posted
about 12 years
ago
Cassandra 1.2 exposes almost everything that each server knows about the cluster in tables in the system keyspace. We started this process with the introduction of CQL3 in Cassandra 1.1, but introducing the native protocol motivated us to finish it
... [More]
, so native protocol drivers can introspect everything they need without falling back to the old Thrift calls.
Here’s what the system keyspace contains.
Schema
CREATE TABLE schema_keyspaces (
keyspace_name text PRIMARY KEY,
durable_writes boolean,
strategy_class text,
strategy_options text
);
CREATE TABLE schema_columnfamilies (
keyspace_name text,
columnfamily_name text,
bloom_filter_fp_chance double,
caching text,
column_aliases text,
comment text,
compaction_strategy_class text,
compaction_strategy_options text,
comparator text,
compression_parameters text,
default_read_consistency text,
default_validator text,
default_write_consistency text,
gc_grace_seconds int,
id int,
key_alias text,
key_aliases text,
key_validator text,
local_read_repair_chance double,
max_compaction_threshold int,
min_compaction_threshold int,
read_repair_chance double,
replicate_on_write boolean,
subcomparator text,
type text,
value_alias text,
PRIMARY KEY (keyspace_name, columnfamily_name)
);
CREATE TABLE schema_columns (
keyspace_name text,
columnfamily_name text,
column_name text,
component_index int,
index_name text,
index_options text,
index_type text,
validator text,
PRIMARY KEY (keyspace_name, columnfamily_name, column_name)
);
This all corresponds exactly with what you see in CREATE TABLE, so this is pretty straightforward. A couple things that might bear additional explanation:
durable_writes: allows disabling the commitlog for tables in this keyspace. Generally not recommended, but occasionally useful for temporary data or when you’re confident that replication will be adequate to keep your data safe.
subcomparator: used by obsolete SuperColumns.
component_index: use by Cassandra internally with compound primary keys
Cluster information
Each node records what other nodes tell it about themselves over gossip:
CREATE TABLE peers (
peer inet PRIMARY KEY,
data_center text,
host_id uuid,
rack text,
release_version text,
rpc_address inet,
schema_version uuid,
tokens set
);
And what it knows about itself, which is a superset of what it gossips:
CREATE TABLE local (
key text PRIMARY KEY,
bootstrapped text,
cluster_name text,
cql_version text,
data_center text,
gossip_generation int,
host_id uuid,
partitioner text,
rack text,
release_version text,
schema_version uuid,
thrift_version text,
tokens set,
truncated_at map
);
Remember that starting with 1.2 each node can be assigned multiple tokens through virtual nodes.
There is only a single row in the local table (key also “local”).
Other
The batchlog table contains data for atomic batches.
Hinted handoff records mutations to replay in the hints table.
IndexInfo stores information about index creation status and will probably be moved into schema_columnfamilies in the future.
NodeIdInfo stores counter “node ids”.
range_xfers is used to store range transfer status when upgrading a non-vnode cluster to use vnodes.
Request traces are stored not in the system keyspace, which is unreplicated, but in system_traces.
Drive responsibly
Cassandra does allow you to update data in the system keyspace, but it goes without saying that you should only do so if you know what you are doing.
Schema changes to system are not allowed. [Less]
|
Posted
about 12 years
ago
Hinted Handoff is an optional part of writes in Cassandra, enabled by default, with two purposes:
Hinted handoff allows Cassandra to offer full write availability when consistency is not required.
Hinted handoff dramatically improves response
... [More]
consistency after temporary outages such as network failures.
This post applies to Cassandra 1.0 and later versions.
How it works
When a write is performed and a replica node for the row is either known to be down ahead of time, or does not respond to the write request, the coordinator will store a hint locally, in the system.hints table. This hint is basically a wrapper around the mutation indicating that it needs to be replayed to the unavailable node(s).
Once a node discovers via gossip that a node for which it holds hints has recovered, it will send the data row corresponding to each hint to the target. Additionally, it will check every ten minutes to see if there are any hints for writes that timed out during an outage too brief for the failure dectector to notice via gossip.
Hinted Handoff and ConsistencyLevel
A hinted write does not count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. If insufficient replica targets are alive to sastisfy a requested ConsistencyLevel, UnavailableException will be thrown with or without Hinted Handoff. (This is an important difference from Dynamo’s replication model; Cassandra does not default to sloppy quorum. But, see “Extreme write availability” below.)
To see why, let’s look at a simple cluster of two nodes, A and B, and a replication factor (RF) of 1: each row is stored on one node.
Suppose node A is down while we write row K to it with ConsistencyLevel.ONE. We must fail the write: recall that the ConsistencyLevel contract is that “reads always reflect the most recent write when W + R > RF, where W is the number of nodes to block for on write, and R the number to block for on reads.”
If we wrote a hint to B and call the write good because it is written “somewhere,” the contract would be violated because there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him.
Extreme write availability
For applications that want Cassandra to accept writes even when all the normal replicas are down (so even ConsistencyLevel.ONE cannot be satisfied), Cassandra provides ConsistencyLevel.ANY. ConsistencyLevel.ANY guarantees that the write is durable and will be readable once an appropriate replica target becomes available and receives the hint replay.
Performance
By design, hinted handoff inherently forces Cassandra to continue performing the same number of writes even when the cluster is operating at reduced capacity. So pushing your cluster to maximum capacity with no allowance for failures ia a bad idea. That said, Cassandra’s hinted handoff is designed to minimize the extra load on the cluster.
All hints for a given replica are stored under a single partition key, so replaying hints is a simple sequential read with minimal performance impact.
But if a replica node is overloaded or unavailable, and the failure detector has not yet marked it down, then we can expect most or all writes to that node to fail after write_request_timeout_in_ms, which defaults to 10s. During that time, we have to keep a hint callback alive on the coordinator, waiting to write the hint when the timeout is reached.
If this happens on many nodes at once this could become substantial memory pressure on the coordinator. So the coordinator tracks how many hints it is currently writing, and if this number gets too high it will temporarily refuse writes (with UnavailableException) whose replicas include the misbehaving nodes.
Operations
When removing a node from the cluster (with decommission or removetoken), Cassandra automatically removes hints targetting the node that no longer exists.
Cassandra will also remove hints for dropped tables.
Repair and the fine print
At first glance, it may appear that Hinted Handoff lets you safely get away without needing repair. This is only true if you never have hardware failure. Hardware failure means that
We lose “historical” data for which the write has already finished, so there is nothing to tell the rest of the cluster exactly what data has gone missing
We can also lose hints-not-yet-replayed from requests the failed node coordinated
With sufficient dedication, you can get by with “only run repair after hardware failure and rely on hinted handoff the rest of the time,” but as your clusters grow (and hardware failure becomes more common) performing repair as a one-off special case will become increasingly difficult to do perfectly. Thus, we continue to recommend running a full repair weekly. [Less]
|
Posted
about 12 years
ago
Cassandra 1.2 adds a number of performance optimizations, particularly for clusters with a large amount of data per node.
Moving internals off-heap
Disk capacities have been increasing. RAM capacities have been increasingly roughly in step. But
... [More]
the JVM’s ability to manage a large heap has not kept pace. So as Cassandra clusters deploy more and more data per node, we’ve been moving storage engine internal structures off-heap, managing them manually in native memory instead.
1.2 moves the two biggest remaining culprits off-heap: compression metadata and per-row bloom filters.
Compression metadata takes about 20GB of memory per TB of compressed data. Moving this into native memory is especially important now that compression is enabled by default.
Bloom filters help Cassandra avoid scanning data files that can’t possibly include the rows being queried. They weigh in at 1-2GB per billion rows, depending on how aggressively they are tuned.
Both of these use the existing sstable reference counting with minor tweaking to free native resources when the sstable they are associated with is compacted away.
Column index performance
Cassandra has supported indexes on columns for over two years, but our implementation has been simplistic: when an indexed column was updated, we’d read the old version of that column, mark the old index entry invalid, and add a new index entry.
There are two problems with this approach:
This needed to be done with a (sharded) row lock, so for heavy insert loads lock contention could be a problem.
If your rows being updated aren’t cached in memory, doing an update will cause a disk seek (to read the old value). This violates our design principle of avoiding random i/o on writes.
I’ve long been a proponent of having a tightly integrated storage engine in Cassandra, and this is another time we see the benefits of that approach. Starting in 1.2, index updates work as follows:
Add an index entry for the new column value
If the old column value was still in the memtable (common for updating a small set of rows repeatedly), remove the old column value
Otherwise, let the old value get purged by compaction
If a read sees a stale index entry before compaction purges it, the reader thread will invalidate it
Parallel leveled compaction
Leveled compaction is a big win for update-intensive workloads, but has had one big disadvantage vs the default size-tiered compaction: only one leveled compaction at a time could run at a time per table, no matter how many hard disks or SSDs you had your data spread across. SSD users in particular have been vocal in demanding this feature.
Cassandra 1.2 fixes this, allowing the LCS to run up to concurrent_compactors compactions across different sstable ranges (including multiple compactions within the same level).
Murmur3Partitioner
Cassandra 1.2 ships with a new default partitioner, the Murmur3Partitioner based on the Murmur3 hash. Cassandra’s use of consistent hashing does not require cryptographic hash properties, so RandomPartitioner‘s use of MD5 was just a matter of having a convenient function with good distribution built into the JDK. Murmur3 is 3x-5x faster, which translates into overall performance gains of over 10% for index-heavy workloads.
Murmur3Partitioner is NOT compatible with RandomPartitioner, so if you’re upgrading and using the new cassandra.yaml file, be sure to change the partitioner back to RandomPartitioner. (If you don’t, Cassandra will notice that you’ve picked an incompatible partitioner and refuse to start, so no permanent harm done.)
We’ve also switched bloom filters from Murmur2 to Murmur3.
NIO Streaming
Streaming is when one Cassandra node transfers an entire range of data to another, either for bootstrapping new nodes into the cluster or for repair.
When we added compression to Cassandra 1.0 we had to switch back temporarily to a manual data read-uncompress-stream process, which is much less efficient than letting the kernel handle the transfer.
1.2 adds that optimization back in as much as possible: we let the kernel do the transfer whenever we have entire compressed blocks to transfer, which is the common case.
Asynchronous hints delivery
Hinted handoff is where a request coordinator saves updates that it couldn’t deliver to a replica, to retry later.
Cassandra 1.2 allows many hints to be delivered to the target replica concurrently, subject to hinted_handoff_throttle_in_kb. This allows recovering replicas to become consistent with the rest of the cluster much faster.
Others
We’ve blogged previously about optimizing tombstone removal and making Cassandra start up faster. [Less]
|
Posted
about 12 years
ago
Cassandra 1.2 brings a number of new and improved configuration options that it is good to be aware of.
Request timeouts
We’ve split the old rpc_timeout_in_ms setting into separate timeouts for [single-row] reads, range scans, writes, truncation
... [More]
, and miscellanea. This allows you more fine-grained control over timeouts; in particular, range queries tend to take longer than others, and truncate requires flushing so it will also be slower.
We’ve left the defaults alone for all of these but truncate, which was extended to 60s. (Incidentally, in 1.2 truncate only needs to flush the table being emptied, not every table in the cluster.)
Improved recovery from request overload
Cassandra deals with request overload by dropping requests that are so behind that they’ve timed out before being processed. Prior to Cassandra 1.2, each replica tracked request timeout locally — that is, it assumed that setting up the request on the coordinator was instantaneous. But if the coordinator is also overloaded, which is often the case, then this is not a good assumption.
For 1.2 we’ve added the ability to do this with the cross_node_timeout option. This is off by default, since it requires your Cassandra cluster’s clocks to be synchronized. If you have ntp enabled or otherwise synchronize your clocks, go ahead and turn cross node timeouts on.
End-to-end encryption
Cassandra has supported SSL between cluster nodes since 0.8. Now we’re extending that to client connections as well. Look for client_encryption_options in cassandra.yaml.
Bloom filters
Cassandra uses bloom filters in its log-structured storage engine to avoid scanning data files that can’t possibly include the partitions being queried.
Bloom filters are configured on a per-table basis, not globally like the above options. Compaction is also configured per-table.
Since leveled compaction does such a good job at minimizing the number of sstables that a given data partition can be spread across, we don’t need to be quite so aggressive with the bloom filters we create. By default, Cassandra 1.2 will use a bloom filter false positive chance of 0.1 for tables using leveled compaction, and 0.01 for tables using size-tiered compaction. This results in memory savings of about 50% for those bloom filters.
Others
We’ve blogged about some other configuration changes in longer articles:
The CQL binary protocol
Disk failure policy
Virtual nodes
[Less]
|
Posted
about 12 years
ago
One of the new features slated for Cassandra 1.2′s release later this year is virtual nodes (vnodes.) What are vnodes? If you recall how token selection works currently, there’s one token per node, and thusly a node owns exactly one contiguous
... [More]
range in the ringspace. Vnodes change this paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.
What advantages does this bring to the table? Let’s consider the following scenario: we have 30 nodes and replication factor of 3. A node dies completely, and we need to bring up a replacement. At this point the replacement node needs to get a replica for 3 different ranges to reconstitute not only the data it is the first natural replica for, but also data that it is a secondary/tertiary natural replica for (though do recall no replica has ‘priority’ over another in Cassandra, this terminology is strictly to illustrate placement on the ring.) Since our RF is 3 and we lost a node, we logically only have 2 replicas left, which for 3 ranges means there are up to 6 nodes we can stream from. In current practice though, Cassandra will only use one replica from each range, so we’ll stream from 3 other nodes total.
We want to minimize how long this operation is going to take, because if we lose another node while this is happening there’s a chance we’ll be down to 1 replica for some ranges, and then all operations for that range with a consistency level greater than ONE would fail. Even if we used all 6 possible replica nodes, we’d only be using 20% of our cluster, however.
If instead we have randomized vnodes spread throughout the entire cluster, we still need to transfer the same amount of data, but now it’s in a greater number of much smaller ranges distributed on all machines in the cluster. This allows us to rebuild the node faster than our single token per node scheme.
Cassandra has worked toward increasing the amount of data that can be reasonably stored per node in many releases, and of course 1.2 will be no different with its new disk failure handling. One last wrinkle though is if you lose one disk, you’ll have to wait on repair before anything will begin to be restored to the new disk. Repair is two phases, first a validation compaction that iterates all the data and generates a Merkle tree, and then streaming when the actual data that is needed is sent. The validation phase might take an hour, while the streaming only takes a few minutes, meaning your replaced disk sits empty for at least an hour. Much like the node replacement scenario I began with, with vnodes you’ll gain two distinct advantages in this situation. The first is that since the ranges are smaller, data will be sent to the damaged node in a more incremental fashion instead of waiting until the end of a large validation phase. The second is that the validation phase will be parallelized across more machines, causing it to complete faster.
Another nice advantage vnodes bring is easing the use of heterogeneous machines in a cluster. As time goes on, everyone is going to come to a point where it’s time to replace older, weaker machines with newer, more powerful ones. While in transition however, it would be nice if the newer nodes could bear more load immediately. You might be able do this today with very careful planning and range calculation, but it would be cumbersome and error prone. If you have vnodes it becomes much simpler, you just assign a proportional number of vnodes to the larger machines. If you started your older machines with 64 vnodes per node and the new machines are twice as powerful, simply give them 128 vnodes each and the cluster remains balanced even during transition.
As you can see, virtual nodes are a large feature addition for 1.2, but don’t worry if you have an existing cluster, they won’t be forced on you and everything will work the way it did before. If you’d like to upgrade an installation to virtual nodes, that’s possible too, but I’ll save that for a later post. If you want to get started with vnodes on a fresh cluster, however, that is fairly straightforward. Just don’t set the initial_token parameter in your conf/cassandra.yaml and instead enable the num_tokens parameter. A good default value for this is 256. [Less]
|
Posted
about 12 years
ago
We’re pleased to announce the availability of the DataStax ODBC Driver for Hadoop / Hive. Our new driver conforms to the standard ODBC 3.52 standards, which practically means you can use it to connect to a DataStax Enterprise cluster from many open
... [More]
source and proprietary BI, query, and ETL tools (e.g. Microsoft Excel, Tableau, MicroStrategy, etc.), and work with data in Hive.
To see our new ODBC driver in action, we have a new tutorial posted on our articles page that walks through the driver’s installation and setup, and shows how to use it with Microsoft Excel. The driver is also explained and detailed in our online documentation.
Installation of the DataStax ODBC Driver for Hive is quick and painless, so visit our drivers download page now and give it a try.
[Less]
|
Posted
about 12 years
ago
This article is one in a series of quick-hit interviews with companies using Apache Cassandra and/or DataStax Enterprise for key parts of their business. For this interview, we spoke with Greg Greenstreet who is VP of engineering at Gnip.
... [More]
DataStax: Greg, thanks for the time today. Please give us an overview of what Gnip is all about.
Greg: Primarily, we serve as the most reliable source of social data to the world. That may sound ambitious, but from a practical perspective we front publishers such as Twitter, Tumblr, Facebook, WordPress, and many more. We take the firehoses from those publishers and provide that data to our customers who want to leverage it for their business.
We’ve been in business since 2008 and currently, our customers are serving social data to 90% of the Fortune 500.
DataStax: What’s your infrastructure look like right now?
Greg: We use a combination of on-premise hardware and systems running in cloud providers. From a development perspective, we use Java for a lot of our data processing and Ruby for front end work.
DataStax: Am I right in saying you guys have a classic big data use case?
Greg: We have both the big data and big bandwidth problems to solve. From a big data perspective, we currently serve out more than 100 billion activities per month to our customers, plus we save all of that data historically. It’s not uncommon for us to digest 20,000 tweets per second, so data comes in very, very fast. And that’s just from one publisher.
We offer both real-time and historical capabilities for all our premium social data publishers. Our core business was built on real time data delivery but increasingly there is demand for a historical perspective across our publishers.
The real-time business has high bandwidth requirements so we like to control the network and compute resources where as batch processing can be pushed off efficiently to cloud platforms.
DataStax: What brought you to Cassandra?
Greg: We’re not an analytics company; instead, we serve all the best companies in social media analytics, business intelligence, finance and ad tech. We are more concerned with realtime processing than batch analytics, so Cassandra was a more natural platform for us than others.
As you can imagine, the write load for us is massive. We need our systems to scale horizontally because the data for Twitter alone can triple is size in just one year. As an example, one project that keeps a week’s worth of data online in a rolling window fashion has 10’s of TB’s just for that one week.
So we need a real-time, massively scalable architecture, where no one node is a point of failure, that can easily span multiple data centers and cloud availability zones, and that’s Cassandra.
DataStax: Did you start out using Cassandra or something else?
Greg: We began by using a Lucene-based system, but that quickly fell down in the face of the write and read loads we have.
DataStax: What are some example use cases that Cassandra covers for you?
Greg: One big area for is compliance. For example, if someone deletes a tweet they made a year ago, we’re not allowed to serve that up historically to our customers. Cassandra was the only database that could handle that type of activity on that much data for us.
We started with Cassandra for the compliance system, but since then, we also use Cassandra for many other projects internally. Now we’re using Cassandra to serve the payload of our data; it’s the source of record for us. Cassandra’s also exceptionally good at time series data so we use it wherever time series use cases area involved.
DataStax: Do you use other databases besides Cassandra?
Greg: We still use some legacy relational databases for small application support and Redis in some areas, but it’s primarily Cassandra for big data storage and retrieval.
DataStax: What advice would you give to help people get started with Cassandra?
Greg: I’d say the primary thing to know up front is how to size the data on the nodes and determine the cluster configuration you need to support your expected I/O traffic and data volumes. Knowing how to grow the cluster efficiently and handle the various maintenance tasks is very important, especially if you’re dealing with many TB’s of data like we are. If you’re going to go too far in one direction, oversizing vs. undersizing your cluster is better.
DataStax: Greg, thanks for sharing what you guys are doing with Cassandra.
Greg: Sure thing.
For more information on Gnip, visit: http://gnip.com/
[Less]
|
Posted
about 12 years
ago
One of the most asked questions we get at DataStax is, “How can I move data from other sources to DataStax Enterprise and Cassandra, and vice versa?” I thought I’d quickly outline the top three options that you have available.
COPY command
Cassandra
... [More]
1.1 and higher supplies the COPY command, which mirrors what the PostgreSQL RDBMS uses for file/export import.read more [Less]
|
Posted
about 12 years
ago
One of the most asked questions we get at DataStax is, “How can I move data from other sources to DataStax Enterprise and Cassandra, and vice versa?” I thought I’d quickly outline the top three options that you have available.
COPY command
Cassandra
... [More]
1.1 and higher supplies the COPY command, which mirrors what the PostgreSQL RDBMS uses for file/export import. The utility is used in Cassandra’s CQL shell, and allows for flat file data to be loaded into Cassandra (nearly all RDBMS’s have unload utilities that allow table data to be written to OS files) as well as data to be written out to OS files. A variety of file formats and delimiters are supported including comma-separated value (CSV), tabs, and more, with CSV being the default.
The syntax for the COPY command is the following:
COPY <column family name> [ ( column [, ...] ) ] FROM ( ‘<filename>’ | STDIN ) [ WITH <option>='value' [AND ...] ];
COPY <column family name> [ ( column [, ...] ) ] TO ( ‘<filename>’ | STDOUT ) [ WITH <option>='value' [AND ...] ];
Below are simple examples of the COPY command in action:
cqlsh> SELECT * FROM airplanes;
name | mach | manufacturer | year
--------------+------+--------------+------
P38-Lightning | 0.7 | Lockheed | 1937
cqlsh> COPY airplanes (name, mach, year, manufacturer) TO 'temp.csv'
1 rows exported in 0.004 seconds.
cqlsh> TRUNCATE airplanes;
cqlsh> COPY airplanes (name, manufacturer, year, mach) FROM 'temp.csv'; 1 rows imported in 0.087 seconds.
See our online documentation for more information about the COPY command.
Sqoop
DataStax Enterprise Edition 2.0 and higher includes support for Sqoop, which is a tool designed to transfer data between an RDBMS and Hadoop. We modified Sqoop so you can not only transfer data from an RDBMS to a Hadoop node in a DataStax Enterprise cluster, but also move data directly into Cassandra as well.
I wrote a short tutorial you can reference on how to use Sqoop to move data from MySQL into Cassandra that can be mimicked for any other RDBMS. There’s also a demo of Sqoop that you’ll find in the /demos directory of a DataStax Enterprise download/installation.
ETL Tools
If you need more sophistication applied to a data movement situation (i.e. more than just extract-load), then you can use any number of extract-transform-load (ETL) solutions that now support Cassandra. These tools provide excellent transformation routines that allow you to manipulate source data in literally any way you need and then load it into a Cassandra target. They also supply many other features such as visual, point-and-click interfaces, scheduling engines, and more.
Happily, many ETL vendors who support Cassandra supply community editions of their products that are free and able to solve many different use cases. Enterprise editions are also available that supply many other compelling features that serious enterprise data users need.
You can freely download and try ETL tools from Jaspersoft, Pentaho, and Talend that all work with DataStax Enterprise and community Cassandra.
Conclusion
The good news is you have multiple methods to use for moving data from most any database source system into DataStax Enterprise and Cassandra. If there’s another ETL tool or similar product you’re using that doesn’t support DataStax Enterprise and/or Cassandra, please contact us and we’ll see what we can do to change that.
If you haven’t done so already, download a copy of DataStax Enterprise Edition, which contains a production-ready version of Cassandra, along with Hadoop for analytics and Solr for enterprise search. It’s completely free to use for as long as you like in your development environments.
[Less]
|