Posted
about 13 years
ago
by
manuel
Today we've released DataCleaner 2.5.1. This is a maintenance release with only minor bugfixes and improvements. But nevertheless we encourage users to upgrade!
Here are the news in DataCleaner 2.5.1:
A bug was fixed in the Table lookup
... [More]
transformation, which caused it to be unable to have multiple output columns.
CSV file escape characters have been made configurable.
A minor bug pertaining to empty strings in the Concatenator was fixed.
Support for the Cubrid database was added.
The converter transformations was adapted to be able to work on multiple fields, not just single fields.
For more information, please refer to the 2.5.1 milestone in the trac system.
We hope you enjoy the new version of DataCleaner!
[Less]
|
Posted
over 13 years
ago
by
kasper
Today we announce the general availability of DataCleaner 2.5! This release is the result of months of hard work by the core DataCleaner crew, the EasyDQ group and the community at large.
Let’s get straight to the “What’s new” question. There are
... [More]
plenty of major improvements in this release:
Saving results to disk
With DataCleaner 2.5 you can save, archive and share your analysis results. This is not only a time-saver for those who used to do manual exporting of analysis results, but it is also a means to improve your methodology around handling profiling results, sharing them with colleagues and for archiving historically profiles of your data.
Saving is implemented so that future versions and/or custom solutions can take advantage of the results and potentially use it for scheduled profiling, data quality monitoring and more.
Data structure transformers
With the rise of Big Data and NoSQL databases comes more advanced data structures. In next generation databases we see key/value pairs and list structures that are cumbersome to deal with in tools built for traditional relational data. To solve these issues DataCleaner 2.5 ships with a new set of “data structure” transformers, which allow you to easily wrap and unwrap structures, to be able to get to the parts that you want to analyze or process.
The data structure transformers also include parsers and writers for JSON data, which is one of the more common representations of NoSQL datastructures.
Filters and transformers are now all "Transformations"
Since DataCleaner 2.0 we’ve been pushing the idea of transformers and filters. The strength of these two types of components were evident from a technical perspective, but for the end-user the distinction has shown to be distracting from its main use-case: To process data in a flow of actions. Therefore DataCleaner 2.5 has consolidated these two terms, and made them available in a common metaphor for the user: Transformations. This means that the user will no longer have to look in multiple menus to find the component he is looking for.
New EasyDQ transformations: Merge duplicates and Due diligence check
The EasyDQ on-demand data quality platform team has also been busy. We present to you three new functions and an optional extension for the advanced users.
First is the Merge duplicates transformation. With this transformation you can turn your results from Duplicate detection into merged, golden records! The merge component is designed to handle a hierarchy of criteria when merging to make sure that critieria such as well-formedness, update date and manual overriding is taken into account.
Secondly we’ve introduced two services for Due diligence checks. These are transformations which will help you validate that the people you are engaging business with are not connected to sanction lists of terrorists, narcotics trafficking and other security threats.
These new features, as well as the other EasyDQ functions, are described in detail in the EasyDQ reference documentation.
Lastly, there's a new extension available, the EasyDQ essentials, which we recommend as a handy extra toolkit for those that want to go deep diving into the features of EasyDQ.
Defining datastore properties on the command line
One of the areas that have been heavily enforced in the later releases of DataCleaner is the command line interface. Using this interface you can set up DataCleaner to execute in all environments, in a scheduled or managed fashion. In DataCleaner 2.5 we’ve also made it possible to override datastore properties from the command line. Why? Because it allows you to reuse the same job on different datastore definitions. If you are for example scanning a directory for CSV files, and want to run a DataCleaner job on each file, this is a solution for you. Refer to the documentation for further explanation and examples.
Drill to detail information in value distribution results
The Value distribution analyzer now contains a drill to detail option, to make it possible to see the source records for each value in the distribution. This greatly helps usability when doing explorative data profiling.
Database-specific connection panels
The dialogs for setting up database connections have been enhanced with database-specific connection properties. This makes it a lot easier for the end-user to connect to a database without having to know the details of constructing a connection URL.
Database-specific configuration panels have been created for MySQL, PostgreSQL, Microsoft SQL Server and Oracle. Other database types are supported using the traditional way of connecting, as in previous versions of DataCleaner.
Execution and scheduling of DataCleaner jobs using Pentaho Data Integration
Pentaho Data Integration (PDI, aka. Kettle) is an open source ETL product that the EasyDQ and DataCleaner team has had a lot of interactions with. For the DataCleaner 2.5 release we are now announcing that in next version of Pentaho Data Integration you will be able to execute and schedule DataCleaner jobs using Pentaho’s infrastructure.
While this is not available, released software as of today, we are looking forward to telling you more about this in the near future!
For those still reading, we also did some minor improvements in DataCleaner 2.5:
We’ve added some number transformations for generating IDs, incrementing numbers and more.
Implemented a Date range filter, similar to the Number range and String range filters.
Support for matching against Synonym catalogs in Reference data matcher (which is previously known as the Matching analyzer).
Now all components have flow visualizations in their configuration panel. This feature helps retain the overview when working with large analysis jobs.
The sample data (the ‘orderdb’ database) has been reworked to contain better examples of data quality issues.
User experience improvements; more elegant dialog designs and trimming of window layout.
We hope you all enjoy the new release of DataCleaner 2.5. Please let us know what you think on the forums, or on our LinkedIn group, or on Google Plus, or on Blogger, or tweet it, or...
[Less]
|
Posted
over 13 years
ago
by
kasper
MetaModel version 2.2.2 has just been released. This is a maintenance release to the 2.2 branch of MetaModel, containing primarily bugfixes and a few small but useful feature enhancements:
The MongoDB module now supports having Maps and Lists as
... [More]
column types. This means that the table-model applied to MongoDB now is structurally compatible with MongoDB's native model (which is key/value based).
Query filters now support logical AND operators. Previously AND was implied between all filter items and therefore not included as a choice, but if nested filters with AND + OR combinations are needed, the new AND operator is useful.
The DataContext.getColumnByQualifiedLabel(...) method is now fault-tolerant towards case differences.
DataSets are now automatically being closed when garbage collected. Although this is not a desirable use-case, it does allow for a late safe-guarding against unclosed resources.
A bug in the DataContextFactory.createExcelDataContext(...) method which caused it to go into stack overflow was fixed.
A detailed view of the work done can be seen in the milestone view.
We hope you enjoy the new release of MetaModel. Please provide your feedback on the MetaModel online forums.
[Less]
|
Posted
over 13 years
ago
by
kasper
The EasyDQ on-demand data quality platform, which DataCleaner is integrated with, has released a patch for DataCleaner version 2.4.2. The patch includes a critical bugfix for the Inter-Dataset matching analyzer.
If you're using this
... [More]
functionality, please download the patch and place it in the lib/ folder of DataCleaner. This will automatically apply the fix and matching multiple datasets will be working again.
The patch has also been applied to the Java WebStart version of DataCleaner, so WebStart users will not need to do anything.
[Less]
|
Posted
over 13 years
ago
by
kasper
We've just released DataCleaner version 2.4.2, which is a bugfix and minor enhancements release. Please update to this latest version, which has a whole bunch of items fixed:
Database connection can now specify if multiple connections can be made
... [More]
or not. This solves an issue related to databases that did not allow this, and a potential application halt if no more connections was available.
There's now a separate distribution of DataCleaner specific for Mac OS. Using this version of DataCleaner you'll see a much nicer OS integration than previously.
Performance of the engine has been improved by providing some job-level metrics as lazy loaded values. For instance, the estimated row count is now lazy loaded, so in situations where this metric is not needed (eg. the command line interface and embedded use of DataCleaner), it will not be calculated.
The command line interface now has additional options to save the results of an analysis to a file, given a variety of output formats. Saved files can later be opened in the User Interface, allowing for a DIY data quality monitoring solution (see Kasper Sørensen's blog for more details).
An issue with correct prefixing of table names in INSERT statements was fixed in the downstream dependencies for the "Insert into table" component.
For full details about all changes, check out the trac roadmap for DataCleaner 2.4.2, AnalyzerBeans 0.10 and MetaModel 2.2.1.
[Less]
|
Posted
over 13 years
ago
by
kasper
As our new years present to all of you, we have a new release of DataCleaner. DataCleaner 2.4.1 is largely a release of bugfixes and minor feature enhancements.
Here's an overview of the improvements we've made:
Feature enhancements:
Batch
... [More]
loading features we're greatly improved when writing data to database tables. Expect to see many orders of magnitude improvements here.
Writing to data has been more conveniently made available by adding the options to the window menu.
You can now easily rename components of a job by double clicking their tabs.
The Javascript transformer now has syntax coloring, so that your Javascripts are easier to inspect and modify.
Bugfixes:
When reading from and writing to the same datastore (eg. the DataCleaner staging area) we've made sure that the table cache of that datastore is refreshed. Previously some scenarios allowed you to see an out-of-date view of the tables.
A potential deadlock when starting up the application was solved. This deadlock was a consequence of an issue in the JVM, but we worked around it by synchronizing all calls to the particular API in Java.
The full list is also available on the DataCleaner 2.4.1 milestone in the roadmap.
The 2.4.1 release should work as a drop-in replacement of DataCleaner 2.4, so we encourage everyone to upgrade. Get it on the downloads page. Happy new year.
[Less]
|
Posted
over 13 years
ago
by
kasper
Today we're announcing the release of MetaModel version 2.2! This new release represents an effort to sanitize, streamline and make the API of MetaModel more flexible. The two major areas of improvement are:
Introduction of an interceptor layer
... [More]
, which can be used for many purposes, for instance to do automatic conversion of data types. The interceptor layer allows you to enrich MetaModel's functionality and to monitor queries and updates on your data.
Improvement of the JDBC write speed by carefully adapting it to use batch updates, prepared statements and controlled commits. With these improvements MetaModel is becoming much more appropriate as a data writing API, even for large batches of data.
There's also a few other smaller additions to this release. You can read it all on the what's new in version 2.2 page and you can download the library from both google code or use it as a maven artifact, as always.
[Less]
|
Posted
over 13 years
ago
by
kasper
Merry christmas! Today we announce the release of DataCleaner 2.4, which marks a huge joint effort by the community and the team at Human Inference to bring together the best ideas of both open source and cloud-based Data Quality.
Here's what's
... [More]
new in DataCleaner 2.4:
EasyDataQuality integration
With DataCleaner 2.4 we've made an alliance with the newly launched EasyDQ.com service, which offers cloud-based Data Quality services. The services provided are:
Duplicate detection (aka. Deduplication or Fuzzy matching of records), which is free to use for up to 500,000 values.
Address data validation and cleansing. This allows you to check if addresses exist, if they are correctly formatted and even to suggest corrections in case you have mistakes.
Name data validation and cleansing. With the Name service, EasyDQ does not only format your names consistently, but also checks for misspellings and interprets the name parts.
Email and phone validation and cleansing. These services provide checking of email and phone data, making sure that email domains exist, that country codes are correct and much more.
No, these are not open source services, but they are offered at a reasonable price as well as a free starter package, and we thoroughly believe that the integration allows DataCleaner to become a much better tool for those who want it.
New analysis job components
Many of DataCleaner's users have reported that they use DataCleaner as a lightweight ETL tool. This is because we currently support basic reading, transformation and writing capabilities. With 2.4 we've added a few crucial components to add to this use-case where you want to do ad-hoc transformations, data quality checks and actually write the data back to your database.
Table lookup which allows you to look up any number of values based on any number of conditions. The lookup component has an intelligent caching mechanism and is highly performant. ( Docs).
Insert into table is a new option when writing data. With this option we are making it possible for DataCleaner to not only produce new files, but also to insert records into existing databases. That makes it a much more flexible writing option.
MongoDB support! And a few more...
Another theme in DataCleaner 2.4 is support for the popular NoSQL database MongoDB. The support is offered both as a profiling service (eg. reading and analyzing data), but ALSO for writing data to MongoDB collections, using the Insert into Table component, which makes DataCleaner the first open source tool that offers data flow modelling and ETL functionality for MongoDB! We also improved on a few other datastores:
Support for MongoDB datastores, which are both readable and writable with DataCleaner. MongoDB uses a schemaless design principle, so you have the choice of either letting DataCleaner auto-detect a virtual schema, or define it yourself. ( Docs).
Added more configuration options to Fixed width value files. Specifically, there is now the option to specify header line number.
Added support for custom table mapping of XML structures. For large XML files this is a recommended approach, since with a fixed table model, DataCleaner can do SAX-based XML parsing which is much less memory intensive and a lot faster. ( Docs).
The Command Line Interface ( Docs) has been further improved, by allowing you to inject job variables from the command line, which makes it possible to parameterize jobs and thereby reuse jobs for different purposes.
Besides these points, a few bugfixes where fixed and some minor features added. For a full list of changes, check out the DataCleaner 2.4 milestone description in trac.
We hope you enjoy DataCleaner 2.4. We built it to be used, so go grab it right away on the downloads page!
[Less]
|
Posted
over 13 years
ago
by
kasper
There's a new and nice extension ready for you at the ExtensionSwap: The network tools extension.
Network tools can be used to work with IP addresses in data, resolve hostnames and more. Give it a look if you're dealing with network addresses (or eg. email addresses, website visitors etc.) in your data.
|
Posted
over 13 years
ago
by
kasper
MetaModel, the universal data access library that provides SQL-like querying capabilities to databases and other data formats alike, have just been released in version 2.1.
The 2.1 version of MetaModel is an exciting one. The primary
... [More]
archievements in this release has been to provide a mapping model for non-tabular datastores like the NoSQL database MongoDB and for XML files. This means that these two data formats that previously required you to do custom conversion and custom query implementations can now be queried (and in MongoDB's case also modified) in a standard fashion. For both MongoDB and XML files you have a choice of either letting MetaModel autodetect a table model (which may not be perfect, but good to begin with) or to specify your own table definitions and let MetaModel figure out the rest.
The 2.1 release also features a few bugfixes to the previous 2.0.2 release.
You can read all about what is new in 2.1 on the MetaModel website.
MetaModel can be downloaded as a independent distributable or as a Maven-style artifact for projects that use that.
We hope you like the new release. Please let us know of your experiences, either on the MetaModel forum or Google group.
[Less]
|