Posted
almost 11 years
ago
by
kasper
We've cut another release of DataCleaner - version 3.5.10! And although this is "just" a minor release version bump, the changes are pretty encouraging and the version numbering scheme does not really do it justice.
So... What's new then?
You can
... [More]
now compose jobs so that a DataCleaner job actually calls/invokes another "child" job as a single transformation. This is an important feature because it allows users to organize and compose complex data processing flows into smaller chunks of work. The new "Invoke child Analysis Job" transformation inlines the transformation section of the child job at execution time, which means that there is practically no overhead to this approach.
As a convenience for the above scenario, it is now allowed to save jobs without any analysis section in them. These jobs will thus be "incomplete", but that might actually be the point when composing and putting jobs together.
Another new transformation was added: Coalesce multiple fields. This transformation is useful for scenarios where multiple sets of fields are interchangeable, or when multiple interchangeable transformations produce the same set of fields. The "coalesce" transformation can roughly be translated into "pick the first non-empty values". When there's multiple sets of fields in your data processing stream, for instance multiple address definitions, and you need to select just one, then this is very convenient.
The handling of source columns has been simplified. Previously we tried to limit the source queries based upon only the source columns that where strictly needed to perform the analysis. But many users gave us the feedback that this caused trouble because the drill-to-detail information available in the analysis results would then be missing important fields for further exploration. So the power is now in the hands of the users: The fields added in the "Source" section of the job are the fields that will be queried.
A change was made to the execution engine in dealing with complex filtering and requirement configurations. Previously, if a component (transformation or analysis) consumed inputs from other components, ALL requirements had to be satisfied, which mostly just causes the requirement to never become true. Now the logic has been changed to be inclusive so that if any of the direct input sources' requirements are satisfied, then the component's inferred requirement is also satisfied. Most users will not notice this change, but it does mean that it is now possible to merge separate filtered data streams back into a single stream.
An issue was fixed in the access to repository files. Read/write locking is now in place which avoids access conflicts by different processes.
The 'requirement' button in DataCleaner has also been reworked. It did not always properly respond to changes in other panels, but now it is consistent.
Finally, the 'About' dialog was improved slightly and now contains more licensing information :-)
We hope you will enjoy this release of DataCleaner. Head over to the downloads page and get your copy now.
[Less]
|
Posted
almost 11 years
ago
by
kasper
We want to stay on top of technology that enhances collaboration and involvement. Therefore we have made a significant move of the DataCleaner source code from our Subversion system towards the social coding platform GitHub. This move was made to
... [More]
give the community further tools for collaboration and to also benefit from the improved source control system features of Git itself.
With GitHub we now have a central and social platform where anyone can pitch in on the development effort. One particularly useful tool for contributors is that they can submit pull requests which are basically suggested changes made in their own maintained copies of the source code - without necesarily impacting the main code tree.
We will be embracing GitHub for the technical development of DataCleaner only. This means that end users should not be much concerned about this move, but developers should be using GitHub? for source code and issue management.
Visit our GitHub? organization 'datacleaner' and check out the projects there. That includes both the projects you know, and maybe also some new ideas that you didn't know.
Or go directly to the main projects; AnalyzerBeans (the processing engine of DataCleaner) and the DataCleaner project itself.
[Less]
|
Posted
about 11 years
ago
by
kasper
Hi everyone!
We've just released DataCleaner version 3.5.7!
For this release we've made 4 important improvements to performance and stability. So although it doesn't seem like a big release in numbers or functionality, it's a good one since we
... [More]
spent the time on making an already good product better at what it does best.
The issues resolved in this release are:
A flag has been added to the CSV datastore options, making it possible to disable values in CSV files that span multiple lines. Disabling this feature in our CSV parser enabled us to increase parsing speed significantly and at the same time handle poorly/inconsistently formatted CSV files much better. Since many CSV files anyway don't contain values that would be allowed to span multiple lines, we think this is a great way to gain the extra performance and stability.
A change was made to the way we monitor progress log information. This means that we now have a much more effective and performant way to monitor progress of DataCleaner jobs, which especially speeds up performance on the server side.
A minor modification to the progress logs have been implemented: The progress information statements now always shows the time of the statement.
A minor bug was fixed: The CSV datastore dialog of the monitor web application would sometimes show an unexpected error if you did not fill out escape characters, quote characters and so on.
You can grab the new version of DataCleaner at the downloads page - enjoy!
[Less]
|
Posted
about 11 years
ago
by
kasper
We've just cut another release of DataCleaner with some minor cosmetic/specialized bugfixes and improvements. We're happy to be able to make users happier with these little additions to our favourite open source data quality tool:
The monitoring
... [More]
webapp's CSV datastore dialog now supports TXT files as well as CSV and TSV files.
A bug was fixed pertaining to the "Max rows" filter's tab in the UI sometimes making uncloseable tabs for other components as well.
A bug was fixed causing sometimes the order of selected input columns of a component to not be retained when saving and loading the job.
Various improvements to API and stability of internal utilities.
For the extra curious reader; check the milestone report. And go download DataCleaner 3.5.6 already now!
[Less]
|
Posted
over 11 years
ago
by
kasper
We've just released DataCleaner 3.5.5, which is primarily a minor bugfix release. Here's a summary of the improvements made:
The 'Synonym lookup' transformation now has a option to look up every token of the input. This is useful if you're doing
... [More]
replacement of synonyms within the values of a long text field.
Blocking execution of DataCleaner jobs through the monitor's web service for this could sometimes fail with a bug caused by the blocking thread. This issue has been fixed.
An improvement was made in the way jobs and the sequence of components are closed / cleaned up after execution.
The JNLP / Java WebStart? version of DataCleaner was exposed by a bug in the Java runtime causing certain JAR files not to be recognized by the WebStart? launcher, under certain circumstances. This issue has been fixed by making slight modifications to those JAR files.
A few dead links in the documentation was fixed.
You can download the new DataCleaner now at the downloads page! Do let us know what you think of it on the discussion forum.
[Less]
|
Posted
over 11 years
ago
by
kasper
DataCleaner version 3.5.4 has just been released and is available for download as of now.
This is primarily a bugfix release, but a few minor improvements has also made the cut for the release. Here's a summary.
It is now possible to hide output
... [More]
columns of transformations. Hiding will not affect the processing flow at all, but simply hide them from the user interface, and thus potentially making the experience more clean, when interacting with other components.
A new web service has been added to the monitoring web application, which provides a way to poll the status of the execution of a particular job.
A bug was fixed, causing the HTML report to fail for certain analysis types when no records had been processed.
And 6 other minor bug has been adressed.
For more details, consult the milestone summary in our issue tracking system. We hope you enjoy this release, and encourage you to provide feedback in any way possible.
[Less]
|
Posted
over 11 years
ago
by
kasper
Hello everyone,
A little summer holiday treat for everyone: Last Friday we released DataCleaner 3.5.2 ... And then today, a few days later, we have just released DataCleaner 3.5.3. The reason being that these are bugfix released and unfortunately
... [More]
one bug escaped the first release. Sorry about that, but rest assured that both releases was contributing to the overall better product.
The improvements made are:
A bug was fixed which cased the DataCleaner monitor to show a result link for all jobs, even if they didn't produce a result. This only happened rarely though, for instance when building a custom Java job that returns null.
An advanced JavaScript? transformer was added to the portfolio of built-in transformations. Using this transformer the user can build a stateful JavaScript? object which is capable of both transforming, aggregating and filtering records.
Job and Datastore wizards now have 'Back' buttons.
A new dedicated 'extensions' folder is available in the DataCleaner desktop application. Use this folder to dump extension JAR files in, if you want them to be automatically loaded during application startup.
A new service was added to DataCleaner monitor, which enables administrators to download and upload (backup and restore) a complete monitoring repository in one go.
A bug was fixed which caused the desktop application's "DataCleaner monitor" dialog to crash when using default user preferences.
Head on over to the downloads page to get this latest release!
[Less]
|
Posted
over 11 years
ago
by
kasper
For the past months we've been working on a proposal to donate the MetaModel project to the Apache Foundation, where it will live initially as an Incubator project. And today the vote for accepting the project has ended - with 18 votes for, and
... [More]
none against - so we are extremely proud and happy to announce that MetaModel will be getting a new home at Apache.
The impact of this project change will be profound on the one side and on the other we will ensure that we do our best to please the existing user and developer base. We hope that you will also help us in this process by having your voices heard on the new Apache dev. infrastructure, mailing lists etc. once that is in place.
More details will follow as soon as the new environment for Apache MetaModel is available.
[Less]
|
Posted
over 11 years
ago
by
kasper
It's always a bit difficult to write a really enthusiastic release announcement about a release that is essentially a bugfix release. And then again ... We've just released DataCleaner 3.5.1 and it is definately mostly a "minor improvements" release
... [More]
but some of these minor improvements are actually pretty cool! Let's have a look at a few highlights:
Capture changed records
A new filter was added to enable incremental processing of records that have not been processed before, e.g. for profiling or copying only modified records. The new filters's name is Capture changed records, referring to the concept of Change data capture.
Queued execution of jobs
The DataCleaner monitor will now queue the execution of the same job, if it is triggered multiple times. This ensures that you don't accidentally run the same job concurrently which may lead to all sorts of issues, depending on what the job does.
Minor bugfixes
Several bugfixes was implemented, see the full list on the 3.5.1 milestone page on our bugtracker.
The release is available at the downloads page and via the WebStart? client. We hope you enjoy!
[Less]
|
Posted
over 11 years
ago
by
kasper
We are very proud and happy to present DataCleaner 3.5, which has just been released!
With the 3.x branch of DataCleaner we set forth on a mission to deliver monitoring, scheduling and management of your data quality directly in your browser. And
... [More]
now with the new release, we are building upon this platform to deliver an even richer feature set, a comfortable user experience and massive scalability through clustering and cloud computing.
To be more precise, these are the major stories that we've worked on for the DataCleaner 3.5 release:
Connectivity to Salesforce and SugarCRM
One of the most important sources of data is usually a company's CRM system. But it is also one of the more troublesome data sources if you look at the quality. For this reason we've made it easier to get the data out of these CRM systems and into DataCleaner! You can now use your Salesforce.com or your local SugarCRM system as if it was a regular database. Start by profiling the customer data to get an overview. But don't stop there - you can even use DataCleaner to also update your CRM data, once it is cleansed. More details are available in the brand new focus article about CRM data quality.
Wizards and other user experience improvements
The DataCleaner monitor is our main user interface going forward. So we want the experience to be at least as pleasant, flexible and rich as the desktop application. To meet this goal, we've made many user interface and user experience improvements, amongst others:
Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs (read more below).
You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.
Distributed execution of jobs
To keep up with the massive amounts of data that many organizations are juggling with today, we had to take a critical look at how we process data in DataCleaner. Although DataCleaner is among the fastest data processing tools, it was previously limited to running on a single machine. For a long time we've been working on a major architecture change that enabled distribution of a DataCleaner job's workload over a cluster of machines. With this new approach to data processing, DataCleaner is truly fit for data quality on big data. More details are available in the documentation section.
Data visualization extension
Data profiling and data visualization do share some common interests - both are disciplines that help you understand the story that your data is telling. There are obviously also some differences, mainly being that data profiling is more targeted at identifying issues and exceptions rather than deriving or measuring business objectives. But confronted with visualization tools we've realized that sometimes there's a lot of profiling value in progressively visualizing data. For instance, a scatter plot can easily help you identify the numerical outliers of your datasets. This idea gave fuel to the idea of a visualization extension to DataCleaner. Therefore DataCleaner now also let's you do basic visualization tasks to aid you in your data quality analysis.
National identifiers extension
A very common issue in data quality projects is to validate national identifiers, such as social security numbers, EAN codes and more. In our commercial editions of DataCleaner, we now offer a wide range of validation components to check such identifiers.
Custom job engines
We've made the ultimate modularization of the DataCleaner monitoring system: The engine itself is a pluggable module. While we do encourage to use DataCleaner's engine as the primary vehicle for execution in DataCleaner monitor, it is not obligatory anymore. You can now schedule and monitor (both in terms of metric monitoring and history management) other types of jobs. For instance, you can provide your own piece of Java code and have it scheduled to run in DataCleaner monitor using the regular web user interface.
Pentaho job scheduling and execution
One major example of a pluggable job engine was introduced that we think deserves special attention: You can now invoke and monitor execution metrics of Pentaho Data Integration transformations. DataCleaner monitor by default ships with this job engine extension which connects to the Pentaho DI server ("Carte") and supervises the execution and result gathering of it. After execution you can track your Pentaho transformations in the timeline views of the monitoring dashboard, just like other metrics. For larger deployments of DataCleaner it may be convenient with dedicated ETL-style jobs in your data quality solution, and with this extension we provide an integration with a leading open source solution for just that. More details are available in the documentation section.
... And a whole lot more'''
There's even a lot more to the 3.5 release than what is posted in these highlights. Take a look at the milestone page on the bugtracker for a more thorough listing of improvements made.
A non-functional aspect of DataCleaner is the reference documentation, which we've also done a lot to update. Additionally all the documentation pages now have a commenting feature, so that you can ask questions or provide feedback to the help that is in there. We'll be continuously providing more and more content in the documentation and on the website for you to get the best resources at your hands.
... Stay tuned for more'''
On the front page of the DataCleaner website we'll be posting "feature focus" articles in the weeks to come. Please help us spread the word by promoting the release and the articles to your friends, colleagues and whom else might be interested.
[Less]
|