What experiences have people made with RapidMiner?

Data scientist - analytics consultant

(German version)

jq is a tool for processing JSON documents. It offers filters, transformations, restructuring and other possibilities to bring the documents into the desired form.

The JSON documents that a data scientist has to deal with are becoming more and more complex. For example, a document comes from a web API that contains hierarchical, optional elements.

For data mining, however, you always need a tabular structure, without hierarchical elements and ideally also without missing data. jq helps to bring the relevant parts of the incoming documents into such a form.

Let's take the following simple example document:

{"count": 3, "category": "example", "elements": [{"id": 1, "description": "first element", "tags": ["tag1"]}, {"id ": 2," description ":" second element "," optional ":" optional element "," tags ": []}, {" id ": 3," description ":" third element "," tags ": ["tag1", "tag2"]}]}

Here we see the common pitfalls of complex JSON documents:

  • Elements on different hierarchy levels: category, elements / id etc.
  • Optional elements: elements [2] / optional
  • Variable number of elements: elements / tags

jq offers a relatively simple syntax to handle such constructs. The easiest way is to develop the expressions online at jqplay.org.

Perhaps the goal is to create a table with the category, the element id and the tags. The jq expression for this is:

{count, category, elements: .elements []} | {category, id: .elements.id, tag: .elements.tags []}

A bit scary at first glance, but ultimately made up of simple elements. If you go through the expression step by step at jqplay, it becomes clearer.

In the first step (the steps are separated by the pipe symbol |) we declare which elements we want to process. A list of objects is built up with {}, including count and category from the main level of the document, and the elements as an array. count and category are repeated so that the "table" is complete.

In the second step we select the category (originally on the main level) and the id of each object; to do this, the tags as an array. With name: .main element.child element we can select and name elements. The result of this step is a list of objects with category, id, and tag in a tabular structure that we could write in a database or process in a data mining tool.

jq in RapidMiner

In order to be able to process such complex documents in RapidMiner, it would be practical to integrate jq directly. That's exactly what I did, using jackson-jq, a Java implementation.

In preparation we have to copy the jar file from jackson-jq and two dependencies into the lib directory of RapidMiner Studio. The functionality is then available in the built-in Groovy Scripting Operator (Execute Script).

In order to simplify the application, I created two RapidMiner processes that can be integrated into your own processes. One variant works on tables (example sets), here you have to specify when calling up which attribute contains the input data and what the target attribute with the result of the transformation should be called. The other variant works on document objects such as those provided by Get Page.

In both cases you can also specify the jq expression, specify whether the output should be formatted indented, and finally choose whether the result should be converted to CSV. RapidMiner can very easily convert the CSV-formatted result into a table with Read CSV - that is often my goal.

In jqplay we would add the following for the CSV output and select "Raw Output":

| [.category, .id, .tag] | @csv

With this we create an array (with the [] syntax) and name the elements to be output. The result is then reformatted with @csv. (The RapidMiner process does this last step automatically if the CSV output is selected.)

With a little practice and the help of jqplay, processes can be created that relatively easily create a manageable table from a nested JSON document.

In order to process different hierarchies within the document, one could also use different jq expressions and obtain different tables from them.

Processing JSON with jq

jq is a command line tool for processing JSON documents. It can filter, transform and restructure documents to format them in the way we want.

The JSON documents data scientists have to work with are becoming more and more complex. Web APIs often generate documents with hierarchic structure and optional elements.

Data mining, however, needs a tabular structure, without hierarchic elements, and if possible without missing data. jq helps us with the transformation of relevant parts of input documents into this shape.

Take the following example document:

{"count": 3, "category": "example", "elements": [{"id": 1, "description": "first element", "tags": ["tag1"]}, {"id ": 2," description ":" second element "," optional ":" optional element "," tags ": []}, {" id ": 3," description ":" third element "," tags ": ["tag1", "tag2"]}]}

This shows the usual pitfalls of complex JSON documents:

  • Elements on different hierarchy levels: category, elements / id etc.
  • Optional elements: elements [2] / optional
  • Variable number of elements: elements / tags

The easiest way to try jq is online at jqplay.org.

We might want to create a table with the category, the element id and the tags. The jq expression for this is:

{count, category, elements: .elements []} | {category, id: .elements.id, tag: .elements.tags []}

Scary for sure in the first moment! But when you look at it, it's built of simple elements. You can always execute it step by step at jqplay to see the effects of each step.

In the first step (the steps being delimited by the pipe symbol "|") we declare the elements we want to process. We build an object list with {}, taking count and category from the top level and an array of the elements. Count and category are repeated to create a proper table.

In the second step we select category and the object id-s, which were on different levels previously. The tags are selected as an array. Using the syntax name: .element.element we can select elements and name them. The result of this step is a list of objects having category, id and tag in a table, suitable for writing into a relational database or processing in a data mining tool.

jq in RapidMiner

It would be useful to process these kinds of documents with jq in RapidMiner. This is what I did, using jackson-jq, a Java implementation of jq.

To prepare, we need to copy the jackson-jq jar file and two dependencies into the RapidMiner Studio lib directory. Then we're able to use the functionality in the built-in Groovy scripting operator (Execute Script).

I created two RapidMiner processes to make the application easier. These can be used in other processes. There is one variant working on tables (example sets), here you specify the name of the input attribute containing your documents and the target attribute for the transformation result. The other variant works on Document objects, like those coming from Get Page.

In both cases you specify the jq expression and set up the output options. You can indent the output, and convert the result to CSV. The CSV formatted result can be easily transformed to an example set - this is a frequent use case.

If you want to see CSV output in jqplay, check "Raw Output" and append the following:

| [.category, .id, .tag] | @csv

This creates an array (with the [] syntax) and lists the elements in the output. The result is converted with the @csv step. (The RapidMiner process does this automatically if the csv output is selected.)

This, together with some practicing in jqplay, enables processes that can transform complex JSON documents to straight tables.

To process different parts and structures in the document, just multiply it and apply different jq expressions on the copies.

jqJSON