Blog, Lucidworks Fusion, Technischer Artikel

Extracting Values From Element Attributes Using Jsoup and a JavaScript Stage

by Kevin Cowan
January 24, 2017

While Fusion comes with built-in Jsoup selector functionality, it is limited in its extraction capability. If you want to do something like extract attribute values — in particular attribute values with special characters or empty spaces in the values, you’ll need to do a custom JavaScript stage and implement the extraction there.

To accomplish this:

1) Create a custom JavaScript stage and order it directly after the Apache Tika Parser. In the Apache Tika Parser stage, make sure that both “Return parsed content as XML or HTML” and “Return original XML and HTML instead of Tika XML Output” are checked.

2) Add your code. For the purposes of this article, I’ve created the following example. Depending on what you’re trying to accomplish, your code may vary:

function(doc) {  

    var Jsoup = org.jsoup.Jsoup;
    var content = doc.getFirstFieldValue("body");
    var jdoc = org.jsoup.nodes.Document;
    var div = org.jsoup.nodes.Element;
    var img = org.jsoup.nodes.Element;
    var iter = java.util.Iterator;
    var divs = org.jsoup.select.Elements;

    try {
        jdoc = Jsoup.parse(content);
        divs = jdoc.select("div");
        iter = divs.iterator();
        div = null; // initialize our value to null
        while (iter.hasNext()) {
            div = iter.next();
            if (div.attr("id").equals("featured-img")) {
                break;
            }
        }
        if (div != null) {
            img = div.child(0);
            logger.info("SRC: " + img.attr("src"));
            logger.info("ORIG FILE: " + img.attr("data-orig-file"));
            doc.addField("post_image", img.attr("src") + " | " + img.attr("data-orig-file"));
        }
        else {
            logger.warn("Div was null");
        }
    }
    catch (e) {
        logger.error(e);
    }
    return doc;
}

So let’s go ahead and break down what is happening here:

1) Declare Java classes and JavaScript variables to be used. Note that we assign the content variable to be the content pulled by the Apache Tika Parser

var Jsoup = org.jsoup.Jsoup;
var content = doc.getFirstFieldValue("body");
var doc = org.jsoup.nodes.Document;
var div = org.jsoup.nodes.Element;
var img = org.jsoup.nodes.Element;
var iter = java.util.Iterator;
var divs = org.jsoup.select.Elements;

2) Next, we pull the “div” elements out and look for one with an ID of “featured-img.” Once we find it, we ‘break’ the iteration and move on. Note: I’m using this type of example to illustrate how to work with element attribute values that contain special characters or empty space. Jsoups selector syntax doesn’t really play well with these types of key names.

 doc = Jsoup.parse(content); // parse the document
             divs = doc.select("div"); // select all the 'div' elements
             iter = divs.iterator(); // get an iterator for the list
            while (iter.hasNext()) { // iterate over the elements
                div = iter.next();
                if (div.attr("id").equals("featured-img")) { // if we find a match, assign and move on. 
                    break;
                }
            }

3) Finally, we set the values in the document. I’ve added some extra logging here, which can ultimately be removed.

   if (div != null) {
                 img = div.child(0); // get the image element
                logger.info("SRC: " + img.attr("src"));
                logger.info("ORIG FILE: " + img.attr("data-orig-file"));
                doc.addField("post_image", img.attr("src") + " | " + img.attr("data-orig-file")); // set the values in the PipelineDocument
            } else {
                logger.warn("Div was null");
            }

And that’s all there is to it! Happy Extracting!

About Kevin Cowan

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Lucidworks-Plattform – Übersicht

Lucidworks-Plattform – Preisgestaltung

KI-Zentrum

FUNKTIONEN VON LUCIDWORKS (ALLES INKLUSIVE)

Produktentdeckung

Searchandising

Websitesuche

Suche am Arbeitsplatz

Daten aufnehmen und Signale erfassen

Sucherlebnis der Mitarbeitenden

Kundenservice und Lösung von Fällen

KI und Large Language Models

LÖSUNGEN

Commerce

Kundenservice

Wissensmanagement

BRANCHEN

B2B-Commerce und -Vertrieb

B2B-Fertigung

Einzelhandel

Regierungsbehörden und öffentlicher Sektor

Gesundheitswesen

Finanzdienstleistungen

B2B Core Package

ENTDECKEN SIE UNSERE INHALTE

E-Books und Berichte

Blog

Videos

Presse

RESSOURCEN

Über Lucidworks

Dokumentation

Karriere

LucidAcademy

Kontakt

Technischer Support

Extracting Values From Element Attributes Using Jsoup and a JavaScript Stage

To accomplish this:

About Kevin Cowan

LEARN MORE

Lucidworks-Plattform – Übersicht

Lucidworks-Plattform – Preisgestaltung

KI-Zentrum

FUNKTIONEN VON LUCIDWORKS (ALLES INKLUSIVE)

Produktentdeckung

Searchandising

Websitesuche

Suche am Arbeitsplatz

Daten aufnehmen und Signale erfassen

Sucherlebnis der Mitarbeitenden

Kundenservice und Lösung von Fällen

KI und Large Language Models

LÖSUNGEN

Commerce

Kundenservice

Wissensmanagement

BRANCHEN

B2B-Commerce und -Vertrieb

B2B-Fertigung

Einzelhandel

Regierungsbehörden und öffentlicher Sektor

Gesundheitswesen

Finanzdienstleistungen

B2B Core Package

ENTDECKEN SIE UNSERE INHALTE

E-Books und Berichte

Blog

Videos

Presse

RESSOURCEN

Über Lucidworks

Dokumentation

Karriere

LucidAcademy

Kontakt

Technischer Support

To accomplish this:

About Kevin Cowan

Related Articles

LEARN MORE