Indexing Existing Data with SolrJ in Apache Solr
How to use the SolrJ client with Apache Solr for indexing data.
Two popular methods of indexing existing data are the Data Import Handler (DIH) and Tika (Solr Cell)/ExtractingRequestHandler. These can be used to index data from a database or structured documents (say Word documents, or PDF or….). These are great tools for getting things up and running quickly, and I have seen productions sites that work well with one or both of these tools.
OK, Then Why Talk About SolrJ?
Well, somewhere in the architectural document are two boxes that have labels like this, connected by an arrow:
Oh, all right. Rarely is the connector between the Solr Server/Indexer and the data it’s going to index labeled “miraculous connection”, but I sometimes wish people would be more honest about it. All sorts of things can get in the way here, I’ll mention 0.01% of them:
- The security people WILL NOT “just open the database for the IP address of the Solr indexer, please”.
- Actually, it’s not one datasource. It’s three. At least. And, by the way, you really need to cache data from database 1 or performance will die. Don’t forget that each of them requires different credentials.
- Did I mention that the “miraculous connection” is where our business rules that have to be encoded into the Solr documents live?
- Hey! I thought we could run the documents through categorization during this step!
- Actually, the meta-data lives in the DB, and you take that information then query the file server which will deliver the document to you.
- We’re indexing movies. Right, you only want the metadata. What? Solr is going to throw 99.99999% of the data away after you’ve transmitted how much data over my network?
- <insert your favorite problem here>
And this doesn’t even mention that DIH and Tika are run on the server. That is, the Solr indexer is doing all the work and the poor thing can only go so fast (alright, it blazes, but parsing a bazillion PDF/Word/Excel documents is quite a load for a single machine).
My point is that often there are sound reasons why using DIH and/or Tika are not optimal, from security to taking more control over how “bad” documents are handled to throughput to what tools your organization is most comfortable with. For those situations SolrJ may be the most appropriate. What follows is a skeletal program that:
- Connects from a Java program to a Solr server, the indexer in this case.
- Queries a database via JDBC and selects information from a table, putting it into a suitable form for indexing.
- Traverses a file system and indexes all of the documents Tika can parse given a directory.
- Adds these documents to the Solr server.
Please note that the example is very simple, nothing being done here couldn’t be done easily with DIH and Solr Cell. The intent here is to provide a starting point for you to adapt to your particular situation where DIH and Solr Cell won’t work right out of the box.
Lots of Buildup, Not Much Code
For all the above, the program itself is pretty short. I’ll outline some highlights, and the complete listing is at the end of this article.
Jars and where to get them.
There are three sets of jar files you’ll need to run this example.
- The Solr jar files. There are two places to look for these, <solr_home>/dist and <solr_home>/dist/solrj-lib. The classes you need to have to make a SolrJ file do its tricks will be in these two directories.
- The Tika jar files. I’d recommend downloading Tika from the Apache project, see Apache Tika and putting those jars in the classpath for your client.
- The appropriate JDBC driver. This will vary depending upon the database you’re connecting to. Often it is available somewhere in your database installation, but just search “jdbc <your database here>” and you should be able to find it.
Note that the only change on the server (and then, only if you’re running the SolrJ program on the server) is the JDBC driver (if necessary)! If you happen to be running the server and client on the machine, you can simply add the appropriate paths to your CLASSPATH environment variable. Otherwise, you’ll have to copy any jars you need from the server to your client.
Here’s what that code looks like. The full source at the end handles traversing the filesystem etc.
Set up the Solr connection
This is just the code to set up the connection to the Solr server. The connection string is to ZooKeeper since this example is for a SolrCloud installation. this should be the same string you use for starting Solr. In this example I use a single Zookeeper, but in reality it’ll be the “usual” ensemble string. The only extra bit is at the end, where we set up the Tika parser.
private SqlTikaExample(String url) throws IOException, SolrServerException { // Create a SolrCloud-aware client to send docs to Solr // Use something like HttpSolrClient for stand-alone client = new CloudSolrClient.Builder().withZkHost(zkEnsemble).build(); // The Solr 8x uses a builder pattern for creating a client.
// client = new CloudSolrClient.Builder(Collections.singletonList(zkEnsemble), Optional.empty()) // .withConnectionTimeout(5000) //.withSocketTimeout(10000) //.build();
// binary parser is used by default for responses client.setParser(new XMLResponseParser());
Index a Structured Document With Tika From a SolrJ Program
ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); // Tim Allison noted the following, thanks Tim! // If you want Tika to parse embedded files (attachments within your .doc or any other embedded // files), you need to send in the autodetectparser in the parsecontext: // context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to // see whether this file is a good candidate. try { autoParser.parse(input, textHandler, metadata, context); } catch (Exception e) { // Needs better logging of what went wrong in order to // track down "bad" documents. log(String.format("File %s failed", file.getCanonicalPath())); e.printStackTrace(); continue; } // Just to show how much meta-data and what form it's in. dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields. SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath());
There’s an outer loop that’s in charge of traversing the filesystem that I haven’t shown. All that’s happening here is that Tika is allowed to do its thing. If Tika fails to parse the document (notice I haven’t taken any care to determine that the files are reasonable, for instance a jar file or an exe file or whatever could be parsed), we log an error and continue.
If the document does parse, we extract a couple of fields and throw them into the Solr document. That document is then added to a Java List, and eventually when there get to be 1,000 documents in the list, the whole thing is passed to Solr for indexing as you can see in the full listing. By the way, The example index that comes with the Solr distribution will already have these fields defined.
But note a subtlety here, even in the trivial case. We assume that the meta-data field for author is “Author”. There are no cross-format standards for this, it might be called “document_author”, or maybe you want “last_editor”. You can control all of this here either by judicious configuration of Tika or programmatically.
Onwards to the Sql Bit
Next, we’ll look at the code that connects via JDBC to a MySql database. Again, it’s the simplest of database tables and the simplest of extractions. You probably won’t be using the SolrJ solution unless your situation is more complex than this, but this gets you started.
Class.forName("com.mysql.jdbc.Driver").newInstance(); log("Driver Loaded......"); con = DriverManager.getConnection("jdbc:mysql://192.168.1.103:3306/test?" + "user=testuser&password=test123"); Statement st = con.createStatement(); ResultSet rs = st.executeQuery("select id,title,text from test"); while (rs.next()) { // DO NOT move this outside the while loop SolrInputDocument doc = new SolrInputDocument(); String id = rs.getString("id"); String title = rs.getString("title"); String text = rs.getString("text"); doc.addField("id", id); doc.addField("title", title); doc.addField("text", text); docList.add(doc);
Again, this is the same sort of process as the last, but now instead of parsing the structured document, we fetch rows from a table and add selected values from those rows to each Solr document. And again we collect those documents in a list to be sent to Solr eventually.
Conclusion
As you can see, using Tika and/or SQL/JDBC from a SolrJ client is not very complicated. I suppose this blog is prompted by the number of requests on the Solr user’s list that request samples of how to use SolrJ to index documents. It is rather daunting to be confronted with the whole of the Solr API documentation and not have a clue where to start, I hope this example de-mystifies the process a bit.
Environment
I compiled this code against Solr 6.x in March, 2017 as the code was getting quite old and some of the classes were no longer available, however I didn’t set up a full test environment again since the changes were entirely changing to CloudSolrClient.
Full Source Code and Disclaimer
One of the delights about writing examples is that one can leave out all the ugly error-handling, logging, etc. This code needs to be beefed up considerably for production purposes, and your situation will almost certainly be much more complex or you wouldn’t have a need to worry about SolrJ in the first place! So feel free to use this as a basis for going forward with SolrJ but it’s only an example after all.
Also note that I’ve used SolrJ as an example, but there are other implementations, C#, PHP, etc. However, to the best of my knowledge clients other than SolrJ do not have all the nifty routing built in to CloudSolrClient so they will be considerably less efficient than SolrJ in a SolrCloud environment.
package solrjexample; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.CloudSolrClient; import org.apache.solr.client.solrj.impl.XMLResponseParser; import org.apache.solr.client.solrj.response.UpdateResponse; import org.apache.solr.common.SolrInputDocument; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.ContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.sql.*; import java.util.ArrayList; import java.util.Collection; /* Example class showing the skeleton of using Tika and Sql on the client to index documents from both structured documents and a SQL database. NOTE: The SQL example and the Tika example are entirely orthogonal. Both are included here to make a more interesting example, but you can omit either of them. */ public class SqlTikaExample { private CloudSolrClient client; private long start = System.currentTimeMillis(); private AutoDetectParser autoParser; private int totalTika = 0; private int totalSql = 0; private final String zkEnsemble = "http://localhost:2181"; private Collection docList = new ArrayList(); public static void main(String[] args) { try { SqlTikaExample idxer = new SqlTikaExample("http://localhost:8983/solr"); idxer.doTikaDocuments(new File("/Users/Erick/testdocs")); idxer.doSqlDocuments(); idxer.endIndexing(); } catch (Exception e) { e.printStackTrace(); } } private SqlTikaExample(String url) throws IOException, SolrServerException { // Create a SolrCloud-aware client to send docs to Solr // Use something like HttpSolrClient for stand-alone client = new CloudSolrClient.Builder().withZkHost(zkEnsemble).build(); // Solr 8 uses a builder pattern here. // client = new CloudSolrClient.Builder(Collections.singletonList(zkEnsemble), Optional.empty()) // .withConnectionTimeout(5000) // .withSocketTimeout(10000) // .build(); // binary parser is used by default for responses client.setParser(new XMLResponseParser()); // One of the ways Tika can be used to attempt to parse arbitrary files. autoParser = new AutoDetectParser(); } // Just a convenient place to wrap things up. private void endIndexing() throws IOException, SolrServerException { if ( docList.size() > 0) { // Are there any documents left over? client.add(docList, 300000); // Commit within 5 minutes } client.commit(); // Only needs to be done at the end, // commitWithin should do the rest. // Could even be omitted // assuming commitWithin was specified. long endTime = System.currentTimeMillis(); log("Total Time Taken: " + (endTime - start) + " milliseconds to index " + totalSql + " SQL rows and " + totalTika + " documents"); } // I hate writing System.out.println() everyplace, // besides this gives a central place to convert to true logging // in a production system. private static void log(String msg) { System.out.println(msg); } /** * ***************************Tika processing here */ // Recursively traverse the filesystem, parsing everything found. private void doTikaDocuments(File root) throws IOException, SolrServerException { // Simple loop for recursively indexing all the files // in the root directory passed in. for (File file : root.listFiles()) { if (file.isDirectory()) { doTikaDocuments(file); continue; } // Get ready to parse the file. ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); // Tim Allison noted the following, thanks Tim! // If you want Tika to parse embedded files (attachments within your .doc or any other embedded // files), you need to send in the autodetectparser in the parsecontext: // context.set(Parser.class, autoParser); InputStream input = new FileInputStream(file); // Try parsing the file. Note we haven't checked at all to // see whether this file is a good candidate. try { autoParser.parse(input, textHandler, metadata, context); } catch (Exception e) { // Needs better logging of what went wrong in order to // track down "bad" documents. log(String.format("File %s failed", file.getCanonicalPath())); e.printStackTrace(); continue; } // Just to show how much meta-data and what form it's in. dumpMetadata(file.getCanonicalPath(), metadata); // Index just a couple of the meta-data fields. SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", file.getCanonicalPath()); // Crude way to get known meta-data fields. // Also possible to write a simple loop to examine all the // metadata returned and selectively index it and/or // just get a list of them. // One can also use the Lucidworks field mapping to // accomplish much the same thing. String author = metadata.get("Author"); if (author != null) { doc.addField("author", author); } doc.addField("text", textHandler.toString()); docList.add(doc); ++totalTika; // Completely arbitrary, just batch up more than one document // for throughput! if ( docList.size() >= 1000) { // Commit within 5 minutes. UpdateResponse resp = client.add(docList, 300000); if (resp.getStatus() != 0) { log("Some horrible error has occurred, status is: " + resp.getStatus()); } docList.clear(); } } } // Just to show all the metadata that's available. private void dumpMetadata(String fileName, Metadata metadata) { log("Dumping metadata for file: " + fileName); for (String name : metadata.names()) { log(name + ":" + metadata.get(name)); } log("nn"); } /** * ***************************SQL processing here */ private void doSqlDocuments() throws SQLException { Connection con = null; try { Class.forName("com.mysql.jdbc.Driver").newInstance(); log("Driver Loaded......"); con = DriverManager.getConnection("jdbc:mysql://192.168.1.103:3306/test?" + "user=testuser&password=test123"); Statement st = con.createStatement(); ResultSet rs = st.executeQuery("select id,title,text from test"); while (rs.next()) { // DO NOT move this outside the while loop SolrInputDocument doc = new SolrInputDocument(); String id = rs.getString("id"); String title = rs.getString("title"); String text = rs.getString("text"); doc.addField("id", id); doc.addField("title", title); doc.addField("text", text); docList.add(doc); ++totalSql; // Completely arbitrary, just batch up more than one // document for throughput! if ( docList.size() > 1000) { // Commit within 5 minutes. UpdateResponse resp = client.add(docList, 300000); if (resp.getStatus() != 0) { log("Some horrible error has occurred, status is: " + resp.getStatus()); } docList.clear(); } } } catch (Exception ex) { ex.printStackTrace(); } finally { if (con != null) { con.close(); } } } }
This post originally published on February 14, 2012.
LEARN MORE
Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.