What is a good way to index a Solr record in which the source data comes from multiple sources?

advertisements

I have multiple sources of data from which I want to produce Solr documents. One source is a filesystem, so I plan to iterate through a set of (potentially many) files to collect one portion of the data in each resulting Solr doc. The second source is another Solr index, from which I'd like to pull just a few fields. This second source also could have many (~millions) of records. If it matters, source 1 provides the bulk of the content (the size of each record there is several orders of magnitude greater than that from source 2).

Source 1:

  • /file/band1 -> id="xyz1" name="beatles" era="60s"
  • /file/band2 -> id="xyz2" name="u2" era="80s"
  • ...
  • /file/band4000 -> id="xyz4000" name="clash" era="70s"

Source 2:

  • solr record 1 -> id="xyz2" guitar="edge"
  • solr record 2 -> id="xyz4000" guitar="jones"
  • solr record 3 -> id="xyz1" guitar="george"

My issue is how best to design this workflow. A few high-level choices include:

  1. Fully index the data from source 1 (the filesystem). Next, index the data from source 2 and update the already-indexed records. With Solr, I believe you still can't just add a single field to a record, you replace the entire old record with the new.
  2. Do the reverse of (1), indexing first the data from the Solr source, followed by the data from the filesystem.
  3. Somehow integrate the data before indexing into Solr. In general, we don't know much about the order of traversal in each source--which is to say, I don't see an easy way to iterate the two sources together, in which xyz1 gets processed from both sources, then xyz2, etc.

So some of the factors affecting the decision include the size of the data (can't afford to be too inefficient in terms of computational time or memory) and the performance of Solr when replacing records (does the original size matter much?).

Any ideas would be greatly appreciated.


I would say if you're not concerned about the data that is stored in two sources being merged first then option 1 or 2 would work fine. I would probably index the larger source first, then "update" with the second.