By: Vidhatanand V
September 20 2019

Architecture Capsules by Vid: Building a Highly Scalable Cross-Site Federated Search

We are excited to announce Architecture Capsules, an original series by our Chief Engagement Officer, Vidhananad. Purely based on experiences, opportunities, and learnings, the concise format of this series will help you understand the intricate technology and its architectures in an absorbing manner. Let’s go!

Federated Search is an application that allows you to index multiple sites (including Drupal) to a single search application and gives consistent results. 

According to IDC, 90% of all the digital information is unstructured, locked in multiple repositories, and digital businesses have either underinvested in technology or invested in substandard technology in order to access them.

Traditional search method leads to failures majorly due to the absence of optimization practices and lack of unified framework. In a data-driven world, unlocking the hidden insights, that are shut off from view within both structured and unstructured data present in multiple repositories, is more critical than ever. 

In this episode of Architecture Capsules, we will learn how to enhance website search experience and retain a blend of useful and accurate results with Federated search as compared to the traditional search measures. 

Business Use Case & Benefits  

  • To allow users to search for content across multiple sites managed by the enterprise 
  • Improve content discovery
  • Improve user retention and engagement 

Requirement Criteria

  • It should be compatible with Drupal and non-drupal sites as well. 
  • The search should be easy to deploy on new sites and should be platform agnostic (Maybe a code snippet)  
  • The search should be fast. 
  • The user experience of search should be instant (read decoupled)
  • Should give granular control of making certain content featured on search.
  • Should support multiple structured content blocks like events, articles, blogs, etc. 

Stack Used

Scrapy, Redis, React, Drupal, Solr/Elastic, PHP/Python, microservices

traditional search architecture

 

Federated search

 

architecture

Architecture Notes 

  • Use Scrapy to crawl the sites. Use the Scrapy cluster if the total number of pages is high. Ideally, keep Scrapy at autothrottle. 
  • Use Queues in Redis to manage the page crawl queue and status. 
  • Write parsers (in python/PHP) for the content which are to be extracted in a structured way. Example, date, content, title for events. Make sure to have a default parser which targets the body on-page. 
  • Send the crawled pages dump to pipeline which identifies specific parser for the content type and if none of the parsers is applicable use default parser. 
  • Set up a service to restart Scrapy at scheduled intervals. Use the signature of dump from the previous crawl and updated date in the header to proceed or skip with the parser pipeline. This will ensure the complete pipeline is executed only if the content is updated/deleted. 
  • The next step in the pipeline sends data to Drupal via JSON API / GraphQL.  
  • Setup Solr with Drupal in a standard way. This is documented widely on the internet. 
  • Drupal is responsible for adding/updating/deleting from the Solr index. 
  • Build a JS app that interacts with the rest API of Solr to execute searches. 
  • Build a js code snippet that loads the app from the previous step in an empty container along with required markup and styles. 
  • Make site specific styling on end site if needed. 

Bingo! 

I will be happy to answer your queries on comments, DM, or email. Hit me up for a complete demo or to discuss opportunities at [email protected]