How to setup a Sitecore Search feed crawler + Product feedback

The Sitecore Search feed crawler features are described in the documentation. On the other hand, while I tried to set it up on my own, I faced a few gaps. This blog note is about how to setup a feed crawler. To make it interesting, I used ChatGPT to first find if it gives any useful results and in my case, it seemed irrelevant since it doesn't go into details of setting up SFTP for a CSV or JSON file since that is the crux of the feed crawler:

ChatGPT - How to setup feed crawler?

Sitecore Search Feed Crawler Process:

Purpose:

The feed crawler is useful in scenarios where you want to decouple the content and then feed that content to the crawler for indexing. In the current world of DXP and third-party integrations, the feed crawler seems to me the most useful utility. Note that the Sitecore Search feed crawler works only with files uploaded to the Search domain upload directory (called as base path in feed crawler parlance) and doesn't work with external stand-alone urlsThe feed crawler can extract from JSON and CSV files. It can also accommodate zip file with one of the file types. 

Pre-requisites:

1. filezilla client

2. You need the sftp server details of where to upload the files for your search domain. You can receive this from the Sitecore Support team. The sftp server details look like this:

  • Host: sftp.rfksrv.com
  • Username: 18246680
  • Password: g288ec196bd909g03d1

Upload file: 

Since this is a test-drive, I decided to keep this csv file in alignment with the content entity. 

content.csv:

id type description
1 article An in-depth guide to getting started with Sitecore Search.
2 product A powerful search service for headless architectures.
3 blog Tips and tricks for optimizing search results in Sitecore.
4 video A walkthrough of the Sitecore Search configuration process.
5 documentation Official API documentation for Sitecore Search integration.
view raw content.csv hosted with ❤ by GitHub

SFTP setup:

Although, you can connect to the ftp server from windows via Add Network location, this wouldn't work since its an SFTP connection and you will end up with an FTP folder error detailed as "A connection with the server could not be established":



Unless you have a way to pass both username and password as part of connecting to sftp, the windows network connection is not useful. On the other hand, the time-tested approach is to use filezilla client. Once you have fileZilla client, you can provide the host name, user name and password then ensure to pass port as 22 since this is an sftp connection:


Crawler setup:

The Sitecore Search feed crawler needs files to be uploaded to upload folder in the Sitecore Search ftp server specific for the domain. Once the file is ftp-ed to the upload folder or its sub-folders, the crawler would be able to access the upload folder and its sub-paths specified in the Document Extractors Basepath.

Source Information:



General:



Available Locales:


Tags Definition:



Document Extractors: 


Taggers:

The base functionality offered is to specify the content tag as well as the separator in csv file but can also write a JS function for attribute extraction:


On the other hand, when I uploaded a json file, I received the following error on job publish. I wasn't sure if this error is relevant for json file while separators are applicable to csv files. So, raised a support ticket for the same -

Missing 'separator' for feed file extractor with index: 0



Leaving transformations and incremental updates as-is:


In case of transformation, the html escape and unescape is used mainly in case of html-encoded strings like & in fields like description that is part of JSON feed. Based on the type (escape/unescape), the parser does the needful.

Once all values are selected, publish the job. Once you add files to the SFTP upload folder or any of its sub-folders, recrawl the content, the job will queue and run thereafter. Note that the feed crawler doesn't have a trigger similar to web crawler. The other feature that feed crawler lacks is the ability to validate extraction.

Errors:

One of the generic errors the crawler throws is "Error while reading document or extractor function". This can occur, for instance, when -

1. Base path is incorrect or not reachable
2. File name is incorrect or file not present
3. CSV file name doesn't match entity name

Result:

On the other hand, you can check if the run is successful by traversing to Analytics > Sources > Overview:



Clicking the bar in the graph gives more detailed information:



You can also check the individual indexed content from the content collection section:


Conclusion:

In my opinion, it would be good to use the feed crawler when you want to test an external api integration with some dummy data. You could also get some head-start in these lines while the external api is being developed. Out of all the crawlers, my vote is for feed crawler since it aids in rapid application development and integration process. In big enterprises, the feed crawler would be a very good fit since it is used to de-centralise the integration process.


Comments