We are looking for a system administrator to update a Apache-Solr and Nutch search system to limit what is captured during crawling a site.
We are using Nutch to spider the urls and would like to update the system to ignore the header, footer, and side bar from being indexed. Or, we would like to tell Nutch specifically what section(s) of the page to index.
The issue we are running into, do a search for 'Announcements' returns all the pages because links to 'Announcements' is in the header and footer of every page.
We would also like to update the Indexing of the page to capture and return a specific meta tag containing a page summary.
Apache Nutch v 1.12 and Solr 7.2.1 are installed on the server.
When applying, please:
1) Include the following line at the top 'Priority should be given to those who read the full job description'
2) Let us know how many hours you expect this to take and when you can start/finish.
About the recuiterMember since May 20, 2018 Devendra Mishra
from Sergipe, Brazil