The goal of this project is to scrape, sanitize, and organize the California State Bar Association data set. Specifically, to crawl a website for all attorneys. Data is classified by a variety of status parameters, and we are only interested in a specific subset.
From the individual attorney results, each unique datapoint will need to be stored against a key (potentially the state bar number). We do not have a specified schema, so we expect this schema to be developed based on the data. However, we will provide a list of specific fields that we'd want to reconcile.
For example:
Matching and filling out a "company" field based on physical address and/or e-mail address when the company name is not present in the results.
Certain keywords will also be provided that will also need to be used to map specific fixed values.
Specific rules will be provided about how the presence of certain provided keywords from data we possess should add values to the specific record. There are also rules around what data we'd want to exclude.
As an output, we will need all of the data in a single file in a machine-readable format (CSV, XML, etc.). Additionally, providing the data via a documented API that delivers the output in XML/JSON could be part of the initial scope or another phase.
While this custom scraper and parser project is initially focused on the California State Bar Association's members, there may be opportunities to work on subsequent projects around other state bar data depending on the success of this initial engagement.
About the recuiterMember since Mar 14, 2020 Abhishak Banara
from North Carolina, United States