Remote Data Mining And Management Job In Data Science And Analytics

Implement >512 max_seq_len for Google BERT (pytorch-pretrained-bert) for long articles

Find more Data Mining And Management remote jobs posted recently Worldwide

Problem description: I want to use this multi-label classifier for Google BERT: https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d

However, by default, when Google BERT converts a document to features, it has a max sequence length of up to 512 WordPiece tokens. It will truncate text from articles which are longer than that.

The SQuAD classifier for BERT actually implements a sliding window solution for longer articles

I tried to splice it into the multi-label classifier but didnt get it right


Deliverable: I want a solution to this problem of ingesting long articles (>512 wordpiece tokens) into Google BERT with code in a Jupyter notebook. So perhaps the article is 1024 words long, using the doc_stride solution, it would perhaaps be ingested as 2x512 sequences, then the classification will be done across both of the articles and the arg_max of the predictions is provided.

Comments and documentation of how you created the solution would also be appreciated.
About the recuiter
Member since Mar 14, 2020
Sinyalir.com
from Chhattisgarh, India

Skills & Expertise Required

Data Science & Analytics Data Mining & Management 

Open for hiringApply before - May 16, 2024

Work from Anywhere

40 hrs / week

Fixed Type

Remote Job

$191.66

Cost

Offer to work on this project closes in 0 days!
Are you interested in this Opportunity?

Looking for help? Checkout our video tutorial
How to search and apply for jobs

How to apply? Do you have more questions about the Job?
See frequently asked questions

Similar Projects

Data Engineer

- Design, construct, install, test and maintain data management systems.
- Build high-performance algorithms, predictive models, and prototypes.
- Ensure that all systems meet the business/company requirements as well as industry practices.read more

Google Analytics set-up with Shopify and ReCharge

We have a customized Shopify store in which you can either make a one-time purchase or enter into a subscription. For the subscription management we use ReCharge.
We want to set Google Analytics up that we can track all marketing channels, includ...read more

Deep Learning based Number plate recognition using OCR

For the dateset of images that will be provided, you have to create your own deep learning based OCR model
The OCR model should have following properties:
A. Should have an accuracy of more than 95% in the dataset that I will provide
B. S...read more