I am running a large-scale computer vision/machine learning platform that works by processing the following:
1) Live streams (RTSP) [
https://en.wikipedia.org/wiki/Real_Time_Streaming_Protocol]
2) Videos stored in Amazon S3
3) Images extracted from videos stored in Amazon S3
I want to re-architect my solution for scalability and flexibility as well as fault tolerance.
Currently,
For Videos/Images
1) A video will be populated in Amazon S3, this will then trigger an Amazon Lambda function that will publish a task into Amazon SQS.
2) I have machine learning code (Python 2.7/Tensorflow/OpenCV) that is pulling tasks from Amazon SQS. Then downloading the videos from Amazon S3, doing processing and publishing JSON to another Amazon SQS.
3) For RTSP streams (live), I manually deploy to servers to continuously process each RTSP steam using similar code (Python 2.7/Tensorflow/OpenCV)
Challenges:
-- I have no way to autoscale my worker nodes (processing videos) based off incoming load (messages in Amazon SQS).
-- Each new use-case requires at least more Amazon SQS (for development, staging and production). This becomes very difficult to maintain.
-- Each task will require different Neural Networks depending on the type of task and the image quality etc.... This means sometimes I have to use deep neural networks which require GPU support. But, sometimes I can run the task on a light-weight CPU. So each Task needs to have a 'weight' associated with it and be distributed to a corresponding instance (for example, Task A requires GPU so it should be run on an instance that is GPU enabled, but Task B can run on CPU or GPU because it is not as compute intensive)
For live feeds:
-- Live RTSP is very difficult because frames are stored in memory and if I don't process each frame immediately it will crash the program or it will skip to the next live frame so I lose frames.
-- each RTSP task also has a 'weight' so I need to be able to run some RTSP tasks on GPU and some on CPU.
General issues:
-- I need to maintain a 'status page' so that I can see if all processes are running correctly for better DevOps/Debugging.
-- I have no centralised logging for debugging
-- If the source RTSP is not working (network issues or live stream is not functional), it should free up the instance to perform another task until that RTSP is back.
All my use-cases are structured as modules, each module will be a different package in my python project, with its own instructions for running. Each module will run on RTSP or Video, never both. But, I can run Module A on live stream (RTSP) and Module B on Videos independently.
About the recuiterMember since Jul 18, 2017 Marko N.
from Utah, United States