How I Power My Business Automation with Celery

Celery is awesome.

Celery is an asynchronous task queue/job queue based on distributed message passing.

The broker can be RabbitMQ, Redis or any system that supports AMQP, like Apache Qpid, Kafka or Amazon SQS.

Celery is focused on real-time operation, but supports scheduling as well.

From the Celery website:

"The execution units, called tasks, are executed concurrently on a single or more worker servers using multiprocessing, Eventlet, or gevent. Tasks can execute asynchronously (in the background) or synchronously (wait until ready)."

Celery is used in production systems to process millions of tasks a day.

It may be the best kept secret in the Python community. I use it for everything. Most of all, I use it for automation.

You see, I don't like having to sit in front of a computer to execute a task. I want to design and build systems that are fully automated. These tasks can schedule and execute themselves and store the result, update the database, send email, call remote API's, check their own health status and so much more.

I recently completed a project for a client that does an extensive bit of web scraping, parsing, storage and retrieval of data. Lots and lots of data. 24/7/365.

A Flask front-end Web UI allows the Admin Users to create Category Strategic Requirements, which, depending on the topic, builds a dynamic list of remote endpoints to scrape. These sites (endpoints) are generated from dozens of content category searches using a user defined selection of search engines. The site content must maintain a certain content relevancy ratings per proprietary algorithms designed by the Client.

When the project's manager described the requirements to me, I knew right away that Celery was going to a major player in this project. The initial requirement was to create a system that could process up to 1 million requests per day; with a proposed ceiling of approximately 3 million per day in the first year.

This project named, reparcs-web-api, has been online and running for more than 15 months now. reparcs has processed more than 435 million requests so far and over 1+ Billion broker messages created, queued and executed by ~ 100 concurrent workers.

I fully expect it to reach 1 billion scraper requests sometime middle of next year with the current growth curve.

Underneath it all is the ridiculously simple and inexpensive project architecture. At the current price, these 2 instances run for ~ $45.00/month.

The foundation: Ubuntu 16.04 LTS

1 x Amazon EC2 T2.Medium (2 vCPU and 4 GB RAM) with a 512 GB SSD mounted at /dev/sdb1 (default data store in )

1 x AWS EC2 T2.Small (1 vCPU and 2 GB RAM)

The stack:

MySQL 5.7
SQLAlchemy
Python 3.5
RabbitMQ Server
Celery 4.2

For a complete example of this Celery project, please check out my GitHub repository.

I will highlight several of the key modules for this Celery project.

project_root/  
    -- app/
       -- init.py
       -- models.py
       -- views.py
       -- tasks.py

   -- celery_worker.py
   -- requirements.txt
   -- config.py
   -- manage.py
   -- tests/
double underscore init.py

This is where the application is created.

celery_worker.py

The celery worker is calling periodic tasks. The system's heart beat.

tasks.py

The tasks module contains all of the registered tasks in the Celery project. When the worker executes the task, it will run the functions in this module. Requires a restart of the worker process if any changes are made to the task function.

config.py

The config.py module contains all of the configuration data for the Celery application including the database logins and which environment is in use.

views.py

The embedded Flask module can display the status of any completed or running task.

models.py

Our project's database schema.

manage.py

This module contains management functions Flask Mail, iPython and routes.

Now, when you are ready to deploy your Celery project, activate your virtual Python environment and run the following command to start up the task queue.

ubuntu@server:~$ cd /path_to_project  
ubuntu@server:~/path_to_project$ celery worker -A app.celery --loglevel=info  

The output that shows the running task queue.

-------------- celery@precision-5810 v4.0.2 (latentcall)
---- **** ----- 
--- * ***  * -- Linux-4.15.0-36-generic-x86_64-with-Ubuntu-16.04-xenial 2018-10-09 16:05:00
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x7f42212a8240
- ** ---------- .> transport:   amqp://guest:**@localhost:5672//
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 8 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]


[2018-10-09 16:05:00,810: INFO/MainProcess] Connected to amqp://guest:**@127.0.0.1:5672//
[2018-10-09 16:05:00,830: INFO/MainProcess] mingle: searching for neighbors
[2018-10-09 16:05:01,863: INFO/MainProcess] mingle: all alone
[2018-10-09 16:05:01,880: INFO/MainProcess] celery@precision-5810 ready.

Celery is a capable and trusted Python library that is at the heart of many of my Python projects. Celery can process millions of tasks every day. It runs with very little CPU (< 2%) and memory utilization (< 5%).

Let me know how you are using Celery is your day to day work.

Craig Derington

Full Stack Developer. Linux, Docker, Python, Celery, Flask, Django, Go, MySQL, MongoDB and Git. Modern, secure, high-performance applications capable of processing millions of transactions a day.

comments powered by Disqus