How We Fixed Gunicorn Worker Errors in Our Flask App: A Real Troubleshooting Journey

By hientd, at: 22:20 Ngày 03 tháng 11 năm 2024

Thời gian đọc ước tính: 9 min read

How We Fixed Gunicorn Worker Errors in Our Flask App: A Real Troubleshooting Journey
How We Fixed Gunicorn Worker Errors in Our Flask App: A Real Troubleshooting Journey

How We Fixed Gunicorn Worker Errors in Our Flask App: A Real Troubleshooting Journey

 

When you manage production servers, there’s always a moment when something goes wrong just as you think everything is running smoothly. Recently, we faced exactly that—a real issue in our Python/Flask app running on Gunicorn. Our clients reported intermittent downtime with the app, and we quickly realized it was time to dig into the issue. This is the story of how we tackled it, what we learned, and the simple steps we’ll take in the future to avoid the same problem.

 

The Initial Setup

Our setup has a load balancer with two servers handling client requests. Under normal circumstances, the load balancer keeps traffic moving smoothly, but our clients started facing this on/off availability issue.

[2024-11-03 13:25:42 +0000] [26747] [INFO] Booting worker with pid: 26747
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
  File "/home/ubuntu/sample-app/venv/bin/gunicorn", line 8, in <module>
    sys.exit(run())
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 231, in run
    super().run()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 211, in run
    self.manage_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 551, in manage_workers
    self.spawn_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 623, in spawn_workers
    time.sleep(0.1 * random.random())
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
    self.reap_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
    os.WTERMSIG(status)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
    self.error_log.warning(msg, *args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
    return wrapped(*args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
    self.emit(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
    self.reap_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
    os.WTERMSIG(status)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
    self.error_log.warning(msg, *args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
    return wrapped(*args, **kwargs)
Message: 'Worker with pid %s was terminated due to signal %s'
Arguments: (26469, 9)
[2024-11-03 13:25:42 +0000] [26748] [INFO] Booting worker with pid: 26748
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
  File "/home/ubuntu/sample-app/venv/bin/gunicorn", line 8, in <module>
    sys.exit(run())
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 231, in run
    super().run()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 211, in run
    self.manage_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 551, in manage_workers
    self.spawn_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 623, in spawn_workers
    time.sleep(0.1 * random.random())
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
    self.reap_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
    os.WTERMSIG(status)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
    self.error_log.warning(msg, *args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
    return wrapped(*args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
    self.emit(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
    self.flush()
  File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
    self.stream.flush()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
    self.reap_workers()
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
    os.WTERMSIG(status)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
    self.error_log.warning(msg, *args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
    return wrapped(*args, **kwargs)
Message: 'Worker with pid %s was terminated due to signal %s'
Arguments: (26469, 9)</module></stderr></module></stderr>


Here’s how we began troubleshooting:

  1. Verifying Processes: First, we checked all server processes to confirm they were running. Everything seemed fine at this level—no obvious crashes or hanging processes.
     

  2. Checking the Logs: Since all processes seemed to be running smoothly, we moved on to the logs managed by Supervisord. That’s where we saw a recurring pattern: the app was being terminated randomly. And in our error logs, we found these specific entries:

    RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'> </stderr>


    Alongside that, we kept seeing:

    Worker with pid %s was terminated due to signal %s


    With this combination of “reentrant” errors and worker terminations, it looked like we had a mix of memory issues, Gunicorn conflicts, and logging errors. But we needed more information.

 

Digging Deeper: Researching the Error

With these clues, we started digging online, searching terms like "reentrant call inside <_io.BufferedWriter>" and "Gunicorn RuntimeError.” Thankfully, we found discussions on GitHub from others who had encountered the same problem. Here are a few key discussions we referenced:

Reading through these, one issue became clear: we were running an outdated version of Gunicorn20.0.1. The latest release was 23.0.0, and it included fixes for logging and process management that directly addressed these issues. So we decided to try an upgrade.

Gunicorn Memory Issues

 

Testing a Solution

  1. The First Upgrade: Instead of jumping to the latest release right away, we upgraded to 21.2.0. We wanted to be cautious, so we monitored the app closely. After a few hours, we saw stability—no more random terminations, and everything was working as expected.
     

  2. Going for the Latest Version: Once we felt confident, we upgraded to 23.0.0, the latest Gunicorn version (which requires a newer Python version upgrade - more difficult but worth it). And since then, the app has been stable without a single reentrant error. Our clients’ access to the app returned to normal, and it was clear the upgrade resolved the issue.

 

What We Learned

This experience highlighted a few lessons:

  1. Logs are Your First Clue: Always start by confirming processes are running and thoroughly check the logs. Logs often contain critical hints.
     

  2. Search for Similar Issues: When you see a strange error, others have likely faced it too. Searching GitHub for specific terms can point you to helpful discussions and sometimes even exact solutions.
     

  3. Take It Slow with Upgrades: Rather than jumping to the latest version, it’s wise to incrementally upgrade and test. We saw immediate improvement with 21.2.0, and that gradual approach let us ensure stability before moving to 23.0.0.
     

  4. Stay Updated on Releases: Subscribing to GitHub repositories for libraries you depend on—like Gunicorn—keeps you in the loop on critical updates. It’s a simple way to avoid these sudden issues.

 

Conclusion

Troubleshooting production issues can be stressful, but taking a step-by-step approach really paid off here. This experience reminded us of the importance of monitoring dependencies and keeping them up to date. Hopefully, our story gives you a few insights to tackle similar challenges—or avoid them altogether.


Theo dõi

Theo dõi bản tin của chúng tôi và không bao giờ bỏ lỡ những tin tức mới nhất.