How We Fixed Gunicorn Worker Errors in Our Flask App: A Real Troubleshooting Journey
By hientd, at: 22:20 Ngày 03 tháng 11 năm 2024
How We Fixed Gunicorn Worker Errors in Our Flask App: A Real Troubleshooting Journey
When you manage production servers, there’s always a moment when something goes wrong just as you think everything is running smoothly. Recently, we faced exactly that—a real issue in our Python/Flask app running on Gunicorn. Our clients reported intermittent downtime with the app, and we quickly realized it was time to dig into the issue. This is the story of how we tackled it, what we learned, and the simple steps we’ll take in the future to avoid the same problem.
The Initial Setup
Our setup has a load balancer with two servers handling client requests. Under normal circumstances, the load balancer keeps traffic moving smoothly, but our clients started facing this on/off availability issue.
[2024-11-03 13:25:42 +0000] [26747] [INFO] Booting worker with pid: 26747
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
File "/home/ubuntu/sample-app/venv/bin/gunicorn", line 8, in <module>
sys.exit(run())
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 231, in run
super().run()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 211, in run
self.manage_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 551, in manage_workers
self.spawn_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 623, in spawn_workers
time.sleep(0.1 * random.random())
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
self.reap_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
os.WTERMSIG(status)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
self.error_log.warning(msg, *args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
self._log(WARNING, msg, args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
self.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
self.callHandlers(record)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
return wrapped(*args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
self.emit(record)
File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
self.reap_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
os.WTERMSIG(status)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
self.error_log.warning(msg, *args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
self._log(WARNING, msg, args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
self.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
self.callHandlers(record)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
return wrapped(*args, **kwargs)
Message: 'Worker with pid %s was terminated due to signal %s'
Arguments: (26469, 9)
[2024-11-03 13:25:42 +0000] [26748] [INFO] Booting worker with pid: 26748
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
Call stack:
File "/home/ubuntu/sample-app/venv/bin/gunicorn", line 8, in <module>
sys.exit(run())
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 231, in run
super().run()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 211, in run
self.manage_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 551, in manage_workers
self.spawn_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 623, in spawn_workers
time.sleep(0.1 * random.random())
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
self.reap_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
os.WTERMSIG(status)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
self.error_log.warning(msg, *args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
self._log(WARNING, msg, args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
self.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
self.callHandlers(record)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
return wrapped(*args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
self.emit(record)
File "/usr/lib/python3.6/logging/__init__.py", line 998, in emit
self.flush()
File "/usr/lib/python3.6/logging/__init__.py", line 978, in flush
self.stream.flush()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 242, in handle_chld
self.reap_workers()
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/arbiter.py", line 533, in reap_workers
os.WTERMSIG(status)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/gunicorn/glogging.py", line 261, in warning
self.error_log.warning(msg, *args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1320, in warning
self._log(WARNING, msg, args, **kwargs)
File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
self.handle(record)
File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
self.callHandlers(record)
File "/home/ubuntu/sample-app/venv/lib/python3.6/site-packages/newrelic/hooks/logger_logging.py", line 87, in wrap_callHandlers
return wrapped(*args, **kwargs)
Message: 'Worker with pid %s was terminated due to signal %s'
Arguments: (26469, 9)</module></stderr></module></stderr>
Here’s how we began troubleshooting:
-
Verifying Processes: First, we checked all server processes to confirm they were running. Everything seemed fine at this level—no obvious crashes or hanging processes.
-
Checking the Logs: Since all processes seemed to be running smoothly, we moved on to the logs managed by Supervisord. That’s where we saw a recurring pattern: the app was being terminated randomly. And in our error logs, we found these specific entries:
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'> </stderr>
Alongside that, we kept seeing:Worker with pid %s was terminated due to signal %s
With this combination of “reentrant” errors and worker terminations, it looked like we had a mix of memory issues, Gunicorn conflicts, and logging errors. But we needed more information.
Digging Deeper: Researching the Error
With these clues, we started digging online, searching terms like "reentrant call inside <_io.BufferedWriter>"
and "Gunicorn RuntimeError.”
Thankfully, we found discussions on GitHub from others who had encountered the same problem. Here are a few key discussions we referenced:
Reading through these, one issue became clear: we were running an outdated version of Gunicorn—20.0.1. The latest release was 23.0.0, and it included fixes for logging and process management that directly addressed these issues. So we decided to try an upgrade.
Testing a Solution
-
The First Upgrade: Instead of jumping to the latest release right away, we upgraded to 21.2.0. We wanted to be cautious, so we monitored the app closely. After a few hours, we saw stability—no more random terminations, and everything was working as expected.
-
Going for the Latest Version: Once we felt confident, we upgraded to 23.0.0, the latest Gunicorn version (which requires a newer Python version upgrade - more difficult but worth it). And since then, the app has been stable without a single reentrant error. Our clients’ access to the app returned to normal, and it was clear the upgrade resolved the issue.
What We Learned
This experience highlighted a few lessons:
-
Logs are Your First Clue: Always start by confirming processes are running and thoroughly check the logs. Logs often contain critical hints.
-
Search for Similar Issues: When you see a strange error, others have likely faced it too. Searching GitHub for specific terms can point you to helpful discussions and sometimes even exact solutions.
-
Take It Slow with Upgrades: Rather than jumping to the latest version, it’s wise to incrementally upgrade and test. We saw immediate improvement with 21.2.0, and that gradual approach let us ensure stability before moving to 23.0.0.
-
Stay Updated on Releases: Subscribing to GitHub repositories for libraries you depend on—like Gunicorn—keeps you in the loop on critical updates. It’s a simple way to avoid these sudden issues.
Conclusion
Troubleshooting production issues can be stressful, but taking a step-by-step approach really paid off here. This experience reminded us of the importance of monitoring dependencies and keeping them up to date. Hopefully, our story gives you a few insights to tackle similar challenges—or avoid them altogether.