Just like that, it was gone
I have a few Raspberry Pi compute module 3 which run on a TuringPi v1 board housed in a Cooler Master case with an external power supply.
They’re flashed with the compute module development board, and configured with my Ansible playbooks. I’ve typically got Consul and Nomad agents deployed on them
So far, so good!
However, these machines stay up for a few days and then suddenly they tend to refuse ssh connections and eventually become unresponsive. The only clue I had to go on was from the ssh client:
kex_exchange_identification: read: Connection reset by peerInvestigating
My first thought was:
Is this a murder or a suicide?
Did something kill my poor compute module, or did I place too much pressure on it in some way, making it pull the trigger on itself?
There wasn’t much running on this machine to start off with - except for a few system services (Prometheus Exporter, Nomad, and Consul) there was only one Nomad job, Fabio LB, running on it. The load was far from excessive… but unfortunately at this point I don’t have any proof of that, since the monitoring and logging wasn’t set up yet.
Luckily, I could put the time of death between 08:40 and 09:05 thanks to my beautiful zsh configuration which puts timestamps and execution times on commands in my prompt.
I took the module out, mounted it with the flashing kit and had a look at the logs.
Notes from beyond the grave
I started looking at the logs for ssh messages.
These contained a few messages from the last Ansible run, the application of the base configuration layer which included things like setting the hostname, the timezone, installing packages, etc.
Indeed, checking back on the log of the Ansible run, I can see that the node stopped responding during application of a package task, and after the application of setting the timezone.
The last successful task was:
- name: Set timezone
community.general.timezone:
name: Europe/Rome
hwclock: localWe have a bit of evidence, but no clues yet.
Since the machine stopped responding during application of the package task, let’s try to correlate a few facts.
The task was:
- name: Ensure base packages are installed
ansible.builtin.package:
state: present
name: ""
use: auto
loop: ""
tags:
- packagesThe variable base_packages is a list:
base_packages:
- bash
- bash-completion
- git
- id-utils
- inetutils-traceroute
- iptraf
- lsb-base
- lsb-release
- net-tools
- procps
- python3
- sshpass
- zlib1g
- unzip
- ntpand I can see from the message log that the last successful action was the installation of
.
.
.
May 28 08:47:01 turing1 ansible-ansible.legacy.apt: Invoked with state=present name=inetutils-traceroute package=['inetutils-traceroute'] update_cache_retries=5 update_cache_retry_max_delay=12 cache_valid_time=0 purge=False force=False upgrade=no dpkg_options=force-confdef,force-confold autoremove=False autoclean=False fail_on_autoremove=False only_upgrade=False force_apt_get=False allow_unauthenticated=False allow_downgrade=False lock_timeout=60 update_cache=None deb=None default_release=None install_recommends=None policy_rc_d=None
May 28 08:47:07 turing1 ansible-ansible.legacy.apt: Invoked with state=present name=iptraf package=['iptraf'] update_cache_retries=5 update_cache_retry_max_delay=12 cache_valid_time=0 purge=False force=False upgrade=no dpkg_options=force-confdef,force-confold autoremove=False autoclean=False fail_on_autoremove=False only_upgrade=False force_apt_get=False allow_unauthenticated=False allow_downgrade=False lock_timeout=60 update_cache=None deb=None default_release=None install_recommends=None policy_rc_d=NoneAnalysis
Here are the facts:
- The machine stopped responding during application of the package task, specifically after installation of
iptrafand before or during installation oflsb-base. - A reboot of the machine solved the problem.
It seems that there is not enough information yet to draw a final conclusion.