When SSH gives up on you

Investigating a strange Compute Pi failure


Just like that, it was gone

I have a few Raspberry Pi compute module 3 which run on a TuringPi v1 board housed in a Cooler Master case with an external power supply.

They’re flashed with the compute module development board, and configured with my Ansible playbooks. I’ve typically got Consul and Nomad agents deployed on them

So far, so good!

However, these machines stay up for a few days and then suddenly they tend to refuse ssh connections and eventually become unresponsive. The only clue I had to go on was from the ssh client:

kex_exchange_identification: read: Connection reset by peer

Investigating

My first thought was:

Is this a murder or a suicide?

Did something kill my poor compute module, or did I place too much pressure on it in some way, making it pull the trigger on itself?

There wasn’t much running on this machine to start off with - except for a few system services (Prometheus Exporter, Nomad, and Consul) there was only one Nomad job, Fabio LB, running on it. The load was far from excessive… but unfortunately at this point I don’t have any proof of that, since the monitoring and logging wasn’t set up yet.

Luckily, I could put the time of death between 08:40 and 09:05 thanks to my beautiful zsh configuration which puts timestamps and execution times on commands in my prompt.

I took the module out, mounted it with the flashing kit and had a look at the logs.

Notes from beyond the grave

I started looking at the logs for ssh messages. These contained a few messages from the last Ansible run, the application of the base configuration layer which included things like setting the hostname, the timezone, installing packages, etc. Indeed, checking back on the log of the Ansible run, I can see that the node stopped responding during application of a package task, and after the application of setting the timezone. The last successful task was:

- name: Set timezone
  community.general.timezone:
    name: Europe/Rome
    hwclock: local

We have a bit of evidence, but no clues yet.

Since the machine stopped responding during application of the package task, let’s try to correlate a few facts.

The task was:

- name: Ensure base packages are installed
  ansible.builtin.package:
    state: present
    name: ""
    use: auto
  loop: ""
  tags:
  - packages

The variable base_packages is a list:

base_packages:

- bash
- bash-completion
- git
- id-utils
- inetutils-traceroute
- iptraf
- lsb-base
- lsb-release
- net-tools
- procps
- python3
- sshpass
- zlib1g
- unzip
- ntp

and I can see from the message log that the last successful action was the installation of

.
.
.
May 28 08:47:01 turing1 ansible-ansible.legacy.apt: Invoked with state=present name=inetutils-traceroute package=['inetutils-traceroute'] update_cache_retries=5 update_cache_retry_max_delay=12 cache_valid_time=0 purge=False force=False upgrade=no dpkg_options=force-confdef,force-confold autoremove=False autoclean=False fail_on_autoremove=False only_upgrade=False force_apt_get=False allow_unauthenticated=False allow_downgrade=False lock_timeout=60 update_cache=None deb=None default_release=None install_recommends=None policy_rc_d=None

May 28 08:47:07 turing1 ansible-ansible.legacy.apt: Invoked with state=present name=iptraf package=['iptraf'] update_cache_retries=5 update_cache_retry_max_delay=12 cache_valid_time=0 purge=False force=False upgrade=no dpkg_options=force-confdef,force-confold autoremove=False autoclean=False fail_on_autoremove=False only_upgrade=False force_apt_get=False allow_unauthenticated=False allow_downgrade=False lock_timeout=60 update_cache=None deb=None default_release=None install_recommends=None policy_rc_d=None

Analysis

Here are the facts:

  1. The machine stopped responding during application of the package task, specifically after installation of iptraf and before or during installation of lsb-base.
  2. A reboot of the machine solved the problem.

It seems that there is not enough information yet to draw a final conclusion.