Generic scripts running more than 1 sec fail with "Monitor script returned critical: 85, rebooting system ..." #48

andrejbinder · 2024-05-06T21:43:50Z

It seems like it comes as a return from the script_exit_status(g->pid); check at the time when the script is still running. This is being called from static void generic_cb(uev_t *w, void *arg, int events)

Setting the critical threshold to more than 85 indeed makes the check pass but that is not a viable workaround because the actual real return code will then not be checked anymore making scripts running longer than 1 sec seemingly always pass.

Maybe we just need to initialise the exit code variable to -1 when creating the data structure?

The text was updated successfully, but these errors were encountered:

troglobit · 2024-05-07T16:57:23Z

What version of watchdogd are you using, and how does your config file look?

andrejbinder · 2024-05-15T20:56:43Z

Hello,

we are running watchdogd 4.0

this is the config we are running:

# /etc/watchdogd.conf sample
# Commented out values are program defaults.
#
# The checker/monitor `warning` and `critical` levels are 0.00-1.00,
# i.e. 0-100%, except for load average which can vary a lot between
# systems and use-cases, not just because of the number of CPU cores.
# Use the `script = ...` setting to call script when `warning` and
# `critical` are reached for a monitor.  In `critical` the monitor
# otherwise triggers an unconditional reboot.
#
# NOTE: `critical` is optional, omitting it disables the reboot action.
#

###
# Do not set WDT timeout and kick interval too low, the daemon runs at
# SCHED_OTHER level with all other tasks, unless the process supervisor
# is enabled.  The monitor plugins (below) need CPU time as well.
#timeout   = 20
#interval  = 10

###
# With safe-exit enabled (true) the daemon will ask the driver disable
# the WDT before exiting (SIGINT).  However, some WDT drivers (or HW)
# may not support this.
#safe-exit = false

### Supervisor
# Instrumented processes can have their main loop supervised.  Processes
# subscribe to this service using the libwdog API, see the docs for more
# on this.  When the supervisor is enabled and the priority is set to a
# value > 0, watchdogd runs as a SCHED_RR process with elevated realtime
# priority.  When disabled, or the priority is set to zero (0), it runs
# as a regular SCHED_OTHER process, this is the default.
#
# When a supervised process fails to meet its deadline, the daemon will
# perform an unconditional reset having saved the reset reason.  If a
# script is provided in this section it will be called instead.  The
# script is called as:
#
#    script.sh supervisor CODE PID LABEL
#
# Availabel CODEs for the reset reason are avilable in wdog.h
#
supervisor {
#    !!!REMEMBER TO ENABLE reset-reason (below) AS WELL!!!
#    enabled  = true
#    priority = 98
    script = "/path/to/supervisor-script.sh"
}

### Reset reason
# The following section controls if/how the reset reason & reset counter
# is tracked.  By default this is disabled, since not all systems allow
# writing to disk, e.g. embedded systems using MTD devices with limited
# number of write cycles.
#
# The default file setting is a non-volatile path, according to the FHS.
# It can be changed to another location, but make sure that location is
# writable first.
reset-reason {
#    enabled = true
    file    = "/var/lib/watchdogd.state"
}

### Checkers/Monitors ##################################################
#
# Script or command to run instead of reboot when a monitor plugin
# reaches any of its critical or warning level.  Setting this will
# disable the built-in reboot on critical, it is therefore up to the
# script to perform reboot, if needed.  The script is called as:
#
#    script.sh {filenr, loadavg, meminfo} {crit, warn} VALUE
#
#script = "/path/to/reboot-action.sh"

# Monitors file descriptor leaks based on /proc/sys/fs/file-nr
filenr {
#    enabled = true
    interval = 300
    logmark  = false
    warning  = 0.9
    critical = 1.0
#    script = "/path/to/alt-reboot-action.sh"
}

# Monitors load average based on sysinfo() from /proc/loadavg
# The level is composed from the average of the 1 and 5 min marks.
loadavg {
#    enabled = true
    interval = 300
    logmark  = false
    warning  = 1.0
    critical = 2.0
#    script = "/path/to/alt-reboot-action.sh"
}

# Monitors free RAM based on data from /proc/meminfo
meminfo {
    enabled = true
    interval = 300
    logmark  = false
    warning  = 0.9
    critical = 0.95
#    script = "/path/to/alt-reboot-action.sh"
}

# Monitor a generic script, executes 'monitor-script' every 'interval'
# seconds, with a max runtime of 'timeout' seconds.  When the exit code
# of the monitor script is above the critical level watchdogd either
# starts the reboot, or calls the alternate 'script' to determin the
# next cause of action.
generic /usr/bin/watchdogscript {
    enabled = true
    interval = 300
    timeout = 60
    warning  = 1
    critical = 10
}

The watchdogscript is basically a shell script that calls curl to check the state of a specific endpoint. Whenever the curl takes more than a second, watchdog reboots the device with the above output. We can of course workaround this by setting a timeout within curl but that may not be an option for some other use cases where the timeout really needs to be longer than a second.

troglobit · 2024-05-21T18:40:04Z

OK. I'll see when I have some time over to debug this. But it's not high on my priority list right now since I'm not using this feature myself (or any of my customers). Of you want to speed things along I suggest getting outside help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic scripts running more than 1 sec fail with "Monitor script returned critical: 85, rebooting system ..." #48

Generic scripts running more than 1 sec fail with "Monitor script returned critical: 85, rebooting system ..." #48

andrejbinder commented May 6, 2024

troglobit commented May 7, 2024

andrejbinder commented May 15, 2024

troglobit commented May 21, 2024

Generic scripts running more than 1 sec fail with "Monitor script returned critical: 85, rebooting system ..." #48

Generic scripts running more than 1 sec fail with "Monitor script returned critical: 85, rebooting system ..." #48

Comments

andrejbinder commented May 6, 2024

troglobit commented May 7, 2024

andrejbinder commented May 15, 2024

troglobit commented May 21, 2024