Florisjan
Posts: 19
Joined: Mon Apr 11, 2016 2:49 pm

Watchdog not rebooting after transmit buffer overflow

Wed May 17, 2017 8:14 pm

Hi,

I am using a Zero with a r8152 USB ethernet adapter. The watchdog runs ok and reboots the system when there is a genuine ping timeout. However after a buffer overflow on the r8152, if fails to restart even though it tries. The only way to resolve it is to restart it

The running watchdog:

Code: Select all

 sudo systemctl status watchdog
â watchdog.service - watchdog daemon
   Loaded: loaded (/lib/systemd/system/watchdog.service; static)
   Active: active (running) since Wed 2017-05-17 19:17:33 CEST; 2h 4min ago
  Process: 855 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin/watchdog $watchdog_options (code=exited, status=0/SUCCESS)
  Process: 850 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${watchdog_module}" = "none" ] || /sbin/modprobe $watchdog_module (code=exited, status=0/SUCCESS)
 Main PID: 857 (watchdog)
   CGroup: /system.slice/watchdog.service
           ââ857 /usr/sbin/watchdog

May 17 19:17:33 wubulubudubdub watchdog[857]: starting daemon (5.14):
May 17 19:17:33 wubulubudubdub watchdog[857]: int=1s realtime=yes sync=no soft=no mla=0 mem=0
May 17 19:17:33 wubulubudubdub watchdog[857]: ping: 192.168.1.254
May 17 19:17:33 wubulubudubdub watchdog[857]: file: no file to check
May 17 19:17:33 wubulubudubdub watchdog[857]: pidfile: no server process to check
May 17 19:17:33 wubulubudubdub watchdog[857]: interface: eth0
May 17 19:17:33 wubulubudubdub watchdog[857]: temperature: no sensors to check
May 17 19:17:33 wubulubudubdub watchdog[857]: test=none(0) repair=none(0) alive=none heartbeat=none to=root no_act=no force=no
May 17 19:17:33 wubulubudubdub systemd[1]: Started watchdog daemon.
The interesting part from /var/log/syslog:

Code: Select all

May 17 20:15:08 wubulubudubdub kernel: [10869.610635] r8152 1-1.2:1.0 eth0: Tx status -71
May 17 20:15:09 wubulubudubdub kernel: [10871.010385] r8152 1-1.2:1.0 eth0: Tx status -71
May 17 20:15:18 wubulubudubdub watchdog[854]: no response from ping (target: 192.168.69.254)
May 17 20:15:18 wubulubudubdub watchdog[854]: /usr/lib/sendmail does not exist or is not executable (errno = 2)
May 17 20:15:18 wubulubudubdub watchdog[854]: shutting down the system because of error 101
May 17 20:15:21 wubulubudubdub kernel: [10883.107478] ------------[ cut here ]------------
May 17 20:15:21 wubulubudubdub kernel: [10883.107536] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x268/0x280
May 17 20:15:21 wubulubudubdub kernel: [10883.107546] NETDEV WATCHDOG: eth0 (r8152): transmit queue 0 timed out
May 17 20:15:21 wubulubudubdub kernel: [10883.107552] Modules linked in: md4 md5 hmac nls_utf8 cifs ppp_async crc_ccitt ppp_generic slhc cfg80211 rfkill cdc_ether r8152 snd_bcm2835 snd_pcm snd_timer snd bcm2835_gpiomem uio_pdrv_genirq uio fixed i2c_dev fuse ipv6
May 17 20:15:21 wubulubudubdub kernel: [10883.107656] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.24+ #993
May 17 20:15:21 wubulubudubdub kernel: [10883.107662] Hardware name: BCM2835
May 17 20:15:21 wubulubudubdub kernel: [10883.107731] [<c0016148>] (unwind_backtrace) from [<c0013c90>] (show_stack+0x20/0x24)
May 17 20:15:21 wubulubudubdub kernel: [10883.107758] [<c0013c90>] (show_stack) from [<c03192b8>] (dump_stack+0x20/0x28)
May 17 20:15:21 wubulubudubdub kernel: [10883.107786] [<c03192b8>] (dump_stack) from [<c0021c98>] (__warn+0xe4/0x10c)
May 17 20:15:21 wubulubudubdub kernel: [10883.107805] [<c0021c98>] (__warn) from [<c0021d08>] (warn_slowpath_fmt+0x48/0x50)
May 17 20:15:21 wubulubudubdub kernel: [10883.107824] [<c0021d08>] (warn_slowpath_fmt) from [<c0512ca0>] (dev_watchdog+0x268/0x280)
May 17 20:15:21 wubulubudubdub kernel: [10883.107863] [<c0512ca0>] (dev_watchdog) from [<c00692d8>] (call_timer_fn+0x40/0x124)
May 17 20:15:21 wubulubudubdub kernel: [10883.107886] [<c00692d8>] (call_timer_fn) from [<c0069474>] (expire_timers+0xb8/0x11c)
May 17 20:15:21 wubulubudubdub kernel: [10883.107906] [<c0069474>] (expire_timers) from [<c00695b8>] (run_timer_softirq+0x90/0x184)
May 17 20:15:21 wubulubudubdub kernel: [10883.107925] [<c00695b8>] (run_timer_softirq) from [<c00095cc>] (__do_softirq+0x124/0x31c)
May 17 20:15:21 wubulubudubdub kernel: [10883.107951] [<c00095cc>] (__do_softirq) from [<c00262c4>] (irq_exit+0xf0/0x140)
May 17 20:15:21 wubulubudubdub kernel: [10883.107985] [<c00262c4>] (irq_exit) from [<c005e14c>] (__handle_domain_irq+0x60/0xb8)
May 17 20:15:21 wubulubudubdub kernel: [10883.108005] [<c005e14c>] (__handle_domain_irq) from [<c0009420>] (bcm2835_handle_irq+0x28/0x48)
May 17 20:15:21 wubulubudubdub kernel: [10883.108028] [<c0009420>] (bcm2835_handle_irq) from [<c05d3a9c>] (__irq_svc+0x5c/0x7c)
May 17 20:15:21 wubulubudubdub kernel: [10883.108036] Exception stack(0xc08a5f18 to 0xc08a5f60)
May 17 20:15:21 wubulubudubdub kernel: [10883.108047] 5f00:                                                       00000000 00000000
May 17 20:15:21 wubulubudubdub kernel: [10883.108061] 5f20: 00000000 c08a7760 c08a4000 00000000 c08a68ac c08bbcc2 00000001 c08bbcc2
May 17 20:15:21 wubulubudubdub kernel: [10883.108075] 5f40: d7fffb80 c08a5f74 c08a5f68 c08a5f68 c00107c4 c00107c8 60000013 ffffffff
May 17 20:15:21 wubulubudubdub kernel: [10883.108096] [<c05d3a9c>] (__irq_svc) from [<c00107c8>] (arch_cpu_idle+0x30/0x40)
May 17 20:15:21 wubulubudubdub kernel: [10883.108115] [<c00107c8>] (arch_cpu_idle) from [<c05d3944>] (default_idle_call+0x34/0x48)
May 17 20:15:21 wubulubudubdub kernel: [10883.108150] [<c05d3944>] (default_idle_call) from [<c00506ac>] (cpu_startup_entry+0x8c/0xe8)
May 17 20:15:21 wubulubudubdub kernel: [10883.108187] [<c00506ac>] (cpu_startup_entry) from [<c05ced5c>] (rest_init+0x6c/0x84)
May 17 20:15:21 wubulubudubdub kernel: [10883.108214] [<c05ced5c>] (rest_init) from [<c083fc84>] (start_kernel+0x33c/0x3b0)
May 17 20:15:21 wubulubudubdub kernel: [10883.108226] ---[ end trace 7d29fde287cd4e05 ]---
May 17 20:15:21 wubulubudubdub kernel: [10883.108248] r8152 1-1.2:1.0 eth0: Tx timeout
May 17 20:15:23 wubulubudubdub kernel: [10884.677701] r8152 1-1.2:1.0 eth0: Tx status -2
May 17 20:15:23 wubulubudubdub kernel: [10884.677825] r8152 1-1.2:1.0 eth0: Tx status -2
May 17 20:15:23 wubulubudubdub kernel: [10884.677890] r8152 1-1.2:1.0 eth0: Tx status -2
May 17 20:15:23 wubulubudubdub kernel: [10884.677917] r8152 1-1.2:1.0 eth0: Tx status -2
May 17 20:15:26 wubulubudubdub kernel: [10888.147558] r8152 1-1.2:1.0 eth0: Tx timeout
May 17 20:15:28 wubulubudubdub kernel: [10889.747590] CIFS VFS: sends on sock d4018b00 stuck for 15 seconds
May 17 20:15:28 wubulubudubdub kernel: [10889.747630] CIFS VFS: Error -11 sending data on socket to server
May 17 20:15:28 wubulubudubdub rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="408" x-info="http://www.rsyslog.com"] exiting on signal 15.
May 17 19:17:13 wubulubudubdub dhcpcd[437]: DUID 00:01:00:01:20:4c:58:fb:7c:dd:90:85:7b:72


From the 2nd to 5th lines I can tell the watchog wants to restart. Then the kernel logs the NETDEV WATCHDOG (is this actually part of the watchdog system or just part of the networking system?) and after that the status -2 is logged a couple of times.

Not very much information to work with. This happens at completely random moments about every 2 days, regardless of any variables like temperature or uptime. No spikes in data transfer, no big files being moved. The power supply is 2 amps and the network connection is stable. In fact I have an identical Zero running with a Asix AX88772A adapter which never gives me this problem.

Any suggestions on solving this?

Regards,

Fj

There is no mac address on the adpater to the ip is static. Some more info:

Code: Select all

ethtool -i eth0
driver: r8152
version: v1.08.7
firmware-version:
bus-info: usb-20980000.usb-1.2
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

 ethtool eth0
Settings for eth0:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
        Link partner advertised pause frame use: No
        Link partner advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 32
        Transceiver: internal
        Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
        Current message level: 0x00007fff (32767)
                               drv probe link timer ifdown ifup rx_err tx_err tx_queued intr tx_done rx_status pktdata hw wol
        Link detected: yes

from dmesg;
[    1.961119] usb 1-1.2: new full-speed USB device number 3 using dwc_otg
[    2.092217] usb 1-1.2: not running at top speed; connect to a high speed hub
[    2.097629] usb 1-1.2: New USB device found, idVendor=0bda, idProduct=8152
[    2.100826] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    2.103802] usb 1-1.2: Product: USB 10/100 LAN
[    2.106611] usb 1-1.2: Manufacturer: Realtek
[    2.109330] usb 1-1.2: SerialNumber: 000000000000
[   12.834438] r8152 1-1.2:1.0 (unnamed net_device) (uninitialized): Invalid ether addr 00:00:00:00:00:00
[   12.835273] r8152 1-1.2:1.0 (unnamed net_device) (uninitialized): Random ether addr d2:83:e7:71:3f:4e
[   12.837850] r8152 1-1.2:1.0 eth0: v1.08.7

[   20.888915] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   20.917043] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

epoch1970
Posts: 4163
Joined: Thu May 05, 2016 9:33 am
Location: Paris, France

Re: Watchdog not rebooting after transmit buffer overflow

Thu May 18, 2017 2:20 pm

I understand you're using the standalone watchdog package, not the systemd service (which I don't use)
I think watchdog has various ways of rebooting a machine, but I don't know where this is configured, if it is. It should be able to reboot the machine (I don't have a Zero, unfortunately)

I know 2 things:
- I always set watchdog to realtime priority, esp. with the Pi where the device requires very frequent reads.
- I would never, ever, define a reboot condition that would trip on a single failure with a timeout of one second. Even if the test was stating files. Do you really need to ping a remote host every second?

Personally I set ping freq. to something like 30 secs (and usually more) on a LAN, and I wait for multiple failures before rebooting. In other words I don't use the "ping" facility directly on the Pi and usually have a repair script in charge of defusing watchdog's impulse to reboot.
"S'il n'y a pas de solution, c'est qu'il n'y a pas de problème." Les Shadoks, J. Rouxel

Return to “Raspbian”