USB CDC sometimes hang on F4 platforms

Description

USB CDC driver for F4 boards sometimes start dropping characters and eventually locks up completely. This bug was first discovered by and I have later found that using UsbComBridge in combination with MWOSD and MWOSD configurator can reliably trigger complete lockup when using font upload function.
The problem seems to show up more often when using bidirectional transfer and on busy system. For example, Discovery F4 on 15.09 has system load of about 8% and it takes longer time to trigger it, but Revo which has load of around 30-40% (depending on fw version) locks up faster. RevoNano is especially sensitive, as cdc locks up even without exercising font upload function.

After locking up, usb unplug is needed to properly terminate application on PC, but firmware side of the cdc stays in dysfunctional state. Other parts of the firmware are working as usual, there is no excessive cpu load, or other signs of malfunction.

Interesting enough, F1 targets are not affected, which leads me to conclusion that the problem is in F4 usb implementation. Further research/debugging is needed.

Also, there seems to be no problems with HID part of the driver. I have tried also updating STM's USB driver to v2.2.0 (latest, we use v2.1.0 dated 2012 currently), but it didn't help.

Environment

None

Activity

Show:
Vladimir Zidar
March 15, 2017, 10:37 PM
Edited

F4 usb (cdc) implementation uses several layers of abstraction, and one of being pios_usbhook - which at current state lacks of API to query for actual endpoint status. Therefore, pios_usb_cdc implementation there cannot check for endpoint state directly, but rather tries to internally track the state in rx_active flag. This is however not possible to track from application action only, as the usb core (hardware) can place endpoint into NAK state too.
This is what is apparently happening there: Endpoint goes into NAK state (control bit NAKSTS gets set by core) but firmware is not aware of it, and never cares to reset it.
F1 implementation does not have this problem, as it checks the hardware for stalled endpoint directly.

Suggested PR adds new trivial api to usbhook: PIOS_USBHOOK_EndpointGetStatus() as a direct wrapper to DCD_GetEPStatus(), and then replaces internal rx_active handling in pios_usb_cdc.c with calls to this new API.

So far, this fixes the CDC hang on F4.

There is however same rx_active construct used in HID code, but as it seems, the data transfer pattern for hid is not triggering NAK state on rx endpoint, and application tracking seem to work. It would be trivial to change hid code to match cdc with call to PIOS_USBHOOK_EndpointGetStatus(). Thoughts?

Assignee

Vladimir Zidar

Reporter

Vladimir Zidar

Labels

None

Components

Fix versions

Priority

High
Configure