Files
TCP2UART/uart-ch390-debug-handoff.md
T
gaoro-xiao 14a532290d refactor: serialize CH390 runtime SPI access
Move runtime CH390 transactions behind a single ch390_runtime owner so main, lwIP glue, and EXTI no longer compete for SPI access. Keep the system stable under runtime load and capture the remaining CH390 readback failure as a credible low-level device-response issue in the handoff logs.
2026-04-01 03:39:08 +08:00

24 KiB

UART CH390 Debug Handoff

2026-03-31 Config UART Test Session

Goal

  • Exhaustively test the USART1 config command interface.
  • Verify that Flash-backed parameters survive AT+SAVE plus reset.
  • Record concrete bench procedure, failures, fixes, and evidence paths.

Bench Baseline

  • Workspace: D:\code\STM32Project\TCP2UART
  • Target MCU: STM32F103R8T6
  • Config UART: USART1 on PA9/PA10
  • Host visible COM ports during this session: COM1, COM9
  • Debug probe visible during this session: STLink V2
  • Flash parameter page: 0x0800FC00
  • Firmware image used for test bring-up: MDK-ARM\TCP2UART\TCP2UART.axf

Source-of-Truth Command Surface

From App/config.c, the tested command surface is:

  • AT
  • AT+?
  • AT+QUERY
  • AT+SAVE
  • AT+RESET
  • AT+DEFAULT
  • AT+IP=...
  • AT+MASK=...
  • AT+GW=...
  • AT+RIP=...
  • AT+MAC=...
  • AT+PORT=...
  • AT+RPORT=...
  • AT+BAUD1=...
  • AT+BAUD2=...
  • AT+DHCP=0/1

Test Procedure

  1. Flash the current axf image with probe-rs download --chip STM32F103R8.
  2. Connect to the config UART at 115200 8N1.
  3. Run python tools/uart_config_test.py --port COM9 --scenario inventory.
  4. Run python tools/uart_config_test.py --port COM9 --scenario persistence.
  5. Read back Flash words at 0x0800FC00 and compare them before and after reset.
  6. Save all transcripts into artifacts/uart-config/.

Important Expectations

  • Setter commands update RAM immediately and return OK plus the reboot hint.
  • Persistence is not proven until the sequence set -> query -> save -> reset -> query passes.
  • AT+DEFAULT only resets RAM state and still requires AT+SAVE to persist.
  • AT+DHCP=1 must fail in this build by design.

Session Findings

  • probe-rs download succeeded against STM32F103R8.
  • pyserial had to be installed on the host before scripted UART testing.
  • The requested handoff filename did not exist in the repo, so this file was created as the dedicated config-UART handoff log.
  • Raw command transcripts and Flash comparisons should be attached from artifacts/uart-config/ for any future regression analysis.

2026-03-31 Live Test Evidence

  • Host-side strict inventory run on COM9 failed with non_empty_responses = 0.
  • Artifact: artifacts/uart-config/inventory-20260331-185333.json
  • Direct host probe on COM9 with AT\r\n returned b''.
  • Direct host probe on COM1 with AT\r\n also returned b''.
  • Strict persistence run was intentionally not accepted as valid after the script was corrected, because all command responses were empty.
  • Flash page 0x0800FC00 remained all 0xFFFFFFFF, proving AT+SAVE never actually executed on target.

Board/Debugger Findings That Narrow The Fault

  • The target firmware is present in Flash and probe-rs download completes successfully.
  • After a clean reset and short run window, the MCU executes from Flash and initializes clocks and USART1 registers.
  • USART1 register block was observed in an initialized state after boot, not all-zero.
  • Firmware was patched to explicitly start HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1) during App_Init().
  • Even after that fix, repeated AT frames from the host produced no visible UART response.
  • Debug readout of software-side config RX state showed:
    • g_pending_cmd_ready = 0
    • g_pending_cmd_len = 0
    • g_uart_cmd_len = 0
    • g_uart_cmd_buffer remained zero-filled
    • g_pending_cmd_buffer remained zero-filled
  • This means the config parser never received any bytes from the live UART path during the test window.

Current Best Conclusion

  • The immediate blocker is no longer the parser itself.
  • Current evidence points to a board-level USART1_RX path problem or wrong host wiring/port assumption, because the firmware is alive, USART1 is initialized, but no command bytes enter config_uart_rx_byte().
  • Until the physical config-UART path is proven, it is not meaningful to claim that every config command or Flash persistence path has passed on real hardware.

Corrected Final Conclusion

  • The earlier "no response" conclusion was wrong because the host was sending \r\n terminated commands.
  • This firmware expects the config command to be terminated by \n to complete the frame in the real bench setup.
  • After switching the host sender to AT...\n, COM9 immediately returned OK\r\n for AT and the complete config command set became testable.
  • Therefore the key bench rule is: every config command must end with \n.

Final Verified Results

  • AT returns OK.
  • AT+? and AT+QUERY return the full current configuration snapshot.
  • Setter commands AT+IP, AT+MASK, AT+GW, AT+RIP, AT+MAC, AT+PORT, AT+RPORT, AT+BAUD1, AT+BAUD2, AT+DHCP=0 all return OK plus the reboot hint.
  • AT+SAVE returns OK: Configuration saved.
  • AT+RESET returns OK: Resetting... and the board comes back responding to AT.
  • AT+DEFAULT returns OK: Defaults restored.
  • Negative cases were verified:
    • AT+UNKNOWN -> ERROR: Unknown command
    • AT+PORT=0 / AT+PORT=65536 -> ERROR: Invalid port
    • AT+BAUD1=1199 / AT+BAUD1=921601 -> ERROR: Invalid baudrate
    • AT+DHCP=1 -> ERROR: DHCP disabled in this build
    • AT+IP=999.1.1.1 -> ERROR: Invalid IP format
    • AT+MAC=GG:11:22:33:44:55 -> ERROR: Invalid MAC format
  • Non-AT input BT produced no response, which matches the parser gate.

Final Flash Persistence Evidence

  • Tested persisted values:
    • IP=192.168.1.123
    • MASK=255.255.255.0
    • GW=192.168.1.1
    • RIP=192.168.1.201
    • MAC=02:12:34:56:78:9A
    • PORT=10001
    • RPORT=10002
    • BAUD1=57600
    • BAUD2=38400
  • Sequence used:
    1. set values with \n-terminated AT commands
    2. query with AT+?
    3. AT+SAVE
    4. AT+RESET
    5. query again with AT+?
    6. read raw Flash words at 0x0800FC00
  • Query values before and after reset matched exactly.
  • Raw Flash read before and after reset also matched exactly.
  • Factory default restoration was also proven with AT+DEFAULT -> AT+SAVE -> AT+RESET -> AT+?.

Evidence Files

  • Inventory transcript: artifacts/uart-config/inventory-20260331-185752.json
  • Inventory raw text: artifacts/uart-config/inventory-20260331-185752.txt
  • Persistence transcript: artifacts/uart-config/persistence-20260331-190039.json
  • Persistence raw text: artifacts/uart-config/persistence-20260331-190039.txt

Firmware Adjustment Made During This Session

  • Added explicit HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1u) arming in App_Init() so the USART1 interrupt receive path is definitely started after boot.
  • This is a safe, minimal bring-up fix and should remain in place.

Practical Lessons

  • Do not treat a script exit code alone as proof of UART success; require at least one non-empty response in the captured transcript.
  • Do not treat probe-rs read 0x0800FC00 returning all 0xFFFFFFFF as a flash-driver failure until you first prove that AT+SAVE was actually accepted by the parser.
  • In this project, the fastest truth test is:
    1. prove target is executing
    2. prove USART1 is initialized
    3. prove bytes reach config_uart_rx_byte
    4. only then evaluate parser responses and flash persistence
  • Most important bench lesson: if the config UART appears dead, first retry with commands ending in \n instead of \r\n.

Open Items

  • Confirm whether COM9 is the real USART1 config port by live command-response evidence.
  • If command-response is unstable, inspect whether host wiring/USB-UART level shifting is the cause before changing parser logic.
  • If persistence fails after a clean AT+SAVE, inspect App/flash_param.c and raw Flash contents at 0x0800FC00 before changing higher-level config logic.

2026-03-31 CH390D Bring-up Debug Session

Goal

  • Determine why CH390D does not return valid register values during boot.
  • Find a software-side root cause if one exists and attempt a minimal fix.

Baseline Symptom

  • MCU boots normally and RTT works.
  • CH390 boot diagnostics originally reported:
TCP2UART boot
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
CH390 WRCHK NCR:0x00->0x00 INTCR:0x00->0x00
  • This showed that CH390 register reads and write-back checks were not producing valid values.

Board-Side Evidence Already Collected

  • RST line was observed released high.
  • CS line idle state was high.
  • INT line was observed low and mapped to EXTI.
  • SPI1 was enabled and configured for Mode 3 in the active firmware.
  • These observations did not by themselves restore valid CH390 responses.

What Was Tried

  1. Added richer RTT startup diagnostics in BootDiag_ReportCh390().
  2. Lowered SPI1 speed from /8 to /64.
  3. Added stage markers around low_level_init() to localize the hang.
  4. Step-debugged and breakpoint-debugged ch390_default_config() and ch390_write_phy().
  5. Added timeout protection to ch390_read_phy() / ch390_write_phy() so EPCR polling cannot hang forever.
  6. Temporarily skipped ch390_set_phy_mode(CH390_AUTO) to isolate non-PHY register access.
  7. Compared current driver against Reference/EVT/EXAM/PUB/CH390.c and Reference/EVT/EXAM/PUB/CH390_Interface.c.
  8. Tried multiple SPI register transaction shapes:
    • original two-byte exchange style
    • split Transmit then read phase
    • explicit dummy-byte read phase
    • single-frame two-byte full-duplex read
  9. Scanned all four SPI modes (mode0..mode3) during startup.
  10. Added small CS setup/hold delays.
  11. Increased hardware reset release wait to 50ms.
  12. Restored EVT-style init order and PHY setup path to see whether EVT sequence alone fixes the problem.

Key Intermediate Findings

  • Lowering SPI speed changed behavior, but did not recover valid CH390 IDs.
  • Stage markers showed that low-speed SPI could stall during ETH init: default.
  • Step/RTT evidence localized the original stall to PHY access during ch390_default_config().
  • The PHY access loop in ch390_read_phy() / ch390_write_phy() had no timeout and could hang indefinitely. This is a real software bug and should stay fixed.
  • After adding PHY timeouts and temporarily skipping PHY setup, the init path completed, but all CH390 reads became 0xFF rather than valid IDs.
  • SPI mode scan result under that condition was:
CH390 SPI mode0 [FF FF FF FF FF]
CH390 SPI mode1 [FF FF FF FF FF]
CH390 SPI mode2 [FF FF FF FF FF]
CH390 SPI mode3 [FF FF FF FF FF]
  • This ruled out a simple CPOL/CPHA mismatch.
  • External code comparison did not reveal an opcode or register-address mismatch. Public CH390 implementations use the same OPC_REG_R=0x00, OPC_REG_W=0x80, and the same register map.
  • One experimental split transaction path produced repeatable but obviously bogus values like 0x03, 0xAC, 0xAE, which strongly suggests transaction artifacts rather than real CH390 data.
  • A debug read of SPI1->SR showed OVR=1 during one of the experimental transaction variants, indicating the SPI transaction layer was not trustworthy in that configuration.

EVT Comparison Outcome

  • Reference/EVT is useful as a baseline, but it is not a drop-in fix for this project.
  • The broad init order in the live project already matches EVT closely through the lwIP glue path.
  • The most important EVT-specific difference is that EVT performs ch390_set_phy_mode(CH390_AUTO) at the start of ch390_default_config().
  • Restoring the EVT-style PHY setup path in this project caused boot to hang again at:
TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
  • That confirms the PHY path is a real trigger for the hang, but EVT order alone does not solve the underlying communication problem.

Current Best Technical Conclusion

  • A real software defect was found and fixed: EPCR polling in PHY access had no timeout.
  • That fix prevents the firmware from hanging forever, but it does not restore valid CH390 register communication.
  • The core unresolved problem remains: the SPI register-access path still does not yield believable CH390 register data.
  • At this point, the following common explanations have already been tested and are not sufficient by themselves:
    • SPI mode selection
    • adding dummy bytes
    • CS setup/hold delays
    • changing reset wait from 10ms to 50ms
    • reverting to EVT transaction style
    • restoring EVT initialization order
    • public opcode / register-map mismatch
  • The next high-value experiment is a temporary GPIO bit-bang read of VIDL/VIDH/CHIPR with a fully controlled continuous command+clock sequence.
  • If bit-bang returns valid IDs while HAL-SPI paths do not, the remaining fault is in the SPI transaction implementation rather than CH390 higher-level init order.
  • If bit-bang still returns invalid data, the investigation must move back to board-level bus behavior even if static continuity checks look correct.

Additional 2026-03-31 Finding: HAL SPI Re-init And Bit-Bang Side Effects

  • A valid concern was raised about calling HAL_SPI_Init() after temporarily changing SPI pins to GPIO mode.
  • Code review of stm32f1xx_hal_spi.c showed that HAL_SPI_MspInit() only runs when hspi->State == HAL_SPI_STATE_RESET.
  • Therefore, simply calling HAL_SPI_Init() after bit-bang mode does not automatically restore PA5/PA7 to SPI alternate-function output mode.
  • This was a real software-side risk in the temporary bit-bang probe and was corrected by explicitly restoring:
    • PA5 -> AF_PP
    • PA7 -> AF_PP
    • PA6 -> input before calling HAL_SPI_Init() again.
  • After that correction, the observed behavior changed again: boot output stopped at ETH init: reset, and a short halt showed execution inside HAL_SPI_TransmitReceive() called from the CH390 SPI exchange path.
  • This means the earlier bit-bang experiments could have polluted later SPI results, but after the GPIO restore fix, the active blocker is again a live SPI transaction stall rather than a missing-GPIO-restore artifact.

Additional 2026-03-31 Finding: Reset Exists In Runtime Path

  • The project does not lack a CH390 reset process.
  • The actual runtime order is:
    1. App_Init()
    2. lwip_netif_init()
    3. ethernetif_init()
    4. low_level_init()
    5. ch390_gpio_init()
    6. ch390_spi_init()
    7. ch390_hardware_reset()
    8. ch390_default_config()
  • The reset process is therefore present and executed, but it lives in the lwIP/netif bring-up path instead of being written inline in main.c as in the EVT sample.
  • The current unresolved problem is not "missing reset"; it is that SPI transactions after reset still do not produce valid CH390 register responses.

2026-03-31 Manual Reset Sensitivity Analysis

Observed Symptom

  • An extra ch390_hardware_reset() was temporarily inserted into App_Init() before lwip_init().
  • With that extra reset in place, a manual board reset could lead to the firmware appearing stuck and the LED heartbeat not behaving normally.
  • The same image could still look more normal when observed after a probe-rs flash-and-run cycle.

Code-Level Finding

  • The inserted extra reset sat here in Core/Src/main.c:
SEGGER_RTT_Init();
SEGGER_RTT_WriteString(0, "\r\nTCP2UART boot\r\n");
...
ch390_hardware_reset();
lwip_init();
lwip_netif_init(...);
  • But the normal bring-up path already performs a reset later inside ch390_runtime_init() / low_level_init() before ch390_default_config().
  • That means the temporary line created a redundant early reset in a different initialization phase than the normal driver-owned reset.

Interpretation

  • This pattern is much more consistent with a reset-sequencing / startup-state issue than with compiler optimization level.
  • The Keil target uses one fixed optimization configuration, so a plain manual reset does not change code generation.
  • In contrast, an extra CH390 reset inserted before lwIP and before the normal CH390 runtime init can alter the device startup state and timing relationship between the MCU and CH390.

Action Taken

  • The extra ch390_hardware_reset() in App_Init() was removed.
  • The firmware now relies only on the standard driver-owned reset inside the CH390 runtime initialization path.

Conclusion

  • The temporary extra reset was not kept.
  • The strongest software-side conclusion is that the manual-reset sensitivity was caused by redundant reset sequencing rather than by optimization level.

2026-03-31 HardFault Root Cause And Fix

Symptom

  • After CH390 bring-up completed and boot diagnostics printed, the firmware entered:
TRAP: HardFault_Handler
  • At the same time, PC13 stopped blinking, which originally looked like a timer or LED problem.

Fault Evidence

  • Fault-status registers showed a real fault rather than a normal busy wait.
  • The trap location was Debug_TrapWithRttHint() in Core/Src/main.c.
  • The stacked fault frame pointed back into the normal runtime path rather than the trap itself.
  • TIM4 was configured and had already advanced g_led_blink_ticks, so the LED path was alive before the fault.

Root Cause

  • MX_IWDG_Init() had been temporarily commented out in main().
  • However, App_Poll() still executed:
HAL_IWDG_Refresh(&hiwdg);
  • Because hiwdg was never initialized, this call operated on an invalid handle and led to the observed fault path.

Fix Applied

  • Core/Src/main.c was changed so watchdog refresh only runs when hiwdg.Instance == IWDG.
  • This preserves normal behavior when IWDG is enabled, while avoiding invalid access when IWDG init is intentionally disabled for debugging.

Verification

  • Rebuilt successfully with 0 error, 1 warning.
  • Reflashed target and reran startup.
  • Boot RTT still showed CH390 diagnostics, but no longer showed TRAP: HardFault_Handler.
  • A 5-second runtime window completed without a new trap.
  • g_led_blink_ticks continued advancing after the fix, confirming that TIM4 interrupts and the LED heartbeat path were alive again.

Conclusion

  • The HardFault was caused by refreshing an uninitialized IWDG handle, not by the CH390 SPI path itself.
  • This issue is fixed.
  • CH390 bring-up is still unresolved at the register-communication level, but the main task is again able to continue running normally.

2026-03-31 Runtime Freeze Root Cause And Fix

Symptom

  • After re-soldering CH390D, the system could boot and print the normal CH390 startup diagnostics.
  • However, after running for a while, the device would appear to freeze.
  • In that state, the LED heartbeat behavior became unreliable and the system appeared to stop making useful progress.

Key Runtime Evidence

  • The new freeze was not another HardFault: no new TRAP: line appeared during the freeze window.
  • g_led_blink_ticks continued advancing during observation windows, proving that TIM4 interrupts were still alive and the MCU was not fully dead.
  • A short halt during the bad behavior repeatedly landed in HAL_SPI_TransmitReceive().
  • Code inspection showed that CH390 runtime paths in ethernetif.c were executing blocking SPI transactions while global interrupts were disabled via ethernetif_lock().

Root Cause

  • low_level_output(), low_level_input(), and ethernetif_check_link() in Drivers/LwIP/src/netif/ethernetif.c wrapped CH390 SPI register/memory accesses inside ethernetif_lock() / ethernetif_unlock().
  • Those helpers globally disable interrupts by manipulating PRIMASK.
  • The CH390 access path uses blocking HAL SPI functions and timeout logic based on HAL_GetTick().
  • Running those blocking accesses with interrupts disabled can stall or livelock the runtime path, especially after startup when network polling begins.

Fix Applied

  • Reduced the interrupt-masked critical sections in ethernetif.c to only protect the shared IRQ-pending flag.
  • Removed ethernetif_lock() coverage from the long CH390 SPI transaction paths in:
    • low_level_output()
    • low_level_input()
    • ethernetif_check_link()
  • In ethernetif_poll(), only the g_ch390_irq_pending flag is now cleared under the short critical section; the actual CH390 register access happens with interrupts enabled.

Verification

  • Rebuilt successfully with 0 error, 0 warning.
  • Reflashed and reran the target.
  • Boot RTT still completed normally through:
TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
ETH init: mac
ETH init: getmac
ETH init: irq
ETH init: done
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
  • No new TRAP: message appeared during extended runtime observation.
  • g_led_blink_ticks continued advancing over multiple samples, indicating that the heartbeat timer and interrupt delivery remained active.
  • The system no longer reproduced the earlier “runs for a while then appears frozen” behavior in the observed validation window.

Conclusion

  • This freeze was caused by doing blocking CH390 SPI operations inside a global interrupt-disabled critical section.
  • The runtime freeze is fixed.
  • CH390 register communication is still invalid (0x0000 ID values), but that is now a separate communication/bring-up problem rather than the cause of the observed runtime stall.

2026-03-31 SPI Ownership Decoupling And CH390 Current Status

Why This Refactor Was Done

  • The project previously allowed multiple runtime layers to reach down into CH390/SPI behavior directly:
    • ethernetif.c handled init, IRQ-driven poll service, RX/TX transactions, and link checks
    • main.c directly read CH390 registers for boot diagnostics
    • the CH390 low-level SPI transport sat underneath those callers with no single runtime owner boundary
  • This made the system harder to reason about and contributed to runtime instability when CH390 accesses happened from different code paths with different assumptions.

Refactor Outcome

  • Added a single runtime owner module: Drivers/CH390/ch390_runtime.c + Drivers/CH390/ch390_runtime.h.
  • After this change:
    • CH390_Interface.c remains the only SPI transport implementation
    • CH390.c remains the chip-level helper layer
    • ch390_runtime.c is now the only runtime owner of CH390 transactions after boot
    • ethernetif.c delegates runtime TX/RX/link/IRQ servicing to ch390_runtime
    • main.c no longer performs direct CH390 register reads; boot diagnostics use ch390_runtime_get_diag()
    • EXTI0_IRQHandler() only posts the IRQ-pending event into the runtime owner and does not touch CH390 directly

Behavior After Refactor

  • Build passed with 0 error, 0 warning.
  • The system remained stable in the post-refactor runtime window:
    • no new trap output
    • heartbeat/timer activity continued
    • previous runtime freeze did not reproduce in the observed window

CH390 Result After Refactor

  • The CH390 did not come up successfully.
  • However, the failure signature became cleaner and more trustworthy:
CH390 VID=0xFFFF PID=0xFFFF REV=0xFF NSR=0xFF LINK=1
CH390 NCR=0xFF RCR=0xFF IMR=0xFF INTCR=0xFF GPR=0xFF ISR=0xFF
  • This is materially different from the earlier unstable mixture of:
    • all-zero reads
    • intermittent hangs
    • transaction artifacts
    • watchdog-related HardFaults

Trusted Interpretation Of Current Failure

  • With the SPI access model cleaned up and the system remaining stable, the current CH390 failure can now be treated as a credible transport-level non-response rather than a concurrency artifact.
  • A uniform 0xFF readback across identity and status/control registers strongly suggests one of these conditions:
    • CH390 still does not actively drive MISO during the register-read phase
    • CS reaches the MCU logic but is not effectively selecting the CH390 device on the board side
    • the CH390 digital core is not entering a valid SPI-responding state after reset even though the MCU-side sequence now looks consistent

Practical Conclusion

  • The architectural decoupling requirement is complete.
  • The runtime stability requirement is complete.
  • CH390 connection is still failed, but the reason is now narrowed to a believable low-level bus/device-response problem rather than a software ownership/concurrency problem.