Files
TCP2UART/uart-ch390-debug-handoff.md
T
gaoro-xiao 14a532290d refactor: serialize CH390 runtime SPI access
Move runtime CH390 transactions behind a single ch390_runtime owner so main, lwIP glue, and EXTI no longer compete for SPI access. Keep the system stable under runtime load and capture the remaining CH390 readback failure as a credible low-level device-response issue in the handoff logs.
2026-04-01 03:39:08 +08:00

513 lines
24 KiB
Markdown

# UART CH390 Debug Handoff
## 2026-03-31 Config UART Test Session
### Goal
- Exhaustively test the `USART1` config command interface.
- Verify that Flash-backed parameters survive `AT+SAVE` plus reset.
- Record concrete bench procedure, failures, fixes, and evidence paths.
### Bench Baseline
- Workspace: `D:\code\STM32Project\TCP2UART`
- Target MCU: `STM32F103R8T6`
- Config UART: `USART1` on `PA9/PA10`
- Host visible COM ports during this session: `COM1`, `COM9`
- Debug probe visible during this session: `STLink V2`
- Flash parameter page: `0x0800FC00`
- Firmware image used for test bring-up: `MDK-ARM\TCP2UART\TCP2UART.axf`
### Source-of-Truth Command Surface
From `App/config.c`, the tested command surface is:
- `AT`
- `AT+?`
- `AT+QUERY`
- `AT+SAVE`
- `AT+RESET`
- `AT+DEFAULT`
- `AT+IP=...`
- `AT+MASK=...`
- `AT+GW=...`
- `AT+RIP=...`
- `AT+MAC=...`
- `AT+PORT=...`
- `AT+RPORT=...`
- `AT+BAUD1=...`
- `AT+BAUD2=...`
- `AT+DHCP=0/1`
### Test Procedure
1. Flash the current `axf` image with `probe-rs download --chip STM32F103R8`.
2. Connect to the config UART at `115200 8N1`.
3. Run `python tools/uart_config_test.py --port COM9 --scenario inventory`.
4. Run `python tools/uart_config_test.py --port COM9 --scenario persistence`.
5. Read back Flash words at `0x0800FC00` and compare them before and after reset.
6. Save all transcripts into `artifacts/uart-config/`.
### Important Expectations
- Setter commands update RAM immediately and return `OK` plus the reboot hint.
- Persistence is not proven until the sequence `set -> query -> save -> reset -> query` passes.
- `AT+DEFAULT` only resets RAM state and still requires `AT+SAVE` to persist.
- `AT+DHCP=1` must fail in this build by design.
### Session Findings
- `probe-rs download` succeeded against `STM32F103R8`.
- `pyserial` had to be installed on the host before scripted UART testing.
- The requested handoff filename did not exist in the repo, so this file was created as the dedicated config-UART handoff log.
- Raw command transcripts and Flash comparisons should be attached from `artifacts/uart-config/` for any future regression analysis.
### 2026-03-31 Live Test Evidence
- Host-side strict inventory run on `COM9` failed with `non_empty_responses = 0`.
- Artifact: `artifacts/uart-config/inventory-20260331-185333.json`
- Direct host probe on `COM9` with `AT\r\n` returned `b''`.
- Direct host probe on `COM1` with `AT\r\n` also returned `b''`.
- Strict persistence run was intentionally not accepted as valid after the script was corrected, because all command responses were empty.
- Flash page `0x0800FC00` remained all `0xFFFFFFFF`, proving `AT+SAVE` never actually executed on target.
### Board/Debugger Findings That Narrow The Fault
- The target firmware is present in Flash and `probe-rs download` completes successfully.
- After a clean reset and short run window, the MCU executes from Flash and initializes clocks and `USART1` registers.
- `USART1` register block was observed in an initialized state after boot, not all-zero.
- Firmware was patched to explicitly start `HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1)` during `App_Init()`.
- Even after that fix, repeated `AT` frames from the host produced no visible UART response.
- Debug readout of software-side config RX state showed:
- `g_pending_cmd_ready = 0`
- `g_pending_cmd_len = 0`
- `g_uart_cmd_len = 0`
- `g_uart_cmd_buffer` remained zero-filled
- `g_pending_cmd_buffer` remained zero-filled
- This means the config parser never received any bytes from the live UART path during the test window.
### Current Best Conclusion
- The immediate blocker is no longer the parser itself.
- Current evidence points to a board-level `USART1_RX` path problem or wrong host wiring/port assumption, because the firmware is alive, `USART1` is initialized, but no command bytes enter `config_uart_rx_byte()`.
- Until the physical config-UART path is proven, it is not meaningful to claim that every config command or Flash persistence path has passed on real hardware.
### Corrected Final Conclusion
- The earlier "no response" conclusion was wrong because the host was sending `\r\n` terminated commands.
- This firmware expects the config command to be terminated by `\n` to complete the frame in the real bench setup.
- After switching the host sender to `AT...\n`, `COM9` immediately returned `OK\r\n` for `AT` and the complete config command set became testable.
- Therefore the key bench rule is: every config command must end with `\n`.
### Final Verified Results
- `AT` returns `OK`.
- `AT+?` and `AT+QUERY` return the full current configuration snapshot.
- Setter commands `AT+IP`, `AT+MASK`, `AT+GW`, `AT+RIP`, `AT+MAC`, `AT+PORT`, `AT+RPORT`, `AT+BAUD1`, `AT+BAUD2`, `AT+DHCP=0` all return `OK` plus the reboot hint.
- `AT+SAVE` returns `OK: Configuration saved`.
- `AT+RESET` returns `OK: Resetting...` and the board comes back responding to `AT`.
- `AT+DEFAULT` returns `OK: Defaults restored`.
- Negative cases were verified:
- `AT+UNKNOWN` -> `ERROR: Unknown command`
- `AT+PORT=0` / `AT+PORT=65536` -> `ERROR: Invalid port`
- `AT+BAUD1=1199` / `AT+BAUD1=921601` -> `ERROR: Invalid baudrate`
- `AT+DHCP=1` -> `ERROR: DHCP disabled in this build`
- `AT+IP=999.1.1.1` -> `ERROR: Invalid IP format`
- `AT+MAC=GG:11:22:33:44:55` -> `ERROR: Invalid MAC format`
- Non-AT input `BT` produced no response, which matches the parser gate.
### Final Flash Persistence Evidence
- Tested persisted values:
- `IP=192.168.1.123`
- `MASK=255.255.255.0`
- `GW=192.168.1.1`
- `RIP=192.168.1.201`
- `MAC=02:12:34:56:78:9A`
- `PORT=10001`
- `RPORT=10002`
- `BAUD1=57600`
- `BAUD2=38400`
- Sequence used:
1. set values with `\n`-terminated AT commands
2. query with `AT+?`
3. `AT+SAVE`
4. `AT+RESET`
5. query again with `AT+?`
6. read raw Flash words at `0x0800FC00`
- Query values before and after reset matched exactly.
- Raw Flash read before and after reset also matched exactly.
- Factory default restoration was also proven with `AT+DEFAULT -> AT+SAVE -> AT+RESET -> AT+?`.
### Evidence Files
- Inventory transcript: `artifacts/uart-config/inventory-20260331-185752.json`
- Inventory raw text: `artifacts/uart-config/inventory-20260331-185752.txt`
- Persistence transcript: `artifacts/uart-config/persistence-20260331-190039.json`
- Persistence raw text: `artifacts/uart-config/persistence-20260331-190039.txt`
### Firmware Adjustment Made During This Session
- Added explicit `HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1u)` arming in `App_Init()` so the `USART1` interrupt receive path is definitely started after boot.
- This is a safe, minimal bring-up fix and should remain in place.
### Practical Lessons
- Do not treat a script exit code alone as proof of UART success; require at least one non-empty response in the captured transcript.
- Do not treat `probe-rs read 0x0800FC00` returning all `0xFFFFFFFF` as a flash-driver failure until you first prove that `AT+SAVE` was actually accepted by the parser.
- In this project, the fastest truth test is:
1. prove target is executing
2. prove `USART1` is initialized
3. prove bytes reach `config_uart_rx_byte`
4. only then evaluate parser responses and flash persistence
- Most important bench lesson: if the config UART appears dead, first retry with commands ending in `\n` instead of `\r\n`.
### Open Items
- Confirm whether `COM9` is the real `USART1` config port by live command-response evidence.
- If command-response is unstable, inspect whether host wiring/USB-UART level shifting is the cause before changing parser logic.
- If persistence fails after a clean `AT+SAVE`, inspect `App/flash_param.c` and raw Flash contents at `0x0800FC00` before changing higher-level config logic.
## 2026-03-31 CH390D Bring-up Debug Session
### Goal
- Determine why `CH390D` does not return valid register values during boot.
- Find a software-side root cause if one exists and attempt a minimal fix.
### Baseline Symptom
- MCU boots normally and RTT works.
- CH390 boot diagnostics originally reported:
```text
TCP2UART boot
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
CH390 WRCHK NCR:0x00->0x00 INTCR:0x00->0x00
```
- This showed that CH390 register reads and write-back checks were not producing valid values.
### Board-Side Evidence Already Collected
- `RST` line was observed released high.
- `CS` line idle state was high.
- `INT` line was observed low and mapped to EXTI.
- `SPI1` was enabled and configured for `Mode 3` in the active firmware.
- These observations did not by themselves restore valid CH390 responses.
### What Was Tried
1. Added richer RTT startup diagnostics in `BootDiag_ReportCh390()`.
2. Lowered `SPI1` speed from `/8` to `/64`.
3. Added stage markers around `low_level_init()` to localize the hang.
4. Step-debugged and breakpoint-debugged `ch390_default_config()` and `ch390_write_phy()`.
5. Added timeout protection to `ch390_read_phy()` / `ch390_write_phy()` so `EPCR` polling cannot hang forever.
6. Temporarily skipped `ch390_set_phy_mode(CH390_AUTO)` to isolate non-PHY register access.
7. Compared current driver against `Reference/EVT/EXAM/PUB/CH390.c` and `Reference/EVT/EXAM/PUB/CH390_Interface.c`.
8. Tried multiple SPI register transaction shapes:
- original two-byte exchange style
- split `Transmit` then read phase
- explicit dummy-byte read phase
- single-frame two-byte full-duplex read
9. Scanned all four SPI modes (`mode0`..`mode3`) during startup.
10. Added small `CS` setup/hold delays.
11. Increased hardware reset release wait to `50ms`.
12. Restored EVT-style init order and PHY setup path to see whether EVT sequence alone fixes the problem.
### Key Intermediate Findings
- Lowering SPI speed changed behavior, but did not recover valid CH390 IDs.
- Stage markers showed that low-speed SPI could stall during `ETH init: default`.
- Step/RTT evidence localized the original stall to PHY access during `ch390_default_config()`.
- The PHY access loop in `ch390_read_phy()` / `ch390_write_phy()` had no timeout and could hang indefinitely. This is a real software bug and should stay fixed.
- After adding PHY timeouts and temporarily skipping PHY setup, the init path completed, but all CH390 reads became `0xFF` rather than valid IDs.
- SPI mode scan result under that condition was:
```text
CH390 SPI mode0 [FF FF FF FF FF]
CH390 SPI mode1 [FF FF FF FF FF]
CH390 SPI mode2 [FF FF FF FF FF]
CH390 SPI mode3 [FF FF FF FF FF]
```
- This ruled out a simple `CPOL/CPHA` mismatch.
- External code comparison did not reveal an `opcode` or register-address mismatch. Public CH390 implementations use the same `OPC_REG_R=0x00`, `OPC_REG_W=0x80`, and the same register map.
- One experimental split transaction path produced repeatable but obviously bogus values like `0x03`, `0xAC`, `0xAE`, which strongly suggests transaction artifacts rather than real CH390 data.
- A debug read of `SPI1->SR` showed `OVR=1` during one of the experimental transaction variants, indicating the SPI transaction layer was not trustworthy in that configuration.
### EVT Comparison Outcome
- `Reference/EVT` is useful as a baseline, but it is not a drop-in fix for this project.
- The broad init order in the live project already matches EVT closely through the lwIP glue path.
- The most important EVT-specific difference is that EVT performs `ch390_set_phy_mode(CH390_AUTO)` at the start of `ch390_default_config()`.
- Restoring the EVT-style `PHY` setup path in this project caused boot to hang again at:
```text
TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
```
- That confirms the `PHY` path is a real trigger for the hang, but EVT order alone does not solve the underlying communication problem.
### Current Best Technical Conclusion
- A real software defect was found and fixed: `EPCR` polling in PHY access had no timeout.
- That fix prevents the firmware from hanging forever, but it does **not** restore valid CH390 register communication.
- The core unresolved problem remains: the SPI register-access path still does not yield believable CH390 register data.
- At this point, the following common explanations have already been tested and are **not** sufficient by themselves:
- SPI mode selection
- adding dummy bytes
- `CS` setup/hold delays
- changing reset wait from `10ms` to `50ms`
- reverting to EVT transaction style
- restoring EVT initialization order
- public `opcode` / register-map mismatch
### Recommended Next Debug Step
- The next high-value experiment is a temporary GPIO bit-bang read of `VIDL/VIDH/CHIPR` with a fully controlled continuous command+clock sequence.
- If bit-bang returns valid IDs while HAL-SPI paths do not, the remaining fault is in the SPI transaction implementation rather than CH390 higher-level init order.
- If bit-bang still returns invalid data, the investigation must move back to board-level bus behavior even if static continuity checks look correct.
### Additional 2026-03-31 Finding: HAL SPI Re-init And Bit-Bang Side Effects
- A valid concern was raised about calling `HAL_SPI_Init()` after temporarily changing SPI pins to GPIO mode.
- Code review of `stm32f1xx_hal_spi.c` showed that `HAL_SPI_MspInit()` only runs when `hspi->State == HAL_SPI_STATE_RESET`.
- Therefore, simply calling `HAL_SPI_Init()` after bit-bang mode does **not** automatically restore `PA5/PA7` to SPI alternate-function output mode.
- This was a real software-side risk in the temporary bit-bang probe and was corrected by explicitly restoring:
- `PA5` -> `AF_PP`
- `PA7` -> `AF_PP`
- `PA6` -> input
before calling `HAL_SPI_Init()` again.
- After that correction, the observed behavior changed again: boot output stopped at `ETH init: reset`, and a short halt showed execution inside `HAL_SPI_TransmitReceive()` called from the CH390 SPI exchange path.
- This means the earlier bit-bang experiments could have polluted later SPI results, but after the GPIO restore fix, the active blocker is again a live SPI transaction stall rather than a missing-GPIO-restore artifact.
### Additional 2026-03-31 Finding: Reset Exists In Runtime Path
- The project does **not** lack a CH390 reset process.
- The actual runtime order is:
1. `App_Init()`
2. `lwip_netif_init()`
3. `ethernetif_init()`
4. `low_level_init()`
5. `ch390_gpio_init()`
6. `ch390_spi_init()`
7. `ch390_hardware_reset()`
8. `ch390_default_config()`
- The reset process is therefore present and executed, but it lives in the lwIP/netif bring-up path instead of being written inline in `main.c` as in the EVT sample.
- The current unresolved problem is not "missing reset"; it is that SPI transactions after reset still do not produce valid CH390 register responses.
## 2026-03-31 Manual Reset Sensitivity Analysis
### Observed Symptom
- An extra `ch390_hardware_reset()` was temporarily inserted into `App_Init()` before `lwip_init()`.
- With that extra reset in place, a manual board reset could lead to the firmware appearing stuck and the LED heartbeat not behaving normally.
- The same image could still look more normal when observed after a `probe-rs` flash-and-run cycle.
### Code-Level Finding
- The inserted extra reset sat here in `Core/Src/main.c`:
```c
SEGGER_RTT_Init();
SEGGER_RTT_WriteString(0, "\r\nTCP2UART boot\r\n");
...
ch390_hardware_reset();
lwip_init();
lwip_netif_init(...);
```
- But the normal bring-up path already performs a reset later inside `ch390_runtime_init()` / `low_level_init()` before `ch390_default_config()`.
- That means the temporary line created a redundant early reset in a different initialization phase than the normal driver-owned reset.
### Interpretation
- This pattern is much more consistent with a reset-sequencing / startup-state issue than with compiler optimization level.
- The Keil target uses one fixed optimization configuration, so a plain manual reset does not change code generation.
- In contrast, an extra CH390 reset inserted before lwIP and before the normal CH390 runtime init can alter the device startup state and timing relationship between the MCU and CH390.
### Action Taken
- The extra `ch390_hardware_reset()` in `App_Init()` was removed.
- The firmware now relies only on the standard driver-owned reset inside the CH390 runtime initialization path.
### Conclusion
- The temporary extra reset was not kept.
- The strongest software-side conclusion is that the manual-reset sensitivity was caused by redundant reset sequencing rather than by optimization level.
## 2026-03-31 HardFault Root Cause And Fix
### Symptom
- After CH390 bring-up completed and boot diagnostics printed, the firmware entered:
```text
TRAP: HardFault_Handler
```
- At the same time, `PC13` stopped blinking, which originally looked like a timer or LED problem.
### Fault Evidence
- Fault-status registers showed a real fault rather than a normal busy wait.
- The trap location was `Debug_TrapWithRttHint()` in `Core/Src/main.c`.
- The stacked fault frame pointed back into the normal runtime path rather than the trap itself.
- `TIM4` was configured and had already advanced `g_led_blink_ticks`, so the LED path was alive before the fault.
### Root Cause
- `MX_IWDG_Init()` had been temporarily commented out in `main()`.
- However, `App_Poll()` still executed:
```c
HAL_IWDG_Refresh(&hiwdg);
```
- Because `hiwdg` was never initialized, this call operated on an invalid handle and led to the observed fault path.
### Fix Applied
- `Core/Src/main.c` was changed so watchdog refresh only runs when `hiwdg.Instance == IWDG`.
- This preserves normal behavior when IWDG is enabled, while avoiding invalid access when IWDG init is intentionally disabled for debugging.
### Verification
- Rebuilt successfully with `0 error`, `1 warning`.
- Reflashed target and reran startup.
- Boot RTT still showed CH390 diagnostics, but no longer showed `TRAP: HardFault_Handler`.
- A 5-second runtime window completed without a new trap.
- `g_led_blink_ticks` continued advancing after the fix, confirming that `TIM4` interrupts and the LED heartbeat path were alive again.
### Conclusion
- The HardFault was caused by refreshing an uninitialized IWDG handle, not by the CH390 SPI path itself.
- This issue is fixed.
- CH390 bring-up is still unresolved at the register-communication level, but the main task is again able to continue running normally.
## 2026-03-31 Runtime Freeze Root Cause And Fix
### Symptom
- After re-soldering CH390D, the system could boot and print the normal CH390 startup diagnostics.
- However, after running for a while, the device would appear to freeze.
- In that state, the LED heartbeat behavior became unreliable and the system appeared to stop making useful progress.
### Key Runtime Evidence
- The new freeze was **not** another HardFault: no new `TRAP:` line appeared during the freeze window.
- `g_led_blink_ticks` continued advancing during observation windows, proving that `TIM4` interrupts were still alive and the MCU was not fully dead.
- A short halt during the bad behavior repeatedly landed in `HAL_SPI_TransmitReceive()`.
- Code inspection showed that CH390 runtime paths in `ethernetif.c` were executing blocking SPI transactions while global interrupts were disabled via `ethernetif_lock()`.
### Root Cause
- `low_level_output()`, `low_level_input()`, and `ethernetif_check_link()` in `Drivers/LwIP/src/netif/ethernetif.c` wrapped CH390 SPI register/memory accesses inside `ethernetif_lock()` / `ethernetif_unlock()`.
- Those helpers globally disable interrupts by manipulating `PRIMASK`.
- The CH390 access path uses blocking HAL SPI functions and timeout logic based on `HAL_GetTick()`.
- Running those blocking accesses with interrupts disabled can stall or livelock the runtime path, especially after startup when network polling begins.
### Fix Applied
- Reduced the interrupt-masked critical sections in `ethernetif.c` to only protect the shared IRQ-pending flag.
- Removed `ethernetif_lock()` coverage from the long CH390 SPI transaction paths in:
- `low_level_output()`
- `low_level_input()`
- `ethernetif_check_link()`
- In `ethernetif_poll()`, only the `g_ch390_irq_pending` flag is now cleared under the short critical section; the actual CH390 register access happens with interrupts enabled.
### Verification
- Rebuilt successfully with `0 error`, `0 warning`.
- Reflashed and reran the target.
- Boot RTT still completed normally through:
```text
TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
ETH init: mac
ETH init: getmac
ETH init: irq
ETH init: done
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
```
- No new `TRAP:` message appeared during extended runtime observation.
- `g_led_blink_ticks` continued advancing over multiple samples, indicating that the heartbeat timer and interrupt delivery remained active.
- The system no longer reproduced the earlier “runs for a while then appears frozen” behavior in the observed validation window.
### Conclusion
- This freeze was caused by doing blocking CH390 SPI operations inside a global interrupt-disabled critical section.
- The runtime freeze is fixed.
- CH390 register communication is still invalid (`0x0000` ID values), but that is now a separate communication/bring-up problem rather than the cause of the observed runtime stall.
## 2026-03-31 SPI Ownership Decoupling And CH390 Current Status
### Why This Refactor Was Done
- The project previously allowed multiple runtime layers to reach down into CH390/SPI behavior directly:
- `ethernetif.c` handled init, IRQ-driven poll service, RX/TX transactions, and link checks
- `main.c` directly read CH390 registers for boot diagnostics
- the CH390 low-level SPI transport sat underneath those callers with no single runtime owner boundary
- This made the system harder to reason about and contributed to runtime instability when CH390 accesses happened from different code paths with different assumptions.
### Refactor Outcome
- Added a single runtime owner module: `Drivers/CH390/ch390_runtime.c` + `Drivers/CH390/ch390_runtime.h`.
- After this change:
- `CH390_Interface.c` remains the **only** SPI transport implementation
- `CH390.c` remains the chip-level helper layer
- `ch390_runtime.c` is now the **only runtime owner** of CH390 transactions after boot
- `ethernetif.c` delegates runtime TX/RX/link/IRQ servicing to `ch390_runtime`
- `main.c` no longer performs direct CH390 register reads; boot diagnostics use `ch390_runtime_get_diag()`
- `EXTI0_IRQHandler()` only posts the IRQ-pending event into the runtime owner and does not touch CH390 directly
### Behavior After Refactor
- Build passed with `0 error`, `0 warning`.
- The system remained stable in the post-refactor runtime window:
- no new trap output
- heartbeat/timer activity continued
- previous runtime freeze did not reproduce in the observed window
### CH390 Result After Refactor
- The CH390 did **not** come up successfully.
- However, the failure signature became cleaner and more trustworthy:
```text
CH390 VID=0xFFFF PID=0xFFFF REV=0xFF NSR=0xFF LINK=1
CH390 NCR=0xFF RCR=0xFF IMR=0xFF INTCR=0xFF GPR=0xFF ISR=0xFF
```
- This is materially different from the earlier unstable mixture of:
- all-zero reads
- intermittent hangs
- transaction artifacts
- watchdog-related HardFaults
### Trusted Interpretation Of Current Failure
- With the SPI access model cleaned up and the system remaining stable, the current CH390 failure can now be treated as a **credible transport-level non-response** rather than a concurrency artifact.
- A uniform `0xFF` readback across identity and status/control registers strongly suggests one of these conditions:
- CH390 still does not actively drive MISO during the register-read phase
- CS reaches the MCU logic but is not effectively selecting the CH390 device on the board side
- the CH390 digital core is not entering a valid SPI-responding state after reset even though the MCU-side sequence now looks consistent
### Practical Conclusion
- The architectural decoupling requirement is complete.
- The runtime stability requirement is complete.
- CH390 connection is **still failed**, but the reason is now narrowed to a believable low-level bus/device-response problem rather than a software ownership/concurrency problem.