TCP2UART/uart-ch390-debug-handoff.md

# UART CH390 Debug Handoff

## 2026-03-31 Config UART Test Session

### Goal

- Exhaustively test the `USART1` config command interface.
- Verify that Flash-backed parameters survive `AT+SAVE` plus reset.
- Record concrete bench procedure, failures, fixes, and evidence paths.

### Bench Baseline

- Workspace: `D:\code\STM32Project\TCP2UART`
- Target MCU: `STM32F103R8T6`
- Config UART: `USART1` on `PA9/PA10`
- Host visible COM ports during this session: `COM1`, `COM9`
- Debug probe visible during this session: `STLink V2`
- Flash parameter page: `0x0800FC00`
- Firmware image used for test bring-up: `MDK-ARM\TCP2UART\TCP2UART.axf`

### Source-of-Truth Command Surface

From `App/config.c`, the tested command surface is:

- `AT`
- `AT+?`
- `AT+QUERY`
- `AT+SAVE`
- `AT+RESET`
- `AT+DEFAULT`
- `AT+IP=...`
- `AT+MASK=...`
- `AT+GW=...`
- `AT+RIP=...`
- `AT+MAC=...`
- `AT+PORT=...`
- `AT+RPORT=...`
- `AT+BAUD1=...`
- `AT+BAUD2=...`
- `AT+DHCP=0/1`

### Test Procedure

1. Flash the current `axf` image with `probe-rs download --chip STM32F103R8`.
2. Connect to the config UART at `115200 8N1`.
3. Run `python tools/uart_config_test.py --port COM9 --scenario inventory`.
4. Run `python tools/uart_config_test.py --port COM9 --scenario persistence`.
5. Read back Flash words at `0x0800FC00` and compare them before and after reset.
6. Save all transcripts into `artifacts/uart-config/`.

### Important Expectations

- Setter commands update RAM immediately and return `OK` plus the reboot hint.
- Persistence is not proven until the sequence `set -> query -> save -> reset -> query` passes.
- `AT+DEFAULT` only resets RAM state and still requires `AT+SAVE` to persist.
- `AT+DHCP=1` must fail in this build by design.

### Session Findings

- `probe-rs download` succeeded against `STM32F103R8`.
- `pyserial` had to be installed on the host before scripted UART testing.
- The requested handoff filename did not exist in the repo, so this file was created as the dedicated config-UART handoff log.
- Raw command transcripts and Flash comparisons should be attached from `artifacts/uart-config/` for any future regression analysis.

### 2026-03-31 Live Test Evidence

- Host-side strict inventory run on `COM9` failed with `non_empty_responses = 0`.
- Artifact: `artifacts/uart-config/inventory-20260331-185333.json`
- Direct host probe on `COM9` with `AT\r\n` returned `b''`.
- Direct host probe on `COM1` with `AT\r\n` also returned `b''`.
- Strict persistence run was intentionally not accepted as valid after the script was corrected, because all command responses were empty.
- Flash page `0x0800FC00` remained all `0xFFFFFFFF`, proving `AT+SAVE` never actually executed on target.

### Board/Debugger Findings That Narrow The Fault

- The target firmware is present in Flash and `probe-rs download` completes successfully.
- After a clean reset and short run window, the MCU executes from Flash and initializes clocks and `USART1` registers.
- `USART1` register block was observed in an initialized state after boot, not all-zero.
- Firmware was patched to explicitly start `HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1)` during `App_Init()`.
- Even after that fix, repeated `AT` frames from the host produced no visible UART response.
- Debug readout of software-side config RX state showed:
  - `g_pending_cmd_ready = 0`
  - `g_pending_cmd_len = 0`
  - `g_uart_cmd_len = 0`
  - `g_uart_cmd_buffer` remained zero-filled
  - `g_pending_cmd_buffer` remained zero-filled
- This means the config parser never received any bytes from the live UART path during the test window.

### Current Best Conclusion

- The immediate blocker is no longer the parser itself.
- Current evidence points to a board-level `USART1_RX` path problem or wrong host wiring/port assumption, because the firmware is alive, `USART1` is initialized, but no command bytes enter `config_uart_rx_byte()`.
- Until the physical config-UART path is proven, it is not meaningful to claim that every config command or Flash persistence path has passed on real hardware.

### Corrected Final Conclusion

- The earlier "no response" conclusion was wrong because the host was sending `\r\n` terminated commands.
- This firmware expects the config command to be terminated by `\n` to complete the frame in the real bench setup.
- After switching the host sender to `AT...\n`, `COM9` immediately returned `OK\r\n` for `AT` and the complete config command set became testable.
- Therefore the key bench rule is: every config command must end with `\n`.

### Final Verified Results

- `AT` returns `OK`.
- `AT+?` and `AT+QUERY` return the full current configuration snapshot.
- Setter commands `AT+IP`, `AT+MASK`, `AT+GW`, `AT+RIP`, `AT+MAC`, `AT+PORT`, `AT+RPORT`, `AT+BAUD1`, `AT+BAUD2`, `AT+DHCP=0` all return `OK` plus the reboot hint.
- `AT+SAVE` returns `OK: Configuration saved`.
- `AT+RESET` returns `OK: Resetting...` and the board comes back responding to `AT`.
- `AT+DEFAULT` returns `OK: Defaults restored`.
- Negative cases were verified:
  - `AT+UNKNOWN` -> `ERROR: Unknown command`
  - `AT+PORT=0` / `AT+PORT=65536` -> `ERROR: Invalid port`
  - `AT+BAUD1=1199` / `AT+BAUD1=921601` -> `ERROR: Invalid baudrate`
  - `AT+DHCP=1` -> `ERROR: DHCP disabled in this build`
  - `AT+IP=999.1.1.1` -> `ERROR: Invalid IP format`
  - `AT+MAC=GG:11:22:33:44:55` -> `ERROR: Invalid MAC format`
- Non-AT input `BT` produced no response, which matches the parser gate.

### Final Flash Persistence Evidence

- Tested persisted values:
  - `IP=192.168.1.123`
  - `MASK=255.255.255.0`
  - `GW=192.168.1.1`
  - `RIP=192.168.1.201`
  - `MAC=02:12:34:56:78:9A`
  - `PORT=10001`
  - `RPORT=10002`
  - `BAUD1=57600`
  - `BAUD2=38400`
- Sequence used:
  1. set values with `\n`-terminated AT commands
  2. query with `AT+?`
  3. `AT+SAVE`
  4. `AT+RESET`
  5. query again with `AT+?`
  6. read raw Flash words at `0x0800FC00`
- Query values before and after reset matched exactly.
- Raw Flash read before and after reset also matched exactly.
- Factory default restoration was also proven with `AT+DEFAULT -> AT+SAVE -> AT+RESET -> AT+?`.

### Evidence Files

- Inventory transcript: `artifacts/uart-config/inventory-20260331-185752.json`
- Inventory raw text: `artifacts/uart-config/inventory-20260331-185752.txt`
- Persistence transcript: `artifacts/uart-config/persistence-20260331-190039.json`
- Persistence raw text: `artifacts/uart-config/persistence-20260331-190039.txt`

### Firmware Adjustment Made During This Session

- Added explicit `HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1u)` arming in `App_Init()` so the `USART1` interrupt receive path is definitely started after boot.
- This is a safe, minimal bring-up fix and should remain in place.

### Practical Lessons

- Do not treat a script exit code alone as proof of UART success; require at least one non-empty response in the captured transcript.
- Do not treat `probe-rs read 0x0800FC00` returning all `0xFFFFFFFF` as a flash-driver failure until you first prove that `AT+SAVE` was actually accepted by the parser.
- In this project, the fastest truth test is:
  1. prove target is executing
  2. prove `USART1` is initialized
  3. prove bytes reach `config_uart_rx_byte`
  4. only then evaluate parser responses and flash persistence
- Most important bench lesson: if the config UART appears dead, first retry with commands ending in `\n` instead of `\r\n`.

### Open Items

- Confirm whether `COM9` is the real `USART1` config port by live command-response evidence.
- If command-response is unstable, inspect whether host wiring/USB-UART level shifting is the cause before changing parser logic.
- If persistence fails after a clean `AT+SAVE`, inspect `App/flash_param.c` and raw Flash contents at `0x0800FC00` before changing higher-level config logic.

## 2026-03-31 CH390D Bring-up Debug Session

### Goal

- Determine why `CH390D` does not return valid register values during boot.
- Find a software-side root cause if one exists and attempt a minimal fix.

### Baseline Symptom

- MCU boots normally and RTT works.
- CH390 boot diagnostics originally reported:

```text
TCP2UART boot
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
CH390 WRCHK NCR:0x00->0x00 INTCR:0x00->0x00
```

- This showed that CH390 register reads and write-back checks were not producing valid values.

### Board-Side Evidence Already Collected

- `RST` line was observed released high.
- `CS` line idle state was high.
- `INT` line was observed low and mapped to EXTI.
- `SPI1` was enabled and configured for `Mode 3` in the active firmware.
- These observations did not by themselves restore valid CH390 responses.

### What Was Tried

1. Added richer RTT startup diagnostics in `BootDiag_ReportCh390()`.
2. Lowered `SPI1` speed from `/8` to `/64`.
3. Added stage markers around `low_level_init()` to localize the hang.
4. Step-debugged and breakpoint-debugged `ch390_default_config()` and `ch390_write_phy()`.
5. Added timeout protection to `ch390_read_phy()` / `ch390_write_phy()` so `EPCR` polling cannot hang forever.
6. Temporarily skipped `ch390_set_phy_mode(CH390_AUTO)` to isolate non-PHY register access.
7. Compared current driver against `Reference/EVT/EXAM/PUB/CH390.c` and `Reference/EVT/EXAM/PUB/CH390_Interface.c`.
8. Tried multiple SPI register transaction shapes:
   - original two-byte exchange style
   - split `Transmit` then read phase
   - explicit dummy-byte read phase
   - single-frame two-byte full-duplex read
9. Scanned all four SPI modes (`mode0`..`mode3`) during startup.
10. Added small `CS` setup/hold delays.
11. Increased hardware reset release wait to `50ms`.
12. Restored EVT-style init order and PHY setup path to see whether EVT sequence alone fixes the problem.

### Key Intermediate Findings

- Lowering SPI speed changed behavior, but did not recover valid CH390 IDs.
- Stage markers showed that low-speed SPI could stall during `ETH init: default`.
- Step/RTT evidence localized the original stall to PHY access during `ch390_default_config()`.
- The PHY access loop in `ch390_read_phy()` / `ch390_write_phy()` had no timeout and could hang indefinitely. This is a real software bug and should stay fixed.
- After adding PHY timeouts and temporarily skipping PHY setup, the init path completed, but all CH390 reads became `0xFF` rather than valid IDs.
- SPI mode scan result under that condition was:

```text
CH390 SPI mode0 [FF FF FF FF FF]
CH390 SPI mode1 [FF FF FF FF FF]
CH390 SPI mode2 [FF FF FF FF FF]
CH390 SPI mode3 [FF FF FF FF FF]
```

- This ruled out a simple `CPOL/CPHA` mismatch.
- External code comparison did not reveal an `opcode` or register-address mismatch. Public CH390 implementations use the same `OPC_REG_R=0x00`, `OPC_REG_W=0x80`, and the same register map.
- One experimental split transaction path produced repeatable but obviously bogus values like `0x03`, `0xAC`, `0xAE`, which strongly suggests transaction artifacts rather than real CH390 data.
- A debug read of `SPI1->SR` showed `OVR=1` during one of the experimental transaction variants, indicating the SPI transaction layer was not trustworthy in that configuration.

### EVT Comparison Outcome

- `Reference/EVT` is useful as a baseline, but it is not a drop-in fix for this project.
- The broad init order in the live project already matches EVT closely through the lwIP glue path.
- The most important EVT-specific difference is that EVT performs `ch390_set_phy_mode(CH390_AUTO)` at the start of `ch390_default_config()`.
- Restoring the EVT-style `PHY` setup path in this project caused boot to hang again at:

```text
TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
```

- That confirms the `PHY` path is a real trigger for the hang, but EVT order alone does not solve the underlying communication problem.

### Current Best Technical Conclusion

- A real software defect was found and fixed: `EPCR` polling in PHY access had no timeout.
- That fix prevents the firmware from hanging forever, but it does **not** restore valid CH390 register communication.
- The core unresolved problem remains: the SPI register-access path still does not yield believable CH390 register data.
- At this point, the following common explanations have already been tested and are **not** sufficient by themselves:
  - SPI mode selection
  - adding dummy bytes
  - `CS` setup/hold delays
  - changing reset wait from `10ms` to `50ms`
  - reverting to EVT transaction style
  - restoring EVT initialization order
  - public `opcode` / register-map mismatch

### Recommended Next Debug Step

- The next high-value experiment is a temporary GPIO bit-bang read of `VIDL/VIDH/CHIPR` with a fully controlled continuous command+clock sequence.
- If bit-bang returns valid IDs while HAL-SPI paths do not, the remaining fault is in the SPI transaction implementation rather than CH390 higher-level init order.
- If bit-bang still returns invalid data, the investigation must move back to board-level bus behavior even if static continuity checks look correct.

### Additional 2026-03-31 Finding: HAL SPI Re-init And Bit-Bang Side Effects

- A valid concern was raised about calling `HAL_SPI_Init()` after temporarily changing SPI pins to GPIO mode.
- Code review of `stm32f1xx_hal_spi.c` showed that `HAL_SPI_MspInit()` only runs when `hspi->State == HAL_SPI_STATE_RESET`.
- Therefore, simply calling `HAL_SPI_Init()` after bit-bang mode does **not** automatically restore `PA5/PA7` to SPI alternate-function output mode.
- This was a real software-side risk in the temporary bit-bang probe and was corrected by explicitly restoring:
  - `PA5` -> `AF_PP`
  - `PA7` -> `AF_PP`
  - `PA6` -> input
  before calling `HAL_SPI_Init()` again.
- After that correction, the observed behavior changed again: boot output stopped at `ETH init: reset`, and a short halt showed execution inside `HAL_SPI_TransmitReceive()` called from the CH390 SPI exchange path.
- This means the earlier bit-bang experiments could have polluted later SPI results, but after the GPIO restore fix, the active blocker is again a live SPI transaction stall rather than a missing-GPIO-restore artifact.

### Additional 2026-03-31 Finding: Reset Exists In Runtime Path

- The project does **not** lack a CH390 reset process.
- The actual runtime order is:
  1. `App_Init()`
  2. `lwip_netif_init()`
  3. `ethernetif_init()`
  4. `low_level_init()`
  5. `ch390_gpio_init()`
  6. `ch390_spi_init()`
  7. `ch390_hardware_reset()`
  8. `ch390_default_config()`
- The reset process is therefore present and executed, but it lives in the lwIP/netif bring-up path instead of being written inline in `main.c` as in the EVT sample.
- The current unresolved problem is not "missing reset"; it is that SPI transactions after reset still do not produce valid CH390 register responses.

## 2026-03-31 Manual Reset Sensitivity Analysis

### Observed Symptom

- An extra `ch390_hardware_reset()` was temporarily inserted into `App_Init()` before `lwip_init()`.
- With that extra reset in place, a manual board reset could lead to the firmware appearing stuck and the LED heartbeat not behaving normally.
- The same image could still look more normal when observed after a `probe-rs` flash-and-run cycle.

### Code-Level Finding

- The inserted extra reset sat here in `Core/Src/main.c`:

```c
SEGGER_RTT_Init();
SEGGER_RTT_WriteString(0, "\r\nTCP2UART boot\r\n");
...
ch390_hardware_reset();
lwip_init();
lwip_netif_init(...);
```

- But the normal bring-up path already performs a reset later inside `ch390_runtime_init()` / `low_level_init()` before `ch390_default_config()`.
- That means the temporary line created a redundant early reset in a different initialization phase than the normal driver-owned reset.

### Interpretation

- This pattern is much more consistent with a reset-sequencing / startup-state issue than with compiler optimization level.
- The Keil target uses one fixed optimization configuration, so a plain manual reset does not change code generation.
- In contrast, an extra CH390 reset inserted before lwIP and before the normal CH390 runtime init can alter the device startup state and timing relationship between the MCU and CH390.

### Action Taken

- The extra `ch390_hardware_reset()` in `App_Init()` was removed.
- The firmware now relies only on the standard driver-owned reset inside the CH390 runtime initialization path.

### Conclusion

- The temporary extra reset was not kept.
- The strongest software-side conclusion is that the manual-reset sensitivity was caused by redundant reset sequencing rather than by optimization level.

## 2026-03-31 HardFault Root Cause And Fix

### Symptom

- After CH390 bring-up completed and boot diagnostics printed, the firmware entered:

```text
TRAP: HardFault_Handler
```

- At the same time, `PC13` stopped blinking, which originally looked like a timer or LED problem.

### Fault Evidence

- Fault-status registers showed a real fault rather than a normal busy wait.
- The trap location was `Debug_TrapWithRttHint()` in `Core/Src/main.c`.
- The stacked fault frame pointed back into the normal runtime path rather than the trap itself.
- `TIM4` was configured and had already advanced `g_led_blink_ticks`, so the LED path was alive before the fault.

### Root Cause

- `MX_IWDG_Init()` had been temporarily commented out in `main()`.
- However, `App_Poll()` still executed:

```c
HAL_IWDG_Refresh(&hiwdg);
```

- Because `hiwdg` was never initialized, this call operated on an invalid handle and led to the observed fault path.

### Fix Applied

- `Core/Src/main.c` was changed so watchdog refresh only runs when `hiwdg.Instance == IWDG`.
- This preserves normal behavior when IWDG is enabled, while avoiding invalid access when IWDG init is intentionally disabled for debugging.

### Verification

- Rebuilt successfully with `0 error`, `1 warning`.
- Reflashed target and reran startup.
- Boot RTT still showed CH390 diagnostics, but no longer showed `TRAP: HardFault_Handler`.
- A 5-second runtime window completed without a new trap.
- `g_led_blink_ticks` continued advancing after the fix, confirming that `TIM4` interrupts and the LED heartbeat path were alive again.

### Conclusion

- The HardFault was caused by refreshing an uninitialized IWDG handle, not by the CH390 SPI path itself.
- This issue is fixed.
- CH390 bring-up is still unresolved at the register-communication level, but the main task is again able to continue running normally.

## 2026-03-31 Runtime Freeze Root Cause And Fix

### Symptom

- After re-soldering CH390D, the system could boot and print the normal CH390 startup diagnostics.
- However, after running for a while, the device would appear to freeze.
- In that state, the LED heartbeat behavior became unreliable and the system appeared to stop making useful progress.

### Key Runtime Evidence

- The new freeze was **not** another HardFault: no new `TRAP:` line appeared during the freeze window.
- `g_led_blink_ticks` continued advancing during observation windows, proving that `TIM4` interrupts were still alive and the MCU was not fully dead.
- A short halt during the bad behavior repeatedly landed in `HAL_SPI_TransmitReceive()`.
- Code inspection showed that CH390 runtime paths in `ethernetif.c` were executing blocking SPI transactions while global interrupts were disabled via `ethernetif_lock()`.

### Root Cause

- `low_level_output()`, `low_level_input()`, and `ethernetif_check_link()` in `Drivers/LwIP/src/netif/ethernetif.c` wrapped CH390 SPI register/memory accesses inside `ethernetif_lock()` / `ethernetif_unlock()`.
- Those helpers globally disable interrupts by manipulating `PRIMASK`.
- The CH390 access path uses blocking HAL SPI functions and timeout logic based on `HAL_GetTick()`.
- Running those blocking accesses with interrupts disabled can stall or livelock the runtime path, especially after startup when network polling begins.

### Fix Applied

- Reduced the interrupt-masked critical sections in `ethernetif.c` to only protect the shared IRQ-pending flag.
- Removed `ethernetif_lock()` coverage from the long CH390 SPI transaction paths in:
  - `low_level_output()`
  - `low_level_input()`
  - `ethernetif_check_link()`
- In `ethernetif_poll()`, only the `g_ch390_irq_pending` flag is now cleared under the short critical section; the actual CH390 register access happens with interrupts enabled.

### Verification

- Rebuilt successfully with `0 error`, `0 warning`.
- Reflashed and reran the target.
- Boot RTT still completed normally through:

```text
TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
ETH init: mac
ETH init: getmac
ETH init: irq
ETH init: done
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
```

- No new `TRAP:` message appeared during extended runtime observation.
- `g_led_blink_ticks` continued advancing over multiple samples, indicating that the heartbeat timer and interrupt delivery remained active.
- The system no longer reproduced the earlier “runs for a while then appears frozen” behavior in the observed validation window.

### Conclusion

- This freeze was caused by doing blocking CH390 SPI operations inside a global interrupt-disabled critical section.
- The runtime freeze is fixed.
- CH390 register communication is still invalid (`0x0000` ID values), but that is now a separate communication/bring-up problem rather than the cause of the observed runtime stall.

## 2026-03-31 SPI Ownership Decoupling And CH390 Current Status

### Why This Refactor Was Done

- The project previously allowed multiple runtime layers to reach down into CH390/SPI behavior directly:
  - `ethernetif.c` handled init, IRQ-driven poll service, RX/TX transactions, and link checks
  - `main.c` directly read CH390 registers for boot diagnostics
  - the CH390 low-level SPI transport sat underneath those callers with no single runtime owner boundary
- This made the system harder to reason about and contributed to runtime instability when CH390 accesses happened from different code paths with different assumptions.

### Refactor Outcome

- Added a single runtime owner module: `Drivers/CH390/ch390_runtime.c` + `Drivers/CH390/ch390_runtime.h`.
- After this change:
  - `CH390_Interface.c` remains the **only** SPI transport implementation
  - `CH390.c` remains the chip-level helper layer
  - `ch390_runtime.c` is now the **only runtime owner** of CH390 transactions after boot
  - `ethernetif.c` delegates runtime TX/RX/link/IRQ servicing to `ch390_runtime`
  - `main.c` no longer performs direct CH390 register reads; boot diagnostics use `ch390_runtime_get_diag()`
  - `EXTI0_IRQHandler()` only posts the IRQ-pending event into the runtime owner and does not touch CH390 directly

### Behavior After Refactor

- Build passed with `0 error`, `0 warning`.
- The system remained stable in the post-refactor runtime window:
  - no new trap output
  - heartbeat/timer activity continued
  - previous runtime freeze did not reproduce in the observed window

### CH390 Result After Refactor

- The CH390 did **not** come up successfully.
- However, the failure signature became cleaner and more trustworthy:

```text
CH390 VID=0xFFFF PID=0xFFFF REV=0xFF NSR=0xFF LINK=1
CH390 NCR=0xFF RCR=0xFF IMR=0xFF INTCR=0xFF GPR=0xFF ISR=0xFF
```

- This is materially different from the earlier unstable mixture of:
  - all-zero reads
  - intermittent hangs
  - transaction artifacts
  - watchdog-related HardFaults

### Trusted Interpretation Of Current Failure

- With the SPI access model cleaned up and the system remaining stable, the current CH390 failure can now be treated as a **credible transport-level non-response** rather than a concurrency artifact.
- A uniform `0xFF` readback across identity and status/control registers strongly suggests one of these conditions:
  - CH390 still does not actively drive MISO during the register-read phase
  - CS reaches the MCU logic but is not effectively selecting the CH390 device on the board side
  - the CH390 digital core is not entering a valid SPI-responding state after reset even though the MCU-side sequence now looks consistent

### Practical Conclusion

- The architectural decoupling requirement is complete.
- The runtime stability requirement is complete.
- CH390 connection is **still failed**, but the reason is now narrowed to a believable low-level bus/device-response problem rather than a software ownership/concurrency problem.