Files

T

gaoro-xiao 14a532290d refactor: serialize CH390 runtime SPI access

Move runtime CH390 transactions behind a single ch390_runtime owner so main, lwIP glue, and EXTI no longer compete for SPI access. Keep the system stable under runtime load and capture the remaining CH390 readback failure as a credible low-level device-response issue in the handoff logs.

2026-04-01 03:39:08 +08:00

24 KiB

Raw Blame History

UART CH390 Debug Handoff

2026-03-31 Config UART Test Session

Goal

Exhaustively test the USART1 config command interface.
Verify that Flash-backed parameters survive AT+SAVE plus reset.
Record concrete bench procedure, failures, fixes, and evidence paths.

Bench Baseline

Workspace: D:\code\STM32Project\TCP2UART
Target MCU: STM32F103R8T6
Config UART: USART1 on PA9/PA10
Host visible COM ports during this session: COM1, COM9
Debug probe visible during this session: STLink V2
Flash parameter page: 0x0800FC00
Firmware image used for test bring-up: MDK-ARM\TCP2UART\TCP2UART.axf

Source-of-Truth Command Surface

From App/config.c, the tested command surface is:

AT
AT+?
AT+QUERY
AT+SAVE
AT+RESET
AT+DEFAULT
AT+IP=...
AT+MASK=...
AT+GW=...
AT+RIP=...
AT+MAC=...
AT+PORT=...
AT+RPORT=...
AT+BAUD1=...
AT+BAUD2=...
AT+DHCP=0/1

Test Procedure

Flash the current axf image with probe-rs download --chip STM32F103R8.
Connect to the config UART at 115200 8N1.
Run python tools/uart_config_test.py --port COM9 --scenario inventory.
Run python tools/uart_config_test.py --port COM9 --scenario persistence.
Read back Flash words at 0x0800FC00 and compare them before and after reset.
Save all transcripts into artifacts/uart-config/.

Important Expectations

Setter commands update RAM immediately and return OK plus the reboot hint.
Persistence is not proven until the sequence set -> query -> save -> reset -> query passes.
AT+DEFAULT only resets RAM state and still requires AT+SAVE to persist.
AT+DHCP=1 must fail in this build by design.

Session Findings

probe-rs download succeeded against STM32F103R8.
pyserial had to be installed on the host before scripted UART testing.
The requested handoff filename did not exist in the repo, so this file was created as the dedicated config-UART handoff log.
Raw command transcripts and Flash comparisons should be attached from artifacts/uart-config/ for any future regression analysis.

2026-03-31 Live Test Evidence

Host-side strict inventory run on COM9 failed with non_empty_responses = 0.
Artifact: artifacts/uart-config/inventory-20260331-185333.json
Direct host probe on COM9 with AT\r\n returned b''.
Direct host probe on COM1 with AT\r\n also returned b''.
Strict persistence run was intentionally not accepted as valid after the script was corrected, because all command responses were empty.
Flash page 0x0800FC00 remained all 0xFFFFFFFF, proving AT+SAVE never actually executed on target.

Board/Debugger Findings That Narrow The Fault

The target firmware is present in Flash and probe-rs download completes successfully.
After a clean reset and short run window, the MCU executes from Flash and initializes clocks and USART1 registers.
USART1 register block was observed in an initialized state after boot, not all-zero.
Firmware was patched to explicitly start HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1) during App_Init().
Even after that fix, repeated AT frames from the host produced no visible UART response.
Debug readout of software-side config RX state showed:
- g_pending_cmd_ready = 0
- g_pending_cmd_len = 0
- g_uart_cmd_len = 0
- g_uart_cmd_buffer remained zero-filled
- g_pending_cmd_buffer remained zero-filled
This means the config parser never received any bytes from the live UART path during the test window.

Current Best Conclusion

The immediate blocker is no longer the parser itself.
Current evidence points to a board-level USART1_RX path problem or wrong host wiring/port assumption, because the firmware is alive, USART1 is initialized, but no command bytes enter config_uart_rx_byte().
Until the physical config-UART path is proven, it is not meaningful to claim that every config command or Flash persistence path has passed on real hardware.

Corrected Final Conclusion

The earlier "no response" conclusion was wrong because the host was sending \r\n terminated commands.
This firmware expects the config command to be terminated by \n to complete the frame in the real bench setup.
After switching the host sender to AT...\n, COM9 immediately returned OK\r\n for AT and the complete config command set became testable.
Therefore the key bench rule is: every config command must end with \n.

Final Verified Results

AT returns OK.
AT+? and AT+QUERY return the full current configuration snapshot.
Setter commands AT+IP, AT+MASK, AT+GW, AT+RIP, AT+MAC, AT+PORT, AT+RPORT, AT+BAUD1, AT+BAUD2, AT+DHCP=0 all return OK plus the reboot hint.
AT+SAVE returns OK: Configuration saved.
AT+RESET returns OK: Resetting... and the board comes back responding to AT.
AT+DEFAULT returns OK: Defaults restored.
Negative cases were verified:
- AT+UNKNOWN -> ERROR: Unknown command
- AT+PORT=0 / AT+PORT=65536 -> ERROR: Invalid port
- AT+BAUD1=1199 / AT+BAUD1=921601 -> ERROR: Invalid baudrate
- AT+DHCP=1 -> ERROR: DHCP disabled in this build
- AT+IP=999.1.1.1 -> ERROR: Invalid IP format
- AT+MAC=GG:11:22:33:44:55 -> ERROR: Invalid MAC format
Non-AT input BT produced no response, which matches the parser gate.

Final Flash Persistence Evidence

Tested persisted values:
- IP=192.168.1.123
- MASK=255.255.255.0
- GW=192.168.1.1
- RIP=192.168.1.201
- MAC=02:12:34:56:78:9A
- PORT=10001
- RPORT=10002
- BAUD1=57600
- BAUD2=38400
Sequence used:
1. set values with \n-terminated AT commands
2. query with AT+?
3. AT+SAVE
4. AT+RESET
5. query again with AT+?
6. read raw Flash words at 0x0800FC00
Query values before and after reset matched exactly.
Raw Flash read before and after reset also matched exactly.
Factory default restoration was also proven with AT+DEFAULT -> AT+SAVE -> AT+RESET -> AT+?.

Evidence Files

Inventory transcript: artifacts/uart-config/inventory-20260331-185752.json
Inventory raw text: artifacts/uart-config/inventory-20260331-185752.txt
Persistence transcript: artifacts/uart-config/persistence-20260331-190039.json
Persistence raw text: artifacts/uart-config/persistence-20260331-190039.txt

Firmware Adjustment Made During This Session

Added explicit HAL_UART_Receive_IT(&huart1, &g_uart1_rx_probe_byte, 1u) arming in App_Init() so the USART1 interrupt receive path is definitely started after boot.
This is a safe, minimal bring-up fix and should remain in place.

Practical Lessons

Do not treat a script exit code alone as proof of UART success; require at least one non-empty response in the captured transcript.
Do not treat probe-rs read 0x0800FC00 returning all 0xFFFFFFFF as a flash-driver failure until you first prove that AT+SAVE was actually accepted by the parser.
In this project, the fastest truth test is:
1. prove target is executing
2. prove USART1 is initialized
3. prove bytes reach config_uart_rx_byte
4. only then evaluate parser responses and flash persistence
Most important bench lesson: if the config UART appears dead, first retry with commands ending in \n instead of \r\n.

Open Items

Confirm whether COM9 is the real USART1 config port by live command-response evidence.
If command-response is unstable, inspect whether host wiring/USB-UART level shifting is the cause before changing parser logic.
If persistence fails after a clean AT+SAVE, inspect App/flash_param.c and raw Flash contents at 0x0800FC00 before changing higher-level config logic.

2026-03-31 CH390D Bring-up Debug Session

Goal

Determine why CH390D does not return valid register values during boot.
Find a software-side root cause if one exists and attempt a minimal fix.

Baseline Symptom

MCU boots normally and RTT works.
CH390 boot diagnostics originally reported:

TCP2UART boot
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00
CH390 WRCHK NCR:0x00->0x00 INTCR:0x00->0x00

This showed that CH390 register reads and write-back checks were not producing valid values.

Board-Side Evidence Already Collected

RST line was observed released high.
CS line idle state was high.
INT line was observed low and mapped to EXTI.
SPI1 was enabled and configured for Mode 3 in the active firmware.
These observations did not by themselves restore valid CH390 responses.

What Was Tried

Added richer RTT startup diagnostics in BootDiag_ReportCh390().
Lowered SPI1 speed from /8 to /64.
Added stage markers around low_level_init() to localize the hang.
Step-debugged and breakpoint-debugged ch390_default_config() and ch390_write_phy().
Added timeout protection to ch390_read_phy() / ch390_write_phy() so EPCR polling cannot hang forever.
Temporarily skipped ch390_set_phy_mode(CH390_AUTO) to isolate non-PHY register access.
Compared current driver against Reference/EVT/EXAM/PUB/CH390.c and Reference/EVT/EXAM/PUB/CH390_Interface.c.
Tried multiple SPI register transaction shapes:
- original two-byte exchange style
- split Transmit then read phase
- explicit dummy-byte read phase
- single-frame two-byte full-duplex read
Scanned all four SPI modes (mode0..mode3) during startup.
Added small CS setup/hold delays.
Increased hardware reset release wait to 50ms.
Restored EVT-style init order and PHY setup path to see whether EVT sequence alone fixes the problem.

Key Intermediate Findings

Lowering SPI speed changed behavior, but did not recover valid CH390 IDs.
Stage markers showed that low-speed SPI could stall during ETH init: default.
Step/RTT evidence localized the original stall to PHY access during ch390_default_config().
The PHY access loop in ch390_read_phy() / ch390_write_phy() had no timeout and could hang indefinitely. This is a real software bug and should stay fixed.
After adding PHY timeouts and temporarily skipping PHY setup, the init path completed, but all CH390 reads became 0xFF rather than valid IDs.
SPI mode scan result under that condition was:

CH390 SPI mode0 [FF FF FF FF FF]
CH390 SPI mode1 [FF FF FF FF FF]
CH390 SPI mode2 [FF FF FF FF FF]
CH390 SPI mode3 [FF FF FF FF FF]

This ruled out a simple CPOL/CPHA mismatch.
External code comparison did not reveal an opcode or register-address mismatch. Public CH390 implementations use the same OPC_REG_R=0x00, OPC_REG_W=0x80, and the same register map.
One experimental split transaction path produced repeatable but obviously bogus values like 0x03, 0xAC, 0xAE, which strongly suggests transaction artifacts rather than real CH390 data.
A debug read of SPI1->SR showed OVR=1 during one of the experimental transaction variants, indicating the SPI transaction layer was not trustworthy in that configuration.

EVT Comparison Outcome

Reference/EVT is useful as a baseline, but it is not a drop-in fix for this project.
The broad init order in the live project already matches EVT closely through the lwIP glue path.
The most important EVT-specific difference is that EVT performs ch390_set_phy_mode(CH390_AUTO) at the start of ch390_default_config().
Restoring the EVT-style PHY setup path in this project caused boot to hang again at:

TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default

That confirms the PHY path is a real trigger for the hang, but EVT order alone does not solve the underlying communication problem.

Current Best Technical Conclusion

A real software defect was found and fixed: EPCR polling in PHY access had no timeout.
That fix prevents the firmware from hanging forever, but it does not restore valid CH390 register communication.
The core unresolved problem remains: the SPI register-access path still does not yield believable CH390 register data.
At this point, the following common explanations have already been tested and are not sufficient by themselves:
- SPI mode selection
- adding dummy bytes
- CS setup/hold delays
- changing reset wait from 10ms to 50ms
- reverting to EVT transaction style
- restoring EVT initialization order
- public opcode / register-map mismatch

Recommended Next Debug Step

The next high-value experiment is a temporary GPIO bit-bang read of VIDL/VIDH/CHIPR with a fully controlled continuous command+clock sequence.
If bit-bang returns valid IDs while HAL-SPI paths do not, the remaining fault is in the SPI transaction implementation rather than CH390 higher-level init order.
If bit-bang still returns invalid data, the investigation must move back to board-level bus behavior even if static continuity checks look correct.

Additional 2026-03-31 Finding: HAL SPI Re-init And Bit-Bang Side Effects

A valid concern was raised about calling HAL_SPI_Init() after temporarily changing SPI pins to GPIO mode.
Code review of stm32f1xx_hal_spi.c showed that HAL_SPI_MspInit() only runs when hspi->State == HAL_SPI_STATE_RESET.
Therefore, simply calling HAL_SPI_Init() after bit-bang mode does not automatically restore PA5/PA7 to SPI alternate-function output mode.
This was a real software-side risk in the temporary bit-bang probe and was corrected by explicitly restoring:
- PA5 -> AF_PP
- PA7 -> AF_PP
- PA6 -> input before calling HAL_SPI_Init() again.
After that correction, the observed behavior changed again: boot output stopped at ETH init: reset, and a short halt showed execution inside HAL_SPI_TransmitReceive() called from the CH390 SPI exchange path.
This means the earlier bit-bang experiments could have polluted later SPI results, but after the GPIO restore fix, the active blocker is again a live SPI transaction stall rather than a missing-GPIO-restore artifact.

Additional 2026-03-31 Finding: Reset Exists In Runtime Path

The project does not lack a CH390 reset process.
The actual runtime order is:
1. App_Init()
2. lwip_netif_init()
3. ethernetif_init()
4. low_level_init()
5. ch390_gpio_init()
6. ch390_spi_init()
7. ch390_hardware_reset()
8. ch390_default_config()
The reset process is therefore present and executed, but it lives in the lwIP/netif bring-up path instead of being written inline in main.c as in the EVT sample.
The current unresolved problem is not "missing reset"; it is that SPI transactions after reset still do not produce valid CH390 register responses.

2026-03-31 Manual Reset Sensitivity Analysis

Observed Symptom

An extra ch390_hardware_reset() was temporarily inserted into App_Init() before lwip_init().
With that extra reset in place, a manual board reset could lead to the firmware appearing stuck and the LED heartbeat not behaving normally.
The same image could still look more normal when observed after a probe-rs flash-and-run cycle.

Code-Level Finding

The inserted extra reset sat here in Core/Src/main.c:

SEGGER_RTT_Init();
SEGGER_RTT_WriteString(0, "\r\nTCP2UART boot\r\n");
...
ch390_hardware_reset();
lwip_init();
lwip_netif_init(...);

But the normal bring-up path already performs a reset later inside ch390_runtime_init() / low_level_init() before ch390_default_config().
That means the temporary line created a redundant early reset in a different initialization phase than the normal driver-owned reset.

Interpretation

This pattern is much more consistent with a reset-sequencing / startup-state issue than with compiler optimization level.
The Keil target uses one fixed optimization configuration, so a plain manual reset does not change code generation.
In contrast, an extra CH390 reset inserted before lwIP and before the normal CH390 runtime init can alter the device startup state and timing relationship between the MCU and CH390.

Action Taken

The extra ch390_hardware_reset() in App_Init() was removed.
The firmware now relies only on the standard driver-owned reset inside the CH390 runtime initialization path.

Conclusion

The temporary extra reset was not kept.
The strongest software-side conclusion is that the manual-reset sensitivity was caused by redundant reset sequencing rather than by optimization level.

2026-03-31 HardFault Root Cause And Fix

Symptom

After CH390 bring-up completed and boot diagnostics printed, the firmware entered:

TRAP: HardFault_Handler

At the same time, PC13 stopped blinking, which originally looked like a timer or LED problem.

Fault Evidence

Fault-status registers showed a real fault rather than a normal busy wait.
The trap location was Debug_TrapWithRttHint() in Core/Src/main.c.
The stacked fault frame pointed back into the normal runtime path rather than the trap itself.
TIM4 was configured and had already advanced g_led_blink_ticks, so the LED path was alive before the fault.

Root Cause

MX_IWDG_Init() had been temporarily commented out in main().
However, App_Poll() still executed:

HAL_IWDG_Refresh(&hiwdg);

Because hiwdg was never initialized, this call operated on an invalid handle and led to the observed fault path.

Fix Applied

Core/Src/main.c was changed so watchdog refresh only runs when hiwdg.Instance == IWDG.
This preserves normal behavior when IWDG is enabled, while avoiding invalid access when IWDG init is intentionally disabled for debugging.

Verification

Rebuilt successfully with 0 error, 1 warning.
Reflashed target and reran startup.
Boot RTT still showed CH390 diagnostics, but no longer showed TRAP: HardFault_Handler.
A 5-second runtime window completed without a new trap.
g_led_blink_ticks continued advancing after the fix, confirming that TIM4 interrupts and the LED heartbeat path were alive again.

Conclusion

The HardFault was caused by refreshing an uninitialized IWDG handle, not by the CH390 SPI path itself.
This issue is fixed.
CH390 bring-up is still unresolved at the register-communication level, but the main task is again able to continue running normally.

2026-03-31 Runtime Freeze Root Cause And Fix

Symptom

After re-soldering CH390D, the system could boot and print the normal CH390 startup diagnostics.
However, after running for a while, the device would appear to freeze.
In that state, the LED heartbeat behavior became unreliable and the system appeared to stop making useful progress.

Key Runtime Evidence

The new freeze was not another HardFault: no new TRAP: line appeared during the freeze window.
g_led_blink_ticks continued advancing during observation windows, proving that TIM4 interrupts were still alive and the MCU was not fully dead.
A short halt during the bad behavior repeatedly landed in HAL_SPI_TransmitReceive().
Code inspection showed that CH390 runtime paths in ethernetif.c were executing blocking SPI transactions while global interrupts were disabled via ethernetif_lock().

Root Cause

low_level_output(), low_level_input(), and ethernetif_check_link() in Drivers/LwIP/src/netif/ethernetif.c wrapped CH390 SPI register/memory accesses inside ethernetif_lock() / ethernetif_unlock().
Those helpers globally disable interrupts by manipulating PRIMASK.
The CH390 access path uses blocking HAL SPI functions and timeout logic based on HAL_GetTick().
Running those blocking accesses with interrupts disabled can stall or livelock the runtime path, especially after startup when network polling begins.

Fix Applied

Reduced the interrupt-masked critical sections in ethernetif.c to only protect the shared IRQ-pending flag.
Removed ethernetif_lock() coverage from the long CH390 SPI transaction paths in:
- low_level_output()
- low_level_input()
- ethernetif_check_link()
In ethernetif_poll(), only the g_ch390_irq_pending flag is now cleared under the short critical section; the actual CH390 register access happens with interrupts enabled.

Verification

Rebuilt successfully with 0 error, 0 warning.
Reflashed and reran the target.
Boot RTT still completed normally through:

TCP2UART boot
ETH init: gpio
ETH init: spi
ETH init: reset
ETH init: default
ETH init: mac
ETH init: getmac
ETH init: irq
ETH init: done
CH390 VID=0x0000 PID=0x0000 REV=0x00 NSR=0x00 LINK=0
CH390 NCR=0x00 RCR=0x00 IMR=0x00 INTCR=0x00 GPR=0x00 ISR=0x00

No new TRAP: message appeared during extended runtime observation.
g_led_blink_ticks continued advancing over multiple samples, indicating that the heartbeat timer and interrupt delivery remained active.
The system no longer reproduced the earlier “runs for a while then appears frozen” behavior in the observed validation window.

Conclusion

This freeze was caused by doing blocking CH390 SPI operations inside a global interrupt-disabled critical section.
The runtime freeze is fixed.
CH390 register communication is still invalid (0x0000 ID values), but that is now a separate communication/bring-up problem rather than the cause of the observed runtime stall.

2026-03-31 SPI Ownership Decoupling And CH390 Current Status

Why This Refactor Was Done

The project previously allowed multiple runtime layers to reach down into CH390/SPI behavior directly:
- ethernetif.c handled init, IRQ-driven poll service, RX/TX transactions, and link checks
- main.c directly read CH390 registers for boot diagnostics
- the CH390 low-level SPI transport sat underneath those callers with no single runtime owner boundary
This made the system harder to reason about and contributed to runtime instability when CH390 accesses happened from different code paths with different assumptions.

Refactor Outcome

Added a single runtime owner module: Drivers/CH390/ch390_runtime.c + Drivers/CH390/ch390_runtime.h.
After this change:
- CH390_Interface.c remains the only SPI transport implementation
- CH390.c remains the chip-level helper layer
- ch390_runtime.c is now the only runtime owner of CH390 transactions after boot
- ethernetif.c delegates runtime TX/RX/link/IRQ servicing to ch390_runtime
- main.c no longer performs direct CH390 register reads; boot diagnostics use ch390_runtime_get_diag()
- EXTI0_IRQHandler() only posts the IRQ-pending event into the runtime owner and does not touch CH390 directly

Behavior After Refactor

Build passed with 0 error, 0 warning.
The system remained stable in the post-refactor runtime window:
- no new trap output
- heartbeat/timer activity continued
- previous runtime freeze did not reproduce in the observed window

CH390 Result After Refactor

The CH390 did not come up successfully.
However, the failure signature became cleaner and more trustworthy:

CH390 VID=0xFFFF PID=0xFFFF REV=0xFF NSR=0xFF LINK=1
CH390 NCR=0xFF RCR=0xFF IMR=0xFF INTCR=0xFF GPR=0xFF ISR=0xFF

This is materially different from the earlier unstable mixture of:
- all-zero reads
- intermittent hangs
- transaction artifacts
- watchdog-related HardFaults

Trusted Interpretation Of Current Failure

With the SPI access model cleaned up and the system remaining stable, the current CH390 failure can now be treated as a credible transport-level non-response rather than a concurrency artifact.
A uniform 0xFF readback across identity and status/control registers strongly suggests one of these conditions:
- CH390 still does not actively drive MISO during the register-read phase
- CS reaches the MCU logic but is not effectively selecting the CH390 device on the board side
- the CH390 digital core is not entering a valid SPI-responding state after reset even though the MCU-side sequence now looks consistent

Practical Conclusion

The architectural decoupling requirement is complete.
The runtime stability requirement is complete.
CH390 connection is still failed, but the reason is now narrowed to a believable low-level bus/device-response problem rather than a software ownership/concurrency problem.

24 KiB Raw Blame History

UART CH390 Debug Handoff

2026-03-31 Config UART Test Session

Goal

Bench Baseline

Source-of-Truth Command Surface

Test Procedure

Important Expectations

Session Findings

2026-03-31 Live Test Evidence

Board/Debugger Findings That Narrow The Fault

Current Best Conclusion

Corrected Final Conclusion

Final Verified Results

Final Flash Persistence Evidence

Evidence Files

Firmware Adjustment Made During This Session

Practical Lessons

Open Items

2026-03-31 CH390D Bring-up Debug Session

Goal

Baseline Symptom

Board-Side Evidence Already Collected

What Was Tried

Key Intermediate Findings

EVT Comparison Outcome

Current Best Technical Conclusion

Recommended Next Debug Step

Additional 2026-03-31 Finding: HAL SPI Re-init And Bit-Bang Side Effects

Additional 2026-03-31 Finding: Reset Exists In Runtime Path

2026-03-31 Manual Reset Sensitivity Analysis

Observed Symptom

Code-Level Finding

Interpretation

Action Taken

Conclusion

2026-03-31 HardFault Root Cause And Fix

Symptom

Fault Evidence

Root Cause

Fix Applied

Verification

Conclusion

2026-03-31 Runtime Freeze Root Cause And Fix

Symptom

Key Runtime Evidence

Root Cause

Fix Applied

Verification

Conclusion

2026-03-31 SPI Ownership Decoupling And CH390 Current Status

Why This Refactor Was Done

Refactor Outcome

Behavior After Refactor

CH390 Result After Refactor

Trusted Interpretation Of Current Failure

Practical Conclusion

24 KiB

Raw Blame History