Discussion:
[lwip-users] Low Iperf performance of lwip 1.4.1 on STM32 and FreeRTOS
Claudius Zingerli
2013-06-21 07:59:01 UTC
Permalink
Hello all,

I'm working on a project using lwIP 1.4.1, FreeRTOS 7.4.2 on an
STM32F407 MCU.
I have several UDP/TCP/Multicast services running well, but when I tried
to measure TCP bandwidth with Iperf as well as with dd|nc, I get very
low results.
Iperf basically just sends a lot of data and lwIP drops it (using
netconn_recv();netbuf_delete() or netconn_recv_tcp_pbuf();pbuf_free();)

An analysis with Wireshark shows the following:
(TCP_MSS=TCP_WND=1460)
- SYN,SYNACK,ACK,PSH,PSH (as usual)
- ZeroWindow (client stuck), WindowUpdate (some ms later)
- PSH, ZeroWindow, WindowUpdate,...

As I understand it, this is how TCP works. Quite low bandwidth (a few
hundred kBps) with these settings, but it works.
When I try to increase TCP_WND to p.e. 5kB, the following problems arise:
- Dup ACKs (from lwIP)
- lots of Retransmissions (from Linux)
The bandwidth is in the Bps to kBps range (at most). I spent hours, but
have no clue where to look next. Any ideas what could be the
reason?(Iperf Linux to Linux results in the full line speed)

One interesting thing is: I get about 0.5% packet drop if I do a ping -f
(100 Pings per second, packets seem to never arrive at the Eth
interrupt). MCU load is always quite low (I have a low prio blink task
that still gets its CPU time as well as )
Things I already fixed: (my design bases on ST's ethernet code)
- Check any stacks/NULL/malloc fails
- Check if pbuf fits into Tx buffer
- Check if there is enough pbuf_mem to fits Rx packet
- In packet reception I try to drain the input queue (by checking
DMARxDescToGet->Status & ETH_DMARxDesc_OWN )
- ETH_DMASR_RBUS cleared in low_level_input()

I just ran out of ideas how to fix the problem. Is this about tuning
lwipopts.h? Attached, my current version.

Best regards

Claudius
FreeRTOS Info
2013-06-21 11:11:27 UTC
Permalink
Post by Claudius Zingerli
Hello all,
I'm working on a project using lwIP 1.4.1, FreeRTOS 7.4.2 on an
STM32F407 MCU.
I have several UDP/TCP/Multicast services running well, but when I tried
to measure TCP bandwidth with Iperf as well as with dd|nc, I get very
low results.
Iperf basically just sends a lot of data and lwIP drops it (using
netconn_recv();netbuf_delete() or netconn_recv_tcp_pbuf();pbuf_free();)
(TCP_MSS=TCP_WND=1460)
- SYN,SYNACK,ACK,PSH,PSH (as usual)
- ZeroWindow (client stuck), WindowUpdate (some ms later)
- PSH, ZeroWindow, WindowUpdate,...
Ever so slightly off topic -

It sounds like there are lots of people doing good work with FreeRTOS
and lwIP here, and I'm sorry I don't get the time to contribute to these
threads more often. In the past I have attempted to maintain an
"example integration" running in the FreeRTOS Win32 simulator, but
projects discussed here go far beyond that.

I would be very grateful if people could occasionally post frameworks of
their code in the FreeRTOS Interactive site for others to reference.

http://interactive.freertos.org

Regards,
Richard.

+ http://www.FreeRTOS.org
Designed for microcontrollers. More than 103000 downloads in 2012.

+ http://www.FreeRTOS.org/plus
Trace, safety certification, FAT FS, TCP/IP, training, and more...
ella
2013-06-23 05:16:32 UTC
Permalink
The problem is not FreeRTOS but buggy and ugly STM32 netif driver. I have
studied original driver provided by ST and had nothing but rewrite it.

Just one example of wrong architecture of this driver. This is from
low_level_output():

buffer = (u8 *)(DMATxDescToSet->Buffer1Addr);
for(q = p; q != NULL; q = q->next)
{
memcpy((u8_t*)&buffer[l], q->payload, q->len);
l = l + q->len;
}

Consider that buffers are allocated as
extern uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE];
and are linked to chained DMA descriptors.

If packet size bigger then ETH_TX_BUF_SIZE you are at potential danger of
wrap around that is not treated in code. Same happens for RX flow. So no
surprise you have a problems with big packets.
And this is only one place, there is a number of others. There are also a
few races.
In short DO NOT USE THIS DRIVER.





--
View this message in context: http://lwip.100.n7.nabble.com/Low-Iperf-performance-of-lwip-1-4-1-on-STM32-and-FreeRTOS-tp21579p21581.html
Sent from the lwip-users mailing list archive at Nabble.com.
Claudius Zingerli
2013-06-23 13:44:39 UTC
Permalink
Dear Ella,

Well well... That's what I'm currently considering as well: A complete
rewrite of the MAC driver. ST's code is definitely ugly, inconsistent
and often seems to be copy-pasted from older code, but not really
adapted to the new devices/functions.
I already fixed the thing you mentioned by ASSERTing l+q->len to be
smaller than the buffer (ST's driver checks that somehow later by
splitting one buffer into multiple buffers in
ETH_Prepare_Transmit_Descriptors(...), but I'm not sure if that still
works with the last chained DMA descriptor).

Before I write my own MAC driver, I wanted to get a benchmark running
with the original code to compare it to LPCs Iperf benchmarks. They
implemented a zero-copy MAC driver for LwIP that achieves almost line
speed (at much slower clock rates than ST).
For my device: Ping round trip time is between 130us and 250us (1 Switch
between Linux+STM32F407 running at 150MHz), but >0% packet loss, TCP is
unreliable and UDP seems to work, but not benchmarked yet.

So: Any open (BSD/GPL), stable and optimally zero-copy MAC drivers for
STM32F4x7+FreeRTOS available here? Maybe I can get some inspiration from
ChibiOS's driver as they seem to have mostly not used ST code.

Regards

Claudius
Post by ella
The problem is not FreeRTOS but buggy and ugly STM32 netif driver. I have
studied original driver provided by ST and had nothing but rewrite it.
Just one example of wrong architecture of this driver. This is from
buffer = (u8 *)(DMATxDescToSet->Buffer1Addr);
for(q = p; q != NULL; q = q->next)
{
memcpy((u8_t*)&buffer[l], q->payload, q->len);
l = l + q->len;
}
Consider that buffers are allocated as
extern uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE];
and are linked to chained DMA descriptors.
If packet size bigger then ETH_TX_BUF_SIZE you are at potential danger of
wrap around that is not treated in code. Same happens for RX flow. So no
surprise you have a problems with big packets.
And this is only one place, there is a number of others. There are also a
few races.
In short DO NOT USE THIS DRIVER.
--
View this message in context: http://lwip.100.n7.nabble.com/Low-Iperf-performance-of-lwip-1-4-1-on-STM32-and-FreeRTOS-tp21579p21581.html
Sent from the lwip-users mailing list archive at Nabble.com.
_______________________________________________
lwip-users mailing list
https://lists.nongnu.org/mailman/listinfo/lwip-users
Claudius Zingerli
2013-07-01 07:56:08 UTC
Permalink
Hi ella and all,

Some progress here: I receive a lot of CRC & Align errors (MMC counters
of STM32). At least there is /some/ correlation between these errors and
LWIP behaving strangely (if there is an increase in the MMC counters,
LwIP gets into trouble, if there is none, LwIP mostly works fine). This
may further be related to the usage of RMII between the MAC and PHY and
creating the RMII-Clock with the STM32-PLL. Datasheet jitter and
precision should be OK for the PHY, but this might not be the cleanest
solution (There is some hint in the datasheet that good guys should
source the RMII clock by bypassing the PLL). So in a next step, I'm
going to use a dedicated 50MHz oscillator to clock the PHY and MCU.
On the software-side: An own implementation of the MAC driver is on the
way. Could probably be open sourced if there is some interest.

Claudius
Post by ella
The problem is not FreeRTOS but buggy and ugly STM32 netif driver. I have
studied original driver provided by ST and had nothing but rewrite it.
Just one example of wrong architecture of this driver. This is from
buffer = (u8 *)(DMATxDescToSet->Buffer1Addr);
for(q = p; q != NULL; q = q->next)
{
memcpy((u8_t*)&buffer[l], q->payload, q->len);
l = l + q->len;
}
Consider that buffers are allocated as
extern uint8_t Tx_Buff[ETH_TXBUFNB][ETH_TX_BUF_SIZE];
and are linked to chained DMA descriptors.
If packet size bigger then ETH_TX_BUF_SIZE you are at potential danger of
wrap around that is not treated in code. Same happens for RX flow. So no
surprise you have a problems with big packets.
And this is only one place, there is a number of others. There are also a
few races.
In short DO NOT USE THIS DRIVER.
--
View this message in context: http://lwip.100.n7.nabble.com/Low-Iperf-performance-of-lwip-1-4-1-on-STM32-and-FreeRTOS-tp21579p21581.html
Sent from the lwip-users mailing list archive at Nabble.com.
_______________________________________________
lwip-users mailing list
https://lists.nongnu.org/mailman/listinfo/lwip-users
Jeff Barlow
2013-07-01 19:00:15 UTC
Permalink
...This may further be related to the usage of RMII between the MAC
and PHY and creating the RMII-Clock with the STM32-PLL. Datasheet
jitter and precision should be OK for the PHY, but this might not be
the cleanest solution (There is some hint in the datasheet that good
guys should source the RMII clock by bypassing the PLL). So in a
next step, I'm going to use a dedicated 50MHz oscillator to clock the
PHY and MCU.
This is a known issue. Deriving the RMII-Clock with the STM32-PLL is
just not a robust design.

There are several PHY chips (Micrel, etc) that have built in RMII clock
generators that can use a low cost 25MHz crystal and provide a nice low
jitter clock back to the MCU.
--
Later,
Jeff
Claudius Zingerli
2013-07-02 11:50:20 UTC
Permalink
[RMII-Clock from PLL is bad]
Post by Jeff Barlow
This is a known issue. Deriving the RMII-Clock with the STM32-PLL is
just not a robust design.
There are several PHY chips (Micrel, etc) that have built in RMII clock
generators that can use a low cost 25MHz crystal and provide a nice low
jitter clock back to the MCU.
In the final design, we plan use a 3-port switch from Micrel. It can be
clocked from 25MHz or 50MHz, but the board I'm using to develop the
software has a DP83848 that does need a 50MHz clock source for RMII.
STM32F4 /might/ be able to handle 50MHz as a main clock source. (The
datasheet is ambiguous about that: The drawing says 4-26MHz HSE, but the
table says 1-50MHz HSE. One could interpret that as from 26MHz one has
to use an oscillator, below an Xtal would fit as well)

Regards

Claudius
Jeff Barlow
2013-07-02 18:50:05 UTC
Permalink
Post by Claudius Zingerli
In the final design, we plan use a 3-port switch from Micrel. It can be
clocked from 25MHz or 50MHz, but the board I'm using to develop the
software has a DP83848 that does need a 50MHz clock source for RMII.
STM32F4 /might/ be able to handle 50MHz as a main clock source.
I see. For a one-off dev board I think it's always less frustrating to
just use a separate 50MHz oscillator. The older PHY chips can be really
fussy about clock jitter. I think once you get away from that DP83848
you'll find things less painful.

I was just suggesting using the Micrel RMII clock output to feed the
RMII clock input on the MCU. I've never tried feeding a 50MHz clock into
the HSE on a STM32F407 but it strikes me as risky. I do seem to recall
that some of the PHYs also have a direct 25MHz output that would work
for that. Don't know if that includes your switch chip, however.
--
Later,
Jeff
Krzysztof Wesołowski
2013-07-02 19:29:10 UTC
Permalink
We put 25MHz HSE to STM32F4, then forward same clock from MCO to micrel phy
and then use micrels 50MHz for RMII.
Post by Jeff Barlow
Post by Claudius Zingerli
In the final design, we plan use a 3-port switch from Micrel. It can be
clocked from 25MHz or 50MHz, but the board I'm using to develop the
software has a DP83848 that does need a 50MHz clock source for RMII.
STM32F4 /might/ be able to handle 50MHz as a main clock source.
I see. For a one-off dev board I think it's always less frustrating to
just use a separate 50MHz oscillator. The older PHY chips can be really
fussy about clock jitter. I think once you get away from that DP83848
you'll find things less painful.
I was just suggesting using the Micrel RMII clock output to feed the RMII
clock input on the MCU. I've never tried feeding a 50MHz clock into the HSE
on a STM32F407 but it strikes me as risky. I do seem to recall that some of
the PHYs also have a direct 25MHz output that would work for that. Don't
know if that includes your switch chip, however.
--
Later,
Jeff
______________________________**_________________
lwip-users mailing list
https://lists.nongnu.org/**mailman/listinfo/lwip-users<https://lists.nongnu.org/mailman/listinfo/lwip-users>
ella
2013-07-03 03:51:54 UTC
Permalink
Hi,
Can you tell me please exact part number of the Micrel PHY that can work in
RMII with external 25MHz crystal. I'd like to try it as well.
Thanks.




--
View this message in context: http://lwip.100.n7.nabble.com/Low-Iperf-performance-of-lwip-1-4-1-on-STM32-and-FreeRTOS-tp21579p21678.html
Sent from the lwip-users mailing list archive at Nabble.com.
Claudius Zingerli
2013-07-03 09:10:47 UTC
Permalink
Hi ella,
Post by ella
Can you tell me please exact part number of the Micrel PHY that can
work in RMII with external 25MHz crystal. I'd like to try it as well.
We plan to use KSZ8863RLL. It can be clocked from 25MHz (xtal,osc) or
50MHz (osc). No practical experience yet. Anyone using that IC as well?

Regards
Claudius

PS: Using a 50MHz Osc for the DP83848 results in Ping: 0ppm packet loss,
92us/117us/258us min/avg/max rtt, 92Mbps tcp receive bandwidth using
Iperf on an STM32F407 clocked at 150MHz connected to a fast Linux PC via
a Dlink USB-FastEthernet adapter.
Pomeroy, Marty
2013-07-03 14:01:57 UTC
Permalink
Post by Claudius Zingerli
Post by ella
Can you tell me please exact part number of the Micrel PHY
We plan to use KSZ8863RLL.
We're using KSZ8031. 100MHz been working for about year with LPC1788
RMII.

Marty
Jeff Barlow
2013-07-03 17:06:24 UTC
Permalink
Post by ella
Can you tell me please exact part number of the Micrel PHY that can work in
RMII with external 25MHz crystal.
I think most of the newer Micrel parts work that way. Have a look at
<http://www.micrel.com/index.php/en/products/lan-solutions/phys.html>

I'd guess other vendors recent designs would be similar.
--
Later,
Jeff
ella
2013-07-01 19:10:15 UTC
Permalink
Hi,
The clock jitter is documented in ST errata. (This file is not easy to find.
Instead of hiding it ST had to put it into Data Sheet and big bold letters).
As far as I understand it you can not use ST PLL neither for MII not for
RMII as in both cases it does not fir the long term jitter requirement for
PHY. Without going deep into understanding the meaning of this bug and
possible outcome we have used external 25MHz crystal with MII. (For RMII you
will need 50MHz one).

As for open source, I also thought about it but ST ignorance stopped me from
doing it. I think respectable company like ST has to take care on the ugly
source code they provide on their web site. It is not only related to
Ethernet driver but their Peripheral Library is exactly in the same state.
(In my projects I do not use it at all and write my own library adding
support of different HW modules as a need for it comes up).
So I decided not to help them and not to disclose any code. But if you are
looking for some cooperation I'm in. The final goal is to get stable driver
with zero-copy. To my understanding MAC and DMA periheral of the STM32F2xx
is sufficient for that.




--
View this message in context: http://lwip.100.n7.nabble.com/Low-Iperf-performance-of-lwip-1-4-1-on-STM32-and-FreeRTOS-tp21579p21669.html
Sent from the lwip-users mailing list archive at Nabble.com.
Claudius Zingerli
2013-06-23 13:57:27 UTC
Permalink
Dear Richard,

I completely agree with your request to put the code online. But
currently I'm working on some quite fundamental problems, so a
svn/git-like repo (or links to such a repo) would be much more practical
yet than uploading a zip of some alpha-level code. I could come back to
that at a later stage.

Claudius
Post by FreeRTOS Info
I would be very grateful if people could occasionally post frameworks of
their code in the FreeRTOS Interactive site for others to reference.
http://interactive.freertos.org
Claudius Zingerli
2013-06-21 07:26:15 UTC
Permalink
Hello all,

I'm working on a project using lwIP 1.4.1, FreeRTOS 7.4.2 on an
STM32F407 MCU.
I have several UDP/TCP/Multicast services running well, but when I tried
to measure TCP bandwidth with Iperf as well as with dd|nc, I get very
low results.
Iperf basically just sends a lot of data and lwIP drops it (using
netconn_recv();netbuf_delete() or netconn_recv_tcp_pbuf();pbuf_free();)

An analysis with Wireshark shows the following:
(TCP_MSS=TCP_WND=1460)
- SYN,SYNACK,ACK,PSH,PSH (as usual)
- ZeroWindow (client stuck), WindowUpdate (some ms later)
- PSH, ZeroWindow, WindowUpdate,...

As I understand it, this is how TCP works. Quite low bandwidth (a few
hundred kBps) with these settings, but it works.
When I try to increase TCP_WND to p.e. 5kB, the following problems arise:
- Dup ACKs (from lwIP)
- lots of Retransmissions (from Linux)
The bandwidth is in the Bps to kBps range (at most). I spent hours, but
have no clue where to look next. Any ideas what could be the reason?
(Iperf Linux to Linux results in the full line speed)

One interesting thing is: I get about 0.5% packet drop if I do a ping -f
(100 Pings per second, packets seem to never arrive at the Eth
interrupt). MCU load is always quite low (I have a low prio blink task
that still gets its CPU time as well as )
Things I already fixed: (my design bases on ST's ethernet code)
- Check any stacks/NULL/malloc fails
- Check if pbuf fits into Tx buffer
- Check if there is enough pbuf_mem to fits Rx packet
- In packet reception I try to drain the input queue (by checking
DMARxDescToGet->Status & ETH_DMARxDesc_OWN )
- ETH_DMASR_RBUS cleared in low_level_input()

I just ran out of ideas how to fix the problem. Is this about tuning
lwipopts.h? Attached, my current version.

Best regards

Claudius
Loading...