Discussion:
[lwip-users] Lwip UDP performance
Lists
2006-03-23 17:06:24 UTC
Permalink
Hi,
I'm using the raw api (single threaded) on a Nios2 processor running at 50 Mhz.
I'm having trouble getting lwip to process udp packets at a high enough rate.
I'm sending udp packets to the nios2 in bursts of 16+ packets (~600 bytes) at
~60Mbps, then leaving a break of ~40ms. The 60 Mbps is probably a bit too fast,
but I would have thought that the 40ms break would give it enough time to catch
up if the buffer was of a sufficient size. I've got the following in lwipopts.h
and I've tried increasing them with no effect:

#define MEM_SIZE 64*1024
#define MEMP_NUM_PBUF 16

#define PBUF_POOL_SIZE 32
#define PBUF_POOL_BUFSIZE 1536
#define PBUF_LINK_HLEN 16

(I've included my lwipopts.h file in case it helps)

The behaviour I've observed is that the first 4 packets of each burst get
processed just fine, but only a few of the remaining packets get through.
Currently to test performance, my application increments a counter when it gets a
udp packet and that's pretty much all it does, so it shouldn't be slowing it down
from that end.

So a few questions:
* What kind of performance can I expect? What kind of peak performance?

* Should lwip be able to handle 60Mbps bursts of data?

* Is lwip polled or interrupt driven. I've had a look at the code and looks like
it's a bit of both... (I call lan91c111if_service(&netif); from main, the
function is part of the driver provided for nios2.)

* Does lwip do buffer copying or is the data passed up through the stack using
pointers?

* Is there an option to disable the udp checksum check? Or is it a case of
modifying the code?

* Any suggestions on how to get maximum udp performance?


Any feedback welcome.

Thanks,


Aidan
Christiaan Simons
2006-03-24 08:36:49 UTC
Permalink
I'm using the raw api (single threaded) on a Nios2 processor runningat 50
Mhz.
I'm having trouble getting lwip to process udp packets at a high enough
rate.

Sounds familiar, I've got similar problems ;-)
I'm sending udp packets to the nios2 in bursts of 16+ packets (~600
bytes) at
~60Mbps, then leaving a break of ~40ms. The 60 Mbps is probably a
bit too fast,
but I would have thought that the 40ms break would give it enough
time to catch
up if the buffer was of a sufficient size. I've got the following
inlwipopts.h
Increasing buffers wont't improve the achievable UDP througput for small
packets.
The behaviour I've observed is that the first 4 packets of each burst get
processed just fine, but only a few of the remaining packets get through.
There might be a problem in your Ethernet driver.

A typical driver copies data from your chip into a pbuf-chain.

This copying might slow things down.
This is the place to look for improved performance.
Note that there may be a delay between a receive interrupt from the driver,
and the driver actually servicing the hardware.

You should try to minimize both delays. (This can be hard,
depending on the nature of the driver).
Currently to test performance, my application increments a counter
when it gets a
udp packet and that's pretty much all it does, so it shouldn't be
slowing it down
from that end.
* What kind of performance can I expect? What kind of peak performance?
This really depends on your HW and SW architecture.
* Should lwip be able to handle 60Mbps bursts of data?
Some archs might reach full wire-speed.
* Is lwip polled or interrupt driven. I've had a look at the code
and looks like
it's a bit of both... (I call lan91c111if_service(&netif); from main, the
function is part of the driver provided for nios2.)
This is like the cs8900 driver design.
The difficualty is in reading data the within the interrupt,
this might lock-up your single threaded system (at full wire-speed).
Therefore the interrupt merely indicates the chip needs to be serviced,
and the reading is done from the infinite (main) loop.
* Does lwip do buffer copying or is the data passed up through the stack
using
pointers?
Some copying when needed, pointer referencing whenever possible.
This diver places the data in the same pbuf-chain as your udp receive
callback handler gets it. This is fairly efficient.
* Is there an option to disable the udp checksum check? Or is it a case
of
modifying the code?
I shouldn't try this.
* Any suggestions on how to get maximum udp performance?
Try to improve your Ethernet driver.
Try to use DMA transfers, and run your lwIP stack from a thread/task
(you'll need a small RTOS) that becomes immediatly active after
the Ethernet RX interrupt.

I cannot recommend the lwip+uC/OS-II solution that Altera offers.
It is built around the lwip sequential API: low performance,
too much thread synchronisation slows it down.

Bye,

Christiaan Simons

Hardware Designer
Axon Digital Design

http://www.axon.tv
Jeffery Du
2006-03-27 01:00:17 UTC
Permalink
Aidan:

In my opinion, the bottleneck is in the HW and the ethernet driver. You use lan91c111, right? As I know, it can't afford a high throughtput like 60Mbps. Or to say, it has exceeded its limit. But maybe you can remind the ethernet driver of lan91c111 to improve the performance. In this case, you should enlarge the rx fifo size to a maximum value and process the ISR of lan91c111 as soon as possible.
In addtion you can close UDP checksum by define CHECKSUM_GEN_UDP to 0 in lwipopt.h.
Post by Lists
Hi,
I'm using the raw api (single threaded) on a Nios2 processor running at 50 Mhz.
I'm having trouble getting lwip to process udp packets at a high enough rate.
I'm sending udp packets to the nios2 in bursts of 16+ packets (~600 bytes) at
~60Mbps, then leaving a break of ~40ms. The 60 Mbps is probably a bit too fast,
but I would have thought that the 40ms break would give it enough time to catch
up if the buffer was of a sufficient size. I've got the following in lwipopts.h
#define MEM_SIZE 64*1024
#define MEMP_NUM_PBUF 16
#define PBUF_POOL_SIZE 32
#define PBUF_POOL_BUFSIZE 1536
#define PBUF_LINK_HLEN 16
(I've included my lwipopts.h file in case it helps)
The behaviour I've observed is that the first 4 packets of each burst get
processed just fine, but only a few of the remaining packets get through.
Currently to test performance, my application increments a counter when it gets a
udp packet and that's pretty much all it does, so it shouldn't be slowing it down
from that end.
* What kind of performance can I expect? What kind of peak performance?
* Should lwip be able to handle 60Mbps bursts of data?
* Is lwip polled or interrupt driven. I've had a look at the code and looks like
it's a bit of both... (I call lan91c111if_service(&netif); from main, the
function is part of the driver provided for nios2.)
* Does lwip do buffer copying or is the data passed up through the stack using
pointers?
* Is there an option to disable the udp checksum check? Or is it a case of
modifying the code?
* Any suggestions on how to get maximum udp performance?
Any feedback welcome.
Thanks,
Aidan
= = = = = = = = = = = = = = = = = = = =
Best regards

Jeffery Du
2006-03-27
Timmy Brolin
2006-03-24 23:16:56 UTC
Permalink
Post by Lists
* Is lwip polled or interrupt driven. I've had a look at the code and looks like
it's a bit of both... (I call lan91c111if_service(&netif); from main, the
function is part of the driver provided for nios2.)
If you have a Ethernet MAC with DMA support (common for CPUs with
integrated 100Mbit/s MAC) then you can set it up so that the MAC will
automatically transfer the packets to pbuf chains using DMA. Then when
there are one or more pbuf chains filled with packets, you get an
interrupt. In this case it is of course interrupt driven, and quite
efficient.
Post by Lists
* Is there an option to disable the udp checksum check? Or is it a case of
modifying the code?
Don't know.
In etiher case, have you assembly optimized the checksum routine?
Checksum optimization has a huge impact on both UDP and TCP performance.

/Timmy
Goldschmidt Simon
2006-03-28 06:52:26 UTC
Permalink
Hi,

Couldn't you write a driver for the lan91c111 that uses the Altera DMA
to Transfer the Packets between RAM and the MAC? That should speed up
things!

Simon.

-----Original Message-----
From: lwip-users-bounces+sgoldschmidt=de.pepperl-***@nongnu.org
[mailto:lwip-users-bounces+sgoldschmidt=de.pepperl-***@nongnu.org]
On Behalf Of Timmy Brolin
Sent: Saturday, March 25, 2006 12:17 AM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] Lwip UDP performance
Post by Lists
* Is lwip polled or interrupt driven. I've had a look at the code and
looks like it's a bit of both... (I call lan91c111if_service(&netif);
from main, the function is part of the driver provided for nios2.)
If you have a Ethernet MAC with DMA support (common for CPUs with
integrated 100Mbit/s MAC) then you can set it up so that the MAC will
automatically transfer the packets to pbuf chains using DMA. Then when
there are one or more pbuf chains filled with packets, you get an
interrupt. In this case it is of course interrupt driven, and quite
efficient.
Post by Lists
* Is there an option to disable the udp checksum check? Or is it a case
of modifying the code?
Don't know.
In etiher case, have you assembly optimized the checksum routine?
Checksum optimization has a huge impact on both UDP and TCP performance.

/Timmy
Reither Robert
2006-03-28 10:12:27 UTC
Permalink
I'm using this ethernet hardware too and its sad but true, the device does only have 4 fixed size (receive) buffers.
So if u receive 4 packets very fast, the (hardware)buffers are full and the chip is dropping packets till u have read them out. i'm using DMA data transfers and got down to 85µs/packet that's my maximum interface throughput running NIOS cpu with 50Mhz (and its not trivial to use DMA with the 91c111, u have to serve the ARDY line or use max. Interface speed of 100ns cycle-time, and don't 4get to flush CPU data caches after DMA read)

Routing packets thought the stack will take much more time than, but if u have send-delays, u can process the packets from a buffer after the bursts.

Greetings
Robert

Hard/Software engineer
AV-Digital
Austria
Christiaan Simons
2006-03-28 10:37:11 UTC
Permalink
Post by Reither Robert
I'm using this ethernet hardware too and its sad but true, the
device does only have 4 fixed size (receive) buffers.
Ahum, good to learn this.
The shiny Altera sales talks won't tell these nasty problems.
Post by Reither Robert
Routing packets thought the stack will take much more time than, but
if u have send-delays, u can process the packets from a buffer after
the bursts.
Do you have an idea where lwIP spends a lot of time in your system?
If you do know about weak points in lwIP I can have a look at those.

I think the pbuf_alloc() can be a bit expensive, and when having a
lot of netifs and PCBs it can slow things down. The memcpy's
in ip_reass won't help much either. For unfragmented traffic I don't
expect any big delays.

Currently I can't pin-point truly encumbered code with regards
to unfragmented IP/UDP speed.

Speeding up the lwIP core for reading unwanted broadcast UDP
(Microsoft SMB blabla) is a bit of a priority for us.

(I don't want to disable the UDP checksumming,
it's only done for traffic that is accepted anyway)

Bye,

Christiaan Simons

Hardware Designer
Axon Digital Design

http://www.axon.tv
Atte Kojo
2006-03-28 10:52:37 UTC
Permalink
Post by Christiaan Simons
Ahum, good to learn this.
The shiny Altera sales talks won't tell these nasty problems.
Come on, now. When have sales talks ever told that the product being sold
would be anything else than glisteningly excellent and devoid of any problems
whatsoever ;). Oh, and better than competition with at least factor of three.
Christiaan Simons
2006-03-28 11:17:02 UTC
Permalink
Post by Atte Kojo
Come on, now. When have sales talks ever told that the product being sold
would be anything else than glisteningly excellent and devoid of
anyproblems
Post by Atte Kojo
whatsoever ;). Oh, and better than competition with at least factor of
three.

We're a big Altera fan, without any cynism.
Their Quartus design software is great :-) compared to the problematic
Xilinx tools :-(.

The only problem I have with their current NIOS offering is that it can
be good for a few DSP like apps, but I don't see it as replacement for
let's say
a Coldfire or ARM9 with embedded MACs (a 'network processor').

When hearing the MAC isn't so good, or very constrained,
I'm glad someone else confirms this idea.

I guess the limited buffering has something todo
with souping up many M4Ks (FPGA memory cells) otherwise.

Christiaan Simons

Hardware Designer
Axon Digital Design

http://www.axon.tv
Timmy Brolin
2006-03-30 19:44:20 UTC
Permalink
Perhaps Actel FPGAs might be something for you then?
A ARM7 IP-core is included in the price of the Actel ProASIC3 FPGAs.
Never used it yet, but it looks nice.

/Timmy
Post by Christiaan Simons
Post by Atte Kojo
Come on, now. When have sales talks ever told that the product being sold
would be anything else than glisteningly excellent and devoid of
anyproblems
Post by Atte Kojo
whatsoever ;). Oh, and better than competition with at least factor of
three.
We're a big Altera fan, without any cynism.
Their Quartus design software is great :-) compared to the problematic
Xilinx tools :-(.
The only problem I have with their current NIOS offering is that it can
be good for a few DSP like apps, but I don't see it as replacement for
let's say
a Coldfire or ARM9 with embedded MACs (a 'network processor').
When hearing the MAC isn't so good, or very constrained,
I'm glad someone else confirms this idea.
I guess the limited buffering has something todo
with souping up many M4Ks (FPGA memory cells) otherwise.
Christiaan Simons
Hardware Designer
Axon Digital Design
http://www.axon.tv
_______________________________________________
lwip-users mailing list
http://lists.nongnu.org/mailman/listinfo/lwip-users
Loading...