Discussion:
[PATCH] can: ti_hecc: fix rx wrong sequence issue
AnilKumar Ch
2012-10-08 06:40:36 UTC
Permalink
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).

While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.

With this patch mailboxes are enabled once the receive status is
cleared.

Signed-off-by: AnilKumar Ch <***@ti.com>
---
drivers/net/can/ti_hecc.c | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/drivers/net/can/ti_hecc.c b/drivers/net/can/ti_hecc.c
index 39293e9..2417a2d 100644
--- a/drivers/net/can/ti_hecc.c
+++ b/drivers/net/can/ti_hecc.c
@@ -169,6 +169,7 @@ MODULE_VERSION(HECC_MODULE_VERSION);
#define HECC_CANES_SMA BIT(5) /* suspend mode ack */
#define HECC_CANES_CCE BIT(4) /* Change config enabled */
#define HECC_CANES_PDA BIT(3) /* Power down mode ack */
+#define HECC_CANES_RM BIT(1) /* Receive Mode bit */

#define HECC_CANBTC_SAM BIT(7) /* sample points */

@@ -194,6 +195,13 @@ MODULE_VERSION(HECC_MODULE_VERSION);
#define HECC_CANGIM_DEF_MASK 0x700 /* only busoff/warning/passive */
#define HECC_CANGIM_SIL BIT(2) /* system interrupts to int line 1 */

+/*
+ * Receive Mode bit reflects what the CAN protocol kernel (CPK) is
+ * actually doing regardless of mailbox configuration. CPK receive
+ * mode timeout. Tried from 1 - 5us and kept 10 as a safety value.
+ */
+#define RM_TIMEOUT_US 10
+
/* CAN Bittiming constants as per HECC specs */
static struct can_bittiming_const ti_hecc_bittiming_const = {
.name = DRV_NAME,
@@ -615,7 +623,7 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
struct ti_hecc_priv *priv = netdev_priv(ndev);
u32 num_pkts = 0;
u32 mbx_mask;
- unsigned long pending_pkts, flags;
+ unsigned long pending_pkts, flags, timeout;

if (!netif_running(ndev))
return 0;
@@ -633,6 +641,15 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
--priv->rx_next;
if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
/* enable high bank mailboxes */
+ timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
+ while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
+ cpu_relax();
+ if (time_after(jiffies, timeout)) {
+ netdev_warn(ndev, "receiving pkt\n");
+ break;
+ }
+ }
+
spin_lock_irqsave(&priv->mbx_lock, flags);
mbx_mask = hecc_read(priv, HECC_CANME);
mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
--
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Lübbe
2012-10-08 09:12:27 UTC
Permalink
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
I tested this on a AM3505 with cansequence 4.0.6. For 500kbit, I see no
reordering anymore.

When testing at 1Mbit, it seems to be worse than before:
~ # ifconfig can0 down; canconfig can0 bitrate 1000000; ifconfig can0 up
can0 bitrate: 1000000, sample-point: 0.692
[ 1627.995574] ti_hecc ti_hecc: can0: setting CANBTC=0x3b
~ # cansequence -r -vv can0
interface = can0, family = 29, type = 3, proto = 1
received frame. sequence number: 0
received frame. sequence number: 1
received frame. sequence number: 2
received frame. sequence number: 3
received frame. sequence number: 4
received frame. sequence number: 5
received frame. sequence number: 6
received frame. sequence number: 7
received frame. sequence number: 8
received frame. sequence number: 9
received frame. sequence number: 10
received frame. sequence number: 11
received frame. sequence number: 12
received frame. sequence number: 13
received frame. sequence number: 14
received frame. sequence number: 15
received frame. sequence number: 16
received frame. sequence number: 17
received frame. sequence number: 18
received frame. sequence number: 19
received frame. sequence number: 20
received frame. sequence number: 21
received frame. sequence number: 21
received wrong sequence count. expected: 22, got: 21
received frame. sequence number: 21
received wrong sequence count. expected: 22, got: 21
received frame. sequence number: 22
received frame. sequence number: 22
received wrong sequence count. expected: 23, got: 22
received frame. sequence number: 23
received frame. sequence number: 23
received wrong sequence count. expected: 24, got: 23
received frame. sequence number: 24
received frame. sequence number: 24
received wrong sequence count. expected: 25, got: 24
received frame. sequence number: 24
received wrong sequence count. expected: 25, got: 24
received frame. sequence number: 25
received frame. sequence number: 25
received wrong sequence count. expected: 26, got: 25
received frame. sequence number: 26
received frame. sequence number: 27
received frame. sequence number: 27
received wrong sequence count. expected: 28, got: 27
received frame. sequence number: 28
received frame. sequence number: 28
received wrong sequence count. expected: 29, got: 28
received frame. sequence number: 28
received wrong sequence count. expected: 29, got: 28
received frame. sequence number: 29
...

Here we can see that after the initial frames, most frames are received
twice (some even thrice).

Regards,
Jan
--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Lübbe
2012-10-11 12:01:53 UTC
Permalink
Hi,
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
=20
I tested this on a AM3505 with cansequence 4.0.6. For 500kbit, I see =
no
reordering anymore.
=20
*snip*
Here we can see that after the initial frames, most frames are receiv=
ed
twice (some even thrice).
Do you have some idea on the cause of this problem, so we could do some
more focused debugging?

Regards,
Jan
--=20
Pengutronix e.K. | =
|
Industrial Linux Solutions | http://www.pengutronix.de/=
|
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 =
|
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-555=
5 |

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Wolfgang Grandegger
2012-10-08 09:28:26 UTC
Permalink
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.
With this patch mailboxes are enabled once the receive status is
cleared.
---
drivers/net/can/ti_hecc.c | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)
diff --git a/drivers/net/can/ti_hecc.c b/drivers/net/can/ti_hecc.c
index 39293e9..2417a2d 100644
--- a/drivers/net/can/ti_hecc.c
+++ b/drivers/net/can/ti_hecc.c
@@ -169,6 +169,7 @@ MODULE_VERSION(HECC_MODULE_VERSION);
#define HECC_CANES_SMA BIT(5) /* suspend mode ack */
#define HECC_CANES_CCE BIT(4) /* Change config enabled */
#define HECC_CANES_PDA BIT(3) /* Power down mode ack */
+#define HECC_CANES_RM BIT(1) /* Receive Mode bit */
#define HECC_CANBTC_SAM BIT(7) /* sample points */
@@ -194,6 +195,13 @@ MODULE_VERSION(HECC_MODULE_VERSION);
#define HECC_CANGIM_DEF_MASK 0x700 /* only busoff/warning/passive */
#define HECC_CANGIM_SIL BIT(2) /* system interrupts to int line 1 */
+/*
+ * Receive Mode bit reflects what the CAN protocol kernel (CPK) is
+ * actually doing regardless of mailbox configuration. CPK receive
+ * mode timeout. Tried from 1 - 5us and kept 10 as a safety value.
+ */
+#define RM_TIMEOUT_US 10
+
/* CAN Bittiming constants as per HECC specs */
static struct can_bittiming_const ti_hecc_bittiming_const = {
.name = DRV_NAME,
@@ -615,7 +623,7 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
struct ti_hecc_priv *priv = netdev_priv(ndev);
u32 num_pkts = 0;
u32 mbx_mask;
- unsigned long pending_pkts, flags;
+ unsigned long pending_pkts, flags, timeout;
if (!netif_running(ndev))
return 0;
@@ -633,6 +641,15 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
--priv->rx_next;
if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
/* enable high bank mailboxes */
+ timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
+ while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
+ cpu_relax();
+ if (time_after(jiffies, timeout)) {
+ netdev_warn(ndev, "receiving pkt\n");
+ break;
+ }
+ }
+
spin_lock_irqsave(&priv->mbx_lock, flags);
mbx_mask = hecc_read(priv, HECC_CANME);
mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
Hm, jiffies do have a 1/HZ resolution, normally 1/100 or 1/250 s. Then
time_after() would not trigger before the next jiffy. Or have I missed
something.

FYI: the program canfdtest [1] from the offical can-utils has been added
some time ago to allow detection of out-of-order messages.

Wolfgang.

[1] https://gitorious.org/linux-can/can-utils/blobs/master/canfdtest.c

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc Kleine-Budde
2012-11-05 11:34:55 UTC
Permalink
Hello,
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.
With this patch mailboxes are enabled once the receive status is
cleared.
AnilKumar Ch, can you add some TI people with access to the vhdl CAN
core design to Cc?

I'm debugging this issue for a week, fixed several other problems with
the driver and now and it seems there is a race condition in the hardware.

The algorithm in the driver relies on the ability to re-enable a bunch
of mailboxes with a single write command. In the ti_hecc hardware the
driver uses the CANME (mailbox enable) register to do so. Under certain
circumstances I've noticed that the CAN core receives the next message
into an undefined mailbox. According to the data sheet CAN messages are
always received into the highest free mailbox (with a matching filter).
But after enabling the mailboxes the next CAN frame doesn't go into the
highest mailbox.

This is a log from my test application. The ti_hecc receives 1 byte long
frames at 500 kbit/s. In the first data byte a number is increased with
every CAN frame (the "d=" column). The enabled mailboxes are in the "me"
column and the pending messages in "rmp".
Post by AnilKumar Ch
[ 57.390747] ti_hecc ti_hecc: can0: i=39502 m=14 n=14 rmp=00004000 me=00007ff0 d= 78
[ 57.390777] ti_hecc ti_hecc: can0: i=39503 m=13 n=13 rmp=00002000 me=00003ff0 d= 79
[ 57.390777] ti_hecc ti_hecc: can0: i=39504 m=12 n=12 rmp=00001000 me=00001ff0 d= 80
[ 57.390808] ti_hecc ti_hecc: can0: i=39505 m=11 n=11 rmp=00000800 me=00000ff0 d= 81
[ 57.390838] ti_hecc ti_hecc: can0: i=39506 m=10 n=10 rmp=00000400 me=000007f0 d= 82
[ 57.390838] ti_hecc ti_hecc: can0: i=39507 m= 9 n= 9 rmp=00000200 me=000003f0 d= 83
[ 57.390869] ti_hecc ti_hecc: can0: i=39508 m= 8 n= 8 rmp=00000100 me=000001f0 d= 84
Here the driver enables the high priority mailboxes (0xfffff000) with a
single write. Actually the driver writes 0xfffff0f0, as there a still
four free low priority mailboxes.
Post by AnilKumar Ch
[ 57.390899] ti_hecc ti_hecc: can0: i=39509 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
^
The next CAN frame should be received, either into 0x00000800 (if the
reception happens before the driver enables the high prio mailboxes) or
into 0x80000000 (if the reception takes place after the driver has
written the canme register).
Post by AnilKumar Ch
[ 57.390930] ti_hecc ti_hecc: can0: i=39510 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.390930] ti_hecc ti_hecc: can0: i=39511 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.390960] ti_hecc ti_hecc: can0: i=39512 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.390991] ti_hecc ti_hecc: can0: i=39513 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391021] ti_hecc ti_hecc: can0: i=39514 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391021] ti_hecc ti_hecc: can0: i=39515 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391052] ti_hecc ti_hecc: can0: i=39516 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391082] ti_hecc ti_hecc: can0: i=39517 m=31 n=31 rmp=80010000 me=fffff0f0 d= 86
^
We see the next frame goes into 0x80000000 as expected. (There are so
many d=0 lines, because the driver is confused, as the next frame is not
expected in the 0x00010000 mailbox.)

There isn't any tx going on at the moment, the rx-path is the only
modifier of the canme register.

I thought maybe there is a problem in the hardware with the canme
register, so I've reworked the algorithm not to change the canme
register at all. canme stays 0xfffffff0 all the time (all rx mailboxes
active), but received frames are not acknowledged (I don't write 1s into
canrmp after message reception). The pending rx interrupts are masked
with canmim. To "enable" the mailboxes again, the driver clears bits in
the canrmp register with a single write and then unmasks the interrupt
again. This however results in _the_same_ buggy behaviour.

Then I had a closer look at AnilKumar Ch's patch, the canme register is
not changed if a the CAN core currently receives a CAN frame. I added
AnilKumar Ch's busy loop before modifying the canme register, and it
works now. So I conclude that the "pick next free mailbox" algorithm in
the CAN core is racy. You are not allowed to do something that changes a
mailbox's status from disabled/full to enabled/free when a CAN frame is
received. As you cannot control CAN frame reception, this is a nice
hardware race condition. Please add this to the manual and errata sheets.

I'll post a series which fixes the problems I've found soon.

regards,
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
AnilKumar, Chimata
2012-11-05 13:47:28 UTC
Permalink
Post by Marc Kleine-Budde
Hello,
Hi Marc,

Thanks for looking into this issue, I had spent some considerable
time for debugging this issue, I have seen better sequence with
below changes at 500KBPS but issue still there. Because of some
other higher priority task I left there.

Some of my observations in the driver:-
1. One is buffer mailbox (12th mailbox we are missing)

@@ -627,18 +617,34 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
if (ti_hecc_rx_pkt(priv, priv->rx_next) < 0)
return num_pkts;
++num_pkts;
- } else if (priv->rx_next > HECC_RX_BUFFER_MBOX) {
+ } else if (priv->rx_next >= HECC_RX_BUFFER_MBOX) {
break; /* pkt not received yet */
}
--priv->rx_next;
- if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
- /* enable high bank mailboxes */
- spin_lock_irqsave(&priv->mbx_lock, flags);
+ if (priv->rx_next == (HECC_RX_BUFFER_MBOX - 1)) {

2. Lower mail box configuration is not happening properly. I have
changed this.
3. Removed spinlocks (Just for testing)
Post by Marc Kleine-Budde
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.
With this patch mailboxes are enabled once the receive status is
cleared.
AnilKumar Ch, can you add some TI people with access to the vhdl CAN
core design to Cc?
I am contacting the person internally, I will add immediately once I
got the right person.
Post by Marc Kleine-Budde
I'm debugging this issue for a week, fixed several other problems with
the driver and now and it seems there is a race condition in the hardware.
The algorithm in the driver relies on the ability to re-enable a bunch
of mailboxes with a single write command. In the ti_hecc hardware the
driver uses the CANME (mailbox enable) register to do so. Under certain
circumstances I've noticed that the CAN core receives the next message
into an undefined mailbox. According to the data sheet CAN messages are
always received into the highest free mailbox (with a matching filter).
But after enabling the mailboxes the next CAN frame doesn't go into the
highest mailbox.
This is a log from my test application. The ti_hecc receives 1 byte long
frames at 500 kbit/s. In the first data byte a number is increased with
every CAN frame (the "d=" column). The enabled mailboxes are in the "me"
column and the pending messages in "rmp".
Post by AnilKumar Ch
[ 57.390747] ti_hecc ti_hecc: can0: i=39502 m=14 n=14 rmp=00004000 me=00007ff0 d= 78
[ 57.390777] ti_hecc ti_hecc: can0: i=39503 m=13 n=13 rmp=00002000 me=00003ff0 d= 79
[ 57.390777] ti_hecc ti_hecc: can0: i=39504 m=12 n=12 rmp=00001000 me=00001ff0 d= 80
[ 57.390808] ti_hecc ti_hecc: can0: i=39505 m=11 n=11 rmp=00000800 me=00000ff0 d= 81
[ 57.390838] ti_hecc ti_hecc: can0: i=39506 m=10 n=10 rmp=00000400 me=000007f0 d= 82
[ 57.390838] ti_hecc ti_hecc: can0: i=39507 m= 9 n= 9 rmp=00000200 me=000003f0 d= 83
[ 57.390869] ti_hecc ti_hecc: can0: i=39508 m= 8 n= 8 rmp=00000100 me=000001f0 d= 84
Here the driver enables the high priority mailboxes (0xfffff000) with a
single write. Actually the driver writes 0xfffff0f0, as there a still
four free low priority mailboxes.
Post by AnilKumar Ch
[ 57.390899] ti_hecc ti_hecc: can0: i=39509 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
^
The next CAN frame should be received, either into 0x00000800 (if the
reception happens before the driver enables the high prio mailboxes) or
into 0x80000000 (if the reception takes place after the driver has
written the canme register).
Post by AnilKumar Ch
[ 57.390930] ti_hecc ti_hecc: can0: i=39510 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.390930] ti_hecc ti_hecc: can0: i=39511 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.390960] ti_hecc ti_hecc: can0: i=39512 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.390991] ti_hecc ti_hecc: can0: i=39513 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391021] ti_hecc ti_hecc: can0: i=39514 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391021] ti_hecc ti_hecc: can0: i=39515 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391052] ti_hecc ti_hecc: can0: i=39516 m= 0 n= 7 rmp=00010000 me=fffff0f0 d= 0
[ 57.391082] ti_hecc ti_hecc: can0: i=39517 m=31 n=31 rmp=80010000 me=fffff0f0 d= 86
^
We see the next frame goes into 0x80000000 as expected. (There are so
many d=0 lines, because the driver is confused, as the next frame is not
expected in the 0x00010000 mailbox.)
There isn't any tx going on at the moment, the rx-path is the only
modifier of the canme register.
I thought maybe there is a problem in the hardware with the canme
register, so I've reworked the algorithm not to change the canme
register at all. canme stays 0xfffffff0 all the time (all rx mailboxes
active), but received frames are not acknowledged (I don't write 1s into
canrmp after message reception). The pending rx interrupts are masked
with canmim. To "enable" the mailboxes again, the driver clears bits in
the canrmp register with a single write and then unmasks the interrupt
again. This however results in _the_same_ buggy behaviour.
Then I had a closer look at AnilKumar Ch's patch, the canme register is
not changed if a the CAN core currently receives a CAN frame. I added
AnilKumar Ch's busy loop before modifying the canme register, and it
works now. So I conclude that the "pick next free mailbox" algorithm in
the CAN core is racy. You are not allowed to do something that changes a
mailbox's status from disabled/full to enabled/free when a CAN frame is
received. As you cannot control CAN frame reception, this is a nice
hardware race condition. Please add this to the manual and errata sheets.
I'll post a series which fixes the problems I've found soon.
I can test the series if your patches are ready.

Thanks
AnilKumar
--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Marc Kleine-Budde
2012-11-05 13:56:55 UTC
Permalink
Post by AnilKumar, Chimata
Post by Marc Kleine-Budde
Hello,
Hi Marc,
Thanks for looking into this issue, I had spent some considerable
time for debugging this issue, I have seen better sequence with
below changes at 500KBPS but issue still there. Because of some
other higher priority task I left there.
I'm testing with cansequence at 500 kbit/s at the moment. With my
current patch, I'm not seeing any out-of-order problem. With "ping -f"
and serial load, the driver goes into a rx-overflow. But this is
probably due to too high load.

The hardware I'm using has an issue with 1Mbit/s, thus I cannot test this.
Post by AnilKumar, Chimata
Some of my observations in the driver:-
1. One is buffer mailbox (12th mailbox we are missing)
@@ -627,18 +617,34 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota)
if (ti_hecc_rx_pkt(priv, priv->rx_next) < 0)
return num_pkts;
++num_pkts;
- } else if (priv->rx_next > HECC_RX_BUFFER_MBOX) {
+ } else if (priv->rx_next >= HECC_RX_BUFFER_MBOX) {
break; /* pkt not received yet */
}
--priv->rx_next;
- if (priv->rx_next == HECC_RX_BUFFER_MBOX) {
- /* enable high bank mailboxes */
- spin_lock_irqsave(&priv->mbx_lock, flags);
+ if (priv->rx_next == (HECC_RX_BUFFER_MBOX - 1)) {
2. Lower mail box configuration is not happening properly. I have
changed this.
3. Removed spinlocks (Just for testing)
I've reworked the poll-loop completely.

[...]
Post by AnilKumar, Chimata
I can test the series if your patches are ready.
I'll send you a preliminary version to test soon, it wouldn't be finial,
but you can test, also include 1Mbit/s.

Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
Marc Kleine-Budde
2012-11-05 18:15:27 UTC
Permalink
Post by Marc Kleine-Budde
Hello,
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.
With this patch mailboxes are enabled once the receive status is
cleared.
[..]
Post by Marc Kleine-Budde
Then I had a closer look at AnilKumar Ch's patch, the canme register is
not changed if a the CAN core currently receives a CAN frame. I added
AnilKumar Ch's busy loop before modifying the canme register, and it
works now. So I conclude that the "pick next free mailbox" algorithm in
the CAN core is racy. You are not allowed to do something that changes a
mailbox's status from disabled/full to enabled/free when a CAN frame is
received. As you cannot control CAN frame reception, this is a nice
hardware race condition. Please add this to the manual and errata sheets.
I really want to know where the race window starts and ends, in order to
specify proper timeouts.
Post by Marc Kleine-Budde
/* enable high bank mailboxes */
+ timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
+ while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
+ cpu_relax();
+ if (time_after(jiffies, timeout)) {
+ netdev_warn(ndev, "receiving pkt\n");
+ break;
+ }
+ }
+
You first wait for reception to finish, but then might block on this
spin_lock. This introduces a software race condition.
Post by Marc Kleine-Budde
spin_lock_irqsave(&priv->mbx_lock, flags);
mbx_mask = hecc_read(priv, HECC_CANME);
mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
We have to use the spin_lock because the CANME register is shared
between the rx and tx path. I think it's better to make use of
CANRPM+CANMIM, because I think the critical path is the modification of
the CANRPM register, which is not shared, so it can be written without
the need for the spin_lock.

Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
Marc Kleine-Budde
2012-11-05 18:40:37 UTC
Permalink
Post by Marc Kleine-Budde
Post by Marc Kleine-Budde
Hello,
Post by AnilKumar Ch
Fix "received wrong sequence count. expected: x, got: y" errors
reported by cansequence in a stress test (--verbose).
While processing the receive packets in mailboxes, upper mailboxes
need to enable while processing 12th (RX_BUFFER mailbox) mailbox.
This requires a status check to identify whether the receiving of
packets in progress or not. If the mailboxes are enabled before the
receive packet status is cleared then there is a possibility of
receiving out of order packets.
With this patch mailboxes are enabled once the receive status is
cleared.
[..]
Post by Marc Kleine-Budde
Then I had a closer look at AnilKumar Ch's patch, the canme register is
not changed if a the CAN core currently receives a CAN frame. I added
AnilKumar Ch's busy loop before modifying the canme register, and it
works now. So I conclude that the "pick next free mailbox" algorithm in
the CAN core is racy. You are not allowed to do something that changes a
mailbox's status from disabled/full to enabled/free when a CAN frame is
received. As you cannot control CAN frame reception, this is a nice
hardware race condition. Please add this to the manual and errata sheets.
I really want to know where the race window starts and ends, in order to
specify proper timeouts.
Post by Marc Kleine-Budde
/* enable high bank mailboxes */
+ timeout = jiffies + usecs_to_jiffies(RM_TIMEOUT_US);
+ while (hecc_get_bit(priv, HECC_CANES, HECC_CANES_RM)) {
+ cpu_relax();
+ if (time_after(jiffies, timeout)) {
+ netdev_warn(ndev, "receiving pkt\n");
+ break;
+ }
+ }
+
You first wait for reception to finish, but then might block on this
spin_lock. This introduces a software race condition.
Post by Marc Kleine-Budde
spin_lock_irqsave(&priv->mbx_lock, flags);
mbx_mask = hecc_read(priv, HECC_CANME);
mbx_mask |= HECC_RX_HIGH_MBOX_MASK;
We have to use the spin_lock because the CANME register is shared
between the rx and tx path. I think it's better to make use of
CANRPM+CANMIM, because I think the critical path is the modification of
the CANRPM register, which is not shared, so it can be written without
the need for the spin_lock.
I've implemented both solutions and both are working. I've instrumented
how long the CAN driver does stay in the while loop. I've loaded the CAN
bus with:

# 1 byte frame
cansequence -p

# 8 byte frame, all 1s
cansend can0 -i 0xffffffff 0xff 0xff 0xff 0xff 0xff 0xaff 0xff 0xff
--loop=-1 -e

Both solution show about the same number of loops, depending on the bit
rate of the bus:

500 kbit/s
hecc_set_bit_canme: looped 867 times

50 kbit/s
hecc_set_bit_canme: looped 4807 times

AnilKumar, can you get in contact with the hardware people to figure out
a better solution than to cycle 4k times in a loop?

Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
Jan Sondhauss
2014-09-08 10:15:08 UTC
Permalink
Post by Marc Kleine-Budde
Post by Marc Kleine-Budde
[..]
Post by Marc Kleine-Budde
Then I had a closer look at AnilKumar Ch's patch, the canme register is
not changed if a the CAN core currently receives a CAN frame. I added
AnilKumar Ch's busy loop before modifying the canme register, and it
works now. So I conclude that the "pick next free mailbox" algorithm in
the CAN core is racy. You are not allowed to do something that changes a
mailbox's status from disabled/full to enabled/free when a CAN frame is
received. As you cannot control CAN frame reception, this is a nice
hardware race condition. Please add this to the manual and errata sheets.
I really want to know where the race window starts and ends, in order to
specify proper timeouts.
[..]
We have to use the spin_lock because the CANME register is shared
between the rx and tx path. I think it's better to make use of
CANRPM+CANMIM, because I think the critical path is the modification of
the CANRPM register, which is not shared, so it can be written without
the need for the spin_lock.
I've implemented both solutions and both are working. I've instrumented
how long the CAN driver does stay in the while loop. I've loaded the CAN
[..]
AnilKumar, can you get in contact with the hardware people to figure out
a better solution than to cycle 4k times in a loop?
Marc
Hello Marc, hello AnilKumar,

is there a proper solution to this problem available yet? I haven't found one
on linux-can repository, but maybe I've been looking at the wrong place.
We are running a "custom" 3.6.11 kernel with PREEMPT_RT_FULL on a Ti AM3517.
Also we applied AnilKumars busy-wait-for-rx-idle-loop.

We have a problem in one test-case with two can-devices (including our device
under test) where we are exchanging messages in a request-response matter:
The am3517 sends 16 requests in a row and then expects 16 response messages.

Under some circumstances the driver stays in the napi's-poll list forever,
because there is a message available in a lower "main"-mailbox but
rx_next points to the first mailbox:

[24817.008123] ti_hecc: priv->rx_next 0000001f
[24817.012769] ti_hecc: HECC_CANME fffffff0
[24817.017015] ti_hecc: HECC_CANMD fffffff0
[24817.021261] ti_hecc: HECC_CANRMP 40000000

It looks to me that the access to the CANME register in the ti_hecc_rx_pkt()
function is also racy:

spin_lock_irqsave(&priv->mbx_lock, flags);
hecc_clear_bit(priv, HECC_CANME, mbx_mask);
hecc_write(priv, HECC_CANRMP, mbx_mask);
/* enable mailbox only if it is part of rx buffer mailboxes */
if (mbxno < HECC_RX_BUFFER_MBOX)
hecc_set_bit(priv, HECC_CANME, mbx_mask);
spin_unlock_irqrestore(&priv->mbx_lock, flags);

That makes me wonder what was the purpose of the else-branch in the
ti_hecc_rx_poll() function (as it prevents us from receiving messages)?:

} else if (priv->rx_next >= HECC_RX_BUFFER_MBOX) {
break; /* pkt not received yet */
}

Our problem is somewhat off-topic, but relates to the racy CANME access so I
hope I posted at the right place. :)

Thanks and best regards,
Jan













--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...