[PATCH] mcp251x: mcp2515 stops receiving

Discussion:

Rost, Martin

2014-05-08 06:33:52 UTC

The mcp2515 sometimes seems to trigger an interrupt with the corresponding register not being set yet.
This makes the driver exit the interrupt because there is obviously nothing to do, but the interrupt line is kept low.
Therefore the driver does not see any more interrupts until the chip is reset (via interface down/up).

This patch changes the IST to first check the IRQ registers, and wait up to 10 ms if an event really occurrs.
If the IRQ register is still empty after 10ms, a kernel message gets issued.
The IST loop is rearranged to evaluate the IRQ register at the last moment before exiting, to not miss a late irq event.

---
diff --git "a/mcp251x.c" "b/mcp251x.c"
index ad58ac6..668ce63 100644
--- "a/mcp251x.c"
+++ "b/mcp251x.c"
@@ -806,15 +806,29 @@ static irqreturn_t mcp251x_can_ist(int irq, void *dev_id)
struct mcp251x_priv *priv = dev_id;
struct spi_device *spi = priv->spi;
struct net_device *net = priv->net;
+ u8 intf, eflag;
+ u8 retrycount = 10;

mutex_lock(&priv->mcp_lock);
- while (!priv->force_quit) {
+
+ do {
+ mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
+ if (!intf) {
+// printk(KERN_CRIT "MCP251x: IRQ delaying.\r\n");
+ mdelay(1);
+ }
+ } while (!intf && (retrycount--));
+
+ if (!intf)
+ printk(KERN_CRIT "MCP251x: IRQ without a cause.\r\n");
+
+ while ((!priv->force_quit) && (intf)) {
enum can_state new_state;
- u8 intf, eflag;
+// u8 intf, eflag;
u8 clear_intf = 0;
int can_id = 0, data1 = 0;

- mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
+// mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);

/* mask out flags we don't care about */
intf &= CANINTF_RX | CANINTF_TX | CANINTF_ERR;
@@ -913,8 +927,8 @@ static irqreturn_t mcp251x_can_ist(int irq, void *dev_id)
}
}

- if (intf == 0)
- break;
+// if (intf == 0)
+// break;

if (intf & CANINTF_TX) {
net->stats.tx_packets++;
@@ -926,6 +940,7 @@ static irqreturn_t mcp251x_can_ist(int irq, void *dev_id)
netif_wake_queue(net);
}

+ mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
}
mutex_unlock(&priv->mcp_lock);
return IRQ_HANDLED;

---

Marc Kleine-Budde

2014-05-12 17:00:38 UTC

Permalink

Hey Martin,

Post by Rost, Martin
The mcp2515 sometimes seems to trigger an interrupt with the
corresponding register not being set yet. This makes the driver exit
the interrupt because there is obviously nothing to do, but the
interrupt line is kept low. Therefore the driver does not see any
more interrupts until the chip is reset (via interface down/up).

Hmmmm, I think level triggered interrupt would help here. However not
all interrupt controller support level interrupts and
arch/arm/mach-pxa/icontrol.c sets the trigger to rising edge.

Post by Rost, Martin
This patch changes the IST to first check the IRQ registers, and wait
up to 10 ms if an event really occurrs. If the IRQ register is still
empty after 10ms, a kernel message gets issued.

What happens if the register is still empty after 10ms? Can we do
something better than the print, which will probably not fix the probem...

Post by Rost, Martin
The IST loop is rearranged to evaluate the IRQ register at the last
moment before exiting, to not miss a late irq event.

Alexander, have you seen or heard of this problem before? What do you
think about this workaround?

Post by Rost, Martin
---
diff --git "a/mcp251x.c" "b/mcp251x.c"
index ad58ac6..668ce63 100644
--- "a/mcp251x.c"
+++ "b/mcp251x.c"
@@ -806,15 +806,29 @@ static irqreturn_t mcp251x_can_ist(int irq, void *dev_id)
struct mcp251x_priv *priv = dev_id;
struct spi_device *spi = priv->spi;
struct net_device *net = priv->net;
+ u8 intf, eflag;
+ u8 retrycount = 10;
mutex_lock(&priv->mcp_lock);
- while (!priv->force_quit) {
+
+ do {
+ mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
+ if (!intf) {
+// printk(KERN_CRIT "MCP251x: IRQ delaying.\r\n");

please remove that line

Post by Rost, Martin
+ mdelay(1);
+ }
+ } while (!intf && (retrycount--));
+
+ if (!intf)
+ printk(KERN_CRIT "MCP251x: IRQ without a cause.\r\n");

In Linux, we use \n only, also dev_LEVEL() is preferred over plain printk.

Post by Rost, Martin
+
+ while ((!priv->force_quit) && (intf)) {
enum can_state new_state;
- u8 intf, eflag;
+// u8 intf, eflag;

please remove, not comment out.

Post by Rost, Martin
u8 clear_intf = 0;
int can_id = 0, data1 = 0;
- mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
+// mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);

same here

Post by Rost, Martin
/* mask out flags we don't care about */
intf &= CANINTF_RX | CANINTF_TX | CANINTF_ERR;
@@ -913,8 +927,8 @@ static irqreturn_t mcp251x_can_ist(int irq, void *dev_id)
}
}
- if (intf == 0)
- break;
+// if (intf == 0)
+// break;

same here

Post by Rost, Martin
if (intf & CANINTF_TX) {
net->stats.tx_packets++;
@@ -926,6 +940,7 @@ static irqreturn_t mcp251x_can_ist(int irq, void *dev_id)
netif_wake_queue(net);
}
+ mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
}
mutex_unlock(&priv->mcp_lock);
return IRQ_HANDLED;

I'll send a cleaned version of you patch.

Marc

Alexander Shiyan

2014-05-12 17:14:55 UTC

Permalink

Post by Marc Kleine-Budde
Hey Martin,

Hmmmm, I think level triggered interrupt would help here. However not
all interrupt controller support level interrupts and
arch/arm/mach-pxa/icontrol.c sets the trigger to rising edge.

What happens if the register is still empty after 10ms? Can we do
something better than the print, which will probably not fix the probem...

Post by Rost, Martin
The IST loop is rearranged to evaluate the IRQ register at the last
moment before exiting, to not miss a late irq event.

Alexander, have you seen or heard of this problem before? What do you
think about this workaround?

I encountered a similar problem only once, but it was resolved by using
can_set_restart_ms() :), because the problem was that the controller
transitions to the BUS_OFF state.

I absolutely don't like the patch. This is not a solution to the problem but
ways to circumvent it.

I would recommend to make a temporary parameter for sysfs, then when a
problem occurs, you could take a snapshot from all registers (including an
interrupt status register) and understand what the problem is.
Realizing what was really going on, we can make the right decision.

---

��칻�&�~�&��+-��ݶ��w��˛��m�b��\jx��ܨ}��Ơz�&j:+v��zZ+

Rost, Martin

2014-05-12 17:34:14 UTC

Permalink

Hello Alexander,

Post by Alexander Shiyan
I absolutely don't like the patch. This is not a solution to the problem but
ways to circumvent it.

True that. I whish I had a better solution.

Post by Alexander Shiyan
I would recommend to make a temporary parameter for sysfs, then when a
problem occurs, you could take a snapshot from all registers (including an
interrupt status register) and understand what the problem is.

I've never done that. Any hints on how to start on this?

Regards,
Martin

Alexander Shiyan

2014-05-12 17:47:24 UTC

Permalink

Post by Rost, Martin
Hello Alexander,

Post by Alexander Shiyan
I absolutely don't like the patch. This is not a solution to the problem but
ways to circumvent it.

True that. I whish I had a better solution.

I've never done that. Any hints on how to start on this?

I can not guarantee the accuracy, but something like this:

static ssize_t dbg_show(struct device *dev, struct device_attribute *attr, char *buf)
{
unsigned i;
ssize_t ret = 0;

for (i = 0; i < NUM_SPI_REGS; i++)
ret += sprintf(buf, "[0x%02x] = 0x%02x\n", i, SPI_REG[i]);

return ret;
}

static DEVICE_ATTR_RO(dbg);

static struct attribute *dbg_sysfs_attrs[] = {
&dev_attr_dbg.attr,
};

static const struct attribute_group dbg_sysfs_group = {
.attrs = dbg_sysfs_attrs,
};

...
sysfs_create_group(&client->dev.kobj, &dbg_sysfs_group);

---

N��r��y��b�X��ǧv�^�)޺{.n�+��{�q��^n�r��z��h��&��G��h�

John Whitmore

2014-05-21 23:53:47 UTC

Permalink

Post by Marc Kleine-Budde
Hey Martin,

Hmmmm, I think level triggered interrupt would help here. However not
all interrupt controller support level interrupts and
arch/arm/mach-pxa/icontrol.c sets the trigger to rising edge.

What happens if the register is still empty after 10ms? Can we do
something better than the print, which will probably not fix the probem...

Post by Rost, Martin
The IST loop is rearranged to evaluate the IRQ register at the last
moment before exiting, to not miss a late irq event.

Alexander, have you seen or heard of this problem before? What do you
think about this workaround?

I've had this problem before but with the MCP2515 connected to a Microchip
Microcontroller. I've been working on other stuff and just got back to looking
into this problem. Whilst it's not Linux based I was getting this exact
problem and was forced to poll the MCP2515 instead of managing it via
interrupts. Earlier today I got the best of both worlds by doing nothing in
the ISR other then setting a FLAG and clearing the interrupt. The Flag is then
picked up in the main processing loop of the code. It's a race condition and a
flaw in the MCP2515, in my opinion.

I'm not an expert in the Linux Driver code nor the implemented fix but I can
confirm that the Interrupt is generated with no flags set. I'm not sure of the
10mS timing but I'd imagine that should work.

Post by Marc Kleine-Budde

please remove that line

Post by Rost, Martin
+ mdelay(1);
+ }
+ } while (!intf && (retrycount--));
+
+ if (!intf)
+ printk(KERN_CRIT "MCP251x: IRQ without a cause.\r\n");

In Linux, we use \n only, also dev_LEVEL() is preferred over plain printk.

Post by Rost, Martin
+
+ while ((!priv->force_quit) && (intf)) {
enum can_state new_state;
- u8 intf, eflag;
+// u8 intf, eflag;

please remove, not comment out.

Post by Rost, Martin
u8 clear_intf = 0;
int can_id = 0, data1 = 0;
- mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);
+// mcp251x_read_2regs(spi, CANINTF, &intf, &eflag);

same here

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Rost, Martin

2014-05-22 06:38:18 UTC

Permalink

Hello John,

-----Original Message-----
Sent: Thursday, May 22, 2014 1:54 AM
To: Marc Kleine-Budde
Subject: Re: [PATCH] mcp251x: mcp2515 stops receiving
I'm not an expert in the Linux Driver code nor the implemented fix but I can
confirm that the Interrupt is generated with no flags set. I'm not sure of the
10mS timing but I'd imagine that should work.

Thanks for confirming it.
I fiddled a bit with the timing, and it seemed that 5ms would catch some but not all occurrences of this problem.
So I went for 10 to make sure.

I'm not too deep into kernel internals, so is there some best practice to this?

Regards,
Martin

John Whitmore

2014-05-22 10:28:44 UTC

Permalink

Post by Rost, Martin
Hello John,

Thanks for confirming it.
I fiddled a bit with the timing, and it seemed that 5ms would catch some but not all occurrences of this problem.
So I went for 10 to make sure.
I'm not too deep into kernel internals, so is there some best practice to this?

I honestly can't answer about best practice on working around apparent flaws in
Hardware ;-) I'll leave somebody else to answer that one.

Post by Rost, Martin
Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Gerhard Bertelsmann

2014-05-22 16:08:47 UTC

Permalink

Hi,

interesting conversation here regarding the mcp251x module ...

Post by Rost, Martin
Hello John,

Thanks for confirming it.
I fiddled a bit with the timing, and it seemed that 5ms would catch some but not all occurrences
of this problem.
So I went for 10 to make sure.
I'm not too deep into kernel internals, so is there some best practice to this?
Regards,
Martin

IMHO a delay isn't a good idea. I've spent lots of hours in fron of my osci
http://www.raspberrypi.org/forums/viewtopic.php?f=44&t=7027&start=103
to get the MCP2515 working with RPi. After fiddling the GPIOs for the MCP2515
interrupt I got a working setup with the RPI and 3.1.x+ kernel. The setup was
stable even after upgrading to 3.2.x+ kernel. There were packet losts if the
CAN frame rate was high but the interface was "stable" - no hangup.

Things got worse after upgrading to 3.6.x+ kernel. The driver wasn't stable
anymore - the interface hang due to missed interrupts. From my point of view
this has to do with the IRQF_ONESHOT. The drivers have to set this flag in newer
kernels if you don't have a primary IRQ handler - the MCP2515 doesn't use it,
because the interrupt will be cleared if you read the buffer for example. This
is a feature of the CAN-Controller and the kernel module use it to reduce
SPI requests.

AFAIK the IRQF_ONESHOT feature was added to avoid endless interrupt retriggering.
But a retrigger is needed by the mc251x module: If you read the first buffer,
there must be a second interrupt if the second buffer is also full.

If you are curious you could do the following test: Write some (useless) code into the
primary IRQ routine:

static irqreturn_t mcp251x_can_hardirq(int irq, void *dev_id)
{
....
return IRQ_WAKE_THREAD;
}
...
static int mcp251x_open(struct net_device *net)
{
...
ret = request_threaded_irq(spi->irq, mcp251x_can_hardirq, mcp251x_can_ist,
pdata->irq_flags ? pdata->irq_flags : IRQF_TRIGGER_FALLING,
DEVICE_NAME, priv);

to avoid the IRQF_ONESHOT flag to be set. There is no guarantee that this will work,
of course. But give it a try ;-)

Regards

Gerd

--
To unsubscribe from this list: send the line "unsubscribe linux-can" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Rost, Martin

2014-05-12 17:27:51 UTC

Permalink

Hello Marc,

Post by Marc Kleine-Budde
Hmmmm, I think level triggered interrupt would help here. However not
all interrupt controller support level interrupts and
arch/arm/mach-pxa/icontrol.c sets the trigger to rising edge.

That was my first attempt. That explains why it didn't help, too.

Post by Marc Kleine-Budde
What happens if the register is still empty after 10ms? Can we do
something better than the print, which will probably not fix the probem...

I thought about re-checking the line if the IRQ line was released, but I have no clue how to do that...
I haven't seen a problem when returning regardless (in my setup), but I'd like to give a hint at least in case someone runs into the same problem.

Post by Marc Kleine-Budde
I'll send a cleaned version of you patch.

much appreciated.

Regards,
Martin.