[cross-post] nand flash bad blocks management

alb · Jan 12, 2015

Hi everyone,

We have ~128Mbit of configuration to be stored in a Flash device and for
reasons related to qualification (HiRel application) we are more
inclined to the use of NAND technology instead of NOR. Unfortunately
NAND flash suffers from bad blocks, which may also develop during the
lifetime of the component and have to be handled.

I've read something about bad block management and it looks like there
are two essential strategies to cope with the issue of bad blocks:

1. skip block
2. reserved block

The first one will skip a block whenever is bad and write on the first
free one, updating also the logical block addressing (LBA). While the second
strategy reserves a dedicated area to remap the bad blocks. In this
second case the LBA shall be kept updated as well.

I do not see much of a difference between the two strategies except the
fact that in case 1. I need to 'search' for the first available free
block, while in second case I reserved a special area for it. Am I
missing any other major difference?

The second question I have is about 'management'. I do not have a
software stack to perform the management of these bad blocks and I'm
obliged to do it with my FPGA. Does anyone here see any potential risk
in doing so? Would I be better off dedicating a small footprint
controller in the FPGA to handle the Flash Translation Layer with wear
leveling and bad block management? Can anyone here point me to some
IPcores readily available for doing this?

There's a high chance I will need to implement some sort of 'scrubbing'
to avoid accumulation of errors. All these 'functions' to handle the
Flash seem to me very suited for software but not for hardware. Does
anyone here have a different opinion?

Any comment/suggestion/pointer/rant is appreciated.

Cheers,

Al

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Boudewijn Dijkstra · Jan 12, 2015

Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:

Hi everyone,

We have ~128Mbit of configuration to be stored in a Flash device and for
reasons related to qualification (HiRel application) we are more
inclined to the use of NAND technology instead of NOR. Unfortunately
NAND flash suffers from bad blocks, which may also develop during the
lifetime of the component and have to be handled.

I've read something about bad block management and it looks like there
are two essential strategies to cope with the issue of bad blocks:

1. skip block
2. reserved block

The first one will skip a block whenever is bad and write on the first
free one, updating also the logical block addressing (LBA). While the
second
strategy reserves a dedicated area to remap the bad blocks. In this
second case the LBA shall be kept updated as well.

I do not see much of a difference between the two strategies except the
fact that in case 1. I need to 'search' for the first available free
block, while in second case I reserved a special area for it. Am I
missing any other major difference?

The second strategy is required when the total logical storage capacity
must be constant. I can imagine the existence of 'bad sectors' degrading
performance on some filesystems.

The second question I have is about 'management'. I do not have a
software stack to perform the management of these bad blocks and I'm
obliged to do it with my FPGA. Does anyone here see any potential risk
in doing so? Would I be better off dedicating a small footprint
controller in the FPGA to handle the Flash Translation Layer with wear
leveling and bad block management? Can anyone here point me to some
IPcores readily available for doing this?

Sounds like you're re-inventing eMMC.

There's a high chance I will need to implement some sort of 'scrubbing'
to avoid accumulation of errors.

Indeed regular reading (and IIRC also writing) can increase the longevity
of the device. But it is up to you whether that is needed at all.

All these 'functions' to handle the
Flash seem to me very suited for software but not for hardware. Does
anyone here have a different opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.>

--
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

alb · Jan 12, 2015

Hi Boudewijn,

In comp.arch.embedded Boudewijn Dijkstra <sp4mtr4p.boudewijn@indes.com> wrote:
[]

I've read something about bad block management and it looks like there
are two essential strategies to cope with the issue of bad blocks:

1. skip block
2. reserved block

The first one will skip a block whenever is bad and write on the first
free one, updating also the logical block addressing (LBA). While the
second
strategy reserves a dedicated area to remap the bad blocks. In this
second case the LBA shall be kept updated as well.

I do not see much of a difference between the two strategies except the
fact that in case 1. I need to 'search' for the first available free
block, while in second case I reserved a special area for it. Am I
missing any other major difference?

The second strategy is required when the total logical storage capacity
must be constant. I can imagine the existence of 'bad sectors' degrading
performance on some filesystems.

Ok, that's a valid point, meaning that since I declare the user space
only the total minus the reserved, the user may rely on that
information.

But in terms of total amount of bad blocks for the quoted endurance will
be exactly with the same number. None of the strategies mentioned wear
less the device.

The second question I have is about 'management'. I do not have a
software stack to perform the management of these bad blocks and I'm
obliged to do it with my FPGA. Does anyone here see any potential risk
in doing so? Would I be better off dedicating a small footprint
controller in the FPGA to handle the Flash Translation Layer with wear
leveling and bad block management? Can anyone here point me to some
IPcores readily available for doing this?

Sounds like you're re-inventing eMMC.

I didn't know there was a name for that. Well if that's so yes, but it's
not for storing your birthday's picture, rather for space application.

Even if there are several 'experiments' running in low orbit with nand
flash components, I do not know any operational satellite (like for
meteo or similar) to have anything like this.

There's a high chance I will need to implement some sort of 'scrubbing'
to avoid accumulation of errors.

Indeed regular reading (and IIRC also writing) can increase the longevity
of the device. But it is up to you whether that is needed at all.

I'm not aiming to increase longevity. I'm aiming to guarantee that the
system will cope with the expected bit flip and still guarantee mission
objectives throughout the intended lifecycle (7.5 years on orbit).

Scrubbing is not so complicated, you read, correct and write back. But
doing so when you hit a bad block during the rewrite and you have tons
of other things to do in the meanwhile may have some side effects...to
be evaluated and handled.

All these 'functions' to handle the
Flash seem to me very suited for software but not for hardware. Does
anyone here have a different opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.

It does not surprise me, I have the requirement not to include *any*
software onboard! I may let an embedded microcontroller with a hardcoded
list of instruction slip through, but I'm not so sure.

Al

Don Y · Jan 13, 2015

Hi Boudewijn,

On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote:

Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:

The second question I have is about 'management'. I do not have a
software stack to perform the management of these bad blocks and I'm
obliged to do it with my FPGA. Does anyone here see any potential risk
in doing so? Would I be better off dedicating a small footprint
controller in the FPGA to handle the Flash Translation Layer with wear
leveling and bad block management? Can anyone here point me to some
IPcores readily available for doing this?

Sounds like you're re-inventing eMMC.

There's a high chance I will need to implement some sort of 'scrubbing'
to avoid accumulation of errors.

Indeed regular reading (and IIRC also writing) can increase the longevity of
the device. But it is up to you whether that is needed at all.

Um, *reading* also causes fatigue in the array -- just not as quickly as
*writing*/erase. In most implementations, this isn't a problem because
you're reading the block *into* RAM and then accessing it from RAM.
But, if you just keep reading blocks repeatedly, you'll discover your
ECC becoming increasingly more active/aggressive in "fixing" the degrading
NAD cells.

So, either KNOW that your access patterns (read and write) *won't*
disturb the array. *Or*, actively manage it by "refreshing" content
after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.

All these 'functions' to handle the
Flash seem to me very suited for software but not for hardware. Does
anyone here have a different opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.

I can't see an *economical* way of doing this (in anything less than
huge volumes) with dedicated hardware (e.g., FPGA).

Boudewijn Dijkstra · Jan 13, 2015

Op Tue, 13 Jan 2015 01:03:45 +0100 schreef Don Y <this@is.not.me.com>:

On 1/12/2015 3:38 AM, Boudewijn Dijkstra wrote:
Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb <al.basili@gmail.com>:

There's a high chance I will need to implement some sort of 'scrubbing'
to avoid accumulation of errors.

Indeed regular reading (and IIRC also writing) can increase the
longevity of
the device. But it is up to you whether that is needed at all.

Um, *reading* also causes fatigue in the array -- just not as quickly as
*writing*/erase.

Indeed; my apologies. Performing many reads before an erase, will indeed
cause bit errors that can be repaired by reprogramming. What I wanted to
say, but misremembered, is that *not* reading over extended periods may
also cause bit errors, due to charge leak. This can also be repaired by
reprogramming.
(ref: Micron TN2917)

All these 'functions' to handle the
Flash seem to me very suited for software but not for hardware. Does
anyone here have a different opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.

I can't see an *economical* way of doing this (in anything less than
huge volumes) with dedicated hardware (e.g., FPGA).

Space exploration is not economical (yet).

--
(Remove the obvious prefix to reply privately.)
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

alb · Jan 14, 2015

Hi Don,

In comp.arch.embedded Don Y <this@is.not.me.com> wrote:
[]

Indeed regular reading (and IIRC also writing) can increase the
longevity of the device. But it is up to you whether that is needed
at all.

Um, *reading* also causes fatigue in the array -- just not as quickly
as *writing*/erase. In most implementations, this isn't a problem
because you're reading the block *into* RAM and then accessing it from
RAM. But, if you just keep reading blocks repeatedly, you'll discover
your ECC becoming increasingly more active/aggressive in "fixing" the
degrading NAD cells.

reading does not cause *fatigue* in the sense that does not wear the
device. The effect has been referred to 'read disturb' which may cause
errors in pages other than the one read. With multiple readings of the
same page you may end up inducing so many errors that your ECC would not
be able to cope with when you try to access the *other* pages.

These sorts of problems though are showing up when we talk about a
number of reading cycles in the hundreds of thousands if not million
(google: The Inconvenient Truths of NAND Flash Memory).

So, either KNOW that your access patterns (read and write) *won't*
disturb the array. *Or*, actively manage it by "refreshing" content
after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.

We have to cope with bit flips anyway (low earth orbit), so we are
obliged to scrub the memory, in order to avoid errors' accumulation we
move the entire block, update the LBA and erase the one affected, so it
becomes again available.

All these 'functions' to handle the Flash seem to me very suited for
software but not for hardware. Does anyone here have a different
opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.

I can't see an *economical* way of doing this (in anything less than
huge volumes) with dedicated hardware (e.g., FPGA).

Well according to our latest estimates we are about at 30% of cell usage
on an AX2000 (2MGates), without including any scrubbing (yet), but
including the bad block management.

Al

Don Y · Jan 14, 2015

Hi Boudewijn,

On 1/13/2015 2:17 AM, Boudewijn Dijkstra wrote:

There's a high chance I will need to implement some sort of 'scrubbing'
to avoid accumulation of errors.

Indeed regular reading (and IIRC also writing) can increase the longevity of
the device. But it is up to you whether that is needed at all.

Um, *reading* also causes fatigue in the array -- just not as quickly as
*writing*/erase.

Indeed; my apologies. Performing many reads before an erase, will indeed cause
bit errors that can be repaired by reprogramming. What I wanted to say, but
misremembered, is that *not* reading over extended periods may also cause bit
errors, due to charge leak. This can also be repaired by reprogramming.
(ref: Micron TN2917)

Yes, its amazing how many of the issues that were troublesome in OLD
technologies have modern day equivalents! E.g., "print through" for
tape; write-restore-after-read for core; etc.

All these 'functions' to handle the
Flash seem to me very suited for software but not for hardware. Does
anyone here have a different opinion?

AFAIK, (e)MMC devices all have a small microcontroller inside.

I can't see an *economical* way of doing this (in anything less than
huge volumes) with dedicated hardware (e.g., FPGA).

Space exploration is not economical (yet).

<frown> Wise ass! :>

Yes, I meant "economical" in terms of device complexity. The more complex
the device required for a given functionality, the less reliable (in an
environment where you don't get second-chances)

[cross-post] nand flash bad blocks management

alb

Guest

Boudewijn Dijkstra

Guest

alb

Guest

Don Y

Guest

Boudewijn Dijkstra

Guest

alb

Guest

Don Y

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

[cross-post] nand flash bad blocks management

alb

Guest

Boudewijn Dijkstra

Guest

alb

Guest

Don Y

Guest

Boudewijn Dijkstra

Guest

alb

Guest

Don Y

Guest

Log in

Welcome to EDABoard.com

Sponsor