row hammer...

On 9/1/2020 12:43 AM, Martin Brown wrote:
On 31/08/2020 19:50, Don Y wrote:
On 8/31/2020 9:29 AM, jlarkin@highlandsniptechnology.com wrote:
On Mon, 31 Aug 2020 11:46:48 -0400, Joe Gwinn <joegwinn@comcast.net
wrote:

On Mon, 31 Aug 2020 08:30:30 -0700, jlarkin@highlandsniptechnology.com
wrote:


https://en.wikipedia.org/wiki/Row_hammer


That\'s amazing. If x86 can possibly do something wrong, it does.

It is not clear that this is an x86 problem, as DRAM is used
universally, and the method depends on DRAM implementation.

Joe Gwinn

Allowing a ram data error to become a privilige escalation exploit is
a classic x86 blunder.

No. The exploit can be applied to other processors, as well.

For example ARM cores although they are somewhat more resistant.

https://par.nsf.gov/servlets/purl/10110647

\"The second exploit revealed by Project Zero runs as an unprivileged
Linux process on the x86-64 architecture, exploiting the row hammer
effect to gain unrestricted access to all physical memory installed in
a computer.\"

We live in the Dark Ages of computing.

Makes you wonder how many of those EDA tools you use are lying to you! :

Yes. He should go back to pencil and paper immediately.
Taped board layouts too you can\'t trust these new-fangled computers.

There\'s something to be said for technologies that expose every action
to review!

The more interesting issue is in deciding when you\'re experiencing
more errors than \"expected\" -- even in the absence of a deliberate
\"exploit\"! When do you decide that you\'re ECC is doing too *much*
work... lowering your confidence that it is protecting you against
ALL errors?

OTOH, REALLY makes you wonder how often the machine is NOT executing the
code (and machine state) \"as prescribed\" -- yet APPEARING to produce
good results!

These are interesting attacks against dynamic ram hardware. The vulnerability
after that depends on how it gets exploited and luck!

There have been some studies on how resilient different \"applications\"
(using that in the generic sense -- i.e., an OS is an application, too!)
are to uncorrected errors.

Consider that an error (in this case, a disturbance) has to be *seen*
before it can actually have an effect on the application (i.e., if
the datum is never read from memory, then it\'s the tree that fell
in the forest that no one heard!).

Then, the program logic has to NOT mask the effects of the error in
order for it to be a hazard. I.e., if the error makes an even value
odd and the code detects the odd number and \"makes it even\", then
no harm, no foul. Likewise, if the error converts an opcode into
one that is essentially equivalent (in THIS condition), then there
is no real change in execution.

Finally, any \"results\" of the application\'s execution have to be
\"visible\" to the consumer, in some NOTICEABLE way. E.g., if
the application is sorting your music collection by artist name,
the error(s) have to \"stand out\" in a way that you are likely to
notice in order to be consequential (\"Hmmm... why is \'Jefferson Airplane\'
BEFORE \'Janice Joplin\'?\")

Ideally, errors should cause crashes -- if your goal is to know
that errors are occurring! If, OTOH, you\'re goal is to \"get a
result\" -- even if not 100% correct -- then crashes should be avoided.

E.g., I\'ve 96G in each of my six primary workstations and nothing
(OS?) has ever complained to me about ECC errors... which I\'m SURE
must be occurring!
 
Don Y wrote:
<snip>
E.g., I\'ve 96G in each of my six primary workstations and nothing
(OS?) has ever complained to me about ECC errors... which I\'m SURE
must be occurring!

I\'m not sure about that at all, but a former coworker\'s husband
ran a DRAM testing firm and we talked a bit about it.

--
Les Cargill
 
Dave Platt <dplatt@coop.radagast.org> wrote:
In article <slrnrkqli8.pap.nomail@xs9.xs4all.nl>,
Rob <nomail@example.com> wrote:

Allowing a ram data error to become a privilige escalation exploit is
a classic x86 blunder.

DRAM without ECC essentially is only suitable for toys.
All serious machines using DRAM have ECC on top of it.

By that criterion, probably 90+ percent PCs used in homes and business
offices in the United States are toys.

That is correct.

It is (unfortunately) relatively uncommon for PC motherboards to
support ECC on the DRAM. Very few consumer-grade boards do, and (I
believe) only a minority of business-grade boards by the Big Makers.
One tends to find ECC only in servers, or in specially-ordered
workstations.

For purposes where one requires better than a toy.

Although you may be able to put an ECC DIMM/SODIMM in a standard
motherboard, and have the computer boot OK, and maybe even identify
the memory as being ECC-equipped, it won\'t actually do you any good...
the common chipsets usually don\'t perform ECC, and even if they do the
motherboard is sometimes built without the additional memory-bus
traces required to connect the additional IC to the chipset.

Of course you need a chipset and board that supports ECC...

As I understand it, this is mostly a question of cost and performance.
Supporting ECC requires more chips on the DIMM, it requires more
traces on the board, and it slows down memory access slightly. As a
result, there\'s a tradeoff between reliability, and performance-per-
dollar.

It was only fairly recently that I found a home-PC motherboard that I
liked, which had ECC support (one of the TaiChi family). It cost more to
equip the board with ECC-enabled DIMMs but I felt it was worth it in the long
run.

Yes, you need to decide whether you want reliability or a toy.
 
On 02 Sep 2020 17:00:12 GMT, Rob <nomail@example.com> wrote:

Dave Platt <dplatt@coop.radagast.org> wrote:
In article <slrnrkqli8.pap.nomail@xs9.xs4all.nl>,
Rob <nomail@example.com> wrote:

Allowing a ram data error to become a privilige escalation exploit is
a classic x86 blunder.

DRAM without ECC essentially is only suitable for toys.
All serious machines using DRAM have ECC on top of it.

By that criterion, probably 90+ percent PCs used in homes and business
offices in the United States are toys.

That is correct.

It is (unfortunately) relatively uncommon for PC motherboards to
support ECC on the DRAM. Very few consumer-grade boards do, and (I
believe) only a minority of business-grade boards by the Big Makers.
One tends to find ECC only in servers, or in specially-ordered
workstations.

For purposes where one requires better than a toy.

Although you may be able to put an ECC DIMM/SODIMM in a standard
motherboard, and have the computer boot OK, and maybe even identify
the memory as being ECC-equipped, it won\'t actually do you any good...
the common chipsets usually don\'t perform ECC, and even if they do the
motherboard is sometimes built without the additional memory-bus
traces required to connect the additional IC to the chipset.

Of course you need a chipset and board that supports ECC...

As I understand it, this is mostly a question of cost and performance.
Supporting ECC requires more chips on the DIMM, it requires more
traces on the board, and it slows down memory access slightly. As a
result, there\'s a tradeoff between reliability, and performance-per-
dollar.

It was only fairly recently that I found a home-PC motherboard that I
liked, which had ECC support (one of the TaiChi family). It cost more to
equip the board with ECC-enabled DIMMs but I felt it was worth it in the long
run.

Yes, you need to decide whether you want reliability or a toy.

Lots of people do serious, productive stuff on their toy PCs. Email.
Word processing. Accounting. Looking up stuff on the web.

Remember phone books and road maps and encyclopedias and filling out
bingo cards and waiting a month to get a data sheet?

Remember life before Amazon?

Most people use laptops, and they seem to be quite reliable. Mine have
all got obsolete before they broke.

My giant Dell boxes seem fine too. Windows is stupid but it works.





--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard
 
On 9/1/2020 4:57 PM, Les Cargill wrote:
Don Y wrote:
snip

E.g., I\'ve 96G in each of my six primary workstations and nothing
(OS?) has ever complained to me about ECC errors... which I\'m SURE
must be occurring!

I\'m not sure about that at all, but a former coworker\'s husband
ran a DRAM testing firm and we talked a bit about it.

It would be interesting to hear his observations on the subject
(i.e., if there is a \"DRAM testing firm\" it suggests there are
issues that need to be addressed! :> )

When I designed my OS, I did a fair bit of research on memory
reliability (various technologies), at that time. The takeaways
were:
- as memory sizes increase, so does the likelihood of encountering
real failures (data retention errors -- transient or not)
- memory errors may be masked by the \"application\" (app == any
piece of software, incl the OS)
- there is a dearth of \"best practices\" guidance as to how to
*interpret* memory faults

[I recall encountering a document that said something to the
effect of \"24 errors in 24 hours suggests the DIMM should be
replaced\"... REALLY? ONE PER HOUR??! So, what do they think
is a *common* failure rate?!]

When designing appliances for 24/7/365 unattended operation, this
all suggested adding capabilities to the system to continually
monitor and advise as to the health of the system as the user
would otherwise be clueless to understand faults that may be
occurring without rising to the level of \"being noticeable\".

(And, you don\'t want a user complaining about a defective product!)
 
On 9/2/2020 10:13 AM, jlarkin@highlandsniptechnology.com wrote:
On 02 Sep 2020 17:00:12 GMT, Rob <nomail@example.com> wrote:
Yes, you need to decide whether you want reliability or a toy.

Lots of people do serious, productive stuff on their toy PCs. Email.
Word processing. Accounting. Looking up stuff on the web.

And more than that! Nothing stops you from running simulations,
compiling VHDL, laying out PCBs, etc. on a \"toy\" PC!

And, chances are, if the machine hiccups, you won\'t be able to
DEFINITIVELY tell if it was operator error, a bug in the code, or a
glitch in the hardware!

Of course, there are also many cases where you won\'t even NOTICE an
error! E.g., if your DRC runs \"OK\", are you 100.00% sure that the code
(and data on which it relied) weren\'t \"disturbed\" while it was
running? Do you hand check all of your accounting figures to make
sure a bit didn\'t flip somewhere as the report was being printed?

OTOH, if you were marketing a \"DRC service\" to thousands of customers,
you\'d probably want to have a bit more confidence in your \"product\"!

Remember phone books and road maps and encyclopedias and filling out
bingo cards and waiting a month to get a data sheet?

Remember life before Amazon?

Most people use laptops, and they seem to be quite reliable. Mine have
all got obsolete before they broke.

My giant Dell boxes seem fine too. Windows is stupid but it works.
 
In article <riorh6$9iq$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:

When I designed my OS, I did a fair bit of research on memory
reliability (various technologies), at that time. The takeaways
were:
- as memory sizes increase, so does the likelihood of encountering
real failures (data retention errors -- transient or not)
- memory errors may be masked by the \"application\" (app == any
piece of software, incl the OS)
- there is a dearth of \"best practices\" guidance as to how to
*interpret* memory faults

[I recall encountering a document that said something to the
effect of \"24 errors in 24 hours suggests the DIMM should be
replaced\"... REALLY? ONE PER HOUR??! So, what do they think
is a *common* failure rate?!]

https://arxiv.org/pdf/1901.03401.pdf is a fairly interesting summary
of this and related issues - a recent thesis and report.

Among other things, they report on an experiment which automatically
sends servers to repair if the server suffers more than 100
correctable memory error per month... that\'s about one every 8 hours.

They note: \"We observe that a small number of servers have a large
number of errors. For example, the top 1% of servers with the most
errors have over 97.8% of all the correctable errors we observe.\"

\"If we compute the mean error rate as in prior work, we observe 497
correctable errors per server per month. However, if we examine the
error rate for the majority of servers (by taking the median errors
per server per month), we find that most servers have at most 9
correctable errors per server per month. In this case, using the mean
value to estimate the value for the majority overestimates by over 55×.\"

When designing appliances for 24/7/365 unattended operation, this
all suggested adding capabilities to the system to continually
monitor and advise as to the health of the system as the user
would otherwise be clueless to understand faults that may be
occurring without rising to the level of \"being noticeable\".

Yup. For DRAM, using ECC and reporting correctable errors if above a
fairly conservative threshold seems like a good strategy. For hard
drives and SSDs, having software which monitors the S.M.A.R.T.
drive-health parameters can help. It\'s not perfect - I\'ve seen drives
with substantial failures that the S.M.A.R.T. statistics didn\'t flag,
because the vendor never considered the specific sort of failure that
occurred. On the other hand, there have been two or three occasions
in which a weekly S.M.A.R.T. off-line scan has triggered an
\"uncorrectable sector read error\" report, and issuing a \"scan and sync
RAID images\" sweep restored the weakened sector from the equivalent
data on the other RAID platter.
 
On 9/2/2020 2:57 PM, Dave Platt wrote:
In article <riorh6$9iq$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:

When I designed my OS, I did a fair bit of research on memory
reliability (various technologies), at that time. The takeaways
were:
- as memory sizes increase, so does the likelihood of encountering
real failures (data retention errors -- transient or not)
- memory errors may be masked by the \"application\" (app == any
piece of software, incl the OS)
- there is a dearth of \"best practices\" guidance as to how to
*interpret* memory faults

[I recall encountering a document that said something to the
effect of \"24 errors in 24 hours suggests the DIMM should be
replaced\"... REALLY? ONE PER HOUR??! So, what do they think
is a *common* failure rate?!]

https://arxiv.org/pdf/1901.03401.pdf is a fairly interesting summary
of this and related issues - a recent thesis and report.

Thanks, I\'ll have a look.

The article i recollected, above:
<https://docs.oracle.com/cd/E19150-01/820-4213-11/dimms.html>

Among other things, they report on an experiment which automatically
sends servers to repair if the server suffers more than 100
correctable memory error per month... that\'s about one every 8 hours.

So, at *90* errors per month, the server is considered \"OK\"? :>

They note: \"We observe that a small number of servers have a large
number of errors. For example, the top 1% of servers with the most
errors have over 97.8% of all the correctable errors we observe.\"

ISTR that there is a correlation between CORRECTABLE errors and
UNcorrectable. I.e., if machine X sufffered an uncorrectable error
(often the hardware would signal a restart), then it is likely that
correctable errors preceded this (i.e., could be used as an early
warning indication -- makes sense)

\"If we compute the mean error rate as in prior work, we observe 497
correctable errors per server per month. However, if we examine the
error rate for the majority of servers (by taking the median errors
per server per month), we find that most servers have at most 9
correctable errors per server per month. In this case, using the mean
value to estimate the value for the majority overestimates by over 55×.\"

But, you don\'t know if you\'ve got one of the \"nominal\" servers on your
hands... or a \"lemon\"! :-/

When designing appliances for 24/7/365 unattended operation, this
all suggested adding capabilities to the system to continually
monitor and advise as to the health of the system as the user
would otherwise be clueless to understand faults that may be
occurring without rising to the level of \"being noticeable\".

Yup. For DRAM, using ECC and reporting correctable errors if above a
fairly conservative threshold seems like a good strategy. For hard

But support for ECC in the types of processors often used in embedded
systems (i.e., appliances) is not common. Pricing pressure often pushes
designers to using cheaper, less featureful devices.

And, support for ECC alone doesn\'t help much if your only remedy is
to reboot or panic()!

drives and SSDs, having software which monitors the S.M.A.R.T.
drive-health parameters can help. It\'s not perfect - I\'ve seen drives
with substantial failures that the S.M.A.R.T. statistics didn\'t flag,
because the vendor never considered the specific sort of failure that
occurred. On the other hand, there have been two or three occasions
in which a weekly S.M.A.R.T. off-line scan has triggered an
\"uncorrectable sector read error\" report, and issuing a \"scan and sync
RAID images\" sweep restored the weakened sector from the equivalent
data on the other RAID platter.

You can\'t guard against every error. OTOH, pretending they don\'t exist
(or that they are inherently benign) is foolish.

I\'ve decided that the most prudent approach is to collect data as
to OBVIOUS errors encountered while running (and 24/7/365 gives you
LOTS of opportunities to watch for f*ckups!) and try to infer from
those observations when a device is likely to be \"headed the wrong way\".

In my case, I can test (scrub) memory while the code is running and
mark \"suspect\" pages, swapping them out of service (much like
bad sector replacement on a disk). And, when the situation permits,
flush the running application and run a comprehensive diagnostic
to try to further quantify the problem. If fault is found, move
the application to another device and flag the failing one for repair
or replacement.

One advantage machines have is that they can be made to remember
\"stuff\". So, you can remember the history of these perceived
faults and take action over the long term.
 
In article <rip7a6$qo9$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:

Among other things, they report on an experiment which automatically
sends servers to repair if the server suffers more than 100
correctable memory error per month... that\'s about one every 8 hours.

So, at *90* errors per month, the server is considered \"OK\"? :

It\'s going to depend on the server size, and on your \"threshold of
pain\" - that is, how expensive is it to service a server? The more
memory you have, and (usually) the smaller the process size, the more
correctable errors per time you can expect to see. Some of the errors
are sporadic - due to alpha-particle-emitter isotopes in the IC
packaging, and due to cosmic rays. The more DRAM cross-section area
you have in the server, the higher the odds of a \"hit\" which flips a
bit. It won\'t do you any real good to change out a DIMM affected by
one of these.

I believe that some of the big servers are being provisioned with a
terabyte or more of DRAM. That\'s a lot of cell area sitting there
just waiting for a muon shower (sorta like a Star Trek red-shirt
security officer with a bullseye on the back), so it may make sense to
tolerate a larger nmber of errors per month before deciding that
there\'s something systematic wrong with the server.

They note: \"We observe that a small number of servers have a large
number of errors. For example, the top 1% of servers with the most
errors have over 97.8% of all the correctable errors we observe.\"

ISTR that there is a correlation between CORRECTABLE errors and
UNcorrectable. I.e., if machine X sufffered an uncorrectable error
(often the hardware would signal a restart), then it is likely that
correctable errors preceded this (i.e., could be used as an early
warning indication -- makes sense)

Right. One of the things that the paper was talking about, was a
decision to change from \"send machines to repair if they have
uncorrectable errors\" to \"send machines to repair if the correctable
error count gets up to a certain level\". This was based on the
correlation you noted... machines with a higher correctable-error rate
are more likely to start suffering uncorrectable errors.

\"If we compute the mean error rate as in prior work, we observe 497
correctable errors per server per month. However, if we examine the
error rate for the majority of servers (by taking the median errors
per server per month), we find that most servers have at most 9
correctable errors per server per month. In this case, using the mean
value to estimate the value for the majority overestimates by over 55×.\"

But, you don\'t know if you\'ve got one of the \"nominal\" servers on your
hands... or a \"lemon\"! :-/

What this points out (there\'s more detail in the paper) is that a very
small percentage of the servers are affected by a large number of
correctable errors. The median number is far below the mean.

So, if 50% of the servers have 9 CECCs per month or less, and a few
percent of the servers have hundreds or thousands per month, then it
makes to spend your repair dollars on the few, not the many. For this
experiment they set the boundary to 100/month. You could certainly
play with that number... look at their raw figures and say \"OK, if we
set the threshold to 50, then we\'ll end up repairing 3% of the fleet
per month, and by doing so we\'ll get rid of 95% of the correctable
DRAM errors and 99% of the uncorrectable ones.\"

In the data center business, I\'m sure it makes good sense to plan out
your repair strategy properly. \"Laying hands\" on a server is a
relatively expensive operation (it takes at least one person,
sometimes more, and you have to take all the services on that machine
down and move them elsewhere).

But support for ECC in the types of processors often used in embedded
systems (i.e., appliances) is not common. Pricing pressure often pushes
designers to using cheaper, less featureful devices.

Yup. Not a good thing, but probably inevitable. The constant \"race
for the bottom\" in terms of price, has bad effects on reliability.

And, support for ECC alone doesn\'t help much if your only remedy is
to reboot or panic()!

Well, handling the occasional (sporadic) correctible error isn\'t a big
deal, I think. The memory controller deals with it, it\'s logged (we
hope), some human eventually sees the report (we hope) and keeps an
eye on the situation.

Uncorrectable errors are definitely much more of a problem... either
double-bit errors that ECC can\'t handle, or parity errors in a
parity-only system. Having to go BOOM and reboot suddenly is, well,
_maybe_ better than silently writing corrupted data to disk, or maybe
not. Hurts either way.

You can\'t guard against every error. OTOH, pretending they don\'t exist
(or that they are inherently benign) is foolish.

Agreed.

I\'ve decided that the most prudent approach is to collect data as
to OBVIOUS errors encountered while running (and 24/7/365 gives you
LOTS of opportunities to watch for f*ckups!) and try to infer from
those observations when a device is likely to be \"headed the wrong way\".

In my case, I can test (scrub) memory while the code is running and
mark \"suspect\" pages, swapping them out of service (much like
bad sector replacement on a disk). And, when the situation permits,
flush the running application and run a comprehensive diagnostic
to try to further quantify the problem. If fault is found, move
the application to another device and flag the failing one for repair
or replacement.

The \"page retirement\" approach is discussed at some length in that
paper, and yes, it can definitely help reliability. It\'s not a
freebie by any means but it can mitigate the cost of memory
degradation errors and extend the lifetime of the system.

One advantage machines have is that they can be made to remember
\"stuff\". So, you can remember the history of these perceived
faults and take action over the long term.

Yup. Logging is (usually) our friend.
 
On 9/2/2020 7:38 PM, Dave Platt wrote:
In article <rip7a6$qo9$1@dont-email.me>,
Don Y <blockedofcourse@foo.invalid> wrote:

Among other things, they report on an experiment which automatically
sends servers to repair if the server suffers more than 100
correctable memory error per month... that\'s about one every 8 hours.

So, at *90* errors per month, the server is considered \"OK\"? :

It\'s going to depend on the server size, and on your \"threshold of
pain\" - that is, how expensive is it to service a server? The more
memory you have, and (usually) the smaller the process size, the more
correctable errors per time you can expect to see. Some of the errors
are sporadic - due to alpha-particle-emitter isotopes in the IC
packaging, and due to cosmic rays. The more DRAM cross-section area
you have in the server, the higher the odds of a \"hit\" which flips a
bit. It won\'t do you any real good to change out a DIMM affected by
one of these.

My point was that there\'s no real rule-of-thumb advice to be had (for
DESIGNERS) from all these studies. They point to observations in
data centers -- where there are FULL TIME staff available to compensate
for \"problems\" (that may, ultimately, be traced back to *design* decisions)

How do you reduce all of this research to something that lets YOU decide
the acceptable error rate for a product of your own design? Isn\'t
it comparable to vendors selling PC\'s without ECC? (i.e., they have
decided that the folks who are more interested in price can live with
some UNKNOWN reliability...)

I believe that some of the big servers are being provisioned with a
terabyte or more of DRAM. That\'s a lot of cell area sitting there
just waiting for a muon shower (sorta like a Star Trek red-shirt
security officer with a bullseye on the back), so it may make sense to
tolerate a larger nmber of errors per month before deciding that
there\'s something systematic wrong with the server.

Yes. But the problem doesn\'t go away just because you have less capacity.
Note that servers (in addition to being \"attended\") tend to have better budgets
for design NRE, fabrication/component costs, operate in well-controlled
settings (no \"cold aisle\" in your automobile or industrial factory!),
etc.

Despite all that, they report errors \"often\".

What do you do if the memory in your car\'s ECU throws a fault? Before the
engine is started? *While* driving???

They note: \"We observe that a small number of servers have a large
number of errors. For example, the top 1% of servers with the most
errors have over 97.8% of all the correctable errors we observe.\"

ISTR that there is a correlation between CORRECTABLE errors and
UNcorrectable. I.e., if machine X sufffered an uncorrectable error
(often the hardware would signal a restart), then it is likely that
correctable errors preceded this (i.e., could be used as an early
warning indication -- makes sense)

Right. One of the things that the paper was talking about, was a
decision to change from \"send machines to repair if they have
uncorrectable errors\" to \"send machines to repair if the correctable
error count gets up to a certain level\". This was based on the
correlation you noted... machines with a higher correctable-error rate
are more likely to start suffering uncorrectable errors.

Yes. But, again, they have dedicated staff to respond to these alarms
IN SHORT ORDER. And, aren\'t dependant on any ONE machine (server).
Again, think of the automobile throwing a fault... do you take it
offline and bring up a \"spare\" while the faulted one is being serviced?

\"If we compute the mean error rate as in prior work, we observe 497
correctable errors per server per month. However, if we examine the
error rate for the majority of servers (by taking the median errors
per server per month), we find that most servers have at most 9
correctable errors per server per month. In this case, using the mean
value to estimate the value for the majority overestimates by over 55×.\"

But, you don\'t know if you\'ve got one of the \"nominal\" servers on your
hands... or a \"lemon\"! :-/

What this points out (there\'s more detail in the paper) is that a very
small percentage of the servers are affected by a large number of
correctable errors. The median number is far below the mean.

Yes. But, again, what do you do if YOUR car is one of those that
is exhibiting an unusually high number of faults? Will you even KNOW
if that\'s the case? Will the dealer? What about the auto manufacturer?

So, if 50% of the servers have 9 CECCs per month or less, and a few
percent of the servers have hundreds or thousands per month, then it
makes to spend your repair dollars on the few, not the many. For this
experiment they set the boundary to 100/month. You could certainly
play with that number... look at their raw figures and say \"OK, if we
set the threshold to 50, then we\'ll end up repairing 3% of the fleet
per month, and by doing so we\'ll get rid of 95% of the correctable
DRAM errors and 99% of the uncorrectable ones.\"

In the data center business, I\'m sure it makes good sense to plan out
your repair strategy properly. \"Laying hands\" on a server is a
relatively expensive operation (it takes at least one person,
sometimes more, and you have to take all the services on that machine
down and move them elsewhere).

In my case, laying hands on a device is impractical. It boils down to
the user returning the device for replacement (assuming he isn\'t pissed
off that it failed and demanding his money back!).

And, at low costs (prices), it\'s not worth repairing most consumer
devices; the memory would be soldered down (save the cost of the connector
and the reliability issues that it brings with it) so it\'s an expensive
repair -- even if you can easily diagnose the problem to be in the memory
components.

But support for ECC in the types of processors often used in embedded
systems (i.e., appliances) is not common. Pricing pressure often pushes
designers to using cheaper, less featureful devices.

Yup. Not a good thing, but probably inevitable. The constant \"race
for the bottom\" in terms of price, has bad effects on reliability.

One advantage of newer memory implementations is that it sets the
bar at a slightly higher (than rock bottom!) level to add SDRAM to
a system. You can\'t just hack together your own memory controller
and tie it onto a parallel memory bus.

And, there\'s always a market for smaller, \"less capable\" processors
so burdening ALL with that capability would needlessly increase costs.

OTOH, it\'s too easy for folks to naively think that they can put a few GB
on a $5 CPU and \"do wonderful things\" with it... not realizing the
hazards that they\'ve unwittingly embraced.

And, support for ECC alone doesn\'t help much if your only remedy is
to reboot or panic()!

Well, handling the occasional (sporadic) correctible error isn\'t a big
deal, I think. The memory controller deals with it, it\'s logged (we
hope), some human eventually sees the report (we hope) and keeps an
eye on the situation.

Again, you\'re assuming there\'s someone who is reponsible for overseeing
and maintaining those devices. Who does that for your \"neighbor\"? :>

In my case, I have AIs that look at the error logs and \"advise\" the
software as to how it should address the perceived problems. The last
recourse is to complain to the user/owner (\"last\" because now you\'ve
put a problem in his lap and made your device a bit of an annoyance)

Uncorrectable errors are definitely much more of a problem... either
double-bit errors that ECC can\'t handle, or parity errors in a

Or, errors that aren\'t even detectable! As CEs increase, the likelihood
of UCEs does, as well. As UCEs increase, the likelihood of undetectable
errors similarly rises.

So, you want to break the chain before problems get \"established\" and
out of control.

parity-only system. Having to go BOOM and reboot suddenly is, well,
_maybe_ better than silently writing corrupted data to disk, or maybe
not. Hurts either way.

You can\'t guard against every error. OTOH, pretending they don\'t exist
(or that they are inherently benign) is foolish.

Agreed.

I\'ve decided that the most prudent approach is to collect data as
to OBVIOUS errors encountered while running (and 24/7/365 gives you
LOTS of opportunities to watch for f*ckups!) and try to infer from
those observations when a device is likely to be \"headed the wrong way\".

In my case, I can test (scrub) memory while the code is running and
mark \"suspect\" pages, swapping them out of service (much like
bad sector replacement on a disk). And, when the situation permits,
flush the running application and run a comprehensive diagnostic
to try to further quantify the problem. If fault is found, move
the application to another device and flag the failing one for repair
or replacement.

The \"page retirement\" approach is discussed at some length in that
paper, and yes, it can definitely help reliability. It\'s not a
freebie by any means but it can mitigate the cost of memory
degradation errors and extend the lifetime of the system.

But, you can only retire pages if you have a VMM. Any system that
exposes the physical addresses to the \"application\" is vulnerable
to those physical addresses becoming unreliable.

The alternative is to add a level of indirection to everything
(cost!) -- and hope that those pointers are never victims!

One advantage machines have is that they can be made to remember
\"stuff\". So, you can remember the history of these perceived
faults and take action over the long term.

Yup. Logging is (usually) our friend.

This will be an interesting time for design. Technology is getting
a bit ahead of our capacity to cope with the problems it presents!
And, sadly, I suspect many folks still think of \"ideal devices\"
(much like not understanding metastability). So, expect surprises! :>
 
jlarkin@highlandsniptechnology.com <jlarkin@highlandsniptechnology.com> wrote:
On 02 Sep 2020 17:00:12 GMT, Rob <nomail@example.com> wrote:

Dave Platt <dplatt@coop.radagast.org> wrote:
In article <slrnrkqli8.pap.nomail@xs9.xs4all.nl>,
Rob <nomail@example.com> wrote:

Allowing a ram data error to become a privilige escalation exploit is
a classic x86 blunder.

DRAM without ECC essentially is only suitable for toys.
All serious machines using DRAM have ECC on top of it.

By that criterion, probably 90+ percent PCs used in homes and business
offices in the United States are toys.

That is correct.

It is (unfortunately) relatively uncommon for PC motherboards to
support ECC on the DRAM. Very few consumer-grade boards do, and (I
believe) only a minority of business-grade boards by the Big Makers.
One tends to find ECC only in servers, or in specially-ordered
workstations.

For purposes where one requires better than a toy.

Although you may be able to put an ECC DIMM/SODIMM in a standard
motherboard, and have the computer boot OK, and maybe even identify
the memory as being ECC-equipped, it won\'t actually do you any good...
the common chipsets usually don\'t perform ECC, and even if they do the
motherboard is sometimes built without the additional memory-bus
traces required to connect the additional IC to the chipset.

Of course you need a chipset and board that supports ECC...

As I understand it, this is mostly a question of cost and performance.
Supporting ECC requires more chips on the DIMM, it requires more
traces on the board, and it slows down memory access slightly. As a
result, there\'s a tradeoff between reliability, and performance-per-
dollar.

It was only fairly recently that I found a home-PC motherboard that I
liked, which had ECC support (one of the TaiChi family). It cost more to
equip the board with ECC-enabled DIMMs but I felt it was worth it in the long
run.

Yes, you need to decide whether you want reliability or a toy.

Lots of people do serious, productive stuff on their toy PCs. Email.
Word processing. Accounting. Looking up stuff on the web.

Remember phone books and road maps and encyclopedias and filling out
bingo cards and waiting a month to get a data sheet?

Remember life before Amazon?

Most people use laptops, and they seem to be quite reliable. Mine have
all got obsolete before they broke.

My giant Dell boxes seem fine too. Windows is stupid but it works.

That is correct, but they are just praying that the computer does what
they want, and does not suddenly mess up their data because a bit
flipped in the memory. It does not happen often, and it does not always
have disastrous undetected results. Probably more often it just makes
your program crash with one of those \"Microsoft Outlook has stopped
working, click here for details\" messages boxes that people click away
without even reading them. But in some cases it may result in a wrong
letter somewhere in an email, that you later attribute to a typing mistake.

You can be sure that Amazon and Dell themselves do not trust their
company data to systems without ECC RAM. It is your own decision if
you want to do the same.
 
On 02/09/2020 22:57, Dave Platt wrote:

Yup. For DRAM, using ECC and reporting correctable errors if above a
fairly conservative threshold seems like a good strategy. For hard
drives and SSDs, having software which monitors the S.M.A.R.T.
drive-health parameters can help. It\'s not perfect - I\'ve seen drives
with substantial failures that the S.M.A.R.T. statistics didn\'t flag,
because the vendor never considered the specific sort of failure that
occurred. On the other hand, there have been two or three occasions
in which a weekly S.M.A.R.T. off-line scan has triggered an
\"uncorrectable sector read error\" report, and issuing a \"scan and sync
RAID images\" sweep restored the weakened sector from the equivalent
data on the other RAID platter.

SMART works fine on spinning rust but not so well on SSD\'s where it is
pretty much dominated by an all or nothing failure mode. I have never
seen one develop actually bad sectors as the wear levelling hides them.

I have seen more than one SSD controller go bad at a level where the
thing no longer responds at all to anything. Certain makers had ones
with early release firmware that trashed your data for good measure if
the wind was blowing in the wrong direction - write only storage.

--
Regards,
Martin Brown
 
On 03 Sep 2020 07:49:13 GMT, Rob <nomail@example.com> wrote:

jlarkin@highlandsniptechnology.com <jlarkin@highlandsniptechnology.com> wrote:
On 02 Sep 2020 17:00:12 GMT, Rob <nomail@example.com> wrote:

Dave Platt <dplatt@coop.radagast.org> wrote:
In article <slrnrkqli8.pap.nomail@xs9.xs4all.nl>,
Rob <nomail@example.com> wrote:

Allowing a ram data error to become a privilige escalation exploit is
a classic x86 blunder.

DRAM without ECC essentially is only suitable for toys.
All serious machines using DRAM have ECC on top of it.

By that criterion, probably 90+ percent PCs used in homes and business
offices in the United States are toys.

That is correct.

It is (unfortunately) relatively uncommon for PC motherboards to
support ECC on the DRAM. Very few consumer-grade boards do, and (I
believe) only a minority of business-grade boards by the Big Makers.
One tends to find ECC only in servers, or in specially-ordered
workstations.

For purposes where one requires better than a toy.

Although you may be able to put an ECC DIMM/SODIMM in a standard
motherboard, and have the computer boot OK, and maybe even identify
the memory as being ECC-equipped, it won\'t actually do you any good...
the common chipsets usually don\'t perform ECC, and even if they do the
motherboard is sometimes built without the additional memory-bus
traces required to connect the additional IC to the chipset.

Of course you need a chipset and board that supports ECC...

As I understand it, this is mostly a question of cost and performance.
Supporting ECC requires more chips on the DIMM, it requires more
traces on the board, and it slows down memory access slightly. As a
result, there\'s a tradeoff between reliability, and performance-per-
dollar.

It was only fairly recently that I found a home-PC motherboard that I
liked, which had ECC support (one of the TaiChi family). It cost more to
equip the board with ECC-enabled DIMMs but I felt it was worth it in the long
run.

Yes, you need to decide whether you want reliability or a toy.

Lots of people do serious, productive stuff on their toy PCs. Email.
Word processing. Accounting. Looking up stuff on the web.

Remember phone books and road maps and encyclopedias and filling out
bingo cards and waiting a month to get a data sheet?

Remember life before Amazon?

Most people use laptops, and they seem to be quite reliable. Mine have
all got obsolete before they broke.

My giant Dell boxes seem fine too. Windows is stupid but it works.

That is correct, but they are just praying that the computer does what
they want, and does not suddenly mess up their data because a bit
flipped in the memory. It does not happen often, and it does not always
have disastrous undetected results. Probably more often it just makes
your program crash with one of those \"Microsoft Outlook has stopped
working, click here for details\" messages boxes that people click away
without even reading them. But in some cases it may result in a wrong
letter somewhere in an email, that you later attribute to a typing mistake.

You can be sure that Amazon and Dell themselves do not trust their
company data to systems without ECC RAM. It is your own decision if
you want to do the same.

We use ECC ram and RAID hard drives for business stuff, and back
things up brutally.

The worst thing that happpens is that Windows bogs itself once in a
great while and has to be rebooted.

I\'m not going back to typewriters and carbon paper. Or wet film
photography. Or tape and mylar PCB layouts.

Spice is OK too.





--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard
 
Lasse Langwadt Christensen wrote:
mandag den 31. august 2020 kl. 17.30.41 UTC+2 skrev
jla...@highlandsniptechnology.com:
https://en.wikipedia.org/wiki/Row_hammer


That\'s amazing. If x86 can possibly do something wrong, it does.


afaiu it is not an x86 problem it is a RAM problem and has been
demonstrated on ARM devices too

Assuming the exploit would need to hammer more than one consecutive row,
could the board maker do something by swapping some of the address bits?

With different bits swapped in each board model, the exploit would have
a tough job.

I know you can\'t swap just any address lines because that would defeat
several kinds of fast addressing modes, but maybe there are some
possibilities.
 
On 9/3/2020 8:56 PM, Tom Del Rosso wrote:
Lasse Langwadt Christensen wrote:

afaiu it is not an x86 problem it is a RAM problem and has been
demonstrated on ARM devices too

Assuming the exploit would need to hammer more than one consecutive row,
could the board maker do something by swapping some of the address bits?

You repeatedly access the (physically!) adjacent row. Or, the rows on
either side of the row being attacked.

With different bits swapped in each board model, the exploit would have
a tough job.

You\'d likely also have to tweek the memory controller\'s initialization of
the devices.

I know you can\'t swap just any address lines because that would defeat
several kinds of fast addressing modes, but maybe there are some
possibilities.

Easier to link the modules in different orders so stuff ends up in different
places. Of course, that assumes there is no way that this information leaks
in other ways!

Note that you can notice who (task ID) is \"current\" when the *likely*
ECC action takes place. If you\'re always encountering EDAC problems
while \"foo()\" is executing, then perhaps foo() needs to be 86-ed!
 
Don Y wrote:
On 9/3/2020 8:56 PM, Tom Del Rosso wrote:
Lasse Langwadt Christensen wrote:

afaiu it is not an x86 problem it is a RAM problem and has been
demonstrated on ARM devices too

Assuming the exploit would need to hammer more than one consecutive
row, could the board maker do something by swapping some of the
address bits?

You repeatedly access the (physically!) adjacent row. Or, the rows on
either side of the row being attacked.

But there\'s no reason the physical and logical order of rows has to
match.



With different bits swapped in each board model, the exploit would
have a tough job.

You\'d likely also have to tweek the memory controller\'s
initialization of the devices.

I know you can\'t swap just any address lines because that would
defeat several kinds of fast addressing modes, but maybe there are
some possibilities.

Easier to link the modules in different orders so stuff ends up in
different places. Of course, that assumes there is no way that this
information leaks in other ways!

Note that you can notice who (task ID) is \"current\" when the *likely*
ECC action takes place. If you\'re always encountering EDAC problems
while \"foo()\" is executing, then perhaps foo() needs to be 86-ed!

A few minutes after posting I realized that the RAM makers could do it
more effectively than the board makers, by mixing up the order of rows.
If every model of chip did that differently the exploit would never
work. There is no way to tell what RAM chips are in the system is there?
The type of RAM can be seen but not the brand.
 
On 9/4/2020 4:24 PM, Tom Del Rosso wrote:
Don Y wrote:
On 9/3/2020 8:56 PM, Tom Del Rosso wrote:

Assuming the exploit would need to hammer more than one consecutive
row, could the board maker do something by swapping some of the
address bits?

You repeatedly access the (physically!) adjacent row. Or, the rows on
either side of the row being attacked.

But there\'s no reason the physical and logical order of rows has to
match.

Yes, but there are limits on how \"mixed up\" you can make the rows.
I.e., you\'re swapping address lines so every row associated with
a particular value of *an* address line \"moves\" when you \"move\"
that address line (relative to the others). I.e., swap the \"A1\"
and \"A2\" and the order goes from:
0, 1, 2, 3, 4, 5, 6, 7 to
0, 1, 4, 5, 2, 3, 6, 7. It\'s not possible to get
0, 1, 7, 3, 2, 4, 5, 6!

With different bits swapped in each board model, the exploit would
have a tough job.

You\'d likely also have to tweek the memory controller\'s
initialization of the devices.

I know you can\'t swap just any address lines because that would
defeat several kinds of fast addressing modes, but maybe there are
some possibilities.

Easier to link the modules in different orders so stuff ends up in
different places. Of course, that assumes there is no way that this
information leaks in other ways!

Note that you can notice who (task ID) is \"current\" when the *likely*
ECC action takes place. If you\'re always encountering EDAC problems
while \"foo()\" is executing, then perhaps foo() needs to be 86-ed!

A few minutes after posting I realized that the RAM makers could do it
more effectively than the board makers, by mixing up the order of rows.
If every model of chip did that differently the exploit would never
work. There is no way to tell what RAM chips are in the system is there?
The type of RAM can be seen but not the brand.

The type of RAM must be *known* (in order to program the memory controller).
You might be able to infer model numbers, vendors, etc. based on other
\"performance parameters\" that you must make known to the memory controller.

If your process has a large enough (logical) address space, you might
be able to probe your *own* address space to deduce (at least part of) the
ordering. Note that you can infer the occurrence of EDAC actions
taken by the OS by noting how much time is spent \"in reality\" vs. how
much is spent conceptually performing those actions.

The approach I\'ve taken is:
- make extensive use of virtual containers (thousands of processes)
- enforce constraints on resource usage (incl *specific* IPCs)
- \"randomize\" the link order of modules within an \"active entity\"
- prefer lots of small entities instead of huge, monolithic ones
- continuously scrub physical memory
- actively monitor EDAC activity
- monitor identity of entity on who\'s behalf any errors are incurred
- twiddle the logical<->physical page mappings on-the-fly
- limit who can create loadable binaries
- restrict \"user\" code to run in a virtual environment (e.g., there\'s
no way you can flush the cache, from userland, without OS involvement)
- when you suspect an attack, blacklist the entity (so it never
runs again)
I\'m probably forgetting some things... bedtime! :<

[E.g., if you see a piece of code doing something that it
\"shouldn\'t\", there are two likely reasons: a bug in the code
OR an attempt at an exploit. In either case, the code shouldn\'t
be allowed to run (it\'s not \"healthy\"). I.e., give the code
every opportunity to screw up and bitch-slap it when it does!]

There are costs associated with each of these actions -- esp
performance hits. But, I\'ve long believed over-optimization is
a wasted design effort; make the code more *robust* (including
hardware shortcomings) with the advances in technology. The
adage: \"make no improvement unless it results in a 2X performance
increase as technology will give you that \'for free\' in the
time it takes you to make those smaller changes\"

(OTOH, \"technology\" won\'t give you less bugs \'for free\'!)
 
Don Y wrote:
On 9/2/2020 10:13 AM, jlarkin@highlandsniptechnology.com wrote:
On 02 Sep 2020 17:00:12 GMT, Rob <nomail@example.com> wrote:
Yes, you need to decide whether you want reliability or a toy.

Lots of people do serious, productive stuff on their toy PCs. Email.
Word processing. Accounting. Looking up stuff on the web.

And more than that!  Nothing stops you from running simulations,
compiling VHDL, laying out PCBs, etc. on a \"toy\" PC!

And, chances are, if the machine hiccups, you won\'t be able to
DEFINITIVELY tell if it was operator error, a bug in the code, or a
glitch in the hardware!

Even better - you don\'t really care.

Of course, there are also many cases where you won\'t even NOTICE an
error!  E.g., if your DRC runs \"OK\", are you 100.00% sure that the code
(and data on which it relied) weren\'t \"disturbed\" while it was
running?  Do you hand check all of your accounting figures to make
sure a bit didn\'t flip somewhere as the report was being printed?

I have, and do actually run a full regression on certain builds of
software. I 100% bit by bit compare the output of the previous to
the latest.

I just. Don\'t. See. Errors. Er, rather, the only errors I
see can be traced back to the most recent layer of changes.

An interesting take might be: If I did get an error, chances
are, I\'d rerun it, and there would be no error.


I realize determinism is boring, but I seem to be awash in it. No
neutrino events for me, boyo. :)

OTOH, if you were marketing a \"DRC service\" to thousands of customers,
you\'d probably want to have a bit more confidence in your \"product\"!

Remember phone books and road maps and encyclopedias and filling out
bingo cards and waiting a month to get a data sheet?

Remember life before Amazon?

Most people use laptops, and they seem to be quite reliable. Mine have
all got obsolete before they broke.

My giant Dell boxes seem fine too. Windows is stupid but it works.

--
Les Cargill
 
On 9/4/2020 7:34 PM, Les Cargill wrote:
Don Y wrote:
So we\'re up to terabyte DIMMs these days.

SFAIK, there\'s not any error correction in most of them[1],
and we pound \'em pretty hard. I expect it all pretty much works.

Newer SDRAM technologies have consistently been addressing sources of
error both in the physical memory and the protocols used to access it.

I run multi-100-GB telemetry analyses every day and get the same answer every
time. And yes, I check.

Have you taken pot shots at random bits of raw data (manually changing a bit
here or there) and rerunning the analysis to see if the \"answer\" changes?
I.e., if not, then you have no way of knowing if those sorts of changes are
occurring all the time in your dataset!

[1] one wonders what happens at the DRAM controller level.

[I recall encountering a document that said something to the
effect of \"24 errors in 24 hours suggests the DIMM should be
replaced\"... REALLY? ONE PER HOUR??! So, what do they think
is a *common* failure rate?!]

You\'re exceeding my knowledge on the subject. :)

When designing appliances for 24/7/365 unattended operation, this
all suggested adding capabilities to the system to continually
monitor and advise as to the health of the system as the user
would otherwise be clueless to understand faults that may be
occurring without rising to the level of \"being noticeable\".

(And, you don\'t want a user complaining about a defective product!)

I\'d think it would have been obvious if there was a problem.

No. The application (\"program\") can mask errors (in the code AND
data). The user\'s expectations of the program (product) can
dismiss errors as \"to be expected\" (aka bugs). And, the user
may not have access to enough empirical data to be able to
independantly verify (or dispute!) the accuracy of the operation.

E.g., my 30 year old furnace has an MCU inside that handles
sequencing of blowers, igniters, etc. Has it ever screwed up?
How would I know? Maybe the thermostat is calling for cooling right
this very second and the furnace is ignoring the request because
a flag got corrupted. As long as the temperature in the house
doesn\'t rise to a level that makes me wonder \"why the ACbrr
hasn\'t kicked in\", I\'m not even going to begin to think about
whether it (or the thermostat!) is broken!

(and, how would I come to that definitive conclusion?)
 
On 9/4/2020 7:47 PM, Les Cargill wrote:
Of course, there are also many cases where you won\'t even NOTICE an
error! E.g., if your DRC runs \"OK\", are you 100.00% sure that the code
(and data on which it relied) weren\'t \"disturbed\" while it was
running? Do you hand check all of your accounting figures to make
sure a bit didn\'t flip somewhere as the report was being printed?

I have, and do actually run a full regression on certain builds of software. I
100% bit by bit compare the output of the previous to
the latest.

I just. Don\'t. See. Errors. Er, rather, the only errors I
see can be traced back to the most recent layer of changes.

An interesting take might be: If I did get an error, chances
are, I\'d rerun it, and there would be no error.

To know if you\'re getting errors, you\'d rerun it CONTINUOUSLY and *never*
see any discrepancies. :>

I realize determinism is boring, but I seem to be awash in it. No neutrino
events for me, boyo. :)

You don\'t know that. As I said elsewhere in this thread, a (uncorrected)
memory fault can still be masked by the application. If an error in an
opcode fetch converts the \"real\" opcode into one that doesn\'t meaningfully
alter the progress of the algorithm FROM THE CURRENT STATE, then the
result is indistinguishable from the original intent. Similarly,
if an error in data is masked by the instruction sequence...

There have been a few studies (experiments) where they\'ve tried to
quantify how resilient (some select, high volume) applications are to
memory faults by deliberately injecting errors and monitoring
outcomes.

The problem with that is that you can\'t generalize it to a \"best practices\"
formulation. Folks can do all sorts of data center studies (the only
place where you can run big experiments \"for relatively low cost\") yet
they never come up with concrete recommendations (\"When your failure
rate is X FIT/Mb, replace the device\")

And, you can\'t always rerun an application to verify the results are
unchanged from a previous run. (\"Let\'s take the aircraft back to
Boston and run the flight over to verify the autopilot makes all
the same decisions that it did last time!\")

How many times have you seen/heard someone encounter a \"can\'t happen\"...
and, because they can\'t reproduce the problem, they shrug it off as
if it had never happened -- EVEN THOUGH THEY SAW IT HAPPEN! Was
it a random disturbance? bizarre combination of events that triggered
a latent bug? hallucination??

I keep a smartphone in the car as a cheap camera. I\'ve a collection
of \"screen snapshots\" documenting things that can\'t happen -- but did!
E.g., the navigation system showing me (graphically and numerically)
that my destination is 400 ft ahead on the left (which it was) -- as
well as 8.0 miles and 25 minutes away! ON THE SAME SCREEN!

(Likely a bug but not repeatable as I make the same drive and don\'t
see the same screwup. Do I expect the auto manufacturer to DO
anything about it?? Ha!)
 

Welcome to EDABoard.com

Sponsor

Back
Top