[Tfug] Small-ish (capacity + size) disk alternatives

Thu Jan 31 02:34:43 MST 2013

Hi John,

On 1/30/2013 9:57 PM, John Hubbard wrote:

[attr's elided]

>> It doesn't matter. Would you expect your traditional (magnetic)
>> disk to wear out once (15,000 * capacity) of data was written to it?
>> Imagine if *RAM* had a similar *inherent* limitation!
>
> I thought that we were looking at using SSDs to replace HDDs. Last time
> I checked HDDs DID wear out. Your problem is HDDs wear out.

No!  My problem is *laptop* HDD's wear out -- not HDD's in general!
In the 30 years I've owned computers, I've had exactly three HDD's
wear out -- all laptop drives despite the fact that I rarely *use*
a laptop (one drive in a laptop, two more in this 365/24/7 situation).

> The
> suggested solution is SSD. You are correct that SDDs wear out, just like
> the HDD that it is being suggested as a replacement for. It is a non
> sequitur to say start comparing RAM (which admittedly doesn't wear out)

[Actually, RAM *does* wear out.  The AFR for RAM is comparable to that
of "regular" HDD's.]

The reason I raise the issue of "RAM reliability" is because the
disk is just another sort of read/write memory -- slower, higher
density, less power, etc.

> with an HDD replacement which (just like the HDD its replacing) does
> wear out.
>
>> And, the actual amount of "work" that you are calling on the drive
>> to do (measured in terms of bytes "written") can be considerably less
>> than that "15000 C" number! E.g., I might want to update a single
>> byte per sector/cluster/FLASH block yet this "costs" as much (in
>> terms of endurance) as if I had rewritten the entire sector/block!
>
> You are talking like ever bit written to memory will lead to a new 2k
> write on the SSD.

No!  But, neither does every "block worth" (which might be as much as
500KB!) contain 100% "new data".  That's the point here -- that the
work you are asking the disk to do can be much less than the work
it actually ends up *doing*.

E.g., if I update the "hours worked this week" for each employee
in a dataset and each employee's "record" resides in a different
FLASH block, then an entire block is erased for each employee in
the organization -- even if that's only 4 *actual* bytes per employee!

> The system keeps the most used data in ram and only
> flushes to swap when there isn't enough room.

Or, if the application is actively writing to a file on that medium!
(e.g., /tmp)

> If you are really as space
> constrained as you claim you are, your system would be unusable. When a
> system is frantically writing to swap, it bring it to its needs and
> makes it unusable.

No, you are confusing "thrashing" with general "swap usage".

A process can grow in an orderly fashion and consume large amounts
of swap without thrashing.  I can create a 500GB file on a disk,
rewind the file and then *read* that file in an orderly fashion.
Never stressing the system nor interfering with the operation of
other processes running on that machine.  I.e., no thrashing.
Give me 500GB of swap and the same holds true.

E.g., I've "make world" on machines with 12MB (twelve) of RAM and
still been able to use the machine to do other things at the same time
without "feeling" the cost of those heavy processes.  The "build"
just took a very long time because gcc wants to deal with large
temp files (so *it* bears that cost).  Try it sometime.

What matters with thrashing is the pattern of (virtual) memory
accesses -- where pages have to frequently get swapped out and
faulted back in.

[Write a little program to create a 500GB file.  Run that program
while you are reading your email, browsing the web, compiling
some other program, etc.  Chances are, you'll never notice the
fact that it is running!]

> You'd know if you were doing it that often. You
> aren't hitting it /that/ hard. This means that at least some of these
> bits that you are writing are being written to the ram, and the changes
> will only go to swap when there are enough in ram.

No.  All it means is those changes aren't being immediately reread
back into memory.  I.e., they can sit on the disk for a while before
they need to be accessed (read or write) again.  The application
can be systematically walking through file(s), doing its processing,
writing results (to VM, temp files or the original files, themselves)
and continuing.

Or, returning to them a few seconds later.  It doesn't matter.
That's the appeal of VM -- it's just "slow memory" for those
applications that *need* lots of memory (an application that
doesn't need much memory remains more *in* memory than "out")

>> As newer, multilevel processes become more commonplace, you'll see
>> densities go up -- and endurances go *down*!
>
> But as densities go up cost/GB goes down. If the PE cycle is cut in
> half, but you get twice space, it will be nearly a wash.

But it isn't.  Error frequencies go up which means more (hidden)
update cycles are incurred, etc.  Notice how "enterprise" SSDs
tend to stick to SLC technology -- trading capacity/speed for
endurance/reliability.

>> Static wear leveling also poses problems for caching and asynchronous
>> access as the drive has to be able to remap data that is already
>> "safe and secure" to blocks that may *not* be! I.e., attempting
>> to move "existing data" can fail.
>
> So what. This is what they do. If you have stuff that is never being
> written (only read) the controller will move it to a frequently written
> cell. It will do this before it thinks that there is only one write left
> on that target cell. Even if the cell fails, so what. Cells fail. Those
> cells are marked as bad, and the drive uses other cells. ALL modern SSDs
> have spare area. They can (and do) handle cell failures.

This is reflected to the interface (i.e., user) as indeterminism.
The application never *knows* that the data that the drive has
previously claimed to have "written" has actually been written.
It's an exaggeration of the write caching problem.  But, it is
brought about by the inherent "endurance" limitations of the
media.  As if you had a HDD that was inherently failure prone.

>> So, the drive can't erase the
>> "good" instance of the data until/unless it has successfully
>> moved it.
>
> Again so what. This is what they do.

Imagine a HDD that was designed to *randomly* pick a block of
otherwise stable data, copy it to a new location, verify that
the copy succeeded and then erase (not just overwrite) the original
all so that some NEW piece of data could be written in its place
(and, at the same time, updating the behind the scenes bookkeeping
that keeps track of all that shuffling around -- using media
with the same "characteristics" that it is trying to workaround)

>> As a result, freeing up blocks that one would *think*
>> still have some wear left in them can be problematic.
>
> I fail to see the problem. SSD controller have a complicated job to do,
> and they do it.

They *try* to do it.  I saw a recent survey claiming 17% of respondents
had an SSD fail in the first *6* months!  (of course, a survey in which
respondents self-select will tend to skew the results -- people are
more likely to bitch about their experiences than praise them!)

> That is a problem for the poor engineer who has to come
> up with the algorithms, and test them, but not for you and I. If you are
> really worried about it buy a SSD from a bigger manufacturer with more
> testing/QA. Intel comes to mind, but I'm sure that there are others.

So, you are suggesting I simply say, "Buy this particular SSD otherwise
the system won't work"?  Would you build a MythTV box if you were told
you had to use this disk (endurance), this motherboard (performance),
this fan (sound level), etc.?  Or, would you cut some corner and then
complain later to anyone who will listen?

Imagine if your favorite Linux distro only ran on one particular PC.
How likely would you be to adopt it?  Tinker with it to get it
running on *another* "PC"?  (Or, would you just wait and hope
someone else undertook that task for you?)

>> But, the drive
>> can't report this problem
>
> This isn't a problem. This is like ECC memory. It gets fixed, you get
> the correct data, and everyone is happy.

No, it only gets fixed *sometimes*.  And, when it doesn't, you lose.
If the portion of the media involved has bookkeeping information,
you could lose *big*!

>> until long after the *need* to make that
>> space available (e.g., if data is residing in R/W cache waiting to
>> be committed to FLASH).
>
> Drives have spare area. They have space to hold a few extra bits while
> they are shifting other things around. If you 'fill' up a drive it won't
> be full.

The spare area has to be nonvolatile.  Not just RAM.  I.e., it is
either BBRAM (with some limited lifetime) or made of the same
material as the medium itself.

>> I.e., it makes the drive behave as if it
>> was "slow with a large cache"
>
> Define slow. Its a heck of a lot faster than the HDD that I'm suggesting
> you get rid of.
>
>> -- the application isn't notified of
>> writeback failures until long after the write operation that was
>> responsible has "completed".
>>
>> Again, this is a consequence of the decoupling of the storage
>> media from the application (because of all the *bloat* that
>> exists in the region between the two)
>>
>>>> (similarly, assuming you could write to the *entire* media "at will",
>>>> you're looking at 80 weeks).
>>>
>>> With the price of SSDs nowadays (provided that they do support static
>>> wear
>>> leveling), that might not be too bad, and possibly not too much more
>>> expensive (and if trends continue, might even be cheaper soon).
>>
>> You're missing the point. Would you want to have <whatever>
>> require replacement/servicing in that short an interval?
>>
>> E.g., would you want to replace/repair (labor cost$) your DVR
>> because "the disk wore out"? (in that time frame) Or, your
>> PC? In my case, should the entire multimedia/automation system
>> grind to a halt because the disk "wore out"?
>
> Would you be sure to send us the link to those HDDs you're using that
> never fail?

I have, at this moment, at least 40 drives installed in machines,
here.  The machines see varying degrees of use -- some on for
weeks at a time, others a few hours at a time (which is probably
harder on the drives than sustained use), still others with uptimes
of hundreds of days.  I repeat, I've had exactly three failures
and all three were laptop drives.

[And, as folks who know me will attest, much of my hardware is
*old* so there are lots of hours on it already!]

>> Do I design something with a built-in/inherent replacement date?
>
> Is the whole thing passively cooled or are you using fans?

Passive cooling.  The inside of the enclosure has never been above 35C.
No one wants to listen to fans 24/7/365!

> Will those fans last forever?

No fans so I guess the answer is "yes".

> Just about anything will have some part that will
> eventually stop working properly. Rather than design it so that it works
> forever ask yourself how long people will want to continue using it, and
> make sure it lasts that long. The average smart phone has to last what 2
> years?

When was the last time you replaced your thermostat?  Irrigation
controller?  Garage door opener?  Washer/dryer?  Doorbell?  TV?
Security camera(s)?  DVR?  "HiFi"?  Hot water heater?  Weatherstation?

Then, ask yourself *why* you replaced it:  because you were tired
of "last year's model"?  Because it wasn't performing as well as
it should?  Because it *broke*?

Chances are, most of these things did their job until they broke (or
were outpaced by other technological issues) and *then* were replaced.

[Your smart phone gets replaced because you want to keep up with
the Jones'; because your contract has expired and the new contract
will justify your upgrading to a new model; you want to play with
the newest toys; your provider won't support the older hardware;
you dropped it; it fell into the toilet; etc.]

>> Instead, you (I) look for technologies that let you avoid these
>> limitations. This is a lot easier to do in hindsight; considerably
>> harder in foresight! :-/
>
> You are avoiding one limitation (SSD finite erase/program cycle) but
> with HDDs you still suffer mechanical wear and tear. As you noted in
> your original email the HDDs you've been using "die pretty easily".

But those have all been laptop HDD's!  E.g., I suspect moving to
a "real" disk drive will give me the same sorts of reliability
that I've seen in my other machines (though at a higher power
budget and cooling requirements).

A colleague has suggested the reason is laptop drives are intended
for use in "disposable" laptops (how many 5 year old laptops do you
see?).  He's dropped a pair of "good" 2.5" drives in the mail for
me to play with (though I won't know how much "better" they are
for many months...)

> You
> are the one looking for a better solution. I haven't seen many
> suggestions beyond SSDs. If your plan is to dismiss SSDs you might need
> to consider changing usage pattern, instead of just switching around
> some hardware.

I'm using a "nonvolatile solid state memory subsystem" in a similarly
designed (commercial) product.  But, there, I can control what ends
up in each portion of the memory subsystem based on the types of
usage the data are expected to see (frequency of references/updates;
required reliability; etc.).

However, that's a "closed" system.  I can "get it right" and never
worry about it thereafter!  That approach won't work with an FOSS
design.  First, someone would have to manufacture the memory
subsystem for folks who wanted it (I don;t want to be involved in
that sort of thing).  Then, the people modifying the software
would have to understand the consequences of placing data in each
of these different portions.  And, ad hoc "enhancements" to the
codebase would be tedious for folks to stress test (since the
memory system has many of the same characteristics of an SSD except
a more predictable/characterizable performance).

IMO, that's a recipe for failure.

>> I.e., you can't quantify the data access/update patterns until
>> you can actually *measure* them. And, until you've identified them,
>> you can't *alter* them to place less stress on the media.
>
> If you don't fully understand your data access/update patterns it
> doesn't seem like you can say whether or not they will overly burden an

I can look at data write *rates* (sector counters) and make conclusions
based solely on that!  The SSD won't give me any better guarantees than
total number of rewrites.  I.e., it doesn't care if I am writing
"AAAAAAAAX" or "AAAXAAAAAA" in place of "BBBBBBBBB" -- as long as either
write is a "sector".

Knowing the access patterns (at the application level) IN DETAIL lets
me restructure tables so that the data that are often updated "as a
group" tend to be grouped in the same memory allocation units.

E.g., if you;re running a payroll application, then wages and taxes
are the hot items that see lots of use.  OTOH, if you are running
an application that tracks attendance (timeclock), then wage
information probably sees *less* activity than "hours worked"
(which would have to be updated daily).  In either case, employee
*name* is probably RARELY updated!

> SSD. Are these systems entering production tomorrow? If not, roll the
> dice?

It won't be *me* that's rolling the dice!  :>  Rather, it will be
someone who tries to build and configure a similar system and
wonders why *his* choice of storage media proved "less than ideal".
Or, why the *identical* system ("as seen on TV") performs so
differently after he's made some "trivial" changes to the code.

"Gee, I just changed the payroll program to update the wages on
a daily basis -- each time the timeclock recorded additional
hours for the employee.  Now, I'm seeing problems with the wage
data's reliability..."

> Its called development for a reason. Try an SSD, monitory the
> Media Wear-Out Indicator (MWI) and you will get some idea of just how
> abusive your usage model really is.