[Tfug] more stuff on ssds

Bexley Hall bexley401 at yahoo.com
Thu Jan 31 13:46:30 MST 2013


Hi Shanna,

On 1/31/2013 9:33 AM, shanna leonard wrote:
> Bexley Hall wrote: Hi Shanna,
>>> I like the price point for reliabiity of the intel 320's - I'm planning
>>> to use them in a ZFS-based storage server soon, and I fully expect that
>>> if I over-provision the Zil (ZFS Intent Log - caches writes) by 100% I
>>> will have it last a couple of years.. Which is all I would count on from
>>> a hard-drive anyway.

>> What are the consequences when your drive fails?
> In my use case, the SSDs are used for read/write caching to speed up
> access to a pool of drives. So the consequence is that access times are
> slower. not good, not catastrophic.

And the "system" is able to detect the cache as failed and work around
it?  (Or, does that require manual intervention -- both to detect and
bypass)?

>> Who "notices"
>> the failure and acts to repair/replace it?
> Good question. I believe that the management software will give
> notification in a gui which is monitored daily in the case of complete
> disk failure.

I'm guessing you also run the entire farm off a UPS or equivalent
(e.g., many SSD's don't have battery/supercap backed caches so
an outage means anything not committed to FLASH media is lost)

My application doesn't need the nonvolatility of the SSD (*or*
the HDD), for the most part.  E.g., aside from the executables
(which almost NEVER change) and the persistent portion of the
database (which changes slowly and in very small increments),
the biggest use for the "disk" is as virtual memory and
"temporary tables" (e.g., the *dynamic* portions of the database
that are built -- and rebuilt -- on the fly by the applications).

Using a HDD is the cheapest way to get that combination of
memory requirements (nonvolatility for some, big size for the
rest) at that particular performance operating point (i.e.,
you don't need RAM access speeds -- nor the power requirements
that accompany it!)

[Note to self:  see if anyone makes a VOLATILE SSD -- "physical RAM
disk"]

>> are there "staff" actively responsible for maintaining this?
> yes

So, it's important enough to merit recognizing it and assigning it
to someone.  :>  Your home heating plant is undoubtedly just as
important (to you).  Yet, I suspect most folks are NOT proactive
in maintaining it (they just complain when it doesn't work and
call someone in for the repair -- for which they are impatient).

>> How much of your *personal* life would you rely on it?
> I would say that, interestingly for ssds, failure is more predictable
> than for hdds, so I would prefer them. So let's imagine I were using
> this to control my own pacemaker :0

Silly rabbit!  Never trust your life to a bit of man-made kit
(speaking from firsthand knowledge of how that sort of stuff is
engineered!)

> I might mirror the drives, and If it were linux, I would install
> smartmontools, and use smartctl in a script something like this:
> http://blog.samat.org/2011/05/09/Monitoring-Intel-SSD-Lifetime-with-S.M.A.R.T.
> and have it trigger a daily report on the readout from the Media Wearout
> Indicator.

And when do you *know* it's time to proactively replace the drive(s)?
(after all, this is your HEART we talking about!  :> )

The (few) large disk farm studies that I've read all seem to be
frustrated at how hard it is to get predictive failure data from
the (magnetic) drives themselves.  I.e., the wrong data seem to be
monitored for a reliable indicator.

> I'd probably also use SLC nand in that application :)
>
> If it were just my house, I'd probably be comfortable with an Intel MLC
> drive like the 320, smartctl reporting, and a replacement strategy,
> (have a cold spare available)
>
> OTOH, I'm comfortable using candles for an hour. A little hardship every
> now and then breeds character!

I've been careful in the design to ensure that "core services" can
remain available -- even in the face of catastrophic failure.  E.g.,
your furnace will continue to keep the house comfortable -- but
it will no longer "sense" when you are awake/asleep and, instead,
revert to repeating the most recent sleep/wake pattern until its
confidence in the current time of day (*or* the predictability
of your sleep/wake schedule) degrades to a point where it can only 
(safely) assume "you are awake and want the house warm".

The same sort of thing happens with the irrigation controller
since it can't know how the weather is changing at the present
moment, etc.

I.e., the system's efficiency goes to hell but it tries to preserve
basic *functionality*.  OTOH, you won't be able to watch TV,
listen to the radio or stored music, etc.  "You are inconvenienced"

"Character" galore!  :>




More information about the tfug mailing list