Discussion:
rant -- followup questions
(too old to reply)
hymie!
2019-05-29 13:43:00 UTC
Permalink
So ... Over the last two days, I got what I think are incredibly stupid
follow-up questions.

===

(Technical one)

I have a machine. It's running RAID 5 or 6. A disk failed. My team of
users has monitoring software that notices this, so they know a disk
failed. A replacement disk has been ordered. Today I was asked (by one
of my users) if I know which disk failed.

Why the hell does it matter?

===

(Non-technical one)

I was at a convention, and I was doing data entry for an art show.
The options for the various items in the show are "unsold", "sold",
"purchased/released", or "went to auction." I needed to print the
list of "went to auction", and I only mention that because the printer
was not cooperating. But since there were only eight of them, the
person who needed the print-out was content to hand-write them rather
than fight with the printer.

As I related this story to my wife, she asked me "Which items went to
auction?"

Why the hell does it matter? Why would you think I remembered this
incredibly trivial detail?

===

Is it just part of living in The Information Age, where everybody wants
to know every detail as soon as it happens? Or am I missing some
fundamental reason why the user needs to know which RAID disk is being
replaced in a machine two time-zones away?

--hymie! http://lactose.homelinux.net/~hymie ***@lactose.homelinux.net
Roger Bell_West
2019-05-29 14:25:55 UTC
Permalink
Post by hymie!
Why the hell does it matter?
They might be curious to know what brand to avoid in their own machines.

(We assume they're mixed brands.)
--
Nice people, with a religious aversion to backups; they were running
the whole company on the 286, and they had no backups whatsoever. I
sometimes wonder what happened to them.
-- Will Rose
Peter Corlett
2019-05-30 08:34:16 UTC
Permalink
Post by Roger Bell_West
Post by hymie!
Why the hell does it matter?
They might be curious to know what brand to avoid in their own machines.
This doesn't necessarily help that much, since said manufacturer uses a number
of different trading names. See also Calder Hall, Windscale and Sellafield.
Grant Taylor
2019-05-29 15:48:00 UTC
Permalink
Post by hymie!
I have a machine. It's running RAID 5 or 6. A disk failed. My team
of users has monitoring software that notices this, so they know a
disk failed. A replacement disk has been ordered. Today I was asked
(by one of my users) if I know which disk failed.
Why the hell does it matter?
They may have been asking to make sure that you knew which drive to replace.

Sadly, I've worked behind people that were just going to pull a random
drive and check the raid status. That's how they identify which drive
is bad.
--
Grant. . . .
unix || die
The Horny Goat
2019-05-29 16:28:02 UTC
Permalink
On Wed, 29 May 2019 09:48:00 -0600, Grant Taylor
Post by Grant Taylor
They may have been asking to make sure that you knew which drive to replace.
Sadly, I've worked behind people that were just going to pull a random
drive and check the raid status. That's how they identify which drive
is bad.
I worked with a guy years ago who headed for the server room saying he
was going to do that but quickly made it clear that he was pulling my
leg.
Steve VanDevender
2019-05-30 07:59:10 UTC
Permalink
Post by Grant Taylor
Post by hymie!
I have a machine. It's running RAID 5 or 6. A disk failed. My team
of users has monitoring software that notices this, so they know a
disk failed. A replacement disk has been ordered. Today I was
asked (by one of my users) if I know which disk failed.
Why the hell does it matter?
They may have been asking to make sure that you knew which drive to replace.
Sadly, I've worked behind people that were just going to pull a random
drive and check the raid status. That's how they identify which drive
is bad.
There was also the time I was called in to help with a disk replacement.
Somehow these people had obtained a server that had no drive activity
lights or any visible numbering on the drive slots. The RAID controller
management utility told us which drive of the four had failed by number.
So we made a reasonable guess about how the numbering corresponded to
slots -- left-to-right as seen from the front. This was, unfortunately,
the wrong guess.
Grant Taylor
2019-05-30 18:02:46 UTC
Permalink
Post by Steve VanDevender
There was also the time I was called in to help with a disk
replacement. Somehow these people had obtained a server that had no
drive activity lights or any visible numbering on the drive slots.
The RAID controller management utility told us which drive of the
four had failed by number. So we made a reasonable guess about how
the numbering corresponded to slots -- left-to-right as seen from
the front. This was, unfortunately, the wrong guess.
Ew.

Every time I ran into that, I always went back to the identifiers on the
controller (for channel) and jumpers on the drive for ID.
--
Grant. . . .
unix || die
Peter Corlett
2019-05-30 19:43:23 UTC
Permalink
Grant Taylor <***@tnetconsulting.net> wrote:
[...]
Post by Grant Taylor
Every time I ran into that, I always went back to the identifiers on the
controller (for channel) and jumpers on the drive for ID.
"Jumpers on the drive" dates you somewhat. SATA and SAS are all point-to-point
links rather than the multidrop busses of yore, and drives no longer have any
interesting jumpers.

My preferred approach is to use a competent RAID system which indicates the
serial number of the bad disk (and not just those of the working disks), which
can then be compared with the serial number printed on the tiny label on the
edge of the drive, readable by any common-or-garden electron microscope.

I do occasionally use the "bugger this for a lark" disk-identification system
of momentarily yanking each disk in turn to see what turns up in the logs.
Again, this involves having selected a competent RAID system in the first place
which isn't stuck in the 1970s and doesn't fail-deadly.
Garrett Wollman
2019-05-30 21:05:04 UTC
Permalink
Post by Peter Corlett
My preferred approach is to use a competent RAID system which indicates the
serial number of the bad disk (and not just those of the working disks), which
can then be compared with the serial number printed on the tiny label on the
edge of the drive, readable by any common-or-garden electron microscope.
I prefer to use a competently integrated chassis or drive shelf that
has locator lights for each disk, and use frfhgvy to flash the light
for the bad disk. (But I also try to assign software labels that
reflect the physical location.)

-GAWollman
--
Garrett A. Wollman | "Act to avoid constraining the future; if you can,
***@bimajority.org| act to remove constraint from the future. This is
Opinions not shared by| a thing you can do, are able to do, to do together."
my employers. | - Graydon Saunders, _A Succession of Bad Days_ (2015)
Peter Corlett
2019-05-31 11:18:21 UTC
Permalink
Post by Peter Corlett
My preferred approach is to use a competent RAID system which indicates the
serial number of the bad disk (and not just those of the working disks),
which can then be compared with the serial number printed on the tiny label
on the edge of the drive, readable by any common-or-garden electron
microscope.
I prefer to use a competently integrated chassis or drive shelf that has
locator lights for each disk, and use frfhgvy to flash the light for the bad
disk. (But I also try to assign software labels that reflect the physical
location.)
Oh to have a hardware budget expansive enough to cover such fripperies. Why, I
bet you even have a safe electricity supply and easy physical access to the
equipment. Next you're going to tell me that you haven't ended up using Frntngr
because even though they're not fit for purpose, they're 20% cheaper than the
stuff that actually works and are already in stock and need using up.
Garrett Wollman
2019-05-31 17:04:46 UTC
Permalink
Post by Peter Corlett
Oh to have a hardware budget expansive enough to cover such fripperies. Why, I
bet you even have a safe electricity supply and easy physical access to the
equipment.
Theoretically. Except for about half of the servers are in a remote
DC 90 miles away, and I'm one of only two people from our group who
are authorized. Luckily we have "remote hands" there but they're
mostly good for pushing buttons and taking pictures of consoles. Oh,
and stuff in the remote DC has 208 power and IEC connectors on the
PDU, which is fine for normal servers but not so great for things that
require wall warts. (Do your PDUs have CEE 7/[357], BS1363, or IEC
60320?)
Post by Peter Corlett
Next you're going to tell me that you haven't ended up using Frntngr
because even though they're not fit for purpose, they're 20% cheaper
than the stuff that actually works and are already in stock and need
using up.
Actually, Frntngr is our preferred vendor, but the integrators we work
with seem to prefer JQ these days. And now we're building more
SSD-only fileservers so there's a completely different set of vendors
whose modes of suckitude we haven't yet identified.

-GAWollman
--
Garrett A. Wollman | "Act to avoid constraining the future; if you can,
***@bimajority.org| act to remove constraint from the future. This is
Opinions not shared by| a thing you can do, are able to do, to do together."
my employers. | - Graydon Saunders, _A Succession of Bad Days_ (2015)
Grant Taylor
2019-05-31 17:47:16 UTC
Permalink
Post by Garrett Wollman
Theoretically. Except for about half of the servers are in a remote
DC 90 miles away, and I'm one of only two people from our group who
are authorized. Luckily we have "remote hands" there but they're
mostly good for pushing buttons and taking pictures of consoles.
That's when a good OoB console / remotely managed PDUs /
iDRAC/iLOM/iLO/etc. are nice things to have.
Post by Garrett Wollman
Oh, and stuff in the remote DC has 208 power and IEC connectors on
the PDU, which is fine for normal servers but not so great for things
that require wall warts. (Do your PDUs have CEE 7/[357], BS1363,
or IEC 60320?)
That means that there is extremely likely 3ɸ power to the DC, feeding
PDUs with 1ɸ wired across two legs. I'm betting that each ɸ is 120 VAC
to ground. This means that you can use a C14 to NEMA 5-15 adapter like
the following to connect wall warts.

https://www.amazon.com/ACA1017-Adapter-Official-Certification-Standard/dp/B07DCWXTYM

Obviously, confirm with the facility electrician.

I've got a handful of these in my DC.
--
Grant. . . .
unix || die
Michel
2019-06-03 07:54:57 UTC
Permalink
Post by Garrett Wollman
Actually, Frntngr is our preferred vendor, but the integrators we work
with seem to prefer JQ these days. And now we're building more
SSD-only fileservers so there's a completely different set of vendors
whose modes of suckitude we haven't yet identified.
Yes.

Fnzfhat, for one, who told us to order their stuff from $supplier.
Ok fine. Then, when one SSD inevitably went titsup, said supplier
took 4 months to handle a simple warranty case.

We also had to explain to them that a 4 TB 860 EVO consumer SSD is
not equivalent to, nor an acceptable replacement for a 4 TB PM863a.
And that we were a bit worried about when any of the remaining 62
SSDs from that order would follow suit.

$supplier eventually replaced it, after much back and forth, with a
vagry of acceptable spec.

Having only ever used JQ at previous @ork, this was a bit of a shock.
Peter Corlett
2019-07-08 07:29:05 UTC
Permalink
Post by Peter Corlett
Oh to have a hardware budget expansive enough to cover such fripperies. Why,
I bet you even have a safe electricity supply and easy physical access to
the equipment.
(To clarify, I was referring to my domestic kit. The stuff in datacentres is
rented and therefore dealing with the hardware is Somebody Else's Problem.)

[...]
(Do your PDUs have CEE 7/[357], BS1363, or IEC 60320?)
"PDU" is a fancy name for an extension lead. Those are a mix of BS1363 and CEE
7/7. Which are plugged into Ol' Sparky CEE 7/1 sockets because earth
connections or indeed building wiring newer than 1964 is for wusses. No wonder
that one of the flats in the block goes up in flames every few years. I intend
to move out before this one joins them.

[...]
Actually, Frntngr is our preferred vendor, but the integrators we work with
seem to prefer JQ these days. And now we're building more SSD-only
fileservers so there's a completely different set of vendors whose modes of
suckitude we haven't yet identified.
My admittedly relatively limited experience with SSDs is that data which is not
also backed up to hard disk might as well not exist. The phrase "RAID is not a
backup" applies in spades with SSDs.
Sir Chewbury Gubbins
2019-07-11 12:35:21 UTC
Permalink
Post by Peter Corlett
My admittedly relatively limited experience with SSDs is that data which is not
also backed up to hard disk might as well not exist. The phrase "RAID is not a
backup" applies in spades with SSDs.
</lurk> I did once enjoy a long, confused, blinking session at a $coworker
who thought it would be a great idea to run SSDs in a mirrorset. <lurk>

J
--
John Dow <***@nelefa.org.invalid>
... Blog & Game Diary : http://www.nelefa.org
/|\ Constructed using Mutt, Tin and Vi.
/ | \ Zomoniac is Wrong. Fact.
Chris Adams
2019-07-11 15:06:58 UTC
Permalink
Post by Sir Chewbury Gubbins
</lurk> I did once enjoy a long, confused, blinking session at a $coworker
who thought it would be a great idea to run SSDs in a mirrorset. <lurk>
Why wouldn't you run SSDs in a mirror, assuming a proper RAID setup that
supports SSDs (for example, can pass down TRIM)?

RAID is about high availability... most things don't handle a filesystem
going away very well, so RAID allows the system to continue to operate
while you replace failed drives. You can (and should) have HA above the
single system layer as well, but usually failure at that level is at
least somewhat disruptive.
--
Chris Adams <***@cmadams.net>
Alexander Schreiber
2019-07-13 15:59:43 UTC
Permalink
Post by Sir Chewbury Gubbins
Post by Peter Corlett
My admittedly relatively limited experience with SSDs is that data which is not
also backed up to hard disk might as well not exist. The phrase "RAID is not a
backup" applies in spades with SSDs.
</lurk> I did once enjoy a long, confused, blinking session at a $coworker
who thought it would be a great idea to run SSDs in a mirrorset. <lurk>
Why not? Knowning that SSDs tend to fail quietly and totally (whereas
spinning rust usually warns you with bad blocks before entirely dying),
putting them in a mirror at least gives you a chance to survice the failure
of one them and continue to run (and then quickly replace the failed
one). If both fail, well, that's what your backups are for. Sure, RAID
is not backup, but it can do wonders for service availability. Having
to cold restore from backup tends to be somewhat disruptive, usually.

Kind regards,
Alex.
--
"Opportunity is missed by most people because it is dressed in overalls and
looks like work." -- Thomas A. Edison
Peter Corlett
2019-07-16 23:01:43 UTC
Permalink
Post by Sir Chewbury Gubbins
Post by Peter Corlett
My admittedly relatively limited experience with SSDs is that data which is
not also backed up to hard disk might as well not exist. The phrase "RAID is
not a backup" applies in spades with SSDs.
</lurk> I did once enjoy a long, confused, blinking session at a $coworker
who thought it would be a great idea to run SSDs in a mirrorset. <lurk>
Check out uggcf://jjj.nznmba.qr/qc/O07Q998212. €88 per terabyte. Prime Day has
already ended on this side of the North Sea, so that's the regular deal.
Welcome to the future.

At that sort of price, and with the commensurate reliability of all
slightly-too-cheap consumer-grade storage, you're a fool to not buy a second
and mirror them.
The Horny Goat
2019-05-30 21:18:56 UTC
Permalink
Post by Peter Corlett
Post by Grant Taylor
Every time I ran into that, I always went back to the identifiers on the
controller (for channel) and jumpers on the drive for ID.
"Jumpers on the drive" dates you somewhat. SATA and SAS are all point-to-point
links rather than the multidrop busses of yore, and drives no longer have any
interesting jumpers.
Indeed - the most recent time I dealt with anything like "interesting
jumpers" was when I was working on my (personal) Apple II.

How times change! (wink)
Chris Adams
2019-05-31 13:52:19 UTC
Permalink
Post by Peter Corlett
I do occasionally use the "bugger this for a lark" disk-identification system
of momentarily yanking each disk in turn to see what turns up in the logs.
Again, this involves having selected a competent RAID system in the first place
which isn't stuck in the 1970s and doesn't fail-deadly.
My "no idea which drive is which" method is to run something on the
system continuously reading the drive (like dd if=/dev/sda
of=/dev/null), and watch the drive activity LEDs. Pull the drive(s)
with no LED lit!
--
Chris Adams <***@cmadams.net>
Grant Taylor
2019-05-31 14:47:55 UTC
Permalink
Post by Chris Adams
My "no idea which drive is which" method is to run something on
the system continuously reading the drive (like dd if=/dev/sda
of=/dev/null), and watch the drive activity LEDs. Pull the drive(s)
with no LED lit!
That's my preferred method.

But it does require drive activity LEDs. I've had more than one
occasion where I didn't have that luxury.
--
Grant. . . .
unix || die
Scott
2019-05-29 16:48:29 UTC
Permalink
Post by hymie!
I have a machine. It's running RAID 5 or 6. A disk failed. My team of
users has monitoring software that notices this, so they know a disk
failed. A replacement disk has been ordered. Today I was asked (by one
of my users) if I know which disk failed.
Why the hell does it matter?
Just say yes. Do you know which disk failed? Yes. Yes I do.

Same answer you give a traffic cop when he asks, do you know how fast
you were going? Yes. Yes I do.

And leave it at that.
Wojciech Derechowski
2019-05-29 18:26:13 UTC
Permalink
Post by hymie!
So ... Over the last two days, I got what I think are incredibly stupid
follow-up questions.
One of the worst follow-up questions I can think of is what if... he put
forth his hand, and take also of the tree of life, and eat, and live
for ever...? or something to that effect, not to mention an extremely bad
case of induction that followed it.

WD
--
Who is Entscheidungs and what is his problem?
Satya
2019-05-30 07:59:03 UTC
Permalink
Post by hymie!
As I related this story to my wife, she asked me "Which items went to
auction?"
Why the hell does it matter? Why would you think I remembered this
incredibly trivial detail?
Lrnu zl jvsr nfxf zr gevivny qrgnvyf yvxr gung (nqzvggrqyl fbzr ner
aba-gevivny) naq V'z bire urer guvaxvat V unir orggre guvatf gb erzrzore, yvxr
gur rknpg bcgvbaf V arrq sbe eflap gb qb gur evtug guvat.
--
A feature is a bug with seniority.
Mans Nilsson
2019-07-23 13:33:38 UTC
Permalink
Post by Satya
Lrnu zl jvsr nfxf zr gevivny qrgnvyf yvxr gung (nqzvggrqyl fbzr ner
aba-gevivny) naq V'z bire urer guvaxvat V unir orggre guvatf gb erzrzore, yvxr
gur rknpg bcgvbaf V arrq sbe eflap gb qb gur evtug guvat.
FJZOB vf n ybg orggre jvgu eflap guna lbhef gehyl. V raq hc nfxvat ure,
be erfbegvat gb gne va cvcrf.
--
Måns Nilsson primary/secondary/besserwisser/machina
MN-1334-RIPE SA0XLR +46 705 989668
Content: 80% POLYESTER, 20% DACRONi ... The waitress's UNIFORM sheds
TARTAR SAUCE like an 8" by 10" GLOSSY ...
Loading...