Commisioning Ex-DC Storage

To save (a lot) of money when building out servers and services in my home lab, I buy “cheap” ex-datacenter SAS drives from eBay, I already know the deal: I’m trading money for my time and, sometimes, frustration.

The value is fantastic, enterprise grade, disks for a fraction of the cost (like AU$10-15/TB), but they don’t come plug-and-play. They come with history, quirks, and sometimes wildly inconsistent behaviour. Usually a good way to learn more about storage hardware then bargained for.

Info

The first head bang most people hit is the notorious “pin 3 power problem”

My strategy is to design a reliable storage, not by investing in expensive new disk for durability, but by implementing resiliency directly at the file system level (with benefits)

Here I will go the issues I hit during a burn-in process, what I assumed was happening, what was actually happening, and how I confirmed and fixed it.

The Issue

I picked up a batch of HGST 4TB SAS drives from multiple sellers and started my standard commissioning burn-in before adding them to a ZFS pool.

Goal

The goal is simple: don’t trust the disks verify them.

I kicked things off with:

badblocks -wsv -b 4096 /dev/sdX

Pretty quickly, something stood out.

Some drives were progressing normally
Others were painfully slow

Not slightly slower, like orders of magnitude slower…by days s-l-o-w. The kind of slow where you start wondering if the drive is about to fall off the perch and I just lost the ebay disk roulette.

My Initial Assumption

My first thought was around error recovery.

With SATA drives, TLER (Time-Limited Error Recovery) is a known factor. Drives without it can hang for ages trying to recover a bad sector, which is bad news for RAID/ZFS.

So the assumption was:

“Maybe this is a SAS equivalent of TLER behaviour.”

That instinct wasn’t wrong—but it wasn’t the full story either.

SAS vs SATA: The Important Difference

SATA drives:

Use TLER/ERC to limit recovery time
Often need tuning for RAID use

SAS drives:

Use SCSI error recovery (built-in)
Controlled via mode pages (not a simple toggle)

So yes, SAS drives already behave like TLER-enabled drives..very enterprisey.

But here’s the catch: The behaviour is firmware-defined, and not all SAS drives behave the same.

Digging Deeper with sdparm

To understand what was happening, I pulled the error recovery settings:

sdparm -p rw /dev/sdX

That’s where things got interesting. Two key parameters stood out:

RTL (Recovery Time Limit)
PER (Post Error Reporting)

And suddenly, the drives split into two clear groups.

The Defining Difference

Some drives had:

RTL = 8000
PER = 0

Others had:

RTL = 0
PER = 1

At a glance, they’re just numbers. In practice, they completely change how the disk behaves.

👉 What is RTL 👈

RTL is the main issue here

RTL = 8000 → recovery is time-limited (~8 seconds)
RTL = 0 → unlimited retries (∞ seconds)

Now think about what badblocks is doing:

Write data
Read it back
Wait for the disk to respond

If the disk hits a weak sector:

With RTL=8000 → it gives up quickly → test continues
With RTL=0 → it retries indefinitely → test appears to hang

That “slow disk” wasn’t slow, it was busy trying very hard not to fail.

What is PER (and isn’t)

PER controls whether recovered errors are reported.

PER = 0 → silent recovery
PER = 1 → report recovered errors

Important detail: PER does not control retry time.
It just controls visibility.

That said, drives with PER=1 are often tuned for more aggressive recovery behaviour, which adds to the effect.

Confirming the Diagnosis

To make sure this wasn’t just theory, I checked a few things.

Error recovery settings

    sdparm -p rw /dev/sdX

SMART data

    smartctl -a /dev/sdX

Behaviour under load
Watching how consistently the slowdown occurred

The pattern was becoming clear:

Drives with RTL=0 were consistently slow
Drives with RTL=8000 behaved normally

SMART data helped distinguish between:

Firmware behaviour (clean stats, just slow)
Actual degradation (growing defect list, errors)

What Was I seeing

I could roughly interpret outcomes like this:

Slow + clean SMART = aggressive firmware, not necessarily bad
Slow + errors = drive is struggling, higher risk
Fast + clean = ideal candidate for ZFS

This distinction matters, because not every slow disk is a failing disk. But it is a problem disk in a zfs pool.

Why This Matters for ZFS

Important

ZFS expects disks to behave predictably.

If you mix drives with:

Different recovery time limits
Different retry strategies

You can end up with:

I/O stalls
Latency spikes
Pool performance issues

Even if every disk is technically healthy.

Fixing the Problem

To all the disks in a pool behaving consistantly, I made sure they had the same recovery settings:

sdparm --set=RTL=8000 --save /dev/sdX  
sdparm --set=PER=0 --save /dev/sdX

After that:

badblocks runtimes became consistent
No more “mystery slow disks”
Behaviour matched expectations for ZFS

![note] A quick note Not all firmware will respect changes, so always verify after applying.

A Better Burn-In Process

After going through this, my process now looks like:

Destructive test (faster, still effective)

badblocks -wsv -b 65536 -t 0x00 /dev/sdX

SMART long test

badblocks -wsv -b 65536 -t 0x00 /dev/sdX

Review SMART data
```
smartctl -a /dev/sdX
```

Validate recovery settings

badblocks -wsv -b 65536 -t 0x00 /dev/sdX

This gives me both:

Host-level validation (badblocks)
Firmware-level validation (SMART)

Process Logic

Visually my logic for disk burn-in and resolving this issue

  %%{init: {'flowchart': {'curve': 'linear'}}}%%
flowchart TD 
A[Buy used SAS drives] --> B[Run burn-in] 
B --> C{badblocks slow on some disks?} 
C -->|No| H[Run SMART long test]
C -->|Yes| D[Check sdparm rw page]
D --> E[Compare RTL and PER] 
E --> F{RTL ≠ 0}
F -->|No| G[Set RTL > ~8 sec]
F -->|Yes| I[Decide: keep, tune or reject] 
G --> H[Run SMART long test] --> I[Decide: keep, tune, or reject]

(RTL) Recovery Time Limit Flow

Logically identify and resolving RTL issues

  %%{init: {'flowchart': {'curve': 'linear'}}}%%
flowchart TD 
A[Slow badblocks] --> B[Check RTL] 
B --> C{RTL = 0?} 
C -->|Yes| D[Drive may retry longer] 
C -->|No| E[Recovery is time limited] 
D --> F[Expect long pauses] 
E --> G[More predictable runtime] 
A --> H[Check SMART] 
H --> I{Errors growing?} 
I -->|Yes| J[Possible media degradation] 
I -->|No| K[Likely firmware behaviour]

My Takeaway on This

Cheap ex-datacenter SAS drives are absolutely worth it. But..but only if you put the time in.

But what you’re really buying is:

Enterprise hardware
With unknown history
And inconsistent firmware policies

The effort invested in testing and standardising turns them into reliable storage storage.

What looked like a batch of “slow drives” turned out to be something much more subtle:

A firmware level mismatch in how disks handle errors.
Once I understood that and validated it, I stopped guessing and fixed…moved on to the next task.

This is the difference between throwing disks into a pool and actually building something reliable.