Replacing Failed MD-RAID Drive

Replacing A Failed Hard Drive In A Software RAID1 Array

Reference: https://www.howtoforge.com/replacing_hard_disks_in_a_raid1_array

This guide shows how to replace a failed drive from a Linux RAID1 (software RAID) array without losing data. In this example we have two drives, /dev/sda with partitions /dev/sda1 and /dev/sda2, and /dev/sdb with partitions /dev/sdb1 and /dev/sdb2.

Partitions /dev/sda1 and /dev/sdb1 make up the RAID1 set /dev/md0.
Partitions /dev/sda2 and /dev/sdb2 make up the RAID1 set /dev/md1.

The healthy configuration can be examined by viewing /proc/mdstat:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
       24418688 blocks [2/2] [UU] 
md1 : active raid1 sda2[0] sdb2[1]
       24418688 blocks [2/2] [UU] 
unused devices: <none>

How Do I Tell If A Hard Disk Has Failed?

If a disk has fails, you will find alert error messages in the /var/log/messages log file, for example:

Apr 29 19:21:36 simplstor7 kernel: [kern.alert]md/raid1:md0: Disk failure on sda1, disabling device.
Apr 29 19:21:36 simplstor7 kernel: [kern.alert]md/raid1:md1: Disk failure on sda2, disabling device.

When you examine the RAID status by viewing /proc/mdstat it now shows:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0](F) sdb2[1]
      1951503360 blocks super 1.2 [2/1] [_U]
      bitmap: 9/15 pages [36KB], 65536KB chunk
md0 : active raid1 sda1[0](F) sdb1[1]
      511936 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

Instead of the string [UU], you will see [_U] or [U_] if you have a degraded RAID array. Use the mdadm command to look at the status of each MD array in detail:

# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.0
  Creation Time : Wed Apr 29 16:24:27 2015
     Raid Level : raid1
     Array Size : 511936 (500.02 MiB 524.22 MB)
  Used Dev Size : 511936 (500.02 MiB 524.22 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Apr 29 19:21:36 2015
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : localhost:0
           UUID : 746cafb2:567a1ca3:7b8bc047:9c3512da
         Events : 42

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

       0       8        1        -      faulty   /dev/sda1

Removing the Failed Disk

To remove /dev/sda, we need to first mark each partition as failed and then remove them from their respective RAID arrays.

First we mark /dev/sda1 and /dev/sda2 as failed:

# mdadm --manage /dev/md0 --fail /dev/sda1
# mdadm --manage /dev/md1 --fail /dev/sda2

Then we remove /dev/sda1 from /dev/md0 and /dev/sda2 from /dev/md1:

# mdadm --manage /dev/md0 --remove /dev/sda1
# mdadm --manage /dev/md1 --remove /dev/sda2
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1]
      1951503360 blocks super 1.2 [2/1] [_U]
      bitmap: 9/15 pages [36KB], 65536KB chunk
md0 : active raid1 sdb1[1]
      511936 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages
unused devices: <none>

If the system has hot-swapable drives, you can remove the bad drive now. If the drive is not hotswapable then power down the system. Replace the old /dev/sda hard drive with a new one (it must be the same size or larger than the old one or the rebuild of the arrays will fail). After you have changed the hard disk, boot the system.

Adding the New Drive

The first thing we must do now is to create the exact same partitioning as on /dev/sdb. We can do this with one simple command:

# sfdisk -d /dev/sdb | sfdisk /dev/sda

You can run 'sfdisk -l' to check if both hard drives have the same partitioning now.

# sfdisk -l /dev/sda /dev/sdb
Disk /dev/sda: 243031 cylinders, 255 heads, 63 sectors/track
Units: cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0
   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sda1   *      0+     63-     64-    512000   fd  Linux raid autodetect
/dev/sda2         63+ 243031- 242968- 1951634432   fd  Linux raid autodetect
/dev/sda3          0       -       0          0    0  Empty
/dev/sda4          0       -       0          0    0  Empty

Disk /dev/sdb: 243031 cylinders, 255 heads, 63 sectors/track
Units: cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0
   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sdb1   *      0+     63-     64-    512000   fd  Linux raid autodetect
/dev/sdb2         63+ 243031- 242968- 1951634432   fd  Linux raid autodetect
/dev/sdb3          0       -       0          0    0  Empty
/dev/sdb4          0       -       0          0    0  Empty

Now add /dev/sda1 to /dev/md0 and /dev/sda2 to /dev/md1:

# mdadm --manage /dev/md0 --add /dev/sda1
mdadm: re-added /dev/sda1
# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: re-added /dev/sda2

Both arrays (/dev/md0 and /dev/md1) will proceed to synchronize. View /proc/mdstat to see progress and determine when it has finished. During the synchronization the output will look like this:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      1951503360 blocks super 1.2 [2/1] [_U]
      [>....................]  recovery =  0.0% (640576/1951503360) finish=304.5min speed=106762K/sec
      bitmap: 9/15 pages [36KB], 65536KB chunk
md0 : active raid1 sda1[0] sdb1[1]
      511936 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>

That's it, you have successfully replaced /dev/sda!