Upgrading a ZFS Raid-1 Root Drive in Proxmox

Upgrading a ZFS Raid-1 Root Drive in Proxmox

When first making my Proxmox server I just used drives I had left over from upgrading previous builds, I had the boot drives as a set of 120gb Sandisk SSDs, 2 Crucial 250gb SSDs as a ZFS Raid-1 holding my VMs, and a 250gb Corsair Force MP510 setup as a Directory running backups because I thought it would help the speeds when backing up my VMs(it really didn't).  As using an NVMe drive in the onboard m.2 slot on the Gigabyte B450M Aorus motherboard limits the number of SATA ports to 4 I found myself running low on disks, so the plan is to migrate the boot mirror(which Proxmox calls rpool) to the 2 Crucial SSDs and run most of the VMs directly off that.

The Proxmox official instructions for replacing a failed disk in the root pool covered majority of the process, but I had to take a few extra steps as I was upgrading the size of the pool. Once I made sure all the VMs were backed up on the nvme I pulled one of the Sandisks and opened up the shell in proxmox.  Since both of my boot drives were currently fine the pulled Sandisk served as my failsafe incase I messed anything up in the process, as it wouldn't be touched until the process was complete.

  root@einherjar:~# zpool status -v
  pool: rpool
  state: ONLINE
  scan: resilvered 63.5G in 0 days 00:27:32 with 0 errors on Sat Apr  4 09:24:53 2020
  config:
  NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdd3    UNAVAILABLE  0     0     0
        sdc3    ONLINE       0     0     0

Whatever disk you pull is going to show up unavailable.  They show up here as partitions /dev/sdd3 and /dev/sdc3 but disk id's can also show up, just go off whatever your pool status shows.  After identifying the pulled drive it needs to be put offline in the pool, run status again to confirm.

  root@einherjar:~# zpool offline rpool /dev/sdd3
  root@einherjar:~# zpool status -v
  pool: rpool
  state: DEGRADED
  status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
  action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 63.5G in 0 days 00:27:32 with 0 errors on Sat Apr  4     09:24:53 2020
  config:
  NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdd3    OFFLINE      0     0     0
        sdc3    ONLINE       0     0     0

From here plug the replacement drive in(my Crucial 250 in this case) and get the disk location either in the Proxmox GUI or by running fdisk -l in the shell(we'll go with /dev/sdb for the guide)

Next step was to copy the partition table over from the current Sandisk 120gb SSD to the crucial.  As the official wiki stresses, the order the drives are placed in the sgdisk command are important as doing it wrong will wipe your current boot disk and that pool, taking all the proxmox configs with it.

sgdisk --replicate=/dev/target /dev/source

So the target disk goes first, and the source second.  In the shell in my case it goes

  root@einherjar:~# sgdisk --replicate=/dev/sdb /dev/sdc

This should copy the partitions over.  Now at this step if you are like me and upgrading the disks to bigger size, you will need to resize the partitions of the drive(which in my case in Proxmox is sdb3) as its copying over a 120gb partition over and that means we have roughly 130gb of free space on this drive that won't be used otherwise.  I took the wikis advice and used parted to resize it, which I had to install.

Run parted first on your current root disk to see something similar to the following

root@einherjar:~# parted /dev/sdc
GNU Parted 3.2
Using /dev/sdc
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print                                                            
Model: ATA CT250MX500SSD1 (scsi)
Disk /dev/sdc: 250GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name  Flags
 1      17.4kB  1049kB  1031kB                     bios_grub
 2      1049kB  538MB   537MB   fat32              boot, esp
 3      538MB   249GB   248GB   zfs

Of note is the disk size in line 7, and the end sector for partition 3 in line 15 as that is the partition we are going to resize.  Run parted or fdisk on your new disk to verify the disk size and then run the following commands

root@einherjar:~# parted /dev/sdb
GNU Parted 3.2
Using /dev/sdd
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) resizepart                                                            
Partition number? 3

Then specify the size, which for me I input 248GB but in actuality it should have been 249GB, as looking at the free space available on the disk in fdisk I left 1gb out.  So now the zfs pool partition on the disk is the correct size.

root@einherjar:~# sgdisk --randomize-guids /dev/sdb
The operation has completed successfully.

Wiki suggests to do this to avoid confusion in the server, and I didn't want to deal with a kernel panic so there it goes.  And next step which didn't work for me

grub-install /dev/sdb

If this completes for you, great.  If it doesn't, you need to fix this otherwise the drive won't be able to boot the OS as I came to find out.  Easiest way I got this going was to just copy the working bios_grub and EFI partitions off the currently working boot drive, which are partitions 1 and 2.  Thankfully after running sgdisk command the new drive has the same partitions in the same start sectors

# dd if=/dev/sdc1  of=/dev/sdb1 

The above command tells dd to use /dev/sdc1 as input file and write it to output file /dev/sdb1.  Do this again for partition 2, and it should be able to properly boot now.

We can now replace the pulled sandisk with the crucial drive in the rpool.  Syntax for the command is

zpool replace <pool> <device> [new-device]

Where first device is the name of the drive you pulled, and new device is the one we've been formatting all this time.  Again you need to write down the device exactly as it shows up in zpool status -v, so if the pool is listing disk-id's or UUIDs for either drive we are swapping go with that or it won't work.

root@einherjar:~# zpool replace rpool /dev/sdd3 /dev/sdb3
Make sure to wait until resilver is done before rebooting.

You might need a -f flag at the end of the replace line, if the shell prompts for it then add it in as I needed to do for this first swap.  This will start the resilvering process where the data from the current root drive gets mirrored over to the new root drive.  Do not power down the server until this is done.  Check on the status and you should see a similar output to below depending on your drives/pool size

  root@einherjar:~# zpool status -v
  pool: rpool
  state: DEGRADED
  status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
  action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Apr 4 09:27:53 2020
  	27G scanned out of 64G at 4.46M/s, 10m to go
    26G resilvered, 40.60% done
  config:
  NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
	    sdc3         ONLINE       0     0     0
	    replacing-1  OFFLINE      0     0     0
	      sdd3       OFFLINE      0     0     0
	      sdb3       ONLINE       0     0     0  (resilvering)

When its finally done should see something similar to

root@einherjar:~# zpool status -v
  pool: rpool
 state: ONLINE
  scan: resilvered 63.5G in 0 days 00:27:32 with 0 errors on Sat Apr  4 09:37:53 2020
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb3    ONLINE       0     0     0
            sdc3    ONLINE       0     0     0

Power the system off and pull the old root drive and power it on to make sure the new drive boots and all your server settings are the same.  If it did then great, next step is to repeat everything all over again with the 2nd new drive to copy everything over and resilver the pool again.  This will give you a new boot pool with no errors in the zpool.

Last step is to turn on autoexpand for the pool, otherwise in my case I am stuck with 120gb available for use with VMs when the disks are obviously much bigger than that now.  Running zpool list will confirm whats actually available for use.  A very simple

root@einherjar:~# zpool set autoexpand=on rpool

As rpool is the name for the boot pool in Proxmox will turn autoexpand on(which when checking before was default to off) and running zpool list should now show a pool size almost equal to the size of your drives(barring a few MB for the boot and EFI partition)

root@einherjar:~# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
rpool   231G  87.4G   144G        -         -     2%    37%  1.00x    ONLINE 

With that my boot pool has now been upgraded and I have some more flexibility in my setup.  After migrating the VMs over from the NVMe drive to the local-zfs pool on the boot drives I used fdisk to nuke it and rebooted it so changes would take effect, and then created a LVM storage on it to run VMs that can take advantage of the higher speeds of the drive, or ones that would be negatively affected by other VMs on the local-zfs storage causing an I/O delay.  Anything like a Minecraft server or music/video server that is handling things in real time can suffer delays if even SSDs are busy with things like scheduled backups or heavy writes from 1 or more of the VMs running on it.  Moving it to a separate drive will reduce that and most game servers in particular benefit from running off an NVMe drive when its available.

I replaced one of the Sandisks with a 1TB Hard Drive I pulled from a PS4 Pro when I upgraded it to a 2TB drive and made a new Directory called Backup with that, and even though its a 5400RPM drive I was still getting a constant 160MB in read/writes when backing up from the SSDs, and alittle faster when backing up from the VMs on the NVMe.  Rebuilding is abit slower from it but since this is for home use it's a good trade-off in exchange for more space to hold backups.  The 4th slot in my drive cage is just occupied by a sandisk I formatted empty for airflow purposes, but I already plan to replace it with another SSD later on when I take on my next hardware upgrade for the Proxmox server in trying to setup a separate Windows VM with GPU passthrough to play certain games I won't have access to when I migrate my desktop OS to Linux again.

I had an original boot disk and backups of all my VMs on a separate Directory so this was a good learning experience in managing zpools without much fear of losing everything.  Since Proxmox is essentially Debian at its core the process carries over to other Linux distros that support ZFS so it's at least something that I can call back on going forward.