PDA

View Full Version : SES5 Install not adding all OSDs



shashi_microfocus
25-Sep-2018, 13:38
Hi,

I am doing a new install of 6-node SES5, each node is with 9 disks. After installing all the ceph stages from 0 to 4, i was expecting 54 disks should be in the ceph cluster, but i found only 31 in the cluster. As per documentation, i have done complete wipe out of each of 54 disks. I ran stages from 1 to 3 multiple times, but it is not adding all the remaining disks. Can anyone please suggest me what could be the issue here and how to solve it.

For example, for node6, in the proposal, all the 9 disks are mentioned. But after SES5 install, i found only 4 got added as OSDs.

# salt '*' pillar.items

sosesn6.swlab.net:
----------
available_roles:


benchmark:
----------

ceph:
----------
storage:
----------
osds:
----------
/dev/sdb:
----------
format:
bluestore
/dev/sdc:
----------
format:
bluestore
/dev/sdd:
----------
format:
bluestore
/dev/sde:
----------
format:
bluestore
/dev/sdf:
----------
format:
bluestore
/dev/sdg:
----------
format:
bluestore
/dev/sdh:
----------
format:
bluestore
/dev/sdi:
----------
format:
bluestore
/dev/sdj:
----------
format:
bluestore



# ceph-disk list
/dev/sda :
/dev/sda1 swap, swap
/dev/sda2 other, ext4, mounted on /
/dev/sdb other, unknown
/dev/sdc other, unknown
/dev/sdd other, unknown
/dev/sde :
/dev/sde1 ceph data, active, cluster ceph, osd.22, block /dev/sde2
/dev/sde2 ceph block, for /dev/sde1
/dev/sdf other, unknown
/dev/sdg :
/dev/sdg1 ceph data, active, cluster ceph, osd.16, block /dev/sdg2
/dev/sdg2 ceph block, for /dev/sdg1
/dev/sdh other, unknown
/dev/sdi :
/dev/sdi1 ceph data, active, cluster ceph, osd.11, block /dev/sdi2
/dev/sdi2 ceph block, for /dev/sdi1
/dev/sdj :
/dev/sdj1 ceph data, active, cluster ceph, osd.3, block /dev/sdj2
/dev/sdj2 ceph block, for /dev/sdj1



Thanks & Regards,
Shashi Kanth.

eblock
25-Sep-2018, 15:32
Hi,

first of all, check the ceph-osd.log files (/var/log/ceph/) on one of the servers that doesn't deploy all the OSDs it is supposed to. There should be something revealing the cause.

Also the deepsea monitor usually shows what's going on (run it on a second session on the admin node). But even the execution of stage.X with salt should show error messages.
Sometimes after wiping the disks only a reboot helps to get rid of certain symptoms, then you could retry the stages.

Another option would be to try a manual deployment of one OSD to see what exactly fails if the orchestrated deployment takes too long or isn't clear enough. To split the creation into two steps:

ceph-disk prepare --data-dir /dev/sdX
ceph-disk activate /dev/sdX

If "prepare" already fails, analyze the logs and find out what's wrong.

shashi_microfocus
27-Sep-2018, 16:17
Thank you for the pointer.

"ceph-disk prepare" is going through, but "ceph-disk activate" giving the bellow error. Not sure what can be done now.

# ceph-disk activate /dev/sdj
ceph-disk: Cannot discover filesystem type: device /dev/sdj: Line is truncated:

eblock
27-Sep-2018, 17:56
In the "activate" command you're supposed to provide the respective partition of the prepared disk, so probably
ceph-disk activate /dev/sdj1.
From the previous output I read that /dev/sdj already had been partitioned in a previous run. If the "activate" command doesn't work I would wipe that disk and start over without partitions on /dev/sdj, running the manual steps for this disk again to see if that changes anything.