"Lab 1.1 - Start the Lab Environment" - "ceph -s" shows no output (only cursor blinking)

ricmarqricmarq Established Member
edited July 27 in Technical Questions

Hi everyone,

I'm beginning to follow this new SUSE Academy SES201v6 ("SUSE Enterprise Storage 6 Basic Operations") course.
Today (27th July 2020), I've watched the first few videos for Module 1 ("01 - Week One - SES201v6 Course Introduction and Overview").

I'm now doing the first Lab ("Lab 1.1 - Start the Lab Environment (10 mins)"). In the Lab Environment, I started successfully all the 3 "Monitor" Nodes ("mon1", "mon2" and "mon3"), the first 3 "Data" Nodes ("data1", "data2" and "data3") and also the "admin" Node.

I'm also following the Lab Guide for this first Lab ("lab_exercises_1.1v2.pdf"), including the last section, named "Resolve SES Cluster Startup Issues". I'm now in "Task 1: Check the cluster’s health" of that Section. In that Task, it's said to run the command "ceph -s" (as "root") in a Terminal session (in the "admin" Node) and to evaluate the output. However, when I enter that command (and press ENTER, of course), then I just get a blinking (block) cursor (and it's now been in that state for more than 10 minutes). Is that to be expected?

Comments

  • Hello, that is a symptom of a communication problem with one or more of the monitor nodes. Please shutdown and restart all 3 monitor nodes and try again.

  • ricmarqricmarq Established Member
    edited July 27

    Hi @Academy_Instructor ,

    Thank you very much for your quick reply. Your suggestion worked for me! :smile:
    Per your suggestion, I've shut down the 3 monitor nodes and I powered them back on, one after the other (and, in each case, waiting for the conclusion of the boot process and for the stabilization of CPU usage, before starting the next "monitor" node ).
    Now, the "ceph -s" command in the "admin" Node in my Lab is not "hanging" anymore and is returning the following output:

    admin:~ # ceph -s
      cluster:
        id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
        health: HEALTH_WARN
                2 osds down
                Degraded data redundancy: 142/645 objects degraded (22.016%), 27 pgs degraded, 317 pgs undersized
                317 pgs not deep-scrubbed in time
                317 pgs not scrubbed in time
    
      services:
        mon: 3 daemons, quorum mon1,mon2,mon3 (age 14m)
        mgr: mon3(active, since 14m), standbys: mon2, mon1
        mds: cephfs:1 {0=mon1=up:active}
        osd: 9 osds: 7 up (since 10m), 9 in (since 10M)
    
      data:
        pools:   7 pools, 480 pgs
        objects: 215 objects, 4.2 KiB
        usage:   7.1 GiB used, 126 GiB / 133 GiB avail
        pgs:     142/645 objects degraded (22.016%)
                 290 active+undersized
                 163 active+clean
                 27  active+undersized+degraded
    
      io:
        client:   851 B/s rd, 0 op/s rd, 0 op/s wr
    
    admin:~ # 
    

    EDIT: Because I've noticed that I had "2 osds down" in the output above, I then followed the instructions in the "Task 5: Are the OSDs “up” and running properly?" section of the PDF, namely:

    1 - I ran the "ceph osd tree" command to find the culprits:

    admin:~ # ceph osd tree
    ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
    -1       0.16727 root default                           
    -3       0.05576     host data1                         
     0   hdd 0.01859         osd.0    down  1.00000 1.00000 
     3   hdd 0.01859         osd.3      up  1.00000 1.00000 
     7   hdd 0.01859         osd.7    down  1.00000 1.00000 
    -5       0.05576     host data2                         
     2   hdd 0.01859         osd.2      up  1.00000 1.00000 
     5   hdd 0.01859         osd.5      up  1.00000 1.00000 
     8   hdd 0.01859         osd.8      up  1.00000 1.00000 
    -7       0.05576     host data3                         
     1   hdd 0.01859         osd.1      up  1.00000 1.00000 
     4   hdd 0.01859         osd.4      up  1.00000 1.00000 
     6   hdd 0.01859         osd.6      up  1.00000 1.00000 
    

    2 - Based on the output above, the problem here seemed to lie on "osd.0" ("down") and "osd.7" (also "down") of "data1". So, I restarted those two services:

    admin:~ # ssh data1 systemctl restart ceph-osd@0.service
    admin:~ # ssh data1 systemctl restart ceph-osd@7.service
    

    3 - After doing this, the output of the "ceph osd tree" command was looking good (everything is "up"):

    admin:~ # ceph osd tree
    ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
    -1       0.16727 root default                           
    -3       0.05576     host data1                         
     0   hdd 0.01859         osd.0      up  1.00000 1.00000 
     3   hdd 0.01859         osd.3      up  1.00000 1.00000 
     7   hdd 0.01859         osd.7      up  1.00000 1.00000 
    -5       0.05576     host data2                         
     2   hdd 0.01859         osd.2      up  1.00000 1.00000 
     5   hdd 0.01859         osd.5      up  1.00000 1.00000 
     8   hdd 0.01859         osd.8      up  1.00000 1.00000 
    -7       0.05576     host data3                         
     1   hdd 0.01859         osd.1      up  1.00000 1.00000 
     4   hdd 0.01859         osd.4      up  1.00000 1.00000 
     6   hdd 0.01859         osd.6      up  1.00000 1.00000 
    

    4 - By this time, the "ceph -s" still was in "HEALTH_WARN" but no longer showing "osds down":

    admin:~ # ceph -s
      cluster:
        id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
        health: HEALTH_WARN
                Degraded data redundancy: 59/645 objects degraded (9.147%), 7 pgs degraded
                301 pgs not deep-scrubbed in time
                301 pgs not scrubbed in time
    
      services:
        mon: 3 daemons, quorum mon1,mon2,mon3 (age 53m)
        mgr: mon3(active, since 53m), standbys: mon2, mon1
        mds: cephfs:1 {0=mon1=up:active}
        osd: 9 osds: 9 up (since 9s), 9 in (since 10M); 3 remapped pgs
    
      data:
        pools:   7 pools, 480 pgs
        objects: 215 objects, 4.2 KiB
        usage:   9.1 GiB used, 162 GiB / 171 GiB avail
        pgs:     59/645 objects degraded (9.147%)
                 473 active+clean
                 6   active+recovery_wait+undersized+degraded+remapped
                 1   active+recovering+undersized+degraded+remapped
    
      io:
        recovery: 0 B/s, 2 objects/s
    

    5 - And, after a few minutes, the health changed from "HEALTH_WARN" to "HEALTH_OK":

    admin:~ # ceph status
      cluster:
        id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
        health: HEALTH_OK
    
      services:
        mon: 3 daemons, quorum mon1,mon2,mon3 (age 63m)
        mgr: mon3(active, since 63m), standbys: mon2, mon1
        mds: cephfs:1 {0=mon1=up:active}
        osd: 9 osds: 9 up (since 10m), 9 in (since 10M)
    
      data:
        pools:   7 pools, 480 pgs
        objects: 215 objects, 4.2 KiB
        usage:   9.1 GiB used, 162 GiB / 171 GiB avail
        pgs:     480 active+clean
    
      io:
        client:   852 B/s rd, 0 op/s rd, 0 op/s wr
    
    admin:~ # ceph -s
      cluster:
        id:     f2b0bde4-8ecc-4900-ab34-7d0234101292
        health: HEALTH_OK
    
      services:
        mon: 3 daemons, quorum mon1,mon2,mon3 (age 64m)
        mgr: mon3(active, since 63m), standbys: mon2, mon1
        mds: cephfs:1 {0=mon1=up:active}
        osd: 9 osds: 9 up (since 10m), 9 in (since 10M)
    
      data:
        pools:   7 pools, 480 pgs
        objects: 215 objects, 4.2 KiB
        usage:   9.1 GiB used, 162 GiB / 171 GiB avail
        pgs:     480 active+clean
    
      io:
        client:   851 B/s rd, 0 op/s rd, 0 op/s wr
    

    So, everything seems to be looking good! :smiley:

  • The same happens when LAB env resumed. No action required, just wait. "seph -s -w" will tell you when the whole environment resumed.

  • I am glad this resolved your issue.

  • ricmarqricmarq Established Member

    Hi,
    @voleg4u : Thanks for the information ("The same happens when LAB env resumed. No action required, just wait. "ceph -s -w" will tell you when the whole environment resumed.").
    @Academy_Instructor : Thanks again!

  • Hi @Academy_Instructor

    "Please shutdown and restart all 3 monitor nodes and try again."
    may i know, when i restart all 3, does the other data and admin nodes need to be shut as well? or it can be remain on and just restart all 3 monitor nodes.

    is it sufficient when i just wait until the monitor node to reach the login prompt, then i begin to power on the other monitor nodes?

  • No, just the monitor nodes.

Sign In or Register to comment.