SSD - failure

I using SLED over some years. On my laptop that runs every day for some hours I use a SSD. Some years ago we had to exchange one disk - ok. shit happen. Now I get failures again.
The laptop freeze totally and the HD controller is running continuously. Checking GSmartControl, no disk failures are found.
Can there be a issue with SSD and linux or specific SUSE?

Comments

  • just the data as well:
    SLED15:/home/hans-christoph # hdparm -I /dev/sda

    /dev/sda:

    ATA device, with non-removable media
    Model Number: KINGSTON SA400S37240G
    Serial Number: 50026B767B01BD02
    Firmware Revision: SBFK71E0
    Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
    Standards:
    Supported: 11 10 9 8 7 6 5
    Likely used: 11
    Configuration:
    Logical max current
    cylinders 16383 16383
    heads 16 16
    sectors/track 63 63
    --
    CHS current addressable sectors: 16514064
    LBA user addressable sectors: 268435455
    LBA48 user addressable sectors: 468862128
    Logical Sector size: 512 bytes
    Physical Sector size: 512 bytes
    Logical Sector-0 offset: 0 bytes
    device size with M = 10241024: 228936 MBytes
    device size with M = 1000
    1000: 240057 MBytes (240 GB)
    cache/buffer size = unknown
    Form Factor: 2.5 inch
    Nominal Media Rotation Rate: Solid State Device
    Capabilities:
    LBA, IORDY(can be disabled)
    Queue depth: 32
    Standby timer values: spec'd by Standard, no device specific minimum
    R/W multiple sector transfer: Max = 16 Current = 16
    DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
    Cycle time: min=120ns recommended=120ns
    PIO: pio0 pio1 pio2 pio3 pio4
    Cycle time: no flow control=120ns IORDY flow control=120ns
    Commands/features:
    Enabled Supported:
    * SMART feature set
    Security Mode feature set
    * Power Management feature set
    * Write cache
    * Look-ahead
    * Host Protected Area feature set
    * WRITE_BUFFER command
    * READ_BUFFER command
    * NOP cmd
    * DOWNLOAD_MICROCODE
    SET_MAX security extension
    * 48-bit Address feature set
    * Device Configuration Overlay feature set
    * Mandatory FLUSH_CACHE
    * FLUSH_CACHE_EXT
    * SMART error logging
    * SMART self-test
    * General Purpose Logging feature set
    * WRITE_{DMA|MULTIPLE}_FUA_EXT
    * 64-bit World wide name
    * WRITE_UNCORRECTABLE_EXT command
    * {READ,WRITE}_DMA_EXT_GPL commands
    * Segmented DOWNLOAD_MICROCODE
    * Gen1 signaling speed (1.5Gb/s)
    * Gen2 signaling speed (3.0Gb/s)
    * Gen3 signaling speed (6.0Gb/s)
    * Native Command Queueing (NCQ)
    * Phy event counters
    * READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
    * DMA Setup Auto-Activate optimization
    Device-initiated interface power management
    * Software settings preservation
    * DOWNLOAD MICROCODE DMA command
    * SET MAX SETPASSWORD/UNLOCK DMA commands
    * WRITE BUFFER DMA command
    * READ BUFFER DMA command
    * DEVICE CONFIGURATION SET/IDENTIFY DMA commands
    * Data Set Management TRIM supported (limit 8 blocks)
    Security:
    Master password revision code = 65534
    supported
    not enabled
    not locked
    not frozen
    not expired: security count
    supported: enhanced erase
    20min for SECURITY ERASE UNIT. 60min for ENHANCED SECURITY ERASE UNIT.
    Logical Unit WWN Device Identifier: 50026b767b01bd02
    NAA : 5
    IEEE OUI : 0026b7
    Unique ID : 67b01bd02
    Checksum: correct

  • malcolmlewismalcolmlewis Knowledge Partner
    edited February 7

    @HANS-CHRISTOPH Hi, what about output from smartctl -a /dev/sda. Are you running btrfs, perhaps needs defrag/balance etc?

  • edited February 9

    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
    9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 3910
    12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 1581
    148 Unknown_Attribute 0x0000 255 255 000 Old_age Offline - 8
    149 Unknown_Attribute 0x0000 255 255 000 Old_age Offline - 2
    167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
    168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0
    169 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 6
    170 Bad_Blk_Ct_Erl/Lat 0x0013 100 100 010 Pre-fail Always - 0/7
    172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
    173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 21 (Average 13)
    181 Program_Fail_Cnt_Total 0x0012 100 100 000 Old_age Always - 0
    182 Erase_Fail_Count_Total 0x0000 255 255 000 Old_age Offline - 1
    187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 1
    192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 60
    194 Temperature_Celsius 0x0023 066 048 000 Pre-fail Always - 34 (Min/Max 11/52)
    196 Not_In_Use 0x0000 100 100 000 Old_age Offline - 2
    199 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
    218 CRC_Error_Count 0x0000 100 100 000 Old_age Offline - 0
    231 SSD_Life_Left 0x0013 100 100 000 Pre-fail Always - 98
    233 Flash_Writes_GiB 0x0013 100 100 000 Pre-fail Always - 3219
    241 Lifetime_Writes_GiB 0x0012 100 100 000 Old_age Always - 3142
    242 Lifetime_Reads_GiB 0x0012 100 100 000 Old_age Always - 1505
    244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 13
    245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 21
    246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 160584

    SMART Error Log Version: 1
    No Errors Logged

    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

    1 Short offline Completed without error 00% 3907 -

    2 Short offline Completed without error 00% 3905 -

    3 Short offline Completed without error 00% 3890 -

    4 Extended offline Completed without error 00% 3889 -

    5 Short offline Completed without error 00% 3877 -

    6 Short offline Completed without error 00% 3871 -

    7 Short offline Completed without error 00% 3867 -

    8 Short offline Completed without error 00% 3861 -

    9 Short offline Completed without error 00% 3858 -

    10 Short offline Completed without error 00% 3854 -

    11 Short offline Completed without error 00% 3850 -

    12 Short offline Completed without error 00% 3842 -

    13 Short offline Completed without error 00% 3834 -

    14 Short offline Completed without error 00% 3819 -

    15 Short offline Completed without error 00% 3815 -

    16 Short offline Completed without error 00% 3814 -

    17 Short offline Completed without error 00% 3808 -

    18 Short offline Completed without error 00% 3795 -

    19 Short offline Completed without error 00% 3781 -

    20 Short offline Completed without error 00% 3762 -

    21 Short offline Completed without error 00% 3752 -

    SMART Selective self-test log data structure revision number 0
    Note: revision number not 1 implies that no selective self-test has ever been run
    SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1 0 0 Not_testing
    2 0 0 Not_testing
    3 0 0 Not_testing
    4 0 0 Not_testing
    5 0 0 Not_testing
    Selective self-test flags (0x0):
    After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.

  • I use BTRFS as native file system - as SUSE suggest.

  • malcolmlewismalcolmlewis Knowledge Partner

    @HANS-CHRISTOPH Hi, that looks ok, what scheduler is in use?

    cat /sys/block/sda/queue/scheduler
    
  • hans-christoph@SLED15:~> cat /sys/block/sda/queue/scheduler
    [mq-deadline] kyber bfq none
    This is the result. The disc are not full, have space. It is strange. And this issues happen randomly. Sometimes I think it happens when using several programs and Internet? But I couldn't find any issue here.

  • malcolmlewismalcolmlewis Knowledge Partner
    edited February 10

    @HANS-CHRISTOPH Hi, that's the correct one for a SSD. Can you confirm that the likes of fstrim, btrfs defrag, balance etc has been run?

  • malcolmlewismalcolmlewis Knowledge Partner

    @HANS-CHRISTOPH Hi forgot to add the command to check ;) systemctl list-timers will show the info when it ran etc...

  • NEXT LEFT LAST PASSED UNIT ACTIVATES
    Thu 2021-02-11 08:00:00 CET 49min left Thu 2021-02-11 07:00:07 CET 10min ago snapper-timeline.timer snapper-timeline.service
    Thu 2021-02-11 23:05:47 CET 15h left Tue 2021-02-09 10:08:38 CET 1 day 21h ago snapper-cleanup.timer snapper-cleanup.service
    Thu 2021-02-11 23:11:04 CET 16h left Tue 2021-02-09 10:13:56 CET 1 day 20h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
    Fri 2021-02-12 00:00:00 CET 16h left Thu 2021-02-11 06:50:48 CET 20min ago logrotate.timer logrotate.service
    Fri 2021-02-12 00:00:00 CET 16h left Thu 2021-02-11 06:50:48 CET 20min ago mandb.timer mandb.service
    Fri 2021-02-12 00:15:52 CET 17h left Thu 2021-02-11 06:50:48 CET 20min ago check-battery.timer check-battery.service
    Fri 2021-02-12 01:39:26 CET 18h left Thu 2021-02-11 06:50:48 CET 20min ago backup-sysconfig.timer backup-sysconfig.service
    Fri 2021-02-12 01:58:12 CET 18h left Thu 2021-02-11 06:50:48 CET 20min ago backup-rpmdb.timer backup-rpmdb.service
    Mon 2021-02-15 00:00:00 CET 3 days left Mon 2021-02-08 08:06:54 CET 2 days ago btrfs-balance.timer btrfs-balance.service
    Mon 2021-02-15 00:00:00 CET 3 days left Mon 2021-02-08 08:06:54 CET 2 days ago fstrim.timer fstrim.service
    Mon 2021-03-01 00:00:00 CET 2 weeks 3 days left Mon 2021-02-01 07:39:12 CET 1 weeks 2 days ago btrfs-scrub.timer btrfs-scrub.service

  • It looks like all run well.
    See, I have 2 SLED 15.2 running. One on my old desktop with a HD. There are no problems with disc.
    One on my laptop with a SSD. I had exchanged one disk in past. The problems seems to be the same as last time. Random freezing of laptop. I that case, hard reboot is the only option. What I can see in that case is the SSD controller (LED) is running.
    Searching for "Linux SSD freeze" I can see other had same problem.

  • malcolmlewismalcolmlewis Knowledge Partner

    Hi
    So if you look at the times when the above services ran, does it correspond with when you get a freeze?

    I also suggest enable the magic sysrq key... https://en.wikipedia.org/wiki/Magic_SysRq_key use cat /proc/sys/kernel/sysrq to see the current value, can also setup a /etc/sysctl.d/10-magic-sysrq.conf file...

  • Hi
    I'm not sure. The services run often and at night as can see on time stamp.
    The sysrq value is 184 - ??
    I can see I have to work with it a little bit more, but seems to have a lot of function. When freeze, ALT+F2 doesn't work.

  • malcolmlewismalcolmlewis Knowledge Partner
    edited February 15

    @HANS-CHRISTOPH Hi, have a read here: https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html

    Try the key combo and press the keys in the required order press alt and hold, press sys rq key and release, then press the following keys one at a time R E I S U B and then release the alt key. System should reboot...

  • Now I found same failure on my Desktop as well. this has an regularly HD. I think I can relate that to Firefox. It happens wehn many windows are open and Firefox works in longer time.
    I always happens when working with Firefox

  • malcolmlewismalcolmlewis Knowledge Partner

    @HANS-CHRISTOPH Hi, sounds like you might need to look at firefox tweaks, eg disk cache to reduce that.
    How much system RAM?

    Maybe reduce swappiness?

    cat /etc/sysctl.d/98-swap.conf
    
    #Reduce swappiness
    vm.swappiness=1
    vm.vfs_cache_pressure=50
    
Sign In or Register to comment.