We've got a couple dozen SLES servers that we configured to authenticate with our Active Directory domain via the "Windows Domain Membership" tool in Yast. We primarily use it for SSH login, and we also have a line in our sudoers file allowing authorization via an AD group.

Lately, some servers have suddenly lost the ability to authenticate via AD. So far I've been unsuccessful in finding a common link defining why these servers in particular were affected.

  • The first affected server is SLES12 SP1, and it happened some weeks ago. I was busy at the time and can no longer remember what was done on the server around the time it happened.
  • The second server happened in the first step of a SLES12SP1->SP2->SP3 upgrade (so somewhere while I was upgrading between SP1 and SP2). It is now SP3 and still broken.
  • The last server (so far) happened while I was preparing to upgrade SP1->SP2->SP3. The root partition was nearly full, so I extended the drive in VMWare, booted the VM to a SLES12 SP2 ISO, went into rescue mode, and extended the partition and filesystem using fdisk and resize2fs. After doing just that, not actually making any changes at the OS level, AD auth broke. I stopped there, and the server is still SLES12 SP1.

The rest of my servers (mostly SLES11, but some SLES12 SP1 and SP3) are working just fine. And everything *else* on the affected servers appears to be working fine. In fact, applications on the affected servers can themselves successfully use AD authentication (via their own means).

I've turned on debugging in pam_winbind.conf and compared the logs. Both successful and unsuccessful attempts start the same way:

2018-05-23T08:37:19.621649-04:00 server01 sshd[20783]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=172.16.xx.xx  user=domain\username
2018-05-23T08:37:19.622328-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): [pamh: 0x55c99b087a10] ENTER: pam_sm_authenticate (flags: 0x0001)
2018-05-23T08:37:19.622694-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): getting password (0x000000d1)
2018-05-23T08:37:19.623049-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): pam_get_item returned a password
2018-05-23T08:37:19.623381-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): Verify user 'domain\username'
2018-05-23T08:37:19.648066-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): CONFIG file: require_membership_of 'LinuxServerUsers'
2018-05-23T08:37:19.648505-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): enabling krb5 login flag
2018-05-23T08:37:19.648876-04:00 server01 sshd[20783]: pam_winbind(sshd:auth): no sid given, looking up: LinuxServerUsers
Then they diverge. A successful login looks like:

2018-05-23T09:02:27.658635-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): request wbcLogonUser succeeded
2018-05-23T09:02:27.658967-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): user 'domain\username' granted access
2018-05-23T09:02:27.659244-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): Returned user was 'DOMAIN\username'
2018-05-23T09:02:27.659586-04:00 server02 sshd[12842]: pam_winbind(sshd:auth): [pamh: 0x55895157bc20] LEAVE: pam_sm_authenticate returning 0 (PAM_SUCCESS)
2018-05-23T09:02:27.659912-04:00 server02 sshd[12842]: pam_winbind(sshd:account): [pamh: 0x55895157bc20] ENTER: pam_sm_acct_mgmt (flags: 0x0000)
2018-05-23T09:02:27.667852-04:00 server02 sshd[12842]: pam_winbind(sshd:account): user 'DOMAIN\username' granted access
2018-05-23T09:02:27.668264-04:00 server02 sshd[12842]: pam_winbind(sshd:account): [pamh: 0x55895157bc20] LEAVE: pam_sm_acct_mgmt returning 0 (PAM_SUCCESS)
While on one of the affected servers, it continues:

2018-05-23T08:37:19.726146-04:00 kmicontract01 sshd[20783]: pam_winbind(sshd:auth): request wbcLogonUser failed: WBC_ERR_AUTH_ERROR, PAM error: PAM_AUTH_ERR (7), NTSTATUS: NT_STATUS_LOGON_FAILURE, Error message was: Logon failure
2018-05-23T08:37:19.726621-04:00 kmicontract01 sshd[20783]: pam_winbind(sshd:auth): user 'domain\username' denied access (incorrect password or invalid membership)
2018-05-23T08:37:19.726993-04:00 kmicontract01 sshd[20783]: pam_winbind(sshd:auth): [pamh: 0x55c99b087a10] LEAVE: pam_sm_authenticate returning 7 (PAM_AUTH_ERR)
2018-05-23T08:37:21.738222-04:00 kmicontract01 sshd[20781]: error: PAM: Authentication failure for domain\\username from 172.16.xx.xx
I'm using the same user, which obviously has the same group memberships in the domain in both cases. I'm at a loss for where to go from here for troubleshooting. As far as I can tell these attempts are not reaching a domain controller, as my user has not been locked in AD as it should be after 5 failed login attempts. Is there a way I can tell what domain controller a server is trying to authenticate with?

Any other suggestions? This one has me really scratching my head. Here are a couple of config files for reference (same on working and non-working servers):

/etc/security/pam_winbind.conf (commented sections removed)
        cached_login = no
        krb5_auth = yes
        krb5_ccache_type =
        require_membership_of = LinuxServerUsers
debug = yes
        workgroup = DOMAIN
        passdb backend = tdbsam
        printing = cups
        printcap name = cups
        printcap cache time = 750
        cups options = raw
        map to guest = Bad User
        include = /etc/samba/dhcp.conf
        logon path = \\%L\profiles\.msprofile
        logon home = \\%L\%U\.9xprofile
        logon drive = P:
        usershare allow guests = No
        idmap gid = 10000-20000
        idmap uid = 10000-20000
        kerberos method = secrets and keytab
        realm = DOMAIN.COM
        security = ADS
        template homedir = /home/%D/%U
        template shell = /bin/bash
        #winbind offline logon = yes
        #winbind refresh tickets = yes
        comment = Home Directories
        valid users = %S, %D%w%S
        browseable = No
        read only = No
        inherit acls = Yes
        comment = Network Profiles Service
        path = %H
        read only = No
        store dos attributes = Yes
        create mask = 0600
        directory mask = 0700
        comment = All users
        path = /home
        read only = No
        inherit acls = Yes
        veto files = /aquota.user/groups/shares/
        comment = All groups
        path = /home/groups
        read only = No
        inherit acls = Yes
        comment = All Printers
        path = /var/tmp
        printable = Yes
        create mask = 0600
        browseable = No
        comment = Printer Drivers
        path = /var/lib/samba/drivers
        write list = @ntadmin root
        force group = ntadmin
        create mask = 0664
        directory mask = 0775