   April 2, 2020  
[00:00:36] <LeftWing> Often begin with a truss of all forks/execs/exits, and try and see if it ran some command that failed before it threw up its hands
[00:17:09] <neirac> LeftWing thanks, I'll try that make.bash works and produces the binary but I need to make all.bash work
[01:04:26] <liv3010m> hi @rmustacc, sorry for being this late. That's nice! please let me know if I can download and give it a test
[01:08:53] <jbk> is the EFI booter (in the EFI partition) bootx86.efi or bootx64.efi ?
[01:09:54] <jbk> (I thought the latter, but installboot(1M) shows the former)
[01:09:58] <jbk> or says
[01:21:31] <andyf> the latter
[01:23:06] <andyf> the 32-bit one is bootia32.efi, 64-bit bootx64.efi
[01:25:27] <rmustacc> liv3010m: Yeah, let me toss it up and get the shasum.
[01:25:51] <liv3010m> :) great!
[01:27:21] <rmustacc> https://fingolfin.org/tmp/bge - sha256: ebb278fe7641388c24619df87fb5557491060768e11157548d8b5ca9c2b1a04b
[01:28:47] <liv3010m> thx, downloading
[01:31:59] <liv3010m> OK, y copied it to a flash drive, I'll have dinner and after it give it try it and report back
[01:34:05] <jbk> andyf: ok.. i filed illumos#12463 for that
[02:58:32] <liv3010m> @rmustacc: this time ew got a kernel panic at an earlier stage: As soon as update_drv was run it crashed
[02:59:34] <rmustacc> liv3010m: Haha, oops. Well, I guess I shouldn't be surprised since I did a lot of untested changes.
[02:59:41] <rmustacc> Mind getting a snapshot of where it crashes?
[03:00:35] <liv3010m> it's OK, this time I created another boot environment
[03:00:36] <liv3010m> sure
[03:00:42] <liv3010m> I'll upload it
[03:01:32] <rmustacc> OK, great.
[03:02:20] <liv3010m> https://i.ibb.co/xM63Tsd/IMG-1980.jpg
[03:03:18] <rmustacc> Oh, you have a crash dump this time.
[03:05:11] <liv3010m> can we do some kmdb stuff with it?
[03:06:27] <rmustacc> If you boot in your previous BE, you should see a /var/crash/ directory that'll have a file like vmdump.<num>
[03:11:08] <liv3010m> let me check
[03:14:29] <liv3010m> yes, it's there. 160MB
[03:19:08] <rmustacc> So you should have be able to run savecore on that file.
[03:29:35] <liv3010m> OK, done
[03:42:49] <liv3010m> should I run "echo '::panicinfo\n::cpuinfo -v\n::threadlist -v 10\n::msgbuf\n*panic_thread::findstack -v\n::stacks' | mdb 0 > ~/crash.0" s instructed from https://illumos.org/docs/user-guide/debug-systems/#gathering-information-from-a-crash-dump ?
[06:40:55] <rmustacc> liv3010m: The most useful thing to me would probably be to actually run something like 'bge_attach+1e9::dis'
[07:39:57] *** psarria <psarria!~phyre___@> has joined #illumos
[07:43:24] <tsoome> the mdb build issue, I did update my build host too, that is likely contributing factor:)
[08:54:09] <gitomat> [illumos-gate] 12432 vntsd: NULL pointer errors -- Toomas Soome <tsoome at me dot com>
[09:17:36] <gitomat> [illumos-gate] 7119 boot should handle change in physical path to ZFS root devices -- Joshua M. Clulow <jmc at joyent dot com>
[09:19:01] <tsoome> whee
[10:43:43] <toasterson1> @freenode_neirac:matrix.wegmueller.it: unfortunately I haven't gotten the chance to port delve yet so you cant run the debugger. I do not see the test that failed. It might be a bashfile issue try running bash with -x to see what the interpreter does.
[10:44:58] <toasterson1> also are you running this under some git diretory? Go has a bug that it's test suite fails when run inside any directory tree which is inside git.
[11:34:59] <sjorge> Any of the ldapusers have an idletimeout configured?
[11:35:17] <sjorge> If I set mine to 120 (2 minutes) ldapclient/nscd refuse to do queries after a few hours
[11:35:37] <sjorge> WIthout an idletimeout set it looks stable for about 12h so far
[11:36:05] <sjorge> Is that a bug in ldapclient were it doesn't reconnect after the server closes the connection due to the timeout?
[11:37:54] <wilbury> don't remember if i had so low timeout
[12:15:49] <jlevon> tsoome: so I'm still not seeing you build failure
[12:16:06] <tsoome> do you have updated build host?
[12:16:08] <jlevon> yes
[12:16:28] <jlevon> one mo
[12:16:35] <jlevon> maybe I have an old proto?
[12:16:45] <jlevon> ah yeah that's it
[12:16:59] <tsoome> now you get?
[12:17:41] <jlevon> yes
[12:18:03] <tsoome> phew.
[12:52:48] <liv3010m> @rmustacc: thanks. Here is the output from running 'bge_attach+1e9::dis' inside mdb (with -k 0 as parameters)
[12:52:55] <liv3010m> https://i.ibb.co/stnt18z/IMG-1982.jpg
[13:11:11] <andyf> Oh, just got WARNING: kmem_alloc(): sleeping allocation with size of 0 from sdev_ncache_write()
[13:13:24] <jlevon> what's the stack
[13:13:42] <andyf> https://paste.ec/paste/ST+x37+q#9ogSCwuHWloP3q3fJLN6mhhweJWhL90kmvh0YtKZFDx
[13:30:42] <Woodstock> why is that a warning and not a panic?
[13:31:01] <andyf> That's the way it got integrated. There's a tunable for making it panic
[13:31:54] <andyf> it is unfortunately not reproducible.. I changed it to panic and did 10 reboots and it did not occur
[13:32:45] <jlevon> you don't want it to panic since there are various scenarios where it happens
[13:32:50] <jlevon> during normal boot
[13:33:02] <jlevon> it doesn't always cause problems, although it's always a bug
[13:42:58] <am11> jbk: to enable debug on libunwind project build, we need to pass --enable-debug to `./configure` (i think), or `CPPFLAGS=-DDEBUG ./configure`
[13:43:22] <am11> (based on https://github.com/libunwind/libunwind/blob/0c1d83a661d1daed7bb7e8389e9bdef7a4bb614f/configure.ac#L224)
[13:44:04] <am11> i'm not autoconf expert :)
[13:50:25] <tsoome> ptribble https://paste.ec/paste/lXqWXmD-#iFDvdoI-bComQA0Htx77toY4JIkI0L7OrkSXGVULd6x
[13:52:39] <ptribble> Is that with current bits? Is that a new pool or one that's had zpool upgrade run on it?
[13:53:37] <tsoome> ou, right its pool with m20, but feature flags enabled
[13:54:21] <tsoome> I haven't managed to create packages and create new be:D
[13:55:23] <ptribble> I'm working on a script for that - just have so many other things to get done
[14:01:15] *** psarria <psarria!~phyre___@> has joined #illumos
[14:24:42] *** nde <nde!uid414739@gateway/web/irccloud.com/x-jrnksnnprdqpngyh> has joined #illumos
[16:16:12] <rmustacc> andyf: When it occurs, it actually records a stack in the dedicated kmem log, so you don't need to go go back and reproduce, fwiw.
[16:17:32] <jlevon> he has that stack
[16:17:38] <jlevon> but it's pretty generic
[16:19:07] <rmustacc> If the negative cache was updated to zero entries, that' do it, fwiw.
[16:28:11] *** neirac <neirac!~neirac@pc-215-101-46-190.cm.vtr.net> has joined #illumos
[16:29:10] <toasterson1> does our bhyve have rbd support by any chance?
[16:31:00] <jbk> it does not
[16:32:00] <toasterson1> I assumed so :)
[16:36:22] <jbk> i thought there was talk of making the storage backend in bhyve more modular (which would enable things like that easier), but not sure what the status of that is
[16:37:48] <neirac> toasterson1 thanks!, I'll check those pointers, yes tests are passing. My interest is to build nomad out of the box in illumos
[16:41:04] <toasterson1> There was talk about a ggate backend although I have yet to find out what that is
[16:41:21] <rmustacc> liv3010m: Thanks. I just did something pretty stupid while refactoring. Should have a fix shortly.
[16:44:15] <rmustacc> liv3010m: I updated https://fingolfin.org/tmp/bge. It'll be differently broken with sha b4e336f834e8eba5560c7efde3576ae3cc7bc812c1dc5ed2a9efa9b0510aba94.
[18:34:14] <liv3010m> @rmustacc: don't worry! 'll try the new version and let you know
[18:34:32] <rmustacc> Thanks, appreciate it!
[19:01:44] * liv3010m devfsadm: driver failed to attach: bge
[19:02:17] <liv3010m> Warning: Driver (bge) succesfuly added to system but failed to attach
[19:02:29] <rmustacc> Is there anything in /var/adm/messages about why?
[19:03:11] <liv3010m> something about pcplusmp but nothing strange
[19:03:17] <liv3010m> I took a Photo from it
[19:04:16] <liv3010m> and also did the mdb ::msgbuf
[19:04:25] <liv3010m> and took a photo from it too
[19:05:10] <liv3010m> but booth show pcplusmp and nothing more
[19:05:22] <rmustacc> Did you by chance lose the PCI ID?
[19:05:22] <liv3010m> both*
[19:06:01] <liv3010m> i used pci14e4,16b1, let me double check if it was correct
[19:07:38] <rmustacc> It shouldn't have changed there. I'm just not immediately sure what would have changed. Hmm.
[19:08:33] <liv3010m> you are saying that the device ID changed in some way?
[19:08:49] <rmustacc> No, I'm saying that wouldn't have changed.
[19:08:56] <liv3010m> ahhh, OK
[19:09:10] <rmustacc> I'm just not sure what things in the bge driver would have changed. Sometimes that can happen because an unknown symbol or other thing happens.
[19:09:22] <rmustacc> Or because there's a missing PCI ID in /etc/driver_aliases, which your add_drv/update_drv should account for.
[19:09:28] <rmustacc> So, I'm just not quite sure.
[19:10:42] <liv3010m> I think update_drv was fine
[19:10:43] <liv3010m> https://i.ibb.co/P6JtWV3/IMG-1983.jpg
[19:11:17] <rmustacc> Oh, so it attached after the update.
[19:11:31] <rmustacc> Huh, weird. It didn't.
[19:11:32] <rmustacc> OK.
[19:11:42] <rmustacc> Let me put together a DTrace one liner to explore what has changed here.
[19:11:45] <liv3010m> emmm, it seems so, but the OS reports that it failed to attach
[19:11:53] <rmustacc> We know that bge attach made it, but something else failed.
[19:11:53] <liv3010m> okey :)
[19:12:01] <rmustacc> Are you able to get text files off of the system?
[19:12:23] <liv3010m> yeah, USB flash drive mount/umount ;)
[19:12:28] <rmustacc> Because what I'm going to do will generate a lot of text output.
[19:12:36] <rmustacc> And I don't think taking pictures will do well for either of us, haha.
[19:12:47] <liv3010m> okey!
[20:30:00] <rmustacc> liv3010m: dtrace -Z -F -o /var/tmp/bge.out -n 'fbt::bge_attach:entry{ self->t = 1; }' -n 'fbt::bge_detach:return{ self->t = 0; }' -n 'fbt:::/self->t/{ trace(arg0); trace(arg1); }'
[20:30:08] <rmustacc> If I've got that right that'll be what you want to run around the same time.
[20:54:01] *** neuroserve <neuroserve!~toens@ip-88-152-243-162.hsi03.unitymediagroup.de> has quit IRC (Ping timeout: 265 seconds)
[21:30:50] <liv3010m> @rmustacc: I'll copy and run it
[21:31:33] <rmustacc> liv3010m: Thanks. You want to do that while you then try to attach it.
[21:32:07] <liv3010m> Ah okey. perfect, will do then.
[21:36:16] <KungFuJesus> rmustacc: doing an image-update and then I should be ready
[21:37:18] <KungFuJesus> I can test this with LACP too, if you'd like, hah
[21:38:00] <rmustacc> KungFuJesus: Ironically right now it's a bit broken, especially with LACP.
[21:38:23] <rmustacc> We found the updates for the new chipset worked and then we tried things with LACP and... it didn't end well.
[21:38:44] <rmustacc> So right now I have it so it no longer panics (we were blowing a debug assertion), but I haven't quite gotten back to working.
[21:39:43] <clapont> hi. any idea how to kill a "zpool export" process hanging, because the Zpool is suspended? no signal helped; truss says nothing; the process is not dead/zombie: root 14759 2297 0 18:22:36 pts/12 0:00 zpool export test-2
[21:40:51] <KungFuJesus> funny I had a panic mid image-update, not sure why yet, but it seemed to be in the network/physical service
[21:41:07] <KungFuJesus> I did plug in the cable, but I wouldn't think that would cause panics without a driver for it
[21:41:17] <clapont> solaris10 sparc, not illumos. to open a SR request I need the explorer working and it hangs because of the same reason :-)
[21:41:22] *** BOKALDO <BOKALDO!~BOKALDO@> has quit IRC (Quit: Leaving)
[21:42:22] <KungFuJesus> oh no wait, that's not where the panic was
[21:42:40] <KungFuJesus> panicstack = unix:real_mode_stop_cpu_stage2_end+ba2d () | unix:trap+11b1 () | unix:cmntrap+e9 () | fffffffdf78ff930 () | zfs:zfs_zaccess_aces_check+94 () | zfs:zfs_zaccess_common+c8 () | zfs:zfs_zaccess+f8 () | zfs:zfs_zaccess_rwx+2e () | zfs:zfs_access+86 () | genunix:fop_access+5d () | genunix:vn_openat+458 () | genunix:copen+4a9 () | genunix:openat+29 () | genunix:open+1c () |
[21:42:46] <KungFuJesus> unix:brand_sys_syscall+1fe () |
[21:43:47] <rmustacc> That is unfortunate.
[21:43:58] <rmustacc> I'm not sure how much I'll be able to really help dig into that, tbh.
[21:44:58] <liv3010m> OK, this is the result of running dtrace (its really sparse): dtrace: <some-number> drops on CPU 0
[21:45:02] <liv3010m> dtrace: <some-number> drops on CPU 1
[21:45:05] <liv3010m> and so on..
[21:46:10] <liv3010m> and before it it showed: dtrace: description 'fbt::bge_attach:entry'matched 1 probe
[21:46:31] <rmustacc> liv3010m: That's ok.
[21:46:35] <liv3010m> dtrace: description 'fbt::bge_deatach:return'matched 1 probe
[21:47:06] <liv3010m> dtrace: description 'fbt::: matched 74075 probes
[21:47:24] <rmustacc> So a bunch of the data should have ended up in the file we specified with -o.
[21:47:45] <liv3010m> so we have these 3 lines and after them the other 4 (CPU[0-3])
[21:48:14] <liv3010m> ok I'll do ctr-c then and copy it over
[21:48:35] <liv3010m> that's correct_
[21:48:36] <liv3010m> ?
[21:49:26] <rmustacc> So, once you have that running, you'll want to try and force th driver to attach.
[21:49:32] <rmustacc> You should see the message about it failing to.
[21:49:38] <rmustacc> Then ctrl+c and copy the file over.
[21:51:40] <liv3010m> maybe I have not expressed correctly, the lines that I told you showed on screen on dtrace when it was running simultaneously and update_drv was run
[21:52:20] <liv3010m> I can run update_drv again to see if something was not right
[21:53:58] <rmustacc> You had the '-o' running right?
[21:54:05] <rmustacc> Erm, that argument.
[21:54:11] <rmustacc> Was there anything in that file?
[21:55:34] <liv3010m> emm if it wasn't on the original dtrace "snippet" then no haha
[21:55:50] <liv3010m> I'll do it
[21:55:53] <liv3010m> sorry
[21:56:22] <liv3010m> ah yes its there didn't saw it
[21:56:41] <liv3010m> Ill check the bge.out file
[21:58:04] <liv3010m> yeah we have a 40MB file now
[22:01:52] <rmustacc> OK, great. That's what I hoped for.
[22:02:04] <liv3010m> :)
[22:02:28] <rmustacc> I basicall told it to instrument every function we run during bge_attach with dtrace so we can figure out where it's failing.
[22:02:56] <sjorge> wilbury I removed the idletimeout and been stable since yesterday
[22:03:06] <sjorge> SO there is probably a bug somewhere where it doesn't reconnect
[22:03:14] <sjorge> I had a look at the code but got lost :(
[22:03:51] <sjorge> Setting a low idletimeout like 30 or so can trigger the behavior after a few hours (but not imediately after 30 * server_num)
[22:04:45] <liv3010m> @rmustacc: do you want me to send/upload it?
[22:05:25] <KungFuJesus> ugh, one of these days I'm going to tear apart IPS/pkg and actually speed it up
[22:05:57] <KungFuJesus> ok, finally booting the new be
[22:07:09] <tsoome> daamit, I forgot the debug build can reveal more issues...
[22:08:11] <KungFuJesus> rmustacc: what do you need me to do?
[22:14:03] <rmustacc> liv3010m: That'd be great, yeah.
[22:14:45] <rmustacc> KungFuJesus: Unfortunately, it looks like I regressed the bge driver at the moment when trying to fix the lacp issue. So I need to go through the bits liv3010m grabbed and fixed things. Sorry.
[22:15:00] <KungFuJesus> ok no problem, can test this another day, then
[22:15:56] <KungFuJesus> aggr was recently modified to leverage pseudo rx rings - not sure if that has any affect on the LACP support problems you're having
[22:17:08] <KungFuJesus> I guess that really only matters if the NIC provides more ring buffers - not sure if it does
[22:19:36] <rmustacc> KungFuJesus: No, this was just a long standing issue in bge.
[22:19:54] <rmustacc> The way it handles rings has been broken and unfortunately can lead to memory corruption I believe.
[22:20:08] <rmustacc> We stumbled over it as an assert.
[22:23:02] <KungFuJesus> always good to find bugs you didn't know were there :)
[22:24:11] <rmustacc> Sometimes, yeah. Though you don't want the stack to get too deep.
[22:26:27] <KungFuJesus> Yeah hopefully the bug is isolated to just bge
[22:26:47] <rmustacc> Yes, this thing is very specific to bge.
[22:27:02] <rmustacc> It's an issue with how it assigns and tracks its own mac address registers.
[22:27:14] <rmustacc> So it'd be hard for others to be there.
[22:28:41] <KungFuJesus> heh, seems like an ARP table would be something that's entirely offloaded to the PHY rather than something the kernel has to deal with
[22:29:08] <KungFuJesus> I didn't get very far through that datasheet, so I definitely don't know what I'm talking about
[22:29:48] <rmustacc> It's not quite that.
[22:29:54] <rmustacc> this isn' the ARP table.
[22:30:08] <rmustacc> Basically you can program a hardware device to receive packets from certain mac addresses.
[22:30:31] <rmustacc> This is the difference between normal mode (get broadcast, multicast, and my own mac address), and promiscuous, get everything.
[22:31:03] <rmustacc> So when you add another vnic on top of the hardware device, if we can program hardware to filter for that mac address then we do. If we can't, we put it in promisc mode.
[22:34:09] <KungFuJesus> ah yeah, I imagine promiscuous mode is much more inefficient then letting the MAC handle that
[22:34:16] <KungFuJesus> than*
[22:37:53] <KungFuJesus> might even be necessary for large receive offload
[22:38:36] <rmustacc> It's not really. But I think doing that in sw is the approach most folks prefer these days, tbh.
[22:38:52] <rmustacc> At least, that's how my prototype of that worked.
[22:40:03] <rmustacc> sjorge: I have a working prototype of the dladm bits there.
[22:40:26] <rmustacc> Not sure if you have an interest in playing around with that.
[22:41:07] <liv3010m> @rmustacc: compressed it came down to just 3MB :)
[22:41:15] <liv3010m> https://mega.nz/file/rw4XBIQS#xPf0EHBCNTONaLPBb-zwB_p8j5neKwhf1uaCD01X32o
[22:42:04] <rmustacc> thanks.
[22:43:07] <liv3010m> you are welcome
[22:50:58] <rmustacc> Hmm. It looks like that hammer was a little too big and the flowindent didn't quite work out.
[22:51:32] <rmustacc> liv3010m: Would you mind if I asked you to do a variant of what we last did?
[22:51:47] <liv3010m> of course not, please tell me
[23:00:32] <rmustacc> I think this will be a bit easier to work with:
[23:00:37] <rmustacc> dtrace -n 'fbt:vioif::{ printf("%d", timestamp); trace(arg0); trace(arg1); }'
[23:00:45] <rmustacc> Oh, that's missing the -o, so add that to point to a file.
[23:06:09] <gitomat> [illumos-gate] 12446 3KSTAT manual pages could be clearer about data life time -- Robert Mustacchi <rm at fingolfin dot org>
[23:07:03] <liv3010m> no prob! I'll report ASAP
[23:07:13] <rmustacc> Thanks!
[23:07:41] <liv3010m> ;)
[23:09:07] <liv3010m> I mean should we have those 2 too?
[23:12:04] <rmustacc> Oh, no. It doesn't.
[23:12:11] <rmustacc> Ugh, that's the wrong thng entirely.
[23:12:20] <rmustacc> Sorry, I pasted a variant that I was using to match probes.
[23:12:43] <rmustacc> dtrace -Z -o /path/to/output -n 'fbt:vioif::{ printf("%d", timestamp); trace(arg0); trace(arg1); }'
[23:12:51] <rmustacc> dtrace -Z -o /path/to/output -n 'fbt:bge::{ printf("%d", timestamp); trace(arg0); trace(arg1); }'
[23:12:54] <rmustacc> OK, that last one.
[23:17:39] <liv3010m> OKey!
[23:27:18] <liv3010m> here we go: https://mega.nz/file/6o4hFSIA#xPf0EHBCNTONaLPBb-zwB_p8j5neKwhf1uaCD01X32o
[23:27:50] <liv3010m> it's a lot smaller than the first one
[23:32:00] <rmustacc> liv3010m: That got me the same file.
[23:32:13] <liv3010m> mmm let me check
[23:32:33] <liv3010m> you were right haha
[23:32:35] <liv3010m> https://mega.nz/file/f0xFWIIB#Pifu2CcS8-71CB3xMAykZ1_pXRhpVwMLWhH6iWM6gD4
[23:37:45] <rmustacc> OK, that's given me a pretty good lead.
[23:40:15] <rmustacc> I understand how I got us here. I have to figure out what the best way to fix this is.
[23:40:19] <rmustacc> It'll take me a bit of time.
[23:41:13] <liv3010m> No problem, please let me know when I can help
[23:41:45] <rmustacc> There's just a lot of code that's not executing, so I have to figure out what the right answer is.
[23:45:35] <liv3010m> If it is easier for you to have an "intermediate" binary to run and debug it please don't hesitate and let me know, I can try as many builds as necessary
[23:47:28] <rmustacc> Thanks. At this point I know the current bug. More of a question about risk of how to fix it.
[23:47:42] <rmustacc> The history here is a bit more complicated than I would have liked.
[23:48:37] <liv3010m> I understand. Is this doable?
[23:49:26] <rmustacc> Oh, yeah. It's all doable, don't worry.
[23:49:44] <rmustacc> I just have to argue with myself a bit, first.
[23:50:22] <liv3010m> thanks @rmustacc>=
   April 2, 2020  
