Thread affinity and transparent huge pages benefits

Author	Message
wujj123456 Send message Joined: 14 Sep 08 Posts: 103 Credit: 37,305,836 RAC: 54,975	Message 71000 - Posted: 16 Jun 2024, 21:37:43 UTC Last modified: 16 Jun 2024, 21:42:25 UTC Finally got enough WU to have some nice plots when messing with optimizations. The benefit of affining thread and enabling transparent huge pages are expected for memory intenstive workloads. So this is just putting some quantitive number over empirical expectations for this specific oifs batch. Vertical axis is task runtime in seconds. Horizontal is one per sample, ordered by return time from oldest to latest. One is 7950X and you can easily see when I started doing the optimization around sample 60. It reduced runtime by ~7-8%. (Sample ~44-58 is when I gambled with 13 tasks on the 64G host. While nothing errored out, it was not a bright idea for performance either. These points were excluded from the percentage calculation.) This one is more complicated. It's 7950X3D, but Linux VM on Windows. I run 6 oifs tasks in the VM, and 16 tasks from other projects on Windows. The dots around sample 30, 42, 58 are peak hours where I paused Windows boinc but didn't pause the VM. The setup is 6C/12T VM bound to the X3D cluster, core [2,14) on Windows. The first drop around sample 40 are enabling huge pages and affining threads inside the VM. That's about 10% improvement, whether I compare non-peak or peak samples. The second drop around sample 70 is when I started affining Windows boinc tasks away from the VM cores. Now it's getting pretty close to the peak samples when I only run the 6 oifs inside the VM without the 16 windows tasks. Appendix - Sharing the simple commands and code Enabing huge page at run time. Only effective for current boot. You can set `transparent_hugepage=always` on kernel cmdline to make it persist across boot, but how you do that is distro dependent, so I'm leaving that out. echo always \| sudo tee /sys/kernel/mm/transparent_hugepage/enabled Verify huge page usage. Some huge page could exist even before that due to the default madvised mode, but the number will increase significantly once you set it to `always` that allows kernel to combine pages as it sees fit. AFAIC, most oifs usage is covered by huge pages just by the rough numbers. grep Huge /proc/meminfo Affinitize threads on Linux is done through `taskset`. # Set pid 1234 to core 0-1 sudo taskset -apc 0,1 1234 To find out your CPU topology, as which CPU number belongs to which L3 or SMT sibling, use `lstopo` and check the `P#`. This is important because we don't want to bind two tasks onto SMT siblings. I bind each task to the two sibling threads. Putting them together, I have a script invoked by in crontab every 10 minutes. Make sure you tune that `i=2`, `$i,$(($i+8))` and `i=$(($i+1))` to match your topology. They control how each task gets assigned to cores. #!/bin/bash i=2 for pid in $(pgrep oifs_43r3_model \| sort); do taskset -apc $i,$(($i+8)) $pid i=$(($i+1)) done If your host is native linux, you can stop here. For Windows, the "details" tab in task manager will let you select affinity. I use that to set affinity for the vmware process, while also affining thread inside the Linux guest as above. Seems to be 1:1 mapping at least for vmware workstation 17. (Can't figure out how to verify this other than looking at per-core usage on Windows host, which roughly matches Linux guest for affected cores. This is a bit of handwavy.) Meanwhile, to bind everything else away, I use a powershell script loop. `$names` are other boinc process name I want to bind away from the cores used by VM. `$cpumask` is decimal of cpumask. Make sure you change that for your need. $names = @('milkyway_nbody_orbit_fitting_1.87_windows_x86_64__mt','wcgrid_mcm1_map_7.61_windows_x86_64','einstein_O3AS_1.07_windows_x86_64__GW-opencl-nvidia-2') $cpumask = 4294950915 # 0xFFFFC003 While ($true) { @(Get-Process $names) \| ForEach-Object { $_.ProcessorAffinity = $cpumask } Start-Sleep -Seconds 300 } PS: I don't really do any programming in Windows. Someone please tell me how to get powershell accept hex? `0x` prefix is supposed to work according to documentation, but I get `SetValueInvocationException` if I use hex. ID: 71000 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 944 Credit: 15,193,000 RAC: 23,306	Message 71002 - Posted: 17 Jun 2024, 11:23:07 UTC - in response to Message 71000. Last modified: 17 Jun 2024, 11:24:08 UTC Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be). Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly. What size hugepages are you using? We would normally test enabling hugepages on HPC jobs. However, just on the batch jobs, not on the entire machine. Also, setting it too high could slow the code down. It has to be tested as you've done. I'd want to be sure it's not adversely affecting the rest of the machine though. I have played with task affinity using task manager on Windows 11 but it made no difference (unless Task Manager was lying to me). When I get more time I'll have another go. --- CPDN Visiting Scientist ID: 71002 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 33,830,902 RAC: 23,947	Message 71004 - Posted: 17 Jun 2024, 14:21:01 UTC - in response to Message 71002. Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be). Not really - it's not having to access data from the next level down storage, it's having to do a (probably partial) page table walk, which is memory accesses that aren't really accomplishing anything (other than working out the virtual to physical mappings... which kind of have to happen for any other accesses to be able to happen). Depending on how the processor's TLB is arranged, and if it has a fixed number of mappings for large pages, it may or may not help a lot, but it's certainly worth trying, and I'd expect some performance improvements, as have been shown. But one can build a processor TLB design where enabling large pages hurts. I'm just not sure how modern x86 chips are doing things these days... Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly. Are any of the CPDN tasks actually disk bound with even a slow SSD? I don't see much in the way of disk accesses that strike me as "something improved by a ramdisk," though chewing up the RAM I have would certainly reduce the number of tasks I can run. I need to upgrade the RAM in a few of my boxes... ID: 71004 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1106 Credit: 17,025,459 RAC: 5,652	Message 71006 - Posted: 17 Jun 2024, 14:42:32 UTC - in response to Message 71004. Are any of the CPDN tasks actually disk bound with even a slow SSD? I don't see much in the way of disk accesses that strike me as "something improved by a ramdisk This is what my main Linux machine is doing. No CPDN tasks available at the moment. All my Boinc data are on a 7200 rpm spinning hard cfive, on a partition all its own. The other partitions on that drive are seldom used (mainly videos). Notice there are 14 total tasks running: 13 Boinc tasks, and the boinc client. The machine is also running Firefox where I am typing this, but I do not type fast enough to put a noticeable load on the 16-core machine. top - 10:29:07 up 11 days, 22:57, 2 users, load average: 13.15, 13.43, 14.00 Tasks: 473 total, 14 running, 459 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.5 us, 0.1 sy, 80.9 ni, 18.3 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st MiB Mem : 128086.0 total, 1816.4 free, 6809.1 used, 119460.6 buff/cache MiB Swap: 15992.0 total, 15848.5 free, 143.5 used. 117967.0 avail Mem PID PPID USER PR NI S RES %MEM %CPU P TIME+ COMMAND 2250093 5542 boinc 39 19 R 377464 0.3 99.3 2 383:13.99 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 2279983 5542 boinc 39 19 R 376732 0.3 99.6 15 130:59.06 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 2286174 5542 boinc 39 19 R 376492 0.3 99.5 6 87:23.78 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 2288140 5542 boinc 39 19 R 213000 0.2 99.5 13 72:19.73 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 2288142 5542 boinc 39 19 R 212912 0.2 99.6 2 72:19.53 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 5542 1 boinc 30 10 S 46240 0.0 0.1 5 233487:01 /usr/bin/boinc 2295746 5542 boinc 39 19 R 40780 0.0 99.5 7 16:07.92 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 2286177 5542 boinc 39 19 R 40200 0.0 99.5 0 87:34.88 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 2295175 5542 boinc 39 19 R 39228 0.0 99.5 1 20:11.83 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 2287722 5542 boinc 39 19 R 38920 0.0 99.5 3 74:56.89 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 2289384 5542 boinc 39 19 R 38900 0.0 99.5 4 67:00.71 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc+ 2287207 5542 boinc 39 19 R 2528 0.0 99.7 9 67:00.94 ../../projects/denis.usj.es_denisathome/HuVeMOp_0.02_x86_64-pc-linux-gnu 2294141 5542 boinc 39 19 R 2508 0.0 99.5 12 28:41.28 ../../projects/denis.usj.es_denisathome/HuVeMOp_0.02_x86_64-pc-linux-gnu 2294649 5542 boinc 39 19 R 2508 0.0 99.7 5 25:35.86 ../../projects/denis.usj.es_denisathome/HuVeMOp_0.02_x86_64-pc-linux-gnu ID: 71006 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 103 Credit: 37,305,836 RAC: 54,975	Message 71007 - Posted: 17 Jun 2024, 17:05:33 UTC - in response to Message 71002. Last modified: 17 Jun 2024, 17:05:50 UTC Worth adding that hugepages are beneficial because it can reduce TLB misses (translation lookaside buffer); essentially a TLB miss means accessing data from next level down storage (whatever that might be). SolarSyonyk had it right. The benefit is not necessarily through reducing next level access. The entire page walk can hit cache but still hurt performance a lot. A TLB miss effectively means that specific memory access is blocked because it needs the physical address first. Whatever latency a page walk incurs is in addition to the normal hit or miss for the data once the address is available. Now if a page work actually miss in cache, that will compound and destroy performance quickly. Modern micro-arch has hardware page walkers that will try to get ahead and hide the latency too. Still TLB miss is to be avoided as much as possible for memory intensive applications, which is how huge pages help by covering much large area of memory per TLB entry. The kernel doc page explained it succinctly if anyone is interested: https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly. This is the opposite of what one wants to do here. The allocation workload does is anonymous memory and ramdisk can only potentially help with file backed memory. However, in any scenario where memory capacity or bandwidth is already the bottleneck, the last thing you want is forcing files or even pages not in the hot path into memory. What size hugepages are you using? We would normally test enabling hugepages on HPC jobs. However, just on the batch jobs, not on the entire machine. Also, setting it too high could slow the code down. It has to be tested as you've done. I'd want to be sure it's not adversely affecting the rest of the machine though. I'm enabling the transparent huge page (THP) feature in kernel and AFAIK, it only uses 2MB huge pages. For applications that we have control, we use a combination of 2MB and 1GB pages in production because we can ensure the application only request the right size it needs. However, here I have no control over the application memory allocation call, so THP is the only thing I can do. Another concern of THP is the wasted memory within huge page causing additional OOMs, which I didn't observe even when I only have like 1GB of headroom when going by 5GB per job. Empirically it makes sense given the memory swing of oifs is hundreds of MB at a time, so 2MB pages shouldn't result in many partially used pages. FWIW, this is the current stat on my 7950X3D VM with 32G memory. More than half of the memory is covered by 2MB pages and the low split stats should mean that the huge pages were actively put to good use during its lifetime. $ egrep 'trans\|thp' /proc/vmstat nr_anon_transparent_hugepages 9170 thp_migration_success 0 thp_migration_fail 0 thp_migration_split 0 thp_fault_alloc 190586370 thp_fault_fallback 12973323 thp_fault_fallback_charge 0 thp_collapse_alloc 8711 thp_collapse_alloc_failed 1 thp_file_alloc 0 thp_file_fallback 0 thp_file_fallback_charge 0 thp_file_mapped 0 thp_split_page 13881 thp_split_page_failed 0 thp_deferred_split_page 12984 thp_split_pmd 27158 thp_scan_exceed_none_pte 18 thp_scan_exceed_swap_pte 23689 thp_scan_exceed_share_pte 0 thp_split_pud 0 thp_zero_page_alloc 2 thp_zero_page_alloc_failed 0 thp_swpout 0 thp_swpout_fallback 13872 $ grep Huge /proc/meminfo AnonHugePages: 18757632 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB ID: 71007 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 103 Credit: 37,305,836 RAC: 54,975	Message 71008 - Posted: 17 Jun 2024, 17:23:19 UTC - in response to Message 71002. Last modified: 17 Jun 2024, 17:29:02 UTC Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly. Oh, I think I get what you were at now. Are you referring to the data/checkpoint written to disk? I assumed they are not application blocking, as the application writes just get buffered by the page cache and flushed to disk asynchronously by kernel. If oifs job actually waits for the flush like database applications do, then it could matter in some cases. AFAIC, each oifs task writes ~50GB to disk. Assuming a 5 hour runtime on a fast machine, that's 3MB/s on average, but all happen as periodic spikes through large sequential writes. If it's spinning rust with 100MB/s write bandwidth, I guess it could be 3% of time spent on disk writes, even worse with multiple tasks if they are not splayed. Likely not worth a consideration for SSDs (especially NVMe ones) even if the write are synchronous, and all my hosts use SSDs for applications... ID: 71008 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 944 Credit: 15,193,000 RAC: 23,306	Message 71009 - Posted: 17 Jun 2024, 19:20:01 UTC - in response to Message 71008. Last modified: 17 Jun 2024, 19:25:50 UTC Have you tried RAMdisks? They can make a huge improvement at the increased risk you lose everything if the machine goes down unexpectedly. Oh, I think I get what you were at now. Are you referring to the data/checkpoint written to disk? I assumed they are not application blocking, as the application writes just get buffered by the page cache and flushed to disk asynchronously by kernel. ... OIFS will wait programmatically until the write completes in the configuration we use for CPDN. That includes the model output and the restart/checkpoint files. In tests I've found the model can slow down between 5-10% depending on exactly how much is written in model results. That's compared to a test that doesn't write anything. I've not tested using RAMdisk on the desktop, only when I was working in HPC. p.s. forgot to add that we usually used 4Mb for hugepages when I was employed! pp.s. thx for correcting me on tlb misses! --- CPDN Visiting Scientist ID: 71009 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 103 Credit: 37,305,836 RAC: 54,975	Message 71010 - Posted: 17 Jun 2024, 19:50:25 UTC - in response to Message 71009. OIFS will wait programmatically until the write completes in the configuration we use for CPDN. That includes the model output and the restart/checkpoint files. In tests I've found the model can slow down between 5-10% depending on exactly how much is written in model results. That's compared to a test that doesn't write anything. I've not tested using RAMdisk on the desktop, only when I was working in HPC. Thanks for the details. Suddenly splaying tasks at initial start seem to be worth the hassle, especially if I play with those cloud instances again next time. Guess this could be one of the reasons why running a larger VMs of the same disk slowed down oifs, since the network disk had pretty low fixed bandwidth. :-( p.s. forgot to add that we usually used 4Mb for hugepages when I was employed! Must be one of those interesting non-x86 architecture back then. AFAIK, x86 only support 4K, 2M and 1G pages. Is that SPARC? :-P I more or less feel x86 is held back by the 4K pages a bit. Apple M* is using 16K pages, and a lot of aarch64 benchmarks are published with 64K page size. Some vendor we work with for data center workload refused to support 4KB pages for their aarch64 implementation at all due to performance reason. ¯\_(ツ)_/¯ ID: 71010 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 262 Credit: 33,830,902 RAC: 23,947	Message 71011 - Posted: 17 Jun 2024, 21:58:55 UTC - in response to Message 71010. Must be one of those interesting non-x86 architecture back then. AFAIK, x86 only support 4K, 2M and 1G pages. Is that SPARC? :-P No, x86/32-bit had 4MB large pages. The page directory maps 4MB on 32-bit x86 (1024 page table entries per 4k page), but only 2MB on 64-bit x86 (512 entries per 4k page table, because they're 64-bit entries, instead of 32-bit entries). I more or less feel x86 is held back by the 4K pages a bit. Apple M* is using 16K pages, and a lot of aarch64 benchmarks are published with 64K page size. Some vendor we work with for data center workload refused to support 4KB pages for their aarch64 implementation at all due to performance reason. ¯\_(ツ)_/¯ It just depends on what you're doing. I doubt x86 will ever change away from 4kb pages, there's too much that assumes that implicitly. Large pages get you a lot, but I'm not sure how much it matters, with some of the (probably leaky...) TLB optimizations on x86 chips anymore. ARMv8/AArch64 is a whole heck of a lot more flexible, though. ID: 71011 · Reply Quote