|
|
KCRASH PART 2 : VIRTUAL MEMORY SYSTEM
The loadmacs file for this section will require that you obtain the following
macros which you can find in Appendix C2.
anon.k
anon_map.k
pageall.k
procvm.k
segall.k
seg_ops.k
seglst.k
segvn_data.k
swapinfo.k
vm.k
Invoke kcrash on a crash dump :
# kcrash crash.mmdd sym.mmdd
S> rg panicregs
S> < /crash/macros/loadmacs
Now that we have a dump to analyze, we need a process to disect. In our
example, we use the current active process denoted by "*practive". You may
choose any process. Run the 'ps' macro if you want to use a process other
than "practive". Once you have the address of the process, give it as an
argument to the "procvm" macro which will show us the portions of the process
structure that are related to VM.
S> procvm *practive
*p_seguslo D1282228 /* process's segu slot address */
*p_segu F2206000 /* pointer to seguser structure */
*p_as D1741880 /* pointer to as structure */
*p_trace 00000000 /* pointer to /proc vnode */
*p_exec D14377F8 /* pointer to a.out vnode */
p_pri 00000041 /* scheduling priority */
p_usize 0002 /* size of ublocks (*4096 bytes) */
u.u_psargs = "sleep 60 "
The process structure points to the address space structure, the seguser
structure and the vnode structure as well as many others not related to VM.
We will discuss the elements above in more detail in the subsection labeled
User Virtual Address Space below. Right now, we just want to preview the
structure of the VM system.
Since the memory address space of each process is defined by the elements of the as structure, we can review the relevant memobers of it by using the *p_as value given by procvm.
Figure 2 (struct as)
S> as D1741880
as [D1741880]: keepcnt 000000 segs D172B560 seglast D172CBE0 sz C000 rss B
hat: pts D1726960 ptlast D1726960 pdtp 00000000 cr3 00000000 ref -780881472
The segs entry is a pointer to a sorted, doubly linked list of segment
structures. This circular list begins at the virtual address of a segment or
s_base and is sorted in ascending order. The pointer seglast is a pointer to
the last address in this list. The total number of bytes used by the process
is given by size and the amount of memory claimed by the process is given in
the rss field. Both size and rss are reported in hex. The 'hat' members pts,
ptlast and pdtp point to the HAT layer of the virtual memory system.
The hardware address translation or HAT layer of the VM system handles the address translation hardware as a cache which is driven by system calls and exception handlers which are at a much higher level in the VM system. The job of HAT is to manage the hardware. We use the 'as' structures' "hat: pts" value as an argument to the hatpt macro.
Figure 3 (struct hatpt)
S> hatpt D1163640
D1163640: forw D153E460 back D153E460 next D153E460 prev D14A4320
pde 7D5007 pdtep E0200080 as D153AF00 aec 1A2 locks 2 pgtp D153E280
mcp[00000000 00000000 C2648500 C31B4400 C31B4480 C31B4500 C399BF00 C2648F80]
mcp[C31B4380 C2648F00 C2648080 C2648C00 C1FEB080 C1FEB300 C1FEB780 C1FEB800]
...
002 mapping hat_mcpp +offset hat_epmc +offset pte
C2648500 C2648000 C2648028 C07D511C C07D5100 00000000
003 mapping hat_mcpp +offset hat_epmc +offset pte
C31B4400 C31B4000 C31B4020 C07D519F C07D5180 00639025
...
The forw, back, next and prev entries are all pointers to related hatpt
structures. The 'pde' entry is the page directory entry (PDE) for the page
table, pdtep is the page directory table (PDT) entry pointer, as is the pointer
back to the containing address structure, aec is the active entry count, and
locks represents the number of locked PTEs. The mcp information is the mapping
chuck pointer array. What follows is the hat_mcpp (HAT_MCPP) which are
pointers to the page table chunks for the 31 mapping chunks in each page. The
hat_epmc or HAT_EPMC are the pointers to the entries per mapping chunk.
Finally, the pte value is the page table entry.
Not only does the address space of each process point us to the HAT layer, it
gives us pointers to the segments of the process. We pass to the segall
macro the address given in the 'seg' element of as.
Figure 4 (struct seg)
S> segall D172B560
s_lock @D172B560 /* lock to prevent races */
*s_base 08046000 /* base virtual address of segment */
s_size 00002000 /* size in bytes of this segment */
*s_as D1741880 /* pointer back to the containing address space */
*s_next D15C7F20 /* pointer to next seg in this address space */
*s_prev D172CBE0 /* pointer to prev seg in this address space */
*s_ops D01ADDD4 /* pointer to segment operations structure */
*s_data D17780B4 /* pointer to segvn_data */
The segall macro only shows one segment of a process at a time. In the User
Virtual Address Space discussion, we will use a different macro to display
all the segments of a process and give a detailed account of process address
space and segments. For now, we want to just use some of the pointers in the
seg structure to preview other VM related segment structures.
Some of the segment operations are segvn_fault, segvn_dup, segvn_checkprot, and segvn_getvp. Actually, there are 17 segment operations which can be seen in the seg_ops structure. To see these, we use a generated macro called seg_ops and give it the s_ops address above as an argument: Figure 5 (struct seg_ops)
S> seg_ops D01ADDD4 *dup D0068AC0 /* duplicate the segment */ *unmap D0068CD0 /* used to unmap the segment */ *free D0069250 /*unmaps and deletes all resources used*/ *fault D0069CB0 /* used by page fault routines */ *faulta D006A180 /* used by pre-fetch pages */ *unload D006A260 /* free hats associated with pages */ *setprot D006A350 /* set page protections */ *checkprot D006A670 /* display page protections */ *kluster D006A7F0 /* used by vm for pre-fetch */ *swapout D006A930 /* used to swap out pages */ *sync D006AB00 /*write changed pages to map/swap file*/ *incore D006AD80 /* are pages in physical mem? */ *lockop D006AF70 /* lock used for segment pages */ *getprot D006A6F0 /* give page protections */ *getoffset D006A780 /* s_base */ *gettype D006A7A0 /* give page type */ *getvp D006A7C0 /* return vnode pointer */All of the entries in seg_ops are pointers to segment operations. For example, *fault is used to handle a page fault which can be done by the segvn_fault routine:
S> di D0069CB0 segvn_fault: 55 pushl %ebpSo, using the kcrash command 'di' with any of the seg_ops addresses would display the routine that was called, so viewing the seg_ops structure can be valuable.
Within the segment structure, the other element relevant to our discussion on VM is the segvn_data structure. It is important at this time because it shows whether this segment of the process has any anonymous pages associated with it. We pass the *s_data value from segall as an argument.
Figure 6 (struct segvn_data)
S> segvn_data D17780B4
lock @D17780B4 /* lock on segment pages */
pageprot 00 /* true if per page protections present */
prot 0F /* current segment prot if pageprot==0 */
maxprot 0F /* max segment protections*/
type 02 /* type of sharing done */
*vp 00000000 /* vnode that segment is mapped to */
offset 00000000 /* starting offset of vnode for mapping */
anon_index 00000000 /* starting index into anon_map anon array */
*amp D171D2D8 /* pointer to anon_map */
*vpage 00000000 /* per-page information, if needed */
*cred D1256C00 /* pointer to credential structure */
swresv 00002000 /* amount of swap reserved for this segment */
For now, we are only interested in the *amp address because it will show us the
next VM structure which is has to do with anonymous pages. However, we will
return to this structure when we disect segments at which time we will also
discuss how vnodes relate to segments.
Unlike shared memory pages and stack pages, anonymous pages have no named file storage. Anonymous pages are associated with the swap device. There is an anon structure for each swap page on the system.
First we view the anon_map structure which uses the *amp pointer found in the segvn_data structure.
Figure 7 (struct anon_map)
S> anon_map D171D2D8
refcnt 00000001 /*reference count on this structure */
size 00002000 /* size in bytes mapped by the anon array */
**anon D16F7A10 /* pointer to an array of anon * pointers */
swresv 00000000 /* swap space reserved for this anon_map */
mutex @D171D2E8 /* Multiprocessing lock for segment manipulation */
As explained in the comments, the size field is the size in bytes of the
anonymous array and the swresv shows the amount of swap space reserved for this
particular anon_map. For now, we want the pointer to the anon to use as an
argument to the anon macro. Although one would think that we would use *D16F7A10
we do not.
Figure 8 (struct anon)
S> anon D16F7A10
an_refcnt 00000000 /* reference count */
un_*an_page D14CF820 /* union of page and anon */
*an_bap 00000000 /* pointer to real anon */
an_flag 0001 /* an_flag values */
an_use 0000 /* used for debuggin */
So, the "un_*an_page" pointer is a union of two structures. The header file
defines the union in this manner: (/usr/include/vm/anon.h)
union { struct page *an_page; /* ``hint'' to the real page */ struct anon *an_next; /* free list pointer */ } un;The an_flag values are defined in /usr/include/vm/anon.h as are the an_use values. Remember that the original pointer was **anon. So, we must run anon twice in order to get the pointer to the page structure:
S> anon D14CF820 an_refcnt 00000001 un_*an_page D10560E8 *an_bap 00000000 an_flag 0000 an_use 0000Now, we can use the "un_*an_page" as a argument to page. This is the last structure in the VM system that we will discuss.
Figure 9 (struct page)
S> page D10560E8
page [D10560E8]: MOD REF
nio 0000, keepcnt 000000, vnode D14B4A04, offset 00582000
next D10560E8, prev D10560E8, vpnext D11B8064, vpprev D102C4A4
mapping C4F7051C, lckcnt 00000000, cowcnt 00000000
In order of appearance, the page macro shows:
page [D10560E8]: MOD REF /* the bits that are on */ nio 0000 /* number of outstanding io reqs needed */ keepcnt 000000 /* number of page `keeps' */ vnode D14B4A04 /* logical vnode this page is from */ offset 00582000 /* offset into vnode for this page */ next D10560E8 /* next page in free/intrans lists */ prev D10560E8 /* prev page in free/intrans lists */ vpnext D11B8064 /* next page in vnode list */ vpprev D102C4A4 /* prev page in vnode list */ mapping C4F7051C /* page mappings from phat struct */ lckcnt 00000000 /* number of locks on page data */ cowcnt 00000000 /* number of copy on write locks */This page has the modify and reference bits set. See /usr/include/vm/page.h 'struct page' for all the bits that can be set as well as for information on the phat structure which defines the page mappings. There is more information on page mappings in /usr/include/vm/vm_hat.h. In the subsection titled _Paging and Swapping_, we will discuss the anon, page and vnode structures at length.
Now we are ready to get down to the nitty gritty details of the virtual memory system. We will start with the user address space.
Again, starting with 'truss -o dyn popd':
execve("popd", 0x08047CA8, 0x08047CB0) argc = 1 open("/dev/zero", O_RDONLY, 01001076274) = 3 mmap(0x00000000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x8003C000 ... open("/usr/lib/libXm.so.1.2", O_RDONLY, 01001073564) = 5 read(5, "7F E L F010101\0\0\0\0\0".., 308) = 308 mmap(0x00000000, 1473456, PROT_READ, MAP_PRIVATE, 3, 0) = 0x8003E000 mmap(0x8003E000, 1347808, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 5, 0) = 0x8003E000 mmap(0x80188000, 117724, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED ,5, 1347584) = 0x80188000 mprotect(0x801A5000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0 ...And get all the necessary kcrash information:
S> ps ADDRESS PID PPID UID FLAGS K U R WCHAN ST COMMAND D14BEE00 08881 08880 00000 00102010 - - 0 ONPROC kcrash -k /dev/mem D146A800 08880 08838 00000 00102010 - - - D146A800 SLEEP ksh D15C6800 08879 08808 00103 00502010 - - - D02F34CC SLEEP popdWe can first see a difference by comparing the 'sz' value given in the output from the as macro. We see that 723 pages are used by the second executable. The text segment of the dynamic executable (base address 08048000) only uses 5 pages versus 55. Which is a savings of over 200000 bytes per execution.S> procvm D15C6800 *p_seguslo D10EDB54 *p_segu F0A54000 *p_as D15CDDE0 *p_trace 00000000 *p_exec D12161D8 p_pri 0000004E p_usize 0002 u.u_psargs = "popd " S> as D15CDDE0 as [D15CDDE0]: keepcnt 000000 segs D1551F60 seglast D15406E0 sz 2D3000 rss 24E hat: pts D159C320 ptlast D159C320 pdtp 00000000 cr3 D15CD5C0 ref -782443648
S> seglst D1551F60 ADDRESS DATA BASE NPGS MAP PROT VNODE OFFSET ANON_MAP SWPRESV D1551F60 D1550A48 08042000 0006 02 0F 00000000 FFFFC000 D15529C0 00006000 D15B1BC0 D15C0C24 08048000 0005 02 0D D12161D8 00000000 00000000 00000000 D15407C0 D1552224 0804D000 0001 02 0F D12161D8 00004000 D1552B80 00001000 D15406E0 D1552520 0804E000 0049 02 0F 00000000 00000000 D1551268 00031000 D15406A0 D15524D8 80000000 0056 02 0D D12363E8 00000000 00000000 00000000 D15407E0 D1552248 80038000 0002 02 0F D12363E8 00038000 D15C7348 00002000 D137F2C0 D15602B4 8003A000 0001 02 0F 00000000 00000000 D1551000 00001000 D1551C80 D15522FC 8003C000 0001 02 0B 00000000 00000000 D1551230 00001000 D1540700 D1552544 8003E000 0330 02 0D D121F8C8 00000000 00000000 00000000 D137F220 D1560200 80188000 0029 02 0F D121F8C8 00149000 D1579838 0001D000 D15C0AE0 D15C7BB0 801A5000 0001 02 0F 00000000 00000000 D15B12A0 00001000 D15C0920 D15C0DB0 801A7000 0062 02 0D D1226628 00000000 00000000 00000000 D155F0C0 D153DAFC 801E5000 0004 02 0F D1226628 0003D000 D15513B8 00004000 D1542CC0 D1542824 801E9000 0001 02 0F 00000000 00000000 D15C71C0 00001000 D15B19A0 D15B1FB0 801EB000 0095 02 0D D1251738 00000000 00000000 00000000 D1551D60 D1551800 8024A000 0003 02 0F D1251738 0005E000 D1551118 00003000 D15C58A0 D15C6BB0 8024E000 0016 02 0D D122BDA8 00000000 00000000 00000000 D15C5880 D15C6B8C 8025E000 0001 02 0F D122BDA8 0000F000 D1552AA0 00001000 D15C0900 D15C0D8C 8025F000 0002 02 0F 00000000 00000000 D159DB48 00002000 D1551C00 D155226C 80262000 0049 02 0D D1247328 00000000 00000000 00000000 D1551D80 D1551824 80293000 0005 02 0F D1247328 00030000 D155B0A8 00005000 D15C0B80 D15C6A6C 80298000 0004 02 0F 00000000 00000000 D15513F0 00004000
File and Memory Mapping
Access Privileges
Access privileges or protections to the mapped pages are chosen by "or-ing" together the following bits. Note that a write will not succeed unless PROT_WRITE has been set. If PROT_NONE has been set, then no access will be allowed.
PROT_READ 0x1 /* pages can be read */ PROT_WRITE 0x2 /* pages can be written */ PROT_EXEC 0x4 /* pages can be executed */ PROT_USER 0x8 /* pages are user accessable */ PROT_ALL (PROT_READ | PROT_WRITE | PROT_EXEC | PROT_USER) PROT_NONE 0x0 /* pages cannot be accessed */
There are only two mapping types. Only one may be specified. MAP_SHARED allows changes to the virtual memory object while MAP_PRIVATE creates a private copy of the memory object (copy-on-write) and does not change the underlying virtual memory object.
A file with read access permissions may be specifed as MAP_PRIVATE with PROT_WRITE, but write access permissions are necessary to declare an object as MAP_SHARED with PROT_WRITE.
Mappings are retained across a fork. (see fork(2))
Note: This section will describe the kernel address space for the 1.3 and
1.4 operating system versions. 1.3.1 and 1.3.2 were implementation/
hardware dependent versions and have been superceded by version 1.4.
kpseg - This segment is used by the kernel to map physical
addresses to virtual addresses.
kpseg2 - This was used to support up to 764MB of memory when
we mapped virtual memory to physical memory 1-to-1. It
is not used in 1.4.
ktextseg - This segment maps kernel text, data and bss.
kvseg - or sptmap. This is the segment used for dynamic kernel
memory allocation, i.e. kmem_alloc() or sptalloc().
For more information on sptmap, please reference the following
documents:
Kernel Tunable Parameters
segkmap - or kvsegmap. The segment used to implement the I/O page
cache. The I/O page cache is used by the file system code.
segu - This is the user-block segment.
Kernel Virtual Address Space
kpioseg - This segment is used for physical, usually disk, i/o and
allows a physical i/o buffer to be passed to a driver
strategy routine to perform direct i/o to an address
space. Currently, this is only used by the vxfs driver.
Figure 11.1.3 Kernel Virtual Address Space for 1.3
To see the kernel address space, first use:
S> as kas as [D034B87C]: keepcnt 000000 segs D0349CB8 seglast D10DD800 sz 0 rss 627 hat: pts D111F1E0 ptlast D111F1E0 pdtp 00000000 cr3 00000000 ref -787642496Then, you will use the segn macro and give it the address of segs above:
S> segn D0349CB8 addr base end size as data physical D0349CB8 C0000000 CFFFFFFF 10000000 D034B87C C0000000 0000000 D0312F0C D0010000 D03C24E7 3B24E8 D034B87C D0010000 0010000 D0393EF4 D1000000 D2FFFFFF 2000000 D034B87C D1000000 0439000 D02A9114 D5000000 D53FFFFF 400000 D034B87C D5000000 D10DD800 D5400000 D5BFFFFF 800000 D034B87C D10D8FC0 274D000 D039393C E0400000 F03FFFFF 10000000 D034B87C E0400000 D10DD820 F0400000 FF7FFFFF F400000 D034B87C D10D8F80To find the names of the segments, we can use 'dl addr' like this:
S> dl D0349CB8 kpseg: 00000000 C0000000 10000000 D034B87C ............|.4.If we do this for each kas address segment, we get a table like:
addr base end kpseg: D0349CB8 C0000000 CFFFFFFF ktextseg: D0312F0C D0010000 D03991D7 kvseg: D0393EF4 D1000000 D2FFFFFF kpioseg: D02A9114 D5000000 D53FFFFF segkmap: D10DD800 D5400000 D5BFFFFF kpseg2: D039393C E0400000 F03FFFFF segu: D10DD820 F0400000 FF7FFFFF
Table 2 - Kernel Address Space 1.3
addr base end kpseg: C13C64E0 C0000000 C0FFFFFF ktextseg: C137F670 C1010000 C14A0E37 kpioseg: C132F5A4 CA000000 CAFFFFFF segkmap: D52BBC00 CB000000 CBFFFFFF segu: D52BBC20 CC000000 D4FFFFFF kvseg: C1433000 D5000000 FEBFFFFF
Table 3 - Kernel Address Space 1.4
The first is widely used - kmeminfo:
Figure 12 (struct kmeminfo)
S> kmeminfo
km_mem[0] 000C4000 /*small KMEM request index*/
km_mem[1] 003F4000 /*large KMEM request index*/
km_mem[2] 00000000 /*outsize KMEM request index*/
km_alloc[0] 00098E20 /*amount of small KMEM allocated*/
km_alloc[1] 00305400 /*amount of large KMEM allocated*/
km_alloc[2] 0029D000 /*amount of outsize KMEM allocated*/
km_fail[0] 00000000 /*number of small KMEM failures*/
km_fail[1] 00000000 /*number of large KMEM failures*/
km_fail[2] 00000000 /*number of outsize KMEM failures*/
The kmeminfo struct has three groups of three fields. The meanings are:
km_mem contains the memory (in bytes) which is currently in an allocation pool,
km_alloc is how much of that pool is currently allocated and km_fail indicates
how many allocation requests from that pool have been rejected.
The fields in slot 0 (zero), apply to the KMEM_SMALL pool (under 256 byte allocations), slot 1 applies to KMEM_LARGE (up to 16384 bytes) and slot 2 is for KMEM_OSIZE (anything larger than 16384). For oversize allocations, there is no pool. Requests are rounded up to a page boundary and the page allocator is called directly. This means km_mem[2] must always be zero. If not, the kmeminfo struct must be corrupt.
The swapinfo structure is useful for viewing how much swap has been configured
on the system, finding the actual names of the swap device(s), and determining
how much swap is being used.
Figure 13 (struct swapinfo)
S> swapinfo
vnode *si_vp D12BE604 /* vnode for this swap device */
vnode *si_svp D12BE704 /* svnode for this swap device */
uint si_soff 00000000 /* starting offset (bytes) of file */
uint si_eoff 02FFD000 /* ending offset (bytes) of file */
anon *si_anon D12D0000 /* pointer to anon array */
anon *si_eanon D12FFFC0 /* pointer to end of anon array */
anon *si_free D12D65D0 /* anon free list for this vp */
int si_allocs 0000000030 /* # of conseq. allocs from this area */
swapinfo *si_next D1331340 /* next swap area */
short si_flags 0000 /* deletion flags */
ulong si_npgs 0000012285 /* number of pages of swap space */
ulong si_nfpgs 0000010747 /* number of free pages of swap space */
char *si_pname /dev/swap
The si_vp pointer allows one to use the vnode macro to obtain more information
about the swap device and the anon pointers, of course, allow one to view the
anon information for swap. In this case, there is a second swap device. See
the /usr/include/sys/swap.h file for the flag definitions listed under
ste_flags. The number of pages is given in decimal 4k pages. The macro prints
out all swap devices defined, the second swap device information is not
included here.
Related macros minfo, mpinfo, sysinfo and vminfo are documented in the Alphabetic Index of Macros. Back to Part2 Contents