Interpret differences in prstat vs. 'prstat -m' on Solaris -


i've been using prstat , prstat -m lot investigate performance issues lately, , think i've understood differences of sampling vs. microstate accounting in solaris 10. don't expect both show same number.

today came across occasion 2 showed vastly different outputs, have problems interpreting them , making sense of output. machine heavily loaded 8-cpu solaris 10, several large websphere processes , oracle database. system practically came halt today 15 minutes (load averages of >700). had difficulties prstat information, able outputs "prtstat 1 1" , "prtstat -m 1 1", issued shortly 1 after another.

the top lines of outputs:

prstat 1 1:

     pid username  size   rss state  pri nice      time  cpu process/nlwp   8379      3208m 2773m cpu5    60    0   5:29:13  19% java/145   7123      3159m 2756m run     59    0   5:26:45 7.7% java/109   5855 app1     1132m   26m cpu2    60    0   0:01:01 7.7% java/18  16503 monitor   494m  286m run     59   19   1:01:08 7.1% java/106   7112 oracle     15g   15g run     59    0   0:00:10 4.5% oracle/1   7124 oracle     15g   15g cpu3    60    0   0:00:10 4.5% oracle/1   7087 app1       15g   15g run     58    0   0:00:09 4.0% oracle/1   7155 oracle     96m 6336k cpu1    60    0   0:00:07 3.6% oracle/1 ... total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8 

prstat -m 1 1 (some seconds later)

    pid username usr sys trp tfl dfl lck slp lat vcx icx scl sig process/nlwp     7087 app1     0.1  56 0.0 0.2 0.4 0.0  13  30  96   2  33   0 oracle/1   7153 oracle   0.1  53 0.0 3.2 1.1 0.0 1.0  42  82   0  14   0 oracle/1   7124 oracle   0.1  47 0.0 0.2 0.2 0.0 0.0  52  77   2  16   0 oracle/1   7112 oracle   0.1  47 0.0 0.4 0.1 0.0 0.0  52  79   1  16   0 oracle/1   7259 oracle   0.1  45 9.4 0.0 0.3 0.0 0.1  45  71   2  32   0 oracle/1   7155 oracle   0.0  42  11 0.0 0.5 0.0 0.1  46  90   1   9   0 oracle/1   7261 oracle   0.0  37 9.5 0.0 0.3 0.0 0.0  53  61   1  17   0 oracle/1   7284 oracle   0.0  32 5.9 0.0 0.2 0.0 0.1  62  53   1  21   0 oracle/1 ... total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51 

i have hard time interpreting output. prstat seems tell me fair amount of java processing going on, oracle stuff, expect in normal situation. prtstat -m shows machine totally dominated oracle, consuming huge amounts of system time, , overall cpu being heavily overloaded (large numbers in lat).

i'm inclined believe output of prstat -m, because matches mor closely system felt during time. totally sluggish, no more user request processing going on websphere, etc. why prstat show heavily differing numbers?

any explanation of welcome!

cu, joe

there's known problem prstat -m on solaris in way cpu usage figures calculated - value see has been averaged on threads (lwps) in process, , hence far far low heavily multithreaded processes - such java app servers, can have hundreds of threads each (see nlwp). less dozen of them cpu hogs hence cpu usage java looks "low". you'd need call prstat -lm per-lwp (thread) breakdown see effect. reference:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169

without further performance monitoring data it's hard give non-speculative explanations of you've seen there. assume lock contention within java. 1 particular workload can cause heavily multithreaded memory mapped i/o, they'll pile on process address space lock. purely java userside lock of course. plockstat on 1 of java processes, and/or simple dtrace profiling helpful.


Comments

Popular posts from this blog

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -

java - Android recognize cell phone with keyboard or not? -

iphone - How would you achieve a LED Scrolling effect? -