Interpret differences in prstat vs. 'prstat -m' on Solaris -
i've been using prstat , prstat -m lot investigate performance issues lately, , think i've understood differences of sampling vs. microstate accounting in solaris 10. don't expect both show same number.
today came across occasion 2 showed vastly different outputs, have problems interpreting them , making sense of output. machine heavily loaded 8-cpu solaris 10, several large websphere processes , oracle database. system practically came halt today 15 minutes (load averages of >700). had difficulties prstat information, able outputs "prtstat 1 1" , "prtstat -m 1 1", issued shortly 1 after another.
the top lines of outputs:
prstat 1 1:
pid username size rss state pri nice time cpu process/nlwp 8379 3208m 2773m cpu5 60 0 5:29:13 19% java/145 7123 3159m 2756m run 59 0 5:26:45 7.7% java/109 5855 app1 1132m 26m cpu2 60 0 0:01:01 7.7% java/18 16503 monitor 494m 286m run 59 19 1:01:08 7.1% java/106 7112 oracle 15g 15g run 59 0 0:00:10 4.5% oracle/1 7124 oracle 15g 15g cpu3 60 0 0:00:10 4.5% oracle/1 7087 app1 15g 15g run 58 0 0:00:09 4.0% oracle/1 7155 oracle 96m 6336k cpu1 60 0 0:00:07 3.6% oracle/1 ... total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8
prstat -m 1 1 (some seconds later)
pid username usr sys trp tfl dfl lck slp lat vcx icx scl sig process/nlwp 7087 app1 0.1 56 0.0 0.2 0.4 0.0 13 30 96 2 33 0 oracle/1 7153 oracle 0.1 53 0.0 3.2 1.1 0.0 1.0 42 82 0 14 0 oracle/1 7124 oracle 0.1 47 0.0 0.2 0.2 0.0 0.0 52 77 2 16 0 oracle/1 7112 oracle 0.1 47 0.0 0.4 0.1 0.0 0.0 52 79 1 16 0 oracle/1 7259 oracle 0.1 45 9.4 0.0 0.3 0.0 0.1 45 71 2 32 0 oracle/1 7155 oracle 0.0 42 11 0.0 0.5 0.0 0.1 46 90 1 9 0 oracle/1 7261 oracle 0.0 37 9.5 0.0 0.3 0.0 0.0 53 61 1 17 0 oracle/1 7284 oracle 0.0 32 5.9 0.0 0.2 0.0 0.1 62 53 1 21 0 oracle/1 ... total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51
i have hard time interpreting output. prstat seems tell me fair amount of java processing going on, oracle stuff, expect in normal situation. prtstat -m shows machine totally dominated oracle, consuming huge amounts of system time, , overall cpu being heavily overloaded (large numbers in lat).
i'm inclined believe output of prstat -m, because matches mor closely system felt during time. totally sluggish, no more user request processing going on websphere, etc. why prstat show heavily differing numbers?
any explanation of welcome!
cu, joe
there's known problem prstat -m
on solaris in way cpu usage figures calculated - value see has been averaged on threads (lwps) in process, , hence far far low heavily multithreaded processes - such java app servers, can have hundreds of threads each (see nlwp). less dozen of them cpu hogs hence cpu usage java looks "low". you'd need call prstat -lm
per-lwp (thread) breakdown see effect. reference:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169
without further performance monitoring data it's hard give non-speculative explanations of you've seen there. assume lock contention within java. 1 particular workload can cause heavily multithreaded memory mapped i/o, they'll pile on process address space lock. purely java userside lock of course. plockstat
on 1 of java processes, and/or simple dtrace profiling helpful.
Comments
Post a Comment