Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation? -


every time mention slow performance of c++ standard library iostreams, met wave of disbelief. yet have profiler results showing large amounts of time spent in iostream library code (full compiler optimizations), , switching iostreams os-specific i/o apis , custom buffer management give order of magnitude improvement.

what work c++ standard library doing, required standard, , useful in practice? or compilers provide implementations of iostreams competitive manual buffer management?

benchmarks

to matters moving, i've written couple of short programs exercise iostreams internal buffering:

note ostringstream , stringbuf versions run fewer iterations because slower.

on ideone, ostringstream 3 times slower std:copy + back_inserter + std::vector, , 15 times slower memcpy raw buffer. feels consistent before-and-after profiling when switched real application custom buffering.

these in-memory buffers, slowness of iostreams can't blamed on slow disk i/o, flushing, synchronization stdio, or of other things people use excuse observed slowness of c++ standard library iostream.

it nice see benchmarks on other systems , commentary on things common implementations (such gcc's libc++, visual c++, intel c++) , how of overhead mandated standard.

rationale test

a number of people have correctly pointed out iostreams more commonly used formatted output. however, modern api provided c++ standard binary file access. real reason doing performance tests on internal buffering applies typical formatted i/o: if iostreams can't keep disk controller supplied raw data, how can possibly keep when responsible formatting well?

benchmark timing

all these per iteration of outer (k) loop.

on ideone (gcc-4.3.4, unknown os , hardware):

  • ostringstream: 53 milliseconds
  • stringbuf: 27 ms
  • vector<char> , back_inserter: 17.6 ms
  • vector<char> ordinary iterator: 10.6 ms
  • vector<char> iterator , bounds check: 11.4 ms
  • char[]: 3.7 ms

on laptop (visual c++ 2010 x86, cl /ox /ehsc, windows 7 ultimate 64-bit, intel core i7, 8 gb ram):

  • ostringstream: 73.4 milliseconds, 71.6 ms
  • stringbuf: 21.7 ms, 21.3 ms
  • vector<char> , back_inserter: 34.6 ms, 34.4 ms
  • vector<char> ordinary iterator: 1.10 ms, 1.04 ms
  • vector<char> iterator , bounds check: 1.11 ms, 0.87 ms, 1.12 ms, 0.89 ms, 1.02 ms, 1.14 ms
  • char[]: 1.48 ms, 1.57 ms

visual c++ 2010 x86, profile-guided optimization cl /ox /ehsc /gl /c, link /ltcg:pgi, run, link /ltcg:pgo, measure:

  • ostringstream: 61.2 ms, 60.5 ms
  • vector<char> ordinary iterator: 1.04 ms, 1.03 ms

same laptop, same os, using cygwin gcc 4.3.4 g++ -o3:

  • ostringstream: 62.7 ms, 60.5 ms
  • stringbuf: 44.4 ms, 44.5 ms
  • vector<char> , back_inserter: 13.5 ms, 13.6 ms
  • vector<char> ordinary iterator: 4.1 ms, 3.9 ms
  • vector<char> iterator , bounds check: 4.0 ms, 4.0 ms
  • char[]: 3.57 ms, 3.75 ms

same laptop, visual c++ 2008 sp1, cl /ox /ehsc:

  • ostringstream: 88.7 ms, 87.6 ms
  • stringbuf: 23.3 ms, 23.4 ms
  • vector<char> , back_inserter: 26.1 ms, 24.5 ms
  • vector<char> ordinary iterator: 3.13 ms, 2.48 ms
  • vector<char> iterator , bounds check: 2.97 ms, 2.53 ms
  • char[]: 1.52 ms, 1.25 ms

same laptop, visual c++ 2010 64-bit compiler:

  • ostringstream: 48.6 ms, 45.0 ms
  • stringbuf: 16.2 ms, 16.0 ms
  • vector<char> , back_inserter: 26.3 ms, 26.5 ms
  • vector<char> ordinary iterator: 0.87 ms, 0.89 ms
  • vector<char> iterator , bounds check: 0.99 ms, 0.99 ms
  • char[]: 1.25 ms, 1.24 ms

edit: ran twice see how consistent results were. pretty consistent imo.

note: on laptop, since can spare more cpu time ideone allows, set number of iterations 1000 methods. means ostringstream , vector reallocation, takes place on first pass, should have little impact on final results.

edit: oops, found bug in vector-with-ordinary-iterator, iterator wasn't being advanced , therefore there many cache hits. wondering how vector<char> outperforming char[]. didn't make difference though, vector<char> still faster char[] under vc++ 2010.

conclusions

buffering of output streams requires 3 steps each time data appended:

  • check incoming block fits available buffer space.
  • copy incoming block.
  • update end-of-data pointer.

the latest code snippet posted, "vector<char> simple iterator plus bounds check" not this, allocates additional space , moves existing data when incoming block doesn't fit. clifford pointed out, buffering in file i/o class wouldn't have that, flush current buffer , reuse it. should upper bound on cost of buffering output. , it's needed make working in-memory buffer.

so why stringbuf 2.5x slower on ideone, , @ least 10 times slower when test it? isn't being used polymorphically in simple micro-benchmark, doesn't explain it.

not answering specifics of question title: 2006 technical report on c++ performance has interesting section on iostreams (p.68). relevant question in section 6.1.2 ("execution speed"):

since aspects of iostreams processing distributed on multiple facets, appears standard mandates inefficient implementation. not case — using form of preprocessing, of work can avoided. smarter linker typically used, possible remove of these inefficiencies. discussed in §6.2.3 , §6.2.5.

since report written in 2006 1 hope many of recommendations have been incorporated current compilers, perhaps not case.

as mention, facets may not feature in write() (but wouldn't assume blindly). feature? running gprof on ostringstream code compiled gcc gives following breakdown:

  • 44.23% in std::basic_streambuf<char>::xsputn(char const*, int)
  • 34.62% in std::ostream::write(char const*, int)
  • 12.50% in main
  • 6.73% in std::ostream::sentry::sentry(std::ostream&)
  • 0.96% in std::string::_m_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
  • 0.96% in std::basic_ostringstream<char>::basic_ostringstream(std::_ios_openmode)
  • 0.00% in std::fpos<int>::fpos(long long)

so bulk of time spent in xsputn, calls std::copy() after lots of checking , updating of cursor positions , buffers (have in c++\bits\streambuf.tcc details).

my take on you've focused on worst-case situation. checking performed small fraction of total work done if dealing reasonably large chunks of data. code shifting data in 4 bytes @ time, , incurring costs each time. 1 avoid doing in real-life situation - consider how negligible penalty have been if write called on array of 1m ints instead of on 1m times on 1 int. , in real-life situation 1 appreciate important features of iostreams, namely memory-safe , type-safe design. such benefits come @ price, , you've written test makes these costs dominate execution time.


Comments

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

php - Change action and image src url's with jQuery -