Yo Mama is C++0x!

This is a side product of some evil plan to emit hundreds of fixed codepaths at compile time, for the sake of better performance for some simple ifs-fractals-renderer i hack at the moment (itself being an exercise in OpenMP, C++0x (as far as g++4.4 goes), Optimization, and patience of course).

So, what follows is a function that reminds of .net’s WriteLine()-family, which some say to be a typesafe variant of printf() (yes, the one dreaded for its ellipsis parameter, possibly opening the door to all kinds of vulnerabilities), and hence better. Of course there are iostreams in C++, but they have the disadvantage that you cannot flip arguments around during runtime, making them a non-option for internationalization purposes. You could also overload operators so as to produce typesafe variants of printf(). Anyways, I was snoopy about variadic templates in C++0x. And as it turned out, variadic templates are a bless!

#include <sstream>
#include <string>
#include <iostream>
#include <stdexcept>
 
// lambdas not yet
template <typename T> std::string to_string (T val) {
        std::stringstream ss;
        ss << val;
        return ss.str();
}
 
template <typename ...ARGS>
void write (std::string const & fmt, ARGS... args) {
        const std::string argss[] = {to_string (args)...}; // <- indeed
        enum {argss_len = sizeof (argss) / sizeof(argss[0])};
 
        // no range based for loops yet ("for (auto it : fmt)")
        for (auto it = fmt.begin(); it != fmt.end(); ++it) {
                if (*it == '{') {
                        auto const left = ++it;
                        for (; it != fmt.end(); ++it) {
                                // closing brace: fine
                                if (*it == '}')
                                        break;
                                // check if numeric. if not, throw.
                                switch (*it) {
                                default:
                                        throw std::invalid_argument (
                                        "syntax error in format string, "
                                        "only numeric digits allowed between "
                                        "braces"
                                        );
                                case '0':case '1':case '2':case '3':case '4':
                                case '5':case '6':case '7':case '8':case '9':;
                                };
                        }
                        if (*it != '}') {
                                throw std::invalid_argument (
                                        "syntax error in format string, "
                                        "missing closing brace"
                                );
                        }
                        auto const right = it;
 
                        if (left == right) {
                                throw std::invalid_argument (
                                        "syntax error in format string, "
                                        "no index given inside braces"
                                );
                        }
 
                        std::stringstream ss;
                        ss << std::string(left,right);
                        size_t index;
                        ss >> index;
                        if (index >= argss_len) {
                                throw std::invalid_argument (
                                        "syntax error in format string, "
                                        "index too big"
                                );
                        }
                        std::cout << argss[index];
                } else {
                        std::cout << *it;
                }
        }
}
 
void write (std::string const & str) {
        std::cout << str;
}
 
template <typename ...ARGS> void writeln (std::string const & fmt, ARGS... args) {
        write (fmt, args...);
        std::cout << '\n';
}
 
void writeln (std::string const & str) {
        std::cout << str << '\n';
}

You can invoke this function like this (no ofuscation through operator overloading, full typesafety):

int main() {
        writeln ("Test: [{0},{1}]", 42, 3.14159);
        writeln ("Test: {1}/{0}!{0}?{0}!!", 43, "hello wurldz");
        writeln ("Test: ");
        return 0;
}

C++0x will also allow for user defined literals, allowing you to process strings (as in “strings”) at compile time, but it has not been scheduled for implementation in GCC as of time of writing.

I initially did some yacking about emitting hundreds of fixed codepaths for performance reasons. Have a look at the following example for a glimpse about how I will do it:

enum class Xtype {
        add,
        sub,
        yo,
        mama,
        end_
};
template <Xtype ...args> struct test;
template <> struct test <> {
        static void exec () {
                std::cout << "-term-\n";
        }
};
template <Xtype ...others> struct test<Xtype::add, others...> {
        static void exec () {
                std::cout << "add\n";
                test<others...>::exec();
        }
};
template <Xtype ...others> struct test<Xtype::sub, others...> {
        static void exec () {
                std::cout << "sub\n";
                test<others...>::exec();
        }
};
template <Xtype ...others> struct test<Xtype::yo, others...> {
        static void exec () {
                std::cout << "yo\n";
                test<others...>::exec();
        }
};
template <Xtype ...others> struct test<Xtype::mama, others...> {
        static void exec () {
                std::cout << "mama\n";
                test<others...>::exec();
        }
};
int main() {
        test<
                Xtype::add,
                Xtype::add,
                Xtype::sub,
                Xtype::sub,
                Xtype::add,
                Xtype::add,
                Xtype::sub,
                Xtype::yo,
                Xtype::mama,
                Xtype::yo,
                Xtype::mama,
                Xtype::yo,
                Xtype::yo
        >::exec();
        return 0;
}

Output:

add
add
sub
sub
add
add
sub
yo
mama
yo
mama
yo
yo
-term-

No more need for clumsy nesting or for std::tuple (surely std::tuple itself largely profits from variadic templates; for the curious: the boost implementation of tuple is essentially a template with a plethora of parameters (http://www.boost.org/doc/libs/1_40_0/libs/tuple/doc/tuple_users_guide.html).

Btw, the status of implementation of C++0x in GCC is at http://gcc.gnu.org/projects/cxx0x.html, the implementation status of the updated standard library is at http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.200x.

If you are snoopy yourself, invoke g++ with -std=c++0x (all lowercase!).

Posted in C++ | Tagged , , | Comments Off

This (->) is not only a matter of style.

Quote:

Original post by Saruman

Quote:

Original post by caldiar
What I’m doing wouldn’t happen to be using that hidden this pointer I keep reading about would it?


That is exactly what you are doing by using this->function2() although you don’t need it at all. In C++ this-> is implied when calling member functions.

There is really zero reason you would ever do that in front of a function, but some people use this-> before member variables in order to avoid code warts. (i.e. instead of naming member variables something like: m_myVariable they will use this->myVariable)

Using this-> is actually a good practice if you write templates.

Consider the following example:

#include <iostream>
 
void vfun() {
        std::cout << "::vfun()n";
}
 
template <typename T> struct X {
        virtual void vfun() const { std::cout << "X::vfun()n"; }
};
 
template <typename T> struct Y : public X<T> {
        void fun () { vfun(); }
};
 
template <typename T> struct Z : Y<T> {
        virtual void vfun() const { std::cout << "Z::vfun()n"; }
};
 
int main () {
        Z<int> z;
        z.fun();
        std::cout << std::flush;
}

Remember that templates are parsed two times: One time the template definition itself is parsed, so errors in the template itself can be found, and another time upon instantiation, so errors resulting from specialisations can be found (early C++ compilers implemented templates more like ordinary #define macros; but we know how hard finding bugs can be that way). Also, only during the second phase, “templated” base classes can be looked up.

At first, a global function vfun() is defined. Nothing special. Then, a class template X<T> is defined. X defines a virtual function “vfun()”.

Another class template Y<T> is then defined, which derives from X<Y>, but it does not re-implement the virtual function X::vfun(). No problem yet. Y<T> also defines a member function, named fun(), which itself calls vfun(). (There’s a third class template, struct Z<T>, but we ignore it for now.)

Here’s struct Y again:

template <typename T> struct Y : public X<T> {
        void fun () { vfun(); }
};

The problem now is, that the call to vfun() is unqualified (no ::vfun() or Y<X>::vfun(), just the pure function name). Furthermore, the arguments you pass to it don’t depend on a template parameter (in this example, we are actually passing no argument at all). Thus, the call doesn’t qualify for Argument Dependent Lookup (ADL).

You would now think, “why does the compiler not just lookup the baseclass?”. The answer lies in possible side effects due to explicit or partial specialisations (which can legally occur everywhere in the code), which basically means that e.g. a template specialisation X<bool> could be defined in a completely different way than X<float>, e.g. X<bool> could define “enum { pipapo = 0 };”, whereas X<float> could  define pipapo as “typedef X pipapo;”.

So, what remains is to use ordinary lookup of non-dependent names only, which is approximately the type of lookup you would have in a C program,  because during the second parse, only ADL and lookup of dependent names occur.

Explicitly qualifying the call to vfun()

If there wouldn’t be a global vfun() (just try yourself), standards compliant compilation should trigger an error about a call to an undeclared function. But with the global vfun(), the compiler happily finds it unambiguously using ordinary lookup. Thus, instead of calling the base classes’ virtual member function “vfun()”, Y<T>::fun() will call the global one.

A solution to this is to qualify the name:

template <typename T> struct Y : public X<T> {
        void fun () { X<T>::vfun(); }
};

Now, the correct function gets called, because we made the call dependent upon a template parameter, so that name-lookup is delayed until the second phase.

We just inhibited the virtual function call mechanism

But still, we have a problem if another class derives from Y<> and overrides vfun(). Here, our previously ignored struct Z<T> comes into play:

template <typename T> struct Z : Y<T> {
        virtual void vfun() const { std::cout << "Z::vfun()n"; }
};

Note that Z<T> also derives fun() from Y<T>, our previous “problem” function. But in “Y<T>::fun()”, we qualified the call to vfun() with X<T>::vfun(), which unfortunately means that we inhibited virtual function calling rules, and have fallen back to a non-virtual call. So, if we now call Z<T>::vfun() through a pointer or reference to struct X, we really call X<T>::vfun(), and not the overriden version in Z<T>.

While this might have been intentional, what if you really want to call the most derived vfun() inside fun(), instead of calling the least derived version? How can me make the virtual call dependent on a template parameter so that lookup can be delayed to the second phase of template parsing?, without interfering with any derivation rules?

this-> is the solution

Consider that the this-> pointer itself is a dependent name inside a class template, as its type is only completely determined once the template is intantiated (-> lookup delayed to second phase). So the simple solution to our not too obvious problem is, when writing a template, to use this-> whenever possible (*).

Here is our modified example:

template <typename T> struct Y : public X<T> {
        void fun () { this->vfun(); }
};

So, actually, using “this->” whenever possible is good style when writing templates. It is not needed in case of an explicit specialisation (**), because an explicit specialisation is basically the same as an ordinary non-template class or function, where each entity can be looked up upon the first parse (in an explicit specialisation, this-> is non-dependent, so we could not even direct the lookup until a second phase as all non-dependent names are only looked up in the first phase; despite that no second phase exists for explicit specialisations (***)).

Footnotes

(*) Except, of course, in the case where we are really interested in a very specific entity, like the global vfun() in our example.

(**) If you would write the following specialisation, this-> can be ommitted, as the compiler can fully instantiate X<int> and Y<int>, and thus find the virtual member function vfun().

template <> struct Y<int> : X<int> {
        void fun () { vfun(); } // Will call the virtual
                                // function, not the global one.
};

(***) Though a partial specialisation still contains dependent names, and all of the said still applies.

Appendix

In an answer to Hummm below, I’ve mashed up the following snippet for you to play with. It has some explanations; just don’t forget about two-phase lookup:

#include <iostream>
void vfun() {
        std::cout << "::vfun()\n";
}
 
template <typename T> struct X {
        virtual void vfun() const { std::cout << "X::vfun()\n"; }
};
 
template <typename T> struct Y : public X<T> {
        // During compilation, only the global vfun()
        // is visible, so this will call ::vfun().
        void fun0 () { vfun(); }
 
        // This will explitily call X::vfun()
        // and inhibit virtual function dispatch.
        void fun1 () { X<T>::vfun(); }
 
        // Qualifying the call with this makes the call
        // to vfun() "dependent", as the type of 'this'
        // depends on X<T>, which is not yet known.
        // Thus, the compiler will postpone
        // name-lookup until Phase 2 of template parsing.
        void fun2 () { this->vfun(); }
};
 
template <typename T> struct Z : Y<T> {
       virtual void vfun() const { std::cout << "Z::vfun()\n"; }
};
 
int main () {
        Z<int> z;
        z.fun0();
        z.fun1();
        z.fun2();
        std::cout << std::flush;
}

References


Posted in C++ | Tagged , , , | 2 Comments

Virtual Functions Considered Not Harmful, Part II

For the lack of a profiler (and for the lack of time), I commited to the cardinal sin of making unproven statements in Virtual Functions Considered Not Harmful, Part I


Original post by dascandy
call *%eax

<snip>

call __ZNK3Bar4fun2Eii

The first causes (often) a pipeline stall. The second never does. Please try to *profile* your speed tests before making any statements. You’re not working on 386′s, 486′s or Pentiums, your CPU is not trivial anymore.


The key concepts here are the Branch History Table (as in “your CPU is not trivial anymore” [sic!]) on modern CPU and the fact that the advantage of non-virtual functions decreases with increasing complexity of the function itself.

You are right, I should have used a profiler, but unfortunately, I don’t have one handy on this box. Hence I wrote the following benchmark (mostly Std-C++ compatible, except for the asm()’s  and the __attribute(())’s for the sake of benchmarking).

Benchmark

I tried to be as careful as possible, if I have done something bogus, please let me know. So, here it is:

#ifdef __GNUC__
#define MARK(x) do{asm volatile ("# "#x);}while(0)
#define NO_OPTIMIZE_AWAY() do{asm("");}while(0)
#else
#error "dies"
#endif
 
#include <ctime>
#include <iostream>
#include <cmath>
 
struct StaticNoInline {
    float dist (float const volatile *lhs, float const volatile *rhs)
     __attribute__((noinline)) {
        float const diff [] = {
            rhs [0] - lhs [0],
            rhs [1] - lhs [1],
            rhs [2] - lhs [2],
        };
        float const lenSq = diff [0]*diff [0] + diff [1]*diff [1]
                                                + diff [2]*diff [2];
        return sqrt (lenSq);
    }
};
 
struct StaticForceInline {
    float dist (float const volatile *lhs, float const volatile *rhs)
     __attribute__((forceinline)) {
        float const diff [] = {
            rhs [0] - lhs [0],
            rhs [1] - lhs [1],
            rhs [2] - lhs [2],
        };
        float const lenSq = diff [0]*diff [0] + diff [1]*diff [1]
                                                + diff [2]*diff [2];
        return sqrt (lenSq);
    }
};
 
struct IVirtual {
    virtual float dist (float const volatile *lhs, float const volatile *rhs)
        __attribute__((noinline)) = 0;
};
 
struct Virtual : public IVirtual {
    float dist (float const volatile *lhs, float const volatile *rhs)
     __attribute__((noinline)) {
        float const diff [] = {
            rhs [0] - lhs [0],
            rhs [1] - lhs [1],
            rhs [2] - lhs [2],
        };
        float const lenSq = diff [0]*diff [0] + diff [1]*diff [1]
                                                + diff [2]*diff [2];
        return sqrt (lenSq);
    }
};
 
template <typename T, typename I, uint32_t count>
void test () {
    static volatile float dist;
    static volatile float lhs [3];
    static volatile float rhs [3];
 
    ::std::cout << "Beginning test for " << typeid (T).name()
                << ", count is " << count
                << 'n';
 
    I &subject = *new T ();
    clock_t const beginT = clock();
    MARK("entering loop");
    for (uint32_t i=0; i<count; ++i) {
        dist = subject.dist (lhs, rhs);
    }
    MARK("left loop");
    clock_t const endT = clock();
    delete &subject;
    ::std::cout << "Test ended, total time " << endT - beginT << "msec.n";
}
 
int main () {
    const uint32_t count = 1<<27;
    test<Virtual, IVirtual, count> ();
    test<StaticNoInline, StaticNoInline, count> ();
    test<StaticForceInline, StaticForceInline, count> ();
    // Re-execute to take care of CPU heat and priority switches
    test<Virtual, IVirtual, count> ();
    test<StaticNoInline, StaticNoInline, count> ();
    test<StaticForceInline, StaticForceInline, count> ();
    ::std::cout << ::std::flush;
}

Basically, this benchmark calculates many million vector3-lengths.

Results

On a Pentium(R) M, 1.8GHz, the program gives me really debunking numbers.

stdout of benchmark

  Beginning test for 7Virtual, count is 134217728
  Test ended, total time 5828msec.
  Beginning test for 14StaticNoInline, count is 134217728
  Test ended, total time 5839msec.
  Beginning test for 17StaticForceInline, count is 134217728
  Test ended, total time 5118msec.
  Beginning test for 7Virtual, count is 134217728
  Test ended, total time 5798msec.
  Beginning test for 14StaticNoInline, count is 134217728
  Test ended, total time 5809msec.
  Beginning test for 17StaticForceInline, count is 134217728
  Test ended, total time 5097msec.

interpretation

clock() is not too exact (especially on Windows(R)), so I ran it for over 5 seconds (134,217,728 loops). In more readable form, the results are:

  • virtual: 5.828 sec, 5.798 sec = 5.813 sec on average
  • static, not inlined: 5.839 sec, 5.809 sec = 5.824 sec on average
  • static, inlined: 5.118 sec, 5.097 sec = 5.1075 sec on average

The static and the virtual variant run at nearly the same speed (roughly 5.8185 sec on average), the inlined version runs in only roughly 87.78 % of the time.

Upshot

Certainly, inlining helps with a bunch of other optimizations, like auto-vectorisation or re-use of data, plus it increases locality, but has the disadvantage of potential code-bloat.

Also, at the time where the compiler decides to not inline a function, the performance difference between virtual and non-virtual decreases (w.r.t. several parameters and configurations), especially in tight loops. So if it would be a great benefit for your application to use virtual functions, then you might decide to use them.

Example for a “great benefit” of virtual functions

Just to give an example (blatant self promotion, sorry), my ray tracer picogen uses virtual functions for one of its shader systems (basically an executable abstract syntax tree; the performance drop is not too big as the shader is only called for definitive correct intersections, where in the meanwhile many other operations happened), plus it uses virtual functions for its intersectable objects (which include a quadtree, a BIH, a kd-tree oops that was the ray tracer I wrote before that, sorry, spheres of course, a cube, a simple DDA based heightmap, and some hacks where I tried things out). The general BIH is an aggregate that can house any other type of Intersectable, including specific and general BIH. Also, I implemented a Whitted-style Ray Tracer and a Path Tracer (called “Surface Integrators”, like pbrt calls it), which definitely are called for each pixel, even if there is no intersection at all.

Without virtual functions, this all would have been far clumsier, and not necessarily faster. Had I organized those objects in multiple lists, code could have even run slower than with virtual functions.

Posted in C++ | Tagged , , , , | Comments Off

Virtual Functions Considered Not Harmful, Part I

Every now and then, the use of virtual functions gets questioned when it comes to runtime performance. I did a bit of work to find the (halfway) truth …

Recently on gamedev.net
Original post by NotAYakk
Note that implementing a faster/leaner virtual function object system in C++ is possible, but (A) it isn’t worth it in most every case, and (B) it isn’t easy to get right.


Did he now mean a virtual-function object system or a virtual function-object system? Anyways, I could guess what he means.

I don’t believe it is possible to code something that is faster and leaner and at the same time as mighty as virtual functions. At least not in valid C++, because there I already found something fast and lean, as follows in the next paragraphs. Anyways, feel free to slap the bitch with actual material, if you find some!

Original post by NotAYakk
Virtual function overhead requires something done on a per-frame and per-pixel basis. Ie, if on every frame, you call the virtual function once per pixel, then you are starting to look at the the point where you should worry about virtual function overhead.

Before that (or other similar cases), it isn’t a concern.

Calling virtual functions on per pixel per frame basis doesn’t mean anything. Actually, virtual functions are not that bad compared to ordinary, non-inlined calls, when they actually contain code. For example, if you do some dot-products, square roots, Reinhard operators, and the usual stuff one does at per-pixel level, then the cost gets negligible.

Let’s ask Dr. GCC about the cost of virtual function calls

Preparation

#include <cstdio>
 
#ifdef __GNUC__
#define MARK(x) do{asm volatile ("# "#x);}while(0)
#define NO_OPTIMIZE_AWAY() do{asm("");}while(0)
#else
#error "dies"
#endif
 
struct Interface {
    virtual int fun() const = 0;
    virtual int fun2 (int, int) const = 0;
    virtual float dot3 (const float *, const float *) const = 0;
};
 
struct Foo : public Interface {
    int const ff;
    Foo (int ff) : ff (ff) {}
 
    int fun() const {
        return ff;
    }
 
    int fun2 (int, int) const {
        return ff;
    }
 
    float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
    }
};
 
struct Oof : public Foo {
    int const ff;
    Oof (int ff) : Foo (0), ff (ff) {}
 
    int fun() const {
        return ff;
    }
};
 
struct Bar {
    int const ff;
    Bar (int ff) : ff (ff) {}
    int fun() const { return ff; }
    int fun2 (int, int) const {
        return ff;
    }
    float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
    }
};


Tests

int main () {
    int entropy;
    scanf ("", &entropy);
    Interface &foo = *new Foo(entropy);
    MARK("Foo starts here ... ");
    entropy = foo.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &foo;
 
    scanf ("", &entropy);
    Interface &oof = *new Oof(entropy);
    MARK("Oof starts here ... ");
    entropy = oof.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &oof;
 
    scanf ("", &entropy);
    Bar &bar = *new Bar(entropy);
    MARK("Bar starts here ... ");
    entropy = bar.fun();
    MARK("... and here it ends.");
    printf ("", entropy);
    delete &bar;
}

Results

# call to Foo::fun()
	movl	-12(%ebp), %eax
	movl	(%eax), %edx
	movl	-12(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
 
# call to Oof::fun()
	movl	-16(%ebp), %eax
	movl	(%eax), %edx
	movl	-16(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
 
# call to Bar::fun()
	movl	-20(%ebp), %eax
	movl	%eax, (%esp)
	call	__ZNK3Bar3funEv
	movl	%eax, -8(%ebp)


We see that the op-count grew from 4 to 7 (increased by a factor of 1.75). The costs for the pure call will partially vanish with an increasing number of operands.

Adding Parameters

Let’s add another test-case (you can copy this into the function main() if you’re trying yourself):

*snip*
int main () {
    *snip*
    int entropy2, entropy3;
    scanf ("", &entropy, &entropy2, &entropy3);
    Interface &foo2 = *new Foo(entropy);
    MARK("Foo2 starts here ... ");
    entropy = foo2.fun2 (entropy2,entropy3);
    MARK("... and here it ends.");
    printf ("", entropy, entropy2, entropy3);
    delete &foo2;
 
    scanf ("", &entropy, &entropy2, &entropy3);
    Bar &bar2 = *new Bar(entropy);
    MARK("Bar2 starts here ... ");
    entropy = bar2.fun2 (entropy2,entropy3);
    MARK("... and here it ends.");
    printf ("", entropy, entropy2, entropy3);
    delete &bar2;
}


The results for this:

# Foo::fun2(int,int)
	movl	-32(%ebp), %eax
	movl	(%eax), %edx
	addl	$4, %edx
	movl	-28(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	-32(%ebp), %eax
	movl	%eax, (%esp)
	movl	(%edx), %eax
	call	*%eax
	movl	%eax, -8(%ebp)
# Bar::fun2(int,int)
	movl	-28(%ebp), %eax
	movl	%eax, 8(%esp)
	movl	-24(%ebp), %eax
	movl	%eax, 4(%esp)
	movl	-36(%ebp), %eax
	movl	%eax, (%esp)
	call	__ZNK3Bar4fun2Eii
	movl	%eax, -8(%ebp)


This time, it increased by 1.5 (12 ops virtual, 8 ops non-virtual).

Adding code at the target site

This was just the call! Now we’ll add upp the code of the functions themselves. The function definitions look all the same, so I just show one of them:

__ZNK3Bar4fun2Eii:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	movl	(%eax), %eax
	popl	%ebp
	ret


We see that to the call-ops, we have to add 6 more ops, so our previous growage of 1.75 decreases to (7+6) / (4+6) = 1.3, and 1.5 decreaes to (12+6) / (8+6) = approx. 1.29.

Adding actual code to the target site

Now let’s have a look at the assembly produced for our dot-product-functions. Remember, they looked as simple as:

float dot3 (const float *lhs, const float *rhs) const {
        return lhs[0]*rhs[0] + lhs[1]*rhs[1] + lhs[2]*rhs[2];
}

The assembly for all versions again look the same, so here is just one of them:

pushl	%ebp
	movl	%esp, %ebp
	movl	12(%ebp), %eax
	movl	16(%ebp), %edx
	flds	(%eax)
	fmuls	(%edx)
	movl	12(%ebp), %eax
	addl	$4, %eax
	movl	16(%ebp), %edx
	addl	$4, %edx
	flds	(%eax)
	fmuls	(%edx)
	faddp	%st, %st(1)
	movl	12(%ebp), %eax
	addl	$8, %eax
	movl	16(%ebp), %edx
	addl	$8, %edx
	flds	(%eax)
	fmuls	(%edx)
	faddp	%st, %st(1)
	popl	%ebp
	ret

And we find 22 instructions, for a simple dot-product. Our additional cost is now:

  • (12+22) / (8+22) = approx. 1.1333

So instead of 40 frames per second, you get “only” 36, but you get many benefits. Also, just add some more operations, and the relative overhead further decreases. And the blanket statement “virtual functions at per pixel per frame level considered harmful” (not sic!) vanishes.

We learn today (how pathetic :D)…

that the cost (*) of calling a virtual function compared to a non-inlined non-virtual function vanishes with the number of arguments you pass to the function, the complexity of what is inside the function and with the number of return values (in C/C++ either 0 or 1)

(*): Actually with respect to size/op-count, as we haven’t done any benchmark yet, but see part 2.

sidenote:

Quote:

No, don’t do epic switch statements.


But sometimes, only sometimes, epic switch statements are the right thing. For example if you write a performant software based virtual machine, with many trivial micro operations (e.g. add or mul), then you want jump tables. And jump tables you generally only get with epic switch statements, or by relying on pointers-to-labels (GCC has support for them, but they are not standard).

Posted in C++ | Tagged , , , , | 2 Comments