Sunday, March 4, 2012

Dr. StrangeCoder or: How I Learned to Stop Worrying and Love C++

posted by Kurtis at
Warning! This blog entry is about programming. Really. It's about C++. To all my non-programming friends: You won't care. Stop reading.

A good friend of mine at work (who is probably one of the best coders I've ever met, and a really nice guy to boot) posted the following over at his blog:

http://blog.grumpycoder.net/index.php/post/2012/03/02/Why-C-is-a-good-language-or-why-you-re-wrong-thinking-C-is-better

We've exchanged email about it a bit already, but I wanted to find out if the things I was saying actually made sense, so I took his challenge (he got it from http://usuckatcoding.com/):
Given a string of random characters, return the index of the first nonrepeated character. If all characters repeat, return -1. For example, given the string "abcab", 'c' is the first nonrepeated character and the result would be 2.

BONUS: There are time/space tradeoffs to be considered. The best solution will address any optimal time solution as well as any optimal space solution.
He did a solution in C to prove his point. He describes it thusly:
It uses ~1800 bytes on the stack. It's in O(n), and starts with a memset of 258 bytes. It supports strings up to 2^31 bytes.
The thing is, that's a very C way of looking at coding.

So I sat down to do this in C++. This took about two minutes to write (I'm omitting testing code) and it uses the identical algorithm to my friend's C solution:

int challenge1(const char *str)  {
    std::list<int>  results;
    std::map<char,std::list<int>::iterator>  seen_map;
    
    char c;
    int ret;
    for(ret = 0; (c = str[ret]); ++ret)  {
        std::map<char,std::list<int>::iterator>::iterator it = seen_map.find(c);
        if(seen_map.end() == it)  {
            // if we haven't seen it before, record it and it's position
            results.push_front(ret);
            seen_map.insert(std::pair<char,std::list<int>::iterator>(c,results.begin()));
        } else  {
            if(results.end() != it->second)  {
                // this is the duplicated character, remove it from the results
                results.erase(it->second);
                it->second = results.end();
            }
        }
    }
    results.push_front(-1);

    return results.back();
}

It's a very C++-ish solution. It does suffer in performance from its extensive heap usage (as my friend asserts) and it's even not O(n) because std::map is an ordered map (and therefore, likely, a tree). However, it is shorter, and the use of the insert/erase function names gives the sense of exactly what it's doing. (Also many of you who already hate C++ syntax are cringing at the iterator declaration inside the for-loop, and I won't disagree, though C++11's auto keyword and a better insert() prototype would fix that.)

Now, we can do better, but I don't think an average C++ programmer would have a lot of difficulty coming to this average solution fairly quickly. Possibly even quicker than the C guy.

So why do I think C++ is a better language? Because it lets you write suboptimal code faster? People who make that argument are (rightly) missing my friend's point. If your desire is to trade performance for ease of coding then use something even easier than C++ for goodness sakes!

I think C++ is a better language than C because it also doesn't hold back the performance. Let's look at iterating on this from a strictly C++ methodology. What do we see? Well, it's very specific to this problem. Let's genericize it (Sharon's gonna kill me for making up words.) We're going to explicitly overdo it to make a point:

template <class Traits>
int challenge2(const typename Traits::Container &container)  {
    typedef typename Traits::List                               ListType;
    typedef typename ListType::iterator                         ListIterator;
    typedef typename Traits::Map                                MapType;
    typedef typename Traits::MapNode                            MapNodeType;
    typedef typename MapType::iterator                          MapIterator;
    typedef typename Traits::Container                          ContainerType;
    typedef typename ContainerType::const_iterator              ContainerConstIterator;
    typedef typename Traits::Value                              ValueType;

    ListType    results;
    MapType     seen_map;

    int ret = 0;
    for(ContainerConstIterator cit = container.begin();
        cit != container.end(); ++cit, ++ret)  {
        const ValueType &v = *cit;
        MapIterator mit = seen_map.find(v);
        if(seen_map.end() == mit)  {
            results.push_front(ret);
            seen_map.insert(MapNodeType(v,results.begin()));
        } else  {
            if(results.end() != mit->second)  {
                results.erase(mit->second);
                mit->second = results.end();
            }
        }
    }
    results.push_front(-1);

    return results.back();
}

struct ChallengeTraits 
{
    typedef std::list<int>                  List;
    typedef char                            Value;
    typedef std::vector<Value>              Container;
    typedef std::map<Value,List::iterator>  Map;
    typedef std::pair<Value,List::iterator> MapNode;
};

Now here's where the real beauty of C++ kicks in. This version is generic enough that the space/time tradeoff can be done ENTIRELY in the Traits class (and fix that whole ordered_map thing at the same time). I could, for instance, define my ChallengeTraits class to use stack-based fixed size implementations like the C solution and use exactly the same space. I leave the code as an exercise to the reader (though it's not hard) and postulate that the C++ STL SHOULD PROVIDE those classes. Since it doesn't (and in general the STL doesn't provide enough flexibility with resource allocation with it's broken allocator model) I would expect that any C++ coder who cares about performance would write them exactly once in their career and then just use them over and over. The code (outside of Traits) hasn't grown appreciably longer.

But that's not all! Act now, and get a generic search in your generic random access container for the first non-duplicated value. In fact, if you change the function to return an iterator you don't even need the random access part.

Also, if you know things about your value set, you can get even better solutions. For instance, maybe the string only contains printable characters and you could code up your Map class to map your range of (32-126) to (0-94) and use less space than the current C solution. All the complexity is hidden away in your Traits, which you customize to your problem.

Let's review. C++ lets us use it's type and template system to specify an object with the exact minimum behavior we need for our algorithm to run correctly, which then lets you deal with the complexity to run optimally elsewhere and in a reusable way. It's the win-win scenario (though not the win-win-win your Michael Scotts might want.)

Class dismissed.

3 Comments:

Blogger Nicolas said...

Ha ha, nice :)

I think I'll admit this: you don't like C++ for the same reason the (vast ?) majority of the other C++ coders do, and you even mention this. If you use C++ for its simplicity, go away. By the way, I challenge you to take a programming course here, at UCIrvine, and stay calm and placid at the way the teacher introduces his students to programming by explaining how to use C++ streams. True story, btw.

I may have forgot to emphasis more on my post that it was a reaction to discussions I had with other people, based on the famous rants by Linus about C++ (that you can see here). Linus sure isn't much of a person person, and does a bad job at making his point, so I felt I needed to explain his view a bit more. Which of course derived in me trying to troll you.


I tip my hat to your resilience to my trollattitude, and bow to your kindness, my friend :)

March 4, 2012 at 2:59 AM  
Blogger Kurtis said...

Yeah. I'll go a step further and suggest that if you're starting a new project from scratch and can pick any programming language in the world (or even make up new ones) C++ is probably never the right choice.

I just like it a lot personally, but that might be because I know it the best and because I've taken the time to get to know it pretty well. It's worth mentioning that I bet lots of C++ programmers of the level you talk about would be pretty scared of that template and all those typedefs. Which (as I've mentioned before) is like a C programmer being scared of pointers. It makes the language complex (some might say unnecessarily complex) but I find it useful.

My problem with lots of Linus's rants is that he is a C guy who's seen a lot of bad C++. I have sympathy for him (my first job was working on a Windows kernel driver that was written entirely in C++, some of it pretty badly) but here's the thing: as bad as C++ is, I can write pretty cool code with it. I know a fair bit of Java, have written a lot in straight C, learned enough Perl to be afraid, and can recognize enough Python and Ruby to be dangerous. But ultimately, I come back to C++, because as I get better at it I can use what I learn to write everything from very low level code to very high level code, so my entire skill set improves. Maybe that came at too high a cost (I could've just learned the others) but now that it's done, I like it.

March 4, 2012 at 10:27 AM  
Blogger Ken Matheis said...

I know this isn't relevant, but one can abandon C++ entirely without sacrificing performance using Haskell. To wit:

import Array

a = array ('A','z') [(x,[]) | x <- ['A'..'z']]

populate [] arr i len = arr
populate (s:ss) arr i len = populate ss (arr // [(s,(i:(arr ! s)))]) (i+1) len

singleton [_] = True
singleton _ = False

getsingles arr = map head (filter singleton (map snd (assocs arr)))

getindex str
| (singles == []) = (-1)
| otherwise = minimum singles
where singles = getsingles (populate str a 0 (length str))

Everything is O(n) and reasonably memory-tight. I could write a two-liner that's O(n log n), but that's cheating. ;) Also, I'm sure a Haskell head could shorten this, but I think this is simple enough to illustrate stuff.

Also, if you didn't get my earlier message, feel free to call. I miss you! :)

March 4, 2012 at 1:16 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home