PyCon 2010: The Mighty Dictionary

By: Eugene Yarmash

264   0   13312

Uploaded on 02/09/2014

PyCon 2010: The Mighty Dictionary

Presented by Brandon Craig Rhodes

Comments (8):

By anonymous    2017-09-20

Let me address some issues you've raised in your question.

Here's the thing, right now the search operation is linear time and I'd like it to become constant time and I'm wondering about a simple/lightweight way to achieve this.

A simple lightweight way to achieve this, i.e., to have an associative array (a.k.a. key-value-store), is to use one provided by the standard library.

You are coding in a recent version of C++, you are in the lucky position that the standard library actually provides one that satisfies your constant-time requirements:

The implementation of the data structures shipped as part of a standard library of any decent compiler these days, are probably better than anything you could come up with. (Or why did you ask for give me the code?).

I've seen hashtables are a possible option but I'd prefer not having to implement a hash function per object. Maybe something similar to an unordered_map... dunno.

A std::unordered_map actually is a hash table, and as you can see in the docs, it takes a hash function. As you can see written in the docs there are lots of specializations for lots of types already available, that can help you derive a custom hash function for your custom object types:

Could anyone give some ideas or maybe providing a simple lightweight implementation of what I'm trying to achieve here?

Just have a look at the example code to std::unordered_map to see how it's used. If you worry about performance, don't forget to measure. If you really want to consume some input on implementation of hash tables, I liked these talks on the Python dictionary:

Also have a look at the wikipedia page (if you haven't already):

In this fictional example I'm using std::vector to avoid making the question bigger and more complex than what it is but my the real use-case won't be using the STL at all (ie: i'll be coding my own custom implementation of std::vector)

Unless you are doing this for educational/recreational purposes, don't do it. Don't be ashamed to base your endeavours on the shoulders of giants. That the standard library wasn't invented in your project is not a problem.

Original Thread

By anonymous    2017-09-20

Actually a dictionary can shrink upon resize, but the resize only happens upon a key insert not removal. Here's a comment from the CPython source for dictresize:

Restructure the table by allocating a new table and reinserting all items again. When entries have been deleted, the new table may actually be smaller than the old one.

By the way, since the other answer quotes Brandon Rhodes talk on the dictionary at PyCon 2010, and the quote seems to be at odds with the above (which has been there for years), I thought I would include the full quote, with the missing part in bold.

Resizes are only calculated upon insertion. As a dictionary shrinks, it just gains a lot of dummy entries and as you refill it, it will just start re-using those to store keys. It will not resize until you manage to make it two-thirds full again at its larger size. So it does not resize as you delete keys. You have to do an insert to get it to figure out it needs to shrink.

So he does say the resizing operation can "figure out [the dictionary] needs to shrink". But that only happens on insert. Apparently when copying over all the keys during resize, the dummy keys can get removed, reducing the size of the backing array.

It isn't clear, however, how to get this to happen, which is why Rhodes says to just copy over everything to a new dictionary.

Original Thread

By anonymous    2017-09-20

How are Python's Built In Dictionaries Implemented?

Python's Dictionaries are Hash Tables

For a long time, it worked like this. Python would preallocate 8 empty rows and use the hash to determine where to stick the key-value pair. For example, if the hash for the key ended in 001, it would stick it in the 1 index (like the example below.)

     hash         key    value
     null        null    null
...010001    ffeb678c    633241c4 # addresses of the keys and values
     null        null    null
      ...         ...    ...

Each row takes up 24 bytes on a 64 bit architecture, 12 on a 32 bit. (Note that the column headers are just labels - they don't actually exist in memory.)

If the hash ended the same as a preexisting key's hash, this is a collision, and then it would stick the key-value pair in a different location.

After 5 key-values are stored, when adding another key-value pair, the probability of hash collisions is too large, so the dictionary is doubled in size. In a 64 bit process, before the resize, we have 72 bytes empty, and after, we are wasting 240 bytes due to the 10 empty rows.

This takes a lot of space, but the lookup time is fairly constant. The key comparison algorithm is to compute the hash, go to the expected location, compare the key's id - if they're the same object, they're equal. If not then compare the hash values, if they are not the same, they're not equal. Else, then we finally compare keys for equality, and if they are equal, return the value. The final comparison for equality can be quite slow, but the earlier checks usually shortcut the final comparison, making the lookups very quick.

(Collisions slow things down, and an attacker could theoretically use hash collisions to perform a denial of service attack, so we randomized the hash function such that it computes a different hash for each new Python process.)

The wasted space described above has led us to modify the implementation of dictionaries, with an exciting new (if unofficial) feature that dictionaries are now ordered (by insertion).

The New Compact Hash Tables

We start, instead, by preallocating an array for the index of the insertion.

Since our first key-value pair goes in the second slot, we index like this:

[null, 0, null, null, null, null, null, null]

And our table just gets populated by insertion order:

     hash         key    value
...010001    ffeb678c    633241c4 
      ...         ...    ...

So when we do a lookup for a key, we use the hash to check the position we expect (in this case, we go straight to index 1 of the array), then go to that index in the hash-table (e.g. index 0), check that the keys are equal (using the same algorithm described earlier), and if so, return the value.

We retain constant lookup time, with minor speed losses in some cases and gains in others, with the upside that we save quite a lot of space over the pre-existing implementation. The only space wasted are the null bytes in the index array.

Raymond Hettinger introduced this to python-dev in December of 2012. It finally got into CPython in Python 3.6. Ordering by insertion is still considered an implementation detail to allow other implementations of Python a chance to catch up.

Shared Keys

Another optimization to save space is an implementation that shares keys. Thus, instead of having redundant dictionaries that take up all of that space, we have dictionaries that reuse the shared keys and keys' hashes. You can think of it like this:

     hash         key    dict_0    dict_1    dict_2...
...010001    ffeb678c    633241c4  fffad420  ...
      ...         ...    ...       ...       ...

For a 64 bit machine, this could save up to 16 bytes per key per extra dictionary.

Shared Keys for Custom Objects & Alternatives

These shared-key dicts are intended to be used for custom objects' __dict__. To get this behavior, I believe you need to finish populating your __dict__ before you instantiate your next object (see PEP 412). This means you should assign all your attributes in the __init__ or __new__, else you might not get your space savings.

However, if you know all of your attributes at the time your __init__ is executed, you could also provide __slots__ for your object, and guarantee that __dict__ is not created at all (if not available in parents), or even allow __dict__ but guarantee that your foreseen attributes are stored in slots anyways. For more on __slots__, see my answer here.

See also:

Original Thread

By anonymous    2017-12-24

As others have pointed out python dictionaries do not guarantee that the items can be accessed in the same order as the order of insertion. This is because python list and dictionary are different data structures each has its own way of retrieving values. If you're interested in this check out this amazing video to understand how dictionary items are stored and retrieved and why order is not preserved.

That being said if you simply want to print or access the items of the dictionary based on items in things there are several options as below apart from Ordereddict already mentioned.

List comprehension

>>> print [(key, d[key]) for key in things]                                                                                                          
[(4, 1), (5, 1), (2.0, 0.5), (0, 0)]
>>> print "{ %s }" % ", ".join(["%s : %s" % (key, d[key]) for key in d])
{ 0 : 0, 2.0 : 0.5, 4 : 1, 5 : 1 }

You can also sort the dictionary based on the things list

>>> print sorted(d.items(), key = lambda x: things.index(x[0]))
[(4, 1), (5, 1), (2.0, 0.5), (0, 0)]

Original Thread

By anonymous    2018-01-01

@user3285099 I think this is the video I watched a while back that explained it well. It's about dicts, but I think sets are done the same way: https://www.youtube.com/watch?v=C4Kc8xzcA68

Original Thread

Submit Your Video

If you have some great dev videos to share, please fill out this form.