Tech Talk: Linus Torvalds on git

By: Google

10389   634   1808094

Uploaded on 05/14/2007

Linus Torvalds visits Google to share his thoughts on git, the source control management system he created two years ago.

Comments (8):

By wyclif    2017-09-20

Linus Torvalds on git https://youtu.be/4XpnKHJAok8

Original Thread

By nialv7    2017-09-20

> but git doesn't actually just hash the data, it does prepend a type/length field to it.

To me it feels like this would just be a small hurdle? But I don't really know this stuff that well. Can someone with more knowledge share their thoughts?

I think Linus also argued that SHA-1 is not a security feature for git (https://youtu.be/4XpnKHJAok8?t=57m44s). Has that been changed?

Original Thread

By snowwrestler    2017-09-20

Linus: "You can have people who try to be malicious... they won't succeed."

Linus talked about why git's use of a "strong hash" made it better than other source control options during his talk at Google in 2007.

https://youtu.be/4XpnKHJAok8

Edit: the whole talk is good but the discussion of using hashes starts at about 55 min.

Original Thread

By anonymous    2017-09-20

You can check that from Linus Torvalds himself, when he presented Git to Google back in 2007:
(emphasis mine)

We check checksums that is considered cryptographically secure. Nobody has been able to break SHA-1, but the point is, SHA-1 as far as git is concerned, isn't even a security feature. It's purely a consistency check.
The security parts are elsewhere. A lot of people assume since git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that it's a huge security feature. It has nothing at all to do with security, it's just the best hash you can get.

Having a good hash is good for being able to trust your data, it happens to have some other good features, too, it means when we hash objects, we know the hash is well distributed and we do not have to worry about certain distribution issues.

Internally it means from the implementation standpoint, we can trust that the hash is so good that we can use hashing algorithms and know there are no bad cases.

So there are some reasons to like the cryptographic side too, but it's really about the ability to trust your data.
I guarantee you, if you put your data in git, you can trust the fact that five years later, after it is converted from your harddisc to DVD to whatever new technology and you copied it along, five years later you can verify the data you get back out is the exact same data you put in. And that is something you really should look for in a source code management system.


I mentioned in "How would git handle a SHA-1 collision on a blob?" that you could engineer a commit with a particular SHA1 prefix (still an extremely costly endeavor).
But the point remains, as Eric Sink mentions in "Git: Cryptographic Hashes" (Version Control by Example (2011) book:

It is rather important that the DVCS never encounter two different pieces of data which have the same digest. Fortunately, good cryptographic hash functions are designed to make such collisions extremely unlikely.

It is harder to find good non-cryptographic hash with low collision rate, unless you consider research like "Finding State-of-the-Art Non-cryptographic Hashes with Genetic Programming".

You can also read "Consider use of non-cryptographic hash algorithm for hashing speed-up", which mentions for instance "xxhash", an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.


Discussions around changing the hash in Git are not new:

(Linus Torvalds)

There's not really anything remaining of the mozilla code, but hey, I started from it. In retrospect I probably should have started from the PPC asm code that already did the blocking sanely - but that's a "20/20 hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so convinced that I could improve on things. So that's a kind of source for it, even if it's more about the motivational side than any actual remaining code ;)

And you need to be careful about how to measure the actual optimization gain

(Linus Torvalds)

I pretty much can guarantee you that it improves things only because it makes gcc generate crap code, which then hides some of the P4 issues.

(John Tapsell - johnflux)

The engineering cost for upgrading git from SHA-1 to a new algorithm is much higher. I'm not sure how it can be done well.

First of all we probably need to deploy a version of git (let's call it version 2 for this conversation) which allows there to be a slot for a new hash value even though it doesn't read or use that space -- it just uses the SHA-1 hash value which is in the other slot.

That way once we eventually deploy yet a newer version of git, let's call it version 3, which produces SHA-3 hashes in addition to SHA-1 hashes, people using git version 2 will be able to continue to inter-operate.
(Although, per this discussion, they may be vulnerable and people who rely on their SHA-1-only patches may be vulnerable.)

In short, switching to any hash is not easy.


Update February 2017: yes, it is in theory possible to compute a colliding SHA1: shattered.io

How is GIT affected?

GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits.
It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one.
An attacker could potentially selectively serve either repository to targeted users. This will require attackers to compute their own collision.

But:

This attack required over 9,223,372,036,854,775,808 SHA1 computations. This took the equivalent processing power as 6,500 years of single-CPU computations and 110 years of single-GPU computations.

So let's not panic just yet.
See more at "How would Git handle a SHA-1 collision on a blob?".

Original Thread

By anonymous    2017-09-20

  • One tree for each directory - the tree object in the commit is the root dir and it contains pointers to blobs and the other trees.
  • git reuses blobs/trees if nothing changed. It also at some point will offer to gc which means (among others) it will compress blobs and store diffs instead of the whole blobs
  • A "blob" object is nothing but a chunk of binary data. - a file has a filename, many different identical files may refer to the same blob
  • As mentioned git will reuse blobs for identical files and will compress blobs (loose objects) to Packfiles at some point (blobs are compressed with zlib to begin with) - git is very efficient (was built with efficiency (space and time) in mind)

See also Git for Computer Scientists and the chapter 10 referenced in comments

Original Thread

By anonymous    2017-10-30

Git is inherently de-centralized. That was a core tenet of its design, as described by Linus Torvalds (the creator of git):

I think that many others had been frustrated by all the same issues that made me hate SCM’s, and while there have been many projects that tried to fix one or two small corner cases that drove people wild, there really hadn’t been anything like git that really ended up taking on the big problems head on. Even when people don’t realize how important that “distributed” part was (and a lot of people were fighting it), once they figure out that it allows those easy and reliable backups, and allows people to make their own private test repositories without having to worry about the politics of having write access to some central repository, they’ll never go back.

(Emphasis mine)

You can also watch a talk Linus gave on Git, specifically about the importance of the distributed nature of git, at Google in 2007.

Basically, your local repo is a backup of your server. If your server goes down, you can spin up a new one based on your local repo.

As far as having multiple servers... sure, it's possible, and it might make you feel better. You could set up the second server to have an "origin" pointing to the main server, and the main server with an "origin" point to the second server, and just run git fetch at regular intervals on each. The logistics of merging aren't terribly complicated... each of your servers is acting kind of like another developer. You might have some confusion about source of truth should they ever diverge or have conflicts, but you could resolve those as long as you are diligent about it. But you wouldn't have anything that you don't already have on every developer's machine, and you'd probably be better off just backing up the server at regular intervals. There's no problem with that approach... it's just a bit unnecessary and redundant.

Original Thread

Recommended Books

    Popular Videos 1

    Dimitar Christoff

    Submit Your Video

    If you have some great dev videos to share, please fill out this form.