> Without the proxy it would just be millions of users hammering their servers.
Doing shallow clones, which are significantly cheaper.
Google is DDoSing them, by their service design. Why a full git clone, why not shallow? Why do they need to do a full git clone of a repository up to a hundred times an hour. It doesn't need that frequency of a refresh.
The likely answer is that the shared state to handle this isn't a trivial addition, it's a lot simpler to just build nodes that only maintain their own state. Instead of doing it on one node and sharing that state across the service, just have every node or small cluster of nodes do its own thing. You don't need to build shared state to run the service, so why bother? That's just needless complexity after all, and all you're costing is bandwidth, right?
That's barely okay laziness when you're interacting with your own stuff and have your own responsibility for scaling and consequences. Google notoriously doesn't let engineers know the cost of what they run, because engineers will over-optimise on the wrong things, but that also teaches them not to pay attention to things like the costs they inflict on other people.
It's unacceptable to act in this kind of fashion when you're accessing third parties. You have a responsibility as a consumer to consume in a sensible and considered fashion. Avoiding this means you're just not costing yourself money through your laziness, you're costing other people who don't have stupid deep pockets like Google.
This is just another way in which operating at big-tech-money scales blinds you to basic good practice (I say this as someone who has spent over a decade now working for big tech companies...)
> Google notoriously doesn't let engineers know the cost of what they run
Huh? I left a few months ago but there was a widely used and well known page for converting between various costs (compute, memory, engineer time, etc).
Per TFA
> More importantly for SourceHut, the proxy will regularly fetch Go packages from their source repository to check for updates – independent of any user requests, such as running go get. These requests take the form of a complete git clone of the source repository, which is the most expensive kind of request for git.sr.ht to service. Additionally, these requests originate from many servers which do not coordinate with each other to reduce their workload. The frequency of these requests can be as high as ~2,500 per hour, often batched with up to a dozen clones at once, and are generally highly redundant: a single git repository can be fetched over 100 times per hour.
The issue isn't from user-initiated requests. It's from the hundreds of automatic refreshes that the proxy then performs over the course of the day and beyond. One person who was running a git server that hosts a Go repo only they use was hit with 4gb of traffic over the course of a few hours.
That's not how the proxy works. The proxy automatically refreshes its cache extremely aggressively and independently of user interactions. The actual traffic volume generated by users running go get is a minute fraction of the total traffic.
Edit: Uh okay, if it's not user traffic then why wasn't the "don't background refresh" not an option?