Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From the GitHub issue, by a Googler:

> For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt […]

This is coming from one of the biggest, richest, most well-staffed companies on the planet. It’s too much work for them to read a robots.txt file like the rest of the world (and plenty of one-man teams) do before hammering a server with terabytes of requests.

If this is too much for them then no wonder they won’t implement smarter logic like differential data downloads or traffic synchronization among peer nodes.



Why did sourcehut not take the offer to be added to the refresh exclusion list like the other two small hosting providers did? It seems like that would have resolved this issue last year.


For a number of reasons. For a start, what does disabling the cache refresh imply? Does it come with a degradation of service for Go users? If not, then why is it there at all? And if so, why should we accept a service degradation when the problem is clearly in the proxy's poor engineering and design?

Furthermore, we try to look past the tip of our own nose when it comes to these kinds of problems. We often reject solutions which are offered to SourceHut and SourceHut alone. This isn't the first time this principle has run into problems with the Go team; to this day pkg.go.dev does not work properly with SourceHut instances hosted elsewhere than git.sr.ht, or even GitLab instances like salsa.debian.org, because they hard-code the list of domains rather than looking for better solutions -- even though they were advised of several.

The proxy has caused problems for many service providers, and agreeing to have SourceHut removed from the refresh would not solve the problem for anyone else, and thus would not solve the problem. Some of these providers have been able to get in touch with the Go team and received this offer, but the process is not easily discovered and is poorly defined, and, again, comes with these implied service considerations. In the spirit of the Debian free software guidelines, we don't accept these kinds of solutions:

> The rights attached to the program must not depend on the program's being part of a Debian system. If the program is extracted from Debian and used or distributed without Debian but otherwise within the terms of the program's license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the Debian system.

Yes, being excluded from the refresh would reduce the traffic to our servers, likely with less impact for users. But it is clearly the wrong solution and we don't like wrong solutions. You would not be wrong to characterize this as somewhat ideologically motivated, but I think we've been very reasonable and the Go team has not -- at this point our move is, in my view, justified.


Supposedly the operator of sourcehut has been banned from posting to the Go issue tracker: https://github.com/golang/go/issues/44577#issuecomment-11378...

So, obviously he's supposed to know the even more obscure and annoying method of opting out.


Why should you opt-in to a No-DDOS list ? Why is it not the default ?


You are basically arguing that sr.ht is taking a "principled stand" against google. If that is what they are doing they should just say that and not pretend like there were no other options.

I'm ok with saying "google should do better!" But the compromise solution from the Go team seems reasonable to solve the immediate issue in a way that doesn't harm end users. The author should at least address why they have chosen the more extreme solution.


Or, we should not assume that Google's stance is correct, that sr.ht is expected to explain themselves. We should ask why is Google continuing on that path of DOSing upstream servers by default, not willing to use standards for properly using network resources, expecting all of them to do all the work.

EDIT: Moreover, sr.ht doing a workaround only for sr.ht, and lubar doing a workaround for lubar, etc... is not what Free Software is about. The point is that we're supposed to act as a community, for the betterment of the collective. Individualism is not a solution.


The author addressed this by way of their by-line


IIRC, because that would only help the official SourceHut instance, not other instances.


I wondered that too, but then I wondered if that's what sourcehut have actually done. I didn't notice any details about how go module mirror will be blocked.

Wouldn't the effect on sourcehut users be identical?


No. The Go team offered to add sourcehat to a list that would stop background refreshes. It would still allow fetches initiated by end users. The change sourcehat is making is to break end users unless they set some environment variables.

I've not seen any explanation about why the solution offered by the Go team was unacceptable. Its weird that that is completely left out of the blog post here.


They could also just add the site to that list, or better yet, make it opt in for sites instead of slamming them with shitty workers, and shitty defaults.

You know, like be good neighbors and respectful of other people's resources, maybe read robots.txt and not make excuses for why you are writing shitty stateless workers that spam the rest of the community.


I think Google should DDoS noone. Not everyone until they opt out.


But it’s not Google DDoSing them, it’s every user downloading packages. Without the proxy it would just be millions of users hammering their servers.

Edit: Uh okay, if it's not user traffic then why wasn't the "don't background refresh" not an option?


> Without the proxy it would just be millions of users hammering their servers.

Doing shallow clones, which are significantly cheaper.

Google is DDoSing them, by their service design. Why a full git clone, why not shallow? Why do they need to do a full git clone of a repository up to a hundred times an hour. It doesn't need that frequency of a refresh.

The likely answer is that the shared state to handle this isn't a trivial addition, it's a lot simpler to just build nodes that only maintain their own state. Instead of doing it on one node and sharing that state across the service, just have every node or small cluster of nodes do its own thing. You don't need to build shared state to run the service, so why bother? That's just needless complexity after all, and all you're costing is bandwidth, right?

That's barely okay laziness when you're interacting with your own stuff and have your own responsibility for scaling and consequences. Google notoriously doesn't let engineers know the cost of what they run, because engineers will over-optimise on the wrong things, but that also teaches them not to pay attention to things like the costs they inflict on other people.

It's unacceptable to act in this kind of fashion when you're accessing third parties. You have a responsibility as a consumer to consume in a sensible and considered fashion. Avoiding this means you're just not costing yourself money through your laziness, you're costing other people who don't have stupid deep pockets like Google.

This is just another way in which operating at big-tech-money scales blinds you to basic good practice (I say this as someone who has spent over a decade now working for big tech companies...)


> Google notoriously doesn't let engineers know the cost of what they run

Huh? I left a few months ago but there was a widely used and well known page for converting between various costs (compute, memory, engineer time, etc).


Per TFA > More importantly for SourceHut, the proxy will regularly fetch Go packages from their source repository to check for updates – independent of any user requests, such as running go get. These requests take the form of a complete git clone of the source repository, which is the most expensive kind of request for git.sr.ht to service. Additionally, these requests originate from many servers which do not coordinate with each other to reduce their workload. The frequency of these requests can be as high as ~2,500 per hour, often batched with up to a dozen clones at once, and are generally highly redundant: a single git repository can be fetched over 100 times per hour.


The issue isn't from user-initiated requests. It's from the hundreds of automatic refreshes that the proxy then performs over the course of the day and beyond. One person who was running a git server that hosts a Go repo only they use was hit with 4gb of traffic over the course of a few hours.

Thats a DDoS.


That's not how the proxy works. The proxy automatically refreshes its cache extremely aggressively and independently of user interactions. The actual traffic volume generated by users running go get is a minute fraction of the total traffic.


They wouldn't recommend users getting the data directly from them if the user traffic was the problem


sourcehut's recommendations seem absolutely reasonable: (1) obey the robots.txt, (2) do bare clones instead of full clones, (3) maintain a local cache.

I could build a system that did this in a week without any support from Google using existing open source tech. It's mind boggling that Google isn't honoring robots.txt, is requesting full clones, and isn't maintaining a local cache.


Despite the issue, I'm not convinced that Go doesn't do shallow fetches vs deep clones. Other issues (like Gentoo's issue with the proxy, I don't have a link handy sadly) point to fetches being done normally, not clones.


It's not about what Go allows, it's about what Google's proxy does on its own schedule. If there was a knob sr.ht could use to change this, it would've come up in the two years since this issue was raised with the Go team.


What does a local cache even mean at Googles scale though? Some of the cache nodes are likely closer to Sourcehuts servers than to Google HQ. I guess local would mean here that Google pays for the traffic. But then it is not a technical problem, but a "financial" one.

If you disregard the question who pays for a moment and only look at what makes sense for the "bits", the stateless architecture seems not so bad. Just a pity that in reality somebody else has to foot the bill.


Are you serious? Google Cloud Storage is a service that Google sells to folks using its cloud. If they can't use it for their own project, that would be shocking, no?


They are probably already using something like GCS to store the data at the cache nodes.

I was not talking about how the nodes store data, but about a central cache. Purely architecture wise, it doesn't make sense to introduce a central storage that just mirrors Sourcehut (and all other Git repositories). Sourcehut is already that central storage. You would just create a detour.

It's also not an easy problem. If the cache nodes try to synchronize writes to the central cache, you are effectively linearizing the problem. Then you might as well just have the one central cache access Sourcehut etc. directly. But then of course you lose total throughput.

I guess the technically "correct" solution would be to put the cache right in front of the Sourcehut server.


Go's Proxy service is already a detour for reasons of trust mentioned in the article. They are in a position to return non-authentic modules if necessary (e.g. court order). That settles all architecture arguments about sources-of-record vs sources-of-truth. The proxy service is a source of truth.

If Google is going to blindly hammer something because they must have their Google scale shared nothing architecture pointed at some unfortunate website, then they should deploy a gitea instance that mirrors sr.ht to Google Cloud Storage, and hammer that.

It's unethical to foist the egress costs onto sr.ht when the solution is so so simple.

Some intern could get this going on GCP in their 20% time and then some manager could hook the billing up to the internal account.


[flagged]


Drew has some... strong opinions on some things, but a straight reading of the issue suggests he's being perfectly reasonable here, and it's Google who can't be arsed to implement a caching service correctly - instead, they're subjecting other servers to excessive traffic.

It's about the clearest example of bad engineering justified by "developer velocity" - developer time is indeed expensive relative to inefficiency for which you don't pay because you externalize it to your users. Clearest, because there are fewer parties affected in a larger way, so the costs are actually measurable.

I do have a dog in this, in a way, because as one of the paying users of sr.ht, I'm unhappy that Google's indifference is wasting sr.ht budget through bandwidth costs.


> I didn't notice any details about how go module mirror will be blocked.

It says in the post they'll check the UserAgent for the Go proxy string and return a 429 code.


That's especially silly because 429 is a retriable error


Sure, but with it, the biggest impact will likely be... spam in logs on the Google side. Short-circuiting a request from a specific user agent to a 429 error code is cheap, compared to performing a full git clone instead.


I don't have any particular affinity for Google, but they're still a business and they're already developing the Go language (and relevant infrastructure) at their own expense. It's not like the Go team at Google has access to the entire Alphabet war chest like your "biggest, richest, well-staffed companies on the planet" suggests.


Go since inception has always been well funded. It is authored by some of the biggest names in programming and they are on staff at Google. This is not a side hobby. Not sure why you're suggesting that Go is lacking in resources.


No. It is much smaller team as far as resources go. Compared to Swift for Apple or Java for Oracle, Go is not strategic bet for Google. There is absolutely no dependency on Go to develop services for Google platform in it. Hell, large number of Google employees spend time on disparaging Go. It does not happen for other company sponsored languages.


Someone in the Go team (rsc, IIRC) commented on how a Google executive came to him in the cafeteria to congratulate him on the launch. It turns out the executive confused him with someone on the Dart or Flutter teams.


Thanks for this anecdote! This is hilarious but seems very true to me.


I just hope it wasn't Rob Pike.


Found it: Ian Lance Taylor:

https://groups.google.com/g/golang-nuts/c/6dKNSN0M_kg/m/EUzc...

Now a bit of personal history. The Go project was started, by Rob, Robert, and Ken, as a bottom-up project. I joined the project some 9 months later, on my own initiative, against my manager's preference. There was no mandate or suggestion from Google management or executives that Google should develop a programming language. For many years, including well after the open source release, I doubt any Google executives had more than a vague awareness of the existence of Go (I recall a time when Google's SVP of Engineering saw some of us in the cafeteria and congratulated us on a release; this was surprising since we hadn't released anything recently, and it soon came up that he thought we were working on the Dart language, not the Go language.)


Yes, Google staffs its Go team, but the original comment invokes Google's vast wealth as though its entire market cap is available for the development of Go, which is of course absurd. Google probably spends single-digit millions of dollars on Go annually, and it seems they've determined that supporting Drew's use case would require a nontrivial share of that budget which they feel could be spent to greater effect elsewhere.

Go is not only a "side project" at Google, but one of its most trivial side projects.


Knowing that "we only have a few million in funding per year" was a valid excuse for generating abusive traffic and refusing to do anything about it, would definitely have changed a few conversations I've had working at startups. Interesting.


Of course, Google doesn't materially benefit from optimizing the module proxy for Drew's use case, and I doubt your startups would have made traffic optimization its top priority either under similar circumstances (which is to say "no ROI from traffic optimization").


Drew's use case?!


"scenario"? Pick your synonym.


They wouldn't write significant parts of their backend in a side project.


This is obviously untrue because we know that Google does write significant portions of its backend in Go and that Google derives ~0% of its revenue from Go (the very definition of a side project). My guess is that you're assuming that a side project for Google is the same as a side project for a lone developer or a small team, which is (pretty obviously, IMHO) untrue.


> that Google derives ~0% of its revenue from Go

AdWords is mainly written in Go. YoutTube is mainly written in Go. Just because they have strategic reasons for not directly monetizing Go doesn't make it a side project more than any other internal tooling.

It's core to their ability to pull in revenue now. If they were somehow immediately deprived access to Go, the company would go under. That's how you know it's not a side project.


> AdWords is mainly written in Go. YoutTube is mainly written in Go

Can you source these claims? Last I checked, YouTube was primarily written in Python, and I doubt that's changed dramatically in the intervening years given the size of YouTube. I assume there's some similar thing going on for AdWords.

> Just because they have strategic reasons for not directly monetizing Go doesn't make it a side project more than any other internal tooling.

Agreed, but all internal tooling is a side project pretty much by definition.

> It's core to their ability to pull in revenue now.

No, it's just the thing that they implemented some of their systems in. I'm a big Go fan, but they could absolutely ship software in other languages for a marginal increased operational overhead.

> If they were somehow immediately deprived access to Go, the company would go under. That's how you know it's not a side project.

I don't know what it means to be "deprived access to Go", but this is a pretty absurd definition of "side project" since it applies to just about everything Google does and a good chunk of the software Google depends on whether first party or third party (Google depends much more strongly on the Linux Kernel; that doesn't mean contributing to the Linux Kernel is Google's primary business activity). It seems you have a bizarre definition of "side project" which hinges on whether or not a business can back out of a given technology on a literal moment's notice irrespective of how likely it is that said technology becomes unavailable on that sort of timeline, and that these unusual semantics are at the root of our disagreement.


Not to mention, it's likely a quite impactful form of marketing / developer relations gain for them. I think so because when I talk to people who start to learn Go, I usually see a transfer of positive feelings and excitement from Go itself to Google as its creator/backer - one of the clearest examples of "halo effect" I've seen first-hand.


Do you really imagine some significant number of Google's search, cloud, etc customers were driven to Google over a competitor because of "good vibes" derived from Go? Google only develops Go because it's a useful internal tool, and I'm pretty sure the marketing team nor the executives spend any meeting minutes discussing Go.


Marketing works in mysterious ways.

Yes, I do imagine that people who are really into Go are more likely than average to join or start Go shops, and then pick GCP over competitors because they have to start with something, and being Go people, Google stuff comes first to mind.

Lots of companies across lots of industries spend a lot of money to achieve more-less this fuzzy, delayed-action effect.


> Yes, I do imagine that people who are really into Go are more likely than average to join or start Go shops, and then pick GCP over competitors because they have to start with something, and being Go people, Google stuff comes first to mind.

How many such people do you imagine there are? I'm active in the Go community, and I've been a cloud developer for the better part of a decade. It's never occurred to me to pick GCP over AWS because Google develops Go, nor have I ever heard anyone else espouse this temptation. I certainly can't imagine there are so many people out there for whom this is true that it recoups the cost that Google incurs developing Go.

Rather, I'm nearly certain that Google's value proposition RE Go is that developing and operating Go applications is marginally lower cost than for other languages, but that at Google's scale that "marginally lower cost" still dwarfs the cost of Google's sponsorship of the Go language.


This problem isn't really specific to Google. If some hobby project was DoSing sites it would get banned. "We don't have the resources to not DoS" is not a valid excuse. The Go team needs to scope their ambitions properly; if they can't make their proxy work safely they should not have bothered to develop it.


But this isn't one of those "we developed fast and did dumb stuff". They put significant effort into doing something dumb.


Surely Google of all places has the most tested, battle-hardened robots.txt library in existence, and they have a company-wide public monorepo to boot. There's no excuse for this.


I'm pretty sure parsing robots.txt isn't the challenge. The Go team asserts that there are technical difficulties to this traffic optimization, and I don't have any reason to disbelieve them (they're clearly not dumb people, and I certainly trust them more than Internet randoms when it comes to maintaining the Go module proxy). It's a bummer for Drew, but he isn't Google's top priority right now (it seems wild to me that you think there is "no excuse" for Google not to prioritize niche use cases like Drew's--how do you imagine large organizations choose what to work on?).


It seems like they're getting tons of bandwidth paid by the war chest if they don't care about this waste at all.


Well, for boring technical reasons, I guess they will remain blocked.


Is the source code of the service behind proxy.golang.org actually open-source?



They're referring to the actual service domain not the public static domain.

https://git.sr.ht/robots.txt



presumably the robots.txt entry they are talking about is https://git.sr.ht/robots.txt




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: