Hacker Newsnew | past | comments | ask | show | jobs | submit | more pbd's commentslogin

From a systems design perspective, $3,000 per book makes this approach completely unscalable compared to web scraping. It's like choosing between a O(n) and O(n²) algorithm - legally compliant data acquisition has fundamentally different scaling characteristics than the 'move fast and break things' approach most labs took initially.


I don't know if anyone has actually read the article or the ruling, but this is about pirating books.

Anthropic went back and bought->scanned->destroyed physical copies of them afterward... but they pirated them first, and that's what this settlement is about.

The judge also said:

> “The training use was a fair use,” he wrote. “The technology at issue was among the most transformative many of us will see in our lifetimes.”

So you don't need to pay $3,000 per book you train on unless you pirate them.


i agree. this is very gray imo. e.g., books in India have cheap EEE editions compared to the ones in US/Europe. so they can pre-process the data in India & then compile it in US. does that save them from piracy rules & reduces cost as well.


I mean relative to the cost of pre-training, books are going to be cheap even if you buy them in the US (as demonstrated by the fact Anthropic bought them after)

For post-training, other data sources (like human feedback and/or examples) are way more expensive than books


Isn't a flat price per book quite plainly O(n)? If not, what's n?


more of a large difference in constant factor, like a galactic algorithm for data trawling


agreed. waiting for qwen here as well.


I love these 'wait, what!' moments in biology. Thanks for sharing this - definitely going to be my fun fact for the week!


This is genuinely exciting.


Please don’t post chatgpt output


apparently the document was briefly public.


When a Google Doc is briefly made public, services like LinkedIn can access its metadata — such as title and thumbnail — through standard Open Graph tags. LinkedIn caches this metadata for about a week. Even if the doc is later made private or restricted to a specific Google Workspace, LinkedIn will still show the cached preview. It doesn’t mean LinkedIn has special access or that Google is leaking anything; it’s just a consequence of how caching and metadata scraping work. This behavior is expected and not considered a security issue.


The power efficiency is fascinating - modern phones are basically ARM servers optimized for battery life. Pixel 5 probably draws <5W under load vs 50-100W for a typical x86 server. For a personal blog, that's 400-800 kWh/year savings. The environmental impact of reusing vs recycling electronics is under appreciated.


For a static site you can get a lot better by dumping it on S3 or Github Pages. Your site uses 0W while not being used since the server was already running, and it consumes no resource usage while not being requested. But yeah an x86 server at home for a static site is awfully inefficient.


Depends. If you reuse otherwise wasted electronic it is efficient in avoiding extracting resources / energy building new hardware and avoids recycling energy or waste polluting earth. A big picture analysis of reusing old hardware would be very interesting.


>Your site uses 0W while not being used since the server was already running

You are paying for it to be available (or in the case of GitHub Microsoft is as an incentive).


If you put it on AWS S3 (not subsidised), and your website is 1GB, which would be huge for a static blog, it'll cost you $0.27 per year to store / have available. The price is so incredibly small that numerous companies offer it as a completely free service.


I use Nearly Free Speech to host and pay $20 every few years. Is there a freer host?



Aren’t CF sunsetting Pages in favour of Workers? I would be hesitant relying on something that may be shut down anytime.


I've just asked the Workers & Pages team (I work at Cloudflare), and that's not true.

"If you’re wondering “What about Pages?” — rest assured, Pages will remain fully supported" https://blog.cloudflare.com/sv-se/builder-day-2024-announcem...


That’s the corporate promise for now, but there is no Ulysses pact guaranteeing they’d support it forever if the financials don’t line up any more, so the second-best guarantee is if there is a strong self-interest incentive that would remain to be the case even in 20, 30 years? They mentioned that they need scale and so offers generous free tiers as proving ground, but would that incentive stay the same after a few decades?


You do realize you can say that about literally every service in the world, right?


No no no, Pages will not be "shut down".

Pages is gradually being unified into the Workers platform. For new projects we suggest just starting with Workers as it is strictly more powerful. But eventually existing Pages projects will be migrated to Workers automatically -- either that or we will just keep supporting Pages forever.

There are an enormous number of web sites hosted on Pages, it would be insane for us to turn them off.


What would happen to the X.pages.dev subdomains when they get auto-migrated? Do they get switched to X.workers.dev silently? My main concern with this sunset is about link rot for those who didn’t use their own domain.

Another concern is whether you’d still be able to get unique .pages.dev subdomains per project, it seems that workers force each account to one subdomain only across all projects. When pages get sunset dies it mean that you’ll no longer be able to make new unique pages.dev subdomains?

Also, the killer feature for many is the ability to just upload a zip hassle-free, both for production and for preview branches, the preview branches potentially serving as an extra subdomain level namespace. Would Workers still support that no-fuss workflow?


Sorry but I think there's still a basic misunderstanding here.

Pages is not going away. It is not "sunsetting".

What is happening is, the implementation is changing to be more closely integrated with Workers.

At present most Pages features are available directly on Workers, but not quite all, but we're working on it. Hence, we're suggesting people use Workers for new projects, but we're not auto-migrating people yet. Once we're feature-complete we'll auto-migrate people to the new implementation.

But the "Pages" brand will continue to exist, as a more-integrated part of the Workers platform. pages.dev will not go away. We will not break anyone's sites. Everything you can do with Pages today should be just as easy if not easier on the new implementation.


Perhaps it would be worth a blog post to elaborate exactly what the migration entails and which parts of the UX will and will not change.

Otherwise the lack of information only results in apprehension when it could’ve been an opportunity to engender enthusiasm.

The tech demographic is by-and-large allergic to anything that reeks of “we’re gonna be performing a MANDATORY UPGRADE to our service that is gonna be SUPERIOR ACROSS THE BOARD, stay tuned!” while lacking concrete details on what will change.


I mean, they are sunsetting pages in favour of workers but it seems that static pages even in workers would have unlimited bandwidth and unlimited pages so there seems to have to be no difference and I trust cloudflare enough that they won't really remove these cf pages.


Do you have some link at hand?


Self-commenting. Aha, I entirely missed this: https://news.ycombinator.com/item?id=44853934


I've had a NextCloud server and an IRC bouncer running on Oracle Cloud with static IP for two years now, free of charge.

More than sufficient to host a site, even dynamic.


Github and Gitlab offer free static site hosting.


50-100W for equivalent work as a phone from 2020 can do would have been the case with CPUs from at least a decade ago. I should hope that one doesn't burn ~75W to host a few static files if it can also run on a Pi or phone or laptop that draws <20W idle

That's not to say it's not a good idea to make use of the super efficient "Pi" you already have at home in the form of (several, probably) old smartphones! Just that you'd not use it for the same purpose as a gaming desktop that can't idle below 50W


Of course it depends on what you consider typical, but x86 can do pretty low power stuff too; n100 systems can idle <10w and 20-30w at full load.


> 400-800 kWh/year savings

The average all-sector U.S. price per kWh is 13.20 cents (source: https://www.eia.gov/electricity/monthly/epm_table_grapher.ph...). Even at the high end that’s a savings of $105.60/year, or $8.80/month.

The U.S. poverty line for 2025 is $15,650 for a single person (source: https://aspe.hhs.gov/sites/default/files/documents/dd73d4f00...). $105.60 is less than one percent of that.

Sure, energy efficiency is great and I would rather have $105.60 than not have it, but it doesn’t really matter in the grand scheme of things.


It's not just the money but also the CO2 footprint, https://news.ycombinator.com/item?id=45111385


Generating a kWh emits 0.8 pounds of CO2 on average (source: https://www.eia.gov/tools/faqs/faq.php?id=74&t=11); burning a gallon of gasoline in a car emits about 20 pounds of CO2. That means that 800 kWh (= 640 pounds of CO2 emitted) is as bad as using 32 gallons in a year.

It just doesn’t matter much.


Did you account for the manufacturing of the server?


well recycle is worth chasing as well in this scheme of things.


Congrats on shipping! For adoption, have you considered integrating with existing workflows? CLI tools, IDE extensions, Slack integrations? The friction to start a pair session often matters more than the session quality itself.


One of the maintainers here. Very good point. Yes we want (and need) to start doing this, we should open github issues so people can help if they want.


The 'dilution effect' is real - even with plenty of context space left, agents start losing track of their original goals around the 50k token mark. It's not just about fitting information in, it's about maintaining coherent reasoning chains. Single agents with good prompt engineering often outperform elaborate multi-agent orchestrations.


The timing mismatch is crucial - data centers can be built in 12-18 months, but new power generation takes 5-10 years minimum. We're essentially trying to scale AI demand faster than energy infrastructure can physically respond. This creates interesting arbitrage opportunities in power-rich but compute-poor regions.


GPT-4 at $24.7 per million tokens vs Mixtral at $0.24 - that's a 100x cost difference! Even if routing gets it wrong 20% of the time, the economics still work. But the real question is how you measure 'performance' - user satisfaction doesn't always correlate with technical metrics.


It's trivial to get better score than GPT-4 with 1% of the cost by using my propertiary routing algorithm that routes all requests to Gemini 2.5 Flash. It's called GASP (Gemini Always, Save Pennies)


Does anyone working in an individual capacity actually end up paying for Gemini (Flash or Pro)? Or does Google boil you like a frog and you end up subscribing?


If I actually had time to work on my hobby projects Gemini pro would be the first thing I’d spend money on. As is, it’s amazing how much progress you can squeeze out of those 5 chats every 24h; I can get a couple hours of before-times hacking done in 15 minutes, which is incidentally when free usage gets throttled and my free time runs out.


I've used Gemini in a lot of personal projects. At this point I've probably made tens of thousands of requests, sometimes exceeding 1k per week. So far, I haven't had to pay a dime!


How come you don't need to pay? Do you get it for free somehow?


There's free tier for API.


"When you use Unpaid Services, including, for example, Google AI Studio and the unpaid quota on Gemini API, Google uses the content you submit to the Services and any generated responses to provide, improve, and develop Google products and services and machine learning technologies, including Google's enterprise features, products, and services, consistent with our Privacy Policy.

To help with quality and improve our products, human reviewers may read, annotate, and process your API input and output. Google takes steps to protect your privacy as part of this process. This includes disconnecting this data from your Google Account, API key, and Cloud project before reviewers see or annotate it. Do not submit sensitive, confidential, or personal information to the Unpaid Services."

Reference: https://ai.google.dev/gemini-api/terms


You get 1500 prompts on AIStudio across a few Gemini flash models. I think I saw 250 or 500 for 2.5. It’s basically free and beats the consumer rate limits of big apps (Claude, ChatGPT, Gemini, meta). I wonder when they’ll cut this off.


I've paid a few dollars a month for my API usage for about 6 months.


PPT (price-per-token) is insufficient to compute cost. You will also need to know an average tokens-per-interaction (TPI). They multiply to give you a cost estimate. A .01x PPT is wiped out by 100x TPI.


Are you saying that some models will take 100x more tokens than other (models in the same ballpark) for the same task? Is the 100 a real measured metric or just random numbers to illustrate a point?


With thinking models, yes 100x is not just possible, but probable. You get charged for the intermediate thinking tokens, even if you don't see them (which is the case for Grok, for example). And even if you do see them, they won't necessarily add value.


> With thinking models, yes 100x is not just possible, but probable

So the answer is no then, because I don't put reasoning and non-reasoning models in the same ballpark when it comes to token usage. You can just turn off reasoning.


the GPT 5 models use ~10x more tokens depending on the reasoning settings.


number of complaints / million tokens?


> How you measure 'performance'

I heard the best way is through valuations


> GPT-4 at $24.7 per million tokens

While technically true why would you want to use it when OpenAI itself provides a bunch of many times cheaper and better models?


RouterBench is from March 2024.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: