I tried 80legs a couple of weeks ago when it was still in beta. It works well if you're willing to write Java code to process HTTP responses (see http://80legs.pbworks.com/80Apps). You can issue new requests in your processing code and write processed data to files which you can download once the crawl finishes. Even without Java, you can perform basic keyword/regex matching on crawled data literally by filling out a web form. And they have an API of course. Very impressive.
However, if you want to run crawling on your own infrastructure, I recommend Scrapy (http://scrapy.org/), a python crawling framework introduced on HN last year. Scrapy solves some of the more time-consuming problems involved in writing a crawler from scratch (multiple simultaneous requests, pipelined processing, raw caching, duplicate URL filtering) and comes with nifty development and administration tools. More importantly, it has an active and helpful set of core developers and good documentation. I am comfortable with both Python and Java but I chose Scrapy over 80legs because I can crawl for free on the machines I already have and I can afford to spend more time crawling from a single IP compared to 80legs which will let me crawl much faster but isn't free. Also, with Scrapy my bot can be 'naughty' - 80legs jobs obey robots.txt and limit the crawling rate per domain.
When my crawling needs outgrow my infrastructure, I am going to look at 80legs again.
The crawl runs on a super-heterogenous network of computers from around the world. The JVM sandbox is the only thing we know. Java is the only language that has something like this.
Yes, Java isn't as sexy anymore. But running your code on 50,000 computers sounds sexier to me than a REST API.
Wow - parselets.com is amazing. Thanks for sharing.
I think 80legs can do some level of parsing now, but Parsely seems like a great way to describe the analysis for targeted crawls on predictable data. If it supported the combination you've described it would really open up some cool possibilities.
We are looking at storing data, doing pre-done crawls, offering data feeds, etc. It's something on the horizon for us.
Hopefully guys that write parselets will be interested in becoming developers for the upcoming 80legs App Store. They'll be able to sell their parselets and earn 100% recurring revenue from 80legs users using their parselets.
That reminds me about my fools day joke about MassiveClouding. I imagined about a company buying cpu/hours from regular desktop computer users and selling it to anyone interested in this kind of cpu power. Of course any computer software could be run on the MassiveClouding without any changes. My good imagination :) I don't know about the technology behind 80Legs but these 50000 computers might be a bot net.
"Any individual can become involved with SETI research by downloading the Berkeley Open Infrastructure for Network Computing (BOINC) software program, attaching to the SETI@home project, and allowing the program to run as a background process that uses idle computer power."
Yes it's something like it. There is another base difference; SETI@home or other folding projects have a centralized infrastructure for the data to be collected and bound. And all the software to be run on the client side are written for a specific mission. So every software should be rewritten for these platforms. My MassiveClouding dream has a some way of dissecting any software into parts and run on the clients. That's why it's a dream.
There may be a market for this. I know several people who run various types of folding just to be able to brag about the performance they get, and love to show off their stats. Getting paid for it would just be icing on the cake.
Of course, the pricing structure would be difficult, as would setting up the infrastructure for it. My (uniformed, by-the-seat-of-my-pants) best guess would be that subverting the current folding method of jobs would be the best way to do it. Accept jobs, send out to client computers, clients get paid based off of how many they complete. The problem arises from: sandboxing the jobs, to make sure they can't steal client info, and making it relatively easy to program jobs, while still allowing users to do various things to increase their performance (like using GPUs and the like).
I am relatively sure it couldn't be done as a startup. The profit margins just aren't there. My back of the envelope math says that each Work Unit, which, based off my experience takes ~8 hours to finish, would go for ~$0.80 (assuming that EC2 does a comparable speed, which isn't guaranteed). In order to make a decent profit, you'd have to be paying the users <$0.20 per unit, or roughly $0.60 per day, per core. It doesn't seem like a good enough proposition, that will barely cover the increased power consumption.
On the other hand, if an existing company branched out into this, something like Amazon and MTurk, I could see it working. A company which already has a user base, which has software installed on clients could leverage this as a "Oh yeah, here, we think this is cool, try it".
Sure, computing becomes a commodity and people / businesses will compete on "extras". Compute cycles and storage will be auctioned off in real-time depending on the buyers' need for and sellers' ability to provide data integrity, computational correctness guarantees, cycles/dollar, and a jurisdiction where the computation and data are legal.
You know, I'd try this, but the hard part is that the processing is still on me (and I can't code, so I'm SOL). That, frankly, is the hard part, is it not?
Hi, I work for 80legs and wanted to address this--we will be building an "App Store" over the next couple of months that will allow you to use Apps written by other developers to do some pretty cool stuff. We already have several of these 80Apps (as we call them) available, many of which have been written by the semantic search engine company swingly.com. You can view descriptions for them on the "Create Job" page if you sign up to use our Portal.
You'd be surprised how difficult it is to create an extremely fast, scalable architecture for crawling web sites. Took me a few months, anyway (and would probably take a few more to make it "fill in a web form and go" usable). After that, the processing is pretty straight forward.
You've got a URL, headers, and body content. Just extract what you want and crunch.
While complex processing can be difficult, having a scalable infrastructure is also unbelievably hard. Sure, you can easily crawl from a single IP address, but if you want to do millions of pages in minutes, you would typically need several servers, IP addresses, and much more.
I remember they had an explanation about this, but I don't remember what it was, and it seems to have disappeared: what are these 50,000 computers they're using?
Plura has a java applet which you can embed into a webpage that gets viewed for a long duration, such as a game or a streaming video.
Affiliates embed an iframe which loads their applet as the user plays their game. The user's CPU goes up a bit, and they can help generate revenue for the game makers.
They're targeting desktop apps. The Java app downloads the pages, so it needs high permissions, so you get the default Java unfriendly popup asking you for confirmation.
Plura supports desktop and web-based games.
If the game is hosted on a website, like a Flash game, the
developer only needs to include 1 line of iframe code. We
will soon be releasing a Javascript API for dynamically
controlling how this iframe is loaded, giving the developer
control over starting, stopping, and controlling CPU usage
in Plura. The iframe loads a Java web applet, which runs
completely in memory. This applet is forcibly restricted
from accessing the user's computer by the sandbox model
provided by Sun.
So far as I understand, they have multiple models.
One affiliate model is "Plura for Java Applets", where-as another is "Signed Java Applets"
I imagine there may be fewer options for unsigned applets, leaving the developer with less potential revenue every month, where as desktop application developers and signed java applications are left with the providers who don't need signed applications.
That said, I agree the Java dialog is ugly and scary ;(
If I were an Affiliate, I'd want to avoid it. It breaks the user-experience of your site into some gaudy and jarring, not to mention unbranded and unrelated to the information the user is after.
They're probably using EC2s and then it's a function of how many pages can be scrapped per EC2 compute hour and then taking a premium on top of that.
The reason why they didn't want it to be too expensive (ie. 5 times as expensive) is (1) competitors can easily equilibrate and steal market share if their idea works b/c of the economic inefficiencies in their pricing model and (2) this game plan is more a game of dependence rather than up front profits, so it makes sense to take very little profit up front to get user traction.
One limitation is the throttling for each domain. If I had a smaller set of pages that I wanted to read frequently, this solution would not work. I understand the need to avoid DOS attacks, but in some cases, it would nice to be able to read a million pages from 3 large domains instead of 1 from a million pages.
"Your parseLinks() and processDocument() methods must complete within a total 10 seconds per document processed"
... as a limit on processing leaves room for competition. One advantage of the BYO-Cloud solution is that you can pay for more intensive processing of the crawl.
They're inflating the cost difference.
Cloud : $0.10 / CPU-hour for "large scale" crawling. There's no reason to use a small, unreserved EC2 instance for large scale anything? I have a small reserved instance I use for git and bugzilla and so forth, it's slow as hell. That's why there are bigger ones.
The cost savings are still there with reserved instances if you do the math.
Actually, the cost is not the biggest issue with the cloud. If you're talking about large-scale crawling, AWS will not adequately scale. You can't get enough nodes or enough bandwidth.
My point was just in questioning a $4/million figure using a 0.10/hr instance. Why don't you run benchmarks against a reserved high CPU instance for 100M pages crawled? I was never contending that you cannot offer cost savings over custom crawling in ec2, just the way the #s on the page were calculated. If you have enough traffic to keep your instances saturated of course it is more economical to buy dedicated servers, but nobody starts out that way. Why in the world do you think you can't get enough nodes or bandwidth to do web crawling off of EC2? I remember seeing Bezos talk about animoto scaling up to 3500 instances instantly for video transcoding, I find it extremely hard to believe such a parallelizable task as web crawling could not be done on EC2. It's one thing to say that you can do it cheaper, it's another thing entirely to say that it can't be done on EC2 at all. How did you come to this conclusion?
The $4/million actually is not based on compute time, it's based on data transfer in/out of AWS. If I included compute time, it would be higher.
And it's the bandwidth aspect that makes web crawling not feasible on AWS. Yes, you have a few thousand nodes, but they're all going through a handful of external IPs, which will cause serious performance issues.. the worst case is that you'll get blocked entirely from the sites you're trying to crawl.
In other words, the bandwidth is not parallelizable on the cloud.
I'm really having trouble with this. My understanding is EC2 provides an internal and external IP for each instance : http://docs.amazonwebservices.com/AmazonEC2/dg/2007-01-19/in... as well as a semi-friendly DNS name. Each of these machines can certainly make its own requests to arbitrary URLs? I don't see how this is any different than a bunch of machines sitting in a data center with a shared, dedicated internet connection? Also, from my rather limited experience of crawling sites the only time I drew negative attention was when I did not throttle my crawler. If you are properly caching to avoid repetitious requests and throttling your requests, how are you going to get blocked and why would it be different in EC2 versus a dedicated hosting center?
So a few points (I assume you're talking about the Elastic IPs):
1. Yes, each instance can have its own IP, but by default, each account is limited to 5 IP addresses.
2. You can increase your limit, but my guess is that it's difficult to do so. You have to put forward a special request and have it approved.
You're right that blocking may not be a big issue, but crawling several different domains quickly will be hard.
Just so you know, we haven't encountered anyone doing large-scale crawling that considers AWS or the cloud in general to be a realistic option. The biggest reason is still the cost.. the outbound transfer rates just don't make sense at scale.
Elastic IPs are about having the same functionality as static IPs, every instance has an IP per the previous link I posted. Every time you connect a new network device to any typical network, it gets an IP. I'm not sure how that relates to scalability of the bandwidth.
You are limited to some number of instances (20, 50?) and yes, you have to fill out a form to get more. The previous example with animoto shows how far you can go. I would wager that finding the funding for a large # of instances is more problematic than getting the approval.
I don't see why crawling several different domains quickly will be hard? There shouldn't be any difference between a bunch of instances on EC2 and a bunch of machines in a data center, from a technical point of view.
As far as the cost argument goes, of course I agree with you. If you can project a high level of CPU/bandwidth usage for an extended period of time then of course you should buy dedicated servers.
The only argument I was trying to make was that it is completely possible to do crawling on EC2 or any other cloud provider from a technical point of view, the only limitation is cost. I see the advantage of utility computing is that it offers a cheap way to handle bursty traffic, which you may certainly run into if your server utilization projections are off? I don't think you should use it as your primary set of servers if you can project some large volume of traffic.
You're right that the data center and the cloud will be very similar, but our assertion at 80legs is that both are very poor choices.
I'm not arguing that it's impossible to do crawling on the cloud. I'm saying it's near-impossible to do it on a large-scale on the cloud. 3500 instances is pretty good, but will still be an order of magnitude slower than what 80legs is capable of.
Now, if you show me someone that has 10,000+ instances on the cloud, I may agree with you!
This seems to be a very neat idea. How about processing of the crawled data, if someone wants to process the data and show it to the end user in a comprehensive way? We are looking for a solution which is affordable and provide both.
Are you supposed to implement your own cycle detection?
If not, how deep is the cycle detection that 80legs offers?
--
There are plenty of HoneyTrap[1] OSS projects which will quickly rack up lots of $$$ if the 80legs spider is backlisted.
For those who don't know, these projects create deeply linked pages, and sometimes create infinite cycles. They are trying to hinder spammers, but may hinder 80legs too.
Because of the way 80legs handles crawls, users don't need to worry about loop detection. There are really two issues here:
1. For each user crawl, we only allow the same url once, so any simple loops that involve the same urls are eliminated by this process no matter how large the loop
2. For more sophisticated "spider traps" that work with different urls and domains, they can have a limited effect on your crawls. Because of our per-domain rate throttling, the worst these traps can do is add a few cents per day to someone's crawl.
However, if you want to run crawling on your own infrastructure, I recommend Scrapy (http://scrapy.org/), a python crawling framework introduced on HN last year. Scrapy solves some of the more time-consuming problems involved in writing a crawler from scratch (multiple simultaneous requests, pipelined processing, raw caching, duplicate URL filtering) and comes with nifty development and administration tools. More importantly, it has an active and helpful set of core developers and good documentation. I am comfortable with both Python and Java but I chose Scrapy over 80legs because I can crawl for free on the machines I already have and I can afford to spend more time crawling from a single IP compared to 80legs which will let me crawl much faster but isn't free. Also, with Scrapy my bot can be 'naughty' - 80legs jobs obey robots.txt and limit the crawling rate per domain.
When my crawling needs outgrow my infrastructure, I am going to look at 80legs again.