This is just a test!


Saturday, September 15, 2012

Black Hat Forum Black Hat SEO: Multiharvester - Free, general purpose content scraper

Black Hat Forum Black Hat SEO
BlackHatWorld is a blackhat SEO Forum dedicated to learning black hat seo, cloaking, doorway pages, blogging, automatic content generators and more. Master the ART of "BlackHat"!
Multiharvester - Free, general purpose content scraper
Sep 15th 2012, 21:25

Hey there all, this is my second BHW release. Multiharvester is a general purpose scraping framework (similar to scrapebox but more centered around pure content scraping rather than posting itself. More on this later.). I'm releasing it for free here and would love to keep it free while actively working on this project as long as possible. A fair warning before we go on, I've only been at it for about 2 weeks now, so the project is very, very (very) young. Crashes, freezes and unoptimized performance are to be expected... So this is by no means a scrapebox replacement (at least not now that is), however if you'd like to try it out, be my guest. Firstly here are the requirements: -Multiharvester should run on any OS since its written in Clojure which itself compiles to JVM bytecode. -Since I'm using the JVM as the host platform a Java install is required (however, you should be prompted for one by the web installer that I'll provide shortly... It's best to just download the Java runtime though, googling "Get Java" should get you started) -The application is resource hungry (non optimized for now, didn't really have any time to run it through some filters sorry... I'll fix that in the next release though) so a somewhat okayish machine is required (My 6 year old macbook seemed to do fine, your mileage may vary...) Current problemls: -The main problem right now is me not having a proper internet connection at the moment lol. I'm currently hunting for a flat here in Brussels (if some BHWers from Brussels wanna meet up and go for a pint, I'm game btw), so working out of coffee places and hotel lobbies is really non optimal. However, this should be fixed by latest next week as hopefully I'll manage to find a flat here lol. -Potential freezes... Swing thread (thread in charge of the GUI) is prone to some freezes right now. This is easily fixable though, I'll just need to go through my code a little and fix those little lockups. Should be fixed by next release. -Potential memory leaks. A bit like issue nr. 2 really... Resource management is a little messed up right now. Will be fixed. -Poor proxy support. Public and private proxies "should" be working.. However, I didn't really have a chance to test it out much (once again, poor internet connection here...) and don't have any access to any private proxies (if someone could help me out with this I'd really appreciate it). Installing (Program has been tested on Mac only. However it should run anywhere. Let me know how it goes for you Win and nix users)(I'm assuming you have the java runtime installed already. Google it if you haven't): -Download the launcher from my server 50.112.250.78 / clojure / core [dot] jnlp (sorry about this, stupid spam filter...) -Double click to launch -It will download the app and ask you to accept the certificate. Do so. -An optional shortcut can be created (you'll be prompted for that) -Done. Happy scraping :) Features: -Google scraping -Time based scraping (past day, past week, past month, etc.) -Language based scraping (all supported Google lang options are included) -Domain specific scraping (all Google domains are included) -PR scraping -********/Nofollow filtering -Other filtering options ("Is alive?" checker, duplicate removal, domains only, string presence, string presence on site, etc.) -Exports to .txt/.csv -20 scraping threads are supported (I've only been using two at most though as I don't have any proxies available. However that was working fine) -150 misc operations threads (for checking pr, ******** and so on). My old mac seemed to have no troubles pushing 150 threads though. -Other little features Current goals: -Short term wise I'd like to fix all current issues first (which shouldn't take all that long) and then start working on a powerful, inbuilt proxy scraper/checker as I feel that this is an essential feature for any type of scraping. -Search engine wise, I'll add support for Bing, Yahoo, Yandex, Rambler, Ask, DuckDuckGo as well as Baidu shortly. I've got parts of the code ready, just need to run further tests. -Improving performance! Firstly shrinkign and optimizing the bytecode should help (current filezise is 20mb which is a bit too big for my liking. Also keep in mind that the JVM is failry slow when it comes to start up. Once its runnign though it should be fairly fast, even non optimized). I'd like to push to around 200-250 scraping threads and around 400 misc threads if possible. That should be doable with the JVM by rewriting HTTP requests as asynchronous operations (so they become non-blocking). However, this will take me some releases and I'd like to stabilize the app with its current threading capabilities first. -Inclusion of a scripting engine. This is one of the big features. My plan is to design a simple scripting language that would allow you to add your own scraping resource. For example, if you'd like to scrape wikipedia pages, you should be able to do so without my involvement. This should be fairly trivial as I'm using Clojure (which is a lisp-1 dialect) which is known for its extensive "domain specific language" support. Work on this part of the application will begin as soon as everything else is stable enough. These are the goals for now. I'll probably setup a website next and include a little tutorial as well as a FAQ. Too keep it all a little organized as I'll be adding a LOT of features in the future, so would be good to have an overview. I'll also need to find some type of monetization method that does not include the actual user (that means you). I'd just like to keep this free and accessible. Anyhow, I hope this proves useful to some and happy scraping of course! I'm always open to suggestions and questions so don't hesitate to post here, I'll try to reply as soon as possible (as long as my connection is working here). P.S. VS of the main jar

You are receiving this email because you subscribed to this feed at blogtrottr.com.

If you no longer wish to receive these emails, you can unsubscribe from this feed, or manage all your subscriptions

No comments:

Post a Comment