| >> |
04/25/08(Fri)18:55 No.460935I wrote a script last night which intelligently downloads every wall on /w/. It maintains a database of metadata, which threads it's read, which posts it's seen, such that it only downloads thread pages or images when it hasn't seen them before.
When it downloads an image, it hashs it and checks if there's an existing hash for that image. If there is, it links the post's image entry to the pre-existing image and discards it.
The initial dump of /w/ done last night took about half an hour, downloading about 1200 images (taking up 1GB of space). I ran the script again this morning, took about 5 minutes to fetch all the new stuff. It's kind of slow because it fetches them in a linear manner, one at a time. Not the fastest for downloading, but it's nice on the server and easier to deal with code-wise.
The next logical step is to add useful image metadata to the database -- image resolutions, most frequent color in the image, average color in the image, etc. Eventually I'll probably write a tag system too, but it's a bit of a nightmare to even attempt to tag (what will be) thousands of images by myself.
Ultimately, I might hook this archiver up to some fat pipes (if I find a set of pipes fast enough) and write a web interface that's actually intuitive and easy to use. So you could search for red wallpapers with a 1.6 aspect ratio of minimum size 1280x800 and it would be able to pull those up, without any human help.
If I ever find some fat pipes, you guys will be the first to know :( |