I didn't realize how many offline Wikipedia readers are out there in the wild until I was almost done with a working version of pyoffwiki. Since I was so close, I decided to take the (usually bad) ostrich strategy of not looking at anything else until I was done. This is good because I actually got this thing released, and people are using it... with all its issues. But bad, because I knew that people had already thought about (and solved) the same problems I was having, and came up with good solutions. Overall, since it was a only a day or two, I think it was justified, and now I can go back and revisit the problems.
The main problem is Indexing. It can't been a huge file like the one generated with Xapian in Thanassis Tsiodras' solution. It also shouldn't be like my solution that allows for only exact search (using cdb) or browsing (sorted title list) starting from a given string. It should be something resembling Patrick's solution. Or even better... a compressed suffix tree, that is, if it's small and fast enough. Time for some more research. Any other ideas?
There is one thing though, that sets this project apart from the others... which has been a design goal for me from the beginning. I wanted to minimize what people have to do on their own machines and to minimize handling of the Wiki database dump data. The first one gets rid of the bad experience of spending 8 hours building an index and finding you did it wrong. The latter makes Wikipedia the unique source of the data... I don't want to distribute 4GB files, torrent or otherwise... let Wikipedia cover the bandwidth costs. It also lets the user feel safer... that the data is actually coming from Wikipedia, and not some random person. (Some people are neurotic like that) I know... the content of Wikipedia is written by... blah blah blah. I still trust it more than many other sources. Any comments on this?
And lastly, to clarify some questions people have had. The full size English Wikipedia is working on the Irex Iliad, but you have to put everything on an ext2 linux partition (for now). The German Wikipedia and English Wiktionary are also working. Pyoffwiki does NOT support any images right now... they are just way too large (about 400GB as of October, 2007). Maybe we shoud download and resize (down) the images for the top (10,000) articles? And it does work on Linux, but I don't see an audience for it... there are other... better offline viewers that don't make the compromises that need to be made on slow and memory limited (ram and disk) devices.
BTW: Ostriches hiding their head in the ground is a myth. Don't believe me? Try to find a good photo... not this one
Subscribe to:
Post Comments (Atom)


3 comments:
Hi, you may also have interest in the Python-Qt program
https://launchpad.net/wikipediadumpreader
Hi,
> Pyoffwiki does NOT support any images right now... they are just way too large (about 400GB as of October, 2007). Maybe we shoud download and resize (down) the images for the top (10,000) articles?
We did exactly that for a wikipedia-iphone port to the OLPC -- Wikibrowse. We chose 3000 images using a ranking system, where the weights are popularity of the page that they're on, and whether they're in the first section of the page. (So, images from the first section of many pages are preferred to images from all sections of fewer pages.)
After making the image list, we worked out what image size to use based on the largest size an image is used at on a page, then resized the image to that size and used "convert -quality 20" -- this gets e.g. a 50KB JPEG down to 2KB without an unacceptable loss of quality.
Anyway, good luck. :-)
- Chris.
Hi Amir
Thanks for your efforts. Putting Wikipedia on the iLiad epitomizes the digital/open revolution for me!
As I write the database that corresponds to your 20080316 index is offline for maintainence and the articles have grown from 3.5GB to 5GB since your original release. If you have enough time on your hands releasing the code for index generation, or even updating the index on the project site would be widely appreciated I'm sure. Releasing the index generation code may enable non-English speakers to generate indexes as required for their native tongues too . . .
My trivial Perl scripting skills fall short on this one I'm afraid :-(
Thanks again,
Dave
Post a Comment