Eccentric Flower:200907/A Little Meta
From Eccentric Flower
A Little Meta
People who get the RSS notifications saw a notice about a page which wasn't a journal entry - a page talking about why I left LiveJournal and explaining where the various LJ icons came from. This is the "about" page for the former-LJ section of the archives, and I flagged it for RSS by mistake.
The LJ archives are now in the wiki, but they are not yet quite suitable for viewing, which is why I have deliberately not yet put up any monthly TOC pages for them or any links to them. But getting them in is a big job off the to-do list for this site, and I'm happy to be done with it.
How big?
There were nearly ten megabytes of LJ entries when exported into individual XML files. I also had to download the same 10 MB of entries again in an HTML format, because LJArchive is stupid and omits certain vital information from the XML format (yet the HTML format is too horrible to parse for anything else but that one bit of indispensible info).
I then had to write a script which
1. Read all the XML files to generate a master table of contents so the script could later make "previous" and "next" links between entries automatically;
2. Read all the HTML files to get the one piece of key information that was only to be had in those files;
3. Parsed out all the information from the XML files for each entry, and comments on each entry;
4. Reformatted it as much as possible into the mediawiki format (the more one does on an automated basis, the less the entries will each have to be tuned by hand later);
5. Generated individual monthly TOC files for use later (making/adapting TOCs has been one of the biggest time drains so far);
6. Assembled the data into a different XML format, the one the wiki wants for imports.
Oh, yes, and
7. Then painstakingly split that file by hand, because the import chokes on a 10 MB file, and import it in chunks.
(If you think you will ever be using MediaWiki's XML import feature in the future, take note: The upper threshold it can handle in one gulp is about 2 MB.)
Plus there was the matter of downloading all the icons from LJ and re-uploading them into the wiki, and other finicky details like that.
This was not my only big project of the last few days - it was my spare-time project, in fact - but it's worth noting that it took me three days to get the script right, and that it's four hundred and thirty lines long. (Admittedly, some of that is literal text to write into the XML output, but not much.)
I think the next project will be deciding which of the entries I've already put in for Alewife and Scherzi to unlock, under the new rules. Then finish the Scherzi stuff (it's uploaded but not checked/cleaned). Then I'll decide whether I want to go ahead and check the LJ imports or work on uploading the intervening material first.
I realize no one cares about this but me. That's not news.
One point that might interest you, lest you think this wiki is overkill: I backed it up today - the raw database I mean. The backup is 100 MB. That's without the three full-length manuscripts, mouth organ, the Shrunken Cinema pieces, Utopia, and the missing journal years. That's just the short-form fiction, the bulk of the short nonfiction, and the less than three years' worth of journal that have been put in so far.
There are already more than six thousand documents in this wiki, and I have barely begun.
All your words, in one place. I love it. Keep up the hard work.
-- 05:48, 24 July 2009 (BST)

Danima:
I believe I speak for everyone when I say: !!
-- 02:42, 24 July 2009 (BST)