Update Thu Jan 10 09:38:42 AKST 2008: Unless you really need a complete mirror like this, a much faster way to achieve something similar is to use Thanassis Tsiodras's Wikipedia Offline method. Templates and other niceties don't work quite as well with his method, but the setup is much, much faster and easier.
I've come to depend on the Wikipedia. Despite potential problems with vandalism, pages without citations, and uneven writing, it's so much better than anything else I have available. And it's a click away.
Except when flooding on the Richardson Highway and a mistake by an Alaska railroad crew cut off Fairbanks from the world. So I've been exploring mirroring the Wikipedia on a laptop. Without images and fulltext searching of article text, it weights in at 7.5 GiB (20061130 dump). If you add the fulltext article search, it's 23 GiB on your hard drive. That's a bit much for a laptop (at least mine), but a desktop could handle it easily. The image dumps aren't being made anymore since many of the images aren't free from Copyright, but even the last dump in November 2005 was 79 GiB. It took about two weeks to download, and I haven't been able to figure out how to integrate it into my existing mirror.
In any case, here's the procedure I used:
Install apache, PHP5, and MySQL. I'm not going to go into detail here, as there are plenty of good tutorials and documentation pages for installing these three things on virtually any platform. I've successfully installed Wikipedia mirrors on OS X and Linux, but there's no reason why this wouldn't work on Windows, since apache, PHP and MySQL are all available for that platform. The only potential problem is that the text table is 6.5 GiB, and some Windows file systems may not be able to handle files larger than 4 GiB (NTFS should be able to handle it, but earlier filesystems like FAT32 probably can't).
Download the latest version of the mediawiki software from http://www.mediawiki.org/wiki/Download (the software links are on the right side of the page).
Create the mediawiki database:
$ mysql -p mysql> create database wikidb; mysql> grant create,select,insert,update,delete,lock tables on wikidb.* to user@localhost identified by 'userpasswd'; mysql> grant all on wikidb.* to admin@localhost identified by 'adminpasswd'; mysql> flush privileges;
Untar the mediawiki software to your web server directory:
$ cd /var/www $ tar xzf ~/mediawiki-1.9.2.tar.gz
Point a web browser to the configuration page, probably something like http://localhost/config/index.php, and fill in the database section with the database name (wikidb) users and passwords from the SQL you typed in earlier. Click the 'install' button. Once that finishes:
$ cd /var/www/ $ mv config/LocalSettings.php . $ rm -rf config/
More detailed instructions for getting mediwiki running are at: http://meta.wikimedia.org/wiki/Help:Installation
Now, get the Wikipedia XML dump from http://download.wikimedia.org/enwiki/. Find the most recent directory that contains a valid pages_articles.xml.bz2 file.
Also download the mwdumper.jar program from http://download.wikimedia.org/tools/. You'll need Java installed to run this program.
Configure your MySQL server to handle the load by editing /etc/mysql/my.cnf, changing the following settings:
[mysqld] max_allowed_packet = 128M innodb_log_file_size = 100M
[mysql] max_allowed_packet = 128M
Restart the server, empty some tables and disable binary logging:
$ sudo /etc/init.d/mysql restart $ mysql -p wikidb mysql> set sql_log_bin=0; mysql> delete from page; mysql> delete from revision; mysql> delete from text;
Now you're ready to load in the Wikipedia dump file. This will take several hours to more than a day, depending on how fast your computer is (a dual 1.8 Ghz Opteron system with 4 GiB of RAM took a little under 17 hours with an average load around 3.0 on the 20061103 dump file). The command is (all on one line):
$ java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 enwiki-20060925-pages-articles.xml.bz2 | mysql -u admin -p wikidb
You'll use the administrator password you chose earlier. You can also use your own MySQL account, since you created the database, you have all the needed rights.
After this finishes, it's a good idea to make sure there are no errors in the MySQL tables. I normally get a few errors in the pagelinks, templatelinks and page tables. To check the tables for errors:
$ mysqlcheck -p wikidb
If there are tables with errors, you can repair them in two different ways. The first is done inside MySQL and doesn't require shutting down the MySQL server. It's slower, though:
$ mysql -p wikidb mysql> repair table pagelinks extended;
The faster way requires shutting down the MySQL server:
$ sudo /etc/init.d/mysql stop (or however you stop it) $ sudo myisamchk -r -q /var/lib/mysql/wikidb/pagelinks.MYI $ sudo /etc/init.d/mysql start
There are several important extensions to mediawiki that Wikipedia depends on. You can view all of them by going to http://en.wikipedia.org/wiki/Special:Version, which shows everything Wikipedia is currently using. You can get the latest versions of all the extensions with:
$ svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions extensions
svn is the client command for http://subversion.tigris.org/. It's a revision control system that eliminates most of the issues people had with CVS (and rcs before that). The command above will check out all the extensions code into a new directory on your system named extensions.
The important extensions are the parser functions, citation functions, CategoryTree and WikiHero. Here's how you install these from the extensions directory that svn created.
$ cd extensions/ParserFunctions $ mkdir /var/www/extensions/ParserFunctions $ cp Expr.php ParserFunctions.php SprintfDateCompat.php /var/www/extensions/ParserFunctions $ cat >> /var/www/LocalSettings.php require_once("$IP/extensions/ParserFunctions/ParserFunctions.php"); $wgUseTidy = true; ^d
(the last four lines just add those PHP commands to the LocalSettings.php file. It's probably easier to just use a text editor.
$ cd ../Cite $ mkdir /var/www/extensions/Cite $ cp Cite.php Cite.i18n.php /var/www/extensions/Cite/ $ cat >> /var/www/LocalSettings.php require_once("$IP/extensions/Cite/Cite.php"); ^d
$ cd .. $ tar cf - CategoryTree/ | (cd /var/www/extensions/; tar xvf -) $ cat >> /var/www/LocalSettings.php $wgUseAjax = true; require_once("$IP/extensions/CategoryTree/CategoryTree.php"); ^d
$ tar cf - wikihiero | (cd /var/www/extensions/; tar xvf -) $ cat >> /var/www/LocalSettings.php require_once("$IP/extensions/wikihiero/wikihiero.php"); ^d
If you want the math to show up properly, you'll need to have LaTeX, dvips, convert (from the ImageMagick suite), GhostScript, and an OCaml setup to build the code. Here's how to do it:
$ cd /var/www/math $ make $ mkdir ../images/tmp $ mkdir ../images/math $ sudo chown -R www-data ../images/
My web server runs as user www-data. If yours uses a different account, that's what you'd change the images directories to be owned by. Alternatively, you could use chmod -R 777 ../images to make them writeable by anyone.
Change the $wgUseTeX variable in LocalSettings.php to true. If your Wikimirror is at the root of your web server (as it is in the examples above), you need to make sure that your apache configuration doesn't have an Alias section for images If any of the programs mentioned aren't in the system PATH (like if you installed them in /usr/local/bin or /sw/bin on a Mac) you'll need to put them in /usr/bin or someplace the script can find them.
MediaWiki comes with a variety of maintenance scripts in the maintenance directory. To allow these to function, you need to put the admin user's username and password into AdminSettings.php:
$ mv /var/www/AdminSettings.sample /var/www/AdminSettings.php
and change the values of $wgDBadminuser to admin (or what you really set it to when you created the database and initialized your mediawiki) and $wgDBadminpassword to adminpasswd.
Now, if you want the Search box to search anything besides the titles of articles, you'll need to rebuild the search tables. As I mentioned earlier, these tables make the database grow from 7 GiB to 23 GiB (as of the September 25, 2006 dump), so make sure you've got plenty of space before starting this process. I've found a Wikimirror is pretty useful even without full searching so don't abandon the effort if you don't have 20+ GiB to devote to a mirror.
To rebuild everything:
$ php /var/www/maintenance/rebuildall.php
This script builds the search tables first (which takes several hours), and then moves on to rebuilding the link tables. Rebuilding the link tables takes a very, very long time, but there's no problem breaking out of this process once it starts. I've found that this has a tendency to damage some of the link tables, requiring a repair before you can continue. If that does happen, note the table that was damaged and the index number where the rebuildall.php script failed. Then:
$ mysql -p wikidb mysql> repair table pagelinks extended;
(replace pagelinks with whatever table was damaged.) I've had repairs take a few minutes, to 12 hours, so keep this in mind.
After the table is repaired, edit the /var/www/maintenance/rebuildall.php script, comment out these lines:
# dropTextIndex( $database ); # rebuildTextIndex( $database ); # createTextIndex( $database ); # rebuildRecentChangesTablePass1(); # rebuildRecentChangesTablePass2();
and insert the index number where the previous run crashed into this line:
refreshLinks( 1 );
Then run it again.
One final note: Doing all of these processes on a laptop can be very taxing on a computer that might not be well equipped to handle a full load for days at a time. If you have a desktop computer, you can do the dumping and rebuilding on that computer, and after everything is finished, simply copy the database files from the desktop to your laptop. I just tried this with the 20061130 dump, copying all the MySQL files from /var/lib/mysql/wikidb on a Linux machine to /sw/lib/mysql/wikidb on my MacBook Pro. After the copying was finished, I restarted the MySQL daemon, and the Wikipedia mirror is now live on my laptop. The desktop had MySQL version 5.0.24 and the laptop has 5.0.16. I'm not sure how different these can be for a direct copy to work, but it does work between different platforms (Linux and OS X) and architectures (AMD64 and Intel Duo Core).