metachronistic

Thu, 10 May 2007

Exciting Flesh Lose Product!

spam

foreign spam
image by f_mafra

I maintain our departmental mail servers, and spend a fair amount of effort trying to reduce unsolicited email. One of the ways I do that is by collecting my spam and training our spam filter with it so our users won’t see it more than once.

Today I got a hilarious email that must have been written in another language and then translated into English. It’s first paragraph mentions a “most exciting lose flesh product available.” Sounds good! Better yet is this supposed testimonial from a guy in New York: “And you see me, the bed became cool also!” To quote Temperance Brennan: “I don’t know what that means.”

I also don’t know how I mange to “decline the preposition” and resist this exciting new product that “attacks unnecessary kilos.”

Tags: ,
cswingle @ 4:59:31 -0800

Wed, 17 Jan 2007

Open source nutrition

Every so often I get curious about nutrition and whether my diet is actually a healthy one. Over the years I’ve used a program called NUT, which is a really great console program that uses all the data from the USDA National Nutrient Database for Standard Reference. A couple days ago I downloaded the latest version and compiled it on my MacBook Pro. Thanks to the genius of writing simple, portable C code that builds with gcc, it compiled perfectly (not even a single warning) and I was off and running.

Unfortunately I was having a little trouble deleting the 26,642 gram (58+ pound) apple I accidentally entered for lunch today, and because I had the source code available, I discovered a buffer overflow error in the menu entry code. (A buffer overflow is sort of like when a form asks for your first name but only has room for six letters, and instead of stopping at C-h-r-i-s-t you continue to write the rest of your name into the following boxes not designed for your first name.) So I wrote to the author. An hour later, he wrote me back to thank me for finding the bug. Along the way he found a couple more, fixed them, and released a new version.

Timeline: Find a bug before dinner. Contact author. By the time I’m having my first beer, the program has already been fixed.

Try getting that kind of support from your commercial vendor.

Tags: , , ,
cswingle @ 20:36:34 -0800

Thu, 09 Nov 2006

Don’t feed the evil

<rant>

all your base

image from psd

In my job as a systems administrator, spam is one of those things I accept as fact, but have to deal with as best I can so my users can actually get work done. I came across this article on Slashdot today, and even though there’s absolutely nothing revelatory in this article, I think people fail to appreciate where spam comes from. It’s not evil spammers sending you junk mail; spam comes from computers running Microsoft Windows that have been infected with something. If you don’t like spam, stop sending Microsoft money for their software. Every time you buy a Microsoft product, you’re supporting all the network effects of their software. The same network effects that make sharing a Word document with other Microsoft Office users easy, also result in more infections, more spam, more wasted time and money.

<rant />

Tags: , ,
cswingle @ 18:11:29 -0800

Sat, 04 Nov 2006

Mirror the Wikipedia

wikipedia

Update Thu Jan 10 09:38:42 AKST 2008: Unless you really need a complete mirror like this, a much faster way to achieve something similar is to use Thanassis Tsiodras’s Wikipedia Offline method. Templates and other niceties don’t work quite as well with his method, but the setup is much, much faster and easier.


I’ve come to depend on the Wikipedia. Despite potential problems with vandalism, pages without citations, and uneven writing, it’s so much better than anything else I have available. And it’s a click away.

Except when flooding on the Richardson Highway and a mistake by an Alaska railroad crew cut off Fairbanks from the world. So I’ve been exploring mirroring the Wikipedia on a laptop. Without images and fulltext searching of article text, it weights in at 7.5 GiB (20061130 dump). If you add the fulltext article search, it’s 23 GiB on your hard drive. That’s a bit much for a laptop (at least mine), but a desktop could handle it easily. The image dumps aren’t being made anymore since many of the images aren’t free from Copyright, but even the last dump in November 2005 was 79 GiB. It took about two weeks to download, and I haven’t been able to figure out how to integrate it into my existing mirror.

In any case, here’s the procedure I used:

Install apache, PHP5, and MySQL. I’m not going to go into detail here, as there are plenty of good tutorials and documentation pages for installing these three things on virtually any platform. I’ve successfully installed Wikipedia mirrors on OS X and Linux, but there’s no reason why this wouldn’t work on Windows, since apache, PHP and MySQL are all available for that platform. The only potential problem is that the text table is 6.5 GiB, and some Windows file systems may not be able to handle files larger than 4 GiB (NTFS should be able to handle it, but earlier filesystems like FAT32 probably can’t).

Download the latest version of the mediawiki software from http://www.mediawiki.org/wiki/Download (the software links are on the right side of the page).

Create the mediawiki database:

$ mysql -p
mysql> create database wikidb;
mysql> grant create,select,insert,update,delete,lock tables on wikidb.* to user@localhost identified by 'userpasswd';
mysql> grant all on wikidb.* to admin@localhost identified by 'adminpasswd';
mysql> flush privileges;

Untar the mediawiki software to your web server directory:

$ cd /var/www
$ tar xzf ~/mediawiki-1.9.2.tar.gz

Point a web browser to the configuration page, probably something like http://localhost/config/index.php, and fill in the database section with the database name (wikidb) users and passwords from the SQL you typed in earlier. Click the ‘install’ button. Once that finishes:


$ cd /var/www/
$ mv config/LocalSettings.php .
$ rm -rf config/

More detailed instructions for getting mediwiki running are at: http://meta.wikimedia.org/wiki/Help:Installation

Now, get the Wikipedia XML dump from http://download.wikimedia.org/enwiki/. Find the most recent directory that contains a valid pages_articles.xml.bz2 file.

Also download the mwdumper.jar program from http://download.wikimedia.org/tools/. You’ll need Java installed to run this program.

Configure your MySQL server to handle the load by editing /etc/mysql/my.cnf, changing the following settings:

[mysqld]
max_allowed_packet = 128M
innodb_log_file_size = 100M

[mysql]
max_allowed_packet = 128M

Restart the server, empty some tables and disable binary logging:

$ sudo /etc/init.d/mysql restart
$ mysql -p wikidb
mysql> set sql_log_bin=0;
mysql> delete from page;
mysql> delete from revision;
mysql> delete from text;

Now you’re ready to load in the Wikipedia dump file. This will take several hours to more than a day, depending on how fast your computer is (a dual 1.8 Ghz Opteron system with 4 GiB of RAM took a little under 17 hours with an average load around 3.0 on the 20061103 dump file). The command is (all on one line):

$ java -Xmx600M -server -jar mwdumper.jar --format=sql:1.5 enwiki-20060925-pages-articles.xml.bz2 | mysql -u admin -p wikidb

You’ll use the administrator password you chose earlier. You can also use your own MySQL account, since you created the database, you have all the needed rights.

After this finishes, it’s a good idea to make sure there are no errors in the MySQL tables. I normally get a few errors in the pagelinks, templatelinks and page tables. To check the tables for errors:

$ mysqlcheck -p wikidb

If there are tables with errors, you can repair them in two different ways. The first is done inside MySQL and doesn’t require shutting down the MySQL server. It’s slower, though:

$ mysql -p wikidb
mysql> repair table pagelinks extended;

The faster way requires shutting down the MySQL server:

$ sudo /etc/init.d/mysql stop (or however you stop it)
$ sudo myisamchk -r -q /var/lib/mysql/wikidb/pagelinks.MYI
$ sudo /etc/init.d/mysql start

There are several important extensions to mediawiki that Wikipedia depends on. You can view all of them by going to http://en.wikipedia.org/wiki/Special:Version, which shows everything Wikipedia is currently using. You can get the latest versions of all the extensions with:

$ svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions extensions

svn is the client command for http://subversion.tigris.org/. It’s a revision control system that eliminates most of the issues people had with CVS (and rcs before that). The command above will check out all the extensions code into a new directory on your system named extensions.

The important extensions are the parser functions, citation functions, CategoryTree and WikiHero. Here’s how you install these from the extensions directory that svn created.

Parser functions:

$ cd extensions/ParserFunctions
$ mkdir /var/www/extensions/ParserFunctions
$ cp Expr.php ParserFunctions.php SprintfDateCompat.php /var/www/extensions/ParserFunctions
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/ParserFunctions/ParserFunctions.php");
$wgUseTidy = true;
^d

(the last four lines just add those PHP commands to the LocalSettings.php file. It’s probably easier to just use a text editor.

Citation functions:

$ cd ../Cite
$ mkdir /var/www/extensions/Cite
$ cp Cite.php Cite.i18n.php /var/www/extensions/Cite/
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/Cite/Cite.php");
^d

CategoryTree:

$ cd ..
$ tar cf - CategoryTree/ | (cd /var/www/extensions/; tar xvf -)
$ cat >> /var/www/LocalSettings.php
$wgUseAjax = true;
require_once("$IP/extensions/CategoryTree/CategoryTree.php");
^d

WikiHero:

$ tar cf - wikihiero | (cd /var/www/extensions/; tar xvf -)
$ cat >> /var/www/LocalSettings.php
require_once("$IP/extensions/wikihiero/wikihiero.php");
^d

If you want the math to show up properly, you’ll need to have LaTeX, dvips, convert (from the ImageMagick suite), GhostScript, and an OCaml setup to build the code. Here’s how to do it:

$ cd /var/www/math
$ make
$ mkdir ../images/tmp
$ mkdir ../images/math
$ sudo chown -R www-data ../images/

My web server runs as user www-data. If yours uses a different account, that’s what you’d change the images directories to be owned by. Alternatively, you could use chmod -R 777 ../images to make them writeable by anyone.

Change the $wgUseTeX variable in LocalSettings.php to true. If your Wikimirror is at the root of your web server (as it is in the examples above), you need to make sure that your apache configuration doesn’t have an Alias section for images If any of the programs mentioned aren’t in the system PATH (like if you installed them in /usr/local/bin or /sw/bin on a Mac) you’ll need to put them in /usr/bin or someplace the script can find them.

MediaWiki comes with a variety of maintenance scripts in the maintenance directory. To allow these to function, you need to put the admin user’s username and password into AdminSettings.php:

$ mv /var/www/AdminSettings.sample /var/www/AdminSettings.php

and change the values of $wgDBadminuser to admin (or what you really set it to when you created the database and initialized your mediawiki) and $wgDBadminpassword to adminpasswd.

Now, if you want the Search box to search anything besides the titles of articles, you’ll need to rebuild the search tables. As I mentioned earlier, these tables make the database grow from 7 GiB to 23 GiB (as of the September 25, 2006 dump), so make sure you’ve got plenty of space before starting this process. I’ve found a Wikimirror is pretty useful even without full searching so don’t abandon the effort if you don’t have 20+ GiB to devote to a mirror.

To rebuild everything:

$ php /var/www/maintenance/rebuildall.php

This script builds the search tables first (which takes several hours), and then moves on to rebuilding the link tables. Rebuilding the link tables takes a very, very long time, but there’s no problem breaking out of this process once it starts. I’ve found that this has a tendency to damage some of the link tables, requiring a repair before you can continue. If that does happen, note the table that was damaged and the index number where the rebuildall.php script failed. Then:

$ mysql -p wikidb
mysql> repair table pagelinks extended;

(replace pagelinks with whatever table was damaged.) I’ve had repairs take a few minutes, to 12 hours, so keep this in mind.

After the table is repaired, edit the /var/www/maintenance/rebuildall.php script, comment out these lines:

# dropTextIndex( $database );
# rebuildTextIndex( $database );
# createTextIndex( $database );
# rebuildRecentChangesTablePass1();
# rebuildRecentChangesTablePass2();

and insert the index number where the previous run crashed into this line:

refreshLinks( 1 );

Then run it again.

One final note: Doing all of these processes on a laptop can be very taxing on a computer that might not be well equipped to handle a full load for days at a time. If you have a desktop computer, you can do the dumping and rebuilding on that computer, and after everything is finished, simply copy the database files from the desktop to your laptop. I just tried this with the 20061130 dump, copying all the MySQL files from /var/lib/mysql/wikidb on a Linux machine to /sw/lib/mysql/wikidb on my MacBook Pro. After the copying was finished, I restarted the MySQL daemon, and the Wikipedia mirror is now live on my laptop. The desktop had MySQL version 5.0.24 and the laptop has 5.0.16. I’m not sure how different these can be for a direct copy to work, but it does work between different platforms (Linux and OS X) and architectures (AMD64 and Intel Duo Core).

Tags: , ,
cswingle @ 9:34:49 -0800

Mon, 01 May 2006

Another iCal script

Last week I wrote a Python script to import my Unix calendar event files into Google calendar. Today I wanted to put the 2006 Alaska Goldpanners schedule into my Google calendar. I suppose I could have entered all the games in manually, but instead I came up with an event file format, and a script to translate these files into iCal files that can be imported into Google calendar.

The format looks like this:

2006-Jun-14 1900 2200 Goldpanners vs. Fairbanks Adult All-Stars

with one event per line. The start and end times are in military time, and events have to start and finish on the same day.

To convert a file of these events to iCal format, download mycal_to_ics.py, and run it like this:

cat mycal | ./mycal_to_ics.py > mycal.ics

Then you can import it into your Google calendar using the Manage Calendars | Import Calendar tab. I’d recommend creating a new temporary calendar and importing into that so that if there are any errors, you won’t have disturbed your existing calendars.

Tags: ,
cswingle @ 16:08:21 -0800

Sun, 30 Apr 2006

Watch iTunes

A couple months ago I got my first Apple since the Mac Classic I had in college. It’s a MacBook Pro and so far I really like it. I’ve managed to get it to do almost everything my Linux laptop could do, but now I’ve got access to iTunes and Adobe’s Creative Suite (although it’s slow under Rosetta). If Apple would allow me to change the focus behavior, and implement the X11 cut and paste, it’d be the perfect system for a laptop.

On campus I have access to the iTunes playlists of all the people on the wireless network that are sharing their music library. And I have mine shared so other people can check out the artists I enjoy. Unfortunately, iTunes doesn’t tell you what songs connected users are listening to or who is actually connected.

Since OS X is Unix, it’s easy enough to examine the process tree and discover what network and filesystem connections iTunes is making. Running:

ps -axo 'pid command' | grep -v grep | grep 'iTunes ' | awk '{print $1}'

will show the process ID for iTunes. Once you have this number, you can use lsof -p [pid] to show all the files (and network connections, which are treated like files in Unix) that iTunes is using. Filtering the results by your iTunes library (grep /Users/$USER/Music/iTunes/iTunes Music/) yields the songs that are being played, both locally and over the network. And searching for ESTABLISHED shows the network connections. The last part of these lines show the IP addresses of the computers connected to you, and if there are two lines with the same destination IP address, that means they are actually playing from your music library.

To automate this, I wrote a Python script watch_itunes.py that automates this process. Note that this is a command-line tool, running from a terminal window. There are Dashboard widgets that are supposed to do this, but the one I tried didn’t work, perhaps because I have an Intel mac.

To use the script: ./watch_itunes.py

By default, it will examine the process tree every 15 seconds, showing what’s playing and who is connected or playing from your music library. Run it with -h to see a list of command line options.

Here’s what it shows right now:

192.168.1.101 is connected but not listening to music
Portastatic                Bright Ideas               05 Little Fern.m4a

192.168.1.101 is listening to music
Arcade Fire                Funeral                    09 Rebellion (Lies).m4a
Portastatic                Bright Ideas               05 Little Fern.m4a

In the first two lines, I’m listening to Little Fern, and another computer is connected to my library, but isn’t playing anything. In the second set of lines, they started listeing to Rebellion (Lies). The program will keep printing lines like these until you exit the program with Control-C.

Tags: , ,
cswingle @ 19:53:23 -0800

Tue, 25 Apr 2006

Unix Calendar -> iCal

For many years I’ve used the Unix calendar program to send me an email reminder of upcoming events and holidays. Unix calendar files are very simple text files with one event per line like this:

Apr 22  We bring Koidern home, 2006

Google recently added a calendar to their set of web programs, and like most things Google does, it offers a clean and elegant implementation. Best of all, it’s on the web, so you can access the same calendar information from anywhere there’s an Internet connection.

These days, calendar files are typically in iCal format. I wanted to convert my Unix calendar file over to iCal so I could import the data into Google calendar. Python to the rescue!

Download the script: calendar_to_ics.py

To use it: cat ~/.calendar/calendar | ./calendar_to_ics.py > /tmp/calendar.ics

Import the file you created into your Google calendar by clicking on the Manage Calendars link, and going to the Import Calendar tab. The script is only designed to handle simple events that take place once a year, on the same day, and it only accepts dates in MMM DD format. But Python is easy to read and hack, so if you have improvements, please email them to me and I’ll incorporate them into the script.

Tags: ,
cswingle @ 5:30:49 -0800

Fri, 25 Nov 2005

Fortune Passwords

Following up on yesterday’s discussion of making passwords that look random to the computer, but contain some pattern that’s easily remembered, I wrote a little password generator in Python. It requires the ‘fortune’ program (fortune-mod, fortunes packages in Debian), as well as Python. The script takes two optional arguments, the number of passwords to generate, and if the script should create “difficult” passwords.

The output looks like this:

    $ ./fortune_password.py 1
    16422 : 4Dcfpnsfe#
    Don't compare floating point numbers solely for equality.

or if you’ve chosen the “difficult” version:

    $ ./fortune_password.py 1 d
    55424 : ya8=Ithotmk
    You are in the hall of the mountain king.

The difficult version puts the number, symbol and upper case letter in the middle of the string of letters, rather than at the beginning and end with the simpler version. I suppose the difficult version is slightly more “random” and is better as a result, but there’s probably not much difference when it comes to how long it would take to crack it.

Of course, despite the way the passwords look, they’re not actually random. So if the cracker knows that you’ve used a password generator based on the fortune command, they can generate a wordlist based on fortunes and use that in a dictionary attack instead of having to use a brute force attack.

Tags:
cswingle @ 11:06:21 -0800

Thu, 24 Nov 2005

Good Passwords

The University has been requring certain departments to sit through a 15 minute presentation on using good passwords. One of the handouts had a chart showing how long it takes to crack passwords by how long they are and how many types of characters they’ve got in them. I’m interested in the subject because I typically assign passwords to my users when they start work. I wrote a simple program that takes words from the dictionary that are between 9 and 15 letters long, and which don’t end in ‘ing’, ‘s’, or ‘ed’. The program then splits the word in the middle somewhere, inserts a random number, a random symbol, and capitalizes one of the following letters in the word.

For example, the script gets the word ‘misdirection’, inserts a ’1′ and a ‘%’, and then capitalizes one of the letters in the word. The resulting password is ‘misdi1%recTion’.

That password is composed of the letters [a-zA-Z], symbols [!@#$%^&*+=;:?], and numbers [0-9], so the set of characters to search for is 26 + 26 + 13 + 10 = 75. The password is 14 characters long, so the space a brute force attack has to search is 7514 = 1.8 x 1028 which is a huge number.

I did a few experiments with my workstation, which has an AMD Opteron 246 processor inside. Performing a brute force attack requires encrypting all these possible combinations until a match is found. So the type of encryption used is important. My computer can perform about 450,000 encryptions per second if the encryption is the old style DES encryption used on most proprietary Unix platforms. But all of my servers are running Linux, which uses md5 style passwords, and my computer can only do about 3,500 encryptions per second. So 1.8 x 1028 possible passwords / 3,500 encryptions / second means it’ll take about 1.6 quadrillion years on my computer to crack it (or half that time on average).

Unfortunately, most passwords aren’t cracked using brute force, they’re cracked by using a dictionary attack, and since my passwords are generated using a dictionary, that means they’re considerably more vulnerable. The question is, does my method of randomly inserting a number and symbol in the middle of a dictionary word (as well as randomly upper casing a letter) defeat a dictionary attack?

I don’t know the answer. But I’ve done some experiments with pathologically bad passwords to see what might happen. On my computer a simple dictionary word is cracked within seconds. And a simple dictionary word with numbers appended (I tried ‘barf51′) is cracked in two and a half hours. So the jury is still out on my method. But I’ll bet that my method isn’t as safe as I thought it was at first. It’s certainly better than the user that uses her husband’s name, the name of the dog, or their license plate number for a password. Most cracking software has information about the typical behavior of users built into it, so it will start by searching the space defined by their username, their domain name, and common names. ‘cswingle11′ would be a pretty poor choice for me. ‘misdi1%recTion’ would undoubtably be better.

The only way to really generate passwords is to do it in such a way that there isn’t a pattern (like a dictionary word) that the computer can identify and use to reduce the number of combinations the cracking program needs to test. So a better approach to passwords is probably to use a database of common phrases, and pull the first letters from the phrase, insert some random cases, symbols and numbers, and use that. Perhaps the ‘fortune’ command offers som possibilities here:

    $ fortune -n 80 | head -1
    There is no distinctly native American criminal class except Congress.

So: ‘tindnaccec’ –> TindnAcceC –> TindnA7#cceC

That’s 7512 = 3 x 1022 and because it’s effectively random (unless cracking tools learn about the ‘fortune’ database and how it might be manipulated. . .), it’ll take 286 billion years for a computer equivalent to mine to crack this.

Sounds like a Python script in the making.

Tags:
cswingle @ 10:49:08 -0800

Back to Swingley Development
Powered by WordPress

Switch to our mobile site