You are here

blog

Fight spam and read books

Well, I fixed my spam problems, it seems. I am now using CAPTCHAs on blog comments. A CAPTCHA is a way of checking whether the person accessing a web page is a "real" person by asking them to do something which computers find it hard to do. Traditionally, this has involved asking them to type out a word in a picture, because computers have always had trouble with image processing. However, software has improved at reading images, and this approach has started failing. Some other ways to determine whether the user is a real person have been suggested:

In order to prove your authenticity, please provide the answer to the following formula: formula
And then there's:

a new captcha approach

I am using neither of these methods, unfortunately. Brad pointed out ReCAPTCHA to me, which is now the recommended implementation of the CAPTCHA system. As described on their page, people perform word recognition all the time when they answer CAPTCHAs, and ReCAPTCHA uses this to assist in scanning the world's library archives into digital format. When some pages of some books are scanned in, the software can't always work out what the words are supposed to be, so these words get used in CAPTCHAs, and we let the people of the world work out what they are. If you're wondering how unknown words can be used in a CAPTCHA, go and read the link above.

Anyway, the point is, we're helping to digitize humanity's knowledge, and fighting spam at the same time. It's like hitting two birds with one stone. I notice that Facebook also uses ReCAPTCHA in its sign-up form. I think it's awesome.

Please let me know if there are any issues using the new CAPTCHAs when submitting comments?

Update: More captcha amusements and yet more.

Drupal anti-spam

Lazyweb, O, lazyweb, I call out to thee in my hour of need. I installed the spam and trackback modules for drupal, and to the outside observer, my blog is nicely spam-free. However, I get about fifty spam comments and spam trackbacks a day, which get trapped in the approval queue, and I have to manually wade through cialis and porn adverts/links to see if there are any real comments/trackbacks for any of my posts.

Depressingly, there generally aren't.

What's the best way to keep one's comments and trackbacks spam-free, without having to manually delete every single dodgy one, and without getting any false-positives?

A side note is that the trackback module isn't great - if I want to send a trackback, I have to manually find the trackback URL and put it in the little textbox - isn't there a nice drupal module that checks all outgoing URLs, and autodiscovers the trackbacks, and pings them? The trackback module that I have installed seems to think that this is what it does, but it has delusions of grandeur, in my opinion.

Converting from Serendipity to Drupal

My old blog, on vhata.rucus.net, was running Serendipity, but I've switched to using Drupal for everything on this site, vhata.net. Apart from being written in PHP, which is unfortunate, Drupal seems to be a fairly decent piece of software - pretty, easy to use, and well written. If I needed persuading of this fact, I would have been convinced by how easy it turned out to be to migrate my blog entries from Serendipity to Drupal.

Since I started setting up my omnia.za.net hosting (maybe I should blog about that a bit later?), I've been using Postgres as much as possible, where there's a choice. I am using it behind my shorl generator, my quote database, and behind the various drupal sites on this machine, including this blog. The database that stored my serendipity entries on rucus, however, was a MySQL database.

It was easy enough to extract the information that I wanted to keep from the serendipity database:

   select id, title, timestamp, concat(body, extended) from serendipity_entries;
I wasn't that interested in trying to keep the comments, categories, and so on - I have retained the database, and I will go through the comments later and update various entries to include the comments, I think.

After that, the question was how to insert this data into Drupal. At first I thought of doing it manually: I created a test blog entry, with pg_dumps before and after, so I could compare the states of the database, and how it changed when a blog entry was created. It seemed simple enough, but the whole idea didn't sit right with me. So I had a look at the PHP code behind Drupal, and as I've said, it's incredibly simple and elegantly written.

It turns out, there's a node_save() function that you can call, passing it a node object (which needs properties such as 'title', 'body', etc), and it will update everything for you. It was that simple. All I needed was to write some PHP code that did the MySQL selection above from serendipity, created a node with the right properties, and saved it. This code would, of course, need to run within the Drupal environment so that it had access to the node_save() function, and was connected to the right database. This was also trivial to achieve: There is a nice tutoral on creating Drupal modules that made it easy.

I pre-created a table called 'blogdata' to contain the data I wanted:

CREATE TABLE blogdata (
  id int(11),
  title varchar(200),
  timestamp int(10),
  body text,
  done int(3) default '0'
);
And then populated it:
insert into blogdata select id, title, timestamp, concat(body, extended), 0 from serendipity_entries;
The relevant part of my Drupal module (which I could have actually stuck into any existing Drupal module in order for it to be run) was as follows:
$q = mysql_query("select * from blogdata where done=0 order by timestamp asc");
while($f = mysql_fetch_assoc($q)) {
   $newent = array('created' => $f[`timestamp`], 'title' => utf8_encode($f["title"]), 'body' => utf8_encode($f["body"]),
      'teaser' => utf8_encode($f["body"]), 'format' => 3, 'uid' => 1, 'type' => 'blog', 'status' => 1, "comment" => 2,
      'promote' => 0, 'sticky' => 0);
   $newento = (object)$newent;
   node_save($newento);
}
$q = mysql_query("update blogdata set done=1");
I did actually have some issues at first because my data was encoded in ISO-8859-1/latin1, and Postgres was expecting UTF-8 data, but as you can see, I call the PHP utf8_encode() function to get around this. Many thanks to bje (whose domain is ironically called "serendipity" ;-) for getting my mind straight when I was being kak about this.

And that was it. My blog entries were imported perfectly. I still need to go through a few of them and fix entries that still hard-link to rucus, but that shouldn't take too long.

The only gripe with Drupal at first was the horrible URLs it created: "/node/124" sort of thing. However, with the nifty Pathauto module, those are a thing of the past.

Subscribe to RSS - blog