rsync and bzip2 or gzip compressed data
A few days ago, I learned that gzip has a custom option --rsyncable
on Debian (and thus also Ubuntu). This old write-up covers it well, or you can just man gzip
on a Debian-based system and see the --rsyncable
option note.
I hadn’t heard of this before and think it’s pretty neat. It resets the compression algorithm on block boundaries so that rsync won’t view every block subsequent to a change as completely different.
Because bzip2 has such large block sizes, it forces rsync to resend even more data for each plaintext change than plain gzip does, as noted here.
Enter pbzip2. Based on how it works, I suspect that pbzip2 will be friendlier to rsync, because each thread’s compressed chunk has to be independent of the others. (However, pbzip2 can only operate on real input files, not stdin streams, so you can’t use it with e.g. tar cj
directly.)
In the case of gzip --rsyncable
and pbzip2
, you trade a little lower compression efficency (< 1% or so worse) for reduced network usage by rsync. This is probably a good tradeoff in many cases.
But even more interesting for me, a couple of days ago Avery Pennarun posted an article about his experimental code to …
hosting git compression
Using ln -sf to replace a symlink to a directory
When you want to forcibly replace a symbolic link on some kind of Unix (here I’m using the version of ln from GNU coreutils), you can do it the manual way:
rm -f /path/to/symlink
ln -s /new/target /path/to/symlink
Or you can provide the -f argument to ln to have it replace the existing symlink automatically:
ln -sf /new/target /path/to/symlink
(I was hoping this would be an atomic action such that there’s no brief period when /path/to/symlink doesn’t exist, as when mv moves a file over top of an existing file. But it’s not. Behind the scenes it tries to create the symlink, fails because a file already exists, then unlinks the existing file and finally creates the symlink.)
Anyway, that’s convenient, but I ran into a gotcha which was confusing. If the existing symlink you’re trying to replace points to a directory, the above actually creates a symlink inside the dereferenced directory the old symlink points to. (Or fails if the referent is invalid.)
To replace an existing directory symlink, use the -n argument to ln:
ln -sfn /new/target /path/to/symlink
That’s always what I have wanted it to do, so I need to remember the -n.
hosting tips
GNU Screen: follow the leader
First of all, if you’re not using GNU Screen, start now :).
Years ago, Jon and I spoke of submitting patches to implement some form of “follow the leader” (like the children’s game, but with a work-specific purpose) in GNU Screen. This was around the time he was patching screen to raise the hard-coded limit of 40 windows allowed within a given session, which might give an idea of how much screen gets used around here (a lot).
The basic idea was that sometimes we just want to “watch” a co-worker’s process as they’re working on something within a shared screen session. Of course, they’re going to be switching between screen windows and if they forget to announce “I’ve switched to screen 13!” on the phone, then one might quickly become lost. What if the cooperative work session doesn’t include a phone call at all?
To the rescue, Screen within Screen.
Accidentally arriving at one screen session within another screen session is a pretty common “problem” for new screen users. However, creative use of two (or more) levels of nested screen during a shared session allows for a “poor man’s” follow the leader.
If the escape sequence of the outermost screen is changed to something other than …
environment open-source tips
Permission denied for postgresql.conf
I recently saw a problem in which Postgres would not startup when called via the standard ‘service’ script, /etc/init.d/postgresql. This was on a normal Linux box, Postgres was installed via yum, and the startup script had not been altered at all. However, running this as root:
service postgresql start
…simply gave a “FAILED”.
Looking into the script showed that output from the startup attempt should be going to /var/lib/pgsql/pgstartup.log. Tailing that file showed this message:
postmaster cannot access the server configuration file
"/var/lib/pgsql/data/postgresql.conf": Permission denied
However, the postgres user can see this file, as evidenced by an su to the account and viewing the file. What’s going on? Well, anytime you see something odd when using Linux, especially if permissions are involved, you should suspect SELinux. The first thing to check is if SELinux is running, and in what mode:
# sestatus
SELinux status: enabled
SELinuxfs mount: /selinux
Current mode: enforcing
Mode from config file: enforcing
Policy version: 21
Policy from config file: …
database postgres security
SEO: External Links and PageRank
I had a flash of inspiration to write an article about external links in the world of search engine optimization. I’ve created many SEO reports for End Point’s clients with an emphasis on technical aspects of search engine optimization. However, at the end of the SEO report, I always like to point out that search engine performance is dependent on having high quality fresh and relevant content and popularity (for example, PageRank). The number of external links to a site is a large factor in popularity of a site, and so the number of external links to a site can positively influence search engine performance.
After wrapping up a report yesterday, I wondered if the external link data that I provide to our clients is meaningful to them. What is the average response when I report, “You should get high quality external links from many diverse domains”?
So, I investigated some data of well known and less well known sites to display a spectrum of external link and PageRank data. Here is the origin of some of the less well known domains referenced in the data below:
- www.petfinder.com: This is where my dogs came from.
- www.endpoint.com: That’s Us!
- www.sonypictures.com/movies/district9/: …
seo
Migrating Postgres with Bucardo 4
Bucardo just released a major version (4). The latest version, 4.0.3, can be found at the Bucardo website. The complete list of changes is available on the new Bucardo wiki.
One of the neat tricks you can do with Bucardo is an in-place upgrade of Postgres. While it still requires application downtime, you can minimize your downtime to a very, very small window by using Bucardo. We’ll work through an example below, but for the impatient, the basic process is this:
- Install Bucardo and add large tables to a pushdelta sync
- Copy the tables to the new server (e.g. with pg_dump)
- Start up Bucardo and catch things up (e.g. copy all rows changes since step 2)
- Stop your application from writing to the original database
- Do a final Bucardo sync, and copy over non-replicated tables
- Point the application to the new server
With this, you can migrate very large databases from one server to another (or from Postgres 8.2 to 8.4, for example) with a downtime measured in minutes, not hours or days. This is possible because Bucardo supports replicating a “pre-warmed” database—one in which most of the data is already there.
Let’s test out this process, using the handy pgbench utility to create a …
open-source perl postgres bucardo
Client Side Twitter Integration
I recently was assigned a project that required an interesting solution, Crisis Consultation Services. The site is essentially composed of five static pages and two dynamic components.
The first integration point required PayPal payment processing. Crisis Consultation Services links to PayPal where payment processing is completed through PayPal. Upon payment completion, the user is bounced back to a static receipt page. This integration was quite simple as PayPal provides the exact form that must be included in the static HTML.
The second integration point required a unique solution. The service offered by the static brochure site is dependent on the availability and schedule of the company employees, so the service availability remains entirely dynamic. The obvious solution was to include dynamic functionality where the employees would update their availability. Some thoughts that crossed our minds of how to update the availability were:
- Could we build an app for the employees to update the availability given the budget constraints?
- Could the employees use ftp or ssh to upload a single file containing details on their availability?
- Are there other dynamic tools that we could use …
ecommerce javascript
Tests are contracts, not blank checks
Recently, I wrote up a new class and some tests to go along with it, and I was lazy and sloppy. My class had a fairly simple implementation (mostly a set of accessors, plus a to_s method). It looked something like this:
class Soldier
attr_accessor :name, :rank, :serial_number
def initialize(name,rank,serial_number)
@name = name
@rank = rank
@serial_number = serial_number
end
def to_s
"#{name}, #{rank}, #{serial_number}"
end
end
I had been trying to determine the essential attributes of the class (e.g., what are the minimal elements of this class? should I have a base class, then sub-class it for the various differences, or should I have only a single class that contains everything I need?)
As a result of the speculative nature of the development, my tests only included a few of the attributes.
What’s wrong with that?
On the surface, there is nothing technically wrong with skipping accessor tests: after all, testing each accessor individually is really testing Ruby, not the code I wrote. Another excuse I made is that testing each individually is very non-DRY—the testing code itself has lots of duplication.
The problem is that the set of tests …
rails