Backing up your SaaS data with Google Takeout

2022-05-31T00:00:00+00:00

Keeping backups is extremely important.

Losing important files can feel like a far-off problem, but the chance of misplacing a drive, theft, drive failure, accidental deletion, house fire, flood, etc., is much greater than we may think. The benefits outweigh the cost of backups, for files that matter at all. So everyone should make regular backups of data that they care about and that can’t be replaced.

Even among people who regularly make backups, there is one area many of us neglect: all of that data on various online services, also called software as a service or SaaS: Google Drive, Apple iCloud, Microsoft OneDrive, Dropbox, Box, etc.

It’s true, the most volatile files are the ones sitting in a single location on your laptop or thumb drive, not those on Google, WordPress, or iCloud servers. The danger of losing files is not nearly as present with SaaS. You can’t drop Google on the floor and lose a couple terabytes of data, like you can a hard drive, but you can be locked out of your account, accidentally delete files, and lose data by missing a notice about a service shutting down. Not to mention the possibility that your SaaS provider is hacked and loses your data.

I recently realized I had about five years of photos, nine years of Google Drive “stuff,” and who knows what else, not backed up from my Google account. I decided to download it all with Google Takeout, the service Google provides for making account backups.

Google Takeout

Google introduced Takeout in 2011 as a service to export and download your data stored in Google products. It seems like the perfect option to easily back up all your files from Google’s servers. But how is the process, and how useful is the actual downloaded data?

Downloading from Takeout is quite painless. You select the services you want to back up, the formats you want them in (when there are multiple options), and start an export. When it’s ready, you get an email linking to a compressed file containing your data. You can export either to a .zip archive or a .tgz (.tar.gz) archive. Zip is more universally accessible, so if you don’t have or want extra software (such as 7-Zip), it is probably the better option.

One of the hardest things about keeping SaaS backups, even when they are easy to manage, is just remembering to do it. Backups become less useful if they’re six months or even years out of date.

Fortunately, Google Takeout has an option to automatically create a new full export 6 times over the course of a year and email you a download link. I’m not great at remembering to renew backups, so I’m letting them email me every two months with a new export. This is a more important feature than I originally thought, as researching for this post dragged on over nearly two months, despite feeling like I had very recently backed my files up.

One concern I’ve had with downloading SaaS data is whether it could be used in other apps, or imported again. I would prefer that my backups aren’t buried among thousands of lines of cruft. So I’d like to dive into the data and get a feel for how useful it would be if I actually needed it.

How useful is the exported data?

Most of Google Takeout’s data is reasonably well organized and usable — YouTube videos in MP4, calendars in ICS, photos in JPEG sorted by year as well as albums you’ve created. But there is plenty of inefficiency, and a few gotchas you need to watch for.

For some services, like Google Drive, you can select between a common usable format (DOCX for Docs, XLSX for sheets, etc) and a PDF render. There’s not a ton of variety in choices, but the formats are generally common enough that you could easily open them in your program of choice.

My download came in three .tgz files, two around 50GB and one around 5GB. That’s not awful, despite some odd choices on Google’s part for space management. For example, Google Keep exports in a nice, usable JSON format, but also in HTML with a huge style tag. My JSON takes 500KB, while the HTML takes 1500KB. Luckily they don’t seem to do this with YouTube, Google Photos, or Drive, or else I would be more concerned about bloat.

Some exports are somewhere in the middle, like YouTube playlists, which only include the ID of the YouTube video. Helpful in the case of losing your account, but not so much if any of those videos are deleted from YouTube.

What does Takeout exclude?

Takeout includes most of the data you would care about: Gmail (in MBOX format, including attachments), Google Photos, Blogger, saved Maps places, your uploaded YouTube videos, etc. You can see a full list of services and formats in Takeout itself.

A major flaw with Takeout is that it only backs up data you are the owner of. That means that, for example, if a co-worker creates a document with nothing but a title and invites you to help work on it, you may have added dozens of pages of painfully earned writing, but Takeout doesn’t consider it yours, so it won’t get exported. You have to manually save your own copy, separately, for every shared file.

Sharing is one of the most useful things about Google’s SaaS options, so not having shared files backed up could largely defeat the purpose of the backup. You can download shared files separately, but any added effort to make a complete backup quickly becomes a hassle, and defeats part of the purpose of using Takeout.

Be on the lookout for shared files that might not be backed up. For me, that’s mostly Drive, Photos, and Calendar, but pay attention to other shared files you may want backed up.

While the data exports aren’t perfect, the potential loss is so great that finding some way to back up SaaS data is a no-brainer.

Signing off with an image I found in my Blogger backup: my initials, created using the GIMP, circa 2009.

Introduction to BorgBackup

2020-09-10T00:00:00+00:00

Photo by Frank R

What is Borg?

BorgBackup (Borg for short) is a ‘deduplicating’ backup program that eliminates duplicate or redundant information. It optionally supports compression and authenticated encryption.

The main objective of Borg is to provide an efficient and secure way to backup data. The deduplication technique utilized to produce the backup process is very quick and effective.

Step 1: Install the Borg backups

On Ubuntu/Debian:

apt install borgbackup

On RHEL/CentOS/Fedora:

dnf install borgbackup

Step 2: Initialize Local Borg repository

Firstly, the system that is going to be backed up needs a new designated backup directory. Name the parent directory ‘backup’ and then create a child directory called ‘borgdemo’, which serves as the repository.

mkdir -p /mnt/backup
borg init --encryption=repokey /mnt/backup/borgdemo

Step 3: Let’s create the first backup (archive)

In Borg terms, each backup instance will be called an archive. The following demonstrates how to backup the ‘photos’ directory and designate the archive as ‘archive_1’.

borg create --stats --progress /mnt/backup/borgdemo::archive_1 /home/kannan/photos

Note: the archive label for each backup run needs to be specified.

Step 4: Next backup (Incremental)

In order to see if the run was successful, the same command will be executed again. However, this time, with the different unique archive label.

borg create --stats --progress /mnt/backup/borgdemo::archive_2 /home/kannan/photos

The following backup is mostly identical to the previous one. Because of deduplication, the process will not only run faster this time, it will be incremental as well. The --stats flag will provide statistics regarding the size of deduplication.

Step 5: List all the archives

The ‘borg list’ command lists all of the archives stored within the Borg repository.

borg list /mnt/backup/borgdemo

Step 6: Remote Borg Repository

Take the scenario where the backups of many servers need to be maintained in a separate server. In this instance, a directory needs to be created for each of the systems that will be backed up. For this backup repository, create a folder named ‘backup’, and then within ‘backup’ a folder called ‘linode_01’. This folder will be initialized as a Borg repository.

mkdir -p /mnt/backup/linode_01
borg init --encryption=repokey user@backup_server:/mnt/backup/linode_01

The username, backup_server, repo can of course all be customized at the user’s discretion.

While initialising the repo, a passphrase for each backup repository can be set for authentication.

Step 7: Create an initial backup to the remote Borg repository

borg create --stats ssh://user@backup_server/mnt/backup/linode_01::archive_1 /home/kannan/photos

To enable the remote backups, the following three environment variables can be used to simplify the automation process:

export BORG_REPO='ssh://user@backup_server/mnt/backup/linode_01'
export BORG_PASSPHRASE='set_your_passpharase'
export BORG_RSH='ssh -i /home/kannan/.ssh/id_rsa_backups'

With those environment variables set, the ‘borg create’ command can be shortened to the following:

borg create --stats ::archive_1 /home/kannan/photos

Step 8: Excluding certain directories or files

In order to exclude certain directories or files, the create command has an --exclude option or an exclude file/directory pattern can be generated. For example, the following command demonstrates how to exclude /dev and /opt:

borg create --stats ::archive_1 / --exclude /dev /opt

Step 9: Restoring an archive through extraction

The ‘borg extract’ command extracts the contents of an archive. As a preset default, the entire archive will be extracted. However, the extraction can be limited by passing the directory path or file path as arguments to the command. For example, this is how a single photo can be extracted from the Photos archive:

borg extract ::archive_1 /home/kannan/photos/sunrise.jpg

Step 10: Pruning older backups

Every backup solution should have a way to maintain the older backups. Borg offers us borg prune for this. It prunes a repository by deleting all archives not matching any of the specified retention options.

For example, retain the final 10 archives from the day, another 6 end-of-week archives, and 3 of the end-of-month archive for every month using the following syntax:

borg prune -v --list --keep-daily=10 --keep-weekly=6 --keep-monthly=3 ::

Note that the double colons :: are required in order to automatically use the environment variables that were set prior.

For more in-depth documentation on Borg backup, read the docs.

End Point Security Tips: Securing your Infrastructure

2020-02-05T00:00:00+00:00

Photo from comparitech.com

Implement Security Measures to Protect Your Organization & Employees

In this post, I’ll address what I believe are the three important initiatives every organization should implement to protect your organization and employees:

Train employees on security culture.
Implement the best technical tools to aid with organizational security.
Implement recovery tools in case you need to recover from a security breach.

Habits of a Security Culture

Train everyone in your organization on these fundamentals:

The only time you should be requested to reset your password by email is when you initiate it. There are rare exceptions to this rule, such as when accounts are compromised and providers request all users reset their passwords, but those events should be publicly announced. Staff can confirm with security personnel before acting on such requests.
If you are asked to reset your password, it will typically be after you successfully logged into a website and the old one has expired.
If you receive an email and do not know the sender, do not trust the contents or open attachments. Get advice from security personnel if needed.
If you think the email is from your bank, keep in mind that banks do not ask their clients for private information via email.
If you think the social security office emailed you to obtain your personal information, keep in mind that they do not initiate or solicit private information via email.
Companies should not solicit private information unless you initiate first.
Online retailers should not ask for your private information unless you initiate first.

A Security Concern: Going Phishing!

Photo by Epic Top 10 Site

One of the more common ways to steal someone’s private information is through phishing. Phishing is like fishing: you try to catch something. In this case, the ‘fish’ is your data. Someone with malicious intent sends you email to attempt to get you to click on the link, picture, content, etc. within the email. Once you click the link or content within the email, it might take you to a website to enter or reset your password, even ask for your social security number or other personal information. The person with malicious intent would use the information collect to open accounts, purchase items online or resell your personal data. The links within the phishing email might even redirect you to a fake website that mimics a real website to collect your personal data. The goal is to confuse the email recipient into believing that the email message is legit in its attempt to collect personal information from the user.

Phishing Exercises

Photo by Christiaan Colen

One way to help the staff better their understanding of a phishing email is to practice phishing exercises by setting up an experiment to see which users would click on your test phishing email. If the user clicks on the experiment phishing email, the email administrator would notify the compliance officer and he or she would re-train the staff on how to properly differentiate real emails from phishing emails. Phishing exercises make a great activity for the employees to avoid the email scams and exert more caution in the future.

Essential Security Tools

Firewall

A firewall should be a mandatory technology for all consumers or businesses. This is basically your main door. The office or organization’s technology is what the firewall is protecting. When you sit in your office working on the computer, just imagine that you are in a fort surrounded by walls. If someone needs to come in, they must be given permission by a gatekeeper to enter the fort. The firewall is very similar. The firewall allows network traffic to go in and out of the office based on the configuration set by a network engineer. The engineer determines what network traffic comes into the office and goes out.

Companies typically test their firewall with penetration tests and network scans to determine if there are any security concerns after implementing the firewall. The testing of the firewall verifies that good security practices are implemented and the firewall is setup properly and securely.

Some hardware firewall devices to consider include:

Network Assessment & Internal Threat Protection

Why is security vulnerability testing necessary? Many organizations have legacy servers, or desktops/laptops with operating systems that are no longer supported. For example, outdated Microsoft Windows XP and 7 can be compromised by malware while browsing the Internet due to the BlueKeep vulnerability. Systems that are not patched could expose your organization to be infected with malware such as ransomware which would hold your data ransom until you pay to unlock the data.

Security vulnerability testing typically results in a report outlining problematic areas such as outdated operating systems, private data that does not belong on a file server, such as social security or credit card number (personally identifying information, or PII). If PII is needed for the organization to operate, then higher security standards and an encryption system to store the private data is needed. The security vulnerabilities testing could also check if your environment is HIPAA or PCI DSS compliant if those security standards apply to you.

Some security vulnerability testing technology includes:

Enterprise Antivirus Systems

Antivirus software is the last line of defense if malware enters your computer. The antivirus itself would not protect you 100% from being infected by malware or virus but if you have multiple layers of security in place, then the chances are much lower that your organization’s systems would not be compromised. A company called AV-Comparatives assessed some of the popular antivirus software in the market.

The battle with malware is endless. Case in point: The WannaCry malware affected over 200,000 machines across the world and spread quickly. Security researchers quickly realized the malware was spreading like a computer worm, across computers and over the network, using the Windows SMB protocol.

New variants of viruses and malware are developed every day, so the antivirus companies are also daily hard at work developing a way to block and remove malware.

Some antivirus software includes:

Web Filtering Technology

Web filtering technology blocks websites that are malicious or deemed not appropriate to visit from an organization’s network. For example, websites for gambling could be blocked to reduce employee distractions, but also to reduce access to sites popularly infected with malware, reducing the possibility of malware coming into your network.

There are many competitors out there in this competitive market, and some vendors offer free proof-of-concept testing with their product before you make a big investment. Take a look at:

Data Loss Prevention

Employees’ and businesses’ private information is sensitive and should be protected. Businesses, whether audited or not, should always protect their employee private information. 20 years ago paper was used to store private information and locked in a file cabinet, but in 2020 most private information is stored digitally. How do companies keep private information from leaving their office?

Physically you probably can’t stop someone from walking out with private information, but digitally there is technology called digital loss prevention (DLP) that can help keep confidential information from leaving the office. For example, if someone in the office decides to copy private information onto USB storage, or tries to send a social security number or credit card information via email, this can often be prevented. Prepare your business with solutions tailored for regulations involving personal information using DLP software.

Some DLP solutions to consider include:

Email Filtering

Probably one of the easiest and oldest methods to infect or phish someone is via email. There are multiple mechanisms email filtering uses to stop malicious emails: SPF and DKIM, blacklists, etc. On top of these configurable items, email filtering software vendors often release updates throughout the day to block the latest known malware or spam based on heuristics.

Cloud email services such as Microsoft Office 365 are for many businesses superior to on-premise email servers due to having all the bells and whistles to proactively protect your email environment, such as spam & virus filtering, and email archiving for retention purposes. If you need email encryption to protect sensitive emails, this feature is also available.

The distinct advantage of using hosted email service is that the cost is predictable and includes maintenance, so your email administrators do not have to worry about updating the email server software or hardware.

Email filtering technology available includes:

Two Factor Authentication (2FA)

Companies like Duo (owned by Cisco) and Google’s two-factor authentication technology are great tools to implement to improve your overall security. Beyond the usual user name and password, your smartphone or a hardware token are used to authorize the access to a website, an application, or a network. This technology is now well-established and available in most online services, and can be added to your own custom business applications.

Some starting points:

VPN (Virtual Private Network)

Virtual private network technology allows businesses to provide secure access to office systems or applications to employees who travel or work from a remote location. That is important so that data traveling to and from the source, and many details of traffic patterns, are encrypted, so that the data cannot be captured and decrypted. VPN solutions have been around for years and are an important tool to securely and safely protect data in transit.

Password Reset Systems

Why are self-service password reset systems necessary, and are they secure? They eliminate many manual password assignment mistakes, keep passwords private to the user alone, and allow an organization to integrate two-factor authentication to securely reset user passwords, such as Active Directory for Windows, or other single sign-on password. Onboarding is needed for each user, which is done by sending them an email with a link to register.

The password reset system could be an internal system only available to your organization. Another possibility is to access the system via VPN or even through a proxy server in the DMZ to allow password reset from anywhere.

Some password reset systems include:

Recovery Tools

Data and System Backups/Disaster Recovery

Data backup and disaster recovery practice is a critical component for a business to speed up the process of recovery if malware attacks your systems or if a system becomes inoperable.

If you are attacked by ransomware and all the critical systems and desktops are compromised and rendered useless, then the only way to recover is from backups. At that point, the disaster recovery preparation you made will be well worth what you invested into it. If you did not have an adequate backup or disaster recovery plan, your organization or company would have to start from scratch and rebuild all systems and desktops which could take weeks if not months to recover from. The lost wages, and possibly clients, due to unavailable systems and desktop to operate, probably would cost you far more than preparation does. The more options you have, the better the chances that you will get out of your bind and not prolong the situation.

Some backup solutions I have worked with have the flexibility to mix and match to fit your configuration needs:

Acronis Cloud Backup allows you to back up to local storage, the cloud, and off-premise, recover a system using a USB drive, back up Office 365 emails, or backup VMware or Hyper-V environments. It also allows you to recover a single file if needed.

Additional Security Recommendations

SSL/TLS Certificates

What are SSL and TLS? SSL is outdated and replaced by TLS, but people often still use the familiar SSL name. They are an encryption system to keep sensitive information sent across the Internet encrypted so that only the intended recipient can access it.

When an SSL certificate is validated, the transmitted data is not only unreadable by anyone except for the intended server, but you are relatively well assured you are communicating with the organization you expected and not with a malicious intermediary. Read more details in SSL Shopper’s Why SSL? article.

SSL certificates were historically used primarily for higher-security web servers, but now are commonly used on almost all web servers. They are also used for other remote server access, B2B server-to-server communication, VPN access, email systems, etc.

Recent Destructive Incidents

These incidents from just a few weeks in fall 2019 show that we have a long way to go in protecting our technology usage:

Speak With Us!

Keeping up with security can be a full-time job. If you need professional consulting on security tools and implementing them, contact our team today.

Defense in Depth

2012-09-28T00:00:00+00:00

“Defense in depth” is a way to build security systems so that when one layer of defense fails, there is another to take its place. If breaking in is hard enough for an attacker, it’s likely that they’ll abandon their assault, deciding it’s not worth the effort. Making the various layers different types of defense also makes it harder to get in, so that an attacker is less likely to get through all the layers. It can also keep one mistake from causing total security failure.

For example, if you have an office with one door and leave it unlocked by accident, then anyone can just walk in. However, if you have an office building with a main door, then a lobby and hallways, and then additional inner locked doors, then the chance of accidentally leaving both unlocked is small. If you accidentally leave the inner door open or unlocked, someone can only get into the office if they first get through the outer door.

Another example of defense in depth is making and maintaining offsite computer backups. The chance of an office being destroyed and all your data with it is low, but not zero. If you maintain offsite backups, then your losses in a catastrophe are reduced.

Another way to decrease risk of data loss is to keep multiple backups in widely separated locations, held by different people or companies. This ensures you still have access to data even when one account is locked out, a storage facility goes out of business, or an account is hijacked by an attacker.

What about the rogue insider threat? You can protect each backup with different credentials, and ensure few or no employees have access to more than one backup.

A security scheme is only as strong as its weakest link, so your various defenses need to be on par with each other. If your office is locked up nice and tight, but anyone can get into your computer over the network, then all your office security isn’t going to do much good and vice versa.

When considering security, it’s important to think about the cost compared to the potential loss. Picture a house. With a normal house, it’s not too hard to get in; all you have to do is break a window. Considering that, is it really worth spending lots of money on fancy locks for the doors? You can add bars to the windows and install multiple fancy door locks, but is it worth the inconvenience and ugliness? Perhaps moving to a new neighborhood is a better solution.

Recognizing that no defense is foolproof, and that a combination of cheap defenses may be more effective than one expensive defense, consider the concept of defense in depth whenever looking at protection, whether it be of an office building, website, or castle.

These articles go into more detail on the topic:

Restoring individual table data from a Postgres dump

2010-04-20T00:00:00+00:00

Recently, one of our clients needed to restore the data in a specific table from the previous night’s PostgreSQL dump file. Basically, there was a UPDATE query that did not do what it was supposed to, and some of the columns in the table were irreversibly changed. So, the challenge was to quickly restore the contents of that table.

The SQL dump file was generated by the pg_dumpall command, and thus there was no easy way to extract individual tables. If you are using the pg_dump command, you can specify a “custom” dump format by adding the -Fc option. Then, pulling out the data from a single table becomes as simple as adding a few flags to the pg_restore command like so:

$ pg_restore --data-only --table=alpha large.custom.dumpfile.pg > alpha.data.pg

One of the drawbacks of using the custom format is that it is only available on a per-database basis; you cannot use it with pg_dumpall. That was the case here, so we needed to extract the data of that one table from within the large dump file. If you know me well, you might suspect at this point that I’ve written yet another handy perl script to tackle the problem. As tempting as that may have been, time was of the essence, and the wonderful array of Unix command line tools already provided me with everything I needed.

Our goal at this point was to pull the data from a single table (“alpha”) from a very large dump file (“large.dumpfile.pg”) into a separate and smaller file that we could use to import directly into the database.

The first step was to find exactly where in the file the data was. We knew the name of the table, and we also know that a dump file inserts data by using the COPY command, so there should be a line like this in the dump file:

COPY alpha (a,b,c,d) FROM stdin;

Because all the COPYs are done together, we can be pretty sure that the command after “COPY alpha” is another copy. So the first thing to try is:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '

This uses grep’s handy -n option (aka –line-number) to output the line number that each match appears on. Then we pipe that back to grep, search for our table name, and print the line after it with the -A option (aka –after-context). The output looked like this:

$ grep -n COPY large.dumpfile.pg | grep -A1 'COPY alpha '
1233889:COPY alpha (cdate, who, state, add, remove) FROM stdin;
12182851:COPY alpha_sequence (sname, value) FROM stdin;

Note that many of the options here are GNU specific. If you are using an operating system that doesn’t support the common GNU tools, you are going to have a much harder time doing this (and many other shell tasks)!

We now have a pretty good guess at the starting and ending lines for our data: 1233889 to lines 12182850 (we subtract 1 as we don’t want the next COPY). We can now use head and tail to extract the lines we want, once we figure out how many lines our data spans:

$ echo 12182851 - 1233889 | bc
10948962
$ head -1233889 large.dumpfile.pg | tail -10948962 > alpha.data.pg

However, what if the next command was not a COPY? We’ll have to scan forward for the end of the COPY section, which is always a backslash and a single dot at the start of a new line. The new command becomes (all one line, but broken down for readability):

$ grep -n COPY large.dumpfile.pg \
     | grep -m1 'COPY alpha' \
     | cut -d: -f1 \
     | xargs -Ix tail --lines=+x large.dumpfile.pg \
     | grep -n -m1 '^\\\.'

That’s a lot, but in the spirit of Unix tools doing one thing and one thing well, it’s easy to break down. First, we grab the line numbers where COPY occurs in our file, then we find the first occurrence of our table (using the -m aka –max-count option). We cut out the first field from that output, using a colon as the delimiter. This gives is the line number where the COPY begins. We pass this to xargs, and tail the file with a –lines=+x argument, which outputs all lines from that file starting at the given line number. Finally, we pipe that output to grep and look for the end of copy indicator, stopping at the first one, and also outputting the line number. Here’s what we get:

$ grep -n COPY large.dumpfile.pg \
     | grep -m1 'COPY alpha' \
     | cut -d: -f1 \
     | xargs -Ix tail --lines=+x large.dumpfile.pg \
     | grep -n -m1 '^\\\.'
148956:\.
xargs: tail: terminated by signal 13

This tells us that 148956 lines after the COPY, we encountered the string “.”. (The complaint from xargs can be ignored). Now we can create our data file:

$ grep -n COPY large.dumpfile.pg \
     | grep -m1 'COPY alpha' \
     | cut -d: -f1 \
     | xargs -Ix tail --lines=+x large.dumpfile.pg \
     | head -148956 > alpha.data.pg

Now that the file is there, we should do a quick sanity check on it. If the file is small enough, we could simply call it up in your favorite editor or run it through less or more. You can also check things out by knowing that a Postgres dump file separates the data in columns by a tab character when using the COPY command. So we can view all lines that don’t have a tab, and make sure there is nothing except comments and the COPY and . lines:

$ grep -v -P '\t' alpha.data.pg

The grep option -P (aka –perl-regexp) instructs grep to interpret the argument (“backslash t” in this case) as a Perl regular expression. You could also simply input a literal tab there: on most systems this can be done with the key combination.

It’s time to replace that bad data. We’ll need to truncate the existing table, then COPY our data back in. To do this, we’ll create a file that we’ll feed to psql -X -f. Here’s the top of the file:

$ cat > alpha.restore.pg

\set ON_ERROR_STOP on
\timing

\c mydatabase someuser

BEGIN;

CREATE SCHEMA backup;

CREATE TABLE backup.alpha AS SELECT * FROM public.alpha;

TRUNCATE TABLE alpha;

From the top: we tell psql to stop right away if it encounters any problems, and then turn on the timing of all queries. We explicitly connect to the correct database as the correct user. Putting it here in the script is a safety feature. Then we start a new transaction, create a backup schema, and make a copy of the existing data into a backup table before truncating the original table. The next step is to add in the data, then wrap things up:

$ cat alpha.data.pg >> alpha.restore.pg

Now we run it and check for any errors. We use the -X argument to ensure control of exactly which psql options are in effect, bypassing any psqlrc files that may be in use.

$ psql -X -f alpha.restore.pg

If everything looks good, the final step is to add a COMMIT and run the file again:

$ echo "COMMIT;" >> alpha.restore.pg
$ psql -X -f alpha.restore.pg

And we are done! All of this is a little simplified, as in real life there was actually more than one table to be restored, and each had some foreign key dependencies that had to be worked around, but the basic idea remains the same. (and yes, I know you can do the extraction in a Perl one-liner)

PostgreSQL EC2/EBS/RAID 0 snapshot backup

2010-02-23T00:00:00+00:00

One of our clients uses Amazon Web Services to host their production application and database servers on EC2 with EBS (Elastic Block Store) storage volumes. Their main database is PostgreSQL.

A big benefit of Amazon’s cloud services is that you can easily add and remove virtual server instances, storage space, etc. and pay as you go. One known problem with Amazon’s EBS storage is that it is much more I/O limited than, say, a nice SAN.

To partially mitigate the I/O limitations, they’re using 4 EBS volumes to back a Linux software RAID 0 block device. On top of that is the xfs filesystem. This gives roughly 4x the I/O throughput and has been effective so far.

They ship WAL files to a secondary server that serves as warm standby in case the primary server fails. That’s working fine.

They also do nightly backups using pg_dumpall on the master so that there’s a separate portable (SQL) backup not dependent on the server architecture. The problem that led to this article is that extra I/O caused by pg_dumpall pushes the system beyond its I/O limits. It adds both reads (from the PostgreSQL database) and writes (to the SQL output file).

There are several solutions we are considering so that we can keep both binary backups of the database and SQL backups, since both types are valuable. In this article I’m not discussing all the options or trying to decide which is best in this case. Instead, I want to consider just one of the tried and true methods of backing up the binary database files on another host to offload the I/O:

Create an atomic snapshot of the block devices
Spin up another virtual server
Mount the backup volume
Start Postgres and allow it to recover from the apparent “crash” the server had (since there wasn’t a clean shutdown of the database before the snapshot
Do whatever pg_dump or other backups are desired
Make throwaway copies of the snapshot for QA or other testing

The benefit of such snapshots is that you get an exact backup of the database, with whatever table bloat, indexes, statistics, etc. exactly as they are in production. That’s a big difference from a freshly created database and import from pg_dump.

The difference here is that we’re using 4 EBS volumes with RAID 0 striped across them, and there isn’t currently a way to do an atomic snapshot of all 4 volumes at the same time. So it’s no longer “atomic” and who knows what state the filesystem metadata and the file data itself would be in?

Well, why not try it anyway? Filesystem metadata doesn’t change that often, especially in the controlled environment of a Postgres data volume. Snapshotting within a relatively short timeframe would be pretty close to atomic, and probably look to the software (operating system and database) like some kind of strange crash since some EBS volumes would have slightly newer writes than others. But aren’t all crashes a little unpredictable? Why shouldn’t the software be able to deal with that? Especially if we have Postgres make a checkpoint right before we snapshot.

I wanted to know if it was crazy or not, so I tried it on a new set of services in a separate AWS account. Here are the notes and some details of what I did:

1. Created one EC2 image:

Amazon EC2 Debian 5.0 lenny AMI built by Eric Hammond
Debian AMI ID ami-4ffe1926 (x86_64)
Instance Type: High-CPU Extra Large (c1.xlarge) — 7 GB RAM, 8 CPU cores

2. Created 4 x 10 GB EBS volumes

3. Attached volumes to the image

4. Created software RAID 0 device:

mdadm -C /dev/md0 -n 4 -l 0 -z max /dev/sdf /dev/sdg /dev/sdh /dev/sdi

5. Created XFS filesystem on top of RAID 0 device:

mkfs -t xfs -L /pgdata /dev/md0

6. Set up in /etc/fstab and mounted:

mkdir /pgdata
# edit /etc/fstab, with noatime
mount /pgdata

7. Installed PostgreSQL 8.3

8. Configured postgresql.conf to be similar to primary production database server

9. Created empty new database cluster with data directory in /pgdata

10. Started Postgres and imported a play database (from public domain census name data and Project Gutenberg texts), resulting in about 820 MB in data directory

11. Ran some bulk inserts to grow database to around 5 GB

12. Rebooted EC2 instance to confirm everything came back up correctly on its own

13. Set up two concurrent data-insertion processes:

50 million row insert based on another local table (INSERT INTO … SELECT …), in a single transaction (hits disk hard, but nothing should be visible in the snapshot because the transaction won’t have committed before the snapshot is taken)
Repeated single inserts in autocommit mode (Python script writing INSERT statements using random data from /usr/share/dict/words piped into psql), to verify that new inserts made it into the snapshot, and no partial row garbage leaked through

14. Started those “beater” jobs, which mostly consumed 2-3 CPU cores

15. Manually inserted a known test row and created a known view that should appear in the snapshot

16. Started Postgres’s backup mode that allows for copying binary data files in a non-atomic manner, which also does a CHECKPOINT and thus also a filesystem sync:

SELECT pg_start_backup('raid_backup');

17. Manually inserted a 2nd known test row & 2nd known test view that I don’t want to appear in the snapshot after recovery

18. Ran snapshot script which calls ec2-create-snapshot on each of the 4 EBS volumes—during first run, run serially quite slowly taking about 1 minute total; during second run, run in parallel such that the snapshot point was within 1 second for all 4 volumes

19. Tell Postgres the backup’s over:

SELECT pg_stop_backup();

20. Ran script to create new EBS volumes derived from the 4 snapshots (which aren’t directly usable and always go into S3), using ec2-create-volume --snapshot

21. Run script to attach new EBS volumes to devices on the new EC2 instance using ec2-attach-volume

22. Then, on the new EC2 instance for doing backups:

mdadm --assemble --scan
mount /pgdata
Start Postgres
Count rows on the 2 volatile tables; confirm that the table with the in-process transaction doesn’t show any new rows, and that the table getting individual rows committed to reads correctly
VACUUM VERBOSE — and confirm no errors or inconsistencies detected
pg_dumpall # confirmed no errors and data looks sound

It worked! No errors or problems, and pretty straightforward to do.

Actually before doing all the above I first did a simpler trial run with no active database writes happening, and didn’t make any attempt for the 4 EBS snapshots to happen simultaneously. They were actually spread out over almost a minute, and it worked fine. With the confidence that the whole thing wasn’t a fool’s errand, I then put together the scripts to do lots of writes during the snapshot and made the snapshots run in parallel so they’d be close to atomic.

There are lots of caveats to note here:

This is an experiment in progress, not a how-to for the general public.
The data set that was snapshotted was fairly small.
Two successful runs, even with no failures, is not a very big sample set. :)
I didn’t use Postgres’s point-in-time recovery (PITR) here at all—I just started up the database and let Postgres recover from an apparent crash. Shipping over the few WAL logs from the master collected during the pg_backup run after the snapshot copying is complete would allow a theoretically fully reliable recovery to be made, not just a practically non-failing recovery as I did above.

So there’s more work to be done to prove this technique viable in production for a mission-critical database, but it’s a promising start worth further investigation. It shows that there is a way to back up a database across multiple EBS volumes without adding noticeably to its I/O load by utilizing the Amazon EBS data store’s snapshotting and letting a separate EC2 server offload the I/O of backups or anything else we want to do with the data.