iA


Importing Data From Google Docs Into Django, Step Four: Writing and running your importer

by Andy Boyle.

Give yourself some whiskey because you're about to have a damn good time.

Your spreadsheet is set up and your models are synced with your database. Soon it will be time to write the code that imports the Google spreadsheet data. Soon all of the base will belong to you.

But first we need to add some preliminary code that will allow us to communicate with the Google doc. We should put this in a lib directory, just for our sanity’s sake. So let’s do this from the base gdocs_importer directory (the one with the manage.py command):

mkdir gdocs_importer/gdocs_importer/lib
touch gdocs_importer/gdocs_importer/__init__.py

Open up that __init__.py we just made. We’ll need to throw in this code in there below to use one of the libraries we installed. My coworker Ryan Nagle wrote this for a project we did at work, and it’s pretty nifty. I’ve edited it a tiny bit to make it work for our project. Here’s the code (and a gist for it, too):

import gdata.docs.service
import gdata.spreadsheet.service
import os
import tempfile
import csv
 
 
def get_spreadsheet(account, password, key, gid):
    gd_client = gdata.docs.service.DocsService()
    gd_client.email = account
    gd_client.password = password
    gd_client.source = "Private Spreadsheet Downloader"
    gd_client.ProgrammaticLogin()
 
    spreadsheets_client = gdata.spreadsheet.service.SpreadsheetsService()
    spreadsheets_client.email = gd_client.email
    spreadsheets_client.password = gd_client.password
    spreadsheets_client.source = "Private Spreadsheet Downloader"
    spreadsheets_client.ProgrammaticLogin()
 
    file_path = tempfile.mktemp(suffix='.csv')
    uri = 'https://docs.google.com/feeds/documents/private/full/%s' % key
    try:
        entry = gd_client.GetDocumentListEntry(uri)
        docs_auth_token = gd_client.GetClientLoginToken()
        gd_client.SetClientLoginToken(
            spreadsheets_client.GetClientLoginToken())
        gd_client.Export(entry, file_path)
        gd_client.SetClientLoginToken(docs_auth_token)
        return csv.reader(file(file_path).readlines())
    finally:
        try:
            os.remove(file_path)
        except OSError:
            pass

And now we need to make what’s called a custom Django management command. These are pretty nifty, and this is going to be the script that will actually hit our Google spreadsheet, download the data and then ingest it into our database. So let’s first make a home for our management command:

mkdir gdocs_importer/management
touch gdocs_importer/management/__init__.py
mkdir gdocs_importer/management/commands/
touch gdocs_importer/management/commands/__init__.py
touch gdocs_importer/management/commands/load_homicides.py

So we made a management directory and a commands directory inside of it, initializing both with a __init__.py. And then inside the commands directory we made a load_homicides.py. If you can guess what we shall be putting into that file, you’re a smart cookie.

Open up load_homicides.py. We’re going to now write our script that imports the homicides from the Google spreadsheeT. Just some basics up front — we’re making this so we don’t have to publicly share the Google spreadsheet. This way you can create a spreadsheet and still keep it hidden, just in case someone stumbles upon the live spreadsheet and there is, perchance, data on there you don’t intend for public consumption. That’s not the case with our test data, but I thought I would show you how to do this just for safety’s sake. That’s what everything in the __init__.py we edited earlier was for.

Here’s the code for your load_homicides.py (gist is here):

from dateutil import parser
from django.core.management.base import BaseCommand
from optparse import make_option
from datetime import datetime
 
from gdocs_importer import logger
from gdocs_importer.models import Homicide
from gdocs_importer.lib import get_spreadsheet
 
GOOGLE_KEY = '' # Enter the Google key here
GOOGLE_SHEET = 0 # This is the Google spreadsheet sheet number you're looking at, and 0 is the default first one
GOOGLE_ACCOUNT = '' # Enter your Google account name here, blahwhatever@gmail.com
GOOGLE_PASS = '' # Enter your Google account password here
 
 
class Command(BaseCommand):
    """
    Downloads and ingests all homicides
    """
    option_list = BaseCommand.option_list + (
        make_option(
            '-c',
            '--clear',
            action='store_true',
            default=False,
            dest='clear',
            help='Clear all homicides in the DB.'),
        make_option(
            '-f',
            '--first',
            action='store_true',
            default=False,
            dest='first',
            help='Run the script for the first time.'),
    )
 
    def get_version(self):
        return "0.1"
 
    def handle(self, *args, **options):
        first_time = False
        if options['clear']:
            logger.info("Clearing all homicides from the DB.")
            Homicide.objects.all().delete()
        if options['first']:
            first_time = True
 
        self.init_reader(first_time)
 
    def dict_for_row(self, row):
        """
        Get a row of data, return a dict whose keys are Homicide properties
        and values are instance values for that row.
        """
        kwargs = {
            'address': row[0],
            'date': row[1],
            'time': row[2],
            'location': row[3],
            'neighborhood': row[4],
            'age': row[5],
            'gender': row[6],
            'race': row[7],
            'name': row[8],
            'cause': row[9],
            'story_url': row[10],
            'rd_number': row[11],
            'charges_url': row[12],
        }
 
        return kwargs
 
    def get_google_csv(self, key, sheet):
        """
        Connect to the Google doc and return a dict of the data to be used,
        including finding the header and the rest of the data
        """
        data = list(get_spreadsheet(
            GOOGLE_ACCOUNT, GOOGLE_PASS, key, sheet))
        return [self.dict_for_row(item) for item in data[1:]]
 
    def init_reader(self, first_time):
        """
        Loop through the spreadsheet and load the data
        """
        homicide_csv = self.get_google_csv(GOOGLE_KEY, GOOGLE_SHEET)
        already_exists = 0
        start_time = datetime.now()
 
        # Run through the spreadsheet, assigning values to certain fields
        for i, row in enumerate(homicide_csv):
            address = row['address']
            # Send some fields to functions that clean them
            cleaned_date_time = self.clean_date_time(row['date'], row['time'])
            location = row['location']
            neighborhood = row['neighborhood']
            age = self.clean_age(row['age'])
            gender = self.clean_gender(row['gender'])
            race = row['race']
            name = row['name']
            cause = row['cause']
            story_url = self.clean_link(row['story_url'])
 
            rd_number = row['rd_number']
            charges_url = self.clean_link(row['charges_url'])
 
            # If the command doesn't use --first, run it this way.
            if first_time is not True:
                # Try and see if this homicide exists. If so, skip it.
                try:
                    homicide = Homicide.objects.get(
                        address=address,
                        name=name)
                    already_exists = already_exists + 1
 
                    logger.info('Already exists, skipping it.')
 
                # If the homicide doesn't exist, create it.
                except Homicide.DoesNotExist:
                    homicide = Homicide(
                        address=address,
                        date_time=cleaned_date_time,
                        location=location,
                        neighborhood=neighborhood,
                        age=age,
                        gender=gender,
                        race=race,
                        name=name,
                        cause=cause,
                        story_url=story_url,
                        rd_number=rd_number,
                        charges_url=charges_url)
                    homicide.save()
            # If you're running it with --first, save everything.
            else:
                homicide = Homicide(
                    address=address,
                    date_time=cleaned_date_time,
                    location=location,
                    neighborhood=neighborhood,
                    age=age,
                    gender=gender,
                    race=race,
                    name=name,
                    cause=cause,
                    story_url=story_url,
                    rd_number=rd_number,
                    charges_url=charges_url)
                homicide.save()
                logger.info("Saved homicide at %s" % address)
                logger.info("Skipped %s so far" % already_exists)
 
        finish_time = datetime.now()
        total_time = finish_time - start_time
        logger.info(
            "All done, took %s seconds to complete." % total_time.seconds)
        logger.info("Skipped %s homicides." % already_exists)
 
    def clean_date_time(self, date, time):
        """
        Turn the date and time into one field that Django will understand
        properly instead of being weird, just in case it was inputted wrong.
        """
        try:
            cleaned_date_time = parser.parse(
                "%s %s" % (date, time), ignoretz=True)
        except ValueError:
            cleaned_date_time = None
 
        return cleaned_date_time
 
    def clean_age(self, age):
        """
        Make sure it's an integer that's been entered, otherwise it'll break
        everything. If not, ignore it.
        """
        try:
            isinstance(age, int)
            age = int(age)
            if age is None:
                age = None
                return age
            else:
                return age
        except ValueError:
            age = None
            return age
 
    def clean_gender(self, gender):
        """
        Make sure it's a string that's been entered, otherwise it'll break
        everything. If not, save it as an unknown gender
        """
 
        try:
            isinstance(gender, basestring)
        except ValueError:
            new_gender = "Unkown"
            return new_gender
 
    def clean_link(self, link):
        """
        Django can only store URLs up to 200 characters, so if it's too long,
        this will allow you to ignore the URL and not break your importer.
        """
 
        try:
            isinstance(link, basestring)
            if len(link) > 200:
                new_link = None
                return new_link
            elif link == "":
                new_link = None
                return new_link
            else:
                return link
        except ValueError:
            new_link = None
            return new_link

I’ve commented the code a bit, so hopefully some of it makes sense, but I’ll explain the larger parts. You’ll need to put in your Google credentials at the top (GOOGLE_ACCOUNT and GOOGLE_PASS should be self-explanatory), as well as the spreadsheet key from your Google spreadsheet. That one is easy to find by getting the URL of your spreadsheet. It should look something like this:

https://docs.google.com/spreadsheet/ccc?key=0Ark-PJD-Ze_DdHBfaUtjZz123m51azc5dVIyYk5JT2c&hl=en#gid=0

The part between ?key= and &hl=end#gid0 — 0Ark-PJD-Ze_DdHBfaUtjZz123m51azc5dVIyYk5JT2c — is what we are looking for. Copy that from your spreadsheet and paste it the GOOGLE_KEY location.

Now, we’ve got a few options when you run this management command. The first one is “clear,” which allows you to clear the entire database before running the command. You use this by adding “–clear” or “-c” to the end of your management command. The second one is “first,” which allows you to tell the script “This is the first time I’m running!” This means it will skip trying to compare the homicides to anything in the database and load everything fast. You use this by adding “–first” or “-f” to the end of the management command.

And what is this management command? Well, it’s quite simple. Now that you have everything squared away, your database is running (Remember pgup?), your Google doc is formed correctly and your scripts are in place. Now type the magic command from the root of your project (where the manage.py file is):

python manage.py load_homicides --first

You should see a bunch of junk spit out onto your screen. That’s because I’ve added debug statements (which I often add to my work projects, much to my coworkers chagrin) that show what you saved, how long it took to run, and whether or not it skipped any because the same homicide already existed in your database.

I added a count so it’ll figure out how many homicides you added to the database. For the test data, it should be 515, and the first time you run it it should skip 0 homicides.

Now, let’s run it again, this time without –first:

python manage.py load_homicides

It should run through and skip everything, so you’ll have created zero homicides and skipped 515. Pretty nifty, huh? Let’s say you wanna blow everything away and start anew:

python manage.py load_homicides --clear --first

This will delete everything in your database and start from scratch. Don’t worry — you won’t be harming anything. This is a normal command I like to have in my import scripts, because I often screw up writing the script initially, so it makes it much easier to re-run it and test everything.

Now let’s prove that this data has showed up in your database. Let’s look inside:

psql gdocs_importer

This SQL command will count how many items are in the gdocs_importer_homicides table:

select count(*) from gdocs_importer_homicide;

It should return 515. If you’d like a more detailed look, try this:

select * from gdocs_importer_homicide;

Type q to quit, or use your up and down arrow keys to look through it.

Well, that’s the basic importer. I plan on adding some things to check and see if any data has changed between your models and what’s in the Google doc and then override it, but I’ll probably be adding that in a few days once I get it working.

In the next post I’ll make a basic view and template to display all of your data.

2 comments on ‘Importing Data From Google Docs Into Django, Step Four: Writing and running your importer’

  1. Ian says:

    I really like these instructions. They’re helping me get more comfortable with using Django and GData/spreadsheets. Thanks for writing it up! I think there are some errors(?) on step 4 with the touch commands and the paths shown. It looks like you want to eliminate the project folder from the command, right? (mkdir gdocs_importer/lib from within the gdocs_importer project folder)

  2. Andy Boyle says:

    You may be correct! I wrote these up quite awhile ago, so I have kind of forgotten what’s going on here. I’ll take a look once I get some time. Thanks for writing!

Leave a Reply