Tuesday 11 December 2012

Prototyping with Google's Prediction API, Python, SQLite and some JSON APIs

:: Introduction

I've been playing around lately with the car2go API v2.1 as I was curious about how easy it would be to ingest it's vehicle location data over time and then try to accurately predict certain outcomes using machine learning algorithms.

I've had quite a bit of success in the past using Google's Prediction API for spam comment detection but this "spam" problem was using classification values and not regression values. Regression values  would have to be used in this type of geo-coordinates based problem.

The last time I used the Python programming language was when I worked at Toshiba in Edinburgh, UK a number of years ago. I was interested to get back to using the language as I remember it being a really elegant language and it also seemed like a great fit to prototype things out. I was using a Macbook Air, which had Python pre-installed, so this made it pretty easy to get up and running. I knew I'd need to query the API, parse out the necessary values and store these to a datastore of some kind. I opted to use Mac OS X's built in SQLite database at it integrated well with Python and it provided a great interface from the command line to test out various SQL queries.

:: Inserting Records into SQLite with Python

The following Python program illustrates how to connect to a SQLite database called car2go.db and insert a single data record into the Vehicles table (Note that the Vehicles table schema needs to be pre-defined and created before the code will execute successfully):

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sqlite3 as sqlite
import sys

con = None

try:
    con = sqlite.connect('car2go.db')

    cur = con.cursor()
    
    cur.execute("INSERT INTO Vehicles VALUES(<INSERT_VALUES_BASED_ON_SCHEMA>)")
    
except sqlite.Error, e:
    
    if con:
        con.rollback()
        
    print "Error %s:" % e.args[0]
    sys.exit(1)
    
finally:
    
    if con:
        con.close()

As you can see this code is really lightweight and easy to get up and running when you have a machine with Python installed. 

:: Requesting API Data and Parsing Values with Python

The next step was to query the JSON based API and parse out any necessary vehicle values:

#!/usr/bin/python
# -*- coding: utf-8 -*- import json import urllib2 import sys # Note that you'll need to replace <SECRET_KEY> & <LOCATION> with the appropriate values car2goVehiclesApi = 'https://www.car2go.com/api/v2.1/vehicles&oauth_consumer_key=<SECRET_KEY>&loc=<LOCATION>&format=json' # request vehicle data in json format webReq = urllib2.urlopen(car2goVehiclesApi) vehiclesJson = json.load(webReq)

The vehiclesJson object now has the JSON data from the web request and can be queried directly given the appropriate keys/indexes. And example to output the address of the first vehicle is:

print vehiclesJson['placemarks'][0]['address']

:: Outputting SQLite Query Data to CSV

Once these basics where done I then had the foundation to expand upon it further. I eventually extracted all the necessary JSON vehicle data from the web request and inserted it into my Vehicles table in my car2go.db SQLite database. From there I crafted the specific SQL query I wanted and output that query data to a CSV file using the following sqlite3 command syntax (simply run the SQL query after setting these values):

.headers off
.output vehicleData.txt
.mode csv

:: Using the Google Prediction API

Setting up your Google Prediction API the first time can be slightly tricky so be careful to follow Google's directions carefully. I won't go into these prerequisite details here as Google has already done a great job explaining them here (and will most likely keep them up-to-date in the future). 

After setting everything up the vehicleData.txt file was then uploaded to one of my Google Cloud Storage buckets where it could be queried directly by Google's Prediction API. Note that depending on the data you upload you may need to wrap the strings in double quotes according to Google's specified training data format. After uploading, the first task was to train my model which can take anywhere from a few seconds to a few minutes. Once this is complete (querying for the status will inform you when its complete) you can begin to ask for regression predictions from the predict HTTP request. Note that depending on what you are predicting the outputValue or the outputMulti[].score values can be retrieved from the JSON response and used to interpret your intended outcome.

(Note that I was using Google's Prediction API v1.5 for this prototype).

:: Conclusion

I'd highly recommend giving the Python language a try (if you haven't already) for your next prototype project. Integrating it with a lightweight database, JSON web requests and Google's Prediction API was pretty easy and my overall impressions with the language and available libraries are that its still a pleasure to work with.