Tuesday 11 December 2012

Prototyping with Google's Prediction API, Python, SQLite and some JSON APIs

:: Introduction

I've been playing around lately with the car2go API v2.1 as I was curious about how easy it would be to ingest it's vehicle location data over time and then try to accurately predict certain outcomes using machine learning algorithms.

I've had quite a bit of success in the past using Google's Prediction API for spam comment detection but this "spam" problem was using classification values and not regression values. Regression values  would have to be used in this type of geo-coordinates based problem.

The last time I used the Python programming language was when I worked at Toshiba in Edinburgh, UK a number of years ago. I was interested to get back to using the language as I remember it being a really elegant language and it also seemed like a great fit to prototype things out. I was using a Macbook Air, which had Python pre-installed, so this made it pretty easy to get up and running. I knew I'd need to query the API, parse out the necessary values and store these to a datastore of some kind. I opted to use Mac OS X's built in SQLite database at it integrated well with Python and it provided a great interface from the command line to test out various SQL queries.

:: Inserting Records into SQLite with Python

The following Python program illustrates how to connect to a SQLite database called car2go.db and insert a single data record into the Vehicles table (Note that the Vehicles table schema needs to be pre-defined and created before the code will execute successfully):

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sqlite3 as sqlite
import sys

con = None

try:
    con = sqlite.connect('car2go.db')

    cur = con.cursor()
    
    cur.execute("INSERT INTO Vehicles VALUES(<INSERT_VALUES_BASED_ON_SCHEMA>)")
    
except sqlite.Error, e:
    
    if con:
        con.rollback()
        
    print "Error %s:" % e.args[0]
    sys.exit(1)
    
finally:
    
    if con:
        con.close()

As you can see this code is really lightweight and easy to get up and running when you have a machine with Python installed. 

:: Requesting API Data and Parsing Values with Python

The next step was to query the JSON based API and parse out any necessary vehicle values:

#!/usr/bin/python
# -*- coding: utf-8 -*- import json import urllib2 import sys # Note that you'll need to replace <SECRET_KEY> & <LOCATION> with the appropriate values car2goVehiclesApi = 'https://www.car2go.com/api/v2.1/vehicles&oauth_consumer_key=<SECRET_KEY>&loc=<LOCATION>&format=json' # request vehicle data in json format webReq = urllib2.urlopen(car2goVehiclesApi) vehiclesJson = json.load(webReq)

The vehiclesJson object now has the JSON data from the web request and can be queried directly given the appropriate keys/indexes. And example to output the address of the first vehicle is:

print vehiclesJson['placemarks'][0]['address']

:: Outputting SQLite Query Data to CSV

Once these basics where done I then had the foundation to expand upon it further. I eventually extracted all the necessary JSON vehicle data from the web request and inserted it into my Vehicles table in my car2go.db SQLite database. From there I crafted the specific SQL query I wanted and output that query data to a CSV file using the following sqlite3 command syntax (simply run the SQL query after setting these values):

.headers off
.output vehicleData.txt
.mode csv

:: Using the Google Prediction API

Setting up your Google Prediction API the first time can be slightly tricky so be careful to follow Google's directions carefully. I won't go into these prerequisite details here as Google has already done a great job explaining them here (and will most likely keep them up-to-date in the future). 

After setting everything up the vehicleData.txt file was then uploaded to one of my Google Cloud Storage buckets where it could be queried directly by Google's Prediction API. Note that depending on the data you upload you may need to wrap the strings in double quotes according to Google's specified training data format. After uploading, the first task was to train my model which can take anywhere from a few seconds to a few minutes. Once this is complete (querying for the status will inform you when its complete) you can begin to ask for regression predictions from the predict HTTP request. Note that depending on what you are predicting the outputValue or the outputMulti[].score values can be retrieved from the JSON response and used to interpret your intended outcome.

(Note that I was using Google's Prediction API v1.5 for this prototype).

:: Conclusion

I'd highly recommend giving the Python language a try (if you haven't already) for your next prototype project. Integrating it with a lightweight database, JSON web requests and Google's Prediction API was pretty easy and my overall impressions with the language and available libraries are that its still a pleasure to work with.

4 comments:

Anonymous said...

Hi Andrew, great post. Curious to know if your analysis delivered any useful models/results.

Are you familiar with BigML (bigml.com)? A service very similar to Prediction API, but using white box models and fitted with both an API and a web interface. It comes with python bindings as well (http://blog.bigml.com/2012/12/07/bigmler-in-da-cloud-machine-learning-made-even-easier/). Even the model that you train can be downloaded as python code to directly integrate in your local code. (Yes, we have some python fans in the team.)
Maybe worthwhile to explore?

Andrew said...

Hi Jos,

I wish I could say it was a resounding success but unfortunately it wasn't. I came across to issues that need further thought:

1. Geolocation coordinates are two dimensional (i.e. latitude and longitude) which caused a slight problem during my initial hacking around. Google's predication algorithm has both a categorical model and a regression model (the later returns only a single numeric value). Because I have to use numerical values for geolocation coordinates I'd need the predictive algorithm to returned not just a single value (i.e. two numeric values) OR I would have to have two separate learning models where one would return the latitude and the other the longitude. I'm not even sure how accurate the later option would work without some further analysis.

2. The above problem put the brakes on my initial plan but because I still wanted to play around with the predictive algorithm, I then looked for some other predictive capability. I decided to make the actual car's identifier (like the license plate number) the answer to a predictive question. In order to do this I used the 'outputValue' JSON from the predict request(https://developers.google.com/prediction/docs/reference/v1.5/hostedmodels/predict). This allowed me to get a numeric ranking of all the cars. With this I could then define some threshold and decide how many vehicles were "within range". Not at all the best possible approach in terms of accuracy but something I investigated briefly.

I hadn't come across BigML but it looks really promising... thanks for sending the link. I will definitely look into that in the future when I next dig deeper into machine learning again.

Take care.

Unknown said...

Great job but I don't understand one part of it.
I though that to use the Car2Go API you needed an OAuth key. I have received my consumer key from Car2Go but I am struggling to implement it, any feedback on the way you solved this would be welcomed.
Best,

Andrew said...

Hi Florian,

It sounds like you've already registered with Car2go at openapi@car2go.com and have received you API key.

Now use that API key by replacing 'API_KEY' in the URL below. Once you've done that you can paste that URL into your browser and it should return some JSON vehicle data.

https://www.car2go.com/api/v2.1/vehicles&oauth_consumer_key=API_KEY&loc=vancouver&format=json

If that doesn't work what is the error message you're receiving back from the Car2go API?