Installing HP Vertica with ODBC and Python plugins on Ubuntu 14.04

Recently, I became familiar with the HP Vertica database: I applied for a “big-data” job in the Helsinki metropolitan area, and the guy on the other side of the desk gave me a homework. He wanted me to install and fiddle with HP Vertica.  In this particular database software, the data elements following each other physically on the disk, are organized in the same column (instead of being in the same row), making the storage of numerical data faster and more efficient.

Didn’t get the job though, what a waste of time. Or maybe not – I thought someone might find this tutorial useful.

1. Installing & creating a database

We start by downloading the correct debian .deb package from https://my.vertica.com /download-community-edition/

Install the package with sudo dpkg -i package.deb.  Vertica installs into /opt/vertica.

At this stage we try to run the install script as root, and bump into the first problem:

root@sampsa-xps13:/home/sampsa/Downloads# /opt/vertica/sbin/install_vertica –hosts localhost
Vertica Analytic Database 7.0.2-1 Installation Tool
>> Validating options…
Mapping hostnames in –hosts (-s) to addresses…
localhost => 127.0.0.1
>> Starting installation tasks.
>> Getting system information for cluster (this may take a while)…
Error: failed to get system information for all hosts
(127.0.0.1) This version of Debian or Ubuntu is unsupported.
Hint: additional failures may be hidden.
Installation FAILED with errors.
Installation stopped before any changes were made.

A fix is found from here: http://www.davewentzel.com/content/workaround-vertica-error-version-debian-or-ubuntu-unsupported, i.e. we must edit the file: /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py

We also create the user “dbadmin” with sudo adduser dbadmin.  Don’t forget the password (**).  Next, we set up the ssh keys as usual.

su dbadmin
ssh-keygen
cd .ssh
cat id_rsa.pub > authorized_keys

Still.. errors:

root@sampsa-xps13:/home/sampsa/Downloads# /opt/vertica/sbin/install_vertica –hosts localhost
Vertica Analytic Database 7.0.2-1 Installation Tool
>> Validating node and cluster prerequisites…Failures during local (OS) configuration for verify-127.0.0.1.xml:
HINT (S0305): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0305
TZ is unset for dbadmin. Consider updating .profile or .bashrc
HINT (S0041): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0041
Could not find the following tools normally provided by the mcelog
package: mcelog
HINT (S0040): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0040
Could not find the following tools normally provided by the pstack or
gstack package: pstack/gstack
HINT (S0045): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0045
Could not find the following tools normally provided by the sysstat
package: mpstat, iostat
WARN (S0141): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0141
CPUs have discouraged cpufreq scaling policies: cpu0, cpu1, cpu2, cpu3
FAIL (S0020): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0020
Readahead size of sda (/dev/sda1) is too low for typical systems: 256 <
2048
FAIL (S0030): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0030
ntpd process is not running: [‘ntpd’, ‘ntp’]
HALT (S0322): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0322
Debian-based OS version ‘jessie/sid’ is not supported.

After that, I edited /home/dbadmin/.bashrc to include a line with export TZ=”America/New_York”, also installed some missing software with.

sudo apt-get install mcelog gstack sysstat ntp

NTP client is needed, and we can confirm that it starts at runlevels > 1 with, say sysv-rc-conf

Hard-disk read buffer size should be at higher than the default, so we must edit /etc/rc.local to include the line /sbin/blockdev –setra 2048 /dev/sda.

Still the following errors persist:

Failures during local (OS) configuration for verify-127.0.0.1.xml:
HINT (S0305): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0305
TZ is unset for dbadmin. Consider updating .profile or .bashrc
WARN (S0141): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0141
CPUs have discouraged cpufreq scaling policies: cpu0, cpu1, cpu2, cpu3
HALT (S0322): https://my.vertica.com/docs/7.0.x/HTML/index.htm#cshid=S0322
Debian-based OS version ‘jessie/sid’ is not supported.

Well, we can never-mind all that and finish the installation with:

/opt/vertica/sbin/install_vertica –hosts localhost –failure-threshold NONE

Now, log-in to your unix box as “dbadmin” and start the vertica administration tools:

su dbadmin
/opt/vertica/bin/admintools

At this stage, it works for me..!

While still logged in as “dbadmin”, we create a database:

/opt/vertica/bin/admintools -t create_db -h
/opt/vertica/bin/admintools -t create_db -d testbase –compat21 –hosts localhost
Info: no password specified, using none
Database with 1 or 2 nodes cannot be k-safe and it may lose data if it crashes
The following node(s) are invalid:
localhost
The database cannot be created.

Had no luck with “–hosts 127.0.0.1” either, so we must create the database “manually” with the “GUI” program /opt/vertica/bin/admintools

Create a database “testi1” and at the command line, check its status with:
/opt/vertica/bin/admintools -t list_db -d testi1

2. Accessing Vertica with Python

We start by install the python plugins with apt-get install python-pyodbc

Open database connectivity (ODBC) provides a way to send SQL queries to Vertica.  In order to make it work, we need to place the following files:

/etc./odbc.ini

  [ODBC Data Sources]
  VerticaDB1 = db1 database on HP Vertica

  [VerticaDB1]
  Description = test 1 database on HP Vertica
  Driver = HPVertica
  Database = testi1
  Servername = 127.0.0.1
  UID = dbadmin
  PWD = 
  Port = 5433
  Locale = en_GB

  [ODBC]
  Threading = 1

  [Driver]
  ErrorMessagesPath = /opt/vertica/lib64/
  ODBCInstLib = /usr/lib/x86_64-linux-gnu/libodbcinst.so.1
  # ODBCInstLib = /usr/lib/x86_64-linux-gnu/libodbcinst.so
  DriverManagerEncoding=UTF-16

/etc/odbcinst.ini

[ODBC]
Trace = Yes
TraceFile = /tmp/odbc.log

[HPVertica]
Description = HP Vertica ODBC Driver
Driver = /opt/vertica/lib64/libverticaodbc.so

More on the topic here:
http://stackoverflow.com/questions/24049173/how-connect-to-vertica-using-pyodbc
http://thisdwhguy.com/2014/04/29/vertica-odbc-error-messages-and-solutions/

Now we start ipython and test:

import os
import pyodbc

os.environ["VERTICAINI"]="/etc/odbc.ini" # vertica needs also this environment variable..

conn = pyodbc.connect("DSN=VerticaDB1;UID=dbadmin;PWD=PASSWD")
cursor=conn.cursor()

.. where PASSWD is the password we set earlier in this post. (**)

And Houston, we have a connection..!

3. Uploading hoards of data

Now we are ready to upload some useless data into Vertica, with Python.
We play around with “national oceanic and atmospheric administrations” data
http://www.noaa.gov/
Daily data from individual weather stations is available from here
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/
However, do not copy-paste that link.. there are too many files and your browser will stall..
Station ID’s can be found here:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt

Then, data of individual stations can be downloaded by:

wget “ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/STATIONID.dly”

We take data of this particular weather station to start with:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/USW00023183.dly
The data format is described here:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt

Well.. actually, this starts to be slightly off-topic. Let’s skip the details.

Well, in any case, finally, I found myself writing elaborate python classes, that issue those horrible SQL commands, i.e. the classes had methods like ..

def insert_from_file(self,colnames,filename):
    """
    |  bulk load data from a formatted file
    |  the most important method in the context of this homework
    """
    print "bulk uploading data to vertica"
    st="copy "+self.name+makevals(colnames)+"from '"+filename+"' delimiter ','; "
    self.execute(st)

In the end I did it like this..

1) Python class that reads that atmospheric data and saves it into a numpy array
2) That array is then written on the disk, and ..
3) An SQL command is issued via pyobdc to to tell Vertica to upload that file
4) To download data from Vertica, the same thing, but in reverse.

So, all you database experts.. was I supposed to do it this way? The job inverviewer never gave me his opinion – not even a phone call ..! (I don’t think it was that bad). I think what I am trying to do here, is called “object relational mapping”, right? You hide the ugly SQL syntax and just play with objects.

In a perfect world we could issue commands, only, and directly from python, for example, it would be cool to just upload a numpy array to Vertica, not having to do additional disk writes and fiddle with the SQL syntax that reminds me of .. say, the cobolt programming language.