Pages

Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Friday, February 1, 2019

Fast import of billion lines to postgresql

This is the fastest way I've found about importing huge (3+ billion lines) text files in csv-like format in PostgreSQL. It requires python3 and psycopg2

#!/usr/bin/python3
#
# initial stuff by Cac Ko <email@gdpr_protected_stuff> 
# additional by <geroy@horizon9.org>
#
# import very large cvs-like file in postgre database
# 01.02.2019
#
# requires psycopg2 postgresql lib for python
# file format is:
# ----------------------------
# something12:otherthing2345
# something7134:otherthing4243
# .....
# License: Public Domain

import os
import sys
import psycopg2
from psycopg2.extras import execute_values

# use file in tmpfs for faster read/write
POS_FILE = '/run/.position'

# get current position from POS_FILE location
def getPosition():
    result = 0
    if os.path.exists(POS_FILE):
        with open(POS_FILE, 'rb') as f:
            try:
                result = int(f.read())
                print("result = %d" % result)
            except Exception as error: 
                print("Error read():", error)
                result = 0
    return result

# write to this file after each commit()
def storePosition(pos):
    with open(POS_FILE, 'w') as f:
        f.write(str(pos))

# uncomment the following 3 lines if you want perentage / position on stdout
#    y = pos/whole*100
#    print('{0:.2f}%'.format(y), end = '')
#    print(" pos=%s" % (str(pos)))


# store stuff in database
def storeInDb(line, bulk_data):
#    print(line)
    x = []
    x = line.split(':')
    sql = b'INSERT into secrets (somestuff1, somestuff2) VALUES %s'
    try:
        execute_values(cur, sql, bulk_data)
        connection.commit()
    except Exception as error:
        print("Error pri INSERT", error)
        pass

if __name__ == '__main__':
    try:
        if len(sys.argv) < 2:
            print('usage: ./large-file-import.py filename-to-import.csv')
            exit
        f = open(sys.argv[1], 'r', buffering=2000000, errors='replace')
        whole = os.path.getsize(sys.argv[1])
        pos = getPosition()
        print("Start reading from pos=%d" % pos)
        f.seek(pos)
        line = f.readline()
        try:
            bulk_count = 0;
            connection = psycopg2.connect(
                user="postgres", password="somestuff", host="127.0.0.1", port="5433",
database="postgres") cur = connection.cursor() bulk_data = [] split_line = [] while line: split_line = line.split(':') try: l = tuple(split_line) bulk_data.append(l) except (Exception) as error: print(error) pass storePosition(f.tell()) line = f.readline() bulk_count = bulk_count + 1 # store and commit in db after X lines if bulk_count == 9000: storeInDb(line,bulk_data) bulk_count = 0; bulk_data = [] #connection.commit() except (Exception, psycopg2.Error) as error: print("Error while connecting to PostgreSQL", error) finally: if(connection): cur.close() connection.close() print("PostgreSQL connection is closed") except KeyboardInterrupt: storePosition(f.tell()) if(connection): cur.close() connection.close() print("PostgreSQL connection is closed")
I am too lazy to explain how this script works and probably there are some errors but it works for me.


Tuesday, March 31, 2015

Compiling and using PyNaCl on Windows 7

Step-by-step how to compile PyNaCl Python package on win32.

Requirements:
Python 2.7.x for Windows - download it from here.
libsodium-1.0.2-msvc - download precompiled binaries from here.
Microsoft Visual C++ Compiler for Python 2.7 - you can download MS C++ compiler for Pyhton from here.

This guide is based on https://github.com/pyca/pynacl/issues/100 and and is a focused reiteration of it.

Python for Windows is compiled with MSVC and because of that it is not possible to compile extensions with MinGW/MSYS. The main problem is with CFFI (Common Foreign Function Interface) and most probably if you succeed with compiling, the lib will not work (will hang if trying to use its functions)

1. Download and install Python for Windows

2. Download and install Microsoft Visual C++ Compiler for Python 2.7 

3. Install python setuptools (needed for MSVC++ Python) with pip:
C:> pip install setuptools
4. Download libsodium latest release with -msvc at the end. Unzip it in C:\work\libsodium-1.0.2-msvc
5. Go to C:\work\libsodium-1.0.2-msvc\Win32\Release\v120\dynamic and rename libsodium.lib to sodium.lib. On 64bit Windows use this dir: C:\work\libsodium-1.0.2-msvc\x64\Release\v120\dynamic

6. Download PyNaCl source and unzip it to C:\work\PyNaCl-0.3.0

7. Start the MSVC++ Python shell: Start -> All Programs -> Microsoft Visual C++ Compiler Package for Python 2.7 -> Visual C++ 2008 32-bit Command Prompt

8. Set these 3 variables in cmd prompt:
set INCLUDE=%INCLUDE%C:\work\libsodium-1.0.2-msvc\include
set LIB=%LIB%C:\work\libsodium-1.0.2-msvc\Win32\Release\v120\dynamic
set SODIUM_INSTALL=system


on 64bit Windows change:
set LIB=%LIB%C:\work\libsodium-1.0.2-msvc\Win32\Release\v120\dynamic
to:
set LIB=%LIB%C:\work\libsodium-1.0.2-msvc\x64\Release\v120\dynamic
9. Probably you will need these two files when compiling:
Download and copy them to: C:\work\libsodium-1.0.2-msvc\include

10. Start the building process:
C:\>cd C:\work\PyNaCl-0.3.0
C:\work\PyNaCl-0.3.0>python setup.py build
11. If everything is ok, then install it:
C:\work\PyNaCl-0.3.0>python setup.py install
12. Finally, copy the original libsodium.dll in PyNaCl install dir:
C:>copy C:\work\libsodium-1.0.2-msvc\Win32\Release\v120\dynamic\libsodium.dll C:\Python27\Lib\site-packages\PyNaCl-0.3.0-py2.7-win32.egg\nacl\_lib
Here is a test program (from doc examples https://pynacl.readthedocs.org/en/latest/public/) slightly modified:
import nacl.utils
from
nacl.public import PrivateKey, Box


skbob = PrivateKey.generate()
pkbob = skbob.public_key  
skalice = PrivateKey.generate()
pkalice = skalice.public_key 
bob_box = Box(skbob, pkalice)
message = b"Kill all humans"
nonce = nacl.utils.random(Box.NONCE_SIZE)

encrypted = bob_box.encrypt(message, nonce)
print "Encrypted Message:", encrypted
alice_box = Box(skalice, pkbob)

plaintext = alice_box.decrypt(encrypted)
print "Plaintext Message:", plaintext

If it works, you should see something like:
C:\work>python nacltest.py
Encrypted Message: ₧╫fαIé├l(α▀W¬½♥↔≈‼╟  üRδD≈é☻'^∞v√oòΣls╣8,ƒ   ↓ü↓╓+ô╓è╣=§╣
Plaintext Message: Kill all humans
C:\work>

Friday, October 18, 2013

Importing CSV file into django model.

Lets assume that we have the following django model:
class Person(models.Model):
    PersonID = models.CharField(max_length=10)
    FirstName = models.CharField(max_length=30)
    LastName = models.CharField(max_length=30)
    Address = models.CharField(max_length=30)

and we have a csv file with 4 elements on each row and we want to import them in our django database. Here is example (delimeter is !):
1!Nikolay!Hristov!Bulgaria, Gabrovo, Test street 18 
What we need to do is to make a view and attach it to certain url (for example http://localhost/import_db/). Here is the view:
def import_db(request):
   
    f = open('/path/to/filename-with-data.csv', 'r'
    for line in f:
        line =  line.split('!')
        tmp = Person.objects.create()
        tmp.PersonID = line[0]
        tmp.FirstName = line[1]
        tmp.LastName = line[2]
        tmp.Address = line[3]
        tmp.save()

    f.close()
Now all we have to do is to point our browser to http://localhost/import_db/ and wait for data to be imported.