Skip to content

Decrypting file in read_csv #44097

@linogaliana

Description

@linogaliana

Basic idea of the feature request

I was trying to read an encrypted file in pandas. As far as I know, there is no way to provide something to read_csv (or any other read_* function) to decrypt a file when reading (and not with ex-post applymap functions as in this stack overflow thread)

The solution proposed in the aforementioned post seems quite slow with data > several Mb.

My solution has been to decrypt the file using cryptography package and write that in a temporary ___location (there's room for improvement in the functions I will propose below, I am aware of that). This works but I was hoping this would be better to have an option in pandas to decrypt when reading the stream input. This would probably lead to:

  • speed improvements since you reduce the I/O
  • improved security since you don't write decrypted (and thus potentially sensible) data in the disk, even for a temporary purpose

Here an example that makes possible to reproduce the feature:

  1. The encrypt_data is just here to reproduce the setting of having a crypted file
  2. It would be great to avoid the decrypt_data step to directly use read_csv with an extra argument.
import pandas as pd
from cryptography.fernet import Fernet

df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})
df.to_csv("toto.csv")

def encrypt_data(path, key, outpath = None):

    if outpath is None:
        outpath = '{}_encrypted'.format(path)
        
    f = Fernet(key)
    # opening the original file to encrypt
    with open(path, 'rb') as file:
        original = file.read()
    # encrypting the file
    encrypted = f.encrypt(original)  
    # opening the file in write mode and 
    # writing the encrypted data
    with open(outpath, 'wb') as encrypted_file:
        encrypted_file.write(encrypted)
    
    print("file {} encrypted ; written at {} ___location".format(path, outpath))
    


def decrypt_data(path, key,  outpath = None):
    if outpath is None:
        outpath = '{}_encrypted'.format(path)
    f = Fernet(key)
    # opening the original file to encrypt
    with open(path, 'rb') as file:
        original = file.read()
    decrypted = f.decrypt(original)
    # opening the file in write mode and 
    # writing the encrypted data
    with open(outpath, 'wb') as dfile:
        dfile.write(decrypted)
    print("file {} decrypted ; written at {} ___location".format(path, outpath))


dummykey = Fernet.generate_key()
encrypt_data("toto.csv", dummykey, outpath = "toto_crypt.csv")
decrypt_data("toto_crypt.csv", dummykey, outpath = "toto_decrypt.csv")


pd.read_csv("toto_crypt.csv")
pd.read_csv("toto_decrypt.csv")

A possible approach

Let's say we call this argument encryption. We could provide an object from cryptography to decode datastream directly in pd.read_csv call. For instance:

pd.read_csv("toto_decrypt.csv", encryption = Fernet(dummykey))

The same approach could be used to to_csv (or other writing functions) to directly write encrypted data in the disk.

However, maybe this solution would imply to use the python engine. Directly providing the key and the encryption method (e.g. Fernet) is maybe better to work with the C engine (I am not familiar with C but there's probably equivalent method than the one I applied in python)

API breaking implications

As far as I understand how I/O works, I think this extra argument would not break any existing code with a default value to None.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions