-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Basic idea of the feature request
I was trying to read an encrypted file in pandas
. As far as I know, there is no way to provide something to read_csv
(or any other read_*
function) to decrypt a file when reading (and not with ex-post applymap
functions as in this stack overflow thread)
The solution proposed in the aforementioned post seems quite slow with data > several Mb.
My solution has been to decrypt the file using cryptography
package and write that in a temporary ___location (there's room for improvement in the functions I will propose below, I am aware of that). This works but I was hoping this would be better to have an option in pandas
to decrypt when reading the stream input. This would probably lead to:
- speed improvements since you reduce the I/O
- improved security since you don't write decrypted (and thus potentially sensible) data in the disk, even for a temporary purpose
Here an example that makes possible to reproduce the feature:
- The
encrypt_data
is just here to reproduce the setting of having a crypted file - It would be great to avoid the
decrypt_data
step to directly useread_csv
with an extra argument.
import pandas as pd
from cryptography.fernet import Fernet
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
df.to_csv("toto.csv")
def encrypt_data(path, key, outpath = None):
if outpath is None:
outpath = '{}_encrypted'.format(path)
f = Fernet(key)
# opening the original file to encrypt
with open(path, 'rb') as file:
original = file.read()
# encrypting the file
encrypted = f.encrypt(original)
# opening the file in write mode and
# writing the encrypted data
with open(outpath, 'wb') as encrypted_file:
encrypted_file.write(encrypted)
print("file {} encrypted ; written at {} ___location".format(path, outpath))
def decrypt_data(path, key, outpath = None):
if outpath is None:
outpath = '{}_encrypted'.format(path)
f = Fernet(key)
# opening the original file to encrypt
with open(path, 'rb') as file:
original = file.read()
decrypted = f.decrypt(original)
# opening the file in write mode and
# writing the encrypted data
with open(outpath, 'wb') as dfile:
dfile.write(decrypted)
print("file {} decrypted ; written at {} ___location".format(path, outpath))
dummykey = Fernet.generate_key()
encrypt_data("toto.csv", dummykey, outpath = "toto_crypt.csv")
decrypt_data("toto_crypt.csv", dummykey, outpath = "toto_decrypt.csv")
pd.read_csv("toto_crypt.csv")
pd.read_csv("toto_decrypt.csv")
A possible approach
Let's say we call this argument encryption
. We could provide an object from cryptography
to decode datastream directly in pd.read_csv
call. For instance:
pd.read_csv("toto_decrypt.csv", encryption = Fernet(dummykey))
The same approach could be used to to_csv
(or other writing functions) to directly write encrypted data in the disk.
However, maybe this solution would imply to use the python
engine. Directly providing the key and the encryption method (e.g. Fernet) is maybe better to work with the C engine (I am not familiar with C but there's probably equivalent method than the one I applied in python)
API breaking implications
As far as I understand how I/O works, I think this extra argument would not break any existing code with a default value to None
.