Python文本数据脱敏

  在我们日常技术交流过程中,往往需要对原始数据进行脱敏处理,例如公司名称、个人身份证号等敏感信息。由此本文介绍2种字符串替换的方法,帮助大家更好的解决数据脱敏的相关问题。具体步骤如下:在D盘新建文件夹test,并在test文件夹中放入原始语料a1.txt,a2.txt,a3.txt
  Python文本数据脱敏程序如下:

# 字符串替换 replace()
infile = open("d:\\test\\a1.txt","r",encoding="utf-8").read()
new_infile = infile.replace("坐席","AAA") # replace函数替换
# 新建同名的空白文档覆盖原始文档
outfile = open("d:\\test\\a1.txt","w",encoding="utf-8")
# 将替换后的数据写入空白文档,注意str()
outfile.write(str(new_infile)) 
outfile.close()

# 正则表达式替换 import re 
import re
infile = open("d:\\test\\a1.txt","r",encoding="utf-8").read()
words = re.compile("坐席") # 关键词定位
new_infile = words.sub("AAA",infile) # 关键词替换
outfile = open("d:\\test\\a2.txt","w",encoding="utf-8")
outfile.write(str(new_infile))
outfile.close()
# 注意:正则表达式适用于复杂的业务替换场景,一般利用replace函数就能达到数据脱敏的需求

# 批量字符串替换 replace(),后续可自行尝试批量正则替换
import os
rootdir = "d:\\test\\"
for root, dirs, files in os.walk(rootdir):
    for i in files:
        i = root + i
        infile = open(i,"r",encoding="utf-8").read()
        new_infile = infile.replace("座席","AAA").replace("客户","BBB") # 同时替换“坐席”与“客户”
        outfile = open(i,"w",encoding="utf-8")
        outfile.write(str(new_infile))
        outfile.close()

分类: 自然语言处理