录音文本格式化
27 Aug 2019 最近总是遇到录音转译文本格式化的问题,遂写下该博文,以便后续应用:
原始语料下载地址:dataformat文件夹
Python文本格式化程序如下:
# ==== 单个文本格式化 ====
infile = open("d:\\811901011508591.txt","r",encoding="gbk").readlines()
outfile = "d:\\test1.txt"
print(infile)
datas=[]
# 去除文本第一行
for i in infile[1:]:
# 对数组每一个元素再切分
row = i.strip().split(" ")
datas.append(" ".join(row[2:]))
# 数组打印输出
print("\n".join(datas),file=open(outfile,"w",encoding="utf-8"))
# ==== 批量文本格式化 ====
import os
root = "d:\\test\\"
for root,dirs,files in os.walk(root):
for n in files:
n = root + n
infile = open(n,"r",encoding="gbk").readlines()
datas = []
for i in infile[1:]:
row = i.strip().split(" ")
datas.append(" ".join(row[2:]))
print("\n".join(datas), file=open(n, "w", encoding="utf-8"))
分类: 自然语言处理