chin2017-20200313 のバックアップの現在との差分(No.2)

バックアップ一覧
差分を表示
ソースを表示
バックアップを表示
chin2017-20200313 へ行く。
- 1 (2020-03-06 (金) 18:17:34)
- 2 (2020-03-06 (金) 18:21:06)
- 3 (2020-03-07 (土) 20:00:21)
- 4 (2020-03-09 (月) 18:09:37)
- 5 (2020-03-13 (金) 23:03:53)
- 6 (2020-03-13 (金) 23:51:36)

追加された行はこの色です。
削除された行はこの色です。

#author("2020-03-06T00:21:06+00:00","default:f-lab","f-lab")
#author("2020-03-13T06:39:18+00:00","default:f-lab","f-lab")
[[seminar-personal/chin2017]]
|~目次|
|#contents|
&br;
*おさらい [#p71808b9]
-[[前回ゼミ>http://f-lab.mydns.jp/index.php?seminar-N-20200228]]
-CSVファイルの有効性
--0:OK
--1:NG（全然だめ）
--2:限りなくOKに近いけど、一部ダメ
*除外処理(3万件CSVに手作業で) [#xfd8f5a4]
-CSVを読み込む
-先頭10行を表示
-例：&ref(s.JPG);
--0:OK
--1:NG（全然だめ）
--2:一部ダメ(欠損データあり
--1:NG（全然だめ）->項目名なし、項目名内容が数字のみ、先頭2行項目名
--2:一部ダメ(欠損データあり　⇒NaN値あり(判断は難しいので、１に)

--ソースコード
 # -*- coding: utf-8 -*-
 from gensim.models import Word2Vec
 import pandas as pd
 import pandas.io.common
 import numpy as np
 import codecs
 import random
 import os
 import getenc
 import time
 import re
 import sys
 import shutil
 pd.set_option('display.max_rows',None)
 pd.set_option('display.max_columns',None)#列は省略号ではなく、すべて表示
 from progressbar import *
 start = time.time()
 print("CSV list reading...")
 csv_list_files="prep2_csv_list.csv"
 with codecs.open(csv_list_files, 'r','utf8') as f:
     csv_list =f.readlines()
 len_csv =29995#len(csv_list)
 print('The number of csv list:',len_csv)
 drop_list =[]
 for n in range(0,20):
     fname=csv_list[n].strip()#空白文字を除外    
     if (os.path.exists(fname)):#サイズ制限:and os.path.getsize(fname) <5242880):#5MB
         moji_code=getenc.getEncode(fname)
         if moji_code is not None :
             with codecs.open(fname,'rb',moji_code) as f:
                 try:
                     df = pd.read_csv(f,delimiter =",",nrows =10)
                     #data = pd.read_csv(fname,nrows =10)
                     print(n)
                     print(fname)
                     print(df)
                     key_input = int(input())
                     if(key_input == 0):#0:ok
                       drop_list.append([n,fname,key_input])
                       continue
                     else:
                       drop_list.append([n,fname,key_input])#除外のパスを保存  #1:NG（全然だめ）2:一部ダメ(欠損NaNデータあり)
                       dst = (fname).replace("./", "@")
                       dst = dst.replace("/", "--")
                       shutil.copy(fname, './tmp/'+dst)
                 except pandas.io.common.EmptyDataError:
                     drop_list.append([n,fname,1])
                     print("ERROR: {} is empty".format(fname))
                 except pandas.io.common.AbstractMethodError:
                     print("ERROR: {} is AbstractMethodError".format(fname))
                     drop_list.append([n,fname,1])
                 except pandas.io.common.DtypeWarning:
                     print("ERROR: {} is DtypeWarning".format(fname))
                     drop_list.append([n,fname,1])
                 except pandas.io.common.ParserError:
                     print("ERROR: {} is ParserError".format(fname))
                     drop_list.append([n,fname,1])
                 except pandas.io.common.ParserWarning:
                     print("ERROR: {} is ParserWarning".format(fname))
                     drop_list.append([n,fname,1])
                 except Exception as e:
                     print("%s 's error is %s"%(format(fname),format(e)))
                     drop_list.append([n,fname,1])
         else:
             drop_list.append([n,fname,1])
     else:
         drop_list.append([n,fname,1])
 df = pd.DataFrame(drop_list,columns=['index','path','flag'])
 df.to_csv("drop_list_all.txt",index=False,encoding="utf-8-sig")
-\\10.200.11.9\home\N\chin\20190313\tmp
-15911個CSVを除外(全体29995個)->半分以上使えないCSVみたい
-一部：&ref(drop_list_all.txt);
-全部除外(1)：&ref(drop_list.txt);