2017-10-30

【pandas】 pandas.dataframeのCSVの入力と出力

機械学習 Python

pandasのdataframeとCSVファイルのやりとりメモ

import pandas as pd

# CSV -> dataframe
df = pd.read_csv(input_csv_path)

# dataframe -> CSV
df.to_csv(output_csv_path)

2017-10-28

【Git】リモートリポジトリの名前の変更について

Git

リモートリポジトリ名を変更したいと思ったら,
まずブラウザのGitHubからリポジトリ -> Setting で名前を変更する.

つづいてローカルリポジトリで.git/configファイルからURLを変更する.

2017-10-28

【numpy】配列の結合np.hstack(tup)メモ

Python 機械学習

np.hsatck()では縦方向(axis=1)の結合ができる.
行列a, bに対してaxis=1のみが違うならば, aとbは結合できる.
すなわちa.shape[1]とb.shape[1]のみが異なり, a.shapeとb.shapeの他の要素が同じならば結合可能となる.

その他の結合方法としてcolumn_stack()による横方向の結合, dstak()による深さ方向の結合, vstack(),row_stack()による縦方向の結合などがあるらしい.

またnp.meshgrid()で格子点の制作に使ったnp.c_()も結合の関数である.
この場合はaとbのサイズが同じでなければならない.

> a = np.arange(8).reshape((2,4))
> b = np.arange(100,900,100).reshape(2,4)

> print(a)
[[0 1 2 3]
 [4 5 6 7]]

> print(b)
[[100 200 300 400]
 [500 600 700 800]]

> print(np.r_[a, b])
[[  0   1   2   3]
 [  4   5   6   7]
 [100 200 300 400]
 [500 600 700 800]]

> print(np.c_[a, b])
[[  0   1   2   3 100 200 300 400]
 [  4   5   6   7 500 600 700 800]]

結合があれば分解もあって,np.split()というものがあるらしいが, 今回はまだ勉強しない_(:3」∠)_

【参考】
・http://python-remrin.hatenadiary.jp/entry/concatenate
・https://deepage.net/features/numpy-stack.html

2017-10-28

画風変換アルゴリズムまとめ

深層学習機械学習

今話題の画風変換するDeepLearningで画像Aの画風を画像Bに適応するというアルゴリズムで様々な画風変換を実験したので,その結果をまとめる.

ゴッホの画風(やや失敗)
f:id:umashika5555:20171028031241j:plain
ミュシャの画風(やや失敗)
f:id:umashika5555:20171028031314j:plain
モネの画風(けっこういい感じ)

ピカソの画風(かなり失敗)
f:id:umashika5555:20171028031421j:plain

ピカソは「泣く女」という直線が多い画像を使った.
直線が多いと画風を受け継ぐのが難しいのかもしれない.

2017-10-28

【Python】help()について

Python 機械学習

クラス参照するときにhelp()関数を使うとJupyter notebook内で参照できるため便利

import sklearn
import sklearn.linear_model
help(sklearn.linear_model.Perceptron)

下のような説明が出てくる.

Help on class Perceptron in module sklearn.linear_model.perceptron:

class Perceptron(sklearn.linear_model.stochastic_gradient.BaseSGDClassifier, sklearn.feature_selection.from_model._LearntSelectorMixin)
 |  Perceptron
 |  
 |  Read more in the :ref:`User Guide <perceptron>`.
 |  
 |  Parameters
 |  ----------
 |  
 |  penalty : None, 'l2' or 'l1' or 'elasticnet'
 |      The penalty (aka regularization term) to be used. Defaults to None.
 |  
~~~~~~~~~~~~~~~~~~~~~~~~~~~

2017-10-28

識別結果の評価

Python 機械学習

手元にトレーニングようデータX,yがあったら
X_train, y_train, X_test, y_test のトレーニングデータと検証データに分割する

from sklearn.cross_validation import train_test_split
#トレーニングデータと検証データに分割
#全体の30%をテストデータにする
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

次に識別結果の評価として誤識別されたサンプルの数と正解率を出力する.

# 識別器が誤分類したサンプルの数
print("Misclassified samples: %d"%((y_test != y_pred).sum()))
# 識別器の正解率
from sklearn.metrics import accuracy_score
print("Accuracy: %.2f"%accuracy_score(y_test,y_pred))

誤識別に関する操作はこのような感じ
f:id:umashika5555:20171028023522p:plain

2017-10-26

【numpy】meshgrid メモ

Python 機械学習

f([a,b])という関数があったとするとf([[a,b], [c,d]])というのは[f([a,b]), f([c,d])]と同じ結果を出力するのはnumpyやmatplotlibの常套らしい.

>>> import numpy as np
>>> x = np.array([1,2,3])# l * m
>>> y = np.array([10,11])# n * k
>>> xx, yy = np.meshgrid(x,y)
>>> xx
array([[1, 2, 3],
       [1, 2, 3]])
>>> yy
array([[10, 10, 10],
       [11, 11, 11]])

xx, yy = np.meshgrid(x,y)ではlen(y)*len(x)の行列を2つ作る.
2次元配列に拡張すると

>>> x = np.array([[10,11],[3,4],[5,6]])#3*2
>>> y = np.array([[1,2,3,4],[5,6,7,8]])#2*4
>>> xx, yy = np.meshgrid(x,y)
>>> xx
array([[10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6],
       [10, 11,  3,  4,  5,  6]])
>>> yy
array([[1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5, 5],
       [6, 6, 6, 6, 6, 6],
       [7, 7, 7, 7, 7, 7],
       [8, 8, 8, 8, 8, 8]])

となり(len(y[0])*len(y[1])) * (len(x[0])*len(x[1]))となった.
xxとyyの各成分が重複なく全ての組み合わせとなるように対応している.
これは格子点を生成していると考えられる.
だから実際には上のような使い方は趣旨にあっていなくて, np.arange()などで生成した配列を引数にとるのが格子点を作る上で正しい使い方と言えるだろう.

自分がサンプルプログラムで見たのは

h = 0.02#格子点の間隔
xx, yy = np.meshgrid(np.arange(x.min()-1, x.max()+1, h), np.arange(y.min()-1, y.max()+1, h))#x,yはnp.array()
Z = clf.predict(np.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape)
out = ax.contourf(xx,yy,Z,**params)

これでclassificationの識別領域を塗りつぶせる.

Z = clf.predict(np.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape)

この処理を一つずつ追っていく.

>>> x = np.array([1,2,3,4])
>>> y = np.array([5,6,7,8])
>>> xx, yy = np.meshgrid(x,y)
>>> xx
array([[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]])
>>> yy
array([[5, 5, 5, 5],[6, 6, 6, 6],[7, 7, 7, 7],[8, 8, 8, 8]])
# xx.ravel()で, 一行のベクトルに変形する.
>>> xx.ravel()
array([1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4])
>>> yy.ravel()
array([5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8])
# np.c_[]で第2軸方向に行列の連結を行う(xx,yyはnp.meshgrid()によって大きさが同じことが保証されている).
# これにより格子点を要素とした配列ができあがる.
>>> np.c_[xx.ravel(),yy.ravel()]
array([[1, 5],
       [2, 5],
       [3, 5],
       [4, 5],
       [1, 6],
       [2, 6],
       [3, 6],
       [4, 6],
       [1, 7],
       [2, 7],
       [3, 7],
       [4, 7],
       [1, 8],
       [2, 8],
       [3, 8],
       [4, 8]])
# 2次元空間上でこれらはテストデータの集合ともとれるので宣言した識別モデルclfのpredict()にかける
# 各格子点に対応したクラスのラベルが返ってくる
>>> clf.predict(np.c_[xx.ravel(),yy.ravel()])
np.array([0,1,1,1,0,0,1,1,0,0,0,1,0,0,0,1])
# 最後にxxの形に整形する
>> clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx)#4*4行列
np.array([0,1,1,1],
         [0,0,1,1],
         [0,0,0,1],
         [0,0,0,1])
# これでxxとyyとZの各要素が対応し, 2次元グラフにおける点とクラスがわかる.
# plt.pcolormesh(xx,yy,Z,...)で領域を図示する

deepage.net
このブログの図が分かりやすい.

【参考】
https://deepage.net/features/numpy-meshgrid.html
http://kaisk.hatenadiary.com/entry/2014/11/05/041011
https://qiita.com/sotetsuk/items/d0e73afdcffdc8ac3e6b
https://qiita.com/ynakayama/items/3250452949102840e624

2017-10-25

matplotlibで使えるcolormap

Python 機械学習

https://matplotlib.org/examples/color/colormaps_reference.html
ここに載っているcolormap一覧をk-NN classificationの図に適応してみた.
3色だと個人的にはjet, prismあたりが見やすくて好み.
色の定義はこのように[("A",["color1","color2"]),...]のようになっているらしい.

cmaps = [('Perceptually Uniform Sequential', [
            'viridis', 'plasma', 'inferno', 'magma']),
         ('Sequential', [
            'Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds',
            'YlOrBr', 'YlOrRd', 'OrRd', 'PuRd', 'RdPu', 'BuPu',
            'GnBu', 'PuBu', 'YlGnBu', 'PuBuGn', 'BuGn', 'YlGn']),
         ('Sequential (2)', [
            'binary', 'gist_yarg', 'gist_gray', 'gray', 'bone', 'pink',
            'spring', 'summer', 'autumn', 'winter', 'cool', 'Wistia',
            'hot', 'afmhot', 'gist_heat', 'copper']),
         ('Diverging', [
            'PiYG', 'PRGn', 'BrBG', 'PuOr', 'RdGy', 'RdBu',
            'RdYlBu', 'RdYlGn', 'Spectral', 'coolwarm', 'bwr', 'seismic']),
         ('Qualitative', [
            'Pastel1', 'Pastel2', 'Paired', 'Accent',
            'Dark2', 'Set1', 'Set2', 'Set3',
            'tab10', 'tab20', 'tab20b', 'tab20c']),
         ('Miscellaneous', [
            'flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern',
            'gnuplot', 'gnuplot2', 'CMRmap', 'cubehelix', 'brg', 'hsv',
            'gist_rainbow', 'rainbow', 'jet', 'nipy_spectral', 'gist_ncar'])]

f:id:umashika5555:20171025224750p:plain f:id:umashika5555:20171025224802p:plain f:id:umashika5555:20171025224900p:plain f:id:umashika5555:20171025225004p:plain f:id:umashika5555:20171025225102p:plain f:id:umashika5555:20171025225200p:plain

【参考】
https://qiita.com/mommonta3/items/cea310b2c36a01b970a6
https://matplotlib.org/examples/color/colormaps_reference.html

2017-10-25

【numpy】paddingメモ

Python

#--- 1次元配列 ---
> a = [2,3]
> np.pad(a,[1,0],"constant")
np.array([0,2,3])#先頭に1個, 末尾に0個 0padding
> np.pad(a,[1,2],"constant")
np.array([0,2,3,0,0])#先頭に1個, 末尾に2個 0padding

#-- 2次元配列 ---
a = [[1,2],[3,4]]
> np.pad(a,[(1,2),(3,4)],"constatnt")
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 2, 0, 0, 0, 0],
       [0, 0, 0, 3, 4, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0]])#行の先頭に1行, 行の末尾に2行, 列の先頭に3列, 列の末尾に4列 0padding

2017-10-25

【numpy】多次元配列を1次元配列にする

Python

> x = np.arange(16).reshape(4, 4)
array([[0, 1, 2, 3],
       [4, 5, 6, 7],
       [8, 9, 10, 11],
       [12, 13, 14, 15]])

# 方法1
> x.reshape(-1,)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])


# 方法2
> np.ravel(x)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

多次元配列の型がnp.arrayならメソッドとして

>a = np.array([[1,2],[3,4]])
>a.ravel()
array([1,2,3,4])
>np.ravel(a)
array([1,2,3,4])

>b = [[1,2],[3,4]]
>b.ravel()
#エラー
>np.array(b)
array([1,2,3,4])

2017-10-23

git 設定メモ

gitの設定で参照にしたサイトメモ

入門

https://qiita.com/ay3/items/8d758ebde41d256a32dc
https://qiita.com/KosukeQiita/items/cf39d2922b77ac93f51d

sshの設定

https://qiita.com/shizuma/items/2b2f873a0034839e47ce
https://qiita.com/drapon/items/441e18452b25060d61f1
http://monsat.hatenablog.com/entry/generating-ssh-keys-for-github

configファイルの設定

#設定確認
git config --list
#設定追加
git config key value
#設定削除
git config --unset key

configファイルの編集

emacs .git/config

2017-10-22

【scikit-learn】クラスタリング手法日本語の文献

機械学習

scikit-learn documentのclusteringを読むときに役立つ日本語の文献をメモ

【shell】よく使うものメモ

Linux

ディレクトリ内の.apkファイルを.zipファイルに一括変換する

$for filename in *.apk;do
>mv "$filename" "${filename%.apk}.zip";
>done

filenameのところに""で囲んでないと空白文字があるファイルに対して操作できない場合があるので注意

ディレクトリ内の画像の名前を連番にする

ls *.jpg | awk '{ printf "mv %s %03d.jpg\n", $0, NR }' | sh

カレントディレクトリのzipファイルのみを別ディレクトリに移動する

for file in *.zip;do
    mv $file "~/other_directory/";
done;

カレントディレクトリの各ファイルの容量

du -sh ./*/

CVPR2017の論文をスクレイピングする

wget -r -l 1 -A pdf pdf -w 5 -nd http://openaccess.thecvf.com/CVPR2017.py

ディレクトリ毎のファイル数を確認する

for x in * ; do echo $x ; ls -1UR $x | wc -l ; done

CSVをTSVに変換する

cat hoge.csv | tr "," "\\t" > fuga.tsv

2017-09-28

twitterAPIを用いたtimelineの取得【その2】

Python

あるアカウントを200ツイート取得してCSVに保存する

from requests_oauthlib import OAuth1Session
import json
from urllib import request
import subprocess
import csv

keys = {
            "CK":'xxx',
            "CS":'xxx',
            "AT":'xxx',
            "AS":'xxx',
        }
sess = OAuth1Session(keys["CK"], keys["CS"], keys["AT"], keys["AS"])

#タイムラインの最も上にいる人のツイートを100件取得する
url = "https://api.twitter.com/1.1/statuses/user_timeline.json"
usr_id = str(input("tweetを取得したいidを入力:"))
params = {"screen_name":usr_id,#ユーザーネーム
          "count":200, #ツイートを最新から何件取得するか(最大200件)
          "exclude_replies":"true",
          "inclue_rts":"true"
          }
req = sess.get(url,params=params)
if req.status_code == 200:
    #レスポンスはJSON形式なのでparseする
    timeline = json.loads(req.text)
    file_name = dir_path + params["screen_name"] + ".csv"
    with open(file_name,"w") as f:
        writer = csv.writer(f,lineterminator="\n")
        for i,tweet in enumerate(timeline):
            tw_id = tweet["id_str"]#1: id
            tw_created_at = tweet["created_at"].split(" ")
            tw_created_at_year = tw_created_at[-1]#2: year
            tw_created_at_month = tw_created_at[1]#3: month
            tw_created_at_date = tw_created_at[2]#4: date
            tw_created_at_time = tw_created_at[3]#5: time
            tw_place = tweet["place"]
            if tw_place is not None:
                tw_place_id = tw_place["id"]#6: place_id
                tw_place_full_name = tw_place["full_name"]#7: place_name
            else:
                tw_place_id = ""
                tw_place_full_name = ""
            tw_txt = tweet["text"]
            a_tweet_info_list = [tw_id,tw_created_at_year,tw_created_at_month,tw_created_at_date,tw_created_at_time,tw_place_id,tw_place_full_name,tw_txt]
            writer.writerow(a_tweet_info_list)

GETリクエスト時のパラメータであるinclude_rtsはfalseにしてもRTを含んだツイートとなってしまったのだが, 解決方法がわからなかった.
どうしてもRTを除外したツイートを取得したい場合は, 各ツイートに対して

if tweet["text"][:3] is not "@RT":
||<

2017-09-28

twitterAPIを用いたtimelineの取得

Python

from requests_oauthlib import OAuth1Session
import json
from urllib import request

keys = {
            "CK":'xxxxx',
            "CS":'xxxxx',
            "AT":'xxxxx',
            "AS":'xxxxx',
        }
sess = OAuth1Session(keys["CK"], keys["CS"], keys["AT"], keys["AS"])

url = "https://api.twitter.com/1.1/statuses/home_timeline.json"
params = {"count":200, #ツイートを最新から何件取得するか(最大200件)
          "include_entities" : 1, #エンティティ(画像のURL等)をツイートに含めるか
          "exclude_replies" : 1, #リプライを含めるか
          }

req = sess.get(url, params=params)
timeline = json.loads(req.text)

print(timeline[0]["user"])

自分のツイートを取得したところこのようになった.

{'id': 863468411120005120, 'id_str': '863468411120005120', 'name': '🐈', 'screen_name': 'tristana_chan', 'location': '',\
 'description': '有益なことは呟かないため鍵垢にしていますがフォローお気軽に', 'url': None,'entities': {'description': {'urls': []}}, \
'protected': True, 'followers_count': 40, 'friends_count': 395, 'listed_count': 0, 'created_at': 'Sat May 13 18:58:06 +0000 2017',\
'favourites_count': 2884, 'utc_offset': -25200, 'time_zone': 'Pacific Time (US & Canada)', 'geo_enabled': False, 'verified': False,\
 'statuses_count': 3281, 'lang': 'ja', 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False,\
 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None,\
 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/899736954631094272/JtKhe4RD_normal.jpg',\
 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/899736954631094272/JtKhe4RD_normal.jpg', 'profile_link_color': '1DA1F2',\
 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True,\
 'has_extended_profile': False, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False,\
 'translator_type': 'none'}

今後有効そうなkey値だけまとめておく.

key	value
name	ユーザーの名前
screen_name	ユーザーID(@以降)
location	住んでるところ(設定していない場合は"")
description	ユーザーの紹介文
url	ユーザーが設定しているURL
protected	Trueならば鍵垢
followers_count	フォロワー数
friends_count	フォロー数
listed_count	登録されているリスト数
created_at	アカウントが作られた日
lang	使用している言語(日本語なら"ja")
profile_image_url	プロフィールのイメージのURL(http)
profile_image_url_https	プロフィールのイメージのURL(https)

タイムライン上のプロフィール画像を取得する

from requests_oauthlib import OAuth1Session
import json
from urllib import request
import subprocess
keys = {
            "CK":'xxxxx',
            "CS":'xxxxx',
            "AT":'xxxxx',
            "AS":'xxxxx',
        }
sess = OAuth1Session(keys["CK"], keys["CS"], keys["AT"], keys["AS"])

url = "https://api.twitter.com/1.1/statuses/home_timeline.json"
params = {"count":200, #ツイートを最新から何件取得するか(最大200件)
          "include_entities" : 1, #エンティティ(画像のURL等)をツイートに含めるか
          "exclude_replies" : 1, #リプライを含めるか
          }

req = sess.get(url, params=params)
timeline = json.loads(req.text)

#タイムライン上にいる100人の人のイメージアイコンをダウンロードする
dir_path = "./profile_images/"
url_set = set()
for tweet in timeline:
    #プロフィールの画像URLを取得
    image_url = tweet["user"]["profile_image_url"]
    #画像の拡張子を取得
    image_ex = image_url.split(".")[-1].replace("jpeg","jpg")#jpg, png, gif
    #ユーザの名前を取得
    user_id = tweet["user"]["name"].replace(" ","_")
    if image_url not in url_set:
        url_set.add(image_url)
        file_name = user_id + "." + image_ex
        args = ["wget",image_url,"-O",dir_path+file_name]
        subprocess.call(args)
    else:
        pass

のようにするとタイムライン上のアカウントのプロフィール画像を取得できる.
f:id:umashika5555:20170928164928p:plain
(個人アカウントはモザイク済み)

入門

sshの設定

2.3.3 Affinity Propagation

2.3.4 Mean Shift

2.3.5 Spectal Clustering

2.3.6 Hierarchical Clustering

2.3.7 DBSCAN

2.3.8 Birch

ディレクトリ内の.apkファイルを.zipファイルに一括変換する

ディレクトリ内の画像の名前を連番にする

カレントディレクトリのzipファイルのみを別ディレクトリに移動する

カレントディレクトリの各ファイルの容量

CVPR2017の論文をスクレイピングする

ディレクトリ毎のファイル数を確認する

CSVをTSVに変換する

あるアカウントを200ツイート取得してCSVに保存する

タイムライン上のプロフィール画像を取得する