JavaScriptでデータフレーム操作 - dataframe-js -

はじめに

こんにちは、ブレインズテクノロジーの佐々木です。４月に新卒で入社しました。学生時代は熱帯林を這いつくばって植物と戯れていました。

CSVからデータを読み取って、ちょっとした分析をしたい！というシーンはよくありますよね。そうした手軽なデータ操作のツールとしては、PythonのpandasやRのdplyrなどが機能も充実していて使いやすいため、人気があるようです。また、大量のデータを製品環境で扱うような場面では、hadoopやsparkなどの分散処理基盤を活用することも多いようです。ただ、これらの選択肢が便利すぎるがゆえに、フロントエンドでデータ操作に迫られたときにストレスを感じる人も多いのではないでしょうか。

ということで、今回は、ブラウザ上でも便利にデータ操作ができるJavaScriptライブラリ、dataframe-jsに入門します。

インストールと準備

といいつつ実際にブラウザで動かすのは面倒なので、今回はnode上で動かします。基本的にはブラウザでも同様に動く（と信じて）います。またもっと手軽に試したい方はJupyter notebookにIJavaScriptカーネルを入れると良いかもしれません（ここまでやると、pandas使えよ！ってツッコミが入りそうなのでやめました）。

npm install dataframe-jsでインストールできます。前提として、以下のようなディレクトリ構成で作業しています。

.
├── iris.js
├── main.js
├── package.json

csvからデータを読み込む場合、データそのものではなくてPromiseが返ってきます。今回は以下のようなIrisクラスにメソッドを書き加えて、main.jsから呼び出す形で作業を進めて行きます。

iris.js

const DataFrame = require('dataframe-js').DataFrame;

module.exports = class Iris {
  constructor(params) {
    this.url = params.url;
    this.dfPromise = DataFrame.fromCSV(this.url);
  }
  
  showDF() {
    this.dfPromise.then(df => {
      df.show();
    });
  }
}

main.js

const Iris = require('./dataframe');

const iris = new Iris({
    url: 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'
});

iris.showDF();

node main.jsで実行します。

ちょっとアヤメの話

それでは本題に入ります。この記事のゴールは、iris（アヤメ）データセットを用いて種ごとに花弁(petal)と萼片(sepal)のサイズの関係を調べ（て、I. versicolor, I. virginica, I. setosa各種の形態的特性とその適応的意義について考察す）ることとします。

USDA Forest Serviceによると、この三種はいずれも湖畔沿などの湿地に生息しsetosa（アラスカ）, versicolor（五大湖付近）, virginica（ニューヨークからフロリダまで）の順に寒冷地に分布しています。一般に寒い方が大気飽差が小さく（相対湿度が大きく）蒸散で水を失いにくいことが知られています。また、組織が幅広だと乾燥重量あたりの蒸散量が増えるので、virginica, versicolor, setosa の順に花弁および萼片が細長であるという仮説をたてました。

この仮説を検証するために、各３種ごとに花弁と萼片の縦横の比の平均と標準偏差を計算し、比較します。

f:id:ryotasasaki:20181009144400p:plain http://suruchifialoke.com/2016-10-13-machine-learning-tutorial-iris-classification/

データ分析

まずはデータの基本的な整形をしてみます。行のフィルターにはfilter, 列の選択にはselectが使えます。使い方はpandasというよりも、Rのdplyrやsparkのデータフレームに似ています。

iris.js

filterAndSelectDF() {
    this.dfPromise.then(df => {
      df
        .filter(row => row.get("species") === "versicolor")
        .select("sepal_length", "sepal_width", "species")
        .show(3);
    });
}

結果

sepal_length	sepal_width	species
7	3.2	versicolor
6.4	3.2	versicolor
6.9	3.1	versicolor

次に、花弁縦横比（縦/幅）を計算してみます。

iris.js

mutateWLratio() {
    this.dfPromise.then(df => {
      df
        .map(row => row.set('sepal_wlratio', row.get('sepal_length') / row.get('sepal_width')))
        .map(row => row.set('petal_wlratio', row.get('petal_length') / row.get('petal_width')))
        .select("sepal_wlratio", "petal_wlratio", "species")
        .show(3);
    });
}

結果

sepal_wlratio	petal_wlratio	species
1.4571	6.9999	setosa
1.6333	6.9999	setosa
1.46875	6.5	setosa

最後に、種ごとにグループ演算をして各種の形質の平均・標準偏差を求めます。groupByで指定した列に対しGroupedDataFrameオブジェクトが返されます。GroupedDataFrameオブジェクトはaggregateメソッドをもち、グループ（種）ごとに関数を適用できます。

今回は花弁・萼片それぞれに平均と分散を種ごとに計算し、一つのデータフレームにまとめています。

iris.js

calcSppStats() {
    this.dfPromise.then(df => {
      const groupedDF = df
      .chain(
        row => row.set('sepal_wlratio', row.get('sepal_length') / row.get('sepal_width')),
        row => row.set('petal_wlratio', row.get('petal_length') / row.get('petal_width'))
      )
      .select('sepal_wlratio', 'petal_wlratio', 'species')
      .groupBy('species');
      groupedDF
        .aggregate(group => group.stat.mean('sepal_wlratio'))
        .rename('aggregation', 'sepal_wlratio_sp_mean')
        .join(
          groupedDF
            .aggregate(group => group.stat.sd('sepal_wlratio'))
            .rename('aggregation', 'sepal_wlratio_sp_sd')
        , 'species', 'inner')
        .join(
          groupedDF
            .aggregate(group => group.stat.mean('petal_wlratio'))
            .rename('aggregation', 'petal_wlratio_sp_mean')
        , 'species', 'inner')
        .join(
          groupedDF
            .aggregate(group => group.stat.sd('petal_wlratio'))
            .rename('aggregation', 'petal_wlratio_sp_sd')
        , 'species', 'inner')
        .show(3);
    });
}

結果

species	sepal_wlratio_sp_mean	sepal_wlratio_sp_sd	petal_wlratio_sp_mean	petal_wlratio_sp_sd
setosa	1.4745	0.1186	7.0779	3.1237
versic	2.1604	0.2286	3.2428	0.3124
virginica	2.2304	0.2469	2.7806	0.4073

また、group.stat.statsというメソッドを使うと各種統計量がまとめて計算できます。

分析まとめ（アヤメ）

以上の結果をテーブルにまとめました。

species	花弁縦横比（縦/幅）・平均 ± 標準偏差	萼片縦横比（縦/幅）・平均 ± 標準偏差
setosa	1.48 ± 0.12	7.10 ± 3.12
versicolor	2.16 ± 0.23	3.24 ± 0.31
virginica	2.23 ± 0.25	2.78 ± 0.41

花弁は仮説の通り、virginica, versicolor, setosa の順に細長でした。しかし、萼片については反対にsetosa, versicolor, virginicaの順に細長でした。考察として、花弁や萼片の形状（縦横比）だけでなく面積そのものや生理学的特性も考慮する必要があると考えられます。

まとめ（dataframe-js）

流石にpandasには劣りますが、基本的なデータ操作はストレスなくできた気がします。クライアント側でデータ操作がスムーズにできることで、plotly.jsなどと組み合わせて（df.toCollection()とすると簡単にオブジェクトに変換できます）1. リッチな可視化が比較的簡単に実装できるようになる、2. pandasなどを使ってデータ処理をするためだけにサーバーサイドに処理を投げる必要がなくなりデータフローがすっきりさせられる、などのメリットがあるのではないでしょうか。

追記

はてなブログでJavaScriptが実行できることを知ったので、一応試しておきました。表示ボタンを押すとhttps://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv のデータがロードされ、一連のこれでブラウザでも同様に動く~~（と信じて）~~ことも確認できました。

結果はここに出ます。

<script src="https://cdn.rawgit.com/Gmousse/dataframe-js/master/dist/dataframe-min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<input id='show_button' type=button value='表示'>
<script type="text/javascript">
$(function() {
    $('#show_button').click(function() {
        const DataFrame = dfjs.DataFrame;
        DataFrame.fromCSV('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
            .then(df => {
                const groupedDF = df
                    .chain(
                    row => row.set('sepal_wlratio', row.get('sepal_length') / row.get('sepal_width')),
                    row => row.set('petal_wlratio', row.get('petal_length') / row.get('petal_width'))
                    )
                    .select('sepal_wlratio', 'petal_wlratio', 'species')
                    .groupBy('species');
                const resDF = groupedDF
                    .aggregate(group => group.stat.mean('sepal_wlratio'))
                    .rename('aggregation', 'sepal_wlratio_sp_mean')
                    .join(
                        groupedDF
                        .aggregate(group => group.stat.sd('sepal_wlratio'))
                        .rename('aggregation', 'sepal_wlratio_sp_sd')
                    , 'species', 'inner')
                    .join(
                        groupedDF
                        .aggregate(group => group.stat.mean('petal_wlratio'))
                        .rename('aggregation', 'petal_wlratio_sp_mean')
                    , 'species', 'inner')
                    .join(
                        groupedDF
                        .aggregate(group => group.stat.sd('petal_wlratio'))
                        .rename('aggregation', 'petal_wlratio_sp_sd')
                    , 'species', 'inner');
                const resJSON = JSON.stringify(resDF.toCollection(), null, 2);
                $('#result').text(resJSON);
            })
    });
});
</script>

ブレインズテクノロジーでは「共に成長できる仲間」を募集中です。
採用ページはこちら

参考

公式: https://www.npmjs.com/package/dataframe-js
basic usage: https://gmousse.gitbooks.io/dataframe-js/content/doc/BASIC_USAGE.html#dataframe
advanced usage: https://gmousse.gitbooks.io/dataframe-js/content/doc/ADVANCED_USAGE.html#advanced-usage
iris dataset: https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
JupyterのJSカーネル: https://github.com/n-riesco/ijavascript
はてなブログでJS: http://doratai.hatenablog.com/entry/2018/09/04/%E3%81%AF%E3%81%A6%E3%81%AA%E3%83%96%E3%83%AD%E3%82%B0%E3%81%A7javascript%E3%81%8C%E6%9B%B8%E3%81%91%E3%82%8B%E3%81%93%E3%81%A8%E3%82%92%E7%9F%A5%E3%81%A3%E3%81%9F

Technology Topics by Brains

ブレインズテクノロジーの研究開発機関「未来工場」で働くエンジニアが、先端オープン技術、機械学習×データ分析（異常検知、予兆検知）に関する取組みをご紹介します。