Dictionary Generation for IPADIC users

WARNING: This section takes several hours or days.

Prepare the base dictionary

Download IPADIC

First, download IPADIC manually from https://taku910.github.io/mecab

WORKDIR=/path/to/your/work/dir
cd $WORKDIR # move to the working directory
cp /path/to/your/download/dir/mecab-ipadic-2.7.0-XXXX.tar.gz $WORKDIR
tar zxfv mecab-ipadic-2.7.0-XXXX.tar.gz

By trying ls mecab-ipadic-2.7.0-XXXX, you will find many CSV files and configuration files in the directory. We convert the encoding of these dicrionaty files from EUC-JP to UTF-8. If your system has nkf commnad,

find ./mecab-ipadic-2.7.0-* -type f -name "*.csv" | xargs -I{} nkf -w --overwrite {}

Otherwise, you can use docker.

docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    find ./mecab-ipadic-2.7.0-* -type f -name "*.csv" | xargs -I{} nkf -w --overwrite {}

Download NEologd

Also, download the NEologd dictionary as follows.

cd $WORKDIR # move to the working directory
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd/

Then, extract the csv file of NEologd dictionary using unxz command. If your system has the unxz command,

find ./mecab-ipadic-neologd/seed/ -type f -name "*.xz" | xargs -I{} unxz -k {}

Or, otherwise,

find ./mecab-ipadic-neologd/seed/ -type f -name "*.xz" | xargs -I{} \
   docker run --rm -v $(pwd):/root/workspace tdmelodic:latest unxz -k {}

Thus many CSV files will be created at ./mecab-ipadic-neologd/seed/.

Inference

WARNING! THIS TAKES MUCH TIME!

Now let generate the accent dictionary. It estimates the accent of the words listed in NEologd dictionary by a machine learning -based technique.

IPADIC

find ./mecab-ipadic-2.7.0-*/ -type f -name "*.csv" | xargs -I{} \
    docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
        tdmelodic-convert -m ipadic --input {} --output {}.accent

Or, following commands will also work.

cat ./mecab-ipadic-2.7.0-*/*.csv > ipadic_all.csv
docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-convert -m ipadic \
        --input ipadic_all.csv \
        --output ipadic_all.csv.accent

NEologd

Use preprocessor if necessary. (try -h to show preprocessing options.)

find ./mecab-ipadic-neologd/seed/ -type f -name "*.csv" | xargs -I{} \
    docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
        tdmelodic-neologd-preprocess -m ipadic --input {} --output {}.preprocessed

Then,

find ./mecab-ipadic-neologd/seed/ -type f -name "*.csv" | xargs -I{} \
    docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
        tdmelodic-convert -m ipadic --input {}.preprocessed --output {}.accent

Thus we obtain dictionary files *.csv.accent with the accent information added.

Alternatively, following commands will also work.

cat ./mecab-ipadic-neologd/seed/*.csv > neologd_all.csv

docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-neologd-preprocess -m ipadic \
        --input neologd_all.csv \
        --output neologd_all.csv.preprocessed

docker run --rm -v $(pwd):/root/workspace tdmelodic:latest \
    tdmelodic-convert -m ipadic \
        --input neologd_all.csv.preprocessed \
        --output neologd_all.csv.accent