The last weekend, I installed pig. I followed the instruction here. It was really straightforward. It is really handy because it can run locally without hadoop running.
After playing it a little in the local mode, I ran the pig script 1 in the tutorial on my hadoop installation. It took about 10 minutes but it just worked.
Then I got the Japanese zip code data from Japan Post (like USPS in the US). The data consists of local government code, zip code, prefecture name, city name, and town name. I ran Pig Latin commands below to find out how many unique zip codes each prefecture has.
RAW = load 'ken_all_rome.csv' using PigStorage(',');
P = foreach RAW generate $4 as pref, $1 as code;
P2 = distinct P;
PG = group P2 by pref;
PC = foreach PG generate $0, COUNT($1);
store PC into 'poscnt-result' using PigStorage();
The prefecture which has the most is Hokkaido, 8006 zip codes. The least is Kagawa, 709 zip codes. Tokyo has 3731 zip codes. It looks too few, considering its population. Its area is relatively small, so probably the number of zip codes is proportional to the area.
After playing it a little in the local mode, I ran the pig script 1 in the tutorial on my hadoop installation. It took about 10 minutes but it just worked.
script1-hadoop-results is the result of pig script 1 |
RAW = load 'ken_all_rome.csv' using PigStorage(',');
P = foreach RAW generate $4 as pref, $1 as code;
P2 = distinct P;
PG = group P2 by pref;
PC = foreach PG generate $0, COUNT($1);
store PC into 'poscnt-result' using PigStorage();
The prefecture which has the most is Hokkaido, 8006 zip codes. The least is Kagawa, 709 zip codes. Tokyo has 3731 zip codes. It looks too few, considering its population. Its area is relatively small, so probably the number of zip codes is proportional to the area.
the number of zip codes each prefecture has |