<code id="qf3hh"></code>
  • <menuitem id="qf3hh"></menuitem>
  • <strike id="qf3hh"><label id="qf3hh"></label></strike>

  • ?
      開發(fā)技術(shù) / Technology

      一個簡單數(shù)據(jù)處理例子

      日期:2015年1月29日  作者:zhjw  來源:互聯(lián)網(wǎng)    點擊:918

        1、Pig數(shù)據(jù)模型

          Bag:表

          Tuple:行,記錄

          Field:屬性

          Pig不要求同一個Bag里面的各個Tuple有相同數(shù)量或相同類型的Field

        2、Pig Lating常用語句

          1)LOAD:指出載入數(shù)據(jù)的方法

          2)FOREACH:逐行掃描進行某種處理

          3)FILTER:過濾行

          4)DUMP:把結(jié)果顯示到屏幕

          5)STORE:把結(jié)果保存到文件

        3、簡單例子:

          假如有一份成績單,有學(xué)號、語文成績、數(shù)學(xué)成績,屬性之間用|分隔,如下:

      20130001|80|90
      20130002|85|96
      20130003|60|70
      20130004|74|86
      20130005|65|98

        1)把文件從本地系統(tǒng)上傳到Hadoop

      [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put /home/coder/score.txt in

        查看是否上傳成功:

      [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
      Found 1 items
      -rw-r--r--   2 coder supergroup         75 2013-04-20 14:33 /user/coder/in/score.txt

        2)載入原始數(shù)據(jù),使用LOAD

      grunt> scores = LOAD 'hdfs://h1:9000/user/coder/in/score.txt' USING PigStorage('|') AS (num:int,Chinese:int,Math:int);

        輸入文件是:'hdfs://h1:9000/user/coder/in/score.txt'

        表名(Bag):scores

        從輸入文件讀取數(shù)據(jù)(Tuple)時以 | 分隔

        讀取的Tuple包含3個屬性,分別為學(xué)號(num)、語文成績(Chinese)和數(shù)學(xué)成績(Math),這三個屬性的數(shù)據(jù)類型都為int

        3)查看表的結(jié)構(gòu)

      grunt> DESCRIBE scores;
      scores: {num: int,Chinese: int,Math: int}

        4)假如我們需要過濾掉學(xué)號為20130005的記錄

      grunt> filter_scores = FILTER scores BY num != 20130005;

        查看過濾后的記錄

      grunt> dump filter_scores;
      (20130001,80,90)
      (20130002,85,96)
      (20130003,60,70)
      (20130004,74,86)

        5)計算每個人的總分

      grunt> totalScore = FOREACH scores GENERATE num,Chinese+Math;

        查看結(jié)果:

      grunt> dump totalScore;

       

      (20130001,170)
      (20130002,181)
      (20130003,130)
      (20130004,160)
      (20130005,163)

        

        6)將每個人的總分結(jié)果輸出到文件

      grunt> store totalScore into 'hdfs://h1:9000/user/coder/out/result' using PigStorage('|');

        查看結(jié)果:

      復(fù)制代碼
      [coder@h1 ~]$ hadoop dfs -ls /user/coder/out/result
      Found 2 items
      drwxr-xr-x   - coder supergroup          0 2013-04-20 15:54 /user/coder/out/result/_logs
      -rw-r--r--   2 coder supergroup         65 2013-04-20 15:54 /user/coder/out/result/part-m-00000
      [coder@h1 ~]$ ^C
      [coder@h1 ~]$ hadoop dfs -cat /user/coder/out/result/*
      20130001|170
      20130002|181
      20130003|130
      20130004|160
      20130005|163
      cat: Source must be a file.
      [coder@h1 ~]$ 
      復(fù)制代碼

       


        再看一個小例子:

        有一批如下格式的文件:

      zhangsan#123456#zhangsan@qq.com
      lisi#434dfdds#lisi@126.com
      wangwu#ffere233#wangwu@163.com
      zhouliu#fgrtr43#zhouliu@139.com

        每行記錄有三個字段:賬號、密碼、郵箱,字段之間以#號分隔,現(xiàn)在要提取這批文件中的郵箱。

        

        1)上傳文件到hadoop

      [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -put data.txt in

       

      [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -ls /user/coder/in
      Found 1 items
      -rw-r--r--   2 coder supergroup        122 2013-04-24 20:34 /user/coder/in/data.txt
      [coder@h1 hadoop-0.20.2]$ 

        2)載入原始數(shù)據(jù)文件

      grunt> T_A = LOAD '/user/coder/in/data.txt' using PigStorage('#') as (username:chararray,password:chararray,email:chararray);

        3)取出email字段

      grunt> T_B = FOREACH T_A GENERATE email;

        4)把結(jié)果輸出到文件

      grunt> STORE T_B INTO '/user/coder/out/email'

        5)查看結(jié)果

      [coder@h1 hadoop-0.20.2]$ bin/hadoop dfs -cat /user/coder/out/email/*
      zhangsan@qq.com
      lisi@126.com
      wangwu@163.com
      zhouliu@139.com
      cat: Source must be a file.

       

      国产一级婬片AAA毛,无码中文精品视视在线观看,欧美日韩a人成v在线动漫,五月丁香青草久久
      <code id="qf3hh"></code>
    • <menuitem id="qf3hh"></menuitem>
    • <strike id="qf3hh"><label id="qf3hh"></label></strike>