본문 바로가기

보물창고/Big Data

윈도우 하둡 하이브 외부 테이블 생성 에이브로 스키마 설정, Azure HDInsight Hive Create external table avro file read

반응형

윈도우 하둡 하이브 외부 테이블 생성 에이브로 스키마 설정, Azure HDInsight Hive Create external table avro file read







HDInsight환경 hive에서 avro파일을 읽을수 있도록하는 create table 쿼리 입니다

테스트하는 동안 발생한 에러들도 포함해서 적습니다

에러로 찾아오시는 분들에게 도움이 되었으면 합니다



Azure HDInsight를 공부하면서 본 링크들을 정리했습니다







위 링크 중 [사이트 2]에 있는 hive코드를 참고했습니다
문서에는 java, hadoop streaming, hive, pig에서 사용할 수 있는 샘플 코드들이 있습니다
문서 앞부분에는 샘플 avro데이터 파일과 avsc파일이 있어 이걸 사용하면 바로 테스트 해 볼 수 있습니다





  • avsc파일을 url로 적는 방법으로 하다 적용이 안되어 create 할때 스키마를 직접 적어주는 방법으로 했습니다
    'avro.schema.literal' // 스키마를 직접 입력하는 것으로 사용함
    'avro.schema.url' //스키마를 경로로 지정하면서 파일을 사용함
  • azure hdinsight에서는 hdfs:///avrotest/schema/twitter.avsc 이방법으로는 파읽을 읽지 못하는거 같습니다





HDInsight에서 잘 동작하는 Create 문입니다


CREATE EXTERNAL TABLE tweets

COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal' = '{
"name" : "Tweet",
"type" : "record",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string"
}, {
"name" : "tweet",
"type" : "string"
}, {
"name" : "timestamp",
"type" : "long"
} ]
}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/avrotest/data/';








잘 적용된 결과 확인


hive> describe tweets;
OK
username string from deserializer
tweet string from deserializer
timestamp bigint

hive> select * from tweets;
OK
miguno Rock: Nerf paper, scissors is fine. 1366150681
BlizzardCS Works as intended. Terran is IMBA. 1366154481
DarkTemplar From the shadows I come! 1366154681
VoidRay Prismatic core online! 1366160000
VoidRay Fire at will, commander. 1366160010
DarkTemplar I am the blade of Shakuras! 1366174681
Immortal I return to serve! 1366175681
Immortal En Taro Adun! 1366176283
VoidRay There is no greater void than the one between your ears. 13661763
00
DarkTemplar I strike from the shadows! 1366184681

Time taken: 0.172 seconds, Fetched: 10 row(s)









아래는 테스트 중 나온 에러들 입니다


'avro.schema.url'을 'avro.schema.literal'로 바꾸는 것으로 해결 했습니다

url도 사용가능하러 같은데 어떤 경로를 넣어줘야 하는지는 Azure HDInsight에서 확인해 봐야 합니다



CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/YOURUSER/examples/input/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
);


CREATE EXTERNAL TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///avrotest/schema/twitter.avsc'
);

설정이 잘 안되서 다른걸로 시도중

describe tweets;

에러 스크랩
hive>
> CREATE EXTERNAL TABLE tweets
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

> LOCATION '/avrotest/data/'
> TBLPROPERTIES (
>
Display all 445 possibilities? (y or n)
>
Display all 445 possibilities? (y or n)
> 'avro.schema.url'='hdfs:///avrotest/schema/twitter.avsc'
> );
OK
Time taken: 0.579 seconds
hive>
>
> describe tweets;
OK
error_error_error_error_error_error_error string from des
erializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
-- 이런 에러 출력이 된다면 avsc 설정이 제대로 안된 상태 입니다
Time taken: 0.904 seconds, Fetched: 7 row(s)
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Fou
nd com.miguno.avro.Tweet, expecting org.apache.hadoop.hive.CannotDetermineSchema
Sentinel
Time taken: 1.611 seconds
hive> show tabels;
NoViableAltException(26@[654:1: ddlStatement : ( createDatabaseStatement | switc
hDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableSta
tement | truncateTableStatement | alterStatement | descStatement | showStatement
| metastoreCheck | createViewStatement | dropViewStatement | createFunctionStat
ement | createMacroStatement | createIndexStatement | dropIndexStatement | dropF
unctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unloc
kStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleState
ment | grantPrivileges | revokePrivileges | showGrants | showRoleGrants | showRo
lePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole );
])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.ja
va:2090)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.j
ava:1398)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:
1036)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19
9)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16
6)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:409)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2
68)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793
)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:5 cannot recognize input near 'show' 'tabels' '<EO
F>' in ddl statement
hive> show tables;
OK
hivesampletable
tweets
Time taken: 0.11 seconds, Fetched: 2 row(s)
hive> show tables;
OK
hivesampletable
tweets
Time taken: 0.062 seconds, Fetched: 2 row(s)
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Fou
nd com.miguno.avro.Tweet, expecting org.apache.hadoop.hive.CannotDetermineSchema
Sentinel
Time taken: 0.297 seconds

--

위에것하고 같은글인데 git에서 관리되면서 더 최신으로 보임
https://github.com/miguno/avro-hadoop-starter

CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = 'hdfs:///avrotest/schema/twitter.avsc'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data/';


에러 스크랩
hive>
> CREATE EXTERNAL TABLE tweets
> COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> WITH SERDEPROPERTIES (
> 'avro.schema.url' = 'hdfs:///avrotest/schema/twitter.avsc'
> )
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

> LOCATION '/avrotest/data/';
OK
Time taken: 0.431 seconds
hive>
> describe tweets;
OK
error_error_error_error_error_error_error string from des
erializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
-- 이런 에러 출력이 된다면 avsc 설정이 제대로 안된 상태 입니다
Time taken: 0.875 seconds, Fetched: 7 row(s)
hive>


--


CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = 'hdfs:///avrotest/schema/twitter.avsc'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///avrotest/data/';

에러 스크랩
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException Incomplete HDFS URI, no host: hdfs:/avrotest/data)

--

CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = '{
"type" : "record",
"name" : "Tweet",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string"
}, {
"name" : "tweet",
"type" : "string"
}, {
"name" : "timestamp",
"type" : "long"
} ]
}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data2/';

에러 스크랩
hive> describe tweets;
OK
error_error_error_error_error_error_error string from deserializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
-- 이런 에러 출력이 된다면 avsc 설정이 제대로 안된 상태 입니다
Time taken: 0.585 seconds, Fetched: 7 row(s)
hive>

--
"앞에 역슬레시가 있어서 따라해 봤음
https://social.msdn.microsoft.com/Forums/azure/en-US/5e0e93f9-3c02-40dc-afa4-656faacde69b/cannot-read-data-from-hive-external-table-to-a-avro-serilized-files-failed-with-exception?forum=hdinsight

CREATE EXTERNAL TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = '{
\"type\" : \"record\",
\"name\" : \"Tweet\",
\"namespace\" : \"com.miguno.avro\",
\"fields\" : [ {
\"name\" : \"username\",
\"type\" : \"string\"
}, {
\"name\" : \"tweet\",
\"type\" : \"string\"
}, {
\"name\" : \"timestamp\",
\"type\" : \"long\"
} ]
}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data2/';


에러 스크랩
hive>
>
> CREATE EXTERNAL TABLE tweets
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> WITH SERDEPROPERTIES (
> 'avro.schema.url' = '{
> \"type\" : \"record\",
> \"name\" : \"Tweet\",
> \"namespace\" : \"com.miguno.avro\",
> \"fields\" : [ {
> \"name\" : \"username\",
> \"type\" : \"string\"
> }, {
> \"name\" : \"tweet\",
> \"type\" : \"string\"
> }, {
> \"name\" : \"timestamp\",
> \"type\" : \"long\"
> } ]
> }'
> )
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

> LOCATION '/avrotest/data2/';
OK
Time taken: 0.234 seconds
hive>
>
> show tables;
OK
hivesampletable
tweets
Time taken: 0.063 seconds, Fetched: 2 row(s)
hive> describe tweets;
OK
error_error_error_error_error_error_error string from deserializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
-- 이런 에러 출력이 된다면 avsc 설정이 제대로 안된 상태 입니다
Time taken: 0.516 seconds, Fetched: 7 row(s)
hive>

--

CREATE EXTERNAL TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = '{}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/avrotest/data2/';


반응형