윈도우 하둡 하이브 외부 테이블 생성 에이브로 스키마 설정, Azure HDInsight Hive Create external table avro file read
HDInsight환경 hive에서 avro파일을 읽을수 있도록하는 create table 쿼리 입니다
테스트하는 동안 발생한 에러들도 포함해서 적습니다
에러로 찾아오시는 분들에게 도움이 되었으면 합니다
- 참고한 사이트
- apache hive에서 안내하는 문서 : https://cwiki.apache.org/confluence/display/Hive/AvroSerDe
- Create 쿼리가 있는 [사이트 1] : http://www.michael-noll.com/blog/2013/07/04/using-avro-in-mapreduce-jobs-with-hadoop-pig-hive/
- Create 쿼리가 있는 [사이트 2] (위 링크의 최신 코드가 유지되는 git) : https://github.com/miguno/avro-hadoop-starter
- avsc파일을 url로 적는 방법으로 하다 적용이 안되어 create 할때 스키마를 직접 적어주는 방법으로 했습니다
'avro.schema.literal' // 스키마를 직접 입력하는 것으로 사용함
'avro.schema.url' //스키마를 경로로 지정하면서 파일을 사용함 - azure hdinsight에서는 hdfs:///avrotest/schema/twitter.avsc 이방법으로는 파읽을 읽지 못하는거 같습니다
HDInsight에서 잘 동작하는 Create 문입니다
CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal' = '{
"name" : "Tweet",
"type" : "record",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string"
}, {
"name" : "tweet",
"type" : "string"
}, {
"name" : "timestamp",
"type" : "long"
} ]
}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data/';
잘 적용된 결과 확인
OK
username string from deserializer
tweet string from deserializer
timestamp bigint
miguno Rock: Nerf paper, scissors is fine. 1366150681
BlizzardCS Works as intended. Terran is IMBA. 1366154481
DarkTemplar From the shadows I come! 1366154681
VoidRay Prismatic core online! 1366160000
VoidRay Fire at will, commander. 1366160010
DarkTemplar I am the blade of Shakuras! 1366174681
Immortal I return to serve! 1366175681
Immortal En Taro Adun! 1366176283
VoidRay There is no greater void than the one between your ears. 13661763
00
DarkTemplar I strike from the shadows! 1366184681
Time taken: 0.172 seconds, Fetched: 10 row(s)
아래는 테스트 중 나온 에러들 입니다
'avro.schema.url'을 'avro.schema.literal'로 바꾸는 것으로 해결 했습니다
url도 사용가능하러 같은데 어떤 경로를 넣어줘야 하는지는 Azure HDInsight에서 확인해 봐야 합니다
CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/YOURUSER/examples/input/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc'
);
CREATE EXTERNAL TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data/'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///avrotest/schema/twitter.avsc'
);
설정이 잘 안되서 다른걸로 시도중
describe tweets;
에러 스크랩
hive>
> CREATE EXTERNAL TABLE tweets
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION '/avrotest/data/'
> TBLPROPERTIES (
>
Display all 445 possibilities? (y or n)
>
Display all 445 possibilities? (y or n)
> 'avro.schema.url'='hdfs:///avrotest/schema/twitter.avsc'
> );
OK
Time taken: 0.579 seconds
hive>
>
> describe tweets;
OK
error_error_error_error_error_error_error string from des
erializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
Time taken: 0.904 seconds, Fetched: 7 row(s)
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Fou
nd com.miguno.avro.Tweet, expecting org.apache.hadoop.hive.CannotDetermineSchema
Sentinel
Time taken: 1.611 seconds
hive> show tabels;
NoViableAltException(26@[654:1: ddlStatement : ( createDatabaseStatement | switc
hDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableSta
tement | truncateTableStatement | alterStatement | descStatement | showStatement
| metastoreCheck | createViewStatement | dropViewStatement | createFunctionStat
ement | createMacroStatement | createIndexStatement | dropIndexStatement | dropF
unctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unloc
kStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleState
ment | grantPrivileges | revokePrivileges | showGrants | showRoleGrants | showRo
lePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole );
])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.ja
va:2090)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.j
ava:1398)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:
1036)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:19
9)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:16
6)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:409)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:2
68)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793
)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:5 cannot recognize input near 'show' 'tabels' '<EO
F>' in ddl statement
hive> show tables;
OK
hivesampletable
tweets
Time taken: 0.11 seconds, Fetched: 2 row(s)
hive> show tables;
OK
hivesampletable
tweets
Time taken: 0.062 seconds, Fetched: 2 row(s)
hive> select * from tweets;
OK
Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Fou
nd com.miguno.avro.Tweet, expecting org.apache.hadoop.hive.CannotDetermineSchema
Sentinel
Time taken: 0.297 seconds
--
위에것하고 같은글인데 git에서 관리되면서 더 최신으로 보임
https://github.com/miguno/avro-hadoop-starter
CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = 'hdfs:///avrotest/schema/twitter.avsc'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data/';
에러 스크랩
hive>
> CREATE EXTERNAL TABLE tweets
> COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> WITH SERDEPROPERTIES (
> 'avro.schema.url' = 'hdfs:///avrotest/schema/twitter.avsc'
> )
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION '/avrotest/data/';
OK
Time taken: 0.431 seconds
hive>
> describe tweets;
OK
error_error_error_error_error_error_error string from des
erializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
Time taken: 0.875 seconds, Fetched: 7 row(s)
hive>
--
CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = 'hdfs:///avrotest/schema/twitter.avsc'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///avrotest/data/';
에러 스크랩
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException Incomplete HDFS URI, no host: hdfs:/avrotest/data)
--
CREATE EXTERNAL TABLE tweets
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = '{
"type" : "record",
"name" : "Tweet",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string"
}, {
"name" : "tweet",
"type" : "string"
}, {
"name" : "timestamp",
"type" : "long"
} ]
}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data2/';
에러 스크랩
hive> describe tweets;
OK
error_error_error_error_error_error_error string from deserializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
Time taken: 0.585 seconds, Fetched: 7 row(s)
hive>
--
"앞에 역슬레시가 있어서 따라해 봤음
https://social.msdn.microsoft.com/Forums/azure/en-US/5e0e93f9-3c02-40dc-afa4-656faacde69b/cannot-read-data-from-hive-external-table-to-a-avro-serilized-files-failed-with-exception?forum=hdinsight
CREATE EXTERNAL TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = '{
\"type\" : \"record\",
\"name\" : \"Tweet\",
\"namespace\" : \"com.miguno.avro\",
\"fields\" : [ {
\"name\" : \"username\",
\"type\" : \"string\"
}, {
\"name\" : \"tweet\",
\"type\" : \"string\"
}, {
\"name\" : \"timestamp\",
\"type\" : \"long\"
} ]
}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data2/';
에러 스크랩
hive>
>
> CREATE EXTERNAL TABLE tweets
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> WITH SERDEPROPERTIES (
> 'avro.schema.url' = '{
> \"type\" : \"record\",
> \"name\" : \"Tweet\",
> \"namespace\" : \"com.miguno.avro\",
> \"fields\" : [ {
> \"name\" : \"username\",
> \"type\" : \"string\"
> }, {
> \"name\" : \"tweet\",
> \"type\" : \"string\"
> }, {
> \"name\" : \"timestamp\",
> \"type\" : \"long\"
> } ]
> }'
> )
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> LOCATION '/avrotest/data2/';
OK
Time taken: 0.234 seconds
hive>
>
> show tables;
OK
hivesampletable
tweets
Time taken: 0.063 seconds, Fetched: 2 row(s)
hive> describe tweets;
OK
error_error_error_error_error_error_error string from deserializer
cannot_determine_schema string from deserializer
check string from deserializer
schema string from deserializer
url string from deserializer
and string from deserializer
literal string from deserializer
Time taken: 0.516 seconds, Fetched: 7 row(s)
hive>
--
CREATE EXTERNAL TABLE tweets
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url' = '{}'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/avrotest/data2/';