apache spark版本2.4.4,HDP版本3.0.1.0-187
首先使用hadoop3编译spark,
./dev/make-distribution.sh --pip --tgz -Phadoop-3.1 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
编译完成后,将安装包放在HDP集群中。为这个spark单独拷贝一份hadoop配置文件,并将yarn-site.xml中的yarn.timeline-service.enabled设置为false。否则会报NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
因为spark2.4.4和yarn都依赖了jersey-client,但依赖的版本不一样,spark/jars下的为jersey-client-2.22.2.jar,在hadoop-yarn/lib下为jersey-client-1.19.jar。当timeline-service.enabled为true时,会走yarn api中的某一段代码逻辑,该片段涉及到了jersey-client的某些类,这些类(方法)在jersey-client-2.22.2.jar已经不存在了,所以报错。
而HDP的spark2不会出现这个错误,是因为在hdp的spark的代码里,特别设置了yarn.timeline-service.enabled为false,具体可参考:https://www.jianshu.com/p/460f98111d43
启动spark-on-yarn:报错ShimLoder.getMajorVersion: Unrecognized Hadoop major version number: 3.1.0。
原因是spark2集成的hive-exec版本为hive-exec-1.2.1.spark2.jar,该版本不支持hadoop3,所有报错。
解决1:使用HDP中的hive-exec-1.21.2.3.0.1.0-187.jar替换调hive-exec-1.2.1.spark2.jar,以上问题解决。
解决2:spark不使用hadoop3及以上版本编译,并将spark-defaults.conf中配置改为spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 1.2.1
启动spark-on-yarn:driver几分钟内都在打印日志Application report for application_1588126420266_0729 (state: ACCEPTED),并最终报错Exception message: /data/hadoop/yarn/local/usercache/ocsp/appcache/application_1519982778829_0171/container_e37_1519982778829_0171_02_000001/launch_container.sh: line 32: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
在spark-default.conf配置文件中添加: spark.driver.extraJavaOptions -Dhdp.version=3.1.0 spark.yarn.am.extraJavaOptions -Dhdp.version=3.1.0,问题解决。
此问题可参考https://www.jianshu.com/p/de762c244663
注意:spark不能引用hdp集群的mapred-site.xml,否则上面的问题即使加配置也无法解决。
以上为启动spark on yarn静态资源遇到的问题,解决后程序正常启动。
接下来spark on yarn动态资源。
首先HDP的sparkshuffleservice的默认端口为7447,而不是apache spark的7337,所以需要在spark-defalut中配置端口为7447。并不设置初始化executor个数,默认为0.
然后启动spark on yarn动态资源,启动成功,目前只存在一个AppMaster。
然后发生任意sql执行,可以看到控制台输出
ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged
并一直输出以下日志
WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
查看appMaster中的日志,发现如下错误
INFO YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
ERROR YarnAllocator: Failed to launch executor 1 on container container_e55_1588126420266_0777_01_000002
org.apache.spark.SparkException: Exception while starting container container_e55_1588126420266_0777_01_000002 on host online-slave-6
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:126)
at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65)
at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$2.run(YarnAllocator.scala:546)
at
at
at
Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist
可以看出因为找不到sparkshuffleservice导致无法创建executor。
查看yarn的配置,配置的sparkshuffleservice的name为spark2_shuffle,这是因为HDP还集成了spark1x,spark1x的sparkshuffleservice的name为spark_shuffle。
但是apache spark源码是写死的“spark_shuffle”,且不支持设置shuffle service name,所以增加配置项spark.shuffle.service.name,并配置成spark2_shuffle,问题解决。如果不想改apache spark源码,就只能改yarn配置并重启yarn。