pyspark运行异常处理过程

在搭建的spark集群运行pyspark出现异常,记录处理过程。

[root@jkstore77 bin]# pyspark
Python 2.7.5 (default, Apr  9 2019, 14:30:50) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:123)
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:123)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:123)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:109)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 7 more
Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib/spark/python/pyspark/shell.py", line 43, in <module>
    sc = SparkContext(pyFiles=add_files)
  File "/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib/spark/python/pyspark/context.py", line 112, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway)
  File "/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib/spark/python/pyspark/context.py", line 246, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway()
  File "/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib/spark/python/pyspark/java_gateway.py", line 92, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
>>> 

查看pyspark的路径,发现下面4个pyspark在不同的路径下。

[root@jkstore77 bin]# find / -type f -name pyspark -print -exec ls -l {} \;    

/var/lib/alternatives/pyspark
-rw-r--r-- 1 root root 88 Sep 17 11:16 /var/lib/alternatives/pyspark

/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib/spark/bin/pyspark
-rwxr-xr-x 1 root root 3545 Aug 10  2018 /opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/lib/spark/bin/pyspark

/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark
-rwxr-xr-x 1 root root 654 Aug 10  2018 /opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark

/opt/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/pyspark
-rwxr-xr-x 1 root root 2987 Sep 27  2018 /opt/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/pyspark

我们需要知道pyspark到底执行的哪个pyspark。

[root@jkstore77 bin]# whereis pyspark
pyspark: /usr/bin/pyspark

可以看到在命令中执行pyspark使用的/usr/bin/pyspark。

[root@jkstore77 bin]# ll /usr/bin/pyspark
lrwxrwxrwx 1 root root 25 Sep 17 11:16 /usr/bin/pyspark -> /etc/alternatives/pyspark

这个路径下的/usr/bin/pyspark是一个链接文件,连接到/etc/alternatives/pyspark。

[root@jkstore77 bin]# ll /etc/alternatives/pyspark
lrwxrwxrwx 1 root root 61 Sep 17 11:16 /etc/alternatives/pyspark -> /opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark

查看/etc/alternatives/pyspark文件,发现它同样是一个链接文件,指向/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark,是什么原因造成执行这个命令出现问题?看看它的配置文件。

[root@jkstore77 bin]# find / -name spark-env.sh -print                         
/etc/spark2/conf.cloudera.spark2_on_yarn/spark-env.sh
/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/etc/spark/conf.dist/spark-env.sh

我晕,配置文件也有两个,应该对应目录的spark-env.sh是其配置文件。查看这个文件。

[root@jkstore77 bin]# vim /opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/etc/spark/conf.dist/spark-env.sh

export STANDALONE_SPARK_MASTER_HOST=`hostname`

export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

### Let's run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR='/var/run/spark/'

if [ -n "$HADOOP_HOME" ]; then
  export LD_LIBRARY_PATH=:/usr/lib/hadoop/lib/native
fi

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

if [[ -d $SPARK_HOME/python ]]
then
    for i in
    do
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$i
    done
fi

SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_LIBRARY_PATH/spark-assembly.jar"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hive/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/flume-ng/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/parquet/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/avro/*"

看得一头雾水,能不能执行另一个pyspark看看效果。

[root@jkstore77 bin]# cd /opt/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2/bin/
[root@jkstore77 bin]# ./pyspark
Python 2.7.5 (default, Apr  9 2019, 14:30:50) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0.cloudera4
      /_/

Using Python version 2.7.5 (default, Apr  9 2019 14:30:50)
SparkSession available as 'spark'.
>>> 

居然没有报错,看来有两个版本的pyspark,我记得当时安装的是2.2.0版本的spark。

如果能把pysaprk的链接改到2.2.0这个版本上,问题就解决了,开始动手。

首先查看pyspark的链接文件:

[root@jkstore77 bin]# find / -type l -name pyspark -print -exec ls -l {} \; 
/etc/alternatives/pyspark
lrwxrwxrwx 1 root root 61 Sep 17 11:16 /etc/alternatives/pyspark -> /opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark
/usr/bin/pyspark
lrwxrwxrwx 1 root root 25 Sep 17 11:16 /usr/bin/pyspark -> /etc/alternatives/pyspark

为什么要两次链接?在网上搜索,发现alternatives是一个管理多个版本软件的工具,我们看到,pyspark这个可执行命令实际上是个符号连接,它指向/etc/alternatives/pyspark;而/etc /alternatives/pyspark也是个符号连接,它指向 我们看到,editor这个可执行命令实际上是个符号连接,它指向/etc/alternatives/editor;而/etc /alternatives/editor也是个符号连接,它指向/opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark。当输入pysaprk,实际执行的是 /opt/cloudera/parcels/CDH-5.15.1-1.cdh5.15.1.p0.4/bin/pyspark 。使用 alternatives 建立这样两个链接,其目的是实现上面说到的特性:方便脚本程序的编写和系统的管理。

查看/etc/ alternatives 目录,发现里面可执行的文件链接很多,看来cdh为了版本管理使用alternatives这个工具。

发现居然有pyspark2,有种醍醐灌顶的感觉,原来折腾半天,pyspark用的是老版本的,而pyspark2是新版本的,在执行过程中更改为pyspark2就可以了,没有必要修改链接。


发表评论

电子邮件地址不会被公开。 必填项已用*标注