Things to know when you are dealing with Apache Spark
1. Main Class
When we package fat jar/uber jar using spring boot maven plugin, it is packging the class files and java libraries in spring way and not the way sprak is expecting it.Inside the uber jar generated by the spring boot maven plugin, we have boot-inf,meta-inf and org folder.So when we give mainclass in spark submit or spark launcher as a parameter,It will not able to find that class as it will not be able to locate package/path specified in parameter, due to change of structure in jar file.Even after you specify the correct location which is starting with BOOT-INF for main class, it will not work because the way spring launches the application is using differrent main class.
Below is the link which shows main class that should be used for launching the fat jar generated by spring boot maven plugin.
On high level if I inform ,File named as MANIFEST.MF inside the uber jar contains below entries.Where Main-Class is the actual main class which is used for initialization of spring related stuff, after which customized main class should get started which is Start-Class entry.
Main-Class: org.springframework.boot.loader.JarLauncher Start-Class: com.mycompany.project.MyApplication
So as a conclusion specifiying main class as "org.springframework.boot.loader.JarLauncher" in spark-submit or spark-launcher will resolve our issue for this problem.This will only work if you are using sparing boot maven plugin for packagin the jar.
External common libraries used in pom.xml+spark installation. Another issue which might occure while using spark-submit or launcher application while launching uber jar, packaged using the spring boot maven plugin is jar conflicts.So the problem will be when we package jar using spring boot maven plugin, it is coping depedencies inside BOOT-INF/lib folder, let's say if you are using below depedency in pom.xml.
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10</version>
</dependency>
Now let's say this depedency already exist with differrent version in spark installation, in that case it will give conflict in classes.As it will be an issue of class loader ,its better not to use those which might create conflict due to which application might fail.I have faced this kind of issues related to logger classes or json libraries, as there are multiple json/logger library options are available.As a resolution you can excluded those classes or libraries or you can replace that library with alternative one.
./bin/spark-submit \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key<=<value> \ --driver-memory <value>g \ --executor-memory <value>g \ --executor-cores <number of cores> \ --jars <comma separated dependencies> --class <main-class> \ <application-jar> \ [application-arguments]
curl -X POST [http://sparkendpoint.com]/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{ "appResource": "file:/home/user/spark_pi.py", "sparkProperties": { "spark.executor.memory": "8g", "spark.master": "spark://192.168.1.1:7077", "spark.driver.memory": "8g", "spark.driver.cores": "2", "spark.eventLog.enabled": "false", "spark.app.name": "Spark REST API - PI", "spark.submit.deployMode": "cluster", "spark.driver.supervise": "true" }, "clientSparkVersion": "2.4.0", "mainClass": "org.apache.spark.deploy.SparkSubmit", "environmentVariables": { "SPARK_ENV_LOADED": "1" }, "action": "CreateSubmissionRequest", "appArgs": [ "/home/user/spark_pi.py", "80" ] }'