Bash script to launch a Apache Spark application in a Amazon Elastic MapReduce cluster

The following Bash script launches a new Amazon Elastic MapReduce cluster from a shell in any remote server with the AWS CLI installed.

#!/bin/bash
# Launch AWS EMR Spark Application using aws cli
# Usage: ./aws_launch_spark_app.sh CLUSTER_NAME INSTANCE_TYPE INSTANCE_COUNT LOG_URI CONFIG_FILE [-s step_class step_class_jar [step_class_params]...]...

CLUSTER_NAME=$1
INSTANCE_TYPE=$2
INSTANCE_COUNT=$3
LOG_URI=$4
CONFIG_FILE=$5

STEPS=""

# Build steps argument text
for arg in "$@"; do
  shift
  if [ $arg = '-s' ]; then
    step_class=$1
    step_class_jar=$2
    step_class_params_array=${@:3}
    step_class_params=""

    # Build step class params argument text
    for step_class_param in $step_class_params_array; do
      if [ $step_class_param = '-s' ]; then
        break
      fi
      step_class_params+=",$step_class_param"
    done

    step="$step_class,$step_class_jar$step_class_params"
    step_name="$step_class$step_class_params"
    STEPS+="Type=Spark,Name=${step_name//,/},Args=[--deploy-mode,cluster,--master,yarn-cluster,--class,$step] "
  fi
done

aws emr create-cluster \
--name $CLUSTER_NAME \
--release-label emr-4.2.0 \
--instance-type $INSTANCE_TYPE \
--instance-count $INSTANCE_COUNT \
--ec2-attributes KeyName=emndata \
--applications Name=Spark \
--log-uri $LOG_URI \
--configurations $CONFIG_FILE \
--steps $STEPS\
--auto-terminate

https://github.com/marcalpla/bash-utils/blob/master/aws_launch_spark_app.sh

This script runs properly with the Amazon Elastic MapReduce release 4.2.0 and may also run with later versions changing the release-label parameter.

The script have the role of interface that interacts with the AWS CLI. You can call the script directly from the shell or from a final application launcher such as:

#!/bin/bash
# Launch a specific Spark application
# Usage: ./this_script.sh INSTANCE_TYPE INSTANCE_COUNT

CLUSTER_NAME=SparkApplication
INSTANCE_TYPE=$1
INSTANCE_COUNT=$2
LOG_URI=s3://bucket/log_path
CONFIG_FILE=https://s3-eu-west-1.amazonaws.com/bucket/config_file.json
STEP_CLASS=SparkApplicationMainClassName
STEP_JAR_PATH=s3://bucket/SparkApplication.jar
AWS_EMR_LAUNCHER_PATH=/home/user/aws_launch_spark_app.sh 
LOG_PATH_LOCAL=/var/log/spark_launcher.log

STEPS="-s $STEP_CLASS $STEP_JAR_PATH step1_param_1 step1_param_2 step1_param_n "
STEPS+="-s $STEP_CLASS $STEP_JAR_PATH step2_param_1 step2_param_2 step2_param_n "
STEPS+="-s $STEP_CLASS $STEP_JAR_PATH step3_param_1 step3_param_2 step3_param_n "

echo "$(date +"%Y-%m-%d %H:%M:%S") Launching a cluster..." >> $LOG_PATH_LOCAL 2>&1

$AWS_EMR_LAUNCHER_PATH $CLUSTER_NAME $INSTANCE_TYPE $INSTANCE_COUNT $LOG_URI $CONFIG_FILE $STEPS >> $LOG_PATH_LOCAL 2>&1

echo "$(date +"%Y-%m-%d %H:%M:%S") Done." >> $LOG_PATH_LOCAL 2>&1

 

Leave a Reply

Your email address will not be published. Required fields are marked *