Reading and writing Amazon S3 files from Apache Spark

The S3 Native Filesystem client present in Apache Spark running over Apache Hadoop allows access to the Amazon S3 service from a Apache Spark application. So it is enough to define the S3 Access Key and the S3 Secret Access Key in the Spark Context as shown below:

val sc = new SparkContext(new SparkConf().setAppName("AppName"))
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "ACCESS_KEY")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "SECRET_ACCESS_KEY")

Then you can read and write files using the s3n URI scheme:

val textFile = sc.textFile("s3n://bucket/source_path")
textFile.saveAsTextFile("s3n://bucket/target_path")

 

Leave a Reply

Your email address will not be published. Required fields are marked *