INTRODUCTION TO SPARK


  • What does Donald Rumsfeld have to do with data analysis?
  • Why is Spark so cool?
  • An introduction to RDDs - Resilient Distributed Datasets
  • Built-in libraries for Spark
  • Installing Spark
  • The PySpark Shell
  • Transformations and Actions
  • See it in Action : Munging Airlines Data with PySpark - I
  • [For Linux/Mac OS Shell Newbies] Path and other Environment Variables




RESILIENT DISTRIBUTED DATASETS


  • RDD Characteristics: Partitions and Immutability
  • RDD Characteristics: Lineage, RDDs know where they came from
  • What can you do with RDDs?
  • Create your first RDD from a file
  • Average distance travelled by a flight using map() and reduce() operations
  • Get delayed flights using filter(), cache data using persist()
  • Average flight delay in one-step using aggregate()
  • Frequency histogram of delays using countByValue()
  • See it in Action : Analyzing Airlines Data with PySpark - II




BASIC SEARCH & OPTIMIZATION ALGORITHMS


  • Brute-force search introduction
  • Brute-force search example
  • Stochastic search introduction
  • Stochastic search example
  • Hill climbing introduction
  • Hill climbing example




ADVANCED RDDS: PAIR RESILIENT DISTRIBUTED DATASETS


  • Special Transformations and Actions
  • Average delay per airport, use reduceByKey(), mapValues() and join()
  • Average delay per airport in one step using combineByKey()
  • Get the top airports by delay using sortBy()
  • Lookup airport descriptions using lookup(), collectAsMap(), broadcast()
  • See it in Action : Analyzing Airlines Data with PySpark - III




ADVANCED SPARK: ACCUMULATORS, SPARK SUBMIT, MAPREDUCE , BEHIND THE SCENES


  • Get information from individual processing nodes using accumulators
  • See it in Action : Using an Accumulator variable
  • Long running programs using spark-submit
  • See it in Action : Running a Python script with Spark-Submit
  • Behind the scenes: What happens when a Spark script runs?
  • Running MapReduce operations
  • See it in Action : MapReduce with Spark




JAVA AND SPARK


  • The Java API and Function objects
  • Pair RDDs in Java
  • Running Java code
  • Installing Maven
  • See it in Action : Running a Spark Job with Java




PAGERANK: RANKING SEARCH RESULTS


  • What is PageRank?
  • The PageRank algorithm
  • Implement PageRank in Spark
  • Join optimization in PageRank using Custom Partitioning
  • See it Action : The PageRank algorithm using Spark




SPARK SQL


  • Dataframes: RDDs + Tables
  • See it in Action : Dataframes and Spark SQL




MLLIB IN SPARK: BUILD A RECOMMENDATIONS ENGINE


  • Collaborative filtering algorithms
  • Latent Factor Analysis with the Alternating Least Squares method
  • Music recommendations using the Audioscrobbler dataset
  • Implement code in Spark using MLlib




SPARK STREAMING


  • Introduction to streaming
  • Implement stream processing in Spark using Dstreams
  • Stateful transformations using sliding windows
  • See it in Action : Spark Streaming




GRAPH LIBRARIES


  • The Marvel social network using Graphs




INTERVIEW WITH SINGAPORE EXPERT


  • Background of Expert
  • Information and Communication Technology in Singapore





#SoyLectora