Students have been asking us for an end to end real wold project that uses Spark and we delivered a new chapter to Spark Developer In Real World course last night that demonstrates an end to end project that uses Spark and other tools from the Big Data ecosystem.
Our intention with this project
Real world projects usually would involve more tools and frameworks in addition to spark. With a good end to end project example you would be able to visualize and possibly implement one of the use cases at work or solve a problem using Spark in combination with other tools in the ecosystem. By seeing an end to end project, you can possibly explain how big data projects are implemented using Spark in your next interview.
What is the project about?
Our goal with this project is to build a “mini Stackoverflow” website with Stackoverflow’s post (Q&A) dataset . Users can go to a web page, type in a text they would like to search and the website will bring back the relevant questions and answers by searching the data stored in Elasticsearch. We will write a Spark job to load the data in to Elasticsearch.
- Load Stackoverflow post dataset (27.3 GB) to Elasticsearch by writing a Spark job transforming XML data to Q&A documents in JSON format. Each question will have all of it’s answers in a nested array.
- Visualize the Stackoverflow Q&A documents stored in Elasticsearch using Kibana
- Write a REST service using the Java Spring framework and Spring Boot to expose the data stored in Elasticsearch
- Write an Angular application to build a web application which allows users to search and present the data stored in Elasticsearch by consuming the REST service.
As you can see we are not just using Spark to solve the problem in our project. In real world projects, there are usually multiple tools involved in a solving a problem and Spark is usually one of the tools in a big chain of things. So one thing we were keen in showing the students is how Spark is used along with other tools in the big data ecosystem to solve a specific problem.
Here is a quick sneak peak of the end result