Hanging out with the Elephant, Hadoop

Yes, I admit that I am new to Hadoop, but I am making a lot of process in the past few days. I learned how to run my job both locally and on Hadoop cluster. Python was not new to me, so it was easier for me to navigate through my Hadoop journey. The first dataset I used is all about movie ratings because my instructor is part of the IMDB gang. For the first job, I want to count the total number of movies rated under the 5-star rating system. I first needed to use a mapper function to extract the ratings from the data set and the number 1s. The rating is the key and 1 is the value. After some shuffling and sorting we proceed to the next step which is to use a reducer function to sort out the key which is the rating, and the sum of 1s (number of times the movie has been rated, say, 4 stars) that are associated with the rating. For the second job, I am interested in sorting the movies with their associated times they have been rated in the database. The more times they are rated, the more popular they are. Again, the mapper extracts the movieIDs as the key and set 1 as the value. In the reducer, it then adds up the 1s to see the number of times rated. This time, however, the sum of movie counts becomes the key and movieID becomes the value. This way when we pass through a second step the shuffle and sort phase will automatically sort things by the movie count for us. Then we write another function ​to get a unique count with the movie associated to the count. The result is that the movie column will appear on the left and the count column will be on the right.

No comments:

Post a Comment