MapReduce Exercise

Deploy at least a 2-node (more if possible) Hadoop cluster using your favorite virtualization technology. You can also use the virtual maschines from the virtualization exercise. Configure at least the HDFS and MapReduce modules of Hadoop. You might want to have a look on «Cloudera» distribution.  
 
Download HTML pages from the Web (~100 MB) and store it in HDFS. Use for example «HTTrack» to download entire Websites.  
 
Write an MapReduce application which counts the appearance of all included HTML tags in the HTML documents. The question is, what are the most frequent tags used in websites. Use invalid tags as well but your program should be case-sensitiv. Optional: Write a custom InputReader to only read the tags. 
 
Use Java/Python/C as you prefer. But: Going for C/Python requires you to use the HadoopStreaming interface! 
 
Optionally you can look into the advanced configuration of the Hadoop framework and check performance changes. You can check this guide for it.  
 
Prepare a short documentation of your program. The documentation should include:  
Letzte Änderung: 22.11.2012, 13:14 | 202 Worte