[Info]

MapReduce Exercise

Deploy at least a 2-node (more if possible) Hadoop cluster using your favorite virtualization technology. You can also use the virtual maschines from the virtualization exercise. Configure at least the HDFS and MapReduce modules of Hadoop. You might want to have a look on «Cloudera» distribution.

Download HTML pages from the Web (~100 MB) and store it in HDFS. Use for example «HTTrack» to download entire Websites.

Write an MapReduce application which counts the appearance of all included HTML tags in the HTML documents. The question is, what are the most frequent tags used in websites. Use invalid tags as well but your program should be case-sensitiv. Optional: Write a custom InputReader to only read the tags.

Use Java/Python/C as you prefer. But: Going for C/Python requires you to use the HadoopStreaming interface!

Optionally you can look into the advanced configuration of the Hadoop framework and check performance changes. You can check this guide for it.

Prepare a short documentation of your program. The documentation should include:

Problem description
- List of downloaded websites
Cluster configuration description
Description of program
- (optional) InputReader
- Description of map task
- Description of reduce task
Outline the execution of the program
(optional) Advanced configuration description
Runtime of the program

Letzte Änderung: 22.11.2012, 13:14 | 202 Worte

Grid und Cloud Computing (GCC)

MapReduce Exercise