11/10/2009

Map/Reduce - in Functional Programming & Parallel Processing Perspectives

Map/Reduce is a very popular term pair in today's technical community, mainly due to the popularity of its "inventor" - Google.

But in fact, the terms and concepts of map & reduce exist in programming language community long before G company's successful paper "MapReduce: Simplified Data Processing on Large Clusters", which appeared in OSDI04.

In this article, I want to summarize what this term pair means in functional programming literature and parallel processing literature respectively.

I - Map/Reduce in Functional Programming Perspective

  Functional Programming has long history in academia, but not been massively accepted in developer communities yet. It has some beautiful features, compared with our daily use imperative language. Higher Order Function is one of them. Basically, it means that function can be used as input parameters or return value for a function definition.

  Among various higher order functions, map, fold and filter are the most popular ones:

- Map is a higher order function that applies a given function(a.k.a transformer) element-wise to a list of elements and returns a list of results. Transformer is a function applies to each element and will produce one or more new elements.

for example: map (toLower) "abcDEFG12!@#" will produces output:"abcdefg12!@#"

- Fold (a.k.a. Reduce, Accumulate) is a higher order function that processes (using a combiner function) a list of elements in some order and build up a return value. Combiner is a function that is applied to two elements and produces a result that can be combined using combiner with the remaining elements in the list.

for example: fold (+) 0 [1..5] will produces output: 15, which is the sum of all the elements.

- Filter is a higher-order function that processes a list of elements in some order to produce a result containing exactly those original elements for which a given predicate returns the Boolean value true. Predicate is a function that takes one element as input parameter and return either true or false.

for example: filter (isAlpha) "$#!+abcDEF657" will produces output: "abcDEF"

  Essentially, these three higher order functions apply an operation on some list/array and produce some results: map transform each element, filter filtering some elements and reduce combine all the elements.

  Pure functional language, such as haskell/lisp, and some mixed language, such as python, have build-in functions named exactly as Map/Reduce. C# 3.0 introduces some functional features in LINQ subsystem, where Map is called Select and Reduce is called Aggregate.

  More concrete examples can be found in [2].

II - Map/Reduce in Parallel Processing Perspective

  Map/Reduce is a Programming Model & also an Implementation Runtime. The programming model is what you can use to express your computation tasks while implementation runtime is those software components that realize what the model claims.

  This model is called map/reduce, but their meanings are somewhat different:
  - the map semantic is the same as in functional programming language: the transformer (the mapper in Google's paper) is applied to each element of the list
  - the reduce semantic differs. Here, the combiner(the reducer in Google's paper) is applied to multiple sub sections of the elements in the list and thus produces multiple reduce results, but in functional programming language it is applied to all the elements and only produces one result.

  Conceptually, how the elements are divided into multiple sub sections?
  To resolve this problem, this model introduces some structure on the elements that are produced by mapper and consumed by reducer - each element/record has two parts: key and value. Then all the elements are divided according to the key. The records with the same key form a sub section and are passed to a reducer function as a whole.

  From implementation's perspective, the most important advantage of this Programming Model is that - it enables automatic parallelization and distribution of large scale data processing:
  - mapper is applied to each record, it's a data parallel problem by itself, we just need to distribute input data in record boundary among processing nodes.
  - reducer is applied to some sub section, we just need to distribute those sub sections among process nodes.

  Another implementation problem is fail over - what to do when failure happened?
  Simple! It just re-execute the failed specific mapper/reducer, other mappers/reducers won't be bothered at all. Because there is no communication among mappers and reducers respectively, this solution is semantically correct for mapper/reducer.
  Since the input of mapper is persisted in reliable storage system, failed mapper only need to re-execute that mapper. But the input of reducer (also the output of mapper) is persisted in worker's local storage system, re-executed reducer may found some input unavailable (for example intermediate data node crashed). In this situation, failed reducer need to re-execute both some mappers and that reducer.

[Reference]
1. Functional Programming
2. higher order functions - map, fold and filter
3. Map/Reduce/Filter in Python
4. Map/Reduce in PHP
5. Google's MapReduce Programming Model — Revisited
6. MapReduce: Simplified Data Processing on Large Clusters
7. Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters

10/29/2009

some info about compiler front end development

水木的CS_ARCH版有人在问一个关于Yacc&Lex资料的问题,碰巧知道一些情况,顺便回答了几句。似乎还是有些有用的信息,故转贴在此:

============================================================
============================================================

前段时间用bison和flex做过一个语言的前端.

flex/lex的文档,看"Lex - A Lexical Analyzer Generator "应该够了.

bison manual确实非常详细,不过还是建议你看看OReilly那本lex & yacc 2e:
一来,可以帮你更好理解这些工具,特别是处理复杂的问题时很有用
二来,上面有个简单的sql parser实例,可以照葫芦画瓢当作练习

网上最好的资料莫过于: http://dinosaur.compilertools.net/ 关于flex/bison, lex/yacc的方方面面都有.

lex & yacc那本书国内似乎有人翻译过,不过绝版了,我当时是在taobao上买的翻印版.

现在用flex/bison从头做东西的似乎不多了,这两个工具的维护似乎也跟不上。比如flex对unicode的支持就是个比较麻烦的问题,后端支持的语言也不多。

你如果从头开始做的话,可以考虑试试antlr(http://www.antlr.org),很多domain specific language compiler都用的这个工具, 比如hibernate的HQL, Yahoo!的YQL都是基于antlr的.

10/06/2009

Windows Debugging Facilities

Part I - System/Application Error Collection Tools

These tools are used to collection software data (especially when error occurs) that can be used to identify & fix software defects.

1. Dr. Watson

"Dr. Watson for Windows is a program error debugger that gathers information about your computer when an error (or user-mode fault) occurs with a program. The information obtained and logged by Dr. Watson is the information needed by technical support personnel to diagnose a program error for a computer running Windows."

A text file (Drwtsn32.log) is created whenever an error is detected, and can be delivered to support personnel by the method they prefer. A crash dump file can also be created, which is a binary file that a programmer can load into a debugger.

Starting with Windows Vista, Dr. Watson has been replaced with "Problem Reports and Solutions"

2. Windows Error Report

While Dr.Watson left the memory dump on the user's local machine for debugging, Windows Error Reporting offers to send the memory dump to Microsoft using the internet, more info can be found at - http://en.wikipedia.org/wiki/Windows_Error_Reporting

3. Adplus

ADPlus is console-based Microsoft Visual Basic script. It automates the Microsoft CDB debugger to produce memory dumps and log files that contain debug output from one or more processes. It has many switches to control what data to collect, more info can be found at Microsoft KB on Adplus tool

[Reference]
- Wiki on Dr. Watson Debugger
- Description of Dr. Watson Tool

Part II - Structured Exception Handling

SEH is usually known as a convenient error handling mechanism for windows native code programmers provided by windows operating system itself (with compiler support). But it is also a great system to enable applications talking to software debuggers.

1. Various Concepts

Structured exception handling
is a mechanism for handling both hardware and software exceptions. To full understand SEH mechanism, you should get familiar with the following concepts:
- guarded body of code
- exception handler
- termination handler
- filter expression, follows __except keyword,
evaluated when system conducting exception processing
- filter function
, can only be called in filter expression

filter expression & function can only return the following three values:
-
EXCEPTION_CONTINUE_SEARCH, The system continues its search for an exception handler.
-
EXCEPTION_CONTINUE_EXECUTION, The system stops its search for an exception handler, restores the machine state, and continues thread execution at the point at which the exception occurred.
-
EXCEPTION_EXECUTE_HANDLER, The system transfers control to the exception handler, and thread execution continues sequentially in the stack frame in which the exception handler is found.

2. Stack Unwinding

If the located exception handler is not in the stack frame in which the exception occurred, the system unwinds the stack, leaving the current stack frame and any other stack frames until it is back to the exception handler's stack frame.

3. Vectored Exception Handling

Vectored handlers are called in the order that they were added, after the debugger gets a first chance notification, but before the system begins unwinding the stack. Since they are not framed based, they will be called each time when an exception is raised.

4. Exception & Debugger

SEH is also a communication mechanism between windows application and debugger. The detailed description on the whole exception dispatching process can be found here and the debugger exception handling process can be found here.

The core concepts here are first-chance notification and second-chance(last-chance) notification.
- 1st chance notification is a mechanism to notify debugger the exception information before application get chance to process the exception.
- 2nd chance notification happens after the windows system finds that no application defined exception handler exists.

David Kline wrote a great article on first-chance & second-chance notification.

5. Functions and Keywords

- GetExceptionCode and GetExceptionInformation can be used to get detail information about current exception.
- The SEH compatible compiler recognize __try, __except, __finally, __leave as keywords.

- It will also interpret the GetExceptionCode, GetExceptionInformation, and AbnormalTermination functions as keywords, and their use outside the appropriate exception-handling syntax generates a compiler error.

[Reference]
- Crash Course on SEH
- Structured Exception Handling @ MSDN
- First and Second Chance Exception handling
- David Kline's Article

Part III - Dump File

In Part I, we introduced several tools to get diagnose data to enable offline debugging and analyzing. The most important diagnose data is dump file. There are two dump files:

1. Kernel-Mode Dump (system/core/memory dump)

This kind of dump happens when an stop error occurs in the windows system. The common phenomenon is that the blue screen shows up and at the same time, an core dump file is generated.

There are three kinds of core dump files:

- Complete Memory Dump

A Complete Memory Dump is the largest kernel-mode dump file. This file contains all the physical memory for the machine at the time of the fault.

The Complete Memory Dump file is written to %SystemRoot%\Memory.dmp by default.

- Small Memory Dump

A Small Memory Dump is much smaller than the other two kinds of kernel-mode crash dump files. It is exactly 64 KB in size, and requires only 64 KB of pagefile space on the boot drive.

- Kernel Memory Dump

A Kernel Memory Dump contains all the memory in use by the kernel at the time of the crash. This kind of dump file is significantly smaller than the Complete Memory Dump. Typically, the dump file will be around one-third the size of the physical memory on the system.

This dump file will not include unallocated memory, or any memory allocated to user-mode applications. It only includes memory allocated to the Windows kernel and hardware abstraction level (HAL), as well as memory allocated to kernel-mode drivers and other kernel-mode programs.

Here, you can find more detailed information on kernel-mode dump files.

An great Microsoft KB on system dump(core dump, blue screen) configuration.

2. User-Mode Dump (application/process dump)

This kind of dump file is from specific process, not from the windows system itself.

- Full Dump

A full user-mode dump is the basic user-mode dump file. This dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.

- Mini Dump

A mini user-mode dump includes only selected parts of the memory associated with a process. The size and contents of a minidump file varies depending on the program being dumped and the application doing the dumping.

The name "minidump" is misleading, because the largest minidump files actually contain more information than the "full" user-mode dump.

User mode dump files can be created using the following methods:
- using task manager
- using adplus
- using userdump

You can also using some debugging tools such as Visual Studio and Windbg to create dump files.

Manipulate Mini Dump Programmatically:
- MiniDumpReadDumpStream()
- MiniDumpWriteDump()
- MINIDUMP_TYPE ENUM

[Reference]
- Dump in Visual Studio
- Crash Dump Doc @ MSDN

9/26/2009

What's Good Research

Some senior people shared his opinions on "What's Good Research" a couple of weeks ago:

1. Like a movie, that has a good and compelling Story to tell;
[clear motivation/vision behind the various technologies]

2. Like Magic, about making seemingly impossible possible;
[non-trivial idea and implementation, should has some technical bars]

3. Emotional, touches people’s Life when you explain it.
[potential to has big impact on this world]

9/20/2009

Qizmt - MySpace's Open Source Mapreduce Framework

MySpace just open sourced it’s C# & Windows Based Map/Reduce framework – Qizmt (http://code.google.com/p/qizmt/)

C#/Windows world has an industrial-quality Hadoop counterpart. It’s a good news for both C# users and Microsoft’s server business.

One nice feature of Qizmt is it's built-in support for debugging distributed computing applications.

Qizmt currently supports .Net 3.5 SP1 on Windows 2003 Server, Windows 2008 Server and Windows Vista