Pre-lunch: Overview and Architecture; Mechanisms for Asynchrony
Post-lunch: Topics in Synchronization; Synchronization Best Practices; Designs and Algorithms; .NET 4.0
These are my notes from the pre-con, with minimal editing. My goal is to provide the raw meat of the talk. I'll sum up and discuss in a later post. My commentary in [ ].
Parallel Computing Program (PCP) : mandate is to prepare Microsoft for the furture
Parallel computing matters more as multi-core hardware architectures become more pervasive.
Some history
Gordon Moore predicted in 1965 that in 1975 there may be 65,000 transistors on a single die. Intel's Nehalem has 731 million transistors. Recommended Windows minimum specs have pretty much tracked Moore's law from Windows 3.1 to Vista. [This means that we, as developers, we'll use as much hardware power as we're given.]
Processors are reaching power densities similiar to a rocket nozzle... we're hitting a power density wall. We're also hitting a memory wall. As a result, single-core performance has more or less reached a limit. However, there's still a capability to improve value through multi-core systems -- and it's the software industry that needs to drive this demand.
"The free lunch" refers to scalable parallel programs, which will be written once and can scale over time with hardware architectures. [The goal here is an application that will run and scale for ten to twenty years.]
Four big themes
Latent parallelism for future scaling
Focus on data -- the scalable dimension
1. Need to consider relations of future features to performance, particularly as it relates to increasing amounts of data
Tasks instead of threads
1. Tasks are single units of work than can be executed separately
2. The concept of "tasks" allow better discussion of parallelism
No silver bullet -- many "right" approaches
1. 30-40 years of experience with parallel computing hasn't revealed a silver bullet
2. Instead, the focus is on multiple approaches that can be intelligently selected and applied
Demonstration code: Connected Components in a node graph
Parallelization strategy:
1. starting searches from random locations; each node is a candidate node
2. some searches will collide; ownership must be arbitrated and the connection must be recorded
3. the connections and candidates reduce to a supergraph
4. update nodes to refer to final components
Some issues:
1. we're going to go through the candidates in a random order
2. root nodes may not be ordered the same each time; is that acceptable?
Parallelism vs Concurrency
1. concurrency: multiple independent incoming requests serviced simultaneously
2. parallelism: a single task divided into component tasks that execute simultaneously
Something of a fractal issue, since the parallelized tasks may well have concurrency considerations (shared resources). Many of these considerations involve latency or scheduling issues. Asynchronicity may also be a consideration. If developers have to manage resources, threads and scheduling, then it's a multi-threaded problem -- this is what the PCP team is trying to move us away from.
System Metrics
Fairness
: allocationg resources to competing uses
Preemption: removing one resources from one use to enable other
** note: these first two are
not a goal, but a context
Responsiveness: latency from request to response; this is an existing architectural concern
Throughput: work (requests) per unit time; implies that overhead needs to be decreased
Approaches to expressing parallelism
1. Task vs data parallelism
- classical data parallel: same operation applied to a homogenous collection
- task parallel: data-focused but built on a "task" model for generality
2. Structured multi-threading
- emphasizes recursive decomposition
- preserves function interfaces (fork, join)
- structured control constructs (parallel loops, co-begin)
- each iteration of a loop is a task; all tasks must complete before the function returns
- developer is not involved in resource allocation; makes components more adaptive
3. Work-sharing (SMPD, OpenMP)
- emphasizes processors (fork/join threads + barrier)
- includes "parallel regions" which allow you to get resources (a "team" of "workers")
- structured control constructs ("shared loops", improving support for recursion)
- involves the developer in resource allocation
4. Dataflow
- a computation is a network where data flows through the connections and work is performed in the nodes
- a good fit for streaming media algorithms, pipelining, concurrent event systems, component interactions for responsiveness, occassionally in recursive algorithms
Summary: different models apply best for different subdomains and application structures. Microsoft has taken steps to provide good support for each of these, as well as putting effort into interoperability.
Mutual exclusion
In a parallel environment, race conditions may exist. As a developer, we need to identify the possible collections, then implement a technique to guarantee mutual exclusion. The typical mechanism for this is using a "lock". However, the lock needs to be intelligently applied; otherwise, you can effectively serialize the entire algorithm.
Locks are also fairly expensive on today's machine. Low-lock and lock-free techniques have been introduced, sometimes at the hardware level (compare_and_swap primitive; atomic operation). These are of short duration, preemption friendly and are useful limited scenarios. Building more useful tools from these primitives has proven to be challenging. Transactional memory may be a future solution.
Concurrent data structures
We need concurrency-safe, high-bandwidth data structures to support parallel computing. There are new versions of classic data structures (vector, hashtables, etc) under development.
Parallel computing
1. Moore's dividend: sequential needs to become parallel
2. over-decompose for scaling
3. relax sequential order to gain more parallelism
4. consider data as well as control flow
The Rabbit Hole of Performance
- parallelism overhead may make a sequential process run slower
-
Amdahl's Law: as more processors get added, the benefit of each processor is reduced; this is partly due to the reduced efficiency of system resources
- resource contention; hardware is currently tuned for single threads; etc.
The Black Hole of Correctness (the new failure modes of parallelism)
1. Data races
2. Deadlock
3. Livelock
4. Memory hierarchy
5. Performance robustness
6. Reproducibility and testing (if output sequences differ, how do we test?)