編程語言是否驅動了硬件開發?


87

編程語言的開發受到硬件設計的影響。this answer中的一個示例提到了C指針至少部分地受到PDP-11設計的影響。反過來發生了嗎,一種語言提供的構造驅動了硬件的發展?

很明顯,我想知道核心語言結構(例如指針),而不是行業協會提出像OpenGL這樣的東西然後在硬件中實現。也許硬件浮點支持?

21

Yes. Case in point, the VAX. The instruction set design was influenced by the requirements of the compiled languages of the day. For example, the orthogonality of the ISA; the provision of instructions that map to language constructs such as 'case' statements (in the numbered-case sense of Hoare's original formulation, not the labelled-case of C), loop statements, and so on.

VAX Architecture Ref - see the Introduction.

I am not claiming the VAX is unique in this respect, just an example I know a little about. As a second example, I'll mention the Burroughs B6500 'display' registers. A display, in 1960s speak, is a mechanism for efficient uplevel references. If your language, such as Algol60, permits declaration of nested procedures to any depth, then arbitrary references to the local variables of different levels of enclosing procedure require special handling. The mechanism used (the 'display') was invented for KDF9 Whetstone Algol by Randell and Russell, and described in their book Algol 60 Implementation. The B6500 incorporates that into hardware:

The B6500/B7500 contains a network of Display Registers (D0 through D31) which are caused to point at the appropriate MSCW (Fig. 5). The local variables of all procedures global to the current procedure are addressed in the B6500/B7500 relative to the Display Registers.


10

Arguably, the relatively simple logical structure of DO loops in Fortran motivated the development of vector hardware on the early Cray and Cyber supercomputers. There may be some "chicken and egg" between hardware and software development though, since CDC Fortran included array slicing operations to encourage programmers to write "logic-free" loops long before that syntax became standardized in Fortran 90.

Certainly the Cray XMP hardware enhancements compared with the Cray 1, such as improved "chaining" (i.e. overlapping in time) of vector operations, vector mask instructions, and gather/scatter vector addressing, were aimed at improving the performance of typical code written in "idiomatic" Fortran.

The need to find a way to overcome the I/O bottlenecks caused by faster computation, without the prohibitive expense of large fast memory, led to the development of the early Cray SSD storage devices as an intermediate level between main memory conventional disk and tape storage devices. Fortran I/O statements made it easy to read and write a random-access file as if it were a large two-dimensional array of data.

See section 2 of http://www.chilton-computing.org.uk/ccd/supercomputers/p005.htm for an 1988 paper by the head of the Cray XMP design team.

There was a downside to this, in that the performance of the first Cray C compilers (and hence the first implementation of the Unix-like Cray operating system UNICOS) was abysmal, since the hardware had no native character-at-a-time instructions, and there was little computer science theory available to attempt to vectorize idiomatic "C-style" loops with a relatively unstructured combination of pointer manipulation and logical tests, compared with Fortran's more rigidly structured DO loops.


57

Simply yes.

And not just a few instructions, but whole CPUs have been developed with languages in mind. Most prominent maybe Intel's 8086. Already the basic CPU was designed to support the way high level languages handle memory management, especially stack allocation and usage. With BP a separate register for stack frames and addressing was added in conjunction with short encodings for stack related addressing to make HLL programs perform. The 80186/286 went further in this direction by adding Enter/Leave instructions for stack frame handling.

While it can be said that the base 8086 was geared more toward languages like Pascal or PL/M (*1,2), later incarnations added many ways to support the now prevalent C primitives - not at least scaling factors for indices.


Since many answers pile now various details of CPUs where instructions may match up (or not), there are maybe two other CPUs worth mentioning: The Pascal Microengine and Rockwells 65C19 (as well as the RTX2000).

Pascal Microengine, a CPU made for generic HLL implementaions

The Pascal Microengine was a WD MCP1600 chipset (*3) based implementation of the virtual 16 bit UCSD p-code processor. Contrary to what the name suggests, it wasn't tied to Pascal as a language, but a generic stack machine tailored to support concepts for HLL operations. Beside a rather simple, stack based execution, the most important part was a far reaching and comfortable management of local memory structures for functions and function linking as well as data. Modern time Java Bytecode and its interpreter as a native Bytecode CPU (e.g. PicoJava) isn't in any way a new idea (*4).

R65C19 and N4000, CPU's enhanced or custom made for a specific language

The Rockwell R65C19 is a 6500 variant with added support for Forth. Its 10 new threaded code instructions (*5) implemented the core functions (like Next) of a Forth system as single machine instructions.

Forth as a language was developed with a keen eye on the way it is executed. It got more in common with Assemblers than many other HLL (*6). So it's no surprise that, already in 1983, its inventor Charles Moore created a Forth CPU called N4000 (*7).


*1 - Most remarkable here the string functions which make only sense in languages supporting strings as discrete data type.

*2 - Stephen Morse's 8086 primer is still a good read - especially when he talks about the finer details. Similar and quite recommended his 2008 interview about the 8086 creation where he describes his approach as mostly HLL driven.

*3 - Which makes it basically a LSI-11 with different microcode.

*4 - As IT historians, we have seen each and every implementation already before, haven't we? So let's play a round of Zork.

*5 - There are other nice additions as well, like mathematical operations that ease filter programming - after all, the 65C19/29/39 series was the heart of many modems.

*6 - The discrimination with assembler as not being a HLL and miles apart becomes quite blurry when looking close anyway.

*7 - Later sold to Harris, who developed it into the RTX2000 series - with radiation hardened versions that power several deep space probes.


9

I recall, back in the 80’s, and referenced in the Wikipedia article, Bellmac 32 CPU, which became the AT&T Western Electric WE32100 CPU was supposedly designed for the C programming language.

This CPU was used by AT&T in their 3B2 line of Unix systems. There was also a single board VME bus version of it that some third parties used. Zilog also came out with a line of Unix systems using this chip - I think they were a second source for it for AT&T.

I did a lot of work with these in the 80’s and probably early 90’s. It was pretty much a dog in terms of performance, if I remember.


16

Some manufacturers have directly admitted such

The first page of the Intel 8086 data sheet lists the processor's features, which include

  • Architecture Designed for Powerful Assembly Language and Efficient High Level Languages

In particular, C and other high-level languages use the stack for arguments and local variables. The 8086 has both a stack pointer (SP) and a frame pointer (BP) which address memory using the stack segment (SS) rather than other segments (CS, DS, ES).

The datasheet for the 8087 co-processor has the following section:

PROGRAMMING LANGUAGE SUPPORT

Programs for the 8087 can be written in Intel's high-level languages for 8086/8088 and 80186/80188 systems; ASM-86 (the 8086, 8088 assembly language), PL/M-86, FORTRAN-86, and PASCAL-86.

The 80286 added several instructions to the architecture to aid high-level languages. PUSHA, POPA, ENTER, and LEAVE help with subroutine calls. The BOUND instruction was useful for array bounds checking and switch-style control statements. Other instructions unrelated to high-level languages were added as well.

The 80386 added bitfield instructions, which are used in C.


The Motorola MC68000 Microprocessor User's Manual states:

2.2.2 Structured Modular Programming

[...] The availability of advanced, structured assemblers and block-structured high-level languages such as Pascal simplifies modular programming. Such concepts are virtually useless, however, unless parameters are easily transferred between and within software modules that operate on a re-entrant and recursive basis. [...] The MC68000 provides architectural features that allow efficient re-entrant modular programming. Two complementary instructions, link and allocate (LINK) and unlink (UNLK), reduce subroutine call overhead by manipulating linked lists of data areas on the stack. The move multiple register instruction (MOVEM) also reduces subroutine call programming overhead. [...] Other instructions that support modern structured programming techniques are push effective address (PEA), load effective address (LEA), return and restore (RTR), return from exception (RTE), jump to subroutine (JSR), branch to subroutine (BSR), and return from subroutine (RTS).

The 68020 added bitfield instructions, which are used in C.


Whereas the above processors added instructions to support programming languages, Reduced Instruction-Set Computers (RISC) took the opposite approach. By analyzing which instructions compilers actually used, they were able to discard many complex instructions that weren't being used. This allowed the architecture to be simplified, shorten the instruction cycle length, and reduce instructions to one cycle, speeding up processors significantly.


44

One interesting example of programming languages driving hardware development is the LISP machine. Since "normal" computers of the time period couldn't execute lisp code efficiently, and there was a high demand for the language in academia and research, dedicated hardware was built with the sole purpose of executing lisp code. Although lisp machines were initially developed for MIT's AI lab, they also saw sucess in computer animation.

These computers provided increased speed by using a stack machine instead of the typical register based design, and had native support for type checking lisp types. Some other important hardware features aided in garbage collection and closures. Here's a series of slides that go into more detail on the design: Architecture of Lisp Machines (PDF) (archive).

The architecture of these computers are specialized enough that in order to run c code, the c source is transpiled into lisp, and then run normally. The Vacietis compiler is an example of such a system.


13

Over time, there have been various language-specific CPUs, some so dedicated that it would be awkward to use them for a different language. For example, the Harris RTX-2000 was designed to run Forth. One could almost say it and other Forth CPUs were the language in hardware form. I'm not saying they understand the language, but they are designed to execute it at the "assembler" level.

Early on, Forth was known for being extremely memory efficient, fast, and, for programmers who could think bass-ackwards, quick to develop in. Having a CPU that could execute it almost directly was a no-brainer. However, memory got cheaper, CPU's got quicker, and programmers who liked thinking Forth got scarcer. (How many folks still use calculators in reverse Polish notation mode?)


7

Another interesting example are Java processors, CPUs that execute (a subset of) Java Virtual Machine bytecode as their instruction sets.

If you’re interested enough to ask this question, you might want to read one of the later editions of Andrew Tanenbaum’s textbook, Structured Computer Organization¹, in which he walks the reader step-by-step through the design of a simple Java processor.

¹ Apparently not the third edition or below. It’s in Chapter 4 of the fifth edition.


8

Yes, definitely. A very good example is how Motorola moved from the 68k architecture to the (somewhat) compatible ColdFire range of CPUs. (It is also an example of how such an evolution might go wrong, but that is another story).

The Motorola Coldfire range of CPUs and Microcontrollers was basically a 68000 CPU32 core with lots of instructions and addressing modes removed that "normal" C compilers wouldn't use frequently (like arithmetic instructions on byte and word operands, some complex addressing modes and addressing modes that act only on memory in favor of registers,...). They also simplified the supervisor mode model and removed some rarely used instructions completely. The whole instruction set was "optimized for C and C++ compilers" (This is how Motorola put it) and the freed up chip space used to train the CPUs for performance (like with adding larger data and instruction caches).

In the end, the changes made the CPUs quite a bit too incompatible for customers to stay within the product family, and the MC68k range of CPUs went towards its demise.


12

Null-terminated strings

When C was invented, many different string forms were in used at the same time. String operations were probably handled mostly in software, therefore people can use whatever format they want. Null-terminated string wasn't a new idea, however special hardware support, if any, might not be meant for it.

But later due to the domination of C, other platforms began adding accelerated instructions for the null-terminated format:

This had some influence on CPU instruction set design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80 and the DEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000 520 in 1992.

https://en.wikipedia.org/wiki/Null-terminated_string#History

On x86 Intel introduced many instructions for text processing in SSE4.2, which do things in parallel until the first null-termination character. Before that there was the SCAS - scan string instruction that can be used to look for the position of the termination character

mov ecx, 100                ; search up to 100 characters
xor eax, eax                ; search for 0
mov edi, offset string      ; search this string
repe scas byte ptr [edi]    ; scan bytes looking for 0 (find end of string)
jnz toolong                 ; not found
sub edi, (offset string) + 1 ; calculate length

https://devblogs.microsoft.com/oldnewthing/20190130-00/?p=100825

We all know that nowadays it's a bad idea. Unfortunately it was baked into C, hence used by every modern platform and can't be changed anymore. Luckily we have std::string in C++


The use of a string with some termination character seems already existed on the PDP-7 where people can choose the termination character

The string is terminated by the second occurrence of the delimiting character chosen by the user

http://bitsavers.trailing-edge.com/pdf/dec/pdp7/PDP-7_AsmMan.pdf

However a real null-termination character can be seen in used on the PDP-8 (see the last line in the code block). Then the ASCIZ keyword was introduced in the assembly language for PDP-10/11

Null-terminated strings were produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.

The B language, which appeared in 1969 and became the precursor to C, might be influenced by that and uses a special character for termination, although I'm not sure which one was chosen

In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e

Dennis M. Ritchie, Development of the C Language

Further reading


10

Another specific example of hardware design to match the languages was Recursiv, which was designed to implement object oriented language features in hardware.

Our Recursiv has been preserved in a museum.

See https://en.wikipedia.org/wiki/Rekursiv


7

Yet another example: The Prime minicomputer had a segmented architecture, and anything in segment 07777 was a null pointer (Prime used octal, and the top 4 bits of the 16-bit word had other uses). Segment 0 was a kernel segment, and just loading 0 into a segment register in user code was an access violation. This would have been fine in properly written C (int* p = 0; stored bit pattern 7777/0 into p).

However it turns out that a lot of C code assumes that if you memset a block of memory to all bits zero, that any contained pointers will have been set to NULL. They eventually had to add a whole new instruction called TCNP (Test C Null Pointer).


48

Interesting question with an interesting answer.

First let me get one thing out of the way:

One example from this answer mentions how C pointers were, at least in part, influenced by the design of the PDP-11

It's a myth to suggest C's design is based on the PDP-11. People often quote, for example, the increment and decrement operators because they have an analogue in the PDP-11 instruction set. This is, however, a coincidence. Those operators were invented before the language was ported to the PDP-11.

There are actually two answers to this question

  • processors that are targeted to a specific high level language
  • processors that include features that a high level language might find useful.

In the former category we have most of the interesting eventual dead ends in computer hardware history. Perhaps the one of the earliest examples of a CPU architecture being targeted at a high level language is the Burroughs B5000 and its successors. This is a family of machines targeted at Algol. In fact, there wasn't really a machine language as such that you could program in.

The B5000 had a lot of hardware features designed to support the implementation of Algol. It had a hardware stack and all data manipulations were performed on the stack. It used tagged descriptors for data so the CPU had some idea of what it was dealing with. It had a series of registers called display registers that were used to model static scope* efficiently.

Other examples of machines targeted at specific languages include the Lisp machine already mentioned, arguably the Cray series of supercomputers for Fortran - or even just Fortran loops, the ICL 2900 series (also block structured high level languages), some machines targeted at the Java virtual machine (some ARM processors have hardware JVM support) and many others.

One of the drivers behind creating RISC architectures was the observation that compilers tended to use only a small subset of the available combinations of instructions and addressing modes available on most CPU architectures, so RISC designers ditched the unused ones and filled the space previously used for complex decoding logic with more registers.

In the second category, we have individual features in processors targeted at high level languages. For example, the hardware stack is a useful feature for an assembly language programmer, but more or less essential for any language that allows recursive function calls. The processor may build features on top of that for example, many CPUs have an instruction to create a stack frame (the data structure on the stack that represents a function's parameters and local variables).

*Algol allowed you to declare functions inside other functions. Static scope reflects the way functions were nested in the program source - an inner function could access the variables and functions defined in it and in the scope in which it was defined and the scope in which that scope was defined all the way up to global scope.


12

Arguably, VLIW architectures were designed mainly for smart compilers. They rely on efficient building of individual very complex instructions (a single "instruction" can do many things at the same time), and while it's not impossible to write the code manually, the idea was that you could get better performance for your applications by using a better compiler, rather than having to upgrade your CPU.

In principle, the difference between e.g. a x86 superscalar CPU and something like SHARC or i860 is that x86 achieves instruction level parallelism at runtime, while SHARC is a very simple CPU design (comparatively) that relies on the compiler. In both cases, there's many tricks to reorder instructions, rename registers etc. to allow multiple instructions to run at the same time, while still appearing to execute them sequentially. The VLIW approach would be especially handy in theory for platforms like JVM or .NET, which use a just-in-time compiler - every update to .NET or JVM could make all your applications faster by allowing better optimizations. And of course, during compilation, the compiler has a lot better idea of what all of your application is trying to do, while the runtime approach only ever has a small subset to work with, and has to rely on techniques like statistical branch prediction.

In practice, the approach of having the CPU decide won out. This does make the CPUs incredibly complex, but it's a lot easier to just buy a new better CPU than to recompile or update all your applications; and frankly, it's a lot easier to sell a compatible CPU that just runs your applications faster :)


12

Some ARM CPUs used to have partial support for executing Java bytecode in hardware with https://en.wikipedia.org/wiki/Jazelle Direct Bytecode eXecution (DBX).

With modern JITing JVMs, that became obsolete, so there was later a variant of Thumb2 mode (compact 16-bit instructions) called ThumbEE designed as a JIT target for managed languages like Java and C# https://en.wikipedia.org/wiki/ARM_architecture#Thumb_Execution_Environment_(ThumbEE)

Apparently ThumbEE has automatic NULL-pointer checks, and an array-bounds instruction. But that was deprecated, too, in 2011.


9

Another example is the decline of binary-coded decimal instructions

In the past it was common for computers to be decimal or have instructions for decimal operations. For example x86 has AAM, AAD, AAA, FBLD... for operating on packed, unpacked and 10-byte BCD values. Many other classic architectures also have similar features

Several microprocessor families offer limited decimal support. For example, the 80x86 family of microprocessors provide instructions to convert one-byte BCD numbers (packed and unpacked) to binary format before or after arithmetic operations.[3] These operations were not extended to wider formats and hence are now slower than using 32-bit or wider BCD 'tricks' to compute in BCD (see [1]). The x87 FPU has instructions to convert 10-byte (18 decimal digits) packed decimal data, although it then operates on them as floating-point numbers.

The Motorola 68000 provided instructions for BCD addition and subtraction;[4] as does the 6502. In the much later 68000 family-derived processors, these instructions were removed when the Coldfire instruction set was defined, and all IBM mainframes also provide BCD integer arithmetic in hardware. The Zilog Z80, Motorola 6800 and its derivatives, together with other 8-bit processors, and also the Intel x86 family have special instructions that support conversion to and from BCD. The Psion Organiser I handheld computer’s manufacturer-supplied software implemented its floating point operations in software using BCD entirely. All later Psion models used binary only, rather than BCD.

https://en.wikipedia.org/wiki/Decimal_computer#More_modern_computers

However they're rarely used, since modern languages often don't have a way to access those instructions. They either lack a decimal integer type completely (like C or Pascal), or doesn't have a decimal type that can map cleanly to BCD instructions

The result is that BCD instructions started to disappear. In x86 they're micro-coded, therefore very slow, which makes people further avoid them. Later AMD removed BCD instructions in x64-64. Other manufacturers did the same in newer versions of their architectures. Having said that, a remnant of BCD operations is still there in the FLAGS register in x86-64 and many other platforms that use flags: the half-carry flag. Newly implemented architectures like ARM, MIPS, Sparc, RISC-V also didn't get any BCD instructions and most of them don't use a flag register

In fact C and C++ allow float, double and long double to be decimal, however none of the implementations use it for the default floating-point types, because modern computers are all binary and are bad at decimal operations. Very few architectures have decimal floating-point support

Many C and C++ compilers do have decimal floating-point types as an extension, such as gcc with _Decimal32, _Decimal64, and _Decimal128. Similarly some other modern languages also have decimal types, however those are mostly big floating-point types for financial or scientific problems and not an integer BCD type. For example decimal in C# is a floating-point type with the mantissa stored in binary, thus BCD instructions would be no help here. Arbitrary-precision decimal types like BigInteger in C# and BigDecimal in Ruby or Java also store the mantissa as binary instead of decimal for performance. A few languages do have a fixed-point decimal monetary type, but the significant part is also in binary

That said, a few floating-point formats can still be stored in BCD or a related form. For example the mantissa in IEEE-754 decimal floating-point types can be stored in either binary or DPD (a highly-packed decimal format which can then be converted to BCD easily). However I doubt that decimal IEEE-754 libraries use BCD instructions, because they're often not exist at all in modern computers, or in case they really exist they'd be extremely slow

BCD was used in many early decimal computers, and is implemented in the instruction set of machines such as the IBM System/360 series and its descendants, Digital Equipment Corporation's VAX, the Burroughs B1700, and the Motorola 68000-series processors. Although BCD per se is not as widely used as in the past and is no longer implemented in newer computers' instruction sets (such as ARM; x86 does not support its BCD instructions in long mode any more), decimal fixed-point and floating-point formats are still important and continue to be used in financial, commercial, and industrial computing, where subtle conversion and fractional rounding errors that are inherent in floating point binary representations cannot be tolerated.

https://en.wikipedia.org/wiki/Binary-coded_decimal#Other_computers_and_BCD


2

Yes. More recently, the TPUs (Tensor Processing Units,) designed by Google to accelerate AI work, are designed to efficiently process their TensorFlow language.


5

Some more examples of programming languages affecting hardware design:

The MIPS RISC ISA often seems strange to newcomers: instructions like ADD generate exceptions on signed integer overflow. It is necessary to use ADDU, add unsigned, to get the usual 2’s complement wrap.

Now, I wasn’t there at the time, but I conjecture that MIPS provided this behavior because it was designed with the Stanford benchmarks - which were originally written in the Pascal programming language, which requires overflow detection.

The C programming language does not require overflow traps. The MIPSr6 new (circa 2012) ISA gets rid of the integer overflow trapping instructions - at least those with a 16 bit immediate - in order to free up opcode space. I was there when this was done.

I can testify that programming languages influenced many modern x86 features, at least from P6 onwards. Mainly C/C++; to some extent JavaScript.


2

Multi-operand IMUL on x86

Not quite sure if it was entirely driven by programming languages, but I believe they must have high influences to Intel's decision

Originally there were only IMUL r/m16 and IMUL r/m8 that output a product twice as wide as the operands. However in reality modern high-level languages generally produce a multiplication result that has the same type as the two operands unless you cast the them to a wider type. For example in C if we have int a and int b then a*b will also have type int. Same to most other languages. Doing a non-widening multiplication is also faster than getting the full result, so Intel added the 2 and 3-operand forms that don't calculate the high bits

There are two additional forms for the IMUL instruction which do not fit the above pattern. The first is a two-operand version that follows the pattern for ADD:

 IMUL    r, r/m      ; d *= s (signed)

This is a more traditional-looking two-operand instruction² that updates the destination register in place.

There is even a (gasp) three-operand version similar to what you see in other processors.

 IMUL  r, r/m, i     ; d = s * t (signed)

This three-operand version accepts an immediate as the third operand, and it's the one the compiler typically generates. For example,

 IMUL  EAX, ECX, 212 ; EAX = ECX * 212 (signed)

These additional forms produce only single-precision results, but that's what the C and C++ languages produce, so it fits well with those languages. If you need a double-precision result, then you can use the single-operand MUL and IMUL instructions.

Note that there is no unsigned version of these additional forms. Fortunately, you can use the signed version for unsigned multiplication because the single-precision result is the same for both signed and unsigned multiplication. However, the flags are always set according to the signed result, so you cannot use them to detect unsigned overflow.

In practice, this is not a problem because the C language doesn't give you access to the overflow flags anyway.

The Intel 80386, part 4: Arithmetic

Since they're more flexible, fit perfectly with the type model, and don't care about signness (because non-widening multiplication is the same for signed and unsigned values), compilers began to use it exclusively for almost all multiplications. As a result, Intel focused to optimize even more for those forms of imul. In fact nowadays single-operand mul and imul are very rarely used and may be slow on many μarchs. See Why is imul used for multiplying unsigned numbers?


3

Not yet retro (as of 2019), but on topic: the ARM Cortex-M design was influenced by the C language (may also apply to Cortex-A and -R, but I don't have experience with them).

  1. No assembly code is needed for startup. The hardware sets the initial stack pointer at reset and then the compiled code can execute without any "magic" (some exceptions to this may exist depending on the exact hardware).
  2. Interrupt handlers are just regular functions from compiler's point of view. They only require specific names, so that the linker can put their vectors in the right place for the NVIC (interrupt controller) to find them. The hardware takes care of stacking and unstacking of registers on ISR entry/exit. Compare that with other architectures (eg. AVR, MSP430, older ARM), where special compiler attributes are needed for interrupt handlers and the compiler has to emit extra code to handle entry/exit.