The FR-V thread-local storage ABI

			   Alexandre Oliva
			    Aldy Hernandez
			     Version 0.22
			      2004-12-10
				   
Introduction
============

This document describes the extensions to the FRV-FDPIC ABI for
supporting thread-local storage (TLS).  It is meant to be read as an
extension to [2] and it follows the format and terminology used in
[1].


Run-Time Handling of TLS
========================

Register GR29 is reserved as the register to be used as the thread
context pointer.  This requires the kernel to actually preserve the
value of this register.

GR29 cannot be used for any other purpose throughout the program, even
if the program itself does not use thread-local storage.

The TLS data structures on FRV are specified according to variant I,
with the following exceptions:

- The thread context pointer is biased: it points 2048 bytes past the
beginning of the TCB, such that the pointer to the DTV may be loaded
from the TCB with a single instruction, while (almost) doubling
the range directly addressable with the 12-bit signed literal offset
available in load and store instructions, that executables can take
advantage of to access local variables in TLS;

- Pointers to module's TLS areas are all biased by +2032 bytes such
that, in the main executable, the thread pointer can be used directly
as the biased pointer to its TLS area; and

- The contents of DTV (Dynamic Thread Vector) entries are unspecified;
they may contain offsets from the thread pointer, instead of absolute
addresses.  It is also unspecified whether they are biased.  Since the
DTV is cannot be directly accessed by ABI-specified mechanisms (the
location of the pointer to it in the TCB is left unspecified), its
internal format, and even its very existence, are to be regarded as an
internal implementation details, subject to change without notice.

The 16 bytes starting at gr29-2048 are reserved for use by the TLS
implementation, so TLS data for the main executable starts at
gr29-2032.  gr29 must always be at aligned to at least a 16-byte
boundary.  If the TLS section in the main executable requires
additional alignment, it is the gr29-2032 address that will be
arranged to satisfy the alignment requirements.  gr29-2048 will
therefore also be aligned to a 16-byte boundary.

The TLS implementation may reserve additional space below gr29-2048,
for example, to hold the data structure that represents the thread
that uses the thread control block.  Should such data structures
require alignment stricter than 16 bytes, gr29-2032 may end up getting
more strictly aligned than hereby specified, such that the offset
between gr29 and this data structure, whose size and alignment are
known to the thread library, is constant.


<tls_get_offset> calling conventions
------------------------------------

On most ports, the __tls_get_addr() function is defined as part of the
ABI, and explicitly called to determine the address of thread-local
variables.  The function traditionally takes obtains from its
arguments the following information: a module ID and an offset.  This
function typically returns the address computed by adding the offset
to the TLS block allocated for the module within the running thread.

Specific features of the FR-V architecture, such as register+register
addressing modes, and of the FDPIC ABI, such as the use of function
descriptors, enable significant optimizations for the most common
cases in which this function used to be used, if we introduce
alternate entry points and specialized calling conventions.
Therefore, __tls_get_addr() is not defined as part of this ABI.  Its
presence is not required, and programs are not supposed to call it.
Different, more efficient mechanisms, are introduced to implement
similar functionality.


Consider that in the general dynamic TLS access model, a function call
is necessary to obtain the address of a thread-local variable.
However, in most cases, the variable happens to be in the portion of
memory that is allocated at the time of creation of every thread, at a
constant offset from the thread context pointer (static TLS).  Even in
cases that require dynamic allocation of the TLS for a module, if the
variable is accessed often enough, we can avoid a significant fraction
of the function call overhead, namely the need to spill all registers
not preserved across function calls, by adopting custom calling
conventions that preserve many more registers.

Another advantage of the custom calling conventions is that they
almost remove the penalty for generating code for the general dynamic
access model, where another access model that doesn't require calls
would have been enough, even more so if the linker performs
relaxations that actually remove the call.

The calling conventions and expected behavior are implemented using an
alternate, unnamed pair of entry points, one used for the static TLS
case, one used for the dynamic TLS case.  The actual entry points are
defined internally to the dynamic loader or, in the case of static
executables, the first entry point is defined by the linker, and the
second is never used.  For clarity purposes, we are going to refer to
these entry points as <tls_get_offset>, with angle brackets being used
to imply that it is not an actual symbol name.

The idea is that the two entry points are interchangeable as far as
the compiler and the linker are concerned: when the linker can't tell
for sure whether the referenced symbol is in static or dynamic TLS,
it will be up to the dynamic loader to select the most appropriate
entry point.  For this reason, it is essential to define a common ABI
for <tls_get_offset> implementations.


<tls_get_offset> inputs:

GR9:

  The idea is that the GR8,GR9 register pair is loaded from a TLS
  descriptor from the GOT.  GR8 is expected to be the selected entry
  point for <tls_get_offset>, but the value is not used by
  <tls_get_offset> itself, so it's legitimate to not set it; GR9, on
  the other hand, holds additional information needed by the entry
  point.  In the static TLS case, it's the TLS offset of the
  referenced variable.  In the dynamic TLS case, it's a pointer to a
  data structure allocated by the dynamic loader holding information
  such as the GOT address to be used by the dynamic loader, the module
  ID of the TLS variable, its TLS module offset, and any other
  information that the implementation of the dynamic <tls_get_offset>
  function might need.

GR15:

  The GOT pointer for the module containing the TLS descriptor from
  which the <tls_get_offset> address was loaded.  It can often be used
  if the instruction that calls <tls_get_offset> is optimized into a
  load, but the <tls_get_offset> function itself might rely on its
  value.

GR29:

  Biased pointer to the TCB for the current thread.


<tls_get_offset> outputs:

GR9:

  Offset from GR29 to the address returned by __tls_get_addr(), or
  that would be returned by it should it have been called.

GR8:

  May be modified freely.

All other registers shall be preserved by functions that follow the
<tls_get_offset> ABI.


<tls_get_offset> behavior:

The static <tls_get_offset> just returns, since the correct value 
was already loaded into GR9 by its caller.

The behavior of the dynamic <tls_get_offset> is described with the
following pseudo-code:

If the requested module has already been resolved for the current
  thread:
	Set GR9 to the DTV entry for the module, plus the offset for
	the variable;
	Return.
Save registers;
Load the arguments for __tls_get_addr() and call it;
Set GR9 to the difference between the returned value and the thread
  context pointer;
Restore registers (other than GR9 and GR8);
Return.


Additional implementations of <tls_get_offset> may be introduced, as
long as they do not make assumptions on inputs not warranted by the
ABI specification, and provide the outputs strictly as specified
above.  The current specification is intended to make room for lazy
resolution of TLS descriptor relocations, but this feature is not
defined in the current specification.


TLS Access Models
-----------------

There are two main ways to access thread-local storage, the dynamic
and the static model.  All other models described here fall into these
two categories.

The FRV provides four models, all derived from the aforementioned two
models: general dynamic, local dynamic, initial exec, local exec.
Different models are used to provide as much performance as possible.

General Dynamic TLS Model
-------------------------

The general dynamic TLS model can be used everywhere.  Compilers will
generate code with this model by default and only use a more
restrictive model when it is more efficient or when told explicitly to
use another model.

The generated code for this model does not assume that the module
number or variable offset is know at link or compile-time.  The values
for the module ID and the TLS block offset are determined by the
dynamic linker at run-time, and passed to the <tls_get_offset>
function through a pointer in GR9.  Upon return, the <tls_get_offset>
function returns the offset from the thread context pointer to the
variable for the current thread in GR9.

It is desireable to avoid this model whenever possible, as it is the
slowest.

In the following code fragments, the code shown is determining the
address of a thread-local variable x:

	extern __thread int x;

	&x;

General form
------------

Here, we call a (pseudo-)function to determine the offset for the
variable.  Like before every call, the GR15 register must hold the GOT
pointer for the current module; unlike other calls, GR15 must be set
prior to the call instruction (i.e., not in the same VLIW pack).
Unlike other calls, GR15 and most other registers are preserved; only
GR8, GR9 can be modified by the callee without being preserved, and LR
is modified by the call instruction itself.

	call	#gettlsoff(x)

The call instruction may be relaxed into a load instruction, so its
VLIW packing must enable such replacement.

The call instruction above requests the linker to generates the
following sequence of instructions in the PLT, and to adjust the call
such that this PLT entry is called:

    plt(gettlsoff(x)):
	sethi.p #gottlsdeschi(x), gr8
	setlo	#gottlsdesclo(x), gr8
	ldd	@(gr15, gr8), gr8
	jmpl	@(gr8, gr0)

Shorter versions of the PLT entry can be used, if the offsets are
known to fit in 16 bits:

    plt(gettlsoff(x)):
	setlos #gottlsdesclo(x), gr8
	ldd	@(gr15, gr8), gr8
	jmpl	@(gr8, gr0)

or even 12 bits:

    plt(gettlsoff(x)):
	lddi	@(gr15, #gottlsdesc12(x)), gr8
	jmpl	@(gr8, gr0)


Like PLT entries for regular function calls, the PLT entries above can
also be inlined into the caller, resulting in code sequences such as:

	sethi.p #gottlsdeschi(x), gr8
	setlo	#gottlsdesclo(x), gr8
	ldd	#tlsdesc(x)@(gr15, gr8), gr8
	calll	#gettlsoff(x)@(gr8, gr0)

The instructions can be freely scheduled, and they can use whatever
registers they like, as long as GR9 holds the value loaded from the
second word of the TLS descriptor when the first instruction in the
callee is executed.  The callee of such a sequence must preserve all
registers other than GR8 and GR9; LR is modified by the call
instruction itself.

If the PLT is inlined, another register holding the GOT pointer may be
used instead of GR15 in the ldd or lddi instructions, but GR15 must be
set to the GOT pointer value before or in parallel with the calll
instruction.  This is less stringent than the non-inlined case, that
requires GR15 to be set before, not in parallel, with the call
instruction.

#tlsdesc and #gettlsoff are annotations that enable the linker to
optimize the code when the dynamic model is not necessary.


Upon return from any of the calls above, gr9+gr29 yields the address
of variable x or sx.  For example, the value of x can be obtained or
modified using a load or store with the addressing mode @(gr29,gr9).


#gottlsdeschi and #gottlsdesclo denote the high and low 16 bits of the
offset from the GOT pointer containing a TLS descriptor; #gottlsdesc12
denotes the low 12 bits.  When requested to create a gettlsoff PLT
entry, or when resolving the relocations denoted by these expressions,
the linker will automatically create such a GOT entry:

    <_GLOBAL_OFFSET_TABLE_+#gottlsdesc(x)>:
	.dword	#<tlsdesc_value>(x)

Note that #<tlsdesc_value> is not well-formed assembly, it's just an
arbitrary notation to convey the fact that we're going to have the
corresponding relocations associated to the given 64-bit value.


Local Dynamic TLS Model
-----------------------

The local dynamic TLS model is an optimization of the general dynamic
TLS model.  The compiler can generate code following this model if it
can recognize that the thread-local variable is defined in the same
object it is referenced in.  This includes thread-local variables with
file scope or variables which are defined to be protected or hidden.

Since a thread-local variable is defined by the module ID and the
offset in the TLS block of that module, in the case of variables which
are known to be referenced and defined in the same object, the offsets
are known at link-time, but the module ID isn't.  Since the module ID
is not known, it is necessary to use general dynamic call sequences to
obtain the module biased base pointer, and then add the constant
offsets known at link time to such base pointer.

The generated code for the general dynamic and the local dynamic
sections are so different that the optimizations have to be done by
the compiler, not the linker.


The following code is used to obtain the biased base pointer for the
module:

	call	#gettlsoff(0)

This is equivalent in all regards to #gettlsoff(zero_offset), where
zero_offset denotes a variable whose biased offset is zero.  The PLT
entry that would be generated can be inlined:

	sethi.p	#gottlsdeschi(0), gr14
	setlo	#gottlsdesclo(0), gr14
	ldd	#tlsdesc(0)@(gr15, gr14), gr8
	calll	#gettlsoff(0)@(gr8, gr0)

The calls above end up calling <tls_get_offset>, so, after they
return, gr9+gr29 yields the biased base pointer for the module.


In the following code fragments, the code shown determines the address
of two thread-local variables:

	static __thread int x1;
	static __thread int x2;

	&x1;
	&x2;

Compute the biased base pointer for the module in gr7:

	add	gr9, gr29, gr7
	
and then use it as a base pointer to access other variables.

	sethi.p	#tlsmoffhi(x1), gr8
	setlo	#tlsmofflo(x1), gr8
	;;; @(gr7, gr8) is x1
	...
	sethi.p	#tlsmoffhi(x2), gr10
	setlo	#tlsmofflo(x2), gr10
	;;; @(gr7, gr10) is x2

#tlsmoff represents the offset of the variable into the module.

Instead of computing TLS module offsets using such long instruction
sequences, it is possible to perform the access, or compute the
address, with a single instruction:

	ld	@(gr7, #tlsmoff12(x1)), gr12
	;;; gr12 now holds the value last stored in x1

	addi	gr7, #tlsmoff12(x2), gr13
	;;; gr13 now holds the thread-local address of x2

Since thread-local sections tend to be small (seldom more than tens of
bytes), it is recommended that code using 12-bit offsets be emitted by
the compiler by default, and that a command-line option be introduced
to enable the use of larger TLS sections.


Initial Exec TLS Model
----------------------

The initial exec TLS model can be applied unconditionally when
generating the executable itself (i.e. when compiling code without the
options to emit code suitable for shared libraries: -fPIC or -fpic).
This optimization is usable if the variables accessed are known to be
in one of the modules available at program start, and if the
programmer selects to use the static access model.  The generated code
will not use the <tls_get_offset> function, which means that deferred
allocation of memory for the TLS blocks is not possible in this model.
It is still possible however, to defer allocation for dynamically
loaded modules.

With this optimization, for each variable there would be a run-time
relocation for a GOT entry which instructs the dynamic linker to
compute the offset from the TCB.

In the following code fragments, the code shown is determining the
address of a thread-local variable x:

	extern __thread int x;

	&x;

Assuming GR15 holds the PIC pointer, the following instruction
computes the offset from the thread context pointer to variable x:

	ldi    @(gr15, #gottlsoff12(x)), gr9

Ideally, the sequence above should never be generated by the compiler,
or generated only with -fpic, since it assumes GOT offsets fit into 12
bits.  The longer form with sethi/setlo should be emitted for -fPIC.

The linker may relax out-of-line `call #gettlsoff' instructions to the
above, if it doesn't cause GOT overflows.

For -fPIC, the following sequence should be generated instead, once
again assuming GR15 is the PIC pointer:

	sethi.p #gottlsoffhi(x), gr14
	setlo	#gottlsofflo(x), gr14
	ld	#tlsoff(x)@(gr15, gr14), gr9

When relaxing an inlined general dynamic call sequence, the ldd
instruction to gr8 and gr9 is replaced with the ld instruction above,
and the call is replaced with a nop.

The GOT entry used by the initial exec model is:

    _GLOBAL_OFFSET_TABLE_+#gottlsoff(x):
	.word	#<tlsoff>(x)


When linking an executable, PLT entries generated in response to call
#gettlsoff instructions can use static TLS instead, referencing TLS
offsets in the GOT, instead of TLS descriptors:

    plt(gettlsoff(sx)):
	sethi.p #gottlsoffhi(sx), gr8
	setlo	#gottlsofflo(sx), gr8
	ld	@(gr15, gr8), gr9
	ret

Shorter versions for 16- or 12-bit offsets from the GOT can be used
for executables as well.


Local Exec TLS Model
--------------------

This model is an optimization for the local dynamic model.  It can be
used only for code in the executable itself, and only when the
variables accessed are in the executable itself.  These restrictions
mean that, in this model, the TLS block can be addressed relative to
the thread pointer.  It also means that we always use the first TLS
block (the one for the executable), and the size of the other TLS
blocks is irrelevant for address computations.

The following code descriptions implement the following fragment:

	static __thread int x;

	&x;

After the two instructions below, gr10+gr29 represents the address of
variable x:

	sethi.p	#tlsmoffhi(x), gr10
	setlo	#tlsmofflo(x), gr10

When relaxing from the Local Dynamic model, the code that calls
<tls_get_offset> can be simplified to:

	setlos	#0, gr9


When linking an executable, and referencing a local symbol, PLT
entries generated in response to call #gettlsoff instructions can use
the local exec access model, computing the TLS offsets directly
instead of using TLS descriptors:

    plt(gettlsoff(sx)):
	sethi.p #tlsmoffhi(sx), gr8
	setlo	#tlsmofflo(sx), gr8
	ret

Shorter versions for 16- or 12-bit offsets from the GOT can be used
for executables as well.


Relocations
===========

Some relocations are available in 12, HI and LO variants.  The 12
variant, to be used as the immediate offset to load or store
instructions, resolves to the least significant 12 bits.  The HI and
LO variants resolve to the most and least significant 16 bits,
respectively, and they're to be used as operands to sethi and setlo.  
Even though LO could be used as the operand to setlos, this is not
recommended, since it silently truncates the value to 16 bits.

Some relocations can only be applied to a limited set of instructions,
so as to enable certain optimizations.  The optimizations are
described assuming fields will not overflow.  If they do, the linker
is advised to not perform the transformation, but, if it does, it must
print an error message and offer a command-line option to disable the
transformation.  Given a set of optimizations that can be applied to
relocations referencing a symbol+offset, the linker must perform
either all or none of them, i.e., the semantics of the sequences
depicted in the Linker Optimizations section below must not be broken
by transformations in some instructions but not in others that below
to the same logical group of instructions.

The following relocations are available in relocatable objects, but
never as dynamic relocations:

25 R_FRV_GETTLSOFF

   This relocation forces a PLT entry to be generated, that references
   a TLS descriptor for the symbol.  The TLS descriptor is implicitly
   generated in the GOT.  The relocation, applied to a call
   instruction, causes the it to call this PLT entry.

   If the symbol number is zero, it's a reference to the TLS base
   address for the module, including the bias.

   This relocation must be associated with a call instruction.

   If an executable is being linked, the referenced symbol binds
   locally, the call instruction may be replaced with:

	setlos	#tlsmoff(symbol+addend), gr9

   If an executable is being linked, but the reference symbol does not
   bind locally (or, optionally, if it does but the substitution above
   would overflow), the call instruction may be replaced with:

	ldi	@(gr15, #gottlsoff12(symbol+addend)), gr9

   A GOT entry with the TLS offset is implicitly generated in this
   case.


27 R_FRV_GOTTLSDESC12
28 R_FRV_GOTTLSDESCHI
29 R_FRV_GOTTLSDESCLO

   These relocations resolve to the GOT offset for a TLS descriptor of
   the symbol.  The TLS descriptor is implicitly generated in the
   GOT.

   If the symbol number is zero, it's a reference to the TLS
   descriptor for the module, including the bias.

   The GOTTLSDESC12 relocation must be associated with an `lddi'
   instruction.  GOTTLSDESCHI must be associated with a `sethi'
   instruction.  GOTTLSDESCLO must be associated with one of `setlo' or
   `setlos'.

   When linking an executable, and the referenced symbol binds locally,
   the `sethi', `setlo' and `setlos' instructions MAY be replaced with
   `nop's.  An `lddi' instruction such as:

	lddi	@(grB, #gottlsdesc12(symbol+offset)), grC

   MAY be replaced with:

	setlos	#tlsmofflo(symbol+offset), gr<C+1>

   When linking an executable, and one of these relocations reference
   a symbol that does not bind locally, the immediate operand of
   `sethi', `setlo' and `setlos' instructions must be replaced with
   the value for a corresponding GOTTLSOFF relocation.  An `lddi'
   instruction may be replaced with `ldi', the immediate offset
   replaced with the value for the corresponding GOTTLSOFF relocation,
   and the destination register replaced with the odd-numbered
   register following the destination register of `lddi'.  A GOT entry
   holding the TLS offset for the symbol is implicitly generated.


30 R_FRV_TLSMOFF12
31 R_FRV_TLSMOFFHI
32 R_FRV_TLSMOFFLO

   These relocations resolve to the offset from the biased base
   address for the module to the address of the thread-local symbol.
   The symbol must bind locally in the module.


33 R_FRV_GOTTLSOFF12
34 R_FRV_GOTTLSOFFHI
35 R_FRV_GOTTLSOFFLO

   These relocations resolve to the GOT offset for an entry holding
   the offset from the thread pointer to the thread-local symbol.  It
   causes a TLSOFF entry to be created in the GOT.

   The GOTTLSOFF12 relocation must be associated with an `ldi'
   instruction.  GOTTLSOFFHI must be associated with a `sethi'
   instruction.  GOTTLSOFFLO must be associated with one of `setlo' or
   `setlos'.

   When linking an executable, and the referenced symbol binds
   locally, `sethi', `setlo' and `setlos' instructions may be replaced
   with `nop's.  An `ldi' instruction such as:

	ldi	@(grB, #gottlsoff12(symbol+offset)), grC

   MAY be replaced with:

	setlos	#tlsmofflo(symbol+offset), grC


The following relocations are do-nothing annotations, used for linker
relaxations.

37 R_FRV_TLSDESC_RELAX

   A relocation that must be attached to ldd instructions that load
   the function descriptor for <tls_get_offset> from a TLS descriptor,
   except for those that get a GOTTLSDESC12 relocation.  It indicates
   that, in @(grX, grY), grX holds the GOT address, and grY holds the
   offset from grX to the TLS descriptor for symbol+addend.

   When linking an executable, and the referenced symbol binds
   locally, the `ldd' instruction may be replaced from:

	ldd	#tlsdesc(symbol+addend)@(grX, grY), grC
	
   to

	setlos	#tlsmofflo(symbol+addend), gr<C+1>

   When linking an executable, and the reference symbol does not bind
   locally, the `ldd' instruction may be replaced with `ld', with the
   destination register replaced with the odd-numbered register
   following the destination register of `ldd'.

   When the referenced TLS descriptor (or offset, after the relaxation
   above) is within the 12-bit-addressable range in the GOT, the load
   instruction may be turned into an immediate load instruction.


38 R_FRV_GETTLSOFF_RELAX

   A relocation that must be attached to calll instructions that call
   <tls_get_offset> for symbol+addend.

   When linking an executable, the `calll' instruction may be replaced
   with a `nop'.


39 R_FRV_TLSOFF_RELAX

   A relocation that must be attached to `ld' instructions that load
   the TLS offset for a symbol+addend from the GOT.  It indicates
   that, in @(grX, grY), grX holds the GOT address, and grY holds the
   offset from it to the GOT entry containing the TLS offset for
   symbol+addend.

   When linking an executable, if the referenced symbol binds locally,
   the `ld' instruction can be replaced with:

	setlos	#tlsmofflo(symbol+addend), grC

   When the referenced TLS offset within the 12-bit-addressable range
   in the GOT, the load instruction may be turned into an immediate
   load instruction.

40 R_FRV_TLSMOFF

   These relocations resolve to the offset from the biased base
   address for the module to the address of the thread-local symbol.
   The symbol must be defined locally in the module.  This relocation
   is typically only used in debug information.  The assembler
   generates it in response to assembly code such as:

	.picptr	tlsmoff(symbol+addend)


The following relocations are only available as dynamic relocations:

26 R_FRV_TLSDESC_VALUE

   A 64-bit relocation that resolves to a <tls_get_offset> entry point
   followed by the argument to be passed to it in GR9 such that it
   computes the TLS offset for the symbol+offset referenced in the
   relocation.  The in-place addend is stored in the second word.  The
   contents of the first word prior to relocation are reserved for
   future extensions.

36 R_FRV_TLSOFF

   A 32-bit offset from the thread pointer to a thread-local symbol.


Linker Optimizations
====================

We summarize below the relaxations the linker may apply to TLS
relocations.  The relaxable instructions can be scheduled freely, and
they can use registers other than those used in the sample code
snippets above.  The only requirement is that, on call sites (call
#gettlsoff or calll #gettlsoff), the custom calling conventions of
<tls_get_offset> are satisfied.  We have to be careful in the
transformations to make sure that we don't assume too much about
what's in each register, since a compiler may have chosen different
registers and added copy instructions.

In the code snippets below, we use symbolic register names such as
grA, grB and grC.  If the compiler may have introduced copies between
instructions, we add ' to the register name (e.g., grA', grA''), to
indicate it's the same register as the one mentioned before, or a copy
thereof.

The packing bits are only shown below to illustrate the expected
common use.  The linker should retain them where they appear, and the
original code must be emitted in such a way that the substitutions
below don't generate illegal packing combinations.

In some cases, instructions become nops.  With significant additional
effort, we could get sufficient information to the linker to enable it
to actually eliminate such instructions from the instruction stream in
some cases that don't require them due to VLIW packing.  However, due
to the increased demands from the assembler in retaining symbolic
information in object files to enable such code-length changes, and
since it is possible to obtain the improved sequences by providing the
compiler with additional symbol-locality information, we recommend
simple substitution of instructions for nops.


General Dynamic to Initial Exec
-------------------------------

Instead of having to emit a full TLS descriptor for the variable in
the GOT, we only need a TLS offset GOT entry.  In the first case, we
may be unable to perform the optimization if we #gottlsoff12
overflows.

From:

	call	#gettlsoff(x)

To:

	ldi	@(gr15, #gottlsoff12(x)), gr9

From:

	sethi.p #gottlsdeschi(x), grA
	setlo	#gottlsdesclo(x), grA'
	ldd	#tlsdesc(x)@(grB, grA''), grC
	calll	#gettlsoff(x)@(grC', gr0)

To:

	sethi.p #gottlsoffhi(x), grA
	setlo	#gottlsofflo(x), grA'
	ld	#tlsoff(x)@(grB, grA''), gr<C+1>
	nop

From:

	setlos	#gottlsdesclo(x), grA
	ldd	#tlsdesc(x)@(grB, grA'), grC
	calll	#gettlsoff(x)@(grC', gr0)

To:

	setlos	#gottlsofflo(x), grA
	ld	#tlsoff(x)@(grB, grA'), gr<C+1>
	nop

From:
    
	lddi.p	@(grB, #gottlsdesc12(x)), grC
	calll	#gettlsoff(x)@(grC', gr0)

To:

	ldi.p	@(grB, #gottlsoff12(x)), gr<C+1>
	nop

One might be tempted to replace the nops with `mov gr<C+1>, gr9', but
at the point of the call, the function descriptor that would have been
loaded to grC/gr<C+1> must have been moved to gr8/gr9, even if gr8 is
not used in the address for the call instruction, because gr8 and gr9
are defined as part of the custom calling conventions of
<tls_get_offset>.


General or Local Dynamic To Local Exec
--------------------------------------

Here we manage to get rid of all dynamic relocations in (almost) all
cases, by taking advantage of the fact that thread-local symbols of
the executable are at fixed offsets from gr29.

In the Local Dynamic case, instead of a symbol x, we'll have a
constant N (expected to be zero).  This constant is the TLS module
offset itself, so #tlsmofflo(N) becomes #lo(N) and #tlsmoffhi(N)
becomes #hi(N).

From:

	call	#gettlsoff(x)

To:

	setlos	#tlsmofflo(x), gr9

or, if #tlsmofflo(x) exceeds a 16-bit signed offset, before giving up,
we may still try the following substitution, that still requires one
dynamic relocation, but saves a TLS PLT entry and most of the TLS
descriptor space in the GOT.

	ldi	@(gr15, #gottlsoff12(x)), gr9

Alternatively, we can optimize the generated PLT entry such that it
computes gr9 using a longer sequence, or one that loads a TLSOFF from
the GOT.

From:

	sethi.p #gottlsdeschi(x), grA
	setlo	#gottlsdesclo(x), grA
	ldd	#tlsdesc(x)@(grB, grA'), grC
	calll	#gettlsoff(x)@(grC', gr0)

To:

	nop.p
	nop
	setlos	#tlsmofflo(x), gr<C+1>
	nop

From:

	setlos	#gottlsdesclo(x), grA
	ldd	#tlsdesc(x)@(grB, grA'), grC
	calll	#gettlsoff(x)@(grC', gr0)
    
To:

	nop
	setlos	#tlsmofflo(x), gr<C+1>
	nop

From:

	lddi.p	@(grB, #gottlsdesc12(x)), grC
	calll	#gettlsoff(x)@(grC', gr0)
    
To:

	setlos.p #tlsmofflo(x), gr<C+1>
	nop

If the tlsmoff(x) exceeds a signed 16-bit value, instead of:

	setlos	#tlsmofflo(x), gr<C+1>
	nop	;; that replaced the calll instruction
    
we emit, in all cases but the first:

	sethi	#tlsmoffhi(x), gr<C+1>
	setlo	#tlsmofflo(x), gr9


Initial Exec To Local Exec
--------------------------

Here we attempt to get rid of the TLS offset GOT entry, so as to
eliminate the need for a dynamic relocation, but we can't always do
it.  If the symbol TLS offset fits in 16 bits, we can always do it.
Otherwise, we need global information to decide how to handle each
individual relaxable relocation, so the recommendation is that such
cases not be optimized.

TLS offset of symbol x fits in 16 bits (signed):

From:

	sethi.p #gottlsoffhi(x), grA
	setlo	#gottlsofflo(x), grA'
	ld	#tlsoff(x)@(grB, grA''), grC

To:

	nop.p
	nop
	setlos	#tlsmofflo(x), grC

From:

	setlos	#gottlsofflo(x), grA
	ld	#tlsoff(x)@(grB, grA'), grC

To:

	nop
	setlos	#tlsmofflo(x), grC

From:

	ldi	@(grB, #gottlsoff12(x)), grC

To:

	setlos	#tlsmofflo(x), grC


If the TLS Module Offset of X exceeds 16 bits, we can't optimize the
last case.  As for the other cases, we might apply the following
transformations.  However, since the cases below require global
analysis on the input operands of the ld instruction, we recommend it
not to be applied.  Since TLS segments are generally very small (tens
of bytes, far less than 64 kilobytes), the absence of such a
transformation should not be a major problem.

The transformations below are ones that can only be performed when
disambiguation is possible.  One must bear in mind, however, that it
is possible for a single ld #tlsoff instruction to take index register
inputs from both sethi/setlo and setlos sets, with copying and control
flow making it an undecidable problem in the general case.

We could have disambiguated in the assembly level, using new
annotations such as #gottlsofflos, #tlsofflos and #tlsoffhilo in
setlos and ld, generating different relocations that would enable the
linker to immediately tell which case is in use, but since this case
would only be useful for TLS objects defined in the main executable
that didn't fit in the initial 65520 bytes of the TLS segment, and
that were not defined in the same translation unit that references
them, or that were not declared as binding locally (e.g., of hidden or
protected visibility) in other translation units that reference it, we
dismissed it as a non-issue, and decided to leave them unoptimized.

From:
(assuming tlsoffhilo in addition to, or instead of tlsoff)

	sethi.p #gottlsoffhi(x), grA
	setlo	#gottlsofflo(x), grA'
	ld	#<tlsoffhilo>(x)@(grB, grA''), grC

To:

	sethi.p	#tlsmofflo(x), grA
	setlo	#tlsmoffhi(x), grA'
	mov	grA'', grC

or, if grA'' and grC were required to be the same register, you could
use:

	nop.p
	sethi	#tlsmofflo(x), grA'
	setlo	#tlsmoffhi(x), grA'' ;; grA'' and grC must be the same

From:
(assuming gottlsofflos and tlsofflos instead of gottlsofflo and
 tlsoff, and that grA' and grC are required to be the same register)

	setlos	#<gottlsofflos>(x), grA
	ld	#<tlsofflos>(x)@(grB, grA'), grC

To:
   
	sethi	#tlsmoffhi(x), grA
	setlo	#tlsmofflo(x), grA' ;; grA' and grC must be the same



Envisioned Extensions
=====================

The <tls_get_offset> design was created to enable lazy resolution of
R_FRV_TLSDESC_VALUE relocations, but this optimization is so far
unspecified and not implemented.  This optimization could
significantly speed up the start up of programs that use dynamic
libraries that rely on many TLS variables, since the cost of
performing their relocations would be avoided for variables that are
not actually referenced, and delayed to runtime otherwise.  Although
this optimization would require an ABI extension to specify the
expected behavior for an R_FRV_TLSDESC_VALUE in a lazy relocation
section, since this relocation is not specified as a lazy relocation
in the current specification, it is expected that such a change could
be implemented in a fully backward-compatible way.


An optimization to the Dynamic Thread Vector management is also
envisioned.  Under the current specification, it is possible to
arrange for the DTV to not contain entries for static TLS entries,
since they are never going to be used.  This also enables the dynamic
thread vector to be allocated on demand, instead of a thread start-up,
which would enable programs that do not use dynamic TLS (typically,
programs that do not dlopen() libraries that have TLS) to run without
ever allocating dynamic thread vectors.  This could significantly
speed up thread creation time.  Since the internal format of the DTV,
and even its very existence, are unspecified in the ABI, such
optimizations can be implemented in a fully backward-compatible way.


Revision History
================

2004-12-10 (0.22): added 32-bit R_FRV_TLSMOFF relocation.


References
==========

[1] "ELF Handling For Thread-Local Storage" version 0.20, Ulrich
Drepper, Red Hat, 2003.

[2] "The FR-V FDPIC ABI" version 1.0, Red Hat, 2004.