65c816 for 6502 developers: Difference between revisions
NovaSquirrel (talk | contribs) (Created page with "This guide is meant to introduce the 65c816 to someone who is already familiar with the 6502. Most 6502 code will work on it without changes but there are additional features that you can take advantage of to write even faster and smaller code. This page is intended to highlight those features. ==16-bit registers!== The biggest, most obvious change from the 6502 is that it can do 16-bit operations. To enable this, the accumulator and index registers can change between b...") |
NovaSquirrel (talk | contribs) m (→Emulation mode) |
||
Line 221: | Line 221: | ||
==Emulation mode== | ==Emulation mode== | ||
The 65c816 starts out in an "emulation mode" which is supposed to simulate a 65c02 by changing a few aspects of the processor (stack pointer is 8-bit and fixed to $0001xx, registers are fixed to 8-bit, interrupts use 16-bit addresses, some instructions take longer). You can completely ignore it, aside from writing code to switch out of it (<code> CLC \ XCE</code>) as it's really only helpful for 65c816 successors to 6502 computers that are meant to be able to run old programs without changes, like the Apple IIGS. It won't help with porting NES code to the SNES. | The 65c816 starts out in an "emulation mode" which is supposed to simulate a 65c02 by changing a few aspects of the processor (stack pointer is 8-bit and fixed to $0001xx, registers are fixed to 8-bit, interrupts use 16-bit addresses, some instructions take longer). You can completely ignore it, aside from writing code to switch out of it (<code>CLC \ XCE</code>) as it's really only helpful for 65c816 successors to 6502 computers that are meant to be able to run old programs without changes, like the Apple IIGS. It won't help with porting NES code to the SNES. | ||
==SNES-specific math== | ==SNES-specific math== |
Revision as of 17:29, 8 May 2022
This guide is meant to introduce the 65c816 to someone who is already familiar with the 6502. Most 6502 code will work on it without changes but there are additional features that you can take advantage of to write even faster and smaller code. This page is intended to highlight those features.
16-bit registers!
The biggest, most obvious change from the 6502 is that it can do 16-bit operations. To enable this, the accumulator and index registers can change between being 8-bit and 16-bit (sort of). This page will go into more detail about that later on.
Two additional flags are added to the flags register to control register size. One flag ("M") controls the size of accumulator operations, as well as operations that do not involve a register such as BIT
, INC
, or ASL
directly on a memory location. Another flag ("X") controls the size of both the X and Y index registers.
Instead of getting dedicated instructions like SEC
/CLC
, register sizes are changed with the new REP
and SEP
instructions. They take an 8-bit immediate operand and REset or SEt bits in the flags register. This means that you can set and reset multiple bits in the flag register by having multiple bits set in that immediate. For example SEP #$30
sets M and X simultaneously. This creates a possible optimization where a CLC
or SEC
(for ADC
and SBC
) can sometimes be removed, and merged into a REP
or SEP
.
REP/SEP value | Flag | Purpose |
---|---|---|
#$01 | C | Carry |
#$02 | Z | Zero |
#$04 | I | Interrupt disable |
#$08 | D | Decimal mode |
#$10 | X | 8-bit index registers when set, 16-bit when clear |
#$20 | M | 8-bit accumulator/memory when set, 16-bit when clear |
#$40 | C | Carry |
#$80 | C | Carry |
Instead of memorizing these bit values, this is a great opportunity to use macros for more readable code! lorom-template contains ca65 macros named seta8
, seta16
, setxy8
, setxy16
, setaxy8
and setaxy16
and these make it clear what registers are being affected how.
One very important thing to consider with register sizes is they change how the processor reads immediate operands! LDA #1
may assemble to A9 01
or A9 01 00
depending on if the accumulator is supposed to be 8-bit or 16-bit at that point in the code. If the processor tries to run code that's meant for a different register size, it will not work correctly (for instance, A9 01 00
in 8-bit mode will be interpreted as LDA #1 \ BRK
). Because of this, it's important to let the assembler know what sizes are required, and the exact syntax for this differs from assembler to assembler.
To tell ca65 what the register sizes are at a given point in the program, the .a8
, .a16
, .i8
and .i16
directives are used. ca65 also has an optional feature (named .smart) that infers size changes from REP
and SEP
but there are still going to be situations where it needs to be explicitly stated. It can be a good practice to specify the register sizes at the start of each subroutine, which serves both as documentation and a guarantee that the routine will be assembled correctly regardless of what code gets put earlier in the file.
Some assemblers like asar require sizes to specified on a per-instruction basis, like LDA.B #1
or LDA.W #1
.
Bigger accumulator
The accumulator is always two bytes, and register size actually determine if accumulator instructions use the low 8 bits or the all 16 bits. When the accumulator is 8-bit, the other byte can be accessed with a XBA
instruction, which eXchanges the two bytes.
The ability to set the accumulator to use 16-bit values greatly simplifies math, comparisons, and pointer operations, and allows the same code to be written in fewer instructions (which will consequently be faster). For instance, stepping a pointer forward by one byte can become a single INC
instruction.
When an instruction does a 16-bit operation on a memory address, it will actually act on the address you specify (containing the low 8 bits) and the address after it (containing the high 8 bits), just like how pointers work.
8-bit variables in 16-bit mode
It's worth noting that you don't always need the accumulator to be in 8-bit mode to work with 8-bit variables. LDA variable \ AND #$00FF
can often work fine.
Bigger index registers
16-bit index registers simplify the use of arrays that are bigger than 256 items, allowing the programmer to avoid using pointers in many cases.
X and Y's size flag works differently from the accumulator's. When X and Y are 8-bit, they act as if their upper byte is always being forced to zero. Consequently, this happens if the accumulator is 16-bit and the index registers are 8-bit:
LDA #$1234 TAX TXA ; A is now #$0034
This means that if you intend to set X or Y to 8-bit temporarily and then change it back to 16-bit, you must save and restore the other index register if you don't intend to change its value!
Index registers as pointers
16-bit index registers are big enough to hold an entire 16-bit address, which means they can act as pointers. This means that pointers can be just as fast as regular array access, or even faster in some cases.
One technique you can do with this is storing the address of the first byte of a structure in X, and then you can access different fields in the structure using different offsets from X, using smaller (and sometimes faster) zero page instructions. A bonus is that the rarely-used (zeropage,x)
addressing mode is a bit more useful - if the structure contains a pointer you can just access it with one instruction like this.
Arrays of structures
On the 6502, parallel arrays (or "structure of arrays") are usually the most efficient way to lay out things like enemy state. However on the 65c816, it can be a good idea to consider array of structures instead!
Consider a structure that contains both 8-bit and 16-bit fields. On the 6502 the 16-bit values would just be spread across separate arrays, but here you would want to take advantage of being able to access a 16-bit number all at once, so you would probably prefer to have the two bytes sequential in memory. The biggest (or only?) downside is that now you need to do a TXA \ CLC \ ADC #Size \ TAX
sequence to iterate through the list, but it's probably worth it considering the other advantages.
Transfers between differently sized registers
What happens if you use TAX
/TAY
/TXA
/TYA
when the accumulator and index registers are different sizes?
Instruction | Accumulator | Index registers | Outcome |
---|---|---|---|
TAX | 8-bit | 16-bit | Both bytes in accumulator are copied to X. |
TAX | 16-bit | 8-bit | X = Low byte of accumulator. High byte of X is zero. |
TXA | 8-bit | 16-bit | A = X, including changing the high byte in accumulator to zero. |
TXA | 16-bit | 8-bit | Low byte of A = X. High byte of A is unmodified. |
Register size saving
One practice you may find useful is to start a routine with PHP
and end it with PLP
if you change the register size inside of it. This way, a caller will get the registers back in the same sizes they were before.
php ; Insert code here plp rts
If you want to save the register values as well, you should push them before you push the register state, so that the correct sizes are restored before they are pulled. Pushing a 16-bit accumulator and then pulling an 8-bit accumulator will probably lead to a crash, as the return address RTS
gets will be wrong.
pha phx phy php ; Insert code here plp ply plx pla rts
Small optimizations
There are a lot of little changes over the original 6502 that make life easier, and allow you to use fewer or smaller instructions. Most of these were actually introduced with the 65c02.
INA
- Increment accumulator.DEA
- Decrement accumulator.- code>PHX - Push X register.
PHY
- Push Y register.PLX
- Pull X register.PLY
- Pull Y register.TXY
- Copy X register to Y register.TYX
- Copy Y register to X register.STZ
- Store zero. Can be indexed with X, and can be zeropage or absolute.BRA
- Unconditional branch. This can save a byte overJMP
but not a cycle, as both require exactly 3.Indirect addressing
- Indexing is no longer mandatory on indirect addressing. For example you can write things likeLDA ($00)
.
Bit tests
The 65c816 provides a few more tools for bit tests. BIT
becomes much more useful because it can now be indexed, and you can use it with an immediate operand. With an immediate operand it acts identically to AND
without changing the accumulator.
TRB
and TSB
are new tools for bit operations. They take a zeropage or absolute address (non-indexed) and do a bit test, setting the zero flag as if an AND
had been performed between the accumulator and memory. Next, the value at that memory address is changed, with TRB
clearing all bits that are set in the accumulator, and TSB
setting all bits that are set in the accumulator. TSB
can be a faster replacement for most situations that called for ORA
followed by STA
on the same address. This means it's very helpful for piecing together an value from multiple parts:
lda XPos sta Temp lda YPos asl asl asl asl tsb Temp
Jump tables
Jump tables on the 6502 require you to either push the address you want to jump to onto the stack, or store it to an address before using an indirect jump. On the 65c816, there are instructions that are specifically for jump tables. JMP (absolute,x)
and JSR (absolute,x)
both exist, though you'll still have to do it the old ways if you need to preserve X or want to jump to a 24-bit address.
asl tax jmp (Table,x) Table: .addr Routine1 .addr Routine2 .addr Routine3
Banks and 24-bit addresses
Banks work very differently on the 65c816 than they would on a 6502 system like the NES. The address space is 24-bit, which provides access to a whole 16 megabytes of data. This means that it doesn't usually make sense to do bank switching, because there is already plenty of space.
24-bit program counter
To go with the bigger address space, the program counter has an 8-bit "bank" byte added to it, resulting in a 24-bit register. JMP
, JSR
, and RTS
still use 16-bit addresses, and do not modify the bank byte. There are now JML
, JSL
and RTL
instructions that jump to a full 24-bit address and do change the program counter's bank. RTI
also now takes a 24-bit address - which is important to note if you're using RTI for the RTS trick!
This means that when you write routines, you need to decide whether it should be callable from only the current bank, or from any of them. A routine you call with JSL
must return via RTL
just as a routine you JSR
needs a RTS
.
The data bank
Similar to the situation with JMP
, loads, stores and other data accesses with 16-bit addresses (but not zeropage ones) get extended out to 24-bit with a bank byte. In this case it's called the "data bank" register. You can only interact with it through the PHB
and PLB
instructions, so setting the data bank has to involve pushing the bank number to the stack. If you want the data bank to equal the program bank, you can use the PHK
instruction, which pushes the program bank. An example follows:
php ; Save register sizes phb ; Save original data bank phk ; Push the program counter's bank plb ; Store it to the data bank ; Insert code that changes the data bank and register sizes to something else plb ; Restore data bank plp ; Restore register sizes rtl
You need to keep the data bank in mind when calling code that's in another bank. If you JSL
somewhere, the data bank won't necessarily be correct for any lookup tables in the target code bank.
In ca65, in addition to < and > to fetch the bottom 8 bits or next 8 bits of a value/label, you can use ^ to get the bank byte. This can be used both for setting up 24-bit pointers and for setting the data bank to be correct for a specific label.
In order to set the data bank to something other than the current program bank, you can set a register to the desired value and push it, but you an also use the PEA
instruction, which takes a 16-bit immediate value and pushes it to the stack. It's not ideal because it's 16-bit and not 8-bit, but it will work. You can PEA
and then PLB
twice, but if you know what bank value you will need next you can do an optimization. Following is a ca65 macro from lorom-template which makes this easier to work with:
;; ; Pushes two constant bytes in the order second, first ; to be pulled in the order first, second. .macro ph2b first, second .local first_, second_, arg first_ = first second_ = second arg = (first_ & $FF) | ((second_ & $FF) << 8) pea arg .endmacro
Addressing modes
LDA
, STA
, ADC
, SBC
, ORA
, AND
, EOR
, and CMP
get new addressing modes that provide access to 24-bit addresses, ignoring the data bank. The following examples use ca65 syntax:
f:absolute, x
- 24-bit address with indexing[zeropage]
- 24-bit version of (zeropage)[zeropage],y
- 24-bit version of (zeropage) with indexing. [zeropage,x] does not exist.
If you need to access data from a bank different from the one you have set as the data bank, you'll have to plan out how you want to use the X and Y registers, given that only X can be used with far absolute addressing.
Stack changes
The 65c816 stack pointer is 16-bit, so it can point anywhere in the first 64KB of the address space. On the SNES it will probably be somewhere in $0000-$1FFF, which is RAM.
New stack addressing modes
The 65c816 makes it more feasible to put function arguments or local variables on the stack with new addressing modes on LDA
, STA
, ADC
, SBC
, ORA
, AND
, EOR
, and CMP
as well as a new 16-bit stack pointer. The new addressing modes are stack,s
and (stack,s),y
which index an 8-bit address with the stack pointer. Remember that the stack pointer points to the next available slot, so 1,s will go to the most recently pushed byte, 2,s will go to the next recently pushed byte and so on.
If you want to work with values on the stack, you should be aware of the TSC
and TCS
instructions. With these you can easily copy the stack pointer into the accumulator, subtract for however many local variables you want to make room for, and copy back to the stack pointer.
Probably the biggest downside to using the stack like this is that only the above instructions work with it. LDX
, INC
and such don't have the addressing modes available.
Movable "zeropage"/direct page
The 65c816 allows you to move zeropage to anywhere in the first 64KB of the address space. As a result, it's usually renamed to the "direct page". You're provided the TDC
and TCD
instructions to copy the accumulator to/from the base of the direct page. Direct page does not even need to start on a page boundary, but ideally it should be because otherwise there is a cycle added to all direct page instructions.
Some useful places you might want to put direct page are $2100 (PPU registers) and $4300 (DMA registers). If you're trying to write code that executes as fast as possible during vblank, this can help by allowing smaller, faster instructions.
Decimal mode
This isn't new to the 65c816, but will be new to NES developers. The SNES has a functional decimal mode! You should consider using it for values that are mostly for displaying, like money amounts or scores. One thing to keep in mind is that decimal mode only applies to ADC
and SBC
, so increments and decrements must be done using those.
Emulation mode
The 65c816 starts out in an "emulation mode" which is supposed to simulate a 65c02 by changing a few aspects of the processor (stack pointer is 8-bit and fixed to $0001xx, registers are fixed to 8-bit, interrupts use 16-bit addresses, some instructions take longer). You can completely ignore it, aside from writing code to switch out of it (CLC \ XCE
) as it's really only helpful for 65c816 successors to 6502 computers that are meant to be able to run old programs without changes, like the Apple IIGS. It won't help with porting NES code to the SNES.
SNES-specific math
The SNES has multiplication and division I/O registers. These are going to be significantly faster than doing the math with a general-purpose multiplication or division routine. The following math operations are available:
- unsigned 8-bit × 8-bit, with a 16-bit result
- unsigned 16-bit ÷ 8-bit, producing a 16-bit result and 16-bit remainder
- signed 16-bit × 8-bit, producing a 24-bit result
The unsigned math registers take time to work, and they need a delay before the result is valid. This can be accomplished with nop
instructions, or (better) by doing other useful work in the meantime. The signed multiplication registers are part of the PPU, and are not usable when it is rendering Mode 7, as it repurposes hardware meant for that.