This is going to be long and quite technical and LOOONNG (easily The Rocketer long). It's actually more of an rant, but it does tell what I have been doing lately too, so I'm putting it here. I guess the programming thread could be suitable too.
Okay, people that visit the Anime thread might be aware that like Spammy is into mecha anime (let's not say obsessed, it's not nice) I'm "obsessed" with a particular franchise.
Anyway along the TV series, the manga and audio dramas (sound stages) all of which I have already experienced and that we have translated versions of two PSP games were also published. They, given the small size of the English speaking audience, were never localized in English. Some of the paths in the first one Battle of Aces were translated, but for the second one that has never happened. We have a pretty detailed synopsis of it but that is it.
Some time ago I decided to take a look at them because I was becoming disillusioned with the direction the franchise was taking recently, and especially the second was supposed to have a pretty good story true to the spirit of the franchise. I played the first one on PPSSPP emulator (worked like a charm), had some fun and then decided to see if I could mod in the English translation into the first one, the one we have some of translations for.
Some time later after much searching I was able to figure out how to extract the data from the pac files on the CD (using quick bms program and add_pac script followed by Lzs script), BUT even though I have figured out how to unpack the game packages I was unable to find a way to repack them. Also considering how much trouble I had trying to pull together extracted files into an ISO that could be played by the emulator (as a test to see if I could output the CD), and the fact that I think at least for some files the game/PSP expects them to be on set locations on the CD, making changes to the files and then managing to put everything back together would have been a longshot.
So with the plan to translate the game going up in flames, I decided to use the translations we had and then put that translation as subtitles over my gameplay. And I was largely successful in that regard producing this: https://www.youtube.com/playlist?list=P ... Ueo7HtO1Ht
. That also allowed me to practice a bit working with the Premieree which I had not had that much experience in. Also I seem to have learned a thing or two given how the initial videos differ from the later ones. While I personally do not know the Japanese, at times I had to make some small corrections when I that the change was more lore appropriate. Also Google Translate while often hilarious in the way it could screw up the translation was helpful. The most challenging of all the videos was probably the first one because for that one I did not have the Japanese transcript so I had no real way of knowing excatly which translated sentence concerns which narrated part. What followed was me trying to input into the Google Translate the words that were clear and that I was pretty certain I could spell, and comparing what I got to the translation. That was a pain. But in the end I think I did a pretty good job for someone who doesn't speak the language.
While I was initially searching for tools to pack and unpack the game files I learned that the Chinese have made a translation patch for the game and had offered the tools they used. Unfortunately the link to the tools was a bust (it was from early 2010s), but later I figured out that the site did hold the Chinese translation of all the game paths (as I said in English only 3/9 were fan translated). Also during my searches I ran into a place where one of the fans made a transcript of the subtitles of one of the paths appearing when characters spoke to one another. So I figured I have the Japanese transcript, the Chinese translation and a pretty good grasp of lore and the characters, how hard would it be to try to translate at least one path?
Well as it turns out it was difficult but doable. Well I never did the entire thing, but I did translate about one third of it. In the end the Chinese translation was only usefull as a general guidepost given how much of a freeform translation they were doing. Also I realized that I simply could not rely on the fan made transcrips because it takes ONE wrong character to screw up the meaning of a sentence and fro GTranslate to go wild. Also other paths were never transcribed so I would have to find a way to get those transcriptions.
I have yet to publish this translation, mostly because I would like for someone at least passably familiar with the language to take a look at it since it might have as much to do with what was actually being said as an abridged series does.
Back when I was working at unpacking the files, I did figure out mostly where the subtitles were probably being kept, but they seemed not to be written in unicode so I left it as it was. This time when I went back I went for the file I knew should contain some dialogue that was written using Western characters. BINGO! There they were, English characters. Considering the number of Japanese characters that would need to be addressed there was no way that simple extension of ASCII (ASCII really only refers to the first 128 characters encoded as 8-bit numbers, other 128 numbers are usually assigned to local letters depending on the code page used, so 140 might mean one character when using Western European encoding and another when using Cyrillic encoding), with it's 128 characters would be enough for thousands of Japanese ones. Hell I don't think Katakana and Hiragana would fit in 128 characters. So Japanese ones had to have been encoded with at least two bytes giving them 65k possible codes to associate with characters. But the ASCII ones certainly did not take two bytes each but were placed right next to each other. This meant there had to be a way they were using to notify the game that the following bytes were in ASCII and use one byte per character and where that ended.
I quickly noticed that with all presumably Japanese characters the first byte was allways greater or equal to 0x80 (128). Ahh realization dawned. They are combining the one byte per character and two bytes per character notations because they know that ASCII characters only cover the first 128 numbers of the 8-bit number (the ones smaller than 0x80). If the game runs into a character whoose byte is greater than 128 that means it's not an ASCII character and it should look for it using two byte encoding. I ran some quick tests, mostly by finding texts that contain the same character in two locations and seeing if the bytes were the same, and it turned out my deduction was true. Later I found out that this is called Shift-JIS encoding and that browsers can read them. I renamed the file from asm to txt, opened it in Firefox and changed the encoding and Voila the characters were there. Great! Now I should have access to the text from all of the paths in the game.
But before I could continue, I had the bright idea that I probably should do this for the second game, since it had a more interesting story and such would be more interesting to others. And I figured, the games were pretty similar in the way they delivered the story and so were the mechanics, so the files were probably also similar right?
Oh how wrong I was. First after I extracted the CD I found similary packed files that were packed using add_pac. The trouble was there were no clear loactions where the story subtitles were located. The first game clearly labeled it's story contaning files as story_charactername_stage_xx, while here things weren't that clear. Some files were hinting they might contain what I need, but the files in them after being extracted and uncompressed (add_pac -> LZS) did not contain text in now familiar Shift-JIS encoding. It took extracting ALL the files (often overwriting them since there were a LOT of duplicates in various packs), finding ways to open GMO and GIM files to rule out the ones that did not contain what I wanted for me to figure out the text was probably encoded in one of the files I first checked. Finally I was able to figure out that the texts were probably in files named story_evXX.asm and story_btlXXXX.asm, but still the bytes found there did not look anything like what I expected.
Eventually after some hits and misses, I noticed that the file contained another section that would start with FTXT and then some bytes later this would be written "story_evXX.txt". I also noticed that several bytes containing 0xFF seemingly deliminated the bytes after this into sections. Still the bytes did not fit either UNICODE or Shift-JIS. I booted up the game and went to one of the stages and opened it's supposed file in my hex editor. And look at that the first line that the character said (thank god for Japanese habit of making games where the characters will wait for you to press a key and acknowledge that you have understood what was being said, so staying at one screen was not an issue) was N characters long and from the header untill the first 0xFF 0xFF there were 2*N bytes. The same happened with the next screen/subtitle and the next one. Then I started noting in notepad which characters seem to map to which two byte numbers and the same characters did map to the same numbers. So I have found the text but even after hours spent trying to figure out it's encoding I have found none would fit the following notation:
Code: Select all
01 00 - 一
02 00 - ...
03 00 - <<space>>
04 00 -、
05 00 -.
11 00 - か
32 00 - の
33 00 - は
72 02 - !
73 02 - (
74 02 - )
7B 02 -?
0A F0 - New line
The numbers are in small endian hexadecimal notation. That endianness did not make things less "fun". I noticed another thing. Even if the programmers were using their own encoding for some reason they pwobably were still using the same order of characters like other encoding right? That would be a sane thing to do and would make transfering characters from one notation to another simple. WRONG!
Take a look here
. That page contains the Shift-JIS code pages. Find the か character. Here I'll help you it's in the row labeled 82 9E. That label BTW labels the first character in that row. So since the first one is 9E our character is in the position 82 A9, But that is besides the point. Now note that in the "game" encoding the character is encoded as 0x11. Note also that の is represented by 0x32. So given that the rows are 16 character wide の should be located two rows below (0x31) and one column to the side on that page right? NOPE! ぬ is in that location. And I have checked some other pairs. In some cases when the characters are close to each other the distance between them in both the game and other encoding is same. But for quite a few oter it's not. Which means that they are not using the full set of characters but only the curated set of characters that they actually need and those that are not needed by the text are not given codes.
Now if I could get my hands on their font files I might be able to figure out how to compute the unicode codes from game codes by comparing the order of characters in the font files with the unicode or Shift-JIS order.
But where were they keeping the font information. I tried to see if I could use the debugger and such tools that came with the emulator. While I was able to confirm that the text was assembled from some kind of texture/image that contained the Japanes characters I was not able to get my hans on it, and white text is not really legible on white/gray checkerboard pattern.
I eventually found couple of files that had font in their names. All of them had tm2 extension and there were a lot of other files with the same extension. But now how to open them. The accepted wisdom was to use Noesis to open those files. While Noesis was able to open all those gmo and gmi files containing animations and textures, and did not complain when told to open TM2 files the output was obviosly garbled and looked like the program did not properly figure out the width of the image. Some other wisdom also pointed out that these files were probably images encoded using the normal 8bit per pixel way. If that was so I should probably be able to make a simple MATLAB script (I have expirience with it, and for prototyping simple scripts it's great) to import the files. And so started my odysey last night trying to figure out the TM2 format with no help.
Well I was able to load the file and present it as image with not much problem. I guessed that the image width was probably a power of 2 so I used 256. In hex editor I noted where the content started after the header, so I simply ordered MATLAB to place the remaining pixels on rows 256 pixels wide. What I got was still garbled a lot, but by playing with the width I ended up with this when I set it to 32.
(cropped and zoomed in from the original image 32 pixel wide)
Now some of those do look like characters, now don't they. Near the top you can spot 0, 1, parts of 4, A and B with the bottom looped off etc. Setting the width to 16 did produce better results but not truly better. Further messing around with the width did not produce significantly better results. After some more time in my HEX editor I noticed the same partially cut characters when I was reading the file.
Now you might be thinking, so? What is wierd about that. They are in the picture so they should be visible in the file too. But you are missing the point, the images are usually stored in memory by storing them row by row. Meaning the entire first row of pixels is stored before you continue to the next. But that is not what I was seeing here. It seems that TM2 splits the image into segments 16x8 pixels wide (fortunately that is exactly the width in bytes that the HexEditor uses for rows and why I was able to see the features from the image in the raw byte file). And the image is seemingly formed by concatenating all these segments horizontally. The front part of B is clearly in the second segmentrow. So I ordered the program to compute the segments and place them in one big row and got this:
(zoomed and cropped for emphasis)
It was pretty obvious that the second part of the image should go below the first part to for the next row of segments. Well to cut the long story short in the end I mostly was able to reverse engineer the format. I figured out which group of bytes (again small endian) tells where the image actually starts, which group tells how long the file is, how wide the final image is (from which I could figure out how many segments per row I should use), and when I should divide that number of segments per row by two. I found where the pallette was stored. Palettes occuply the last 2048 bytes since there are TWO of them. So that is 2 palettes * 256 colors * 4 bytes (Red Green Blue and Alpha) = 2048 bytes. I haven't been able to figure out when each of them is suppoosed to be used. Often the second one is used to store gray scaled (black and white) color values, but not always. Anyway the file I was looking at looks like this with no colors applied (when I do things go awry with this file, I'm probably adding 1 somewhere where I shouldn't):
No Japanese characters huh. And the rest of TM2 files I have spent hours to figure out how they are stored? Most of them are game art assets but again no actual Japanese characters. WTF.
Also why are these images encoded in this obtuse a fashion?! What's wrong with storing images row by row. You might think that it's to save on having to read entire rows just to get the pixels belonging to one letter but that is not so. As the pictures above demonstrate the characters allmost allways overflow over the segment dividing line and you almost always would need to load the next row of segments. So WHY? Were they being deliberately obtuse and wanted to make hacking the game as difficult as possible? Possibly knowing how they used their own encoding to screw with me. So using such an obtuse format for small game art assets would be normal.
Now you might think that after I wasted all this time on basically nothing, I would quit. Oh no, making bad life choices is practically our national custom. So I continued on. I then went back into the game with the idea of trying to record which codes belong to which characters and see if I can figure out roughly how their encoding works. Which is when I got the real WHAT THE FUCK! moment. See notice how の is coded as 0x32? And how Japanese comma is 0x04
? That is from the second fight during the first stage. When I went back in I started from the prologue. And guess what in the prologue (the slideshow stage prologues and ends are in those story_ev files, at least the prologue was) Japanese comma is drum roll please ...0x06
WHAT THE FUCK! No I did not look at the wrong location. The text lenght matches the byte length (that is the byte length is twice as big as the character length), and also the ア happens twice in this text and both times in their location the same code is used, and the codes/characters exhibit the similar behavior where their order is similar like in let's say Shift-JIS but the distances are often off. Then I found a set of files named story_xx.fnt... I could make only one conclusion...THE SADISTIC BASTARDS USED DIFFERENT HOME BREW ENCODING FOR EACH AND EVERY STAGE!!!
That is why they have to have a different font file for every stage. Which also makes any attempt at making translation patches nearly impossible since you would also need to make your own fonts.
I can just picture them now planning this. "Have you heard Butt-san how those gajins from the mainlands managed to translate our superior game!" "Oh no Dick-san that can not be, we must endeavor to make the game files entirely impenetrable for our next game, even if it takes extra work to create the converters that will make all these different encoding possible. The dirty mainlanders must not experience the glory of our Nippon magical girl video games.", "Verily, Butt-san, now excuse me I need to fold some steel to make superior steel too".
Hey guess what you assess, your steel sucks just as much as this practice of yours. Yes you heard me! Not everything needs to be overly complicated to be better. You certainly could have used normal ass fonts and encodings and the only reason for doing this from what I can see is to support that quirk of your language where you can write how something is read over obscure kanji (I think, I don't know enough). But that was not really necessary. You DICKS.
Plan... I guess D for DICKS. Use KanjiTomo
to grab characters off screenshots. It works pretty well, but can only really work with up to 4 characters. BUT it does give you options for each of them so you can verify and correct wrong OCRs right on the spot. This is how I'll grab the text once I get some free time.
The MATLAB file for reading TM2 can be found here