If this were just a website, then dokuwiki with a plugin to show furigana properly would have been enough. However, that was not the purpose. Instead, I wanted a site that would let me present the book, stored in as plain text as possible, on the web. Dokuwiki to the rescue: it stores pages in minimally marked up plain text. This allows me to edit the content with a text editor, while being able to trust dokuwiki to convert that into nice looking html for me.
Of course, since the goal was a book, there are certain things in the text files that should never show up in the dokuwiki pages. For instance, I use a simple markup for furigana, namely kanjiform(furi), and that needs to be converted to proper furi. I also have index and glossary markup in my content, because as a book these are rather required. It would be disastrous if these showed up in the pages that dokuwiki renders, so a plugin that suppresses that was also written.
Finally, a set of PHP scripts were written to, outside of dokuwiki, aggregate all the book's relevant pages, convert them to proper LaTeX, and then run them through the LaTeX compiler that I use in order to produce the PDF file that you can download from the main page.
So how does everything tie together?
There are several plugins at work on this dokuwiki to make sure the book's content shows up properly
This is a public plugin I wrote, which turns tab delimited table data into proper dokuwiki format. Since the goal for me was to have as little explicit markup as possible, this plugin was essential.
This is another public plugin I wrote, which turns simple furigana markup into proper XHTML1.1 “ruby” markup.
A private plugin (which may become public at some point in the future) that allows page visitors to leave a comment per section, without having to log in, and without having to use a separate discussion page.
A private plugin that I needed so that I could use publication quality images (300/600dpi) without having to generate webversions for each image. The plugin looks at the image file, gets its horizontal and vertical dots per inch value, and tells dokuwiki to on-screen resize the images to what they should be if they had screen resolution (which is only 72 dots per inch). The result is an image that looks the right size on your screen, as well as in the PDF file when it's included during the build process.
A private plugin that I needed to suppress glossary and index markup. In order for the PDF to have glossary and index information, I tag content as idx:indexname:… and gls:glossaryname:… - these tags are stripped by the plugin when dokuwiki serves up the page, so you won't ever know they're there unless you look at the plain text content.
I actually don't use LaTeX, I use XeLaTeX, which is the XeTeX engine extended with the LaTeX macros. The advantage of XeLaTeX is that it uses XeTeX rather than TeX. To explain why this is an advantage:
TeX is from an era when text was 7-bit ASCII. As such, all engines built on top of TeX expect 7 bit ascii input, and any package that allows you to use utf8 is basically a hack to make a (limited number of) language(s) play nice. XeTeX, on the other hand, is not built on top of TeX, it's a different engine that parses ONLY utf8 formatted data, in a way that is consistent with TeX parsing. So unlike all the other TeX flavours, XeTeX actually expects utf8 unicode input. What more do you want?
Well, these benefits really… XeTeX offers two crucial functionalities:
The first lets you define what should happen when a transition is detection from characters from one set (say, unicode latin) to characters from another set (say, CJK characters). The second is a XeLaTeX specific package that gives you access to fonts in the same way any office application does: by name. I cannot tell you how useful this is. If you're used to TeX's MF system, you know how much more convenient a simple command “\setmainfont{Times New Roman}” is if you can rely on it to get all the fontfaces right. Which it does.
Combined, these two features mean that I don't have to worry about sticking in font switches anywhere. Things Just Work, just like you'd expect from a system that's designed for typesetting.
Installing XeLaTeX is basically a no-braining (but then this goes for all TeX flavours): Install either TeXLive (if you're on *nix) or MiKTeX (if you're in windows) and you're done. They come with pretty much every possible flavour of tex and latex so you can try every known working tex processor in existence. Although of course I would recommend trying “xelatex” and then not bothering with any other ones until 2012 when LuaTeX has hopefully been completed.
For English text, I use the commercial "Palatino Linotype" font, which comes with all modern versions of Windows, with the free FreeSerif font filling in the blanks that Palatino Linotype has when it comes to the extended Latin characters. For CJK text I primarily use the commercial "Kozuka Mincho Pro" font, which comes with Adobe Creative Suite programmes, as well as the free HAN NOM A/HAN NOM B fonts, which fill in some of the more exotic characters I use in the book.
LaTeX has CRAPLOADS OF QUIRKS. It also has no unified documentation, so if the prospect of hunting down information on how to fix something that looks like a stupid tiny thing for an entire day sounds like a turn-off to you, LaTeX is not for you. But then you would probably be content with Microsoft Word or Open Office Writer documents, so no worries.
The following things are things I have learned over the course of making all of this work.
Use the relsize package, and its \smaller and \larger macros. These commands can be chained, but be aware that text cannot be made smaller than 6 points, which means you can't use it to get to \tiny, which is 5.5 points (which can be a problem for people who use it for phonetic guide text such as the Japanese furigana).
To get around that, issue \renewcommand\RSsmallest{5pt} after loading the package, so that the smallest size will be 5.
Wrap your itemize or enumerate environment in the samepage environment. For instance, for examples sentences I use the following code (the examples are basically just bullet-less lists):
\renewcommand{\labelitemi}{}
\newenvironment{exampleblock}{%
\begin{samepage}%
\begin{itemize}%
\setlength{\itemsep}{0pt}%
\setlength{\parsep}{0pt}%
\setlength{\topsep}{0pt}%
\setlength{\partopsep}{0pt}%
\setlength{\parskip}{0pt}%
\setlength{\labelsep}{0pt}}
{\end{itemize}%
\end{samepage}}
Use the longtable package. If you have header/footer code (even if it's just an \hline command) you can specify the “first” and “subsequent” header code, as well as “not last” and “last” footer code. That way, if the table runs past the past, longtable will snip it, and ensure it has the right header/footer code to make everything look peachy.
For instance:
\begin{longtable}[h]{| l l l l l l |}
\hline
\T & kana & pronunciation & & as glide & pronunciation\\
\hline
\endfirsthead
\hline
\T & kana & pronunciation & & as glide & pronunciation\\
\hline
\endhead
\hline
\endfoot
\hline
\endlastfoot
\T き + や & きや & kiya & & きゃ & kya\\
\T し + ゆ & しゆ & shiyu & & しゅ & shu\\
\T ち + よ & ちよ & chiyo & & ちょ & cho\\
\T み + や & みや & miya & & みゃ & mya\\
\T ひ + よ & ひよ & hiyo & & ひょ & hyo\\
\T に + ゆ & にゆ & niyu & & にゅ & nyu\\
\T り + よ & りよ & riyo & & りょ & ryo\\
\end{longtable}
this will build a table
| kana | pronunciation | as glide | pronunciation | ||
|---|---|---|---|---|---|
| き + や | きや | kiya | きゃ | kya | |
| し + ゆ | しゆ | shiyu | しゅ | shu | |
| ち + よ | ちよ | chiyo | ちょ | cho | |
| み + や | みや | miya | みゃ | mya | |
| ひ + よ | ひよ | hiyo | ひょ | hyo | |
| に + ゆ | にゆ | niyu | にゅ | nyu | |
| り + よ | りよ | riyo | りょ | ryo |
but can also split this into tables mid way
| kana | pronunciation | as glide | pronunciation | ||
|---|---|---|---|---|---|
| き + や | きや | kiya | きゃ | kya | |
| し + ゆ | しゆ | shiyu | しゅ | shu | |
| ち + よ | ちよ | chiyo | ちょ | cho | |
| み + や | みや | miya | みゃ | mya |
| kana | pronunciation | as glide | pronunciation | ||
|---|---|---|---|---|---|
| ひ + よ | ひよ | hiyo | ひょ | hyo | |
| に + ゆ | にゆ | niyu | にゅ | nyu | |
| り + よ | りよ | riyo | りょ | ryo |
Give up. There's no way to do this short of writing your own package to make it happen. Either use the standard packages for undefined-width tables and live with manual wrapping, or use the tabularx to get columns to wrap their text, but live with the fact that you'll have to indicate how wide the table should be.
The idea behind TeX is that “you should know what you want your text to look like” but obviously this philosophy is violated time and again via all sorts of packages.
issue a \noindent before your \begin{tabular}.
If you find out, contact me. I actually wrote a preprocessor as part of my set of conversion scripts that, when it's converting table data to LaTeX format, it also guestimates what the table width will be, and changes the table's environment from tablular to longtable with autofit left/right margins when the table is wider than 25EM, with a \noindent issued just to be sure the table gets to use the entire page's width.
There is the tabularx package, which can do text wrapping if the tables are too large, but it requires you to say how wide the tables need to be, which I can't do since the conversion is an automated process. (the idea behind it is that you are in control of your typesetting, so if it has to look the same on any computer you run the source compile on, you need to rigidly mark how wide wrapping columns are, or your document will look different on different machines).
If your table uses tabular, then wrap it in the center environment. If you use longtable, there are two commands that you can issue that set the horizontal left and right spacing (for that point on), which can take the wonderfully useful command \fill as value.
\setvalue\TLleft\fill
\setvalue\TLright\fill
\begin{longtable}
...
\end{longtable}
Use Xetex and the fontspec package. This lets you use ttf and otf fonts in the same way any sane program would: by name. To change the font to a new font, you simply issue the command \fontspec{font name as you see it in any text editing/word processing application}. So if you want to change to Gothic Pro, \fontspec{Gothic Pro}.
Use XeLaTeX, which is the latex version of xetex. This lets you use “intercharclasstokens”, which is a funky way to say “it lets you define what code to automatically put between characters of different classes”. Class 0 is roman, classes 1 through 3 are CJK, and class 255 is the special 'boundary' class. You can assign any number of characters their own class number (between 3 and 254, because 254 will be a wildcard class in the next version of xetex) and then define transition rules for that class to the other classes. if you want to change fonts between roman and cjk, for instance, you would say:
\XeTeXinterchartokenstate = 1
% when going from not CJK to CJK
\XeTeXinterchartoks 0 1 = {\cjkfont}
\XeTeXinterchartoks 0 2 = {\cjkfont}
\XeTeXinterchartoks 0 3 = {\cjkfont}
\XeTeXinterchartoks 255 1 = {\cjkfont}
\XeTeXinterchartoks 255 2 = {\cjkfont}
\XeTeXinterchartoks 255 3 = {\cjkfont}
% when going from CJK to not CJK
\XeTeXinterchartoks 1 0 = {\rmfont}
\XeTeXinterchartoks 2 0 = {\rmfont}
\XeTeXinterchartoks 3 0 = {\rmfont}
\XeTeXinterchartoks 1 255 = {\rmfont}
\XeTeXinterchartoks 2 255 = {\rmfont}
\XeTeXinterchartoks 3 255 = {\rmfont}
this code would go in your preamble, and the \rmfont and \cjkfont could be defined as “\fontspec{Roman}” and “\fontspec{HAN NOM A}”, for instance.
use the multicols package:
\usepackage{multicols}
...
\begin{document}
...
\begin{multicols}{2}
lots of text
\ldots
\end{multicols}
...
\end{document}
Yes, this will let you do different column layouts on the same page, rather than the LaTeX \onecolumn and \twocolumn commands, which force page breaks.
Header/footer styling means you're using a pagestyle, so to make sure that the word “Chapter” doesn't make it into the header, issue a \renewcommand{\chaptername}{} after your \pagestyle{…} command
(I know, what the hell, right? Apparently making “chapter name” empty doesn't actually make your chapter name empty… it makes the prefix to chapter names empty. Could have called it \chapterprefix or something, but no such luck I fear)
“Package Fancyhdr Warning: \headheight is too small (12.0pt): Make it at least 14.49998pt. We now make it that large for the rest of the document.”
issue \setlength{\headheight}{15pt} in preamble.
When using intercharclass behaviour, a la
\XeTeXinterchartoks 0 1 = {\cjkfont}
\XeTeXinterchartoks 0 2 = {\cjkfont}
\XeTeXinterchartoks 0 3 = {\cjkfont}
\XeTeXinterchartoks 255 1 = {\cjkfont}
\XeTeXinterchartoks 255 2 = {\cjkfont}
\XeTeXinterchartoks 255 3 = {\cjkfont}
\XeTeXinterchartoks 1 0 = {\rmfont}
\XeTeXinterchartoks 2 0 = {\rmfont}
\XeTeXinterchartoks 3 0 = {\rmfont}
\XeTeXinterchartoks 1 255 = {\rmfont}
\XeTeXinterchartoks 2 255 = {\rmfont}
\XeTeXinterchartoks 3 255 = {\rmfont}
do define the transition commands as follows:
\newfontfamily{\cjkfont}{Kozuka Mincho Pro}
\newfontfamily{\rmfont}{Palatino Linotype}
\setmainfont{Palatino Linotype}
and really, really don't use:
\newcommand{\cjkfont}{\setmainfont{Kozuka Mincho Pro}}
\newcommand{\rmfont}{\setmainfont{Palatino Linotype}}
\rmfont
If you use the not-right version, the styling of chapter/section/subsection/subsubsection/paragraph headings will break once a transition rule is applied, and things will suddenly look very, very off
Several ways to do this, I went with the highest form of customisation and just redefined the theindex environment. In LaTeX this environment is defined as:
\renewenvironment{theindex}
{\if@twocolumn
\@restonecolfalse
\else
\@restonecoltrue
\fi
\twocolumn[\section*{\indexname}]%
\@mkboth{\MakeUppercase\indexname}%
{\MakeUppercase\indexname}%
\thispagestyle{plain}\parindent\z@
\parskip\z@ \@plus .3\p@\relax
\columnseprule \z@
\columnsep 35\p@
\let\item\@idxitem}
{\if@restonecol\onecolumn\else\clearpage\fi}
To override this behaviour, make a command that renews the \indexname:
\newcommand{\setindextitle}[1]{\renewcommand{\indexname}{#1}}
and simply call \setindextitle{The Title You Want} to make happy things happen. I also stuck in the line
\addcontentsline{toc}{section}{\indexname}%
after the \twocolum declaration because it would be silly to have to manually add a ToC link.
Alternatively, you can fiddle with the definition directly, but then you'll need to do this either in a .sty file (because it relies on the magical @ values) or put it in your preamble and issue \makeatletter before, and \makeatother afterward (this tells latex that you know what you're doing and that it's okay for you to use these latex-internals values)
I can't get any of them to work satisfactorily. The multind package screws up the ToC and header names if there are index terms in them (which is a lot in my book). The index and splitidx packages seem to give me more headaches than necessary, so for my book I just use makeidx and then correct the .ind file makeindex generates, so that when the Japanese section starts, the “theindex” environment is closed, and a new one is started. This does EXACTLY what one would want for bilingual indexes.
Note I use the word indexes, rather than indices, because in a book an “index” is actually a list of indices. Thus, multiple “index” sections constitute indexes, not indices (since only the indices in the index can be used for looking up things, the index itself is not a lookup key for anything).
Since I use relsize, I don't want to rely on LaTeX's fontsizes, I want the relative sizes. As such, you can make some custom styles (in a .sty file). I use, for instance:
\renewcommand\section{\@startsection{section}{1}{\z@}%
{-3.5ex \@plus -1ex \@minus -.2ex}%
{2.3ex \@plus.2ex}%
{\larger\larger\bfseries}}
The center environment (as well as several other environments) add fractional vertical spacing when they are called, which can lead to rather crazy block relocation. In the case of this book, it led to a completely blank page at page 62. Removing the center environment and instead setting the proper aligning on tables, etc (very useful to know: USE THE \fill COMMAND. This sticks “variable width” glue somewhere. Using it for horizontal spacing before and after your content means it ends up centered), does the same thing, without the mystery vertical spacing.
I wrote a collection of PHP scripts that start with the base plain text data, and iteratively replace both the dokuwiki minimal markup, as well as my own markup (tables, furigana, index terms, glossary terms) to produce a legal LaTeX source file. This is then combined with a static latex preamble and backmatter, after which it is run through XeLaTeX several times to – almost – produce the PDF that is put up for download.
I know you're thinking “but php is for the web?” and no it's not. It's just a scripting language, in the same way that perl, python, etc. are scripting languages (as for “why php”, it has what I consider a reasonably friendly syntax). The scripts are actually called via a .bat file (yes, I'm in windows) which passes some startup parameters to the parse-and-compile script, and then the whole process runs without me having to do anything.
Which is nice.
I said it almost produces the PDF that you can download, because the document generated in this way lacks the titlesheet, as well as the security settings that I want applied to the document, so the final step consists of opening the file in Acrobat 8, adding in the titlesheet, setting the relevant dublin core values and applying the PDF v1.5 security features that I need.
The result is the pdf file that you can find in the “download the book” section on the main page =)
You can download the conversion scripts I use, although of course you should bear in mind it's tailored to my dokuwiki and its content. Two important things to take note of are the fact that tabular data on this dokuwiki are tab-delimited (using the tabtables plugin) and there are masses of furigana all over the place (which uses the xhtmlruby plugin).
I run these scripts from $dokuwiki/bin/tex, so all file paths are relative to that. If you have any questions on how they do the job that they apparently do, feel free to drop me a line!